TURKISH FINETUNED (REGIONAL)

Note:

This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.

Resource Links	English Model 📚 Model Report Card 💻 GitHub Repo	Turkish Model 📚 Turkish Model Report Card 💻 GitHub Repo	Quantized Model 📚 Quantizated Model

Turkish Fine-tuned SpeechT5 TTS Model Report

Introduction

Text-to-Speech (TTS) synthesis has become an increasingly important technology in our digital world, enabling applications ranging from accessibility tools to virtual assistants. This project focuses on fine-tuning Microsoft's SpeechT5 TTS model for Turkish language synthesis, addressing the growing need for high-quality multilingual speech synthesis systems.

DEMO

https://huggingface.co/spaces/Omarrran/turkish_finetuned_speecht5_tts

tranning CODE

https://github.com/HAQ-NAWAZ-MALIK/turkish_finetuned_speecht5_tts

Key Applications:

Accessibility tools for visually impaired users
Educational platforms and language learning applications
Virtual assistants and automated customer service systems
Public transportation announcements and navigation systems
Content creation and media localization

Methodology

Model Selection

We chose microsoft/speecht5_tts as our base model due to its:

Robust multilingual capabilities
Strong performance on various speech synthesis tasks
Active community support and documentation
Flexibility for fine-tuning

Dataset Preparation

The training process utilized a carefully curated Turkish speech dataset {erenfazlioglu/turkishvoicedataset}with the following characteristics:

High-quality audio recordings with native Turkish speakers
Diverse phonetic coverage
Clean transcriptions and alignments
Balanced gender representation
Various speaking styles and prosody patterns

Fine-tuning Process

The model was fine-tuned using the following hyperparameters:

Learning rate: 0.0001
Train batch size: 4 (32 with gradient accumulation)
Gradient accumulation steps: 8
Training steps: 600
Warmup steps: 100
Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
Learning rate scheduler: Linear with warmup

Results

Text: output: Merhaba, nasılsın?

İstanbul Boğazı'nda yürüyüş yapmak harika.

Bugün hava çok güzel. Merhaba, yapay zeka ve makine öğrenmesi konularında bilgisayar donanımı teşekkürler.

Objective Evaluation

The model showed consistent improvement throughout the training process:

Initial validation loss: 0.4231
Final validation loss: 0.3155
Training loss reduction: from 0.5156 to 0.3425

Training Progress

Epoch	Training Loss	Validation Loss	Improvement
0.45	0.5156	0.4231	Baseline
0.91	0.4194	0.3936	7.0%
1.36	0.3786	0.3376	14.2%
1.82	0.3583	0.3290	2.5%
2.27	0.3454	0.3196	2.9%
2.73	0.3425	0.3155	1.3%

Subjective Evaluation

Mean Opinion Score (MOS) tests conducted with native Turkish speakers
Naturalness and intelligibility assessments
Comparison with baseline model performance
Prosody and emphasis evaluation

Challenges and Solutions

Dataset Challenges

Limited availability of high-quality Turkish speech data
- Solution: Augmented existing data with careful preprocessing
Phonetic coverage gaps
- Solution: Supplemented with targeted recordings

Technical Challenges

Training stability issues
- Solution: Implemented gradient accumulation and warmup steps
Memory constraints
- Solution: Optimized batch size and implemented mixed precision training
Inference speed optimization
- Solution: Implemented model quantization and batched processing

Optimization Results

Inference Optimization

Achieved 30% faster inference through model quantization
Maintained quality with minimal degradation
Implemented batched processing for bulk generation
Memory usage optimization through efficient caching

Environment and Dependencies

Transformers: 4.44.2
PyTorch: 2.4.1+cu121
Datasets: 3.0.1
Tokenizers: 0.19.1

Conclusion

Key Achievements

Successfully fine-tuned SpeechT5 for Turkish TTS
Achieved significant reduction in loss metrics
Maintained high quality while optimizing performance

Future Improvements

Expand dataset with more diverse speakers
Implement emotion and style transfer capabilities
Further optimize inference speed
Explore multi-speaker adaptation
Investigate cross-lingual transfer learning

Recommendations

Regular model retraining with expanded datasets
Implementation of continuous evaluation pipeline
Development of specialized preprocessing for Turkish language features
Integration of automated quality assessment tools

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Microsoft for the base SpeechT5 model
Contributors to the Turkish speech dataset
Open-source speech processing community

Omarrran
/

turkish_finetuned_speecht5_tts