Ayoub-Laachir's picture
Update README.md
1744f1a verified
|
raw
history blame
3.27 kB
---
license: apache-2.0
datasets:
- Ayoub-Laachir/Darija_Dataset
language:
- dj
metrics:
- wer
- cer
base_model:
- openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
---
# Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)
## Model Overview
**Model Name:** Whisper Large V3 (Fine-tuned for Moroccan Darija)
**Author:** Ayoub Laachir
**License:** apache-2.0
**Repository:** [Ayoub-Laachir/MaghrebVoice](https://huggingface.co/Ayoub-Laachir/MaghrebVoice)
**Dataset:** [Ayoub-Laachir/Darija_Dataset](https://huggingface.co/datasets/Ayoub-Laachir/Darija_Dataset)
## Description
This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.
## Technologies Used
- **Whisper Large V3:** OpenAI’s state-of-the-art speech recognition model
- **PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation):** An efficient fine-tuning technique
- **Google Colab:** Cloud environment for training the model
- **Hugging Face:** Hosting the dataset and final model
## Dataset Preparation
The dataset preparation involved several steps:
1. **Cleaning:** Correcting bad transcriptions and standardizing word spellings.
2. **Audio Processing:** Converting all samples to a 16 kHz sample rate.
3. **Dataset Split:** Creating a training set of 3,312 samples and a test set of 150 samples.
4. **Format Conversion:** Transforming the dataset into the parquet file format.
5. **Uploading:** The prepared dataset was uploaded to the Hugging Face Hub.
## Training Process
The model was fine-tuned using the following parameters:
- **Per device train batch size:** 8
- **Gradient accumulation steps:** 1
- **Learning rate:** 1e-4 (0.0001)
- **Warmup steps:** 200
- **Number of train epochs:** 2
- **Logging and evaluation:** every 50 steps
- **Weight decay:** 0.01
Training progress showed a steady decrease in both training and validation loss over 8000 steps.
## Testing and Evaluation
The model was evaluated using:
- **Word Error Rate (WER):** 3.1467%
- **Character Error Rate (CER):** 2.3893%
These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.
The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.
## Challenges and Future Improvements
### Challenges Encountered
- Diverse spellings of words in Moroccan Darija
- Cleaning and standardizing the dataset
### Future Improvements
- Expand the dataset to include more Darija accents and expressions
- Further fine-tune the model for specific Moroccan regional dialects
- Explore integration into practical applications like voice assistants and transcription services
## Conclusion
This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.