Ayoub-Laachir
/

MaghrebVoice_OnlyLoRaLayers

Automatic Speech Recognition

Model card Files Files and versions Community

MaghrebVoice_OnlyLoRaLayers / README.md

Ayoub-Laachir's picture

Update README.md

1744f1a verified 19 days ago

|

3.27 kB

	---
	license: apache-2.0
	datasets:
	- Ayoub-Laachir/Darija_Dataset
	language:
	- dj
	metrics:
	- wer
	- cer
	base_model:
	- openai/whisper-large-v3
	pipeline_tag: automatic-speech-recognition
	---
	# Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)

	## Model Overview
	Model Name: Whisper Large V3 (Fine-tuned for Moroccan Darija)
	Author: Ayoub Laachir
	License: apache-2.0
	Repository: [Ayoub-Laachir/MaghrebVoice](https://huggingface.co/Ayoub-Laachir/MaghrebVoice)
	Dataset: [Ayoub-Laachir/Darija_Dataset](https://huggingface.co/datasets/Ayoub-Laachir/Darija_Dataset)

	## Description
	This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.

	## Technologies Used
	- Whisper Large V3: OpenAI’s state-of-the-art speech recognition model
	- PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation): An efficient fine-tuning technique
	- Google Colab: Cloud environment for training the model
	- Hugging Face: Hosting the dataset and final model

	## Dataset Preparation
	The dataset preparation involved several steps:
	1. Cleaning: Correcting bad transcriptions and standardizing word spellings.
	2. Audio Processing: Converting all samples to a 16 kHz sample rate.
	3. Dataset Split: Creating a training set of 3,312 samples and a test set of 150 samples.
	4. Format Conversion: Transforming the dataset into the parquet file format.
	5. Uploading: The prepared dataset was uploaded to the Hugging Face Hub.

	## Training Process
	The model was fine-tuned using the following parameters:
	- Per device train batch size: 8
	- Gradient accumulation steps: 1
	- Learning rate: 1e-4 (0.0001)
	- Warmup steps: 200
	- Number of train epochs: 2
	- Logging and evaluation: every 50 steps
	- Weight decay: 0.01

	Training progress showed a steady decrease in both training and validation loss over 8000 steps.

	## Testing and Evaluation
	The model was evaluated using:
	- Word Error Rate (WER): 3.1467%
	- Character Error Rate (CER): 2.3893%

	These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.


	The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.

	## Challenges and Future Improvements
	### Challenges Encountered
	- Diverse spellings of words in Moroccan Darija
	- Cleaning and standardizing the dataset

	### Future Improvements
	- Expand the dataset to include more Darija accents and expressions
	- Further fine-tune the model for specific Moroccan regional dialects
	- Explore integration into practical applications like voice assistants and transcription services

	## Conclusion
	This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.