metadata

license: apache-2.0
datasets:
  - Ayoub-Laachir/Darija_Dataset
language:
  - dj
metrics:
  - wer
  - cer
base_model:
  - openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition

Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)

Model Overview

Model Name: Whisper Large V3 (Fine-tuned for Moroccan Darija)
Author: Ayoub Laachir
License: apache-2.0
Repository: Ayoub-Laachir/MaghrebVoice
Dataset: Ayoub-Laachir/Darija_Dataset

Description

This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.

Technologies Used

Whisper Large V3: OpenAI’s state-of-the-art speech recognition model
PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation): An efficient fine-tuning technique
Google Colab: Cloud environment for training the model
Hugging Face: Hosting the dataset and final model

Dataset Preparation

The dataset preparation involved several steps:

Cleaning: Correcting bad transcriptions and standardizing word spellings.
Audio Processing: Converting all samples to a 16 kHz sample rate.
Dataset Split: Creating a training set of 3,312 samples and a test set of 150 samples.
Format Conversion: Transforming the dataset into the parquet file format.
Uploading: The prepared dataset was uploaded to the Hugging Face Hub.

Training Process

The model was fine-tuned using the following parameters:

Per device train batch size: 8
Gradient accumulation steps: 1
Learning rate: 1e-4 (0.0001)
Warmup steps: 200
Number of train epochs: 2
Logging and evaluation: every 50 steps
Weight decay: 0.01

Training progress showed a steady decrease in both training and validation loss over 8000 steps.

Testing and Evaluation

The model was evaluated using:

Word Error Rate (WER): 3.1467%
Character Error Rate (CER): 2.3893%

These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.

The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.

Challenges and Future Improvements

Challenges Encountered

Diverse spellings of words in Moroccan Darija
Cleaning and standardizing the dataset

Future Improvements

Expand the dataset to include more Darija accents and expressions
Further fine-tune the model for specific Moroccan regional dialects
Explore integration into practical applications like voice assistants and transcription services

Conclusion

This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.