Ayoub-Laachir's picture
Update README.md
1744f1a verified
|
raw
history blame
3.27 kB
metadata
license: apache-2.0
datasets:
  - Ayoub-Laachir/Darija_Dataset
language:
  - dj
metrics:
  - wer
  - cer
base_model:
  - openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition

Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)

Model Overview

Model Name: Whisper Large V3 (Fine-tuned for Moroccan Darija)
Author: Ayoub Laachir
License: apache-2.0
Repository: Ayoub-Laachir/MaghrebVoice
Dataset: Ayoub-Laachir/Darija_Dataset

Description

This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.

Technologies Used

  • Whisper Large V3: OpenAI’s state-of-the-art speech recognition model
  • PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation): An efficient fine-tuning technique
  • Google Colab: Cloud environment for training the model
  • Hugging Face: Hosting the dataset and final model

Dataset Preparation

The dataset preparation involved several steps:

  1. Cleaning: Correcting bad transcriptions and standardizing word spellings.
  2. Audio Processing: Converting all samples to a 16 kHz sample rate.
  3. Dataset Split: Creating a training set of 3,312 samples and a test set of 150 samples.
  4. Format Conversion: Transforming the dataset into the parquet file format.
  5. Uploading: The prepared dataset was uploaded to the Hugging Face Hub.

Training Process

The model was fine-tuned using the following parameters:

  • Per device train batch size: 8
  • Gradient accumulation steps: 1
  • Learning rate: 1e-4 (0.0001)
  • Warmup steps: 200
  • Number of train epochs: 2
  • Logging and evaluation: every 50 steps
  • Weight decay: 0.01

Training progress showed a steady decrease in both training and validation loss over 8000 steps.

Testing and Evaluation

The model was evaluated using:

  • Word Error Rate (WER): 3.1467%
  • Character Error Rate (CER): 2.3893%

These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.

The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.

Challenges and Future Improvements

Challenges Encountered

  • Diverse spellings of words in Moroccan Darija
  • Cleaning and standardizing the dataset

Future Improvements

  • Expand the dataset to include more Darija accents and expressions
  • Further fine-tune the model for specific Moroccan regional dialects
  • Explore integration into practical applications like voice assistants and transcription services

Conclusion

This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.