MaghrebVoice / README.md
Ayoub-Laachir's picture
Update README.md
246daeb verified
metadata
license: apache-2.0
datasets:
  - Ayoub-Laachir/Darija_Dataset
language:
  - dj
metrics:
  - wer
  - cer
base_model:
  - openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition

Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)

Model Overview

Model Name: Whisper Large V3 (Fine-tuned for Moroccan Darija)
Author: Ayoub Laachir
License: apache-2.0
Repository: Ayoub-Laachir/MaghrebVoice
Dataset: Ayoub-Laachir/Darija_Dataset

Description

This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.

Technologies Used

  • Whisper Large V3: OpenAI’s state-of-the-art speech recognition model
  • PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation): An efficient fine-tuning technique
  • Google Colab: Cloud environment for training the model
  • Hugging Face: Hosting the dataset and final model

Dataset Preparation

The dataset preparation involved several steps:

  1. Cleaning: Correcting bad transcriptions and standardizing word spellings.
  2. Audio Processing: Converting all samples to a 16 kHz sample rate.
  3. Dataset Split: Creating a training set of 3,312 samples and a test set of 150 samples.
  4. Format Conversion: Transforming the dataset into the parquet file format.
  5. Uploading: The prepared dataset was uploaded to the Hugging Face Hub.

Training Process

The model was fine-tuned using the following parameters:

  • Per device train batch size: 8
  • Gradient accumulation steps: 1
  • Learning rate: 1e-4 (0.0001)
  • Warmup steps: 200
  • Number of train epochs: 2
  • Logging and evaluation: every 50 steps
  • Weight decay: 0.01

Training progress showed a steady decrease in both training and validation loss over 8000 steps.

Testing and Evaluation

The model was evaluated using:

  • Word Error Rate (WER): 3.1467%
  • Character Error Rate (CER): 2.3893%

These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.

The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.

Audio Transcription Script

This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija. It includes steps for installing necessary libraries, loading the model, and processing audio files.

Required Libraries

Before running the script, ensure you have the following libraries installed. You can install them using:

!pip install --upgrade pip
!pip install --upgrade transformers accelerate librosa soundfile pydub
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import librosa
import soundfile as sf
from pydub import AudioSegment

# Set the device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Configuration for the model
config = {
    "model_id": "Ayoub-Laachir/MaghrebVoice",  # Model ID from Hugging Face
    "language": "arabic",                          # Language for transcription
    "task": "transcribe",                          # Task type
    "chunk_length_s": 30,                          # Length of each audio chunk in seconds
    "stride_length_s": 5,                          # Overlap between chunks in seconds
}

# Load the model and processor
def load_model_and_processor():
    try:
        model = AutoModelForSpeechSeq2Seq.from_pretrained(
            config["model_id"],
            torch_dtype=torch_dtype,               # Use appropriate data type
            low_cpu_mem_usage=True,                # Use low CPU memory
            use_safetensors=True,                   # Load model with safetensors
            attn_implementation="sdpa",            # Specify attention implementation
        )
        model.to(device)  # Move model to the specified device

        processor = AutoProcessor.from_pretrained(config["model_id"])

        print("Model and processor loaded successfully.")
        return model, processor
    except Exception as e:
        print(f"Error loading model and processor: {e}")
        return None, None

# Load the model and processor
model, processor = load_model_and_processor()
if model is None or processor is None:
    print("Failed to load model or processor")
    exit(1)

# Configure the generation parameters for the pipeline
generate_kwargs = {
    "language": config["language"],  # Language for the pipeline
    "task": config["task"],          # Task for the pipeline
}

# Initialize the automatic speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs=generate_kwargs,
    chunk_length_s=config["chunk_length_s"],  # Length of each audio chunk
    stride_length_s=config["stride_length_s"],  # Overlap between chunks
)

# Convert audio to 16kHz sampling rate
def convert_audio_to_16khz(input_path, output_path):
    audio, sr = librosa.load(input_path, sr=None)  # Load the audio file
    audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)  # Resample to 16kHz
    sf.write(output_path, audio_16k, 16000)  # Save the converted audio

# Format time in HH:MM:SS.milliseconds
def format_time(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = seconds % 60
    return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"

# Transcribe audio file
def transcribe_audio(audio_path):
    try:
        result = pipe(audio_path, return_timestamps=True)  # Transcribe audio and get timestamps
        return result["chunks"]  # Return transcription chunks
    except Exception as e:
        print(f"Error transcribing audio: {e}")
        return None

# Main function to execute the transcription process
def main():
    # Specify input and output audio paths (update paths as needed)
    input_audio_path = "/path/to/your/input/audio.mp3"  # Replace with your input audio path
    output_audio_path = "/path/to/your/output/audio_16khz.wav"  # Replace with your output audio path

    # Convert audio to 16kHz
    convert_audio_to_16khz(input_audio_path, output_audio_path)

    # Transcribe the converted audio
    transcription_chunks = transcribe_audio(output_audio_path)

    if transcription_chunks:
        print("WEBVTT\n")  # Print header for WEBVTT format
        for chunk in transcription_chunks:
            start_time = format_time(chunk["timestamp"][0])  # Format start time
            end_time = format_time(chunk["timestamp"][1])    # Format end time
            text = chunk["text"]                              # Get the transcribed text
            print(f"{start_time} --> {end_time}")           # Print time range
            print(f"{text}\n")                               # Print transcribed text
    else:
        print("Transcription failed.")

if __name__ == "__main__":
    main()

Challenges and Future Improvements

Challenges Encountered

  • Diverse spellings of words in Moroccan Darija
  • Cleaning and standardizing the dataset

Future Improvements

  • Expand the dataset to include more Darija accents and expressions
  • Further fine-tune the model for specific Moroccan regional dialects
  • Explore integration into practical applications like voice assistants and transcription services

Conclusion

This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.