File size: 7,701 Bytes
2a76378
1744f1a
 
 
 
 
 
 
 
 
 
 
2a76378
1744f1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d021e05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1744f1a
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
license: apache-2.0
datasets:
- Ayoub-Laachir/Darija_Dataset
language:
- dj
metrics:
- wer
- cer
base_model:
- openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
---
# Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)

## Model Overview
**Model Name:** Whisper Large V3 (Fine-tuned for Moroccan Darija)  
**Author:** Ayoub Laachir  
**License:** apache-2.0  
**Repository:** [Ayoub-Laachir/MaghrebVoice](https://huggingface.co/Ayoub-Laachir/MaghrebVoice)  
**Dataset:** [Ayoub-Laachir/Darija_Dataset](https://huggingface.co/datasets/Ayoub-Laachir/Darija_Dataset)  

## Description
This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.

## Technologies Used
- **Whisper Large V3:** OpenAI’s state-of-the-art speech recognition model
- **PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation):** An efficient fine-tuning technique
- **Google Colab:** Cloud environment for training the model
- **Hugging Face:** Hosting the dataset and final model

## Dataset Preparation
The dataset preparation involved several steps:
1. **Cleaning:** Correcting bad transcriptions and standardizing word spellings.
2. **Audio Processing:** Converting all samples to a 16 kHz sample rate.
3. **Dataset Split:** Creating a training set of 3,312 samples and a test set of 150 samples.
4. **Format Conversion:** Transforming the dataset into the parquet file format.
5. **Uploading:** The prepared dataset was uploaded to the Hugging Face Hub.

## Training Process
The model was fine-tuned using the following parameters:
- **Per device train batch size:** 8
- **Gradient accumulation steps:** 1
- **Learning rate:** 1e-4 (0.0001)
- **Warmup steps:** 200
- **Number of train epochs:** 2
- **Logging and evaluation:** every 50 steps
- **Weight decay:** 0.01

Training progress showed a steady decrease in both training and validation loss over 8000 steps.

## Testing and Evaluation
The model was evaluated using:
- **Word Error Rate (WER):** 3.1467%
- **Character Error Rate (CER):** 2.3893%

These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.


The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.

## Audio Transcription Script with PEFT Layers

This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija, incorporating PEFT (Parameter-Efficient Fine-Tuning) layers for improved performance.

### Required Libraries

Before running the script, ensure you have the following libraries installed. You can install them using:

```bash
!pip install --upgrade pip
!pip install --upgrade transformers accelerate librosa soundfile pydub
!pip install peft==0.3.0  # Install PEFT library
```
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import librosa
import soundfile as sf
from pydub import AudioSegment
from peft import PeftModel, PeftConfig  # Import PEFT classes

# Set the device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Configuration for the base Whisper model
base_model_name = "openai/whisper-large-v3"  # Base model for Whisper
processor = AutoProcessor.from_pretrained(base_model_name)  # Load the processor

# Load your fine-tuned model configuration
model_name = "Ayoub-Laachir/MaghrebVoice_OnlyLoRaLayers"  # Fine-tuned model with LoRA layers
peft_config = PeftConfig.from_pretrained(model_name)  # Load PEFT configuration

# Load the base model
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(base_model_name).to(device)  # Load the base model

# Load the PEFT model
model = PeftModel.from_pretrained(base_model, model_name).to(device)  # Load the PEFT model

# Merge the LoRA weights with the base model
model = model.merge_and_unload()  # Combine the LoRA weights into the base model

# Configuration for transcription
config = {
    "language": "arabic",  # Language for transcription
    "task": "transcribe",  # Task type
    "chunk_length_s": 30,  # Length of each audio chunk in seconds
    "stride_length_s": 5,   # Overlap between chunks in seconds
}

# Initialize the automatic speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,  # Use the merged model
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    chunk_length_s=config["chunk_length_s"],
    stride_length_s=config["stride_length_s"],
)

# Convert audio to 16kHz sampling rate
def convert_audio_to_16khz(input_path, output_path):
    audio, sr = librosa.load(input_path, sr=None)  # Load the audio file
    audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000)  # Resample to 16kHz
    sf.write(output_path, audio_16k, 16000)  # Save the converted audio

# Format time in HH:MM:SS.milliseconds
def format_time(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = seconds % 60
    return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"

# Transcribe audio file
def transcribe_audio(audio_path):
    try:
        result = pipe(audio_path, return_timestamps=True)  # Transcribe audio and get timestamps
        return result["chunks"]  # Return transcription chunks
    except Exception as e:
        print(f"Error transcribing audio: {e}")
        return None

# Main function to execute the transcription process
def main():
    # Specify input and output audio paths (update paths as needed)
    input_audio_path = "/path/to/your/input/audio.mp3"  # Replace with your input audio path
    output_audio_path = "/path/to/your/output/audio_16khz.wav"  # Replace with your output audio path

    # Convert audio to 16kHz
    convert_audio_to_16khz(input_audio_path, output_audio_path)

    # Transcribe the converted audio
    transcription_chunks = transcribe_audio(output_audio_path)

    if transcription_chunks:
        print("WEBVTT\n")  # Print header for WEBVTT format
        for chunk in transcription_chunks:
            start_time = format_time(chunk["timestamp"][0])  # Format start time
            end_time = format_time(chunk["timestamp"][1])    # Format end time
            text = chunk["text"]                              # Get the transcribed text
            print(f"{start_time} --> {end_time}")           # Print time range
            print(f"{text}\n")                               # Print transcribed text
    else:
        print("Transcription failed.")

if __name__ == "__main__":
    main()
```

## Challenges and Future Improvements
### Challenges Encountered
- Diverse spellings of words in Moroccan Darija
- Cleaning and standardizing the dataset

### Future Improvements
- Expand the dataset to include more Darija accents and expressions
- Further fine-tune the model for specific Moroccan regional dialects
- Explore integration into practical applications like voice assistants and transcription services

## Conclusion
This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.