metadata

language: ary
metrics:
  - wer
tags:
  - audio
  - automatic-speech-recognition
  - speech
  - xlsr-fine-tuning-week
license: apache-2.0
model-index:
  - name: XLSR Wav2Vec2 Moroccan Arabic dialect by Boumehdi
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        metrics:
          - name: Test WER
            type: wer
            value: 49.68

Wav2Vec2-Large-XLSR-53-Moroccan-Darija

wav2vec2-large-xlsr fine-tuned on 6 hours of labeled Darija Audios

I have also added 3 phonetic units to this model ڭ, ڤ and پ. For example: ڭال , ڤيديو , پودكاست

Usage

The model can be used directly (without a language model) as follows:

import librosa
import torch
from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2ForCTC, Wav2Vec2Processor, TrainingArguments, Wav2Vec2FeatureExtractor, Trainer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
processor = Wav2Vec2Processor.from_pretrained('boumehdi/wav2vec2-large-xlsr-moroccan-darija', tokenizer=tokenizer)
model=Wav2Vec2ForCTC.from_pretrained('boumehdi/wav2vec2-large-xlsr-moroccan-darija')


# load the audio data (use your own wav file here!)
input_audio, sr = librosa.load('file.wav', sr=16000)

# tokenize
input_values = processor(input_audio, return_tensors="pt", padding=True).input_values

# retrieve logits
logits = model(input_values).logits

tokens=torch.argmax(logits, axis=-1)

# decode using n-gram
transcription = tokenizer.batch_decode(tokens)

# print the output
print(transcription)

Here's the output: ڭالت ليا هاد السيد هادا ما كاينش بحالو

Evaluation

Wer: 49.68

Training Loss: 9.88

Validation Loss: 45.24

This high validation loss value is mainly due to the fact that Darija can be written in many ways.

Future Work

Currently working on improving this model. The new model will be available soon.