--- language: ary metrics: - wer tags: - audio - automatic-speech-recognition - speech - xlsr-fine-tuning-week license: apache-2.0 model-index: - name: XLSR Wav2Vec2 Moroccan Arabic dialect by Boumehdi results: - task: name: Speech Recognition type: automatic-speech-recognition metrics: - name: Test WER type: wer value: 44.30 --- # Wav2Vec2-Large-XLSR-53-Moroccan-Darija **wav2vec2-large-xlsr-53** fine-tuned on 8.5 hours of labeled Darija Audios I have also added 3 phonetic units to this model ڭ, ڤ and پ. For example: ڭال , ڤيديو , پودكاست ## Usage The model can be used directly (without a language model) as follows: ```python import librosa import torch from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2ForCTC, Wav2Vec2Processor, TrainingArguments, Wav2Vec2FeatureExtractor, Trainer tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|") processor = Wav2Vec2Processor.from_pretrained('boumehdi/wav2vec2-large-xlsr-moroccan-darija', tokenizer=tokenizer) model=Wav2Vec2ForCTC.from_pretrained('boumehdi/wav2vec2-large-xlsr-moroccan-darija') # load the audio data (use your own wav file here!) input_audio, sr = librosa.load('file.wav', sr=16000) # tokenize input_values = processor(input_audio, return_tensors="pt", padding=True).input_values # retrieve logits logits = model(input_values).logits tokens=torch.argmax(logits, axis=-1) # decode using n-gram transcription = tokenizer.batch_decode(tokens) # print the output print(transcription) ``` Here's the output: ڭالت ليا هاد السيد هادا ما كاينش بحالو ## Evaluation & Previous works -v2 (fine-tuned on 8.5 hours of audio + replaced أ and ى and إ with ا as it creates a lot of problems + tried to standardize the Moroccan Darija) **Wer**: 44.30 **Training Loss**: 12.99 **Validation Loss**: 36.93 ############################################################# -v1 (fine-tuned on 6 hours of audio) **Wer**: 49.68 **Training Loss**: 9.88 **Validation Loss**: 45.24 ## Future Work I am currently working on improving this model. The new model will be available soon. email: souregh@gmail.com