metadata
license: mit
language: fr
datasets:
- mozilla-foundation/common_voice_13_0
metrics:
- per
tags:
- audio
- automatic-speech-recognition
- speech
- phonemize
- phoneme
model-index:
- name: Wav2Vec2-base French finetuned for phonemes by LMSSC
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice v13
type: mozilla-foundation/common_voice_13_0
args: fr
metrics:
- name: Test PER on Common Voice FR 13.0 | Trained
type: per
value: 5.52
- name: Test PER on Multilingual Librispeech FR | Trained
type: per
value: 4.36
- name: Val PER on Common Voice FR 13.0 | Trained
type: per
value: 4.31
Fine-tuned French Voxpopuli v2 wav2vec2-base model for speech-to-phoneme task in French
Fine-tuned facebook/wav2vec2-base-fr-voxpopuli-v2 for French speech-to-phoneme (without language model) using the train and validation splits of Common Voice v13.
Audio samplerate for usage
When using this model, make sure that your speech input is sampled at 16kHz.
Output
As this model is specifically trained for a speech-to-phoneme task, the output is sequence of IPA-encoded words, without punctuation.
If you don't read the phonetic alphabet fluently, you can use this excellent IPA reader website to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription.
Training procedure
The model has been finetuned on Commonvoice-v13 (FR) for 14 epochs on a 4x2080 Ti GPUs at Cnam/LMMSC using a ddp strategy and gradient-accumulation procedure (256 audios per update, corresponding roughly to 25 minutes of speech per update -> 2k updates per epoch)
Usage (using the online Inference API)
Just record your voice on the ⚡ Inference API on this webpage, and then click on "Compute", that's all !
Usage (with HuggingSound library)
The model can be used directly using the HuggingSound library:
import pandas as pd
from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("Cnam-LMSSC/wav2vec2-french-phonemizer")
audio_paths = ["./test_relecture_texte.wav", "./10179_11051_000021.flac"]
transcriptions = model.transcribe(audio_paths)
df = pd.DataFrame(transcriptions)
df['Audio file'] = pd.DataFrame(audio_paths)
df.set_index('Audio file', inplace=True)
df[['transcription']]
Output :
Audio file |
Phonetic transcription (IPA) |
./test_relecture_texte.wav |
ʃapitʁ di də abɛse pəti kɔ̃t də ʒyl ləmɛtʁ ɑ̃ʁʒistʁe puʁ libʁivɔksɔʁɡ ibis dɑ̃ la bas kuʁ dœ̃ ʃato sə tʁuva paʁmi tut sɔʁt də volaj œ̃n ibis ʁɔz |
./10179_11051_000021.flac |
kɛl dɔmaʒ kə sə nə swa pa dy sykʁ supiʁa se foʁaz ɑ̃ pasɑ̃ sa lɑ̃ɡ syʁ la vitʁ fɛ̃ dy ʃapitʁ kɛ̃z ɑ̃ʁʒistʁe paʁ sonjɛ̃ sɛt ɑ̃ʁʒistʁəmɑ̃ fɛ paʁti dy domɛn pyblik |
Inference script (if you do not want to use the huggingsound library) :
import torch
from transformers import AutoModelForCTC, Wav2Vec2Processor
from datasets import load_dataset
import soundfile as sf
MODEL_ID = "Cnam-LMSSC/wav2vec2-french-phonemizer"
model = AutoModelForCTC.from_pretrained(MODEL_ID)
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
audio = sf.read('example.wav')
inputs = processor(np.array(audio[0]),sampling_rate=16_000., return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits,dim = -1)
transcription = processor.batch_decode(predicted_ids)
print("Phonetic transcription : ", transcription)
Output :
'ʒə syi tʁɛ kɔ̃tɑ̃ də vu pʁezɑ̃te notʁ solysjɔ̃ puʁ fonomize dez odjo fasilmɑ̃ sa fɔ̃ksjɔn kɑ̃ mɛm tʁɛ bjɛ̃'
Test Results:
In the table below, we report the Phoneme Error Rate (PER) of the model on both Common Voice and Multilingual Librispeech (using the French configs for both datasets of course), when finetuned on Common Voice train set only :
Model |
Test Set |
PER |
Cnam-LMSSC/wav2vec2-french-phonemizer |
Common Voice v13 (French) |
5.52% |
Cnam-LMSSC/wav2vec2-french-phonemizer |
Multilingual Librispeech (French) |
4.36% |
Citation
If you use this finetuned model for any publication, please use this to cite our work :
@misc {lmssc-wav2vec2-base-phonemizer-french_2023,
author = { Olivier, Malo AND Hauret, Julien AND Bavu, {É}ric },
title = { wav2vec2-french-phonemizer (Revision e715906) },
year = 2023,
url = { https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer },
doi = { 10.57967/hf/1339 },
publisher = { Hugging Face }
}