Fine-tuned whisper-medium model for ASR in French
This model is a fine-tuned version of openai/whisper-medium, trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and the validation splits of Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, Fleurs, Multilingual TEDx, MediaSpeech, and African Accented French. When using the model make sure that your speech input is sampled at 16Khz. This model doesn't predict casing or punctuation.
Performance
Below are the WERs of the pre-trained models on the Common Voice 9.0, Multilingual LibriSpeech, Voxpopuli and Fleurs. These results are reported in the original paper.
Model | Common Voice 9.0 | MLS | VoxPopuli | Fleurs |
---|---|---|---|---|
openai/whisper-small | 22.7 | 16.2 | 15.7 | 15.0 |
openai/whisper-medium | 16.0 | 8.9 | 12.2 | 8.7 |
openai/whisper-large | 14.7 | 8.9 | 11.0 | 7.7 |
openai/whisper-large-v2 | 13.9 | 7.3 | 11.4 | 8.3 |
Below are the WERs of the fine-tuned models on the Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, and Fleurs. Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as WER (greedy search) / WER (beam search with beam width 5)
.
Model | Common Voice 11.0 | MLS | VoxPopuli | Fleurs |
---|---|---|---|---|
bofenghuang/whisper-small-cv11-french | 11.76 / 10.99 | 9.65 / 8.91 | 14.45 / 13.66 | 10.76 / 9.83 |
bofenghuang/whisper-medium-cv11-french | 9.03 / 8.54 | 6.34 / 5.86 | 11.64 / 11.35 | 7.13 / 6.85 |
bofenghuang/whisper-medium-french | 9.03 / 8.73 | 4.60 / 4.44 | 9.53 / 9.46 | 6.33 / 5.94 |
bofenghuang/whisper-large-v2-cv11-french | 8.05 / 7.67 | 5.56 / 5.28 | 11.50 / 10.69 | 5.42 / 5.05 |
bofenghuang/whisper-large-v2-french | 8.15 / 7.83 | 4.20 / 4.03 | 9.10 / 8.66 | 5.22 / 4.98 |
Usage
Inference with π€ Pipeline
import torch
from datasets import load_dataset
from transformers import pipeline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-french", device=device)
# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")
# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]
# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"] # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"] # beam search
# Normalise predicted sentences if necessary
Inference with π€ low-level APIs
import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-french", language="french", task="transcribe")
# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")
# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate
# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]
# Resample
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)
# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225) # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5) # beam search
# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Normalise predicted sentences if necessary
- Downloads last month
- 62
Datasets used to train bofenghuang/whisper-medium-french
Space using bofenghuang/whisper-medium-french 1
Collection including bofenghuang/whisper-medium-french
Evaluation results
- WER (Greedy) on Common Voice 11.0test set self-reported9.030
- WER (Beam 5) on Common Voice 11.0test set self-reported8.730
- WER (Greedy) on Multilingual LibriSpeech (MLS)test set self-reported4.600
- WER (Beam 5) on Multilingual LibriSpeech (MLS)test set self-reported4.440
- WER (Greedy) on VoxPopulitest set self-reported9.530
- WER (Beam 5) on VoxPopulitest set self-reported9.460
- WER (Greedy) on Fleurstest set self-reported6.330
- WER (Beam 5) on Fleurstest set self-reported5.940
- WER (Greedy) on African Accented Frenchtest set self-reported4.890
- WER (Beam 5) on African Accented Frenchtest set self-reported4.560