--- license: apache-2.0 language: fr library_name: transformers thumbnail: null tags: - automatic-speech-recognition - hf-asr-leaderboard - whisper-event datasets: - mozilla-foundation/common_voice_11_0 metrics: - wer model-index: - name: Fine-tuned whisper-medium model for ASR in French results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 11.0 type: mozilla-foundation/common_voice_11_0 config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 9.03 - name: WER (Beam 5) type: wer value: 8.54 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Multilingual LibriSpeech (MLS) type: facebook/multilingual_librispeech config: french split: test args: french metrics: - name: WER (Greedy) type: wer value: 6.34 - name: WER (Beam 5) type: wer value: 5.86 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: VoxPopuli type: facebook/voxpopuli config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 11.64 - name: WER (Beam 5) type: wer value: 11.35 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Fleurs type: google/fleurs config: fr_fr split: test args: fr_fr metrics: - name: WER (Greedy) type: wer value: 7.13 - name: WER (Beam 5) type: wer value: 6.85 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: African Accented French type: gigant/african_accented_french config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 8.88 - name: WER (Beam 5) type: wer value: 7.02 --- ![Model architecture](https://img.shields.io/badge/Model_Architecture-seq2seq-lightgrey) ![Model size](https://img.shields.io/badge/Params-769M-lightgrey) ![Language](https://img.shields.io/badge/Language-French-lightgrey) # Fine-tuned whisper-medium model for ASR in French This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium), trained on the mozilla-foundation/common_voice_11_0 fr dataset. When using the model make sure that your speech input is also sampled at 16Khz. **This model also predicts casing and punctuation.** ## Performance *Below are the WERs of the pre-trained models on the [Common Voice 9.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli) and [Fleurs](https://huggingface.co/datasets/google/fleurs). These results are reported in the original [paper](https://cdn.openai.com/papers/whisper.pdf).* | Model | Common Voice 9.0 | MLS | VoxPopuli | Fleurs | | --- | :---: | :---: | :---: | :---: | | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 22.7 | 16.2 | 15.7 | 15.0 | | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 16.0 | 8.9 | 12.2 | 8.7 | | [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 14.7 | 8.9 | **11.0** | **7.7** | | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | **13.9** | **7.3** | 11.4 | 8.3 | *Below are the WERs of the fine-tuned models on the [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), and [Fleurs](https://huggingface.co/datasets/google/fleurs). Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as `WER (greedy search) / WER (beam search with beam width 5)`.* | Model | Common Voice 11.0 | MLS | VoxPopuli | Fleurs | | --- | :---: | :---: | :---: | :---: | | [bofenghuang/whisper-small-cv11-french](https://huggingface.co/bofenghuang/whisper-small-cv11-french) | 11.76 / 10.99 | 9.65 / 8.91 | 14.45 / 13.66 | 10.76 / 9.83 | | [bofenghuang/whisper-medium-cv11-french](https://huggingface.co/bofenghuang/whisper-medium-cv11-french) | 9.03 / 8.54 | 6.34 / 5.86 | 11.64 / 11.35 | 7.13 / 6.85 | | [bofenghuang/whisper-medium-french](https://huggingface.co/bofenghuang/whisper-medium-french) | 9.03 / 8.73 | 4.60 / 4.44 | 9.53 / 9.46 | 6.33 / 5.94 | | [bofenghuang/whisper-large-v2-cv11-french](https://huggingface.co/bofenghuang/whisper-large-v2-cv11-french) | **8.05** / **7.67** | 5.56 / 5.28 | 11.50 / 10.69 | 5.42 / 5.05 | | [bofenghuang/whisper-large-v2-french](https://huggingface.co/bofenghuang/whisper-large-v2-french) | 8.15 / 7.83 | **4.20** / **4.03** | **9.10** / **8.66** | **5.22** / **4.98** | ## Usage Inference with 🤗 Pipeline ```python import torch from datasets import load_dataset from transformers import pipeline device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Load pipeline pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-cv11-french", device=device) # NB: set forced_decoder_ids for generation utils pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe") # Load data ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) test_segment = next(iter(ds_mcv_test)) waveform = test_segment["audio"] # Run generated_sentences = pipe(waveform, max_new_tokens=225)["text"] # greedy # generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"] # beam search # Normalise predicted sentences if necessary ``` Inference with 🤗 low-level APIs ```python import torch import torchaudio from datasets import load_dataset from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Load model model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-cv11-french").to(device) processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-cv11-french", language="french", task="transcribe") # NB: set forced_decoder_ids for generation utils model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe") # 16_000 model_sample_rate = processor.feature_extractor.sampling_rate # Load data ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) test_segment = next(iter(ds_mcv_test)) waveform = torch.from_numpy(test_segment["audio"]["array"]) sample_rate = test_segment["audio"]["sampling_rate"] # Resample if sample_rate != model_sample_rate: resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate) waveform = resampler(waveform) # Get feat inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt") input_features = inputs.input_features input_features = input_features.to(device) # Generate generated_ids = model.generate(inputs=input_features, max_new_tokens=225) # greedy # generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5) # beam search # Detokenize generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] # Normalise predicted sentences if necessary ```