Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
language:
|
3 |
- en
|
4 |
- zh
|
5 |
- de
|
@@ -29,7 +29,7 @@ language:
|
|
29 |
- da
|
30 |
- hu
|
31 |
- ta
|
32 |
-
- no
|
33 |
- th
|
34 |
- ur
|
35 |
- hr
|
@@ -109,7 +109,7 @@ widget:
|
|
109 |
- example_title: Librispeech sample 2
|
110 |
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
|
111 |
pipeline_tag: automatic-speech-recognition
|
112 |
-
license:
|
113 |
---
|
114 |
|
115 |
# Whisper
|
@@ -120,4 +120,98 @@ et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a
|
|
120 |
datasets and domains in a zero-shot setting.
|
121 |
|
122 |
@OpenAI
|
123 |
-
Downloaded from: [link](https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
- en
|
4 |
- zh
|
5 |
- de
|
|
|
29 |
- da
|
30 |
- hu
|
31 |
- ta
|
32 |
+
- 'no'
|
33 |
- th
|
34 |
- ur
|
35 |
- hr
|
|
|
109 |
- example_title: Librispeech sample 2
|
110 |
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
|
111 |
pipeline_tag: automatic-speech-recognition
|
112 |
+
license: mit
|
113 |
---
|
114 |
|
115 |
# Whisper
|
|
|
120 |
datasets and domains in a zero-shot setting.
|
121 |
|
122 |
@OpenAI
|
123 |
+
Downloaded from: [link](https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt)
|
124 |
+
|
125 |
+
## Available models and languages
|
126 |
+
|
127 |
+
There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs.
|
128 |
+
Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model.
|
129 |
+
The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware.
|
130 |
+
|
131 |
+
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|
132 |
+
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|
133 |
+
| tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~10x |
|
134 |
+
| base | 74 M | `base.en` | `base` | ~1 GB | ~7x |
|
135 |
+
| small | 244 M | `small.en` | `small` | ~2 GB | ~4x |
|
136 |
+
| medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
|
137 |
+
| large | 1550 M | N/A | `large` | ~10 GB | 1x |
|
138 |
+
| turbo | 809 M | N/A | `turbo` | ~6 GB | ~8x |
|
139 |
+
|
140 |
+
The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.
|
141 |
+
Additionally, the `turbo` model is an optimized version of `large-v3` that offers faster transcription speed with a minimal degradation in accuracy.
|
142 |
+
|
143 |
+
Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of `large-v3` and `large-v2` models by language, using WERs (word error rates) or CER (character error rates, shown in *Italic*) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of [the paper](https://arxiv.org/abs/2212.04356), as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
|
144 |
+
|
145 |
+
![WER breakdown by language](https://github.com/openai/whisper/assets/266841/f4619d66-1058-4005-8f67-a9d811b77c62)
|
146 |
+
|
147 |
+
|
148 |
+
|
149 |
+
## Command-line usage
|
150 |
+
|
151 |
+
The following command will transcribe speech in audio files, using the `turbo` model:
|
152 |
+
|
153 |
+
whisper audio.flac audio.mp3 audio.wav --model turbo
|
154 |
+
|
155 |
+
The default setting (which selects the `small` model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the `--language` option:
|
156 |
+
|
157 |
+
whisper japanese.wav --language Japanese
|
158 |
+
|
159 |
+
Adding `--task translate` will translate the speech into English:
|
160 |
+
|
161 |
+
whisper japanese.wav --language Japanese --task translate
|
162 |
+
|
163 |
+
Run the following to view all available options:
|
164 |
+
|
165 |
+
whisper --help
|
166 |
+
|
167 |
+
See [tokenizer.py](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py) for the list of all available languages.
|
168 |
+
|
169 |
+
|
170 |
+
## Python usage
|
171 |
+
|
172 |
+
Transcription can also be performed within Python:
|
173 |
+
|
174 |
+
```python
|
175 |
+
import whisper
|
176 |
+
|
177 |
+
model = whisper.load_model("turbo")
|
178 |
+
result = model.transcribe("audio.mp3")
|
179 |
+
print(result["text"])
|
180 |
+
```
|
181 |
+
|
182 |
+
Internally, the `transcribe()` method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
|
183 |
+
|
184 |
+
Below is an example usage of `whisper.detect_language()` and `whisper.decode()` which provide lower-level access to the model.
|
185 |
+
|
186 |
+
```python
|
187 |
+
import whisper
|
188 |
+
|
189 |
+
model = whisper.load_model("turbo")
|
190 |
+
|
191 |
+
# load audio and pad/trim it to fit 30 seconds
|
192 |
+
audio = whisper.load_audio("audio.mp3")
|
193 |
+
audio = whisper.pad_or_trim(audio)
|
194 |
+
|
195 |
+
# make log-Mel spectrogram and move to the same device as the model
|
196 |
+
mel = whisper.log_mel_spectrogram(audio).to(model.device)
|
197 |
+
|
198 |
+
# detect the spoken language
|
199 |
+
_, probs = model.detect_language(mel)
|
200 |
+
print(f"Detected language: {max(probs, key=probs.get)}")
|
201 |
+
|
202 |
+
# decode the audio
|
203 |
+
options = whisper.DecodingOptions()
|
204 |
+
result = whisper.decode(model, mel, options)
|
205 |
+
|
206 |
+
# print the recognized text
|
207 |
+
print(result.text)
|
208 |
+
```
|
209 |
+
|
210 |
+
## More examples
|
211 |
+
|
212 |
+
Please use the [🙌 Show and tell](https://github.com/openai/whisper/discussions/categories/show-and-tell) category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc.
|
213 |
+
|
214 |
+
|
215 |
+
## License
|
216 |
+
|
217 |
+
Whisper's code and model weights are released under the MIT License. See [LICENSE](https://github.com/openai/whisper/blob/main/LICENSE) for further details.
|