sanchit-gandhi HF staff commited on
Commit
e80f01d
1 Parent(s): 27b5bb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -356,8 +356,8 @@ This code snippet shows how to evaluate Whisper Large on [LibriSpeech test-clean
356
  The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
357
  algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
358
  [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
359
- method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. It can also be extended to
360
- predict utterance level timestamps by passing `return_timestamps=True`:
361
 
362
  ```python
363
  >>> import torch
@@ -376,15 +376,17 @@ predict utterance level timestamps by passing `return_timestamps=True`:
376
  >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
377
  >>> sample = ds[0]["audio"]
378
 
379
- >>> prediction = pipe(sample.copy())["text"]
380
  " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
381
 
382
  >>> # we can also return timestamps for the predictions
383
- >>> prediction = pipe(sample, return_timestamps=True)["chunks"]
384
  [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
385
  'timestamp': (0.0, 5.44)}]
386
  ```
387
 
 
 
388
  ## Fine-Tuning
389
 
390
  The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
 
356
  The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
357
  algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
358
  [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
359
+ method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
360
+ can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
361
 
362
  ```python
363
  >>> import torch
 
376
  >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
377
  >>> sample = ds[0]["audio"]
378
 
379
+ >>> prediction = pipe(sample.copy(), batch_size=8)["text"]
380
  " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
381
 
382
  >>> # we can also return timestamps for the predictions
383
+ >>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
384
  [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
385
  'timestamp': (0.0, 5.44)}]
386
  ```
387
 
388
+ Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
389
+
390
  ## Fine-Tuning
391
 
392
  The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,