Why not print the entire text of the audio?

#50
by wadexiao - opened

below is my code

from transformers import pipeline

speech_recognizer = pipeline("automatic-speech-recognition", chunk_length_s=30, model="openai/whisper-large-v2")

s=speech_recognizer(r"C:\Users\Administrator\Desktop\bad.mp3", max_new_tokens=8000)
print(s['text'])

the time of the audio almost 210s, but why does it only display text about the first 30s of the audio, event change the , chunk_length_s=200s, I want to see the full text, how can I do it?

Hey @wadexiao - are you able to share your audio so I can reproduce locally please? Your code looks correct otherwise.

I have the same issue, the output is just the first 15 or 20 seconds for audio of 2 minutes, like 100 tokens approx.

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf

processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="spanish", task="transcribe")

wav_path = "audios/output1.wav"

audio_data, sample_rate = sf.read(wav_path)

input_features = processor(audio_data, sampling_rate=sample_rate, return_tensors="pt").input_features

predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids, max_new_tokens=4000)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

Hey @felipedelacruz - for long-form transcription, it's advised to use the pipeline class:

import torch
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large-v2",
  chunk_length_s=30,
  device=device,
)

# to transcribe a local file
wav_path = "audios/output1.wav"
prediction = pipe(wav_path, batch_size=8)["text"]

# we can also return timestamps for the predictions
prediction = pipe(wav_path, batch_size=8, return_timestamps=True)["chunks"]

@sanchit-gandhi hi, as you suggest for long audio or video to transcribe use pipeline. But they not working I am using the same technique.

I have 10 minutes video and they just transcribe 30 sec.

Hey @Imran1 - do you have a reproducible codesnippet to show the behaviour you're seeing where only the first 30s of an audio is transcribed? If you set chunk_length_s=30 when you initialise the pipeline (as done above), then chunking should be enabled, meaning you can transcribe audio files of arbitrary length.

hi, @sanchit-gandhi
sure, below is my audio

Hey @wadexiao , it worked perfectly fine fore me using the code snippet I shared before:

import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large-v2",
  chunk_length_s=30,
  device=device,
)

audio = "https://cdn-uploads.huggingface.co/production/uploads/64b38065f44fd957490e79af/EK5e1kQPiJvW1dawJv2l7.mpga"
text = pipe(audio, batch_size=16)
print(text)

Let me know if you continue to encounter any issues, more than happy to help here!

Hi @sanchit-gandhi , I am having the same problem here as @wadexiao .

I've tried running your above code on a 2-minute audio file, and it only provides the text for the first few seconds. I have also tried running the model that I wanted using a gradio app from HuggingFace and I get the same thing... after a while, the transcription stops.

Any solutions for this?

Hey @MMYDatasets ! Could you share the audio file and code that you're using so that we reproduce the error on our side? Thanks!

The code that I am using is this one, which you shared:

import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
chunk_length_s=30,
device=device,
)

audio = "https://cdn-uploads.huggingface.co/production/uploads/64b38065f44fd957490e79af/EK5e1kQPiJvW1dawJv2l7.mpga"
text = pipe(audio, batch_size=16)
print(text)

As for the file, its an mp3 file which is about 2 minutes in length.

I don't know Arabic, but the transcription looks to be improved when we return timestamps:

import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large-v2",
  chunk_length_s=30,
  device=device,
)

audio = "https://cdn-uploads.huggingface.co/production/uploads/65671e8a77fe61d0fce9dde0/HCSF1wW79K6be5ucybPHb.mpga"
output = pipe(audio, batch_size=16, generate_kwargs={"task": "transcribe"}, return_timestamps=True)
print(output["text"])

Print Output

 الوحدة الثامنة الجو والملابس كيف الجو عندكم؟ رابعاً الاستماع والمحادثة التدريب التاسع عشر استمع إلى النشرة الجوية ثم أجب عن الأسئلة ألف ضع علامة صح أو خطأ ثم صحح الخطأ يتوقع غدا أن تستمر درجات الحرارة في الانخفاض مع وجود سماء صافية احيانا وغائمة جزئية احيانا اخرى كما يلاحظ سقوط امطار خفيفة على بعض البلاد مثل تركيا ويطاليا وظهور صحب كثيفة في فترة الظهيرة مع احتمال زيادة الرطوبة في الليل والصباح الباكر ومما يلاحظ أيضا زيادة درجات الحرارة على بعض البلاد مثل السعودية ومصر تتدرجات الحرارة على بعض البلاد مثل السعودية ومصر أما عن درجات الحرارة المتوقعة غدا فهي السعودية العظمة 35 والصغرى 25مس وعشرون مصر العظمى ثلاثون والصغرى عشرون تركيا العظمى ثلاث عشرة والصغرى تسع ايطاليا العظمى تسع درجات الصغرى اربع درجات

did you figure this out? thanks.

Hi @sanchit-gandhi ,
Above you have advised using pipeline for long form transcription using whisper. Am I correct to understand that this means you cannot have the option to customize all the parameters for long form transcription that the original whisper package released by openai provides such as:
temperature: float = 0.0
sample_len: Optional[int] = None
best_of: Optional[int] = None
beam_size: Optional[int] = None
patience: Optional[float] = None
length_penalty: Optional[float] = None
prompt: Optional[Union[str, List[int]]] = None
prefix: Optional[Union[str, List[int]]] = None
suppress_tokens: Optional[Union[str, Iterable[int]]] = "-1"
suppress_blank: bool = True
without_timestamps: bool = False
max_initial_timestamp: Optional[float] = 1.0
fp16: bool = True
verbose: Optional[bool] = None,
temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
compression_ratio_threshold: Optional[float] = 2.4,
logprob_threshold: Optional[float] = -1.0,
no_speech_threshold: Optional[float] = 0.6,
condition_on_previous_text: bool = True,
initial_prompt: Optional[str] = None,
word_timestamps: bool = False,
prepend_punctuations: str = ""'“¿([{-",
append_punctuations: str = ""'.。,,!!??::”)]}、",
clip_timestamps: Union[str, List[float]] = "0",
hallucination_silence_threshold: Optional[float] = None,

Most of these seem to be present as the arguments in the generate function of transformers.WhisperForConditionalGeneration

Sign up or log in to comment