Transcription repeats the same word
Hi,
Thanks for making this model available!
I tried to implement it and it works like a charm with audios up to 1 min length. Unfortunately if I try to transcribe a longer than 1 minute recording, it only transcribes the first 2-3 sentences, then just repeats the word where it gets stuck for the rest of the text. I have a 1 hour long recording I'm trying to transcribe, and if I crop to 1 minute, it is perfect. 5 minutes already has the problem of repeating a word, and it can't even transcribe the first 1 minute properly. I used the basic example code for an English language transcription. Do you have an ide how to solve this issue?
I have an Nvidia RTX4090 GPU with 24 GB memory and I would like to infer only.
Thanks,
Agi
Hi, thanks for trying out the model! We have a special script for inference on longer samples here: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed_chunked_infer.py. It should fix your issues.
Oh, perfect!
It took me a bit of time to figure out, that I need to build the environment from git via pip from source instead of just pip, but otherwise it worked smooth and could transcribe a 1 hour long recording.
@AgnesG can you share instructions?