Anyone knows how to translate longer text? - I am new on this

#18
by JaimeLugo - opened

I have the code below and i am only interested in T2T format. I am new so very likely i have a newbi mistake but i am not able to see the tranlated text if its longer than 500 characters.... i only see the first 400 char, anyone knows how to solve this?

thanks!

def translate_text(text, src_lang, tgt_lang):
processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

text_inputs = processor(text = text, src_lang=src_lang, return_tensors="pt")
output_tokens = model.generate(**text_inputs, tgt_lang=tgt_lang, text_num_beams=5, generate_speech=False)
translated_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)

return translated_text

It seems the default max_new_tokens is set to 256 for this model. You can probably increase this but be mindful of your input token length and the context length of the model (which I believe is 4096).

max_new_tokens (int, optional, defaults to 256) — The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt.

If you need to translate even longer text, probably best to chunk it at like a period after its exceeded some length and loop over your entire text.

Thanks Noobmaster29! - I managed to increase the answer length simply with "max_new_tokens=1000".... this works very well, however, the model in general likes to cut sentences... perhaps if it feels the sentence has redundant words it simply ignores it.

Hmmm, what language are you using? I'm finding some of the translation from English to Chinese somewhat questionable. Seems like the model does not like long sentences or really short phrases.

can any body let me know where do i have to add the max_new_tokens parameter, i am not able to figure out.

Please can anyone help me
Where should I start so that I can also use models available in hugging face

@junelegend I changed the max_new_tokens by model.config.max_new_tokens = 4096 ,
can confirm its changed when you print model.config again
But still doesnt change output audio length for S2SS . Not sure which other config param needs to be changed.
Could someone help please?

Sign up or log in to comment