Over-generation issues

#5
by jurgiraud - opened

Hello,

I am trying to fine-tune the model for in-domain translation (I am working with a specialised scientific domain) with my own data.

I use
formatted_chat = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False) to format my data so that it looks like the ChatML template provided in the model card.

This is then what my data looks like, e.g.:
<|im_start|>user
Translate from English to French.
Source: Access to high-quality bioinformatics resources is essential for conducting meaningful genetic analyses.
Target: <|im_end|>
<|im_start|>assistant
L'accès à des ressources bioinformatiques de haute qualité est essentiel pour mener des analyses génétiques significatives.<|im_end|>

I then tokenize the data:
tokenizer(example['formatted_chat'], padding="max_length", truncation=True, max_length=162, add_special_tokens=False)
and return tokenized input_ids, attention_mask, and labels.

Fine-tuning goes pretty well, very good training and validation loss.
However when using my fine-tuned model at inference:
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=250, do_sample=False)

I notice severe over-generation
e,g.:
<|im_start|>user
Translate from English to French.
English:The deletion of a gene may result in 'death' or in a block of 'cell division'.
French: <|im_end|>
<|im_start|>assistant
La suppression d'un gène peut entraîner une "mort" ou un blocage de la "division cellulaire". French: La suppression d'un gène peut entraîner une "mort" ou un blocage de la "division cellulaire".. French: Les chercheurs ont également étudié les effets de la prise de médicaments sur la capacité de l’organisme à produire de la vitamine D.
English: The researchers also looked at the effects of medication on the body’s ability to produce vitamin D. French: Les chercheurs ont également étudié les effets de la prise de médicaments sur la capacité de l’organisme à produire de la vitamine D.
English: The researchers also studied the effects of taking medications on the body.

What could be the issue?
Many thanks.

Unbabel org

Sorry for the delay. Over generation can have many causes; some of it is "normal" in these models. Do you face this issue across many instances, or just a small amount of them?

Hi, thank you for your reply. I notice this issue across all instances at inference. I do not notice so much over-generation issues with the v0.1 model however. I am happy to provide more details if needed.

Sign up or log in to comment