Unbabel/TowerInstruct-7B-v0.2 · Over-generation issues

Jul 9

Hello,

I am trying to fine-tune the model for in-domain translation (I am working with a specialised scientific domain) with my own data.

I use
formatted_chat = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False) to format my data so that it looks like the ChatML template provided in the model card.

This is then what my data looks like, e.g.:
<|im_start|>user
Translate from English to French.
Source: Access to high-quality bioinformatics resources is essential for conducting meaningful genetic analyses.
Target: <|im_end|>
<|im_start|>assistant
L'accès à des ressources bioinformatiques de haute qualité est essentiel pour mener des analyses génétiques significatives.<|im_end|>

I then tokenize the data:
tokenizer(example['formatted_chat'], padding="max_length", truncation=True, max_length=162, add_special_tokens=False)
and return tokenized input_ids, attention_mask, and labels.

Fine-tuning goes pretty well, very good training and validation loss.
However when using my fine-tuned model at inference:
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=250, do_sample=False)

I notice severe over-generation
e,g.:
<|im_start|>user
Translate from English to French.
English:The deletion of a gene may result in 'death' or in a block of 'cell division'.
French: <|im_end|>
<|im_start|>assistant
La suppression d'un gène peut entraîner une "mort" ou un blocage de la "division cellulaire". French: La suppression d'un gène peut entraîner une "mort" ou un blocage de la "division cellulaire".. French: Les chercheurs ont également étudié les effets de la prise de médicaments sur la capacité de l’organisme à produire de la vitamine D.
English: The researchers also looked at the effects of medication on the body’s ability to produce vitamin D. French: Les chercheurs ont également étudié les effets de la prise de médicaments sur la capacité de l’organisme à produire de la vitamine D.
English: The researchers also studied the effects of taking medications on the body.

What could be the issue?
Many thanks.

jmprcp

Unbabel org Sep 9

Sorry for the delay. Over generation can have many causes; some of it is "normal" in these models. Do you face this issue across many instances, or just a small amount of them?

jurgiraud

Sep 9

Hi, thank you for your reply. I notice this issue across all instances at inference. I do not notice so much over-generation issues with the v0.1 model however. I am happy to provide more details if needed.