Over-generation issues
Hello,
I am trying to fine-tune the model for in-domain translation (I am working with a specialised scientific domain) with my own data.
I use
formatted_chat = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False) to format my data so that it looks like the ChatML template provided in the model card.
This is then what my data looks like, e.g.:
<|im_start|>user
Translate from English to French.
Source: Access to high-quality bioinformatics resources is essential for conducting meaningful genetic analyses.
Target: <|im_end|>
<|im_start|>assistant
L'accès à des ressources bioinformatiques de haute qualité est essentiel pour mener des analyses génétiques significatives.<|im_end|>
I then tokenize the data:
tokenizer(example['formatted_chat'], padding="max_length", truncation=True, max_length=162, add_special_tokens=False)
and return tokenized input_ids, attention_mask, and labels.
Fine-tuning goes pretty well, very good training and validation loss.
However when using my fine-tuned model at inference:
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=250, do_sample=False)
I notice severe over-generation
e,g.:
<|im_start|>user
Translate from English to French.
English:The deletion of a gene may result in 'death' or in a block of 'cell division'.
French: <|im_end|>
<|im_start|>assistant
La suppression d'un gène peut entraîner une "mort" ou un blocage de la "division cellulaire". French: La suppression d'un gène peut entraîner une "mort" ou un blocage de la "division cellulaire".. French: Les chercheurs ont également étudié les effets de la prise de médicaments sur la capacité de l’organisme à produire de la vitamine D.
English: The researchers also looked at the effects of medication on the body’s ability to produce vitamin D. French: Les chercheurs ont également étudié les effets de la prise de médicaments sur la capacité de l’organisme à produire de la vitamine D.
English: The researchers also studied the effects of taking medications on the body.
What could be the issue?
Many thanks.
Sorry for the delay. Over generation can have many causes; some of it is "normal" in these models. Do you face this issue across many instances, or just a small amount of them?
Hi, thank you for your reply. I notice this issue across all instances at inference. I do not notice so much over-generation issues with the v0.1 model however. I am happy to provide more details if needed.