Results validation with another benchmarks?
Hello!
Nice work! Very interesting results!
Did you try to validate with other benchmarks? I tried to check with MMLU (lm-eval-harness) and looks like MMLU results degrades a bit comparing with original suzume_multilingual. Wondering if MT-bench score is preferable...
Yeah, as I found in the paper, the Belebele scores drop when doing ORPO training while the MT-Bench scores increase. I think this is because they measure different things - while the MT-Bench scores measures the chat ability of the output given, Belebele and MMLU measure the logit scores of the "correct" answer. So I think this ORPO trained model will be better at chatting, but worse at logit-based knowledge testing tasks. We found in the paper that the lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top25 did better at Belebel than the base model, so that might also be better at MMLU?
Thanks. Yes, you are right, I've checked for https://huggingface.co/lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top25 and MMLU are better for it, as well as other logit-based testing benchmarks. BTW, when you did MT-bench scoring do you have kind of length control? As mentioned in some papers (f.e. https://arxiv.org/html/2404.04475v1) OpenAI's GPT typically prefer lengthy answers, so may be it is also the case with ORPO trained models?
Yes, there is a preference for long answers. And in this version of the model, the answers are just huge. In fact, the training dataset should have long examples in both positive and negative example answers, otherwise the model will learn that you should just write a long answer. Hence, you need to carefully validate the training dataset, in terms of answer lengths, and in terms of examples accepted and rejected.
Hey, yeah, I agree that that is something I need to work on for the next iteration of this model. If you just say "Hi" to the model, it list this loooong answer about how it is here to helpful and how useful it will be. Ironically, not very helpful haha.
The idea of training using long negatives is a good one - I have not checked whether the positives are substantially longer than the negatives, but I would wager they are.
However, I think I will probably focus on training using a method like SimPO (https://arxiv.org/pdf/2405.14734), as it contains a length penalty naturally, which would (I think) mean that I would be able to use any length of answers for both positives and negatives.