Spaces:
Running
on
CPU Upgrade
Incorrect ifeval benchmark
Hello everyone,
Apparently the ifeval evaluation went wrong with our model. Unfortunately I can't explain how this could have happened. As you can see in the ifeval benchmark result data set, most responses are simply empty. In our internal tests (based on the HF leaderboard documentation) everything worked correctly.
You can also see from the remaining benchmarks that the ifeval value cannot have been calculated correctly, as all other values are similar to our internal tests (we have also stored diagrams in the model card where you can check this again)
Do you have an idea or even a solution?
Thanks in advance,
David
The same behavior can also be observed with the 9b finetunes of gemma 2:
https://huggingface.co/datasets/open-llm-leaderboard/UCLA-AGI__Gemma-2-9B-It-SPPO-Iter3-details/viewer/UCLA-AGI__Gemma-2-9B-It-SPPO-Iter3__leaderboard_ifeval
https://huggingface.co/datasets/open-llm-leaderboard/princeton-nlp__gemma-2-9b-it-SimPO-details/viewer/princeton-nlp__gemma-2-9b-it-SimPO__leaderboard_ifeval
Hi @DavidGF ,
It looks like the issue with your model’s responses to the IFEval benchmark is challenging to pinpoint. Sometimes the model responds as expected, but other times it doesn't, particularly with more complex prompts.
From what I can see, everything seems set up correctly, like the BOS token and the chat template, so the problem might be related to how the model treats the specific generation settings. These settings might be making the model too rigid, which could explain why it occasionally fails to generate a response. Plus, I've tried to re-evaluate your model and got the same results. Have you also tried evaluating your model with the added BOS token?
Hello
@alozowski
,
First of all, thank you very much for your efforts!
We have also evaluated the model several times with lm eval harness and have not had any problems.
If it was only specific to our model, then the behavior should not automatically occur in the other models I mentioned.
A lot of Gemma 2 finetunes are affected by this behavior.
The results of the rest of the benchmark also show that the model performs well and I therefore do not assume that it is overwhelmed by the complexity of certain ifeval prompts.
Hi @DavidGF ,
After manual inspection of the different outputs (and local re-runs of the model), we are not able to pinpoint a specific place where this failure could come from, as we're getting the same results consistently.
Could you share
- the command you are using to run the harness locally? (notably, are you using the same command as the one indicated in our Reproducibility section in the doc page? including
few_shot_as_multiturn
?) - a detailed result file of one of your runs?
Thanks a lot for your answer, it would help us debug this way faster!
Closing for inactivity