How do I view the results of my submission?

#980
by ymcki - opened

I am in the process of fine tuning google/gemma-2-2b-jpn-it. The first step is to know the benchmark scores of google/gemma-2-2b-jpn-it itself.

Since google didn't submit the model, so I submitted the model myself. According to the requests page, my submission is finished. However, after one day, I still can't see it to show up in the leaderboard. Where can I see the results of my submission? The revision hash of my submission is 6b046bbc091084a1ec89fe03e58871fde10868eb

I did read the FAQ and doc but I can't see anywhere saying how I can view the results of my submission. Thanks a lot in advance.

Taking hints from this discussion,
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/981

I find that results of my submission can be found here:
https://huggingface.co/datasets/open-llm-leaderboard/results

Probably should add this to FAQ in case people can't find their results.

Would also want to know if my results will get published or not....

Open LLM Leaderboard org

Hi @ymcki ,

According to our FAQ, please, provide us with the request file for your model next time. Here is the request file for google/gemma-2-2b-jpn-it you have submitted:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/google/gemma-2-2b-jpn-it_eval_request_False_float16_Original.json

According to the status, it is FINISHED. Usually it takes approximately one day to see the results on the Leaderboard, but it can be longer on weekends. Currently, the model is displayed as you can see from my screenshot

I should also note, that google/gemma-2-2b-jpn-it is a conversational model that has a chat_template and the correct precision to submit it is bfloat16 according to its config.json, so I have submitted it with a chat_template = true and in bfloat16, please, find the request file here:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/google/gemma-2-2b-jpn-it_eval_request_False_bfloat16_Original.json

Screenshot 2024-10-14 at 13.14.49.png

Thanks for telling me I submitted with the wrong data type.

But isn't gemma-2-2b-jpn-it an instruct model instead of a chat model. Based on my understanding of the README
https://huggingface.co/google/gemma-2b-it/blob/main/README.md
Models ending with it are instruct and without it are chat in google's naming convention.

Open LLM Leaderboard org

Yes, we evaluate all -it models with the chat_template applied, as you can see on the Leaderboard.

Models without -it are usually base models.

Screenshot 2024-10-15 at 11.47.08.png

Thanks for your clarification. Interestingly, my submission has higher raw score than yours
https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/google/gemma-2-2b-jpn-it/results_2024-10-11T13-51-38.420715.json
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/google/gemma-2-2b-jpn-it/results_2024-10-15T15-21-39.173019.json

Dev Average IFEval BBH MathLv5 GPQA MUSR MMLU-PRO model
google 31.82 51.37 42.21 3.474 28.52 39.56 25.78 gemma-2-2b-jpn-it (float16, not chat)
google 30.82 54.11 41.43 0.0 27.52 37.17 24.67 gemma-2-2b-jpn-it (bfloat16, chat)

Is this normal? Does float16 really make a big difference?

If not, doesn't that imply ticking chat makes the model a bit dumber?

Open LLM Leaderboard org

We have experienced similar problems with gemma model evaluations before, and that's normal. The chat_template has a positive effect on the IFEval score, while the MATH and, in particular, MUSR scores might be lower

Nevertheless, we usually strongly advise people to run the evaluation of instruct models with the chat template applied, as it is noted in our documentation

clefourrier changed discussion status to closed

Sign up or log in to comment