Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1020

How do I view the results of my submission?

#980

by ymcki - opened Oct 12

Discussion

ymcki

Oct 12

I am in the process of fine tuning google/gemma-2-2b-jpn-it. The first step is to know the benchmark scores of google/gemma-2-2b-jpn-it itself.

Since google didn't submit the model, so I submitted the model myself. According to the requests page, my submission is finished. However, after one day, I still can't see it to show up in the leaderboard. Where can I see the results of my submission? The revision hash of my submission is 6b046bbc091084a1ec89fe03e58871fde10868eb

I did read the FAQ and doc but I can't see anywhere saying how I can view the results of my submission. Thanks a lot in advance.

ymcki

Oct 13

Taking hints from this discussion,
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/981

I find that results of my submission can be found here:
https://huggingface.co/datasets/open-llm-leaderboard/results

Probably should add this to FAQ in case people can't find their results.

Would also want to know if my results will get published or not....

alozowski

Open LLM Leaderboard org Oct 14

Hi @ymcki ,

According to our FAQ, please, provide us with the request file for your model next time. Here is the request file for google/gemma-2-2b-jpn-it you have submitted:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/google/gemma-2-2b-jpn-it_eval_request_False_float16_Original.json

According to the status, it is FINISHED. Usually it takes approximately one day to see the results on the Leaderboard, but it can be longer on weekends. Currently, the model is displayed as you can see from my screenshot

I should also note, that google/gemma-2-2b-jpn-it is a conversational model that has a chat_template and the correct precision to submit it is bfloat16 according to its config.json, so I have submitted it with a chat_template = true and in bfloat16, please, find the request file here:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/google/gemma-2-2b-jpn-it_eval_request_False_bfloat16_Original.json

ymcki

Oct 15

Thanks for telling me I submitted with the wrong data type.

But isn't gemma-2-2b-jpn-it an instruct model instead of a chat model. Based on my understanding of the README
https://huggingface.co/google/gemma-2b-it/blob/main/README.md
Models ending with it are instruct and without it are chat in google's naming convention.

alozowski

Open LLM Leaderboard org Oct 15

Yes, we evaluate all -it models with the chat_template applied, as you can see on the Leaderboard.

Models without -it are usually base models.

ymcki

Oct 16

Thanks for your clarification. Interestingly, my submission has higher raw score than yours
https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/google/gemma-2-2b-jpn-it/results_2024-10-11T13-51-38.420715.json
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/google/gemma-2-2b-jpn-it/results_2024-10-15T15-21-39.173019.json

Dev Average IFEval BBH MathLv5 GPQA MUSR MMLU-PRO model
google 31.82 51.37 42.21 3.474 28.52 39.56 25.78 gemma-2-2b-jpn-it (float16, not chat)
google 30.82 54.11 41.43 0.0 27.52 37.17 24.67 gemma-2-2b-jpn-it (bfloat16, chat)

Is this normal? Does float16 really make a big difference?

If not, doesn't that imply ticking chat makes the model a bit dumber?

alozowski

Open LLM Leaderboard org Oct 16

We have experienced similar problems with gemma model evaluations before, and that's normal. The chat_template has a positive effect on the IFEval score, while the MATH and, in particular, MUSR scores might be lower

Nevertheless, we usually strongly advise people to run the evaluation of instruct models with the chat template applied, as it is noted in our documentation

clefourrier changed discussion status to closed Oct 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment