Reason for high performance may be an error in evaluation

by ChuckMcSneed - opened Oct 9

Oct 9

Look at the scores, MATH for Qwen2.5-72B-Instruct is suspiciously low. It could be that your model outperforms it simply because Qwen2.5-72B-Instruct was somehow misevaluated. I have opened a discussion here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/975

rombodawg

Owner Oct 9

@ChuckMcSneed I 100% agree that the instruct model was evaluated wrong, but i still think my model would outperform it. Would be cool to see it re-evaluated

KTibow

Oct 12

but i still think my model would outperform it

It depends on the score, but if Qwen got the same then in terms of averages Qwen would beat this model by about 0.5%

ChuckMcSneed

Oct 12

https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/975

According to @alozowski Qwen Instruct got worse at in-context learning and tried to highlight the answers instead of following the format. What happens is your method brings it closer to base model, which can be both good(better at in-context learning, less forgetting) or bad(less obedient).

yinanlz

Oct 21

what is your finetuing dataset

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment