Questions on MMLU

#1
by jphme - opened

Hey guys, congratz to the release, model looks great and will try it out soon.

One quick question with regards to the MMLU scores (because that´s where we find it hard to get any improvement vs the official Instruct versions): Did you do you own test and did you investigate the reasons for deviation vs Meta´s results?
E.g. MMLU en 78.83 for Llama 3.1 70b Instruct vs 83.6 claimed on the official model card. For German MMLU it´s 72.85 (here) vs 79.27 (Meta).
(Not trying to nitpick here, just curious about the reason for the deviations and/or if you weren't able to replicate Meta´s claimed results.

Thanks + keep it coming! :-)

VAGO solutions org

Hey @jphme , thank you very much :)

Improving the MMLU values of a model is indeed one of the greatest challenges in fine-tuning. The results delivered by the major llm providers are a crucial aspect of this process.
However, major providers typically do not disclose which evaluation framework and batch size they used to achieve their results.
To enable precise comparisons with our own models, we always conduct our own benchmarks.
We adhere to the frameworks and versions specified by the hf leaderboard. Our aim is to keep the batch size as low as possible to obtain accurate results.
An exception was made for this particular model in AGIEval and GPT4All evaluations, as otherwise, the assessment would have taken an excessively long time. Nevertheless, the trend of improvements is still clearly visible in these cases.
For this model, we tested the MMLU benchmarks using the hf leaderboard version of lm_evaluation_harness with a batch size of 6 (3 x A100 GPUs with a batch size of 2 each).

Long story short:

  • LLM providers rarely disclose details about evaluation framework and batch size
  • Own benchmarks are necessary for accurate comparisons
  • Adherence to hf leaderboard specifications
  • Use of smallest possible batch sizes for precise results

I hope this has been helpful to you!

Thanks @DavidGF ,appreciate the effort and explanation :-)
(although I still think that the deviation is quite large and apparently others were able to reproduce the MMLU scores more closely, but MMLU isn´t everything...)

Hope to chat with you guys soon again, would also be interested in you experiences with Spectrum vs FT/Lora and som other stuff.

jphme changed discussion status to closed
VAGO solutions org

With pleasure :-)

It may well be that our results do not directly confirm the scores measured by meta. We used the --apply_chat_template parameter to evaluate the mmlu value (as we did with other benchmarks), which is consistent with the evaluation when we submit our models to the new hf leaderboard.

While this approach might yield different absolute values, it doesn't alter the relative performance between our model and the original one. This is further supported by our previous experiments with other models, where the relative performance in MMLU (and MMLU Pro) confirms this pattern as well.

And true, it's important to note that MMLU is just one metric among many. Benchmarks in general have limitations but they currently offer the most objective means of assessing model performance.
Unfortunately, submissions to platforms like the Chatbot Arena for evaluation are "restricted" for most researchers :-/

I'm also looking forward to chatting with you again about the latest FT techniques :)

Sign up or log in to comment