Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?
I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.
I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.
I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.
As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.
So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?
-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as
lm_eval
from the Harness, lighteval
from HF, or the Open LLM Leaderboard.