Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
are benchmark scores normalised to a baseline?
#2
by
Abulaphia
- opened
In the documentation, I see reference to a baseline model for GSM8k. Are the scores for models on the archived leaderboard raw scores, or are they normalised in some way / compared to a standard benchmark? If the latter, is there somewhere I can find details on the methodology?
Hi! Here they are all raw, we added normalisation in the v2 only :)
The baseline scores (for the row "baseline") were taken from the papers introducing the benchmarks each time.
clefourrier
changed discussion status to
closed