open_llm_leaderboard

Running on CPU Upgrade

are benchmark scores normalised to a baseline?

by Abulaphia - opened Sep 20

Sep 20

In the documentation, I see reference to a baseline model for GSM8k. Are the scores for models on the archived leaderboard raw scores, or are they normalised in some way / compared to a standard benchmark? If the latter, is there somewhere I can find details on the methodology?

clefourrier

Open LLM Leaderboard Archive org 6 days ago

Hi! Here they are all raw, we added normalisation in the v2 only :)
The baseline scores (for the row "baseline") were taken from the papers introducing the benchmarks each time.

clefourrier changed discussion status to closed 6 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment