MMLU-Pro benchmark
#13
by
kth8
- opened
In Meta's announcement I noticed they showed MMLU scores for the 1B and 3B models but not MMLU-Pro. Here is my testing result with Llama 3.1 8B and Qwen2.5 for comparison:
| Models | Data Source | Overall | Biology | Business | Chemistry | Computer Science | Economics | Engineering | Health | History | Law | Math | Philosophy | Physics | Psychology | Other |
|-----------------------|---------------|---------|---------|----------|-----------|------------------|-----------|-------------|---------|---------|-------|-------|------------|---------|------------|-------|
| Llama-3.1-8B-Instruct | TIGER-Lab | 0.443 | 0.630 | 0.493 | 0.376 | 0.483 | 0.551 | 0.297 | 0.507 | 0.423 | 0.273 | 0.438 | 0.445 | 0.403 | 0.600 | 0.448 |
| Qwen2.5-3B | Self-Reported | 0.437 | 0.545 | 0.541 | 0.407 | 0.432 | 0.530 | 0.292 | 0.440 | 0.391 | 0.223 | 0.545 | 0.371 | 0.440 | 0.555 | 0.415 |
| Llama-3.2-3B-Instruct | Self-Reported | 0.365 | 0.552 | 0.399 | 0.264 | 0.371 | 0.480 | 0.260 | 0.461 | 0.336 | 0.227 | 0.378 | 0.349 | 0.302 | 0.514 | 0.358 |
You can view the full leaderboard here: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro