Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Update utils.py
Browse files
utils.py
CHANGED
@@ -39,11 +39,14 @@ Compared to the original MMLU, there are three major differences:
|
|
39 |
|
40 |
- The original MMLU dataset only contains 4 options, MMLU-Pro increases it to 10 options. The increase in options will make the evaluation more realistic and challenging. The random guessing will lead to a much lower score.
|
41 |
- The original MMLU dataset contains mostly knowledge-driven questions without requiring much reasoning. Therefore, PPL results are normally better than CoT. In our dataset, we increase the problem difficulty and integrate more reasoning-focused problems. In MMLU-Pro, CoT can be 20% higher than PPL.
|
42 |
-
- By increasing the distractor numbers, we significantly reduce the probability of correct guess by chance to boost the benchmark’s robustness. Specifically, with 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro
|
43 |
|
44 |
For detailed information about the dataset, visit our page on Hugging Face: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.
|
|
|
45 |
If you are interested in replicating these results or wish to evaluate your models using our dataset, access our evaluation scripts available on GitHub: https://github.com/TIGER-AI-Lab/MMLU-Pro.
|
46 |
-
|
|
|
|
|
47 |
Below you can find the accuracies of different models tested on this dataset.
|
48 |
|
49 |
"""
|
|
|
39 |
|
40 |
- The original MMLU dataset only contains 4 options, MMLU-Pro increases it to 10 options. The increase in options will make the evaluation more realistic and challenging. The random guessing will lead to a much lower score.
|
41 |
- The original MMLU dataset contains mostly knowledge-driven questions without requiring much reasoning. Therefore, PPL results are normally better than CoT. In our dataset, we increase the problem difficulty and integrate more reasoning-focused problems. In MMLU-Pro, CoT can be 20% higher than PPL.
|
42 |
+
- By increasing the distractor numbers, we significantly reduce the probability of correct guess by chance to boost the benchmark’s robustness. Specifically, with 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro.
|
43 |
|
44 |
For detailed information about the dataset, visit our page on Hugging Face: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.
|
45 |
+
|
46 |
If you are interested in replicating these results or wish to evaluate your models using our dataset, access our evaluation scripts available on GitHub: https://github.com/TIGER-AI-Lab/MMLU-Pro.
|
47 |
+
|
48 |
+
If you would like to learn more details about our dataset, please check out our paper: https://arxiv.org/abs/2406.01574.
|
49 |
+
|
50 |
Below you can find the accuracies of different models tested on this dataset.
|
51 |
|
52 |
"""
|