Spaces:
Runtime error
Runtime error
Clémentine
commited on
Commit
•
698f471
1
Parent(s):
ead4c96
removed drop
Browse files- src/display/about.py +4 -20
- src/display/utils.py +0 -4
src/display/about.py
CHANGED
@@ -36,7 +36,6 @@ If there is no icon, we have not uploaded the information on the model yet, feel
|
|
36 |
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
|
37 |
- <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
|
38 |
- <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
|
39 |
-
- <a href="https://arxiv.org/abs/1903.00161" target="_blank"> DROP </a> (3-shot) - English reading comprehension benchmark requiring Discrete Reasoning Over the content of Paragraphs.
|
40 |
|
41 |
For all these evaluations, a higher score is a better score.
|
42 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
@@ -49,10 +48,10 @@ You can find:
|
|
49 |
|
50 |
## Reproducibility
|
51 |
To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
|
52 |
-
`python main.py --model=hf-causal --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
|
53 |
-
` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=
|
54 |
|
55 |
-
The total batch size we get for models which fit on one A100 node is
|
56 |
*You can expect results to vary slightly for different batch sizes because of padding.*
|
57 |
|
58 |
The tasks and few shots parameters are:
|
@@ -62,11 +61,9 @@ The tasks and few shots parameters are:
|
|
62 |
- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
|
63 |
- Winogrande: 5-shot, *winogrande* (`acc`)
|
64 |
- GSM8k: 5-shot, *gsm8k* (`acc`)
|
65 |
-
- DROP: 3-shot, *drop* (`f1`)
|
66 |
|
67 |
Side note on the baseline scores:
|
68 |
- for log-likelihood evaluation, we select the random baseline
|
69 |
-
- for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
|
70 |
- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
|
71 |
|
72 |
## Quantization
|
@@ -203,17 +200,4 @@ CITATION_BUTTON_TEXT = r"""
|
|
203 |
archivePrefix={arXiv},
|
204 |
primaryClass={cs.CL}
|
205 |
}
|
206 |
-
|
207 |
-
title={{DROP:} {A} Reading Comprehension Benchmark Requiring Discrete Reasoning
|
208 |
-
Over Paragraphs},
|
209 |
-
author={Dheeru Dua and
|
210 |
-
Yizhong Wang and
|
211 |
-
Pradeep Dasigi and
|
212 |
-
Gabriel Stanovsky and
|
213 |
-
Sameer Singh and
|
214 |
-
Matt Gardner},
|
215 |
-
year={2019},
|
216 |
-
eprinttype={arXiv},
|
217 |
-
eprint={1903.00161},
|
218 |
-
primaryClass={cs.CL}
|
219 |
-
}"""
|
|
|
36 |
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
|
37 |
- <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
|
38 |
- <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
|
|
|
39 |
|
40 |
For all these evaluations, a higher score is a better score.
|
41 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
|
|
48 |
|
49 |
## Reproducibility
|
50 |
To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
|
51 |
+
`python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
|
52 |
+
` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
|
53 |
|
54 |
+
The total batch size we get for models which fit on one A100 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit.
|
55 |
*You can expect results to vary slightly for different batch sizes because of padding.*
|
56 |
|
57 |
The tasks and few shots parameters are:
|
|
|
61 |
- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
|
62 |
- Winogrande: 5-shot, *winogrande* (`acc`)
|
63 |
- GSM8k: 5-shot, *gsm8k* (`acc`)
|
|
|
64 |
|
65 |
Side note on the baseline scores:
|
66 |
- for log-likelihood evaluation, we select the random baseline
|
|
|
67 |
- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
|
68 |
|
69 |
## Quantization
|
|
|
200 |
archivePrefix={arXiv},
|
201 |
primaryClass={cs.CL}
|
202 |
}
|
203 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/display/utils.py
CHANGED
@@ -20,7 +20,6 @@ class Tasks(Enum):
|
|
20 |
truthfulqa = Task("truthfulqa:mc", "mc2", "TruthfulQA")
|
21 |
winogrande = Task("winogrande", "acc", "Winogrande")
|
22 |
gsm8k = Task("gsm8k", "acc", "GSM8K")
|
23 |
-
drop = Task("drop", "f1", "DROP")
|
24 |
|
25 |
# These classes are for user facing column names,
|
26 |
# to avoid having to change them all around the code
|
@@ -79,7 +78,6 @@ baseline_row = {
|
|
79 |
AutoEvalColumn.truthfulqa.name: 25.0,
|
80 |
AutoEvalColumn.winogrande.name: 50.0,
|
81 |
AutoEvalColumn.gsm8k.name: 0.21,
|
82 |
-
AutoEvalColumn.drop.name: 0.47,
|
83 |
AutoEvalColumn.dummy.name: "baseline",
|
84 |
AutoEvalColumn.model_type.name: "",
|
85 |
}
|
@@ -89,7 +87,6 @@ baseline_row = {
|
|
89 |
# HellaSwag human baseline is 0.95 (source: https://deepgram.com/learn/hellaswag-llm-benchmark-guide)
|
90 |
# MMLU human baseline is 0.898 (source: https://openreview.net/forum?id=d7KBjmI3GmQ)
|
91 |
# TruthfulQA human baseline is 0.94(source: https://arxiv.org/pdf/2109.07958.pdf)
|
92 |
-
# Drop: https://leaderboard.allenai.org/drop/submissions/public
|
93 |
# Winogrande: https://leaderboard.allenai.org/winogrande/submissions/public
|
94 |
# GSM8K: paper
|
95 |
# Define the human baselines
|
@@ -104,7 +101,6 @@ human_baseline_row = {
|
|
104 |
AutoEvalColumn.truthfulqa.name: 94.0,
|
105 |
AutoEvalColumn.winogrande.name: 94.0,
|
106 |
AutoEvalColumn.gsm8k.name: 100,
|
107 |
-
AutoEvalColumn.drop.name: 96.42,
|
108 |
AutoEvalColumn.dummy.name: "human_baseline",
|
109 |
AutoEvalColumn.model_type.name: "",
|
110 |
}
|
|
|
20 |
truthfulqa = Task("truthfulqa:mc", "mc2", "TruthfulQA")
|
21 |
winogrande = Task("winogrande", "acc", "Winogrande")
|
22 |
gsm8k = Task("gsm8k", "acc", "GSM8K")
|
|
|
23 |
|
24 |
# These classes are for user facing column names,
|
25 |
# to avoid having to change them all around the code
|
|
|
78 |
AutoEvalColumn.truthfulqa.name: 25.0,
|
79 |
AutoEvalColumn.winogrande.name: 50.0,
|
80 |
AutoEvalColumn.gsm8k.name: 0.21,
|
|
|
81 |
AutoEvalColumn.dummy.name: "baseline",
|
82 |
AutoEvalColumn.model_type.name: "",
|
83 |
}
|
|
|
87 |
# HellaSwag human baseline is 0.95 (source: https://deepgram.com/learn/hellaswag-llm-benchmark-guide)
|
88 |
# MMLU human baseline is 0.898 (source: https://openreview.net/forum?id=d7KBjmI3GmQ)
|
89 |
# TruthfulQA human baseline is 0.94(source: https://arxiv.org/pdf/2109.07958.pdf)
|
|
|
90 |
# Winogrande: https://leaderboard.allenai.org/winogrande/submissions/public
|
91 |
# GSM8K: paper
|
92 |
# Define the human baselines
|
|
|
101 |
AutoEvalColumn.truthfulqa.name: 94.0,
|
102 |
AutoEvalColumn.winogrande.name: 94.0,
|
103 |
AutoEvalColumn.gsm8k.name: 100,
|
|
|
104 |
AutoEvalColumn.dummy.name: "human_baseline",
|
105 |
AutoEvalColumn.model_type.name: "",
|
106 |
}
|