open_llm_leaderboard2

Runtime error

App Files Files Community

Clémentine commited on Dec 1, 2023

Commit

698f471

•

1 Parent(s): ead4c96

removed drop

Browse files

Files changed (2) hide show

src/display/about.py +4 -20
src/display/utils.py +0 -4

src/display/about.py CHANGED Viewed

@@ -36,7 +36,6 @@ If there is no icon, we have not uploaded the information on the model yet, feel
 - <a href="https://arxiv.org/abs/2109.07958" target="_blank">  TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
 - <a href="https://arxiv.org/abs/1907.10641" target="_blank">  Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
 - <a href="https://arxiv.org/abs/2110.14168" target="_blank">  GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
-- <a href="https://arxiv.org/abs/1903.00161" target="_blank">  DROP </a> (3-shot) - English reading comprehension benchmark requiring Discrete Reasoning Over the content of Paragraphs.
 For all these evaluations, a higher score is a better score.
 We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
@@ -49,10 +48,10 @@ You can find:
 ## Reproducibility
 To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
-`python main.py --model=hf-causal --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
-` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=2 --output_path=<output_path>`
-The total batch size we get for models which fit on one A100 node is 16 (8 GPUs * 2). If you don't use parallelism, adapt your batch size to fit.
 *You can expect results to vary slightly for different batch sizes because of padding.*
 The tasks and few shots parameters are:
@@ -62,11 +61,9 @@ The tasks and few shots parameters are:
 - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
 - Winogrande: 5-shot, *winogrande* (`acc`)
 - GSM8k: 5-shot, *gsm8k* (`acc`)
-- DROP: 3-shot, *drop* (`f1`)
 Side note on the baseline scores:
 - for log-likelihood evaluation, we select the random baseline
-- for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
 - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
 ## Quantization
@@ -203,17 +200,4 @@ CITATION_BUTTON_TEXT = r"""
       archivePrefix={arXiv},
       primaryClass={cs.CL}
 }
-@misc{DBLP:journals/corr/abs-1903-00161,
-      title={{DROP:} {A} Reading Comprehension Benchmark Requiring Discrete Reasoning
-                  Over Paragraphs},
-      author={Dheeru Dua and
-                  Yizhong Wang and
-                  Pradeep Dasigi and
-                  Gabriel Stanovsky and
-                  Sameer Singh and
-                  Matt Gardner},
-      year={2019},
-      eprinttype={arXiv},
-      eprint={1903.00161},
-      primaryClass={cs.CL}
-}"""

 - <a href="https://arxiv.org/abs/2109.07958" target="_blank">  TruthfulQA </a> (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
 - <a href="https://arxiv.org/abs/1907.10641" target="_blank">  Winogrande </a> (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
 - <a href="https://arxiv.org/abs/2110.14168" target="_blank">  GSM8k </a> (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
 For all these evaluations, a higher score is a better score.
 We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
 ## Reproducibility
 To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
+`python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
+` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
+The total batch size we get for models which fit on one A100 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit.
 *You can expect results to vary slightly for different batch sizes because of padding.*
 The tasks and few shots parameters are:
 - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
 - Winogrande: 5-shot, *winogrande* (`acc`)
 - GSM8k: 5-shot, *gsm8k* (`acc`)
 Side note on the baseline scores:
 - for log-likelihood evaluation, we select the random baseline
 - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
 ## Quantization
       archivePrefix={arXiv},
       primaryClass={cs.CL}
 }
+"""

src/display/utils.py CHANGED Viewed

@@ -20,7 +20,6 @@ class Tasks(Enum):
     truthfulqa = Task("truthfulqa:mc", "mc2", "TruthfulQA")
     winogrande = Task("winogrande", "acc", "Winogrande")
     gsm8k = Task("gsm8k", "acc", "GSM8K")
-    drop = Task("drop", "f1", "DROP")
 # These classes are for user facing column names,
 # to avoid having to change them all around the code
@@ -79,7 +78,6 @@ baseline_row = {
     AutoEvalColumn.truthfulqa.name: 25.0,
     AutoEvalColumn.winogrande.name: 50.0,
     AutoEvalColumn.gsm8k.name: 0.21,
-    AutoEvalColumn.drop.name: 0.47,
     AutoEvalColumn.dummy.name: "baseline",
     AutoEvalColumn.model_type.name: "",
 }
@@ -89,7 +87,6 @@ baseline_row = {
 # HellaSwag human baseline is 0.95 (source: https://deepgram.com/learn/hellaswag-llm-benchmark-guide)
 # MMLU human baseline is 0.898 (source: https://openreview.net/forum?id=d7KBjmI3GmQ)
 # TruthfulQA human baseline is 0.94(source: https://arxiv.org/pdf/2109.07958.pdf)
-# Drop: https://leaderboard.allenai.org/drop/submissions/public
 # Winogrande: https://leaderboard.allenai.org/winogrande/submissions/public
 # GSM8K: paper
 # Define the human baselines
@@ -104,7 +101,6 @@ human_baseline_row = {
     AutoEvalColumn.truthfulqa.name: 94.0,
     AutoEvalColumn.winogrande.name: 94.0,
     AutoEvalColumn.gsm8k.name: 100,
-    AutoEvalColumn.drop.name: 96.42,
     AutoEvalColumn.dummy.name: "human_baseline",
     AutoEvalColumn.model_type.name: "",
 }

     truthfulqa = Task("truthfulqa:mc", "mc2", "TruthfulQA")
     winogrande = Task("winogrande", "acc", "Winogrande")
     gsm8k = Task("gsm8k", "acc", "GSM8K")
 # These classes are for user facing column names,
 # to avoid having to change them all around the code
     AutoEvalColumn.truthfulqa.name: 25.0,
     AutoEvalColumn.winogrande.name: 50.0,
     AutoEvalColumn.gsm8k.name: 0.21,
     AutoEvalColumn.dummy.name: "baseline",
     AutoEvalColumn.model_type.name: "",
 }
 # HellaSwag human baseline is 0.95 (source: https://deepgram.com/learn/hellaswag-llm-benchmark-guide)
 # MMLU human baseline is 0.898 (source: https://openreview.net/forum?id=d7KBjmI3GmQ)
 # TruthfulQA human baseline is 0.94(source: https://arxiv.org/pdf/2109.07958.pdf)
 # Winogrande: https://leaderboard.allenai.org/winogrande/submissions/public
 # GSM8K: paper
 # Define the human baselines
     AutoEvalColumn.truthfulqa.name: 94.0,
     AutoEvalColumn.winogrande.name: 94.0,
     AutoEvalColumn.gsm8k.name: 100,
     AutoEvalColumn.dummy.name: "human_baseline",
     AutoEvalColumn.model_type.name: "",
 }