benlipkin commited on
Commit
f2cdfc1
1 Parent(s): 82929e6

add results table

Browse files
Files changed (1) hide show
  1. README.md +15 -2
README.md CHANGED
@@ -101,8 +101,6 @@ NuminaMath is a series of language models that are trained with two stages of su
101
  * **Stage 1:** fine-tune the base model on a large, diverse dataset of natural language math problems and solutions, where each solution is templated with Chain of Thought (CoT) to facilitate reasoning.
102
  * **Stage 2:** fine-tune the model from Stage 1 on a synthetic dataset of tool-integrated reasoning, where each math problem is decomposed into a sequence of rationales, Python programs, and their outputs.
103
 
104
-
105
-
106
  ## Model description
107
 
108
  - **Model type:** A 72B parameter math LLM fine-tuned on a dataset with 860k+ math problem-solution pairs.
@@ -110,6 +108,21 @@ NuminaMath is a series of language models that are trained with two stages of su
110
  - **License:** Tongyi Qianwen
111
  - **Finetuned from model:** [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B)
112
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  ### Model Sources
114
 
115
  <!-- Provide the basic links for the model. -->
 
101
  * **Stage 1:** fine-tune the base model on a large, diverse dataset of natural language math problems and solutions, where each solution is templated with Chain of Thought (CoT) to facilitate reasoning.
102
  * **Stage 2:** fine-tune the model from Stage 1 on a synthetic dataset of tool-integrated reasoning, where each math problem is decomposed into a sequence of rationales, Python programs, and their outputs.
103
 
 
 
104
  ## Model description
105
 
106
  - **Model type:** A 72B parameter math LLM fine-tuned on a dataset with 860k+ math problem-solution pairs.
 
108
  - **License:** Tongyi Qianwen
109
  - **Finetuned from model:** [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B)
110
 
111
+ ## Model performance
112
+
113
+ | | | NuminaMath-72B-CoT | NuminaMath-72B-TIR | Qwen2-72B-Instruct | Llama3-70B-Instruct | Claude-3.5-Sonnet | GPT-4o-0513 |
114
+ | --- | --- | :---: | :---: | :---: | :---: | :---: | :---: |
115
+ | **GSM8k** | 0-shot | 91.4% | 91.5% | 91.1% | 93.0% | **96.4%** | 95.8% |
116
+ | Grade school math |
117
+ | **MATH** | 0-shot | 68.0% | 75.8% | 59.7% | 50.4% | 71.1% | **76.6%** |
118
+ | Math problem-solving |
119
+ | **AMC 2023** | 0-shot | 21/40 | **24/40** | 19/40 | 13/40 | 17/40 | 20/40 |
120
+ | Competition-level math | maj@64 | 24/40 | **34/40** | 21/40 | 13/40 | - | - |
121
+ | **AIME 2024** | 0-shot | 1/30 | **5/30** | 3/30 | 0/30 | 2/30 | 2/30 |
122
+ | Competition-level math | maj@64 | 3/30 | **12/30** | 4/30 | 2/30 | - | - |
123
+
124
+ *Table: Comparison of various open weight and proprietary language models on different math benchmarks. All scores except those for NuminaMath-72B-TIR are reported without tool-integrated reasoning.*
125
+
126
  ### Model Sources
127
 
128
  <!-- Provide the basic links for the model. -->