AI-MO
/

NuminaMath-72B-TIR

Text Generation

alignment-handbook

Generated from Trainer

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

benlipkin commited on Jul 20

Commit

f2cdfc1

•

1 Parent(s): 82929e6

add results table

Files changed (1) hide show

README.md +15 -2

README.md CHANGED Viewed

@@ -101,8 +101,6 @@ NuminaMath is a series of language models that are trained with two stages of su
 * **Stage 1:** fine-tune the base model on a large, diverse dataset of natural language math problems and solutions, where each solution is templated with Chain of Thought (CoT) to facilitate reasoning.
 * **Stage 2:** fine-tune the model from Stage 1 on a synthetic dataset of tool-integrated reasoning, where each math problem is decomposed into a sequence of rationales, Python programs, and their outputs.
 ## Model description
 - **Model type:** A 72B parameter math LLM fine-tuned on a dataset with 860k+ math problem-solution pairs.
@@ -110,6 +108,21 @@ NuminaMath is a series of language models that are trained with two stages of su
 - **License:** Tongyi Qianwen
 - **Finetuned from model:** [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B)
 ### Model Sources
 <!-- Provide the basic links for the model. -->

 * **Stage 1:** fine-tune the base model on a large, diverse dataset of natural language math problems and solutions, where each solution is templated with Chain of Thought (CoT) to facilitate reasoning.
 * **Stage 2:** fine-tune the model from Stage 1 on a synthetic dataset of tool-integrated reasoning, where each math problem is decomposed into a sequence of rationales, Python programs, and their outputs.
 ## Model description
 - **Model type:** A 72B parameter math LLM fine-tuned on a dataset with 860k+ math problem-solution pairs.
 - **License:** Tongyi Qianwen
 - **Finetuned from model:** [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B)
+## Model performance
+| | | NuminaMath-72B-CoT | NuminaMath-72B-TIR | Qwen2-72B-Instruct | Llama3-70B-Instruct | Claude-3.5-Sonnet | GPT-4o-0513 |
+| --- | --- | :---: | :---: | :---: | :---: | :---: | :---: |
+| **GSM8k** | 0-shot | 91.4% | 91.5% | 91.1% | 93.0% | **96.4%** | 95.8% |
+| Grade school math |
+| **MATH** | 0-shot | 68.0% | 75.8% | 59.7% | 50.4% | 71.1% | **76.6%** |
+| Math problem-solving |
+| **AMC 2023** | 0-shot | 21/40 | **24/40** | 19/40 | 13/40 | 17/40 | 20/40 |
+| Competition-level math | maj@64 | 24/40 | **34/40** | 21/40 | 13/40 | - | - |
+| **AIME 2024** | 0-shot | 1/30 | **5/30** | 3/30 | 0/30 | 2/30 | 2/30 |
+| Competition-level math | maj@64 | 3/30 | **12/30** | 4/30 | 2/30 | - | - |
+*Table: Comparison of various open weight and proprietary language models on different math benchmarks. All scores except those for NuminaMath-72B-TIR are reported without tool-integrated reasoning.*
 ### Model Sources
 <!-- Provide the basic links for the model. -->