arxiv:2312.09241

TinyGSM: achieving >80% on GSM8k with small language models

Published on Dec 14, 2023

· Submitted by

akhaliq on Dec 15, 2023

#2 Paper of the day

Upvote

Authors:

Bingbin Liu ,

Sebastien Bubeck ,

Ronen Eldan ,

Yuanzhi Li ,

Anh Nguyen ,

Abstract

Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce TinyGSM, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on TinyGSM, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset TinyGSM, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.

View arXiv page View PDF Add to collection

Community

unclecode

Dec 16, 2023

This will open the door to having very specific model runs locally, making AI accessible for all children everywhere. Instead of needing an AI tutor with high generalization, we can have a tutor that answers the same questions asked many years ago. By using a verifier trained on a tiny amount of data from GSM, we can intentionally contaminate it, resulting in a SML that is good at answering GSM-like questions. This is indeed a smart move!

Intentional overfitting or contamination can be beneficial, especially for educational AI tutors. For instance, Grade 7 math questions haven't changed significantly over time. A specialized AI tutor for this grade should focus on these specific questions, using overfitting as a tool for precision rather than generalization. This approach aligns with the educational domain's needs, ensuring that the AI remains focused on relevant material.

I wonder if you guys can share the TinyGSM dataset? I like to try your approach for other STEM topics and different grades, to have many SMLs each sophisticated on one topic and one grade.

Thanks.

ClaraBing

Paper author Jul 11

I apologize for the late response and thank you for your interest in our dataset!
Please find our dataset here: https://huggingface.co/datasets/TinyGSM/TinyGSM

librarian-bot

Dec 21, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

michal-stefanik

Dec 25, 2023

Thank you for sharing the work!

Regarding the construction of the TinyGSM dataset used for training, I was wondering if some arrangements/checks were made to avoid coincidental leakage of duplicates or near-duplicates of GSM8K's test set. As the scale and diversity were the main objectives in creating the dataset, it might be worth checking.

Once the TinyGSM dataset is available, we can also run the check ourselves, like we did with other math reasoning datasets where we found this to be a common issue.

ClaraBing

Paper author Jul 11

(I'm sorry that this is a very late reply!)

Thank you for the great question and for your interest in our work! We ran n-gram checks to make sure that there is no verbatim duplication, but we are not sure whether there are duplication by rephrasing. One possible way to check for semantic similarity is by comparing the embeddings of some trained language models, but this doesn't work well for math: for example, changing the numbers would make the question different but still have the same semantics.

Our dataset is available here: https://huggingface.co/datasets/TinyGSM/TinyGSM
Any duplication checks are welcome and we'd love to learn about your findings. Thank you!

MartialTerran

Mar 27

Dear Ronen Eldan, I would like to try to finetune this Math GSM model to improve its performance. If it the model is reliable then it could become a module in the TimeCapsuleTeacher(TM) platform, for teaching Math. Can you give me the model.py and train.py and configs and remote access to fast GPU/TPU to finetune a separate version of the GSM model weights according to my own MathTrain.txt and finetuning methods? Is there a simple way to automatically test the performance-benchmarks of the finetuned model periodically during finetuning? (One local save of model weights to support an instance of inference mode, while the finetuning still proceeds.)