Performance on MATH dataset?

by fzyzcjy - opened 21 days ago

21 days ago

Hi thanks for the LLM! I would appreciate it if I could know the MATH performance on SmolLM2 series (currently seems only GSM8K).

loubnabnl

Hugging Face TB Research org 17 days ago

Hi HuggingFaceTB/SmolLM2-1.7B-Instruct scores 16.72 on MATH (4-shot)

fzyzcjy

17 days ago

@loubnabnl Hi, thank you very much! Btw, it seems that Llama-3.2-1B is 30.6 on MATH, and Qwen2.5-1.5B is 55.2 on MATH. Therefore, I wonder whether huggingface will create some models that is stronger in math in the future?

ypzkteknoloji

17 days ago

This comment has been hidden

loubnabnl

Hugging Face TB Research org 17 days ago

•

edited 17 days ago

Evaluation setups can be different, in ours (which we'll share soon) Llama3.2-1B-Instruct scores 6.48 on MATH and Qwen2.5-1.5B-Instruct scores 31.07, so the model is already good at math for 1B models and we will continue to improve it in the next iterations

fzyzcjy

17 days ago

Thank you! That's interesting - I personally reproduced zero-shot cot llama 3.2-1B to be 27.8 etc. Looking forward to your evaluation setups!

ldwang

15 days ago

Looking forward to your evaluation setups! +1

anton-l

Hugging Face TB Research org 2 days ago

The MATH task will likely be updated in the mainline lighteval, but in the meantime you could add the task code to smollm/evaluation/tasks.py

And run it with

lighteval accelerate \
  --model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,revision=main,dtype=bfloat16,vllm,gpu_memory_utilisation=0.8,max_model_length=2048" \
  --custom_tasks "tasks.py" --tasks "custom|math|4|1" --use_chat_template --output_dir "./evals" --save_details

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment