Performance on MATH dataset?

by fzyzcjy - opened Oct 31, 2024

Oct 31, 2024

Hi thanks for the LLM! I would appreciate it if I could know the MATH performance on SmolLM2 series (currently seems only GSM8K).

loubnabnl

Hugging Face Smol Models Research org Nov 4, 2024

Hi HuggingFaceTB/SmolLM2-1.7B-Instruct scores 16.72 on MATH (4-shot)

fzyzcjy

Nov 4, 2024

@loubnabnl Hi, thank you very much! Btw, it seems that Llama-3.2-1B is 30.6 on MATH, and Qwen2.5-1.5B is 55.2 on MATH. Therefore, I wonder whether huggingface will create some models that is stronger in math in the future?

ypzkteknoloji

Nov 4, 2024

This comment has been hidden

loubnabnl

Hugging Face Smol Models Research org Nov 4, 2024

•

edited Nov 4, 2024

Evaluation setups can be different, in ours (which we'll share soon) Llama3.2-1B-Instruct scores 6.48 on MATH and Qwen2.5-1.5B-Instruct scores 31.07, so the model is already good at math for 1B models and we will continue to improve it in the next iterations

fzyzcjy

Nov 4, 2024

Thank you! That's interesting - I personally reproduced zero-shot cot llama 3.2-1B to be 27.8 etc. Looking forward to your evaluation setups!

ldwang

Nov 7, 2024

Looking forward to your evaluation setups! +1

anton-l

Hugging Face Smol Models Research org Nov 19, 2024

•

edited Nov 26, 2024

UPD: the code is merged into smollm/evaluation

~~The MATH task will likely be updated in the mainline lighteval, but in the meantime you could add the task code to smollm/evaluation/tasks.py~~

~~And run it with~~

lighteval accelerate \
  --model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,revision=main,dtype=bfloat16,vllm,gpu_memory_utilisation=0.8,max_model_length=2048" \
  --custom_tasks "tasks.py" --tasks "custom|math|4|1" --use_chat_template --output_dir "./evals" --save_details

loubnabnl changed discussion status to closed Nov 29, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment