Performance on MATH dataset?

#3
by fzyzcjy - opened

Hi thanks for the LLM! I would appreciate it if I could know the MATH performance on SmolLM2 series (currently seems only GSM8K).

Hugging Face Smol Models Research org

Hi HuggingFaceTB/SmolLM2-1.7B-Instruct scores 16.72 on MATH (4-shot)

@loubnabnl Hi, thank you very much! Btw, it seems that Llama-3.2-1B is 30.6 on MATH, and Qwen2.5-1.5B is 55.2 on MATH. Therefore, I wonder whether huggingface will create some models that is stronger in math in the future?

This comment has been hidden
Hugging Face Smol Models Research org
β€’
edited Nov 4, 2024

Evaluation setups can be different, in ours (which we'll share soon) Llama3.2-1B-Instruct scores 6.48 on MATH and Qwen2.5-1.5B-Instruct scores 31.07, so the model is already good at math for 1B models and we will continue to improve it in the next iterations

Thank you! That's interesting - I personally reproduced zero-shot cot llama 3.2-1B to be 27.8 etc. Looking forward to your evaluation setups!

Looking forward to your evaluation setups! +1

Hugging Face Smol Models Research org
β€’
edited Nov 26, 2024

UPD: the code is merged into smollm/evaluation

The MATH task will likely be updated in the mainline lighteval, but in the meantime you could add the task code to smollm/evaluation/tasks.py

And run it with

lighteval accelerate \
  --model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,revision=main,dtype=bfloat16,vllm,gpu_memory_utilisation=0.8,max_model_length=2048" \
  --custom_tasks "tasks.py" --tasks "custom|math|4|1" --use_chat_template --output_dir "./evals" --save_details
loubnabnl changed discussion status to closed

Sign up or log in to comment