Majority-Voting: Llama-3.2-3B-Instruct trained on DAPO-14k

This model is a Llama-3.2-3B-Instruct checkpoint trained using the Majority-Voting method on the DAPO-14k training set. It is part of the research presented in the paper Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models.

The Co-rewarding framework introduces a novel self-supervised reinforcement learning (RL) approach designed to enhance the reasoning capabilities of Large Language Models (LLMs). It improves training stability by seeking complementary supervision from different perspectives, thereby addressing issues like training collapse and reward hacking often seen in other self-rewarding methods. The framework was empirically shown to achieve stable training and outperform other self-rewarding baselines on various mathematical reasoning benchmarks.

For more details, code, and other checkpoints, refer to the official GitHub repository: https://github.com/tmlr-group/Co-rewarding.

Downloads last month
21
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-DAPO14k

Quantizations
1 model

Collection including TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-DAPO14k