Majority-Voting: Llama-3.2-3B-Instruct trained on DAPO-14k
This model is a Llama-3.2-3B-Instruct checkpoint trained using the Majority-Voting method on the DAPO-14k training set. It is part of the research presented in the paper Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models.
The Co-rewarding framework introduces a novel self-supervised reinforcement learning (RL) approach designed to enhance the reasoning capabilities of Large Language Models (LLMs). It improves training stability by seeking complementary supervision from different perspectives, thereby addressing issues like training collapse and reward hacking often seen in other self-rewarding methods. The framework was empirically shown to achieve stable training and outperform other self-rewarding baselines on various mathematical reasoning benchmarks.
For more details, code, and other checkpoints, refer to the official GitHub repository: https://github.com/tmlr-group/Co-rewarding.
- Downloads last month
- 21