Majority-Voting: Llama-3.2-3B-Instruct trained on DAPO-14k

This model is a Llama-3.2-3B-Instruct checkpoint trained using the Majority-Voting method on the DAPO-14k training set. It is part of the research presented in the paper Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models.

The Co-rewarding framework introduces a novel self-supervised reinforcement learning (RL) approach designed to enhance the reasoning capabilities of Large Language Models (LLMs). It improves training stability by seeking complementary supervision from different perspectives, thereby addressing issues like training collapse and reward hacking often seen in other self-rewarding methods. The framework was empirically shown to achieve stable training and outperform other self-rewarding baselines on various mathematical reasoning benchmarks.

For more details, code, and other checkpoints, refer to the official GitHub repository: https://github.com/tmlr-group/Co-rewarding.

Downloads last month: 21

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-DAPO14k

Quantizations

1 model

Collection including TMLR-Group-HF/Majority-Voting-Llama-3.2-3B-Instruct-DAPO14k

Co-rewarding

Collection

Co-rewarding is a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. • 69 items • Updated Oct 3 • 1