---
base_model: Qwen/Qwen2.5-3B
datasets:
- math
language:
- en
license: apache-2.0
metrics:
- accuracy
pipeline_tag: text-generation
library_name: transformers
tags:
- mathematical-reasoning
- code-generation
- reinforcement-learning
- reasoning
---

# Qwen2.5-3B-GRPO-MATH-1EPOCH

This model is a GRPO-fine-tuned version of Qwen2.5-3B, trained on the MATH dataset, as presented in the paper [Learning to Reason without External Rewards](https://huggingface.co/papers/2505.19590).

**Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm called **Reinforcement Learning from Internal Feedback (RLIF)**. This model represents an instance fine-tuned using the GRPO policy optimization algorithm within this framework.

RLIF enables LLMs to learn from intrinsic signals without external rewards or labeled data, offering a scalable alternative for autonomous AI systems where verifiable rewards are unavailable. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation.

## Key Features
*   **Reinforcement Learning from Internal Feedback (RLIF)**: A framework enabling LLMs to learn from intrinsic signals without external rewards, gold labels, or verifiers.
*   **Self-Certainty as Reward**: Intuitor uses the model's own confidence (self-certainty) as its sole reward signal.
*   **Mathematical Reasoning**: Specifically fine-tuned on the MATH dataset to enhance mathematical reasoning capabilities.
*   **Code Generation**: Demonstrates strong generalization to code generation tasks.

## Usage

This model is compatible with the Hugging Face `transformers` library. You can load and use it for text generation as follows:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_name = "sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Define a conversation prompt for mathematical reasoning
prompt = "Question: What is the sum of the first 100 positive integers?
Answer:"

# Apply the chat template suitable for Qwen models
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Encode the input
input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)

# Set generation configuration
generation_config = GenerationConfig(
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

# Generate response
outputs = model.generate(input_ids, generation_config=generation_config)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)
```

## Code
The official implementation and training scripts are available on the [GitHub repository](https://github.com/sunblaze-ucb/Intuitor).

## Citation

If you use this model or the associated research, please cite the paper:

```bibtex
@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}

@article{sha2024deepseekmath,
  title     = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  author    = {Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and … Guo, Daya},
  journal   = {arXiv preprint arXiv:2402.03300},
  year      = {2024},
}
```