--- base_model: Qwen/Qwen2.5-3B datasets: - math language: - en license: apache-2.0 metrics: - accuracy pipeline_tag: text-generation library_name: transformers tags: - mathematical-reasoning - code-generation - reinforcement-learning - reasoning --- # Qwen2.5-3B-GRPO-MATH-1EPOCH This model is a GRPO-fine-tuned version of Qwen2.5-3B, trained on the MATH dataset, as presented in the paper [Learning to Reason without External Rewards](https://huggingface.co/papers/2505.19590). **Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm called **Reinforcement Learning from Internal Feedback (RLIF)**. This model represents an instance fine-tuned using the GRPO policy optimization algorithm within this framework. RLIF enables LLMs to learn from intrinsic signals without external rewards or labeled data, offering a scalable alternative for autonomous AI systems where verifiable rewards are unavailable. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation. ## Key Features * **Reinforcement Learning from Internal Feedback (RLIF)**: A framework enabling LLMs to learn from intrinsic signals without external rewards, gold labels, or verifiers. * **Self-Certainty as Reward**: Intuitor uses the model's own confidence (self-certainty) as its sole reward signal. * **Mathematical Reasoning**: Specifically fine-tuned on the MATH dataset to enhance mathematical reasoning capabilities. * **Code Generation**: Demonstrates strong generalization to code generation tasks. ## Usage This model is compatible with the Hugging Face `transformers` library. You can load and use it for text generation as follows: ```python from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig import torch model_name = "sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) # Define a conversation prompt for mathematical reasoning prompt = "Question: What is the sum of the first 100 positive integers? Answer:" # Apply the chat template suitable for Qwen models messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # Encode the input input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device) # Set generation configuration generation_config = GenerationConfig( bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=0.9, ) # Generate response outputs = model.generate(input_ids, generation_config=generation_config) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Code The official implementation and training scripts are available on the [GitHub repository](https://github.com/sunblaze-ucb/Intuitor). ## Citation If you use this model or the associated research, please cite the paper: ```bibtex @article{zhao2025learning, title={Learning to Reason without External Rewards}, author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn}, journal={arXiv preprint arXiv:2505.19590}, year={2025} } @article{sha2024deepseekmath, title = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}, author = {Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and … Guo, Daya}, journal = {arXiv preprint arXiv:2402.03300}, year = {2024}, } ```