sunblaze-ucb
/

Qwen2.5-3B-GRPO-MATH-1EPOCH

@@ -1,25 +1,88 @@
 ---
 base_model: Qwen/Qwen2.5-3B
-license: apache-2.0
 datasets:
-  - math
 metrics:
-  - accuracy
 pipeline_tag: text-generation
-language:
-  - en
 ---
 # Qwen2.5-3B-GRPO-MATH-1EPOCH
-**Description:**
-A GRPO-fine-tuned version of Qwen2.5-3B trained on the MATH dataset.
----
 ## Citation
 ```bibtex
 @article{zhao2025learning,
   title={Learning to Reason without External Rewards},
@@ -34,5 +97,4 @@ A GRPO-fine-tuned version of Qwen2.5-3B trained on the MATH dataset.
   journal   = {arXiv preprint arXiv:2402.03300},
   year      = {2024},
 }
-```

 ---
 base_model: Qwen/Qwen2.5-3B
 datasets:
+- math
+language:
+- en
+license: apache-2.0
 metrics:
+- accuracy
 pipeline_tag: text-generation
+library_name: transformers
+tags:
+- mathematical-reasoning
+- code-generation
+- reinforcement-learning
+- reasoning
 ---
 # Qwen2.5-3B-GRPO-MATH-1EPOCH
+This model is a GRPO-fine-tuned version of Qwen2.5-3B, trained on the MATH dataset, as presented in the paper [Learning to Reason without External Rewards](https://huggingface.co/papers/2505.19590).
+**Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm called **Reinforcement Learning from Internal Feedback (RLIF)**. This model represents an instance fine-tuned using the GRPO policy optimization algorithm within this framework.
+RLIF enables LLMs to learn from intrinsic signals without external rewards or labeled data, offering a scalable alternative for autonomous AI systems where verifiable rewards are unavailable. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation.
+## Key Features
+*   **Reinforcement Learning from Internal Feedback (RLIF)**: A framework enabling LLMs to learn from intrinsic signals without external rewards, gold labels, or verifiers.
+*   **Self-Certainty as Reward**: Intuitor uses the model's own confidence (self-certainty) as its sole reward signal.
+*   **Mathematical Reasoning**: Specifically fine-tuned on the MATH dataset to enhance mathematical reasoning capabilities.
+*   **Code Generation**: Demonstrates strong generalization to code generation tasks.
+## Usage
+This model is compatible with the Hugging Face `transformers` library. You can load and use it for text generation as follows:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+import torch
+model_name = "sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+# Define a conversation prompt for mathematical reasoning
+prompt = "Question: What is the sum of the first 100 positive integers?
+Answer:"
+# Apply the chat template suitable for Qwen models
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+# Encode the input
+input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)
+# Set generation configuration
+generation_config = GenerationConfig(
+    bos_token_id=tokenizer.bos_token_id,
+    eos_token_id=tokenizer.eos_token_id,
+    max_new_tokens=2048,
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.9,
+)
+# Generate response
+outputs = model.generate(input_ids, generation_config=generation_config)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+## Code
+The official implementation and training scripts are available on the [GitHub repository](https://github.com/sunblaze-ucb/Intuitor).
 ## Citation
+If you use this model or the associated research, please cite the paper:
 ```bibtex
 @article{zhao2025learning,
   title={Learning to Reason without External Rewards},
   journal   = {arXiv preprint arXiv:2402.03300},
   year      = {2024},
 }
+```