Xuandong nielsr HF Staff commited on
Commit
2915b7d
·
verified ·
1 Parent(s): 6cfb9d6

Improve model card: Add library, usage, tags, and links (#1)

Browse files

- Improve model card: Add library, usage, tags, and links (2f0880aa9f648da8ae611f9c02d0920223da82f9)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +72 -10
README.md CHANGED
@@ -1,25 +1,88 @@
1
  ---
2
  base_model: Qwen/Qwen2.5-3B
3
- license: apache-2.0
4
  datasets:
5
- - math
 
 
 
6
  metrics:
7
- - accuracy
8
  pipeline_tag: text-generation
9
- language:
10
- - en
 
 
 
 
11
  ---
12
 
13
  # Qwen2.5-3B-GRPO-MATH-1EPOCH
14
 
15
- **Description:**
16
 
17
- A GRPO-fine-tuned version of Qwen2.5-3B trained on the MATH dataset.
18
 
19
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Citation
22
 
 
 
23
  ```bibtex
24
  @article{zhao2025learning,
25
  title={Learning to Reason without External Rewards},
@@ -34,5 +97,4 @@ A GRPO-fine-tuned version of Qwen2.5-3B trained on the MATH dataset.
34
  journal = {arXiv preprint arXiv:2402.03300},
35
  year = {2024},
36
  }
37
- ```
38
-
 
1
  ---
2
  base_model: Qwen/Qwen2.5-3B
 
3
  datasets:
4
+ - math
5
+ language:
6
+ - en
7
+ license: apache-2.0
8
  metrics:
9
+ - accuracy
10
  pipeline_tag: text-generation
11
+ library_name: transformers
12
+ tags:
13
+ - mathematical-reasoning
14
+ - code-generation
15
+ - reinforcement-learning
16
+ - reasoning
17
  ---
18
 
19
  # Qwen2.5-3B-GRPO-MATH-1EPOCH
20
 
21
+ This model is a GRPO-fine-tuned version of Qwen2.5-3B, trained on the MATH dataset, as presented in the paper [Learning to Reason without External Rewards](https://huggingface.co/papers/2505.19590).
22
 
23
+ **Intuitor** is a reinforcement learning method that fine-tunes large language models (LLMs) using *self-certainty*—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm called **Reinforcement Learning from Internal Feedback (RLIF)**. This model represents an instance fine-tuned using the GRPO policy optimization algorithm within this framework.
24
 
25
+ RLIF enables LLMs to learn from intrinsic signals without external rewards or labeled data, offering a scalable alternative for autonomous AI systems where verifiable rewards are unavailable. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation.
26
+
27
+ ## Key Features
28
+ * **Reinforcement Learning from Internal Feedback (RLIF)**: A framework enabling LLMs to learn from intrinsic signals without external rewards, gold labels, or verifiers.
29
+ * **Self-Certainty as Reward**: Intuitor uses the model's own confidence (self-certainty) as its sole reward signal.
30
+ * **Mathematical Reasoning**: Specifically fine-tuned on the MATH dataset to enhance mathematical reasoning capabilities.
31
+ * **Code Generation**: Demonstrates strong generalization to code generation tasks.
32
+
33
+ ## Usage
34
+
35
+ This model is compatible with the Hugging Face `transformers` library. You can load and use it for text generation as follows:
36
+
37
+ ```python
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
39
+ import torch
40
+
41
+ model_name = "sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH"
42
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
43
+ model = AutoModelForCausalLM.from_pretrained(
44
+ model_name,
45
+ torch_dtype=torch.bfloat16,
46
+ device_map="auto"
47
+ )
48
+
49
+ # Define a conversation prompt for mathematical reasoning
50
+ prompt = "Question: What is the sum of the first 100 positive integers?
51
+ Answer:"
52
+
53
+ # Apply the chat template suitable for Qwen models
54
+ messages = [
55
+ {"role": "user", "content": prompt}
56
+ ]
57
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
58
+
59
+ # Encode the input
60
+ input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)
61
+
62
+ # Set generation configuration
63
+ generation_config = GenerationConfig(
64
+ bos_token_id=tokenizer.bos_token_id,
65
+ eos_token_id=tokenizer.eos_token_id,
66
+ max_new_tokens=2048,
67
+ do_sample=True,
68
+ temperature=0.7,
69
+ top_p=0.9,
70
+ )
71
+
72
+ # Generate response
73
+ outputs = model.generate(input_ids, generation_config=generation_config)
74
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
75
+
76
+ print(response)
77
+ ```
78
+
79
+ ## Code
80
+ The official implementation and training scripts are available on the [GitHub repository](https://github.com/sunblaze-ucb/Intuitor).
81
 
82
  ## Citation
83
 
84
+ If you use this model or the associated research, please cite the paper:
85
+
86
  ```bibtex
87
  @article{zhao2025learning,
88
  title={Learning to Reason without External Rewards},
 
97
  journal = {arXiv preprint arXiv:2402.03300},
98
  year = {2024},
99
  }
100
+ ```