File size: 8,085 Bytes
bbcb6f3
700fe8c
bbcb6f3
700fe8c
 
 
 
 
 
 
 
 
 
7e45e64
 
 
 
 
9d27549
700fe8c
4bf62f8
0241243
c64a30a
 
 
 
2854161
c64a30a
 
700fe8c
 
 
 
1c558aa
700fe8c
 
 
 
 
 
dbf148d
 
 
700fe8c
e1a7891
1c558aa
 
700fe8c
 
 
458244e
700fe8c
458244e
700fe8c
458244e
700fe8c
bbcb6f3
2854161
bbcb6f3
1c558aa
bbcb6f3
dbf148d
 
 
 
 
 
0719cfc
dbf148d
 
 
 
 
1c558aa
 
14e54dc
 
 
 
1c558aa
 
 
 
 
 
 
 
 
 
 
 
 
0ddbf89
 
 
 
 
 
 
 
cf270dd
0ddbf89
 
 
 
 
 
 
 
 
 
 
580d058
0ddbf89
 
 
 
 
 
cf270dd
 
 
 
4b76eed
cf270dd
 
 
 
 
 
 
 
 
 
7e45e64
 
4b76eed
cf270dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c558aa
 
 
 
 
 
 
402b991
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
license: mit
library_name: transformers
datasets:
- AI-MO/NuminaMath-CoT
- KbsdJames/Omni-MATH
- RUC-AIBOX/STILL-3-Preview-RL-Data
- hendrycks/competition_math
language:
- en
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: text-generation
tags:
- vllm
- qwen
- qwq
- deepseek
---


![image/png](https://cdn-uploads.huggingface.co/production/uploads/657e900beaad53ff67ba84db/BU86-Rfy6RN7jdq-vESsU.png)
<div align="center">
<span style="font-family: default; font-size: 1.5em;">QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha</span>
<div>
- Solo Innovation: Breaking Performance Barriers with Minimal Resources -
<div><b>Powered by personal research with insights from agentica-org</b></div>
</div>
</div>

## Overview
QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha is a groundbreaking language model built on top of the DeepSeek‑R1‑Distill‑Qwen‑1.5B base. Developed entirely by a solo innovator—with valuable inspiration from Berkeley’s research—the model employs a novel reinforcement learning distillation framework that dramatically enhances performance while keeping training data requirements and compute costs to a minimum. Despite having only 1.5B parameters, the model achieves a striking 47.18 MMLU score and outperforms prior baselines on multiple math and reasoning benchmarks.

---

## Data
Our training dataset is comprised of 6,170 meticulously curated problem–answer pairs drawn from high-quality sources such as:
- AIME Problems(QwQ-32B Generated)
- AMC Problems(QwQ-32B Generated)
- MMLU Problems(QwQ-32B Generated))
- Complementary academic math and reasoning datasets(QwQ-32B Generated)

By focusing on a lean yet highly informative dataset, the model efficiently learns critical reasoning capabilities without the burden of excessive data volume.

Generate in QwQ32B with reference to each dataset in the model definition and other datasets.
---

## Training Recipe
To maximize performance with minimal resources, QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha utilizes an innovative training strategy that includes:

<b>- Scaled Group Relative Policy Optimization (GRPO):</b>
An adaptation of PPO that normalizes the advantage function across samples generated from the same prompt.
<b>- KL Divergence Regularization::</b>
Additional regularization is applied on top of the surrogate loss to prevent significant policy drift.
<b>- Iterative Context Scaling::</b>
Progressive expansion of the context length is used to boost model performance while reducing compute costs.

Training was carried out using <b>H200 GPUs for 336 hours</b> at an exceptionally low cost of approximately <b>$1,341</b>. This carefully engineered approach makes it possible to obtain state-of-the-art performance with very limited training data.

---

## Evaluation
The model has been rigorously evaluated on a variety of challenging benchmarks. Below is a snapshot of the results:

| **Benchmark**    | **Metric (Path@1)** | **Metric (cons@64)** | **Avg. Token Count** |
|------------------|---------------------|----------------------|----------------------|
| **MMLU**         | 47.18               | –                    | –                    |
| **AIME 2024**    | 33.33               | 53.33                | 21,191               |
| **AIME 2025-I**  | 34.58               | 40.00                | 17,952               |
| **AIME 2025-II** | 21.56               | 33.33                | 21,376               |
| **AMC 2023**     | 75.00               | 58.92                | 44.17                |
| **MATH 5000**    | 38.89               | –                    | 20,173               |

---

## Comparison

![image/png](https://cdn-uploads.huggingface.co/production/uploads/657e900beaad53ff67ba84db/pm9JyHri9_Fo1FzztI2LE.png)

## Serving QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha

Deploy your model effortlessly using high-performance inference systems, including:

- **vLLM**  
- **Hugging Face Text Generation Inference (TGI)**  
- **SGLang**  
- **TensorRT-LLM**

All these systems support the OpenAI Chat Completions API format, ensuring smooth integration into your applications.

---

## How to use:

<b>Runs on a single A40 GPU!</b>
### Serving Model:
```shell
vllm serve AXCXEPT/QwQ-32B-Distill-Qwen-1.5B-Alpha --max-model-len 32768 --enforce-eager
```

### Call API Without Streaming:
```python
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

prompt = """Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+rac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop."""
completion = client.chat.completions.create(
  model="AXCXEPT/QwQ-32B-Distill-Qwen-1.5B-Alpha",
  messages=[
    {"role": "user", "content": prompt}
  ]
)

print(completion.choices[0].message)
```

### Call API With Streaming:
```python
from openai import OpenAI

#Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id
prompt = """Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+rac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop."""
messages = [{"role": "user", "content": prompt}]
#For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
stream = client.chat.completions.create(model=model,
                                        messages=messages,
                                        stream=True)

print("client: Start streaming chat completions...")
printed_reasoning_content = False
printed_content = False

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    elif hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("reasoning_content:", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    elif content is not None:
        if not printed_content:
            printed_content = True
            print("\ncontent:", end="", flush=True)
        # Extract and print the content
        print(content, end="", flush=True)
```

## License

This project is released under the **MIT License**, reflecting our commitment to open and accessible AI. We firmly believe that cutting-edge AI research should be available for anyone to use, modify, and build upon.


---

## Special Thanks
We extend our sincere gratitude to the following teams and organizations whose contributions and ideas were instrumental in this project:
- Qwen Team (Alibaba Cloud): for creating the exceptional QwQ-32B model used as the distillation source.
- Agentica-org (Berkeley Sky Computing Lab and Berkeley AI Research): for valuable insights and pioneering reinforcement learning techniques.
- DeepSeek AI: for developing the robust foundational model upon which this research is built.

Their groundbreaking work made our innovations possible.