File size: 8,085 Bytes

bbcb6f3
700fe8c
bbcb6f3
700fe8c
 
 
 
 
 
 
 
 
 
7e45e64
 
 
 
 
9d27549
700fe8c
4bf62f8
0241243
c64a30a
 
 
 
2854161
c64a30a
 
700fe8c
 
 
 
1c558aa
700fe8c
 
 
 
 
 
dbf148d
 
 
700fe8c
e1a7891
1c558aa
 
700fe8c
 
 
458244e
700fe8c
458244e
700fe8c
458244e
700fe8c
bbcb6f3
2854161
bbcb6f3
1c558aa
bbcb6f3
dbf148d
 
 
 
 
 
0719cfc
dbf148d
 
 
 
 
1c558aa
 
14e54dc
 
 
 
1c558aa
 
 
 
 
 
 
 
 
 
 
 
 
0ddbf89
 
 
 
 
 
 
 
cf270dd
0ddbf89
 
 
 
 
 
 
 
 
 
 
580d058
0ddbf89
 
 
 
 
 
cf270dd
 
 
 
4b76eed
cf270dd
 
 
 
 
 
 
 
 
 
7e45e64
 
4b76eed
cf270dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c558aa
 
 
 
 
 
 
402b991

---
license: mit
library_name: transformers
datasets:
- AI-MO/NuminaMath-CoT
- KbsdJames/Omni-MATH
- RUC-AIBOX/STILL-3-Preview-RL-Data
- hendrycks/competition_math
language:
- en
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: text-generation
tags:
- vllm
- qwen
- qwq
- deepseek
---


![image/png](https://cdn-uploads.huggingface.co/production/uploads/657e900beaad53ff67ba84db/BU86-Rfy6RN7jdq-vESsU.png)
<div align="center">
<span style="font-family: default; font-size: 1.5em;">QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha</span>
<div>
- Solo Innovation: Breaking Performance Barriers with Minimal Resources -
<div><b>Powered by personal research with insights from agentica-org</b></div>
</div>
</div>

## Overview
QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha is a groundbreaking language model built on top of the DeepSeek‑R1‑Distill‑Qwen‑1.5B base. Developed entirely by a solo innovator—with valuable inspiration from Berkeley’s research—the model employs a novel reinforcement learning distillation framework that dramatically enhances performance while keeping training data requirements and compute costs to a minimum. Despite having only 1.5B parameters, the model achieves a striking 47.18 MMLU score and outperforms prior baselines on multiple math and reasoning benchmarks.

---

## Data
Our training dataset is comprised of 6,170 meticulously curated problem–answer pairs drawn from high-quality sources such as:
- AIME Problems(QwQ-32B Generated)
- AMC Problems(QwQ-32B Generated)
- MMLU Problems(QwQ-32B Generated))
- Complementary academic math and reasoning datasets(QwQ-32B Generated)

By focusing on a lean yet highly informative dataset, the model efficiently learns critical reasoning capabilities without the burden of excessive data volume.

Generate in QwQ32B with reference to each dataset in the model definition and other datasets.
---

## Training Recipe
To maximize performance with minimal resources, QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha utilizes an innovative training strategy that includes:

<b>- Scaled Group Relative Policy Optimization (GRPO):</b>
An adaptation of PPO that normalizes the advantage function across samples generated from the same prompt.
<b>- KL Divergence Regularization::</b>
Additional regularization is applied on top of the surrogate loss to prevent significant policy drift.
<b>- Iterative Context Scaling::</b>
Progressive expansion of the context length is used to boost model performance while reducing compute costs.

Training was carried out using <b>H200 GPUs for 336 hours</b> at an exceptionally low cost of approximately <b>$1,341</b>. This carefully engineered approach makes it possible to obtain state-of-the-art performance with very limited training data.

---

## Evaluation
The model has been rigorously evaluated on a variety of challenging benchmarks. Below is a snapshot of the results:

| **Benchmark**    | **Metric (Path@1)** | **Metric (cons@64)** | **Avg. Token Count** |
|------------------|---------------------|----------------------|----------------------|
| **MMLU**         | 47.18               | –                    | –                    |
| **AIME 2024**    | 33.33               | 53.33                | 21,191               |
| **AIME 2025-I**  | 34.58               | 40.00                | 17,952               |
| **AIME 2025-II** | 21.56               | 33.33                | 21,376               |
| **AMC 2023**     | 75.00               | 58.92                | 44.17                |
| **MATH 5000**    | 38.89               | –                    | 20,173               |

---

## Comparison

![image/png](https://cdn-uploads.huggingface.co/production/uploads/657e900beaad53ff67ba84db/pm9JyHri9_Fo1FzztI2LE.png)

## Serving QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha

Deploy your model effortlessly using high-performance inference systems, including:

- **vLLM**  
- **Hugging Face Text Generation Inference (TGI)**  
- **SGLang**  
- **TensorRT-LLM**

All these systems support the OpenAI Chat Completions API format, ensuring smooth integration into your applications.

---

## How to use:

<b>Runs on a single A40 GPU!</b>
### Serving Model:
```shell
vllm serve AXCXEPT/QwQ-32B-Distill-Qwen-1.5B-Alpha --max-model-len 32768 --enforce-eager
```

### Call API Without Streaming:
```python
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

prompt = """Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+rac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop."""
completion = client.chat.completions.create(
  model="AXCXEPT/QwQ-32B-Distill-Qwen-1.5B-Alpha",
  messages=[
    {"role": "user", "content": prompt}
  ]
)

print(completion.choices[0].message)
```

### Call API With Streaming:
```python
from openai import OpenAI

#Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id
prompt = """Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+rac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop."""
messages = [{"role": "user", "content": prompt}]
#For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
stream = client.chat.completions.create(model=model,
                                        messages=messages,
                                        stream=True)

print("client: Start streaming chat completions...")
printed_reasoning_content = False
printed_content = False

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    elif hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("reasoning_content:", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    elif content is not None:
        if not printed_content:
            printed_content = True
            print("\ncontent:", end="", flush=True)
        # Extract and print the content
        print(content, end="", flush=True)
```

## License

This project is released under the **MIT License**, reflecting our commitment to open and accessible AI. We firmly believe that cutting-edge AI research should be available for anyone to use, modify, and build upon.


---

## Special Thanks
We extend our sincere gratitude to the following teams and organizations whose contributions and ideas were instrumental in this project:
- Qwen Team (Alibaba Cloud): for creating the exceptional QwQ-32B model used as the distillation source.
- Agentica-org (Berkeley Sky Computing Lab and Berkeley AI Research): for valuable insights and pioneering reinforcement learning techniques.
- DeepSeek AI: for developing the robust foundational model upon which this research is built.

Their groundbreaking work made our innovations possible.