1. Introduction
Nanbeige4-3B-Thinking is a 3B-parameter reasoning model within the fourth-generation Nanbeige LLM family. It showcases that even compact models can achieve advanced reasoning abilities through continuous enhancements in data quality and training methodologies. To support research and technological advancement in the open-source community, we have open-sourced the Nanbeige4-3B-Thinking model together with its technical methodology.
2. Model Summary
Pre-Training
- We constructed a comprehensive 23T-tokens training corpus from web texts, books, code, and papers, meticulously filtered through a hybrid strategy of tagging-based scoring and retrieval-based recalling. This foundation was then augmented with knowledge-dense and reasoning-intensive synthetic data, including Q&A pairs, textbooks, and Long-COTs, which significantly benefited the downstream task performance.
- We designed an innovative FG-WSD (Fine-Grained Warmup-Stable-Decay) training scheduler, meticulously refining the conventional WSD approach. This scheduler was implemented with a fine-grained, quality-progressive data curriculum, dividing the Stable stage into multiple phases with progressively improved data mixtures. Compared to the vanilla WSD, our method achieved notable performance gains. During the Decay stage, we increased the proportion of math, code, synthetic QA, and synthetic Long-COT data to further enhance reasoning capabilities.
| Stage | Training Tokens | Learning Rate |
|---|---|---|
| Warmup Stage | 0.1T | 0 ——> 4.5e-4 |
| Diversity-Enriched Stable Stage | 12.4T | Constant 4.5e-4 |
| High-Quality Stable Stage | 6.5T | Constant 4.5e-4 |
| Decay and Long-Context Stage | 4T | 4.5e-4 ——> 1.5e-6 |
Post-Training
SFT phase. We constructed a collection of over 30 million high-quality Long Chain-of-Thought (Long-CoT) samples to support multi-stage curriculum learning. By integrating both rule-based and model-based verification methods, we not only ensured sample accuracy but also enhanced the comprehensiveness and instructional value of each training example compared to alternative candidates. This rich diversity in instructions and high response quality equipped the model to achieve outstanding performance across a variety of benchmarks.
Distill. Following SFT, we employed the Nanbeige flagship reasoning model as the teacher model to distill the Nanbeige4-3B-Thinking model, and further enhanced the performance. We observed that on-policy distillation provides greater benefits for mathematical reasoning tasks, while off-policy distillation is more effective for general tasks such as human-preference alignment.
RL phase. We then advanced to a multi-stage, on-policy reinforcement learning phase. This approach leverages verifiable rewards to enhance reasoning capability and a preference reward model to improve alignment, utilizing a carefully filtered blend of real-world and synthetic data calibrated for appropriate difficulty.
3. Model Performance
For model performance comparison, we benchmark our model against recent reasoning LLMs from the Qwen3 series. All models are evaluated under identical configurations to ensure fairness. The results show that our model outperforms the baselines across a range of mainstream benchmarks, including math, science, creative writing, tool use, and human preference alignment.
| Model | AIME24 | AIME25 | GPQA | Super-GPQA | Science-QA | Writing-Bench | BFCL-V4-Agentic | Arena-hard2 |
|---|---|---|---|---|---|---|---|---|
| Qwen3-8B-Thinking-2504 | 76.0 | 67.3 | 62.0 | 39.1 | 24.8 | 74.8 | 14.4 | 26.4 |
| Qwen3-14B-Thinking-2504 | 79.3 | 70.4 | 64.0 | 46.8 | 23.2 | 77.2 | 17.0 | 40.5 |
| Qwen3-4B-Thinking-2507 | 83.3 | 81.3 | 67.2 | 46.7 | 24.4 | 84.3 | 14.3 | 37.7 |
| Nanbeige4-3B-Thinking-2510 | 87.5 | 81.7 | 77.2 | 51.4 | 26.0 | 85.5 | 17.2 | 42.9 |
4. Quickstart
For the chat scenario:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
'Nanbeige/Nanbeige4-3B-Thinking-2510',
use_fast=False,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'Nanbeige/Nanbeige4-3B-Thinking-2510',
torch_dtype='auto',
device_map='auto',
trust_remote_code=True
)
messages = [
{'role': 'user', 'content': 'Which number is bigger, 9.11 or 9.8?'}
]
prompt = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)
For the tool use scenario:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
'Nanbeige/Nanbeige4-3B-Thinking-2510',
use_fast=False,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'Nanbeige/Nanbeige4-3B-Thinking-2510',
torch_dtype='auto',
device_map='auto',
trust_remote_code=True
)
messages = [
{'role': 'user', 'content': 'Help me check the weather in Beijing now'}
]
tools = [{'type': 'function',
'function': {'name': 'SearchWeather',
'description': 'Find out current weather in a certain place on a certain day.',
'parameters': {'type': 'dict',
'properties': {'location': {'type': 'string',
'description': 'A city in china.'},
'required': ['location']}}}}]
prompt = tokenizer.apply_chat_template(
messages,
tools,
add_generation_prompt=True,
tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)
5. Limitations
While we place great emphasis on the safety of the model during the training process, striving to ensure that its outputs align with ethical and legal requirements, it may not completely avoid generating unexpected outputs due to the model's size and probabilistic nature. These outputs may include harmful content such as bias or discrimination. Please don't propagate such content. We do not assume any responsibility for the consequences resulting from the dissemination of inappropriate information.
6. Citation
If you find our model useful or want to use it in your projects, please kindly cite this Huggingface project.
7. Contact
If you have any questions, please raise an issue or contact us at [email protected].
- Downloads last month
- 142