Add 1M support (#6)
Browse files- add 1m support (503c9057e3ef9b527e65c091b7afd1d3713a5737)
- update README (c3fe46fa1f5339394c1eed3ad5a709ceeb6eb180)
- update README (061a2acd8330082c3319b4bb069e476984386222)
- update README (17cc3062bb77ed3d1552b8c3c54a61b35ba3005a)
- update README (e13b5952f87f15783efe9f614abc82e033ae94e2)
- README.md +132 -0
- config_1m.json +0 -0
- tokenizer_config.json +1 -1
README.md
CHANGED
|
@@ -213,6 +213,138 @@ for responses in bot.run(messages=messages):
|
|
| 213 |
print(responses)
|
| 214 |
```
|
| 215 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
## Best Practices
|
| 217 |
|
| 218 |
To achieve optimal performance, we recommend the following settings:
|
|
|
|
| 213 |
print(responses)
|
| 214 |
```
|
| 215 |
|
| 216 |
+
## Processing Ultra-Long Texts
|
| 217 |
+
|
| 218 |
+
To support **ultra-long context processing** (up to **1 million tokens**), we integrate two key techniques:
|
| 219 |
+
|
| 220 |
+
- **[Dual Chunk Attention](https://arxiv.org/abs/2402.17463) (DCA)**: A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence.
|
| 221 |
+
- **[MInference](https://arxiv.org/abs/2407.02490)**: A sparse attention mechanism that reduces computational overhead by focusing on critical token interactions.
|
| 222 |
+
|
| 223 |
+
Together, these innovations significantly improve both **generation quality** and **inference efficiency** for sequences beyond 256K tokens. On sequences approaching 1M tokens, the system achieves up to a **3× speedup** compared to standard attention implementations.
|
| 224 |
+
|
| 225 |
+
For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.org/abs/2501.15383).
|
| 226 |
+
|
| 227 |
+
### How to Enable 1M Token Context
|
| 228 |
+
|
| 229 |
+
> [!NOTE]
|
| 230 |
+
> To effectively process a 1 million token context, users will require approximately **1000 GB** of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.
|
| 231 |
+
|
| 232 |
+
#### Step 1: Update Configuration File
|
| 233 |
+
|
| 234 |
+
Download the model and replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 235 |
+
|
| 236 |
+
```bash
|
| 237 |
+
export MODELNAME=Qwen3-235B-A22B-Thinking-2507
|
| 238 |
+
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
|
| 239 |
+
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
|
| 240 |
+
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
+
#### Step 2: Launch Model Server
|
| 244 |
+
|
| 245 |
+
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
| 246 |
+
|
| 247 |
+
#### Option 1: Using vLLM
|
| 248 |
+
|
| 249 |
+
To run Qwen with 1M context support:
|
| 250 |
+
|
| 251 |
+
```bash
|
| 252 |
+
git clone https://github.com/vllm-project/vllm.git
|
| 253 |
+
cd vllm
|
| 254 |
+
pip install -e .
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
Then launch the server with Dual Chunk Flash Attention enabled:
|
| 258 |
+
|
| 259 |
+
```bash
|
| 260 |
+
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 261 |
+
vllm serve ./Qwen3-235B-A22B-Thinking-2507 \
|
| 262 |
+
--tensor-parallel-size 8 \
|
| 263 |
+
--max-model-len 1010000 \
|
| 264 |
+
--enable-chunked-prefill \
|
| 265 |
+
--max-num-batched-tokens 131072 \
|
| 266 |
+
--enforce-eager \
|
| 267 |
+
--max-num-seqs 1 \
|
| 268 |
+
--gpu-memory-utilization 0.85 \
|
| 269 |
+
--enable-reasoning --reasoning-parser deepseek_r1
|
| 270 |
+
```
|
| 271 |
+
|
| 272 |
+
##### Key Parameters
|
| 273 |
+
|
| 274 |
+
| Parameter | Purpose |
|
| 275 |
+
|--------|--------|
|
| 276 |
+
| `VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN` | Enables the custom attention kernel for long-context efficiency |
|
| 277 |
+
| `--max-model-len 1010000` | Sets maximum context length to ~1M tokens |
|
| 278 |
+
| `--enable-chunked-prefill` | Allows chunked prefill for very long inputs (avoids OOM) |
|
| 279 |
+
| `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
|
| 280 |
+
| `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
|
| 281 |
+
| `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
|
| 282 |
+
| `--gpu-memory-utilization 0.85` | Set the fraction of GPU memory to be used for the model executor |
|
| 283 |
+
|
| 284 |
+
#### Option 2: Using SGLang
|
| 285 |
+
|
| 286 |
+
First, clone and install the specialized branch:
|
| 287 |
+
|
| 288 |
+
```bash
|
| 289 |
+
git clone https://github.com/sgl-project/sglang.git
|
| 290 |
+
cd sglang
|
| 291 |
+
pip install -e "python[all]"
|
| 292 |
+
```
|
| 293 |
+
|
| 294 |
+
Launch the server with DCA support:
|
| 295 |
+
|
| 296 |
+
```bash
|
| 297 |
+
python3 -m sglang.launch_server \
|
| 298 |
+
--model-path ./Qwen3-235B-A22B-Thinking-2507 \
|
| 299 |
+
--context-length 1010000 \
|
| 300 |
+
--mem-frac 0.75 \
|
| 301 |
+
--attention-backend dual_chunk_flash_attn \
|
| 302 |
+
--tp 8 \
|
| 303 |
+
--chunked-prefill-size 131072 \
|
| 304 |
+
--reasoning-parser deepseek-r1
|
| 305 |
+
```
|
| 306 |
+
|
| 307 |
+
##### Key Parameters
|
| 308 |
+
|
| 309 |
+
| Parameter | Purpose |
|
| 310 |
+
|---------|--------|
|
| 311 |
+
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
| 312 |
+
| `--context-length 1010000` | Defines max input length |
|
| 313 |
+
| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
| 314 |
+
| `--tp 8` | Tensor parallelism size (matches model sharding) |
|
| 315 |
+
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
| 316 |
+
|
| 317 |
+
#### Troubleshooting:
|
| 318 |
+
|
| 319 |
+
1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static."
|
| 320 |
+
|
| 321 |
+
The VRAM reserved for the KV cache is insufficient.
|
| 322 |
+
- vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size`` and ``gpu_memory_utilization``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
|
| 323 |
+
- SGLang: Consider reducing the ``context-length`` or increasing the ``tp`` and ``mem-frac``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
|
| 324 |
+
|
| 325 |
+
2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
|
| 326 |
+
|
| 327 |
+
The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
|
| 328 |
+
|
| 329 |
+
3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)."
|
| 330 |
+
|
| 331 |
+
The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
|
| 332 |
+
|
| 333 |
+
#### Long-Context Performance
|
| 334 |
+
|
| 335 |
+
We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
|
| 336 |
+
|
| 337 |
+
| Model Name | Acc avg | 4k | 8k | 16k | 32k | 64k | 96k | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k |
|
| 338 |
+
|---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------|
|
| 339 |
+
| Qwen3-235B-A22B (Thinking) | 82.9 | 97.3 | 95.9 | 95.3 | 88.7 | 91.7 | 91.5 | 87.9 | 85.4 | 78.4 | 75.6 | 73.7 | 73.6 | 70.6 | 69.9 | 67.6 |
|
| 340 |
+
| Qwen3-235B-A22B-Thinking-2507 (Full Attention) | 95.4 | 99.6 | 100.0| 99.5 | 99.6 | 99.1 | 100.0| 98.8 | 98.1 | 96.1 | 95.2 | 90.0 | 91.7 | 89.7 | 87.9 | 85.9 |
|
| 341 |
+
| Qwen3-235B-A22B-Thinking-2507 (Sparse Attention) | 95.5 | 100.0| 100.0| 100.0| 100.0| 98.6 | 99.5 | 98.8 | 98.1 | 95.4 | 93.0 | 90.7 | 91.9 | 91.7 | 87.8 | 86.6 |
|
| 342 |
+
|
| 343 |
+
|
| 344 |
+
* All models are evaluated with Dual Chunk Attention enabled.
|
| 345 |
+
* Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each).
|
| 346 |
+
* To avoid overly verbose reasoning, we set the thinking budget to 8,192 tokens.
|
| 347 |
+
|
| 348 |
## Best Practices
|
| 349 |
|
| 350 |
To achieve optimal performance, we recommend the following settings:
|
config_1m.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
CHANGED
|
@@ -230,7 +230,7 @@
|
|
| 230 |
"clean_up_tokenization_spaces": false,
|
| 231 |
"eos_token": "<|im_end|>",
|
| 232 |
"errors": "replace",
|
| 233 |
-
"model_max_length":
|
| 234 |
"pad_token": "<|endoftext|>",
|
| 235 |
"split_special_tokens": false,
|
| 236 |
"tokenizer_class": "Qwen2Tokenizer",
|
|
|
|
| 230 |
"clean_up_tokenization_spaces": false,
|
| 231 |
"eos_token": "<|im_end|>",
|
| 232 |
"errors": "replace",
|
| 233 |
+
"model_max_length": 1010000,
|
| 234 |
"pad_token": "<|endoftext|>",
|
| 235 |
"split_special_tokens": false,
|
| 236 |
"tokenizer_class": "Qwen2Tokenizer",
|