jklj077 commited on
Commit
ddee1c5
·
verified ·
1 Parent(s): d24ea30
Files changed (3) hide show
  1. README.md +132 -0
  2. config_1m.json +0 -0
  3. tokenizer_config.json +1 -1
README.md CHANGED
@@ -213,6 +213,138 @@ for responses in bot.run(messages=messages):
213
  print(responses)
214
  ```
215
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216
  ## Best Practices
217
 
218
  To achieve optimal performance, we recommend the following settings:
 
213
  print(responses)
214
  ```
215
 
216
+ ## Processing Ultra-Long Texts
217
+
218
+ To support **ultra-long context processing** (up to **1 million tokens**), we integrate two key techniques:
219
+
220
+ - **[Dual Chunk Attention](https://arxiv.org/abs/2402.17463) (DCA)**: A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence.
221
+ - **[MInference](https://arxiv.org/abs/2407.02490)**: A sparse attention mechanism that reduces computational overhead by focusing on critical token interactions.
222
+
223
+ Together, these innovations significantly improve both **generation quality** and **inference efficiency** for sequences beyond 256K tokens. On sequences approaching 1M tokens, the system achieves up to a **3× speedup** compared to standard attention implementations.
224
+
225
+ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.org/abs/2501.15383).
226
+
227
+ ### How to Enable 1M Token Context
228
+
229
+ > [!NOTE]
230
+ > To effectively process a 1 million token context, users will require approximately **1000 GB** of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.
231
+
232
+ #### Step 1: Update Configuration File
233
+
234
+ Download the model and replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
235
+
236
+ ```bash
237
+ export MODELNAME=Qwen3-235B-A22B-Thinking-2507
238
+ huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
239
+ mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
240
+ mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
241
+ ```
242
+
243
+ #### Step 2: Launch Model Server
244
+
245
+ After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
246
+
247
+ #### Option 1: Using vLLM
248
+
249
+ To run Qwen with 1M context support:
250
+
251
+ ```bash
252
+ git clone https://github.com/vllm-project/vllm.git
253
+ cd vllm
254
+ pip install -e .
255
+ ```
256
+
257
+ Then launch the server with Dual Chunk Flash Attention enabled:
258
+
259
+ ```bash
260
+ VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
261
+ vllm serve ./Qwen3-235B-A22B-Thinking-2507 \
262
+ --tensor-parallel-size 8 \
263
+ --max-model-len 1010000 \
264
+ --enable-chunked-prefill \
265
+ --max-num-batched-tokens 131072 \
266
+ --enforce-eager \
267
+ --max-num-seqs 1 \
268
+ --gpu-memory-utilization 0.85 \
269
+ --enable-reasoning --reasoning-parser deepseek_r1
270
+ ```
271
+
272
+ ##### Key Parameters
273
+
274
+ | Parameter | Purpose |
275
+ |--------|--------|
276
+ | `VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN` | Enables the custom attention kernel for long-context efficiency |
277
+ | `--max-model-len 1010000` | Sets maximum context length to ~1M tokens |
278
+ | `--enable-chunked-prefill` | Allows chunked prefill for very long inputs (avoids OOM) |
279
+ | `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
280
+ | `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
281
+ | `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
282
+ | `--gpu-memory-utilization 0.85` | Set the fraction of GPU memory to be used for the model executor |
283
+
284
+ #### Option 2: Using SGLang
285
+
286
+ First, clone and install the specialized branch:
287
+
288
+ ```bash
289
+ git clone https://github.com/sgl-project/sglang.git
290
+ cd sglang
291
+ pip install -e "python[all]"
292
+ ```
293
+
294
+ Launch the server with DCA support:
295
+
296
+ ```bash
297
+ python3 -m sglang.launch_server \
298
+ --model-path ./Qwen3-235B-A22B-Thinking-2507 \
299
+ --context-length 1010000 \
300
+ --mem-frac 0.75 \
301
+ --attention-backend dual_chunk_flash_attn \
302
+ --tp 8 \
303
+ --chunked-prefill-size 131072 \
304
+ --reasoning-parser deepseek-r1
305
+ ```
306
+
307
+ ##### Key Parameters
308
+
309
+ | Parameter | Purpose |
310
+ |---------|--------|
311
+ | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
312
+ | `--context-length 1010000` | Defines max input length |
313
+ | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
314
+ | `--tp 8` | Tensor parallelism size (matches model sharding) |
315
+ | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
316
+
317
+ #### Troubleshooting:
318
+
319
+ 1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static."
320
+
321
+ The VRAM reserved for the KV cache is insufficient.
322
+ - vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size`` and ``gpu_memory_utilization``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
323
+ - SGLang: Consider reducing the ``context-length`` or increasing the ``tp`` and ``mem-frac``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
324
+
325
+ 2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
326
+
327
+ The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
328
+
329
+ 3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)."
330
+
331
+ The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
332
+
333
+ #### Long-Context Performance
334
+
335
+ We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
336
+
337
+ | Model Name | Acc avg | 4k | 8k | 16k | 32k | 64k | 96k | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k |
338
+ |---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------|
339
+ | Qwen3-235B-A22B (Thinking) | 82.9 | 97.3 | 95.9 | 95.3 | 88.7 | 91.7 | 91.5 | 87.9 | 85.4 | 78.4 | 75.6 | 73.7 | 73.6 | 70.6 | 69.9 | 67.6 |
340
+ | Qwen3-235B-A22B-Thinking-2507 (Full Attention) | 95.4 | 99.6 | 100.0| 99.5 | 99.6 | 99.1 | 100.0| 98.8 | 98.1 | 96.1 | 95.2 | 90.0 | 91.7 | 89.7 | 87.9 | 85.9 |
341
+ | Qwen3-235B-A22B-Thinking-2507 (Sparse Attention) | 95.5 | 100.0| 100.0| 100.0| 100.0| 98.6 | 99.5 | 98.8 | 98.1 | 95.4 | 93.0 | 90.7 | 91.9 | 91.7 | 87.8 | 86.6 |
342
+
343
+
344
+ * All models are evaluated with Dual Chunk Attention enabled.
345
+ * Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each).
346
+ * To avoid overly verbose reasoning, we set the thinking budget to 8,192 tokens.
347
+
348
  ## Best Practices
349
 
350
  To achieve optimal performance, we recommend the following settings:
config_1m.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -230,7 +230,7 @@
230
  "clean_up_tokenization_spaces": false,
231
  "eos_token": "<|im_end|>",
232
  "errors": "replace",
233
- "model_max_length": 262144,
234
  "pad_token": "<|endoftext|>",
235
  "split_special_tokens": false,
236
  "tokenizer_class": "Qwen2Tokenizer",
 
230
  "clean_up_tokenization_spaces": false,
231
  "eos_token": "<|im_end|>",
232
  "errors": "replace",
233
+ "model_max_length": 1010000,
234
  "pad_token": "<|endoftext|>",
235
  "split_special_tokens": false,
236
  "tokenizer_class": "Qwen2Tokenizer",