Add 1M support (#6)

- add 1m support (503c9057e3ef9b527e65c091b7afd1d3713a5737)
- update README (c3fe46fa1f5339394c1eed3ad5a709ceeb6eb180)
- update README (061a2acd8330082c3319b4bb069e476984386222)
- update README (17cc3062bb77ed3d1552b8c3c54a61b35ba3005a)
- update README (e13b5952f87f15783efe9f614abc82e033ae94e2)

Files changed (3) hide show

README.md +132 -0
config_1m.json +0 -0
tokenizer_config.json +1 -1

README.md CHANGED Viewed

@@ -213,6 +213,138 @@ for responses in bot.run(messages=messages):
 print(responses)
 ```
 ## Best Practices
 To achieve optimal performance, we recommend the following settings:

 print(responses)
 ```
+## Processing Ultra-Long Texts
+To support **ultra-long context processing** (up to **1 million tokens**), we integrate two key techniques:
+- **[Dual Chunk Attention](https://arxiv.org/abs/2402.17463) (DCA)**: A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence.
+- **[MInference](https://arxiv.org/abs/2407.02490)**: A sparse attention mechanism that reduces computational overhead by focusing on critical token interactions.
+Together, these innovations significantly improve both **generation quality** and **inference efficiency** for sequences beyond 256K tokens. On sequences approaching 1M tokens, the system achieves up to a **3× speedup** compared to standard attention implementations.
+For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.org/abs/2501.15383).
+### How to Enable 1M Token Context
+> [!NOTE]
+> To effectively process a 1 million token context, users will require approximately **1000 GB** of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.
+#### Step 1: Update Configuration File
+Download the model and replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
+```bash
+export MODELNAME=Qwen3-235B-A22B-Thinking-2507
+huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
+mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
+mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
+```
+#### Step 2: Launch Model Server
+After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
+#### Option 1: Using vLLM
+To run Qwen with 1M context support:
+```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+pip install -e .
+```
+Then launch the server with Dual Chunk Flash Attention enabled:
+```bash
+VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
+vllm serve ./Qwen3-235B-A22B-Thinking-2507 \
+  --tensor-parallel-size 8 \
+  --max-model-len 1010000 \
+  --enable-chunked-prefill \
+  --max-num-batched-tokens 131072 \
+  --enforce-eager \
+  --max-num-seqs 1 \
+  --gpu-memory-utilization 0.85 \
+  --enable-reasoning --reasoning-parser deepseek_r1
+```
+##### Key Parameters
+| Parameter | Purpose |
+|--------|--------|
+| `VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN` | Enables the custom attention kernel for long-context efficiency |
+| `--max-model-len 1010000` | Sets maximum context length to ~1M tokens |
+| `--enable-chunked-prefill` | Allows chunked prefill for very long inputs (avoids OOM) |
+| `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
+| `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
+| `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
+| `--gpu-memory-utilization 0.85` | Set the fraction of GPU memory to be used for the model executor |
+#### Option 2: Using SGLang
+First, clone and install the specialized branch:
+```bash
+git clone https://github.com/sgl-project/sglang.git
+cd sglang
+pip install -e "python[all]"
+```
+Launch the server with DCA support:
+```bash
+python3 -m sglang.launch_server \
+    --model-path ./Qwen3-235B-A22B-Thinking-2507 \
+    --context-length 1010000 \
+    --mem-frac 0.75 \
+    --attention-backend dual_chunk_flash_attn \
+    --tp 8 \
+    --chunked-prefill-size 131072 \
+    --reasoning-parser deepseek-r1
+```
+##### Key Parameters
+| Parameter | Purpose |
+|---------|--------|
+| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
+| `--context-length 1010000` | Defines max input length |
+| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
+| `--tp 8` | Tensor parallelism size (matches model sharding) |
+| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
+#### Troubleshooting:
+1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static."
+    The VRAM reserved for the KV cache is insufficient.
+    - vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size`` and ``gpu_memory_utilization``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
+    - SGLang: Consider reducing the ``context-length`` or increasing the ``tp`` and ``mem-frac``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
+2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
+    The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
+3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)."
+    The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
+#### Long-Context Performance
+We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
+| Model Name                                  | Acc avg | 4k   | 8k   | 16k  | 32k  | 64k  | 96k  | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k |
+|---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------|
+| Qwen3-235B-A22B (Thinking)                  | 82.9    | 97.3 | 95.9 | 95.3 | 88.7 | 91.7 | 91.5 | 87.9 | 85.4 | 78.4 | 75.6 | 73.7 | 73.6 | 70.6 | 69.9 | 67.6  |
+| Qwen3-235B-A22B-Thinking-2507 (Full Attention)   | 95.4    | 99.6 | 100.0| 99.5 | 99.6 | 99.1 | 100.0| 98.8 | 98.1 | 96.1 | 95.2 | 90.0 | 91.7 | 89.7 | 87.9 | 85.9  |
+| Qwen3-235B-A22B-Thinking-2507 (Sparse Attention) | 95.5 | 100.0| 100.0| 100.0| 100.0| 98.6 | 99.5 | 98.8 | 98.1 | 95.4 | 93.0 | 90.7 | 91.9 | 91.7 | 87.8 | 86.6  |
+* All models are evaluated with Dual Chunk Attention enabled.
+* Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each).
+* To avoid overly verbose reasoning, we set the thinking budget to 8,192 tokens.
 ## Best Practices
 To achieve optimal performance, we recommend the following settings:

config_1m.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -230,7 +230,7 @@
     "clean_up_tokenization_spaces": false,
     "eos_token": "<|im_end|>",
     "errors": "replace",
-    "model_max_length": 262144,
     "pad_token": "<|endoftext|>",
     "split_special_tokens": false,
     "tokenizer_class": "Qwen2Tokenizer",

     "clean_up_tokenization_spaces": false,
     "eos_token": "<|im_end|>",
     "errors": "replace",
+    "model_max_length": 1010000,
     "pad_token": "<|endoftext|>",
     "split_special_tokens": false,
     "tokenizer_class": "Qwen2Tokenizer",