| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - zh |
| | base_model: tencent/WeDLM-8B |
| | pipeline_tag: text-generation |
| | tags: |
| | - language model |
| | - parallel-decoding |
| | library_name: transformers |
| | --- |
| | |
| | # WeDLM-8B-Instruct ⭐ |
| |
|
| | **WeDLM-8B-Instruct** is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base). |
| |
|
| | **Highlights:** |
| | - 🚀 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks |
| | - 📈 Outperforms base Qwen3-8B-Instruct on most benchmarks |
| | - ✅ Native KV cache compatible (FlashAttention, PagedAttention, CUDA Graphs) |
| |
|
| | For the base (pretrained) version, see [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base), which is based on Qwen3-8B-Base. |
| |
|
| | 📄 [Paper](https://arxiv.org/abs/2512.22737) | 🌐 [Project Page](https://wedlm.github.io) | 💻 [GitHub](https://github.com/tencent/WeDLM) |
| |
|
| | ## Model Details |
| |
|
| | | Attribute | Value | |
| | |:----------|:------| |
| | | Base Model | [WeDLM-8B](https://huggingface.co/tencent/WeDLM-8B-Base) | |
| | | Parameters | 8B | |
| | | Context Length | 32,768 | |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | git clone https://github.com/tencent/WeDLM.git |
| | cd WeDLM && bash install.sh |
| | ``` |
| |
|
| | <details> |
| | <summary><b>Manual Installation</b></summary> |
| |
|
| | ```bash |
| | # Step 1: PyTorch |
| | pip install torch==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129 |
| | |
| | # Step 2: flash-attn build dependencies |
| | pip install psutil ninja packaging |
| | |
| | # Step 3: flash-attn (requires torch first) |
| | pip install flash-attn==2.7.4.post1 --no-build-isolation |
| | |
| | # Step 4: WeDLM |
| | git clone https://github.com/tencent/WeDLM.git |
| | cd WeDLM && pip install -e . |
| | ``` |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary><b>Docker Installation</b></summary> |
| |
|
| | ```bash |
| | # Pull the Docker image |
| | docker pull aiweiliu/wedlm:v3 |
| | |
| | # Run the container with GPU support |
| | docker run -it --gpus all -p 8080:8080 --name wedlm aiweiliu/wedlm:v3 /bin/bash |
| | |
| | # Inside the container, run inference directly |
| | python example.py --model tencent/WeDLM-8B-Instruct |
| | ``` |
| |
|
| | </details> |
| |
|
| | > **Note:** `flash-attn` requires compilation and must be installed after PyTorch. |
| | > The `install.sh` script handles this automatically (default: CUDA 12.9). |
| | > For other CUDA versions: `CUDA_VERSION=cu124 bash install.sh` |
| | |
| | ## Quick Start (Recommended) |
| | |
| | For **fast inference**, use the `wedlm` engine: |
| | |
| | ```python |
| | from transformers import AutoTokenizer |
| | from wedlm import LLM, SamplingParams |
| | |
| | llm = LLM(model="tencent/WeDLM-8B-Instruct") |
| | tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True) |
| |
|
| | prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?" |
| | messages = [{"role": "user", "content": prompt}] |
| | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| |
|
| | outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512)) |
| | print(outputs[0]["text"]) |
| | ``` |
| | |
| | ### Multi-turn Conversation |
| | |
| | ```python |
| | messages = [ |
| | {"role": "user", "content": "What is the derivative of x^2?"}, |
| | {"role": "assistant", "content": "The derivative of x² is 2x."}, |
| | {"role": "user", "content": "What about x^3?"} |
| | ] |
| | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256)) |
| | ``` |
| | |
| | ### Batch Inference |
| | |
| | ```python |
| | prompts = [ |
| | "Explain quantum entanglement simply.", |
| | "Write a Python function to check if a number is prime.", |
| | "What are the main causes of climate change?" |
| | ] |
| | messages_batch = [[{"role": "user", "content": p}] for p in prompts] |
| | texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch] |
| | |
| | outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512)) |
| | for i, output in enumerate(outputs): |
| | print(f"=== Response {i+1} ===\n{output['text']}\n") |
| | ``` |
| | |
| | ## HuggingFace Transformers |
| | |
| | For **training** or simple forward passes: |
| | |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "tencent/WeDLM-8B-Instruct", |
| | trust_remote_code=True, |
| | torch_dtype="auto", |
| | device_map="auto" |
| | ) |
| | |
| | messages = [{"role": "user", "content": "Hello!"}] |
| | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| | outputs = model(**inputs) |
| | ``` |
| | |
| | > ⚠️ **Note:** The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the `wedlm` engine above. |
| | |
| | ## Performance |
| | |
| | ### Generation Quality |
| | |
| | | Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct | |
| | |:----------|:-----------------:|:-----------------:| |
| | | ARC-C (0-shot) | 91.47 | **92.92** | |
| | | GSM8K (3-shot) | 89.91 | **92.27** | |
| | | MATH (4-shot) | **69.60** | 64.80 | |
| | | HumanEval (4-shot) | 71.95 | **80.49** | |
| | | MMLU (5-shot) | 71.52 | **75.14** | |
| | | GPQA-Diamond (5-shot) | 41.41 | **44.95** | |
| | | **Average** | 75.12 | **77.53** | |
| | |
| | ### Inference Speed |
| | |
| | Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct): |
| | |
| | | Scenario | Speedup | Notes | |
| | |:---------|:-------:|:------| |
| | | Math Reasoning (GSM8K) | 3-6× | Structured, predictable output | |
| | | Code Generation | 2-3× | Deterministic syntax | |
| | | Open-ended QA | 1.5-2× | Higher entropy limits parallelism | |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @article{liu2025wedlm, |
| | title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference}, |
| | author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie}, |
| | journal={arXiv preprint arXiv:2512.22737}, |
| | year={2025} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | Apache 2.0 |