Enhance model card: Add project page link, overall concept, training, evaluation, case, and acknowledgment sections
Browse filesThis PR significantly enhances the model card for `SDLM-32B-D4` by integrating comprehensive information from the project's GitHub README. Key improvements include:
* **Header Links:**
* Added a direct link to the project's blog page (`🚀 Project Page`) for improved discoverability.
* Clarified the existing Hugging Face collection link as `🤗 HuggingFace Collection`.
* The arXiv link for the paper is retained as per instructions.
* **Overall Concept:** Added a visual explanation of SDLM's core concept, sourced from the GitHub README's introduction.
* **Training Section:** Included detailed training instructions, environment setup, dependencies, a table of datasets used with their Hugging Face links, training parameters, and loss plots, greatly aiding reproducibility.
* **Evaluation Section:** Added information on the evaluation framework used.
* **Case Section:** Incorporated a visual demonstration of the model's capabilities.
* **Acknowledgment Section:** Added a section to acknowledge foundational contributions, fostering community recognition.
All new content, including images and internal links, has been carefully sourced and adapted from the official GitHub repository to maintain accuracy and consistency.
|
@@ -1,18 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
license_name: qwen
|
| 4 |
-
license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
|
| 5 |
-
pipeline_tag: text-generation
|
| 6 |
-
library_name: transformers
|
| 7 |
base_model:
|
| 8 |
- Qwen/Qwen2.5-32B
|
| 9 |
-
base_model_relation: finetune
|
| 10 |
-
language:
|
| 11 |
-
- en
|
| 12 |
-
tags:
|
| 13 |
-
- sdlm
|
| 14 |
-
- diffusion language model
|
| 15 |
-
- custom_code
|
| 16 |
datasets:
|
| 17 |
- dyyyyyyyy/ScaleQuest-Math
|
| 18 |
- OpenCoder-LLM/opc-sft-stage2
|
|
@@ -20,34 +8,54 @@ datasets:
|
|
| 20 |
- HuggingFaceTB/smoltalk2
|
| 21 |
- LipengCS/Table-GPT
|
| 22 |
- allenai/SciRIFF
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
---
|
| 24 |
|
| 25 |
# SDLM-32B-D4
|
| 26 |
|
| 27 |
-
[\[📂 GitHub\]](https://github.com/OpenGVLab/SDLM)
|
| 28 |
|
| 29 |
## Introduction
|
| 30 |
|
| 31 |
-
We propose a
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
## SDLM Family
|
| 36 |
|
| 37 |
In the following table, we provide an overview of the SDLM series.
|
| 38 |
|
| 39 |
-
| Model Name
|
| 40 |
-
|
|
| 41 |
-
| SDLM-3B-D4
|
| 42 |
-
| SDLM-3B-D8
|
| 43 |
| SDLM-32B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-32B">Qwen2.5-32B</a> | https://huggingface.co/OpenGVLab/SDLM-32B-D4 |
|
| 44 |
|
| 45 |
## Model Architecture
|
| 46 |
|
| 47 |
We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.
|
| 48 |
|
| 49 |
-
*
|
| 50 |
-
*
|
| 51 |
|
| 52 |

|
| 53 |
|
|
@@ -75,70 +83,158 @@ Trade-off between performance and speed under different confidence thresholds τ
|
|
| 75 |
|
| 76 |
## Inference
|
| 77 |
|
| 78 |
-
1.
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
2.
|
| 88 |
-
|
| 89 |
-
3.
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
|
|
|
| 141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
## Citation
|
| 144 |
|
|
@@ -151,4 +247,4 @@ If you find this project useful in your research, please consider citing:
|
|
| 151 |
journal={arXiv preprint arXiv:2509.24007},
|
| 152 |
year={2025}
|
| 153 |
}
|
| 154 |
-
```
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-32B
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
datasets:
|
| 5 |
- dyyyyyyyy/ScaleQuest-Math
|
| 6 |
- OpenCoder-LLM/opc-sft-stage2
|
|
|
|
| 8 |
- HuggingFaceTB/smoltalk2
|
| 9 |
- LipengCS/Table-GPT
|
| 10 |
- allenai/SciRIFF
|
| 11 |
+
language:
|
| 12 |
+
- en
|
| 13 |
+
library_name: transformers
|
| 14 |
+
license: apache-2.0
|
| 15 |
+
license_name: qwen
|
| 16 |
+
license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
|
| 17 |
+
pipeline_tag: text-generation
|
| 18 |
+
tags:
|
| 19 |
+
- sdlm
|
| 20 |
+
- diffusion language model
|
| 21 |
+
- custom_code
|
| 22 |
+
base_model_relation: finetune
|
| 23 |
---
|
| 24 |
|
| 25 |
# SDLM-32B-D4
|
| 26 |
|
| 27 |
+
[\[📂 GitHub\]](https://github.com/OpenGVLab/SDLM) [\[📜 Tech Report\]](https://arxiv.org/abs/2509.24007) [🚀 Project Page](https://internvl.github.io/blog/2025-09-29-SDLM/) [\[🤗 HuggingFace Collection\]](https://huggingface.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552)
|
| 28 |
|
| 29 |
## Introduction
|
| 30 |
|
| 31 |
+
We propose a **S**equential **D**iffusion **L**anguage **M**odel (**SDLM**), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning.
|
| 32 |
|
| 33 |
+
### Overall Concept
|
| 34 |
+
|
| 35 |
+
SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.
|
| 36 |
+
|
| 37 |
+

|
| 38 |
+
|
| 39 |
+
- Autoregression: Predicts tokens one by one.
|
| 40 |
+
- Diffusion: Regenerates all tokens each step.
|
| 41 |
+
- SDLM (ours): Decodes D tokens per step, then **keeps the longest consecutive n confident tokens** (1 ≤ n ≤ D). Cached tokens are reused, saving computation.
|
| 42 |
|
| 43 |
## SDLM Family
|
| 44 |
|
| 45 |
In the following table, we provide an overview of the SDLM series.
|
| 46 |
|
| 47 |
+
| Model Name | Base Model 🤗 | HF Link 🤗 |
|
| 48 |
+
| :--------- | :----------------------------------------------------------- | :-------------------------------------------- |
|
| 49 |
+
| SDLM-3B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D4 |
|
| 50 |
+
| SDLM-3B-D8 | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D8 |
|
| 51 |
| SDLM-32B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-32B">Qwen2.5-32B</a> | https://huggingface.co/OpenGVLab/SDLM-32B-D4 |
|
| 52 |
|
| 53 |
## Model Architecture
|
| 54 |
|
| 55 |
We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.
|
| 56 |
|
| 57 |
+
* **(a) Training pipeline.** Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right).
|
| 58 |
+
* **(b) Sampling Pipeline.** Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding.
|
| 59 |
|
| 60 |

|
| 61 |
|
|
|
|
| 83 |
|
| 84 |
## Inference
|
| 85 |
|
| 86 |
+
1. Install Dependencies
|
| 87 |
+
|
| 88 |
+
Key package versions:
|
| 89 |
+
|
| 90 |
+
```
|
| 91 |
+
transformers==4.37.2
|
| 92 |
+
torch>=2.5.0
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
2. Download the model generation script [sdlm_inference.py](https://github.com/OpenGVLab/SDLM/blob/main/sdlm_inference.py) to your working directory.
|
| 96 |
+
|
| 97 |
+
3. We provide an example code to run `SDLM-32B-D4` using `transformers`.
|
| 98 |
+
|
| 99 |
+
```python
|
| 100 |
+
import torch
|
| 101 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 102 |
+
from sdlm_inference import SDLM_generate
|
| 103 |
+
|
| 104 |
+
if __name__ == "__main__":
|
| 105 |
+
ckpt_hf = 'OpenGVLab/SDLM-32B-D4'
|
| 106 |
+
|
| 107 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 108 |
+
ckpt_hf,
|
| 109 |
+
attn_implementation="eager",
|
| 110 |
+
trust_remote_code=True
|
| 111 |
+
).to(dtype=torch.float16)
|
| 112 |
+
tokenizer = AutoTokenizer.from_pretrained(ckpt_hf)
|
| 113 |
+
|
| 114 |
+
prompt = 'Write a Fibonacci function in Python.'
|
| 115 |
+
messages = [
|
| 116 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
| 117 |
+
{"role": "user", "content": prompt}
|
| 118 |
+
]
|
| 119 |
+
text = tokenizer.apply_chat_template(
|
| 120 |
+
messages,
|
| 121 |
+
tokenize=False,
|
| 122 |
+
add_generation_prompt=True
|
| 123 |
+
)
|
| 124 |
+
|
| 125 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 126 |
+
|
| 127 |
+
response, history = SDLM_generate(
|
| 128 |
+
model,
|
| 129 |
+
tokenizer,
|
| 130 |
+
model_inputs,
|
| 131 |
+
max_gen_len = 1024,
|
| 132 |
+
temperature = 0,
|
| 133 |
+
threshold = 0.5,
|
| 134 |
+
n_future_tokens = 4,
|
| 135 |
+
alg = 'prob_conf', # prob_conf | entropy_conf | self_speculative
|
| 136 |
+
save_history = True,
|
| 137 |
+
use_cache = True
|
| 138 |
+
)
|
| 139 |
+
|
| 140 |
+
print('response: ', response[0])
|
| 141 |
+
|
| 142 |
+
print('=======histroy')
|
| 143 |
+
for item in history:
|
| 144 |
+
print('cur total token ', item[1])
|
| 145 |
+
print(item[0][0])
|
| 146 |
+
print('--------')
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
## Train
|
| 150 |
+
|
| 151 |
+
1. Environment Setup
|
| 152 |
+
|
| 153 |
+
```bash
|
| 154 |
+
git clone https://github.com/OpenGVLab/SDLM.git
|
| 155 |
+
cd SDLM
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
2. Install Dependencies
|
| 159 |
+
|
| 160 |
+
Key package versions:
|
| 161 |
+
```
|
| 162 |
+
transformers==4.37.2
|
| 163 |
+
deepspeed==0.16.5
|
| 164 |
+
torch>=2.5.0
|
| 165 |
+
accelerate==0.32.1
|
| 166 |
+
```
|
| 167 |
+
**Note**: Additional setup is required if using Flex Attention.
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
3. Prepare Training Data
|
| 171 |
+
|
| 172 |
+
The training dataset we used is specified in the meta file: [meta.json](https://github.com/OpenGVLab/SDLM/blob/main/shell/playground/data/meta/sft_opc436k_scale_math_1m_smoltalk_1m_tulu_1m.json) and is organized in the ShareGPT style, according to the [InternVL chat data format](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html).
|
| 173 |
+
|
| 174 |
+
This dataset is composed of several open-source datasets, with the following structure:
|
| 175 |
+
|
| 176 |
+
| Dataset Name | # Sample | Domain |
|
| 177 |
+
| :----------------------------------------------------------------------------------------- | :--------- | :------- |
|
| 178 |
+
| <a href="https://huggingface.co/datasets/dyyyyyyyy/ScaleQuest-Math">ScaleQuest-Math</a> | 1,000K | Math |
|
| 179 |
+
| <a href="https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2">Opc-sft-stage2</a> | 436K | Code |
|
| 180 |
+
| <a href="https://huggingface.co/datasets/HuggingFaceTB/smoltalk">Smoltalk</a> | 1,100K | General |
|
| 181 |
+
| <a href="https://huggingface.co/datasets/allenai/tulu-3-sft-mixture">Tulu-3-sft-mixture</a> | 939K | General |
|
| 182 |
+
| <a href="https://huggingface.co/datasets/allenai/SciRIFF">SciRIFF</a> | 79K | Scienece|
|
| 183 |
+
| <a href="https://huggingface.co/datasets/LipengCS/Table-GPT">Table-GPT</a> | 13K | Table |
|
| 184 |
+
| **Total** | **3,506K** | -- |
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
4. Start Training
|
| 188 |
+
|
| 189 |
+
All training scripts are available in the [shell/train](https://github.com/OpenGVLab/SDLM/tree/main/shell/train) directory. Key parameters include:
|
| 190 |
+
* `block_size`: The size of the diffusion window. Current settings use `4`, we also try to use `8`; larger sizes are under exploration.
|
| 191 |
+
* `attn_implementation`: Attention implementation type. Options include sdpa, eager, or flex_attn. Using Flex Attention requires additional setup. Prefer to use `sdpa` for a quick start.
|
| 192 |
+
* `causal_attn`: Whether to use causal attention within the window. Currently set to non-causal (`False`).
|
| 193 |
+
|
| 194 |
+
Our training setting is:
|
| 195 |
+
|
| 196 |
+
<p align="center">
|
| 197 |
+
<img src="https://huggingface.co/OpenGVLab/SDLM-32B-D4/resolve/main/assets/hyper-param.png" width="50%"></a>
|
| 198 |
+
</p>
|
| 199 |
+
|
| 200 |
+
The training loss of our 3B model. loss_pos_`i` refers to the loss at the `i`-th position of each block. The loss at `i=0` is close to the SFT loss of AR's NTP.
|
| 201 |
+
|
| 202 |
+
Here, we display the loss corresponding to each position within the window during the training process. When bs=8, only the first 4 are shown.
|
| 203 |
+
The correspondence is as follows:
|
| 204 |
+
|
| 205 |
+
bs = 4 (red):
|
| 206 |
+
|
| 207 |
+
| x | m | m | m |
|
| 208 |
+
| :-- | :-- | :-- | :-- |
|
| 209 |
+
| loss_pos_1 | loss_pos_2 | loss_pos_3 | loss_pos_4 |
|
| 210 |
+
|
| 211 |
+
bs = 8 (orange):
|
| 212 |
+
|
| 213 |
+
| x | m | m | m | m | m | m | m |
|
| 214 |
+
| :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
|
| 215 |
+
| loss_pos_1 | loss_pos_2 | loss_pos_3 | loss_pos_4 | -- | -- | -- | -- |
|
| 216 |
+
|
| 217 |
+

|
| 218 |
+
|
| 219 |
+
## Evaluation
|
| 220 |
+
|
| 221 |
+
Currently, we use [Opencompass](https://github.com/open-compass/opencompass) for evaluation. For more details, please refer to the [evaluation guide](https://github.com/OpenGVLab/SDLM/blob/main/eval/with_opencompass/readme.md).
|
| 222 |
+
|
| 223 |
+
## Case
|
| 224 |
+
|
| 225 |
+
<p align="center">
|
| 226 |
+
<img src="https://huggingface.co/OpenGVLab/SDLM-32B-D4/resolve/main/assets/case.gif" width="70%"></a>
|
| 227 |
+
</p>
|
| 228 |
+
|
| 229 |
+
## Acknowledge
|
| 230 |
|
| 231 |
+
We extend our gratitude to the open-source community for their foundational contributions:
|
| 232 |
|
| 233 |
+
* [InternVL](https://github.com/OpenGVLab/InternVL/tree/main) The codebase we build upon.
|
| 234 |
+
* [SMDM](https://github.com/ML-GSAI/SMDM), [LLaDA](https://github.com/ML-GSAI/LLaDA), [Dream](https://github.com/HKUNLP/Dream), [Block Diffusion](https://github.com/kuleshov-group/bd3lms) for insights into diffusion-based generative modeling.
|
| 235 |
+
* [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5-llm/) as a robust base model for comparative studies.
|
| 236 |
+
* [Opencompass](https://github.com/open-compass/opencompass) for providing a comprehensive evaluation framework.
|
| 237 |
+
* The creators of all datasets used in this work, enabling rigorous training and validation.
|
| 238 |
|
| 239 |
## Citation
|
| 240 |
|
|
|
|
| 247 |
journal={arXiv preprint arXiv:2509.24007},
|
| 248 |
year={2025}
|
| 249 |
}
|
| 250 |
+
```
|