File size: 10,572 Bytes
9bbb8ef
 
ec11c9e
9bbb8ef
 
 
 
 
 
 
71164b1
 
 
 
 
 
 
 
 
 
 
 
9bbb8ef
 
ec11c9e
9bbb8ef
71164b1
9bbb8ef
 
 
71164b1
9bbb8ef
71164b1
 
 
 
841f283
71164b1
 
 
 
9bbb8ef
 
 
 
 
71164b1
 
 
 
9bbb8ef
 
 
 
 
 
71164b1
 
9bbb8ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71164b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
841f283
71164b1
 
 
 
 
 
 
9bbb8ef
71164b1
9bbb8ef
71164b1
 
 
 
 
9bbb8ef
 
 
 
 
 
b0a9d2b
9bbb8ef
b0a9d2b
505bfc9
9bbb8ef
 
71164b1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
base_model:
- Qwen/Qwen2.5-32B
datasets:
- dyyyyyyyy/ScaleQuest-Math
- OpenCoder-LLM/opc-sft-stage2
- allenai/tulu-3-sft-mixture
- HuggingFaceTB/smoltalk2
- LipengCS/Table-GPT
- allenai/SciRIFF
language:
- en
library_name: transformers
license: apache-2.0
license_name: qwen
license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
pipeline_tag: text-generation
tags:
- sdlm
- diffusion language model
- custom_code
base_model_relation: finetune
---

# SDLM-32B-D4

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/SDLM) [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2509.24007) [πŸš€ Project Page](https://internvl.github.io/blog/2025-09-29-SDLM/) [\[πŸ€— HuggingFace Collection\]](https://huggingface.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552)

## Introduction

We propose a **S**equential **D**iffusion **L**anguage **M**odel (**SDLM**), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning.

### Overall Concept

SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.

![Overall Framework](https://huggingface.co/OpenGVLab/SDLM-32B-D4/resolve/main/assets/three_framework.png)

-   Autoregression: Predicts tokens one by one.
-   Diffusion: Regenerates all tokens each step.
-   SDLM (ours): Decodes D tokens per step, then **keeps the longest consecutive n confident tokens** (1 ≀ n ≀ D). Cached tokens are reused, saving computation.

## SDLM Family

In the following table, we provide an overview of the SDLM series.

| Model Name | Base Model πŸ€— | HF Link πŸ€— |
| :--------- | :----------------------------------------------------------- | :-------------------------------------------- |
| SDLM-3B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D4 |
| SDLM-3B-D8 | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D8 |
| SDLM-32B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-32B">Qwen2.5-32B</a> | https://huggingface.co/OpenGVLab/SDLM-32B-D4 |

## Model Architecture

We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.

*   **(a) Training pipeline.** Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right).
*   **(b) Sampling Pipeline.** Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding.

![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/framework.png)

## Performance

### Long-Form Benchmarks

SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.

![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/main_exp1.png)

### General Mutiple-Choice Benchmarks

![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/main_exp2.png)

### Block Size & Self-Speculative Decoding

![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/self_speculative_decoding.png)

## Trade-off Between Performance and Speed

Trade-off between performance and speed under different confidence thresholds Ο„ for SDLM-3B (B=4) and SDLM-3B (B=8). By adjusting Ο„, a controllable trade-off between speed and performance can be achieved. SpeedUp denotes the average number of tokens output per forward pass.

![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/ablation_tau.png)

## Inference

1.  Install Dependencies

    Key package versions:

    ```
    transformers==4.37.2
    torch>=2.5.0
    ```

2.  Download the model generation script [sdlm_inference.py](https://github.com/OpenGVLab/SDLM/blob/main/sdlm_inference.py) to your working directory.

3.  We provide an example code to run `SDLM-32B-D4` using `transformers`.

    ```python
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from sdlm_inference import SDLM_generate

    if __name__ == "__main__":
        ckpt_hf = 'OpenGVLab/SDLM-32B-D4'

        model = AutoModelForCausalLM.from_pretrained(
            ckpt_hf,
            attn_implementation="eager",
            trust_remote_code=True
        ).to(dtype=torch.float16)
        tokenizer = AutoTokenizer.from_pretrained(ckpt_hf)

        prompt = 'Write a Fibonacci function in Python.'
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

        response, history = SDLM_generate(
            model,
            tokenizer,
            model_inputs,
            max_gen_len = 1024,
            temperature = 0,
            threshold = 0.5,
            n_future_tokens = 4,
            alg = 'prob_conf', #  prob_conf | entropy_conf | self_speculative
            save_history = True,
            use_cache = True
        )

        print('response: ', response[0])

        print('=======histroy')
        for item in history:
            print('cur total token ', item[1])
            print(item[0][0])
            print('--------')
    ```

## Train

1.  Environment Setup

    ```bash
    git clone https://github.com/OpenGVLab/SDLM.git
    cd SDLM
    ```

2.  Install Dependencies

    Key package versions:
    ```
    transformers==4.37.2
    deepspeed==0.16.5
    torch>=2.5.0
    accelerate==0.32.1
    ```
    **Note**: Additional setup is required if using Flex Attention.


3.  Prepare Training Data

    The training dataset we used is specified in the meta file: [meta.json](https://github.com/OpenGVLab/SDLM/blob/main/shell/playground/data/meta/sft_opc436k_scale_math_1m_smoltalk_1m_tulu_1m.json) and is organized in the ShareGPT style, according to the [InternVL chat data format](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html).

    This dataset is composed of several open-source datasets, with the following structure:

    | Dataset Name | # Sample | Domain |
    | :----------------------------------------------------------------------------------------- | :--------- | :------- |
    | <a href="https://huggingface.co/datasets/dyyyyyyyy/ScaleQuest-Math">ScaleQuest-Math</a> | 1,000K | Math |
    | <a href="https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2">Opc-sft-stage2</a> | 436K | Code |
    | <a href="https://huggingface.co/datasets/HuggingFaceTB/smoltalk">Smoltalk</a> | 1,100K | General |
    | <a href="https://huggingface.co/datasets/allenai/tulu-3-sft-mixture">Tulu-3-sft-mixture</a> | 939K | General |
    | <a href="https://huggingface.co/datasets/allenai/SciRIFF">SciRIFF</a> | 79K | Scienece|
    | <a href="https://huggingface.co/datasets/LipengCS/Table-GPT">Table-GPT</a> | 13K | Table |
    | **Total** | **3,506K** | -- |


4.  Start Training

    All training scripts are available in the [shell/train](https://github.com/OpenGVLab/SDLM/tree/main/shell/train) directory. Key parameters include:
    *   `block_size`: The size of the diffusion window. Current settings use `4`, we also try to use `8`; larger sizes are under exploration.
    *   `attn_implementation`: Attention implementation type. Options include sdpa, eager, or flex_attn. Using Flex Attention requires additional setup. Prefer to use `sdpa` for a quick start.
    *   `causal_attn`: Whether to use causal attention within the window. Currently set to non-causal (`False`).

    More details about training please refer to [github](https://github.com/OpenGVLab/SDLM).

## Evaluation

Currently, we use [Opencompass](https://github.com/open-compass/opencompass) for evaluation. For more details, please refer to the [evaluation guide](https://github.com/OpenGVLab/SDLM/blob/main/eval/with_opencompass/readme.md).


## Acknowledge

We extend our gratitude to the open-source community for their foundational contributions:

*   [InternVL](https://github.com/OpenGVLab/InternVL/tree/main) The codebase we build upon.
*   [SMDM](https://github.com/ML-GSAI/SMDM), [LLaDA](https://github.com/ML-GSAI/LLaDA), [Dream](https://github.com/HKUNLP/Dream), [Block Diffusion](https://github.com/kuleshov-group/bd3lms) for insights into diffusion-based generative modeling.
*   [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5-llm/) as a robust base model for comparative studies.
*   [Opencompass](https://github.com/open-compass/opencompass) for providing a comprehensive evaluation framework.
*   The creators of all datasets used in this work, enabling rigorous training and validation.

## Citation

If you find this project useful in your research, please consider citing:

```BibTeX
@article{liu2025sdlm,
  title={Sequential Diffusion Language Models},
  author={Liu, Yangzhou and Cao, Yue and Li, Hao and Luo, Gen and Chen, Zhe and Wang, Weiyun and Liang, Xiaobo and Qi, Biqing and Wu, Lijun and Tian, Changyao and Zhang, Yanting and Li, Yuqiang and Lu, Tong and Qiao, Yu and Dai, Jifeng and Wang, Wenhai},
  journal={arXiv preprint arXiv:2509.24007},
  year={2025}
}
```