lll2343 nielsr HF Staff commited on
Commit
71164b1
·
verified ·
1 Parent(s): 2ec6c6d

Enhance model card: Add project page link, overall concept, training, evaluation, case, and acknowledgment sections (#1)

Browse files

- Enhance model card: Add project page link, overall concept, training, evaluation, case, and acknowledgment sections (17d4f5413338f3666f67f3a4cd695c386b416fb3)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +180 -84
README.md CHANGED
@@ -1,18 +1,6 @@
1
  ---
2
- license: apache-2.0
3
- license_name: qwen
4
- license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
5
- pipeline_tag: text-generation
6
- library_name: transformers
7
  base_model:
8
  - Qwen/Qwen2.5-32B
9
- base_model_relation: finetune
10
- language:
11
- - en
12
- tags:
13
- - sdlm
14
- - diffusion language model
15
- - custom_code
16
  datasets:
17
  - dyyyyyyyy/ScaleQuest-Math
18
  - OpenCoder-LLM/opc-sft-stage2
@@ -20,34 +8,54 @@ datasets:
20
  - HuggingFaceTB/smoltalk2
21
  - LipengCS/Table-GPT
22
  - allenai/SciRIFF
 
 
 
 
 
 
 
 
 
 
 
 
23
  ---
24
 
25
  # SDLM-32B-D4
26
 
27
- [\[📂 GitHub\]](https://github.com/OpenGVLab/SDLM) [\[📜 Tech Report\]](https://arxiv.org/abs/2509.24007) [\[🤗 HuggingFace\]](https://huggingface.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552)
28
 
29
  ## Introduction
30
 
31
- We propose a <b>S</b>equential <b>D</b>iffusion <b>L</b>anguage <b>M</b>odel (<b>SDLM</b>), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning.
32
 
33
- ![image/png](https://huggingface.co/OpenGVLab/SDLM-32B-D4/resolve/main/assets/three_framework.png)
 
 
 
 
 
 
 
 
34
 
35
  ## SDLM Family
36
 
37
  In the following table, we provide an overview of the SDLM series.
38
 
39
- | Model Name | Base Model 🤗 | HF Link 🤗 |
40
- | ----------- | ------------------------------------------------------------ | -------------------------------------------- |
41
- | SDLM-3B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D4 |
42
- | SDLM-3B-D8 | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D8 |
43
  | SDLM-32B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-32B">Qwen2.5-32B</a> | https://huggingface.co/OpenGVLab/SDLM-32B-D4 |
44
 
45
  ## Model Architecture
46
 
47
  We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.
48
 
49
- * **(a) Training pipeline.** Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right).
50
- * **(b) Sampling Pipeline.** Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding.
51
 
52
  ![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/framework.png)
53
 
@@ -75,70 +83,158 @@ Trade-off between performance and speed under different confidence thresholds τ
75
 
76
  ## Inference
77
 
78
- 1. Install Dependencies
79
-
80
- Key package versions:
81
-
82
- ```
83
- transformers==4.37.2
84
- torch>=2.5.0
85
- ```
86
-
87
- 2. Download the model generation script [sdlm_inference.py](https://github.com/OpenGVLab/SDLM/blob/main/sdlm_inference.py) to your working directory.
88
-
89
- 3. We provide an example code to run `SDLM-32B-D4` using `transformers`.
90
-
91
- ```python
92
- import torch
93
- from transformers import AutoModelForCausalLM, AutoTokenizer
94
- from sdlm_inference import SDLM_generate
95
-
96
- if __name__ == "__main__":
97
- ckpt_hf = 'OpenGVLab/SDLM-32B-D4'
98
-
99
- model = AutoModelForCausalLM.from_pretrained(
100
- ckpt_hf,
101
- attn_implementation="eager",
102
- trust_remote_code=True
103
- ).to(dtype=torch.float16)
104
- tokenizer = AutoTokenizer.from_pretrained(ckpt_hf)
105
-
106
- prompt = 'Write a Fibonacci function in Python.'
107
- messages = [
108
- {"role": "system", "content": "You are a helpful assistant."},
109
- {"role": "user", "content": prompt}
110
- ]
111
- text = tokenizer.apply_chat_template(
112
- messages,
113
- tokenize=False,
114
- add_generation_prompt=True
115
- )
116
-
117
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
118
-
119
- response, history = SDLM_generate(
120
- model,
121
- tokenizer,
122
- model_inputs,
123
- max_gen_len = 1024,
124
- temperature = 0,
125
- threshold = 0.5,
126
- n_future_tokens = 4,
127
- alg = 'prob_conf', # prob_conf | entropy_conf | self_speculative
128
- save_history = True,
129
- use_cache = True
130
- )
131
-
132
- print('response: ', response[0])
133
-
134
- print('=======histroy')
135
- for item in history:
136
- print('cur total token ', item[1])
137
- print(item[0][0])
138
- print('--------')
139
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
 
141
 
 
 
 
 
 
142
 
143
  ## Citation
144
 
@@ -151,4 +247,4 @@ If you find this project useful in your research, please consider citing:
151
  journal={arXiv preprint arXiv:2509.24007},
152
  year={2025}
153
  }
154
- ```
 
1
  ---
 
 
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-32B
 
 
 
 
 
 
 
4
  datasets:
5
  - dyyyyyyyy/ScaleQuest-Math
6
  - OpenCoder-LLM/opc-sft-stage2
 
8
  - HuggingFaceTB/smoltalk2
9
  - LipengCS/Table-GPT
10
  - allenai/SciRIFF
11
+ language:
12
+ - en
13
+ library_name: transformers
14
+ license: apache-2.0
15
+ license_name: qwen
16
+ license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
17
+ pipeline_tag: text-generation
18
+ tags:
19
+ - sdlm
20
+ - diffusion language model
21
+ - custom_code
22
+ base_model_relation: finetune
23
  ---
24
 
25
  # SDLM-32B-D4
26
 
27
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/SDLM) [\[📜 Tech Report\]](https://arxiv.org/abs/2509.24007) [🚀 Project Page](https://internvl.github.io/blog/2025-09-29-SDLM/) [\[🤗 HuggingFace Collection\]](https://huggingface.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552)
28
 
29
  ## Introduction
30
 
31
+ We propose a **S**equential **D**iffusion **L**anguage **M**odel (**SDLM**), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning.
32
 
33
+ ### Overall Concept
34
+
35
+ SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark.
36
+
37
+ ![Overall Framework](https://huggingface.co/OpenGVLab/SDLM-32B-D4/resolve/main/assets/framwork_compare.png)
38
+
39
+ - Autoregression: Predicts tokens one by one.
40
+ - Diffusion: Regenerates all tokens each step.
41
+ - SDLM (ours): Decodes D tokens per step, then **keeps the longest consecutive n confident tokens** (1 ≤ n ≤ D). Cached tokens are reused, saving computation.
42
 
43
  ## SDLM Family
44
 
45
  In the following table, we provide an overview of the SDLM series.
46
 
47
+ | Model Name | Base Model 🤗 | HF Link 🤗 |
48
+ | :--------- | :----------------------------------------------------------- | :-------------------------------------------- |
49
+ | SDLM-3B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D4 |
50
+ | SDLM-3B-D8 | <a href="https://huggingface.co/Qwen/Qwen2.5-3B">Qwen2.5-3B</a> | https://huggingface.co/OpenGVLab/SDLM-3B-D8 |
51
  | SDLM-32B-D4 | <a href="https://huggingface.co/Qwen/Qwen2.5-32B">Qwen2.5-32B</a> | https://huggingface.co/OpenGVLab/SDLM-32B-D4 |
52
 
53
  ## Model Architecture
54
 
55
  We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy.
56
 
57
+ * **(a) Training pipeline.** Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right).
58
+ * **(b) Sampling Pipeline.** Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding.
59
 
60
  ![image/png](https://huggingface.co/OpenGVLab/SDLM-3B-D4/resolve/main/assets/framework.png)
61
 
 
83
 
84
  ## Inference
85
 
86
+ 1. Install Dependencies
87
+
88
+ Key package versions:
89
+
90
+ ```
91
+ transformers==4.37.2
92
+ torch>=2.5.0
93
+ ```
94
+
95
+ 2. Download the model generation script [sdlm_inference.py](https://github.com/OpenGVLab/SDLM/blob/main/sdlm_inference.py) to your working directory.
96
+
97
+ 3. We provide an example code to run `SDLM-32B-D4` using `transformers`.
98
+
99
+ ```python
100
+ import torch
101
+ from transformers import AutoModelForCausalLM, AutoTokenizer
102
+ from sdlm_inference import SDLM_generate
103
+
104
+ if __name__ == "__main__":
105
+ ckpt_hf = 'OpenGVLab/SDLM-32B-D4'
106
+
107
+ model = AutoModelForCausalLM.from_pretrained(
108
+ ckpt_hf,
109
+ attn_implementation="eager",
110
+ trust_remote_code=True
111
+ ).to(dtype=torch.float16)
112
+ tokenizer = AutoTokenizer.from_pretrained(ckpt_hf)
113
+
114
+ prompt = 'Write a Fibonacci function in Python.'
115
+ messages = [
116
+ {"role": "system", "content": "You are a helpful assistant."},
117
+ {"role": "user", "content": prompt}
118
+ ]
119
+ text = tokenizer.apply_chat_template(
120
+ messages,
121
+ tokenize=False,
122
+ add_generation_prompt=True
123
+ )
124
+
125
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
126
+
127
+ response, history = SDLM_generate(
128
+ model,
129
+ tokenizer,
130
+ model_inputs,
131
+ max_gen_len = 1024,
132
+ temperature = 0,
133
+ threshold = 0.5,
134
+ n_future_tokens = 4,
135
+ alg = 'prob_conf', # prob_conf | entropy_conf | self_speculative
136
+ save_history = True,
137
+ use_cache = True
138
+ )
139
+
140
+ print('response: ', response[0])
141
+
142
+ print('=======histroy')
143
+ for item in history:
144
+ print('cur total token ', item[1])
145
+ print(item[0][0])
146
+ print('--------')
147
+ ```
148
+
149
+ ## Train
150
+
151
+ 1. Environment Setup
152
+
153
+ ```bash
154
+ git clone https://github.com/OpenGVLab/SDLM.git
155
+ cd SDLM
156
+ ```
157
+
158
+ 2. Install Dependencies
159
+
160
+ Key package versions:
161
+ ```
162
+ transformers==4.37.2
163
+ deepspeed==0.16.5
164
+ torch>=2.5.0
165
+ accelerate==0.32.1
166
+ ```
167
+ **Note**: Additional setup is required if using Flex Attention.
168
+
169
+
170
+ 3. Prepare Training Data
171
+
172
+ The training dataset we used is specified in the meta file: [meta.json](https://github.com/OpenGVLab/SDLM/blob/main/shell/playground/data/meta/sft_opc436k_scale_math_1m_smoltalk_1m_tulu_1m.json) and is organized in the ShareGPT style, according to the [InternVL chat data format](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html).
173
+
174
+ This dataset is composed of several open-source datasets, with the following structure:
175
+
176
+ | Dataset Name | # Sample | Domain |
177
+ | :----------------------------------------------------------------------------------------- | :--------- | :------- |
178
+ | <a href="https://huggingface.co/datasets/dyyyyyyyy/ScaleQuest-Math">ScaleQuest-Math</a> | 1,000K | Math |
179
+ | <a href="https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2">Opc-sft-stage2</a> | 436K | Code |
180
+ | <a href="https://huggingface.co/datasets/HuggingFaceTB/smoltalk">Smoltalk</a> | 1,100K | General |
181
+ | <a href="https://huggingface.co/datasets/allenai/tulu-3-sft-mixture">Tulu-3-sft-mixture</a> | 939K | General |
182
+ | <a href="https://huggingface.co/datasets/allenai/SciRIFF">SciRIFF</a> | 79K | Scienece|
183
+ | <a href="https://huggingface.co/datasets/LipengCS/Table-GPT">Table-GPT</a> | 13K | Table |
184
+ | **Total** | **3,506K** | -- |
185
+
186
+
187
+ 4. Start Training
188
+
189
+ All training scripts are available in the [shell/train](https://github.com/OpenGVLab/SDLM/tree/main/shell/train) directory. Key parameters include:
190
+ * `block_size`: The size of the diffusion window. Current settings use `4`, we also try to use `8`; larger sizes are under exploration.
191
+ * `attn_implementation`: Attention implementation type. Options include sdpa, eager, or flex_attn. Using Flex Attention requires additional setup. Prefer to use `sdpa` for a quick start.
192
+ * `causal_attn`: Whether to use causal attention within the window. Currently set to non-causal (`False`).
193
+
194
+ Our training setting is:
195
+
196
+ <p align="center">
197
+ <img src="https://huggingface.co/OpenGVLab/SDLM-32B-D4/resolve/main/assets/hyper-param.png" width="50%"></a>
198
+ </p>
199
+
200
+ The training loss of our 3B model. loss_pos_`i` refers to the loss at the `i`-th position of each block. The loss at `i=0` is close to the SFT loss of AR's NTP.
201
+
202
+ Here, we display the loss corresponding to each position within the window during the training process. When bs=8, only the first 4 are shown.
203
+ The correspondence is as follows:
204
+
205
+ bs = 4 (red):
206
+
207
+ | x | m | m | m |
208
+ | :-- | :-- | :-- | :-- |
209
+ | loss_pos_1 | loss_pos_2 | loss_pos_3 | loss_pos_4 |
210
+
211
+ bs = 8 (orange):
212
+
213
+ | x | m | m | m | m | m | m | m |
214
+ | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
215
+ | loss_pos_1 | loss_pos_2 | loss_pos_3 | loss_pos_4 | -- | -- | -- | -- |
216
+
217
+ ![](https://huggingface.co/OpenGVLab/SDLM-32B-D4/resolve/main/assets/train_log_3b.png)
218
+
219
+ ## Evaluation
220
+
221
+ Currently, we use [Opencompass](https://github.com/open-compass/opencompass) for evaluation. For more details, please refer to the [evaluation guide](https://github.com/OpenGVLab/SDLM/blob/main/eval/with_opencompass/readme.md).
222
+
223
+ ## Case
224
+
225
+ <p align="center">
226
+ <img src="https://huggingface.co/OpenGVLab/SDLM-32B-D4/resolve/main/assets/case.gif" width="70%"></a>
227
+ </p>
228
+
229
+ ## Acknowledge
230
 
231
+ We extend our gratitude to the open-source community for their foundational contributions:
232
 
233
+ * [InternVL](https://github.com/OpenGVLab/InternVL/tree/main) The codebase we build upon.
234
+ * [SMDM](https://github.com/ML-GSAI/SMDM), [LLaDA](https://github.com/ML-GSAI/LLaDA), [Dream](https://github.com/HKUNLP/Dream), [Block Diffusion](https://github.com/kuleshov-group/bd3lms) for insights into diffusion-based generative modeling.
235
+ * [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5-llm/) as a robust base model for comparative studies.
236
+ * [Opencompass](https://github.com/open-compass/opencompass) for providing a comprehensive evaluation framework.
237
+ * The creators of all datasets used in this work, enabling rigorous training and validation.
238
 
239
  ## Citation
240
 
 
247
  journal={arXiv preprint arXiv:2509.24007},
248
  year={2025}
249
  }
250
+ ```