shuai bai commited on
Commit
a6647ef
·
verified ·
1 Parent(s): 3d0753c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -58
README.md CHANGED
@@ -7,6 +7,8 @@ license: apache-2.0
7
 
8
  # Qwen3-VL-235B-A22B-Thinking-FP8
9
 
 
 
10
 
11
  Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.
12
 
@@ -64,75 +66,161 @@ This is the weight repository for the FP8 version of Qwen3-VL-235B-A22B-Thinking
64
 
65
  ## Quickstart
66
 
67
- Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers.
68
 
69
- The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
70
- ```
71
- pip install git+https://github.com/huggingface/transformers
72
- # pip install transformers==4.57.0 # currently, V4.57.0 is not released
73
- ```
74
 
75
- ### Using 🤗 Transformers to Chat
76
 
77
- Here we show a code snippet to show you how to use the chat model with `transformers`:
78
 
79
  ```python
80
- from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
81
-
82
- # default: Load the model on the available device(s)
83
- model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
84
- "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8", dtype="auto", device_map="auto"
85
- )
86
-
87
- # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
88
- # model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
89
- # "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8",
90
- # dtype="auto",
91
- # attn_implementation="flash_attention_2",
92
- # device_map="auto",
93
- # )
94
-
95
- processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Thinking-FP8")
96
-
97
- messages = [
98
- {
99
- "role": "user",
100
- "content": [
101
- {
102
- "type": "image",
103
- "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
104
- },
105
- {"type": "text", "text": "Describe this image."},
106
- ],
 
 
 
107
  }
108
- ]
109
-
110
- # Preparation for inference
111
- inputs = processor.apply_chat_template(
112
- messages,
113
- tokenize=True,
114
- add_generation_prompt=True,
115
- return_dict=True,
116
- return_tensors="pt"
117
- )
118
-
119
- # Inference: Generation of the output
120
- generated_ids = model.generate(**inputs, max_new_tokens=128)
121
- generated_ids_trimmed = [
122
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
123
- ]
124
- output_text = processor.batch_decode(
125
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
126
- )
127
- print(output_text)
128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  ```
130
- ## Note on FP8
131
 
132
- For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3-VL, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
133
 
134
- You can use the Qwen3-VL-235B-A22B-Thinking-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
135
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
  ## Citation
138
 
 
7
 
8
  # Qwen3-VL-235B-A22B-Thinking-FP8
9
 
10
+ > This repository contains an FP8 quantized version of the [Qwen3-VL-235B-A22B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking) model. We quantized it using Activation-aware Weight Quantization (AWQ), and its performance metrics are nearly identical to those of the original BF16 model. Enjoy!
11
+
12
 
13
  Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.
14
 
 
66
 
67
  ## Quickstart
68
 
69
+ Currently, 🤗 Transformers does not support loading these weights directly. Stay tuned!
70
 
71
+ We recommend deploying the model using vLLM or SGLang, with example launch commands provided below. For details on the runtime environment and deployment, please refer to this [link](https://github.com/QwenLM/Qwen3-VL?tab=readme-ov-file#deployment).
 
 
 
 
72
 
73
+ ### vLLM Inference
74
 
75
+ Here we provide a code snippet demonstrating how to use vLLM to run inference with Qwen3-VL locally. For more details on efficient deployment with vLLM, please refer to the [community deployment guide](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html).
76
 
77
  ```python
78
+ # -*- coding: utf-8 -*-
79
+ import torch
80
+ from qwen_vl_utils import process_vision_info
81
+ from transformers import AutoProcessor
82
+ from vllm import LLM, SamplingParams
83
+
84
+ import os
85
+ os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
86
+
87
+ def prepare_inputs_for_vllm(messages, processor):
88
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
89
+ # qwen_vl_utils 0.0.14+ reqired
90
+ image_inputs, video_inputs, video_kwargs = process_vision_info(
91
+ messages,
92
+ image_patch_size=processor.image_processor.patch_size,
93
+ return_video_kwargs=True,
94
+ return_video_metadata=True
95
+ )
96
+ print(f"video_kwargs: {video_kwargs}")
97
+
98
+ mm_data = {}
99
+ if image_inputs is not None:
100
+ mm_data['image'] = image_inputs
101
+ if video_inputs is not None:
102
+ mm_data['video'] = video_inputs
103
+
104
+ return {
105
+ 'prompt': text,
106
+ 'multi_modal_data': mm_data,
107
+ 'mm_processor_kwargs': video_kwargs
108
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
+
111
+ if __name__ == '__main__':
112
+ # messages = [
113
+ # {
114
+ # "role": "user",
115
+ # "content": [
116
+ # {
117
+ # "type": "video",
118
+ # "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
119
+ # },
120
+ # {"type": "text", "text": "这段视频有多长"},
121
+ # ],
122
+ # }
123
+ # ]
124
+
125
+ messages = [
126
+ {
127
+ "role": "user",
128
+ "content": [
129
+ {
130
+ "type": "image",
131
+ "image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
132
+ },
133
+ {"type": "text", "text": "Read all the text in the image."},
134
+ ],
135
+ }
136
+ ]
137
+
138
+ # TODO: change to your own checkpoint path
139
+ checkpoint_path = "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8"
140
+ processor = AutoProcessor.from_pretrained(checkpoint_path)
141
+ inputs = [prepare_inputs_for_vllm(message, processor) for message in [messages]]
142
+
143
+ llm = LLM(
144
+ model=checkpoint_path,
145
+ trust_remote_code=True,
146
+ gpu_memory_utilization=0.70,
147
+ enforce_eager=False,
148
+ tensor_parallel_size=torch.cuda.device_count(),
149
+ seed=0
150
+ )
151
+
152
+ sampling_params = SamplingParams(
153
+ temperature=0,
154
+ max_tokens=1024,
155
+ top_k=-1,
156
+ stop_token_ids=[],
157
+ )
158
+
159
+ for i, input_ in enumerate(inputs):
160
+ print()
161
+ print('=' * 40)
162
+ print(f"Inputs[{i}]: {input_['prompt']=!r}")
163
+ print('\n' + '>' * 40)
164
+
165
+ outputs = llm.generate(inputs, sampling_params=sampling_params)
166
+ for i, output in enumerate(outputs):
167
+ generated_text = output.outputs[0].text
168
+ print()
169
+ print('=' * 40)
170
+ print(f"Generated text: {generated_text!r}")
171
  ```
 
172
 
173
+ ### SGLang Inference
174
 
175
+ Here we provide a code snippet demonstrating how to use SGLang to run inference with Qwen3-VL locally.
176
 
177
+ ```python
178
+ import time
179
+ from PIL import Image
180
+ from sglang import Engine
181
+ from qwen_vl_utils import process_vision_info
182
+ from transformers import AutoProcessor, AutoConfig
183
+
184
+ if __name__ == "__main__":
185
+ # TODO: change to your own checkpoint path
186
+ checkpoint_path = "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8"
187
+ processor = AutoProcessor.from_pretrained(checkpoint_path)
188
+
189
+ messages = [
190
+ {
191
+ "role": "user",
192
+ "content": [
193
+ {
194
+ "type": "image",
195
+ "image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
196
+ },
197
+ {"type": "text", "text": "Read all the text in the image."},
198
+ ],
199
+ }
200
+ ]
201
+
202
+ text = processor.apply_chat_template(
203
+ messages,
204
+ tokenize=False,
205
+ add_generation_prompt=True
206
+ )
207
+
208
+ image_inputs, _ = process_vision_info(messages, image_patch_size=processor.image_processor.patch_size)
209
+
210
+ llm = Engine(
211
+ model_path=checkpoint_path,
212
+ enable_multimodal=True,
213
+ mem_fraction_static=0.8,
214
+ tp_size=torch.cuda.device_count(),
215
+ attention_backend="fa3"
216
+ )
217
+
218
+ start = time.time()
219
+ sampling_params = {"max_new_tokens": 1024}
220
+ response = llm.generate(prompt=text, image_data=image_inputs, sampling_params=sampling_params)
221
+ print(f"Response costs: {time.time() - start:.2f}s")
222
+ print(f"Generated text: {response['text']}")
223
+ ```
224
 
225
  ## Citation
226