Qwen
/

Qwen3-VL-235B-A22B-Thinking-FP8

@@ -7,6 +7,8 @@ license: apache-2.0
 # Qwen3-VL-235B-A22B-Thinking-FP8
 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.
@@ -64,75 +66,161 @@ This is the weight repository for the FP8 version of Qwen3-VL-235B-A22B-Thinking
 ## Quickstart
-Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers.
-The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
-```
-pip install git+https://github.com/huggingface/transformers
-# pip install transformers==4.57.0 # currently, V4.57.0 is not released
-```
-### Using 🤗 Transformers to Chat
-Here we show a code snippet to show you how to use the chat model with `transformers`:
 ```python
-from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
-# default: Load the model on the available device(s)
-model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
-    "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8", dtype="auto", device_map="auto"
-)
-# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
-# model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
-#     "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8",
-#     dtype="auto",
-#     attn_implementation="flash_attention_2",
-#     device_map="auto",
-# )
-processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Thinking-FP8")
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
-            },
-            {"type": "text", "text": "Describe this image."},
-        ],
     }
-]
-# Preparation for inference
-inputs = processor.apply_chat_template(
-    messages,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_dict=True,
-    return_tensors="pt"
-)
-# Inference: Generation of the output
-generated_ids = model.generate(**inputs, max_new_tokens=128)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_text = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)
-print(output_text)
 ```
-## Note on FP8
-For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3-VL, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
-You can use the Qwen3-VL-235B-A22B-Thinking-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
 ## Citation

 # Qwen3-VL-235B-A22B-Thinking-FP8
+> This repository contains an FP8 quantized version of the [Qwen3-VL-235B-A22B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking) model. We quantized it using Activation-aware Weight Quantization (AWQ), and its performance metrics are nearly identical to those of the original BF16 model. Enjoy!
 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.
 ## Quickstart
+Currently, 🤗 Transformers does not support loading these weights directly. Stay tuned!
+We recommend deploying the model using vLLM or SGLang, with example launch commands provided below.  For details on the runtime environment and deployment, please refer to this [link](https://github.com/QwenLM/Qwen3-VL?tab=readme-ov-file#deployment).
+### vLLM Inference
+Here we provide a code snippet demonstrating how to use vLLM to run inference with Qwen3-VL locally. For more details on efficient deployment with vLLM, please refer to the [community deployment guide](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html).
 ```python
+# -*- coding: utf-8 -*-
+import torch
+from qwen_vl_utils import process_vision_info
+from transformers import AutoProcessor
+from vllm import LLM, SamplingParams
+import os
+os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
+def prepare_inputs_for_vllm(messages, processor):
+    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    # qwen_vl_utils 0.0.14+ reqired
+    image_inputs, video_inputs, video_kwargs = process_vision_info(
+        messages,
+        image_patch_size=processor.image_processor.patch_size,
+        return_video_kwargs=True,
+        return_video_metadata=True
+    )
+    print(f"video_kwargs: {video_kwargs}")
+    mm_data = {}
+    if image_inputs is not None:
+        mm_data['image'] = image_inputs
+    if video_inputs is not None:
+        mm_data['video'] = video_inputs
+    return {
+        'prompt': text,
+        'multi_modal_data': mm_data,
+        'mm_processor_kwargs': video_kwargs
     }
+if __name__ == '__main__':
+    # messages = [
+    #     {
+    #         "role": "user",
+    #         "content": [
+    #             {
+    #                 "type": "video",
+    #                 "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
+    #             },
+    #             {"type": "text", "text": "这段视频有多长"},
+    #         ],
+    #     }
+    # ]
+    messages = [
+        {
+            "role": "user",
+            "content": [
+              {
+                  "type": "image",
+                  "image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
+              },
+              {"type": "text", "text": "Read all the text in the image."},
+            ],
+        }
+    ]
+    # TODO: change to your own checkpoint path
+    checkpoint_path = "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8"
+    processor = AutoProcessor.from_pretrained(checkpoint_path)
+    inputs = [prepare_inputs_for_vllm(message, processor) for message in [messages]]
+    llm = LLM(
+        model=checkpoint_path,
+        trust_remote_code=True,
+        gpu_memory_utilization=0.70,
+        enforce_eager=False,
+        tensor_parallel_size=torch.cuda.device_count(),
+        seed=0
+    )
+    sampling_params = SamplingParams(
+        temperature=0,
+        max_tokens=1024,
+        top_k=-1,
+        stop_token_ids=[],
+    )
+    for i, input_ in enumerate(inputs):
+        print()
+        print('=' * 40)
+        print(f"Inputs[{i}]: {input_['prompt']=!r}")
+    print('\n' + '>' * 40)
+    outputs = llm.generate(inputs, sampling_params=sampling_params)
+    for i, output in enumerate(outputs):
+        generated_text = output.outputs[0].text
+        print()
+        print('=' * 40)
+        print(f"Generated text: {generated_text!r}")
 ```
+###  SGLang Inference
+Here we provide a code snippet demonstrating how to use SGLang to run inference with Qwen3-VL locally.
+```python
+import time
+from PIL import Image
+from sglang import Engine
+from qwen_vl_utils import process_vision_info
+from transformers import AutoProcessor, AutoConfig
+if __name__ == "__main__":
+    # TODO: change to your own checkpoint path
+    checkpoint_path = "Qwen/Qwen3-VL-235B-A22B-Thinking-FP8"
+    processor = AutoProcessor.from_pretrained(checkpoint_path)
+    messages = [
+        {
+            "role": "user",
+            "content": [
+              {
+                  "type": "image",
+                  "image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
+              },
+              {"type": "text", "text": "Read all the text in the image."},
+            ],
+        }
+    ]
+    text = processor.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    image_inputs, _ = process_vision_info(messages, image_patch_size=processor.image_processor.patch_size)
+    llm = Engine(
+        model_path=checkpoint_path,
+        enable_multimodal=True,
+        mem_fraction_static=0.8,
+        tp_size=torch.cuda.device_count(),
+        attention_backend="fa3"
+    )
+    start = time.time()
+    sampling_params = {"max_new_tokens": 1024}
+    response = llm.generate(prompt=text, image_data=image_inputs, sampling_params=sampling_params)
+    print(f"Response costs: {time.time() - start:.2f}s")
+    print(f"Generated text: {response['text']}")
+```
 ## Citation