Xiangtai commited on
Commit
fa5f0ed
·
verified ·
1 Parent(s): 5301f61

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - OpenGVLab/InternVL3-8B
7
+ base_model_relation: merge
8
+ language:
9
+ - multilingual
10
+ tags:
11
+ - Sa2VA
12
+ - custom_code
13
+ ---
14
+
15
+ # Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
16
+
17
+ [\[📂 GitHub\]](https://github.com/bytedance/Sa2VA)
18
+ [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001)
19
+ [\[🚀 Quick Start\]](#quick-start)
20
+
21
+
22
+
23
+ ## Introduction
24
+
25
+ Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2.5-VL and InternVL3 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
26
+
27
+ ## Sa2VA Family
28
+
29
+ We built the Sa2VA series based on Qwen2.5/3-VL and InternVL2.5/3. In the following table, we provide some Sa2VA models built on Qwen2.5/3-VL and InternVL3.
30
+
31
+ | Model Name | Base MLLM | Language Part | HF Link |
32
+ |:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:|
33
+ | Sa2VA-InternVL3-2B | [InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B) | [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-InternVL3-2B) |
34
+ | Sa2VA-InternVL3-8B | [InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) | [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-InternVL3-8B) |
35
+ | Sa2VA-InternVL3-14B | [InternVL3-14B](https://huggingface.co/OpenGVLab/InternVL3-14B) | [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-InternVL3-14B) |
36
+ | Sa2VA-Qwen2_5-VL-3B | [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) | [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-Qwen2_5-VL-3B) |
37
+ | Sa2VA-Qwen2_5-VL-7B | [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) | [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-Qwen2_5-VL-7B) |
38
+ | Sa2VA-Qwen3-VL-4B | [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) | [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-Qwen3-VL-4B) |
39
+
40
+ ## Sa2VA Performance
41
+ | Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (val_u) | DAVIS |
42
+ |:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:|
43
+ | Sa2VA-InternVL3-2B | 1631/559 | 79.8 | 81.4 | 75.7 | 80.3 | 53.9 | 74.5 |
44
+ | Sa2VA-InternVL3-8B | 1743/633 | 83.0 | 83.3 | 78.9 | 81.8 | 56.4 | 76.3 |
45
+ | Sa2VA-InternVL3-14B | 1746/724 | 84.3 | 83.6 | 79.9 | 83.6 | 59.2 | 76.6 |
46
+ | Sa2VA-Qwen2_5-VL-3B | 1533/572 | 78.4 | 79.6 | 74.0 | 77.1 | 51.6 | 73.4 |
47
+ | Sa2VA-Qwen2_5-VL-7B | 1552/676 | 84.5 | 82.4 | 77.5 | 81.5 | 56.4 | 79.4 |
48
+ | Sa2VA-Qwen3-VL-4B | 1660/655 | 86.3 | 81.7 | 77.4 | 80.0 | 57.1 | 75.9 |
49
+
50
+ ## Quick Start
51
+
52
+ We provide an example code to run `Sa2VA` using `transformers`.
53
+
54
+ ```python
55
+ import torch
56
+ from transformers import AutoProcessor, AutoModel
57
+ from PIL import Image
58
+ import numpy as np
59
+ import os
60
+
61
+ # load the model and processor
62
+ path = "ByteDance/Sa2VA-Qwen3-VL-4B"
63
+ model = AutoModel.from_pretrained(
64
+ path,
65
+ torch_dtype=torch.bfloat16,
66
+ low_cpu_mem_usage=True,
67
+ use_flash_attn=True,
68
+ trust_remote_code=True).eval().cuda()
69
+ processor = AutoProcessor.from_pretrained(path, trust_remote_code=True, use_fast=False)
70
+
71
+ # for image chat
72
+ image_path = "/PATH/TO/IMAGE"
73
+ text_prompts = "<image>Please describe the image."
74
+ image = Image.open(image_path).convert('RGB')
75
+ input_dict = {
76
+ 'image': image,
77
+ 'text': text_prompts,
78
+ 'past_text': '',
79
+ 'mask_prompts': None,
80
+ 'processor': processor,
81
+ }
82
+ return_dict = model.predict_forward(**input_dict)
83
+ answer = return_dict["prediction"] # the text format answer
84
+
85
+ # for image chat with segmentation output
86
+ image_path = "/PATH/TO/IMAGE"
87
+ text_prompts = "<image>Could you please give me a brief description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer."
88
+ image = Image.open(image_path).convert('RGB')
89
+ input_dict = {
90
+ 'image': image,
91
+ 'text': text_prompts,
92
+ 'past_text': '',
93
+ 'mask_prompts': None,
94
+ 'processor': processor,
95
+ }
96
+ return_dict = model.predict_forward(**input_dict)
97
+ answer = return_dict["prediction"] # the text format answer
98
+ masks = return_dict['prediction_masks'] # segmentation masks, list(np.array(1, h, w), ...)
99
+
100
+ # for chat with visual prompt (mask format) input
101
+ mask_prompts = np.load('/PATH/TO/pred_masks.npy') # np.array(n_prompts, h, w)
102
+ image_path = "/PATH/TO/IMAGE"
103
+ text_prompts = "<image>Can you provide me with a detailed description of the region in the picture marked by region1."
104
+ image = Image.open(image_path).convert('RGB')
105
+ input_dict = {
106
+ 'image': image,
107
+ 'text': text_prompts,
108
+ 'past_text': '',
109
+ 'mask_prompts': mask_prompts,
110
+ 'processor': processor,
111
+ }
112
+ return_dict = model.predict_forward(**input_dict)
113
+ answer = return_dict["prediction"] # the text format answer
114
+
115
+ # for video chat
116
+ video_folder = "/PATH/TO/VIDEO_FOLDER"
117
+ images_paths = os.listdir(video_folder)
118
+ images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
119
+ if len(images_paths) > 5: # uniformly sample 5 frames
120
+ step = (len(images_paths) - 1) // (5 - 1)
121
+ images_paths = [images_paths[0]] + images_paths[1:-1][::step][1:] + [images_paths[-1]]
122
+ text_prompts = "<image>Please describe the video."
123
+ input_dict = {
124
+ 'video': images_paths,
125
+ 'text': text_prompts,
126
+ 'past_text': '',
127
+ 'mask_prompts': None,
128
+ 'processor': processor,
129
+ }
130
+ return_dict = model.predict_forward(**input_dict)
131
+ answer = return_dict["prediction"] # the text format answer
132
+
133
+
134
+ # for video chat with segmentation mask output
135
+ video_folder = "/PATH/TO/VIDEO_FOLDER"
136
+ images_paths = os.listdir(video_folder)
137
+ images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
138
+ text_prompts = "<image>Please segment the person."
139
+ input_dict = {
140
+ 'video': images_paths,
141
+ 'text': text_prompts,
142
+ 'past_text': '',
143
+ 'mask_prompts': None,
144
+ 'processor': processor,
145
+ }
146
+ return_dict = model.predict_forward(**input_dict)
147
+ answer = return_dict["prediction"] # the text format answer
148
+ masks = return_dict['prediction_masks'] # segmentation masks, list(np.array(n_frames, h, w), ...)
149
+ ```
150
+
151
+ ## Citation
152
+
153
+ If you find this project useful in your research, please consider citing:
154
+
155
+ ```BibTeX
156
+ @article{sa2va,
157
+ title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
158
+ author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
159
+ journal={arXiv preprint},
160
+ year={2025}
161
+ }
162
+ ```
added_tokens.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</p>": 151671,
3
+ "</think>": 151668,
4
+ "</tool_call>": 151658,
5
+ "</tool_response>": 151666,
6
+ "</vp>": 151673,
7
+ "<p>": 151670,
8
+ "<think>": 151667,
9
+ "<tool_call>": 151657,
10
+ "<tool_response>": 151665,
11
+ "<vp>": 151672,
12
+ "<|box_end|>": 151649,
13
+ "<|box_start|>": 151648,
14
+ "<|endoftext|>": 151643,
15
+ "<|file_sep|>": 151664,
16
+ "<|fim_middle|>": 151660,
17
+ "<|fim_pad|>": 151662,
18
+ "<|fim_prefix|>": 151659,
19
+ "<|fim_suffix|>": 151661,
20
+ "<|im_end|>": 151645,
21
+ "<|im_start|>": 151644,
22
+ "<|image_pad|>": 151655,
23
+ "<|object_ref_end|>": 151647,
24
+ "<|object_ref_start|>": 151646,
25
+ "<|quad_end|>": 151651,
26
+ "<|quad_start|>": 151650,
27
+ "<|repo_name|>": 151663,
28
+ "<|video_pad|>": 151656,
29
+ "<|vision_end|>": 151653,
30
+ "<|vision_pad|>": 151654,
31
+ "<|vision_start|>": 151652,
32
+ "[SEG]": 151669
33
+ }
chat_template.jinja ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if 'text' in content %}
9
+ {{- content.text }}
10
+ {%- endif %}
11
+ {%- endfor %}
12
+ {%- endif %}
13
+ {{- '\n\n' }}
14
+ {%- endif %}
15
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
16
+ {%- for tool in tools %}
17
+ {{- "\n" }}
18
+ {{- tool | tojson }}
19
+ {%- endfor %}
20
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
21
+ {%- else %}
22
+ {%- if messages[0].role == 'system' %}
23
+ {{- '<|im_start|>system\n' }}
24
+ {%- if messages[0].content is string %}
25
+ {{- messages[0].content }}
26
+ {%- else %}
27
+ {%- for content in messages[0].content %}
28
+ {%- if 'text' in content %}
29
+ {{- content.text }}
30
+ {%- endif %}
31
+ {%- endfor %}
32
+ {%- endif %}
33
+ {{- '<|im_end|>\n' }}
34
+ {%- endif %}
35
+ {%- endif %}
36
+ {%- set image_count = namespace(value=0) %}
37
+ {%- set video_count = namespace(value=0) %}
38
+ {%- for message in messages %}
39
+ {%- if message.role == "user" %}
40
+ {{- '<|im_start|>' + message.role + '\n' }}
41
+ {%- if message.content is string %}
42
+ {{- message.content }}
43
+ {%- else %}
44
+ {%- for content in message.content %}
45
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
46
+ {%- set image_count.value = image_count.value + 1 %}
47
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
48
+ <|vision_start|><|image_pad|><|vision_end|>
49
+ {%- elif content.type == 'video' or 'video' in content %}
50
+ {%- set video_count.value = video_count.value + 1 %}
51
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
52
+ <|vision_start|><|video_pad|><|vision_end|>
53
+ {%- elif 'text' in content %}
54
+ {{- content.text }}
55
+ {%- endif %}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {{- '<|im_end|>\n' }}
59
+ {%- elif message.role == "assistant" %}
60
+ {{- '<|im_start|>' + message.role + '\n' }}
61
+ {%- if message.content is string %}
62
+ {{- message.content }}
63
+ {%- else %}
64
+ {%- for content_item in message.content %}
65
+ {%- if 'text' in content_item %}
66
+ {{- content_item.text }}
67
+ {%- endif %}
68
+ {%- endfor %}
69
+ {%- endif %}
70
+ {%- if message.tool_calls %}
71
+ {%- for tool_call in message.tool_calls %}
72
+ {%- if (loop.first and message.content) or (not loop.first) %}
73
+ {{- '\n' }}
74
+ {%- endif %}
75
+ {%- if tool_call.function %}
76
+ {%- set tool_call = tool_call.function %}
77
+ {%- endif %}
78
+ {{- '<tool_call>\n{"name": "' }}
79
+ {{- tool_call.name }}
80
+ {{- '", "arguments": ' }}
81
+ {%- if tool_call.arguments is string %}
82
+ {{- tool_call.arguments }}
83
+ {%- else %}
84
+ {{- tool_call.arguments | tojson }}
85
+ {%- endif %}
86
+ {{- '}\n</tool_call>' }}
87
+ {%- endfor %}
88
+ {%- endif %}
89
+ {{- '<|im_end|>\n' }}
90
+ {%- elif message.role == "tool" %}
91
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
92
+ {{- '<|im_start|>user' }}
93
+ {%- endif %}
94
+ {{- '\n<tool_response>\n' }}
95
+ {%- if message.content is string %}
96
+ {{- message.content }}
97
+ {%- else %}
98
+ {%- for content in message.content %}
99
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
100
+ {%- set image_count.value = image_count.value + 1 %}
101
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
102
+ <|vision_start|><|image_pad|><|vision_end|>
103
+ {%- elif content.type == 'video' or 'video' in content %}
104
+ {%- set video_count.value = video_count.value + 1 %}
105
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
106
+ <|vision_start|><|video_pad|><|vision_end|>
107
+ {%- elif 'text' in content %}
108
+ {{- content.text }}
109
+ {%- endif %}
110
+ {%- endfor %}
111
+ {%- endif %}
112
+ {{- '\n</tool_response>' }}
113
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
114
+ {{- '<|im_end|>\n' }}
115
+ {%- endif %}
116
+ {%- endif %}
117
+ {%- endfor %}
118
+ {%- if add_generation_prompt %}
119
+ {{- '<|im_start|>assistant\n' }}
120
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Sa2VAChatModelQwen"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_sa2va_chat.Sa2VAChatConfigQwen",
7
+ "AutoModel": "modeling_sa2va_qwen.Sa2VAChatModelQwen",
8
+ "AutoModelForCausalLM": "modeling_sa2va_qwen.Sa2VAChatModelQwen"
9
+ },
10
+ "dtype": "float32",
11
+ "image_token_id": 151655,
12
+ "model_type": "sa2va_chat",
13
+ "template": "qwen_chat",
14
+ "text_config": {
15
+ "attention_bias": false,
16
+ "attention_dropout": 0.0,
17
+ "bos_token_id": 151643,
18
+ "dtype": "bfloat16",
19
+ "eos_token_id": 151645,
20
+ "head_dim": 128,
21
+ "hidden_act": "silu",
22
+ "hidden_size": 2560,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 9728,
25
+ "max_position_embeddings": 262144,
26
+ "model_type": "qwen3_vl_text",
27
+ "num_attention_heads": 32,
28
+ "num_hidden_layers": 36,
29
+ "num_key_value_heads": 8,
30
+ "rms_norm_eps": 1e-06,
31
+ "rope_scaling": {
32
+ "mrope_interleaved": true,
33
+ "mrope_section": [
34
+ 24,
35
+ 20,
36
+ 20
37
+ ],
38
+ "rope_type": "default"
39
+ },
40
+ "rope_theta": 5000000,
41
+ "use_cache": true,
42
+ "vocab_size": 151674
43
+ },
44
+ "tie_word_embeddings": false,
45
+ "transformers_version": "4.57.0",
46
+ "video_token_id": 151656,
47
+ "vision_config": {
48
+ "deepstack_visual_indexes": [
49
+ 5,
50
+ 11,
51
+ 17
52
+ ],
53
+ "depth": 24,
54
+ "hidden_act": "gelu_pytorch_tanh",
55
+ "hidden_size": 1024,
56
+ "in_channels": 3,
57
+ "initializer_range": 0.02,
58
+ "intermediate_size": 4096,
59
+ "model_type": "qwen3_vl",
60
+ "num_heads": 16,
61
+ "num_position_embeddings": 2304,
62
+ "out_hidden_size": 2560,
63
+ "patch_size": 16,
64
+ "spatial_merge_size": 2,
65
+ "temporal_patch_size": 2
66
+ },
67
+ "vision_end_token_id": 151653,
68
+ "vision_start_token_id": 151652
69
+ }
configuration_sa2va_chat.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import copy
2
+
3
+ import transformers
4
+ from transformers import Qwen2Config
5
+ from transformers.configuration_utils import PretrainedConfig
6
+ from transformers.utils import logging
7
+
8
+ from transformers.models.qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig
9
+
10
+ logger = logging.get_logger(__name__)
11
+
12
+ class Sa2VAChatConfigQwen(Qwen3VLConfig):
13
+ model_type = 'sa2va_chat'
14
+
15
+ def __init__(
16
+ self,
17
+ template=None,
18
+ **kwargs
19
+ ):
20
+ super().__init__(**kwargs)
21
+ self.template = template
22
+
23
+ def to_dict(self):
24
+ """
25
+ Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
26
+
27
+ Returns:
28
+ `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
29
+ """
30
+
31
+ output = super().to_dict()
32
+ output["template"] = self.template
33
+
34
+ return output
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed2f72ade8566319cf154c12f4765166fa4bc2524614aa446aef2a123dd28f09
3
+ size 4934328696
model-00002-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:11aa35d66d696c6e4ec0b4246a9edb0f81049bd59e94cf4a376dd8713eddec71
3
+ size 4944311840
model-00003-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:991fc96652cddd8992e32aa1cfe61b4c153cca335b72c2c26341ff185735649c
3
+ size 4944311896
model-00004-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:600c3ea1ea7fe5421a30ef8b67bba119eeb10168818bfe57a9852a8b9572543b
3
+ size 4998026056
model-00005-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35642f38b2551f8a32c3ee8bf9b579ebc71c9e7366601d4250e2407a20157ff4
3
+ size 407536000
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_sa2va_qwen.py ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+ from transformers import (AutoModel, GenerationConfig, Qwen3VLForConditionalGeneration,
4
+ Qwen2ForCausalLM)
5
+ from transformers.modeling_utils import PreTrainedModel
6
+
7
+ from .configuration_sa2va_chat import Sa2VAChatConfigQwen
8
+
9
+ from .sam2 import SAM2
10
+
11
+ import numpy as np
12
+ from torchvision.transforms.functional import to_pil_image
13
+
14
+ import torch.nn.functional as F
15
+
16
+ from qwen_vl_utils import process_vision_info
17
+
18
+
19
+
20
+ class DirectResize:
21
+ def __init__(self, target_length: int) -> None:
22
+ self.target_length = target_length
23
+
24
+ def apply_image(self, image: np.ndarray) -> np.ndarray:
25
+ """
26
+ Expects a numpy array with shape HxWxC in uint8 format.
27
+ """
28
+ img = to_pil_image(image, mode='RGB')
29
+ return np.array(img.resize((self.target_length, self.target_length)))
30
+
31
+ class Sa2VAChatModelQwen(PreTrainedModel):
32
+ config_class = Sa2VAChatConfigQwen
33
+ main_input_name = 'pixel_values'
34
+ base_model_prefix = 'language_model'
35
+ _no_split_modules = ['Qwen3VisionTransformerPretrainedModel', 'Qwen3VLDecoderLayer', 'SAM2']
36
+ _supports_flash_attn_2 = True
37
+ supports_gradient_checkpointing = True
38
+
39
+
40
+
41
+ def __init__(self, config: Sa2VAChatConfigQwen, model=None, use_flash_attn=True):
42
+ super().__init__(config)
43
+ self.extra_image_processor = DirectResize(target_length=1024, )
44
+
45
+ self.min_pixels = 512 * 28 * 28
46
+ self.max_pixels = 2048 * 28 * 28
47
+
48
+ self.torch_dtype = torch.bfloat16
49
+
50
+ if model is not None:
51
+ self.model=model
52
+ else:
53
+ self.model = Qwen3VLForConditionalGeneration(config)
54
+
55
+ llm_hidden_size = config.text_config.hidden_size
56
+
57
+ self.grounding_encoder = SAM2()
58
+ out_dim = self.grounding_encoder.hidden_dim
59
+ in_dim = llm_hidden_size
60
+ self.text_hidden_fcs = nn.Sequential(
61
+ nn.Linear(in_dim, in_dim), nn.ReLU(inplace=True),
62
+ nn.Linear(in_dim, out_dim), nn.Dropout(0.0)
63
+ )
64
+
65
+ @property
66
+ def lm_head(self):
67
+ return self.model.lm_head
68
+
69
+ def get_input_embeddings(self):
70
+ return self.model.get_input_embeddings()
71
+
72
+ def get_output_embeddings(self):
73
+ return self.model.get_output_embeddings()
74
+
75
+ def predict_forward(
76
+ self,
77
+ image=None,
78
+ video=None,
79
+ text=None,
80
+ past_text='',
81
+ mask_prompts=None,
82
+ tokenizer=None,
83
+ processor=None,
84
+ ):
85
+ assert processor is not None
86
+ self.processor = processor
87
+
88
+ self.seg_token_idx = self.processor.tokenizer.convert_tokens_to_ids('[SEG]')
89
+
90
+ text = text.replace('<image>', "")
91
+
92
+ if image is None and video is None and '<image>' not in past_text:
93
+
94
+ messages = [
95
+ {
96
+ "role": "user",
97
+ "content": [
98
+ {"type": "text", "text": past_text + text},
99
+ ],
100
+ }
101
+ ]
102
+
103
+ # Preparation for inference
104
+ processsed_text = self.processor.apply_chat_template(
105
+ messages, tokenize=False, add_generation_prompt=True
106
+ )
107
+
108
+ mm_inputs = self.processor(
109
+ text=[processsed_text],
110
+ images=None,
111
+ videos=None,
112
+ padding=True,
113
+ return_tensors="pt",
114
+ )
115
+ mm_inputs = mm_inputs.to(self.device)
116
+
117
+ ret_masks = []
118
+ else:
119
+ input_dict = {}
120
+ if video is not None:
121
+ pixel_values = []
122
+ extra_pixel_values = []
123
+ images = []
124
+ content = []
125
+ ori_image_size = video[0].size
126
+ for frame_idx, frame_image in enumerate(video):
127
+ # assert ori_image_size == frame_image.size
128
+ g_image = np.array(frame_image) # for grounding
129
+ g_image = self.extra_image_processor.apply_image(g_image)
130
+ g_image = torch.from_numpy(g_image).permute(2, 0, 1).contiguous()
131
+ extra_pixel_values.append(g_image)
132
+ if frame_idx < 5:
133
+ content.append({"type": "image", "image": frame_image},)
134
+
135
+
136
+ content.append({"type": "text", "text": text})
137
+ messages = [
138
+ {
139
+ "role": "user",
140
+ "content": content,
141
+ }
142
+ ]
143
+
144
+ # Preparation for inference
145
+ processsed_text = self.processor.apply_chat_template(
146
+ messages, tokenize=False, add_generation_prompt=True
147
+ )
148
+
149
+ image_inputs, video_inputs = process_vision_info(messages)
150
+ mm_inputs = self.processor(
151
+ text=[processsed_text],
152
+ images=image_inputs,
153
+ videos=video_inputs,
154
+ padding=True,
155
+ return_tensors="pt",
156
+ min_pixels=self.min_pixels,
157
+ max_pixels=self.max_pixels
158
+ )
159
+ mm_inputs = mm_inputs.to(self.device)
160
+
161
+ g_pixel_values = torch.stack([
162
+ self.grounding_encoder.preprocess_image(pixel) for pixel in extra_pixel_values
163
+ ]).to(self.torch_dtype)
164
+
165
+ num_frames = min(5, len(video))
166
+
167
+ else:
168
+ ori_image_size = image.size
169
+
170
+ # prepare grounding images
171
+ g_image = np.array(image) # for grounding
172
+ g_image = self.extra_image_processor.apply_image(g_image)
173
+ g_pixel_values = torch.from_numpy(g_image).permute(2, 0, 1).contiguous().to(self.torch_dtype)
174
+ extra_pixel_values = [g_pixel_values]
175
+ g_pixel_values = torch.stack([
176
+ self.grounding_encoder.preprocess_image(pixel) for pixel in extra_pixel_values
177
+ ]).to(self.torch_dtype)
178
+
179
+ messages = [
180
+ {
181
+ "role": "user",
182
+ "content": [
183
+ {
184
+ "type": "image",
185
+ "image": image,
186
+ },
187
+ {"type": "text", "text": text},
188
+ ],
189
+ }
190
+ ]
191
+
192
+ # Preparation for inference
193
+ processsed_text = self.processor.apply_chat_template(
194
+ messages, tokenize=False, add_generation_prompt=True
195
+ )
196
+
197
+ image_inputs, video_inputs = process_vision_info(messages)
198
+ mm_inputs = self.processor(
199
+ text=[processsed_text],
200
+ images=image_inputs,
201
+ videos=video_inputs,
202
+ padding=True,
203
+ return_tensors="pt",
204
+ min_pixels=self.min_pixels,
205
+ max_pixels=self.max_pixels
206
+ )
207
+ mm_inputs = mm_inputs.to(self.device)
208
+
209
+ num_frames = 1
210
+
211
+ input_dict['g_pixel_values'] = g_pixel_values
212
+ ret_masks = []
213
+
214
+ generate_output = self.model.generate(
215
+ **mm_inputs,
216
+ max_new_tokens=2048,
217
+ do_sample=False,
218
+ output_hidden_states=True,
219
+ return_dict_in_generate=True
220
+ )
221
+
222
+ generate_output_trimmed = [
223
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(mm_inputs.input_ids, generate_output.sequences)
224
+ ]
225
+
226
+ predict = self.processor.batch_decode(generate_output_trimmed, skip_special_tokens=False)[0].strip()
227
+
228
+ if image is None and video is None and '<image>' not in past_text:
229
+ return {'prediction': predict, 'prediction_masks': ret_masks, }
230
+
231
+ # if have seg result, find the seg hidden states
232
+ hidden_states = generate_output.hidden_states
233
+ last_hidden_states = [item[-1][0] for item in hidden_states]
234
+ last_hidden_states = torch.cat(last_hidden_states, dim=0)
235
+ seg_hidden_states = get_seg_hidden_states(
236
+ last_hidden_states, generate_output.sequences[0][:-1],
237
+ seg_id=self.seg_token_idx
238
+ )
239
+ all_seg_hidden_states = self.text_hidden_fcs(seg_hidden_states)
240
+
241
+ for seg_hidden_states in all_seg_hidden_states:
242
+ seg_hidden_states = seg_hidden_states.unsqueeze(0)
243
+ g_pixel_values = input_dict['g_pixel_values']
244
+ sam_states = self.grounding_encoder.get_sam2_embeddings(g_pixel_values)
245
+ pred_masks = self.grounding_encoder.language_embd_inference(sam_states, [seg_hidden_states] * num_frames)
246
+ w, h = ori_image_size
247
+ masks = F.interpolate(pred_masks, size=(h, w), mode='bilinear', align_corners=False)
248
+ masks = masks[:, 0]
249
+ masks = masks.sigmoid() > 0.5
250
+ masks = masks.cpu().numpy()
251
+ ret_masks.append(masks)
252
+
253
+ return {'prediction': predict, 'prediction_masks': ret_masks,}
254
+
255
+ def get_seg_hidden_states(hidden_states, output_ids, seg_id):
256
+ seg_mask = output_ids == seg_id
257
+ n_out = len(seg_mask)
258
+ if n_out == 0:
259
+ return hidden_states[0:0]
260
+ return hidden_states[-n_out:][seg_mask]
preprocessor_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "disable_grouping": null,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_pad": null,
11
+ "do_rescale": true,
12
+ "do_resize": true,
13
+ "image_mean": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "image_processor_type": "Qwen2VLImageProcessorFast",
19
+ "image_std": [
20
+ 0.5,
21
+ 0.5,
22
+ 0.5
23
+ ],
24
+ "input_data_format": null,
25
+ "max_pixels": null,
26
+ "merge_size": 2,
27
+ "min_pixels": null,
28
+ "pad_size": null,
29
+ "patch_size": 16,
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_tensors": null,
34
+ "size": {
35
+ "longest_edge": 16777216,
36
+ "shortest_edge": 65536
37
+ },
38
+ "temporal_patch_size": 2
39
+ }
sam2.py ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
templates.py ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ PROMPT_TEMPLATE = dict(
3
+ default=dict(
4
+ SYSTEM='<|System|>:{system}\n',
5
+ INSTRUCTION='<|User|>:{input}\n<|Bot|>:',
6
+ SEP='\n'),
7
+ zephyr=dict(
8
+ SYSTEM='<|system|>\n{system}\n',
9
+ INSTRUCTION='<|user|>\n{input}\n<|assistant|>\n',
10
+ SEP='\n'),
11
+ internlm_chat=dict(
12
+ SYSTEM='<|System|>:{system}\n',
13
+ INSTRUCTION='<|User|>:{input}<eoh>\n<|Bot|>:',
14
+ SUFFIX='<eoa>',
15
+ SUFFIX_AS_EOS=True,
16
+ SEP='\n',
17
+ STOP_WORDS=['<eoa>']),
18
+ internlm2_chat=dict(
19
+ SYSTEM='<|im_start|>system\n{system}<|im_end|>\n',
20
+ INSTRUCTION=('<|im_start|>user\n{input}<|im_end|>\n'
21
+ '<|im_start|>assistant\n'),
22
+ SUFFIX='<|im_end|>',
23
+ SUFFIX_AS_EOS=True,
24
+ SEP='\n',
25
+ STOP_WORDS=['<|im_end|>']),
26
+ moss_sft=dict(
27
+ SYSTEM='{system}\n',
28
+ INSTRUCTION='<|Human|>: {input}<eoh>\n',
29
+ SEP='\n',
30
+ STOP_WORDS=['<eoc>', '<eom>']),
31
+ llama2_chat=dict(
32
+ SYSTEM=(
33
+ '[INST] <<SYS>>\n You are a helpful, respectful and honest '
34
+ 'assistant. Always answer as helpfully as possible, while being '
35
+ 'safe. Your answers should not include any harmful, unethical, '
36
+ 'racist, sexist, toxic, dangerous, or illegal content. Please '
37
+ 'ensure that your responses are socially unbiased and positive in '
38
+ 'nature.\n{system}\n<</SYS>>\n [/INST] '),
39
+ INSTRUCTION='[INST] {input} [/INST]',
40
+ SEP='\n'),
41
+ code_llama_chat=dict(
42
+ SYSTEM='{system}\n', INSTRUCTION='[INST] {input} [/INST]'),
43
+ chatglm2=dict(
44
+ SYSTEM='{system}\n',
45
+ INSTRUCTION='[Round {round}]\n\n问:{input}\n\n答:',
46
+ SEP='\n\n'),
47
+ chatglm3=dict(
48
+ SYSTEM='<|system|>\n{system}',
49
+ INSTRUCTION='<|user|>\n{input}<|assistant|>\n',
50
+ SEP='\n'),
51
+ qwen_chat=dict(
52
+ SYSTEM=('<|im_start|>system\n{system}<|im_end|>\n'),
53
+ INSTRUCTION=('<|im_start|>user\n{input}<|im_end|>\n'
54
+ '<|im_start|>assistant\n'),
55
+ SUFFIX='<|im_end|>',
56
+ SUFFIX_AS_EOS=True,
57
+ SEP='\n',
58
+ STOP_WORDS=['<|im_end|>', '<|endoftext|>']),
59
+ baichuan_chat=dict(
60
+ SYSTEM='{system}\n',
61
+ INSTRUCTION='<reserved_102>{input}<reserved_103>',
62
+ SEP='\n'),
63
+ baichuan2_chat=dict(
64
+ SYSTEM='{system}\n',
65
+ INSTRUCTION='<reserved_106>{input}<reserved_107>',
66
+ SEP='\n'),
67
+ wizardlm=dict(
68
+ SYSTEM=('A chat between a curious user and an artificial '
69
+ 'intelligence assistant. The assistant gives '
70
+ 'helpful, detailed, and polite answers to the '
71
+ 'user\'s questions. {system}\n '),
72
+ INSTRUCTION=('USER: {input} ASSISTANT:'),
73
+ SEP='\n'),
74
+ wizardcoder=dict(
75
+ SYSTEM=(
76
+ 'Below is an instruction that describes a task. '
77
+ 'Write a response that appropriately completes the request.\n\n'
78
+ '{system}\n '),
79
+ INSTRUCTION=('### Instruction:\n{input}\n\n### Response:'),
80
+ SEP='\n\n'),
81
+ vicuna=dict(
82
+ SYSTEM=('A chat between a curious user and an artificial '
83
+ 'intelligence assistant. The assistant gives '
84
+ 'helpful, detailed, and polite answers to the '
85
+ 'user\'s questions. {system}\n '),
86
+ INSTRUCTION=('USER: {input} ASSISTANT:'),
87
+ SEP='\n'),
88
+ deepseek_coder=dict(
89
+ SYSTEM=('You are an AI programming assistant, utilizing '
90
+ 'the DeepSeek Coder model, developed by DeepSeek'
91
+ 'Company, and you only answer questions related '
92
+ 'to computer science. For politically sensitive '
93
+ 'questions, security and privacy issues, and '
94
+ 'other non-computer science questions, you will '
95
+ 'refuse to answer. {system}\n'),
96
+ INSTRUCTION=('### Instruction:\n{input}\n### Response:\n'),
97
+ SEP='\n'),
98
+ # TODO: deprecation, v0.2.0
99
+ deepseekcoder=dict(
100
+ SYSTEM=('You are an AI programming assistant, utilizing '
101
+ 'the DeepSeek Coder model, developed by DeepSeek'
102
+ 'Company, and you only answer questions related '
103
+ 'to computer science. For politically sensitive '
104
+ 'questions, security and privacy issues, and '
105
+ 'other non-computer science questions, you will '
106
+ 'refuse to answer. {system}\n'),
107
+ INSTRUCTION=('### Instruction:\n{input}\n### Response:\n'),
108
+ SEP='\n'),
109
+ deepseek_moe=dict(
110
+ SYSTEM=('[INST] {system} [/INST]\n'),
111
+ INSTRUCTION=('[INST] {input} [/INST]'),
112
+ SEP='\n'),
113
+ deepseek_v2=dict(
114
+ SYSTEM='{system}\n\n',
115
+ INSTRUCTION='User: {input}\n\nAssistant: ',
116
+ SUFFIX='<|end▁of▁sentence|>',
117
+ SUFFIX_AS_EOS=True,
118
+ STOP_WORDS=['<|end▁of▁sentence|>']),
119
+ mistral=dict(
120
+ SYSTEM=('[INST] {system} [/INST]\n'),
121
+ INSTRUCTION=('[INST] {input} [/INST]'),
122
+ SEP='\n'),
123
+ mixtral=dict(
124
+ SYSTEM=('[INST] {system} [/INST]\n'),
125
+ INSTRUCTION=('[INST] {input} [/INST]'),
126
+ SEP='\n'),
127
+ minicpm=dict(INSTRUCTION=('<用户> {input} <AI>'), SEP='\n'),
128
+ minicpm3=dict(
129
+ SYSTEM=('<|im_start|>system\n{system}<|im_end|>\n'),
130
+ INSTRUCTION=('<|im_start|>user\n{input}<|im_end|>\n'
131
+ '<|im_start|>assistant\n'),
132
+ SUFFIX='<|im_end|>',
133
+ SUFFIX_AS_EOS=True,
134
+ SEP='\n',
135
+ STOP_WORDS=['<|im_end|>', '<|endoftext|>']),
136
+ gemma=dict(
137
+ # `system` field is extended by xtuner
138
+ SYSTEM=('<start_of_turn>system\n{system}<end_of_turn>\n'),
139
+ INSTRUCTION=('<start_of_turn>user\n{input}<end_of_turn>\n'
140
+ '<start_of_turn>model\n'),
141
+ SUFFIX='<end_of_turn>',
142
+ SUFFIX_AS_EOS=False,
143
+ SEP='\n',
144
+ STOP_WORDS=['<end_of_turn>']),
145
+ cohere_chat=dict(
146
+ SYSTEM=('<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{system}'
147
+ '<|END_OF_TURN_TOKEN|>'),
148
+ INSTRUCTION=(
149
+ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{input}<|END_OF_TURN_TOKEN|>'
150
+ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'),
151
+ SUFFIX='<|END_OF_TURN_TOKEN|>',
152
+ SUFFIX_AS_EOS=True,
153
+ STOP_WORDS=['<|END_OF_TURN_TOKEN|>']),
154
+ llama3_chat=dict(
155
+ SYSTEM=('<|start_header_id|>system<|end_header_id|>\n\n'
156
+ '{system}<|eot_id|>'),
157
+ INSTRUCTION=(
158
+ '<|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|>'
159
+ '<|start_header_id|>assistant<|end_header_id|>\n\n'),
160
+ SUFFIX='<|eot_id|>',
161
+ SUFFIX_AS_EOS=True,
162
+ STOP_WORDS=['<|eot_id|>']),
163
+ phi3_chat=dict(
164
+ SYSTEM='<|system|>\n{system}<|end|>\n',
165
+ INSTRUCTION='<|user|>\n{input}<|end|>\n<|assistant|>\n',
166
+ SUFFIX='<|end|>',
167
+ SUFFIX_AS_EOS=True,
168
+ SEP='\n',
169
+ STOP_WORDS=['<|end|>']),
170
+ )
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ade4f08c34dfcbd3b4d11082162940b785e27af18f916549aa4bc223155c91b8
3
+ size 11423560
tokenizer_config.json ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ },
213
+ "151669": {
214
+ "content": "[SEG]",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151670": {
222
+ "content": "<p>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151671": {
230
+ "content": "</p>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151672": {
238
+ "content": "<vp>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151673": {
246
+ "content": "</vp>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ }
253
+ },
254
+ "additional_special_tokens": [
255
+ "<|im_start|>",
256
+ "<|im_end|>",
257
+ "<|object_ref_start|>",
258
+ "<|object_ref_end|>",
259
+ "<|box_start|>",
260
+ "<|box_end|>",
261
+ "<|quad_start|>",
262
+ "<|quad_end|>",
263
+ "<|vision_start|>",
264
+ "<|vision_end|>",
265
+ "<|vision_pad|>",
266
+ "<|image_pad|>",
267
+ "<|video_pad|>"
268
+ ],
269
+ "bos_token": null,
270
+ "clean_up_tokenization_spaces": false,
271
+ "eos_token": "<|im_end|>",
272
+ "errors": "replace",
273
+ "extra_special_tokens": {},
274
+ "model_max_length": 262144,
275
+ "pad_token": "<|endoftext|>",
276
+ "padding_side": "right",
277
+ "processor_class": "Qwen3VLProcessor",
278
+ "split_special_tokens": false,
279
+ "tokenizer_class": "Qwen2Tokenizer",
280
+ "unk_token": null
281
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "do_center_crop": null,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "do_sample_frames": true,
12
+ "fps": 2,
13
+ "image_mean": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "image_std": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "input_data_format": null,
24
+ "max_frames": 768,
25
+ "merge_size": 2,
26
+ "min_frames": 4,
27
+ "num_frames": null,
28
+ "pad_size": null,
29
+ "patch_size": 16,
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_metadata": false,
34
+ "size": {
35
+ "longest_edge": 25165824,
36
+ "shortest_edge": 4096
37
+ },
38
+ "temporal_patch_size": 2,
39
+ "video_metadata": null,
40
+ "video_processor_type": "Qwen3VLVideoProcessor"
41
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff