jwnt4 commited on
Commit
d44b994
·
verified ·
1 Parent(s): b216c2d

Upload model

Browse files
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
config.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XGenMMModelForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "modeling_xgenmm.XGenMMConfig",
7
+ "AutoModelForVision2Seq": "modeling_xgenmm.XGenMMModelForConditionalGeneration"
8
+ },
9
+ "model_type": "xgenmm",
10
+ "text_config": {
11
+ "attention_dropout": 0.0,
12
+ "embd_pdrop": 0.0,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 3072,
15
+ "initial_tokenizer_len": 32012,
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 8192,
18
+ "max_position_embeddings": 4096,
19
+ "model_type": "phi3",
20
+ "num_attention_heads": 32,
21
+ "num_hidden_layers": 32,
22
+ "num_key_value_heads": 32,
23
+ "original_max_position_embeddings": 4096,
24
+ "partial_rotary_factor": 1.0,
25
+ "resid_pdrop": 0.0,
26
+ "rms_norm_eps": 1e-05,
27
+ "rope_scaling": null,
28
+ "rope_theta": 10000.0,
29
+ "sliding_window": 2047,
30
+ "torch_dtype": "bfloat16",
31
+ "use_cache": true,
32
+ "vocab_size": 32064
33
+ },
34
+ "torch_dtype": "bfloat16",
35
+ "transformers_version": "4.55.0",
36
+ "vision_encoder_config": {
37
+ "anyres_grids": [
38
+ [
39
+ 384,
40
+ 768
41
+ ],
42
+ [
43
+ 768,
44
+ 384
45
+ ],
46
+ [
47
+ 768,
48
+ 768
49
+ ],
50
+ [
51
+ 1152,
52
+ 384
53
+ ],
54
+ [
55
+ 384,
56
+ 1152
57
+ ]
58
+ ],
59
+ "anyres_patch_sampling": true,
60
+ "image_aspect_ratio": "anyres",
61
+ "model_name": "google/siglip-so400m-patch14-384",
62
+ "model_type": "xgenmm_vision_encoder"
63
+ },
64
+ "vision_tokenizer_config": {
65
+ "image_aspect_ratio": "anyres",
66
+ "lang_embedding_dim": 3072,
67
+ "model_type": "xgenmm_vision_tokenizer",
68
+ "num_vis_tokens": 128,
69
+ "vis_feature_dim": 1152
70
+ }
71
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 32007,
5
+ "pad_token_id": 32000,
6
+ "transformers_version": "4.55.0"
7
+ }
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:436f7e50d403dddff7200e0fd9d988dc3685a4432e9627529162456a934da823
3
+ size 4972926984
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c8d7a10df665d105008f7e9464091473151991f7c670c83452a0de3f7ca2f0e6
3
+ size 3745680670
model.safetensors.index.json ADDED
@@ -0,0 +1,726 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 4359257219,
4
+ "total_size": 8718514438
5
+ },
6
+ "weight_map": {
7
+ "vlm.lang_model.lm_head.additional_fc.bias": "model-00002-of-00002.safetensors",
8
+ "vlm.lang_model.lm_head.additional_fc.weight": "model-00002-of-00002.safetensors",
9
+ "vlm.lang_model.lm_head.bias": "model-00002-of-00002.safetensors",
10
+ "vlm.lang_model.lm_head.weight": "model-00002-of-00002.safetensors",
11
+ "vlm.lang_model.model.embed_tokens.additional_embedding.weight": "model-00001-of-00002.safetensors",
12
+ "vlm.lang_model.model.embed_tokens.weight": "model-00001-of-00002.safetensors",
13
+ "vlm.lang_model.model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
14
+ "vlm.lang_model.model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
15
+ "vlm.lang_model.model.layers.0.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
16
+ "vlm.lang_model.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
17
+ "vlm.lang_model.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
18
+ "vlm.lang_model.model.layers.0.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
19
+ "vlm.lang_model.model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
20
+ "vlm.lang_model.model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
21
+ "vlm.lang_model.model.layers.1.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
22
+ "vlm.lang_model.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
23
+ "vlm.lang_model.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
24
+ "vlm.lang_model.model.layers.1.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
25
+ "vlm.lang_model.model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
26
+ "vlm.lang_model.model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
27
+ "vlm.lang_model.model.layers.10.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
28
+ "vlm.lang_model.model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
29
+ "vlm.lang_model.model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
30
+ "vlm.lang_model.model.layers.10.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
31
+ "vlm.lang_model.model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
32
+ "vlm.lang_model.model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
33
+ "vlm.lang_model.model.layers.11.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
34
+ "vlm.lang_model.model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
35
+ "vlm.lang_model.model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
36
+ "vlm.lang_model.model.layers.11.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
37
+ "vlm.lang_model.model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
38
+ "vlm.lang_model.model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
39
+ "vlm.lang_model.model.layers.12.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
40
+ "vlm.lang_model.model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
41
+ "vlm.lang_model.model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
42
+ "vlm.lang_model.model.layers.12.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
43
+ "vlm.lang_model.model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
44
+ "vlm.lang_model.model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
45
+ "vlm.lang_model.model.layers.13.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
46
+ "vlm.lang_model.model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
47
+ "vlm.lang_model.model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
48
+ "vlm.lang_model.model.layers.13.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
49
+ "vlm.lang_model.model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
50
+ "vlm.lang_model.model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
51
+ "vlm.lang_model.model.layers.14.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
52
+ "vlm.lang_model.model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
53
+ "vlm.lang_model.model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
54
+ "vlm.lang_model.model.layers.14.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
55
+ "vlm.lang_model.model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
56
+ "vlm.lang_model.model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
57
+ "vlm.lang_model.model.layers.15.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
58
+ "vlm.lang_model.model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
59
+ "vlm.lang_model.model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
60
+ "vlm.lang_model.model.layers.15.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
61
+ "vlm.lang_model.model.layers.16.input_layernorm.weight": "model-00002-of-00002.safetensors",
62
+ "vlm.lang_model.model.layers.16.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
63
+ "vlm.lang_model.model.layers.16.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
64
+ "vlm.lang_model.model.layers.16.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
65
+ "vlm.lang_model.model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
66
+ "vlm.lang_model.model.layers.16.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
67
+ "vlm.lang_model.model.layers.17.input_layernorm.weight": "model-00002-of-00002.safetensors",
68
+ "vlm.lang_model.model.layers.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
69
+ "vlm.lang_model.model.layers.17.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
70
+ "vlm.lang_model.model.layers.17.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
71
+ "vlm.lang_model.model.layers.17.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
72
+ "vlm.lang_model.model.layers.17.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
73
+ "vlm.lang_model.model.layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
74
+ "vlm.lang_model.model.layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
75
+ "vlm.lang_model.model.layers.18.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
76
+ "vlm.lang_model.model.layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
77
+ "vlm.lang_model.model.layers.18.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
78
+ "vlm.lang_model.model.layers.18.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
79
+ "vlm.lang_model.model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
80
+ "vlm.lang_model.model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
81
+ "vlm.lang_model.model.layers.19.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
82
+ "vlm.lang_model.model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
83
+ "vlm.lang_model.model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
84
+ "vlm.lang_model.model.layers.19.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
85
+ "vlm.lang_model.model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
86
+ "vlm.lang_model.model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
87
+ "vlm.lang_model.model.layers.2.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
88
+ "vlm.lang_model.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
89
+ "vlm.lang_model.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
90
+ "vlm.lang_model.model.layers.2.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
91
+ "vlm.lang_model.model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
92
+ "vlm.lang_model.model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
93
+ "vlm.lang_model.model.layers.20.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
94
+ "vlm.lang_model.model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
95
+ "vlm.lang_model.model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
96
+ "vlm.lang_model.model.layers.20.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
97
+ "vlm.lang_model.model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
98
+ "vlm.lang_model.model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
99
+ "vlm.lang_model.model.layers.21.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
100
+ "vlm.lang_model.model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
101
+ "vlm.lang_model.model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
102
+ "vlm.lang_model.model.layers.21.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
103
+ "vlm.lang_model.model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
104
+ "vlm.lang_model.model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
105
+ "vlm.lang_model.model.layers.22.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
106
+ "vlm.lang_model.model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
107
+ "vlm.lang_model.model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
108
+ "vlm.lang_model.model.layers.22.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
109
+ "vlm.lang_model.model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
110
+ "vlm.lang_model.model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
111
+ "vlm.lang_model.model.layers.23.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
112
+ "vlm.lang_model.model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
113
+ "vlm.lang_model.model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
114
+ "vlm.lang_model.model.layers.23.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
115
+ "vlm.lang_model.model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
116
+ "vlm.lang_model.model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
117
+ "vlm.lang_model.model.layers.24.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
118
+ "vlm.lang_model.model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
119
+ "vlm.lang_model.model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
120
+ "vlm.lang_model.model.layers.24.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
121
+ "vlm.lang_model.model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
122
+ "vlm.lang_model.model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
123
+ "vlm.lang_model.model.layers.25.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
124
+ "vlm.lang_model.model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
125
+ "vlm.lang_model.model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
126
+ "vlm.lang_model.model.layers.25.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
127
+ "vlm.lang_model.model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
128
+ "vlm.lang_model.model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
129
+ "vlm.lang_model.model.layers.26.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
130
+ "vlm.lang_model.model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
131
+ "vlm.lang_model.model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
132
+ "vlm.lang_model.model.layers.26.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
133
+ "vlm.lang_model.model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
134
+ "vlm.lang_model.model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
135
+ "vlm.lang_model.model.layers.27.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
136
+ "vlm.lang_model.model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
137
+ "vlm.lang_model.model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
138
+ "vlm.lang_model.model.layers.27.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
139
+ "vlm.lang_model.model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
140
+ "vlm.lang_model.model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
141
+ "vlm.lang_model.model.layers.28.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
142
+ "vlm.lang_model.model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
143
+ "vlm.lang_model.model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
144
+ "vlm.lang_model.model.layers.28.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
145
+ "vlm.lang_model.model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
146
+ "vlm.lang_model.model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
147
+ "vlm.lang_model.model.layers.29.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
148
+ "vlm.lang_model.model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
149
+ "vlm.lang_model.model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
150
+ "vlm.lang_model.model.layers.29.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
151
+ "vlm.lang_model.model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
152
+ "vlm.lang_model.model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
153
+ "vlm.lang_model.model.layers.3.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
154
+ "vlm.lang_model.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
155
+ "vlm.lang_model.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
156
+ "vlm.lang_model.model.layers.3.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
157
+ "vlm.lang_model.model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
158
+ "vlm.lang_model.model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
159
+ "vlm.lang_model.model.layers.30.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
160
+ "vlm.lang_model.model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
161
+ "vlm.lang_model.model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
162
+ "vlm.lang_model.model.layers.30.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
163
+ "vlm.lang_model.model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
164
+ "vlm.lang_model.model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
165
+ "vlm.lang_model.model.layers.31.mlp.gate_up_proj.weight": "model-00002-of-00002.safetensors",
166
+ "vlm.lang_model.model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
167
+ "vlm.lang_model.model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
168
+ "vlm.lang_model.model.layers.31.self_attn.qkv_proj.weight": "model-00002-of-00002.safetensors",
169
+ "vlm.lang_model.model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
170
+ "vlm.lang_model.model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
171
+ "vlm.lang_model.model.layers.4.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
172
+ "vlm.lang_model.model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
173
+ "vlm.lang_model.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
174
+ "vlm.lang_model.model.layers.4.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
175
+ "vlm.lang_model.model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
176
+ "vlm.lang_model.model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
177
+ "vlm.lang_model.model.layers.5.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
178
+ "vlm.lang_model.model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
179
+ "vlm.lang_model.model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
180
+ "vlm.lang_model.model.layers.5.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
181
+ "vlm.lang_model.model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
182
+ "vlm.lang_model.model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
183
+ "vlm.lang_model.model.layers.6.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
184
+ "vlm.lang_model.model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
185
+ "vlm.lang_model.model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
186
+ "vlm.lang_model.model.layers.6.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
187
+ "vlm.lang_model.model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
188
+ "vlm.lang_model.model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
189
+ "vlm.lang_model.model.layers.7.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
190
+ "vlm.lang_model.model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
191
+ "vlm.lang_model.model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
192
+ "vlm.lang_model.model.layers.7.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
193
+ "vlm.lang_model.model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
194
+ "vlm.lang_model.model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
195
+ "vlm.lang_model.model.layers.8.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
196
+ "vlm.lang_model.model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
197
+ "vlm.lang_model.model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
198
+ "vlm.lang_model.model.layers.8.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
199
+ "vlm.lang_model.model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
200
+ "vlm.lang_model.model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
201
+ "vlm.lang_model.model.layers.9.mlp.gate_up_proj.weight": "model-00001-of-00002.safetensors",
202
+ "vlm.lang_model.model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
203
+ "vlm.lang_model.model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
204
+ "vlm.lang_model.model.layers.9.self_attn.qkv_proj.weight": "model-00001-of-00002.safetensors",
205
+ "vlm.lang_model.model.norm.weight": "model-00002-of-00002.safetensors",
206
+ "vlm.vision_encoder.embeddings.patch_embedding.bias": "model-00001-of-00002.safetensors",
207
+ "vlm.vision_encoder.embeddings.patch_embedding.weight": "model-00001-of-00002.safetensors",
208
+ "vlm.vision_encoder.embeddings.position_embedding.weight": "model-00001-of-00002.safetensors",
209
+ "vlm.vision_encoder.encoder.layers.0.layer_norm1.bias": "model-00001-of-00002.safetensors",
210
+ "vlm.vision_encoder.encoder.layers.0.layer_norm1.weight": "model-00001-of-00002.safetensors",
211
+ "vlm.vision_encoder.encoder.layers.0.layer_norm2.bias": "model-00001-of-00002.safetensors",
212
+ "vlm.vision_encoder.encoder.layers.0.layer_norm2.weight": "model-00001-of-00002.safetensors",
213
+ "vlm.vision_encoder.encoder.layers.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
214
+ "vlm.vision_encoder.encoder.layers.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
215
+ "vlm.vision_encoder.encoder.layers.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
216
+ "vlm.vision_encoder.encoder.layers.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
217
+ "vlm.vision_encoder.encoder.layers.0.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
218
+ "vlm.vision_encoder.encoder.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
219
+ "vlm.vision_encoder.encoder.layers.0.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
220
+ "vlm.vision_encoder.encoder.layers.0.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
221
+ "vlm.vision_encoder.encoder.layers.0.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
222
+ "vlm.vision_encoder.encoder.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
223
+ "vlm.vision_encoder.encoder.layers.0.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
224
+ "vlm.vision_encoder.encoder.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
225
+ "vlm.vision_encoder.encoder.layers.1.layer_norm1.bias": "model-00001-of-00002.safetensors",
226
+ "vlm.vision_encoder.encoder.layers.1.layer_norm1.weight": "model-00001-of-00002.safetensors",
227
+ "vlm.vision_encoder.encoder.layers.1.layer_norm2.bias": "model-00001-of-00002.safetensors",
228
+ "vlm.vision_encoder.encoder.layers.1.layer_norm2.weight": "model-00001-of-00002.safetensors",
229
+ "vlm.vision_encoder.encoder.layers.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
230
+ "vlm.vision_encoder.encoder.layers.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
231
+ "vlm.vision_encoder.encoder.layers.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
232
+ "vlm.vision_encoder.encoder.layers.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
233
+ "vlm.vision_encoder.encoder.layers.1.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
234
+ "vlm.vision_encoder.encoder.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
235
+ "vlm.vision_encoder.encoder.layers.1.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
236
+ "vlm.vision_encoder.encoder.layers.1.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
237
+ "vlm.vision_encoder.encoder.layers.1.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
238
+ "vlm.vision_encoder.encoder.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
239
+ "vlm.vision_encoder.encoder.layers.1.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
240
+ "vlm.vision_encoder.encoder.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
241
+ "vlm.vision_encoder.encoder.layers.10.layer_norm1.bias": "model-00001-of-00002.safetensors",
242
+ "vlm.vision_encoder.encoder.layers.10.layer_norm1.weight": "model-00001-of-00002.safetensors",
243
+ "vlm.vision_encoder.encoder.layers.10.layer_norm2.bias": "model-00001-of-00002.safetensors",
244
+ "vlm.vision_encoder.encoder.layers.10.layer_norm2.weight": "model-00001-of-00002.safetensors",
245
+ "vlm.vision_encoder.encoder.layers.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
246
+ "vlm.vision_encoder.encoder.layers.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
247
+ "vlm.vision_encoder.encoder.layers.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
248
+ "vlm.vision_encoder.encoder.layers.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
249
+ "vlm.vision_encoder.encoder.layers.10.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
250
+ "vlm.vision_encoder.encoder.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
251
+ "vlm.vision_encoder.encoder.layers.10.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
252
+ "vlm.vision_encoder.encoder.layers.10.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
253
+ "vlm.vision_encoder.encoder.layers.10.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
254
+ "vlm.vision_encoder.encoder.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
255
+ "vlm.vision_encoder.encoder.layers.10.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
256
+ "vlm.vision_encoder.encoder.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
257
+ "vlm.vision_encoder.encoder.layers.11.layer_norm1.bias": "model-00001-of-00002.safetensors",
258
+ "vlm.vision_encoder.encoder.layers.11.layer_norm1.weight": "model-00001-of-00002.safetensors",
259
+ "vlm.vision_encoder.encoder.layers.11.layer_norm2.bias": "model-00001-of-00002.safetensors",
260
+ "vlm.vision_encoder.encoder.layers.11.layer_norm2.weight": "model-00001-of-00002.safetensors",
261
+ "vlm.vision_encoder.encoder.layers.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
262
+ "vlm.vision_encoder.encoder.layers.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
263
+ "vlm.vision_encoder.encoder.layers.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
264
+ "vlm.vision_encoder.encoder.layers.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
265
+ "vlm.vision_encoder.encoder.layers.11.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
266
+ "vlm.vision_encoder.encoder.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
267
+ "vlm.vision_encoder.encoder.layers.11.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
268
+ "vlm.vision_encoder.encoder.layers.11.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
269
+ "vlm.vision_encoder.encoder.layers.11.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
270
+ "vlm.vision_encoder.encoder.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
271
+ "vlm.vision_encoder.encoder.layers.11.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
272
+ "vlm.vision_encoder.encoder.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
273
+ "vlm.vision_encoder.encoder.layers.12.layer_norm1.bias": "model-00001-of-00002.safetensors",
274
+ "vlm.vision_encoder.encoder.layers.12.layer_norm1.weight": "model-00001-of-00002.safetensors",
275
+ "vlm.vision_encoder.encoder.layers.12.layer_norm2.bias": "model-00001-of-00002.safetensors",
276
+ "vlm.vision_encoder.encoder.layers.12.layer_norm2.weight": "model-00001-of-00002.safetensors",
277
+ "vlm.vision_encoder.encoder.layers.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
278
+ "vlm.vision_encoder.encoder.layers.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
279
+ "vlm.vision_encoder.encoder.layers.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
280
+ "vlm.vision_encoder.encoder.layers.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
281
+ "vlm.vision_encoder.encoder.layers.12.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
282
+ "vlm.vision_encoder.encoder.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
283
+ "vlm.vision_encoder.encoder.layers.12.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
284
+ "vlm.vision_encoder.encoder.layers.12.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
285
+ "vlm.vision_encoder.encoder.layers.12.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
286
+ "vlm.vision_encoder.encoder.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
287
+ "vlm.vision_encoder.encoder.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
288
+ "vlm.vision_encoder.encoder.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
289
+ "vlm.vision_encoder.encoder.layers.13.layer_norm1.bias": "model-00001-of-00002.safetensors",
290
+ "vlm.vision_encoder.encoder.layers.13.layer_norm1.weight": "model-00001-of-00002.safetensors",
291
+ "vlm.vision_encoder.encoder.layers.13.layer_norm2.bias": "model-00001-of-00002.safetensors",
292
+ "vlm.vision_encoder.encoder.layers.13.layer_norm2.weight": "model-00001-of-00002.safetensors",
293
+ "vlm.vision_encoder.encoder.layers.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
294
+ "vlm.vision_encoder.encoder.layers.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
295
+ "vlm.vision_encoder.encoder.layers.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
296
+ "vlm.vision_encoder.encoder.layers.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
297
+ "vlm.vision_encoder.encoder.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
298
+ "vlm.vision_encoder.encoder.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
299
+ "vlm.vision_encoder.encoder.layers.13.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
300
+ "vlm.vision_encoder.encoder.layers.13.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
301
+ "vlm.vision_encoder.encoder.layers.13.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
302
+ "vlm.vision_encoder.encoder.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
303
+ "vlm.vision_encoder.encoder.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
304
+ "vlm.vision_encoder.encoder.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
305
+ "vlm.vision_encoder.encoder.layers.14.layer_norm1.bias": "model-00001-of-00002.safetensors",
306
+ "vlm.vision_encoder.encoder.layers.14.layer_norm1.weight": "model-00001-of-00002.safetensors",
307
+ "vlm.vision_encoder.encoder.layers.14.layer_norm2.bias": "model-00001-of-00002.safetensors",
308
+ "vlm.vision_encoder.encoder.layers.14.layer_norm2.weight": "model-00001-of-00002.safetensors",
309
+ "vlm.vision_encoder.encoder.layers.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
310
+ "vlm.vision_encoder.encoder.layers.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
311
+ "vlm.vision_encoder.encoder.layers.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
312
+ "vlm.vision_encoder.encoder.layers.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
313
+ "vlm.vision_encoder.encoder.layers.14.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
314
+ "vlm.vision_encoder.encoder.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
315
+ "vlm.vision_encoder.encoder.layers.14.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
316
+ "vlm.vision_encoder.encoder.layers.14.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
317
+ "vlm.vision_encoder.encoder.layers.14.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
318
+ "vlm.vision_encoder.encoder.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
319
+ "vlm.vision_encoder.encoder.layers.14.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
320
+ "vlm.vision_encoder.encoder.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
321
+ "vlm.vision_encoder.encoder.layers.15.layer_norm1.bias": "model-00001-of-00002.safetensors",
322
+ "vlm.vision_encoder.encoder.layers.15.layer_norm1.weight": "model-00001-of-00002.safetensors",
323
+ "vlm.vision_encoder.encoder.layers.15.layer_norm2.bias": "model-00001-of-00002.safetensors",
324
+ "vlm.vision_encoder.encoder.layers.15.layer_norm2.weight": "model-00001-of-00002.safetensors",
325
+ "vlm.vision_encoder.encoder.layers.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
326
+ "vlm.vision_encoder.encoder.layers.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
327
+ "vlm.vision_encoder.encoder.layers.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
328
+ "vlm.vision_encoder.encoder.layers.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
329
+ "vlm.vision_encoder.encoder.layers.15.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
330
+ "vlm.vision_encoder.encoder.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
331
+ "vlm.vision_encoder.encoder.layers.15.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
332
+ "vlm.vision_encoder.encoder.layers.15.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
333
+ "vlm.vision_encoder.encoder.layers.15.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
334
+ "vlm.vision_encoder.encoder.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
335
+ "vlm.vision_encoder.encoder.layers.15.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
336
+ "vlm.vision_encoder.encoder.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
337
+ "vlm.vision_encoder.encoder.layers.16.layer_norm1.bias": "model-00001-of-00002.safetensors",
338
+ "vlm.vision_encoder.encoder.layers.16.layer_norm1.weight": "model-00001-of-00002.safetensors",
339
+ "vlm.vision_encoder.encoder.layers.16.layer_norm2.bias": "model-00001-of-00002.safetensors",
340
+ "vlm.vision_encoder.encoder.layers.16.layer_norm2.weight": "model-00001-of-00002.safetensors",
341
+ "vlm.vision_encoder.encoder.layers.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
342
+ "vlm.vision_encoder.encoder.layers.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
343
+ "vlm.vision_encoder.encoder.layers.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
344
+ "vlm.vision_encoder.encoder.layers.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
345
+ "vlm.vision_encoder.encoder.layers.16.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
346
+ "vlm.vision_encoder.encoder.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
347
+ "vlm.vision_encoder.encoder.layers.16.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
348
+ "vlm.vision_encoder.encoder.layers.16.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
349
+ "vlm.vision_encoder.encoder.layers.16.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
350
+ "vlm.vision_encoder.encoder.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
351
+ "vlm.vision_encoder.encoder.layers.16.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
352
+ "vlm.vision_encoder.encoder.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
353
+ "vlm.vision_encoder.encoder.layers.17.layer_norm1.bias": "model-00001-of-00002.safetensors",
354
+ "vlm.vision_encoder.encoder.layers.17.layer_norm1.weight": "model-00001-of-00002.safetensors",
355
+ "vlm.vision_encoder.encoder.layers.17.layer_norm2.bias": "model-00001-of-00002.safetensors",
356
+ "vlm.vision_encoder.encoder.layers.17.layer_norm2.weight": "model-00001-of-00002.safetensors",
357
+ "vlm.vision_encoder.encoder.layers.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
358
+ "vlm.vision_encoder.encoder.layers.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
359
+ "vlm.vision_encoder.encoder.layers.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
360
+ "vlm.vision_encoder.encoder.layers.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
361
+ "vlm.vision_encoder.encoder.layers.17.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
362
+ "vlm.vision_encoder.encoder.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
363
+ "vlm.vision_encoder.encoder.layers.17.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
364
+ "vlm.vision_encoder.encoder.layers.17.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
365
+ "vlm.vision_encoder.encoder.layers.17.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
366
+ "vlm.vision_encoder.encoder.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
367
+ "vlm.vision_encoder.encoder.layers.17.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
368
+ "vlm.vision_encoder.encoder.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
369
+ "vlm.vision_encoder.encoder.layers.18.layer_norm1.bias": "model-00001-of-00002.safetensors",
370
+ "vlm.vision_encoder.encoder.layers.18.layer_norm1.weight": "model-00001-of-00002.safetensors",
371
+ "vlm.vision_encoder.encoder.layers.18.layer_norm2.bias": "model-00001-of-00002.safetensors",
372
+ "vlm.vision_encoder.encoder.layers.18.layer_norm2.weight": "model-00001-of-00002.safetensors",
373
+ "vlm.vision_encoder.encoder.layers.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
374
+ "vlm.vision_encoder.encoder.layers.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
375
+ "vlm.vision_encoder.encoder.layers.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
376
+ "vlm.vision_encoder.encoder.layers.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
377
+ "vlm.vision_encoder.encoder.layers.18.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
378
+ "vlm.vision_encoder.encoder.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
379
+ "vlm.vision_encoder.encoder.layers.18.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
380
+ "vlm.vision_encoder.encoder.layers.18.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
381
+ "vlm.vision_encoder.encoder.layers.18.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
382
+ "vlm.vision_encoder.encoder.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
383
+ "vlm.vision_encoder.encoder.layers.18.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
384
+ "vlm.vision_encoder.encoder.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
385
+ "vlm.vision_encoder.encoder.layers.19.layer_norm1.bias": "model-00001-of-00002.safetensors",
386
+ "vlm.vision_encoder.encoder.layers.19.layer_norm1.weight": "model-00001-of-00002.safetensors",
387
+ "vlm.vision_encoder.encoder.layers.19.layer_norm2.bias": "model-00001-of-00002.safetensors",
388
+ "vlm.vision_encoder.encoder.layers.19.layer_norm2.weight": "model-00001-of-00002.safetensors",
389
+ "vlm.vision_encoder.encoder.layers.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
390
+ "vlm.vision_encoder.encoder.layers.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
391
+ "vlm.vision_encoder.encoder.layers.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
392
+ "vlm.vision_encoder.encoder.layers.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
393
+ "vlm.vision_encoder.encoder.layers.19.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
394
+ "vlm.vision_encoder.encoder.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
395
+ "vlm.vision_encoder.encoder.layers.19.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
396
+ "vlm.vision_encoder.encoder.layers.19.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
397
+ "vlm.vision_encoder.encoder.layers.19.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
398
+ "vlm.vision_encoder.encoder.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
399
+ "vlm.vision_encoder.encoder.layers.19.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
400
+ "vlm.vision_encoder.encoder.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
401
+ "vlm.vision_encoder.encoder.layers.2.layer_norm1.bias": "model-00001-of-00002.safetensors",
402
+ "vlm.vision_encoder.encoder.layers.2.layer_norm1.weight": "model-00001-of-00002.safetensors",
403
+ "vlm.vision_encoder.encoder.layers.2.layer_norm2.bias": "model-00001-of-00002.safetensors",
404
+ "vlm.vision_encoder.encoder.layers.2.layer_norm2.weight": "model-00001-of-00002.safetensors",
405
+ "vlm.vision_encoder.encoder.layers.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
406
+ "vlm.vision_encoder.encoder.layers.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
407
+ "vlm.vision_encoder.encoder.layers.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
408
+ "vlm.vision_encoder.encoder.layers.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
409
+ "vlm.vision_encoder.encoder.layers.2.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
410
+ "vlm.vision_encoder.encoder.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
411
+ "vlm.vision_encoder.encoder.layers.2.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
412
+ "vlm.vision_encoder.encoder.layers.2.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
413
+ "vlm.vision_encoder.encoder.layers.2.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
414
+ "vlm.vision_encoder.encoder.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
415
+ "vlm.vision_encoder.encoder.layers.2.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
416
+ "vlm.vision_encoder.encoder.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
417
+ "vlm.vision_encoder.encoder.layers.20.layer_norm1.bias": "model-00001-of-00002.safetensors",
418
+ "vlm.vision_encoder.encoder.layers.20.layer_norm1.weight": "model-00001-of-00002.safetensors",
419
+ "vlm.vision_encoder.encoder.layers.20.layer_norm2.bias": "model-00001-of-00002.safetensors",
420
+ "vlm.vision_encoder.encoder.layers.20.layer_norm2.weight": "model-00001-of-00002.safetensors",
421
+ "vlm.vision_encoder.encoder.layers.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
422
+ "vlm.vision_encoder.encoder.layers.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
423
+ "vlm.vision_encoder.encoder.layers.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
424
+ "vlm.vision_encoder.encoder.layers.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
425
+ "vlm.vision_encoder.encoder.layers.20.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
426
+ "vlm.vision_encoder.encoder.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
427
+ "vlm.vision_encoder.encoder.layers.20.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
428
+ "vlm.vision_encoder.encoder.layers.20.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
429
+ "vlm.vision_encoder.encoder.layers.20.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
430
+ "vlm.vision_encoder.encoder.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
431
+ "vlm.vision_encoder.encoder.layers.20.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
432
+ "vlm.vision_encoder.encoder.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
433
+ "vlm.vision_encoder.encoder.layers.21.layer_norm1.bias": "model-00001-of-00002.safetensors",
434
+ "vlm.vision_encoder.encoder.layers.21.layer_norm1.weight": "model-00001-of-00002.safetensors",
435
+ "vlm.vision_encoder.encoder.layers.21.layer_norm2.bias": "model-00001-of-00002.safetensors",
436
+ "vlm.vision_encoder.encoder.layers.21.layer_norm2.weight": "model-00001-of-00002.safetensors",
437
+ "vlm.vision_encoder.encoder.layers.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
438
+ "vlm.vision_encoder.encoder.layers.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
439
+ "vlm.vision_encoder.encoder.layers.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
440
+ "vlm.vision_encoder.encoder.layers.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
441
+ "vlm.vision_encoder.encoder.layers.21.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
442
+ "vlm.vision_encoder.encoder.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
443
+ "vlm.vision_encoder.encoder.layers.21.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
444
+ "vlm.vision_encoder.encoder.layers.21.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
445
+ "vlm.vision_encoder.encoder.layers.21.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
446
+ "vlm.vision_encoder.encoder.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
447
+ "vlm.vision_encoder.encoder.layers.21.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
448
+ "vlm.vision_encoder.encoder.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
449
+ "vlm.vision_encoder.encoder.layers.22.layer_norm1.bias": "model-00001-of-00002.safetensors",
450
+ "vlm.vision_encoder.encoder.layers.22.layer_norm1.weight": "model-00001-of-00002.safetensors",
451
+ "vlm.vision_encoder.encoder.layers.22.layer_norm2.bias": "model-00001-of-00002.safetensors",
452
+ "vlm.vision_encoder.encoder.layers.22.layer_norm2.weight": "model-00001-of-00002.safetensors",
453
+ "vlm.vision_encoder.encoder.layers.22.mlp.fc1.bias": "model-00001-of-00002.safetensors",
454
+ "vlm.vision_encoder.encoder.layers.22.mlp.fc1.weight": "model-00001-of-00002.safetensors",
455
+ "vlm.vision_encoder.encoder.layers.22.mlp.fc2.bias": "model-00001-of-00002.safetensors",
456
+ "vlm.vision_encoder.encoder.layers.22.mlp.fc2.weight": "model-00001-of-00002.safetensors",
457
+ "vlm.vision_encoder.encoder.layers.22.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
458
+ "vlm.vision_encoder.encoder.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
459
+ "vlm.vision_encoder.encoder.layers.22.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
460
+ "vlm.vision_encoder.encoder.layers.22.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
461
+ "vlm.vision_encoder.encoder.layers.22.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
462
+ "vlm.vision_encoder.encoder.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
463
+ "vlm.vision_encoder.encoder.layers.22.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
464
+ "vlm.vision_encoder.encoder.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
465
+ "vlm.vision_encoder.encoder.layers.23.layer_norm1.bias": "model-00001-of-00002.safetensors",
466
+ "vlm.vision_encoder.encoder.layers.23.layer_norm1.weight": "model-00001-of-00002.safetensors",
467
+ "vlm.vision_encoder.encoder.layers.23.layer_norm2.bias": "model-00001-of-00002.safetensors",
468
+ "vlm.vision_encoder.encoder.layers.23.layer_norm2.weight": "model-00001-of-00002.safetensors",
469
+ "vlm.vision_encoder.encoder.layers.23.mlp.fc1.bias": "model-00001-of-00002.safetensors",
470
+ "vlm.vision_encoder.encoder.layers.23.mlp.fc1.weight": "model-00001-of-00002.safetensors",
471
+ "vlm.vision_encoder.encoder.layers.23.mlp.fc2.bias": "model-00001-of-00002.safetensors",
472
+ "vlm.vision_encoder.encoder.layers.23.mlp.fc2.weight": "model-00001-of-00002.safetensors",
473
+ "vlm.vision_encoder.encoder.layers.23.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
474
+ "vlm.vision_encoder.encoder.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
475
+ "vlm.vision_encoder.encoder.layers.23.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
476
+ "vlm.vision_encoder.encoder.layers.23.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
477
+ "vlm.vision_encoder.encoder.layers.23.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
478
+ "vlm.vision_encoder.encoder.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
479
+ "vlm.vision_encoder.encoder.layers.23.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
480
+ "vlm.vision_encoder.encoder.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
481
+ "vlm.vision_encoder.encoder.layers.24.layer_norm1.bias": "model-00001-of-00002.safetensors",
482
+ "vlm.vision_encoder.encoder.layers.24.layer_norm1.weight": "model-00001-of-00002.safetensors",
483
+ "vlm.vision_encoder.encoder.layers.24.layer_norm2.bias": "model-00001-of-00002.safetensors",
484
+ "vlm.vision_encoder.encoder.layers.24.layer_norm2.weight": "model-00001-of-00002.safetensors",
485
+ "vlm.vision_encoder.encoder.layers.24.mlp.fc1.bias": "model-00001-of-00002.safetensors",
486
+ "vlm.vision_encoder.encoder.layers.24.mlp.fc1.weight": "model-00001-of-00002.safetensors",
487
+ "vlm.vision_encoder.encoder.layers.24.mlp.fc2.bias": "model-00001-of-00002.safetensors",
488
+ "vlm.vision_encoder.encoder.layers.24.mlp.fc2.weight": "model-00001-of-00002.safetensors",
489
+ "vlm.vision_encoder.encoder.layers.24.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
490
+ "vlm.vision_encoder.encoder.layers.24.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
491
+ "vlm.vision_encoder.encoder.layers.24.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
492
+ "vlm.vision_encoder.encoder.layers.24.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
493
+ "vlm.vision_encoder.encoder.layers.24.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
494
+ "vlm.vision_encoder.encoder.layers.24.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
495
+ "vlm.vision_encoder.encoder.layers.24.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
496
+ "vlm.vision_encoder.encoder.layers.24.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
497
+ "vlm.vision_encoder.encoder.layers.25.layer_norm1.bias": "model-00001-of-00002.safetensors",
498
+ "vlm.vision_encoder.encoder.layers.25.layer_norm1.weight": "model-00001-of-00002.safetensors",
499
+ "vlm.vision_encoder.encoder.layers.25.layer_norm2.bias": "model-00001-of-00002.safetensors",
500
+ "vlm.vision_encoder.encoder.layers.25.layer_norm2.weight": "model-00001-of-00002.safetensors",
501
+ "vlm.vision_encoder.encoder.layers.25.mlp.fc1.bias": "model-00001-of-00002.safetensors",
502
+ "vlm.vision_encoder.encoder.layers.25.mlp.fc1.weight": "model-00001-of-00002.safetensors",
503
+ "vlm.vision_encoder.encoder.layers.25.mlp.fc2.bias": "model-00001-of-00002.safetensors",
504
+ "vlm.vision_encoder.encoder.layers.25.mlp.fc2.weight": "model-00001-of-00002.safetensors",
505
+ "vlm.vision_encoder.encoder.layers.25.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
506
+ "vlm.vision_encoder.encoder.layers.25.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
507
+ "vlm.vision_encoder.encoder.layers.25.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
508
+ "vlm.vision_encoder.encoder.layers.25.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
509
+ "vlm.vision_encoder.encoder.layers.25.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
510
+ "vlm.vision_encoder.encoder.layers.25.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
511
+ "vlm.vision_encoder.encoder.layers.25.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
512
+ "vlm.vision_encoder.encoder.layers.25.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
513
+ "vlm.vision_encoder.encoder.layers.26.layer_norm1.bias": "model-00001-of-00002.safetensors",
514
+ "vlm.vision_encoder.encoder.layers.26.layer_norm1.weight": "model-00001-of-00002.safetensors",
515
+ "vlm.vision_encoder.encoder.layers.26.layer_norm2.bias": "model-00001-of-00002.safetensors",
516
+ "vlm.vision_encoder.encoder.layers.26.layer_norm2.weight": "model-00001-of-00002.safetensors",
517
+ "vlm.vision_encoder.encoder.layers.26.mlp.fc1.bias": "model-00001-of-00002.safetensors",
518
+ "vlm.vision_encoder.encoder.layers.26.mlp.fc1.weight": "model-00001-of-00002.safetensors",
519
+ "vlm.vision_encoder.encoder.layers.26.mlp.fc2.bias": "model-00001-of-00002.safetensors",
520
+ "vlm.vision_encoder.encoder.layers.26.mlp.fc2.weight": "model-00001-of-00002.safetensors",
521
+ "vlm.vision_encoder.encoder.layers.26.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
522
+ "vlm.vision_encoder.encoder.layers.26.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
523
+ "vlm.vision_encoder.encoder.layers.26.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
524
+ "vlm.vision_encoder.encoder.layers.26.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
525
+ "vlm.vision_encoder.encoder.layers.26.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
526
+ "vlm.vision_encoder.encoder.layers.26.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
527
+ "vlm.vision_encoder.encoder.layers.26.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
528
+ "vlm.vision_encoder.encoder.layers.26.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
529
+ "vlm.vision_encoder.encoder.layers.3.layer_norm1.bias": "model-00001-of-00002.safetensors",
530
+ "vlm.vision_encoder.encoder.layers.3.layer_norm1.weight": "model-00001-of-00002.safetensors",
531
+ "vlm.vision_encoder.encoder.layers.3.layer_norm2.bias": "model-00001-of-00002.safetensors",
532
+ "vlm.vision_encoder.encoder.layers.3.layer_norm2.weight": "model-00001-of-00002.safetensors",
533
+ "vlm.vision_encoder.encoder.layers.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
534
+ "vlm.vision_encoder.encoder.layers.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
535
+ "vlm.vision_encoder.encoder.layers.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
536
+ "vlm.vision_encoder.encoder.layers.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
537
+ "vlm.vision_encoder.encoder.layers.3.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
538
+ "vlm.vision_encoder.encoder.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
539
+ "vlm.vision_encoder.encoder.layers.3.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
540
+ "vlm.vision_encoder.encoder.layers.3.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
541
+ "vlm.vision_encoder.encoder.layers.3.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
542
+ "vlm.vision_encoder.encoder.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
543
+ "vlm.vision_encoder.encoder.layers.3.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
544
+ "vlm.vision_encoder.encoder.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
545
+ "vlm.vision_encoder.encoder.layers.4.layer_norm1.bias": "model-00001-of-00002.safetensors",
546
+ "vlm.vision_encoder.encoder.layers.4.layer_norm1.weight": "model-00001-of-00002.safetensors",
547
+ "vlm.vision_encoder.encoder.layers.4.layer_norm2.bias": "model-00001-of-00002.safetensors",
548
+ "vlm.vision_encoder.encoder.layers.4.layer_norm2.weight": "model-00001-of-00002.safetensors",
549
+ "vlm.vision_encoder.encoder.layers.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
550
+ "vlm.vision_encoder.encoder.layers.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
551
+ "vlm.vision_encoder.encoder.layers.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
552
+ "vlm.vision_encoder.encoder.layers.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
553
+ "vlm.vision_encoder.encoder.layers.4.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
554
+ "vlm.vision_encoder.encoder.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
555
+ "vlm.vision_encoder.encoder.layers.4.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
556
+ "vlm.vision_encoder.encoder.layers.4.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
557
+ "vlm.vision_encoder.encoder.layers.4.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
558
+ "vlm.vision_encoder.encoder.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
559
+ "vlm.vision_encoder.encoder.layers.4.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
560
+ "vlm.vision_encoder.encoder.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
561
+ "vlm.vision_encoder.encoder.layers.5.layer_norm1.bias": "model-00001-of-00002.safetensors",
562
+ "vlm.vision_encoder.encoder.layers.5.layer_norm1.weight": "model-00001-of-00002.safetensors",
563
+ "vlm.vision_encoder.encoder.layers.5.layer_norm2.bias": "model-00001-of-00002.safetensors",
564
+ "vlm.vision_encoder.encoder.layers.5.layer_norm2.weight": "model-00001-of-00002.safetensors",
565
+ "vlm.vision_encoder.encoder.layers.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
566
+ "vlm.vision_encoder.encoder.layers.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
567
+ "vlm.vision_encoder.encoder.layers.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
568
+ "vlm.vision_encoder.encoder.layers.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
569
+ "vlm.vision_encoder.encoder.layers.5.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
570
+ "vlm.vision_encoder.encoder.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
571
+ "vlm.vision_encoder.encoder.layers.5.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
572
+ "vlm.vision_encoder.encoder.layers.5.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
573
+ "vlm.vision_encoder.encoder.layers.5.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
574
+ "vlm.vision_encoder.encoder.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
575
+ "vlm.vision_encoder.encoder.layers.5.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
576
+ "vlm.vision_encoder.encoder.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
577
+ "vlm.vision_encoder.encoder.layers.6.layer_norm1.bias": "model-00001-of-00002.safetensors",
578
+ "vlm.vision_encoder.encoder.layers.6.layer_norm1.weight": "model-00001-of-00002.safetensors",
579
+ "vlm.vision_encoder.encoder.layers.6.layer_norm2.bias": "model-00001-of-00002.safetensors",
580
+ "vlm.vision_encoder.encoder.layers.6.layer_norm2.weight": "model-00001-of-00002.safetensors",
581
+ "vlm.vision_encoder.encoder.layers.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
582
+ "vlm.vision_encoder.encoder.layers.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
583
+ "vlm.vision_encoder.encoder.layers.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
584
+ "vlm.vision_encoder.encoder.layers.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
585
+ "vlm.vision_encoder.encoder.layers.6.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
586
+ "vlm.vision_encoder.encoder.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
587
+ "vlm.vision_encoder.encoder.layers.6.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
588
+ "vlm.vision_encoder.encoder.layers.6.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
589
+ "vlm.vision_encoder.encoder.layers.6.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
590
+ "vlm.vision_encoder.encoder.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
591
+ "vlm.vision_encoder.encoder.layers.6.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
592
+ "vlm.vision_encoder.encoder.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
593
+ "vlm.vision_encoder.encoder.layers.7.layer_norm1.bias": "model-00001-of-00002.safetensors",
594
+ "vlm.vision_encoder.encoder.layers.7.layer_norm1.weight": "model-00001-of-00002.safetensors",
595
+ "vlm.vision_encoder.encoder.layers.7.layer_norm2.bias": "model-00001-of-00002.safetensors",
596
+ "vlm.vision_encoder.encoder.layers.7.layer_norm2.weight": "model-00001-of-00002.safetensors",
597
+ "vlm.vision_encoder.encoder.layers.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
598
+ "vlm.vision_encoder.encoder.layers.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
599
+ "vlm.vision_encoder.encoder.layers.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
600
+ "vlm.vision_encoder.encoder.layers.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
601
+ "vlm.vision_encoder.encoder.layers.7.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
602
+ "vlm.vision_encoder.encoder.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
603
+ "vlm.vision_encoder.encoder.layers.7.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
604
+ "vlm.vision_encoder.encoder.layers.7.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
605
+ "vlm.vision_encoder.encoder.layers.7.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
606
+ "vlm.vision_encoder.encoder.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
607
+ "vlm.vision_encoder.encoder.layers.7.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
608
+ "vlm.vision_encoder.encoder.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
609
+ "vlm.vision_encoder.encoder.layers.8.layer_norm1.bias": "model-00001-of-00002.safetensors",
610
+ "vlm.vision_encoder.encoder.layers.8.layer_norm1.weight": "model-00001-of-00002.safetensors",
611
+ "vlm.vision_encoder.encoder.layers.8.layer_norm2.bias": "model-00001-of-00002.safetensors",
612
+ "vlm.vision_encoder.encoder.layers.8.layer_norm2.weight": "model-00001-of-00002.safetensors",
613
+ "vlm.vision_encoder.encoder.layers.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
614
+ "vlm.vision_encoder.encoder.layers.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
615
+ "vlm.vision_encoder.encoder.layers.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
616
+ "vlm.vision_encoder.encoder.layers.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
617
+ "vlm.vision_encoder.encoder.layers.8.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
618
+ "vlm.vision_encoder.encoder.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
619
+ "vlm.vision_encoder.encoder.layers.8.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
620
+ "vlm.vision_encoder.encoder.layers.8.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
621
+ "vlm.vision_encoder.encoder.layers.8.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
622
+ "vlm.vision_encoder.encoder.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
623
+ "vlm.vision_encoder.encoder.layers.8.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
624
+ "vlm.vision_encoder.encoder.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
625
+ "vlm.vision_encoder.encoder.layers.9.layer_norm1.bias": "model-00001-of-00002.safetensors",
626
+ "vlm.vision_encoder.encoder.layers.9.layer_norm1.weight": "model-00001-of-00002.safetensors",
627
+ "vlm.vision_encoder.encoder.layers.9.layer_norm2.bias": "model-00001-of-00002.safetensors",
628
+ "vlm.vision_encoder.encoder.layers.9.layer_norm2.weight": "model-00001-of-00002.safetensors",
629
+ "vlm.vision_encoder.encoder.layers.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
630
+ "vlm.vision_encoder.encoder.layers.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
631
+ "vlm.vision_encoder.encoder.layers.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
632
+ "vlm.vision_encoder.encoder.layers.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
633
+ "vlm.vision_encoder.encoder.layers.9.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
634
+ "vlm.vision_encoder.encoder.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
635
+ "vlm.vision_encoder.encoder.layers.9.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
636
+ "vlm.vision_encoder.encoder.layers.9.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
637
+ "vlm.vision_encoder.encoder.layers.9.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
638
+ "vlm.vision_encoder.encoder.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
639
+ "vlm.vision_encoder.encoder.layers.9.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
640
+ "vlm.vision_encoder.encoder.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
641
+ "vlm.vision_encoder.head.attention.in_proj_bias": "model-00001-of-00002.safetensors",
642
+ "vlm.vision_encoder.head.attention.in_proj_weight": "model-00001-of-00002.safetensors",
643
+ "vlm.vision_encoder.head.attention.out_proj.bias": "model-00001-of-00002.safetensors",
644
+ "vlm.vision_encoder.head.attention.out_proj.weight": "model-00001-of-00002.safetensors",
645
+ "vlm.vision_encoder.head.layernorm.bias": "model-00001-of-00002.safetensors",
646
+ "vlm.vision_encoder.head.layernorm.weight": "model-00001-of-00002.safetensors",
647
+ "vlm.vision_encoder.head.mlp.fc1.bias": "model-00001-of-00002.safetensors",
648
+ "vlm.vision_encoder.head.mlp.fc1.weight": "model-00001-of-00002.safetensors",
649
+ "vlm.vision_encoder.head.mlp.fc2.bias": "model-00001-of-00002.safetensors",
650
+ "vlm.vision_encoder.head.mlp.fc2.weight": "model-00001-of-00002.safetensors",
651
+ "vlm.vision_encoder.head.probe": "model-00001-of-00002.safetensors",
652
+ "vlm.vision_encoder.post_layernorm.bias": "model-00001-of-00002.safetensors",
653
+ "vlm.vision_encoder.post_layernorm.weight": "model-00001-of-00002.safetensors",
654
+ "vlm.vision_tokenizer.latents": "model-00001-of-00002.safetensors",
655
+ "vlm.vision_tokenizer.layers.0.0.norm_latents.bias": "model-00001-of-00002.safetensors",
656
+ "vlm.vision_tokenizer.layers.0.0.norm_latents.weight": "model-00001-of-00002.safetensors",
657
+ "vlm.vision_tokenizer.layers.0.0.norm_media.bias": "model-00001-of-00002.safetensors",
658
+ "vlm.vision_tokenizer.layers.0.0.norm_media.weight": "model-00001-of-00002.safetensors",
659
+ "vlm.vision_tokenizer.layers.0.0.to_kv.weight": "model-00001-of-00002.safetensors",
660
+ "vlm.vision_tokenizer.layers.0.0.to_out.weight": "model-00001-of-00002.safetensors",
661
+ "vlm.vision_tokenizer.layers.0.0.to_q.weight": "model-00001-of-00002.safetensors",
662
+ "vlm.vision_tokenizer.layers.0.1.0.bias": "model-00001-of-00002.safetensors",
663
+ "vlm.vision_tokenizer.layers.0.1.0.weight": "model-00001-of-00002.safetensors",
664
+ "vlm.vision_tokenizer.layers.0.1.1.weight": "model-00001-of-00002.safetensors",
665
+ "vlm.vision_tokenizer.layers.0.1.3.weight": "model-00001-of-00002.safetensors",
666
+ "vlm.vision_tokenizer.layers.1.0.norm_latents.bias": "model-00001-of-00002.safetensors",
667
+ "vlm.vision_tokenizer.layers.1.0.norm_latents.weight": "model-00001-of-00002.safetensors",
668
+ "vlm.vision_tokenizer.layers.1.0.norm_media.bias": "model-00001-of-00002.safetensors",
669
+ "vlm.vision_tokenizer.layers.1.0.norm_media.weight": "model-00001-of-00002.safetensors",
670
+ "vlm.vision_tokenizer.layers.1.0.to_kv.weight": "model-00001-of-00002.safetensors",
671
+ "vlm.vision_tokenizer.layers.1.0.to_out.weight": "model-00001-of-00002.safetensors",
672
+ "vlm.vision_tokenizer.layers.1.0.to_q.weight": "model-00001-of-00002.safetensors",
673
+ "vlm.vision_tokenizer.layers.1.1.0.bias": "model-00001-of-00002.safetensors",
674
+ "vlm.vision_tokenizer.layers.1.1.0.weight": "model-00001-of-00002.safetensors",
675
+ "vlm.vision_tokenizer.layers.1.1.1.weight": "model-00001-of-00002.safetensors",
676
+ "vlm.vision_tokenizer.layers.1.1.3.weight": "model-00001-of-00002.safetensors",
677
+ "vlm.vision_tokenizer.layers.2.0.norm_latents.bias": "model-00001-of-00002.safetensors",
678
+ "vlm.vision_tokenizer.layers.2.0.norm_latents.weight": "model-00001-of-00002.safetensors",
679
+ "vlm.vision_tokenizer.layers.2.0.norm_media.bias": "model-00001-of-00002.safetensors",
680
+ "vlm.vision_tokenizer.layers.2.0.norm_media.weight": "model-00001-of-00002.safetensors",
681
+ "vlm.vision_tokenizer.layers.2.0.to_kv.weight": "model-00001-of-00002.safetensors",
682
+ "vlm.vision_tokenizer.layers.2.0.to_out.weight": "model-00001-of-00002.safetensors",
683
+ "vlm.vision_tokenizer.layers.2.0.to_q.weight": "model-00001-of-00002.safetensors",
684
+ "vlm.vision_tokenizer.layers.2.1.0.bias": "model-00001-of-00002.safetensors",
685
+ "vlm.vision_tokenizer.layers.2.1.0.weight": "model-00001-of-00002.safetensors",
686
+ "vlm.vision_tokenizer.layers.2.1.1.weight": "model-00001-of-00002.safetensors",
687
+ "vlm.vision_tokenizer.layers.2.1.3.weight": "model-00001-of-00002.safetensors",
688
+ "vlm.vision_tokenizer.layers.3.0.norm_latents.bias": "model-00001-of-00002.safetensors",
689
+ "vlm.vision_tokenizer.layers.3.0.norm_latents.weight": "model-00001-of-00002.safetensors",
690
+ "vlm.vision_tokenizer.layers.3.0.norm_media.bias": "model-00001-of-00002.safetensors",
691
+ "vlm.vision_tokenizer.layers.3.0.norm_media.weight": "model-00001-of-00002.safetensors",
692
+ "vlm.vision_tokenizer.layers.3.0.to_kv.weight": "model-00001-of-00002.safetensors",
693
+ "vlm.vision_tokenizer.layers.3.0.to_out.weight": "model-00001-of-00002.safetensors",
694
+ "vlm.vision_tokenizer.layers.3.0.to_q.weight": "model-00001-of-00002.safetensors",
695
+ "vlm.vision_tokenizer.layers.3.1.0.bias": "model-00001-of-00002.safetensors",
696
+ "vlm.vision_tokenizer.layers.3.1.0.weight": "model-00001-of-00002.safetensors",
697
+ "vlm.vision_tokenizer.layers.3.1.1.weight": "model-00001-of-00002.safetensors",
698
+ "vlm.vision_tokenizer.layers.3.1.3.weight": "model-00001-of-00002.safetensors",
699
+ "vlm.vision_tokenizer.layers.4.0.norm_latents.bias": "model-00001-of-00002.safetensors",
700
+ "vlm.vision_tokenizer.layers.4.0.norm_latents.weight": "model-00001-of-00002.safetensors",
701
+ "vlm.vision_tokenizer.layers.4.0.norm_media.bias": "model-00001-of-00002.safetensors",
702
+ "vlm.vision_tokenizer.layers.4.0.norm_media.weight": "model-00001-of-00002.safetensors",
703
+ "vlm.vision_tokenizer.layers.4.0.to_kv.weight": "model-00001-of-00002.safetensors",
704
+ "vlm.vision_tokenizer.layers.4.0.to_out.weight": "model-00001-of-00002.safetensors",
705
+ "vlm.vision_tokenizer.layers.4.0.to_q.weight": "model-00001-of-00002.safetensors",
706
+ "vlm.vision_tokenizer.layers.4.1.0.bias": "model-00001-of-00002.safetensors",
707
+ "vlm.vision_tokenizer.layers.4.1.0.weight": "model-00001-of-00002.safetensors",
708
+ "vlm.vision_tokenizer.layers.4.1.1.weight": "model-00001-of-00002.safetensors",
709
+ "vlm.vision_tokenizer.layers.4.1.3.weight": "model-00001-of-00002.safetensors",
710
+ "vlm.vision_tokenizer.layers.5.0.norm_latents.bias": "model-00001-of-00002.safetensors",
711
+ "vlm.vision_tokenizer.layers.5.0.norm_latents.weight": "model-00001-of-00002.safetensors",
712
+ "vlm.vision_tokenizer.layers.5.0.norm_media.bias": "model-00001-of-00002.safetensors",
713
+ "vlm.vision_tokenizer.layers.5.0.norm_media.weight": "model-00001-of-00002.safetensors",
714
+ "vlm.vision_tokenizer.layers.5.0.to_kv.weight": "model-00001-of-00002.safetensors",
715
+ "vlm.vision_tokenizer.layers.5.0.to_out.weight": "model-00001-of-00002.safetensors",
716
+ "vlm.vision_tokenizer.layers.5.0.to_q.weight": "model-00001-of-00002.safetensors",
717
+ "vlm.vision_tokenizer.layers.5.1.0.bias": "model-00001-of-00002.safetensors",
718
+ "vlm.vision_tokenizer.layers.5.1.0.weight": "model-00001-of-00002.safetensors",
719
+ "vlm.vision_tokenizer.layers.5.1.1.weight": "model-00001-of-00002.safetensors",
720
+ "vlm.vision_tokenizer.layers.5.1.3.weight": "model-00001-of-00002.safetensors",
721
+ "vlm.vision_tokenizer.norm.bias": "model-00001-of-00002.safetensors",
722
+ "vlm.vision_tokenizer.norm.weight": "model-00001-of-00002.safetensors",
723
+ "vlm.vision_tokenizer.projection.bias": "model-00001-of-00002.safetensors",
724
+ "vlm.vision_tokenizer.projection.weight": "model-00001-of-00002.safetensors"
725
+ }
726
+ }
modeling_xgenmm.py ADDED
@@ -0,0 +1,2107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import ast
2
+ import math
3
+ from einops import rearrange, repeat
4
+ from einops_exts import rearrange_many
5
+ from einops import rearrange
6
+ from PIL import Image
7
+ import torch
8
+ from torch import einsum, nn
9
+
10
+
11
+ from typing import List, Optional, Tuple, Union
12
+ import torch.nn.functional as F
13
+ from transformers.modeling_outputs import CausalLMOutputWithPast
14
+ from dataclasses import dataclass
15
+ from transformers import CLIPVisionModel
16
+ from transformers import PreTrainedModel, AutoModelForCausalLM, AutoModel
17
+ from transformers import PretrainedConfig, logging, CONFIG_MAPPING
18
+ from transformers.models.siglip.modeling_siglip import SiglipVisionTransformer
19
+
20
+
21
+ logger = logging.get_logger(__name__)
22
+
23
+
24
+ class XGenMMVisionEncoderConfig(PretrainedConfig):
25
+ model_type = "xgenmm_vision_encoder"
26
+
27
+ def __init__(
28
+ self,
29
+ model_name: str = "google/siglip-so400m-patch14-384",
30
+ anyres_grids: list[int] = [
31
+ [384, 768],
32
+ [768, 384],
33
+ [768, 768],
34
+ [1152, 384],
35
+ [384, 1152],
36
+ ],
37
+ **kwargs,
38
+ ):
39
+ self.model_name = model_name
40
+ self.anyres_grids = anyres_grids
41
+ super().__init__(**kwargs)
42
+
43
+
44
+ class XGenMMVisionTokenizerConfig(PretrainedConfig):
45
+ model_type = "xgenmm_vision_tokenizer"
46
+
47
+ def __init__(
48
+ self,
49
+ vis_feature_dim: int = 1152,
50
+ lang_embedding_dim: int = 3072,
51
+ num_vis_tokens: int = 128,
52
+ image_aspect_ratio: str = "anyres",
53
+ **kwargs,
54
+ ):
55
+ self.vis_feature_dim = vis_feature_dim
56
+ self.lang_embedding_dim = lang_embedding_dim
57
+ self.num_vis_tokens = num_vis_tokens
58
+ self.image_aspect_ratio = image_aspect_ratio
59
+ super().__init__(**kwargs)
60
+
61
+
62
+ class XGenMMConfig(PretrainedConfig):
63
+ model_type = "xgenmm"
64
+
65
+ def __init__(
66
+ self,
67
+ vision_encoder_config: dict = None,
68
+ vision_tokenizer_config: dict = None,
69
+ text_config: dict = None,
70
+ **kwargs,
71
+ ):
72
+
73
+ if vision_encoder_config is None:
74
+ vision_encoder_config = {
75
+ "image_aspect_ratio": "anyres",
76
+ "anyres_patch_sampling": True,
77
+ }
78
+ logger.info(
79
+ "vision_encoder_config is None. initializing the XGenMMVisionEncoderConfig with default values."
80
+ )
81
+
82
+ if vision_tokenizer_config is None:
83
+ vision_tokenizer_config = {}
84
+ logger.info(
85
+ "vision_tokenizer_config is None. Initializing the XGenMMVisionTokenizerConfig with default values."
86
+ )
87
+
88
+ if text_config is None:
89
+ text_config = {
90
+ "initial_tokenizer_len": 32012,
91
+ "pad_token_id": 32011,
92
+ "bos_token_id": 1,
93
+ "eos_token_id": 32000,
94
+ "vocab_size": 32064,
95
+ "hidden_size": 3072,
96
+ "intermediate_size": 8192,
97
+ "num_hidden_layers": 32,
98
+ "num_attention_heads": 32,
99
+ "num_key_value_heads": 32,
100
+ "resid_pdrop": 0.0,
101
+ "embd_pdrop": 0.0,
102
+ "attention_dropout": 0.0,
103
+ "hidden_act": "silu",
104
+ "max_position_embeddings": 4096,
105
+ "original_max_position_embeddings": 4096,
106
+ "initializer_range": 0.02,
107
+ "rms_norm_eps": 1e-05,
108
+ "use_cache": True,
109
+ "rope_theta": 10000.0,
110
+ "rope_scaling": None,
111
+ "sliding_window": 2047,
112
+ "return_dict": True,
113
+ "output_hidden_states": False,
114
+ "output_attentions": False,
115
+ "torchscript": False,
116
+ "torch_dtype": "bfloat16",
117
+ "use_bfloat16": False,
118
+ "tf_legacy_loss": False,
119
+ "pruned_heads": {},
120
+ "tie_word_embeddings": False,
121
+ "chunk_size_feed_forward": 0,
122
+ "is_encoder_decoder": False,
123
+ "is_decoder": False,
124
+ "cross_attention_hidden_size": None,
125
+ "add_cross_attention": False,
126
+ "tie_encoder_decoder": False,
127
+ "max_length": 20,
128
+ "min_length": 0,
129
+ "do_sample": False,
130
+ "early_stopping": False,
131
+ "num_beams": 1,
132
+ "num_beam_groups": 1,
133
+ "diversity_penalty": 0.0,
134
+ "temperature": 1.0,
135
+ "top_k": 50,
136
+ "top_p": 1.0,
137
+ "typical_p": 1.0,
138
+ "repetition_penalty": 1.0,
139
+ "length_penalty": 1.0,
140
+ "no_repeat_ngram_size": 0,
141
+ "encoder_no_repeat_ngram_size": 0,
142
+ "bad_words_ids": None,
143
+ "num_return_sequences": 1,
144
+ "output_scores": False,
145
+ "return_dict_in_generate": False,
146
+ "forced_bos_token_id": None,
147
+ "forced_eos_token_id": None,
148
+ "remove_invalid_values": False,
149
+ "exponential_decay_length_penalty": None,
150
+ "suppress_tokens": None,
151
+ "begin_suppress_tokens": None,
152
+ "finetuning_task": None,
153
+ "id2label": {0: "LABEL_0", 1: "LABEL_1"},
154
+ "label2id": {"LABEL_0": 0, "LABEL_1": 1},
155
+ "tokenizer_class": None,
156
+ "prefix": None,
157
+ "bos_token_id": 1,
158
+ "pad_token_id": 32000,
159
+ "eos_token_id": 32000,
160
+ "sep_token_id": None,
161
+ "decoder_start_token_id": None,
162
+ "task_specific_params": None,
163
+ "problem_type": None,
164
+ "model_type": "phi3",
165
+ "_attn_implementation": "flash_attention_2",
166
+ }
167
+ logger.info(
168
+ "text_config is None. Initializing the text config with default values (`Phi3Config`)."
169
+ )
170
+
171
+ self.vision_encoder_config = XGenMMVisionEncoderConfig(**vision_encoder_config)
172
+
173
+ self.vision_tokenizer_config = XGenMMVisionTokenizerConfig(
174
+ **vision_tokenizer_config
175
+ )
176
+
177
+ text_model_type = (
178
+ text_config["model_type"] if "model_type" in text_config else "phi3"
179
+ )
180
+ self.text_config = CONFIG_MAPPING[text_model_type](**text_config)
181
+
182
+ for key in ["initial_tokenizer_len", "pad_token_id"]:
183
+ if key not in self.text_config.to_dict():
184
+ raise ValueError(f"The key `{key}` is missing in the text_config.")
185
+
186
+ super().__init__(**kwargs)
187
+
188
+
189
+ def hasattr_recursive(obj, att):
190
+ """
191
+ Check if obj has nested attribute
192
+ Example: hasattr_recursive(obj, 'a.b.c') is equivalent to hasattr(obj, 'a') and hasattr(obj.a, 'b') and hasattr(obj.a.b, 'c')
193
+ """
194
+ if att == "":
195
+ return True
196
+ i = att.find(".")
197
+ if i < 0:
198
+ return hasattr(obj, att)
199
+ else:
200
+ try:
201
+ return hasattr_recursive(getattr(obj, att[:i]), att[i + 1 :])
202
+ except:
203
+ return False
204
+
205
+
206
+ def getattr_recursive(obj, att):
207
+ """
208
+ Return nested attribute of obj
209
+ Example: getattr_recursive(obj, 'a.b.c') is equivalent to obj.a.b.c
210
+ """
211
+ if att == "":
212
+ return obj
213
+ i = att.find(".")
214
+ if i < 0:
215
+ return getattr(obj, att)
216
+ else:
217
+ return getattr_recursive(getattr(obj, att[:i]), att[i + 1 :])
218
+
219
+
220
+ def setattr_recursive(obj, att, val):
221
+ """
222
+ Set nested attribute of obj
223
+ Example: setattr_recursive(obj, 'a.b.c', val) is equivalent to obj.a.b.c = val
224
+ """
225
+ if "." in att:
226
+ obj = getattr_recursive(obj, ".".join(att.split(".")[:-1]))
227
+ setattr(obj, att.split(".")[-1], val)
228
+
229
+
230
+ def check_embedding_fns(lang_model):
231
+ """Checks for and attempts to set {get/set}_{input/output}_embeddings functions to the model"""
232
+ if not has_fn(lang_model, "get_input_embeddings"):
233
+ if hasattr_recursive(lang_model, "transformer.wte"): # MPT
234
+ lang_model.get_input_embeddings = lambda: lang_model.transformer.wte
235
+ elif hasattr_recursive(lang_model, "model.decoder.embed_tokens"): # OPT
236
+ lang_model.get_input_embeddings = lambda: lang_model.decoder.embed_tokens
237
+ else:
238
+ raise ValueError(
239
+ "We require the language encoder to have a get_input_embeddings method but we couldn't determine the name of the input embeddings attribute. Please supply this manually in factory.py."
240
+ )
241
+
242
+ if not has_fn(lang_model, "set_input_embeddings"):
243
+ if hasattr_recursive(lang_model, "transformer.wte"): # MPT
244
+ lang_model.set_input_embeddings = lambda x: setattr_recursive(
245
+ lang_model, "transformer.wte", x
246
+ )
247
+ elif hasattr_recursive(lang_model, "model.decoder.embed_tokens"): # OPT
248
+ lang_model.set_input_embeddings = lambda x: setattr_recursive(
249
+ lang_model, "model.decoder.embed_tokens", x
250
+ )
251
+ else:
252
+ raise ValueError(
253
+ "We require the language encoder to have a set_input_embeddings method but we couldn't determine the name of the input embeddings attribute. Please supply this manually in factory.py."
254
+ )
255
+
256
+ if not has_fn(lang_model, "get_output_embeddings"):
257
+ if hasattr_recursive(lang_model, "lm_head"):
258
+ lang_model.get_output_embeddings = lambda: lang_model.lm_head
259
+ else:
260
+ raise ValueError(
261
+ "We require the language encoder to have a get_output_embeddings method but we couldn't determine the name of the output embeddings attribute. Please supply this manually in factory.py."
262
+ )
263
+
264
+ if not has_fn(lang_model, "set_output_embeddings"):
265
+ if hasattr_recursive(lang_model, "lm_head"):
266
+ lang_model.set_output_embeddings = lambda x: setattr_recursive(
267
+ lang_model, "lm_head", x
268
+ )
269
+ else:
270
+ raise ValueError(
271
+ "We require the language encoder to have a set_output_embeddings method but we couldn't determine the name of the output embeddings attribute. Please supply this manually in factory.py."
272
+ )
273
+
274
+
275
+ def has_fn(model, fn_name):
276
+ """Check if model has a function fn_name"""
277
+ return callable(getattr(model, fn_name, None))
278
+
279
+
280
+ def stack_with_padding(list_of_tensors, padding_value=0, padding_side="right"):
281
+ """
282
+ Stack a list of tensors with padding on one side
283
+ Args:
284
+ list_of_tensors (list[torch.Tensor]): List of tensors to stack
285
+ padding_value (int, optional): Value to pad with. Defaults to 0.
286
+ padding_side (str, optional): Side to pad on. Defaults to "right".
287
+ Returns:
288
+ torch.Tensor: Stacked tensors
289
+ """
290
+ max_tokens = max(tensor.size(0) for tensor in list_of_tensors)
291
+ padded_tensors = []
292
+ for tensor in list_of_tensors:
293
+ num_tokens = tensor.size(0)
294
+ if len(tensor.size()) == 1:
295
+ padding = torch.full(
296
+ (max_tokens - num_tokens,),
297
+ padding_value,
298
+ dtype=tensor.dtype,
299
+ device=tensor.device,
300
+ )
301
+ else:
302
+ padding = torch.full(
303
+ (max_tokens - num_tokens, tensor.size(1)),
304
+ padding_value,
305
+ dtype=tensor.dtype,
306
+ device=tensor.device,
307
+ )
308
+ padded_tensor = (
309
+ torch.cat((tensor, padding), dim=0)
310
+ if padding_side == "right"
311
+ else torch.cat((padding, tensor), dim=0)
312
+ )
313
+ padded_tensors.append(padded_tensor)
314
+ return torch.stack(padded_tensors)
315
+
316
+
317
+ def unpad_image(tensor, original_size, keep_original_shape=False):
318
+ """
319
+ Unpads a PyTorch tensor of a padded and resized image.
320
+
321
+ Args:
322
+ tensor (torch.Tensor): The image tensor, assumed to be in CxHxW format.
323
+ original_size (tuple): The original size of the image (height, width).
324
+
325
+ Returns:
326
+ torch.Tensor: The unpadded image tensor.
327
+ """
328
+ original_width, original_height = original_size
329
+ current_height, current_width = tensor.shape[1:]
330
+
331
+ original_aspect_ratio = original_width / original_height
332
+ current_aspect_ratio = current_width / current_height
333
+
334
+ if original_aspect_ratio > current_aspect_ratio:
335
+ scale_factor = current_width / original_width
336
+ new_height = int(original_height * scale_factor)
337
+ padding = (current_height - new_height) // 2
338
+ if keep_original_shape:
339
+ attention_mask = torch.ones(
340
+ (current_height, current_width), device=tensor.device
341
+ )
342
+ attention_mask[:padding, :] = 0
343
+ attention_mask[current_height - padding :, :] = 0
344
+ return tensor, attention_mask
345
+ else:
346
+ unpadded_tensor = tensor[:, padding : current_height - padding, :]
347
+ return unpadded_tensor, None
348
+ else:
349
+ scale_factor = current_height / original_height
350
+ new_width = int(original_width * scale_factor)
351
+ padding = (current_width - new_width) // 2
352
+ if keep_original_shape:
353
+ attention_mask = torch.ones(
354
+ (current_height, current_width), device=tensor.device
355
+ )
356
+ attention_mask[:, :padding] = 0
357
+ attention_mask[:, current_width - padding :] = 0
358
+ return tensor, attention_mask
359
+ else:
360
+ unpadded_tensor = tensor[:, :, padding : current_width - padding]
361
+ return unpadded_tensor, None
362
+
363
+
364
+ def select_best_resolution(original_size, possible_resolutions):
365
+ """
366
+ Selects the best resolution from a list of possible resolutions based on the original size.
367
+
368
+ Args:
369
+ original_size (tuple): The original size of the image in the format (width, height).
370
+ possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
371
+
372
+ Returns:
373
+ tuple: The best fit resolution in the format (width, height).
374
+ """
375
+ original_width, original_height = original_size
376
+ best_fit = None
377
+ max_effective_resolution = 0
378
+ min_wasted_resolution = float("inf")
379
+
380
+ for width, height in possible_resolutions:
381
+ scale = min(width / original_width, height / original_height)
382
+ downscaled_width, downscaled_height = int(original_width * scale), int(
383
+ original_height * scale
384
+ )
385
+ effective_resolution = min(
386
+ downscaled_width * downscaled_height, original_width * original_height
387
+ )
388
+ wasted_resolution = (width * height) - effective_resolution
389
+
390
+ if effective_resolution > max_effective_resolution or (
391
+ effective_resolution == max_effective_resolution
392
+ and wasted_resolution < min_wasted_resolution
393
+ ):
394
+ max_effective_resolution = effective_resolution
395
+ min_wasted_resolution = wasted_resolution
396
+ best_fit = (width, height)
397
+
398
+ return best_fit
399
+
400
+
401
+ def resize_and_pad_image(image, target_resolution):
402
+ """
403
+ Resize and pad an image to a target resolution while maintaining aspect ratio.
404
+
405
+ Args:
406
+ image (PIL.Image.Image): The input image.
407
+ target_resolution (tuple): The target resolution (width, height) of the image.
408
+
409
+ Returns:
410
+ PIL.Image.Image: The resized and padded image.
411
+ """
412
+ original_width, original_height = image.size
413
+ target_width, target_height = target_resolution
414
+
415
+ scale_w = target_width / original_width
416
+ scale_h = target_height / original_height
417
+
418
+ if scale_w < scale_h:
419
+ new_width = target_width
420
+ new_height = min(math.ceil(original_height * scale_w), target_height)
421
+ else:
422
+ new_height = target_height
423
+ new_width = min(math.ceil(original_width * scale_h), target_width)
424
+
425
+ # Resize the image
426
+ resized_image = image.resize((new_width, new_height))
427
+
428
+ new_image = Image.new("RGB", (target_width, target_height), (0, 0, 0))
429
+ paste_x = (target_width - new_width) // 2
430
+ paste_y = (target_height - new_height) // 2
431
+ new_image.paste(resized_image, (paste_x, paste_y))
432
+
433
+ return new_image
434
+
435
+
436
+ def divide_to_patches(image, patch_size):
437
+ """
438
+ Divides an image into patches of a specified size.
439
+
440
+ Args:
441
+ image (PIL.Image.Image): The input image.
442
+ patch_size (int): The size of each patch.
443
+
444
+ Returns:
445
+ list: A list of PIL.Image.Image objects representing the patches.
446
+ """
447
+ patches = []
448
+ width, height = image.size
449
+ for i in range(0, height, patch_size):
450
+ for j in range(0, width, patch_size):
451
+ box = (j, i, j + patch_size, i + patch_size)
452
+ patch = image.crop(box)
453
+ patches.append(patch)
454
+
455
+ return patches
456
+
457
+
458
+ def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
459
+ """
460
+ Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
461
+
462
+ Args:
463
+ image_size (tuple): The size of the input image in the format (width, height).
464
+ grid_pinpoints (str): A string representation of a list of possible resolutions.
465
+ patch_size (int): The size of each image patch.
466
+
467
+ Returns:
468
+ tuple: The shape of the image patch grid in the format (width, height).
469
+ """
470
+ if type(grid_pinpoints) is list:
471
+ possible_resolutions = grid_pinpoints
472
+ else:
473
+ possible_resolutions = ast.literal_eval(grid_pinpoints)
474
+ width, height = select_best_resolution(image_size, possible_resolutions)
475
+ return width // patch_size, height // patch_size
476
+
477
+
478
+ def process_anyres_image(image, processor, grid_pinpoints):
479
+ """
480
+ Process an image with variable resolutions.
481
+
482
+ Args:
483
+ image (PIL.Image.Image): The input image to be processed.
484
+ processor: The image processor object.
485
+ grid_pinpoints (str): A string representation of a list of possible resolutions.
486
+
487
+ Returns:
488
+ torch.Tensor: A tensor containing the processed image patches.
489
+ """
490
+ # FIXME: determine grid_pinpoints from image sizes.
491
+ if type(grid_pinpoints) is list:
492
+ possible_resolutions = grid_pinpoints
493
+ else:
494
+ possible_resolutions = ast.literal_eval(grid_pinpoints)
495
+ best_resolution = select_best_resolution(image.size, possible_resolutions)
496
+ image_padded = resize_and_pad_image(image, best_resolution)
497
+
498
+ processor_size = processor.transforms[0].size
499
+ patches = divide_to_patches(image_padded, processor_size[0])
500
+
501
+ image_original_resize = image.resize((processor_size[0], processor_size[0]))
502
+
503
+ image_patches = [image_original_resize] + patches
504
+ image_patches = [processor(image_patch) for image_patch in image_patches]
505
+ return torch.stack(image_patches, dim=0)
506
+
507
+
508
+ def expand2square(pil_img, background_color):
509
+ width, height = pil_img.size
510
+ if width == height:
511
+ return pil_img
512
+ elif width > height:
513
+ result = Image.new(pil_img.mode, (width, width), background_color)
514
+ result.paste(pil_img, (0, (width - height) // 2))
515
+ return result
516
+ else:
517
+ result = Image.new(pil_img.mode, (height, height), background_color)
518
+ result.paste(pil_img, ((height - width) // 2, 0))
519
+ return result
520
+
521
+
522
+ class VisionTokenizer(nn.Module):
523
+ def __init__(self, dim_media, num_tokens_per_media):
524
+ super().__init__()
525
+ self.dim_media = dim_media
526
+ self.num_tokens_per_media = num_tokens_per_media
527
+
528
+
529
+ class PerceiverAttention(nn.Module):
530
+ def __init__(self, *, dim, dim_head=64, heads=8):
531
+ super().__init__()
532
+ self.scale = dim_head**-0.5
533
+ self.heads = heads
534
+ inner_dim = dim_head * heads
535
+
536
+ self.norm_media = nn.LayerNorm(dim)
537
+ self.norm_latents = nn.LayerNorm(dim)
538
+
539
+ self.to_q = nn.Linear(dim, inner_dim, bias=False)
540
+ self.to_kv = nn.Linear(dim, inner_dim * 2, bias=False)
541
+ self.to_out = nn.Linear(inner_dim, dim, bias=False)
542
+
543
+ def forward(self, x, latents, vision_attn_masks=None):
544
+ """
545
+ Args:
546
+ x (torch.Tensor): image features
547
+ shape (b, T, n1, D)
548
+ latent (torch.Tensor): latent features
549
+ shape (b, T, n2, D)
550
+ """
551
+ x = self.norm_media(x)
552
+ latents = self.norm_latents(latents)
553
+
554
+ h = self.heads
555
+
556
+ q = self.to_q(latents)
557
+ kv_input = torch.cat(
558
+ (x, latents), dim=-2
559
+ ) # TODO: Change the shape of vision attention mask according to this.
560
+ if vision_attn_masks is not None:
561
+ vision_attn_masks = torch.cat(
562
+ (
563
+ vision_attn_masks,
564
+ torch.ones(
565
+ (latents.shape[0], latents.shape[-2]),
566
+ dtype=latents.dtype,
567
+ device=latents.device,
568
+ ),
569
+ ),
570
+ dim=-1,
571
+ )
572
+ k, v = self.to_kv(kv_input).chunk(2, dim=-1)
573
+ q, k, v = rearrange_many((q, k, v), "b t n (h d) -> b h t n d", h=h)
574
+ q = q * self.scale
575
+
576
+ # attention
577
+ sim = einsum("... i d, ... j d -> ... i j", q, k)
578
+ # Apply vision attention mask here.
579
+ # Reference: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention
580
+ if vision_attn_masks is not None:
581
+ attn_bias = torch.zeros(
582
+ (q.size(0), 1, 1, q.size(-2), k.size(-2)),
583
+ dtype=q.dtype,
584
+ device=q.device,
585
+ )
586
+ vision_attn_masks = repeat(
587
+ vision_attn_masks, "b n -> b 1 1 l n", l=q.size(-2)
588
+ )
589
+ attn_bias.masked_fill_(vision_attn_masks.logical_not(), float("-inf"))
590
+ sim += attn_bias
591
+
592
+ sim = sim - sim.amax(dim=-1, keepdim=True).detach()
593
+ attn = sim.softmax(dim=-1)
594
+
595
+ out = einsum("... i j, ... j d -> ... i d", attn, v)
596
+ out = rearrange(out, "b h t n d -> b t n (h d)", h=h)
597
+ return self.to_out(out)
598
+
599
+
600
+ def FeedForward(dim, mult=4):
601
+ inner_dim = int(dim * mult)
602
+ return nn.Sequential(
603
+ nn.LayerNorm(dim),
604
+ nn.Linear(dim, inner_dim, bias=False),
605
+ nn.GELU(),
606
+ nn.Linear(inner_dim, dim, bias=False),
607
+ )
608
+
609
+
610
+ def num_params(module, filter_to_trainable=False):
611
+ """Returns the number of parameters in the module, or optionally only the trainable parameters"""
612
+ if filter_to_trainable:
613
+ return sum(p.numel() for p in module.parameters() if p.requires_grad)
614
+ else:
615
+ return sum(p.numel() for p in module.parameters())
616
+
617
+
618
+ class PerceiverResampler(VisionTokenizer):
619
+ def __init__(
620
+ self,
621
+ *,
622
+ dim,
623
+ dim_inner=None,
624
+ depth=6,
625
+ dim_head=96,
626
+ heads=16,
627
+ num_latents=128,
628
+ max_num_media=None,
629
+ max_num_frames=None,
630
+ ff_mult=4,
631
+ ):
632
+ """
633
+ Perceiver module which takes in image features and outputs image tokens.
634
+ Args:
635
+ dim (int): dimension of the incoming image features
636
+ dim_inner (int, optional): final dimension to project the incoming image features to;
637
+ also the final dimension of the outputted features. If None, no projection is used, and dim_inner = dim.
638
+ depth (int, optional): number of layers. Defaults to 6.
639
+ dim_head (int, optional): dimension of each head. Defaults to 64.
640
+ heads (int, optional): number of heads. Defaults to 8.
641
+ num_latents (int, optional): number of latent tokens to use in the Perceiver;
642
+ also corresponds to number of tokens per sequence to output. Defaults to 64.
643
+ max_num_media (int, optional): maximum number of media per sequence to input into the Perceiver
644
+ and keep positional embeddings for. If None, no positional embeddings are used.
645
+ max_num_frames (int, optional): maximum number of frames to input into the Perceiver
646
+ and keep positional embeddings for. If None, no positional embeddings are used.
647
+ ff_mult (int, optional): dimension multiplier for the feedforward network. Defaults to 4.
648
+ """
649
+ if dim_inner is not None:
650
+ projection = nn.Linear(dim, dim_inner)
651
+ else:
652
+ projection = None
653
+ dim_inner = dim
654
+ super().__init__(dim_media=dim, num_tokens_per_media=num_latents)
655
+ self.projection = projection
656
+ self.latents = nn.Parameter(torch.randn(num_latents, dim))
657
+
658
+ # positional embeddings
659
+ self.frame_embs = (
660
+ nn.Parameter(torch.randn(max_num_frames, dim))
661
+ if exists(max_num_frames)
662
+ else None
663
+ )
664
+ self.media_time_embs = (
665
+ nn.Parameter(torch.randn(max_num_media, 1, dim))
666
+ if exists(max_num_media)
667
+ else None
668
+ )
669
+
670
+ self.layers = nn.ModuleList([])
671
+ for _ in range(depth):
672
+ self.layers.append(
673
+ nn.ModuleList(
674
+ [
675
+ PerceiverAttention(dim=dim, dim_head=dim_head, heads=heads),
676
+ FeedForward(dim=dim, mult=ff_mult),
677
+ ]
678
+ )
679
+ )
680
+
681
+ self.norm = nn.LayerNorm(dim)
682
+
683
+ def forward(self, x, vision_attn_masks):
684
+ """
685
+ Args:
686
+ x (torch.Tensor): image features
687
+ shape (b, T, F, v, D)
688
+ vision_attn_masks (torch.Tensor): attention masks for padded visiont tokens (i.e., x)
689
+ shape (b, v)
690
+ Returns:
691
+ shape (b, T, n, D) where n is self.num_latents
692
+ """
693
+ b, T, F, v = x.shape[:4]
694
+
695
+ # frame and media time embeddings
696
+ if exists(self.frame_embs):
697
+ frame_embs = repeat(self.frame_embs[:F], "F d -> b T F v d", b=b, T=T, v=v)
698
+ x = x + frame_embs
699
+ x = rearrange(
700
+ x, "b T F v d -> b T (F v) d"
701
+ ) # flatten the frame and spatial dimensions
702
+ if exists(self.media_time_embs):
703
+ x = x + self.media_time_embs[:T]
704
+
705
+ # blocks
706
+ latents = self.latents
707
+ latents = repeat(latents, "n d -> b T n d", b=b, T=T)
708
+ for attn, ff in self.layers:
709
+ latents = attn(x, latents, vision_attn_masks) + latents
710
+ latents = ff(latents) + latents
711
+
712
+ if exists(self.projection):
713
+ return self.projection(self.norm(latents))
714
+ else:
715
+ return self.norm(latents)
716
+
717
+
718
+ class DecoupledEmbedding(nn.Embedding):
719
+ # Derived from https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#Embedding
720
+ """
721
+ Implements a decoupling of parameters to allow freezing (or not) a subset of the embeddings. In practise, the
722
+ regular `weight` can be trained or frozen (i.e. `partially_freeze=True`), and if `num_additional_embeddings` > 0,
723
+ then it will create `num_additional_embeddings` additional parameters that are always trained. If
724
+ `num_additional_embeddings=0`, then the module defaults back to the regular behavior of `nn.Embedding`.
725
+ """
726
+
727
+ def __init__(
728
+ self,
729
+ max_original_id: int,
730
+ num_additional_embeddings: int = 0,
731
+ _weight: torch.Tensor = None,
732
+ num_original_embeddings: int = None,
733
+ embedding_dim: int = None,
734
+ partially_freeze=True,
735
+ device=None,
736
+ dtype=None,
737
+ pad_token_id=None,
738
+ ) -> None:
739
+ """
740
+ Args:
741
+ max_original_id (`int`):
742
+ The largest token id that should be embedded using the regular embedding (regular `weight`).
743
+ This is usually len(tokenizer) - 1 before additional tokens are added.
744
+ Note that this may not equal self.weight.shape[0]
745
+ num_additional_embeddings (`int`):
746
+ Number of additional tokens to initialize an Embedding matrix for (`additional_weight`).
747
+ _weight (`torch.Tensor`, *optional*, defaults to `None`): The regular weight tensor.
748
+ If provided, this sets the `num_original_embeddings` and `embedding_dim` parameters.
749
+ num_original_embeddings (`int`):
750
+ self.weight.shape[0]
751
+ embedding_dim (`int`):
752
+ The size of each embedding vector
753
+ partially_freeze: (`bool`, *optional*, defaults to `True`):
754
+ If `True`, the regular `weight` will be frozen. `additional_weight` is never frozen.
755
+ padding_idx (`int`, *optional*):
756
+ The padding index (needs to be less than num_embeddings)
757
+
758
+ Note: there are a lot of other parameters to initialize a standard `nn.Embedding` such as `padding_idx`,
759
+ `max_norm` or `norm_type`. We are not supporting these.
760
+ """
761
+ # validate args
762
+ if pad_token_id is not None and pad_token_id > max_original_id:
763
+ raise ValueError(
764
+ f"pad_token_id must be <= max_original_id. Got {pad_token_id} and {max_original_id}."
765
+ + "If the original tokenizer does not have a pad_token_id, use pad_token_id=None."
766
+ )
767
+ if _weight is not None:
768
+ assert (num_original_embeddings is None) or (
769
+ _weight.shape[0] == num_original_embeddings
770
+ ), f"num_original_embeddings={num_original_embeddings} but _weight.shape[0]={_weight.shape[0]}"
771
+ assert (embedding_dim is None) or (
772
+ _weight.shape[1] == embedding_dim
773
+ ), f"embedding_dim={embedding_dim} but _weight.shape[1]={_weight.shape[1]}"
774
+ num_original_embeddings = _weight.shape[0]
775
+ embedding_dim = _weight.shape[1]
776
+ else:
777
+ assert (
778
+ num_original_embeddings is not None
779
+ ), "num_original_embeddings must be provided if _weight is not provided"
780
+ assert (
781
+ embedding_dim is not None
782
+ ), "embedding_dim must be provided if _weight is not provided"
783
+
784
+ super().__init__(
785
+ num_embeddings=num_original_embeddings,
786
+ embedding_dim=embedding_dim,
787
+ device=device,
788
+ dtype=dtype,
789
+ padding_idx=pad_token_id,
790
+ _weight=_weight,
791
+ )
792
+ self.max_original_id = max_original_id
793
+ self.padding_idx = pad_token_id
794
+ self.num_additional_embeddings = num_additional_embeddings
795
+ if self.num_additional_embeddings > 0:
796
+ self.additional_embedding = nn.Embedding(
797
+ num_embeddings=self.num_additional_embeddings,
798
+ embedding_dim=embedding_dim,
799
+ device=device,
800
+ dtype=dtype,
801
+ )
802
+ self.set_requires_grad(
803
+ require_regular_grad=not partially_freeze, require_additional_grad=True
804
+ )
805
+
806
+ def set_requires_grad(self, require_regular_grad, require_additional_grad):
807
+ """
808
+ Helper function to separately set the requires_grad flag for the regular weight and the additional weight.
809
+ """
810
+ self.weight.requires_grad_(require_regular_grad)
811
+ self.additional_embedding.requires_grad_(require_additional_grad)
812
+
813
+ def forward(self, input_ids):
814
+ """
815
+ we have 2 embeddings, with different indices - one pretrained self.weight and another
816
+ self.additional_embedding.weight that is being trained.
817
+
818
+ in order to make a lookup of the input ids, we:
819
+ 1. find out the indices of the entries belonging to the 2nd embedding
820
+ 2. extract those values while subtracting the size of the first embedding (num_embeddings), since the 2nd
821
+ embedding starts from 0 and not num_embeddings
822
+ 3. perform the 2nd embedding lookup
823
+ 4. now we handle the 1st embedding, we overwrite indices belonging to the 2nd embedding with a padding index
824
+ 5. perform the 1st embedding lookup
825
+ 6. now we overwrite the values in the 1st embedding lookup with the values of the 2nd embedding lookup
826
+
827
+ note: for the 1st embedding lookup we could have looked up only the low indices and not do the padding, but
828
+ then we have to create a new tensor and populate it with 2 tensors that are spread out across various indices -
829
+ i.e. not a simple concat - I haven't benchmarked the complex case if it's any faster, given that seqlens are
830
+ usually relatively short it's probably not faster or if faster not by much - but might be a good idea to
831
+ measure.
832
+
833
+ """
834
+ if self.num_additional_embeddings == 0:
835
+ return F.embedding(input_ids, self.weight)
836
+
837
+ # Clone so that we don't modify the original input_ids later on
838
+ input_ids = input_ids.clone()
839
+ additional_vocab_indices = torch.where(input_ids > self.max_original_id)
840
+ input_ids_additional_vocab = input_ids[additional_vocab_indices]
841
+ additional_embeddings = self.additional_embedding(
842
+ input_ids_additional_vocab - self.max_original_id - 1
843
+ )
844
+
845
+ # for successful lookup replace input_ids with 0, the results of these will be discarded anyway
846
+ input_ids[additional_vocab_indices] = 0
847
+ full_vector = F.embedding(input_ids, self.weight)
848
+
849
+ # overwrite the records with high indices
850
+ full_vector[additional_vocab_indices] = additional_embeddings
851
+
852
+ return full_vector
853
+
854
+ def extra_repr(self) -> str:
855
+ return "num_original_embeddings={}, num_additional_embeddings={}, embedding_dim={}, partially_freeze={}".format(
856
+ self.max_original_id + 1,
857
+ self.num_additional_embeddings,
858
+ self.embedding_dim,
859
+ (not self.weight.requires_grad),
860
+ )
861
+
862
+
863
+ class DecoupledLinear(nn.Linear):
864
+ # Derived from https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear
865
+ """
866
+ Implements a decoupling of parameters to allow freezing (or not) a subset of the parameters. In practise, the
867
+ regular `weight` can be trained or frozen (i.e. `partially_freeze=True`), and if `additional_out_features` > 0,
868
+ then it will create `additional_out_features * in_features` additional parameters that are always trained. If
869
+ `additional_out_features=0`, then the module defaults back to the regular behavior of `nn.Linear`.
870
+ """
871
+
872
+ def __init__(
873
+ self,
874
+ max_original_id: int,
875
+ additional_out_features: int = 0,
876
+ _weight: torch.Tensor = None,
877
+ _bias: torch.Tensor = None,
878
+ in_features: int = None,
879
+ original_out_features: int = None,
880
+ bias: bool = True,
881
+ partially_freeze: bool = True,
882
+ device=None,
883
+ dtype=None,
884
+ ) -> None:
885
+ """
886
+ Args:
887
+ max_original_id (`int`): The largest token id that should be extracted from the regular weight.
888
+ This is usually len(tokenizer) - 1 before additional tokens are added.
889
+ Note that this may not equal original_out_features - 1
890
+ _weight: torch.Tensor, *optional*, defaults to `None`. The regular weight tensor.
891
+ If provided, this sets the `in_features` and `original_out_features` parameters.
892
+ _bias: torch.Tensor, *optional*, defaults to `None`. The regular bias tensor.
893
+ in_features: int. Input hidden size.
894
+ original_out_features: int. Original out_features of the language model's get_output_embeddings() function.
895
+ additional_out_features: int. Number of additional trainable dimensions.
896
+ bias: bool. Whether to include a bias term.
897
+ partially_freeze: bool, *optional*, defaults to `True`): If `True`, the regular `weight` will be frozen.
898
+ """
899
+ # argument validation
900
+ if _weight is not None:
901
+ assert (_weight.shape[0] == original_out_features) or (
902
+ original_out_features is None
903
+ ), f"original_out_features={original_out_features} but _weight.shape[0]={_weight.shape[0]}"
904
+ assert (_weight.shape[1] == in_features) or (
905
+ in_features is None
906
+ ), f"in_features={in_features} but _weight.shape[1]={_weight.shape[1]}"
907
+ in_features = _weight.shape[1]
908
+ original_out_features = _weight.shape[0]
909
+ else:
910
+ assert (
911
+ in_features is not None
912
+ ), "in_features must be provided if _weight is not provided"
913
+ assert (
914
+ original_out_features is not None
915
+ ), "original_out_features must be provided if _weight is not provided"
916
+
917
+ if _bias is not None:
918
+ assert bias is True, "bias must be True if _bias is provided"
919
+
920
+ # initialize original linear
921
+ super().__init__(in_features, original_out_features, bias, device, dtype)
922
+
923
+ # set weight and bias manually
924
+ if _weight is not None:
925
+ self.weight = nn.Parameter(_weight)
926
+ if _bias is not None:
927
+ self.bias = nn.Parameter(_bias)
928
+
929
+ self.in_features = in_features
930
+ self.original_out_features = original_out_features
931
+ self.max_original_id = max_original_id
932
+
933
+ # initialize additional linear
934
+ self.additional_out_features = additional_out_features
935
+ self.has_bias = bias
936
+ if additional_out_features > 0:
937
+ self.additional_fc = nn.Linear(
938
+ in_features=in_features,
939
+ out_features=additional_out_features,
940
+ bias=self.has_bias,
941
+ device=device,
942
+ dtype=dtype,
943
+ )
944
+ self.set_requires_grad(
945
+ require_regular_grad=not partially_freeze, require_additional_grad=True
946
+ )
947
+
948
+ def set_requires_grad(self, require_regular_grad, require_additional_grad):
949
+ """
950
+ Helper function to separately set the requires_grad flag for the regular weight and the additional weight.
951
+ """
952
+ self.weight.requires_grad_(require_regular_grad)
953
+ if self.has_bias:
954
+ self.bias.requires_grad_(require_regular_grad)
955
+ self.additional_fc.requires_grad_(require_additional_grad)
956
+
957
+ def forward(self, input: torch.Tensor) -> torch.Tensor:
958
+ output = F.linear(input, self.weight, self.bias)
959
+ output = output[..., : self.max_original_id + 1]
960
+
961
+ if self.additional_out_features > 0:
962
+ additional_features = F.linear(
963
+ input, self.additional_fc.weight, self.additional_fc.bias
964
+ )
965
+ output = torch.cat((output, additional_features), -1)
966
+ return output
967
+
968
+ def extra_repr(self) -> str:
969
+ """Overwriting `nn.Linear.extra_repr` to include new parameters."""
970
+ return "in_features={}, out_features={}, additional_out_features={}, bias={}, partially_freeze={}".format(
971
+ self.in_features,
972
+ self.max_original_id + 1,
973
+ self.additional_out_features,
974
+ self.bias is not None,
975
+ (not self.weight.requires_grad or not self.bias.requires_grad),
976
+ )
977
+
978
+
979
+ class VLM(nn.Module):
980
+ """
981
+ Generic vision-language model (VLM) class.
982
+ A VLM consists of four components:
983
+ 1. A vision encoder that extracts features from pixels, e.g. CLIP
984
+ input: (B, T_img, F, C, H, W)
985
+ output: (B, T_img, F, v, d)
986
+ 2. A vision tokenizer that converts these features to visual token-like embeddings, e.g. Perceiver, or a linear projection head
987
+ input: (B, T_img, F, v, d)
988
+ output: (B, T_img, n, d)
989
+ 3. A fusion method that allows the language model to attend to these tokens, e.g. cross-attention, or placing the tokens directly in the language model's input sequence
990
+ 4. A language model
991
+ """
992
+
993
+ def __init__(
994
+ self,
995
+ vision_encoder: nn.Module,
996
+ vision_tokenizer: nn.Module,
997
+ lang_model: nn.Module,
998
+ initial_tokenizer_len: int,
999
+ pad_token_id: int,
1000
+ gradient_checkpointing: bool = False,
1001
+ ):
1002
+ """
1003
+ Args:
1004
+ vision_encoder (nn.Module): e.g. CLIP
1005
+ vision_tokenizer (nn.Module): e.g. PerceiverResampler
1006
+ lang_model (nn.Module): e.g. MPT
1007
+ initial_tokenizer_len (int): size of the original tokenizer vocab
1008
+ pad_token_id (int): id of the pad token
1009
+ gradient_checkpointing (bool, optional): Whether to use gradient checkpointing. Defaults to False.
1010
+ """
1011
+ super().__init__()
1012
+
1013
+ # save dimension information
1014
+ self.lang_embedding_dim = lang_model.get_input_embeddings().weight.shape[1]
1015
+ if hasattr(lang_model.config, "d_model"):
1016
+ self.lang_hidden_dim = lang_model.config.d_model # mpt uses d_model
1017
+ else:
1018
+ self.lang_hidden_dim = lang_model.config.hidden_size
1019
+ self.vis_embedding_dim = vision_tokenizer.dim_media
1020
+ self.num_tokens_per_vis = vision_tokenizer.num_tokens_per_media
1021
+
1022
+ # core components
1023
+ self.vision_encoder = vision_encoder
1024
+ self.vision_tokenizer = vision_tokenizer
1025
+ self.lang_model = lang_model
1026
+
1027
+ # lm embeddings
1028
+ self.pad_token_id = pad_token_id
1029
+ self.initial_tokenizer_len = initial_tokenizer_len
1030
+ input_embeds = DecoupledEmbedding(
1031
+ max_original_id=initial_tokenizer_len - 1,
1032
+ num_additional_embeddings=len(self.special_tokens),
1033
+ _weight=self.lang_model.get_input_embeddings().weight,
1034
+ pad_token_id=self.pad_token_id,
1035
+ ).to(self.lang_model.dtype)
1036
+ if hasattr(input_embeds, "additional_embedding"):
1037
+ input_embeds.additional_embedding.weight.data.normal_(
1038
+ mean=0.0,
1039
+ std=(
1040
+ self.lang_model.config.initializer_range
1041
+ if hasattr(self.lang_model.config, "initializer_range")
1042
+ else 0.02
1043
+ ),
1044
+ )
1045
+ self.lang_model.set_input_embeddings(input_embeds)
1046
+
1047
+ out_embeds = DecoupledLinear(
1048
+ max_original_id=initial_tokenizer_len - 1,
1049
+ additional_out_features=len(self.special_tokens),
1050
+ _weight=self.lang_model.get_output_embeddings().weight,
1051
+ _bias=(
1052
+ self.lang_model.get_output_embeddings().bias
1053
+ if hasattr(self.lang_model.get_output_embeddings(), "bias")
1054
+ else None
1055
+ ),
1056
+ ).to(self.lang_model.dtype)
1057
+ if hasattr(out_embeds, "additional_fc"):
1058
+ out_embeds.additional_fc.weight.data.normal_(
1059
+ mean=0.0,
1060
+ std=(
1061
+ self.lang_model.config.initializer_range
1062
+ if hasattr(self.lang_model.config, "initializer_range")
1063
+ else 0.02
1064
+ ),
1065
+ )
1066
+ self.lang_model.set_output_embeddings(out_embeds)
1067
+
1068
+ # gradient checkpointing
1069
+ self.vision_tokenizer._use_gradient_checkpointing = gradient_checkpointing
1070
+
1071
+ def forward(
1072
+ self,
1073
+ vision_x: Optional[torch.Tensor],
1074
+ lang_x: torch.Tensor,
1075
+ attention_mask: Optional[torch.Tensor] = None,
1076
+ labels: Optional[torch.Tensor] = None,
1077
+ past_key_values: Optional[
1078
+ List[Union[torch.Tensor, Tuple[torch.Tensor]]]
1079
+ ] = None,
1080
+ past_media_locations: Optional[torch.Tensor] = None,
1081
+ past_vision_tokens: Optional[torch.Tensor] = None,
1082
+ use_cache: Optional[bool] = False,
1083
+ **kwargs,
1084
+ ):
1085
+ """
1086
+ Args:
1087
+ vision_x: Vision input
1088
+ shape (B, T_img, F, C, H, W) with F=1
1089
+ only F = 1 is supported (single-frame videos)
1090
+ if T_img > the number of media tokens in the corresponding input_ids (lang_x),
1091
+ only the first number of media tokens in lang_x are used
1092
+ lang_x: Language input ids, with media tokens denoting where
1093
+ visual media should be inserted.
1094
+ shape (B, T_txt)
1095
+ attention_mask: Attention mask. Defaults to None.
1096
+ labels: Labels. Defaults to None.
1097
+ shape (B, T_txt)
1098
+ past_key_values (Tuple[torch.Tensor]], optional): Past key value pairs for each of the T_txt previous tokens in the language model. Defaults to None.
1099
+ list of length = number of decoder layers in the LM
1100
+ exact implementation depends on LM, see Hugging Face docs
1101
+ past_media_locations (torch.Tensor, optional): boolean mask denoting which of the previous T_txt tokens were media tokens. Defaults to None.
1102
+ shape (B, T_txt)
1103
+ past_vision_tokens (torch.Tensor, optional): Previous vision tokens. Defaults to None.
1104
+ use_cache (Optional[bool], optional): Whether to use cache. Defaults to False.
1105
+ If True, includes key_values, media_locations, and vision_tokens in the output.
1106
+ """
1107
+ assert not (past_vision_tokens is None) ^ (
1108
+ past_media_locations is None
1109
+ ), "past_vision_tokens and past_media_locations must both be None or both be not None"
1110
+
1111
+ # convert pixels to vision tokens
1112
+ if vision_x is not None:
1113
+ vision_features = self._encode_vision_x(vision_x=vision_x)
1114
+ vision_tokens = self.vision_tokenizer(vision_features)
1115
+ else:
1116
+ vision_tokens = None
1117
+
1118
+ # fuse the vision and language tokens
1119
+ new_inputs = self._prepare_inputs_for_forward(
1120
+ vision_tokens=vision_tokens,
1121
+ lang_x=lang_x,
1122
+ attention_mask=attention_mask,
1123
+ labels=labels,
1124
+ past_key_values=past_key_values,
1125
+ past_media_locations=past_media_locations,
1126
+ padding_side="right",
1127
+ past_vision_tokens=past_vision_tokens,
1128
+ )
1129
+ output = self.lang_model(
1130
+ **new_inputs,
1131
+ use_cache=use_cache,
1132
+ past_key_values=past_key_values,
1133
+ **kwargs,
1134
+ )
1135
+
1136
+ # postprocessing may be needed, e.g. to remove extra tokens from logits that were inserted into the language stream
1137
+ # or to add the past_vision_tokens and past_media_locations to the output
1138
+ output = self._postprocess_outputs_from_forward(
1139
+ output=output,
1140
+ lang_x=lang_x,
1141
+ vision_tokens=vision_tokens,
1142
+ use_cache=use_cache,
1143
+ past_vision_tokens=past_vision_tokens,
1144
+ past_media_locations=past_media_locations,
1145
+ )
1146
+
1147
+ # postforward hooks
1148
+ self._post_forward_hook()
1149
+ return output
1150
+
1151
+ def _encode_vision_x_anyres(self, samples, device):
1152
+ assert self.anyres_grids is not None
1153
+ image_raw = samples[
1154
+ "image"
1155
+ ] # list of patch list in of shape [1, N_patch, C, H, W]
1156
+ image_sizes = samples["image_size"]
1157
+
1158
+ # Image_raw can be a list of list of patches, when a `samples` has multiple images.
1159
+ if isinstance(image_raw[0], list):
1160
+ images = [x.squeeze(0) for sample_img in image_raw for x in sample_img]
1161
+ image_sizes = [s for sample_sizes in image_sizes for s in sample_sizes]
1162
+ else:
1163
+ # assert isinstance(image_raw[0], torch.Tensor), f"Unkown image type: {image_raw[0]}"
1164
+ # concate list of patches into one big patch for any res encoding.
1165
+ images = [x.squeeze(0) for x in image_raw] # [N_patch, C, H, W]
1166
+ image = torch.cat(images, dim=0) # [\sum{B}{N_patch_i}, C, H, W]
1167
+ image = image.to(device)
1168
+
1169
+ with torch.no_grad():
1170
+ if self.vision_encoder.__class__.__name__ == "TimmModel":
1171
+ image_embeds = self.vision_encoder.trunk.forward_features(image)
1172
+ elif self.vision_encoder.__class__.__name__ in [
1173
+ "CLIPVisionModel",
1174
+ "SiglipVisionTransformer",
1175
+ ]:
1176
+ image_embeds = self.vision_encoder(image).last_hidden_state
1177
+ else:
1178
+ image_embeds = self.vision_encoder(image)[1] # OpenCLIP returns tuples
1179
+
1180
+ if isinstance(self.vision_encoder, CLIPVisionModel) or isinstance(
1181
+ self.vision_encoder, SiglipVisionTransformer
1182
+ ):
1183
+ base_img_size = self.vision_encoder.config.image_size
1184
+ else:
1185
+ base_img_size = self.vision_encoder.image_size[0]
1186
+
1187
+ if self.vision_encoder.__class__.__name__ == "TimmModel":
1188
+ grid_size = self.vision_encoder.trunk.patch_embed.grid_size
1189
+ elif self.vision_encoder.__class__.__name__ in [
1190
+ "CLIPVisionModel",
1191
+ "SiglipVisionTransformer",
1192
+ ]:
1193
+ grid_size_base = (
1194
+ self.vision_encoder.config.image_size
1195
+ // self.vision_encoder.config.patch_size
1196
+ )
1197
+ grid_size = (grid_size_base, grid_size_base)
1198
+ else:
1199
+ grid_size = self.vision_encoder.grid_size
1200
+ height, width = grid_size
1201
+
1202
+ if not image_embeds.shape[1] == height * width:
1203
+ assert (
1204
+ image_embeds.shape[1] == height * width + 1
1205
+ ) # For vision encoders that has [CLS] token.
1206
+ image_embeds = image_embeds[:, 1:, :] # Drop the cls token for each patch.
1207
+ n_vis_token_per_patch = image_embeds.shape[1]
1208
+
1209
+ # Split encoded patches and merge patch features
1210
+ # 1. Get the raw sizes from samples, and split the image embeds [\sum_{B}(N_patch_i), N_tok(16*16), C]
1211
+ split_sizes = [image.shape[0] for image in images]
1212
+ image_embeds = torch.split(image_embeds, split_sizes, dim=0)
1213
+ # 2. For each image (consist of a list of patches), merge the patches spatially (of shape [C, n_patch_height, n_patch_width])
1214
+ new_image_embeds = []
1215
+ patch_attn_masks = []
1216
+ max_n_img_token = -1
1217
+ for idx, patch_embeds in enumerate(image_embeds):
1218
+ if patch_embeds.shape[0] > 1:
1219
+ # 3. Flatten the patch features and get [C, n_patch_height * (n_patch_width+1)]
1220
+ base_patch_embeds = patch_embeds[
1221
+ 0
1222
+ ] # TODO: prepend the CLS token for th base patch embeds (of the resized entire image).
1223
+ patch_embeds = patch_embeds[1:]
1224
+
1225
+ assert height * width == base_patch_embeds.shape[0]
1226
+
1227
+ num_patch_width, num_patch_height = get_anyres_image_grid_shape(
1228
+ image_sizes[idx], self.anyres_grids, base_img_size
1229
+ ) # Hardcoded grid_pinpoints.
1230
+ patch_embeds = patch_embeds.view(
1231
+ num_patch_height, num_patch_width, height, width, -1
1232
+ )
1233
+
1234
+ patch_embeds = patch_embeds.permute(4, 0, 2, 1, 3).contiguous()
1235
+ patch_embeds = patch_embeds.flatten(1, 2).flatten(2, 3)
1236
+ patch_embeds, patch_attn_mask = unpad_image(
1237
+ patch_embeds, image_sizes[idx], self.anyres_patch_sampling
1238
+ )
1239
+ if hasattr(self, "image_newline"):
1240
+ patch_embeds = torch.cat(
1241
+ (
1242
+ patch_embeds,
1243
+ self.image_newline[:, None, None].expand(
1244
+ *patch_embeds.shape[:-1], 1
1245
+ ),
1246
+ ),
1247
+ dim=-1,
1248
+ )
1249
+ if self.anyres_patch_sampling:
1250
+ patch_embeds = patch_embeds.view(
1251
+ -1, num_patch_height, num_patch_width, height * width
1252
+ )
1253
+ patch_embeds = patch_embeds.flatten(1, 2).permute(1, 2, 0)
1254
+ assert patch_attn_mask is not None
1255
+ patch_attn_mask = patch_attn_mask.view(
1256
+ num_patch_height, num_patch_width, height * width
1257
+ )
1258
+ patch_attn_mask = patch_attn_mask.flatten(0, 1)
1259
+ patch_embeds = torch.cat(
1260
+ (base_patch_embeds.unsqueeze(0), patch_embeds), dim=0
1261
+ )
1262
+ patch_attn_mask = torch.cat(
1263
+ (
1264
+ torch.ones(
1265
+ n_vis_token_per_patch, device=patch_embeds.device
1266
+ ).unsqueeze(0),
1267
+ patch_attn_mask,
1268
+ ),
1269
+ dim=0,
1270
+ )
1271
+ else:
1272
+ patch_embeds = patch_embeds.flatten(1, 2).transpose(0, 1)
1273
+ patch_embeds = torch.cat((base_patch_embeds, patch_embeds), dim=0)
1274
+ else:
1275
+ patch_embeds = (
1276
+ patch_embeds[0].unsqueeze(0)
1277
+ if self.anyres_patch_sampling
1278
+ else patch_embeds[0]
1279
+ )
1280
+ patch_attn_mask = (
1281
+ torch.ones(
1282
+ n_vis_token_per_patch, device=patch_embeds.device
1283
+ ).unsqueeze(0)
1284
+ if self.anyres_patch_sampling
1285
+ else None
1286
+ )
1287
+ if hasattr(self, "image_newline"):
1288
+ patch_embeds = torch.cat(
1289
+ (patch_embeds, self.image_newline[None]), dim=0
1290
+ )
1291
+ if not self.anyres_patch_sampling:
1292
+ max_n_img_token = max(patch_embeds.shape[0], max_n_img_token)
1293
+
1294
+ new_image_embeds.append(patch_embeds)
1295
+ patch_attn_masks.append(patch_attn_mask)
1296
+
1297
+ if self.anyres_patch_sampling:
1298
+ # Return individual patches for independent token downsampling.
1299
+ return new_image_embeds, patch_attn_masks
1300
+
1301
+ # 4. Pad and concat the list of image_embeds [N_tok_i, C] together into a batch. Also modify the query attention mask.
1302
+ image_embeds = []
1303
+ image_atts = []
1304
+ for image_embed in new_image_embeds:
1305
+ n_img_token = image_embed.shape[0]
1306
+ img_attn = torch.ones(
1307
+ (max_n_img_token), dtype=torch.long, device=image_embed.device
1308
+ )
1309
+ if n_img_token < max_n_img_token:
1310
+ padded_embed = torch.zeros(
1311
+ (max_n_img_token, image_embed.shape[-1]),
1312
+ dtype=image_embed.dtype,
1313
+ device=image_embed.device,
1314
+ )
1315
+ padded_embed[:n_img_token, :] = image_embed
1316
+ img_attn[n_img_token:] = 0 # Mask out the padded entries.
1317
+ else:
1318
+ padded_embed = image_embed
1319
+ image_embeds.append(padded_embed)
1320
+ image_atts.append(img_attn)
1321
+ image_embeds = torch.stack(
1322
+ image_embeds, dim=0
1323
+ ) # Shape [B, N_tok_longest, C_dim]
1324
+ image_atts = torch.stack(image_atts, dim=0) # Shape [B, N_tok_longest, C_dim]
1325
+ # TODO: reshape image_embeds and image_atts to "b T F v d"
1326
+ image_embeds = image_embeds[:, None, None, :, :]
1327
+ # image_atts = image_atts[:, None, None, :, :]
1328
+
1329
+ return image_embeds, image_atts
1330
+
1331
+ def _encode_vision_x(self, vision_x: torch.Tensor):
1332
+ """
1333
+ Compute media tokens from vision input by passing it through vision encoder and conditioning language model.
1334
+ Args:
1335
+ vision_x: Vision input
1336
+ shape (B, T_img, F, C, H, W)
1337
+ Images in the same chunk are collated along T_img, and frames are collated along F
1338
+ Currently only F=1 is supported (single-frame videos)
1339
+
1340
+ rearrange code based on https://github.com/dhansmair/flamingo-mini
1341
+ """
1342
+ assert vision_x.ndim == 6, "vision_x should be of shape (b, T_img, F, C, H, W)"
1343
+ b, T, F = vision_x.shape[:3]
1344
+
1345
+ vision_x = rearrange(vision_x, "b T F c h w -> (b T F) c h w")
1346
+ with torch.no_grad():
1347
+ if self.vision_encoder.__class__.__name__ == "TimmModel":
1348
+ vision_x = self.vision_encoder.trunk.forward_features(vision_x)
1349
+ elif self.vision_encoder.__class__.__name__ in [
1350
+ "CLIPVisionModel",
1351
+ "SiglipVisionTransformer",
1352
+ ]:
1353
+ vision_x = self.vision_encoder(vision_x).last_hidden_state
1354
+ else:
1355
+ vision_x = self.vision_encoder(vision_x)[1] # OpenCLIP returns tuples
1356
+ vision_x = rearrange(vision_x, "(b T F) v d -> b T F v d", b=b, T=T, F=F)
1357
+ return vision_x
1358
+
1359
+ def _concat_vision_cache(
1360
+ self, lang_x, vision_tokens, past_vision_tokens, past_media_locations, use_cache
1361
+ ):
1362
+ """
1363
+ Helper function to include the past vision tokens and past media locations in the output.
1364
+ """
1365
+ if use_cache:
1366
+ if past_media_locations is not None and past_vision_tokens is not None:
1367
+ if vision_tokens is not None:
1368
+ updated_vision_tokens = torch.cat(
1369
+ [
1370
+ past_vision_tokens,
1371
+ vision_tokens,
1372
+ ],
1373
+ dim=1,
1374
+ )
1375
+ else:
1376
+ updated_vision_tokens = past_vision_tokens
1377
+ updated_media_locations = torch.cat(
1378
+ [
1379
+ past_media_locations,
1380
+ lang_x == self.media_token_id,
1381
+ ],
1382
+ dim=1,
1383
+ )
1384
+ else:
1385
+ updated_vision_tokens = vision_tokens
1386
+ updated_media_locations = lang_x == self.media_token_id
1387
+
1388
+ else:
1389
+ updated_vision_tokens = None
1390
+ updated_media_locations = None
1391
+
1392
+ return updated_vision_tokens, updated_media_locations
1393
+
1394
+ def generate(
1395
+ self,
1396
+ vision_x: torch.Tensor,
1397
+ lang_x: torch.Tensor,
1398
+ attention_mask: torch.Tensor = None,
1399
+ past_key_values: Optional[
1400
+ List[Union[torch.Tensor, Tuple[torch.Tensor]]]
1401
+ ] = None,
1402
+ past_media_locations: Optional[torch.Tensor] = None,
1403
+ past_vision_tokens: Optional[torch.Tensor] = None,
1404
+ **kwargs,
1405
+ ):
1406
+ """
1407
+ Generate text conditioned on vision and language inputs.
1408
+ Args:
1409
+ vision_x (torch.Tensor): Vision input
1410
+ shape (B, T_img, F, C, H, W)
1411
+ see documentation for forward
1412
+ lang_x (torch.Tensor): Language input
1413
+ shape (B, T_txt)
1414
+ attention_mask (torch.Tensor, optional): Attention mask. Defaults to None.
1415
+ **kwargs: see generate documentation in Hugging Face CausalLM models.
1416
+ Returns:
1417
+ torch.Tensor: lang_x with generated tokens appended to it
1418
+ """
1419
+ num_beams = kwargs.pop("num_beams", 1)
1420
+
1421
+ # convert pixels to vision tokens
1422
+ if vision_x is not None:
1423
+ vision_features = self._encode_vision_x(vision_x=vision_x)
1424
+ vision_tokens = self.vision_tokenizer(vision_features)
1425
+ else:
1426
+ vision_tokens = None
1427
+
1428
+ # fuse the vision and language tokens
1429
+ # for xattn, vision_x and media_location are repeat_interleaved s.t.
1430
+ # the total batch size is B * num_beams
1431
+ new_inputs = self._prepare_inputs_for_forward(
1432
+ vision_tokens=vision_tokens,
1433
+ lang_x=lang_x,
1434
+ attention_mask=attention_mask,
1435
+ past_key_values=past_key_values,
1436
+ past_media_locations=past_media_locations,
1437
+ past_vision_tokens=past_vision_tokens,
1438
+ padding_side="left",
1439
+ num_beams=num_beams,
1440
+ )
1441
+ output = self.lang_model.generate(
1442
+ **new_inputs,
1443
+ past_key_values=past_key_values,
1444
+ num_beams=num_beams,
1445
+ use_cache=True,
1446
+ **kwargs,
1447
+ )
1448
+ self._post_forward_hook()
1449
+ return output
1450
+
1451
+ @property
1452
+ def num_trainable_params(self):
1453
+ """Print the number of trainable parameters"""
1454
+ return num_params(self, filter_to_trainable=True)
1455
+
1456
+ def set_trainable(self):
1457
+ """
1458
+ Freeze appropriate parameters in the model.
1459
+ """
1460
+ raise NotImplementedError
1461
+
1462
+ def group_params_by_weight_decay(self):
1463
+ """
1464
+ Return a tuple of (params to optimize w/ weight decay, params to optimize w/o weight decay)
1465
+ """
1466
+ params_with_wd, params_without_wd = [], []
1467
+ for n, p in self.named_parameters():
1468
+ if p.requires_grad:
1469
+ if self._should_apply_weight_decay(n):
1470
+ params_with_wd.append(p)
1471
+ else:
1472
+ params_without_wd.append(p)
1473
+ return params_with_wd, params_without_wd
1474
+
1475
+ def _should_apply_weight_decay(self, parameter_name):
1476
+ """
1477
+ Return whether weight decay should be applied to a parameter.
1478
+ """
1479
+ raise NotImplementedError
1480
+
1481
+ @property
1482
+ def special_tokens(self):
1483
+ """
1484
+ Returns a dict mapping from the attribute name of a special token to its string format,
1485
+ e.g. "media_token": "<image>"
1486
+ """
1487
+ assert (
1488
+ "media_token" in self._special_tokens
1489
+ ), "VLMs need to request that the tokenizer add a media_token and call set_special_token_ids to set self.media_token_id"
1490
+ return self._special_tokens
1491
+
1492
+ @property
1493
+ def special_token_ids(self):
1494
+ """
1495
+ Returns a list of the special token ids
1496
+ """
1497
+ return [getattr(self, f"{att_name}_id") for att_name in self.special_tokens]
1498
+
1499
+ def set_special_token_ids(self, string_to_ids):
1500
+ """
1501
+ Args:
1502
+ string_to_ids (dict): mapping from token string to id
1503
+ """
1504
+ assert set(self.special_tokens.values()).issubset(set(string_to_ids.keys()))
1505
+ for att_name, token_str in self.special_tokens.items():
1506
+ token_id = string_to_ids[token_str]
1507
+ setattr(self, f"{att_name}_id", token_id)
1508
+ setattr(self.lang_model, f"{att_name}_id", token_id)
1509
+
1510
+ def init_gradient_checkpointing(self):
1511
+ from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
1512
+ checkpoint_wrapper,
1513
+ CheckpointWrapper,
1514
+ CheckpointImpl,
1515
+ apply_activation_checkpointing,
1516
+ )
1517
+ from functools import partial
1518
+
1519
+ non_reentrant_wrapper = partial(
1520
+ checkpoint_wrapper,
1521
+ checkpoint_impl=CheckpointImpl.NO_REENTRANT,
1522
+ )
1523
+ apply_activation_checkpointing(
1524
+ self,
1525
+ checkpoint_wrapper_fn=non_reentrant_wrapper,
1526
+ check_fn=lambda m: getattr(m, "_use_gradient_checkpointing", False)
1527
+ and not isinstance(m, CheckpointWrapper),
1528
+ )
1529
+
1530
+
1531
+ @dataclass
1532
+ class VLMOutputWithPast(CausalLMOutputWithPast):
1533
+ """
1534
+ VLMOutputWithPast is a wrapper around CausalLMOutputWithPast that adds the following attributes:
1535
+ past_media_locations: Optional[torch.Tensor] = None,
1536
+ past_vision_tokens: Optional[torch.Tensor] = None,
1537
+ """
1538
+
1539
+ past_media_locations: Optional[torch.Tensor] = None
1540
+ past_vision_tokens: Optional[torch.Tensor] = None
1541
+
1542
+
1543
+ def exists(val):
1544
+ return val is not None
1545
+
1546
+
1547
+ def FeedForward(dim, mult=4):
1548
+ inner_dim = int(dim * mult)
1549
+ return nn.Sequential(
1550
+ nn.LayerNorm(dim),
1551
+ nn.Linear(dim, inner_dim, bias=False),
1552
+ nn.GELU(),
1553
+ nn.Linear(inner_dim, dim, bias=False),
1554
+ )
1555
+
1556
+
1557
+ class VLMWithLanguageStream(VLM):
1558
+ """
1559
+ VLM that fuses modalities by inserting vision tokens directly into the language stream.
1560
+ """
1561
+
1562
+ def __init__(
1563
+ self,
1564
+ vision_encoder: nn.Module,
1565
+ vision_tokenizer: nn.Module,
1566
+ lang_model: nn.Module,
1567
+ initial_tokenizer_len: int,
1568
+ pad_token_id: int,
1569
+ decoder_layers_attr_name: str = None,
1570
+ gradient_checkpointing: bool = False,
1571
+ ):
1572
+ super().__init__(
1573
+ vision_encoder=vision_encoder,
1574
+ vision_tokenizer=vision_tokenizer,
1575
+ lang_model=lang_model,
1576
+ initial_tokenizer_len=initial_tokenizer_len,
1577
+ pad_token_id=pad_token_id,
1578
+ gradient_checkpointing=gradient_checkpointing,
1579
+ )
1580
+ self.decoder_layers_attr_name = decoder_layers_attr_name
1581
+ if decoder_layers_attr_name is not None:
1582
+ for block in getattr_recursive(
1583
+ self.lang_model, self.decoder_layers_attr_name
1584
+ ):
1585
+ block._use_gradient_checkpointing = gradient_checkpointing
1586
+
1587
+ def _prepare_inputs_for_forward(
1588
+ self,
1589
+ vision_tokens: torch.Tensor,
1590
+ lang_x: torch.Tensor,
1591
+ attention_mask: torch.Tensor,
1592
+ labels: torch.Tensor = None,
1593
+ past_key_values=None,
1594
+ vision_attention_mask: Optional[torch.Tensor] = None,
1595
+ past_media_locations: torch.Tensor = None,
1596
+ past_vision_tokens: torch.Tensor = None,
1597
+ padding_side: str = "left",
1598
+ num_beams: int = 1,
1599
+ ):
1600
+ """
1601
+ Insert the vision tokens directly into the language stream/
1602
+ This requires us to modify the input_ids, attention_mask, and labels.
1603
+ """
1604
+ if past_key_values is not None:
1605
+ past_len = past_key_values[0][0].shape[2]
1606
+ assert attention_mask.shape[1] == past_len + lang_x.shape[1], (
1607
+ "Attention_mask must be as long as the entire past len (including image tokens) and current input IDs. "
1608
+ + "Check that you've expanded the attention mask to account for past image tokens."
1609
+ )
1610
+
1611
+ if vision_tokens is None:
1612
+ return {
1613
+ "input_ids": lang_x,
1614
+ "attention_mask": attention_mask,
1615
+ "labels": labels,
1616
+ }
1617
+
1618
+ # get the language embeddings
1619
+ lang_embeds = self.lang_model.get_input_embeddings()(lang_x)
1620
+
1621
+ # build up the multimodal embeddings
1622
+ B = lang_x.shape[0]
1623
+ has_labels = labels is not None
1624
+ multimodal_embeds = []
1625
+ multimodal_attention_mask = []
1626
+ multimodal_labels = [] if has_labels else None
1627
+ for i in range(B):
1628
+ # get index of <image> tokens in lang_x[i]
1629
+ image_token_idxs = torch.where(lang_x[i] == self.media_token_id)[0]
1630
+
1631
+ if len(image_token_idxs) == 0:
1632
+ multimodal_embeds.append(lang_embeds[i].clone())
1633
+ multimodal_attention_mask.append(attention_mask[i].clone())
1634
+ if has_labels:
1635
+ multimodal_labels.append(labels[i].clone())
1636
+ continue
1637
+
1638
+ # loop through the image_token_idxs and insert the vision tokens
1639
+ new_embed = lang_embeds[i].clone()
1640
+ new_attention_mask = (
1641
+ attention_mask[i].clone() if attention_mask is not None else None
1642
+ )
1643
+ if has_labels:
1644
+ new_label = labels[i].clone()
1645
+
1646
+ for img_num in range(len(image_token_idxs)):
1647
+ img_idx = image_token_idxs[img_num]
1648
+ # Get vision token attention mask for padded llava-style any resolution image tokens.
1649
+ if self.image_aspect_ratio == "anyres":
1650
+ num_vis_tokens = vision_tokens[i][img_num].shape[0]
1651
+ if vision_attention_mask is not None:
1652
+ vis_attention_mask = vision_attention_mask[i]
1653
+ else:
1654
+ vis_attention_mask = torch.ones(
1655
+ num_vis_tokens, dtype=torch.long
1656
+ ).to(attention_mask.device)
1657
+ else:
1658
+ assert (
1659
+ vision_tokens[i][img_num].shape[0] == self.num_tokens_per_vis
1660
+ ), f"vision token number mismatch: image embedding ({vision_tokens[i][img_num].shape[0]}) \
1661
+ vs. model.num_tokens_per_vis ({self.num_tokens_per_vis})"
1662
+ # By default, vision tokens are not padded.
1663
+ num_vis_tokens = self.num_tokens_per_vis
1664
+ vis_attention_mask = torch.ones(
1665
+ num_vis_tokens, dtype=torch.long
1666
+ ).to(attention_mask.device)
1667
+
1668
+ # Offset the rest of image tokens with current num_vis_tokens
1669
+ for j in range(img_num+1, len(image_token_idxs)):
1670
+ image_token_idxs[j] += (num_vis_tokens - 1)
1671
+
1672
+ new_embed = torch.cat(
1673
+ (
1674
+ new_embed[:img_idx],
1675
+ vision_tokens[i][img_num],
1676
+ new_embed[img_idx + 1 :],
1677
+ ),
1678
+ dim=0,
1679
+ )
1680
+ new_attention_mask = torch.cat(
1681
+ (
1682
+ new_attention_mask[:img_idx],
1683
+ vis_attention_mask,
1684
+ new_attention_mask[img_idx + 1 :],
1685
+ ),
1686
+ dim=0,
1687
+ )
1688
+ if has_labels:
1689
+ new_label = torch.cat(
1690
+ (
1691
+ new_label[:img_idx],
1692
+ torch.ones(num_vis_tokens, dtype=torch.long).to(
1693
+ labels.device
1694
+ )
1695
+ * -100,
1696
+ new_label[img_idx + 1 :],
1697
+ ),
1698
+ dim=0,
1699
+ )
1700
+ multimodal_embeds.append(new_embed)
1701
+ multimodal_attention_mask.append(new_attention_mask)
1702
+ if has_labels:
1703
+ multimodal_labels.append(new_label)
1704
+
1705
+ # stack
1706
+ multimodal_embeds = stack_with_padding(
1707
+ multimodal_embeds,
1708
+ padding_value=self.pad_token_id,
1709
+ padding_side=padding_side,
1710
+ )
1711
+ multimodal_attention_mask = stack_with_padding(
1712
+ multimodal_attention_mask,
1713
+ padding_value=0,
1714
+ padding_side=padding_side,
1715
+ )
1716
+ if has_labels:
1717
+ multimodal_labels = stack_with_padding(
1718
+ multimodal_labels,
1719
+ padding_value=-100,
1720
+ padding_side=padding_side,
1721
+ )
1722
+
1723
+ return {
1724
+ "inputs_embeds": multimodal_embeds,
1725
+ "attention_mask": multimodal_attention_mask,
1726
+ "labels": multimodal_labels,
1727
+ }
1728
+
1729
+ def _postprocess_outputs_from_forward(
1730
+ self,
1731
+ output: CausalLMOutputWithPast,
1732
+ lang_x: torch.Tensor,
1733
+ vision_tokens: torch.Tensor,
1734
+ past_vision_tokens: torch.Tensor,
1735
+ past_media_locations: torch.Tensor,
1736
+ use_cache: bool = False,
1737
+ ):
1738
+ # Include the past vision tokens and past media locations in the output
1739
+ updated_vision_tokens, updated_media_locations = self._concat_vision_cache(
1740
+ lang_x=lang_x,
1741
+ vision_tokens=vision_tokens,
1742
+ past_vision_tokens=past_vision_tokens,
1743
+ past_media_locations=past_media_locations,
1744
+ use_cache=use_cache,
1745
+ )
1746
+
1747
+ # return logits that are the same shape as the original input_ids
1748
+ logits = output.logits
1749
+ batch_logits = []
1750
+ B, T_txt = lang_x.shape
1751
+ for i in range(B):
1752
+ sequence_logits = []
1753
+ logits_j = 0
1754
+ for j in range(T_txt):
1755
+ if lang_x[i, j] != self.media_token_id:
1756
+ sequence_logits.append(logits[i, logits_j])
1757
+ logits_j += 1
1758
+ else:
1759
+ # append the logit for the first image token, then skip over the rest
1760
+ # note: the model actually learns to predict <im_patch>, not <image>
1761
+ sequence_logits.append(logits[i, logits_j])
1762
+ logits_j += self.num_tokens_per_vis
1763
+ sequence_logits = torch.stack(sequence_logits, dim=0) # (B, vocab_size)
1764
+ batch_logits.append(sequence_logits)
1765
+
1766
+ batch_logits = torch.stack(batch_logits, dim=0) # (B, T_txt, vocab_size)
1767
+ # The final logits shape should be the same as the original input_ids shape
1768
+ assert batch_logits.shape[:2] == (B, T_txt)
1769
+
1770
+ # assemble the output
1771
+ output = VLMOutputWithPast(
1772
+ loss=output.loss,
1773
+ logits=batch_logits,
1774
+ past_key_values=output.past_key_values,
1775
+ hidden_states=output.hidden_states,
1776
+ attentions=output.attentions,
1777
+ past_media_locations=updated_media_locations,
1778
+ past_vision_tokens=updated_vision_tokens,
1779
+ )
1780
+
1781
+ return output
1782
+
1783
+ def _post_forward_hook(self):
1784
+ pass
1785
+
1786
+ @property
1787
+ def num_params_per_module(self):
1788
+ """Print the number of parameters per module in the model"""
1789
+ return "\n".join(
1790
+ [
1791
+ f"Vision encoder: {num_params(self.vision_encoder):,} parameters",
1792
+ f"Vision tokenizer: {num_params(self.vision_tokenizer):,} parameters",
1793
+ f"Language model: {num_params(self.lang_model):,} parameters",
1794
+ ]
1795
+ )
1796
+
1797
+ @property
1798
+ def num_trainable_params_per_module(self):
1799
+ """Print the number of trainable parameters per module in the model"""
1800
+ return "\n".join(
1801
+ [
1802
+ f"Vision encoder: {num_params(self.vision_encoder, filter_to_trainable=True):,} trainable parameters",
1803
+ f"Vision tokenizer: {num_params(self.vision_tokenizer, filter_to_trainable=True):,} trainable parameters",
1804
+ f"Language model: {num_params(self.lang_model, filter_to_trainable=True):,} trainable parameters",
1805
+ ]
1806
+ )
1807
+
1808
+
1809
+ class XGenMMPerceiver(VLMWithLanguageStream):
1810
+ def __init__(
1811
+ self,
1812
+ vision_encoder: nn.Module,
1813
+ vision_tokenizer: nn.Module,
1814
+ lang_model: nn.Module,
1815
+ initial_tokenizer_len: int,
1816
+ pad_token_id: int,
1817
+ decoder_layers_attr_name: str = None,
1818
+ gradient_checkpointing: bool = False,
1819
+ image_aspect_ratio: str = "anyres",
1820
+ anyres_patch_sampling: bool = True,
1821
+ anyres_grids: list[int] = None,
1822
+ ):
1823
+ """
1824
+ Args:
1825
+ vision_encoder (nn.Module): HF CLIPModel
1826
+ lang_encoder (nn.Module): HF causal language model
1827
+ vis_feature_dim (int): final dimension of the visual features outputted by the vision_encoder
1828
+ initial_tokenizer_len (int): size of the tokenizer vocab
1829
+ padding_token_id (int): id of the padding token. None if no padding token; then a padding token
1830
+ will be inserted into self.special_tokens, which factory.py fills after creating new tokens
1831
+ decoder_layers_attr_name (str, optional): name of the decoder layers attribute. Defaults to None.
1832
+ gradient_checkpointing (bool, optional): whether to use gradient checkpointing. Defaults to False.
1833
+ """
1834
+ self._special_tokens = {
1835
+ "media_token": "<image>",
1836
+ "image_placeholder_token": "<image placeholder>",
1837
+ "end_of_trunk_token": "<|endofchunk|>",
1838
+ }
1839
+ lang_embedding_dim = lang_model.get_input_embeddings().weight.shape[1]
1840
+ super().__init__(
1841
+ vision_encoder=vision_encoder,
1842
+ vision_tokenizer=vision_tokenizer,
1843
+ lang_model=lang_model,
1844
+ initial_tokenizer_len=initial_tokenizer_len,
1845
+ gradient_checkpointing=gradient_checkpointing,
1846
+ decoder_layers_attr_name=decoder_layers_attr_name,
1847
+ pad_token_id=pad_token_id,
1848
+ )
1849
+ self.image_aspect_ratio = image_aspect_ratio
1850
+ self.anyres_patch_sampling = anyres_patch_sampling
1851
+ self.anyres_grids = anyres_grids
1852
+
1853
+ def set_trainable(self):
1854
+ """
1855
+ Unfreeze everything except the vision_encoder
1856
+ """
1857
+ self.requires_grad_(True)
1858
+ self.vision_encoder.requires_grad_(False)
1859
+
1860
+ def _should_apply_weight_decay(self, parameter_name):
1861
+ """
1862
+ Kosmos applies 0.01 weight deacy to everything
1863
+ """
1864
+ return True
1865
+
1866
+ def generate(
1867
+ self,
1868
+ vision_x: torch.Tensor,
1869
+ lang_x: torch.Tensor,
1870
+ image_size: Optional[Tuple] = None,
1871
+ attention_mask: torch.Tensor = None,
1872
+ past_key_values: Optional[
1873
+ List[Union[torch.Tensor, Tuple[torch.Tensor]]]
1874
+ ] = None,
1875
+ past_media_locations: Optional[torch.Tensor] = None,
1876
+ past_vision_tokens: Optional[torch.Tensor] = None,
1877
+ **kwargs,
1878
+ ):
1879
+ """
1880
+ Generate text conditioned on vision and language inputs.
1881
+ Args:
1882
+ vision_x (torch.Tensor): Vision input
1883
+ shape (B, T_img, F, C, H, W)
1884
+ see documentation for forward
1885
+ lang_x (torch.Tensor): Language input
1886
+ shape (B, T_txt)
1887
+ attention_mask (torch.Tensor, optional): Attention mask. Defaults to None.
1888
+ **kwargs: see generate documentation in Hugging Face CausalLM models.
1889
+ Returns:
1890
+ torch.Tensor: lang_x with generated tokens appended to it
1891
+ """
1892
+ num_beams = kwargs.pop("num_beams", 1)
1893
+
1894
+ # convert pixels to vision tokens
1895
+ vision_attention_mask = None
1896
+ if vision_x is not None:
1897
+ if self.image_aspect_ratio == "anyres":
1898
+ input_dict = dict(image=vision_x, image_size=image_size)
1899
+ vision_features, vision_attn_masks = self._encode_vision_x_anyres(
1900
+ input_dict, lang_x.device
1901
+ )
1902
+ else:
1903
+ vision_features = self._encode_vision_x(vision_x=vision_x)
1904
+ vision_attn_masks = None
1905
+ # If doing patch sampling, then flatten patches of shape [b, Np_i, v, d] -> [b*Np, v, d]
1906
+ # Same for attention masks: [b, Np, v] -> [b*Np, v]
1907
+ if self.anyres_patch_sampling:
1908
+ split_sizes = [feature.shape[0] for feature in vision_features]
1909
+ # Nested splits for multi-image samples.
1910
+ if isinstance(vision_x[0], list):
1911
+ nt_images = [len(images) for images in vision_x]
1912
+ split_split_sizes = []
1913
+ img_id = 0
1914
+ for nt in nt_images:
1915
+ split_split_sizes.append(split_sizes[img_id : img_id + nt])
1916
+ img_id += nt
1917
+ else:
1918
+ nt_images = [1] * len(vision_x)
1919
+ split_split_sizes = split_sizes
1920
+ vision_features = torch.cat(vision_features, dim=0)
1921
+ vision_features = vision_features[
1922
+ :, None, None, :, :
1923
+ ] # Expand dimensions.
1924
+ vision_attn_masks = torch.cat(vision_attn_masks, dim=0)
1925
+ vision_tokens = self.vision_tokenizer(vision_features, vision_attn_masks)
1926
+
1927
+ # Post-processing: Split the batches into groups of patches and concatenate them together.
1928
+ if self.anyres_patch_sampling:
1929
+ assert isinstance(vision_x, list)
1930
+ if isinstance(vision_x[0], list):
1931
+ vision_token_groups = torch.split(
1932
+ vision_tokens,
1933
+ list(sum(nt_img) for nt_img in split_split_sizes),
1934
+ dim=0,
1935
+ )
1936
+ vision_tokens = []
1937
+
1938
+ for sample_id, patch_vis_tokens in enumerate(vision_token_groups):
1939
+ patch_vis_token_groups = torch.split(
1940
+ patch_vis_tokens, split_split_sizes[sample_id], dim=0
1941
+ ) # [Np*nt, 1, v, d] -> [[Np_t, 1, v, d], ...]
1942
+ flatten_vision_tokens = []
1943
+ for image_vis_token in patch_vis_token_groups:
1944
+ image_vis_token = image_vis_token.flatten(
1945
+ 0, 2
1946
+ ) # [Np, 1, v, d] -> [Np*v, d]
1947
+ flatten_vision_tokens.append(image_vis_token)
1948
+ vision_tokens_i = flatten_vision_tokens
1949
+ vision_tokens.append(vision_tokens_i)
1950
+ else:
1951
+ vision_token_groups = torch.split(vision_tokens, split_sizes, dim=0)
1952
+ vision_tokens = []
1953
+ for patch_vis_tokens in vision_token_groups:
1954
+ patch_vis_tokens = patch_vis_tokens.flatten(
1955
+ 0, 2
1956
+ ) # [Np, 1, v, d] -> [Np*v, d]
1957
+ vision_tokens.append(
1958
+ patch_vis_tokens.unsqueeze(0)
1959
+ ) # Add the nt dimension.
1960
+ else:
1961
+ vision_tokens = None
1962
+
1963
+ # fuse the vision and language tokens
1964
+ # for xattn, vision_x and media_location are repeat_interleaved s.t.
1965
+ # the total batch size is B * num_beams
1966
+ new_inputs = self._prepare_inputs_for_forward(
1967
+ vision_tokens=vision_tokens,
1968
+ lang_x=lang_x,
1969
+ attention_mask=attention_mask,
1970
+ vision_attention_mask=vision_attention_mask,
1971
+ past_key_values=past_key_values,
1972
+ past_media_locations=past_media_locations,
1973
+ past_vision_tokens=past_vision_tokens,
1974
+ padding_side="left",
1975
+ num_beams=num_beams,
1976
+ )
1977
+ if past_key_values is not None:
1978
+ output = self.lang_model.generate(
1979
+ **new_inputs,
1980
+ past_key_values=past_key_values,
1981
+ num_beams=num_beams,
1982
+ use_cache=True,
1983
+ **kwargs,
1984
+ )
1985
+ else:
1986
+ output = self.lang_model.generate(
1987
+ **new_inputs,
1988
+ num_beams=num_beams,
1989
+ use_cache=True,
1990
+ **kwargs,
1991
+ )
1992
+ self._post_forward_hook()
1993
+ return output
1994
+
1995
+
1996
+ class XGenMMVisionEncoder(PreTrainedModel):
1997
+ main_input_name = "pixel_values"
1998
+ config_class = XGenMMVisionEncoderConfig
1999
+
2000
+ def __init__(self, config: XGenMMVisionEncoderConfig):
2001
+ super().__init__(config)
2002
+ if config.model_name != "google/siglip-so400m-patch14-384":
2003
+ raise ValueError(
2004
+ f"Unsupported model {config.model_name}. New vision models will be added soon."
2005
+ )
2006
+ self.model = AutoModel.from_pretrained(config.model_name)
2007
+
2008
+ def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
2009
+ # assert pixel_values.ndim == 4, f"Expected 4D tensor (bs, c, h, w), got {pixel_values.ndim}"
2010
+ return self.model.encode_image(pixel_values)
2011
+
2012
+
2013
+ # vision tokenizer
2014
+ class XGenMMVisionTokenizer(PreTrainedModel):
2015
+ config_class = XGenMMVisionTokenizerConfig
2016
+
2017
+ def __init__(self, config: XGenMMVisionTokenizerConfig):
2018
+ super().__init__(config)
2019
+ self.model = PerceiverResampler(
2020
+ dim=config.vis_feature_dim,
2021
+ dim_inner=config.lang_embedding_dim,
2022
+ num_latents=config.num_vis_tokens,
2023
+ )
2024
+
2025
+ def forward(self, vision_features: torch.Tensor, vision_attn_masks: torch.Tensor):
2026
+ return self.model(vision_features, vision_attn_masks)
2027
+
2028
+
2029
+ # XGenMM model
2030
+ class XGenMMModelForConditionalGeneration(PreTrainedModel):
2031
+ config_class = XGenMMConfig
2032
+
2033
+ def __init__(self, config: XGenMMConfig):
2034
+ super().__init__(config)
2035
+
2036
+ # vision encoder initialization
2037
+ vision_encoder = AutoModel.from_pretrained(
2038
+ config.vision_encoder_config.model_name,
2039
+ torch_dtype=config.text_config.torch_dtype,
2040
+ ).vision_model
2041
+
2042
+ # language model initialization
2043
+ language_model = AutoModelForCausalLM.from_config(
2044
+ config.text_config,
2045
+ torch_dtype=config.text_config.torch_dtype,
2046
+ )
2047
+ check_embedding_fns(language_model)
2048
+ # Update _tied_weights_keys using the base model used.
2049
+ if language_model._tied_weights_keys is not None:
2050
+ self._tied_weights_keys = [
2051
+ f"language_model.{k}" for k in language_model._tied_weights_keys
2052
+ ]
2053
+
2054
+ # vision tokenizer initialization
2055
+ if (
2056
+ config.vision_tokenizer_config.lang_embedding_dim
2057
+ != language_model.get_input_embeddings().weight.shape[1]
2058
+ ):
2059
+ overwrite = language_model.get_input_embeddings().weight.shape[1]
2060
+ config.vision_tokenizer_config.lang_embedding_dim = overwrite
2061
+ print(
2062
+ f"Warning: The language embedding dimension in the vision tokenizer config is different from the language model's embedding dimension. Overwriting the language embedding dimension in the vision tokenizer config to {overwrite}."
2063
+ )
2064
+
2065
+ vision_tokenizer = XGenMMVisionTokenizer(config.vision_tokenizer_config).model.to(language_model.dtype)
2066
+
2067
+ self.vlm = XGenMMPerceiver(
2068
+ vision_encoder=vision_encoder,
2069
+ vision_tokenizer=vision_tokenizer,
2070
+ lang_model=language_model,
2071
+ initial_tokenizer_len=config.text_config.initial_tokenizer_len,
2072
+ pad_token_id=config.text_config.pad_token_id,
2073
+ image_aspect_ratio=config.vision_encoder_config.image_aspect_ratio,
2074
+ anyres_patch_sampling=config.vision_encoder_config.anyres_patch_sampling,
2075
+ anyres_grids=config.vision_encoder_config.anyres_grids,
2076
+ )
2077
+ # Initialize weights and apply final processing
2078
+ self.post_init()
2079
+
2080
+ @torch.no_grad()
2081
+ def generate(
2082
+ self,
2083
+ pixel_values: torch.FloatTensor,
2084
+ input_ids: Optional[torch.LongTensor] = None,
2085
+ attention_mask: Optional[torch.LongTensor] = None,
2086
+ **generate_kwargs,
2087
+ ) -> torch.LongTensor:
2088
+ self.vlm = self.vlm.eval()
2089
+ return self.vlm.generate(
2090
+ vision_x=pixel_values,
2091
+ lang_x=input_ids,
2092
+ attention_mask=attention_mask,
2093
+ **generate_kwargs,
2094
+ )
2095
+
2096
+ def update_special_tokens(self, tokenizer):
2097
+ tokenizer.add_special_tokens(
2098
+ {"additional_special_tokens": list(self.vlm.special_tokens.values())}
2099
+ )
2100
+ self.vlm.lang_model.config.vocab_size = len(tokenizer)
2101
+ self.vlm.set_special_token_ids(
2102
+ {
2103
+ v: tokenizer.convert_tokens_to_ids(v)
2104
+ for v in self.vlm.special_tokens.values()
2105
+ }
2106
+ )
2107
+ return tokenizer