BAAI
/

Visual Question Answering
Transformers
Safetensors
English
Chinese
qwen2
text-generation
multimodal
text-generation-inference
Files changed (1) hide show
  1. README.md +160 -149
README.md CHANGED
@@ -1,150 +1,161 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- - zh
6
- tags:
7
- - multimodal
8
- library_name: transformers
9
- datasets:
10
- - BAAI/Infinity-MM
11
- - BAAI/Infinity-Instruct
12
- - BAAI/Infinity-Preference
13
- base_model:
14
- - Qwen/Qwen2.5-1.5B-Instruct
15
- - google/siglip-so400m-patch14-384
16
- pipeline_tag: visual-question-answering
17
- ---
18
-
19
- ![mof-class1](https://mot.isitopen.ai/model/1130/badge/1)
20
-
21
- # Introduction
22
-
23
- The **Aquila-VL-2B** model is a vision-language model (VLM) trained based on the [LLava-one-vision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
24
-
25
- The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
26
-
27
-
28
- We have open-sourced [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset and related resources. We hope you enjoy using them!
29
-
30
- ## News
31
- - `2024/11/19`: We have released [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-VL-2B-Intermediate) obtained during different stages of training. Please feel free to use these models for analysis and experimentation.
32
- - `2024/10/25`: The [Aquila-VL-2B](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) model and [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset are now available. We have also released the [technical report](https://arxiv.org/abs/2410.18558) simultaneously.
33
-
34
- # Evaluation
35
-
36
- We evaluated the model using the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) tool. Whenever possible, we prioritized using the OpenAI API for test sets that support API-based evaluation.
37
-
38
- | Benchmark | MiniCPM-V-2 | InternVL2-2B | XinYuan-VL-2B | Qwen2-VL-2B-Instruct | Aquila-VL-2B |
39
- | :--------------------------- | :---------: | :----------: | :-----------: | :------------------: | :----------: |
40
- | MMBench-EN<sub>test</sub> | 69.4 | 73.4 | **78.9** | 74.9 | 78.8 |
41
- | MMBench-CN<sub>test</sub> | 65.9 | 70.9 | 76.1 | 73.9 | **76.4** |
42
- | MMBench_V1.1<sub>test</sub> | 65.2 | 69.7 | **75.4** | 72.7 | 75.2 |
43
- | MMT-Bench<sub>test</sub> | 54.5 | 53.3 | 57.2 | 54.8 | **58.2** |
44
- | RealWorldQA | 55.4 | 57.3 | 63.9 | 62.6 | **63.9** |
45
- | HallusionBench | 36.8 | 38.1 | 36.0 | 41.5 | **43.0** |
46
- | SEEDBench2<sub>plus</sub> | 51.8 | 60.0 | 63.0 | 62.4 | **63.0** |
47
- | LLaVABench | 66.1 | 64.8 | 42.4 | 52.5 | **68.4** |
48
- | MMStar | 41.6 | 50.2 | 51.9 | 47.8 | **54.9** |
49
- | POPE | 86.6 | 85.3 | **89.4** | 88.0 | 83.6 |
50
- | MMVet | 44.0 | 41.1 | 42.7 | **50.7** | 44.3 |
51
- | MMMU<sub>val</sub> | 39.6 | 34.9 | 43.6 | 41.7 | **47.4** |
52
- | ScienceQA<sub>test</sub> | 80.4 | 94.1 | 86.6 | 78.1 | **95.2** |
53
- | AI2D<sub>test</sub> | 64.8 | 74.4 | 74.2 | 74.6 | **75.0** |
54
- | MathVista<sub>testmini</sub> | 39.0 | 45.0 | 47.1 | 47.9 | **59.0** |
55
- | MathVerse<sub>testmini</sub> | 19.8 | 24.7 | 22.2 | 21.0 | **26.2** |
56
- | MathVision | 15.4 | 12.6 | 16.3 | 17.5 | **18.4** |
57
- | DocVQA<sub>test</sub> | 71.0 | 86.9 | 87.6 | **89.9** | 85.0 |
58
- | InfoVQA<sub>test</sub> | 40.0 | 59.5 | 59.1 | **65.4** | 58.3 |
59
- | ChartQA<sub>test</sub> | 59.6 | 71.4 | 57.1 | 73.5 | **76.5** |
60
- | TextVQA<sub>val</sub> | 74.3 | 73.5 | 77.6 | **79.9** | 76.4 |
61
- | OCRVQA<sub>testcore</sub> | 54.4 | 40.2 | 67.6 | **68.7** | 64.0 |
62
- | VCR<sub>en easy</sub> | 27.6 | 51.6 | 67.7 | 68.3 | **70.0** |
63
- | OCRBench | 613 | 784 | 782 | **810** | 772 |
64
- | Average | 53.5 | 58.8 | 60.9 | 62.1 | **64.1** |
65
-
66
-
67
-
68
- For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers or on the official VLMEvalKit leaderboard.
69
-
70
- # How to use
71
-
72
- ```python
73
- # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
74
- from llava.model.builder import load_pretrained_model
75
- from llava.mm_utils import process_images, tokenizer_image_token
76
- from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
77
- from llava.conversation import conv_templates
78
- from PIL import Image
79
- import requests
80
- import copy
81
- import torch
82
- import warnings
83
-
84
- warnings.filterwarnings("ignore")
85
-
86
- pretrained = "BAAI/Aquila-VL-2B-llava-qwen"
87
-
88
- model_name = "llava_qwen"
89
- device = "cuda"
90
- device_map = "auto"
91
- tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
92
-
93
- model.eval()
94
-
95
- # load image from url
96
- url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
97
- image = Image.open(requests.get(url, stream=True).raw)
98
-
99
- # load image from local environment
100
- # url = "./local_image.jpg"
101
- # image = Image.open(url)
102
-
103
- image_tensor = process_images([image], image_processor, model.config)
104
- image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
105
-
106
- conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
107
- question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
108
- conv = copy.deepcopy(conv_templates[conv_template])
109
- conv.append_message(conv.roles[0], question)
110
- conv.append_message(conv.roles[1], None)
111
- prompt_question = conv.get_prompt()
112
-
113
- input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
114
- image_sizes = [image.size]
115
-
116
- cont = model.generate(
117
- input_ids,
118
- images=image_tensor,
119
- image_sizes=image_sizes,
120
- do_sample=False,
121
- temperature=0,
122
- max_new_tokens=4096,
123
- )
124
-
125
- text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
126
-
127
- print(text_outputs)
128
- ```
129
-
130
-
131
-
132
- # Future Plan
133
-
134
- * We plan to train models of various sizes.
135
- * Future training will incorporate multi-image and video data.
136
-
137
-
138
- ## **Citation**
139
- If you find this useful, please cite the following work
140
- ```
141
- @misc{gu2024infinitymmscalingmultimodalperformance,
142
- title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
143
- author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
144
- year={2024},
145
- eprint={2410.18558},
146
- archivePrefix={arXiv},
147
- primaryClass={cs.CL},
148
- url={https://arxiv.org/abs/2410.18558},
149
- }
 
 
 
 
 
 
 
 
 
 
 
150
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zho
5
+ - eng
6
+ - fra
7
+ - spa
8
+ - por
9
+ - deu
10
+ - ita
11
+ - rus
12
+ - jpn
13
+ - kor
14
+ - vie
15
+ - tha
16
+ - ara
17
+ tags:
18
+ - multimodal
19
+ library_name: transformers
20
+ datasets:
21
+ - BAAI/Infinity-MM
22
+ - BAAI/Infinity-Instruct
23
+ - BAAI/Infinity-Preference
24
+ base_model:
25
+ - Qwen/Qwen2.5-1.5B-Instruct
26
+ - google/siglip-so400m-patch14-384
27
+ pipeline_tag: visual-question-answering
28
+ ---
29
+
30
+ ![mof-class1](https://mot.isitopen.ai/model/1130/badge/1)
31
+
32
+ # Introduction
33
+
34
+ The **Aquila-VL-2B** model is a vision-language model (VLM) trained based on the [LLava-one-vision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
35
+
36
+ The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
37
+
38
+
39
+ We have open-sourced [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset and related resources. We hope you enjoy using them!
40
+
41
+ ## News
42
+ - `2024/11/19`: We have released [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-VL-2B-Intermediate) obtained during different stages of training. Please feel free to use these models for analysis and experimentation.
43
+ - `2024/10/25`: The [Aquila-VL-2B](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) model and [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset are now available. We have also released the [technical report](https://arxiv.org/abs/2410.18558) simultaneously.
44
+
45
+ # Evaluation
46
+
47
+ We evaluated the model using the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) tool. Whenever possible, we prioritized using the OpenAI API for test sets that support API-based evaluation.
48
+
49
+ | Benchmark | MiniCPM-V-2 | InternVL2-2B | XinYuan-VL-2B | Qwen2-VL-2B-Instruct | Aquila-VL-2B |
50
+ | :--------------------------- | :---------: | :----------: | :-----------: | :------------------: | :----------: |
51
+ | MMBench-EN<sub>test</sub> | 69.4 | 73.4 | **78.9** | 74.9 | 78.8 |
52
+ | MMBench-CN<sub>test</sub> | 65.9 | 70.9 | 76.1 | 73.9 | **76.4** |
53
+ | MMBench_V1.1<sub>test</sub> | 65.2 | 69.7 | **75.4** | 72.7 | 75.2 |
54
+ | MMT-Bench<sub>test</sub> | 54.5 | 53.3 | 57.2 | 54.8 | **58.2** |
55
+ | RealWorldQA | 55.4 | 57.3 | 63.9 | 62.6 | **63.9** |
56
+ | HallusionBench | 36.8 | 38.1 | 36.0 | 41.5 | **43.0** |
57
+ | SEEDBench2<sub>plus</sub> | 51.8 | 60.0 | 63.0 | 62.4 | **63.0** |
58
+ | LLaVABench | 66.1 | 64.8 | 42.4 | 52.5 | **68.4** |
59
+ | MMStar | 41.6 | 50.2 | 51.9 | 47.8 | **54.9** |
60
+ | POPE | 86.6 | 85.3 | **89.4** | 88.0 | 83.6 |
61
+ | MMVet | 44.0 | 41.1 | 42.7 | **50.7** | 44.3 |
62
+ | MMMU<sub>val</sub> | 39.6 | 34.9 | 43.6 | 41.7 | **47.4** |
63
+ | ScienceQA<sub>test</sub> | 80.4 | 94.1 | 86.6 | 78.1 | **95.2** |
64
+ | AI2D<sub>test</sub> | 64.8 | 74.4 | 74.2 | 74.6 | **75.0** |
65
+ | MathVista<sub>testmini</sub> | 39.0 | 45.0 | 47.1 | 47.9 | **59.0** |
66
+ | MathVerse<sub>testmini</sub> | 19.8 | 24.7 | 22.2 | 21.0 | **26.2** |
67
+ | MathVision | 15.4 | 12.6 | 16.3 | 17.5 | **18.4** |
68
+ | DocVQA<sub>test</sub> | 71.0 | 86.9 | 87.6 | **89.9** | 85.0 |
69
+ | InfoVQA<sub>test</sub> | 40.0 | 59.5 | 59.1 | **65.4** | 58.3 |
70
+ | ChartQA<sub>test</sub> | 59.6 | 71.4 | 57.1 | 73.5 | **76.5** |
71
+ | TextVQA<sub>val</sub> | 74.3 | 73.5 | 77.6 | **79.9** | 76.4 |
72
+ | OCRVQA<sub>testcore</sub> | 54.4 | 40.2 | 67.6 | **68.7** | 64.0 |
73
+ | VCR<sub>en easy</sub> | 27.6 | 51.6 | 67.7 | 68.3 | **70.0** |
74
+ | OCRBench | 613 | 784 | 782 | **810** | 772 |
75
+ | Average | 53.5 | 58.8 | 60.9 | 62.1 | **64.1** |
76
+
77
+
78
+
79
+ For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers or on the official VLMEvalKit leaderboard.
80
+
81
+ # How to use
82
+
83
+ ```python
84
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
85
+ from llava.model.builder import load_pretrained_model
86
+ from llava.mm_utils import process_images, tokenizer_image_token
87
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
88
+ from llava.conversation import conv_templates
89
+ from PIL import Image
90
+ import requests
91
+ import copy
92
+ import torch
93
+ import warnings
94
+
95
+ warnings.filterwarnings("ignore")
96
+
97
+ pretrained = "BAAI/Aquila-VL-2B-llava-qwen"
98
+
99
+ model_name = "llava_qwen"
100
+ device = "cuda"
101
+ device_map = "auto"
102
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
103
+
104
+ model.eval()
105
+
106
+ # load image from url
107
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
108
+ image = Image.open(requests.get(url, stream=True).raw)
109
+
110
+ # load image from local environment
111
+ # url = "./local_image.jpg"
112
+ # image = Image.open(url)
113
+
114
+ image_tensor = process_images([image], image_processor, model.config)
115
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
116
+
117
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
118
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
119
+ conv = copy.deepcopy(conv_templates[conv_template])
120
+ conv.append_message(conv.roles[0], question)
121
+ conv.append_message(conv.roles[1], None)
122
+ prompt_question = conv.get_prompt()
123
+
124
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
125
+ image_sizes = [image.size]
126
+
127
+ cont = model.generate(
128
+ input_ids,
129
+ images=image_tensor,
130
+ image_sizes=image_sizes,
131
+ do_sample=False,
132
+ temperature=0,
133
+ max_new_tokens=4096,
134
+ )
135
+
136
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
137
+
138
+ print(text_outputs)
139
+ ```
140
+
141
+
142
+
143
+ # Future Plan
144
+
145
+ * We plan to train models of various sizes.
146
+ * Future training will incorporate multi-image and video data.
147
+
148
+
149
+ ## **Citation**
150
+ If you find this useful, please cite the following work
151
+ ```
152
+ @misc{gu2024infinitymmscalingmultimodalperformance,
153
+ title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
154
+ author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
155
+ year={2024},
156
+ eprint={2410.18558},
157
+ archivePrefix={arXiv},
158
+ primaryClass={cs.CL},
159
+ url={https://arxiv.org/abs/2410.18558},
160
+ }
161
  ```