dhirajjoshi116 commited on
Commit
21746e7
·
verified ·
1 Parent(s): 0908792

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -26
README.md CHANGED
@@ -1,38 +1,37 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
4
  # granite-vision-3.3-2b
5
 
6
  **Model Summary:** Granite-vision-3.3-2b is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following data, comprising diverse public and synthetic datasets tailored to support a wide range of document understanding and general image tasks. Granite-vision-3.3-2b was trained by fine-tuning a Granite large language model with both image and text modalities.
7
 
8
 
9
- **Evaluations:** We evaluated granite-vision-3.3-2b alongside other vision-language models (VLMs) in the 1B-4B parameter range using the standard llms-eval benchmark. The evaluation spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
10
-
11
 
12
- | | Molmo-E | InternVL3-2b | Phi-4-Multimodal(5.6b) | GV-3.2-2b | GV-3.3-2b |
13
- |-----------|--------------|----------------|-------------|------------|------------|
14
  | **Document benchmarks** |
15
- | ChartQA | 0.60 | 0.80 | 0.81 | 0.87 | **0.87** |
16
- | DocVQA | 0.66 | 0.88 | **0.93** | 0.89 | 0.91 |
17
- | TextVQA | 0.62 | 0.77 | 0.76 | 0.78 | **0.80** |
18
- | AI2D | 0.63 | 0.79 | **0.82** | 0.76 | 0.88(without mask) |
19
- | InfoVQA | 0.44 | 0.66 | **0.73** | 0.64 | 0.69 |
20
- | OCRBench | 0.65 | **0.84** | **0.84** | 0.77 | 0.80 |
21
- | LiveXiv VQA | 0.47 | - | - | 0.61 | **XX** |
22
- | LiveXiv TQA | 0.36 | - | - | 0.57 | **XX** |
23
  | **Other benchmarks** |
24
- | MMMU | 0.32 | 0.49 | 0.55 | 0.37 | 0.37 |
25
- | VQAv2 | 0.57 | - | - | 0.78 | **XX** |
26
- | RealWorldQA | 0.55 | 0.64 | - | 0.63 | **0.68** |
27
- | VizWiz VQA | 0.49 | - | - | 0.63 | **0.66** |
28
- | OK VQA | 0.40 | - | - | 0.56 | **0.60** |
29
-
30
 
31
  - **Paper:** [Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence](https://arxiv.org/abs/2502.09927). Note that the paper describes Granite Vision 3.2. Granite Vision 3.3 shares most of the technical underpinnings with Granite 3.2. However, there are several enhancements in terms of new and improved vision encoder, many new high quality datasets, and several new experimental capabilities.
32
  - **Release Date**: Jun 11th, 2025
33
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
34
 
35
- **Supported Input Format:** Currently the model supports English instructions and images (png, jpeg, etc.) as input format.
36
 
37
  **Intended Use:** The model is intended to be used in enterprise applications that involve processing visual and text data. In particular, the model is well-suited for a range of visual document understanding tasks, such as analyzing tables and charts, performing optical character recognition (OCR), and answering questions based on document content. Additionally, its capabilities extend to general image understanding, enabling it to be applied to a broader range of business applications. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
38
 
@@ -107,7 +106,6 @@ model_path = "ibm-granite/granite-vision-3.3-2b"
107
 
108
  model = LLM(
109
  model=model_path,
110
- limit_mm_per_prompt={"image": 1},
111
  )
112
 
113
  sampling_params = SamplingParams(
@@ -138,11 +136,39 @@ print(f"Generated text: {outputs[0].outputs[0].text}")
138
  ```
139
 
140
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ### Experimental capabilities
142
 
143
- Granite-vision-3.3-2b introduces two new experimental capabilities (1) Image segmentation, (2) Doctags generation (please see [Docling project](https://github.com/docling-project/docling) for more details).
 
 
 
 
 
 
144
 
145
- TBD
146
 
147
  ### Fine-tuning
148
 
@@ -168,10 +194,8 @@ We built upon LLaVA (https://llava-vl.github.io) to train our model. We use mult
168
 
169
  **Infrastructure:** We train granite-vision-3.3-2b using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
170
 
171
- **Ethical Considerations and Limitations:** The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. granite-vision-3.3-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts.
172
- Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios due to their reduced sizes, which could limit their ability to generate coherent and contextually accurate responses.
173
- This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.3-2b with ethical intentions and in a responsible way. We recommend using this model for document understanding tasks, and note that more general vision tasks may pose higher inherent risks of triggering biased or harmful output.
174
- To enhance safety, we recommend using granite-vision-3.3-2b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
175
 
176
  **Resources**
177
  - 📄 Read the full technical report [here](https://arxiv.org/abs/2502.09927)
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
  # granite-vision-3.3-2b
6
 
7
  **Model Summary:** Granite-vision-3.3-2b is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following data, comprising diverse public and synthetic datasets tailored to support a wide range of document understanding and general image tasks. Granite-vision-3.3-2b was trained by fine-tuning a Granite large language model with both image and text modalities.
8
 
9
 
10
+ **Evaluations:** We compare the performance of granite-vision-3.3-2b with previous versions of granite-vision models. Evaluations were done using the standard llms-eval benchmark and spanned multiple public benchmarks, with particular emphasis on document understanding tasks while also including general visual question-answering benchmarks.
 
11
 
12
+ | | GV-3.1-2b-preview | GV-3.2-2b | GV-3.3-2b |
13
+ |-----------|-----------|--------------|----------------|
14
  | **Document benchmarks** |
15
+ | ChartQA | 0.86 | 0.87 | 0.87 |
16
+ | DocVQA | 0.88 | 0.89 | **0.91** |
17
+ | TextVQA | 0.76 | 0.78 | **0.80** |
18
+ | AI2D | 0.78 | 0.76 | 0.77 |
19
+ | InfoVQA | 0.63 | 0.64 | **0.68** |
20
+ | OCRBench | 0.75 | 0.77 | **0.79** |
21
+ | LiveXiv VQA v2 | 0.61 | 0.61 | 0.61 |
22
+ | LiveXiv TQA v2 | 0.55 | 0.57 | 0.52 |
23
  | **Other benchmarks** |
24
+ | MMMU | 0.35 | 0.37 | 0.37 |
25
+ | VQAv2 | 0.81 | 0.78 | 0.79 |
26
+ | RealWorldQA | 0.65 | 0.63 | 0.63 |
27
+ | VizWiz VQA | 0.64 | 0.63 | 0.62 |
28
+ | OK VQA | 0.57 | 0.56 | 0.55|
 
29
 
30
  - **Paper:** [Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence](https://arxiv.org/abs/2502.09927). Note that the paper describes Granite Vision 3.2. Granite Vision 3.3 shares most of the technical underpinnings with Granite 3.2. However, there are several enhancements in terms of new and improved vision encoder, many new high quality datasets, and several new experimental capabilities.
31
  - **Release Date**: Jun 11th, 2025
32
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
33
 
34
+ **Supported Input Format:** Currently the model supports English instructions and images (png, jpeg) as input format.
35
 
36
  **Intended Use:** The model is intended to be used in enterprise applications that involve processing visual and text data. In particular, the model is well-suited for a range of visual document understanding tasks, such as analyzing tables and charts, performing optical character recognition (OCR), and answering questions based on document content. Additionally, its capabilities extend to general image understanding, enabling it to be applied to a broader range of business applications. For tasks that exclusively involve text-based input, we suggest using our Granite large language models, which are optimized for text-only processing and offer superior performance compared to this model.
37
 
 
106
 
107
  model = LLM(
108
  model=model_path,
 
109
  )
110
 
111
  sampling_params = SamplingParams(
 
136
  ```
137
 
138
 
139
+
140
+ ### Safety evaluation
141
+
142
+ The GV-3.3-2b model also went through safety alignment to make sure responses are safer without affecting the model’s performance on its intended task. We carefully safety aligned the model on publicly available safety data and synthetically generated safety data. We report our safety scores on publicly available RTVLM and VLGuard datasets.
143
+
144
+ **RTVLM Safety Score - [0,10] - Higher is Better**
145
+
146
+ | | Politics | Racial | Jailbreak | Mislead |
147
+ |-----------|-----------|--------------|----------------|----------------|
148
+ |GV-3.1-2b-preview|7.2|7.7|4.5|7.6|
149
+ |GV-3.2-2b|7.6|7.8|6.2|8.0|
150
+ |GV-3.3-2b|8.0|8.1|7.5|8.0|
151
+
152
+
153
+ **VLGuard Safety Score - [0,10] - Higher is Better**
154
+
155
+ | | Unsafe Images (Unsafe) | Safe Images with Unsafe Instructions |
156
+ |-----------|-----------|--------------|
157
+ |GV-3.1-2b-preview|6.6|8.4|
158
+ |GV-3.2-2b|7.6|8.9|
159
+ |GV-3.3-2b|8.4|9.3|
160
+
161
+
162
  ### Experimental capabilities
163
 
164
+ Granite-vision-3.3-2b introduces three new experimental capabilities:
165
+
166
+ (1) Image segmentation: [A notebook showing a segmentation example](https://github.com/ibm-granite/granite-vision-models/blob/main/cookbooks/GraniteVision_Segmentation_Notebook.ipynb)
167
+
168
+ (2) Doctags generation (please see [Docling project](https://github.com/docling-project/docling) for more details): TBD (notebook link)
169
+
170
+ (3) Multipage support: The model was trained to handle question answering (QA) tasks using multiple consecutive pages from a document—up to 10 pages—given the demands of long-context processing. To support such long sequences without exceeding GPU memory limits, we recommend resizing images so that their longer dimension is 768 pixels.
171
 
 
172
 
173
  ### Fine-tuning
174
 
 
194
 
195
  **Infrastructure:** We train granite-vision-3.3-2b using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
196
 
197
+ **Responsible Use and Limitations:** Some use cases for Large Vision and Language Models can trigger certain risks and ethical considerations, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate, biased, offensive or unwanted responses to user prompts. Additionally, whether smaller models may exhibit increased susceptibility to hallucination in generation scenarios due to their reduced sizes, which could limit their ability to generate coherent and contextually accurate responses, remains uncertain. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. We urge the community to use granite-vision-3.3-2b in a responsible way and avoid any malicious utilization. We recommend using this model for document understanding tasks. More general vision tasks may pose higher inherent risks of triggering unwanted output. To enhance safety, we recommend using granite-vision-3.3-2b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
198
+
 
 
199
 
200
  **Resources**
201
  - 📄 Read the full technical report [here](https://arxiv.org/abs/2502.09927)