Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Abstract
Griffon v2, a unified high-resolution model with a down-sampling projector and visual tokenizer, achieves state-of-the-art performance in object detection, counting, and referring tasks by overcoming image resolution and multimodal perception limitations.
Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception (2024)
- RegionGPT: Towards Region Understanding Vision Language Model (2024)
- Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models (2024)
- Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models (2024)
- Multi-modal Attribute Prompting for Vision-Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Thank you for sharing your work, well done! 👏
Besides the conducted ablation studies which was aimed at selection of best encoder from the whole chosen for experiments, have you attempted to build your system on top even smaller LLMs (3B params or less)?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
 AK
							AK 
					 
					 
					 
					 
					 
						 
						 
					