arxiv:2510.18876

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Published on Oct 21

· Submitted by

taesiri on Oct 22

ByteDance

Upvote

Authors:

Haochen Wang ,

Ye Tian ,

Jiahao Meng ,

Abstract

Grasp Any Region (GAR) enhances region-level visual understanding by integrating global contexts and modeling interactions, achieving advanced reasoning and outperforming existing models in captioning and video reference tasks.

AI-generated summary

While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.

View arXiv page View PDF GitHub 68 Add to collection

Community

taesiri

Paper submitter 6 days ago

librarian-bot

5 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Abstract

Community

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 3