Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Abstract
Decomposed Attention Fusion (DecAF) enhances video object segmentation by refining attention maps from multimodal large language models without retraining.
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.
Community
We introduce DecAF (Decomposed Attention Fusion) — a training-free framework that transforms MLLMs’ attention maps into video segmentation. DecAF refines noisy attention through two attention fusion mechanisms: (1) Contrastive object-background fusion and (2) Complementary video-frame fusion.
Additionally, with our attention-guided SAM2 prompting strategy, DecAF obtains fine-grained masks and achieves performance comparable to training-based methods — all without retraining.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding (2025)
- RefAM: Attention Magnets for Zero-Shot Referral Segmentation (2025)
- Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation (2025)
- Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs (2025)
- SimToken: A Simple Baseline for Referring Audio-Visual Segmentation (2025)
- Temporal Prompting Matters: Rethinking Referring Video Object Segmentation (2025)
- VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
