| license: apache-2.0 | |
| papers: | |
| - arxiv:2509.16944 | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| # llava-v1.5-13b-roi-K15T3-152k-v1bf16Mheads-twiginit | |
| This model is associated with the paper [Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception](https://huggingface.co/papers/2509.16944). | |
| ## Introduction | |
| While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process for RoI identification. | |
| We propose an efficient, annotation-free **S**elf-**D**istilled **R**egion **P**roposal **N**etwork (SD-RPN) that resolves this trade-off. Our core innovation is a pipeline that processes and denoises the noisy cross-attention maps from the MLLM's middle layers to generate pseudo-RoI labels. We then use these labels to train a lightweight and tunable Region Proposal Network (RPN) that is built upon the frozen MLLM backbone. Our RPN predicts the RoI in a single forward pass using features available from the MLLM's middle layers, completely decoupling RoI identification from the auto-regressive generation process and avoiding costly multi-pass operations. | |
| <p align="center"> | |
| <img src="https://github.com/YuHengsss/SD-RPN/raw/main/assets/framework.png" width="800" /> | |
| </p> | |
| For more details, code, and training instructions, visit the [GitHub repository](https://github.com/YuHengsss/SD-RPN). | |
| ## Citation | |
| If you use this model, please cite the original paper: | |
| ```bibtex | |
| @misc{shi2025catching, | |
| title={Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception}, | |
| author={Yuheng Shi and Xiaohuan Pei and Minjing Dong and Chang Xu}, | |
| year={2025}, | |
| eprint={2509.16944}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV} | |
| } | |
| ``` |