arxiv:2510.21323

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Published on Oct 24

· Submitted by

Shufan Shen on Oct 29

ucas

Upvote

Authors:

Shufan Shen ,

Abstract

VL-SAE, a sparse autoencoder, enhances vision-language alignment by correlating neurons to unified concepts, improving interpretability and performance in tasks like zero-shot image classification and hallucination elimination.

AI-generated summary

The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at https://github.com/ssfgunner/VL-SAE.

View arXiv page View PDF GitHub 4 Add to collection

Community

shufanshen

Paper author Paper submitter 2 days ago

[NeurIPS 2025] We propose VL-SAE that helps users to understand the vision-language alignment of VLMs via concepts.
Github: https://github.com/ssfgunner/VL-SAE

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.21323 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.21323 in a Space README.md to link it from this page.