Papers
arxiv:2507.13361

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

Published on Jul 4
Authors:
,

Abstract

Vision-language models struggle with nonlocal visual reasoning tasks such as comparative perception, saccadic search, and smooth visual search, despite advancements in visual acuity.

AI-generated summary

Vision-Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation of vision-language models' capacity for nonlocal visual reasoning: reasoning that requires chaining evidence collected from multiple, possibly distant regions of an image. We isolate three distinct forms of nonlocal vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves following a continuous contour. Flagship models (e.g., GPT-5, Gemini 2.5 Pro, Claude Sonnet 4), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test whether VLMs can perform visual algorithms similar to those used by humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.13361 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.13361 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.