DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking
Abstract
DeepWideSearch is a benchmark that evaluates agents' ability to integrate deep reasoning and wide-scale information collection, revealing significant challenges and limitations in current agent architectures.
Current search agents fundamentally lack the ability to simultaneously perform deep reasoning over multi-hop retrieval and wide-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.
Community
DeepWideSearch benchmark is designed to evaluate LLM-based agents on simultaneous deep reasoning over multi-hop retrieval and wide-scale information retrieval—a critical capability for real-world tasks like market analysis and business development, for example, the question "List all second-tier suppliers of Apple's AirPods, with contact info, location, and certification status.". The output of this task is a tabular. Rows are candidate answers of the questions and columns are attributes of each candidate that questions required to collect.
Github: https://github.com/AIDC-AI/Marco-Search-Agent
Huggingface: https://huggingface.co/datasets/AIDC-AI/DeepWideSearch
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents (2025)
- WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents (2025)
- DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL (2025)
- ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization (2025)
- Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents (2025)
- Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics (2025)
- MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper