MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
Abstract
MSC-Bench evaluates multi-hop tool orchestration by LLM agents in a hierarchical ecosystem, addressing challenges like functional overlap and cross-server planning with a five-level curriculum and objective metrics.
We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional overlap and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score and reducing the dependency on LLM-as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at https://github.com/snooow1029/MSC_Bench.
Community
To address the questions "How to evaluate multi-hop, end-to-end tool orchestration by LLM agents?" and "How to design benchmarks that reflect real-world MCP tool ecosystems?", we propose MSC-Bench. MSC-Bench constructs ground truth via equal function sets to enable objective evaluation and reduce reliance on LLM-as-a-judge scoring, while organizing tasks into a five-level curriculum that progressively tests agents’ orchestration and robustness across servers and functions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools (2025)
- MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers (2025)
- TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments (2025)
- AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production (2025)
- MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use (2025)
- Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution (2025)
- MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper