kaizuberbuehler
's Collections
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
•
2412.14161
•
Published
•
51
Training Software Engineering Agents and Verifiers with SWE-Gym
Paper
•
2412.21139
•
Published
•
24
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
Task Synthesis
Paper
•
2412.19723
•
Published
•
87
AgentGen: Enhancing Planning Abilities for Large Language Model based
Agent via Environment and Task Generation
Paper
•
2408.00764
•
Published
•
1
More Agents Is All You Need
Paper
•
2402.05120
•
Published
•
57
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper
•
2402.07456
•
Published
•
46
Generative Agents: Interactive Simulacra of Human Behavior
Paper
•
2304.03442
•
Published
•
13
Language Agent Tree Search Unifies Reasoning Acting and Planning in
Language Models
Paper
•
2310.04406
•
Published
•
10
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and
Optimisation
Paper
•
2312.13010
•
Published
•
6
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
237
LLM Agent Operating System
Paper
•
2403.16971
•
Published
•
72
Octopus v2: On-device language model for super agent
Paper
•
2404.01744
•
Published
•
58
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler
Generation
Paper
•
2404.12753
•
Published
•
43
Scaling Instructable Agents Across Many Simulated Worlds
Paper
•
2404.10179
•
Published
•
28
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
•
2404.07972
•
Published
•
50
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
Paper
•
2404.05902
•
Published
•
22
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
83
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web
Navigating Agent
Paper
•
2404.03648
•
Published
•
29
Voyager: An Open-Ended Embodied Agent with Large Language Models
Paper
•
2305.16291
•
Published
•
11
LASER: LLM Agent with State-Space Exploration for Web Navigation
Paper
•
2309.08172
•
Published
•
13
The Rise and Potential of Large Language Model Based Agents: A Survey
Paper
•
2309.07864
•
Published
•
7
Reflexion: Language Agents with Verbal Reinforcement Learning
Paper
•
2303.11366
•
Published
•
5
LEGENT: Open Platform for Embodied Agents
Paper
•
2404.18243
•
Published
•
22
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
30
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
41
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex
Interactive Tasks
Paper
•
2305.17390
•
Published
•
3
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Paper
•
2407.18961
•
Published
•
40
AppWorld: A Controllable World of Apps and People for Benchmarking
Interactive Coding Agents
Paper
•
2407.18901
•
Published
•
35
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Paper
•
2407.21787
•
Published
•
13
OmniParser for Pure Vision Based GUI Agent
Paper
•
2408.00203
•
Published
•
25
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper
•
2307.13854
•
Published
•
25
Diffusion Augmented Agents: A Framework for Efficient Exploration and
Transfer Learning
Paper
•
2407.20798
•
Published
•
24
Diversity Empowers Intelligence: Integrating Expertise of Software
Engineering Agents
Paper
•
2408.07060
•
Published
•
42
The AI Scientist: Towards Fully Automated Open-Ended Scientific
Discovery
Paper
•
2408.06292
•
Published
•
126
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Paper
•
2408.14354
•
Published
•
41
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated
clinical environments
Paper
•
2405.07960
•
Published
•
1
On the limits of agency in agent-based models
Paper
•
2409.10568
•
Published
•
14
DSBench: How Far Are Data Science Agents to Becoming Data Science
Experts?
Paper
•
2409.07703
•
Published
•
67
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks
at Scale
Paper
•
2409.16299
•
Published
•
12
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer
Use
Paper
•
2411.10323
•
Published
•
34
Generative World Explorer
Paper
•
2411.11844
•
Published
•
77
Paper
•
2412.13501
•
Published
•
29
Large Action Models: From Inception to Implementation
Paper
•
2412.10047
•
Published
•
36
A3: Android Agent Arena for Mobile GUI Agents
Paper
•
2501.01149
•
Published
•
22
ResearchTown: Simulator of Human Research Community
Paper
•
2412.17767
•
Published
•
14
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital
World
Paper
•
2412.17589
•
Published
•
14
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Paper
•
2412.14470
•
Published
•
13
GenEx: Generating an Explorable World
Paper
•
2412.09624
•
Published
•
97
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web
Tutorials
Paper
•
2412.09605
•
Published
•
29
The BrowserGym Ecosystem for Web Agent Research
Paper
•
2412.05467
•
Published
•
22
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper
•
2412.04454
•
Published
•
70
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and
Proactive Robotic Failure Detection
Paper
•
2412.04455
•
Published
•
38
MALT: Improving Reasoning with Multi-Agent LLM Training
Paper
•
2412.01928
•
Published
•
45
Mars-PO: Multi-Agent Reasoning System Preference Optimization
Paper
•
2411.19039
•
Published
•
1
Flow-DPO: Improving LLM Mathematical Reasoning through Online
Multi-Agent Learning
Paper
•
2410.22304
•
Published
•
18
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics
Manipulation
Paper
•
2411.17636
•
Published
•
2
Cooperative Strategic Planning Enhances Reasoning Capabilities in Large
Language Models
Paper
•
2410.20007
•
Published
•
1
Enhancing LLM Agents for Code Generation with Possibility and Pass-rate
Prioritized Experience Replay
Paper
•
2410.12236
•
Published
•
1
Large Language Model-Brained GUI Agents: A Survey
Paper
•
2411.18279
•
Published
•
31
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
•
2411.17465
•
Published
•
89
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning
for Web Agents
Paper
•
2411.06559
•
Published
•
16
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile
Manipulation
Paper
•
2411.04999
•
Published
•
18
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle
Grandmaster Level
Paper
•
2411.03562
•
Published
•
68
Agent Laboratory: Using LLM Agents as Research Assistants
Paper
•
2501.04227
•
Published
•
94
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning
and Reflection
Paper
•
2501.04575
•
Published
•
25
SDPO: Segment-Level Direct Preference Optimization for Social Agents
Paper
•
2501.01821
•
Published
•
20
SOTOPIA: Interactive Evaluation for Social Intelligence in Language
Agents
Paper
•
2310.11667
•
Published
•
4
WebWalker: Benchmarking LLMs in Web Traversal
Paper
•
2501.07572
•
Published
•
23
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper
•
2501.05707
•
Published
•
20
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub
Issue Resolution
Paper
•
2501.05040
•
Published
•
15
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
•
2501.09747
•
Published
•
27
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous
Reinforcement Learning
Paper
•
2406.11896
•
Published
•
20
From Novice to Expert: LLM Agent Policy Optimization via Step-wise
Reinforcement Learning
Paper
•
2411.03817
•
Published
•
1
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper
•
2501.10120
•
Published
•
52
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper
•
2501.12326
•
Published
•
65
Agent-R: Training Language Model Agents to Reflect via Iterative
Self-Training
Paper
•
2501.11425
•
Published
•
108
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
•
2501.11733
•
Published
•
28
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in
Realistic Environments
Paper
•
2501.10893
•
Published
•
26
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in
Virtual 3D Spaces
Paper
•
2501.12909
•
Published
•
71
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI
Systems
Paper
•
2501.11067
•
Published
•
13
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
Paper
•
2501.13200
•
Published
•
68
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
Paper
•
2502.02584
•
Published
•
17
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models
Beneficial?
Paper
•
2502.00674
•
Published
•
13
TwinMarket: A Scalable Behavioral and Social Simulation for Financial
Markets
Paper
•
2502.01506
•
Published
•
38
Large Language Model Guided Self-Debugging Code Generation
Paper
•
2502.02928
•
Published
•
13
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference
Optimization
Paper
•
2502.04306
•
Published
•
20
Training Language Models for Social Deduction with Multi-Agent
Reinforcement Learning
Paper
•
2502.06060
•
Published
•
38
CODESIM: Multi-Agent Code Generation and Problem Solving through
Simulation-Driven Planning and Debugging
Paper
•
2502.05664
•
Published
•
24
Hephaestus: Improving Fundamental Agent Capabilities of Large Language
Models through Continual Pre-Training
Paper
•
2502.06589
•
Published
•
20
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Paper
•
2502.08047
•
Published
•
28
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
•
2502.09560
•
Published
•
35
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in
Agentic Tasks
Paper
•
2502.08235
•
Published
•
58
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Paper
•
2502.14499
•
Published
•
192
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
Task Automation on PC
Paper
•
2502.14282
•
Published
•
29
Magma: A Foundation Model for Multimodal AI Agents
Paper
•
2502.13130
•
Published
•
58
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for
Multimodal Web Agents
Paper
•
2502.11357
•
Published
•
10
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem
Understanding
Paper
•
2502.19400
•
Published
•
48
Towards an AI co-scientist
Paper
•
2502.18864
•
Published
•
51
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning
Trajectories for Complex Problem Solving
Paper
•
2502.16111
•
Published
•
9
TAG: A Decentralized Framework for Multi-Agent Hierarchical
Reinforcement Learning
Paper
•
2502.15425
•
Published
•
9
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided
Multi-Agent Collaboration
Paper
•
2502.17110
•
Published
•
13
WebGames: Challenging General-Purpose Web-Browsing AI Agents
Paper
•
2502.18356
•
Published
•
14
VEM: Environment-Free Exploration for Training GUI Agent with Value
Environment Model
Paper
•
2502.18906
•
Published
•
12
Curie: Toward Rigorous and Automated Scientific Experimentation with AI
Agents
Paper
•
2502.16069
•
Published
•
20
Agentic Reward Modeling: Integrating Human Preferences with Verifiable
Correctness Signals for Reliable Reward Systems
Paper
•
2502.19328
•
Published
•
23
ATLaS: Agent Tuning via Learning Critical Steps
Paper
•
2503.02197
•
Published
•
9
Gemini Robotics: Bringing AI into the Physical World
Paper
•
2503.20020
•
Published
•
29
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Paper
•
2503.02268
•
Published
•
11
Unified Video Action Model
Paper
•
2503.00200
•
Published
•
14
MPO: Boosting LLM Agents with Meta Plan Optimization
Paper
•
2503.02682
•
Published
•
28
MultiAgentBench: Evaluating the Collaboration and Competition of LLM
agents
Paper
•
2503.01935
•
Published
•
29
World Modeling Makes a Better Planner: Dual Preference Optimization for
Embodied Task Planning
Paper
•
2503.10480
•
Published
•
55
Automated Movie Generation via Multi-Agent CoT Planning
Paper
•
2503.07314
•
Published
•
44
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via
Reinforcement Learning and Reasoning
Paper
•
2503.07608
•
Published
•
23
SafeArena: Evaluating the Safety of Autonomous Web Agents
Paper
•
2503.04957
•
Published
•
21
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based
VLM Agent Training
Paper
•
2503.08525
•
Published
•
17
Agent models: Internalizing Chain-of-Action Generation into Reasoning
models
Paper
•
2503.06580
•
Published
•
19
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for
Complex Medical Reasoning
Paper
•
2503.07459
•
Published
•
16
LocAgent: Graph-Guided LLM Agents for Code Localization
Paper
•
2503.09089
•
Published
•
13
AI-native Memory 2.0: Second Me
Paper
•
2503.08102
•
Published
•
13
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
•
2503.21620
•
Published
•
62
Agents Play Thousands of 3D Video Games
Paper
•
2503.13356
•
Published
•
9
SWE-smith: Scaling Data for Software Engineering Agents
Paper
•
2504.21798
•
Published
•
11
Survey on Evaluation of LLM-based Agents
Paper
•
2503.16416
•
Published
•
95
Why Do Multi-Agent LLM Systems Fail?
Paper
•
2503.13657
•
Published
•
47
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?
Paper
•
2503.12349
•
Published
•
44
API Agents vs. GUI Agents: Divergence and Convergence
Paper
•
2503.11069
•
Published
•
37
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of
Tools
Paper
•
2503.10970
•
Published
•
18
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Paper
•
2503.13444
•
Published
•
17
STEVE: AStep Verification Pipeline for Computer-use Agent Training
Paper
•
2503.12532
•
Published
•
17
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning
Tasks
Paper
•
2503.15478
•
Published
•
13
Measuring AI Ability to Complete Long Tasks
Paper
•
2503.14499
•
Published
•
15
Free-form language-based robotic reasoning and grasping
Paper
•
2503.13082
•
Published
•
11
Large Language Model Agent: A Survey on Methodology, Applications and
Challenges
Paper
•
2503.21460
•
Published
•
83
MAPS: A Multi-Agent Framework Based on Big Seven Personality and
Socratic Guidance for Multimodal Scientific Problem Solving
Paper
•
2503.16905
•
Published
•
54
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Paper
•
2503.20201
•
Published
•
48
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for
Automated Prompt Optimization
Paper
•
2503.16874
•
Published
•
44
RoboFactory: Exploring Embodied Agent Collaboration with Compositional
Constraints
Paper
•
2503.16408
•
Published
•
41
AgentRxiv: Towards Collaborative Autonomous Research
Paper
•
2503.18102
•
Published
•
25
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
•
2503.21696
•
Published
•
23
Defeating Prompt Injections by Design
Paper
•
2503.18813
•
Published
•
22
MDocAgent: A Multi-Modal Multi-Agent Framework for Document
Understanding
Paper
•
2503.13964
•
Published
•
20
Advances and Challenges in Foundation Agents: From Brain-Inspired
Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper
•
2504.01990
•
Published
•
300
PaperBench: Evaluating AI's Ability to Replicate AI Research
Paper
•
2504.01848
•
Published
•
36
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive
Program Synthesis
Paper
•
2503.23145
•
Published
•
35
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist
Policy
Paper
•
2503.24388
•
Published
•
30
Agent S2: A Compositional Generalist-Specialist Framework for Computer
Use Agents
Paper
•
2504.00906
•
Published
•
26
Towards Trustworthy GUI Agents: A Survey
Paper
•
2503.23434
•
Published
•
21
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Paper
•
2504.01871
•
Published
•
12
ActionStudio: A Lightweight Framework for Data and Training of Large
Action Models
Paper
•
2503.22673
•
Published
•
12
Scaling Laws in Scientific Discovery with AI and Robot Scientists
Paper
•
2503.22444
•
Published
•
12
VerifiAgent: a Unified Verification Agent in Language Model Reasoning
Paper
•
2504.00406
•
Published
•
8
MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via
Reasoning Agentic Workflow
Paper
•
2503.18968
•
Published
•
8
A Unified Agentic Framework for Evaluating Conditional Image Generation
Paper
•
2504.07046
•
Published
•
30
Agentic Knowledgeable Self-awareness
Paper
•
2504.03553
•
Published
•
27
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in
Multi-Agent Simulations
Paper
•
2504.07830
•
Published
•
18
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge
Refinement
Paper
•
2504.03561
•
Published
•
18
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated
Agent-Human Interplay
Paper
•
2504.03601
•
Published
•
17
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
Paper
•
2503.22738
•
Published
•
17
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
•
2504.06148
•
Published
•
13
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing
Skills
Paper
•
2504.07079
•
Published
•
12
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper
•
2504.11536
•
Published
•
62
Genius: A Generalizable and Purely Unsupervised Self-Training Framework
For Advanced Reasoning
Paper
•
2504.08672
•
Published
•
55
MineWorld: a Real-Time and Open-Source Interactive World Model on
Minecraft
Paper
•
2504.08388
•
Published
•
42
Paper
•
2504.11442
•
Published
•
29
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent
Trajectories
Paper
•
2504.08942
•
Published
•
28
MLRC-Bench: Can Language Agents Solve Machine Learning Research
Challenges?
Paper
•
2504.09702
•
Published
•
18
SocioVerse: A World Model for Social Simulation Powered by LLM Agents
and A Pool of 10 Million Real-World Users
Paper
•
2504.10157
•
Published
•
17
Breaking the Data Barrier -- Building GUI Agents Through Task
Generalization
Paper
•
2504.10127
•
Published
•
17
ReZero: Enhancing LLM search ability by trying one-more-time
Paper
•
2504.11001
•
Published
•
15
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via
Agentic Tree Search
Paper
•
2504.08066
•
Published
•
14
Exploring Expert Failures Improves LLM Agent Tuning
Paper
•
2504.13145
•
Published
•
12
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic
Data Generation
Paper
•
2504.12563
•
Published
•
4
Paper2Code: Automating Code Generation from Scientific Papers in Machine
Learning
Paper
•
2504.17192
•
Published
•
120
FlowReasoner: Reinforcing Query-Level Meta-Agents
Paper
•
2504.15257
•
Published
•
47
ToolRL: Reward is All Tool Learning Needs
Paper
•
2504.13958
•
Published
•
48
OTC: Optimal Tool Calls via Reinforcement Learning
Paper
•
2504.14870
•
Published
•
35
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Paper
•
2504.13203
•
Published
•
34
BookWorld: From Novels to Interactive Agent Societies for Creative Story
Generation
Paper
•
2504.14538
•
Published
•
30
UFO2: The Desktop AgentOS
Paper
•
2504.14603
•
Published
•
29
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making
Abilities
Paper
•
2504.16078
•
Published
•
21
WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World
Model-based LLM Agents
Paper
•
2504.15785
•
Published
•
20