Abstract
SkillClaw enables collective skill evolution in multi-user LLM agent systems by aggregating user interactions to autonomously update and improve reusable skills across the ecosystem.
Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.
Community
found a good walkthrough of this paper here https://arxivexplained.com/skillclaw-let-skills-evolve-collectively-with-agentic-evolver the part about SkillClaw was the most interesting bit to me
The 404 error
the most interesting detail here is how SkillClaw clusters cross-user trajectories into referenced skills and then uses the evolver to translate those patterns into concrete updates. i’m curious about how you decide the granularity of the skill taxonomy, and what happens if two different workflows land in the same cluster or if a single workflow touches multiple skills. btw the arXivLens breakdown helped me parse the method details, especially the nighttime validation gate that defers updates until a safe rollout. a potential pitfall is drift when the taxonomy is too coarse or when noisy signals from users push updates that barely generalize, so i’d love to see an ablation on taxonomy granularity and tests on non-stationary user distributions.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper