Alibaba Cloud researchers published a paper on arXiv presenting SkillWeaver, a framework that routes AI agent tasks to relevant tools while claiming over 99% reduction in context-window token consumption. The research addresses a concrete problem: as agents gain access to hundreds or thousands of tools via Model Context Protocol (MCP) servers, stuffing every tool description into the prompt becomes prohibitively expensive.
The benchmark numbers are dramatic. On CompSkillBench, a test suite of 300 compositional queries across 2,209 real skills from public MCP servers, context-window usage dropped from an estimated 884,000 tokens per query to roughly 1,160 tokens, according to the arXiv paper as reported by WinBuzzer. The baseline comparison loads every skill description into the prompt, a method no production system would actually use, which makes the headline figure less actionable than it sounds.
How the Routing Works
SkillWeaver operates in three stages: Decompose, Retrieve, and Compose. A complex request is broken into atomic sub-tasks. Each sub-task triggers a vector search against a FAISS-backed index of tool descriptions using all-MiniLM-L6-v2 embeddings, with retrieval latency under 15 milliseconds. Qwen2.5-7B-Instruct handles the primary decomposition logic.
A method called Skill-Aware Decomposition uses retrieved tool hints to refine the task breakdown before the final execution plan assembles. The system then builds a dependency graph determining which sub-tasks can run in parallel and which require sequential ordering.
Benchmark Results in Context
Strict decomposition accuracy rose from 51.0% to 67.7% after one Skill-Aware Decomposition pass with Qwen2.5-7B-Instruct, while Qwen-Max reached 92% accuracy on the same task. A pilot execution study reported a 76.7% chain completion rate for routed plans using mock executors.
For comparison, the ReAct baseline scored 0% decomposition accuracy on the same benchmark, giving SkillWeaver a favorable comparison point. However, mock executors are not production APIs. The chain completion rate represents an upper bound under ideal conditions.
Production Gaps
Three significant limitations remain. First, source code has not been released. Second, the compose stage lacks built-in error recovery when an API call in the dependency graph fails. Third, retrieval rank quality matters: if the correct tool does not appear in the initial retrieval results, downstream graph steps that depend on it will fail silently.
The benchmark also tests a specific scenario: routing across a large tool catalog. It does not test the more common production pattern where agents operate with 10 to 50 configured tools, a regime where the context-window problem is less acute.
The Cost Problem It Targets
Agent token costs remain a documented pain point. NCT reported last week that OpenAI’s API tier system automatically escalates spending limits to $200,000 per month, and Forbes documented an OpenClaw user running 100 agents at $1.3 million monthly in token fees. SkillWeaver’s routing approach, if validated in production, would reduce the tool-description portion of that spend, though tool descriptions typically represent a fraction of total token usage compared to conversation history and reasoning chains.
For teams building multi-tool agent systems on MCP infrastructure, the research demonstrates a viable architectural pattern: separate tool selection from task execution, and use lightweight retrieval to avoid loading entire skill catalogs into every prompt. Whether Alibaba Cloud ships this as a product or it remains a research contribution depends on the code release and production hardening that the paper itself acknowledges are missing.