Researchers from Huawei Technologies, Beijing Institute of Technology, Peking University, and the Chinese Academy of Sciences published Claw-Anything, a benchmark that simulates over three months of a user’s digital life and asks AI agents to navigate it. GPT-5.5, OpenAI’s flagship model built for agentic long-horizon work, scored 34.5% on the pass@1 metric. On proactive tasks where agents needed to anticipate user needs without being prompted, scores collapsed to 6.7%.

The benchmark arrives at a moment when the gap between agent marketing and agent reality is becoming measurable. Enterprise adoption surveys show 79% of organizations experimenting with AI agents but only 11% reaching production scale, according to Digital Applied’s compilation of 220+ sourced data points. Claw-Anything offers one explanation for why: the benchmarks used to validate agent capabilities may not be testing what matters.

What Claw-Anything Actually Measures

Most existing agent benchmarks hand the model a clean task in an isolated environment. Summarize this document. Write this function. Answer this question given these three files. The context window ranges from 1,700 to 12,000 words. The agent gets one service to interact with. The task starts and ends in a single session.

Claw-Anything inverts every one of those constraints. The benchmark evaluates agents across three dimensions simultaneously, according to the research paper:

Long-horizon event streams. Each task sits inside a simulated user history spanning more than three months. The agent doesn’t get a clean desk. It gets months of accumulated emails, calendar entries, notes, Slack messages, price alerts, and shopping activity, including irrelevant noise and conflicting signals. The average context window per task is 191,700 words.

Interdependent backend services. Tasks require coordination across an average of 10.1 backend services per task. An agent might need to cross-reference a price alert from three weeks ago with a calendar appointment and an email thread, then act on the result. Removing any required cross-service tool drops success rates to near zero, the researchers found.

Multi-device interaction. Agents must switch between CLI Linux environments and GUI Android environments within the same task. A user might start research on a desktop and need the agent to complete an action from a phone.

The scoring method is pass@1: the agent must complete the task correctly on its first attempt. No retries, no partial credit, no “close enough.”

The Numbers

GPT-5.5 led the leaderboard at 34.5% overall. Several models that score impressively on existing benchmarks fell further. The paper does not list all model scores, but the research team noted the gap between Claw-Anything performance and prior benchmark performance was consistent across models, as reported by Decrypt.

The most revealing breakdown splits performance by task type:

  • Reactive tasks (respond to an explicit user request with current context): 25.9%
  • Proactive tasks (identify a need and act without being asked): 6.7%

That 19-point gap between reactive and proactive performance is the benchmark’s central finding. Most existing evaluations test only reactive behavior. The agent receives a prompt, executes, and returns a result. Claw-Anything also tests whether agents can monitor a stream of events, recognize when something requires attention, and initiate action.

The proactive failure mode is specific: agents either miss the trigger entirely (the price drop happened, but the agent didn’t notice it against months of noise) or act on the wrong signal (confusing a similar event from weeks ago with the current one). The 191,700-word average context window means agents must filter signal from massive amounts of accumulated noise, a fundamentally different problem from answering a question about a 5-page document.

Why Existing Benchmarks Missed This

The Claw-Anything researchers argue that most benchmarks treat agents as “task solvers given a clean desk,” per Decrypt. The agent gets a well-defined problem, the tools it needs, and a clean environment. Success on those benchmarks tells you the agent can follow instructions in controlled conditions. It tells you almost nothing about whether the agent can operate in the messy, context-heavy, multi-service environments where personal assistants actually need to function.

This is not the first time benchmark validity has come under scrutiny in 2026. OpenAI declared SWE-bench Verified contaminated earlier this year after finding that 59% of failed test cases had flawed tests. Models scoring 70-80% on SWE-bench Verified dropped to roughly 23% on SWE-bench Pro, a version designed to resist contamination. That was a data hygiene problem. Claw-Anything identifies something more structural: the benchmarks may be measuring the wrong capabilities entirely.

The distinction matters for deployment decisions. An enterprise evaluating whether to deploy an AI agent for customer support might look at task-completion benchmarks and see 70-80% success rates. Claw-Anything suggests that once that agent encounters real customer histories spanning months, with multiple communication channels and interdependent backend systems, performance could drop by half or more.

Huawei’s Broader AI Agent Infrastructure Play

Claw-Anything does not exist in isolation. Huawei announced a full-stack AI data platform and agent foundry at its ID 2026 forum in Paris on the same day, according to Blocks and Files. The five-layer platform spans data lake storage, an AI data platform with vector retrieval, heterogeneous compute adaptation for Huawei’s own Ascend and Atlas accelerators, model training, and an agent framework layer.

The strategic context is clear: Huawei cannot partner with Nvidia or other US GPU makers due to geopolitical trade restrictions. It has built its own parallel stack. Publishing Claw-Anything, an open benchmark with the dataset on Hugging Face and code on GitHub, serves two purposes. First, it positions Huawei as a serious contributor to agent evaluation science, not just a hardware vendor. Second, the accompanying training pipeline (2,000 environments, 23.7% improvement on fine-tuned Qwen3.5-27B) demonstrates that the benchmark is actionable, not just diagnostic.

The fine-tuning result is notable: Qwen3.5-27B, an open-weight model fine-tuned on 1,500 successful agent trajectories from the Claw-Anything pipeline, beat several closed-source models on the leaderboard, including Claude Sonnet, according to Decrypt. That’s a signal that the bottleneck isn’t raw model intelligence but training data that reflects real-world agent workflows.

The Safety Dimension

Claw-Anything measures capability. A companion benchmark from Huawei’s RAMS Lab, BeSafe-Bench, measures safety. Published in March 2026, BeSafe-Bench tested 13 production AI agents in real environments across web, mobile, embodied VLM, and embodied VLA domains. None of the 13 agents could complete 40% of tasks while respecting all safety constraints.

Read those two benchmarks together and the picture is sobering. Agents can’t do the work (34.5% on realistic tasks). And when they do act, they can’t do it safely (zero agents clearing 40% safe completion). The industry is deploying systems that fail on both dimensions simultaneously.

This dual failure mode explains why the adoption-to-production gap persists. Stanford HAI’s 2026 AI Index documents agent success on real-world tasks rising from 20% in 2025 to 77.3% on Terminal-Bench in 2026. But Terminal-Bench is exactly the kind of clean-desk benchmark Claw-Anything challenges. The Gartner CIO survey puts actual agent deployment at 17%, with only 42% planning to deploy. Cisco’s AI Readiness Index finds 83% of organizations plan to deploy autonomous agents, but only one-in-three say their infrastructure is ready.

The Proactive Autonomy Problem

The 6.7% proactive score deserves specific attention because proactive behavior is the entire value proposition of “always-on” personal AI assistants.

The pitch has always been: give the agent access to your digital life and it handles things. It notices the price drop. It flags the scheduling conflict. It drafts the response before you ask. Claw-Anything tested that pitch empirically and found it fails 93.3% of the time.

This isn’t a model-size problem. GPT-5.5 is OpenAI’s largest, most capable model, specifically designed for agentic work. The failure mode is architectural: current models process prompts, not event streams. They respond to questions, not situations. The gap between “answer this question about context X” (reactive) and “monitor context X continuously and act when something matters” (proactive) is the distance between a search engine and an operating system.

The ablation results reinforce this. When the researchers removed tools required for cross-service coordination, success rates fell to near zero. Most real tasks require the agent to retrieve information from one service and act on it in another. A price alert in a shopping app needs to be cross-referenced with a budget in a spreadsheet and a delivery window in a calendar. No single-service benchmark captures this dependency chain.

What Changes for Builders

For teams building or deploying agents, Claw-Anything suggests three concrete takeaways.

First, internal evaluation must test multi-service coordination. If an agent’s capability is validated against isolated task benchmarks, production performance will be worse than expected. The gap between single-service and multi-service task completion is not incremental. It is categorical.

Second, proactive behavior requires different architecture, not just better models. The 6.7% proactive score across all models suggests the problem is not solvable by waiting for GPT-6. Event-driven architectures, persistent state management, and continuous context monitoring are infrastructure problems, not intelligence problems.

Third, the training pipeline matters more than the benchmark scores. The 23.7% improvement from fine-tuning on real agent trajectories suggests that progress will come from better training data reflecting actual multi-service, multi-device workflows, not from scaling models on existing data mixes.

The benchmark’s release as open-source (dataset, code, and training pipeline) means any team can evaluate their agents against the same standard. Whether they will, given the unflattering results, is a different question. The history of SWE-bench suggests that benchmarks producing high scores get adopted enthusiastically, while benchmarks producing low scores get argued about. Claw-Anything’s value is precisely that it produces low scores. That makes it useful.