Scale AI Benchmark Finds Top AI Agents Complete Just 4% of Professional Freelance Work to Client Standard

The best AI agent on the market completes 4.17% of real professional freelance tasks to a standard a paying client would accept. That is the headline finding from the Remote Labor Index (RLI), a benchmark jointly developed by Scale AI and the Center for AI Safety that measures agent performance on actual paid digital work, not synthetic test tasks.

The top performer is Anthropic’s claude-opus-4-6 running on the CoWork platform, according to HRD America. Everything else on the live leaderboard scores lower. When the benchmark launched in late 2025, the best agent managed 2.5%. Six months later, the needle has barely moved.

How the Benchmark Works

Most AI benchmarks test isolated capabilities: answering questions, writing code snippets, summarizing text. The RLI was designed around a different question: can an AI agent take a real task from start to finish the way a paid professional would, and would the client actually pay for the output?

Tasks were sourced from digital labor platforms like Upwork across 23 sectors, including video editing, logo design, architecture, data analysis, jewelry design, and game development. Evaluators compared AI-generated deliverables against human-produced ones on a single criterion: would the client pay for this?

“If you consider creating a window and you have designs and all that, it could be the case that AI can create something very aesthetically pleasing,” Udari Madhushani Sehwag, security and policy research lead at Scale AI, told HRD America. “But if the dimensions are incorrect, it doesn’t matter how pleasing it looks. A human is not going to actually pay for that.”

Pockets of Strength, a Flat Improvement Curve

Image generation tasks showed the strongest agent performance. In logo creation, AI outputs sometimes outperformed human work in evaluator preference. Sehwag attributed this partly to the subjectivity in visual work, where models trained on vast datasets produce polished results that sway human judgment. But that advantage did not extend to specification-driven tasks requiring precise measurements, technical accuracy, or multi-step coordination.

The improvement curve is the more striking finding. Other AI benchmarks, which test discrete skills, tend to show steep progress over time. The RLI, grounded in end-to-end task completion, shows something flatter.

“After six months we are still seeing less than 5%,” Sehwag told HRD America. “They started very low initially and we are still in this very low region.”

The Reliability Gap

The low score is not simply about bad output. Sehwag pointed to a specific failure mode: agents can complete parts of tasks but cannot reliably execute the full sequence from brief to deliverable.

“The keyword is reliability,” she told HRD America. “They can complete parts of the tasks, but for the most part they’re not able to complete the end to end tasks reliably.”

She identified three interconnected gaps: agents need to fully understand a task brief, complete all component parts, and assemble those parts into a coherent whole. Until all three work consistently, full end-to-end automation remains out of reach.

The Organizational Momentum Gap

The findings land against a backdrop of aggressive enterprise AI adoption. A Salesforce survey of 200 CHROs covered by HRD America found that 89% believe AI agents will empower them to reassign employees to new roles, with roughly 23% of the workforce expected to be redeployed as a result.

Sehwag’s recommendation: base decisions on demonstrated capability, not projections. “Decisions should be based on what we can see and the proof that exists, not based on the projections we have in mind,” she told HRD America.

Her framing for teams building with agents: augmentation, not automation. An agent can compress a 30-minute task to 10 minutes when a human reviews the output. Treating it as a full replacement for the human produces a 96% failure rate on professional work.

“I wouldn’t expect to see a rapid jump,” Sehwag said. “And this is what we’ve been seeing since late 2025 as well.”

Scale AI Benchmark Finds Top AI Agents Complete Just 4% of Professional Freelance Work to Client Standard

How the Benchmark Works

Pockets of Strength, a Flat Improvement Curve

The Reliability Gap

The Organizational Momentum Gap

Get our morning briefing in your inbox

Keep Reading

OpenAI, Anthropic, and Google CEOs Join G7 Leaders at Evian Summit for AI Policy Talks

Vercel Launches Agent Stack, Eve Framework, and Connect OAuth Layer at Ship 2026

Amazon's Agentic AI Division Rebuilds Two-Pizza Teams, Ships Products in Weeks Instead of Months