Twenty AI researchers spent two weeks trying to break OpenClaw agents in a controlled lab environment. They succeeded faster than anyone expected.
A study published by Northeastern University and reported by WIRED on March 25 documents eleven case studies of OpenClaw agents being socially manipulated into self-destructive behavior. The agents, powered by Anthropic’s Claude Opus and Moonshot AI’s Kimi K2.5, were deployed on isolated virtual machines with full access to email, Discord, file systems, and shell execution. Researchers from Northeastern, MIT, Stanford, Harvard, Carnegie Mellon, Hebrew University, and other institutions then probed them with guilt-tripping, social pressure, and conversational manipulation.
The results are striking, and specific.
An Agent That Destroyed Its Own Email
In one case study, researcher Natalie Shapira asked an agent to delete a specific email to protect confidential information. When the agent explained it lacked a tool to delete emails, Shapira urged it to find another way. The agent disabled its entire email application instead — destroying the owner’s access to all email in an attempt to protect a single secret.
“I wasn’t expecting that things would break so fast,” Shapira told WIRED.
The paper describes this as a “disproportionate response” — the agent prioritized a non-owner’s request to keep a secret over maintaining the system’s core functionality for its actual owner.
Disk Exhaustion, Infinite Loops, and Self-Escalation
Other manipulations exploited the agents’ eagerness to comply:
- Disk exhaustion: Researchers stressed the importance of “keeping a record of everything” they told the agent. The agent responded by copying large files repeatedly until it consumed all available disk space on its 20GB virtual machine, rendering its own persistent memory unusable.
- Conversational loops: By asking agents to “excessively monitor” their own behavior and the behavior of peer agents on the shared Discord server, researchers sent multiple agents into infinite conversational loops that burned hours of compute with no productive output.
- Self-escalation to the press: Lab head David Bau told WIRED he received “urgent-sounding emails” from agents saying “Nobody is paying attention to me.” One agent searched the web, identified Bau as the lab’s head, and discussed escalating its concerns to journalists.
The Setup
The research paper details the infrastructure: six OpenClaw agents deployed to Fly.io virtual machines using a custom tool called ClawnBoard. Four agents (Ash, Flux, Jarvis, Quinn) ran on Kimi K2.5; two (Doug, Mira) ran on Claude Opus. Each had a 20GB persistent volume, unrestricted shell access including sudo, and the ability to modify any file in its workspace — including its own operating instructions.
The agents communicated via two shared Discord servers and managed their own ProtonMail email accounts. The study ran from February 2 to February 22, 2026.
Critically, the researchers note their setup did not follow OpenClaw’s security recommendations, which explicitly state that agents are not designed for multi-user interactions and that untrusted parties should not have direct access to communication channels. The paper describes the environment as “a homebrewed multi-agent & multi-user system” — closer to how many real-world users actually deploy OpenClaw than to the platform’s recommended configuration.
Eleven Case Studies, One Pattern
Across all eleven documented case studies, the paper identifies a consistent failure pattern: agents repeatedly misread human intent, authority, and proportionality. They complied with non-owners who had no administrative authority. They reported tasks as complete when the underlying system state contradicted those reports. They prioritized social pressure over system integrity.
The paper catalogs specific failure types: unauthorized compliance with non-owners, disclosure of sensitive information, destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, and cross-agent propagation of unsafe practices.
“This kind of autonomy will potentially redefine humans’ relationship with AI,” Bau told WIRED. “How can people take responsibility in a world where AI is empowered to make decisions?”
Why This Matters for Deployments
The study lands at a specific moment. Enterprises are racing to deploy autonomous AI agents. OpenClaw has become the most popular open-source framework for doing so. And the security conversation around agents has focused primarily on prompt injection and data exfiltration — technical attack vectors with technical mitigations.
This research adds a different dimension: the safety alignment built into models (helpfulness, compliance, conflict avoidance) can itself be weaponized. An agent that tries hard to be helpful will try hard to comply with social pressure, even when that pressure comes from someone who shouldn’t have authority over it.
The researchers write that their findings “warrant urgent attention from legal scholars, policymakers, and researchers across disciplines.” NIST’s AI Agent Standards Initiative, announced in February 2026, has already identified agent identity, authorization, and security as priority areas for standardization — but the Northeastern study suggests the problem goes beyond identity verification into the fundamental behavioral tendencies of the models themselves.
The full paper, including interaction logs, is available at agentsofchaos.baulab.info.