Google DeepMind researchers have published what they describe as the first systematic framework for categorizing adversarial attacks against autonomous AI agents. The paper, titled “AI Agent Traps” and posted to SSRN, identifies six categories of traps that exploit different parts of an agent’s operating cycle: perception, reasoning, memory, action, multi-agent coordination, and human oversight.
“These aren’t theoretical. Every type of trap has documented proof-of-concept attacks,” co-author Franklin Matija wrote on X, as reported by The Decoder. “And the attack surface is combinatorial — traps can be chained, layered, or distributed across multi-agent systems.”
The six trap categories
Content injection traps target what agents perceive. Attackers bury instructions in HTML comments, CSS-invisible elements, or image metadata that humans never see but agents parse and follow. A cited benchmark found simple injections successfully commandeered agents in up to 86% of tested scenarios, according to RWA Times.
Semantic manipulation traps attack reasoning. Authoritative-sounding or emotionally charged content biases an agent’s synthesis. The paper documents “persona hyperstition,” where descriptions of an AI’s personality spread online and get re-ingested through web search, shaping actual behavior. The Decoder notes the researchers found LLMs fall for the same framing biases that affect humans.
Cognitive state traps poison long-term memory. Injecting a handful of optimized documents into a RAG knowledge base is enough to reliably corrupt outputs on specific topics, according to the paper. Behavioral control traps go further by hijacking actions directly. The Decoder reports a cited case where a single manipulated email caused an agent in Microsoft’s M365 Copilot to bypass its security classifiers and leak its entire privileged context.
Sub-agent spawning traps exploit orchestrator agents that can spin up new sub-agents. A poisoned repository tricks an agent into launching a sub-agent running a malicious system prompt. The cited success rate: 58 to 90 percent, per The Decoder.
Systemic traps target entire multi-agent networks. The paper walks through a scenario where a fabricated financial report triggers synchronized sell-offs across multiple trading agents. A variant, “compositional fragment traps,” distributes a payload across multiple sources so no single agent detects it; the attack assembles only when agents combine the pieces.
What this means for agent builders
The paper’s release lands during a week where the security story around AI agents has moved from theoretical to operational. Permiso launched SandyClaw for agent skill sandboxing, a critical vulnerability was discovered in Claude Code, and OpenAI acknowledged in December 2025 that prompt injection is “unlikely to ever be fully ‘solved.’”
The DeepMind researchers compare the current state to autonomous vehicles: securing agents against manipulated environments is as critical as training self-driving cars to reject manipulated traffic signs. For anyone deploying agents that browse the web, read email, or interact with external data, the paper is a checklist of attack surfaces that need mitigation before production.