Hermes Agent won 14 of 18 real-world tasks against Claude Code running on Opus 4.7 and OpenClaw running on Sonnet 4.6, according to a benchmark published by developer Chew Loong Nian on Towards AI. The four tasks Claude Code won were raw coding challenges. Every task Hermes won, it won because of one feature neither competitor has: session memory.

Hermes ships with a SQLite database indexed by FTS5 that stores every session transcript the user has ever run. When a new session starts and the user references prior work (“fix the bug we were chasing on Friday”), Hermes searches Friday’s transcript, pulls relevant turns into context, and resumes where it left off. Claude Code and OpenClaw start each session with no memory of previous conversations.

“That’s not a metaphor,” Chew writes. “Hermes greps Friday’s transcript, pulls the relevant turns into context, and picks up where you left off. Claude Code and OpenClaw don’t. They start every session with an empty room.”

How the Memory Architecture Works

Hermes Agent’s memory operates across three layers, according to MarkTechPost’s analysis and the project’s GitHub repository. The first layer is a persistent snapshot of user and agent identity. The second is the SQLite FTS5 full-text search database that indexes every past session, with results summarized by a fast model (Gemini Flash) for cross-session recall. The third layer consists of procedural skill files that capture repeatable task logic, generated autonomously after the agent completes complex tasks.

When memory capacity reaches its limit, the system forces consolidation, prompting the agent to merge or prune entries rather than silently dropping context.

This architecture is built around the premise that an agent’s value compounds over time. The longer a user runs Hermes, the more optimized it becomes for their specific workflows.

Where Claude Code Still Wins

The benchmark was not a clean sweep. Claude Code took all four coding-specific tasks, winning on what Chew calls “raw coding chops.” On tasks requiring precise code generation, refactoring, or debugging without prior context, Opus 4.7’s stronger model capabilities gave it the edge.

The split maps to a real architectural tradeoff. Claude Code’s advantage is model quality per-turn. Hermes’ advantage is context accumulation across turns. For a developer working on the same codebase over weeks, the second advantage may matter more than the first.

The Competitive Context

Hermes Agent launched in February 2026 and by May 10 had overtaken OpenClaw on OpenRouter’s daily inference rankings, processing 224 billion tokens per day to OpenClaw’s 186 billion, as reported by MarkTechPost.

The benchmark arrives as session continuity is becoming a recognized gap across agent frameworks. Projects like claude-mem, a third-party tool that captures session activity and injects relevant context into future sessions, have appeared specifically to retrofit memory onto Claude Code, OpenClaw, Codex, and other agents that lack it natively.

OpenClaw stores memory in human-authored markdown files and does not automatically index or search prior sessions. Claude Code offers no cross-session persistence at all.

Session Memory as a Differentiator

The benchmark’s core finding is narrow but consequential: in 14 of 18 tasks, the ability to reference prior work was the deciding factor, not model quality, tool count, or platform support. Hermes’ 10-week-old codebase beat two established agents because it solved a UX problem (session continuity) that neither competitor has addressed natively.

The question for Anthropic and the OpenClaw foundation is whether session memory moves from competitive differentiator to table-stakes feature before the next round of benchmarks.