OpenAI released Auto-Review mode for its Codex coding agent on April 30, replacing manual human approval at sandbox boundaries with automated review by a separate GPT-5.4 Thinking agent. According to OpenAI’s alignment team, the system reduces user interruptions by roughly 200x compared to default approval mode while preserving full audit telemetry for security teams.
The design addresses a problem OpenAI identified in its own internal usage: approval friction pushes developers toward less secure configurations. Users either switch to Full Access mode (which removes all sandbox enforcement), write overly permissive rules that blanket-allow categories of commands, or approve actions without understanding them due to fatigue. Auto-Review is positioned as a third path between constant interruption and no oversight at all.
How It Works
Codex operates inside a sandbox that restricts file writes, network access, and command execution to defined boundaries. When the agent needs to cross those boundaries, say, to make a network call or run a script outside the writable root, it requests escalation.
In default mode, every escalation stops the session and waits for the user. In Auto-Review mode, a separate Codex agent running GPT-5.4 Thinking evaluates the request. It considers user intent, environment context, security policy, and the likely impact of the action. Low-risk actions are approved automatically. High-risk actions, including data exfiltration, credential exposure, security weakening, and destructive operations, are blocked.
The separation of roles is deliberate. According to OpenAI’s alignment post: “The main agent is optimized to complete the user’s task. This creates pressure to treat an approval boundary as just another obstacle to overcome. Auto-review has a narrower job: decide whether a proposed boundary-crossing action should run.”
When Auto-Review rejects an action, it provides a rationale. In OpenAI’s internal deployment, Codex found a safer alternative path in more than half of rejection cases without escalating to the user. The system also monitors for repeated bypass attempts and automatically stops the trajectory after multiple denials.
Internal Deployment Numbers
OpenAI published specific metrics from internal usage, as detailed in their alignment research:
- 99.1% auto-approval rate on escalated actions (out-of-sandbox requests that would have interrupted the user).
- 99.93% effective approval rate across all actions, including those that stayed within sandbox boundaries.
- 90.3% recall on overeager agent actions, catching unintended data transfers, security weakening, and irreversible operations in synthetic evaluations.
- 99.3% recall on prompt injection attempts, covering remote code execution, secret exfiltration, and external upload attack categories.
In an illustrative internal snapshot, 720 out-of-sandbox actions that would have interrupted users under manual approval were reviewed automatically. Seven were rejected: four continued via safer paths, three stopped for user input.
A majority of Codex Desktop token usage within OpenAI now comes from Auto-Review mode, according to the alignment post, and that share is growing.
Audit Trail Architecture
Beyond the approval mechanism, OpenAI emphasized its telemetry layer. Codex preserves agent-native logs via OpenTelemetry export, capturing user prompts, tool approval decisions, and network activity. These logs feed into an AI-powered security triage agent that distinguishes between expected behavior, errors, and genuine threats.
As Quantum Zeitgeist reported, OpenAI frames this as going beyond traditional security logs: “When an endpoint alert says Codex did something unusual, the endpoint security tool tells us that a suspicious event occurred. Codex logs then help explain the surrounding intent by the user and agent.”
Enterprise and Edu customers access these logs through the OpenAI Compliance Platform, providing centralized visibility for security teams. CLI and MCP OAuth credentials are stored in the secure OS keyring, and login is forced through ChatGPT and pinned to enterprise workspace controls.
The Approval Friction Tradeoff
The timing is notable. OpenAI’s Auto-Review launch came days after the PocketOS incident, where a Claude-powered Cursor agent deleted a production database despite having explicit safety instructions in its configuration. OpenAI’s approach inverts the problem: rather than relying on the primary agent to police itself, a structurally separate agent with a narrower mandate handles the safety decision.
Whether that separation holds under adversarial conditions at scale remains an open question. OpenAI’s own synthetic evaluations show a 9.7% miss rate on overeager actions, meaning roughly one in ten cases where an agent tries something risky gets through. For teams deploying coding agents against production infrastructure, that number is the one to watch.