Google DeepMind’s Aletheia, an autonomous AI agent powered by Gemini 3 Deep Think, solved 6 of 10 unpublished research-level mathematical problems in the FirstProof challenge without any human intervention. Expert mathematicians evaluated the solutions and judged them “publishable after minor revisions,” according to InfoQ’s analysis of the results. Aletheia also scored approximately 91.9% on IMO-ProofBench, a benchmark for competition-level proof verification.

The FirstProof challenge consists of ten lemmas sourced from the active, unpublished work of professional mathematicians. The problems had never been posted online, making data contamination or memorization effectively impossible. Participants had one week to submit solutions. Aletheia received raw problem prompts with no human hints, no dialogue loops, and no manual selection of outputs.

What Aletheia Got Right, and What It Refused to Answer

For the 6 solved problems, expert evaluators deemed the proofs publication-ready with minor revisions. Problem 8’s solution was judged correct by 5 of 7 expert reviewers, according to the FirstProof evaluation document. The remaining two reviewers noted insufficient clarifying details rather than logical flaws.

For the 4 unsolved problems, Aletheia outputted “No solution found” or timed out. It did not generate plausible-sounding incorrect proofs. DeepMind researchers described this self-filtering as a deliberate design principle: “We view reliability as the primary bottleneck to scaling up AI assistance on research mathematics,” they wrote in the evaluation paper. “We suspect that many practicing researchers would prefer to trade raw problem-solving capability for increased accuracy.”

OpenAI attempted the same challenge using an internal, unreleased reasoning model. The company initially claimed 6 solutions, but the count was later revised to 5 after Problem 2’s solution was found to contain a logical flaw. OpenAI acknowledged using limited human supervision to evaluate and select outputs from multiple attempts, making its approach fundamentally different from Aletheia’s zero-shot autonomous execution.

How Aletheia Works

The system uses a multi-agent architecture with three components: a Generator that proposes logical steps, a Verifier that checks them for flaws, and a Reviser that patches mistakes. Google Search integration allows the agent to verify concepts against existing literature, reducing the hallucinated citations that typically plague language models in technical domains. InfoQ described the setup as resembling “a CI/CD pipeline for mathematics: propose, verify, fail, repair, and merge.”

The underlying Gemini 3 Deep Think architecture relies on extended test-time compute, investing more inference-time processing into each problem rather than scaling model parameters. DeepMind’s blog post noted that Aletheia demonstrated “higher reasoning quality can be achieved at a lower inference-time compute” compared to brute-force scaling.

Limitations and Next Steps

The researchers acknowledged significant constraints. The evaluation paper noted that “even with its verifier mechanism, Aletheia is still more prone to errors than human experts” and exhibits “specification gaming,” interpreting ambiguous problems in ways that are easiest to answer rather than most mathematically interesting.

Scientific American’s coverage quoted Stanford professor Mohammed Abouzaid, a FirstProof organizer, observing that AI-generated proofs “have the flavor of 19th-century mathematics” while mathematicians are “trying to build the mathematics of the 21st century.”

The FirstProof team is already preparing a second batch of problems for March through June 2026, designed as a fully formal benchmark. For agent builders, the takeaway is architectural: Aletheia’s value comes not from raw intelligence but from knowing when to stop, a design pattern that applies well beyond mathematics.