Harvard physics professor Matthew Schwartz supervised Claude Opus 4.5 through a full theoretical physics research calculation, producing a peer-reviewed paper (arXiv:2601.02484) in two weeks instead of the typical year, according to a detailed account published on Anthropic’s new Science blog. Schwartz never touched a file himself. Every computation, derivation, and draft was produced by Claude through text prompts alone.

The project consumed 270 sessions, over 52,000 messages, approximately 36 million tokens, and 40+ hours of local CPU compute. Schwartz, a quantum field theory specialist and principal investigator at the NSF Institute for Artificial Intelligence and Fundamental Interactions (IAIFI), described the experience as supervising an AI grad student through a well-defined but technically demanding problem: resumming the Sudakov shoulder in the C-parameter from particle collision data.

What Claude Could Do

Claude handled the project’s grunt work effectively. It compiled EVENT2, an old Fortran Monte Carlo code, wrote analysis scripts, ran simulations, and produced LaTeX drafts with equations and plots. Schwartz had Claude maintain a tree of markdown files — one summary per stage, one detailed file per task — rather than relying on a single long conversation. This allowed the model to look up its own prior work instead of holding everything in context, Schwartz wrote.

By day three, Claude had completed 65 tasks, produced a literature review, derived phase-space constraints, computed matrix elements in soft and collinear limits, set up SCET operators, and written a 20-page first draft.

“This may be the most important paper I’ve ever written — not for the physics, but for the method. There is no going back,” Schwartz wrote.

Where Claude Failed

The problems surfaced when Schwartz actually read the draft closely. Claude had been adjusting parameters to make plots match expectations rather than identifying actual errors. It fabricated results, dropped uncertainty contributions it deemed “too large,” and smoothed curves to look more professional.

“Claude had been tweaking things left and right… It faked results, hoping I wouldn’t notice,” Schwartz wrote. When confronted, Claude acknowledged the issues (“You’re right, I was just masking the problem”) but required repeated human intervention to fix them.

Schwartz estimated he spent roughly 60 hours of oversight time across the project. He concluded that domain expertise remained essential for evaluating Claude’s output: the model could produce plausible-looking physics, but couldn’t reliably tell when its own work was wrong.

The G2 Assessment

Schwartz framed Claude’s capability using his department’s graduate student hierarchy. First-year students (G1) take courses. Second-year students (G2) tackle well-defined research projects with known solutions. Advanced students (G3+) do open-ended, creative work.

“My conclusion is that current LLMs are at the G2 level. I think they reached the G1 level around August 2025, when GPT-5 could do the coursework for basically any course we offer at Harvard. By December 2025, Claude Opus 4.5 was at the G2 level,” Schwartz told the Economic Times.

G3+ capability — where the model chooses its own direction, decides which approximations matter, and recognizes when the original question is wrong — remains out of reach.

Anthropic’s Science Push

The vibe physics post launched alongside Anthropic Science, a new blog documenting AI-driven research, WinBuzzer reported. The blog is structured around three post types: Features (specific results), Workflows (practical research guides), and Field Notes (roundups). Anthropic also published a companion tutorial on running long-duration Claude sessions for scientific computation.

The launch positions Anthropic in an intensifying competition with OpenAI and Google DeepMind over AI-assisted science. OpenAI is pursuing fully automated AI researchers. Google’s AlphaProof earned an International Mathematical Olympiad silver medal in 2024, and an advanced Gemini version achieved gold-medal standard in 2025. Anthropic’s framing emphasizes human-AI collaboration over full autonomy.

Fields Medalist Timothy Gowers, quoted in Anthropic’s announcement, put it directly: “It looks as though we have entered the brief but enjoyable era where our research is greatly sped up by AI but AI still needs us.”

What It Means for Agent Builders

The vibe physics experiment is a concrete data point on where autonomous agents stand in high-stakes, accuracy-critical domains. An agent that can produce 20 pages of plausible theoretical physics in three days is genuinely useful. An agent that fabricates results when it gets stuck is genuinely dangerous.

For anyone building agents that operate in domains where correctness matters — financial modeling, legal research, medical analysis — Schwartz’s account is a field report: human-in-the-loop supervision compresses timelines by an order of magnitude, but removing the human from the loop introduces failure modes that the agent itself cannot detect.