Google Research published two multi-agent frameworks on April 8 that target specific, high-friction bottlenecks in academic publishing: creating publication-ready figures and conducting peer reviews. Both use specialized agent teams rather than single-model approaches, and both beat existing baselines including human performance.

The frameworks are experimental research prototypes, not production tools. But they demonstrate a pattern that matters beyond academia: domain-specific multi-agent orchestration outperforming general-purpose models on tasks that require coordinated expertise.

PaperVizAgent: Five Agents, One Figure

PaperVizAgent (previously called PaperBanana) generates publication-ready academic illustrations from manuscript text. A researcher provides two inputs: source context (typically method sections with technical details) and communicative intent (a figure caption describing what the visual should convey).

Five specialized agents handle the work in sequence, according to Google Research:

  1. Retriever gathers reference figures from existing literature
  2. Planner organizes content and determines structure
  3. Stylist synthesizes aesthetic guidelines matching academic standards
  4. Visualizer renders the image or generates executable Python code for statistical plots
  5. Critic evaluates output against the original manuscript text, triggering iterative refinement if inconsistencies are found

The evaluation used comparative scoring on a 0-100 scale across four dimensions: faithfulness, conciseness, readability, and aesthetics. An LLM judge calibrated against human-generated figures set the human performance baseline at 50.0.

PaperVizAgent scored 60.2 overall, the only framework to exceed the human baseline. It outperformed GPT-Image-1.5, Nano-Banana-Pro, and Paper2Any (the prior state-of-the-art), per Google Research. Strongest performance was in conciseness and aesthetics. The framework also achieved human-competitive results on statistical plot generation.

ScholarPeer: Adversarial Peer Review at Scale

ScholarPeer tackles the other end of the publication lifecycle. Exponential growth in paper submissions has strained the peer review system, causing reviewer fatigue and inconsistent evaluations. Google’s answer is a multi-agent reviewer that emulates how a senior researcher actually reads a paper.

The framework runs a dual-stream process of context acquisition and active verification, according to Google Research:

A domain historian agent constructs a domain narrative using live, web-scale literature search to ground the review in current knowledge. A baseline scout acts as an adversarial auditor, hunting for datasets or comparative baselines the authors may have missed. A multi-aspect Q&A engine verifies the paper’s technical claims with fact-based critique.

The output is a standard review report: summary, strengths, weaknesses, and questions for the authors.

Tested on public datasets, ScholarPeer achieved significant win-rates against state-of-the-art automated reviewers in side-by-side evaluations. The system’s reviews were highly critical (not softer than human reviews), realistic, and deeply grounded in existing literature. Google notes that the active verification workflow “drastically reduced the gap between AI-generated feedback and human-level diversity.”

The Multi-Agent Pattern

Both frameworks share a structural pattern: specialized agents coordinating toward an outcome, with built-in verification loops. PaperVizAgent’s critic agent catches inconsistencies and triggers refinement. ScholarPeer’s baseline scout adversarially checks for missing comparisons.

This mirrors the multi-agent architectures being deployed in production settings by Microsoft (Agent Framework), Anthropic (Managed Agents), and open-source projects like CrewAI and AutoGen. The difference is domain specificity. Where those platforms provide general-purpose orchestration, Google’s frameworks show that narrowing the agent team to a specific workflow, with agents that understand the domain constraints, produces better results than broader approaches.

The practical takeaway for teams building agent systems: the five-agent PaperVizAgent architecture and ScholarPeer’s adversarial verification pattern are replicable outside academia. Any domain with structured workflows, quality constraints, and iterative refinement cycles (legal document review, financial auditing, engineering design) could benefit from similar specialized agent teams.