SkyPilot, the open-source cloud orchestration tool, published results last week from an experiment that scaled Andrej Karpathy’s autoresearch framework from a single GPU to a 16-GPU Kubernetes cluster managed entirely by Anthropic’s Claude Code agent. Over eight hours, the agent submitted approximately 910 machine learning experiments, reduced validation loss (val_bpb) from 1.003 to 0.974, and reached its best result 9x faster than a simulated sequential baseline running on one GPU.
What Autoresearch Does
Karpathy released autoresearch in mid-March as a 630-line Python framework where a coding agent autonomously improves a neural network training script. The agent edits a train.py file, runs a five-minute training experiment on a GPU, checks the validation loss, and loops, keeping changes that help and discarding those that don’t. In Karpathy’s own overnight run, the agent found roughly 20 improvements that stacked up to an 11% reduction in time-to-GPT-2 on the nanochat leaderboard, according to SkyPilot’s account of the original results.
The constraint is deliberate: a fixed five-minute wall-clock training budget per experiment, with one file the agent can modify and one it cannot touch. Everything in train.py is fair game — architecture, hyperparameters, optimizer settings, batch size, model depth — as long as the code runs without crashing.
The limitation is equally deliberate. With one GPU, the agent runs roughly 10-12 experiments per hour, each taking about five minutes of training plus a minute of setup and agent reasoning time. That means greedy hill-climbing: try one thing, check, repeat.
What Changes With 16 GPUs
SkyPilot engineers Alex Kim and Romil Bhardwaj gave Claude Code access to 16 GPUs on Kubernetes backed by CoreWeave, using SkyPilot’s orchestration layer to let the agent provision and manage clusters autonomously. The agent read SkyPilot’s skill documentation, then launched and managed GPU clusters on its own with no manual cloud configuration.
Throughput jumped from roughly 10 experiments per hour to approximately 90. But the more significant change was strategic. Instead of testing one hypothesis per five-minute window, the agent ran factorial grids of 10-13 experiments per wave, testing interaction effects between parameters that sequential search would miss.
The results progressed through five distinct phases that the agent developed on its own, not from instructions:
Phase 1 (first ~200 experiments): Hyperparameter sweeps mapped batch size, Adam betas, weight decay, and learning rate schedules. Val_bpb dropped from 1.003 to 0.981.
Phase 2 (~200-420): The largest single improvement. The agent tested six different model aspect ratios simultaneously — AR=48, 64, 72, 80, 90, 96 — in a single five-minute wave. Sequentially, that would take 30 minutes. The result: scaling model width from the default (model_dim=384) to AR=96 (model_dim=768) outperformed every hyperparameter tweak from Phase 1. Val_bpb dropped to 0.977.
Phase 3-4 (~420-700): Fine-tuning the wider model and optimizer parameters yielded progressively smaller gains, with the biggest late-stage find being a Muon optimizer beta2 change from 0.95 to 0.98, worth roughly 0.001 val_bpb.
Phase 5 (~700-910): Diminishing returns. Combinatorial sweeps over final learning rate fractions and embedding parameters produced improvements below 0.0001 per experiment.
The Hardware Discovery
The most notable emergent behavior: the agent figured out on its own that it had access to two different GPU types. Of the 16 clusters, 13 landed on H100s and 3 on H200s. Nobody told Claude Code about the performance difference. After a few waves, it noticed identical configurations scored roughly 0.005 val_bpb lower on H200 clusters because faster step times meant more training steps within the five-minute budget.
The agent then developed a two-tier strategy: screen 10+ hypotheses cheaply on H100s in parallel, then promote the top 2-3 candidates to H200s for confirmation runs. According to the SkyPilot blog post, the agent’s own reasoning included: “Only 3 H200 clusters: gpu-03, gpu-04, gpu-08! The rest are H100. This explains everything — H200 is significantly faster than H100.”
Context and Limitations
The experiment operated within autoresearch’s constrained scope: optimizing a small GPT training script on a five-minute budget. The 2.87% improvement in validation loss (1.003 to 0.974) is meaningful within that sandbox but does not directly translate to frontier model training, which requires multi-node clusters and training runs measured in weeks, not minutes. As Abhishek Gautam noted in an analysis of Karpathy’s original framework, autoresearch cannot perform novel mathematical derivations, handle custom data collection, or replace an experienced researcher’s judgment about which questions are worth asking.
What the SkyPilot experiment does demonstrate is that a coding agent, given access to parallel compute and minimal instructions, will independently develop research strategies — factorial experimental designs, hardware-aware optimization, phased exploration — that resemble how human ML researchers work when they have unlimited GPU access. The difference is that the agent did it in eight hours instead of the estimated 72 for sequential execution.
The full setup, including agent instructions and SkyPilot YAML configurations, is open-sourced on GitHub.