Meta has deployed a unified AI agent platform that autonomously finds and fixes infrastructure performance issues across its fleet, recovering hundreds of megawatts of power and compressing investigations that took engineers 10 hours into 30-minute automated workflows. The company detailed the system in a Meta Engineering blog post published April 16, calling it the infrastructure backbone of its Capacity Efficiency Program.
How It Works
The platform operates on two layers. Standardized MCP tool interfaces let agents query profiling data, fetch experiment results, retrieve configuration history, search code, and extract documentation. On top of those tools sit “skills” that encode the domain expertise of senior efficiency engineers, capturing reasoning patterns developed over years of debugging performance regressions at hyperscale.
The system runs both offense and defense. On defense, Meta’s FBDetect tool catches thousands of performance regressions weekly, detecting changes as small as 0.005% in noisy production environments. When it flags a regression, an AI Regression Solver activates: the agent gathers context about the regressing functions, identifies the root-cause pull request, applies domain-specific mitigation knowledge, and produces a fix-forward pull request sent to the original author for review.
On offense, the platform generates pull requests for proactive efficiency opportunities, proposed code changes believed to improve performance of existing systems. What previously required hours of investigation by engineers now takes minutes to review and deploy.
The Scale of Recovery
When code serves more than 3 billion people, a 0.1% performance regression translates directly to significant additional power consumption. The hundreds of megawatts Meta reports recovering are, in the company’s own framing, “enough to power hundreds of thousands of American homes for a year.”
The efficiency gains compound. Without automated resolution, regressions detected by FBDetect would stack up across the fleet while waiting for engineer bandwidth. The AI agent platform breaks that bottleneck, allowing the Capacity Efficiency Program to scale megawatt delivery across more product areas without proportionally scaling headcount.
Part of a Broader Agent Push
The Capacity Efficiency platform sits alongside a growing portfolio of internal AI agent systems at Meta. In a separate April 2 post, Meta Engineering detailed KernelEvolve, an agentic kernel authoring system that compresses weeks of expert hardware optimization work into hours of automated search and evaluation, achieving over 60% inference throughput improvements across NVIDIA GPUs, AMD GPUs, and Meta’s custom MTIA silicon.
Reuters reported on April 9 that Meta reorganized its Applied AI organization, transferring top engineers into a new team specifically tasked with building tools and evaluations to accelerate AI agent development. The reorganization accompanied broader layoffs as Meta redirects resources toward AI infrastructure.
What the Architecture Reveals
The design choice that stands out is composability. Rather than building bespoke AI systems for each efficiency problem, Meta built a shared platform where the same tools power both regression detection and proactive optimization. Only the skills layer differs between offense and defense.
For infrastructure teams watching from outside Meta, the architectural lesson is specific: encode your senior engineers’ debugging patterns as reusable agent skills rather than tribal knowledge that walks out the door when someone changes teams. Meta’s system works because it captured years of domain expertise into structured, composable components that any agent can invoke. The agents handle the long tail of performance work that no human team could scale to manually.