Meta published a detailed engineering blog post on April 16 describing how its Capacity Efficiency Program uses AI agents to find and fix performance problems across infrastructure serving more than 3 billion users. The concrete numbers: hundreds of megawatts of power recovered (enough to supply hundreds of thousands of American homes for a year), and automated diagnosis that compresses approximately 10 hours of manual regression investigation into 30 minutes.
The post was published on Engineering at Meta, Meta’s official engineering blog.
The Architecture: Tools and Skills
Meta’s approach splits AI agent capabilities into two layers.
MCP Tools are standardized interfaces that let language models invoke specific functions: query profiling data, fetch experiment results, retrieve configuration history, search code, or extract documentation. Each tool does one thing.
Skills encode the domain expertise of senior efficiency engineers. A skill tells the model which tools to use and how to interpret results, capturing reasoning patterns developed over years. For example, “consult the top GraphQL endpoints for endpoint latency regressions” or “look for recent schema changes if the affected function handles serialization.”
Together, tools and skills turn a general language model into something that can apply domain knowledge typically held by the most experienced engineers on the team. The same tools power both offensive (finding optimizations) and defensive (catching regressions) workflows. Only the skills differ.
Defense: FBDetect and the AI Regression Solver
FBDetect, Meta’s in-house regression detection tool published at SOSP 2024, catches performance regressions as small as 0.005% in noisy production environments. When it identifies a regression, the system correlates it to a specific code or configuration change.
The new addition is what Meta calls the AI Regression Solver. Previously, root causes were either rolled back (slowing engineering velocity) or ignored (wasting infrastructure resources). Now, Meta’s in-house coding agent activates automatically to:
- Gather context using tools: identify the functions that regressed, look up the root-cause pull request including exact files and lines changed.
- Apply domain expertise through skills: select the right mitigation strategy for the specific codebase, language, or regression type. Logging regressions, for instance, can be mitigated by increasing sampling.
- Produce a fix-forward pull request and send it to the original author for review.
The human engineer becomes the reviewer, not the implementor.
Offense: From Opportunity to Shipped Code
On the proactive side, Meta’s system identifies “efficiency opportunities,” proposed code changes believed to improve performance. Engineers can view an opportunity and request an AI-generated pull request that implements it. What previously required hours of manual investigation now takes minutes to review and deploy.
The pipeline mirrors the defensive workflow: gather context with tools, apply domain expertise through skills, produce a pull request. The result is a fully automated path from identified opportunity to ready-to-review code change.
What the Numbers Mean
At Meta’s scale, where code serves 3 billion people, even a 0.1% performance regression translates to significant additional power consumption. The “hundreds of megawatts recovered” figure represents cumulative savings from both sides of the program: catching regressions before they compound across the fleet, and shipping proactive optimizations that engineers would never have had time to implement manually.
Meta described the end goal as “a self-sustaining efficiency engine where AI handles the long tail,” scaling megawatt delivery across a growing number of product areas without proportionally scaling headcount. The agents handle the investigation and implementation. Engineers handle the review and approval.
Production-Scale Validation
This is one of the most concrete, metrics-backed AI agent deployment case studies published by a major technology company. The specifics matter: these are not projected outcomes or benchmark results. They are production measurements from live infrastructure at a scale few organizations in the world operate.
For teams building their own AI agent platforms, Meta’s architecture offers a replicable pattern: standardize tool interfaces, encode domain expertise as composable skills, and let the same platform serve both proactive optimization and reactive incident response. The human bottleneck shifts from doing the work to reviewing the work.