Google Splits TPU 8 Into Dedicated Training and Inference Chips, Targeting Agent-Scale Workloads

This is a developing story. NCT previously reported on April 20 that Google Chief Scientist Jeff Dean confirmed inference-specialized TPU development. Today’s announcement formalizes the product launch with full specifications.

Google unveiled TPU 8t and TPU 8i at Cloud Next 2026 in Las Vegas on April 22, formally splitting its custom tensor processing unit line into dedicated training and inference architectures for the first time. The dual-chip design delivers up to 2.8x faster training and 80% higher performance per dollar for LLM inference compared to last year’s Ironwood TPUs, according to The Register’s analysis.

TPU 8t: Training at Scale

TPU 8t targets frontier model development, with Google’s Chief Technologist for AI Infrastructure Amin Vahadat stating it is “built to reduce the frontier model development cycle from months to weeks,” according to NewsBytes.

Each TPU 8t chip features 216 GB of high-bandwidth memory at 6.5 TB/s bandwidth, 128 MB of on-chip SRAM, up to 12.6 petaFLOPS of FP4 compute, and 19.2 Tbps of chip-to-chip bandwidth, per The Register.

A single TPU 8t superpod scales to 9,600 chips with 2 petabytes of shared high-bandwidth memory and double the interchip bandwidth of the previous generation, delivering 121 ExaFLOPS of compute, according to Google’s technical blog. That represents nearly 3x the compute performance per pod over Ironwood.

The scaling architecture uses optical circuit switches to connect those 9,600 chips in a single unified pod. Multiple pods then link through the new Virgo Network fabric, which uses high-density packet switches to connect up to 134,000 TPUs per datacenter and up to one million TPUs across multiple sites, as The Register reported.

Google is targeting 97% “goodput,” a measure of productive compute time, through real-time telemetry across tens of thousands of chips, automatic detection and rerouting around faulty links, and optical circuit switching that reconfigures hardware around failures without human intervention, per Google’s blog.

TPU 8i: Inference for Multi-Agent Workloads

TPU 8i addresses a different bottleneck. Inference is memory-bandwidth limited: each generated token requires streaming the entire model’s active weights through memory. Google redesigned the chip to trade raw FLOPS for faster memory access and larger on-chip caches.

The chip pairs 288 GB of high-bandwidth memory with 384 MB of on-chip SRAM, 3x more than the previous generation, according to Google’s blog. TPU 8i connects 1,152 chips in a single pod, as Sundar Pichai noted in his keynote post.

Pichai stated that TPU 8i is designed to “deliver the massive throughput and low latency needed to concurrently run millions of agents cost-effectively.” The architectural intent is clear: when agents run continuously and interact with each other in multi-step workflows, small latency inefficiencies compound. Google designed TPU 8i to eliminate what it calls the “waiting room” effect, where processors sit idle between reasoning steps.

Competitive Context

Google is not the first to specialize training and inference silicon. AWS has maintained separate Trainium (training) and Inferentia (inference) chip lines since 2019. Nvidia’s Blackwell Ultra generation also optimized for inference, trading high-precision compute for a 50% jump in memory and FP4 throughput, as The Register noted.

Where Google differentiates is scale-up architecture. Nvidia’s latest GPUs support up to 576 accelerators in a single NVLink domain before requiring Ethernet or InfiniBand scale-out. Google’s TPU 8t connects 9,600 chips in a unified pod through optical circuit switches, bypassing that constraint entirely.

Google also disclosed that it is replacing x86 host processors with its own Arm-based Axion CPUs for TPU servers, mirroring Amazon’s shift to Graviton for Trainium 3, according to The Register.

Citadel Securities was named as an early TPU 8 customer, choosing the chips to “power their cutting-edge AI workloads,” per Google’s blog.

Infrastructure as Competitive Moat

The dual-chip announcement arrives alongside Google’s Gemini Enterprise Agent Platform launch and Agentic Data Cloud, forming a complete stack play: custom silicon, agent development platform, data infrastructure, and security. Pichai framed it in his keynote as Google being “customer zero” for its own technologies: “75% of all new code at Google is now AI-generated and approved by engineers, up from 50% last fall,” he wrote.

For teams evaluating inference infrastructure for autonomous agent workloads, the key question is whether workload specialization at the silicon level translates into meaningful cost advantages over general-purpose GPU clusters. Google is betting that agents’ continuous, stateful, multi-model inference patterns are different enough from batch LLM inference to justify dedicated hardware. Both chips are expected to reach general availability later this year.

Google Splits TPU 8 Into Dedicated Training and Inference Chips, Targeting Agent-Scale Workloads

TPU 8t: Training at Scale

TPU 8i: Inference for Multi-Agent Workloads

Competitive Context

Infrastructure as Competitive Moat

Get our morning briefing in your inbox

Keep Reading

Anthropic Locks In Multi-Gigawatt Google TPU Capacity for Claude Agent Workloads Starting 2027

HCL Technologies Creates Dedicated Gemini Enterprise Business Unit, First System Integrator to Build an Agentic AI Practice

Salesforce and Google Cloud Connect Agentforce to Gemini for Cross-Platform Enterprise Agent Workflows