NVIDIA published the full technical report for Nemotron 3 Super on April 17, 2026, detailing the architecture behind its open-weight model designed explicitly for agentic AI workloads. The 120B total parameter model activates only 12B parameters per token through a Mixture-of-Experts architecture, targeting the cost and latency problems that make multi-agent systems impractical at scale.
Architecture: Mamba-Transformer Hybrid with Latent MoE
The technical blog post explains that Nemotron 3 Super interleaves three layer types: Mamba-2 layers for linear-time sequence processing, Transformer attention layers for precise associative recall, and MoE layers for scaling parameter count without proportional compute cost.
The standout architectural choice is latent MoE. Standard MoE routes tokens directly from the full hidden dimension to experts. Nemotron 3 Super compresses tokens into a low-rank latent space before routing, allowing the model to consult 4x as many expert specialists for the same inference cost. According to NVIDIA’s blog, this enables “highly specialized routing, for example, activating distinct experts for Python syntax versus SQL logic, that are only activated when strictly necessary.”
Multi-token prediction (MTP) heads predict several future tokens simultaneously from each position, enabling built-in speculative decoding at inference. NVIDIA reports up to 3x wall-clock speedups for structured generation tasks like code and tool calls, without requiring a separate draft model.
Performance and Context
The model ships with a native 1M-token context window, addressing what NVIDIA calls the “context explosion” problem in multi-agent systems, where agents re-send history, tool outputs, and reasoning steps at every turn. According to the technical blog, multi-agent systems generate up to 15x the tokens of standard chats.
On PinchBench, a benchmark measuring LLM performance as the brain of an OpenClaw agent, Nemotron 3 Super scores 85.6% across the full test suite, making it the highest-scoring open model in its class. NVIDIA reports over 5x throughput compared to the previous Nemotron Super.
Native NVFP4 pretraining on NVIDIA Blackwell hardware cuts memory requirements and speeds up inference by 4x on B200 GPUs compared to FP8 on H100, according to the model card. The model was post-trained with reinforcement learning across 21 environment configurations using more than 1.2 million environment rollouts.
Open Weights, Open Deployment
The model is fully open under the NVIDIA Open License with weights, datasets, and training recipes published. It is already available on Hugging Face, NVIDIA NIM, Ollama, and OpenRouter.
For agent builders evaluating open-weight alternatives to closed API models, the combination of 1M context, 12B active parameters, and the efficiency gains from latent MoE and MTP makes Nemotron 3 Super a direct competitor to proprietary agentic models. Teams running multi-agent orchestration on-premise or through NVIDIA partners can avoid per-inference API costs entirely.