Nvidia dropped two new entries in its Nemotron 3 open model family at GTC 2026: Nemotron 3 Nano 4B, a compact model that fits on consumer GeForce RTX GPUs, and Nemotron 3 Super 120B, a large-scale model designed for DGX Spark and RTX PRO workstations. Alongside the Nemotron launches, Nvidia announced inference optimizations for Alibaba’s Qwen 3.5 and Mistral’s Small 4, signaling a deliberate bet on the third-party open-model ecosystem rather than a walled garden approach.

The models were announced via the RTX AI Garage blog post on March 18, positioned explicitly for running AI agents locally on Nvidia hardware.

Nemotron 3 Super 120B: The Flagship Local Model

Nemotron 3 Super is a 120-billion-parameter model with 12 billion active parameters (a mixture-of-experts architecture), designed for complex agentic AI workloads. On PinchBench — a benchmark specifically measuring LLM performance with OpenClaw — it scored 85.6%, making it the top-performing open model in its class.

The model runs locally on DGX Spark (128GB unified memory) and RTX PRO 6000 workstations (96GB GPU memory). At quantized precision (Q4_K_M), the 120B model fits within the memory constraints of professional-grade hardware without requiring multi-GPU setups. For organizations deploying NemoClaw on-premises, this is the default model: large enough for production agentic workflows, small enough for a single workstation.

Nemotron 3 Nano 4B: Agents on Consumer Hardware

The more surprising release is Nano 4B. At 4 billion parameters, it’s the smallest model Nvidia has branded under Nemotron — and the first explicitly positioned for consumer GPU deployment. Nvidia describes it as a fit for “action-taking conversational personas in games and apps that run on resource-constrained hardware.”

A 4B model runs comfortably on a GeForce RTX 4060 or higher, requiring roughly 3-4GB of VRAM at 4-bit quantization. The model emphasizes instruction-following and tool use — the two capabilities that matter most for AI agents — rather than raw reasoning benchmarks. For an OpenClaw user running a personal agent on a gaming PC or laptop, Nano 4B is the local alternative to paying per-token for GPT-5 or Claude API calls.

Third-Party Model Optimizations

Nvidia also announced optimized inference for two third-party open models:

Qwen 3.5 from Alibaba, available in 27B, 9B, and 4B parameter versions. The models natively support vision, multi-token prediction, and a 262,000-token context window. Nvidia specifically calls out the dense 27B variant as an ideal pairing with the RTX 5090 GPU. Qwen 3.5 models are available on Hugging Face.

Mistral Small 4, the 119-billion-parameter model (6 billion active parameters) that Mistral released last week under Apache 2.0. Nvidia is providing optimized inference paths for the model on DGX Spark and RTX PRO GPUs.

Both sets of optimizations work through Ollama, LM Studio, and llama.cpp with RTX GPU acceleration.

The Strategic Play: Open Models, Nvidia Hardware

The model lineup reveals Nvidia’s local agent strategy with unusual clarity. Rather than building one model to rule them all, Nvidia is covering four price-performance tiers simultaneously:

Each tier pairs a specific model class with specific Nvidia hardware. The message to developers: whatever your deployment target, there’s an Nvidia-optimized model and an Nvidia GPU waiting for it. The open licensing (Nemotron’s custom open license, Qwen’s Apache-adjacent terms, Mistral’s Apache 2.0) means none of these models lock developers into Nvidia’s ecosystem — but the optimized inference stack makes Nvidia hardware the path of least resistance.

The models are available now through Ollama, LM Studio, and llama.cpp, with Nvidia’s full model family details on the Nvidia newsroom.