MiniMax M3 Launches on NVIDIA with Free Inference Endpoint, Targeting 24/7 Agent Workloads

MiniMax, the Chinese AI lab, released M3 on NVIDIA’s accelerated infrastructure on June 12, with NVIDIA providing a free GPU-accelerated inference endpoint through its API catalog. The 428-billion-parameter mixture-of-experts model supports text, image, and video inputs natively and carries a one-million-token context window, positioning it for long-running agent workflows and extended coding sessions.

Architecture

M3 uses a mixture-of-experts design with 128 total experts and 4 activated per token, keeping the active parameter count at approximately 22 billion per forward pass, according to NVIDIA’s technical blog. The model includes a 600-million-parameter visual encoder and supports BF16 and MXFP8 precision formats.

The core architectural innovation is MiniMax Sparse Attention (MSA), which replaces standard quadratic attention with a pre-filtering stage that identifies relevant context blocks and attends only to those. NVIDIA’s blog states MSA yields 1/20th the per-token compute cost of the previous-generation M2 at one-million-token context, with 9x faster prefill and 15x faster decoding, without compressing key-values or sacrificing precision. Each KV cache block is read once with contiguous memory access, which NVIDIA reports is more than 4x faster than existing sparse attention implementations.

The model trains natively on text, images, and video from step zero across approximately 100 trillion interleaved tokens, rather than adding multimodal capabilities through post-training.

Deployment and Pricing

Developers can access M3 through three open-source inference paths: NVIDIA TensorRT LLM (text-only), SGLang, and vLLM. For large-scale production serving, NVIDIA Dynamo provides disaggregated inference. Fine-tuning is available through the NVIDIA NeMo Framework with full N-D parallelism and context parallelism up to 128K tokens.

The free endpoint removes the upfront cost barrier for teams evaluating the model. TestingCatalog noted that this is “especially useful if you want to run a weekend project or save tokens for your 24/7 agents, such as OpenClaw or Hermes.”

The Inference Cost Race

M3’s arrival on NVIDIA’s platform with a free endpoint adds to the downward pressure on inference pricing across the industry. OpenAI has been considering significant token price reductions ahead of its IPO, and Google has already cut Gemini pricing. For teams running autonomous agents that consume tokens continuously, the model-layer economics are shifting: a competitive 428B MoE with a million-token context window, available at zero cost through NVIDIA’s catalog, changes the calculus for which provider handles long-running agent sessions.

The open weights, deployment flexibility across multiple inference engines, and NVIDIA’s hardware optimization for Blackwell GPUs give MiniMax M3 a specific niche: high-context, multimodal agent workloads where cost and context length matter more than brand loyalty to OpenAI or Anthropic.

MiniMax M3 Launches on NVIDIA with Free Inference Endpoint, Targeting 24/7 Agent Workloads

Architecture

Deployment and Pricing

The Inference Cost Race

Get our morning briefing in your inbox

Keep Reading

Barret Zoph Exits OpenAI for Second Time After Five Months as Enterprise Head

Yahoo DSP Launches Agent Network With 30+ Partners Across Ad-Tech Workflow

Omdia: Agentic AI Is Forcing AWS, Google, and Microsoft to Redesign Their Cloud Infrastructure