NVIDIA Nemotron 3 Super: Best Open-Weight Coding Model

MAR 21AI4 MIN READ2143189 COMMENTS

At NVIDIA's GTC conference on March 11, 2026, the company released what is currently the highest-scoring open-weight model on SWE-Bench Verified: Nemotron 3 Super, a 120-billion-parameter hybrid model that keeps only 12 billion parameters active at inference time. It scored 60.47% on SWE-Bench Verified under the OpenHands evaluation framework — a result that, until recently, was territory only closed frontier models occupied. GPT-OSS, OpenAI's published open model, scores 41.90% on the same benchmark. Nemotron 3 Super beats the previous best published open-weight result by nearly 20 percentage points.

This is a meaningful threshold. SWE-Bench Verified measures a model's ability to resolve real GitHub issues — actual software engineering tasks pulled from production repositories, not synthetic puzzles. Scoring above 60% means the model can close more than three out of five real issues autonomously. That has been a frontier-model capability. Open-weight models have now reached it, and with a model developers can download and self-host.

Architecture: Hybrid Mamba-Transformer MoE

Nemotron 3 Super is built as a hybrid architecture combining Mamba state-space layers with transformer attention blocks and a sparse Mixture-of-Experts routing system. The MoE design makes the 120B/12B split practical: the model has 120 billion total parameters, but any given inference step activates only 12 billion, keeping compute and memory requirements closer to a 12B-class model while preserving the capacity of a much larger one.

The context window is one million tokens. On the RULER benchmark at 1M context length, Nemotron 3 Super scored 91.75%, compared to GPT-OSS's 22.30%. The gap reflects a structural advantage: Mamba's recurrent architecture scales linearly with sequence length, where standard transformer attention scales quadratically. For agentic coding tasks that need to hold large codebases or long execution histories in context at once, this difference is not marginal. A model that actually functions at 1M tokens is qualitatively different from one that nominally supports it but degrades at scale.

NVIDIA claims 5x higher inference throughput compared to previous-generation models at equivalent quality. Fewer active parameters per inference token directly produces lower latency and higher requests-per-second on the same hardware. The post-training data cutoff is February 2026, making it one of the most current open models in this size class. The model was released with fully open weights on Hugging Face in both BF16 and FP8 variants, and with documented training datasets and recipes. The license covers research and commercial use.

What to Evaluate Before Adopting

SWE-Bench Verified at 60.47% is the headline number, but it represents performance on a standardized evaluation set. Before routing production coding tasks through Nemotron 3 Super, run representative workloads from your actual use case. SWE-Bench correlates with real coding task performance better than most benchmarks do — it uses real issues from real repositories, not constructed problems — but domain-specific tasks can still diverge from aggregate scores.

The long-context RULER result is worth taking more seriously than typical context window marketing claims. Most models that advertise large windows perform poorly at the extremes — they can technically accept the input but quality degrades. A 91.75% score at 1M tokens suggests the window is functional rather than nominal, but test your actual input lengths and content before depending on it for production workloads.

For teams running self-hosted coding agents, automated code review pipelines, or any application where frontier-competitive performance without per-token API costs matters, Nemotron 3 Super is the model to evaluate. It is the first open-weight option with a credible claim to matching closed frontier models on software engineering tasks — and with open weights, you can run it, fine-tune it, and deploy it on your own infrastructure without a vendor dependency.

// ENGLISH
KEY POINTS:

- Nemotron 3 Super scores 60.47% on SWE-Bench Verified — highest open-weight result
- 120B total parameters, only 12B active per inference (sparse MoE architecture)
- Hybrid Mamba-Transformer design: linear attention scaling for 1M-token context
- 91.75% on RULER at 1M tokens vs. GPT-OSS's 22.30% — functional long context
- 5x throughput vs. previous-gen models at equivalent quality
- Fully open weights, datasets, and training recipes; BF16 and FP8 on Hugging Face