DeepSeek-V4-Pro: 1.6 Trillion Parameters, MIT License, 1 Million Token Context
The benchmark gap between open-source and closed-source frontier models has been closing steadily. DeepSeek-V4-Pro is the latest — and most significant — step in that direction.
Released by DeepSeek AI in 2026 under an MIT license, DeepSeek-V4-Pro is a Mixture-of-Experts language model with 1.6 trillion total parameters. It achieves 90.1% on GPQA Diamond, 93.5% on LiveCodeBench, and 80.6% on SWE-Verified. It supports a 1 million token context window with architectural changes that reduce inference cost at that length to 27% of the compute required by its predecessor. And it's fully open-source, commercially usable, no restrictions.
This post covers what DeepSeek-V4-Pro actually is, the architectural choices that make it work, and what the benchmarks mean in practice.
The Scale: 1.6T Parameters, 49B Activated
DeepSeek-V4-Pro uses a Mixture-of-Experts architecture, which means the 1.6 trillion parameter count is the total capacity — only 49 billion parameters activate on any given inference pass. This is how the model achieves frontier-class performance while remaining practically deployable: you're not running 1.6T parameters for every token. The compute cost per token is closer to that of a 49B dense model.
Precision is FP4 + FP8 mixed: MoE expert weights use FP4, most other parameters use FP8. This enables the large parameter scale while keeping memory footprint manageable.
The pre-training dataset exceeds 32 trillion tokens of diverse, high-quality text.
Architectural Innovations
Hybrid Attention for Long Context
The 1 million token context window is the headline number, but the architectural work behind it is what makes it credible. Standard attention implementations don't scale to 1M tokens — the compute and memory costs are prohibitive.
DeepSeek-V4-Pro uses a hybrid attention architecture combining two mechanisms:
- → Compressed Sparse Attention (CSA): Reduces the key-value footprint by compressing the attended context
- → Heavily Compressed Attention (HCA): Further compresses attention for long-range dependencies
The result: at 1 million tokens, DeepSeek-V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2. You get a model that can actually process a million tokens without requiring unreasonable hardware.
Benchmark evidence: on MRCR at 1M tokens, the model achieves 83.5% recall. On CorpusQA at 1M tokens, 62.0% accuracy. These aren't theoretical context windows — they hold up under evaluation.
Manifold-Constrained Hyper-Connections
Training very deep MoE models introduces gradient propagation challenges. DeepSeek-V4-Pro addresses this with Manifold-Constrained Hyper-Connections (mHC), which strengthens conventional residual connections while preserving model expressivity. The effect is more stable signal propagation across layers during both training and inference.
Muon Optimizer
The model was trained using the Muon optimizer, which enables faster convergence and greater training stability compared to the standard AdamW approach. This is part of why the training on 32+ trillion tokens was tractable.
Three Reasoning Modes
One of the more interesting product decisions in DeepSeek-V4-Pro is the explicit three-tier reasoning system:
| Mode | What it does | When to use |
|---|---|---|
| Non-think | Fast, direct responses | Routine tasks, simple queries |
| Think High | Logical step-by-step analysis | Complex problem-solving, planning |
| Think Max | Maximum reasoning depth | Hardest problems, boundary testing |
Think Max requires a minimum 384K token context to work effectively — the model needs headroom to lay out its full reasoning chain. The benchmarks cited in the model card (90.1% GPQA Diamond, 95.2% HMMT 2026 Feb) are Think Max results.
This explicit mode system means you can tune cost vs. capability depending on the task. Non-think mode for high-volume, straightforward work; Think Max when you're hitting the model's reasoning ceiling.
Two-Stage Post-Training
The post-training pipeline follows a two-stage design:
- Stage 1 — Domain-specific expert cultivation: Each domain (math, code, science, etc.) is trained independently using supervised fine-tuning (SFT) and reinforcement learning with Group Relative Policy Optimization (GRPO). Experts are developed with domain-specific reward signals.
- Stage 2 — Unified consolidation: The domain experts are merged into a single model via on-policy distillation. The model learns to route queries to the appropriate expert behavior without requiring explicit domain labels at inference time.
This pipeline is partly why the coding benchmarks (93.5% LiveCodeBench, 3206 Codeforces rating) and math benchmarks (95.2% HMMT 2026 Feb, 89.8% IMOAnswerBench) are both strong in the same model.
Benchmark Results
All results below are Think Max mode.
Reasoning and knowledge:
- → GPQA Diamond: 90.1%
- → MMLU-Pro: 87.5%
- → SimpleQA-Verified: 57.9%
- → Chinese-SimpleQA: 84.4%
Coding:
- → LiveCodeBench: 93.5%
- → Codeforces Rating: 3206
Math:
- → HMMT 2026 Feb: 95.2%
- → IMOAnswerBench: 89.8%
Long context (1M tokens):
- → MRCR 1M: 83.5% (MMR)
- → CorpusQA 1M: 62.0% (ACC)
Agentic tasks:
- → SWE-Verified: 80.6%
- → BrowseComp: 83.4%
- → Toolathlon: 51.8%
The SWE-Verified score is the one worth pausing on. 80.6% on autonomous software engineering tasks — finding and fixing real GitHub issues — is competitive with the frontier closed-source models. It's what makes DeepSeek-V4-Pro meaningful for actual AI agent workloads, not just benchmark comparisons.
The Model Family
DeepSeek-V4 ships in two variants:
| Model | Total Params | Activated Params | Precision |
|---|---|---|---|
| DeepSeek-V4-Flash | 284B | 13B | FP4 + FP8 |
| DeepSeek-V4-Pro | 1.6T | 49B | FP4 + FP8 |
Flash is the fast, lightweight option. Pro is the performance ceiling. Both use the same FP4 + FP8 mixed precision scheme and share the architectural innovations described above.
Sampling Parameters
The model card recommends these defaults for local deployment:
`
Temperature: 1.0
Top P: 1.0
`
The chat template is custom — not a standard Jinja template. DeepSeek ships an encoding_dsv4 module in the model repo that handles message encoding and completion parsing.
What This Means for Open-Source AI
Three things stand out about DeepSeek-V4-Pro relative to the broader landscape:
The parameter efficiency story. 1.6T total parameters activating 49B per pass is a bet on MoE scaling that's paying off. The performance-per-activated-parameter is exceptional — this isn't a model that wins benchmarks by brute-forcing compute.
The MIT license. Full commercial use, no restrictions, no usage agreements. For teams building AI products, the licensing difference between MIT and non-commercial open-weight licenses is significant. DeepSeek-V4-Pro is usable in production without legal review.
The 1M token context that actually works. Many models claim million-token context windows; few maintain meaningful accuracy at that scale. The 83.5% MRCR result at 1M tokens is evidence of real capability, not a theoretical limit. The hybrid attention architecture is what makes the cost at that scale viable.
Limitations
A few honest caveats:
- → Hardware requirements are substantial. Even with FP4 quantization, hosting a model with 49B activated parameters requires serious GPU infrastructure. This isn't a laptop model. Running DeepSeek-V4-Pro in production requires A100 or H100-class hardware.
- → Think Max needs 384K+ context. You cannot use maximum reasoning depth without allocating substantial context window. For shorter conversations, Think High is the practical ceiling.
- → Toolathlon at 51.8% is the weak spot. Compared to the other agentic benchmarks, tool use performance is noticeably lower. Complex multi-tool orchestration workflows will run into this.
- → Custom chat template. The encoding_dsv4 module adds an integration dependency. Standard LLM serving stacks need adjustment before deployment.
---
DeepSeek-V4-Pro is the most capable open-source language model released to date. The combination of MIT licensing, genuine long-context capability, and benchmark performance competitive with closed frontier models makes it practically significant — not just for researchers, but for anyone building AI systems who doesn't want to be permanently dependent on API providers.
---
Resources: