Deepseek

Cross-section of a translucent crystal brain threaded by red, gold, and teal attention ribbons resting on a doubly-stochastic matrix pedestal beside a guitar-tuning lab figure.

DeepSeek V4 Tech Report: 3 Tricks That Cut Compute 73%

DeepSeek V4 is a 1.6 trillion parameter open-weight Mixture-of-Experts model. It reads 1M tokens at once. It uses 27% of V3.2’s inference FLOPs and 10% of its KV cache. The DeepSeek V4 tech report credits three moves: hybrid CSA plus HCA attention, Manifold-Constrained Hyper-Connections, and the Muon optimizer in place of AdamW.

Key Takeaways

DeepSeek V4 is a free, open-weight AI that goes toe-to-toe with the top closed models from OpenAI, Anthropic, and Google.
It reads 1 million tokens in one prompt, enough for several full books or a long agent run without losing track.
It runs on roughly a quarter of the compute its previous version needed, making long-context AI affordable to operate.
A smaller team built it without access to top NVIDIA chips, proving clever engineering can rival raw GPU spend.
It scored a perfect 120 out of 120 on the 2025 Putnam math competition and beats Google’s Gemini 3.1 Pro at 1M-token recall.

DeepSeek V4 at a Glance

The official launch announcement on April 24, 2026 framed the release as “the era of cost-effective 1M context length.” It shipped two checkpoints under the MIT license. DeepSeek-V4-Pro runs at 1.6T total and 49B active parameters. DeepSeek-V4-Flash runs at 284B total and 13B active. Both models read 1M tokens at once. Both ship as open weights on Hugging Face . The routed expert weights use FP4 math, and most other weights use FP8.

Run DeepSeek R1 Locally: Reasoning Models on Consumer Hardware

You can run DeepSeek R1 ’s distilled reasoning models on an RTX 5080 with 16 GB of VRAM. Use Ollama or llama.cpp with 4-bit quantization. The 14B distilled variant (Q4_K_M) fits in about 10 GB of VRAM. It shows visible <think> reasoning traces that rival cloud quality on math, coding, and logic. The full 671B model needs multi-GPU rigs, but the distilled models give you 80-90% of the quality for far less hardware.