<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>Deepseek - Tag - Botmonster Tech</title><link>https://botmonster.com/tags/deepseek/</link><description>Deepseek - Tag - Botmonster Tech</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Fri, 08 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://botmonster.com/tags/deepseek/" rel="self" type="application/rss+xml"/><item><title>DeepSeek V4 Tech Report: 3 Tricks That Cut Compute 73%</title><link>https://botmonster.com/ai/deepseek-v4-tech-report-3-revolutionary-tricks-chinese-ai/</link><pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate><author>Botmonster</author><guid>https://botmonster.com/ai/deepseek-v4-tech-report-3-revolutionary-tricks-chinese-ai/</guid><description><![CDATA[<div class="featured-image">
                <img src="/deepseek-v4-tech-report-3-revolutionary-tricks-chinese-ai.png" referrerpolicy="no-referrer">
            </div><p>DeepSeek V4 is a 1.6 trillion parameter open-weight Mixture-of-Experts model. It reads 1M tokens at once. It uses 27% of V3.2&rsquo;s inference FLOPs and 10% of its KV cache. The <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf" target="_blank" rel="noopener noreferrer ">DeepSeek V4 tech report</a>
 credits three moves: hybrid CSA plus HCA attention, Manifold-Constrained Hyper-Connections, and the Muon optimizer in place of AdamW.</p>
<section class="key-takeaways">
  <h2 id="key-takeaways" class="key-takeaways-title">
    <svg class="key-takeaways-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true">
      <rect x="4" y="3" width="16" height="18" rx="2" ry="2"></rect>
      <path d="M8 8h5"></path>
      <path d="M8 13h8"></path>
      <path d="M8 17h8"></path>
      <path d="M15.5 7l1.2 1.2L19 6"></path>
    </svg>
    <span>Key Takeaways</span>
  </h2>
  <div class="key-takeaways-body">
<ul>
<li>DeepSeek V4 is a free, open-weight AI that goes toe-to-toe with the top closed models from OpenAI, Anthropic, and Google.</li>
<li>It reads 1 million tokens in one prompt, enough for several full books or a long agent run without losing track.</li>
<li>It runs on roughly a quarter of the compute its previous version needed, making long-context AI affordable to operate.</li>
<li>A smaller team built it without access to top NVIDIA chips, proving clever engineering can rival raw GPU spend.</li>
<li>It scored a perfect 120 out of 120 on the 2025 Putnam math competition and beats Google&rsquo;s Gemini 3.1 Pro at 1M-token recall.</li>
</ul>

  </div>
</section>

<h2 id="deepseek-v4-at-a-glance">DeepSeek V4 at a Glance</h2>
<p>The <a href="https://api-docs.deepseek.com/news/news260424" target="_blank" rel="noopener noreferrer ">official launch announcement</a>
 on April 24, 2026 framed the release as &ldquo;the era of cost-effective 1M context length.&rdquo; It shipped two checkpoints under the MIT license. DeepSeek-V4-Pro runs at 1.6T total and 49B active parameters. DeepSeek-V4-Flash runs at 284B total and 13B active. Both models read 1M tokens at once. Both ship as open weights on <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro" target="_blank" rel="noopener noreferrer ">Hugging Face</a>
. The routed expert weights use FP4 math, and most other weights use FP8.</p>]]></description></item><item><title>Run DeepSeek R1 Locally: Reasoning Models on Consumer Hardware</title><link>https://botmonster.com/ai/run-deepseek-r1-locally-reasoning-models-consumer-hardware/</link><pubDate>Sun, 26 Apr 2026 00:00:00 +0000</pubDate><author>Botmonster</author><guid>https://botmonster.com/ai/run-deepseek-r1-locally-reasoning-models-consumer-hardware/</guid><description><![CDATA[<div class="featured-image">
                <img src="/run-deepseek-r1-locally.png" referrerpolicy="no-referrer">
            </div><p>You can run <a href="https://github.com/deepseek-ai/DeepSeek-R1" target="_blank" rel="noopener noreferrer ">DeepSeek R1</a>
&rsquo;s distilled reasoning models on an RTX 5080 with 16 GB of VRAM. Use <a href="https://ollama.com/" target="_blank" rel="noopener noreferrer ">Ollama</a>
 or <a href="https://github.com/ggerganov/llama.cpp" target="_blank" rel="noopener noreferrer ">llama.cpp</a>
 with 4-bit quantization. The 14B distilled variant (Q4_K_M) fits in about 10 GB of VRAM. It shows visible <code>&lt;think&gt;</code> reasoning traces that rival cloud quality on math, coding, and logic. The full 671B model needs multi-GPU rigs, but the distilled models give you 80-90% of the quality for far less hardware.</p>]]></description></item></channel></rss>