Moe

Different-sized glowing AI brains on a weighing scale balanced against stacks of memory chips, the smallest sitting on a 24 GB pedestal

Open-Weight Coding Models Ranked by Capability Per GB (2026)

The best open-weight coding model you can run on a 24 GB GPU in 2026 is Qwen3.6-27B at Q4. It scores 77.2 on SWE-bench Verified while fitting in about 17 GB, the highest coding skill per gigabyte you can actually load at home. DeepSeek V4 wins the leaderboard, but no consumer card can hold it.

Key Takeaways

Qwen3.6-27B at Q4 gives the most coding skill per GB on a 24 GB card.
DeepSeek V4 tops the leaderboard, but no home GPU can run it.
GLM-4.7-Flash fits 24 GB and still clears 59 percent on SWE-bench.
Qwen and Devstral ship Apache 2.0; the big models lean on MIT.
Pick by the GPU you own, not by the top of the leaderboard.

Why Capability Per GB Beats the Leaderboard

Most 2026 roundups rank coding models by the score of a flagship variant that needs a multi-GPU server. For anyone running models at home, that number is a fantasy. The only figure that counts is how much coding skill fits in the VRAM you actually own.

Qwen3.6-35B-A3B: Alibaba's Open-Weight Coding MoE

Qwen3.6-35B-A3B is Alibaba Cloud’s Apache 2.0 sparse Mixture-of-Experts model released April 14, 2026. It carries 35 billion total parameters but activates only about 3 billion per token, and on agentic coding suites it beats Gemma 4-31B and matches Claude Sonnet 4.5 on most vision tasks. A 20.9GB Q4 quantization runs on a MacBook Pro M5, which is the reason this release has taken over half the AI timeline for the past week.

Running Gemma 4 26B MoE on 8GB VRAM: Three Strategies That Work

The short answer is no, the Gemma 4 26B MoE model will not fit entirely in 8 GB of VRAM at standard Q4_K_M quantization - the weights alone require roughly 16-18 GB. But with the right approach, you can run it on budget hardware and get usable interactive performance. The three practical strategies are aggressive quantization (IQ3_XS brings weights under 10 GB), GPU-CPU layer offloading (split 15-20 of 30 layers to GPU, rest on system RAM), and multi-GPU setups (two cheap 8 GB cards via tensor parallelism). Each involves different trade-offs between quality, speed, and hardware requirements.