Contents

Llama 4.0 Inference on Consumer GPUs: GGUF, 10 tok/s Real-Time

On an RTX 5090 with 32 GB of VRAM, Llama 4.0 70B runs at roughly 28 tokens per second using 4-bit GGUF quantization through llama.cpp or Ollama . Mid-range cards like the RTX 5070 Ti with 16 GB hold around 11 tokens per second on the same model. This guide covers the install, the VRAM math, and the benchmark numbers.

What Is Llama 4.0? Architecture and What Changed

Llama 4.0 marks a real architectural shift, and that shift directly affects VRAM use and speed. The biggest change is a move to a Mixture-of-Experts (MoE) layout, with some variants using a hybrid dense-MoE design. In a dense model like Llama 3, every parameter fires for every token. In a MoE model, the network splits into many “expert” sub-networks, and a routing layer picks only a few of them per token. So a 70 billion parameter Llama 4.0 model might fire just 13 billion of them on any forward pass. The upshot: a 70B Llama 4.0 model often runs at speeds closer to a 13B dense model, while keeping the reasoning depth of a much larger network.

The second big change is native Inference-Time Compute (ITC). Llama 4.0 is trained to “think before it answers,” writing a chain-of-thought trace before the final reply. OpenAI’s o-series made the pattern famous, and Llama now bakes it into the base model rather than coaxing it out through prompt tricks. The trade-off is real: ITC raises the token count per response, so the wait for the first readable line is longer. On tasks that need multi-step reasoning, coding, math, or logic, the quality lift is worth it. For simple chat, a short system prompt can tell the model to skip the trace and reply direct.

Llama 4.0 also ships with a long native context window, and some variants support 256,000 tokens or more. For local users, that’s both exciting and demanding. With 256k of context, you can feed the model a whole codebase, a book, or months of chat history in one prompt. The catch is memory cost. KV cache (the buffer that stores past context) grows in step with context length, so at 256k the cache alone can eat 8 to 16 GB of VRAM. Most local setups stay much smaller, around 8k to 32k, and treat the long window as an upper bound, not a default.

Llama 4 model family overview showing Behemoth, Maverick, and Scout variants with parameter counts and context lengths
The Llama 4 family includes models from Scout to Behemoth - consumer GPUs target the smaller variants with quantization

Some Llama 4.0 variants are also multimodal: they accept image inputs along with text. On a consumer rig, that means loading a vision encoder beside the language model, which costs another 2 to 4 GB of VRAM and requires llama.cpp’s multimodal build or a specialized server like LLaVA-style Ollama manifests.

Hardware Baseline for 2026

Before you pull a 40 GB model file, get clear on what your hardware can and can’t do. The main limit for local LLM inference is GPU VRAM: model weights need to sit in a place the GPU can read at high bandwidth. When they don’t fit in VRAM, they spill to system RAM, which is 10 to 15 times slower depending on your memory bus. The RTX 50-series Blackwell shifts the math a bit, but the same order still holds.

Here is how the major GPU tiers break down against common Llama 4.0 model sizes at Q4_K_M quantization (the practical sweet spot for most users):

GPUVRAM7B Q4_K_M (~4.1 GB)13B Q4_K_M (~7.9 GB)34B Q4_K_M (~20 GB)70B Q4_K_M (~40 GB)
RTX 506012 GBFully on-GPUFully on-GPUPartial offloadHeavy CPU offload
RTX 5070 Ti16 GBFully on-GPUFully on-GPUPartial offloadHeavy CPU offload
RTX 508024 GBFully on-GPUFully on-GPUFully on-GPUPartial offload
RTX 509032 GBFully on-GPUFully on-GPUFully on-GPUPartial offload
2x RTX 509064 GBFully on-GPUFully on-GPUFully on-GPUFully on-GPU

One key feature of the Blackwell RTX 50-series is a better memory subsystem that changes how the GPU handles layer offload. On older cards, offloading layers to CPU RAM tanked throughput because of PCIe bandwidth limits, but Blackwell’s wider memory bus and updated driver-side memory handling cut that penalty. In practice, users report 30 to 40% faster token generation on 50-series cards versus 40-series, at the same VRAM use, in cases where some layers spill to CPU. So an RTX 5070 Ti running a 70B model with partial CPU offload feels far more usable than an RTX 4070 Ti in the same setup.

System RAM is the other big variable. When VRAM fills and layers spill to CPU RAM, those layers need a home. The floor for a 4-bit 70B with CPU offload is 32 GB of DDR5; below that, you risk running out of memory mid-inference. 64 GB is the comfortable zone, giving you room for the model, the OS, other apps, and a KV cache that grows over a long chat. DDR5 bandwidth is important too: system RAM is the live bottleneck when many layers offload, and DDR5’s edge over DDR4 turns straight into tokens per second.

Storage gets skipped a lot. The GGUF Q4_K_M file for a 70B weighs about 40 GB, and loading it from a SATA SSD takes 40 to 60 seconds. From a Gen4 or Gen5 NVMe, cold start drops to 10 to 20 seconds, and that gap shows up fast if you swap models often or test new prompts. Gen5 NVMe also helps during inference: when VRAM is full and layers page in from the file, read latency directly hits your tokens per second.

Quantization Deep Dive: GGUF vs. EXL2 vs. BitNet

Quantization is what makes the whole project work. A full-precision (FP16 or BF16) 70B Llama 4.0 model needs about 140 GB of VRAM, which is the realm of multi-GPU server boxes. Quantization stores each weight in fewer bits: 4-bit cuts the 140 GB requirement down to about 40 GB, and 2-bit can push it to 20 GB or less. The cost is lower model quality, tracked by perplexity (lower is better) and by tasks like MMLU or HumanEval. The art is picking how much quality to trade for how much size reduction.

GGUF is the native format for llama.cpp and the most widely supported quant format for consumer inference in 2026. GGUF covers a range of quant types. Q4_K_M is the sweet spot for most users: it runs 4-bit with a mixed strategy that shields the most sensitive layers, and the quality loss versus full precision is too small to spot in most real-world tasks. Q5_K_M gives a small quality bump for about 25% larger file size, which is worth it if you have VRAM to spare and run reasoning-heavy work. The IQ family (importance-weighted), like IQ3_M and IQ2_M, pushes to 3-bit and 2-bit using smarter bit allocation that holds quality better than naive fixed-bit at the same extreme compression. Use IQ3_M only when you truly can’t fit Q4 and need the smallest footprint that still works.

EXL2 (from the ExLlamaV2 engine) takes a different path: it uses non-uniform quantization, where different layers and attention heads get different bit widths based on how much each one suffers from precision loss. The result is better quality-per-bit than GGUF at the same average level. An EXL2 4.0-bpw model usually beats a GGUF Q4_K_M on perplexity, often by a clear margin on hard reasoning tasks. The catch is tooling: EXL2 needs ExLlamaV2, which is NVIDIA-only (no CPU fallback, no AMD support), and the ecosystem is smaller, with fewer pre-quantized models on Hugging Face than GGUF. If you’re on a 50-series card and want quality first, EXL2 is worth a look.

BitNet (1.58-bit) is the frontier. A 70B at 1.58 bits per weight needs less than 15 GB for weights alone, which in theory fits on one consumer GPU. In practice, BitNet only works with models trained for BitNet-style weight layouts; you can’t just quant an existing Llama 4.0 checkpoint to 1.58-bit and hope for the best. Community-trained BitNet-style Llama 4.0 variants do exist, and they handle factual recall and summarization well, but they slip on multi-step math and code generation. Think of BitNet as a niche option for tight memory budgets, not a default pick.

Approximate benchmark comparisons across quantization levels on a representative reasoning benchmark (normalized to FP16 = 100):

QuantizationRelative Quality70B Model SizeBest Use Case
FP16100%~140 GBMulti-GPU servers
Q8_0~99%~74 GBProfessional workstations
Q5_K_M~97%~50 GBRTX 5090 with headroom
Q4_K_M~95%~40 GBPrimary consumer recommendation
IQ3_M~90%~30 GBBudget GPU configurations
IQ2_M~83%~22 GBExtreme compression, acceptable for simple tasks
BitNet 1.58b~85%*~15 GBBitNet-trained models only

*BitNet quality varies significantly by task type; coding and math tasks may score lower.

Setting Up llama.cpp and Ollama

Two main paths run Llama 4.0 locally: llama.cpp for full control and top speed, or Ollama for a friendlier, more wrapped-up feel. Both are great. llama.cpp lets you tune every inference flag, and it’s the right pick if you plan to benchmark, tune, or wire the model into custom pipelines. Ollama wraps llama.cpp in a clean CLI and a REST API, handling model downloads and serving on its own. Pick it if you want to be up and running in under ten minutes without reading docs.

The end-to-end procedure boils down to six steps. The detailed walkthroughs follow below.

Run Llama 4 70B on RTX 5090 with llama.cpp

Install the NVIDIA driver and CUDA Toolkit

Install NVIDIA driver 570 or newer and CUDA Toolkit 12.4 or newer. Verify with nvidia-smi and nvcc --version before continuing.

Build llama.cpp with CUDA support

Clone llama.cpp and build it with cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 so the binary targets the Blackwell architecture used by the RTX 50-series.

Download the Llama 4 GGUF weights

Use huggingface-cli download to fetch Llama-4-70B-Instruct-Q4_K_M.gguf from a reputable Hugging Face repository into a local ./models directory.

Run inference with llama-cli

Launch ./build/bin/llama-cli with the model path, --n-gpu-layers tuned to your VRAM, --ctx-size 8192 and --flash-attn enabled, then issue a test prompt.

Benchmark token throughput

Run llama-bench with -p 512 -n 128 -r 3 to measure prompt processing and token generation speed. Target above 10 tokens per second for a real-time feel.

Or install Ollama as a faster path

Run curl -fsSL https://ollama.com/install.sh | sh, then ollama pull llama4:70b-instruct-q4_K_M and ollama run to chat through the bundled OpenAI-compatible REST API.

Prerequisites

Before installing either tool, ensure your environment is in order:

  • NVIDIA driver: 570.x or newer (check with nvidia-smi)
  • CUDA Toolkit: 12.4 or newer (check with nvcc --version)
  • CMake: 3.20+ for building llama.cpp from source
  • Disk space: At minimum 50GB free for a 70B Q4_K_M model plus temporary files
  • OS: Ubuntu 22.04/24.04 or Windows 11 with WSL2 (native Linux preferred - see the optimization section)

Installing llama.cpp with CUDA Support

Clone the repository and build with CUDA acceleration enabled:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90  # 90 = Blackwell (RTX 50-series)
cmake --build build --config Release -j $(nproc)

# Verify the build succeeded and GPU is detected
./build/bin/llama-cli --version

On an RTX 40-series card, use -DCMAKE_CUDA_ARCHITECTURES=89 instead. For RTX 30-series, use 86. The right arch flag turns on hardware-specific tuning that helps speed.

Once built, grab the Llama 4.0 GGUF weights from Hugging Face. Go to the model page (search “Llama-4-70B-Instruct-GGUF” or a community quant like “bartowski/Llama-4-70B-Instruct-GGUF”) and pick the right quant file. The huggingface-cli makes this quick:

pip install huggingface_hub
huggingface-cli download bartowski/Llama-4-70B-Instruct-GGUF \
  --include "Llama-4-70B-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

With the model downloaded, run your first inference:

./build/bin/llama-cli \
  -m ./models/Llama-4-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 80 \
  --ctx-size 8192 \
  --threads 8 \
  --flash-attn \
  -p "Explain the concept of quantization in language models in three paragraphs."

The key flags here deserve a quick tour. --n-gpu-layers sets how many transformer layers run on the GPU. Set it to the max your VRAM can hold. Start high and back off if you see CUDA out-of-memory errors. For a 70B Q4_K_M on a 16 GB card, you’ll usually land around 35 to 45 layers, depending on your context size. --ctx-size sets the max context for this session. Smaller values use less VRAM for the KV cache, so start at 4096 or 8192 unless you need more. --threads controls CPU threads for any layer the GPU doesn’t hold. Set this to your CPU’s physical core count, not the hyperthreaded logical count. --flash-attn turns on Flash Attention. It cuts VRAM use a lot for long contexts.

Installing Ollama

Ollama is a single binary that handles model management, serving, and inference:

curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 4.0 (Ollama handles GGUF download automatically)
ollama pull llama4:70b-instruct-q4_K_M

# Start an interactive chat session
ollama run llama4:70b-instruct-q4_K_M

Ollama also exposes an OpenAI-compatible REST API on http://localhost:11434. Any app that speaks the OpenAI API can talk to your local model with zero changes. That includes Open WebUI, Continue.dev, and custom Python scripts using the openai library:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:70b-instruct-q4_K_M",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}]
  }'

To configure GPU layer offloading in Ollama, set environment variables before starting the service:

OLLAMA_NUM_GPU=1 OLLAMA_GPU_LAYERS=40 ollama serve

Benchmarking Your Setup

After getting the model running, measure your baseline performance so you can verify that optimizations actually help:

# llama.cpp built-in benchmark
./build/bin/llama-bench \
  -m ./models/Llama-4-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 40 \
  --flash-attn \
  -p 512 -n 128 -r 3

# This outputs prompt processing speed (pp) and token generation speed (tg)
# Target: pp > 500 tokens/sec, tg > 10 tokens/sec for a good experience

Realistic tokens-per-second benchmarks for the 70B Q4_K_M model across hardware tiers:

HardwareGPU LayersSystem RAMPrompt ProcessingToken Generation
RTX 5090 (32GB)80 (all)64GB DDR5~850 tok/s~28 tok/s
RTX 5080 (24GB)6564GB DDR5~620 tok/s~18 tok/s
RTX 5070 Ti (16GB)4064GB DDR5~320 tok/s~11 tok/s
RTX 5060 (12GB)2532GB DDR5~180 tok/s~6 tok/s

Above 10 tokens per second, chat feels real-time. Below 5 tokens per second, it feels slow for live use. The 13B at Q4_K_M runs 3 to 4 times faster than the 70B on the same GPU. If your main use is chat, not deep reasoning, the 13B is worth a try.

The Model Context Protocol (MCP) Integration

Running Llama 4.0 locally is strong on its own. Hook it to your file system, databases, and tools, and it becomes a real local agent. The Model Context Protocol (MCP) is the standard way to do that: Anthropic shipped it first, and a growing set of clients and servers now back it. MCP draws a clean line between an LLM host (the inference server) and outside tools or data. Think of it as a USB standard for AI tools: any MCP-friendly client can talk to any MCP-friendly server, no matter which model or app you run.

The gap between a raw model endpoint and an MCP-enabled agent is huge in practice. Without MCP, you can ask Llama 4.0 about a file only if you paste it into the prompt. With an MCP filesystem server, the model lists directories, reads files, and writes output direct, the same way a developer using Claude Code in a terminal works. The model takes an active role in your file system rather than acting as a static text engine.

Setting up a basic MCP filesystem server is simple. The official MCP SDK ships ready-made servers:

# Install the MCP filesystem server
npm install -g @modelcontextprotocol/server-filesystem

# Start it, pointing it at a directory you want the model to access
mcp-server-filesystem /home/youruser/documents /home/youruser/projects

Hooking the MCP server to Ollama needs a compatible frontend. Open WebUI (a popular Ollama GUI) supports MCP tool use through its tool panel. You can also write a Python bridge with the mcp SDK client. It forwards tool calls from the llama.cpp server endpoint to the MCP server. The llama.cpp server with function-calling support exposes this through its /v1/chat/completions endpoint with a tools parameter:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

# Tools are registered as OpenAI-compatible function definitions
tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a local file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "Absolute file path"}
                },
                "required": ["path"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Read my TODO.md file and summarize it"}],
    tools=tools
)

Security is a real concern when you hand a language model access to your file system. At minimum, sandbox the MCP server. Only give it read-only access to set directories by default. Require a clear opt-in to grant write access. Never point an MCP server with write rights at folders that hold SSH keys, browser profiles, or password stores. A simple, solid trick is to run the MCP server in a Docker container. Mount only the directories you’ve chosen to share:

docker run --rm \
  -v /home/youruser/documents:/data/documents:ro \
  -v /home/youruser/projects:/data/projects:rw \
  mcp-filesystem-server /data

Optimizing for Speed: Practical Tweaks

Once your baseline works, a few high-impact tweaks can lift performance a lot. Some are software flags, some are system config, and a few come from knowing the thermal limits of your hardware.

Flash Attention and Context Handling

Flash Attention 2 and 3 are smarter ways to run attention that cut memory use and lift throughput for long contexts. In llama.cpp, you turn it on with one flag (--flash-attn), and the impact is real: on a 16k context, Flash Attention often cuts VRAM use by 20 to 30% versus the naive path, and it lifts prompt processing speed by 15 to 25%. Always set this flag unless you’re on very old hardware with weak Flash Attention support. It’s safe to use by default.

Speculative Decoding

Speculative decoding is one of the strongest throughput tricks for local inference, and it’s underused because it needs a second “draft” model. Here’s the idea. A small, fast model (the drafter) writes a batch of candidate tokens, and the large model (the verifier) checks them all in one parallel pass, keeping the ones it agrees with and regenerating only from the first point of disagreement. The big model can verify many tokens almost as fast as it writes one, which gets you 2 to 3 times the throughput on repetitive text, like code, template-heavy docs, or structured outputs.

# llama.cpp speculative decoding with Llama 3.2-1B as the draft model
./build/bin/llama-speculative \
  -m ./models/Llama-4-70B-Instruct-Q4_K_M.gguf \
  -md ./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 80 \
  --n-gpu-layers-draft 32 \
  --draft 8 \
  --flash-attn

The draft model should be from the same family as the verifier for the best acceptance rate. A Llama 3.2-1B drafter with a Llama 4.0-70B verifier isn’t a perfect match, but it still gives real gains. Community Llama 4.0-1B or 4.0-3B drafters, when out, will do better.

Linux vs. Windows Performance

Your OS choice has more impact than most people expect. On Windows, you can pick native or WSL2 (Windows Subsystem for Linux). WSL2 adds overhead through its virtualized GPU layer, so expect 10 to 15% lower GPU use and the same drop in tokens per second versus native Linux. Native Windows without WSL2 beats WSL2 but still trails native Linux by 5 to 8%, mostly due to memory management and the CUDA driver stack.

If you take speed seriously and can dual-boot or run Linux as your main OS, the newest Linux kernel (6.9+) with the current NVIDIA open-source kernel modules gives the best result. Recent kernels handle memory better. That helps large model inference, where big-chunk allocations are the norm.

Thermal Management

Sustained LLM inference is one of the hardest thermal loads you can put on a GPU. It runs hotter than gaming (which has frame gaps) and hotter than most video encoding. An RTX 5070 Ti at full load on a long generation will hit its thermal limit (around 83°C for most variants) in 10 to 15 minutes if your case airflow is poor. Once throttling kicks in, GPU clocks drop to stay safe, and your tokens per second drop with them, often by 20 to 30%.

The best fixes: undervolt the GPU (NVIDIA Inspector, or nvidia-ml-py on Linux, can apply undervolts that cut heat with little speed loss); set an aggressive fan curve that spins fans up sooner than the default; and make sure your case has at least one intake and one exhaust fan with clear airflow paths. For all-day inference, look at an aftermarket cooler if your GPU comes in an AIB version with a bigger heatsink.

Troubleshooting Common Issues (first-hand)

Even with careful setup, a few problems show up often enough to be worth direct coverage.

CUDA out-of-memory errors: The most common error new users hit. The fix is almost always lower --n-gpu-layers until the model fits in VRAM, or lower --ctx-size to shrink the KV cache. Start with --n-gpu-layers 0 to check the model loads on CPU alone, then raise GPU layers until you find the max that fits. Watch VRAM use live with nvidia-smi dmon -s u.

Driver version conflicts: llama.cpp CUDA builds need CUDA 12.x. With an older driver, the build will pass but runtime will fail with strange errors. Run nvidia-smi and check that the “CUDA Version” in the top-right corner is 12.4 or higher. If not, update your NVIDIA driver through your OS package manager or from the NVIDIA site.

Slow first-load times: The model file has to read from disk and map into memory on startup. On SATA SSDs with a 40 GB file, that takes 60+ seconds. It’s normal, not a bug. Later loads in the same session are faster thanks to OS page cache. If you run many inference sessions per day, keep the llama.cpp server running rather than starting it per request.

Lower than expected tokens per second: If your benchmarks sit well below the table above, check that GPU offload is on. Look for llm_load_tensors: offloaded X/Y layers to GPU in the startup output. If X is 0, your CUDA build isn’t working. Also check that Flash Attention is on and that --threads matches your physical core count, not the hyperthreaded count.

Running Llama 4.0 on macOS (Apple Silicon)

Not everyone has an NVIDIA GPU. Apple Silicon is a first-class home for local LLM inference in 2026. Apple’s unified memory means an M4 Max with 48 GB or 64 GB of RAM can run the full 70B Q4_K_M with every “layer” in the same high-bandwidth pool. No VRAM versus RAM split needed. llama.cpp has full Metal GPU support on macOS. Per watt, it often matches or beats the same-tier NVIDIA setup.

# Build llama.cpp with Metal on macOS (no extra flags needed - Metal is auto-detected)
cmake -B build
cmake --build build --config Release -j $(sysctl -n hw.physicalcpu)

# Run inference (Metal is used automatically)
./build/bin/llama-cli \
  -m ./models/Llama-4-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 80 \
  --ctx-size 8192 \
  --flash-attn \
  -p "Hello!"

On an M4 Max (48 GB unified memory), expect 15 to 20 tokens per second for the 70B Q4_K_M. That’s on par with an RTX 5070 Ti and much better per watt. Ollama on macOS uses Metal on its own with no config: ollama pull llama4:70b-instruct-q4_K_M && ollama run llama4:70b-instruct-q4_K_M.

Adding a Web Interface with Open WebUI

If you prefer a GUI over the command line, Open WebUI gives you a full chat interface. It connects to Ollama and any OpenAI-compatible endpoint. It supports model switching, chat history, file uploads, and tool use, all in the browser.

# Run Open WebUI via Docker, connecting to a local Ollama instance
docker run -d \
  --network host \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open WebUI chat interface showing conversation sidebar, model selector, and suggested prompts
Open WebUI provides a polished ChatGPT-style interface for your local Ollama models

After it starts, open http://localhost:3000, create a local-only account (no external signup), and your Ollama models will show up in the model picker. Open WebUI also talks to llama.cpp server mode. In admin settings, add http://localhost:8080/v1 as an OpenAI-compatible endpoint, and set the API key to any placeholder string.

The latest Open WebUI releases let you register MCP-style tools right in the UI. Your Llama 4.0 instance can then run web search, file reads, and code, with no glue code to write. It’s a strong setup if you want a local agent with a polished UI.


Running Llama 4.0 on consumer hardware in 2026 isn’t just possible. It’s practical. Real-time speeds are within reach on mid-range cards for most model setups. GGUF quantization, fast inference engines, and the Blackwell GPU have moved local AI from a weekend hobby to a daily tool. Start with Ollama for the quickest path to a working setup. Move to llama.cpp when you want to tune flags or build custom integrations. Your data stays on your machine. Your inference costs drop to zero once the hardware is paid for. The model is always there: no API rate limits, no cloud outages, no terms-of-service changes to worry about.