AI

Hands-on guides to LLMs, agents, prompt engineering, and the AI tools I run every day for real work, not demos.

A glowing desktop graphics card streams data into a landscape painting on an easel beside VRAM and wattage gauges

Run FLUX 2 Locally in 2026: VRAM by GPU + ComfyUI Setup

You can run FLUX 2 locally on a single consumer GPU in 2026. The open-weight FLUX 2 dev is a 32B model from Black Forest Labs that fits a 24GB card when quantized, while the smaller Klein builds run on 8GB. This guide picks the right variant for your card, installs it in ComfyUI, and covers what it costs to run.

Key Takeaways

FLUX 2 dev needs a 24GB card; Klein runs on 8GB.
ComfyUI plus Stability Matrix is the fastest way to start.
Quantized GGUF builds cut VRAM in half with little quality loss.
Running locally costs a fraction of a cent per image in power.
Only dev and Klein have downloadable weights; Pro and Max are API only.

FLUX 2 dev sample output showing a retro-futuristic cityscape with Japanese-inspired typography and cosmic sky — FLUX 2 produces photorealistic and stylized images with strong detail and coherence

Why Small Language Models (SLMs) are Better for Edge Devices

Small Language Models, sub-4B parameter models built to run on local hardware, now handle most of the edge AI work that used to need the cloud. Phi-4 , Gemma 3 , and Llama 3.2-1B run offline on Raspberry Pi boards, phones, and industrial PLCs. The economics, latency, and privacy story all point the same way: edge first.

What Counts as a Small Language Model

In 2023, “small” meant under 13B parameters. Today, three tiers matter for edge work.

SDXL 2.0 LoRA: 50-300 MB Adapters on 12 GB VRAM

The best way to fine-tune Stable Diffusion XL 2.0 is with Low-Rank Adaptation (LoRA) : a small adapter that injects your style or subject without touching the base weights. Instead of retraining the full model, LoRA trains a tiny side network next to the frozen base. The result is a 50 to 300 MB file you can load, swap, and stack at inference, trained on a 12 GB GPU in an afternoon.

Underground vault library with glowing holographic books arranged in vector space and a robot librarian retrieving relevant volumes

Setup a Private Local RAG Knowledge Base

To build a private Retrieval-Augmented Generation (RAG) system, pair a local vector database like Qdrant with an embedding model like BGE-M3 . Add a local LLM through Ollama , and you can index hundreds of documents and ask questions about them. Your data stays on your machine.

Why RAG? The Problem With Pure LLM Memory

Large language models sound smart, but they are poor knowledge stores. They learn from old training data and know nothing about files you created later or keep private. Ask about your own data, and the model will often guess. Even strong open weight models like Llama 4.0 can invent plausible but wrong answers about content they never saw.

Building Multi-Step AI Agents with LangGraph

AI agents built on LangGraph run as stateful graphs, not linear prompts. The graph can loop, branch on tool output, retry after a failure, and save its progress. That structure is what lets one agent handle long, multi-step tasks reliably.

Key Takeaways

LangGraph models an agent as a stateful graph, so it can loop, retry, and recover.
The state schema you design up front decides how stable the agent turns out.
Built-in checkpointing lets an agent crash, pause for approval, and resume without lost work.
Conditional edges turn failures into retries instead of dead ends.
One agent task can fire dozens of LLM calls, so plan for cost before you deploy.

Prerequisites

You should know Python 3.11+ and the LangChain basics: LLMs, tools, prompts. The code below uses these versions:

High-end gaming desktop with illuminated NVIDIA GPU visible through a glass side panel, surrounded by floating holographic neural network diagrams and data streams

Run Llama 4 Scout Locally: 24GB VRAM, GGUF, Real Speeds

You can run Llama 4 Scout on a 24 GB consumer GPU, but only with an aggressive quantization and some patience. Scout is a 109B-parameter Mixture-of-Experts model, and even its smallest Unsloth dynamic GGUF build is about 32 GB, so a 24 GB card runs it with CPU offload at roughly 20 tokens per second. This guide covers which Llama 4 model fits your hardware, the real VRAM math, and the fastest way to get it running.