How to Fine-Tune Stable Diffusion XL 2.0 with LoRA

Contents

Fine-tuning Stable Diffusion XL 2.0 is most efficiently achieved using Low-Rank Adaptation (LoRA) - a lightweight adapter technique that injects your custom style or subject concept into the model without modifying the base weights. Instead of retraining the full model (which requires enormous compute and produces a 6+ GB file that overwrites the model’s general capabilities), a LoRA trains a small side-network that sits alongside the frozen base. The resulting file is typically 50–300 MB and can be loaded, unloaded, and stacked at inference time. With the right tooling, you can train a quality LoRA on a mid-range RTX 50-series GPU with 12 GB of VRAM in an afternoon.

The 2026 Image Gen Landscape: SDXL 2.0 vs. Flux.1 vs. Others

Before investing time in a training run, it is worth understanding where SDXL 2.0 sits relative to its competitors in early 2026. The image generation field has bifurcated into two major architectural camps: the UNet-based diffusion stack that SDXL inherits from Stable Diffusion 1.x/2.x, and the Diffusion Transformer (DiT) architecture used by Flux.1 (Black Forest Labs), Sora-class video models, and PixArt-Sigma. SDXL 2.0 remains on the UNet path but benefits from substantially improved training data curation, a redesigned VAE with near-lossless latent compression, and a distillation pipeline that closes much of the quality gap with Flux.1 Dev at lower inference cost.

Flux.1 Dev produces stunning images - particularly for photorealism and complex scene composition - but its DiT architecture means the LoRA training ecosystem is still maturing. Tools like x-flux and SimpleTuner support Flux.1 LoRA training, but the hyperparameter landscape is less understood and training stability is more sensitive. SDXL 2.0, by contrast, has a mature LoRA ecosystem backed by years of community experimentation through kohya_ss, thousands of public LoRA releases on CivitAI, and well-documented training recipes. If your goal is a style LoRA with predictable results and a large base model community to build on, SDXL 2.0 is still the right choice in 2026. If you need maximum photorealism and can accept a longer experimentation cycle, Flux.1 Dev is the frontier.

LoRA’s dominance over alternative fine-tuning approaches comes down to a simple set of trade-offs. Full DreamBooth fine-tuning modifies every weight in the UNet and text encoders - it can produce excellent results but requires 24 GB+ of VRAM in full precision, produces a model checkpoint of 6–7 GB, and risks catastrophic forgetting of the base model’s general capabilities. At the other extreme, textual inversion only learns a new embedding vector and cannot teach the model new visual patterns beyond what its weights already know how to express. LoRA sits squarely in the productive middle: it injects small rank-decomposed weight matrices at each attention layer, is trainable on consumer hardware, produces a modular file you can share without distributing the full model, and can be combined with other LoRAs at inference time using ComfyUI or AUTOMATIC1111’s built-in LoRA scheduler.

For base model sourcing, Hugging Face Hub (stabilityai/stable-diffusion-xl-base-2.0) and CivitAI are the two primary distribution points. When choosing a base model, look for one with a VAE baked in (or pair it with the SDXL 2.0 VAE separately), verify the license for your use case (SDXL uses CreativeML Open RAIL-M, which permits commercial use with attribution and prohibits generating certain categories of harmful content), and check whether the checkpoint is a full model or a distilled variant. For training purposes, you almost always want the full, non-distilled base - distilled models have had steps removed from their denoising schedule and do not fine-tune as cleanly.

Dataset Preparation with Vision-Language Models

A LoRA is only as good as the data it trains on, and dataset preparation is the most underestimated step in the entire pipeline. The amount of data you need depends on what you are teaching. For a style LoRA that captures a particular artistic aesthetic - watercolor illustration, brutalist architectural photography, a specific comic book inking style - 15 to 50 high-quality images is typically sufficient if they are well-captioned and visually consistent. For a character or face LoRA, where the model must learn a specific identity across varied poses, lighting conditions, and expressions, 20 to 100 images is a more reliable range. In both cases, quality beats quantity: ten crisp, diverse images outperform a hundred blurry, repetitive ones.

The captioning step is where modern VLMs have transformed the workflow. Manually writing a precise caption for each of your 50 training images is feasible but tedious; for 500 images it becomes the bottleneck of the entire project. Florence-2 (Microsoft) and LLaVA-v1.6 are both capable of generating detailed, accurate image descriptions at low VRAM cost and can be run locally. The following pipeline processes a directory of images and writes matching .txt caption files, which is the format expected by both kohya_ss and SimpleTuner:

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import os, glob, torch

model_id = "microsoft/Florence-2-large"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).cuda().half()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image_dir = "/data/my_lora_dataset"
prompt = "<MORE_DETAILED_CAPTION>"

for img_path in glob.glob(f"{image_dir}/*.jpg") + glob.glob(f"{image_dir}/*.png"):
    image = Image.open(img_path).convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda", torch.float16)
    output = model.generate(**inputs, max_new_tokens=256)
    caption = processor.batch_decode(output, skip_special_tokens=True)[0]
    caption_path = os.path.splitext(img_path)[0] + ".txt"
    with open(caption_path, "w") as f:
        f.write(caption)
    print(f"Captioned: {img_path}")

Once you have machine-generated captions, review a sample of them and correct any that are systematically wrong - VLMs sometimes hallucinate artistic medium or misread text in images. The goal is captions that describe what is visually present without describing the concept you are trying to teach, because that concept will be activated by the trigger word instead.

Trigger words are one of the most important design decisions in LoRA training. Every LoRA needs a unique activation token that does not exist in the base model’s vocabulary - something like ohwx for a person, styl3 for a style, or artst_painterly for a fine art aesthetic. This token is prepended to every training caption: "ohwx woman standing in a park, afternoon light, casual clothing". At inference time, including your trigger word in the prompt activates the learned concept without interfering with the base model’s general understanding. Using a real word like van_gogh or watercolor risks entangling your LoRA concept with the model’s existing knowledge of those terms, producing unpredictable blending.

Regularization images address a related problem called concept bleeding, where the LoRA becomes so dominant that it overwrites the base model’s general knowledge. If you train a face LoRA without regularization, prompts that do not include your trigger word may still generate your subject’s face - the model has effectively forgotten how to generate generic people. Regularization works by including a set of generic images from the same class (e.g., random portraits of people who are not your subject) without the trigger word, so the model is continuously reminded what a non-triggered version of that concept looks like. A 1:1 ratio of subject images to regularization images is a common starting point.

Image preprocessing standardizes the technical quality of your dataset. SDXL’s native training resolution is 1024×1024, but multi-resolution bucketing (supported by both kohya_ss and SimpleTuner) allows you to include images at various aspect ratios and resolutions - the training framework groups images into compatible buckets and rescales within each bucket rather than force-cropping everything to square. Before training, run an automated rejection pass: discard images below 512px on either dimension, flag blurry images using a Laplacian variance filter, and deduplicate using perceptual hashing. Tools like imgdataset-tools and wd14-tagger automate much of this preprocessing and are commonly referenced in the kohya_ss community.

Training Optimization on Consumer Hardware

SDXL 2.0 is a large model - the UNet alone has 2.6 billion parameters - and naive full-precision training quickly exceeds the VRAM of any consumer GPU. Getting a LoRA training run to fit on a 12 GB card requires a specific stack of memory-reduction techniques applied together, not one at a time. The good news is that all of the popular training frameworks have these techniques pre-configured as sensible defaults, so you do not need to implement them from scratch.

Choosing a Training Framework

kohya_ss remains the most popular SDXL LoRA training framework due to its mature GUI (via gradio), enormous community knowledge base, and compatibility with nearly every training technique. If you want a web interface to configure your training run and prefer to not edit config files by hand, start here.

SimpleTuner has emerged as a strong alternative with a cleaner Python codebase and better support for SDXL 2.0’s specific improvements. It handles multi-resolution bucketing more gracefully, has first-class support for the SDXL 2.0 VAE, and its YAML-based configuration is more legible than kohya’s .toml format. It is the better choice if you are comfortable with the command line and want more control over the training loop.

Diffusers (Hugging Face) provides the lowest-level access - you are working directly with the training scripts (train_dreambooth_lora_sdxl.py) and have full flexibility, but also the most configuration responsibility.

Key Hyperparameters

The learning rate is the single most impactful hyperparameter. Too high and the LoRA will rapidly overfit, producing images that look exactly like your training data regardless of the prompt. Too low and the training will not converge within a reasonable number of steps. A baseline of 1e-4 with a cosine annealing schedule works for most style LoRAs; for face/character LoRAs where fine-grained identity fidelity matters, drop to 5e-5 and train for more steps.

The network rank (r in LoRA nomenclature) controls the capacity of the adapter. A rank of 16 is sufficient for most style LoRAs and trains faster with lower memory use. Rank 32 or 64 gives more capacity for complex concepts - detailed character designs, multi-element styles - at the cost of a larger output file and higher VRAM use. The alpha parameter is typically set equal to rank (so alpha=16 for r=16), which normalizes the adapter’s scale during training.

Step count depends on your dataset size. A rough heuristic: multiply your image count by 100 to get a starting step estimate. For a 30-image style dataset, start at 3000 steps. For a 60-image character dataset, start at 4000–6000 steps. These are starting points - monitor training loss via TensorBoard and save intermediate checkpoints every 500 steps so you can sample from them and identify where the model produces the best results before overfitting sets in.

A Practical kohya_ss Training Run

Here is an example kohya_ss TOML configuration for a style LoRA on an RTX 5060 12 GB GPU:

[general]
enable_bucket = true
pretrained_model_name_or_path = "/models/sdxl-2.0-base.safetensors"
output_dir = "/output/my_style_lora"
output_name = "my_style_v1"
save_model_as = "safetensors"
caption_extension = ".txt"

[dataset_arguments]
train_data_dir = "/data/my_lora_dataset"
resolution = "1024,1024"
batch_size = 1

[training_arguments]
max_train_steps = 3000
learning_rate = 1e-4
lr_scheduler = "cosine"
lr_warmup_steps = 100
optimizer_type = "AdamW8bit"
mixed_precision = "bf16"
gradient_checkpointing = true
save_every_n_steps = 500

[network_arguments]
network_module = "networks.lora"
network_dim = 16
network_alpha = 16

[sample_prompt_arguments]
sample_every_n_steps = 500
sample_prompts = "/data/sample_prompts.txt"
sample_sampler = "euler_a"

Run it from the kohya_ss directory:

python sdxl_train_network.py --config_file my_style_lora.toml

A SimpleTuner Configuration

SimpleTuner uses a YAML config file and a separate environment file:

# config.yaml
model_type: "sdxl"
pretrained_model_name_or_path: "stabilityai/stable-diffusion-xl-base-2.0"
output_dir: "/output/my_style_lora"
train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
lr_scheduler: "cosine"
lr_warmup_steps: 100
max_train_steps: 3000
mixed_precision: "bf16"
gradient_checkpointing: true
use_8bit_adam: true
lora_rank: 16
lora_alpha: 16
report_to: "tensorboard"
validation_steps: 500

# Launch training
python train.py --config_file config.yaml

VRAM Trade-offs by GPU Tier

GPU	VRAM	Max LoRA Rank (bf16 + gradient checkpointing)	Typical Step Time
RTX 5060	12 GB	r=32 (r=16 comfortable)	~4s/step
RTX 5070 Ti	16 GB	r=64	~3s/step
RTX 5080	24 GB	r=128	~2s/step
RTX 5090	32 GB	r=256 (full fine-tune possible)	~1.5s/step

The 8-bit Adam optimizer (AdamW8bit from the bitsandbytes library) is the single most impactful memory reduction after gradient checkpointing, typically cutting optimizer state memory by 50%. Combined with bf16 mixed precision and gradient checkpointing, a rank-16 SDXL 2.0 LoRA fits comfortably on 10 GB of VRAM, leaving headroom for the VAE and the batch.

Monitor training loss in TensorBoard by launching tensorboard --logdir /output/my_style_lora/logs in a second terminal. Loss should decrease steadily for the first 1000–1500 steps and then plateau. If loss plateaus early (before step 1000), your learning rate may be too low. If loss spikes or oscillates, your learning rate is too high. Sample images at checkpoint intervals are more informative than the loss number alone - a LoRA can have acceptable loss but produce flat, concept-collapsed images if the training data was too homogeneous.

Fine-Tuning Method Comparison

Not all fine-tuning approaches are equal. Here is how the major methods compare for SDXL 2.0 in practice:

Method	VRAM Required	Output Size	Preserves Base Model	Swappable	Best For
LoRA (r=16)	8–12 GB	50–150 MB	Yes (adapter only)	Yes	Style, character, quick experiments
LoRA (r=64)	12–18 GB	200–400 MB	Yes	Yes	Complex concepts, high fidelity
LyCORIS / LoCon	10–16 GB	100–500 MB	Yes	Yes	Fine-grained style with local attention
LyCORIS / LoHa	10–16 GB	100–400 MB	Yes	Yes	Styles with Hadamard decomposition
DreamBooth (LoRA)	12–16 GB	50–200 MB	Yes	Yes	Subject-driven, more stable than full DB
DreamBooth (Full)	24–40 GB	6–7 GB	No (new checkpoint)	No	Maximum fidelity, production checkpoints
Full fine-tune	40–80 GB	6–7 GB	No	No	Dataset distillation, NSFW tuning
Textual Inversion	6–8 GB	< 100 KB	Yes	Yes	Soft concept injection, limited range

LyCORIS adapters (LoCon and LoHa) are worth highlighting as a middle ground. LoCon applies LoRA-style decomposition to the convolutional layers of the UNet in addition to the attention layers, giving it more capacity for style capture. LoHa uses Hadamard products to compose the weight update, which some practitioners find produces more stable training at higher ranks. Both are supported by kohya_ss via the lycoris network module. For most use cases, a standard LoRA at rank 16–32 is the right starting point, with LyCORIS worth experimenting with if you find standard LoRA is not capturing the full richness of your target style.

Once training completes, resist the temptation to immediately share your checkpoint. Systematic evaluation catches problems that spot-checking misses. Build a standardized test prompt set before you start training so you can compare checkpoints fairly: include prompts that should activate your concept strongly (trigger word, expected subject matter), prompts that are adjacent (trigger word, unexpected context), and prompts without the trigger word (to test for concept bleeding). Run all prompts at each saved checkpoint using a fixed seed and compare the image grids.

The lora_scale parameter (called “weight” in AUTOMATIC1111 and “strength” in ComfyUI’s LoRA loader node) controls how strongly the LoRA influences the output at inference time. The trained LoRA is at full strength at lora_scale=1.0, but this is frequently too aggressive - it can flatten the compositional control of the prompt and make every output look like a training image. A range of 0.7–0.9 is the typical sweet spot for style LoRAs. Face/character LoRAs often need 0.8–1.0 to maintain identity. You can also stack multiple LoRAs at fractional strengths in ComfyUI: [style_lora:0.6, lighting_lora:0.4] is a valid and common workflow.

Merging a LoRA into the base model produces a single checkpoint that loads faster (no runtime adapter overhead) and is simpler to distribute to users who do not want to manage multiple files. The kohya_ss merge script handles this:

python networks/merge_lora.py \
  --sd_model /models/sdxl-2.0-base.safetensors \
  --models /output/my_style_lora/my_style_v1.safetensors \
  --ratios 0.8 \
  --save_to /output/merged/my_style_sdxl.safetensors \
  --sdxl

The --ratios 0.8 argument sets the merge strength - equivalent to using the LoRA at 0.8 scale at inference time. The trade-off versus keeping the LoRA separate is loss of flexibility: a merged checkpoint cannot be unloaded at runtime, and you cannot combine it with other LoRAs using the standard adapter stack.

Before publishing on CivitAI or Hugging Face, review the base model’s license. SDXL uses CreativeML Open RAIL-M, which permits commercial use as long as you comply with its use restrictions (no CSAM, no content designed to deceive through disinformation, etc.) and include the license text when redistributing. If you trained on a community checkpoint that added its own license terms, those apply in addition. On CivitAI, the “type” tag (style LoRA, character LoRA, concept LoRA) determines how the community discovers your work; accurate tagging and a clear trigger word in the description are the most important metadata decisions.

Troubleshooting Common Training Failures

Even with a well-prepared dataset and sensible hyperparameters, training runs go wrong. The most common failure modes are predictable and fixable.

NaN loss (the training loss shows nan after a few steps) is almost always caused by a learning rate that is too high, a corrupt training image, or a mismatch between the model’s expected dtype and the optimizer’s output. Start by halving the learning rate. If the problem persists, run a preprocessing pass to check for corrupt images (PIL.Image.verify()), and ensure you are using bf16 rather than fp16 - SDXL 2.0 is trained in bfloat16 and mixing fp16 training can cause numerical instability.

Mode collapse occurs when the LoRA converges to a narrow distribution - every generated image looks similar regardless of the prompt, typically resembling the modal image in your training set. It is caused by insufficient dataset diversity, too-high learning rate, or too many training steps relative to dataset size. Add more diverse images to your dataset, reduce the learning rate, and stop training earlier.

Catastrophic overfitting is the opposite of mode collapse in terms of symptom: the LoRA perfectly reproduces your training images but fails to generalize. Prompts that include your trigger word but describe contexts not in your training set produce bizarre, incoherent outputs. Fix this by reducing step count (use earlier checkpoints), increasing regularization image count, and adding a small amount of noise augmentation if your training framework supports it.

Slow convergence (loss barely moving after 1000 steps) usually means your learning rate is too low or your dataset captions are too inconsistent. Verify that captions across your dataset use consistent terminology and that your trigger word appears at the start of every caption string.

Loading and Testing in ComfyUI

ComfyUI’s node-based workflow makes LoRA testing iterative and visual. Place your trained .safetensors file in ComfyUI/models/loras/. In the workflow, add a Load LoRA node between your CheckpointLoaderSimple and your CLIPTextEncode / KSampler nodes. The node exposes two strength sliders: strength_model (affects the UNet denoiser) and strength_clip (affects the text encoder conditioning). For most LoRAs, keeping both at the same value and sweeping from 0.5 to 1.0 in 0.1 increments while holding a fixed seed gives you a clear calibration of how the LoRA’s effect scales.

A systematic test workflow: create a prompt grid using ComfyUI’s batch features - fix the seed, vary the LoRA strength across columns, and vary the prompt’s subject matter across rows. This gives you a two-dimensional view of how the LoRA interacts with the base model’s capabilities and immediately reveals both concept bleeding (trigger-free prompts affected) and under-training (trigger prompts insufficiently differentiated from the base).

The investment of a careful test workflow before publishing pays dividends in user trust and adoption - a LoRA with a clear model card, accurate trigger words, and example images generated at representative strength values will be used and rated far more reliably than a bare safetensors file with no documentation.