Run Vision Models Locally: Florence-2 and Qwen-VL for Image Analysis

Florence-2 and Qwen2-VL both run on consumer NVIDIA GPUs with as little as 8 GB VRAM. They handle OCR, object detection, image captioning, and visual question answering, all of it offline. Florence-2 uses a small sequence-to-sequence design with task prompt tokens. That makes it fast and reliable for structured extraction. Qwen2-VL takes a chat-style approach. It handles open-ended reasoning, dense documents, and follow-up questions. The two models work best as a pair, not as swaps for each other.

Why Run Vision Models Locally

Cloud vision APIs have made image analysis easy. Google Cloud Vision , AWS Rekognition , and OpenAI’s GPT-4o Vision all return solid results. The catch: every image you send leaves your network, costs money per request, and adds round-trip lag.

For many jobs, none of that is a problem. But think about medical scans, security camera frames, internal product catalogs, or scanned legal files. For those, the rules or plain privacy worries make cloud uploads risky. Even with no hard rules, the cost math shifts fast. Process 10,000 product images per day at $0.0015 each and you spend about $450 per month on one task.

Local inference flips that. Once the models sit in GPU memory, each new image costs almost nothing beyond power. An RTX 5070 runs a Florence-2-large image in about 150ms. It returns a Qwen2-VL 7B response in about 600ms. That is fast enough for batch jobs and many real-time apps.

The models have also gotten good enough to make this a real trade. Sub-10B vision-language models in 2026 match or beat the cloud models of a few years back. You can see this on benchmarks like TextVQA and COCO Captions. You no longer swap accuracy for privacy. The local models are genuinely strong.

The use cases that benefit most from running locally:

  • Batch OCR on scanned documents, receipts, or invoices
  • Automated alt-text generation for large image libraries
  • Product catalog tagging and categorization
  • Screenshot-to-code extraction for UI automation workflows

Florence-2: Architecture and Setup

Florence-2 is Microsoft’s unified vision model, and its design is unusual. Both the task instruction and the output are plain text. You prompt the model with a task token: <OD> for object detection, <OCR> for text extraction, <CAPTION> for a short description. It then returns structured text you can parse directly.

Florence-2 unified architecture showing task prompts flowing through a shared vision-language backbone
Florence-2's sequence-to-sequence architecture accepts task tokens and images as input
Image: Microsoft Research

Two model sizes ship. Florence-2-base is 0.23B parameters (~1.2 GB VRAM). Florence-2-large is 0.77B parameters (~2.5 GB VRAM). Both fit on any modern GPU with ease. That leaves most of your VRAM free for batching or other models.

Installation:

pip install transformers torch pillow

Loading the model:

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large",
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")

processor = AutoProcessor.from_pretrained(
    "microsoft/Florence-2-large",
    trust_remote_code=True
)

Running inference:

def run_florence(image_path, task_token):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(
        text=task_token,
        images=image,
        return_tensors="pt"
    ).to("cuda")

    with torch.no_grad():
        output_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            do_sample=False
        )

    result = processor.batch_decode(output_ids, skip_special_tokens=False)[0]
    parsed = processor.post_process_generation(
        result,
        task=task_token,
        image_size=image.size
    )
    return parsed

# Extract text from a scanned document
text_result = run_florence("document.jpg", "<OCR>")

# Object detection - returns bounding boxes
detection_result = run_florence("photo.jpg", "<OD>")

The task tokens worth using day-to-day:

  • <CAPTION> - short one-sentence description
  • <DETAILED_CAPTION> - dense multi-sentence description
  • <OD> - object detection with bounding boxes in <loc_XXX> format
  • <OCR> - text extraction from the image
  • <REGION_PROPOSAL> - generate region proposals
  • <REFERRING_EXPRESSION_SEGMENTATION> - segment a referred object

Florence-2 returns bounding boxes in a normalized 1000x1000 grid. The post_process_generation method converts those back to pixel coordinates. So you can pass results straight to PIL’s ImageDraw or OpenCV’s rectangle function with no extra math.

Florence-2 OCR and region detection output showing bounding boxes and extracted text overlaid on an image
Florence-2 OCR with region-level text extraction and bounding box output
Image: Microsoft Research

Florence-2 is narrow by design. It is not built for open-ended chat questions, multi-image comparisons, or tasks that need context beyond a single frame. For those, you want Qwen2-VL.

Qwen2-VL: Architecture and Setup

Qwen2-VL is Alibaba’s multimodal model, built for visual understanding and reasoning. Florence-2 uses structured task tokens. Qwen2-VL takes plain language instead. You ask “What text is visible in this image?” or “Describe how the objects in this scene relate,” and it answers in chat form.

Qwen2-VL architecture diagram showing the vision encoder, language model, and multimodal fusion pipeline
Qwen2-VL processes images at native resolution through a vision transformer before fusing with the language model
Image: Alibaba Qwen Team

Three parameter sizes ship: 2B, 7B, and 72B. The 7B variant is the right pick for consumer hardware. It needs about 5.5 GB VRAM at Q4_K_M quantization. The 72B version works on a multi-GPU setup or with heavy quantization. But the 7B hits the best balance of accuracy and resources for most tasks.

The fastest way to start is Ollama , the same tool used for running large language models on consumer hardware :

ollama pull qwen2-vl:7b
ollama run qwen2-vl:7b "Describe this image" --images ./photo.jpg

Ollama handles quantization and GPU memory for you. The --images flag accepts local file paths. For batch jobs or API work, the transformers library gives you more control:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="cuda"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

def run_qwen(image_path, question):
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": question}
            ]
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt"
    ).to("cuda")

    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=512)

    trimmed = [
        out[len(inp):]
        for inp, out in zip(inputs.input_ids, output_ids)
    ]
    return processor.batch_decode(trimmed, skip_special_tokens=True)[0]

caption = run_qwen("photo.jpg", "Describe what you see in this image in detail.")

Qwen2-VL accepts more than one image in a single message. Add extra {"type": "image"} entries in the content array for comparison tasks or frame sequences. It also handles images at native resolution, so you do not need to resize them first.

Qwen2-VL 7B benchmark results comparing performance across vision-language tasks
Qwen2-VL 7B benchmark scores on DocVQA, ChartQA, and other vision-language tasks
Image: Alibaba Qwen Team

Where Qwen2-VL pulls ahead of Florence-2:

  • Multi-turn visual QA with follow-up questions in a chat loop
  • Document understanding for tables, charts, and infographics
  • Cultural and contextual reasoning about scene content
  • Comparing multiple images in a single prompt
  • Video analysis by passing extracted keyframes as an image sequence

The trade-off is speed. Qwen2-VL 7B takes 600ms to 2 seconds per image, based on response length. Florence-2-large takes 150ms. For bulk structured extraction, that gap adds up fast.

Head-to-Head: Which Model for Which Task

The choice between them comes down to what a given task needs.

TaskFlorence-2Qwen2-VL 7B
OCR accuracy (printed English)~94% char-level~91% char-level
Object detection outputStructured bounding boxesNatural language descriptions
Image captioning depthFlat but accurateRich, contextually aware
Visual QANot designed for itStrong, handles follow-ups
Inference speed (RTX 5070)~150ms~600ms - 2s
VRAM at float16~2.5 GB~5.5 GB (Q4_K_M)
Multi-image inputNoYes

A few specifics. Florence-2’s OCR edge comes from its dedicated text extraction head. The gap shrinks on handwritten or low-resolution text. Qwen2-VL’s object location output needs post-processing if you want pixel coordinates. It tells you “a dog in the lower right” rather than a bounding box.

On captioning, the gap is wider than benchmarks show. Florence-2 writes correct but flat descriptions. Qwen2-VL picks up on humor, visual irony, implied motion, and scene context. That makes its output far more usable for anything a person will read.

One practical note: both models can run at the same time on a 12 GB VRAM card. Florence-2-large at 2.5 GB and Qwen2-VL 7B at 5.5 GB leaves enough room for inference overhead. That opens up a useful pipeline design.

Building a Two-Stage Image Analysis Pipeline

Use Florence-2 for fast triage and Qwen2-VL for deeper analysis on flagged images. You get the speed of the small model on most of your data. You spend the costly inference only where it pays off.

The logic is simple. Run Florence-2’s <OD> and <OCR> on every image. If the object count tops a threshold, or the OCR output looks like a form, receipt, or table, pass that image to Qwen2-VL for a closer read. Simpler images get handled by Florence-2 alone: single object, no text, clean scene.

import sqlite3
from pathlib import Path
from fastapi import FastAPI, UploadFile

app = FastAPI()

def analyze_image(image_path: str) -> dict:
    # Stage 1: Florence-2 fast pass
    detection = run_florence(image_path, "<OD>")
    ocr_output = run_florence(image_path, "<OCR>")
    caption = run_florence(image_path, "<DETAILED_CAPTION>")

    object_count = len(detection.get("labels", []))
    has_complex_text = len(ocr_output.get("<OCR>", "")) > 100

    qwen_analysis = None
    if object_count > 5 or has_complex_text:
        qwen_analysis = run_qwen(
            image_path,
            "Describe what is happening in this image in detail, "
            "including any visible text and the relationships between objects."
        )

    return {
        "caption": caption,
        "objects": detection,
        "text": ocr_output,
        "deep_analysis": qwen_analysis
    }

@app.post("/analyze")
async def analyze_endpoint(file: UploadFile):
    tmp_path = f"/tmp/{file.filename}"
    with open(tmp_path, "wb") as f:
        f.write(await file.read())
    result = analyze_image(tmp_path)
    return result

For batch jobs, wrap Florence-2 calls in a torch.utils.data.DataLoader with a batch size of 4 to 8. The small model fits several images in its VRAM context at once. Run Qwen2-VL queries one at a time, since its dynamic resolution does not parallelize as cleanly.

Store results in SQLite with columns for file path, Florence-2 captions, detected objects as JSON, extracted text, and Qwen2-VL analysis. That gives you a searchable index over an image library with no outside service. For richer retrieval over mixed image and text content, feed these outputs into a multi-modal RAG pipeline that indexes captions and text alongside image embeddings.

Here is a concrete example. Process 1,000 product photos to generate alt-text, pull visible label text, and sort by product type. Florence-2 handles the roughly 80% that are clean product shots. Qwen2-VL takes the 20% with busy scenes or label-heavy content. The full batch finishes in under 15 minutes on an RTX 5070. The same job through a cloud API costs $3 to $5 per run, every run.

Hardware Requirements and GPU Compatibility

The minimum practical setup is an NVIDIA GPU with 8 GB VRAM. That leaves plenty of room for Florence-2-large, and a small margin for Qwen2-VL 7B at Q4_K_M quantization. Running both models at once needs around 10-11 GB VRAM with inference overhead. So a 12 GB card (RTX 4070 Ti, RTX 3080 12GB, RTX 4080, or RTX 5070) is the comfortable baseline for the two-stage pipeline.

If you are stuck with 8 GB, keep only one model loaded at a time. Use Python’s garbage collection to free the GPU memory between calls. For Florence-2 this is easy, since the model fits well and reloads fast. For Qwen2-VL, Ollama loads and unloads the model for you.

import gc
import torch

def free_gpu_memory():
    gc.collect()
    torch.cuda.empty_cache()

# Unload Florence-2 before loading Qwen2-VL
del model
free_gpu_memory()

CPU inference works for Florence-2-base, since 0.23B parameters runs at a tolerable pace on a modern CPU. But at roughly 5-10 seconds per image, it suits only low-volume tasks. Qwen2-VL is too slow on CPU for anything beyond testing.

AMD GPU users get ROCm support through PyTorch. First confirm your GPU is on the ROCm compatibility list. Florence-2 and Qwen2-VL both load on ROCm-capable cards with the same code, since ROCm exposes itself as CUDA to PyTorch. Apple Silicon users can run Qwen2-VL through Ollama , which ships Metal-optimized builds, or through MLX for direct framework access.

Wrapping Up

Both models live on Hugging Face and need only the standard Transformers and PyTorch packages. Florence-2 needs trust_remote_code=True at load time, since its processor includes custom code. Qwen2-VL needs the qwen-vl-utils package for image preprocessing.

For most people, the fastest way to start is Ollama for Qwen2-VL and the Hugging Face Transformers path for Florence-2. Once both models run, you can build the two-stage pipeline in an afternoon. The pair covers most practical vision tasks, and no image ever leaves your machine.