Contents

Setup a Private Local RAG Knowledge Base

Creating a private Retrieval-Augmented Generation (RAG) system requires a local vector database like Qdrant paired with a strong embedding model like BGE-M3 . Together with a locally-served LLM via Ollama , this configuration lets you index hundreds of documents and answer questions about them with AI — without a single byte of your data leaving your machine.

Why RAG? The Problem With Pure LLM Memory

Large language models are impressive but fundamentally limited as knowledge stores. They are trained on a frozen snapshot of data and have no awareness of anything that happened after their training cutoff, let alone your personal files, internal documents, or private notes. When you ask a model about your own data, it has no choice but to confabulate — and it does so confidently, which is the dangerous part. Even the most capable open-weight models like Llama 4.0 will invent plausible-sounding but entirely wrong answers when asked about content they have never seen.

The naive workaround is to paste all your documents directly into the context window. This breaks down almost immediately in practice. Context windows have hard limits, processing long contexts is slow and expensive (or resource-intensive locally), and most critically, research has repeatedly shown that models suffer from the “lost in the middle” problem: they disproportionately weight content near the beginning and end of a long prompt while largely ignoring material in the middle. If your 200-page legal document gets stuffed into a context, the model may completely miss the clause on page 95.

RAG sidesteps all of these problems by changing the information retrieval architecture entirely. Instead of feeding documents to the model, you first convert every document chunk into a dense vector embedding and store it in a vector database. When a query arrives, the same embedding model converts the query into a vector, and the vector database retrieves the 3–10 most semantically similar chunks. Only those chunks are injected into the LLM’s context as grounding material. The model answers based on retrieved evidence rather than memorized weights. The result is accurate, traceable, and — crucially — updatable without retraining anything.

It is worth being clear about when RAG is the right tool versus fine-tuning. RAG is for dynamic, frequently updated knowledge bases: document collections, Obsidian vaults, codebases, support ticket archives. Fine-tuning is for instilling static behavioral patterns, domain-specific writing styles, or structured output formats into the model itself. If you need to ask questions about your documents, RAG is nearly always the correct answer.

Choosing Your Stack: Vector DB, Embedding Model, and LLM

The RAG stack has three moving parts — the vector database, the embedding model, and the generative LLM — and the choices interact. Picking the wrong combination can mean either mediocre retrieval quality or a system that won’t fit in your available RAM.

For the vector database, Qdrant is the recommended choice for a local setup that you intend to run seriously. It ships as a single Docker image, stores its data on disk (so it survives restarts), and natively supports both dense vector search and hybrid sparse+dense search — which matters a great deal for retrieval quality (more on that later). Chroma is a popular alternative for experimentation: it is Python-native, trivial to set up, and requires no Docker, but it lacks production-grade features like filtering-aware HNSW indexing and built-in hybrid search. Milvus is the enterprise-scale option with distributed deployment capabilities that are overkill for a single-person knowledge base but relevant for teams indexing millions of documents. The table below summarizes the practical differences:

FeatureQdrantChromaMilvusLanceDB
DeploymentDocker / binaryPython in-processDocker / K8sPython in-process / S3
Hybrid searchBuilt-in (RRF)NoYes (BM25 plugin)No
Disk-backed storageYes (HNSW on disk)Yes (SQLite)YesYes (Lance columnar)
Filtering on metadataYes (payload filters)Yes (basic)YesYes
Ease of local setupMedium (Docker)Very easyComplexVery easy
Best forProduction local RAGPrototypingEnterprise scaleAnalytics workloads

For the embedding model, BGE-M3 from BAAI is the current best-in-class choice for a local setup. It produces 1024-dimensional dense vectors, supports over 100 languages, and critically can generate both dense and sparse (BM25-style) embeddings from the same model — making it a natural fit for Qdrant’s hybrid search. all-MiniLM-L6-v2 is a faster alternative that produces 384-dimensional vectors; it works well for prototyping where retrieval quality is not critical. nomic-embed-text is a solid middle ground with strong English-language quality at a dimension of 768.

ModelDimensionsLanguagesSpeed (CPU)Quality
BGE-M31024100+~80ms/chunkExcellent
nomic-embed-text768English-primary~45ms/chunkVery good
all-MiniLM-L6-v2384English~15ms/chunkGood

For the generative LLM, Ollama makes local inference trivial. Any model you can pull via ollama pull — Llama 4.0, Mistral Nemo, Gemma 3 — can serve as the generation step. For a machine with 16GB RAM and no discrete GPU, Mistral Nemo (12B, Q4 quantized) is a solid choice. With a dedicated RTX 50-series GPU and 24GB+ VRAM, Llama 4.0 at Q4 is excellent. Install the dependencies before starting:

pip install qdrant-client fastembed ollama langchain-text-splitters pymupdf python-docx

Pull the generative model via Ollama:

ollama pull mistral-nemo

Start Qdrant with Docker:

docker run -d --name qdrant \
  -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

Document Ingestion and Chunking Strategies

The quality of a RAG system is determined more by how you chunk your documents than by almost any other factor. Retrieval works by comparing query vectors against chunk vectors. If a chunk spans two unrelated topics because you naively split at every 512 characters, the resulting embedding will be a confused average of both — and it will fail to surface reliably for either topic.

Fixed-size character chunking is the fastest approach and the worst for quality. Splitting on character count means you will routinely cut mid-sentence, mid-table, and mid-code-block. The resulting chunks are semantically incoherent and produce poor embeddings. Despite being the default in many tutorials, avoid this in production.

Recursive character splitting is the baseline production approach. LangChain’s RecursiveCharacterTextSplitter splits first on paragraph boundaries (\n\n), then sentence boundaries (. ), then word boundaries, and only falls back to character counts as a last resort. This preserves semantic coherence in the vast majority of cases. A chunk size of 512 tokens with 64 tokens of overlap is a solid starting point:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = splitter.split_text(document_text)

For code files (.py, .ts, .go), use language-aware splitting. LangChain provides Language.PYTHON, Language.JS, etc., which split on function and class boundaries rather than arbitrary character counts — essential for keeping function signatures paired with their bodies.

Handling diverse file types requires different loading strategies. PDFs are the most common source and also the most treacherous. pymupdf (imported as fitz) preserves text layout, handles multi-column documents reasonably well, and extracts page numbers — which you need for citations. DOCX files are straightforward with python-docx. Markdown from Obsidian vaults can be split with a Markdown-aware splitter that respects heading hierarchy.

Metadata enrichment is the most commonly skipped step and one of the most valuable. Every chunk stored in Qdrant should carry a payload that includes at minimum: the source filename, page or section number, the section heading it appeared under, and a creation or modification timestamp. This metadata serves two purposes: it enables payload filtering (e.g., “only search documents tagged as ’legal’”), and it lets your system generate citations rather than just answers.

Here is a complete ingestion pipeline for a directory of PDFs:

import fitz  # pymupdf
import os
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter
from fastembed import TextEmbedding
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

COLLECTION_NAME = "knowledge_base"
EMBED_MODEL = "BAAI/bge-m3"
QDRANT_URL = "http://localhost:6333"

client = QdrantClient(url=QDRANT_URL)
embedder = TextEmbedding(model_name=EMBED_MODEL)

# Create collection if it doesn't exist
if not client.collection_exists(COLLECTION_NAME):
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
    )

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

def ingest_pdf(pdf_path: Path):
    doc = fitz.open(pdf_path)
    all_points = []

    for page_num, page in enumerate(doc, start=1):
        page_text = page.get_text("text")
        if not page_text.strip():
            continue

        chunks = splitter.split_text(page_text)

        for chunk in chunks:
            if len(chunk.strip()) < 50:  # skip tiny fragments
                continue

            embedding = list(embedder.embed([chunk]))[0].tolist()

            point = PointStruct(
                id=str(uuid.uuid4()),
                vector=embedding,
                payload={
                    "text": chunk,
                    "source": pdf_path.name,
                    "page": page_num,
                    "path": str(pdf_path),
                },
            )
            all_points.append(point)

    # Upsert in batches of 100
    for i in range(0, len(all_points), 100):
        client.upsert(
            collection_name=COLLECTION_NAME,
            points=all_points[i:i+100],
        )

    print(f"Ingested {len(all_points)} chunks from {pdf_path.name}")


docs_dir = Path("./documents")
for pdf_file in docs_dir.glob("**/*.pdf"):
    ingest_pdf(pdf_file)

Pure dense vector search has well-understood failure modes that bite you in production. Because dense embeddings capture semantic meaning, they excel at conceptual questions (“what does the contract say about liability?”) but struggle with exact-term matching. If you ask about “Section 4.2.1” or “the BGE-M3 model” or a specific error code like “E_CONN_TIMEOUT”, the embedding of your query may not rank documents containing that exact string anywhere near the top — because the embedding space encodes meaning, not lexical identity.

BM25 sparse retrieval is the traditional information retrieval approach that handles exact-term matching precisely: it scores documents by term frequency weighted by inverse document frequency. It is excellent for proper nouns, version numbers, model names, and technical identifiers. It is poor at synonymy and conceptual matching. Neither approach alone is adequate for a real knowledge base.

Hybrid search combines the two with Reciprocal Rank Fusion (RRF). RRF is a rank-merging algorithm: instead of trying to combine raw scores (which are on incompatible scales), it converts each result list to ranked positions and combines the positional scores. It is parameter-free, robust, and consistently outperforms either method alone on standard retrieval benchmarks. Qdrant has built-in hybrid search support using its sparse vector capability. To use it with BGE-M3’s sparse output:

from qdrant_client.models import SparseVector, NamedSparseVector, NamedVector, SearchRequest

def hybrid_search(query: str, top_k: int = 10) -> list:
    # Get both dense and sparse embeddings from BGE-M3
    dense_embedding = list(embedder.embed([query]))[0].tolist()

    # For sparse embeddings, use fastembed's sparse model
    from fastembed import SparseTextEmbedding
    sparse_embedder = SparseTextEmbedding(model_name="Qdrant/bm42-all-minilm-l6-v2-attentions")
    sparse_result = list(sparse_embedder.embed([query]))[0]

    results = client.search_batch(
        collection_name=COLLECTION_NAME,
        requests=[
            SearchRequest(
                vector=NamedVector(name="dense", vector=dense_embedding),
                limit=top_k,
                with_payload=True,
            ),
            SearchRequest(
                vector=NamedSparseVector(
                    name="sparse",
                    vector=SparseVector(
                        indices=sparse_result.indices.tolist(),
                        values=sparse_result.values.tolist(),
                    ),
                ),
                limit=top_k,
                with_payload=True,
            ),
        ],
    )

    # Apply RRF manually if Qdrant's built-in fusion isn't configured
    scores = {}
    for rank, hit in enumerate(results[0]):
        scores[hit.id] = scores.get(hit.id, 0) + 1.0 / (60 + rank + 1)
    for rank, hit in enumerate(results[1]):
        scores[hit.id] = scores.get(hit.id, 0) + 1.0 / (60 + rank + 1)

    # Sort by combined RRF score and retrieve payloads
    sorted_ids = sorted(scores, key=scores.get, reverse=True)[:top_k]
    return sorted_ids

Reranking adds a second-pass filter that significantly improves precision. After hybrid search retrieves the top-20 candidates, a Cross-Encoder model re-scores each candidate against the original query in a more compute-intensive but higher-accuracy way. A model like cross-encoder/ms-marco-MiniLM-L-6-v2 takes (query, passage) pairs and outputs a relevance score. Filtering from 20 candidates down to the top 4 with a reranker typically improves answer quality noticeably:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_n: int = 4) -> list[dict]:
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
    return [doc for _, doc in ranked[:top_n]]

Querying: Putting It All Together With Ollama

With ingestion and retrieval in place, the query pipeline is straightforward. Retrieve the top reranked chunks, format them as context, and pass them to your local Ollama model with a prompt that instructs it to answer only from the provided material:

import ollama

def query_knowledge_base(question: str, model: str = "mistral-nemo") -> str:
    # 1. Embed the question
    question_embedding = list(embedder.embed([question]))[0].tolist()

    # 2. Retrieve top candidates from Qdrant
    search_results = client.search(
        collection_name=COLLECTION_NAME,
        query_vector=question_embedding,
        limit=20,
        with_payload=True,
    )

    candidates = [
        {"text": r.payload["text"], "source": r.payload["source"], "page": r.payload["page"]}
        for r in search_results
    ]

    # 3. Rerank to top 4
    top_chunks = rerank(question, candidates, top_n=4)

    # 4. Build grounded context
    context_parts = []
    for i, chunk in enumerate(top_chunks, start=1):
        context_parts.append(
            f"[{i}] Source: {chunk['source']}, Page {chunk['page']}\n{chunk['text']}"
        )
    context = "\n\n---\n\n".join(context_parts)

    # 5. Generate answer with Ollama
    prompt = f"""You are a helpful assistant. Answer the user's question based ONLY on the provided context.
If the context does not contain sufficient information to answer, say so explicitly.
Cite the source number (e.g., [1], [2]) when referencing specific information.

Context:
{context}

Question: {question}

Answer:"""

    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    return response["message"]["content"]


# Example usage
answer = query_knowledge_base("What are the termination clauses in the vendor contract?")
print(answer)

This is the complete loop: document in, answer out, entirely on your own hardware.

Local Model Context Protocol (MCP) Servers

Running the query pipeline as a one-off Python script is fine for personal use, but it becomes limiting quickly when you want to access your knowledge base from multiple tools — an Obsidian plugin, a Claude Desktop session, Open WebUI, a custom CLI. The Model Context Protocol (MCP) solves this by turning your RAG pipeline into a lightweight service that any MCP-compatible client can call as a tool.

An MCP server is a process that listens for structured tool-call requests and returns structured responses. From an LLM client’s perspective, calling your RAG knowledge base is identical to calling any other MCP tool: the client describes what it needs, MCP routes the call, and your server executes the Qdrant query and returns the retrieved chunks. The LLM never needs to know or care that vector search is happening under the hood.

Setting up a minimal MCP server with the mcp Python library takes about 50 lines of code:

from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types
import asyncio

app = Server("rag-knowledge-base")

@app.list_tools()
async def list_tools():
    return [
        types.Tool(
            name="search_knowledge_base",
            description="Search the private local knowledge base for information relevant to a query.",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The question or search query",
                    },
                    "top_k": {
                        "type": "integer",
                        "description": "Number of results to return (default 4)",
                        "default": 4,
                    },
                },
                "required": ["query"],
            },
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "search_knowledge_base":
        query = arguments["query"]
        top_k = arguments.get("top_k", 4)

        embedding = list(embedder.embed([query]))[0].tolist()
        results = client.search(
            collection_name=COLLECTION_NAME,
            query_vector=embedding,
            limit=top_k * 3,
            with_payload=True,
        )
        candidates = [
            {"text": r.payload["text"], "source": r.payload["source"], "page": r.payload["page"]}
            for r in results
        ]
        top_chunks = rerank(query, candidates, top_n=top_k)

        output = "\n\n".join(
            f"[Source: {c['source']}, p.{c['page']}]\n{c['text']}" for c in top_chunks
        )
        return [types.TextContent(type="text", text=output)]

async def main():
    async with stdio_server() as streams:
        await app.run(*streams, app.create_initialization_options())

if __name__ == "__main__":
    asyncio.run(main())

With this MCP server running, you can connect it to Claude Desktop by adding it to claude_desktop_config.json, or to Open WebUI’s tool configuration. Your private Qdrant knowledge base becomes available as a tool to any LLM session without any cloud involvement.

For Obsidian users, pairing this with obsidian-local-rest-api (a community plugin that exposes your vault’s notes as a REST API) lets you build a pipeline where notes from your vault are continuously indexed into Qdrant and queryable from any MCP client. When to build an MCP server versus calling your vector DB directly from Python is a straightforward decision: build the MCP server if you want to reuse the RAG system across multiple tools and clients; skip it if you are building a single, purpose-built script.

Privacy and Zero-Cloud Architecture

The entire motivation for building this stack locally is privacy. But “it runs locally” is not a complete privacy story — there are several specific threat vectors worth addressing explicitly.

The most common accidental privacy failure in local RAG setups is using a cloud LLM for the generation step. If you ingest private HR documents into a local Qdrant database, retrieve chunks locally, and then send those chunks to the OpenAI API for generation, your data has left your machine. The retrieval was private but the generation was not. Use Ollama with a local model for the final answer generation, not a cloud API endpoint, even during development when the local model feels slower.

The second threat is accidental PII in your document corpus. If you are indexing documents that contain employee names, email addresses, patient records, or financial account numbers, those strings will be embedded verbatim into your vector store and surfaced in retrieved chunks. The presidio library from Microsoft provides a fast, local NER-based PII detection and redaction pipeline. Run it as a preprocessing step before chunking:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

def redact_pii(text: str) -> str:
    results = analyzer.analyze(text=text, language="en")
    anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
    return anonymized.text

# Use in your ingestion pipeline before chunking
clean_text = redact_pii(raw_document_text)
chunks = splitter.split_text(clean_text)

Network isolation prevents any accidental outbound calls from your Docker containers. Qdrant, despite being a local service, ships with telemetry enabled by default. Disable it:

docker run -d --name qdrant \
  -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  -e QDRANT__TELEMETRY_DISABLED=true \
  qdrant/qdrant

For maximum isolation, run your entire RAG stack in a Docker Compose network with no external access, and use firewall rules to prevent outbound connections from the embedding and LLM processes. A minimal docker-compose.yml with network isolation:

version: "3.9"
services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - ./qdrant_storage:/qdrant/storage
    environment:
      - QDRANT__TELEMETRY_DISABLED=true
    networks:
      - rag_internal

networks:
  rag_internal:
    driver: bridge
    internal: true  # no external internet access

Hardware considerations are worth mentioning because they directly affect the experience. The HNSW index that Qdrant uses for approximate nearest-neighbor search is loaded from disk on startup. On a 5400 RPM HDD, loading a large index can take tens of seconds and result in high cold-start query latency. On a Gen4 NVMe SSD, the same load takes under a second. If you are building a knowledge base with tens of thousands of chunks, ensure Qdrant’s storage volume lives on an NVMe drive. Once the index is warm in memory, query latency drops to single-digit milliseconds regardless of storage speed.

Evaluating and Maintaining Your Knowledge Base

A RAG system you cannot measure is one you cannot improve. The ragas library provides a suite of metrics designed specifically for RAG evaluation without requiring human annotation: context precision (are the retrieved chunks actually relevant?), answer faithfulness (does the generated answer stick to what the retrieved chunks say?), and answer relevancy (does the answer actually address the question?). Running a small evaluation set of 20–50 (question, expected answer) pairs through ragas will immediately surface whether your chunking strategy, retrieval parameters, or reranker thresholds need adjustment.

Adding new documents to a running knowledge base is straightforward: run the ingestion pipeline on the new files and upsert the resulting points into the existing collection. Qdrant’s upsert is idempotent if you use deterministic IDs (e.g., a hash of the source path and chunk index). Handling deletions requires tracking which point IDs belong to which source file; a simple SQLite table mapping (source_path, point_id) is sufficient.

Re-embedding changed files is the trickiest maintenance case. If you update a document and re-ingest it without deleting the old chunks, you end up with duplicated content from both the old and new versions. The correct approach is to delete all existing points for a given source file before upserting the new chunks. If you stored source path in the payload, Qdrant’s payload filtering makes this a single delete call:

from qdrant_client.models import Filter, FieldCondition, MatchValue

client.delete(
    collection_name=COLLECTION_NAME,
    points_selector=Filter(
        must=[FieldCondition(key="source", match=MatchValue(value="updated_report.pdf"))]
    ),
)
# Then re-ingest the updated file
ingest_pdf(Path("./documents/updated_report.pdf"))

The result is a knowledge base that stays current with your document collection, runs entirely on your hardware, and gives your local LLM the grounding it needs to answer questions accurately rather than hallucinating. The full stack — Qdrant, BGE-M3, Ollama — requires no cloud accounts, no API keys, and no monthly subscription. Your documents stay exactly where you put them.