Contents

Self-Hosted AI Search: Combine SearXNG and a Local RAG Pipeline

You can build a private AI search engine modeled on Perplexity by combining SearXNG with a local language model running through Ollama . The stack is: SearXNG aggregates results from multiple search engines simultaneously, a Python scraper fetches and cleans the actual page content, and the LLM synthesizes everything into a cited answer with inline references like [1], [2]. No API keys, no telemetry, no query logging to third-party AI services. A machine with 12 GB VRAM handles the whole pipeline, and most queries come back in 5-15 seconds.

This is a practical build guide. By the end you will have a working Docker Compose stack with three services and a minimal web interface you can open in any browser.

How the Pipeline Works

The data flow has four stages. The user submits a query. SearXNG searches Google, Bing, DuckDuckGo, and other engines simultaneously and returns ranked URLs with short snippets. A scraper fetches the top 5 URLs and extracts clean article text. That text gets chunked and passed to the LLM along with the original query, and the model produces an answer where each claim is tied to a numbered source.

The RAG part here is intentionally simple - no vector database, no embeddings, no persistent index. For live web search you do not need them. The search engine has already ranked results by relevance; all you need is to strip the HTML, break the text into manageable pieces, and give the LLM enough raw material to write a grounded answer. This approach is sometimes called “naive RAG” or “context stuffing”, and it works well precisely because the hard work of relevance ranking is already done upstream.

The three components talk over localhost HTTP: SearXNG on port 8888, Ollama on port 11434, and a FastAPI orchestration app on port 8000. The browser talks only to FastAPI, which coordinates the rest.

Deploying SearXNG with Docker Compose

Start with SearXNG since everything else depends on it. The official Docker image is straightforward to deploy. Create a docker-compose.yaml with a dedicated internal network:

services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "8888:8080"
    volumes:
      - ./searxng:/etc/searxng
    networks:
      - search-net

networks:
  search-net:
    driver: bridge

The ./searxng volume mount holds your settings.yml. Two changes in that file are mandatory before anything else works.

First, enable JSON output format. Without this, you cannot query SearXNG programmatically - it only returns HTML by default:

search:
  formats:
    - html
    - json

Second, disable the rate limiter. It exists to prevent abuse on public instances but is unnecessary on a private deployment:

server:
  limiter: false

For search engine selection, enable Google, Bing, DuckDuckGo, Brave Search, and Wikipedia. Disable social media engines - they tend to surface low-quality results for technical queries. SearXNG ships with roughly 70 available engines and most are off by default, so this is mainly a matter of turning on the ones you want.

Once the container is running, test JSON output directly in a terminal:

curl "http://localhost:8888/search?q=linux+kernel+6.12&format=json" | jq '.results[:3]'

You should get back an array with title, url, and content fields for each result. The content field is the search snippet - typically 100-200 characters. That is nowhere near enough text for the LLM to synthesize a meaningful answer, which is where the scraper layer comes in.

SearXNG’s memory footprint is minimal, around 150 MB RAM, so it does not add meaningful overhead to your server.

Building the Scraper and Chunking Layer

The scraper has one job: fetch URLs and return clean article text. The tricky part is that web pages vary wildly in structure. Some are clean article pages; others are buried under navigation bars, sidebars, cookie banners, and footer links that produce garbage if you just convert HTML to text naively.

Trafilatura handles this well. It is specifically designed to identify the main content block of a page and discard the surrounding clutter - far more reliable than writing custom BeautifulSoup selectors per site. Set up a three-step fallback chain for cases where trafilatura comes up empty:

  1. Try trafilatura.extract()
  2. Fall back to readability-lxml if that returns empty
  3. Fall back to BeautifulSoup with get_text() as a last resort

Fetch URLs concurrently using httpx and asyncio. A 5-second timeout per request prevents slow-loading pages from stalling the entire pipeline:

import asyncio
import httpx
import trafilatura

async def fetch_and_extract(client: httpx.AsyncClient, url: str) -> str:
    try:
        response = await client.get(url, timeout=5.0, follow_redirects=True)
        text = trafilatura.extract(response.text)
        return text or ""
    except Exception:
        return ""

async def scrape_urls(urls: list[str]) -> list[str]:
    headers = {"User-Agent": "Mozilla/5.0 (compatible; ResearchBot/1.0)"}
    async with httpx.AsyncClient(headers=headers) as client:
        tasks = [fetch_and_extract(client, url) for url in urls]
        return await asyncio.gather(*tasks)

After fetching, chunk each page’s text into 1000-character segments with 100-character overlap. The overlap ensures that sentences split at a chunk boundary still appear in their full form in at least one chunk. Prepend each chunk with its source number so the LLM knows which URL to cite:

def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 100) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        chunks.append(text[start:start + chunk_size])
        start += chunk_size - overlap
    return chunks

def build_context(scraped_pages: list[str], urls: list[str]) -> str:
    parts = []
    for i, (text, url) in enumerate(zip(scraped_pages, urls), 1):
        if not text:
            continue
        for chunk in chunk_text(text):
            parts.append(f"[Source {i}: {url}]\n{chunk}\n")
    return "\n".join(parts)

Keep the total assembled context under 6000 tokens - roughly 24 KB of text. This leaves enough room in an 8K-16K context window for the system prompt, user query, and the generated answer. If you have a model with a larger context window and enough VRAM, you can expand this and pull in more source material.

LLM Synthesis and Citation

Model choice matters less than prompt structure here. Both Qwen 2.5 7B Instruct and Mistral 7B Instruct v0.3 produce reliable cited answers at Q4_K_M quantization. The system prompt needs to be explicit about using only the provided sources and about citation format:

You are a research assistant. Answer the user's question using ONLY the
provided sources. Cite sources inline using [1], [2], etc. where the number
matches the source number in the context. If the sources do not contain enough
information to answer the question, say so. Do not invent information that
is not present in the sources.

Structure the user prompt to clearly separate the question from the source material:

Question: {query}

Sources:
{context}

Provide a comprehensive answer with inline citations.

Call the Ollama API with specific parameters tuned for factual output:

import httpx

SYSTEM_PROMPT = "..."  # prompt above

async def generate_answer(query: str, context: str) -> str:
    payload = {
        "model": "qwen2.5:7b-instruct-q4_K_M",
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Question: {query}\n\nSources:\n{context}"}
        ],
        "options": {
            "temperature": 0.3,
            "num_ctx": 16384,
            "num_predict": 1024
        },
        "stream": False
    }
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:11434/api/chat",
            json=payload,
            timeout=60.0
        )
        return response.json()["message"]["content"]

Temperature 0.3 keeps the output factual without pushing the model into repetitive loops. After getting the response, run a quick post-processing pass to strip any citation numbers that exceed your actual source count. Occasional hallucinated citations like [8] when you only provided 5 sources are easy to catch with a regex and should be removed before displaying to the user.

Typical timing breakdown: SearXNG returns in 1-2 seconds, concurrent scraping takes 2-5 seconds, and generation takes 3-8 seconds depending on the model and your GPU. The overall 6-15 second window is in line with what Perplexity delivers, though the consistency varies more since you are scraping real websites rather than using a pre-indexed corpus.

The Orchestration Layer and Web Interface

Wire everything into a FastAPI app. The main endpoint accepts a query string and coordinates the three stages:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class SearchRequest(BaseModel):
    query: str

@app.post("/search")
async def search(request: SearchRequest):
    urls, snippets = await query_searxng(request.query)
    pages = await scrape_urls(urls[:5])
    context = build_context(pages, urls[:5])
    answer = await generate_answer(request.query, context)
    return {
        "answer": answer,
        "sources": [{"url": u, "snippet": s} for u, s in zip(urls, snippets)]
    }

Add a streaming endpoint using Server-Sent Events so the browser can display the answer as it generates. The 5-8 second generation window is much less noticeable when tokens appear progressively. Use Ollama’s stream: true mode and forward each chunk as an SSE event:

import json
from fastapi.responses import StreamingResponse

@app.post("/search/stream")
async def search_stream(request: SearchRequest):
    urls, snippets = await query_searxng(request.query)
    pages = await scrape_urls(urls[:5])
    context = build_context(pages, urls[:5])

    async def generate():
        sources = [{"url": u, "snippet": s} for u, s in zip(urls, snippets)]
        yield f"data: {json.dumps({'sources': sources})}\n\n"
        async for token in stream_answer(request.query, context):
            yield f"data: {json.dumps({'token': token})}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

The frontend can be a single static HTML file. An input box, a results area, and a source sidebar is all you need. The browser’s native EventSource API handles the SSE stream without any JavaScript framework.

For the complete three-service Docker Compose setup:

services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "8888:8080"
    volumes:
      - ./searxng:/etc/searxng
    networks:
      - search-net

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - search-net

  app:
    build: .
    ports:
      - "8000:8000"
    depends_on:
      - searxng
      - ollama
    networks:
      - search-net
    environment:
      - SEARXNG_URL=http://searxng:8080
      - OLLAMA_URL=http://ollama:11434

volumes:
  ollama-data:

networks:
  search-net:
    driver: bridge

The depends_on directive ensures SearXNG and Ollama start before the app. For production use within a home network, also add a health check against Ollama’s /api/tags endpoint so the app waits until the model is actually loaded before accepting requests.

Running the Stack

With all three services running, pull your chosen model into Ollama once:

docker exec -it <ollama-container> ollama pull qwen2.5:7b-instruct-q4_K_M

Then point your browser at http://localhost:8000. Type a question. SearXNG fires off requests to multiple search engines in parallel, the scraper fetches real page content from the top results, and within 10-15 seconds you get a synthesized answer with numbered citations linking back to the original sources.

The architecture gives you a meaningful privacy advantage over cloud AI search tools. Your queries go to SearXNG, which passes them to public search engines as anonymous requests with no persistent session cookie or account association. The LLM processing - the part that actually reads your question and formulates an answer - happens entirely on your hardware. Nothing about the synthesis step touches an external service.

One practical note: SearXNG’s aggregation can occasionally hit rate limits on individual search engines, especially if you are running many queries in quick succession. Spreading requests across a larger set of enabled engines reduces the chance of any single engine throttling you. If you notice partial results on some queries, check the SearXNG admin interface at http://localhost:8888/admin to see which engines are erroring out.

For follow-up questions, store the last two or three query-answer pairs in the session and include them in the LLM prompt. Most 7B models can handle a few turns of conversation context without losing coherence, which makes the system usable for iterative research rather than just one-off lookups. You can also add a simple Redis cache keyed on query hashes to avoid re-scraping and re-generating for repeated queries - a common pattern when multiple household members share the same instance.

Extending the Stack

Once the basic pipeline is working reliably, a few additions improve day-to-day usability without much added complexity.

A browser bookmarklet that sends selected text to your local search endpoint is useful for quick lookups while reading. A simple fetch() call to http://localhost:8000/search with the selected text as the query body is all it takes. On mobile, you can expose the FastAPI app through a local reverse proxy or a WireGuard tunnel and add the URL as a home screen shortcut.

If answer quality on broad or ambiguous queries feels inconsistent, the most targeted fix is adding a reranking step after scraping. Instead of passing all chunks in document order, score each chunk’s relevance to the query using a small cross-encoder model (ms-marco-MiniLM works well and runs entirely on CPU) and sort by score before assembling the context block. This costs a few hundred milliseconds but noticeably improves results when the top search results mix on-topic and off-topic content.

For households where multiple people will use the same instance, basic HTTP authentication on the FastAPI app prevents unintended external access. A single HTTPBasic dependency added to your route handlers is enough for a local network deployment. Pair it with nginx as a reverse proxy if you want TLS.