Personal AI Research Assistant: Local Semantic Search

You can build a personal AI research assistant that ingests PDFs, web bookmarks, and notes into a local ChromaDB vector store. It answers questions with cited sources using Ollama and a local LLM like Llama 4 Scout. The system uses sentence-transformers to embed your documents into a searchable index. When you ask a question, it pulls relevant passages and writes an answer that cites the exact source and page. The whole stack runs offline on consumer hardware, so your research data stays private.
Once you have a few hundred documents indexed, you can ask plain questions across your whole library and get cited answers. That saves real time during daily research.
Why Build Your Own Instead of Using ChatGPT or Perplexity
Cloud AI assistants like ChatGPT and Perplexity are good at searching the public web. They can’t search your PDF library, your Obsidian vault, or your saved bookmarks. The most useful research context is almost always in your private collection, not on the open internet.
Even with 200K token context windows, you can’t paste a whole library into one chat. Vector search fixes this by pulling only the relevant passages from any size library. A 10,000-document collection is just as searchable as a 100-document one. Query time stays under a second.
Privacy is a hard rule for many use cases. Academic work in progress, business documents, medical records, legal case files: none of these can go to cloud services. Local processing means zero data leakage. Your documents never leave your machine.

Session memory is the next gap. Cloud assistants forget everything between chats. Your local vector store sticks around and grows over time. Every document you add makes the system more useful. After six months of steady use, you have a knowledge base that no cloud tool can copy, because it holds your specific set of sources.
Cost is simple. Querying your own knowledge base with a local LLM costs nothing per query. If you run 50 or more queries a day during active research, the savings over API tools add up fast.
You also control every knob. The embedding model, chunk size, retrieval strategy, and generation model are all yours to tune. You can tune for your own domain, such as academic papers, technical docs, or legal briefs. That beats a general tool that works fine for everyone but is not great for anyone.
System Architecture and Component Selection
The research assistant has four parts: document ingestion, embedding and indexing, semantic search, and answer generation. Here is what each part needs, and which tools work well on consumer hardware.
For ingestion, PyMuPDF (v1.25+) pulls text from PDFs, including multi-column layouts and footnotes. markdownify turns saved HTML pages into clean text. python-docx handles Word files. Plain Markdown notes are read as is.
Documents need to be split into chunks the embedding model can handle. Use RecursiveCharacterTextSplitter from LangChain
with 512-token chunks and 50-token overlap. This size trades off precision against context. Smaller chunks are more precise but lose context. Larger chunks keep context but dilute the embedding.
The embedding model converts text chunks into vectors. Two good options:
| Model | Dimensions | Size | Speed (CPU) | Best For |
|---|---|---|---|---|
all-MiniLM-L6-v2 | 384 | ~80 MB | ~14K docs/sec | Fast setup, smaller collections |
nomic-embed-text-v2 | 768 | ~270 MB | ~5K docs/sec | Better quality, production use |
Start with all-MiniLM-L6-v2 from sentence-transformers to get running fast. Switch to Nomic’s embedding model
later if retrieval quality needs a lift.
For the vector store, use ChromaDB v0.6+ with persistent storage. Initialize it with:
import chromadb
client = chromadb.PersistentClient(path="./research_db")
collection = client.get_or_create_collection("research")ChromaDB supports metadata filters, partial updates, and scales to over a million documents on one machine. The persistent client saves to disk on its own, so your index survives restarts. For a Qdrant-based option with BGE-M3 embeddings, see the guide on setting up a private local RAG knowledge base .
The generation model runs through Ollama. llama4-scout:17b at Q4_K_M quantization needs about 10 GB of VRAM and gives good answers with citations. If your GPU has less memory, mistral-nemo:12b works well too. Both support the system/user/assistant message format that RAG prompts need. For a sharper model on a budget card, the offload tricks for running Gemma 4 26B MoE on 8GB VRAM
apply here as well.
The whole thing is run by a single Python script. No framework needed. Use httpx for Ollama API calls, chromadb for vector ops, and Rich
for terminal output. It all fits in a few hundred lines of code.
Ingesting and Indexing Your Documents
Personal document collections are messy. PDFs have broken text extraction. Web pages carry boilerplate nav. Notes show up in many formats. Getting ingestion right, with metadata kept intact, is what makes the system useful instead of annoying.
PDF ingestion with PyMuPDF is straightforward:
import pymupdf
def ingest_pdf(filepath):
doc = pymupdf.open(filepath)
pages = []
for i, page in enumerate(doc):
text = page.get_text()
if text.strip():
pages.append({"text": text, "page": i + 1})
return pagesPyMuPDF handles multi-column layouts and keeps reading order better than tools like pdfplumber. For scanned PDFs without embedded text, you would need OCR (Tesseract or EasyOCR). That’s a separate project.
Web bookmark ingestion strips boilerplate from saved HTML:
from markdownify import markdownify
def ingest_html(filepath, source_url=None):
with open(filepath) as f:
html = f.read()
text = markdownify(html, strip=["script", "style", "nav", "footer"])
return {"text": text, "source_url": source_url}For live URLs, fetch with httpx.get(url) first, then convert. Always store the source URL as metadata so citations can link back to the original page.
Markdown and Obsidian notes work better with semantic chunking. Don’t split on token counts. Split on headings to make chunks that hold one idea each:
import re
def split_markdown_by_headings(text):
sections = re.split(r'\n(?=#{1,3}\s)', text)
return [s.strip() for s in sections if s.strip()]Keep the heading path as metadata (section: "Literature Review > Methodology") so you know where a retrieved chunk came from.
The metadata schema is the key to useful citations. Every chunk gets:
source_file- original filename or URLpage_number- for PDFssection_heading- for structured documentsdate_added- when the document was ingesteddocument_type- pdf, web, or notecontent_hash- SHA-256 for deduplication
Batch embedding and indexing processes documents efficiently:
from sentence_transformers import SentenceTransformer
import hashlib
model = SentenceTransformer("all-MiniLM-L6-v2")
def index_chunks(collection, chunks, metadatas):
ids = [hashlib.sha256(c.encode()).hexdigest()[:16] for c in chunks]
embeddings = model.encode(chunks, batch_size=100,
show_progress_bar=True).tolist()
collection.add(
documents=chunks,
embeddings=embeddings,
metadatas=metadatas,
ids=ids
)Process in batches of 100 chunks. On a modern CPU, embedding 1,000 PDF pages takes about 2 minutes, with extraction and chunking included. The resulting vector store is around 50 MB for 10,000 chunks.
Partial updates stop you from reprocessing unchanged files. On re-ingestion, check the content_hash against existing entries. Skip unchanged docs, re-embed changed ones, and drop deleted sources. A small SQLite table next to ChromaDB tracks what has been ingested and when:
import sqlite3
def init_tracking_db():
conn = sqlite3.connect("./research_db/tracking.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS ingested (
filepath TEXT PRIMARY KEY,
content_hash TEXT,
last_ingested TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
return connQuerying - Semantic Search with Cited Answers
The query pipeline is the part you use day to day. You ask a plain question and get a written answer with citations pointing to specific documents and pages.
Retrieval embeds the query with the same model used for indexing, then searches ChromaDB:
def retrieve(collection, query, n_results=10):
q_embedding = model.encode([query]).tolist()
results = collection.query(
query_embeddings=q_embedding,
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
return resultsA relevance filter stops the LLM from seeing junk context. Drop results with cosine distance above 0.7 (where 0 is identical and 2 is opposite). Without this filter, weak matches can lead the model astray:
def filter_results(results, max_distance=0.7):
filtered = {"documents": [], "metadatas": []}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
if dist <= max_distance:
filtered["documents"].append(doc)
filtered["metadatas"].append(meta)
return filteredContext assembly formats retrieved chunks so the LLM can cite each one by number:
def format_context(filtered):
passages = []
for i, (doc, meta) in enumerate(
zip(filtered["documents"], filtered["metadatas"]), 1
):
source = meta.get("source_file", "unknown")
page = meta.get("page_number", "")
page_str = f", p.{page}" if page else ""
passages.append(f"[{i}] (source: {source}{page_str}) \"{doc[:500]}\"")
return "\n\n".join(passages)The generation prompt tells the LLM to use only the provided context and to cite sources:
Answer the question using ONLY the provided context passages.
Cite your sources using [1], [2], etc.
If the context doesn't contain enough information, say so.
Context:
{passages}
Question: {query}
Answer:Calling Ollama streams the response for an interactive feel:
import httpx
def generate_answer(query, context):
system_prompt = f"""Answer the question using ONLY the provided context
passages. Cite your sources using [1], [2], etc. If the context doesn't
contain enough information, say so.
Context:
{context}"""
response = httpx.post(
"http://localhost:11434/api/chat",
json={
"model": "llama4-scout",
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
],
"stream": True
},
timeout=120
)
# Process streaming response
for line in response.iter_lines():
chunk = json.loads(line)
if "message" in chunk:
yield chunk["message"].get("content", "")After generation, parse the [N] references from the answer. Check that each cited passage really exists in the context. Flag any citations that point outside the provided set. That catches hallucinated citations
, which local LLMs sometimes produce.
Follow-up queries work by keeping conversation history. When a user asks “Tell me more about the method in source [3]”, the system can re-search the same source with a refined query aimed at that section.
Building the User Interface
You have three good interface options, based on how you like to work.
The CLI mode is the simplest. It’s a loop that reads queries, runs the pipeline, and prints formatted answers:
from rich.console import Console
from rich.markdown import Markdown
console = Console()
while True:
query = input("\nQuery: ").strip()
if not query:
continue
if query.startswith("/"):
handle_command(query)
continue
results = retrieve(collection, query)
filtered = filter_results(results)
context = format_context(filtered)
console.print("\n[bold]Answer:[/bold]")
answer = ""
for chunk in generate_answer(query, context):
console.print(chunk, end="")
answer += chunk
console.print()Rich handles colored output and markdown, so citations and code blocks stay readable in the terminal.
A TUI built with Textual gives a richer terminal feel. You can build a split-panel layout in about 100 lines of Textual code. Put the query input at the bottom, the streaming answer in the main panel, and a sidebar that shows retrieved sources with metadata. For terminal users, this hits the right balance of simple and useful.
For a browser-based option, Gradio gets you a chat interface with minimal code:
import gradio as gr
def query_pipeline(message, history):
results = retrieve(collection, message)
filtered = filter_results(results)
context = format_context(filtered)
answer = "".join(generate_answer(message, context))
return answer
gr.ChatInterface(fn=query_pipeline, title="Research Assistant").launch()Gradio
creates a chat-style web UI at localhost:7860. You can add a file upload widget for drag-and-drop ingestion, so users can add sources without touching the command line.
Useful commands for the CLI and TUI modes: /ingest path/to/file.pdf adds a new document on the fly. /sources lists every indexed file with its metadata. /stats shows vector store totals (chunks, doc count, storage size). /clear resets conversation history.
One handy tip: lazy-load the embedding model and LLM on the first query, not at startup. Show a “Loading models…” note during that first load. After that, models stay in memory. Later queries return in 1 to 3 seconds, based on your hardware and the number of retrieved passages.
Putting It All Together
The complete system requires these Python packages:
chromadb>=0.6.0
sentence-transformers>=3.0
pymupdf>=1.25.0
markdownify>=0.14
httpx>=0.27
rich>=13.0Install Ollama on its own and pull your generation model with ollama pull llama4-scout. The first run takes longest. It downloads the embedding model (~80 MB) and builds the initial index. After that, adding new docs is partial and fast.
A practical workflow looks like this. Save useful PDFs and web pages to one folder. Run the ingestion script every so often, or set up a file watcher with watchdog. Query when you need to find something in your collection. The system gets more useful with every doc you add. After a few months of steady use, it turns into a real research tool, not a demo.
The whole codebase fits in under 500 lines of Python. No frameworks, no fancy deploy, no cloud accounts. Just a local vector store, a local LLM, and your own documents.
Botmonster Tech