Agentic RAG: How to Let Your LLM Decide When and What to Retrieve

Contents

Agentic RAG replaces the standard “retrieve-then-generate” pattern by giving the LLM tool-use capabilities to autonomously decide when to retrieve, which knowledge sources to query, how to reformulate queries for better results, and whether the retrieved context is sufficient or needs additional searches. Instead of blindly fetching documents on every user query, the model acts as an orchestrator - issuing targeted searches across multiple vector stores, SQL databases, and web sources, then self-verifying answers before responding. This approach achieves 15-25% higher answer accuracy than naive RAG on multi-hop question answering benchmarks while cutting unnecessary retrieval calls by roughly 35%.

The practical implementation uses a framework like LangGraph or a custom agent loop with Ollama and tool calling. You define retrieval tools, expose them to the LLM, and let the model reason about what it needs before fetching anything. The result is a system that handles ambiguous queries, multi-source questions, and simple factual requests equally well.

Why Naive RAG Fails and Agentic RAG Fixes It

The standard RAG pipeline is straightforward: take the user query, embed it, run a vector search, stuff the top-k results into the prompt, and generate an answer. This works for simple, direct questions where the answer sits neatly in a single document chunk. It falls apart in several common scenarios.

Ambiguous queries are one failure mode. A user asking “why is my build so slow” could mean Docker builds, CI/CD pipelines, or local compilation. Naive RAG just embeds the vague query and hopes for the best. Multi-source questions are another problem - comparing pricing models across cloud providers, for instance, demands several targeted queries against different document sections. Sometimes the top-k results simply don’t contain the answer, and no amount of re-ranking will fix it because the query itself was wrong. And often the query doesn’t need retrieval at all.

That last point matters more than people expect. In a typical chatbot deployment, 30-40% of user queries are conversational or factual questions the LLM already knows the answer to. Retrieving irrelevant context for “What is the capital of France?” or “Explain what a REST API is” actually degrades answer quality. The model gets confused by unrelated document chunks injected into its context window.

The agentic approach addresses all of these by giving the LLM a search_knowledge_base(query) tool and letting it make decisions: Should I search? What exactly should I search for? Was the result useful? Should I search again with a different query?

Consider the difference in practice: naive RAG is a librarian who grabs five books every time you ask any question, regardless of whether you asked about quantum physics or what time the library closes. Agentic RAG is a researcher who thinks about what they need, checks specific references, cross-references findings, and verifies their conclusions before giving you an answer.

Architecture of an Agentic RAG System

An agentic RAG system has four core components: an LLM with tool-calling capability, a set of retrieval tools exposed to the LLM, one or more knowledge sources, and an orchestration layer managing the agent loop.

For the LLM, you need a model that supports native tool calling. Llama 4 Scout via Ollama, Claude through the Anthropic API, or GPT-4o all work. The key requirement is that the model can output structured tool calls rather than just text, so the orchestration layer knows when the model wants to search versus when it wants to respond.

The retrieval tools are functions the LLM can invoke. Typical tool definitions look like:

search_documents(query: str, collection: str) -> list[str] for vector similarity search
query_database(sql: str) -> list[dict] for structured data lookups
web_search(query: str) -> list[str] for real-time information from the internet

Each tool returns results the LLM can reason about. The tool descriptions in the system prompt tell the model what each tool does and when to use it.

The agent loop follows a consistent pattern:

The LLM receives the user query plus a system prompt describing available tools
The LLM decides to call a tool or respond directly
If a tool call is requested, the orchestration layer executes it and returns results to the LLM
The LLM evaluates the results and decides on the next action
This repeats until the LLM generates a final answer or hits a maximum iteration count

Query routing happens implicitly. You don’t need a separate classifier to decide which knowledge source to query. The LLM reads the tool descriptions and makes that decision based on the query content. A question about “deployment config” naturally routes to the technical docs vector store, while “Q3 revenue” goes to the SQL database. The model handles this routing as part of its reasoning process.

State management matters a lot here. The full conversation and tool-call history stays in the agent’s message buffer. This lets the LLM reference previous search results and avoid redundant queries. If the first search returned partial information, the model can issue a follow-up query that targets the gap rather than repeating what it already found.

For termination, you typically set two conditions: the LLM generates a final answer with no tool call attached, or the loop hits a maximum iteration count of 5-8 steps. The iteration cap prevents runaway loops where the model keeps searching without converging on an answer.

LangGraph agentic RAG graph showing the agent loop with nodes for generate_query_or_respond, retrieve, generate_answer, and rewrite_question connected by conditional edges — A compiled LangGraph showing the agentic RAG loop with conditional routing between retrieval, answer generation, and query rewriting

Implementing Agentic RAG with LangGraph

LangGraph provides a structured way to build agent loops with conditional routing and state management. Here is a concrete implementation using LangGraph, ChromaDB , and a local LLM through Ollama.

Start with the dependencies:

pip install langgraph langchain langchain-community chromadb langchain-ollama

LangGraph v0.3+ provides the StateGraph abstraction for building agent loops. Define the agent state first:

from typing import TypedDict
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
    messages: list[BaseMessage]
    retrieved_docs: list[str]
    iteration: int

The state flows through the graph and accumulates context across iterations. Next, define retrieval tools using LangChain’s @tool decorator:

from langchain_core.tools import tool
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

@tool
def search_docs(query: str, collection: str = "default") -> str:
    """Search the knowledge base for relevant documents.
    Use this when you need to find information about internal
    documentation, procedures, or technical references."""
    col = client.get_collection(collection)
    results = col.query(query_texts=[query], n_results=5)
    formatted = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        source = meta.get("source", "unknown")
        formatted.append(f"[Source: {source}]\n{doc}")
    return "\n\n---\n\n".join(formatted)

Build the graph with nodes for the agent (LLM reasoning), tools (tool execution), and a conditional router:

from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama4-scout", temperature=0.1)
llm_with_tools = llm.bind_tools([search_docs])

def agent_node(state: AgentState) -> AgentState:
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": state["messages"] + [response]}

def should_continue(state: AgentState) -> str:
    last_message = state["messages"][-1]
    if last_message.tool_calls:
        return "tools"
    return END

tool_node = ToolNode([search_docs])

builder = StateGraph(AgentState)
builder.add_node("agent", agent_node)
builder.add_node("tools", tool_node)
builder.set_entry_point("agent")
builder.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
builder.add_edge("tools", "agent")

graph = builder.compile()

Run the agent with a query:

from langchain_core.messages import HumanMessage

result = graph.invoke({
    "messages": [HumanMessage(content="How do I configure SSL for our nginx proxy?")],
    "retrieved_docs": [],
    "iteration": 0
})
print(result["messages"][-1].content)

The graph executes the agent loop until the LLM produces a response without tool calls. For real-time visibility into the reasoning process, use streaming:

for event in graph.stream(
    {"messages": [HumanMessage(content="Compare our deployment options")]},
    stream_mode="messages"
):
    print(event)

This shows each LLM reasoning step and tool call as it happens, which is useful for debugging and for giving end users visibility into how the system arrives at its answer.

Promptfoo's CLI runs evaluation suites and grades outputs directly from the terminal

Advanced Patterns - Query Rewriting, Re-Ranking, and Self-Verification

The basic agent loop handles when to retrieve and what to query. Several additional patterns push retrieval quality much higher in practice.

Query Rewriting

Before the vector search runs, the LLM reformulates the user’s raw question into an optimized search query. “Why is my Docker build so slow?” becomes “Docker build performance optimization layer caching multi-stage.” You can implement this as a system prompt instruction (“Before searching, reformulate the user’s question into effective search keywords”) or as a dedicated rewriting step in the graph. Either way, the search query hits closer to the vocabulary actually used in your documents.

Hypothetical Document Embedding (HyDE)

HyDE takes a different approach to the query-document vocabulary gap. Instead of embedding the query directly, have the LLM generate a hypothetical answer paragraph, then embed that paragraph for the vector search. Since the hypothetical answer uses similar vocabulary and structure to actual documents in the knowledge base, similarity search becomes more effective. This technique improves recall by 10-15% on domain-specific corpora where user queries use different terminology than the source documents.

Multi-Query Expansion

For complex questions, the LLM generates 3-5 sub-queries, each targeting a different aspect of the original question. Each sub-query gets searched independently, and results are merged and deduplicated before being presented back to the LLM. For a question like “Compare AWS Lambda and Cloud Functions pricing for 1M requests/month,” the agent might generate separate queries for Lambda pricing tiers, Cloud Functions pricing model, and serverless cost comparison benchmarks.

Re-Ranking Retrieved Results

The initial vector search returns the top-20 candidates using cosine similarity, which is fast but approximate. A cross-encoder model like cross-encoder/ms-marco-MiniLM-L-12-v2 then re-scores each candidate against the original query, considering the full interaction between query and document rather than just their embedding distance. Keep the top-5 after re-ranking. This step adds 100-200ms of latency but significantly improves the relevance of what the LLM actually sees in its context window.

Self-Verification

After generating an answer, you can add a verification step where the LLM checks each claim against the retrieved sources. If a claim isn’t grounded in the retrieved documents, the agent either retrieves additional evidence or qualifies the statement. This reduces hallucination rates without requiring a separate fact-checking model. You implement it by appending a verification prompt after the initial answer generation, asking the model to cite sources for each factual claim.

Confidence-Based Routing

The LLM assigns a confidence score (1-5) to its answer. Scores below 3 trigger additional retrieval passes or escalation to a different knowledge source. This gives the system a built-in fallback - it automatically tries harder on questions where the initial retrieval didn’t produce strong enough evidence. In practice, about 15-20% of queries in a typical deployment hit the low-confidence path and benefit from the extra retrieval round.

Evaluation and Monitoring for Production Agentic RAG

Agentic systems are harder to evaluate than simple pipelines because the LLM’s retrieval decisions vary across runs. The same question might trigger two tool calls one time and four the next. You need systematic measurement across multiple dimensions.

Track these key metrics:

Metric	What It Measures	Target
Answer accuracy	Correctness against ground truth	>85%
Retrieval precision	Relevance of retrieved docs	>70%
Retrieval recall	Coverage of needed information	>80%
Avg tool calls/query	Efficiency of agent reasoning	2-4
End-to-end latency	Total response time	<10s (p80)

Build an evaluation dataset of 50-100 question-answer pairs with annotated source documents. Include three categories: single-hop questions (one retrieval needed), multi-hop questions (requiring multiple retrievals), and no-retrieval questions (testing whether the agent correctly skips retrieval). This spread tests all pathways through the system.

For automated evaluation, Promptfoo or RAGAS can score factual accuracy, faithfulness (whether the answer is grounded in retrieved documents), and relevance. Run evaluations on every pipeline change:

promptfoo eval --config eval.yaml

Promptfoo web viewer showing an evaluation matrix comparing Claude and GPT outputs with scoring charts and detailed result rows — Promptfoo's web viewer lets you compare model outputs side-by-side with automated grading

Cost monitoring matters especially for agentic RAG because the multi-step reasoning uses more LLM calls than naive RAG. On average, agentic RAG makes 2-4 LLM calls per query versus 1 for naive RAG. Set alerts if the average exceeds 6 calls per query - that usually indicates the agent is struggling with a class of questions and needs prompt tuning or better tool descriptions.

Latency profiling should instrument each step separately: LLM inference time, vector search time, SQL query time, and any re-ranking overhead. The main bottleneck is almost always LLM inference, not retrieval. If you are running locally with Ollama, model quantization and GPU memory allocation have the biggest impact on total response time.

Before fully committing to agentic RAG in production, run both systems in parallel. Shadow mode lets you route production traffic to both naive and agentic RAG, compare answer quality ratings and cost per query, and build confidence that the added complexity actually delivers better results for your specific use case and document corpus. Not every RAG application benefits from the agentic pattern - if your queries are consistently simple and single-hop, the overhead of agent reasoning may not be worth it.