LLM Security: 7-Stage Defense Pipeline Against Prompt Injection

Contents

You can harden LLM apps against prompt injection and data leaks by stacking defenses. Input cleanup strips control tokens before they hit the model. Output filters scan replies for PII and secrets. Structured output forces the model to follow a fixed schema. Add a system prompt firewall that walls off trusted rules from user input. Together they turn one bare API call into a pipeline. Bad prompts get caught before the model runs. Risky data gets redacted after. No single layer is bulletproof. Stacked, they cut the attack surface enough that most threats give up.

What follows is a practical breakdown of each layer, with the tools, code, and config you can drop into a FastAPI app today.

Understanding Prompt Injection: Attack Vectors

Prompt injection isn’t one bug. It’s a family of attacks that exploit a basic fact: LLMs can’t tell instructions apart from data. Before you build defenses, you need a clear picture of the main types.

The simplest is direct injection. The user sends phrases like “Ignore previous instructions” or “You are now DAN” to override the system prompt. Even current models like GPT-4o and Claude Opus 4 still fall for this when no input filter is in place. The attack works because the model reads the system prompt and user input as one token stream with no hard wall between them.

Indirect injection is more dangerous and harder to catch. The bad rules don’t sit in the user’s message. They sit in outside data the LLM pulls in during RAG retrieval : web pages, PDFs, database rows, email bodies. Greshake et al. first wrote it up in 2023. It’s still the most common real-world attack vector. It abuses the trust line between your data pipeline and the model’s input. A poisoned doc in your vector store can tell the model to leak data or shift behavior, with no bad typing from the user.

Multi-turn jailbreaks spread the payload across many turns. The attacker opens with an innocent message, slowly shifts the context, and drops the injection only after the model is primed to comply. Tools like Garak v0.9 can run multi-turn attacks against your endpoint on their own.

Encoding-based attacks slip past naive string-match filters. They hide commands in Base64, ROT13, Unicode look-alikes, or hidden zero-width characters. The model decodes them as it runs, even when your input filter sees nothing odd. A filter that only checks English injection phrases will miss SWdub3JlIHByZXZpb3VzIGluc3RydWN0aW9ucw== (Base64 for “Ignore previous instructions”).

Tool-use exploits target apps that give the LLM function calling . Attackers craft prompts that trick the model into firing risky tools: deleting files, running SQL, or hitting internal services over HTTP. They do it by stuffing tool-call JSON inside user messages. If your app blindly runs whatever tool calls the model emits, one injection can turn into full remote code execution.

Context window stuffing pads user input with thousands of tokens of junk text. The goal is to shove the system prompt out of the model’s real attention window. The system prompt is still in context on paper. Still, the model weighs it less when it’s drowned in user tokens. The model follows rules less reliably, and other injection tricks land more often.

Input Sanitization: Filtering Before Inference

The first line of defense is catching bad content before it reaches the model. Input sanitization is cheap, fast, and kills the most obvious attacks.

The biggest win is a dedicated prompt injection classifier running as a pre-filter. Several options exist at different price and effort points. Rebuff is open source and runs locally. Lakera Guard offers a hosted API with sub-50ms latency. For the most control, fine-tune a DeBERTa-v3 model on the Gandalf Ignore Instructions dataset. This can hit over 95% recall on known injection patterns. Run the classifier on every user input before it enters your prompt template. If it flags the input, log the attempt and return a plain error.

Token-level cleanup is another quick win. Strip or escape special tokens from user input before you join it with the system prompt. Tokens like <|im_start|>, <|endoftext|>, [INST], and <s> are chat-template delimiters. If an attacker drops them into a message, they can break the shape of your prompt. A simple regex pass costs microseconds and blocks format-string attacks on chat-template parsers.

import re

SPECIAL_TOKENS = re.compile(
    r'<\|im_start\|>|<\|im_end\|>|<\|endoftext\|>|\[INST\]|\[/INST\]|<s>|</s>'
)

def sanitize_input(text: str) -> str:
    return SPECIAL_TOKENS.sub('', text)

The “sandwich defense” is worth setting up too. Place your system prompt both before and after the user input in the message sequence. The model then sees trusted instructions on both sides of the untrusted content. It isn’t foolproof. Still, it measurably cuts the success rate of direct injection attacks.

Input length caps and character-class limits catch a different set of attacks. Cap user messages at 4,000 tokens. Reject inputs with more than 5% non-ASCII characters, unless your app needs Unicode input. Flag messages with high entropy scores. English text usually sits around 4.0 to 4.2 bits per character, so anything above 4.5 bits/char is worth a look for encoded payloads.

At the API level, split data and rules using whatever your provider gives you. With Anthropic’s API , use the system field only for trusted rules and keep all user content in user role messages. With OpenAI’s API , use the developer message role added in 2025 for the same job. This split doesn’t stop injection on its own. It does give the model a stronger hint about which content to trust.

Finally, log every flagged input. Pipe each detected attempt to a structured log in JSON lines format. Include a timestamp, user ID, a hash of the raw input, and the classifier’s confidence score. Feed the logs into your SIEM or monitoring stack to spot patterns. Injection attempts often come in bursts from the same source. Catching them early lets you block bad actors at the network level before they find a bypass.

Output Filtering: Scanning Responses for Leaks and Hallucinations

Even with clean inputs, the model can still leak data. Training-data recall, PII pulled in from RAG context, and made-up URLs are all risks input filters can’t fix. A post-inference filter needs to scan every reply before it reaches the user.

Start with regex-based PII detection on every response. Match Social Security numbers (\d{3}-\d{2}-\d{4}), credit card numbers (Luhn-validated 13-19 digit sequences), email addresses, phone numbers, and API key patterns (like sk-[a-zA-Z0-9]{48} for OpenAI keys). Microsoft Presidio v2.4 ships a production-ready build with many entity types and locales. The lighter scrubadub Python library works well if you need fewer entity types with lower overhead.

Presidio identifies and redacts multiple PII entity types simultaneously, replacing them with labelled placeholder tokens

Regex catches set patterns, but it misses loose PII like person names, street addresses, and org names. A named entity recognition model fills this gap. spaCy’s en_core_web_trf transformer pipeline or the newer GLiNER model can spot these entities with high accuracy. Both run in under 100ms on a CPU. Layer NER on top of your regex filters for wider coverage.

Hallucinated URLs are another common problem. LLMs often generate plausible-looking URLs that don’t exist or point to the wrong place. For any URL in the model’s response, check it against a known-good allowlist. If it’s not on the list, fire a HEAD request with a 2-second timeout. Flag any URL that returns a 404, redirects to an odd domain, or times out. This stops your app from sending users to dead links. Worse, it stops you from sending them to a domain an attacker registered to match the hallucination.

A content safety classifier adds another layer. Llama Guard 3 (an 8B model) or OpenAI’s Moderation API can flag replies with hate speech, self-harm advice, illegal-activity guides, or other harm types. Running Llama Guard locally gives you full control and keeps your replies off a third-party API. The hosted moderation endpoints are simpler to wire up if your latency and privacy rules allow.

When PII is detected, redact rather than block. Refusing to respond at all frustrates users and gives no useful output. Instead, replace the detected entities with placeholder tokens like [REDACTED-SSN] or [REDACTED-EMAIL]. Add a short note that explains the redaction. The response stays useful and the data stays inside.

Rate-limiting output volume catches exfiltration attempts. If a single session generates more than 10KB of output in under 60 seconds, flag it for review. This pattern often means an attacker has tricked the model into dumping its full context window. Set up automated alerts for sessions that hit this threshold.

Structured Generation: Constraining Output to Safe Schemas

Constrained decoding wipes out whole classes of bugs. It forces the model to emit output that fits a fixed schema. If the model can only emit valid JSON with set field types and value rules, it physically can’t produce free-text injection payloads or leak random data in its replies.

Outlines v0.2, used with vLLM or llama.cpp, applies JSON schema rules at the token level as the model runs. The library tweaks logit odds at each decoding step. The model can only pick tokens that are valid next-steps in the target schema. The output is valid JSON that fits your schema. Not “usually valid”: guaranteed.

For hosted API users, Instructor v1.6 for Python wraps OpenAI and Anthropic API calls with Pydantic model checks. Define your response as a Pydantic BaseModel with typed fields, validators, and limits. Instructor handles retries when the model emits invalid output.

from pydantic import BaseModel, Field
import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(Anthropic())

class SentimentResult(BaseModel):
    sentiment: str = Field(..., pattern="^(positive|neutral|negative)$")
    confidence: float = Field(..., ge=0.0, le=1.0)
    summary: str = Field(..., max_length=500)

result = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyze sentiment: Great product!"}],
    response_model=SentimentResult,
)

The major API providers also offer native structured output modes. Anthropic’s tool use with input_schema, OpenAI’s response_format: { type: "json_schema" }, and Google Gemini’s response_schema field all enforce schemas server-side. These are the lowest-friction option when you’re already on a hosted API.

Defensive schema design counts as much as the enforcement layer. Use enum fields for fixed-choice outputs so the model can’t emit odd values. Apply max_length on all string fields to cap output size. Use Optional fields with sane defaults for graceful degradation when the model can’t fill every slot. The tighter your schema, the smaller the attack surface.

One key catch: schema rules block structural attacks but not semantic ones. The model can still drop a Social Security number inside an allowed string field. Run your PII regex and NER filters on the string values inside the structured output, not just on free-text replies.

On the performance side, structured generation with Outlines adds 5 to 15% latency on vLLM versus unconstrained generation. On llama.cpp with grammar-based sampling (GBNF), overhead stays under 10% for schemas with fewer than 50 fields. That’s a small price for guaranteed output validity.

Putting It All Together: A Defense-in-Depth Pipeline

No single guardrail is enough. Each method above tackles one attack class, and attackers will probe for whichever layer you skipped. The goal is a pipeline where bypassing one layer still leaves several more in the way.

The recommended pipeline order is:

Input length and character validation
Prompt injection classifier
Input sanitization and token stripping
System prompt isolation with sandwich defense
LLM inference with structured generation
Output PII and content filtering
Response delivery

In a FastAPI app, you can build this as middleware using dependency injection to chain validator classes. Each step processes the text and either passes it through or raises an exception.

from fastapi import FastAPI, Depends, HTTPException

app = FastAPI()

async def validate_input(request: ChatRequest) -> ChatRequest:
    if len(request.message) > 16000:
        raise HTTPException(422, "Input too long")
    return request

async def detect_injection(request: ChatRequest = Depends(validate_input)) -> ChatRequest:
    if injection_classifier.predict(request.message) > 0.85:
        logger.warning("Injection detected", extra={"user": request.user_id})
        raise HTTPException(422, "Request flagged by safety filter")
    return request

async def sanitize(request: ChatRequest = Depends(detect_injection)) -> ChatRequest:
    request.message = sanitize_input(request.message)
    return request

@app.post("/chat")
async def chat(request: ChatRequest = Depends(sanitize)):
    response = await call_llm(request.message)
    filtered = output_filter.process(response)
    return {"response": filtered}

Feature flags make tuning practical. In dev, log warnings but let all traffic through so you can measure false positive rates. In prod, block flagged inputs and redact flagged outputs. This lets you tune sensitivity without a redeploy. A simple env variable or feature flag service works fine.

Before you deploy, test with Garak v0.9. It’s an open-source LLM vulnerability scanner with probe suites for prompt injection, encoding attacks, and data leakage. Run it against your endpoint and measure bypass rates before and after you turn your guardrails on. If your bypass rate isn’t below 5% on Garak’s standard probe set, you have gaps to fill.

Garak LLM vulnerability scanner showing encoding probe results with pass/fail rates per attack category — Garak's encoding probe module tests models against Base64, ROT13, and homoglyph-encoded injection payloads

Image: NVIDIA/garak

In prod, track the injection detection rate, the false positive rate (from user complaints and manual review), the output redaction rate, and p99 latency overhead per pipeline stage. Prometheus and Grafana are the standard stack for this. Set up alerts for sudden spikes in detection rate (a live attack) or sudden drops (your classifier may be failing in silence).

Failure handling deserves real thought. If the injection classifier times out or throws, fail open with stronger logging rather than fail closed. Blocking all requests because your safety classifier is down is a self-inflicted denial of service. Instead, tighten the downstream output filters to make up for it. Drop the PII redaction threshold and lower the content safety classifier’s confidence bar so more borderline replies get filtered.

The full pipeline usually adds 100 to 300ms of latency per request. The exact number depends on whether you run classifiers locally or call external APIs. For most chat apps, that’s well inside the budget. For latency-critical apps, run the input classifier and output filter async where you can. Use the fastest model variants: quantized DeBERTa for input checks, distilled NER for output filtering.

Building guardrails isn’t a one-time job. New attack tricks show up often. Injection classifiers need retraining on fresh adversarial samples. Your app’s attack surface shifts every time you add a new tool or data source to the LLM pipeline. Treat your guardrail pipeline like any other security-critical system. Keep testing it, keep watching it, and keep updating it as new threats land.