Guardrails for LLM Apps: How to Prevent Prompt Injection and Data Leaks

Contents

You can harden LLM-powered applications against prompt injection and data leaks by layering multiple defenses: input sanitization that strips control tokens before they reach the model, output filtering that scans responses for PII and secrets, and structured generation that constrains model output to a predefined schema. Combined with a system prompt firewall that separates trusted instructions from untrusted user input, these defenses turn a single unguarded API call into a pipeline where adversarial prompts are detected before inference and sensitive data is redacted after inference. None of these techniques alone is bulletproof, but stacked together they reduce attack surface to a level that makes exploitation impractical for most threat models.

What follows is a practical breakdown of each layer, with concrete tools, code patterns, and configuration you can drop into a production FastAPI application today.

Understanding Prompt Injection: Attack Vectors

Prompt injection is not a single vulnerability. It is a family of attack techniques that exploit the fundamental inability of LLMs to distinguish between instructions and data. Before building defenses, you need a clear picture of the major categories.

The simplest form is direct injection, where the user sends input containing phrases like “Ignore previous instructions” or “You are now DAN” to override the system prompt. Even current-generation models like GPT-4o and Claude Opus 4 remain partially susceptible when no input filtering is in place. The attack works because the model processes the system prompt and user input as a single token stream with no hard boundary between them.

Indirect injection is more dangerous and harder to catch. Malicious instructions are embedded not in the user’s message but in external data sources - web pages, PDFs, database rows, email bodies - that the LLM processes during RAG retrieval. First documented by Greshake et al. in 2023, indirect injection remains the most common real-world attack vector because it exploits the trust boundary between your application’s data pipeline and the model’s input context. A poisoned document in your vector store can instruct the model to exfiltrate data or change its behavior without the end user typing anything adversarial.

Multi-turn jailbreaks spread the adversarial payload across multiple conversation turns. The attacker starts with an innocent-looking message, gradually shifts the conversation context, and delivers the injection payload only after the model has been primed to comply. Tools like Garak v0.9 can simulate multi-turn attacks automatically against your endpoint.

Encoding-based attacks bypass naive string-matching filters by encoding instructions in Base64, ROT13, Unicode homoglyphs, or invisible zero-width characters. The model decodes these during generation even when your input filter sees nothing suspicious. A filter that only checks for English-language injection phrases will miss SWdub3JlIHByZXZpb3VzIGluc3RydWN0aW9ucw== (Base64 for “Ignore previous instructions”).

Tool-use exploitation targets applications that give the LLM access to function calling. Attackers craft prompts that trick the model into calling dangerous tools - file deletion, SQL execution, HTTP requests to internal services - by embedding tool-call JSON structures in user messages. If your application blindly executes whatever tool calls the model generates, a single injection can become a full remote code execution vulnerability.

Context window stuffing pads user input with thousands of tokens of irrelevant text to push the system prompt out of the model’s effective attention window. Even though the system prompt is technically still in context, the model’s attention mechanism gives it less weight when it is surrounded by a massive volume of user-controlled tokens. This reduces instruction-following fidelity and makes other injection techniques more likely to succeed.

Input Sanitization: Filtering Before Inference

The first line of defense is catching adversarial content before it reaches the model. Input sanitization is cheap, fast, and eliminates the most obvious attacks.

The most impactful addition is a dedicated prompt injection classifier running as a pre-filter. Several options exist at different price and complexity points. Rebuff is open source and runs locally. Lakera Guard provides a hosted API with sub-50ms latency. For maximum control, fine-tune a DeBERTa-v3 model on the Gandalf Ignore Instructions dataset - this approach can achieve over 95% recall on known injection patterns. Run the classifier on every user input before it enters your prompt template. If the classifier flags the input, log the attempt and return a generic error response.

Token-level sanitization is another quick win. Strip or escape special tokens from user input before concatenating it with the system prompt. Tokens like <|im_start|>, <|endoftext|>, [INST], and <s> are chat-template delimiters that can break the structure of your prompt if an attacker includes them in their message. A simple regex pass that removes these tokens costs microseconds and prevents format-string attacks on chat-template parsers.

import re

SPECIAL_TOKENS = re.compile(
    r'<\|im_start\|>|<\|im_end\|>|<\|endoftext\|>|\[INST\]|\[/INST\]|<s>|</s>'
)

def sanitize_input(text: str) -> str:
    return SPECIAL_TOKENS.sub('', text)

The “sandwich defense” is worth implementing too. Place your system prompt both before and after the user input in the message sequence so the model sees trusted instructions on both sides of the untrusted content. It is not foolproof, but it measurably reduces the success rate of direct injection attacks.

Input length limits and character-class restrictions catch a different set of attacks. Cap user messages at 4,000 tokens. Reject inputs containing more than 5% non-ASCII characters (unless your application specifically requires Unicode input). Flag messages with high entropy scores - English text typically has around 4.0-4.2 bits per character, so anything above 4.5 bits/char is worth inspecting for encoded payloads.

At the API level, separate data and instructions using whatever mechanism your provider offers. With Anthropic’s API , use the system parameter exclusively for trusted instructions and confine all user content to user role messages. With OpenAI’s API , use the developer message role introduced in 2025 for the same purpose. This separation does not prevent injection on its own, but it gives the model a stronger signal about which content to treat as authoritative.

Finally, log every flagged input. Pipe detected injection attempts to a structured log in JSON lines format with timestamp, user ID, a hash of the raw input, and classifier confidence score. Feed these logs into your SIEM or monitoring stack for pattern analysis. Injection attempts often come in bursts from the same source, and early detection lets you block bad actors at the network level before they find a bypass.

Output Filtering: Scanning Responses for Leaks and Hallucinations

Even with perfectly sanitized inputs, the model can still leak sensitive data. Training data memorization, PII regurgitation from RAG context, and hallucinated URLs are all risks that input filtering cannot address. A post-inference filter needs to scan every response before it reaches the user.

Start with regex-based PII detection on every response. Match Social Security numbers (\d{3}-\d{2}-\d{4}), credit card numbers (Luhn-validated 13-19 digit sequences), email addresses, phone numbers, and API key patterns (like sk-[a-zA-Z0-9]{48} for OpenAI keys). Microsoft Presidio v2.4 provides a production-ready implementation with support for multiple entity types and locales. The lighter-weight scrubadub Python library is a good alternative if you need fewer entity types with lower overhead.

Presidio identifies and redacts multiple PII entity types simultaneously, replacing them with labelled placeholder tokens

Regex catches structured patterns, but it misses unstructured PII like person names, street addresses, and organization names. A named entity recognition model fills this gap. spaCy’s en_core_web_trf transformer pipeline or the newer GLiNER model can identify these entities with high accuracy, running in under 100ms on a CPU. Layer NER on top of your regex filters for broader coverage.

Hallucinated URLs are another common problem. LLMs frequently generate plausible-looking URLs that either do not exist or point to unexpected destinations. For any URL in the model’s response, check it against a known-good allowlist. If the URL is not on the list, perform a HEAD request with a 2-second timeout. Flag any URL that returns a 404, redirects to an unexpected domain, or times out. This prevents your application from sending users to dead links or, worse, to domains controlled by an attacker who registered the hallucinated URL.

A content safety classifier adds another layer. Llama Guard 3 (8B parameter model) or OpenAI’s Moderation API can flag responses containing hate speech, self-harm instructions, illegal activity guides, or other harmful content categories. Running Llama Guard locally gives you full control and avoids sending your responses to a third-party API, but the hosted moderation endpoints are simpler to integrate if latency and privacy constraints allow.

When PII is detected, redact rather than block. Instead of refusing to respond entirely - which frustrates users and provides no useful output - replace the detected entities with placeholder tokens like [REDACTED-SSN] or [REDACTED-EMAIL]. Include a brief note explaining the redaction. This approach keeps the response useful while preventing data exposure.

Rate-limiting output volume catches exfiltration attempts. If a single session generates more than 10KB of output in under 60 seconds, flag it for review. This pattern often indicates an attacker has tricked the model into dumping its full context window. Set up automated alerts for sessions that hit this threshold.

Structured Generation: Constraining Output to Safe Schemas

Constrained decoding eliminates entire classes of vulnerabilities by forcing the model to produce output that matches a predefined schema. If the model can only output valid JSON with specific field types and value constraints, it physically cannot produce free-text injection payloads or leak arbitrary data in its responses.

Outlines v0.2, used with vLLM or llama.cpp, applies JSON schema constraints at the token level during generation. The library modifies logit probabilities at each decoding step so the model can only produce tokens that are valid next-steps in the target schema. The output is guaranteed to be valid JSON conforming to your schema - not just “usually valid” but mathematically guaranteed.

For hosted API users, Instructor v1.6 for Python wraps OpenAI and Anthropic API calls with Pydantic model validation. Define your response as a Pydantic BaseModel with typed fields, validators, and constraints, and Instructor handles retry logic when the model produces invalid output.

from pydantic import BaseModel, Field
import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(Anthropic())

class SentimentResult(BaseModel):
    sentiment: str = Field(..., pattern="^(positive|neutral|negative)$")
    confidence: float = Field(..., ge=0.0, le=1.0)
    summary: str = Field(..., max_length=500)

result = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyze sentiment: Great product!"}],
    response_model=SentimentResult,
)

The major API providers also offer native structured output modes. Anthropic’s tool use with input_schema, OpenAI’s response_format: { type: "json_schema" }, and Google’s Gemini response_schema parameter all provide server-side schema enforcement. These are the lowest-friction option when you are already using a hosted API.

Defensive schema design matters as much as the enforcement mechanism. Use enum fields for categorical outputs so the model cannot produce unexpected values. Apply max_length constraints on all string fields to prevent unbounded output. Use Optional fields with sensible defaults for graceful degradation when the model cannot populate every field. The tighter your schema, the smaller the attack surface.

One important caveat: schema constraints prevent structural attacks but not semantic ones. The model can still embed a Social Security number inside an allowed string field. Run your PII regex and NER filters on the string values within the structured output, not just on free-text responses.

On the performance side, structured generation with Outlines adds 5-15% latency on vLLM compared to unconstrained generation. On llama.cpp with grammar-based sampling (GBNF), overhead is under 10% for schemas with fewer than 50 fields. This is a small price for guaranteed output validity.

Putting It All Together: A Defense-in-Depth Pipeline

No single guardrail is sufficient. Each technique described above addresses a specific attack class, and attackers will probe for whichever layer you skipped. The goal is a pipeline where bypassing one layer still leaves multiple additional barriers.

The recommended pipeline order is:

Input length and character validation
Prompt injection classifier
Input sanitization and token stripping
System prompt isolation with sandwich defense
LLM inference with structured generation
Output PII and content filtering
Response delivery

In a FastAPI application, you can implement this as middleware using dependency injection to chain validator classes. Each step processes the text and either passes it through or raises an exception.

from fastapi import FastAPI, Depends, HTTPException

app = FastAPI()

async def validate_input(request: ChatRequest) -> ChatRequest:
    if len(request.message) > 16000:
        raise HTTPException(422, "Input too long")
    return request

async def detect_injection(request: ChatRequest = Depends(validate_input)) -> ChatRequest:
    if injection_classifier.predict(request.message) > 0.85:
        logger.warning("Injection detected", extra={"user": request.user_id})
        raise HTTPException(422, "Request flagged by safety filter")
    return request

async def sanitize(request: ChatRequest = Depends(detect_injection)) -> ChatRequest:
    request.message = sanitize_input(request.message)
    return request

@app.post("/chat")
async def chat(request: ChatRequest = Depends(sanitize)):
    response = await call_llm(request.message)
    filtered = output_filter.process(response)
    return {"response": filtered}

Feature flags make tuning practical. In development, log warnings but allow all traffic through so you can measure false positive rates. In production, block flagged inputs and redact flagged outputs. This lets you adjust sensitivity without redeploying your application. A simple environment variable or feature flag service works fine for this.

Before deploying, test with Garak v0.9, an open-source LLM vulnerability scanner that includes probe suites for prompt injection, encoding attacks, and data leakage. Run it against your endpoint and measure bypass rates before and after enabling your guardrails. If your bypass rate is not below 5% on Garak’s standard probe set, you have gaps to fill.

Garak LLM vulnerability scanner showing encoding probe results with pass/fail rates per attack category — Garak's encoding probe module tests models against Base64, ROT13, and homoglyph-encoded injection payloads

Image: NVIDIA/garak

In production, track injection detection rate, false positive rate (derived from user complaints and manual review), output redaction frequency, and p99 latency overhead per pipeline stage. Prometheus and Grafana are the standard stack for this. Set up alerts for sudden spikes in detection rate (which indicates an active attack) or sudden drops (which indicates your classifier may be failing silently).

Failure handling deserves careful thought. If the injection classifier times out or throws an exception, fail open with enhanced logging rather than fail closed. Blocking all requests because your safety classifier is down amounts to a self-inflicted denial of service. Instead, increase the strictness of downstream output filters as compensation - tighten PII redaction thresholds and lower the content safety classifier’s confidence threshold so more borderline responses get filtered.

The total latency overhead for this full pipeline is typically 100-300ms per request, depending on whether you run classifiers locally or call external APIs. For most conversational applications, this is well within acceptable bounds. For latency-critical applications, run the input classifier and output filter asynchronously where possible, and use the fastest model variants available (quantized DeBERTa for input classification, distilled NER for output filtering).

Building guardrails is not a one-time task. New attack techniques emerge regularly, injection classifiers need retraining on fresh adversarial examples, and your application’s attack surface changes every time you add a new tool or data source to the LLM pipeline. Treat your guardrail pipeline like any other security-critical system - keep testing it, keep monitoring it, and keep updating it as new threats appear.