How to Evaluate LLM Outputs Systematically with Promptfoo

Promptfoo
is an open-source CLI tool that lets you define test cases with expected outputs, run them against one or more LLM providers
simultaneously, and score the results using deterministic checks, LLM-as-judge grading, or custom scoring functions. You write a YAML configuration file defining your prompts, test cases, and assertions, then run promptfoo eval to generate a detailed report showing pass/fail rates, regressions, and side-by-side comparisons. This catches prompt regressions, model upgrade breakages, and quality degradation before they reach production.
If you have spent any time building applications on top of large language models, you have probably experienced the frustration of tweaking a prompt to fix one edge case and then discovering - days later, from a user complaint - that the change broke something else entirely. Manual spot-checking does not scale. Promptfoo gives you a structured, repeatable way to verify that your LLM-powered features actually work across the full range of inputs you care about.
Why You Need Automated LLM Evaluation
LLM outputs are non-deterministic, and the surface area of possible inputs is enormous. A typical LLM pipeline handles hundreds of different input patterns, but most teams manually test three to five examples and call it done. That gives you coverage of maybe 5% of your real-world cases.
Prompt regression is the most common failure mode. You tweak your system prompt to handle a tricky edge case - maybe users were getting overly verbose responses for simple questions - and unknowingly break the formatting for code generation outputs, or introduce a new tendency to hallucinate citations . Without automated evaluation running across dozens of test cases, you will not notice until users start complaining. By then you have lost trust that is hard to rebuild.
Model upgrades carry similar risk. Switching from GPT-4o to GPT-4o-mini, or from Claude 3.5 to Claude 4, changes output behavior in ways that are difficult to predict. The new model might be better on average but worse on specific categories of input that matter to your application. An evaluation suite lets you quantify exactly what changes - which test cases improved, which regressed, and by how much - before you commit to the switch.
Cost-quality trade-offs are another area where systematic evaluation pays off fast. Running the same test suite against GPT-4o ($2.50/MTok input) versus Claude 4 Sonnet ($3/MTok) versus Llama 4 Scout running locally through Ollama ($0) lets you see exactly how much quality you sacrifice by choosing a cheaper option. Without numbers, these decisions get made on gut feeling. With an evaluation suite, you have a concrete quality score for each provider on your specific use case.
For regulated industries like healthcare and finance, there is also the compliance angle. Documented testing of AI outputs is increasingly expected by auditors and regulators. Promptfoo generates timestamped evaluation reports that serve as audit artifacts, showing exactly what was tested, when, and what the results were.
The feedback loop this creates is what makes it practical for day-to-day development. When you can see which specific test cases fail after a prompt change, you know exactly where to focus your prompt engineering effort. Instead of guessing and iterating blindly, each change is validated against your full test suite.
Installing Promptfoo and Understanding the Configuration
Promptfoo is distributed as an npm package. Install it globally with:
npm install -g promptfooYou need version 0.100 or later. If you prefer not to install it globally, use npx promptfoo@latest eval to run it directly. There is also a Python package available via pip install promptfoo if that fits your toolchain better.
To scaffold a new project, run:
promptfoo initThis creates a promptfooconfig.yaml template with example prompts and test cases. The configuration file has three main sections that work together: prompts, providers, and tests.
The prompts section defines the prompt templates you want to evaluate. You can reference external files with file://prompt.txt or write inline strings using {{variable}} template syntax for dynamic inputs. You can list multiple prompt variations, and Promptfoo tests every prompt against every provider and every test case, creating a full comparison matrix.
The providers section specifies which LLM APIs to test against. A typical configuration might look like:
providers:
- openai:gpt-4o
- anthropic:claude-4-sonnet
- ollama:llama4-scoutEach provider gets tested against every prompt and every test case, so you define your tests once and run them everywhere.
The tests section is where you define individual test cases. Each test has vars (the input variables that get substituted into your prompt template), assert (the conditions the output must satisfy), and an optional description for human-readable labeling.
Here is a minimal but complete configuration example:
prompts:
- file://system_prompt.txt
providers:
- openai:gpt-4o
- anthropic:claude-4-sonnet
tests:
- description: "Basic greeting"
vars:
user_input: "Hello, how are you?"
assert:
- type: contains
value: "hello"
- type: not-contains
value: "I'm an AI"
- description: "JSON output format"
vars:
user_input: "List three colors as JSON"
assert:
- type: is-json
- type: javascript
value: "JSON.parse(output).length === 3"For API keys, create a .env file in your project directory with OPENAI_API_KEY, ANTHROPIC_API_KEY, or whatever credentials your providers need. Promptfoo loads these automatically. For Ollama, set OLLAMA_BASE_URL=http://localhost:11434.
Writing Effective Test Cases and Assertions
The quality of your evaluation depends on how well you write your test cases and assertions. Promptfoo provides several categories of assertion types, each suited to different kinds of validation.
Deterministic assertions are the foundation. These are fast, free (no API calls), and completely reproducible:
contains- checks that the output includes a specific substringnot-contains- verifies a substring is absentregex- matches a regular expression patternequals- exact string matchis-json- validates that the output is parseable JSONis-valid-openai-function-call- checks function call format compliance
Use deterministic assertions whenever possible. They run instantly, cost nothing, and produce the same result every time. For checking that a summarization prompt includes specific key facts, or that a code generation prompt produces valid JSON , these are all you need.
Similarity assertions handle cases where the exact wording does not matter but the meaning does. similar(expected, threshold) computes cosine similarity between embeddings of the output and your expected text. levenshtein(expected, maxDistance) uses edit distance for cases where you expect nearly identical text with minor variations. These work well for open-ended questions where there are multiple valid phrasings of a correct answer.
The llm-rubric assertion type is the most flexible option. It sends the output to a grading LLM along with your rubric, and the grader returns a pass/fail decision with reasoning. For example:
assert:
- type: llm-rubric
value: "The response should be professional in tone, include at least 2 specific examples, and not exceed 200 words"This works well for subjective quality dimensions like tone, completeness, and helpfulness that are difficult to capture with deterministic rules. The trade-off is cost (each assertion requires an API call to the grading model) and slight non-determinism in the grading itself.
Custom scoring functions give you full programmatic control. Write JavaScript inline or reference external Python files:
assert:
- type: javascript
value: "output.split('\\n').length <= 10"
- type: python
value: "file://custom_scorer.py"Use these for application-specific logic - checking that generated SQL is syntactically valid, verifying that a response respects a word count constraint, or validating that extracted entities match a known schema.
For organizing your test suite, aim for 50 to 100 test cases covering several categories: happy path inputs that should work cleanly, edge cases that have caused problems before, adversarial inputs that test safety guardrails, and format compliance checks. Use the description field on every test case so you can quickly scan results. You can also assign weight to assertions to reflect their importance - a factual error (weight: 5) should count more than a minor formatting issue (weight: 1).
Running Evaluations and Interpreting Results
Running an evaluation is a single command:
promptfoo evalThis executes all test cases against all providers. Add --no-cache to force fresh API calls if you want to verify that results are not stale. Use -j 5 to set concurrency if you need to manage API rate limits (default is 4 concurrent requests).
To view results interactively, run:
promptfoo viewThis opens a local web UI at localhost:15500 with a table showing pass/fail status per test case per provider.

For programmatic access, export results to JSON:
promptfoo eval -o results.jsonOr print a summary table directly to the terminal with promptfoo eval --table.
Regression detection is one of the most useful features for ongoing development. Save a baseline after a known-good evaluation:
promptfoo eval -o baseline.jsonThen after making changes to your prompt or switching models, run a comparison:
promptfoo eval --compare baseline.jsonThis highlights which test cases changed from pass to fail (regressions) and from fail to pass (improvements), giving you a clear picture of whether your change was a net positive or negative.
Promptfoo also tracks token usage and estimated cost per provider in every evaluation run. Use this to calculate cost-per-test-case and extrapolate to production volumes, which is especially relevant when comparing a $2.50/MTok provider against a $0.15/MTok alternative.
For CI integration, add the evaluation to your GitHub Actions workflow or equivalent:
promptfoo eval --ci --fail-on-error-rate 0.1This returns exit code 1 if the pass rate drops below the configured threshold (10% failure rate in this example), blocking merges that would degrade quality.
promptfoo share to upload results to Promptfoo’s hosted viewer and get a shareable URL for team review without requiring everyone to run the evaluation locally.
Advanced Patterns - Red Teaming, Model Comparison, and RAG Evaluation
Beyond basic prompt testing, Promptfoo supports several advanced evaluation scenarios worth covering.

Red teaming is built in. Running promptfoo generate redteam automatically generates adversarial test cases tailored to your specific prompt - prompt injections, jailbreak attempts, PII extraction probes, and similar attack vectors. This is faster than writing adversarial tests by hand, and it covers attack patterns you might not think of on your own. Run the generated tests to verify that your safety guardrails hold.
For systematic model comparison, define three to five providers and run your full test suite (ideally 100+ cases) against all of them. The results view lets you sort by overall score to identify the best model for your specific use case. This turns model selection from a subjective “GPT-4o feels better” into a quantified “GPT-4o scores 94% on our test suite, Claude 4 Sonnet scores 91%, and Llama 4 Scout scores 82% but costs nothing to run.” Those numbers make procurement and architecture decisions much easier to justify.
RAG pipeline evaluation extends testing beyond the LLM itself to your full retrieval-augmented generation system. Define test cases with known answers and expected source documents, then use assertions to check both answer quality (via llm-rubric) and retrieval quality (via contains checks for expected source content in the retrieved context). This catches problems at every stage of the pipeline, from retrieval failures to context window overflow to answer hallucinations.
A/B prompt testing lets you compare multiple prompt variations against the same test cases. Define several versions of your system prompt, each as a separate file in the prompts section, and Promptfoo shows which version scores highest overall and on specific test case categories.
For testing your complete application pipeline rather than just the LLM call, custom providers let you wrap your entire system:
providers:
- type: python
config:
file: "custom_provider.py"Your custom provider can call your retrieval system, run post-processing, apply formatting rules, and return the final output that users would actually see. This tests the real system rather than the LLM in isolation.
Scheduled evaluations are worth considering as a production monitoring strategy. Run promptfoo eval on a daily cron job against your production prompts to detect model-side changes - providers do silently update their models, and this can shift output behavior without any changes on your end. Set up webhook alerts on score drops so you catch these before users notice.
Getting Started with Your First Evaluation
If you have an existing LLM application, the fastest way to get value from Promptfoo is straightforward: identify the five most common failure modes your users have reported, write a test case for each one with appropriate assertions, and run the evaluation. You will immediately see whether your current prompt and model handle those cases correctly. From there, grow the test suite incrementally - add a few test cases every time you encounter a new edge case or bug report. Within a few weeks you will have a solid evaluation suite that gives you actual confidence when making changes.
The YAML configuration is straightforward, the CLI works well, and most test cases take a minute or two to write. Once you have a test suite running, prompt changes and model upgrades stop being anxiety-inducing because you have a safety net that catches regressions before they reach production.
Botmonster Tech