Evals

AI Coding Benchmarks in 2026: Why the Leaderboard You Pick Decides the Winner

The SWE-bench Verified leaderboard in June 2026 is led by OpenAI’s GPT-5.5 at 88.7%, with Claude Opus 4.7 a step behind at 87.6% and GPT-5.3-Codex at 85.0%. Anthropic’s June flagships, Opus 4.8 and the new Fable 5, ship as the current top Claude models but have not landed on the public board yet. Pick a different benchmark and the order flips. On SWE-bench Pro, Claude Opus 4.7 leads at 64.3%. On Terminal-Bench 2.0 , Codex CLI paired with GPT-5.5 tops the chart at 82.0%, while the cheaper, faster Gemini 3.5 Flash hit 76.2% on the newer 2.1 set with output about 4x faster. LiveCodeBench favors Google. There is no single best AI coding model. There is only a best model for the kind of task you care about, and the agent scaffold around that model can shift scores by several points.

The 80% Coverage Trap: Why AI-Generated Tests Create a False Sense of Security

AI test generators make it easy to hit 80% or even 90%+ line coverage. Point GitHub Copilot at a codebase, use the @Test directive, and watch it write hundreds of test methods by itself. The number looks great on a dashboard. But line coverage only measures execution, not detection. A test suite can run every line of your code while checking nothing about whether that code is correct. In one 2026 experiment, an AI-built suite scored 93.1% line coverage but only 58.6% on mutation testing. Over a third of realistic bugs slipped through undetected, with CI green across the board.

Promptfoo: Catch LLM Regressions Before Production

Promptfoo is an open-source CLI tool that runs your test cases against one or more LLM providers at once. You write a YAML file with prompts, test cases, and checks, then run promptfoo eval to get a report with pass/fail rates, regressions, and side-by-side comparisons. It scores results three ways: simple text checks, LLM-as-judge grading, or your own scoring code. The point is to catch prompt regressions, broken model upgrades, and quality drops before users see them.

Production LLM Hallucinations: Taxonomy, Evals, and RAG Defenses

Fixing LLM hallucinations in production needs a layered defense. Use Chain-of-Verification at inference time. Ground the model in trusted data. Build eval suites that give you a hallucination rate you can track and gate in CI . No single trick fixes this. But pair prompt rules with retrieval-augmented grounding , self-checking, and validation layers, and you turn it into a problem you can measure and ship against.

What Is Hallucination? A Taxonomy for Developers

“Hallucination” has become an umbrella label for almost any unexpected LLM output. That fuzziness is dangerous in production. Each failure mode has a distinct cause and a distinct fix. Lump them together and you’ll apply the wrong remedy to the wrong problem. You’ll spend cycles on prompt tuning when the real issue is retrieval quality, or add RAG when the failure is instruction-following. Before you can fix hallucinations, you need a precise vocabulary for what you’re seeing.