Building an eval harness that actually catches regressions

2025-05-22 / 2 min / llm / evals / rag / production

Retrieval and prompt evaluation pipelines that drove an 18% relative lift in rubric pass rate over the prior eval harness, measured on production-derived canary sets. Plus why most eval setups silently lie to you.

The problem with most LLM evals

Teams ship an LLM feature. Their offline benchmark numbers go up. Then production users start complaining. The usual culprit is an eval suite that grades on the wrong axis: average correctness on a static set, not the long-tail regressions that appear when retrieval, prompt, or model versions drift. (This is one of the five things that have to be built before an AI feature stops being a demo.)

If your eval pipeline gives you a single green number per release, it is almost certainly hiding the regressions you most need to see.

What we built

A harness that scored every release across three independent axes. Retrieval grounding: did the model use evidence that was actually retrieved. Prompt-following: did it obey constraints on out-of-distribution queries. Rubric-based correctness: a versioned gold set graded by a stronger model with sample human spot-checks.

Each axis ran against the same canary set so regressions could not hide behind a flattering mean. Scoring traces were written to a queryable store with the inputs, outputs, retrieved documents, and grader rationale for every example.

Critically, the queryable store let us ask "which queries got worse between v23 and v24" instead of "did the average go up". That single shift in question was the biggest unlock.

Why it moved the needle

The 18% correctness lift was measured as a relative improvement in rubric pass rate on the production-derived canary sets used by those deployments, not as a universal benchmark score. It did not come from a smarter prompt. It came from being able to see, in minutes, which kinds of queries each iteration was breaking. Roughly half the prompt and retrieval changes we previously shipped would have been reverted if we had been measuring this way from the start.

We also caught a class of silent regression where retrieval quality dropped but generation papered over it with plausible-sounding output. Without scoring retrieval and generation separately, that kind of failure is invisible.

What I would do differently

Build the eval pipeline before the first feature ships. Otherwise every team rationalises "we will add evals once the prompt stabilises". It never stabilises.

Treat the gold set as a living artefact. Stale gold sets reward overfitting to the past. We added a weekly process where a small number of new examples were curated from production failures, with a 60-day rotation on the oldest examples.

If your team is shipping LLM features and your offline numbers look green while production users complain, this is the kind of work I take on. Send a brief.