LLM evaluation: what to measure before an AI feature ships
A production-focused guide to LLM evaluation: golden datasets, groundedness, retrieval quality, refusal behavior, latency, cost, and regression tests.
LLM evaluation is the difference between an impressive demo and a feature you can keep improving. Without evals, every prompt change, model upgrade, or retrieval tweak is a guess. With evals, quality becomes something you can measure, discuss, and regression-test before users feel the impact.
Start with a golden dataset
Build a representative set of real questions, expected behaviors, edge cases, and "should not answer" examples. The dataset does not need to be huge at first; it needs to reflect the situations that would actually hurt trust if the system failed.
Measure retrieval and generation separately
In RAG systems, the answer can fail because retrieval brought back the wrong context or because the model ignored good context. Evaluate retrieval relevance, coverage, and source quality separately from answer correctness and groundedness.
Track production signals too
- Latency and timeout rate by workflow.
- Cost per successful task, not just cost per request.
- Refusal and escalation rates for unsupported questions.
- User feedback tied back to prompts, model versions, and retrieved sources.
The goal is not a perfect score. The goal is a system where quality can be observed, improved, and kept stable as the product changes.