What should an LLM evaluation measure?

At minimum, evaluate task success, groundedness, retrieval quality, refusal behavior, safety constraints, latency, and cost. For RAG systems, evaluate retrieval separately from generation so you know where failures come from.

LLM evaluation: what to measure before an AI feature ships

LLM evaluation is the difference between an impressive demo and a feature you can keep improving. Without evals, every prompt change, model upgrade, or retrieval tweak is a guess. With evals, quality becomes something you can measure, discuss, and regression-test before users feel the impact.

Start with a golden dataset

Build a representative set of real questions, expected behaviors, edge cases, and "should not answer" examples. The dataset does not need to be huge at first; it needs to reflect the situations that would actually hurt trust if the system failed.

Measure retrieval and generation separately

In RAG systems, the answer can fail because retrieval brought back the wrong context or because the model ignored good context. Evaluate retrieval relevance, coverage, and source quality separately from answer correctness and groundedness.

Track production signals too

Latency and timeout rate by workflow.
Cost per successful task, not just cost per request.
Refusal and escalation rates for unsupported questions.
User feedback tied back to prompts, model versions, and retrieved sources.

The goal is not a perfect score. The goal is a system where quality can be observed, improved, and kept stable as the product changes.

LLM evaluation: what to measure before an AI feature ships

Start with a golden dataset

Measure retrieval and generation separately

Track production signals too

FAQ

More reading