Evals — short for evaluations — are the test suites of applied AI. An eval is a set of example inputs paired with a way of judging whether the system's output is good: it might check an exact answer, score against a reference, apply a rubric, or use another model as a judge. Run together, evals turn "the model feels better" into a measurable score you can track across prompt changes, model upgrades, and code deploys.
A practical eval set is built from real tasks the system has to handle, including the edge cases and failure modes that matter to you. Each item defines an input and a grading criterion. Grading can be deterministic (does the extracted figure match?), reference-based (how close is this to a known-good answer?), rubric-based (does it satisfy these criteria?), or model-graded (an LLM judges the output against instructions). Most serious systems combine several methods because no single one covers every kind of correctness.
In production, evals do for AI what unit and integration tests do for ordinary software: they catch regressions before users do. When you change a prompt, swap a model, or tune retrieval, you rerun the evals and compare scores. They are also how you make a go/no-go decision on a new model — you don't trust a vendor's benchmark, you measure on your own tasks. Teams that ship reliably treat the eval set as a living asset that grows every time a real failure is found.
Evals matter because without them, AI development is guesswork — you can't tell whether a change helped or hurt, and you discover regressions in production. Good evals are the difference between an AI feature you can iterate on confidently and one nobody dares to touch. They are also the foundation of safety work: you can't claim a system is reliable, unbiased, or on-policy without a measurable way to test it.