AI evals, testing, and pre-deployment quality

What Is an Offline Eval? How Pre-Deployment Testing Works for AI Products

An offline eval tests your AI system against a fixed dataset before it ships. Learn what offline evals are, how to build one, and when to use them alongside online evals.

Table of Contents

                An offline eval is a repeatable test that runs your AI system against a fixed dataset of inputs before any change reaches users. It's the AI product equivalent of pre-deployment regression testing: you define what good output looks like, collect cases that capture real failures and edge cases, and re-run the suite every time the system changes.

                Teams use offline evals to catch regressions before they ship and to build a shared, explicit definition of quality that both PMs and engineers can work from.

                How offline evals work

                An offline eval runs a fixed set of inputs through your AI system, scores the outputs against predefined criteria, and returns a pass rate the team can track over time. The core components are a dataset of input/output pairs, a scoring function, a pass/fail threshold, and a run cadence tied to system changes.

                In practice, this looks like a CI check attached to your model deployment. A customer support agent might be tested against 40 historical support tickets before any prompt change goes live. Each ticket runs through the system, the response is scored, and the deployment only proceeds if the pass rate stays above the threshold. If a change drops the pass rate, the team investigates before users see anything.

                The key property of an offline eval is that the dataset is fixed. You control exactly what the system is tested against, which makes it fast to run, easy to reproduce, and easy to diff across versions. That fixed quality is also its limitation: a curated dataset can't cover every phrasing or intent combination that real users produce.

                One important distinction for PMs: offline evals are not the same as benchmarks. A benchmark is a shared, public test used to compare models across organizations. An offline eval is private and specific to your product's quality bar. Benchmarks answer "which model is best in general." Offline evals answer "is this version of my system good enough for our users."

                What goes into an offline eval dataset

                A strong offline eval dataset starts with real failures, not invented scenarios. The cases that matter most come from four places:

                • Manual testing during development. Engineers and PMs work through the product and deliberately probe known edge cases. What happens when a user types an ambiguous question? What happens when the intent is clear but the phrasing is unusual? These cases go into the dataset as they surface.
                • Customer-reported bugs and support tickets. When a user complains that the AI gave a wrong answer, that specific input and the expected correct output become an eval case. Each complaint is data.
                • Production traces with negative signals. Sessions where users gave thumbs-down feedback, asked the same question twice, or abandoned mid-conversation are high-signal candidates. These traces reveal real failure modes that synthetic datasets miss.
                • Synthetic edge cases for known risks. If your system handles financial data, you add cases that test for hallucinated figures. If it calls external tools, you add cases that test for incorrect tool selection. Synthetic cases cover the failure modes you know to watch for before they happen in production.

                How many cases do you need? Twenty to fifty real failures is enough to produce a useful first signal. Quality matters far more than count. Each case should be unambiguous enough that two reviewers working independently would reach the same pass/fail verdict. A dataset of 20 sharp, specific cases beats a dataset of 200 fuzzy ones.

                Datasets also need maintenance. As the system ships and new failure modes surface in production, those cases get added back to the offline set. The offline dataset is a living record of every class of failure the team has named and decided to guard against.

                Offline evals vs. online evals

                Offline and online evals are not interchangeable. They run in different environments, test different things, and catch different failure modes. Most production AI teams run both.

                Offline evals run before deployment, against a dataset you control. They're fast to run (milliseconds to minutes), easy to integrate into CI, and optimized for catching regressions. Their weakness is coverage: no curated dataset can anticipate every intent variation, unusual phrasing, or edge case that real users produce.

                Online evals run continuously against live production traffic. They score traces as they happen, which gives teams a continuous quality signal across the full distribution of real user behavior. An online eval catches the long tail of phrasings that no curated dataset ever includes. The tradeoff is that online evals run after the fact: by the time a new failure mode surfaces in production, some users have already seen the bad response.

                Offline evalOnline eval
                When it runsPre-deployment, in CIContinuously in production
                Input sourceFixed, curated datasetReal user traffic
                Primary purposeCatch regressions before releaseMonitor live quality, surface new failure modes
                StrengthFast, reproducible, gates releasesCovers the full distribution of real usage
                LimitationCurated dataset can't anticipate all real-world variationRuns after deployment; can't prevent the first bad response

                The relationship between the two is iterative. Offline evals gate releases. Online evals monitor live behavior. When online evals surface a new failure mode, that case gets added back to the offline dataset, which tightens the pre-deployment gate for the next release. Each type informs the other.

                Scoring offline evals: code-based vs. LLM judge

                Offline evals use one of two scoring approaches, and the choice depends on what property you're trying to measure.

                Code-based scoring

                Code-based scoring uses deterministic logic written in code. Examples include a regex match on required text, a JSON schema validation, a check that the agent called the correct tool, or a row count on a SQL result. Code-based scoring is fast, cheap, and perfectly reproducible. It works well for any output property that is objectively verifiable: did the response include the required disclaimer? Did the agent select the right retrieval source? Did the output parse as valid JSON?

                The limitation is that anything subjective or open-ended is hard to express in code. A response can be technically correct in structure and genuinely unhelpful in substance, or technically wrong in one field and completely right in everything that matters. For those cases, teams use LLM judges.

                LLM judges

                LLM judges use a separate language model to score output against a rubric written in natural language. The judge receives the rubric, the user's input, and the system's output, and returns a score or a pass/fail decision. LLM judges scale to open-ended quality questions: Was the answer helpful? Did the response address what the user actually asked? Was the analysis grounded in the retrieved context rather than the model's general knowledge?

                LLM judges require calibration before you trust them. A judge that passes everything is worse than no judge, because it creates false confidence. Teams calibrate by sampling judge decisions and comparing them against human reviewer decisions on the same cases. When the agreement rate is high enough, the judge runs at scale. Until then, treat judge scores as directional rather than definitive.

                A practical starting point: use code-based scoring for everything deterministic, and add LLM judges only for the cases where code can't capture what "good" means.

                How Amplitude supports offline eval workflows

                Seeding an offline eval dataset requires finding real failures, and finding real failures requires access to what actually happened during AI interactions. Amplitude's Agent Analytics treats AI interactions as events in the same product event stream where retention, conversion, and engagement are already measured, which means the trace data needed to build eval datasets lives in the same place as the product data.

                Teams building an initial offline eval set can filter their Session Replay and trace data for high-signal cases: sessions where users gave negative feedback, conversations where the same intent was retried multiple times, or interactions where latency spiked and engagement dropped. Those sessions become the raw material for eval cases. The failure is already named by user behavior; the team's job is to write the scoring criteria around it.

                This connection matters for PMs specifically because it closes the loop between eval quality and product outcomes. A pass rate tells you the model handles a test set. Amplitude shows you whether the sessions your offline eval is designed to protect against are also the sessions that drive retention and conversion. When the two views are in the same tool, the team can prioritize which failure modes to guard against based on business impact, not just frequency.

                If you're getting started with AI evals, the AI Evals for Product Managers overview covers the full picture: traces, eval types, LLM judges, and how to connect eval quality to engagement metrics.

                Connect eval quality to product outcomes

                Offline evals define what good looks like before a change ships. Connecting that quality signal to the product data that drives retention and conversion is where the real leverage is.

                Try Amplitude for free today and see how AI interaction data and product engagement metrics live in the same workflow.

                Frequently asked questions about offline evals

                A unit test checks that a function returns an exact expected output for a given input. An offline eval checks that an AI system produces output meeting quality criteria, where the output may vary between runs and the criteria may be partly subjective. Unit tests assert equality. Offline evals assert quality against a defined rubric.

                Twenty to fifty real failures is enough to get a useful first signal. Quality matters more than count. Each case should be clear enough that two reviewers working independently reach the same pass/fail verdict. A small, well-curated set produces a more trustworthy signal than a large dataset of ambiguous cases.

                Most teams seed their initial dataset from three sources: manual testing during development, bug reports and customer-reported failures, and production traces where users gave negative feedback or abandoned the conversation. As the system matures, new production failures get added back to keep the dataset current.

                Offline evals run in development against a fixed dataset and gate releases by catching regressions before a change ships. Online evals run continuously against real production traffic. Most teams run both: offline evals prevent known failure modes from reaching users, and online evals surface new failure modes from live traffic that feed back into the offline dataset.

                No. Many offline evals use code-based scoring: a regex match, a JSON schema check, a verified tool call. LLM judges are only needed for outputs where correctness is open-ended or partly subjective. Start with code-based scoring for anything deterministic and add LLM judges for the cases code can't evaluate.

                Yes. A spreadsheet with input, expected output, actual output, and a pass/fail column is a working offline eval. Frameworks like DeepEval and Arize add scale, CI integration, and reporting, but they're not required to start. Defining what good output looks like is the hard part. The tooling is secondary.