AI Evals for Product Teams
What Is an Offline Eval? A Guide for Product Teams
An offline eval is a pre-deployment test that measures whether an AI system handles a fixed set of known inputs correctly. Here's how product teams use them.
An offline eval is a test you run before code ships. You give a fixed dataset of known inputs to your AI system, score the outputs against defined quality criteria, and confirm whether the system meets your quality bar before any user in production sees it.
The analogy to existing PM muscle memory is direct: offline evals are pre-deployment regression testing for AI features. Where a traditional regression suite checks that your checkout flow does not break after a backend change, an offline eval checks that your AI feature does not degrade after a prompt change, a model swap, or a retrieval update.
This page explains how offline evals work, what belongs in the dataset, and how they fit into the broader development cycle for AI features.
How offline evals work
An offline eval takes a fixed dataset of inputs, runs each one through the AI system, scores the output against defined criteria, and produces a pass rate: the percentage of cases where the system met the quality bar.
That is the full loop. The complexity lives in the details of each step.
The dataset is a curated collection of input-output pairs. Each case has an input (a user query, a document, a conversation snippet) and a defined expectation for what a good output looks like. The dataset is fixed, meaning it does not change between runs unless a team deliberately updates it.
Scoring happens in one of two ways. A code-based eval uses deterministic logic to check the output: does the response include a required field, does it match a JSON schema, did the agent call the expected tool? A large language model (LLM) judge uses a separate model to score open-ended quality: was the response helpful, was the answer grounded in the retrieved context, was the tone appropriate? Many eval suites use both.
The output is typically a pass rate, a list of failing cases, and (for LLM judges) a set of scores with explanations. A team running an offline eval before a release reviews the failures, decides whether they represent regressions or acceptable tradeoffs, and either ships or iterates.
A concrete example: a product team ships an AI-powered Q&A feature for their support docs. Before each release, they run an offline eval against 80 curated questions drawn from real customer tickets. The eval checks two things: whether the agent returns a valid citation (code-based) and whether the answer is accurate and helpful (LLM judge). If the pass rate drops below 90%, the release is blocked until the team identifies and fixes the failure cases.
What belongs in an offline eval dataset
An offline eval dataset is built from four sources: real failures, edge cases, known-good examples, and adversarial inputs.
Real failures are the most valuable cases in any dataset. When a user reports a bad output, files a bug, or gives a thumbs-down on an AI response, that input belongs in your eval set. These cases encode the failure modes your system has already exhibited in the wild, and they are the most likely to recur.
Edge cases come from manual testing and from reasoning about your product's hardest problems. If your AI assistant handles both casual and technical queries, the hardest technical queries go into the eval set. If your agent retrieves context from a document store, queries that require synthesizing multiple documents are edge cases worth capturing.
Known-good examples establish what correct looks like. These are inputs where your best team members agree on the right output. They function as the positive ground truth that your pass rate is measured against.
Adversarial inputs are queries designed to break the system: jailbreaks, ambiguous phrasings, inputs that are close to real user queries but subtly different in ways that matter. They make the eval set harder to game.
On dataset size: 20 to 50 real cases is enough to produce a useful first signal. Quality matters more than count. Each case should be unambiguous enough that two independent reviewers would reach the same pass or fail verdict. If your team cannot agree on whether an output passes, the case needs a clearer rubric before it goes into the eval set.
How offline evals fit into the development cycle
Offline evals run in the development environment before a change ships, giving teams a quality gate between the "let's try this" stage and the "we're shipping this" stage.
The typical workflow looks like this: an engineer makes a change to the system (a new prompt, a different retrieval strategy, a model upgrade). Before the change gets merged and deployed, the eval suite runs automatically in CI. If the pass rate holds or improves, the change ships. If it drops, the engineer reviews the failing cases, fixes the root cause, and runs the eval again.
This is the part that makes offline evals valuable for PMs specifically: they create a shared definition of quality that is not locked in one engineer's head. When the eval suite is the release gate, the team has an explicit answer to "is this ready to ship?" rather than a subjective one.
The dataset itself should grow over time. When an online eval surfaces a new failure mode in production, or when a user reports a bug, that case gets added to the offline set. The offline dataset is a living record of every failure your system has exhibited and every quality standard your team has committed to.
Offline evals vs. online evals
Offline and online evals cover different parts of the quality problem. Running both is standard practice for production AI teams.
Offline evals catch regressions: changes that break something that used to work. Online evals catch the long tail of phrasings, intents, and edge cases that no curated dataset can fully anticipate, because they run against actual user traffic.
The two systems feed each other. A failure surfaced by online monitoring in production is a candidate for the offline dataset. Over time, the offline set grows to reflect the failure modes the product has actually encountered, and the eval suite becomes more representative of real usage.
How offline eval scores connect to product outcomes
A pass rate tells you whether the system handles a known set of cases correctly. It does not tell you whether quality improvements translate to better retention, whether failure modes concentrate in your highest-value user segments, or whether the queries that cost the most to run are also the ones most likely to convert.
Answering those questions requires joining eval scores and trace data to the product engagement data you already measure, under the same user identity. That join is where AI quality stops being a technical metric and starts informing product decisions: where to invest in improvements, which failure modes matter most to the business, and whether a model upgrade that lifts the pass rate by eight points actually changes anything users care about.
Amplitude's Agent Analytics was built for this connection. It treats AI interactions as events in the same product event stream where retention, conversion, and adoption are already measured, so teams can ask questions like "do users who get high-quality AI responses retain at higher rates?" alongside their existing product analytics.
Start measuring AI quality with Amplitude
Offline evals are one piece of the quality measurement picture. For the full framework, including online evals, LLM judges, trace analysis, and how to connect eval scores to product retention, keep reading the AI Evals for Product Managers series.
Try Amplitude for free today to connect AI interaction quality to the product metrics you already track.
Offline evals FAQ
Twenty to fifty cases drawn from real failures, manual testing, and early user feedback is enough to produce a useful first signal. The quality of each case matters more than the count. Each input and its expected output should be clear enough that two independent reviewers would reach the same pass or fail verdict.
A unit test checks whether a function returns an exact expected output for a given input. An offline eval checks whether an AI system's output meets a quality bar, where the output may vary between runs and the criteria may be partly subjective. Unit tests assert equality. Evals assess quality.
Yes. Offline evals catch regressions before they reach users. Online evals surface new failure modes after a release. Each catches what the other misses. Most production AI teams run both, with online failures feeding back into the offline dataset over time.
The pass rate is the percentage of eval cases where the AI system's output meets the defined success criteria. It is the most common headline metric for an eval suite and the number teams track over time to confirm that changes to the system improve rather than degrade quality.
PMs and engineers typically share ownership. PMs define what counts as success and contribute cases from real user behavior and business requirements. Engineers implement the eval infrastructure and integrate it into CI. Most mature AI product teams treat the eval set as a shared artifact that both functions can edit, similar to a shared spec.
Eval-driven development is a practice where teams write evals before the system can pass them, then use the failing cases to guide iteration. It is the AI equivalent of test-driven development: the eval defines the target, and the development work is the process of getting there.