# What Is an AI Evaluation? A Plain-Language Guide for Product Teams | Amplitude

An AI evaluation (or eval) is a repeatable test that measures whether an AI system produces output meeting defined quality criteria. Here&#x27;s what product teams need to know.

Source: https://amplitude.com/en-us/explore/analytics/what-is-an-ai-evaluation

---

###### AI Evaluation Definition and Guide

# What Is an AI Evaluation? A Plain-Language Guide for Product Teams

An AI evaluation (or eval) is a repeatable test that measures whether an AI system produces output meeting defined quality criteria. Here's what product teams need to know.

<!--$-->

Table of Contents

-

An AI evaluation, or eval, is a repeatable test that measures whether an AI system produces output meeting defined quality criteria for a given input. Evals serve the same purpose for AI products that unit tests serve for deterministic software: they define what good looks like, run every time the system changes, and produce a score the team can track over time.

If you're a product manager shipping AI features, evals are how you answer the question your dashboards can't: did the AI actually help?

Browse this guide

- [What an eval is](#what-an-eval-is)
- [The two dimensions that matter](#the-two-dimensions-that-matter)
- [Offline evals](#offline-evals)
- [Online evals](#online-evals)
- [Connecting eval scores to product outcomes](#connecting-eval-scores-to-product-outcomes)
- [Frequently asked questions](#frequently-asked-questions)

<!--/$-->

## What an eval is

An AI evaluation is a structured test that checks whether an AI system's output meets a defined quality standard for a specific input. Unlike a unit test, which asserts that a function returns an exact expected value, an eval asserts quality. The output may vary between runs. The criteria may be partially subjective. And a response can be technically correct in form while being useless in substance, or technically incomplete while still being the right answer.

That distinction matters for product teams because it changes what "testing" means. A support agent that returns grammatically valid text has passed a format check. Whether it actually resolved the user's question is a different test entirely.

Consider a product team that ships an AI assistant to help users set up their first integration. Traditional analytics records every session the user opens the assistant as an engaged interaction. Evals record whether the assistant's instructions were correct. Without evals, a hallucinated API endpoint looks identical to accurate guidance in every dashboard the team owns.

Evals are also how teams catch regressions. When a prompt changes, a model updates, or a retrieval system shifts, evals run against the same inputs and flag whether quality held.

## The two dimensions that matter

Evals break down along two axes: where they run (offline or online) and how they score (code-based or LLM judge). Understanding this grid is the fastest way to get oriented in a new eval conversation.

|                     | Code-based scoring                                           | LLM judge scoring                                                   |
| ------------------- | ------------------------------------------------------------ | ------------------------------------------------------------------- |
| Offline (pre-ship)  | Check: did the agent call the right tool? Return valid JSON? | Check: was the response relevant? Did it address the user's intent? |
| Online (production) | Monitor: is the agent returning valid outputs at scale?      | Monitor: is response quality holding across real user traffic?      |

Most production teams run all four quadrants. The offline checks gate releases. The online checks catch what production traffic throws at the system that no dataset anticipated.

## Offline evals

Offline evals run in development against a fixed dataset of inputs before the system ships or before a change deploys. They're the AI equivalent of pre-deployment regression testing.

The dataset is built from real failures, edge cases, customer scenarios, and known good examples. Each eval runs the same input through the system and scores the result. If a new prompt version causes the agent to fail a case it previously handled correctly, the offline eval catches it before users see it.

The limitation is scope. Any fixed dataset encodes assumptions about how the product will be used, and production traffic always exceeds those assumptions. An offline suite built from early beta users won't cover the full range of phrasings, intents, and edge cases that arrive after a wider launch. That's the gap online evals fill.

A practical starting point: 20 to 50 real failures from manual testing, bug reports, and early user feedback are enough to produce a useful first signal. Quality of cases matters more than count. Each case should be unambiguous enough that two reviewers would independently reach the same pass or fail verdict.

## Online evals

Online evals run continuously against real production traffic, scoring traces as they happen. Where offline evals tell you whether the system handles a known set of cases correctly, online evals tell you whether it handles actual user traffic correctly.

They capture the long tail. Users phrase the same question dozens of different ways. They ask things no one on the team anticipated. They combine requests that individually work fine but trip the system when combined. No curated dataset covers this surface.

Online evals also produce a continuous quality signal. Instead of a pass/fail check at deploy time, the team has a time-series score they can monitor, alert on, and segment. Quality by user segment. Quality by query type. Quality over time as the model and the product evolve.

The standard pattern is to use both: offline evals gate releases, online evals monitor behavior post-release, and failures surfaced by online monitoring get added back to the offline set. Each catches what the other misses.

## Connecting eval scores to product outcomes

Eval pass rates tell you whether the model performs on a test set. They don't tell you whether quality drives retention, whether failure modes concentrate in high-value user segments, or whether the query types that fail most often are the ones your best customers use most.

Answering those questions requires joining eval scores to product engagement data under the same user identity. A team that can segment users by the quality of their AI interactions and compare their retention curves has a fundamentally different view of the product than a team looking at pass rates alone. High-quality interactions driving 2x retention isn't a quality metric anymore; it's a growth argument.

Amplitude's [Agent Analytics](https://amplitude.com/ai-agents) was built for this join. It treats AI interactions as events in the same product event stream where retention, conversion, and feature adoption are already measured. Conversation traces decompose into behavioral events. Eval scores attach to user profiles. Quality-to-retention correlation becomes a query you can run in the same workspace where you already track your product.

That's the loop that makes eval work worth funding: not just "we found the failure" but "the failure cost us these users."

## Frequently asked questions

### What is the difference between an eval and a unit test?

A unit test checks that a function returns an exact expected output for a given input. An eval checks that an AI system produces output meeting quality criteria, where the output may vary between runs and the criteria may be partly subjective. Unit tests assert equality. Evals assert quality.

### What is the difference between offline and online evals?

Offline evals run in development against a fixed dataset before a change ships. They catch regressions before users see them. Online evals run continuously against real production traffic after the system is live. They surface the failure modes that no curated dataset anticipates. Most production AI teams run both.

### What is LLM as a judge?

LLM as a judge (sometimes written LLMaaJ) is an eval approach where a separate language model scores an AI output against a rubric defined in natural language. It scales to open-ended quality questions that code can't evaluate, such as whether a response addressed the user's actual intent or whether the tone was appropriate. LLM judges require calibration against human labels before they can be trusted at scale.

### How many evals do you need to start?

Twenty to 50 real failures from manual testing, bug reports, and early user feedback are enough to produce a useful first signal. The quality of the cases matters more than the count. Each case should be unambiguous enough that two reviewers would independently reach the same pass or fail verdict.

### Who owns evals on a product team?

In most mature AI product teams, product managers and engineers share ownership. PMs define what counts as success and contribute eval cases from real user behavior. Engineers implement the eval infrastructure and integrate it into CI. Many teams treat evals as a shared artifact that both functions can edit, the same way both functions edit a product spec.

## Start connecting eval scores to product outcomes

If you're building AI features and want to connect evaluation scores to the product metrics that fund improvements, [Amplitude Agent Analytics](https://amplitude.com/ai-agents) joins trace data, eval results, and behavioral analytics in one workspace.

[Try Amplitude for free today](https://app.amplitude.com/signup) and start measuring what your AI actually does.
