AI evals, testing, and production quality monitoring

What Is an Online Eval? Monitoring AI Quality in Production

An online eval scores your AI system's outputs against real production traffic, continuously. Learn how online evals work, what they catch, and how they pair with offline evals.

Table of Contents

                  An online eval is a repeatable quality check that runs continuously against live production traffic, scoring AI outputs as they happen. Where offline evals guard against known failure modes before a change ships, online evals tell you what your system actually does with real users after it ships. For PM teams, that continuous signal is what makes sustained improvement possible once a feature is live.

                  How online evals work

                  An online eval runs asynchronously: traces flow from your production AI system into a scoring pipeline, each trace is evaluated against a rubric, and results accumulate into a live quality signal the team can monitor, segment, and alert on. The scoring happens after the response has already been returned to the user, so it adds no latency to the product experience.

                  The pipeline looks roughly like this in practice. A user sends a message to your AI feature. The system returns a response. That interaction is captured as a trace and routed to the eval pipeline. A scoring function evaluates it, passes or fails it against defined criteria, and logs the result. The pass rate across all traces updates in near-real time.

                  Teams typically set up dashboards showing pass rate over time, segmented by intent category, user cohort, or feature version. Alerts fire when the pass rate for a segment drops below a defined threshold. A sudden drop in pass rate on high-value user sessions is as actionable as a spike in error rate: it means something changed and needs investigation.

                  The key difference from running an offline eval is the input source. Offline evals use a curated dataset you control. Online evals use whatever real users actually send, which is always a larger and more varied distribution than any curated dataset captures.

                  What online evals catch that offline evals miss

                  Online evals surface failure modes that no pre-deployment test suite can anticipate, because they come from real user behavior at scale.

                  The clearest example is phrasing variation. An offline dataset of 50 carefully constructed test cases captures the failure modes a team knows to look for. It cannot capture the range of ways 10,000 users will phrase the same underlying intent. A phrasing variant that affects 0.3% of sessions looks negligible in a test environment; at production volume, 0.3% is 30 failed sessions per day. Online evals catch it because they run against every session.

                  Distribution shift is a second failure mode offline evals reliably miss. User behavior changes over time. Seasonal patterns, new user cohorts, and product changes all alter the distribution of intents hitting the system. An offline eval suite is a snapshot of what the product team expected users to do. Online evals track what users actually do as that evolves.

                  Novel failure modes come third. Some failure patterns only emerge at scale, when combinations of inputs and system states occur that low-volume testing never produces. Online evals surface these by running continuously rather than against a fixed set.

                  Online evals vs. offline evals

                  The two types cover different parts of the AI product lifecycle. Most production AI teams run both.

                  Online evalOffline eval
                  When it runsContinuously in productionPre-deployment, in CI
                  Input sourceReal user trafficFixed, curated dataset
                  Primary purposeMonitor live quality, surface new failure modesCatch regressions before release
                  StrengthCovers the full distribution of real usageFast, reproducible, gates releases
                  LimitationRuns after deployment; can't prevent the first bad responseCurated dataset can't anticipate all real-world variation

                  Offline evals gate releases. Online evals monitor live behavior. A team running only offline evals ships with confidence but flies blind after launch. A team running only online evals catches problems after users have already seen them, with no pre-deployment safety net.

                  The right question for any AI product team isn't which type to run. It's how to connect them so each one improves the other over time.

                  Scoring online evals

                  Online evals use the same two scoring approaches as offline evals, applied differently at production scale.

                  Code-based scoring

                  Code-based scoring runs on every trace. A regex match, a JSON schema check, a verified tool call, a required phrase present or absent: these checks are fast and cheap enough to apply to 100% of production traffic without meaningful infrastructure cost. Code-based scoring gives the team a constant baseline signal across the full volume of real sessions.

                  LLM judges

                  LLM judges don't run on every trace. Calling a separate language model to evaluate each production interaction adds cost that scales directly with traffic volume. Most teams handle this by sampling: score 5-20% of traces with an LLM judge, with higher sample rates for the segments that matter most (enterprise users, high-value intent types, sessions that followed an unusual path). This keeps cost manageable while still covering the cases where nuanced quality evaluation matters.

                  Human review

                  Human review plays a smaller but critical role. When online evals surface a novel failure mode that neither code-based scoring nor an existing LLM judge captures clearly, a human reviewer names it, writes the scoring criteria, and either creates a new code-based check or updates the judge rubric. Human review is the mechanism by which new failure modes get turned into repeatable eval cases.

                  A practical starting point: deploy code-based checks on everything from day one, sample 10% for LLM judge scoring, and reserve human review for the failures automated scoring can't characterize.

                  The feedback loop between online and offline evals

                  Online evals are how an offline eval dataset stays current as a product grows.

                  A production AI system encounters failure modes no pre-deployment test suite anticipated. Online evals surface those failures. A PM or engineer reviews the trace, names the failure mode, writes a scoring criterion for it, and adds it to the offline dataset. The next CI run tests against a tighter set of cases. The pre-deployment gate improves.

                  This loop is what eval-driven development looks like in a mature AI product team. The offline dataset is not a fixed artifact written at launch. It grows from production behavior over time, with each class of real failure becoming a permanent eval case that prevents the same failure from shipping again in a future version.

                  Teams that skip online evals lose access to this loop. Their offline datasets reflect what the team expected users to do at the time the system shipped, not what users actually do now. Over time, the eval suite drifts from reality, and quality problems surface only when users report them.

                  The feedback loop also applies to the scoring rubric. When human reviewers calibrate an LLM judge against online eval samples, the judge improves. Rubric updates propagate back to both offline and online scoring, raising the quality bar across the full eval system.

                  How Amplitude supports online eval workflows

                  Running online evals generates a quality signal. The question PM teams then face is what that signal means for the business.

                  A pass rate tells you the system handled X% of production traffic correctly. It doesn't tell you whether the sessions that failed are also the sessions that drive retention, or whether failure concentrates in your most valuable user segments. Answering those questions requires the eval signal to live next to the product engagement data.

                  Amplitude's Agent Analytics treats AI interactions as events in the same product event stream where retention, conversion, and feature adoption are already tracked. Online eval scores can live in that same stream, which means a PM can ask: are the users whose sessions fail evals more likely to churn? Which intent categories are failing and how much of my core retention do they represent? Which model versions have better quality scores and better downstream conversion?

                  Session Replay adds a second layer: when an online eval fails, the team can pull the session recording and watch exactly what the user experienced. The eval score names the failure. The replay shows the context. Together they produce the information needed to prioritize which failure modes to fix first.

                  If you're building out your AI eval system from scratch, the AI Evals for Product Managers overview covers the full picture: traces, eval types, LLM judges, and how to connect quality to engagement metrics.

                  Connect online eval quality to product outcomes

                  Online evals are how AI product teams find out what actually breaks in production. Connecting those failures to the product metrics that matter is how teams decide which ones to fix first.

                  Try Amplitude for free today and see how AI interaction data and product engagement metrics live in the same workflow.

                  Frequently asked questions about online evals

                  LLM observability covers the infrastructure view of a running AI system: latency, cost, token usage, error rates, and trace capture. An online eval is a quality check layered on top of that observability data. Observability tells you the system ran without errors. Online evals tell you whether the output it produced was actually good.

                  No. Online evals run asynchronously, after the response has been returned to the user. Scoring happens in a separate pipeline so it adds no latency to the product. Results typically appear in dashboards within seconds to minutes, depending on the scoring method and infrastructure setup.

                  Most teams don't score every trace with an LLM judge. Code-based checks run on everything cheaply. LLM judge scoring is sampled, typically 5-20% of traces, with higher rates for high-value user segments. The sampling approach keeps cost manageable while still covering the cases where open-ended quality evaluation matters most.

                  Offline evals run in development against a fixed dataset before a change ships, and their primary job is catching regressions. Online evals run continuously against live production traffic and monitor quality after release. Most teams run both: offline evals gate releases, and online evals surface new failure modes that feed back into the offline dataset over time.

                  Start with a small set that mirrors your offline evals: the same quality criteria, applied to live traffic. A handful of well-calibrated checks covering your most critical user intents is more useful than a large suite of loosely defined ones. Add new online evals as production traffic surfaces failure modes your initial set didn't cover.

                  Yes. Teams commonly configure online evals to send alerts to Slack or PagerDuty when pass rate drops below a threshold, route failing sessions to human review queues, or trigger feature flags that roll back a model change if quality degrades. The alert-on-threshold pattern is straightforward to implement even early on.