Analyze agent results

Early Access

This feature is in Early Access. During this time, aspects of the functionality may still be developed, and this documentation may not always be up to date. If you have any questions, contact Amplitude Support.

This page explains how to read the Agent Analytics UI, interpret evaluator results, and turn agent data into Amplitude cohorts. If you're new to the product, start with the Agent Analytics overview. Set up the SDK to send agent data before you analyze it.

Open Agent Analytics

Open Agent Analytics within the Amplitude app by selecting the Agent Analytics entry in the left navigation. The entry expands into pages for Sessions, Datasets, Evaluators, Runs, and Monitor. Selecting the top-level entry opens the Monitor page.

Agent Analytics scopes sessions, evaluators, and overview metrics to the selected project. Choose a project to see its data.

URL path

The URL path uses the legacy llm-analytics slug, so the address bar shows a path like /llm-analytics/<org>/sessions even though the product name in the UI is Agent Analytics.

Monitor agent activity

Open Monitor from the Agent Analytics left navigation. Monitor is the landing page, so selecting the top-level Agent Analytics entry also opens it.

Monitor summarizes agent health across the selected project and time range. The page charts agent activity and session volume, session quality from the always-on signals, and cost and latency trends. Use these charts to spot a shift before you open individual sessions, and apply the agent filter to focus on one agent or compare agents.

Before a project sends any agent data, Monitor shows a get-started guide for SDK setup in place of the charts.

Browse sessions

Open Sessions from the Agent Analytics left navigation. The sessions list shows every agent session in the selected time range. Use it to:

Filter by date, agent, evaluator result, error type, or topic.
Filter sessions by an Amplitude cohort to focus on a specific group of users.
Filter by agent name through the URL query parameter, for example ?agents=<agent-name>.
Search interaction content, with the most relevant sessions first.

Filters combine, so you can narrow to sessions from one agent that failed a specific evaluator on a given day.

Inspect a session in detail

Open any session to view the Session Detail View. The view organizes session data into five tabs:

Thread: interaction view of user and AI messages. Includes Session ID.
Info: session metadata including session ID, agent, model, cost, tokens, and latency.
Turns: turn-by-turn breakdown, including nested tool calls and spans.
Evaluators: evaluator results for the session, with pass or fail status and signals.
Review: Review aspects about the session and specify review settings and labels.

The session hierarchy is Session > Turn > Span. Spans roll up into turns, and turns roll up into the session. Amplitude identifies a session with the [Agent] Session ID property.

From a session, you can:

Select User Activity to open that user's event timeline. Amplitude filters the timeline to [Agent] events and starts it at the session.
Select Filter to Similar Sessions to find sessions with similar aspects.
Select Add to Dataset to include this session in a specific dataset.

Find similar sessions

When a session shows a problem, you can select Suggested filters and surface other sessions with the same pattern. Suggested filters anchor on the session you're viewing, so they work best when started from a concrete example rather than a broad search.

Read evaluator results

Agent Analytics measures quality with two layers: Signals and Evaluators.

Signals are always-on enrichments that Amplitude runs on every closed session. You don't configure them. The default signals cover task completion, response quality, user friction, negative feedback, user intent, and session safety, plus a code-based data quality check. Amplitude refines how signals work over time, so treat them as directional indicators.

Evaluators are scorers you define and calibrate yourself for precise, product-specific criteria.

Enrichment writes results to two events in your event stream:

[Agent] Session Record: one event for each session. This event carries the signal results, session rollups (total turns, total cost, and tokens), and flags such as [Agent] Has Data Quality Issues.
[Agent] Evaluator Result: one event for each evaluator run on a session. Key properties include [Agent] Evaluator Name, [Agent] Evaluator Output Type, [Agent] Binary Label, [Agent] Rationale, [Agent] Evidence, and [Agent] Evaluator Model.

For the full list of events and their properties, go to the Agent Analytics event taxonomy.

These results land as queryable Amplitude events, so you can build a cohort like "users whose sessions failed task completion" with a single filter. The data quality signal is code-based (rule and pattern matching) rather than LLM-based, so the signal returns a clear "no issues found" result instead of a generated rationale.

Create and refine custom evaluators

Signals run automatically. To measure product-specific criteria, create your own evaluators on the Evaluators page, which you open from the Agent Analytics left navigation. Define an evaluator with a name, a judge prompt, and pass criteria, then test it before you activate it:

Create the evaluator with its judge prompt and pass criteria.
Run it on the Runs page against a sample of sessions or a saved dataset to check how it scores.
Review the results, then revise the judge prompt or activate the evaluator.

Evaluators support binary (true/false), classification, and score output types. Activate an evaluator to start scoring new sessions. The results land as [Agent] Evaluator Result events that you can query alongside the signal data.

By default, evaluators run on Amplitude-managed models. To run them on your own LLM provider instead, add your provider key in AI Controls.

Human reviewers improve evaluator accuracy. Reviewers label sessions and correct evaluator outputs in the review tab, and those reviewed labels build a dataset of ground truth. Rerun an evaluator against that dataset to compare how a revised prompt scores against known-good answers.

Run evaluators

The Runs page runs evaluators against a set of sessions on demand. Use the Runs page to test an evaluator before you activate it, or to score sessions that predate the evaluator. Open Runs from the Agent Analytics left navigation.

To start a run, select Create Run, choose one or more evaluators, and choose which sessions to score: a sample of recent sessions, a filtered set, or a saved dataset (a reusable set of sessions you curate on the Datasets page).

While a run is in progress, the run drawer shows a live X / N sessions progress bar and a count of any failures. Results stream in as each session finishes, so you can read early results without waiting for the whole batch. Agent Analytics isolates a failed or timed-out session, surfaces a per-evaluator error, and completes the rest of the run. To stop a run early, select Cancel.

For each session in the run, the results show every evaluator's result (pass or fail, class, or score), the judge prompt that ran, the rationale, and the latency of the evaluation. Use these results to refine a judge prompt, then rerun until the output aligns with your expectations.

Each action requires a specific permission:

Action	Required permission
Run evaluators on demand	Manage Inactive Evals and Runs
Activate an evaluator so it scores new sessions automatically	Activate Evals

Curate datasets

A dataset is a reusable, named collection of agent sessions that you run evaluators against repeatedly. Datasets turn one-off spot checks into a regression suite. You fix a set of sessions and establish the correct answers, then rerun an evaluator against those sessions whenever you change the evaluator's prompt. Open Datasets from the Agent Analytics left navigation.

Build a dataset around whatever slice of agent activity you want to track, such as sessions from one model, one product surface, or a filtered set of users. Start with a handful of representative sessions that cover your common cases and known failure modes.

Add sessions to a dataset in two ways:

From a session: open the session and use the Add to Dataset control. Amplitude lists the datasets that match the session's agent.
In bulk: select sessions on the Sessions page and add them at once. A dataset holds up to 100 sessions, so filter to the sessions you care about first.

Open a dataset to view its sessions, read any session's thread with its evaluator results inline, remove sessions, or download the dataset as JSON.

Establish ground truth

A dataset's value comes from its ground truth: the verified correct answer for each session. When you add a session, Amplitude snapshots the session's current evaluator results as a baseline, so you only review and correct the results that are wrong.

To review a session, open it from the dataset and change any evaluator result that's wrong, or select Mark as correct to confirm a result you agree with. Each session moves from pending to partial to verified as you review its evaluators. Reviews attach to the session, so a session keeps its labels everywhere it appears, and your corrections survive when you edit the evaluator's prompt.

With ground truth in place, run an evaluator against the dataset from the Runs page and compare the evaluator's output to the verified answers. Comparing output to verified answers confirms that a prompt change improved the evaluator before you activate it.

Connect agent quality to product behavior

Agent Analytics connects to the rest of Amplitude through cohorts. Cohort integration is the core differentiator over standalone agent observability tools:

Filter a set of sessions in Agent Analytics.
Create a cohort from that selection.
Use the cohort in any standard Amplitude chart such as funnel, retention, or segmentation.

For example, build a cohort of users whose sessions failed task completion, then chart their 30-day retention against a cohort of users whose sessions succeeded. The comparison shows whether agent quality maps to outcomes like conversion, retention, or churn.

Query Agent Analytics from your tools

The Amplitude MCP server exposes Agent Analytics to AI tools like Claude and Cursor. Use the MCP server to query metrics, sessions, spans, and conversations, inspect the schema, manage datasets, and create, update, and list evaluators.

Go to Amplitude MCP for the current tool list, setup, and authentication details.

Was this helpful?