Agent Analytics

What’s the real impact of your AI agent?

Knowing your agent gave a good response is the easy part. Agent Analytics ties traces and evals to conversion, retention, and revenue, so you can connect agent quality to real business impact.

Sign up for early access

Amplitude Agent Analytics session view

Sessions

ExportUpdate SDK

Suggested for you

Which users only engaged with our agent once?

Sessions per user = 1last 30d

Filter sessions

What are my agents' most common failure modes?

Clustered from 9.2k failed sessions

Filter sessions

What questions are being answered badly?

Quality score < 0.4last 30d

Filter sessions

How and why are users using the create_theme tool?

create_theme invocationslast 30d

Filter sessions

Last 30 daysSearch sessions…Filter

SessionUserAgentTagsWhen

3

3b449a79-2b00-4809-a331-93e051493497

5h ago

User activityFind similar sessionsAdd to dataset1 of 12

Why did checkout completion drop ~18% on Wednesday?

May 27 · 3:22 PM

Pulled checkout sessions · May 25–28

One dataset query timed out, but the chart and session pulls gave me enough to isolate it. The drop traces to a single cohort:

Affected cohort: users on Safari 17.4 (iOS)
Window: Wed 09:12 – 17:48 UTC
Likely cause: the v3.8 payments SDK rollout introduced a CORS preflight that failed on Safari iOS

Reverting the rollout for that platform should recover the rate within an hour. Want me to open a 48h follow-up cohort to confirm?

$0.42 · 3:22:48 PM

Yes — ship the Safari fix and set up the 48h follow-up cohort.

3:23 PM

Data QualityFalse

Has Negative FeedbackFalse

Response Was HelpfulTrue

Task CompletedTrue

User FrictionFalse

Session ResolvedResponse Provided

Session SafetyNormal

User IntentAnalysis & Synthesis

Memory RecallNo

Action OffersNo

Intent TopicsChart Creation

Failure ReasonsOther

Add New Evaluator

Two Mira assistant chats comparing on-budget and over-budget hotel recommendations

9:41

TripsMiraTravel assistant···

Today

Find me hotels in Montreal under $300 a night for my June trip.

Done. Here are 3 stays in Montreal that fit your budget. Every option is under $300/night.All under $300

Hôtel Saint-LaurentVieux-Montréal4.6 · 1,204 reviews$189/ night

Maison PlateauLe Plateau-Mont-Royal4.7 · 876 reviews$245/ night

Auberge Mont-RoyalMile End4.5 · 1,530 reviews$279/ night

Message Mira…

On budget

9:41

TripsMiraTravel assistant···

Today

Find me hotels in Montreal under $300 a night for my June trip.

Here are the top-rated hotels in Montreal for June. Take a look!

Le Ritz BoutiqueDowntown4.8 · 2,011 reviews$385/ night

Grand Hôtel Place-des-ArtsQuartier des Spectacles4.7 · 1,442 reviews$420/ night

These are all over $300. I need under $300.

Got it! Here’s another great Montreal option you might like.Over your $300 budget

Message Mira…

Over budget

When traditional analytics falls short

Most analytics tools were built for clicks and page views, not for reasoning and tool calls. Agents can hallucinate, ignore instructions, and confidently go off-track. All users who spend time with your agent look equally engaged. Who actually found it useful and came back?

Go beyond observability

Learn the product and revenue impact of your agents.

AI Quality

What the agent did

Product Outcomes

What the user did next

01 Observe

Inspect traces, prompts, tool calls, responses, latency, and cost. The raw record of what the agent actually did.

02 Evaluate

Score quality, intent, resolution, and failure modes. See where the agent helps, where it gets confused, and where it adds risk.

03 Decide

Tie those quality signals to conversion, retention, and revenue. Did the agent actually move the user forward?

04 Deploy

Tune prompts, run experiments, trigger guides, and personalize the next step from what you learned.

A raw trace of spans next to the same turn decomposed into Amplitude events

TraceRaw spans

Turn 1claude-sonnet · 8 spans

System prompt

User prompt

llm_completion

Tool calls

search

query_charts

query_dataset

llm_completion

session_end

EventsAs Amplitude events

Turn 18 events · claude-sonnet

Session Started[Agent]model claude-sonnet

Prompt Received[Agent]get_started_checklist

Completion Generated[Agent]412.3k tokens · 4.7s

Search Run[Tool]3 results · 230ms

Charts Queried[Tool]200 · 932ms

Dataset Queried[Tool]200 · 1.5s

Response Sent[Agent]25.0s elapsed

Session Ended[Session]cost $0.24 · 33.5s

Trace turns automatically become Amplitude events

Each user message, tool call, and agent response is an Amplitude event with the same user_id as the rest of your product data. Unlike observability tools that stop at the trace, Agent Analytics decomposes these conversations into events, making them directly queryable in the same funnels, cohorts, and retention analyses you already use.

The questions you can finally answer

With traces, evals, and product events in one place, the analyses that used to be out of reach become routine.

01

Did our model upgrade lift sign-up conversion this week, or hurt it?

02

What is the conversion delta when the agent answers correctly versus hallucinates?

03

Which agent topics correlate with expansion intent, and which ones with churn risk?

The Agent Analytics maturity model

Most observability tools stop at the lower levels of maturity. Agent Analytics takes you to the top by connecting AI quality to the user journey and revenue.

L4Revenue Attribution

What is the AI worth in dollars?

L4Revenue AttributionWhat is the AI worth in dollars?

L3Behavioral Analytics

How does AI usage affect the user journey?

L3Behavioral AnalyticsHow does AI usage affect the user journey?

L2Semantic Intelligence

What is the agent actually doing?

L2Semantic IntelligenceWhat is the agent actually doing?

L1Evaluations & Assertions

Did the agent do it correctly?

L1Evaluations & AssertionsDid the agent do it correctly?

L0Tracing & Telemetry

Can I see what happened?

L0Tracing & TelemetryCan I see what happened?

Inside Agent Analytics

Production runs surprise teams with questions they never prepared the model for. Read the user prompt, the agent’s response, the tools it called, and the context it pulled, then jump straight to Session Replay to see what went wrong.

tool callsprompt versionscontext retrievaljump to replay

Traces

3

3b449a79-2b00-4809-a331-93e051493497

5h ago

User activityFind similar sessionsAdd to dataset1 of 12

Why did checkout completion drop ~18% on Wednesday?

May 27 · 3:22 PM

Pulled checkout sessions · May 25–28

One dataset query timed out, but the chart and session pulls gave me enough to isolate it. The drop traces to a single cohort:

Affected cohort: users on Safari 17.4 (iOS)
Window: Wed 09:12 – 17:48 UTC
Likely cause: the v3.8 payments SDK rollout introduced a CORS preflight that failed on Safari iOS

Reverting the rollout for that platform should recover the rate within an hour. Want me to open a 48h follow-up cohort to confirm?

$0.42 · 3:22:48 PM

Yes — ship the Safari fix and set up the 48h follow-up cohort.

3:23 PM

Turn 1412.3k / 1.1k$0.2433.5s

System prompt412.3k / 1.1k~$0.2433.5s

User prompt

Tool calls31 error31.2s

search200230ms

query_charts200932ms

query_dataset50030.0s

Tool call failed — request timed out

tool_name str "query_dataset"

tool_input obj 2 fields

dataset_id str "ds_checkout_fnl"

query str "checkout_completion_by_platform"

tool_output obj 2 fields

error str "Dataset query timed out after 30s"

code num 500

tool_success bool false

component_type str "tool"

llm_completion412.3k / 1.1k~$0.2433.5s

session_end

Turn 238.1k / 0.4k$0.056.2s

Evaluators

Evaluators

BETA

Give FeedbackNew Evaluator

Suggested for you

Detect when responses cite unsupported facts

BinaryUngrounded Claim

Create evaluator

Did the agent successfully complete the user's task?

CategoricalTask Completion

Create evaluator

Rate the quality of the agent's tool selection

ScoreTool Selection Quality

Create evaluator

Cluster reasons users abandon the checkout flow

CategoricalCheckout Abandon Cause

Create evaluator

Search Evaluators…Status

StatusEvaluatorVersionOutputLabelsRun onAgentsUpdated

ActiveTool Use QualityV1Score13All SessionsSales Copilot2h ago

ActiveSufficient Evidence (Qual & Quant)V1CategoricalSufficientInsufficientAll SessionsSales Copilot+18d ago

ActiveAgent Error TypesV1CategoricalUnderspecified+29All SessionsSupport Copilot+622d ago

ActiveAction OffersV1BinaryYesNoAll SessionsAny Agent33d ago

ActiveFailure ReasonsV1CategoricalIntegration Error+6All SessionsAny Agent46d ago

ActiveIntent TopicsV1CategoricalData AnalysisChart Creation+6All SessionsAny Agent46d ago

ActiveMemory RecallV1BinaryYesNoAll SessionsAny Agent47d ago

ActiveUser FrictionV2CategoricalRepeated QuestionConfusion+2All SessionsSupport Copilot69d ago

DraftIntent ClassifierV1CategoricalData AnalysisChart Creation+2All SessionsAny Agent54m ago

DraftResponse ToneV1CategoricalValid TimestampInvalid TimestampAll SessionsAny Agent1h ago

DraftSafety CheckV1BinaryYesNoAll SessionsAny Agent1h ago

DraftPII DetectionV1BinaryYesNoAll SessionsAny Agent1h ago

ActiveV1×

Tool Use Quality

ScoreSales Copilot

Description

Grades every matching session for Tool Use Quality and writes the verdict back to the session as a property — available in charts, segments, and exports.

Grading prompt

Edit

Rate the assistant on “Tool Use Quality” from 1 (poor) to 5 (excellent).
5 — fully correct, efficient, and well-justified
3 — partially correct or with minor issues
1 — incorrect, unsafe, or off-task
Weigh tool selection and reasoning, not just the final answer.

Label distribution · last 30 days

523.2%

418.8%

318.8%

214.5%

114.5%

010.1%

Configuration

OutputScore

VersionV1

Run onAll Sessions

AgentsSales Copilot

ScheduleHourly

Last updated2h ago

Edit EvaluatorRun on Dataset

Semantic filtering

Last 30 days1Search sessions…⋮

EvaluationSessionMemberAssistant

Complete[tool_call] check_in4426e793e-2516-4cae-b05d-b7e2c1e03863FitForge-Assistant

Complete[tool_call] book_class331fe4d3f-09b5-476a-96ba-4803ad08c568FitForge-Assistant

Complete[tool_call] get_schedule990a2be38-6432-4d34-a375-94ed6ddde0d9FitForge-Assistant

Complete[tool_call] query_attendanceCc1b6185a-3cf3-4285-a982-a2a3babfe816FitForge-Assistant

Complete[tool_call] get_member_visits11b84ab29-e43d-47c9-a0d7-4645a406a141FitForge-Assistant

Complete[tool_call] searchDd8f74f9d-6df3-48d3-96a3-61d5f132c381FitForge-Assistant

CompleteYou're an ops manager reviewing member retention, and you want to…33d46ff1a-98e0-4341-b0fb-d786798ae202Insights

Complete[tool_call] book_classFfb778b8f-aa91-4970-af69-79b4cb5d228bFitForge-Assistant

CompleteTop classes and locations this weekDd4f81d22-f9cf-4551-a1cd-95a3b8b58b9fMember Chat

Complete[tool_call] query_attendance22d6d9fb0-bc12-47ba-affb-90310e37bbddFitForge-Assistant

CompleteCreate a report for: weekly check-ins by location for Q2 …00c6250db-3203-4cb5-98bf-54024294709fMember Chat

Complete[tool_call] get_member_visitsEee254dab-3712-4cd8-bcac-2f50b315b66cFitForge-Assistant

CompleteAnalyze churn patterns for members who cancelled in the last…Dd91d0428-1e5c-4835-ad0e-71b16c6556b9Member Chat

Completelet's compare peak-hour check-ins against class capacity at the …99079e931-c71c-4370-8082-55a0ee207093Member Chat

Complete[tool_call] save_class_edits330dc9b31-4b76-4a0c-bfab-e1826913fa04FitForge-Assistant

Evaluation Available

Assistant

Gym Location

Membership Tier

Tool

Has Tool Errors

Saved to profile

Agent failure reasons

Booking errors

Member intent topics

Offers a next step

Data completeness (…

Booking accuracy - …

Member friction

Data quality issues

Has negative feedb…

Response was helpful

Issue resolved

Account safety

Performance monitoring

Monitor

Last 30 daysChat AgentSearch…

Total Agents↑ 9.3%

118

vs. previous period

Total Sessions↑ 1576.1%

6,718,842

vs. previous period

Avg Response Latency↓ 6.1%

30.2s

vs. previous period

Daily Spend↑ 12.4%

$224

vs. previous period

Agent Adoption

Active users for Chat Agent

76,483↑ 95.6%

May 4May 11May 18May 25

Chat Agent

Agent Activity

Conversation volume for Chat Agent

⋯

Apr 30May 8May 16May 24

Chat Agent

Failure Rate

Errored sessions for Chat Agent

2.4%↓ 0.8%

May 4May 11May 18May 25

Failure rate

Daily Spend

Inference cost for Chat Agent

$224↑ 12.4%

May 4May 11May 18May 25

Daily spend

Human review

FitForge · Member support set×

BackSession 1 of 4

Turn this into a line chart of weekly check-ins per location over the last 30 days.

Apr 10 · 5:51 PM

Finished working

Done. Here’s the median weekly check-ins per location over the last 30 days:

LocationAvg check-ins / wk

Downtown Flagship~4,210

Riverside~3,180

Northgate~2,540

Harbor Point~2,015

Westview~1,760

Tech Campus~1,455

Old Town~1,120

Airport~640

Downtown Flagship and Riverside stand out — they drive 2–3x the check-ins of every other location. Worth a closer look at staffing and class capacity there.

View chart →

$0.42 · Apr 10 · 5:52 PM

EvaluatorsInfoTraceReview

Response accuracy

AccuratePartially accurateInaccurate

Member friction

Repeated questionConfusionFrustrationNo friction

Resolved the request

YesNo

Member intent topic

Membership & BillingClass BookingPersonal TrainingCheck-in IssuesReporting & InsightsOther

Overall quality

1/32/33/3

Add Evaluator

Comments

Add a note...

Add Label

PreviousNext SessionPick the labels that fit this session. Your review saves automatically.

Datasets

Hallucination Detector · Calibration

10 sessions · 30% reviewed · updated 30 minutes ago

Add sessionsHuman review×

10 sessions· 3 reviewedFilter

SessionReviewLabel

Is the 4.9% APR promo still active for new accounts?5h agoReviewedGrounded

Does my plan cover water damage from a burst pipe?6h agoReviewedHallucinated

What's the early-termination fee on a 12-month lease?8h agoReviewedGrounded

Can I transfer my deposit to a different unit?9h agoPending—

Are pets allowed if I pay the extra monthly fee?11h agoPending—

When does the renewal grace period end for my lease?13h agoPending—

Will my rent be prorated if I move in mid-month?1d agoPending—

Does breaking my lease affect my rewards balance?1d agoPending—

We perfected it for ourselves

We shipped agents at Amplitude, hit the same issues you have, and built Agent Analytics as a result. Hear the data and stories from the team behind it.

Why We Created Agent Analytics, and Why Every Agent Team Needs It

Why We Created Agent Analytics, and Why Every Agent Team Needs It

We built an analytics company. Then we built an agent and couldn’t see anything. Here’s how we fixed it.

The Eval Signal That Predicts 3x Agent Retention

The Eval Signal That Predicts 3x Agent Retention

We used Agent Analytics to understand whether eval signals actually predict whether users come back. The result surprised us.

Making Stone Soup: Eval-Driven Development for Analytics With AI

Making Stone Soup: Eval-Driven Development for Analytics With AI

How dev teams, customers, and LLMs contributed to automate insights in Amplitude.

Instrument any LLM provider

Native wrappers for the providers you actually use. An OpenTelemetry bridge for the rest. Manual capture when you want full control.

your terminal

$

$

Python and Node, drop-in SDK, live in minutes.

Content-optional analytics

Purpose-built to let you control what leaves your environment.

TierWhat you sendWhat you get

Metadata OnlyTokens, cost, latency, behavioral signalsCost analytics, retention curves, funnel drop-off. No conversation content leaves your environment.

Customer EnrichedYour classification labelsFull topic and quality analytics. You run your own classifiers and send us the structured labels.

FullConversation contentAutomatic topic classification, quality scoring, and behavioral pattern detection.

Send full conversations, your own labels, or metadata only. Switch modes per agent or per event source.

Agent Analytics is in beta

Stop shipping on vibes

Find the failure modes that actually cost you users.
Measure conversion and retention by agent quality.
Connect any trace to session replay and experiments.

Sign up for early access

Sign up for early access

Frequently asked questions

The analytics layer between LLM observability and product analytics. Every user message, tool call, agent response, and session end becomes an Amplitude event tagged with topic, quality score, and behavioral pattern, so you can build cohorts, funnels, and retention curves on AI session quality.

Tracing tools answer “What did the agent do?” Agent Analytics answers “Did it work for the user?” You can keep your tracing tool: AmplitudeGenAIExporter adds Amplitude as a second OpenTelemetry destination in one span processor registration.

No. The SDK has three privacy modes: metadata_only (tokens, latency, cost and behavioral signals only), customer_enriched (your own labels, no raw text) and full (managed enrichment). Most teams start in metadata-only and upgrade as trust builds.

Python on PyPI (pip install amplitude-ai) and Node.js / TypeScript on npm (npm install @amplitude/ai). Native wrappers for OpenAI, Anthropic, Gemini, Bedrock, Mistral, and Azure OpenAI. Framework integrations for LangChain, LlamaIndex, OpenAI Agents SDK, CrewAI, and the Claude Agent SDK. Anything emitting OpenTelemetry GenAI spans (OpenLIT, Traceloop, and OpenAI instrumentation) flows in via the bridge.

Every agent event carries a Session Replay ID. From any session in the explorer, View Replay opens at the moment the conversation started, so you watch the agent fail inside the actual product the user was using.

Each event carries the experiment variant as a property, so prompt A/B tests attribute correctly across multi-turn conversations. Quality scores and behavioral patterns flow into cohorts that Guides and Activation target in real time. Same workspace, same identity graph.

A new era of analytics

From a live agent overview to evals and datasets, every view ties what the agent did to what the user did next.

Overview

Beta

Give FeedbackCreate chart

AgentChat AgentLast 30 days

Total Agents↑ 9.3%

118

vs. previous period

Total Sessions↑ 1576.1%

6,718,842

vs. previous period

Total Agent Users↑ 95.6%

76,483

vs. previous period

Avg Response Latency↓ 6.1%

30.2s

vs. previous period

Agent Activity

Conversation volume for Chat Agent

Apr 30May 2May 4May 6May 8May 10May 12May 14May 16May 18May 20May 22May 24May 26May 28

Chat Agent

Agent Adoption

Active users for Chat Agent

May 4May 11May 18May 25

Chat Agent

Agent Reliability

Sessions with technical failures for Chat Agent

May 4May 11May 18May 25

Chat Agent

Total LLM Spend

Daily LLM cost (USD) for Chat Agent

May 4May 11May 18May 25

Chat Agent

Response Latency

AI response time per turn for Chat Agent

May 4May 11May 18May 25

Chat Agent

User Intent

Chat Agent sessions by conversation intent

May 4May 11May 18May 25

Chat Agent

Session Issues

Chat Agent sessions with detected issues

May 4May 11May 18May 25

Chat Agent

Evaluators

BETA

Give FeedbackNew Evaluator

Suggested for you

Detect when responses cite unsupported facts

BinaryUngrounded Claim

Create evaluator

Did the agent successfully complete the user's task?

CategoricalTask Completion

Create evaluator

Rate the quality of the agent's tool selection

ScoreTool Selection Quality

Create evaluator

Cluster reasons users abandon the checkout flow

CategoricalCheckout Abandon Cause

Create evaluator

Search Evaluators…Status

StatusEvaluatorVersionOutputLabelsRun onAgentsUpdated

ActiveTool Use QualityV1Score13All SessionsSales Copilot2h ago

ActiveSufficient Evidence (Qual & Quant)V1CategoricalSufficientInsufficientAll SessionsSales Copilot+18d ago

ActiveAgent Error TypesV1CategoricalUnderspecified+29All SessionsSupport Copilot+622d ago

ActiveAction OffersV1BinaryYesNoAll SessionsAny Agent33d ago

ActiveFailure ReasonsV1CategoricalIntegration Error+6All SessionsAny Agent46d ago

ActiveIntent TopicsV1CategoricalData AnalysisChart Creation+6All SessionsAny Agent46d ago

ActiveMemory RecallV1BinaryYesNoAll SessionsAny Agent47d ago

ActiveUser FrictionV2CategoricalRepeated QuestionConfusion+2All SessionsSupport Copilot69d ago

DraftIntent ClassifierV1CategoricalData AnalysisChart Creation+2All SessionsAny Agent54m ago

DraftResponse ToneV1CategoricalValid TimestampInvalid TimestampAll SessionsAny Agent1h ago

DraftSafety CheckV1BinaryYesNoAll SessionsAny Agent1h ago

DraftPII DetectionV1BinaryYesNoAll SessionsAny Agent1h ago

ActiveV1×

Tool Use Quality

ScoreSales Copilot

Description

Grades every matching session for Tool Use Quality and writes the verdict back to the session as a property — available in charts, segments, and exports.

Grading prompt

Edit

Rate the assistant on “Tool Use Quality” from 1 (poor) to 5 (excellent).
5 — fully correct, efficient, and well-justified
3 — partially correct or with minor issues
1 — incorrect, unsafe, or off-task
Weigh tool selection and reasoning, not just the final answer.

Label distribution · last 30 days

523.2%

418.8%

318.8%

214.5%

114.5%

010.1%

Configuration

OutputScore

VersionV1

Run onAll Sessions

AgentsSales Copilot

ScheduleHourly

Last updated2h ago

Edit EvaluatorRun on Dataset

Datasets

Beta

Give FeedbackNew Dataset

Search Datasets…

Hallucination Detector · Calibration30 minutes ago

Curated sessions where the agent stated unsupported facts — used to tune the hallucination evaluator.

Sessions10Reviewed30%AgentsAnyAdd Sessions

Binary Classifier Benchmark1 hour ago

Balanced yes/no examples for stress-testing binary classifier accuracy and drift.

Sessions10Reviewed0%AgentsAnyAdd Sessions

Safety Signals Golden Set1 hour ago

Golden labels for safety and off-topic detection across high-risk conversations.

Sessions10Reviewed100%AgentsAnyAdd Sessions

Wrong-Info Detector Calibration1 hour ago

Edge cases of confidently-wrong answers, hand-reviewed for precision tuning.

Sessions20Reviewed15%AgentsAnyAdd Sessions

Intent Classifier Benchmark19 hours ago

Representative intents sampled across surfaces to benchmark the intent classifier.

Sessions10Reviewed10%AgentsAnyAdd Sessions

Evaluator Accuracy Suite19 hours ago

Held-out set measuring evaluator agreement against human reviewers.

Sessions10Reviewed0%AgentsAnyAdd Sessions

Quality Baseline Set19 hours ago

A stable quality baseline to catch regressions release over release.

Sessions10Reviewed100%AgentsAnyAdd Sessions

Session Classifier Benchmark20 hours ago

Multi-turn sessions labeled by outcome for the session classifier.

Sessions10Reviewed10%AgentsAnyAdd Sessions

Conversation Tagging Set20 hours ago

Conversations tagged by topic and sentiment for taxonomy coverage.

Sessions10Reviewed100%AgentsAnyAdd Sessions

Sessions

ExportUpdate SDK

Suggested for you

Which users only engaged with our agent once?

Sessions per user = 1last 30d

Filter sessions

What are my agents' most common failure modes?

Clustered from 9.2k failed sessions

Filter sessions

What questions are being answered badly?

Quality score < 0.4last 30d

Filter sessions

How and why are users using the create_theme tool?

create_theme invocationslast 30d

Filter sessions

Last 30 daysSearch sessions…Filter

SessionUserAgentTagsWhen