What We Learned Building AI Products in 2025

Amplitude customers consumed billions of tokens last year. Here's what actually worked and what didn't.

Jan 22, 2026

9 min read

Nirmal Utwani

Amplitude Director of Engineering, AI Analytics

Last year, we shipped 3 new AI Products and over 20 new AI analytics features to over 4500 enterprise customers, who consumed 13B tokens, up 100X from the previous year. But here's what those numbers don't tell you: we learned more from what broke than what worked.

There's a lot of buzz around agents heading into 2026. Anthropic’s launch of Claude Cowork last week is igniting the debate over specialized vertical agents vs. general-purpose, long-horizon agents. Meanwhile, VCs are predicting AGI in 2026. Yet, Andrej Karpathy thinks we're looking at a 10-year timeline for functional, reliable AI agents.

After 10 years of building both ML and GenAI products that serve millions of users, here's what we’ve learned building AI products for enterprise customers in 2025, specifically what's working for vertical agents in our industry and what's broken.

The gap between general agent demos and production scale is real

Agents work best on tasks where outputs are easy to verify automatically. That's why coding agents and math solvers have taken off. But analytics? Analytics is a hard AI use case. Unlike code, which is structured and defined, an organization’s data is messy and ambiguous. On top of that, it’s hard to verify an insight when a question is open-ended or intentionally exploratory.

We learned this the hard way in analytics. Most organizations aren't ready for general-purpose agents in production. They have data silos and lack specialized context, tools, and observability infrastructure.

In this landscape, getting to full autonomy even for simple workflows requires higher quality, reliability, and trust than most people realize. Most folks overestimate how quickly agents will automate many forms of specialized work, such as analytics.

Users need different levels of autonomy from agents

For exploratory analysis, users don’t want to wait 30 minutes to see results. They want fast responses with a logical thought process. Agents that run faster iterations and interact with the user more frequently outperform long-running autonomous background tasks in user satisfaction. People want to stay in the loop, understand what's happening and why.

When agents take smaller steps, there’s less room for error and faster feedback from the user to make sure the project is on the right track.

Just to be clear, we think there’s certainly still a lot of room for long-running background agents with autonomy, especially as models continue to improve. They are a good fit for clearly defined work where the answers are known and the output can be verified at key checkpoints. Think data migration, taxonomy clean-ups, etc. These are set-and-forget tasks, they don’t require the same level of collaboration between user and machine.

Distribution and flexibility matter more than features

New AI features struggle without existing user pathways. "Build it and they will come" doesn’t work.

Engagement spiked 10x as soon as we made Agents available in our product via Slack. MCP (Model Context Protocol) continues to be a big unlock for customer teams that are traditionally not power users of our product, like software engineering and product engineering.

Customers expect to flexibly pivot across tools and context sources mid-thread. Power users need control over tools, context sources, and prompts. Success requires integration into the workflows and tools people already use.

Teams should invert the build:eval ratio

Most teams spend 60-70% of their time building features and only 30-40% building evals. For some of our bets, we tried flipping that ratio. It’s clear to us that those bets were the ones that actually added customer value.

Candidly, before our team could embrace eval-driven development, we also had to invest in strong LLM observability and analytics. We looked for an existing solution that could reliably tie agent performance to better product UX, but didn’t find anything that worked for us. We ended up building something to give us the kind of visibility and tight feedback loop we needed.

Here's the playbook that worked for us:

Phase 1: Evals become the new PRDs. We started by using evaluations to define the core use cases and requirements that guide how AI agents are built, tuned, and improved. We then empowered the broadest set of SMEs and product teams to create and maintain evals. Next, we expanded and sharpened the eval suite. This is one of the highest-leverage ways PMs can spend their time.
Phase 2: Move fast, ship often—with high visibility. We started with manual reviews, then automated as we earned confidence, backed by strong observability (hello, Amplitude!). Don’t dismiss qualitative judgment (“vibe checks”). We made sure to capture it, then translate it into repeatable evals whenever possible.
Phase 3: Keep growing your eval bank as you learn. Now, every time we uncover a new failure mode, we add an eval for it. We use evals to prevent regressions. We compare approaches and consistently choose the best model for your task.

The playbook works when you repeatedly observe and iterate. The number of iterations matters. Re-analyze every few weeks. Your understanding of "good" evolves, and that's normal.

Foundation model improvements create huge product leverage

One advantage of our new agentic architecture is that we get leverage from foundation model improvements immediately. With eval tooling, we can test new models in a few hours.

In initial testing for Agents, eval's success rate was in the mid 40s. When we moved from Haiku to Claude Sonnet, eval accuracy (before any further optimizations) jumped from 47% to 65%. We tested Gemini 3.0 Pro at 62% and evaluated it within a few hours after its launch. In addition, we’ve found that other models are better suited for certain sub-agents—and we route those tasks accordingly to get the best performance.

Speed matters. When you can validate model improvements in hours instead of weeks, you can ride the wave of foundation model progress instead of being left behind.

Roadmaps need to update in real time to reflect development velocity

For AI product development, we've shifted from quarterly planning to a flexible model.

A tiger team of engineers, PMs, and designers meets with two to three customers per day, identifies broken experiences, and fixes them. Often, those fixes happen the same day customers flag them. Traditional planning cycles are too slow for how fast this space moves.

Non-engineers are shipping code across the organization

With tools like Cursor and Claude Code integrated into our codebase, along with devX improvements, almost everyone on our design and product team has started writing code and improving our product surface areas directly. We’ve seen a 300% growth in the number of PRs created by non-engineers spanning bug fixes, copy improvements, layout updates, new feature requests, and a lot more.

We’re not aiming to replace engineers. It's about unlocking velocity across the entire organization. When designers can fix button alignment themselves and PMs can adjust copy without a sprint ticket, everyone moves faster.

Demand for engineering is increasing

In 2025, the big question was whether the rise in these agents would lead to job losses, with engineering first on the chopping block. We've done the opposite. For context, we doubled our new engineering headcount in 2025, and we plan to grow open headcount by 55% in 2026.

None of this is to imply that AI is not automating our work. But as we continue to ship faster, we’re also seeing that teams building AI products need more humans in the loop, not fewer, to handle the complexity of evals, integrations, and the constant iteration required to ship real value.

The bottom line

The hype cycle around AI agents is real, and while we’re seeing agents excel at verifiable tasks like coding, in many other fields we’re still seeing a gap between demos and enterprise readiness.

2025 taught us that customers want software that they can use collaboratively and reliably. The software teams that are winning today aren't the ones chasing full autonomy—Although that might be the right strategy only for the frontier AI Research labs. The best AI application and infra teams are the ones building robust evals, meeting users where they already work, and staying flexible enough to adapt as foundation models improve.

We're still early. But we're learning fast. And we’re excited to ship in 2026!

About the author