AI Broke Your Experimentation Program. Here’s How to Fix It.
How mature experimentation teams are getting to speed of learning without sacrificing quality.
This blog was co-authored by Ken Kutyn, Head of Solutions Engineering APJ at Amplitude.
You probably ran more tests last quarter than the year before. Most teams did. And most teams think that’s what a healthy experimentation program looks like.
But AI has changed that.
Idea generation is essentially free. Your team can generate 50 test ideas before lunch, build the variation by the afternoon, and have a ship/roll-back recommendation by the end of the day. Now that velocity costs nothing, it tells you nothing.
The teams pulling ahead are asking a harder question: Are we actually learning anything?
After years of building experimentation programs, I can tell you that scaling one requires three fundamental things in the AI-native world: thoughtfully filtering ideas, building trust into results, and expanding experiments beyond the UI to new interfaces.
Velocity is not the only metric
For years, velocity was how mature experimentation teams proved their programs were working. Run more tests, surface more ideas, and find the unexpected winners.
Faster is better, right? But velocity as a North Star doesn’t mean what it used to.
With AI, significant portions of the experiment workflow that used to take days or weeks now take minutes:
- Ideation: Anyone can screenshot a landing page and share it with their agent of choice, asking for 5 test ideas.
- Building tests: Agents can quickly generate JavaScript to code custom test variations.
- Documentation: MCP connectors make it possible to pull in design assets, Confluence pages, and backlog tickets, and generate reports in seconds.
But running more tests is no longer a sign of a healthy program. The best experimentation teams are now asking themselves:
- Are these testing ideas grounded in our user behavior and friction (quantitative and qualitative)?
- What is the negative impact on user experience of shipping a very high volume of low-quality tests?
- Are we learning anything from these tests?
The good news is you don’t have to build this infrastructure yourself. Modern experimentation platforms have made it significantly cheaper and faster to get to this behavioral foundation, combining experiment data, session replay, surveys, and cohorts under a single set of events so the “why” behind a result isn’t a separate research project.
When test hypotheses come from observed user behavior, they’re more likely to positively impact the user experience.
Validate your learnings
Getting to a data-backed hypothesis is only half the equation; trusting what it says is the other half.
For years, the hard part of experimentation was instrumentation. Getting clean data, setting up the test, waiting for significance. AI is compressing all of that so that novice testers can now paste impression and conversion counts into an agent and get a ship/roll-back recommendation in seconds.
That’s the problem.
A generic agent doesn’t know your data taxonomy, your experiment history, or what “conversion” actually means in your product. Speed without semantic knowledge isn’t analysis and might lead to incorrect patterns.
Mature teams are asking harder questions before they act on a result:
- Can we trust the data collected by this experiment?
- Does a generic agent sitting directly on our warehouse have enough semantic knowledge of our experiment and data taxonomy to produce reliable results?
- If this test goes sideways, can we identify the affected users and roll back the feature?
- Do we know which other metrics are unexpectedly affected by our test, without having to repeat the full experiment?
- How can we learn from a test that is not statistically significant through qualitative analysis?
A thin layer of AI on top of a mediocre foundation doesn’t cut it. Fast research that you can trust requires running experiments on a foundation where data quality is monitored, and agents work with context, not just numbers. And all of this needs to happen in a unified platform, where experiment data, session replay, surveys, and cohorts share the same event taxonomy.
The best experimentation teams don’t treat a shipped test as the end of the loop. They treat it as the beginning of the next one.
New interfaces are the next frontier
Most of today’s experimentation best practices were shaped in a world of landing pages, CTAs, and checkout flows. That world no longer exists.
When your product is an agent, a chatbot, or a workflow copilot, the variables that matter have shifted to prompt phrasing, model choice, and more. Small changes in any of these can have an outsized impact on outcomes. With no “best practices,” you can’t safely copy an airline’s agent pattern and assume it’ll work for an ecommerce cart or B2B SaaS onboarding.
This is where most experimentation programs hit a wall.
If your experimentation platform can’t meet your data where it already lives, you’re either throwing spaghetti against the wall or engineering expensive pipelines just to run a single test. Warehouse-native experimentation lets you define metrics, assign treatments, and analyze results directly against your existing infrastructure, without duplicating your stack or forcing every experiment through a single instrumentation layer.
For teams building on foundation models, warehouse-native ensures that you can run experiments on prompts, models, flows, and agent behaviors with the same rigor you used to reserve for UI tests. And you can do it without rebuilding your instrumentation layer to support it.
A few questions worth taking back to your team:
- Are you experimenting on prompts, models, and agent behaviors, or just UI tweaks?
- Do your engineering and marketing experiments share the same metrics and user definitions?
- When a feature is behind a flag, can non-engineering teams still test how it’s presented and messaged?
- Can you run experiments on data that lives in your warehouse without rebuilding your instrumentation layer?
- Does your current pricing and packaging still make sense for an AI-first product, and are you actually testing that hypothesis?
Experimentation in an AI world
AI has fundamentally changed experimentation.
When idea generation is free, velocity stops being the signal of a healthy, scaled experimentation program. When an agent can ship an analysis in seconds, trusted results matter more than fast ones. When your product is a nondeterministic agent, the surfaces that need testing outgrow the platforms most teams are still using.
Modern platforms like Statsig offer an easy-to-use, unified place to run experimentation as a continuous operating system. Statsig helps teams test fast enough, broadly enough, and deeply enough to matter.

Viv Magida
Global Solution Architect, Amplitude
Viv is a Global Solutions Architect at Amplitude, where she helps customers build rigorous experimentation programs. Previously, she helped scale Peacock’s experimentation program from zero to 750 tests per year. She jumped at the chance to join a leader in product analytics and experimentation, and is thrilled to be part of the Amplitude team helping companies build better products in the AI age.
More from VivRecommended Reading

Tracing the Sale: Connect Behavior to Conversions with Persisted Properties
May 28, 2026
7 min read

Building CLI Agents: It’s What You Don’t Give Them That Counts
May 27, 2026
6 min read

Three Tips for Better Prompts in Amplitude Global Agent
May 26, 2026
9 min read

How AI Took the Data Analyst’s Job, and Created a Better One
May 22, 2026
8 min read

