When product intuition and bias collide, experimentation offers a way forward. In Amplitude’s recent webinar, —renowned experimentation expert and former Executive at Microsoft, Amazon, and Airbnb—joined , Principal Product Manager at Amplitude, to share decades of lessons from building experimentation platforms and cultures at scale.
What followed was part data science masterclass, part cultural intervention, and all-around a call to action: if you want innovation, you need to test—early, often, and across your entire organization. Read on for a summary of their discussion!
Check out the full recording of the Ronny and Jheel’s webinar, .
The experimentation paradox: Most ideas fail
Ronny opens with a hard truth: most product ideas don’t work.
Industry-wide data shows that the vast majority of experiments fail to deliver meaningful gains on core business metrics. At Bing, only 15% of launched experiments succeeded. Even top performers like Google Ads, Netflix, and Booking.com report 10% success rates on average.
So why test?
Because the alternative is building blindly—shipping unvalidated ideas based on opinion or instinct, often backed by the “HiPPO” (Highest Paid Person’s Opinion). And if those don’t succeed, what do you learn?
Experimentation helps you find what works, fail safely, and learn fast. In short, low success rates aren’t a bug—they’re the reason you need in the first place.
From HiPPO to data-driven: How culture drives outcomes
Throughout the talk, Ronny emphasized that tools alone don’t build great experimentation programs—culture does. And shifting that culture requires challenging some deeply held beliefs:
- “We know what to do. It’s in our DNA.”
- “We don’t need to test something we’re going to ship anyway.”
- “We’ve always done it this way.”
Ronny called this the “Semmelweis Reflex,” referencing the 19th-century doctor who discovered that handwashing dramatically reduced deaths in hospitals—only to be rejected by the medical establishment for decades. Innovation often meets resistance, especially when it contradicts existing power structures or expertise.
The same holds true in product development. Executives and project managers may bristle at experiments that contradict their assumptions. The answer, Ronny says, is to normalize experimentation as a learning tool—not a judgment of someone's ability.
Four stages of experimentation maturity
To help organizations map their path forward, Ronny shared a model for experimentation maturity:
1. Hubris
Teams are frequently overconfident in their instincts. Too often, decisions are based on opinions, not evidence. A data-driven culture shift is crucial for overcoming this stage.
2. Measurement and control
Organizations begin to track metrics and adopt safe deployment practices. They may use feature flags and A/B tests—but usually only after shipping.
3. Accept results (“Semmelweis Reflex”)
Experiments challenge long-held beliefs, and sometimes teams resist unexpected or uncomfortable results. When the “But that’s how we’ve always done it” argument comes in, it’s crucial to trust in the data.
4. Fundamental understanding
Experimentation becomes embedded in every decision. Teams test small ideas regularly, iterate based on results, and embrace failures as opportunities to learn.
Lessons from experience
Ronny’s team scaled Bing from a handful of experiments to more than 1,000 per month. This didn’t happen overnight. Key accelerators included:
- Top-down support: When Satya Nadella became CEO, experimentation got a mandate.
- Safe deployment: Feature flags allowed teams to ship with control and gradually introduce experiments.
- Cultural storytelling: Sharing wins (and surprising failures) helped shift mindsets company-wide.
- Automation: The cost of running an experiment dropped to near-zero thanks to robust infrastructure.
Other brands saw similar results. When Microsoft Office adopted the same platform, experimentation grew 700% year over year. But Ronny warned that you can’t go from zero experiments to thousands overnight. The groundwork—metrics, culture, tooling—must come first.
Helping the HiPPO
Without data, the HiPPO usually takes precedence. But with it, leaders are far more likely to de-prioritize their personal opinion in lieu of the cold, hard facts.
Ronny introduced a hierarchy of evidence to deliver to the highest-paid person:
- Anecdotes and opinions: lowest trust (e.g., “My daughter liked this feature.”)
- Observational data: useful, but prone to bias.
- Non-randomized tests: better, but causality isn’t clear.
- Randomized controlled experiments: the gold standard.
- Multiple randomized controlled experiments + meta-analysis: highest confidence.
Online, randomized controlled experiments take the form of A/B tests. They allow teams to quantify causal impact with high confidence—and avoid being misled by intuition or correlation.
Building the right metrics and mindset
A successful experimentation culture also requires well-defined metrics. At Bing, Ronny’s team used an Overall Evaluation Criterion (OEC)—a composite score combining revenue, relevance, and user satisfaction.
The goal? Shift accountability from output (“Did we ship it?”) to outcomes (“Did it work?”). That’s when teams start optimizing for impact instead of activity.
Common questions, real answers
During the Q&A, Jheel asked Ronny to address some of the attendees’ questions around experimentation:
My org only ships releases every six months. Can we even experiment?
Yes. Ronny has seen organizations shift their release-cycle culture by solving for real pain points in a controlled environment. Safe deployment, or , can solve issues on a smaller scale and can lead to wider buy-in for a more agile culture.
Who should own experimentation?
A often works best. Centralize best practices, tooling, and training—then enable individual teams to run their own tests.
How do I convince leadership?
Share success rates from industry leaders and emphasize the cost of false confidence. Most code doesn’t help users—experimentation prevents bloat and regression.
How do you test new features without hurting the user experience?
Make small, gradual changes instead of big redesigns. Start testing with your most active users, who will give you the clearest feedback.
Experimentation culture is about learning from mistakes
Ronny closed with an example that sums up experimentation, leadership support, and a data-driven culture. A young executive at IBM attempted a risky venture that ultimately cost the company $10MM. When the executive tried to hand in his resignation, Tom Watson Sr., the founder of IBM, responded, “You can’t be serious! We’ve just spent $10MM educating you.”
For Ronny Kohavi, successful experimentation is tied to metrics, proved by A/B tests, and fueled by a data-driven culture—even if your experiment fails.
Ronny and Jheel’s conversation is a goldmine of experimentation insights. Watch the to learn more.
Ronny also offers two courses on Maven:
- : A two-week deep dive into experimentation strategy
- : For teams already running tests but looking to level up