Set the MDE for your experiment
Before you run an experiment, set a Minimum Detectable Effect (MDE) to estimate how you'll measure success. Think of MDE as the minimum change you're hoping to find by running your experiment. Without a fail-safe calculation available for the MDE, it can be tricky to set one. With Amplitude Experiment, the default MDE is 2%. However, as the MDE is directly linked to your unique business needs, be thoughtful during each experiment's design phase. Considerations for setting the MDE should include the primary metric and any associated risks.
MDE and the metric goal type
When you create your experiment, you select between two metric goal types: success or guardrail.
The following case study examines how the goal type can alter the MDE.
The marketing director of a small arts organization is using Amplitude Experiment to help plan updates to a ticketing management system. With no data science team, the director chooses if you need experiments and, if so, how best to run them. The anticipated updates are:
- Adding a "quick checkout" option on event pages, to increase conversion of page visits to ticket sales for logged-in users.
- Adding a new payment option during check out for all users.
Because the goal of the first update is to increase conversion rates, a success metric is appropriate here. That metric should tell the marketing director whether the new button is in the right place and visible enough to meet the conversion rate goal. The marketing director notes that their next fiscal quarter goal is to increase ticket sale revenue by 3%. These company goals are important when planning for the success metric, and steer the test's direction to increase and the MDE to 3%.
The second update is needed to meet financial requirements. As a non-negotiable enhancement to the checkout process, a guardrail metric may help confirm that the additional payment method doesn't decrease completed sales for users in that process. Over the last four fiscal quarters, an average of 1% of users abandoned checkout after starting the process. Therefore, this guardrail metric would have a direction set to decrease and an MDE set to 1%.
If running a T-test, Amplitude's duration estimator can also help set the MDE. Review the recommended MDE that Amplitude gives you or change the MDE until the duration estimate is reasonable.
MDE and the primary metric
In Amplitude, the MDE is relative to the control mean of the primary metric. For example, if the conversion rate for the control group is 10%, an MDE of 2% (0.2) would mean that a change would be detected if the rate moved outside of the range 9.8% to 10.2%.
Refer to the case study from the previous section. Consider how the primary metric of ticket sales may require a change in the MDE if:
- The hypothesis testing experiment runs during an annual discount on ticket prices.
- The number of varying event tickets, which is positively correlated to ticket sales, is significantly smaller than previous fiscal quarters.
- The experiment runs during a global pandemic where large in-person gatherings are prohibited.
You must consider any unique business needs and circumstances when planning for an experiment and setting the MDE. One goal of any experiment should be to cause as little harm as possible.
You can also set the MDE when analyzing your experiment results.
MDE and associated risk
Experiments don't produce risk-free results and running them can take a lot of time and require large sample sets. This can mean higher costs and greater potential for adverse effects on users. The most important thing to remember when assessing risk is that the MDE is inversely related to sample size. This means that the smaller or more "sensitive" the MDE, the larger the sample size needed to reach statistical significance.
Here are some additional questions to help you further assess risk:
- Are the estimated costs or run time of an experiment worth the wanted outcome?
- What are the possible negative side effects of users exposed to the experiment, and would the outcome be worth potential losses?
- Is an experiment needed at all, or should other options, such as a feature release, be considered instead?
- What's the smallest percentage change you would be happy with? For example, would you be willing to roll out the experiment if you saw a lift of 2%, 3%, or 5%?
- If your experiment resulted in positive outcomes, such as an increase in the number of annual subscribers from 100 to 105, would that be a big enough change to present to leadership?
Common questions
These questions cover Amplitude Experiment's duration estimate. For setup guidance and pre-launch planning, see Estimate the duration of your experiments.
How does the duration estimate work?
The experiment duration estimate predicts the length of time your experiment needs to generate statistically significant results. It can only be used with the primary metric and sequential testing, and isn't supported in Experiment Results.
Amplitude Experiment uses the means, variances, and exposures of your control and variants to forecast expected behavior and calculate the number of days your experiment takes to reach statistical significance. The prediction improves as more data arrives. If any of these inputs change significantly during the experiment, the accuracy of the prediction is likely to decrease.
What's the difference between the duration estimate and the duration estimator?
Amplitude calculates the duration estimate using sequential testing as the experiment is running. The duration estimator uses the T-test.
Why is the duration estimate not showing?
The duration estimate is visible when your experiment meets all the following criteria:
- The metric hasn't yet reached statistical significance
- The end date of the analysis window is in the past
- The experiment has enough observations
- The experiment is rolled out or rolled back
- None of the following statistical conditions hold:
- The absolute lift is outside the confidence interval
- The confidence interval flips (lower confidence interval > upper confidence interval) — this can happen if the mean for either the treatment or control fluctuates while the experiment is running, or if rollout weights or targeting segments change
- The standard error is very small
- The variance is negative
- The conversion rate is greater than 1 or less than 0 (where applicable)
If the estimate isn't showing, one or more of these criteria isn't met.
Is there a cap for the duration estimate?
Yes — currently 40 days. The reasons:
- The duration estimate uses real-time simulations, where latency scales with the number of days simulated.
- Means and standard deviations usually don't change much over time, especially for experiments with longer running times.
- Short-term predictions are easier to make accurately than long-term predictions. (This is why weather forecasts beyond ten days change frequently as the date approaches.)
- Most experiments shouldn't take 40 days to complete.
How does Amplitude Experiment determine the number of exposures per day?
Amplitude Experiment assumes a constant number of exposures per day, calculated by dividing the cumulative exposures (as of today) by the number of days the experiment has been running so far.
What types of errors are there?
The duration estimate is still an estimate — don't take it as truth. You may encounter:
- Irreducible error — error inherent to the estimation process; you can't correct for it. Each simulation reaches statistical significance at a different time, which is the main reason to run multiple simulations. The time it takes for an experiment to reach statistical significance is itself a random variable. It depends on the p-value, which depends on the data the experiment collects. Even if you know the control mean, control standard deviation, treatment mean, and treatment standard deviation — and you force a normal distribution and independence on everything — Experiment can't reduce error all the way to zero.
- Incorrect estimates — when Amplitude Experiment generates a duration estimate, it estimates the control population mean, control population standard deviation, and other quantities from the sample. These estimates are as good as they can be, but they still leave room for error.
- Drift — for example, if today the control mean is 5 and ten days from now it's 15, there's drift in the control mean. A common example is seasonality. Drift in any input degrades the estimate, since the model assumes no drift when doing hypothesis testing.
What does "Threshold reached" mean?
If your experiment displays "Threshold reached" with "0 days left" in the duration estimator, the confidence interval doesn't contain the MDE (the threshold).
This isn't necessarily bad if your recommendation metric is a guardrail — the effect size would be smaller than the allowed amount.
It's a bad sign if your recommendation metric is a success metric — the effect size would be smaller than what you hoped for. End the experiment in this case: even if you reach statistical significance, the lift would be smaller than what's practically significant.
What does "Statistical significance may never reach" mean?
When the duration estimator shows 40 or more days to complete an experiment, Amplitude may assume that it isn't likely to reach statistical significance after running for two weeks. In those cases, Experiment shows this message.
Was this helpful?