Set the MDE for your experiment
Before you run an experiment, set a Minimum Detectable Effect (MDE) to estimate how you measure success. Think of MDE as the minimum change you want to find by running your experiment. No standard calculation exists for the MDE, so setting one requires judgment. In Amplitude Experiment, the default MDE is 2%. Because the MDE links directly to your business needs, be thoughtful during each experiment's design phase. When you set the MDE, consider the primary metric and any associated risks.
MDE and the metric goal type
When you create your experiment, you select between two metric goal types: success or guardrail. The following case study shows how the goal type can change the MDE.
The marketing director of a small arts organization uses Amplitude Experiment to plan updates to a ticketing management system. With no data science team, the director decides which experiments to run and how to run them. The planned updates are:
- Add a "quick checkout" option on event pages to increase conversion from page visits to ticket sales for logged-in users.
- Add a new payment option during checkout for all users.
The goal of the first update is to increase conversion rates, so a success metric fits. The metric tells the marketing director whether the new button sits in the right place and stays visible enough to meet the conversion rate goal. The marketing director's next fiscal quarter goal is to increase ticket sale revenue by 3%. These company goals shape the success metric and set the test direction to increase and the MDE to 3%.
The second update meets financial requirements. As a required change to the checkout process, a guardrail metric helps confirm that the new payment method doesn't decrease completed sales. Over the last four fiscal quarters, an average of 1% of users abandoned checkout after starting the process. The guardrail metric direction is decrease and the MDE is 1%.
If you run a T-test, Amplitude's duration estimator can also help set the MDE. Review the recommended MDE that Amplitude provides, or change the MDE until the duration estimate is reasonable.
MDE and the primary metric
In Amplitude, the MDE is relative to the control mean of the primary metric. For example, if the conversion rate for the control group is 10%, an MDE of 2% (0.2) means Amplitude detects a change when the rate moves outside the range 9.8% to 10.2%.
In the ticketing case study, the primary metric of ticket sales may require a different MDE if:
- The hypothesis testing experiment runs during an annual discount on ticket prices.
- The number of available events, which correlates positively to ticket sales, is much smaller than previous fiscal quarters.
- The experiment runs during a global pandemic that prohibits large in-person gatherings.
Consider your business needs and circumstances when you plan an experiment and set the MDE. One goal of any experiment is to cause as little harm as possible.
You can also set the MDE when you analyze your experiment results.
MDE and associated risk
Experiments don't produce risk-free results, and running them can take time and require large sample sets. This can mean higher costs and greater potential for adverse effects on users. The MDE has an inverse relationship to sample size: the smaller or more "sensitive" the MDE, the larger the sample size you need to reach statistical significance.
Use these questions to assess risk:
- Are the estimated costs or run time of an experiment worth the expected outcome?
- What are the possible negative side effects for users in the experiment, and is the outcome worth potential losses?
- Do you need an experiment at all, or should you consider other options, such as a feature release?
- What's the smallest percentage change that satisfies you? For example, would you roll out the experiment if you saw a lift of 2%, 3%, or 5%?
- If your experiment produces positive outcomes, such as an increase in annual subscribers from 100 to 105, is that change large enough to present to leadership?
Common questions
These questions cover Amplitude Experiment's duration estimate. For setup guidance and pre-launch planning, refer to Estimate the duration of your experiments.
How does the duration estimate work?
The experiment duration estimate predicts how long your experiment needs to run to generate statistically significant results. The duration estimate works only with the primary metric and sequential testing, and doesn't support Experiment Results.
Amplitude Experiment uses the means, variances, and exposures of your control and variants to forecast expected behavior and calculate the number of days your experiment takes to reach statistical significance. The prediction improves as more data arrives. If any of these inputs change significantly during the experiment, the accuracy of the prediction is likely to decrease.
What's the difference between the duration estimate and the duration estimator?
Amplitude calculates the duration estimate using sequential testing while the experiment runs. The duration estimator uses the T-test.
Why isn't the duration estimate showing?
The duration estimate displays when your experiment meets all the following criteria:
- The metric hasn't yet reached statistical significance.
- The end date of the analysis window is in the past.
- The experiment has enough observations.
- The experiment status is rolled out or rolled back.
- None of the following statistical conditions hold:
- The absolute lift is outside the confidence interval.
- The confidence interval flips (lower confidence interval > upper confidence interval). This can happen when the mean for the treatment or control fluctuates while the experiment runs, or when rollout weights or targeting segments change.
- The standard error is very small.
- The variance is negative.
- The conversion rate is greater than 1 or less than 0 (where applicable).
If the estimate doesn't show, one or more of these criteria isn't met.
Is there a cap for the duration estimate?
Yes, the cap is 40 days. The reasons:
- The duration estimate uses real-time simulations, where latency scales with the number of days simulated.
- Means and standard deviations usually don't change much over time, especially for experiments with longer running times.
- Short-term predictions are easier to make accurately than long-term predictions. (Weather forecasts beyond ten days change frequently as the date approaches for the same reason.)
- Most experiments shouldn't take 40 days to complete.
How does Amplitude Experiment determine the number of exposures per day?
Amplitude Experiment assumes a constant number of exposures per day. Amplitude calculates this value by dividing the cumulative exposures as of today by the number of days the experiment has run so far.
What types of errors are there?
The duration estimate is still an estimate. Don't take it as truth. You may encounter:
- Irreducible error: error inherent to the estimation process. You can't correct for it. Each simulation reaches statistical significance at a different time, which is the main reason to run multiple simulations. The time it takes for an experiment to reach statistical significance is itself a random variable. The time depends on the p-value, which depends on the data the experiment collects. Even if you know the control mean, control standard deviation, treatment mean, and treatment standard deviation, and you force a normal distribution and independence on everything, Experiment can't reduce error all the way to zero.
- Incorrect estimates: when Amplitude Experiment generates a duration estimate, Amplitude estimates the control population mean, control population standard deviation, and other quantities from the sample. These estimates are as good as they can be, but they still leave room for error.
- Drift: for example, if today the control mean is 5 and ten days from now it's 15, the control mean shows drift. A common example is seasonality. Drift in any input degrades the estimate, because the model assumes no drift during hypothesis testing.
What does "Threshold reached" mean?
If your experiment displays "Threshold reached" with "0 days left" in the duration estimator, the confidence interval doesn't contain the MDE (the threshold).
This message isn't necessarily bad when your recommendation metric is a guardrail, because the effect size is smaller than the allowed amount.
The message is a bad sign when your recommendation metric is a success metric, because the effect size is smaller than what you hoped for. End the experiment in this case: even if you reach statistical significance, the lift is smaller than what's practically significant.
What does "Statistical significance may never reach" mean?
When the duration estimator shows 40 or more days to complete an experiment, Amplitude may assume that the experiment isn't likely to reach statistical significance after running for two weeks. In those cases, Experiment shows this message.
Was this helpful?