This article covers frequently asked questions about Amplitude Experiment's duration estimate.
The experiment duration estimate is designed to predict the length of time your experiment will need to generate statistically significant results. It can only be used with the primary metric and sequential testing. It is not currently supported in Experiment Results. Amplitude Experiment uses the means, variances, and exposures of your control and variants to forecast expected behavior and calculate how many days your experiment will take to reach statistical significance. As Amplitude Experiment receives more data over time, this prediction will improve. However, if any of these inputs change significantly during the experiment, the accuracy of the prediction will likely decrease.
How does the duration estimate work?
The duration estimate is calculated using sequential testing as the experiment is running. The duration estimator, however, uses T-test. What is the difference between the duration estimate and the duration estimator?
The duration estimate is visible when the follow criteria are met: If the estimate is not showing, it likely means that one or more of these criteria have not been met.
Why is the duration estimate not showing?
Amplitude Experiment uses the worst case, average case, and best case to describe the uncertainty inherent in its estimate of the time it will take for a hypothesis test to reach statistical significance: What do worst case, average case, and best case mean?
Yes. The duration estimate is currently capped at 40 days, for the following reasons:Is there a cap for the duration estimate?
Amplitude Experiment assumes a constant number of exposures per day. Exposures per day are calculated by dividing the cumulative exposures (as of today) by the number of days the experiment has been running so far.How does Amplitude Experiment determine the number of exposures per day?
The experiment duration estimate is still an estimate; it should not be taken as ground truth. Here is a list of some of the error types you might encounter. Irreducible error Irreducible error is error inherent to the estimation process; unfortunately, it cannot be corrected for. In simulations, each one will reach statistical significance at different times: this difference is the main reason to run multiple simulations. This is just how randomness works. In fact, the time it takes for an experiment to reach stat sig is actually a random variable itself. It will depend on the p-value, which in turn depends on the data your experiment collects. Even if we cheat and pretend to know the control mean, control standard deviation, treatment mean, and treatment standard deviation, and if we force normal distribution and independence on everything, we still cannot reduce error all the way to zero. See this video on irreducible error and bias for more information. Incorrect estimates When Amplitude Experiment generates a duration estimate, it estimates the control population mean and control population standard deviation, among other things, with the sample estimates. These estimates are as good as they can be. That said, there is potential for error here also. Drift For example, if today the control mean equals 5, and ten days from now the control mean equals 15, there is drift in the control mean. A common example of drift is seasonality. If there is any drift in any of the statistics, the estimate will do poorly. The estimate assumes no drift when doing hypothesis testing.
What types of errors are there?
If your experiment displays the message "Threshold reached" with "0 days left" in the duration estimator, it is because the confidence interval does not contain the MDE (aka, threshold in this instance). This isn't necessarily a bad result if you were running a do-no-harm experiment, since the effect size would be smaller than the allowed amount. Conversely, it's a bad sign if you were running a hypothesis test because the effect size would be smaller than what you hoped for. It's recommended to end the experiment if this happens; even if you would have reached statistical significance, the lift would be smaller than what is practically significant and you wouldn't have moved the metric like you were hoping to.
What does 'Threshold reached' mean?
When the duration estimator shows 40 or more days to complete an experiment, Amplitude may assume that it's not likely to reach statistical significance after running for two weeks. In those cases, you will see this message.What does 'Statistical significance may never reach' mean?
Thanks for your feedback!
July 4th, 2024
Need help? Contact Support
Visit Amplitude.com
Have a look at the Amplitude Blog
Learn more at Amplitude Academy
© 2024 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.