This article covers some frequently asked questions about sequential testing.
What is the statistical power of this approach?
For average totals and sum of property, Experiment waits until it has at least 100 exposures each for the treatment and control.
Why hasn’t the p-value or confidence interval changed, even though the number of exposures is greater than 0?
For uniques, Experiment waits until there are at least 25 conversions and 100 exposures each for the treatment and control. Then it will start computing the p-values and confidence intervals. For average totals and sum of property, Experiment waits until it has at least 100 exposures each for the treatment and control.
Why don’t I see any confidence interval on the Confidence Interval Over Time chart?
What are we estimating when we choose Uniques?
What are we estimating when we choose Average Totals?
What are we estimating when we choose Average Sum of Property?
What is absolute lift?
What is relative lift?
The good thing about Experiment’s approach is that it’s robust to handle symmetric time variation, which occurs when both the treatment and control maintain their absolute difference over time, and their means vary in sync. One option is to choose a different starting date (or date range) where the absolute lift is more stable and the allocation is static. This may also happen if there is novelty effect or a drift in lift over time. Sequential testing allows for a flexible sample size. Because of this, whenever there is a large time delay between exposure and conversion for your test metrics, you should not stop the test before considering the impact of exposed users who have not yet had time to convert. To do this, you could:Why does absolute lift exit the confidence interval?
Below is a visualization of p-values over time in a simulation we ran of 100 A/A tests for a particular configuration (alpha=0.05, beta=0.2). As we ran a T-test on data coming in, we peeked at our results at regular intervals. Whenever we see the p-value fall below alpha, we stop the test and conclude that it has reached statistical significance. You can see the p-values fluctuate quite a bit, even before the end of our test when we’ve reached 10,000 visitors. By peeking, we are inflating the number of false positives. The table below summarizes the number of rejections we have for different configurations of our experiment when we run a T-test. Here, baseline is the conversion rate of our control variant, and delta_true is the absolute difference between our treatment and the control. Since this is an A/A test, there is no difference. With alpha set to 0.05, we can see that the number of rejections far exceeds that of our threshold that we set for our Type 1 error if we peek at our results—num_reject should never be higher than 5 in this example. Now compare that to a sequential testing approach. Again, we have 100 A/A tests, and alpha is set to 0.05. We peek at our results on a regular interval and if we see the p-value go below alpha, we conclude that the test has reached statistical significance. As a result of using this statistical method, the number of false positives stays below this threshold: With always-valid results, we can end our test any time the p-value goes below the threshold. From 100 trials where alpha = 0.05, the number of those that fall below that is four, so Type 1 errors are still controlled. The table below summarizes the number of rejections we have for different configurations of our experiment when we run a sequential test with mSPRT: Using the same basic configurations as before, we see that the number of rejections (out of 100 trials) is within our predetermined threshold of alpha = 0.05. With alpha set to 0.05, we know that only 5% of our experiments will yield false positives, as opposed to 30-50% when using a T-test. With sequential testing, we can confidently look at our results and conclude experiments at any time, without worrying about inflating false positives. Read about the T-test in this help center article, and more about the difference in testing options in this blog.
How does sequential testing compare to a T-test?
Note
Thanks for your feedback!
November 21st, 2024
Need help? Contact Support
Visit Amplitude.com
Have a look at the Amplitude Blog
Learn more at Amplitude Academy
© 2024 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.