This article covers some frequently asked questions about sequential testing.
What is the statistical power of this approach?
For uniques, Experiment waits until there are at least 25 conversions and 100 exposures each for the treatment and control. Then it starts computing the p-values and confidence intervals. For average totals and sum of property, Experiment waits until it has at least 100 exposures each for the treatment and control.Why don’t I see any confidence interval on the Confidence Interval Over Time chart?
The good thing about Experiment’s approach is that it’s robust to handle symmetric time variation, which occurs when both the treatment and control maintain their absolute difference over time, and their means vary in sync. One option is to choose a different starting date (or date range) where the absolute lift is more stable and the allocation is static. This may also happen if there is novelty effect or a drift in lift over time. Sequential testing allows for a flexible sample size. Because of this, whenever there is a large time delay between exposure and conversion for your test metrics, you shouldn't stop the test before considering the impact of exposed users who haven't yet had time to convert. To do this, you could:Why does absolute lift exit the confidence interval?

Below is a visualization of p-values over time in a simulation run of 100 A/A tests for a particular configuration (alpha=0.05, beta=0.2). A T-test run on data coming in, you can peek at the results at regular intervals. Whenever you find the p-value fall below alpha, you can stop the test and conclude that it has reached statistical significance. Within the example, p-values fluctuate, even before the end of the test when you have reached 10,000 visitors. By peeking, you are inflating the number of false positives. The table below summarizes the number of rejections for different configurations of the experiment when running a T-test. Here, baseline is the conversion rate of the control variant, and delta_true is the absolute difference between the treatment and the control. Since this is an A/A test, there is no difference. With alpha set to 0.05, the number of rejections far exceeds that of the threshold that's set for Type 1 error if we peek at our results—num_reject should never be higher than 5 in this example. Now compare that to a sequential testing approach. Again, there are 100 A/A tests, and alpha is set to 0.05.
Peeking at the results on a regular interval and if the p-value goes below alpha, you can conclude that the test has reached statistical significance. As a result of using this statistical method, the number of false positives stays below this threshold: With always-valid results, you can end the test any time the p-value goes below the threshold. From 100 trials where alpha = 0.05, the number of those that fall below four, so Type 1 errors are still controlled. The table below summarizes the number of rejections for different configurations of the experiment when we run a sequential test with mSPRT: Using the same basic configurations as before, the number of rejections (out of 100 trials) is within the predetermined threshold of alpha = 0.05. With alpha set to 0.05, we know that only 5% of experiments yield false positives, as opposed to 30-50% when using a T-test. With sequential testing, you can confidently review the results and conclude experiments at any time, without worrying about inflating false positives.
How does sequential testing compare to a T-test?




Note
November 27th, 2025
Need help? Contact Support
Visit Amplitude.com
Have a look at the Amplitude Blog
Learn more at Amplitude Academy
© 2025 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.