Sequential testing

This article covers some frequently asked questions about sequential testing.

What is the statistical power of this approach?

Given enough time, the statistical power of Amplitude's sequential testing method is 1. If there is an effect size to be detected, this approach detects it.

Why don’t I see any confidence interval on the Confidence Interval Over Time chart?

This is because the thresholds haven’t been met yet.

For uniques, Experiment waits until there are at least 25 conversions and 100 exposures each for the treatment and control. Then it starts computing the p-values and confidence intervals.

For average totals and sum of property, Experiment waits until it has at least 100 exposures each for the treatment and control.

Why does absolute lift exit the confidence interval?

Occasionally you may find the absolute lift exit the confidence interval, which can cause confidence bounds to flip. This happens when the parameter you’re estimating (in other words, absolute lift) changes over time and the allocation for your treatment and control has changed. The underlying assumption in the statistical model Experiment uses is that the absolute lift and variant allocation don't change over time.

Graph displaying values that show that the absolute lift and variant allocation remaining constant over time.

The good thing about Experiment’s approach is that it’s robust to handle symmetric time variation, which occurs when both the treatment and control maintain their absolute difference over time, and their means vary in sync.

One option is to choose a different starting date (or date range) where the absolute lift is more stable and the allocation is static.

This may also happen if there is novelty effect or a drift in lift over time. Sequential testing allows for a flexible sample size. Because of this, whenever there is a large time delay between exposure and conversion for your test metrics, you shouldn't stop the test before considering the impact of exposed users who haven't yet had time to convert. To do this, you could:

Compare the average time to convert for each variant using a funnel chart
Adjust the date range when analyzing the experiment results to include users who were exposed but converted after you stopped the test

How does sequential testing compare to a T-test?

Using sequential testing lets you look at the results whenever you like. But fixed-horizon tests—such as T-tests, for example—can give you inflated false positives if you peek while your experiment is running.

Below is a visualization of p-values over time in a simulation run of 100 A/A tests for a particular configuration (alpha=0.05, beta=0.2). A T-test run on data coming in, you can peek at the results at regular intervals. Whenever you find the p-value fall below alpha, you can stop the test and conclude that it has reached statistical significance.

P-values over time where 100 A/A tests were run for a consistent time frame. Displays statistical significance.

Within the example, p-values fluctuate, even before the end of the test when you have reached 10,000 visitors. By peeking, you are inflating the number of false positives. The table below summarizes the number of rejections for different configurations of the experiment when running a T-test.

Table describing fluctuations in p-values with the num-reject numbers movoving from 38-59 to signify a fluctuation in p-values.

Here, baseline is the conversion rate of the control variant, and delta_true is the absolute difference between the treatment and the control. Since this is an A/A test, there is no difference. With alpha set to 0.05, the number of rejections far exceeds that of the threshold that's set for Type 1 error if we peek at our results—num_reject should never be higher than 5 in this example.

Now compare that to a sequential testing approach. Again, there are 100 A/A tests, and alpha is set to 0.05. Peeking at the results on a regular interval and if the p-value goes below alpha, you can conclude that the test has reached statistical significance. As a result of using this statistical method, the number of false positives stays below this threshold:

Graph of a sequential testing approach. Describes the results of 100 A/A tests over time. Because the p-values goes below alpha, the test reaches statistical significance.

With always-valid results, you can end the test any time the p-value goes below the threshold. From 100 trials where alpha = 0.05, the number of those that fall below four, so Type 1 errors are still controlled.

The table below summarizes the number of rejections for different configurations of the experiment when we run a sequential test with mSPRT:

Table describing the number of rejections for different configurations.

Using the same basic configurations as before, the number of rejections (out of 100 trials) is within the predetermined threshold of alpha = 0.05. With alpha set to 0.05, we know that only 5% of experiments yield false positives, as opposed to 30-50% when using a T-test. With sequential testing, you can confidently review the results and conclude experiments at any time, without worrying about inflating false positives.

Note

Read about the T-test in this help center article, and more about the difference in testing options in this blog.