
Experiment uses a sequential testing method of statistical inference. With sequential testing, results stay valid whenever you view them. You can end an experiment early based on observations made to that point. The number of observations you need to make an informed decision is, on average, much lower than the number you need with [T-tests](/docs/feature-experiment/experiment-theory/analyze-with-t-test) or similar procedures. You can experiment rapidly, incorporate what you learn into your product, and accelerate the pace of your experimentation program.

Sequential testing has several advantages over T-tests. Primarily, you don't need to know the number of observations necessary to achieve significance before you start the experiment. You can use both sequential testing and T-tests for binary metrics and continuous metrics. If you have concerns about long-tailed distributions affecting the Central Limit Theorem assumption, refer to this article about [outliers](/docs/feature-experiment/advanced-techniques/winsorization-in-experiment).

Given enough time, the statistical power of the sequential testing method is 1. If there is an effect size to detect, this approach can detect it.

This article explains the basics of sequential testing, how it fits into Amplitude Experiment, and how to make it work for you.

## Hypothesis testing in Amplitude Experiment

When you run an A/B test, Experiment conducts a hypothesis test using a randomized control trial. In this trial, Amplitude randomly assigns users to either a treatment variant or the control. The control represents your product in its current state. Each treatment includes a set of potential changes to your current baseline product. With a predetermined metric, Experiment compares the performance of these two populations using a test statistic.

In a hypothesis test, you look for performance differences between the control and your treatment variants. Amplitude Experiment tests the null hypothesis

$$
H_0:\ \delta = 0
$$

where

$$
\delta = \mu_{\text{treatment}} - \mu_{\text{control}}
$$

states there's no difference between the treatment's mean and the control's mean.

For example, you want to measure the conversion rate of a treatment variant. The null hypothesis posits that the conversion rates of your treatment variants and your control are the same.

The alternative hypothesis states that there is a difference between the treatment and control. Experiment's statistical model uses sequential testing to look for any difference between treatments and control.

There are many different sequential testing options. Amplitude Experiment uses a family of sequential tests called mixture sequential probability ratio test (mSPRT). The weight function, H, is the mixing distribution. The following mixture of likelihood ratios against the null hypothesis is such that.

## Common questions

{% accordion title="Why hasn't the p-value or confidence interval changed, even though the number of exposures is greater than 0?" %}
For uniques, Amplitude Experiment doesn't compute p-values and confidence intervals until there are at least 25 conversions and 100 exposures each for both the treatment and control.

For average totals and sum of property, Experiment waits until there are at least 100 exposures each for the treatment and control.
{% /accordion %}

{% accordion title="Why don't I see any confidence interval on the Confidence Interval Over Time chart?" %}
The thresholds haven't yet been reached.

For uniques, Experiment waits until there are at least 25 conversions and 100 exposures each for the treatment and control. After those thresholds, Experiment starts computing the p-values and confidence intervals.

For average totals and sum of property, Experiment waits until there are at least 100 exposures each for the treatment and control.
{% /accordion %}

{% accordion title="What are we estimating when we choose Uniques?" %}
Uniques measures whether your visitors fired a specific event. The result is the proportion of the population that has taken this action. Uniques is a comparison of proportions, or the conversion rates between treatment and control.
{% /accordion %}

{% accordion title="What are we estimating when we choose Average Totals?" %}
Average Totals counts the average number of times a visitor has fired an event. For each visitor, Experiment counts the number of times they took the action you're interested in, then averages that across the sample within both the control and treatment. The result is a comparison of the average totals between the treatment and control.
{% /accordion %}

{% accordion title="What are we estimating when we choose Average Sum of Property?" %}
Average Sum of Property sums the values of an event per user on a specific property. For example, if you want the total cart value of a user across all times, pick the event "add to cart" with the property "cart value." The result of this specific example is a comparison of the average cart value between treatment and control.
{% /accordion %}

{% accordion title="What is absolute lift?" %}
Absolute lift is the absolute difference between treatment and control.
{% /accordion %}

{% accordion title="What is relative lift?" %}
Relative lift is the absolute lift scaled by the mean of the control. Some people find this value useful to determine the relative change a treatment has with respect to the baseline.
{% /accordion %}

{% accordion title="Why does absolute lift exit the confidence interval?" %}
Occasionally, the absolute lift exits the confidence interval, which can cause confidence bounds to flip. This happens when the parameter you're estimating (for example, absolute lift) changes over time and the allocation for your treatment and control has changed. The underlying assumption in Experiment's statistical model is that the absolute lift and variant allocation don't change over time.

![Chart showing uniques with confidence level over time. The confidence bounds flip midway through the chart.](/images/faq/image6-png.png)

Experiment's approach incorporates symmetric time variation, which occurs when both the treatment and control maintain their absolute difference over time and their means vary in sync.

One option is to choose a different starting date (or date range) where the absolute lift is more stable and the allocation is static.

This pattern may also occur when there is a novelty effect or a drift in lift over time. Sequential testing allows for a flexible sample size. When there is a large time delay between exposure and conversion for your test metrics, don't stop the test before considering the impact of exposed users who haven't yet had time to convert. To address this:

- Compare the average time to convert for each variant using a funnel chart.
- Adjust the date range when analyzing the experiment results to include users who were exposed but converted after you stopped the test.
{% /accordion %}

{% accordion title="How does sequential testing compare to a T-test?" %}
Sequential testing lets you look at the results whenever you like. However, fixed-horizon tests such as T-tests can give you inflated false positives if you review results while your experiment is running.

The following visualization shows p-values over time in a simulation of 100 A/A tests for a particular configuration (alpha=0.05, beta=0.2). As Experiment ran a T-test on incoming data, results were reviewed at regular intervals. When the p-value falls below alpha, the test stops and you conclude that it has reached statistical significance.

![A visualization of p-values over time in a simulation of 100 A/A tests for a configuration where alpha=0.05 and beta=0.2.](/images/faq/image7-png.png)

In this example, the p-values fluctuate, even before the end of the test when it reaches 10,000 visitors. By reviewing results early, you inflate the number of false positives. The following table summarizes the number of rejections recorded for different configurations of the experiment when a T-test runs.

|     | alpha | beta | baseline | delta_true | num_reject |
| --- | ----- | ---- | -------- | ---------- | ---------- |
| 0   | 0.05  | 0.2  | 0.01     | 0.0        | 0          |
| 1   | 0.05  | 0.2  | 0.05     | 0.0        | 0          |
| 2   | 0.05  | 0.2  | 0.10     | 0.0        | 1          |
| 3   | 0.05  | 0.2  | 0.20     | 0.0        | 0          |

In the table, the baseline is the conversion rate of the control variant, and delta_true is the absolute difference between the treatment and the control. Because this is an A/A test, there is no difference. With alpha set to 0.05, the number of rejections far exceeds the threshold set for Type 1 error. If you peek at the results, num_reject should never be higher than 5.

Compare that to a sequential testing approach. In this example, there are again 100 A/A tests, and alpha is set to 0.05. Peeking at your results on a regular interval and the p-value goes below alpha. You can conclude that the test has reached statistical significance. As a result of using this statistical method, the number of false positives stays below this threshold.

![Sequential testing approach showing peeking at results before the end of the experiment, with the test concluding it has reached statistical significance.](/images/faq/image5-png.png)

With always-valid results, you can end your test any time the p-value goes below the threshold. From 100 trials where alpha = 0.05, the number that fall below that threshold is 4, so Type 1 errors stay controlled.

The following table summarizes the number of rejections for different configurations of the experiment when you run a sequential test with mSPRT:

|     | alpha | beta | baseline | delta_true | num_reject |
| --- | ----- | ---- | -------- | ---------- | ---------- |
| 0   | 0.05  | 0.2  | 0.01     | 0.0        | 0          |
| 1   | 0.05  | 0.2  | 0.05     | 0.0        | 0          |
| 2   | 0.05  | 0.2  | 0.10     | 0.0        | 1          |
| 3   | 0.05  | 0.2  | 0.20     | 0.0        | 0          |

Using the same basic configurations as before, the number of rejections (out of 100 trials) stays within the predetermined threshold of alpha = 0.05. With alpha set to 0.05, only 5% of the experiments yield false positives, compared to 30-50% when using a T-test.
{% /accordion %}
