Bonferroni Correction

In an experiment, think of each variant or metric you include as its own hypothesis. For example, when you add a new variant, you propose the hypothesis that the changes in that variant produce a detectable impact on the experiment's results.

The simplest experiments have only a single hypothesis. Single-hypothesis tests can yield valuable insights. However, including more than one metric or variant, or multiple hypotheses, is often more efficient or enlightening.

Multiple hypothesis testing can introduce errors into your calculations of statistical significance through the multiple comparisons problem. The probability of making an error increases with the number of hypothesis tests you run.

Problems with multiple hypotheses testing

For example, imagine you run an experiment on the color of your site's "Buy now" button. The default setting for the site is blue, which makes blue the control. You also want to test green (variant #1) and purple (variant #2). If your false positive rate is 0.05 (5%) for each individual hypothesis test, the probability of finding a statistically significant result when the null hypothesis is true is:

1 - 0.95^2 = 0.0975

This calculation assumes the tests are independent.

If you run enough tests, you eventually get a statistically significant result no matter what. With a 0.05 false positive rate, expect one out of every 20 hypothesis tests to show statistical significance by random chance alone.

Multiple hypothesis correction asks: is this statistically significant result due to chance, or is it genuine?

False positives

A false positive rate is the ratio between:

the number of negative events falsely described as positive, and
the total number of actual negative events.

Every experiment carries the risk of a false positive result. A false positive occurs when an experiment reports a conclusive result in either direction, when no real difference exists between variations.

The risk of a false positive result increases with each metric or variant you add to your experiment. This risk increases even though the false positive rate stays the same for each individual metric or variant.

Statistical tools compensate and correct for the multiple comparisons problem. Amplitude uses the Bonferroni correction.

Amplitude enables the Bonferroni correction by default. You can manually toggle it off in your statistical settings.

Bonferroni correction

The Bonferroni correction is the simplest statistical method for counteracting the multiple comparisons problem. It's also one of the more conservative methods, and carries a greater risk of false negatives than other techniques. For example, the Bonferroni correction doesn't consider the distribution of p-values across all comparisons, which could be uniform if the null hypothesis is true for all hypotheses.

The Bonferroni correction controls for family-wise error rate and applies to the confidence interval. In the button color example above, dividing 0.1 by 2 equals 0.05, which is the target value. The family-wise error rate stays controlled.

The proof follows from Boole's inequality.

Mathematically, the Bonferroni correction divides the false positive rate by the number of hypothesis tests you run. This is the same as multiplying the p-value by the number of hypothesis tests.

Amplitude Experiment performs Bonferroni corrections on both the number of treatments and the number of primary and secondary metrics:

Bonferroni applies to the primary metric when more than one treatment exists.
Bonferroni applies to the secondary metric when multiple secondary metrics or multiple treatments exist.

In either case, Amplitude places an info icon in the significance column when you apply Bonferroni correction. The tooltip shows the corrected and uncorrected p-value.

Was this helpful?