P Values: What They Are and How to Calculate Them
Explore p-values in experiments, understand how to calculate and apply them, and discover best practices to make informed decisions in product development.
What is a p-value?
A p-value determines whether an experiment's results are likely real or a fluke.
P-values are probabilities, meaning they can take any value between 0 and 1. The smaller the p-value (i.e., closer to zero), the more confident you can be that the results aren’t just random chance.
In web and product testing, p-values help you separate meaningful changes from noise when tweaking websites, apps, or .
Say you change the color of your website's “Buy Now” button. A p-value can tell you whether any increase in sales is likely due to your change or if it’s a normal, day-to-day variation.
However, it’s important to remember that p-values can’t prove anything with 100% certainty—they instead give you a strong idea of how likely it is that your results are meaningful. This insight makes p-values (used alongside other calculations) a crucial part of the decision-making toolkit for anyone running experiments or data.
P-values vs confidence intervals
P-values and confidence intervals are two sides of the same coin. Both values can help you understand your data, but they do so in slightly different ways.
P-values tell us how likely our results are if there’s no real effect—they measure probability.
Confidence intervals, on the other hand, give us a range where the true effect lies. The value is an estimation rather than a set number.
Experiments can use p-values and confidence intervals. The results help you make more informed decisions about your product and website.
Imagine you’re testing a new site feature:
- A p-value of 0.03 indicates that there’s only a 3% chance you’d see these results if the feature had no effect.
- A 95% confidence interval of 2% to 8% suggests that you’re pretty sure the feature increases engagement between 2% and 8%.
Using both values provides you with a more complete picture. The p-value tells you if there’s likely an effect, while the confidence interval gives you an idea of how big that effect might be.
You can use these tools to decide whether that new feature or product update is worth the investment. Is the potential increase in engagement worth the risk and effort? Or do you need to go back to the drawing board? P-values provide a practical framework for making these decisions.
The role of p-values in hypothesis testing
P-values are an essential part of hypothesis testing, a method for determining whether changes to a product have a meaningful impact.
The process starts with questions like “Does this change make a difference?”
You then set up two hypotheses:
- Null hypothesis: There’s no real effect (the skeptical stance)
- Alternative hypothesis: There is a real effect.
After you run the experiment and collect data, you calculate the p-value. This value helps answer the question: “If there was no effect, how likely are we to see these results?”
If the p-value is small (usually less than 0.05), you might conclude that the results are unlikely under the null hypothesis, leading you to accept the alternative hypothesis.
The process helps you decide whether or not to implement changes. For example, when testing a new layout, your null hypothesis would be “The new layout doesn’t affect user engagement.”
After the test, you might get a p-value of 0.02. This value suggests there is only a 2% chance of observing the results if the layout truly had no effect. Based on this information, you might decide the new layout positively impacts engagement and roll out the change.
However, a low p-value alone doesn’t necessarily mean the new layout is better—it simply indicates that the difference in results is probably not due to chance. The improvement could be due to other factors you hadn’t considered (which you can find and eliminate with other tests).
Consider this analogy: You hear a noise at night. Your null hypothesis is “It’s just the wind.” A low p-value is like saying, “It’s very unlikely to hear this noise if it’s just the wind.” The value suggests it’s probably not the wind, but it doesn’t prove it’s a burglar (your alternative hypothesis). The noise could be your cat, a tree branch, or something else entirely.
This slight difference matters because it keeps us humble and open to other explanations. Statistics guide our decisions, but they don’t make them for us. You must still use your judgment, consider other evidence, and be open to alternative explanations.
In summary, hypothesis testing with p-values is vital for helping product teams make quick, data-driven decisions. However, it’s always wise to combine the results with other evidence, such as user feedback, long-term trends, and practical considerations, such as how the change aligns with your and the cost of implementation.
How to calculate the p-value
Calculating a p-value might seem complex initially, but the process compares your observed results to what you’d expect by chance.
Collect the results
Begin by gathering the data from your experiment. For example, if you’re testing two webpage versions, you would record how many clicks each version gets.
Calculate a test statistic
Next, calculate a test statistic summarizing your data into a single number. You do this by taking the difference between your results and what you expected and adjusting for the spread out of your data.
Standard test statistics include:
- T-statistics: Compare group averages relative to their internal variation.
- Z-scores: Measure how far a data point is from the mean regarding standard deviations.
- Chi-square tests: Compare the observed versus expected frequencies in categorical data.
While these test statistics are the most used, many others can be applied depending on your data.
Determine the probability distribution
Determine the range of results you’d expect if there were no real differences (i.e. if the null hypothesis were true).
For example, when flipping a coin, you know the expected pattern of heads and tails for a fair coin. Similarly, statisticians have established expected result patterns for various experiments when there’s no real effect.
Find the probability
Finally, ask yourself, “If there were no real effect, how often would I see results like mine or more extreme?” This probability is the p-value.
Using the coin toss analogy, you might ask, “If this coin were fair, how often would I see this many heads in a row or more?”
A small p-value suggests that your results would be rare if there were no real effect, indicating that your observed effect is likely genuine.
Using the p-value in product and web experiments
In practice, most teams don’t calculate p-values by hand. They use statistical software, online calculators, or built-in functions in programming languages like R or Python. These tools handle the complex calculations for you.
Many A/B testing platforms automatically calculate p-values. You input your data, and they provide the results.
For instance, your A/B testing tool might use a two-sample t-test when comparing click-through rates for two webpage versions. This test calculates the t-statistic based on the difference between your two samples. Then it determines the p-value from the t-distribution (which describes how a sample mean varies around the population mean).
The specific calculation method may vary based on your experiment design and the data type. However, for most practical testing purposes, understanding what the p-value means and how to interpret it is more important than knowing how to calculate it manually.
Interpreting p-values: Significance levels and decision-making
Interpreting p-values is crucial for making decisions after your experiments. Making sense of them helps you see if your results are significant and if you should reject the null hypothesis.
The most common significance level used in testing is 0.05. This level means that if your p-value is less than 0.05, your results are typically considered statistically significant. In other words, there’s less than a 5% chance you’d see these results if there was no real effect.
But interpretation doesn’t just involve looking for that 0.05 mark.
- p < 0.01: strong evidence against the null hypothesis
- 0.01 < p < 0.05: moderate evidence against the null hypothesis
- 0.05 < p < 0.1: weak evidence against the null hypothesis
- p > 0.1: little or no evidence against the null hypothesis
The strength of evidence needed also depends on the stakes of your experiment. For instance, a p-value of 0.04 might be enough to try a new feature, but you may want a lower p-value before making a major change to your core product.
doesn’t always mean practical significance. A tiny change might be statistically significant in a large sample but not worth implementing if it doesn’t have a meaningful impact.
You should also be wary of p-hacking, i.e., running too many tests and only reporting the significant ones. This practice can lead to false positives and misleading conclusions.
P-values assist your decision-making—they’re not a rule. Use them with like effect sizes, confidence intervals, and your business knowledge to make the best choices for your product or website.
Best practices for reporting p-values in research
Providing a complete and transparent picture of your experimental results improves the credibility of your findings, leading to better decision-making in your process. Accurate reporting means more trustworthy research that benefits your entire business.
Following these best practices will help ensure clarity and preciseness.
Report exact p-values
Instead of just saying “p < 0.05” or “P > 0.05”, provide more information by providing the actual p-value (e.g., p = 0.032).
Readers can use this data to interpret the strength of the evidence for themselves and compare the results across different experiments.
Provide context
Always report p-values alongside effect sizes and confidence intervals. For example, you might say, “The new checkout process increased conversion rates by 2.5 percentage points (95% CI: 1.2 to 3.8 percentage points, p = 0.001).”
This context provides a more complete and more precise picture of your results.
Pre-specify your threshold
Before running your experiment, decide on and clearly state your significance level (e.g., a = 0.05). Significance levels prevent the temptation to move the goalposts after seeing the results.
Be transparent
Report how all the tests performed, even those that weren’t significant. Being more open with your results helps combat publication bias and p-hacking.
Avoid loaded language
Terms like “highly significant” or “marginally significant” can mislead. Let the numbers speak for themselves.
Discuss practical significance
Go beyond statistical significance to explain what your results mean in real-world terms for your product—even ‘positive’ changes may not provide enough uplift to implement fully.
For instance: “While the new email subject line significantly increased open rates (p = 0.04), the absolute increase was only 0.5 percentage points, translating to about 50 additional opens per campaign. Given the effort required to implement this change, we don’t consider this practically significant.”
Interpret cautiously
Discuss p-values in context to consider limitations and alternative explanations. Did you run the test during a major holiday or broader product update? Acknowledge how events like these might have influenced your results, and mention them in your reports to avoid overstating your conclusions.
Use visualizations
Graphs, charts, and other communicate your results more clearly than p-values alone.
You might include a showing effect sizes and confidence intervals for different variations tests or a cumulative showing how the p-value evolved throughout your experiment.
Real-world examples of p-value application
P-values guide decisions across various aspects of products, from design elements to core business strategies. Let’s examine how p-values apply in practice.
Ecommerce button color
An tests green vs. blue “Add to Cart” buttons. After a week-long A/B test with 50,000 visitors split evenly between the two versions, the green button shows a 2% higher click-through rate (22% vs. 20%) with a p-value of 0.03.
This p-value suggests there’s only a 3% chance of seeing this difference if the button colors were equally effective. The marketing team decided to implement the green button, estimating it could increase sales.
App onboarding flow
A fitness tracking app experiments with a new process that includes more visual guides and fewer steps. Over a month, 10,000 new users are randomly assigned to the old or new process.
The completion rate increases from 60% to 65%, with a p-value of 0.001. This low p-value gives strong confidence that the new process is better. The product team rolls out the change and projects it will lead to a higher number of active users.
Email subject line test
A tests two email subject lines for its monthly newsletter. With a sample size of 20,000 subscribers split evenly, open rates differ by 1% (22% vs. 21%), but the p-value is 0.2.
This high p-value indicates that the difference might be due to chance. The marketing team chooses to run a longer test with more variation before making any changes.
Website loading speed
After optimizing images and scripts, an e-learning platform’s average load time decreases from 3.5 to 3.0 seconds.
In a two-week test period with 100,000 visitors, (measured by average session duration) increases by 5%, from 10 minutes to 10.5 minutes, with a p-value of 0.07. This p-value is in the “weak evidence” range.
As a result, the development team opts to keep the optimizations but continues monitoring to see if the engagement increase becomes more statistically significant over time.
Pricing strategy
A project management company tests a new mid-tier pricing option between its basic and premium plans. Over two months, it exposes 50% of its 10,000 monthly website visitors to the new pricing structure.
(free trial sign-ups) increased by 10% in the test group, with a p-value of 0.04. This number is just under the standard 0.05 threshold, so the product team decided to launch the new pricing tier gradually—it starts with 20% of new visitors while closely monitoring long-term metrics.
Common misconceptions about p-values
P-values are often misunderstood, even by experienced researchers and those who use them regularly.
These misconceptions can lead to poor decisions. You might overhaul your website based on a “significant” result that’s tiny in practical terms or miss out on a promising feature because your test was underpowered.
Understanding these nuances helps you accurately interpret your A/B tests and other experiments, leading to better product decisions.
Let’s clear up some common myths.
- “A low p-value proves my hypothesis is true.” Not quite. A low p-value just means your data is unlikely if the null hypothesis is true. It doesn’t prove your specific alternative hypothesis.
- “P-values measure the probability that my results occurred by chance.” The values measure the probability of seeing such results (or more extreme) if the null hypothesis were true—a subtle but important difference.
- “A non-significant p-value means there’s no effect.” This statement isn’t necessarily true. The value might just mean you don’t have enough evidence to detect an effect. Maybe your sample size was too small, for instance.
- “P-values tell me how important results are.” The values only speak to statistical significance, not practical importance. A tiny, meaningful effect can have a very low p-value in a large sample.
- “If p < 0.05, the effect is real; if p > 0.05, it isn’t.” This idea treats p-values as a binary switch, which they’re not. P-values are more like a continuous measure of evidence against the null hypothesis.
Limitations and criticisms
Although p-values are widely used in experiments, they have drawbacks.
- Overreliance: Many product teams focus too heavily on p-values, sometimes ignoring other important aspects of their data.
- Arbitrary thresholds: The typical 0.05 significance level is somewhat arbitrary and can lead to thinking in black-and-white about more complex results.
- Doesn’t measure effect size: A low p-value doesn’t tell whether an effect is large or important. Small, practically insignificant effects can have low p-values in large samples.
- Vulnerable to p-hacking: Researchers can manipulate analyses to achieve significant p-values, leading to false positives. This practice is especially tempting in fast-paced testing environments.
- Sample size sensitivity: With larger datasets (common in web experiments), almost everything becomes “statistically significant.”
- Doesn’t work well for multiple comparisons: When you run many tests, which is typical in product refinements, the chance of false positives increases.
- Encourages binary thinking: P-values can lead to oversimplified “significant or not” decisions rather than nuanced interpretations.
These limitations are why you shouldn’t rely on p-values alone. A slight tweak to a website might yield a statistically significant result but have no real impact on or business metrics.
Recognizing these criticisms enables you to use p-values as part of your broader decision-making resources rather than as the sole judge of truth in your experiments.
P-value alternatives
Given the limitations of p-values, many researchers and product teams often explore alternative or complementary approaches.
Here are some popular options:
- Effect sizes: Measure the magnitude of an effect (like how big a difference your change made), not just its statistical significance. You might use metrics like Cohen’s d or percentage change.
- Confidence intervals: Provide a range of plausible values for the true effect. In A/B testing, confidence intervals might show you the likely range of improvement for a new feature, giving more context than a single p-value.
- Bayes factors: Compares the likelihood of the data under different hypotheses. This approach is particularly useful for ongoing tests, as you can update your beliefs as new data comes in.
- False discovery rate: This feature adjusts for multiple comparisons and reduces the risk of false positives when running many tests simultaneously—a common scenario in web experimentation.
- Practical significance thresholds: Instead of statistical significance, you set thresholds based on what matters for your business. For example, “We’ll only implement changes that improve conversion rates by at least 2%.”
- algorithms: Dynamically allocate traffic to better-performing variations during the experiment, potentially finding wins faster than traditional A/B tests.
- Regression discontinuity designs: Useful when you have a clear cutoff point, such as analyzing the effect of a loyalty program that kicks in at a certain spending level.
In practice, you might use a combination of these methods. For instance, you could report effect sizes and confidence intervals alongside p-values or use Bayesian methods for ongoing tests where you want to make decisions as data accumulates.
The goal of any experiment is to make informed decisions about your product or website. These alternatives often provide richer information than p-values alone and helps you understand whether a change had an impact and how large and important that impact is likely to be.
Results you can trust with Amplitude
Although p-values are useful, they’re just one piece of the puzzle. A holistic view of your experimental results will help you make more knowledgeable decisions as quickly as possible.
enables you to run and analyze A/B tests and other experiments, providing not just p-values but a suite of statistical measures to help you understand your results.
- Easily set up and run experiments across your digital products
- Get real-time results with clear visualizations
- Analyze several metrics at the same time
- Automatically adjust for multiple comparisons
- Integrate your experiment results with user behavior data for deeper insights.
With a solid understanding of p-values and the right resources, you’ll gain actionable insights that lead to meaningful improvements for your users and your company. These confident decisions are the driving force behind business growth.
Take your experimentation to the next level. .