The Benefits of Bayesian Statistics

A technical dive into how Bayesian and frequentist probability compare, and why Bayesian can give you a better “best bet”

Mar 31, 2026

8 min read

Akhil Prakash

Senior Machine Learning Scientist, Amplitude

two colorful dice float in midair, suggesting probability

Earlier this year, Amplitude Feature Experimentation and Web Experimentation expanded their support for Bayesian statistics, empowering teams to interpret results with greater confidence. In this article, we’ll walk through the key differences between frequentist and Bayesian approaches, along with practical examples to make those distinctions clear.

This piece assumes a basic understanding of statistics. If you’re new to statistics, our blog on data science skills is a good place to start.

The basics: Frequentist vs. Bayesian probability

The frequentist interpretation of probability is the long-run proportion of times an event occurs. The Bayesian interpretation of probability quantifies uncertainty about the world.

In the frequentist model, it’s assumed that population parameters are fixed, unknown values that we are trying to estimate.

In the Bayesian model, population parameters are random variables, which allows us to make probability statements about them.

Why go Bayesian?

If you’ve taken a statistics class, it was most likely a frequentist focus. Frequentist methods are very common because they allow clearer control over false-positive and false-negative rates.

However, frequentist probability relies on having enough data to form a conclusion. If you don’t have enough data, you’re out of luck. Or you were, until the 18th century, with the advent of Bayesian probability.

At its core, Bayesian statistics follows the Bayes’ rule, a theorem that explains how to find the probability of a cause given its effect. In statistical terms, it shows how to convert from P(A|B) to P(B|A). In other words, the posterior is directly proportional to the prior times the likelihood. Bayesian probability uses this to calculate probabilities even from small sample datasets.

Many statisticians prefer Bayesian statistics because it allows you to compute the probability of treatment over control. You can’t compute this in the frequentist framework, and instead compute a p-value, which is P(you saw data as extreme | H0 is true). But what you really want is the opposite conditional probability, which is P(H0 is true | data)—and that’s what you get with Bayesian because Bayes’ rule tells you how to flip the conditional.

One of the new concepts in Bayes is the prior distribution. This allows you to specify your domain knowledge. Another benefit of Bayesian is that it allows for specifying the data-generating process so that you can specify the whole distribution.

One disadvantage of Bayesian methods is that there is often no single correct prior. Two people may pick different priors and get different results. On the other hand, if you have enough data, the choice of prior should not matter that much.

Example 1: No 6’s in the die roll set

Say you want to figure out the probability of rolling a 6 on a normal 6-sided die. Because you’re an empiricist, you roll the die 10 times and record your findings. No 6’s come up. So, what’s the probability of rolling a 6?

From the frequentist perspective, you would estimate the probability of rolling each number as the sample mean of the empirical data. This corresponds to the maximum likelihood estimate (MLE). That means you would estimate the probability of rolling a 6 as 0.

Saying an event has a probability of 0 is a very strong statement. You’re saying that this event can’t happen. However, we know that’s not true—what you really mean to say is, “The probability of this event happening is unlikely, and I just have not had enough data to give a good estimate of this probability.”

From the Bayesian perspective, instead of using the MLE, you would use the maximum a posteriori estimate (MAP). One prior you might pick is assuming that there is 1 roll of each side of the die. With this prior, the MAP is (0+1)/(10+1+1+1+1+1+1) = 0.0625.

We can see that the MLE and MAP give different estimates. The MAP result still may not fit the ideal probability of rolling a 6, but it’s existentially closer than p=0.

Example 2: Top batting average

If you’re watching a baseball game, the broadcast will often highlight the players with the top batting average in the league. Look closely, and you’ll always see an asterisk at the bottom of the graphic saying, “At least 20 games played,” or “At least 100 at bats.” Why include that asterisk?

Because otherwise, if you sort by raw batting average, you’d suddenly see new batters who have a 100% batting average (or other very high averages)—simply because they’ve only had a couple of at-bats. Oftentimes, these are pitchers. Pitchers are generally the worst hitters, so it’s safe to say that these small-sample batting averages are a bad estimate of their future career-long batting averages, or even another 100 at-bats.

From the frequentist perspective, the batting average asterisk sets a sample-size threshold to make the data more trustworthy. That gets rid of those lucky, low-at-bat pitchers.

But if you wanted to include the pitchers out of a sense of fairness, you could avoid the asterisk using Bayesian statistics. One option would be to group all pitchers, compute the global batting average, and then regularize each pitcher’s batting average toward the global pitcher’s batting average.

Edge cases

Sometimes the distinction between Bayesian and frequentist is not as strong, and you can interpret the same methodology in different ways.

For example, say you’re trying to solve a least squares problem. From a frequentist approach, you could use lasso, saying you have normally distributed errors and L1 regularization. This is achieved by finding the MLE, applying regularization to prevent overfitting, and seeking a sparse model. L1 regularization is chosen since it is convex and makes the coefficients more likely to be 0, because the contours are “pointy.” When doing optimization, you’d generally pick one of the points that maps to one of the coefficients being 0.

Another frequentist approach is the Simplex method. With that, we only test the vertices of the boundary of the feasible region, and we don’t need to test anything on the interior of the feasible region.

The Bayesian viewpoint of lasso is that you have normally distributed errors with Laplace priors on the coefficients. The standard follow-up question is why does L1 regularization create more sparsity than L2 regularization? From the frequentist perspective (even though this doesn’t really have anything to do with probabilities and is more of a geometry argument), you would make an argument about the contours of the L1 norm vs. the L2 norm. From the Bayesian perspective, you would say that the Laplace prior has more mass close to 0 than a normal prior, so it is going to take more data to make the coefficient non-zero.

So, which approach should I use?

Like all things in probability, it depends. Evaluate how much data you have and the type of problem you’re solving. If you have a smaller sample size, odds are that a Bayesian approach will serve you better.

Interested in trying Bayesian statistics in your Amplitude experiments? Check out our documentation on how to get started!

About the author