Multi-armed bandit experiments
In a traditional A/B test, Amplitude Experiment assesses all the variants in your experiment until it reaches a statistically significant result. From there, you can choose to roll out the winning variant, or roll all users back to the control variant. Your decision depends on why a particular variant outperformed the others.
Sometimes, that reason isn't relevant. You want to identify the best-performing variant and send as much traffic to it as possible. For example:
- Optimizing hero images, messaging, or color changes to UI elements.
- In-product layout changes, like information hierarchy or order of operations.
- Optimizing menus or navigation.
- Ad optimization for seasonal or time-sensitive promotions or events.
- Hyperparameter tuning for ML models.
Unlike a traditional A/B test, multi-armed bandits don't use statistical significance to determine success. They also don't use a control or baseline variant. Amplitude Experiment also displays results differently for the two experiment types. A later section covers those differences.
Multi-armed bandit experiments use Thompson sampling. Amplitude Experiment doesn't support other statistical methodologies for multi-armed bandits.
Before you begin
- You can evaluate multi-armed bandit experiments locally or remotely.
- You can configure multi-armed bandit experiments to reallocate traffic hourly, daily, or weekly.
- Amplitude Experiment requires at least 100 exposures in each variant before it reallocates traffic.
- Multi-armed bandits respect all mutual exclusion groups and holdouts that you associate with them.
- The flag config history shows each reallocation. Amplitude makes entries under the user
ampex_data_monster.
Reallocation schedule
Reallocation runs at fixed times, not relative to the experiment start date. The schedule depends on the reallocation frequency you choose. You can't configure these times.
Create a multi-armed bandit experiment
Creating a multi-armed bandit experiment is almost identical to creating an A/B test in Amplitude Experiment. The next section covers the differences.
Differences between multi-armed bandits and A/B tests
Metrics
A multi-armed bandit experiment requires a primary metric. Amplitude Experiment uses the primary metric to optimize your experiment. You can include secondary metrics, but Amplitude Experiment uses them for reporting only.
In an A/B test, your primary metric can be a guardrail metric: a metric that you don't want your experiment to negatively affect. Clickthrough rate is a common guardrail metric. A multi-armed bandit experiment optimizes for a metric, so a guardrail metric doesn't apply. You can't optimize for a change you don't want to occur. Primary metrics for multi-armed bandit experiments must be success metrics ("will increase" or "will decrease"). Amplitude Experiment supports both binary metrics and continuous metrics.
To optimize two metrics in your multi-armed bandit experiment, create a custom metric that's a weighted average of both. If you face a tradeoff between metrics you want to optimize, run an A/B test instead.
Traffic allocation
Allocation for a multi-armed bandit experiment always begins with a uniform distribution, because the model can't know which variant is most effective before it collects data. The allocation changes after data starts arriving.
A multi-armed bandit adjusts the allocation between the variants only. It doesn't adjust the percentage rollout.
Confidence level
The confidence level in a multi-armed bandit experiment has a different role than in an A/B test. The confidence level can accelerate traffic to the winning variant. For example, if your experiment's confidence level is 95%, and the multi-armed bandit has already allocated at least 95% of the experiment's traffic to the winning variant, Amplitude Experiment assumes confidence and allocates 100% of traffic to that variant from that point on.
Duration estimate and MDE
In multi-armed bandit experiments, the minimum detectable effect (MDE) helps calculate the duration estimate. These experiments are automated and optimize for a metric, so the MDE doesn't affect the experiment after it starts running.
When calculating the duration estimate before the experiment starts, Amplitude Experiment simulates what happens when all variants except one share the same baseline mean (computed from historical data). When measuring an increase, the exception variant has mean * (1+MDE). When measuring a decrease, the exception variant has mean * (1-MDE). Amplitude Experiment then calculates how long the multi-armed bandit might take to assign all traffic to one variant. The duration estimate caps at 31 days.
Displayed results
Amplitude Experiment doesn't display variant jumping while a multi-armed bandit runs, because variant jumping is expected behavior in these experiments.
Amplitude Experiment doesn't display the data quality card for multi-armed bandit experiments. Most checks for this display don't apply to this experiment type. You can't make changes to the experiment that affect traffic allocation while the experiment runs.
The Bandits card resembles the non-cumulative exposure chart in the Monitor card, but normalizes to 100%. The Bandits card lets you visualize the percentage of traffic each variant receives on a given day.
Notifications
Amplitude Experiment sends notifications to experiment editors when a multi-armed bandit allocates 70%, 80%, 90%, or 100% of traffic to a variant. Amplitude Experiment also sends a notification if the bandit takes a long time to complete or if the experiment's end date arrives. Notifications can go through Slack or email. For more information, go to Integrate Slack with Amplitude.
To configure your notifications, go to Settings > Personal settings > Notifications.
Was this helpful?