How to Calculate the Sample Size for an A/B Test
Understand with AI
Discuss with your preferred AI assistant
The conventional minimum statistical power — an 80% chance of detecting a real effect if one exists.
A 5% significance level (alpha) is the default tolerance for a false positive across the industry.
Halving the minimum detectable effect roughly quadruples the sample size you need per variant.
Running an A/B test without first calculating the sample size is the single most common way experiments go wrong. You either stop too early and ship a "winner" that was really just noise, or you run forever chasing an effect your traffic could never have detected. A sample size calculation tells you, before you start, exactly how many visitors each variant needs — so you know whether the test is even worth running.
This guide explains what sample size means in A/B testing, the four inputs that drive it, the exact formula behind the numbers, and how to use the result to plan a test you can actually trust.
What Is Sample Size in an A/B Test?
Sample size is the number of visitors (or sessions) each variant needs before your test has enough statistical power to reliably detect the effect you care about. It is not a number you discover after the fact by watching the dashboard — it is a target you commit to before launching, based on four parameters: your baseline conversion rate, the minimum detectable effect, the statistical power, and the significance level.
Get the sample size right and you avoid two expensive mistakes: a false positive (calling a winner that does not exist) and a false negative (missing a real improvement because the test was under-powered).
The Four Inputs That Drive Sample Size
Every sample size calculation — including the one above — rests on the same four levers:
- Baseline conversion rate. The current conversion rate of your control. Lower baselines need dramatically more traffic, because there is less signal to measure against.
- Minimum detectable effect (MDE). The smallest improvement you want to be able to detect. Expressed as a relative lift (e.g. "a 10% lift over baseline") or an absolute change (e.g. "+0.5 percentage points"). Smaller MDEs require much larger samples.
- Statistical power (1 − β). The probability of detecting a real effect if one exists. 80% is the conventional minimum; 90% is safer for high-stakes decisions. Higher power means a bigger sample.
- Significance level (α). Your tolerance for a false positive. 5% (95% confidence) is standard. A stricter 1% (99% confidence) reduces false positives but raises the sample needed.
How the Sample Size Formula Works
For a two-proportion test comparing a control rate (p1) with a target rate (p2), the required sample size per variant is:
| Symbol | Meaning |
|---|---|
| n | Visitors required per variant |
| zα | Critical z-value for your significance level and test type |
| zβ | Critical z-value for your chosen power |
| p1, p2 | Baseline and target conversion rates |
In words: n = (zα·√(2·p̄·(1−p̄)) + zβ·√(p1·(1−p1) + p2·(1−p2)))² ÷ (p2 − p1)², where p̄ is the average of the two rates. The calculator solves this for you and turns the four inputs into the critical z-values using the inverse normal distribution. Because the denominator squares the effect, halving your MDE roughly quadruples the visitors you need.
Estimating Test Duration
Sample size only tells you how many visitors you need — duration tells you how long that will take. Divide the required per-variant sample by your daily visitors per variant:
- Days = sample per variant ÷ (daily visitors ÷ number of variants).
- Add the optional daily-visitors field in the calculator and it estimates days and weeks automatically.
Always run for at least one full week — ideally two — even if you hit the sample sooner, because weekday and weekend behaviour differ. If the estimate stretches past roughly eight weeks, seasonality and cookie churn start to distort results; raise your MDE, send more traffic, or test a higher-converting metric instead.
Common Sample Size Mistakes
- Peeking and stopping early. Checking the dashboard daily and stopping the moment it looks significant inflates your false-positive rate far above the stated 5%.
- Chasing tiny MDEs. Wanting to detect a 1% relative lift on a 2% baseline can require hundreds of thousands of visitors per arm. Be honest about the effect you can realistically achieve and measure.
- Ignoring multiple comparisons. Testing A/B/C/D quadruples your false-positive risk unless you correct alpha — this calculator applies a Bonferroni adjustment automatically.
- Forgetting low baselines. A 0.5% conversion rate needs vastly more traffic than a 10% rate to detect the same relative lift.
How to Use the Result
Once the calculator gives you a per-variant sample, do three things before launching: confirm the duration is realistic for your traffic, lock the sample size and stopping rule in your test plan, and resist the urge to peek. If the numbers are impossible — say, six months to reach significance — that is valuable information too: pick a bigger MDE, consolidate variants, or move the test higher in the funnel where conversion volume is greater.
Expert Tips
Pick the smallest lift worth shipping
Your minimum detectable effect should be the smallest improvement that would actually change your decision. Set it too small and the test becomes impossibly large; too large and you miss real wins.
Plan duration, not just sample
Divide the required sample by your daily traffic to get a realistic timeline. Run at least one full week to capture weekday/weekend behaviour, and avoid tests that stretch past ~8 weeks.
Frequently Asked Questions
How many visitors do I need for an A/B test?
It depends on your baseline conversion rate and the minimum detectable effect. A 5% baseline detecting a 10% relative lift at 80% power and 95% confidence needs roughly 31,000 visitors per variant. Lower baselines and smaller effects require far more; enter your own numbers above for an exact figure.
What is a good minimum detectable effect for A/B testing?
Choose the smallest lift that would still be worth shipping. Most teams target a 5–20% relative lift. Smaller MDEs are more sensitive but require dramatically more traffic, so balance ambition against the sample size and duration you can afford.
Should I use 80% or 90% power?
80% is the conventional minimum and a fine default. Use 90% (or higher) for high-stakes decisions where missing a real winner is costly — it lowers your false-negative rate at the cost of a larger sample.
Why does a lower baseline conversion rate need a bigger sample?
Rare events carry more relative variance, so it takes more observations to separate a true effect from random noise. Detecting the same relative lift on a 1% baseline can need ten times the traffic of a 10% baseline.