What "statistical significance" actually means
"95% confidence" sounds like "we are 95% sure variant B is better." It is not. It is a much narrower claim, and confusing the two is why most teams ship false winners.
The actual definition
A 95% confidence result means: if there were truly no difference between the variants, we would see a result this lopsided less than 5% of the time by random chance alone.
That is not the same as "B is 95% likely to be better." It's a statement about how likely your data would be in a world where there's no real effect.
Why this matters in practice
Two failure modes follow from misreading this:
- Stopping early. If you peek at results every day and stop the moment significance crosses 95%, you will declare winners that aren't real, again and again. The math assumes you commit to a sample size up front.
- Confusing significance with impact. A variant that converts 0.05% better can become "significant" with enough traffic. Significant ≠ meaningful.
What to do instead
- Decide your minimum sample size before launching. Stick to it.
- Look at the effect size first. A 5% lift that's not yet significant is more useful than a 0.1% lift that is.
- Run the test for at least one full business cycle (typically a week). Weekday/weekend traffic patterns will lie to you otherwise.