What "statistical significance" actually means

"95% confidence" sounds like "we are 95% sure variant B is better." It is not. It is a much narrower claim, and confusing the two is why most teams ship false winners.

The actual definition

A 95% confidence result means: if there were truly no difference between the variants, we would see a result this lopsided less than 5% of the time by random chance alone.

That is not the same as "B is 95% likely to be better." It's a statement about how likely your data would be in a world where there's no real effect.

Why this matters in practice

Two failure modes follow from misreading this:

Stopping early. If you peek at results every day and stop the moment significance crosses 95%, you will declare winners that aren't real, again and again. The math assumes you commit to a sample size up front.
Confusing significance with impact. A variant that converts 0.05% better can become "significant" with enough traffic. Significant ≠ meaningful.

What to do instead

Decide your minimum sample size before launching. Stick to it.
Look at the effect size first. A 5% lift that's not yet significant is more useful than a 0.1% lift that is.
Run the test for at least one full business cycle (typically a week). Weekday/weekend traffic patterns will lie to you otherwise.