The Null World

A p-value is one counted number — and almost no one, including the people who report it, can say what it counts. Here is the whole thing, made of nothing but coin flips: assume nothing is going on, spin that world thousands of times, and count how often blind chance does what you saw or more. That fraction is the p-value. Watch it assemble itself.

The null-world simulator

You flipped a coin 100 times and got 60 heads. Does that prove it's biased — or is a fair coin just doing what fair coins do?

Number of flips n = 100

Heads you observed 60 of 100

35heads from a fair coin →65

fair-coin outcomes as extreme as yours, or more your result

null worlds spun

as extreme, or more

p — counted fraction

—

p — exact binomial

0.0569

A fair coin lands this far from an even split — or further — about 5.7% of the time.

That number you watched build — the orange fraction — is the entire definition: the probability that a fair coin produces a result at least this extreme. Written out, p = P( data this surprising given the null is true ). The simulation just counts it; the exact binomial sum (in green) is where the count is heading.

The default already hides the first surprise. 60 heads out of 100 feels obviously rigged — yet a fair coin does that or better about 5.7% of the time, a hair over the sacred 0.05 line. Nudge the slider to 62 and it slips under. The line is a convention someone chose in the 1920s, not a fact about the coin. Drag the dials and feel where "significant" actually lives.

Three things a p-value is not

Every misuse below is in textbooks, press releases, and papers. Each one is undone by a number you can recompute here.

Misreading 1 — the inverse

Not the probability the coin is fair

The p-value runs one direction: it assumes the coin is fair and asks how surprising your data is. The thing people want — the chance the coin is fair given the data — runs the other way, and you can't flip a conditional probability without a prior. (This is the exact inversion that Bayes' theorem exists to fix.)

How far apart are the two directions? A much-cited calibration (Sellke, Bayarri & Berger, 2001) gives a lower bound: even being as generous as the mathematics allows to the "it's biased" verdict, a result sitting at p = 0.05 still leaves the coin fair at least ≈ 29% of the time. Not 5%. The p-value is not your error rate.

Misreading 2 — significance is not size

Not a measure of how biased the coin is

A barely-bent coin becomes "statistically significant" if you flip it enough. Significance measures detectability — effect size tangled up with sample size — never the size of the effect alone. Hold a real bias fixed and just add flips:

significance versus sample size

A truly biased coin 52% heads

If you flip it this many times n = 100

p = 0.76 not significant

The effect never changes; only n does. A 52% coin needs about 2,400 flips before it crosses 0.05; a 51% coin, nearly 10,000. A tiny p means a detectable effect, not a big one.

Misreading 3 — the silence

p > 0.05 does not prove the coin is fair

A non-significant result is not a verdict of "no effect." It can equally mean "real effect, not enough flips to see it." Absence of evidence is not evidence of absence — and usually you cannot tell the two apart from the p-value alone.

Concretely: a coin that truly lands heads 55% of the time, flipped 50 times, fails to reach significance ≈ 92% of the time. The bias is real and sizeable; the experiment is just blind to it. Reading that silence as "the coin is fair" gets it backwards nine times in ten.

The check — show the working

Both p-values on the simulator are recomputed live, exactly. The green figure is the two-sided binomial tail — the probability mass of every outcome at least as far from an even split as yours — summed from exact binomial probabilities built up by the recurrence P(k+1) = P(k)·(n−k)/(k+1) (no normal approximation, no library). The orange figure is the Monte-Carlo estimate from the histogram you watched; it converges to the green one, which is the point: a p-value is a counted tail-fraction.

The three anchor numbers — ≈29% (Sellke–Bayarri–Berger), ≈2,400 flips for a 52% coin, ≈92% miss-rate for a 55% coin at n=50 — are each recomputed from first principles, offline, by verify-p-value.mjs (notebook in research/p-value/), and the verifier also drives this page headless to confirm the default reads 0.0569 and that the live exact value matches its own binomial sum to 6 decimals.

The conventions, named honestly. "Two-sided" has more than one definition; with a fair-coin null (p₀ = ½) the distribution is symmetric and they all agree, so the figure is unambiguous — we say so rather than hiding the choice. The 29% bound assumes a point null and equal prior odds and is a minimum over a class of alternative priors (real posteriors are often higher). The "2,400 flips" figure assumes you observe exactly the expected count, the cleanest case; real samples scatter. These are conventions of inference, not measurements of the world — flagged so you can see exactly what is claimed.