How many words did Shakespeare know?The Words He Never Used
Across everything he wrote, Shakespeare used about 31,000 different words. But the more interesting number is the one you can't read off the page: how many words he knew and simply never happened to write down. Remarkably, you can estimate it — from the words he used exactly once. Here's the method, recomputed live, on a canon you can download and check yourself.
In 1976 two statisticians, Bradley Efron and Ronald Thisted, asked this question and answered it with a tool built for counting species — butterflies in a net, not words on a page.1 The link is exact: a word you've seen only once is a hint that the author has many more like it, still uncollected. Count the once-only words, the twice-only words, and so on, and the shape of that list tells you how much vocabulary is hiding just out of frame.
First, look at how often each word appears
Take the complete works, lowercase everything, and tally how many distinct words occur exactly once, exactly twice, exactly three times, and so on. This list — the frequency spectrum — is the raw material for everything that follows. Its shape is startling.
The frequency spectrum of the canon
bars: words appearing exactly x times
10,164 distinct words appear exactly once — that's 37.8% of every different word he used.
Words used only once have a name as old as scholarship itself: hapax legomena, Greek for "said only once." Nearly two in five of Shakespeare's distinct words are hapaxes — bubukles, anthropophaginian, honorificabilitudinitatibus. That long, fat tail of rare words is not noise. It is the signal. The more of an author's vocabulary you meet only once, the more of it you haven't met at all.
Now imagine a second Shakespeare
Suppose we discovered a fresh, equally large body of Shakespeare — another 36 plays and 150 sonnets, written but lost. How many words would appear in it that never appear in the canon we have? Good and Toulmin (1956) gave the estimate, and it is almost eerily simple — it's just the frequency spectrum with alternating signs:2
new words(t) = n₁·t − n₂·t² + n₃·t³ − n₄·t⁴ + …
where nₓ is the number of words used exactly x times, and t is how big the new corpus is relative to the old. Drag t below. At t = 1 you're asking the original question: a second canon the same size.
The unseen-species estimator, live
7,643 brand-new words expected in a second canon of equal size.
On the freely-downloadable text this page uses, a second equal-sized Shakespeare would bring an estimated 7,643 never-before-seen words. Switch the toggle to Efron & Thisted 1976 and the same formula, run on their word-counts, gives 11,434 — almost exactly the 11,430 they published.1 The two numbers differ, and the gap is the whole honest story of the next section. But notice first: the method doesn't care which corpus you feed it. The estimate is genuine, not fitted.
Why you can't just keep dragging
Push the slider toward its right edge and the curve starts to misbehave; past about t = 1.3 the formula explodes — Efron's own word is that it "diverges for t greater than 1."3 This is not a bug to hide; it is the truth about how far the data can see. The once-only words let you estimate one more canon's worth of vocabulary — and no further. Which is exactly why the famous headline number, the one everyone quotes, needs more than this honest little series.
"He knew 66,000 words" — the number with an asterisk
The question people actually remember is not "a second canon" but the unbounded one: how many words did Shakespeare know in total, used or not? That's t → ∞, and the simple series can't reach it — it diverges. To answer it at all, Efron and Thisted had to add a model: a parametric assumption about how a writer's rare words are distributed, plus a linear program to bound the answer from below.1
These are the numbers behind the popular "Shakespeare knew ~66,000 words." Treat them with the caveat Efron himself insists on: they are a lower bound, and they ride on a modelling choice. As he puts it, "there can't be a way to pin down" the true total.3 The clean, model-free part of this whole exercise is the doubling estimate above — about 11,000 more words in one more canon. Everything past a doubling is an extrapolation wearing a confidence it hasn't quite earned, and the honest version says so.
The estimator caught a forgery attempt — or didn't
In 1985 a Shakespeare editor announced a "newly discovered" nine-stanza poem, Shall I die?, and the field erupted. Thisted and Efron realised their tool could weigh in.4 If the poem were really Shakespeare's, its rare and brand-new words should appear at the rate his spectrum predicts. The poem is only 429 words long; here is what they predicted versus what the poem actually contained:
| words in the poem that were… | predicted | observed |
|---|---|---|
| never used by Shakespeare before | 6.97 | 9 |
| used by him exactly once before | 4.21 | 7 |
| used exactly twice before | 3.33 | 5 |
| used exactly three times before | 2.84 | 4 |
The fit is good — the poem has about the right number of Shakespearean rarities (the nine new words include admiration, besots, exiles, tormentor, twined). On the rare-word test, authorship could not be ruled out.4 But read the caveat in the numbers: the poem is consistently a little richer in rare words than predicted. A consistency test is not a fingerprint; the poem has never entered the canon, and the statistics settle nothing on their own. The instrument's honesty is precisely that it returns "consistent with," not "proven."
And no, he didn't invent 1,700 words
You'll have read that Shakespeare coined some 1,700 words — assassination, eyeball, lonely, bedazzle. The number comes from counting the words for which the Oxford English Dictionary lists him as the earliest recorded user. But "earliest recorded" is a fact about the dictionary's filing cabinet, not about the English language.
The OED's Victorian readers were steeped in Shakespeare — Mary Cowden Clarke's 1845 concordance made his works uniquely searchable decades before anyone else's — so he collected roughly 33,000 quotations, far more than equally inventive contemporaries like Nashe or Marlowe.5 It's a streetlight effect: he gets first credit because that's where the lexicographers were looking.
So: a man who used about 31,000 words, knew at least 35,000 more he never needed, and invented far fewer than the legend says. The dazzling part isn't a coinage count. It's that the words he used once — the throwaways, the hapaxes — are a window onto the words he never used at all. The negative space of a vocabulary has a measurable shape.
The check — every number here is recomputed, not quoted
- The live instrument computes its claim in your browser. The frequency spectrum and both estimator readouts are produced from word-counts embedded from research/shakespeare-vocabulary/; the estimator (Euler-accelerated alternating series) runs on the fly as you drag.
- Our own corpus, fully reproducible. From Project Gutenberg's Complete Works (eBook #100), our stated tokenizer counts N = 968,242 tokens, V = 26,909 distinct words, 10,164 hapaxes (37.8%). A second equal canon → ≈ 7,643 new words. Re-run node verify.mjs to regenerate every figure.
- The historical headline reproduced. Running the same estimator on Efron & Thisted's published spectrum (n₁=14,376, n₂=4,343, …) yields 11,434.5 — matching their reported 11,430 to 0.04%, and already converged by the 8th term (so the unpublished tail can't move it).
- Why our 7,643 ≠ their 11,430. Not an error — a different corpus and a different rule for "what counts as a word." Spevack's concordance counts 31,534 distinct types where our naive lowercasing finds 26,909. The estimator is invariant; the inputs are not. Counting words is itself not a settled operation — which is the point.
- The method's own limit is shown, not hidden. The series diverges past t ≈ 1.3 (verified: the t=2 value is >80× the t=1 value); that's why the "words he knew but never used" figure is flagged as a model-dependent lower bound.
- Self-test: the Euler summation reproduces ln 2 from the alternating harmonic series to 1×10⁻¹². Offline verifier: 14/14 checks pass.