Satoshi Nakamoto Stylometric Analysis: 'Where is Satoshi?' — Bas van Dorst's 75,000-Author Open-Source Comparison Dataset (April 13, 2024)

On April 13, 2024, Bas van Dorst released “Where is Satoshi?” — a large-scale, open-source stylometric comparison of Satoshi Nakamoto’s writing corpus against more than 75,000 cryptography mailing-list authors and over 70,000 Reddit /r/Bitcoin commenters. The project is, by corpus size and by data-release transparency, the most rigorous numerical multi-candidate stylometric resource on Satoshi authorship in the public record.

Corpus

Mailing-list corpus: 500,000+ posts across 10+ cryptography-related mailing lists, covering 75,000+ authors writing between 1992 and 2000.
Reddit corpus: 7,500,000+ comments from /r/Bitcoin, covering 70,000+ authors between 2005 and 2019.
Satoshi corpus: 81,500 words across the Bitcoin whitepaper, the BitcoinTalk forum posts, private email correspondence, and code comments from the v0.1 source release.

Stylometric metrics

For every (author, time-window) pair the project computes:

Metric	What it measures
N-gram analysis (1/2/3-gram)	Vocabulary and phrase patterns
Burrows’ Delta	Standard stylometric distance metric in computational stylometry
Jaccard similarity	Set-overlap measure across vocabulary
Five readability indices	Flesch, Gunning Fog, Dale-Chall, Coleman, SMOG
Punctuation patterns	Hyphenation conventions, double-spacing-after-period (later highlighted in the Carreyrou NYT investigation)
Word-length frequency distributions	Lexical-length signature
Personal pronoun usage	First-person singular vs. plural, presence/absence patterns
British vs. American spelling variants	Including the British-spelling tells in Satoshi’s writing that have driven multiple identification hypotheses

Data release

The complete numerical output is released as downloadable spreadsheets:

XLSX aggregate: 40 MB.
CSV raw per-chunk: 240 MB.

This is the largest publicly-released numerical dataset of Satoshi-vs-candidates stylometric comparison. It is the only resource in the field that allows independent re-analysis: a researcher with a different distance metric or a different candidate-pruning strategy can run their own ranking against the same underlying numbers.

Author’s framing

Van Dorst explicitly refrains from naming a leading candidate:

Yes, I have a short-list of suspects. No, I’m not going to drop names here because I’m not 100% sure.

He is also transparent about personal interest:

There is no personal interest in Satoshi’s real identity. I sold my stake in 2012, way before the hype started.

The combination — large dataset, full numerical release, no identification push — is methodologically more rigorous than the narrative-driven stylometric pieces that have named specific candidates (Skye Grey 2013 for Szabo, the Aston University 2014 study for Szabo, Cafiero’s analysis in Carreyrou’s 2026 NYT investigation for Adam Back).

Methodological caveats stated by the author

“High correlation only shows similar language patterns → high correlation does not imply causation.” Two authors writing about the same technical subject in the same era will share vocabulary and sentence structures regardless of whether they are the same person.
N-gram analysis is complicated by terminology emerging only after Bitcoin’s 2009 release. Comparisons to post-2010 writing on cryptocurrency topics can be confounded by Bitcoin-specific vocabulary that any informed writer will have adopted.
Email-thread reply extraction may inadvertently include quoted text from earlier replies, misattributing language to the wrong author.
The last data chunk in any time-windowed analysis is often incomplete, potentially skewing aggregated results.

These caveats are the same structural limitations that constrain all stylometric Satoshi-identification work; the project’s value is in stating them explicitly and providing the raw numerical data that lets a critic test them.

Position in the stylometric record

Cross-referencing this corpus against the named-candidate stylometric tradition:

Study	Year	Candidate scope	Top match	Numerical data public
Skye Grey	2013	Szabo (single-hypothesis)	Szabo	No (narrative phrase list)
Aston University	2014	11 candidates	Szabo	No (results in press releases only)
Cafiero / Carreyrou NYT	2026	12 candidates (focus); 620 (broader)	Adam Back (Hal Finney near tie)	No (results summarized in NYT article)
van Dorst	2024	75,000+	Not named publicly (Szabo top of 5 named candidates per Bitcoin Institute reanalysis)	Yes — 280 MB CSV/XLSX

The fact that van Dorst declines to publicly name a candidate while the narrative-driven studies do — and the fact that the named-candidate studies, when read alongside the Bitcoin Institute reanalysis, mostly converge on Szabo (3 of 4) with Cafiero / Carreyrou as the Adam Back outlier — is itself a methodological observation: stylometric Satoshi identification is sensitive to corpus selection, distance metric, and candidate pre-selection, but the most persistent signal across methods places Szabo highest among the named candidates.

Per-candidate values from this corpus (Bitcoin Institute reanalysis)

Although van Dorst declines to publish a ranking, the underlying numerical data is downloadable and the named identity-hypothesis candidates can be located in it. The Bitcoin Institute reanalysis (May 2026) extracts Burrows’ Delta values for the five most-cited Satoshi-identity named candidates from the published comparison.xlsx, ranks each against the 12,739 authors in the corpus with at least 10 chunks of writing, and reads van Dorst’s “I’m not 100% sure” caveat against the data: Nick Szabo leads the named group at top 4.67%, Hal Finney and Adam Back follow within 0.85 standard deviations, and 594 unnamed authors rank closer than Szabo — most of the corpus’ apparent top matches turn out on inspection to be noise (e-commerce accounts, anonymous remailer relays, disposable accounts) rather than signal. See the analysis entry for the per-candidate table, the corpus’ top-20 noise discussion, and the four-reading interpretation of why van Dorst’s silence on names is the data-honest position.

For the analytical treatment of stylometric methods in Satoshi identification more broadly, see the relevant identity-hypothesis entries: Nick Szabo, Adam Back, and the identity-hypotheses overview.

This dataset is the foundational evidence for two later readings. The 2026 Bitcoin Institute reanalysis of named candidates is built entirely on this corpus — extracting the five named identity-hypothesis candidates and ranking them within the 12,739-author distribution derived from this dataset, citing it across its opening paragraph, §1.1 source-data section, §2.3 timeline, and §3 noise discussion. The identification-asymmetry analysis references the dataset in its §1 timeline as the 2024 technical attempt and again in §1.3 with the specific numerical findings (the 75,000-author corpus and the 594 unnamed authors closer than Szabo) that anchor its named-vs-unnamed asymmetry argument.

Satoshi Nakamoto Stylometric Analysis: 'Where is Satoshi?' — Bas van Dorst's 75,000-Author Open-Source Comparison Dataset (April 13, 2024)

Corpus

Stylometric metrics

Data release

Author’s framing

Methodological caveats stated by the author

Position in the stylometric record

Per-candidate values from this corpus (Bitcoin Institute reanalysis)

Original external source

Other external sources

Corpus

Stylometric metrics

Data release

Author’s framing

Methodological caveats stated by the author

Position in the stylometric record

Per-candidate values from this corpus (Bitcoin Institute reanalysis)

Related entries

Original external source

Other external sources