For those who don't know, the 1911 Britannica is heralded for several reasons (and rightly criticized for regrettable others), but the most well-known is that it was the last encyclopedia before The Great War, and hence had a good amount of steam/optimism coming from the first and second industrial revolutions and the "Progressive Era", not sullied yet by thoughts of "the war to end all wars".
Trying https://britannica11.org specifically, it quickly found and displayed the article I searched for, chosen (to search for) at random: Portuguese East Africa, at https://britannica11.org/article/22-0177-portuguese-east-afr...
A question/idea for nice-to-haves, most respectfully. I don't know if it would be feasible. It's probably perfect as it is, simply linking to the image-page in unobtrusive text for each section. But I would love an option (emphasis on option) to see the text side by side with the page images. That parallel view would load all of the page images on the same page as the full article text. That way, I could "confirm" or "fact check" the faithfulness of the OCR, and also see the beautiful printing, at once, without opening each page separately and managing the images/windows myself. Most likely, I would use the site to jump to the articles, and read them mainly as images, only switching to the text form to verify what something said, or to copy-paste cleanly, etc. (As it is, initially, I thought I read the original images were available, but had to visit the page three (3!) times before finding where the side-links to them were.) Maybe thumbnails could be a middle-ground option (again, optional) for salience.
Very, very well done. And it's fast!
You can already do that on Wikisource. For example, here's p. 658 from the entry on "Molecule":
https://en.wikisource.org/wiki/Page:EB1911_-_Volume_18.djvu/...
Also OP: I noticed some fidelity issues in your version (at https://britannica11.org/article/18-0684-s2/molecule). For example parts of the math formula under the line that ends with "the molecules of other kinds" ([1]) are missing (compare [2]). Also, in your version fn. 1 of this article is attached to "as they have always done" ([3]) but it should actually be attached to "Atom" on p. 654 ([4]):
[1] https://britannica11.org/article/18-0684-s2/molecule#:~:text...
[2] https://en.wikisource.org/wiki/Page:EB1911_-_Volume_18.djvu/...
[3] https://britannica11.org/article/18-0684-s2/molecule#:~:text...
[4] https://en.wikisource.org/wiki/Page:EB1911_-_Volume_18.djvu/...
As an example flow (since it took a minute to figure out): we can start at https://en.wikisource.org/wiki/1911_Encyclopædia_Britannica then click to navigate/browse volume > section > topic to get to a text page, then click Source tab, then click a Page Number (maybe hunt around for the correct page number), and see the parallel view, text + image. With previous and next page buttons available, retaining the parallel text + image view.
That’s a great suggestion. A side-by-side text + page view would be very nice for exactly the reasons you mention (verifying the text and seeing the original layout). I haven’t built that yet, but I’ve considered it.
Also helpful to hear that the links to the scans weren’t immediately obvious — I should probably make them a bit clearer. This may also not be obvious, but you can click the vol:page links in the left margin and go directly to the scan of whatever page you're reading.
Thanks again.
What it does:
– ~37k articles reconstructed from the original volumes – section-level structure (contents are clickable within articles) – cross-references extracted and linked – contributors indexed and searchable – original volume + page references preserved and shown while reading – links to the original scans for each page – ancillary material included (prefaces, abbreviations, etc.) – topic index reproduced and cross-linked – full-text search with article metadata (length, volume, etc.)
Most of the work was in parsing and reconstruction: headings, multi-page articles, tables, math, languages, footnotes, plates, and all the small edge cases that come up in a work like this.
The goal was to make something that feels like the original, but is actually usable.
I’d especially appreciate feedback on: – search quality – navigation (sections, cross-references) – anything that looks structurally off
Happy to answer questions about the pipeline or data model
A few things... when I click an article and try to jump to a new topic, the top search box (labeled "Search titles and full text...") doesn't work. Second, when I first came to the site, I was a bit stuck. It took a bit of time to realize I need to click on "Articles" or even "Topics" to start browsing. Not sure why, maybe I expected the image to let me enter the site somehow...?
The underlying text (1911 edition) is public domain, but the structured version here — the parsing, reconstruction, and linking — is something I put together for this site. Right now there isn’t a bulk download available. I’m considering exposing structured access (API or dataset) in some form, but haven’t decided exactly how that will work yet.
If you have a specific use case in mind (especially for training), I’d be interested to hear more.
Separately, I've fine-tuned the Gemma 4 model[2], it was very quick (just 90 seconds), so I think it could be interesting to train it to talk like 1911 Encyclopedia Britannica.
I would use the entries as training data and train it to talk in the same style. There isn't a specific use case for why, I just think it would be interesting. For example, I could see how it writes about modern concepts in the style of 1911 Britannica.
[1] https://stateofutopia.com/encyclopedia/
[2] To talk like a pirate! https://www.youtube.com/live/WuCxWJhrkIM
The underlying text is public domain, but the structured version here is something I put together for the site. I haven’t released a bulk dataset yet.
If you end up experimenting with it, I’d love to hear how it turns out — and I’m still figuring out what structured access might look like.
Another reason would be to able to keep running/using it even if the main site were to go down for whatever reason eventually; or, to operate a mirror of it, for redundancy (linking back to the original, of course).
What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.
For casual or small-scale use there’s no issue at all. For bulk use (e.g. dataset / training / redistribution), I’d prefer people get in touch so I can figure out a sensible way to support that.
They only release books that are in the public domain.
"In the case of girls, let them run, leap and climb with their brothers for the first twelve years or so of life. But as puberty approaches, with all the change, stress and strain dependent thereon, their lives should be appropriately modified. Rest should be enforced during the menstrual periods of these earlier years, and milder, more graduated exercise taken at other times. In the same way all mental strain should be diminished. Instead of pressure being put on a girl’s intellectual education at about this time, as is too often the case, the time devoted to school and books should be diminished. Education should be on broader, more fundamental lines, and much time should be passed in the open air."
On Reading Old Books C. S. Lewis
https://bradleyggreen.com/attachments/article/97/Lewis.On-Re...
Many people practice it, and women’s movements that put most energy on doing the opposite have since dialed back to pointing out that they were fighting for choice, including that choice of not being in a workforce. An option of a “soft life” that is wildly popular, and timeless. People just needed a new way to say it.
If it was culturally supported for men to be subsidized by another, a large percentage of men would immediately take that graduated and intellectually diminished role too. This is not a reliable option and is rare.
If common, it would unironically solve representation imbalances in other fields, since it would no longer be about shoehorning women into them, because enough men would leave on their own. A level of enlightenment still missing from Women in <field> fireside chats at every industry conference worldwide
Sometimes the work is the POINT. We read things like this not just to learn about the past, but for novelty and to exercise our critical thinking powers. To outsource that labor before even trying is like going to the gym and having your butler lift the weights. The weights got lifted, but what was really accomplished?
But whatever the reason is why the ideas have fallen out of fashion, it can broaden the mind to encounter them.
I've had a ton of fun playing learning about BaseX and XQuery to ask questions like "Which classical authors are responsible for writing words that appear only once in the entire corpus (hapax legomena)" or "what are longest hapax words" (usually the funniest ones) and that kind of thing. Shout out to Tufts University for making this available!
I would love to be able to load the 1911 Britannica into BaseX and and see what interesting things I could learn about it via XQuery!
People asking for dataset access has definitely been one of the themes of this thread. I’m taking that seriously. If I do expose it, I’d want to do it in a form that preserves the structure and doesn't just dump plain text.
Take the article about Copenhagen as an example: https://britannica11.org/article/07-0111-copenhagen/copenhag... The geography and key points of interest are described very accurately, but the authors aren’t shy about inserting emotionally charged adjectives and personal options on what they consider interesting or curious. Also, the huge portion about the Battle of Copenhagen in the bottom is a complete departure and shifts the genre from a geographical description to the shot-per-shot narration of a naval battle.
You get that mix of geography, history, and sometimes quite opinionated description all in one place, which makes them much more readable, in my view. My introduction to this version discusses this and other related matters: https://britannica11.org/about.html
"anything approaching a uniform distribution of the stars cannot extend Limits of the Universe. indefinitely. It can be shown that, if the density of distribution of the stars through infinite space is nowhere less than a certain limit (which may be as small as we please), the total amount of light received from them (assuming that there is no absorption of light in space) would be infinitely great, so that the background of the sky would shine with a. dazzling brilliancy ...."
[0] https://britannica11.org/article/25-0806-star/star#section-1...
I've been testing different OCR tools and so far I've been the most impressed with paddleOCR - it correctly split the text columns, labled the illustrations, and noted the maragin text.
Still, it's not perfect, so I'm having to hand-edit some tables. I plan to put the source pages online as well so you can switch between the scanned page and the electronic text.
Other material that would be fun to put online in a hyperlinked and indexed format include geographic and medical atlases and the Baedeker travel guides.
Thank you for keeping the encyclopedia books alive.
Pre LLM And post COVID and perhaps the best we can hope for before AI taints all the info.
One of my prized possessions as a child was a CDROM based encyclopedia (well before the internet was common). I don't know why I liked it so much but on a rainy afternoon I'd kick up some of my favourite articles and read and learn more of them.
2009: https://archive.org/details/britannica-multimedia-dvd-2009-d...
2012: https://archive.org/details/britannica-dvd_20230709
2013: https://archive.org/details/encyclopedia-britannica-dvd-2013
Part of the motivation here was to bring that kind of exploration back, but with the original 1911 text and structure.
Some bugs I noticed:
Searching for Zurich allows you to go to the article for the canton of Zurich, not the city. Clicking the link "Zürich (city)" inside of this article, opens this same article again about the canton, rather than opening the actual article for the city
When viewing an article, the search for articles (leftmost search box) doesn't seem to work at all for me (in Firefox). When being on the main page, it does work
There's a small clickable 'home' button on the right, but muscle memory from how other websites work makes me expect that clicking the big title "Encyclopædia Britannica, 11th Edition" on the top left also goes to home
I haven't tested the article search box on the article viewer in Firefox. I'll look into that as well.
Making the title linkable is a great idea and it will be implemented shortly. Thanks for catching all of this.
I highly recommend getting an old set of volumes.
I actually took a recent crack at making a more modern website for Websters 1913: https://websters1913.timcieplowski.com/
There's a bit of funkiness with "<?/" appearing here:
> Though the offence of eavesdropping still exists at common law, there is no modern instance of a prosecution or indictment.
Thanks for posting this resource, I've often wanted to share a link to this and other entries.
[0]https://britannica11.org/article/08-0867-eavesdrip/eavesdrip...
I didn’t do OCR myself, except for the topic index and to fill in a few gaps. I started from existing Wikisource text and then built a pipeline around that: cleaning (headers, hyphenation, etc.), detecting article boundaries, reconstructing sections, and linking things back to the original page images. Most of the effort went into rendering the complex layouts, and handling the cross-linking, not the initial ingestion.
Glad to go into more detail if you’re interested, but that’s the gist of it.
https://britannica11.org/article/15-0341-jenghiz-khan/jenghi...
Just kidding, of course. This is incredible and surprisingly nostalgic. Reading some of the entries took me right back to being a kid huddled in my room for hours pouring over an encyclopedia or even the dictionary.
And I still vividly remember the rush of installing Encarta for the first time on the family PC.
I couldn't believe that I, a mere kid, have now access to iconic historical footage and that I can watch anytime I felt like it. I can't describe how amazingly cool that felt at the time! It still gives me a hit of endorphins when I remember it today.
By the way, it looks like there's a bug where I can't search for articles when already inside one. To do so, I need to go back to home > articles and then search.