Visualizing All ISBNs
393 points
1 year ago
| 16 comments
| annas-archive.org
| HN
graypegg
1 year ago
[-]
I see that bounty at the bottom, so tossing away my chances here, but this visualization is just asking to be mapped onto a Hilbert Curve. [0] When you "stripe" the data like this, points that are sorted close together could end up pretty far apart, since a distance in the Y axis skips an entire row of data as you move down, rather than a distance in the X axis which is 1-to-1 with the source data.

If you map it onto a hilbert curve, the X and Y axis mean nothing, but visually points that are close together in the sorted list, will be visually close together in the output image.

Since the first part of an ISBN is the country, then the second part is the publisher, and the third part is the title, with a check sum at the end, I would remove the checksum and sort them each as a big number. (no hyphens)

You should end up with "islands", where you see big areas covered by big publishing countries, with these "islands" having bright spots for the publisher codes.

Bonus points for labeling these areas!

I set up something a while ago [1] for an interview that does this with weather data. It makes the seasons really obvious since they're all grouped together.

[0] https://en.wikipedia.org/wiki/Hilbert_curve

[1] https://graypegg.com/hilbert (https://github.com/graypegg/hilbertcurveplayground code if anyone wants to go for the prize using this! Please at least mention me if you decide to reuse this code, but I can't stop ya lol)

reply
abetusk
1 year ago
[-]
And there's a generalized Hilbert curve, the Gilbert curve, for non powers of two rectangular regions [0] (online demo [1]).

[0] https://github.com/jakubcerveny/gilbert

[1] https://jakubcerveny.github.io/gilbert/demo/

reply
n2d4
1 year ago
[-]
What property makes the Hilbert curve desirable compared to, say, a snake pattern, with which neighbouring ISBNs are also neighbours in the visualisation?

The worry I have with Hilbert curves is that they make the result look like there are distinct "squares" of data [0] when really this is just an artifact of how Hilbert curves work. In that sense, the current visualization is more useful, because it's straightforward to identify the location of each country in it.

[0] https://raw.githubusercontent.com/jakubcerveny/gilbert/maste...

reply
graypegg
1 year ago
[-]
In a snake pattern, the neighbouring pixels on the left and right are related, but the ones above and below have skipped a whole row.

And yeah that’s true! you end up with squares with Hilbert curves. But those squares are all « related » data. Then those squares are related to the squares near it. Zoom out more and that grouping of squares is related to the neighbouring macro-squares etc etc.

Basically the square shape is a positive. Kind of like how charting the derivative lets you see how random/related information is, grouping into these squares gives you a visualization of pattern-ness, rather than any specific measurement.

reply
n2d4
1 year ago
[-]
> In a snake pattern, the neighbouring pixels on the left and right are related, but the ones above and below have skipped a whole row.

But this is also true in Hilbert curves across the boundaries of the "squares" that I mentioned. The two center pixels in the top row are much more distant than any two pixels would be in a snake pattern.

reply
NooneAtAll3
1 year ago
[-]
> What property makes the Hilbert curve desirable compared to, say, a snake pattern, with which neighbouring ISBNs are also neighbours in the visualisation?

2D neighbourhood is better than 1D one

> The worry I have with Hilbert curves is that they make the result look like there are distinct "squares" of data

that's the point, tho? instead of distinct lines of taken ISBNs in a row, you get distinct squares if taken ISBNs in a row - much more noticeable

reply
WillAdams
1 year ago
[-]
The thing is, ISBNs aren't hierarchical --- they are bought in blocks (or even individually at an exorbitant markup, says the guy who bought one to reprint a single book), so this doesn't show anything really interesting/useful.

A visualization using LoC or even Dewey Decimal would be far more useful, esp. if it also linked to public domain and copyright-free repositories/lists, say an interactive and visual version of John Mark Ockerbloom's:

https://onlinebooks.library.upenn.edu/

reply
est31
1 year ago
[-]
ISBN's are hierarchical, what do you mean? Like Gaul, ISBNs are divided into multiple parts, where one part is for the language, another is for the publisher, and the last is for the title. The last part is a checksum. https://en.wikipedia.org/wiki/ISBN#Overview
reply
WillAdams
1 year ago
[-]
Yes, but this internal hierarchy for an issued number doesn't tell anything beyond those facts about a specific edition of a specific text.

One can't use ISBNs alone to create a hierarchical listing of texts which is useful for anything beyond browsing by language/publisher/order in which the ISBN was generated.

A visual and interactive representation of books by LoC or some other cataloging system would actually be useful.

reply
PaulHoule
1 year ago
[-]
I got into an argument with the manager of South End Press back in '94 about whether 'Futuresplash' (soon to be Macromedia Flash) had a future, he thought it did and he was right.

Years later I was working at the library and got a little bit steamed because South End Press was reusing ISBN's after books went out of print which was allowed but, I think, lame.

One of my strategies for researching a topic is looking a few up in the OPAC, finding them in the stacks, and finding more books on the topic in those areas. (In the Library of Congress system, machine vision could be under QA56 with the rest of computer science or around TA1630, thus "areas".)

From time to time I've thought about trying to replicate the feel of this with some kind of UI given that our library moved a lot of the collection into deep archives and we have a very fast 'Borrow Direct' service with other peers)

reply
convolvatron
1 year ago
[-]
totally agree, but thats not in the data. however, since blocks are assigned to agencies associated with countries and publishers, you might find some utility in showing coverage by likely language and/or country of origin and date.
reply
MarceColl
1 year ago
[-]
It shows what they want to show, which is mostly how much of the world books they have. Hierarchical has nothing to do with it.
reply
Finnucane
1 year ago
[-]
It only sort of shows that. ISBNs are issued by edition, not title, so many books would have more than one. And books published before 1970 or so might not be represented at all if they have no recent edition.
reply
NoMoreNicksLeft
1 year ago
[-]
They can't even have a tiny fraction of the world's books. Each edition of the book gets a new ISBN... if a book is released as a paperback, hardback, kindle edition, pdf, and epub then there are supposed to be five ISBNs.

The vast, vast majority have only been released as dead-tree versions. They have none of those. The books they scan may have an ISBN, but the scans do not have them. Like all Project Gutenberg books, their books have no ISBNs at all. From a strict point of view, they've released new editions of these books.

reply
nickelpro
1 year ago
[-]
Worthless semantics in the context of the mission of the project.

What you've described is that the archived content can be mapped to multiple ISBNs. It's clear the only element of concern here is the content itself. The failure to preserve a particular binding or printer's choice of typeface is irrelevant.

Failing to recognize this requires an almost malicious level of pedantry

reply
jameshart
1 year ago
[-]
A successful archival of one of those ISBNs will light up; four of those ISBNs remain dark. Yet they have that content archived. It means that lighting up the entire grid is not necessary to achieve their goal.

Indeed a bigger problem is that it’s much harder to know which areas of the grid are never going to light up because the ISBN has not been used.

reply
nickelpro
1 year ago
[-]
This is a separate problem, but a notable one.

Lighting up the entire grid is still the goal, you're describing the problem of ensuring the right set of squares is illuminated for each piece of archived content. One is a problem of archiving the content, the other is a problem of bookkeeping.

reply
NoMoreNicksLeft
1 year ago
[-]
>Worthless semantics in the context of the mission of the project.

Hardly worthless... often times, the edition of the book matters as much as the title. Steven King wrote two books named The Stand, and one isn't anything like the other. He pulled a Lucas pretty early on.

He's hardly the only author to ever do this. But it's not just authors either. Editors, collectors, translators all make their mark, and give you works that though they might be slightly different to you, the differences actually matter to the rest of us. It's not that you're ignorant that offends me, it's the arrogance about a subject you seem to know so little about that makes it difficult to tolerate.

There is no pedantry here, just a desire to actually preserve books and to organize them.

reply
nickelpro
1 year ago
[-]
> Steven King wrote two books named The Stand, and one isn't anything like the other

Then those two texts would map to different ISBNS, or perhaps each maps to multiple different ISBNs, it doesn't matter. That some texts exist with the same title but different content is similarly irrelevant.

The content is all that matters. Two different bodies of content, two different entries in the archive. Each entry may map to one or more ISBN numbers.

> the differences actually matter to the rest of us

The only differences that matter are what matters to the archive that made the blog post. Your concerns are for entirely different things, which is fine, but don't say the OP's concerns or initiatives are impossible or ill-suited based on a criteria you're projecting onto them.

reply
mmooss
1 year ago
[-]
> The books they scan may have an ISBN, but the scans do not have them. Like all Project Gutenberg books, their books have no ISBNs at all. From a strict point of view, they've released new editions of these books.

Are you saying they actively remove ISBN numbers from scans? If I downloaded one of the books, it wouldn't have an ISBN?

Why? That seems like a bunch of extra processing per book, makes it harder for users to specifically identify a book, and probably does nothing for legality. Also, can people search by ISBN?

reply
Tomte
1 year ago
[-]
> Are you saying they actively remove ISBN numbers from scans?

No, he‘s playing the pointless „well, actually a scan of a book is a different thing from the book itself“ game.

reply
NoMoreNicksLeft
1 year ago
[-]
No, I'm saying that the ISBN doesn't describe titles, it describes editions, and editions matter.
reply
nickelpro
1 year ago
[-]
You said:

> From a strict point of view, they've released new editions of these books.

And this is clearly a semantically worthless distinction from the point of view of the archive.

When different editions have different content, archiving those differences in that content may matter (arguably not for simple typographical corrections, printing errors, etc). When different ISBNs have identical content, it is totally irrelevant to the goals of the archive.

reply
edflsafoiewq
1 year ago
[-]
This is addressed somewhat in the "The critical window of shadow libraries" post

> Until now, the only options to shrink the total size of our collection has been through more aggressive compression, or deduplication. However, to get significant enough savings, both are too lossy for our taste. Heavy compression of photos can make text barely readable. And deduplication requires high confidence of books being exactly the same, which is often too inaccurate, especially if the contents are the same but the scans are made on different occasions.

reply
Finnucane
1 year ago
[-]
A text may be derived from an edition with an isbn, but the isbn wouldn’t apply to that file, it is effectively a different edition.
reply
omoikane
1 year ago
[-]
One thing it shows is how ISBNs are allocated much faster than they are used, judging by the amount of black pixels.

The image contains 1000*800 pixels at 2500 ISBNs per pixel, so it's visualizing 2e9 ISBNs. ISBN-13 contains 12 digits plus one check digit, so we might have expected the image to be 500 times bigger/denser than the current image. The fact that it's at its current size suggests that only ISBNs with 978 and 979 prefixes are included, and since the bottom half is more sparse, that probably corresponds to the new 979 range.

reply
skrebbel
1 year ago
[-]
I thought it was my color blindness that made me not able to distinguish between the red and green pixels as described (i only see red and black ones), but even with a browser extension that counters color blindness i can't distinguish more colors. Is this just me, or is the graph weird?
reply
saithound
1 year ago
[-]
Fwiw (not color-blind) I can see red, green and black pixels. The graph doesn't look weird to the naked eye.

Find the interactive visualiser by scrolling down, and switch it to "Files in Anna's Archive [md5]". This will highlight the location of the green pixels in grey.

reply
Muehe
1 year ago
[-]
If you have red-green blindness like me try this:

- Right-click the image and select "Inspect".

- Add a new CSS hue-rotate filter to the element:

    element {
       max-width: 100%;
       margin: 0 auto;
       filter: hue-rotate(-90deg);
    }

Usually I use "filter: saturate(100);", but that didn't really work well for this image. You might have to adjust the rotation degree though, -90 worked best for me.
reply
superzamp
1 year ago
[-]
The graph seems to be alright, there are indeed red and (some) green pixels, looks like an issue with your extension unfortunately.
reply
Finnucane
1 year ago
[-]
I am also color blind and the graph is not good.
reply
rendx
1 year ago
[-]
I see green dots and a few lines of green dots. Did you try zooming in?
reply
thaumasiotes
1 year ago
[-]
I see red, green, and a bit of yellow. I assume the yellow is what happens when the red and green pixels come too close to each other.
reply
psychoslave
1 year ago
[-]
No idea of were the issue might land, but I can see the difference in colors.
reply
asfasdfasdfn
1 year ago
[-]
The graphs are very easy to read, albeit depend on your ability to distinguish between red and green.

Can you change the green channel to blue to better view it?

reply
glimshe
1 year ago
[-]
Anna's archive is one of the wonders of the world. If we almost destroyed our species but Anna's archive endured, there would be hope for a relatively expedient reconstruction.
reply
wayathr0w
1 year ago
[-]
>relatively expedient reconstruction

If self-destruction is a necessary premise here, is that really a good thing?

reply
jdblair
1 year ago
[-]
It appears that the IP of the server is blocked in the EU. I get this from my ISP (Ziggo, in the Netherlands):

Deze website is geblokkeerd

Europese sancties

De Raad van Europa heeft besloten dat de websites van RT (voorheen Russia Today) en Sputnik News niet meer mogen worden doorgegeven. De website die je probeert te bezoeken, valt onder deze Europese sanctie.

VodafoneZiggo is verplicht de sanctie uit te voeren en heeft de website geblokkeerd.

reply
voytec
1 year ago
[-]
reply
jdblair
1 year ago
[-]
UPDATE: I updated my DNS server config (I run my own already) to use root DNS rather than forward to my ISP, problem solved.
reply
hk__2
1 year ago
[-]
No issue here in France.
reply
manosyja
1 year ago
[-]
Running your own recursive resolver has certain advantages…
reply
jdblair
1 year ago
[-]
And I was so close! I just disabled forwarding to my ISP DNS on my home DNS, now there is no block.
reply
usr1106
1 year ago
[-]
No issue in Finland.
reply
billpg
1 year ago
[-]
Anyone else seeing this?

"This server couldn't prove that it's annas-archive.org; its security certificate is from *.hs.llnwd.net. This may be caused by a misconfiguration or an attacker intercepting your connection."

reply
masfuerte
1 year ago
[-]
Yes. A DNS request for annas-archive.org to my ISP (EE in the UK) returns an address for www.ukispcourtorders.co.uk, which also gives a security warning. If I click through the warning on either site I get an HTTP 400 error.

According to Wikipedia, www.ukispcourtorders.co.uk used to list the blocked domains and the court orders responsible.

https://en.wikipedia.org/wiki/List_of_websites_blocked_in_th...

reply
c0balt
1 year ago
[-]
No, sounds like you are being mitm for them. Though the domain appears like a legitimate CDN.
reply
usr1106
1 year ago
[-]
I get a valid-looking cert issued by Google Trust Services. Finnish ISP's DNS.
reply
swores
1 year ago
[-]
Same for me
reply
quink
1 year ago
[-]
Kind of hard to tell what corresponds to what in these graphs, maybe if someone could point out Bookland (i.e. 978), it would be a bit easier to orient oneself?
reply
seszett
1 year ago
[-]
Making it easier to visualise is the whole point of the bounty announced by this post.
reply
greenie_beans
1 year ago
[-]
is it illegal to download and use their isbn file? like what is wrong with having that information?
reply
karel-3d
1 year ago
[-]
I don't think this page, which links to libgen and sci-hub, is that concerned about copyright.
reply
greenie_beans
1 year ago
[-]
annoying non-answer to my question. i already know all about anna's archive. i'm asking if a person can download these isbns and use them to make data visualizations without fear of breaking a law? https://software.annas-archive.li/AnnaArchivist/annas-archiv...
reply
qingcharles
1 year ago
[-]
Seeing as nobody has provided a real answer. The question is, maybe.

Anna's Archive is getting sued currently for scraping vast amounts of essentially public metadata which was being gate-keeped by a single organisation.

Here's the longer and more complicated answer for you:

https://libraries.emory.edu/research/copyright/copyright-dat...

reply
greenie_beans
1 year ago
[-]
feist is what comes up when i search around, too. the ISBNs might be poisoned if anna broke terms of service to get the ISBNs
reply
karel-3d
1 year ago
[-]
Sorry, I misunderstood your question.
reply
salomonk_mur
1 year ago
[-]
They explicitly provide that data for you to do as you wish. They are in a grey area, not you. You can download it no problem.
reply
greenie_beans
1 year ago
[-]
is there legal precedent for that?

already asked LLMs so please don't copy/paste an LLM response.

reply
eemil
1 year ago
[-]
Depends on your jurisdiction.
reply
whataguy
1 year ago
[-]
> Each pixel represents 2,500 ISBNs. If we have a file for an ISBN, we make that pixel more green.

What do you mean by "more green"? I don't see any shaded green.

And I presume the black pixels are unregistered ISBNs?

reply
slyall
1 year ago
[-]
I'd suggest you try a color blindness test. The green is very obvious, especially about 40% of the way down the whole image.
reply
whataguy
1 year ago
[-]
No, I see the green, but I don't see any shaded green. Though this has probably to do that ISBNs are distributed in blocks and every pixel is either red or green?
reply
lmm
1 year ago
[-]
If you look closely there are definitely some brownish pixels and some dim greens.
reply
eporomaa
1 year ago
[-]
Hm, I got:

"...

European sanctions

The Council of Europe has decided that the websites of RT (formerly Russia Today) and Sputnik News may no longer be transmitted. The website you are trying to visit falls under this European sanction.

..."

reply
reddalo
1 year ago
[-]
I think the website is censored at DNS level but they chose the wrong error page.

In Italy it just errors out with a NS_ERROR_CONNECTION_REFUSED.

reply
flir
1 year ago
[-]
You're just cleared up a minor mystery I never bothered to investigate (BT, UK). Thanks.

Flipping DNS to 8.8.4.4 fixed it for now but I really need to move this connection to A&A.

reply
TonyTrapp
1 year ago
[-]
Works fine here from a European IP.
reply
jaapz
1 year ago
[-]
It's blocked at least in the Netherlands. Weirdly it mentions it being part of the sanctions against Russia, while from a cursory search I only found a judge ordering the site to be blocked because of copyright issues (thanks Brein). They probably just show the wrong error page?
reply
Cthulhu_
1 year ago
[-]
Must be ISP specific, I'm also in NL and can access it fine.
reply
jaapz
1 year ago
[-]
I'm on Ziggo
reply
rchard2scout
1 year ago
[-]
It's blocked by my corporate networking filter for me, in the category "Illegal downloads". So the Russian sanctions message is probably incorrect indeed.
reply
rollulus
1 year ago
[-]
I'm also in NL. Ziggo's DNS server blocks it:

  $ dig annas-archive.org @89.101.251.228
  annas-archive.org. 360 IN CNAME unavailable.for.legal.reasons.
  unavailable.for.legal.reasons. 339 IN A 213.46.185.10
213.46.185.10 serves a generic page mentioning Russia Today and the Pirate Bay. Not sure which one applies here.
reply
seszett
1 year ago
[-]
> CNAME unavailable.for.legal.reasons.

Not really standards compliant, but an interesting use of DNS.

reply
Freak_NL
1 year ago
[-]
Same for KPN:

http://195.121.82.125/

Would Tweak have blocked this? Most households in the Netherlands currently have the choice of Ziggo, KPN, and Odido. Long live VPNs…

reply
xp84
1 year ago
[-]
Is that three broadband providers serving the same address?? You guys are so lucky you don’t even know. In America we generally have a choice of one if you aren’t including Starlink or legacy slow satellite. And perhaps a joke of a 1-6Mbps DSL option in some parts.
reply
reddalo
1 year ago
[-]
Oh wow, don't look at Italy so! At my current address I have coverage from at least 7 different providers (even though they're all based on only 3 different infrastructures/lines).
reply
xp84
1 year ago
[-]
Three usable lines to your home??? I hope you're happy, you've made at least one American cry today.
reply
reddalo
1 year ago
[-]
Yes. One of those three lines is based on the old copper phone lines; the other two are optical fiber (FTTH).

I currently have a 1 Gbps down / 300 Mbps up unlimited connection, and I pay only 16 euros (~16 USD) per month.

I wonder why the US is so bad on home internet connections, but maybe it's because of the scale of your country?

reply
powerhugs
1 year ago
[-]
Switch DNS to like 1.1.1.1 (Cloudflare) or 8.8.8.8 (Google)
reply
usr1106
1 year ago
[-]
What is Anna's archive and why is it blocked by law enforcement in several European countries (EU + UK)?
reply
nout
1 year ago
[-]
It's the largest collection of books in easy to download formats for e-readers (often epub).
reply
usr1106
1 year ago
[-]
So blocked because of copyright issues?
reply
ge96
1 year ago
[-]
Ooh prize money, D3 those are fun, where you can map a million things/zoom into it
reply
friend_Fernando
1 year ago
[-]
Isn't it interesting how certain online forces affiliated with the letter Z are against copyright for Western IP in general, but are pro copyright when it comes to hamstringing Western AI?
reply
CaptainFever
1 year ago
[-]
The letter Z? What does that mean?
reply
aspenmayer
1 year ago
[-]
Probably a reference to Z-Library, or as a stand-in for Russia.

https://en.wikipedia.org/wiki/Z-Library

https://en.wikipedia.org/wiki/Z_(military_symbol)

reply
netman21
1 year ago
[-]
Hee, hee. "Imperial Library of Trantor."
reply
qingcharles
1 year ago
[-]
Now do ISSNs, please.
reply
sebstefan
1 year ago
[-]
>$10,000 bounty

>There is much to explore here, so we’re announcing a bounty for improving the visualization above. Unlike most of our bounties, this one is time-bound. You have to submit your open source code by 2025-01-31 (23:59 UTC).

>The best submission will get $6,000, second place is $3,000, and third place is $1,000.

>All bounties will be awarded using Monero (XMR).

? Why are they using crypto, and, weirdly enough, specifically the crypto people use for buying drugs, to award this?

Is it some kind of scam?

reply
yawndex
1 year ago
[-]
Because the efforts of Anna's Archive are unfortunately currently very much illegal, and XMR is one of the few cryptocurrencies that can actually offer some privacy to its users.
reply
sebstefan
1 year ago
[-]
I've used XMR before. Just surprised seeing it to pay for legitimate & harmless visualization work.

I see, that makes sense

reply
aprilnya
1 year ago
[-]
So what you’re saying is you think XMR is just for buying drugs, and you’re also saying you’ve used XMR before.

Hmmmmmm

/s

reply
fear-anger-hate
1 year ago
[-]
They use monero because what they are doing (copyright infringement) will get you in to big trouble anywhere in the western world. Without cryptocurrencies much of the modern large scale archival efforts wouldn't be possible, or at the very least would significantly increase risks for the people participating in it. For me this alone is a good enough reason to admit that there are valid reasons for existence of privacy coins.

The harm they may cause in the short term via tax avoidance or being used to buy drugs is minimal, but the possibility that because of them archivists are able to fund servers for data that future historians wouldn't have otherwise been able to get their hands on? Priceless.

reply
Klaus23
1 year ago
[-]
Because it is a book download site, which is illegal in every country that has copyright, and revealing one's identity with a bank transfer would be a stupid way to go to jail.
reply
akimbostrawman
1 year ago
[-]
>Why are they using crypto, and, weirdly enough, specifically the crypto people use for buying drugs, to award this?

You really have to ask why a illegal/grey site is using currency that is build to protect privacy and anonymity?

is this some kind of sarcasm?

reply