Losing 1½ Million Lines of Go
89 points
by moks
4 days ago
| 2 comments
| tbray.org
| HN
mroche
5 hours ago
[-]
> Unfortunately, Go’s library doesn’t get updated every time Unicode does. As of now, January 2026, it’s still stuck at Unicode 15.0.0, which dates to September 2023; the latest version is 17.0.0, last September. Which means there are plenty of Unicode characters Go doesn’t know about, and I didn’t want Quamina to settle for that.

I have to say I am surprised about that. Does anyone have any context or guesses as to why this is the case?

EDIT: Go's unicode was actually updated to v17 yesterday:

https://github.com/golang/go/commit/dd39dfb534d2badf1bb2d72d...

reply
fsmv
4 hours ago
[-]
There was a short thread about this on mastodon involving Rob Pike the other day https://hachyderm.io/@robpike/115896334649905170
reply
matt3210
4 hours ago
[-]
Based on the commit message and using "CL" which is google lingo for Change List on their internal system, I bet this was already available on the internal version and just ported to github version after someone pointed it out.
reply
neild
4 hours ago
[-]
Much more prosaic (if slightly embarrassing), I'm afraid: The update was non-trivial (this CL is simple, but there are some accompanying ones in x/text which are not) and it didn't hit the top of the priority list for anyone who understands x/text.

Go is pretty much entirely developed in public; there are some Google-internal customizations but none of them are particularly exciting and almost all changes start in the open source repo and are imported from there.

reply
LukeShu
4 hours ago
[-]
"CL"/"Change List" is the lingo for the Gerrit code review tool, which is how all contributions to Go happen. Creating a GitHub PR simply triggers a bot to create a Gerrit CL, which is where all discussion about the "PR" happens and where the "accept" button gets clicked.
reply
8n4vidtmkvmk
59 minutes ago
[-]
Is Gerrit the same as Critique?
reply
watchful_moose
5 hours ago
[-]
Hard to get promoted at Google doing that
reply
Someone
1 hour ago
[-]
> Sure, these automata are “wide”, with lots of branches, but they’re also shallow, since they run on UTF-8 encoded characters whose maximum length is four and average length is much less

I would consider splitting this task into two:

- extracting the next Unicode code unit

- determining whether it’s in the code class

For the second, instead of using an automaton, one could use a perfect hash (https://en.wikipedia.org/wiki/Perfect_hash_function). That could make that part branch-free.

Is that a good idea?

reply