FilterHN

Losing 1½ Million Lines of Go

89 points

by moks

4 days ago

| past

| 2 comments

| tbray.org

| HN

▲

mroche

5 hours ago

[-]

> Unfortunately, Go’s library doesn’t get updated every time Unicode does. As of now, January 2026, it’s still stuck at Unicode 15.0.0, which dates to September 2023; the latest version is 17.0.0, last September. Which means there are plenty of Unicode characters Go doesn’t know about, and I didn’t want Quamina to settle for that.

I have to say I am surprised about that. Does anyone have any context or guesses as to why this is the case?

EDIT: Go's unicode was actually updated to v17 yesterday:

https://github.com/golang/go/commit/dd39dfb534d2badf1bb2d72d...

▲

fsmv

4 hours ago

[-]

There was a short thread about this on mastodon involving Rob Pike the other day https://hachyderm.io/@robpike/115896334649905170

▲

matt3210

4 hours ago

[-]

Based on the commit message and using "CL" which is google lingo for Change List on their internal system, I bet this was already available on the internal version and just ported to github version after someone pointed it out.

▲

neild

4 hours ago

[-]

Much more prosaic (if slightly embarrassing), I'm afraid: The update was non-trivial (this CL is simple, but there are some accompanying ones in x/text which are not) and it didn't hit the top of the priority list for anyone who understands x/text.

Go is pretty much entirely developed in public; there are some Google-internal customizations but none of them are particularly exciting and almost all changes start in the open source repo and are imported from there.

▲

LukeShu

4 hours ago

[-]

"CL"/"Change List" is the lingo for the Gerrit code review tool, which is how all contributions to Go happen. Creating a GitHub PR simply triggers a bot to create a Gerrit CL, which is where all discussion about the "PR" happens and where the "accept" button gets clicked.

▲

8n4vidtmkvmk

59 minutes ago

[-]

Is Gerrit the same as Critique?

▲

watchful_moose

5 hours ago

[-]

Hard to get promoted at Google doing that

▲

Someone

1 hour ago

[-]

> Sure, these automata are “wide”, with lots of branches, but they’re also shallow, since they run on UTF-8 encoded characters whose maximum length is four and average length is much less

I would consider splitting this task into two:

- extracting the next Unicode code unit

- determining whether it’s in the code class

For the second, instead of using an automaton, one could use a perfect hash (https://en.wikipedia.org/wiki/Perfect_hash_function). That could make that part branch-free.

Is that a good idea?