The Burrows-Wheeler Transform
130 points
13 hours ago
| 15 comments
| sandbox.bio
| HN
mfworks
6 hours ago
[-]
The most magical part of this transform is the search! First learned about this in a bioalgorithms course, and the really cool property is that for a string length l, you can search for the string in O(l) time. It has the same search time complexity of a suffix tree with O(n) space complexity (with a very low constant multiple). To this day it may be the coolest algorithm I've encountered.
reply
dgacmu
6 hours ago
[-]
I encountered the search version of this, which is turned suffix arrays, in grad school and was so taken by them I incorporated them as the primary search mechanism for the pi searcher. 25 years later it's still the best way to do it. Incredible insight behind bwt and suffix arrays.
reply
dcl
5 hours ago
[-]
A friend doing bioinformatics told me about this at uni, it was definitely one of those "i can't believe this is doable" sort of things.
reply
moribvndvs
4 hours ago
[-]
For some reason I was really getting lost on step 2, “sort” it. They mean lexicographically or “sort it like each rotation is a word in the dictionary, where $ has the lowest character value”. Now that I see it, I’m a little embarrassed I got stuck on that step, but hopefully it helps someone else.
reply
hinkley
9 hours ago
[-]
Noteworthy that the paper that was published describing this technique came at a time when IP lawyers had begun to destroy the world wrt to small business innovation. And that they released it unencumbered is a huge debt of gratitude that we haven’t really honored.
reply
Imnimo
7 hours ago
[-]
An interesting bit of trivia about the Burrows-Wheeler transform is that it was rejected when submitted to a conference for publication, and so citations just point to a DEC technical report, rather than a journal or conference article.
reply
lancefisher
10 hours ago
[-]
This is a good article and a fun algorithm. Google made a great series of compression videos a while ago - albeit sometimes silly. For the Burrows-Wheeler one they interviewed Mike Burrows for some of the history. https://youtu.be/4WRANhDiSHM
reply
emmelaich
9 hours ago
[-]
I wonder if he came up with it when pondering permuted indexes. Which used to be a feature of the man pages system.
reply
kvark
5 hours ago
[-]
I've spent years playing with BWT and suffix sorting at school (some work can be found be names of archon and dark). It's a beautiful domain to learn!

Now I'm wondering: in the era of software 2.0, everything is figured out by AI. What are the chances AI would discover this algorithm at all?

reply
mrkeen
3 hours ago
[-]
No need to speculate about this algorithm in particular. It's been a few years since every programmer has had access to LLMs (that's a huge sample size!). Some LLMs are even branded as 'Research'. Plus they never get tired like humans do. We should be swimming in algorithms like this.
reply
pixelpoet
7 hours ago
[-]
I remember randomly reading about this in the Hugi demoscene diskmag as a young teen, and it completely blew my mind (notably the reverse transform, apparently not covered in OP's article): https://hugi.scene.org/online/coding/hugi%2013%20-%20cobwt.h...

The author later wrote many now-famous articles, including A Trip Through the Graphics Pipeline.

reply
rollulus
2 hours ago
[-]
Oh wow, I clicked that link without reading the last section of your comment, and was like “Fabian Giesen!”, an absolute HN favourite: https://hn.algolia.com/?q=fgiesen
reply
raboukhalil
11 hours ago
[-]
Author here, nice to see the article posted here! I'm currently looking for ideas for other interactive articles, so let me know if think of other interesting/weird algorithms that are hard to wrap your head around.
reply
foobarian
11 hours ago
[-]
The unintuitive part of BWT to me was always the reverse operation, which was the only thing the post didn't give intuition for :-)
reply
SimplyUnknown
1 hour ago
[-]
I'm still not sure I get it. I think it is

1. Put the BWT string in the right-most empty column

2. Sort the rows of the matrix such that the strings read along the columns of the matrix are in lexicographical order starting from the top-row????

3. Repeat step 1 and 2 until matrix is full

4. Extract the row of the matrix that has the end-delimiter in the final column

It's the "sort matrix" step that seems under-explained to me.

reply
hinkley
9 hours ago
[-]
Most BWT descriptions: now draw the rest of the owl.

Once upon a time I found one that did the entire round trip, I neglected to bookmark it. So when I eventually forgot how it worked, or wanted to show others, I was stuck. Never did find it.

reply
raboukhalil
6 hours ago
[-]
I added more details about how the decoding works here: https://sandbox.bio/concepts/bwt#intuition-decoding , I'd love to hear if that is more clear now
reply
raboukhalil
11 hours ago
[-]
Good point, thanks! I'll add a subsection about the intuition for that.
reply
raboukhalil
6 hours ago
[-]
I just added a section about the intuition behind the decoding: https://sandbox.bio/concepts/bwt#intuition-decoding , hope it helps!
reply
zahlman
4 hours ago
[-]
BWT is how I first got to interact with Tim Peters on the Internet: https://stackoverflow.com/questions/21297887
reply
fair_enough
11 hours ago
[-]
Thanks, man! This helps a lot, because as you say the algorithm is not intuitive. The description of the type of data it is very good for is the best value-adding part of this writeup. Greatly appreciated!
reply
username223
7 hours ago
[-]
Does anyone else remember when Dr. Dobb's Journal wrote about stuff like this? https://web.archive.org/web/20040217202950/http://www.ddj.co...
reply
embit
8 hours ago
[-]
This is great. The best article I have read on BWT and experimented was in Dr Dobbs Journal around 1996-97.
reply
loloquwowndueo
11 hours ago
[-]
I was once mentioning this transform to someone and they were like “what? The Bruce Willis transform?” - I guess my pronunciation sucked that much.
reply
Bukhmanizer
10 hours ago
[-]
Once in a lecture one of my friends asked “You keep mentioning Cooper Netties, who is Cooper Netties?” Everyone was very confused until someone figured out she was asking about Kubernetes.
reply
jansan
12 hours ago
[-]
For an article describing a compression algorithm this was very digestible and entertaining.
reply
jakedata
11 hours ago
[-]
This transformation in and of itself does not perform any compression. In fact it adds an additional string marker to the data. Performing compression on these strings is addressed here: https://www.cs.cmu.edu/~15451-f18/lectures/lec25-bwt.pdf#pag...
reply