FilterHN

The Burrows-Wheeler Transform

130 points

by g0xA52A2A

13 hours ago

| past

| 15 comments

| sandbox.bio

| HN

▲

mfworks

6 hours ago

[-]

The most magical part of this transform is the search! First learned about this in a bioalgorithms course, and the really cool property is that for a string length l, you can search for the string in O(l) time. It has the same search time complexity of a suffix tree with O(n) space complexity (with a very low constant multiple). To this day it may be the coolest algorithm I've encountered.

▲

dgacmu

6 hours ago

[-]

I encountered the search version of this, which is turned suffix arrays, in grad school and was so taken by them I incorporated them as the primary search mechanism for the pi searcher. 25 years later it's still the best way to do it. Incredible insight behind bwt and suffix arrays.

▲

dcl

5 hours ago

[-]

A friend doing bioinformatics told me about this at uni, it was definitely one of those "i can't believe this is doable" sort of things.

▲

moribvndvs

4 hours ago

[-]

For some reason I was really getting lost on step 2, “sort” it. They mean lexicographically or “sort it like each rotation is a word in the dictionary, where $ has the lowest character value”. Now that I see it, I’m a little embarrassed I got stuck on that step, but hopefully it helps someone else.

▲

hinkley

9 hours ago

[-]

Noteworthy that the paper that was published describing this technique came at a time when IP lawyers had begun to destroy the world wrt to small business innovation. And that they released it unencumbered is a huge debt of gratitude that we haven’t really honored.

▲

Imnimo

7 hours ago

[-]

An interesting bit of trivia about the Burrows-Wheeler transform is that it was rejected when submitted to a conference for publication, and so citations just point to a DEC technical report, rather than a journal or conference article.

▲

lancefisher

10 hours ago

[-]

This is a good article and a fun algorithm. Google made a great series of compression videos a while ago - albeit sometimes silly. For the Burrows-Wheeler one they interviewed Mike Burrows for some of the history. https://youtu.be/4WRANhDiSHM

▲

emmelaich

9 hours ago

[-]

I wonder if he came up with it when pondering permuted indexes. Which used to be a feature of the man pages system.

▲

kvark

5 hours ago

[-]

I've spent years playing with BWT and suffix sorting at school (some work can be found be names of archon and dark). It's a beautiful domain to learn!

Now I'm wondering: in the era of software 2.0, everything is figured out by AI. What are the chances AI would discover this algorithm at all?

▲

mrkeen

3 hours ago

[-]

No need to speculate about this algorithm in particular. It's been a few years since every programmer has had access to LLMs (that's a huge sample size!). Some LLMs are even branded as 'Research'. Plus they never get tired like humans do. We should be swimming in algorithms like this.

▲

pixelpoet

7 hours ago

[-]

I remember randomly reading about this in the Hugi demoscene diskmag as a young teen, and it completely blew my mind (notably the reverse transform, apparently not covered in OP's article): https://hugi.scene.org/online/coding/hugi%2013%20-%20cobwt.h...

The author later wrote many now-famous articles, including A Trip Through the Graphics Pipeline.

▲

rollulus

2 hours ago

[-]

Oh wow, I clicked that link without reading the last section of your comment, and was like “Fabian Giesen!”, an absolute HN favourite: https://hn.algolia.com/?q=fgiesen

▲

raboukhalil

11 hours ago

[-]

Author here, nice to see the article posted here! I'm currently looking for ideas for other interactive articles, so let me know if think of other interesting/weird algorithms that are hard to wrap your head around.

▲

foobarian

11 hours ago

[-]

The unintuitive part of BWT to me was always the reverse operation, which was the only thing the post didn't give intuition for :-)

▲

SimplyUnknown

1 hour ago

[-]

I'm still not sure I get it. I think it is

1. Put the BWT string in the right-most empty column

2. Sort the rows of the matrix such that the strings read along the columns of the matrix are in lexicographical order starting from the top-row????

3. Repeat step 1 and 2 until matrix is full

4. Extract the row of the matrix that has the end-delimiter in the final column

It's the "sort matrix" step that seems under-explained to me.

▲

hinkley

9 hours ago

[-]

Most BWT descriptions: now draw the rest of the owl.

Once upon a time I found one that did the entire round trip, I neglected to bookmark it. So when I eventually forgot how it worked, or wanted to show others, I was stuck. Never did find it.

▲

raboukhalil

6 hours ago

[-]

I added more details about how the decoding works here: https://sandbox.bio/concepts/bwt#intuition-decoding , I'd love to hear if that is more clear now

▲

raboukhalil

11 hours ago

[-]

Good point, thanks! I'll add a subsection about the intuition for that.

▲

raboukhalil

6 hours ago

[-]

I just added a section about the intuition behind the decoding: https://sandbox.bio/concepts/bwt#intuition-decoding , hope it helps!

▲

zahlman

4 hours ago

[-]

BWT is how I first got to interact with Tim Peters on the Internet: https://stackoverflow.com/questions/21297887

▲

fair_enough

11 hours ago

[-]

Thanks, man! This helps a lot, because as you say the algorithm is not intuitive. The description of the type of data it is very good for is the best value-adding part of this writeup. Greatly appreciated!

▲

username223

7 hours ago

[-]

Does anyone else remember when Dr. Dobb's Journal wrote about stuff like this? https://web.archive.org/web/20040217202950/http://www.ddj.co...

▲

embit

8 hours ago

[-]

This is great. The best article I have read on BWT and experimented was in Dr Dobbs Journal around 1996-97.

▲

loloquwowndueo

11 hours ago

[-]

I was once mentioning this transform to someone and they were like “what? The Bruce Willis transform?” - I guess my pronunciation sucked that much.

▲

Bukhmanizer

10 hours ago

[-]

Once in a lecture one of my friends asked “You keep mentioning Cooper Netties, who is Cooper Netties?” Everyone was very confused until someone figured out she was asking about Kubernetes.

▲

jansan

12 hours ago

[-]

For an article describing a compression algorithm this was very digestible and entertaining.

▲

jakedata

11 hours ago

[-]

This transformation in and of itself does not perform any compression. In fact it adds an additional string marker to the data. Performing compression on these strings is addressed here: https://www.cs.cmu.edu/~15451-f18/lectures/lec25-bwt.pdf#pag...