Sorted string tables (SST) from first principles
67 points
4 days ago
| 4 comments
| bitsxpages.com
| HN
craftkiller
19 hours ago
[-]
The diagrams on this page are stunning! My only complaint is leaving the close/maximize/minimize buttons in the top left was unnecessary but this is the kind of clarity I always strive for (and fail to achieve) every time I make diagrams.

Did you use a tool to create them, and if so, what is that tool?

reply
agavra
13 hours ago
[-]
Thanks! I use https://monodraw.helftone.com/ which is my favorite one-time-purchase software of all time. I definitely agree the buttons on the top left are unnecessary but ... it's cute and it makes me happy so I can't help it. Maybe I'll come up with a different style for the next blog
reply
cipehr
4 hours ago
[-]
Thanks for sharing. I also like the diagrams for this.
reply
epistasis
17 hours ago
[-]
> There are several ways to organize immutable data durably that meet these requirements, the simplest of which is an append-only log.

This is also a fairly good way to handle large amounts of data with maximum performance on spinning rust, and at the heart of systems like Kafka.

I had assumed that the story would be very different with SSDs, so it's surprising to see append only logs show up again.

reply
agavra
13 hours ago
[-]
We haven't even started to discuss Object Storage, but it ends up looking very very similar if you're building data systems that use that instead of raw filesystems (not so much for physics reasons, but because of the way object storage require immutable objects and penalize you for many API calls)
reply
agavra
13 hours ago
[-]
So cool to see this make the front page of hacker news! I'm the author, I'll be online here throughout the weekend to answer any questions you might have :) excited for the next post which is in the works about LSM trees.
reply
mac3n
17 hours ago
[-]
see https://gitlab.com/mac3n/ksip binary search on mmpa'd sorted text files no index needed
reply
agavra
13 hours ago
[-]
this is pretty different but reminds me of https://en.wikipedia.org/wiki/Bitcask - if you're storing it all in memory why not just use a hash index?
reply
mac3n
11 hours ago
[-]
The trick is, it's not all in memory - it's a memory-mapped file If you look at the cache (with `fincore` or similar) you'll see that the buinary search only loads the pages it examines, roughly logarithmetic in the file size.

And a text file is the most useful general format - easy to write, easy to process with standard tools.

I've used this in the past on data sets of hundreds of millions of lines, maybe billions.

It's also true that you could use a memory-mapped indexed file for faster searches - I've used sqlite for this.

reply
mac3n
17 hours ago
[-]
what this could really use is a compression format that compresses variable amount of text into fixed-size blocks. with that, it could binary-search compressed text
reply
agavra
13 hours ago
[-]
RocksDB actually does something somewhat similar with its prefix compression. It prefix-compresses texts and then "resets"the prefix compression every N records so it stores a mapping of reset point -> offset so you can skip across compressed records. It's pretty neat
reply