FilterHN

Sorted string tables (SST) from first principles

67 points

by apurvamehta

4 days ago

| past

| 4 comments

| bitsxpages.com

| HN

▲

craftkiller

19 hours ago

[-]

The diagrams on this page are stunning! My only complaint is leaving the close/maximize/minimize buttons in the top left was unnecessary but this is the kind of clarity I always strive for (and fail to achieve) every time I make diagrams.

Did you use a tool to create them, and if so, what is that tool?

▲

agavra

13 hours ago

[-]

Thanks! I use https://monodraw.helftone.com/ which is my favorite one-time-purchase software of all time. I definitely agree the buttons on the top left are unnecessary but ... it's cute and it makes me happy so I can't help it. Maybe I'll come up with a different style for the next blog

▲

cipehr

4 hours ago

[-]

Thanks for sharing. I also like the diagrams for this.

▲

epistasis

17 hours ago

[-]

> There are several ways to organize immutable data durably that meet these requirements, the simplest of which is an append-only log.

This is also a fairly good way to handle large amounts of data with maximum performance on spinning rust, and at the heart of systems like Kafka.

I had assumed that the story would be very different with SSDs, so it's surprising to see append only logs show up again.

▲

agavra

13 hours ago

[-]

We haven't even started to discuss Object Storage, but it ends up looking very very similar if you're building data systems that use that instead of raw filesystems (not so much for physics reasons, but because of the way object storage require immutable objects and penalize you for many API calls)

▲

agavra

13 hours ago

[-]

So cool to see this make the front page of hacker news! I'm the author, I'll be online here throughout the weekend to answer any questions you might have :) excited for the next post which is in the works about LSM trees.

▲

mac3n

17 hours ago

[-]

see https://gitlab.com/mac3n/ksip binary search on mmpa'd sorted text files no index needed

▲

agavra

13 hours ago

[-]

this is pretty different but reminds me of https://en.wikipedia.org/wiki/Bitcask - if you're storing it all in memory why not just use a hash index?

▲

mac3n

11 hours ago

[-]

The trick is, it's not all in memory - it's a memory-mapped file If you look at the cache (with `fincore` or similar) you'll see that the buinary search only loads the pages it examines, roughly logarithmetic in the file size.

And a text file is the most useful general format - easy to write, easy to process with standard tools.

I've used this in the past on data sets of hundreds of millions of lines, maybe billions.

It's also true that you could use a memory-mapped indexed file for faster searches - I've used sqlite for this.

▲

mac3n

17 hours ago

[-]

what this could really use is a compression format that compresses variable amount of text into fixed-size blocks. with that, it could binary-search compressed text

▲

agavra

13 hours ago

[-]

RocksDB actually does something somewhat similar with its prefix compression. It prefix-compresses texts and then "resets"the prefix compression every N records so it stores a mapping of reset point -> offset so you can skip across compressed records. It's pretty neat