Fast B-Trees
271 points
2 months ago
| 19 comments
| scattered-thoughts.net
| HN
BeeOnRope
2 months ago
[-]
To answer a question implied in the article, per-lookup timing with rdtscp hurts the hash more than the btree for the same reason the hash is hurt by the data-depending chaining: rdtscp is an execution barrier which prevents successive lookups from overlapping. rdtsc (no p) isn't, and would probably produce quite different timings.

That the btree doesn't benefit from overlapping adjacent lookup/inserts is intereting.

I suppose it is because btree access (here) involves data-dependent branches, and so with random access you'll get about L mispredicts per lookup in an L-deep tree, so adjacent lookups are separated by at least one mispredict: so adjacent lookup can overlap execution, but the overlapping is useless since everything beyond the next mispredict is useless as it is on the bad path.

That's probably at least true for the small map regime. For the larger maps, the next iteration is actually very useful, even if on a mispredicted path, because the date accesses are at the right location so it serves to bring in all the nodes for the next iteration. This matters a lot outside of L2. At 5 instructions per comparison and 32-element nodes, however, there are just so many instructions in the window for 1 lookup it's hard to make it to the next iteration.

So b-trees benefit a lot from a tight linear seach (e.g. 2 instructions per check, macro-fused to 1 op), or a branch-free linear search, or far better than those for big nodes, a vectorized branch-free search.

reply
dwattttt
2 months ago
[-]
> the overlapping is useless since everything beyond the next mispredict is useless as it is on the bad path

Is this a consequence of Spectre et al mitigations?

reply
BeeOnRope
2 months ago
[-]
No, just a consequence of how mispredicts work: all execution after a mispredict is thrown away: though some traces remain in the cache, which can be very important for performance (and also, of course, Spectre).
reply
jnordwick
2 months ago
[-]
Can you be more specific about "all execution after a mispredict is thrown away". Are you saying even non-dependant instructions? I thought the CPU just ignored the registers resulting from the wrong instruction path and started calculating the correct path - that the way the register file worked allowed it be to much better than just junking everything.

A little less verbosely: "is non-dependant ILP work also tossed on a mispredict even though that won't change?

reply
BeeOnRope
2 months ago
[-]
> Can you be more specific about "all execution after a mispredict is thrown away". Are you saying even non-dependant instructions?

To clarify, the misprediction happens at some point in the linear instruction stream, and all instructions and results before that in the linear stream are kept and everything after is thrown out. Everything after is wrong because the stream of instructions didn't go down the expected path, so it's not even really about instruction dependencies at that point: the wrong stream of instructions executed in the first place.

In some cases the wrong and good paths join up very quickly, e.g.:

    cmp rax, rax
    jl  over
    inc rax
  over:
    ...
In this case the jump is over a single instruction so even if mispredicted the right stream will vary only in the `inc` instruction, so in principle it seems possible to save some of the further work which doesn't depend on rax, but this is very difficult in practice and no CPU does it in a general way that I know of. I believe some POWER arches could actually internally covert the pattern a above into the moral equivalent of a predicated operation, removing the branch completely, but that's a different thing entirely with a different set of tradeoffs.
reply
jnordwick
2 months ago
[-]
Does this mean that a compiler should strive to move everything before the branch that isn't affected by it?

eg, if there was:

  over:
    inc rbx
it should move that before the cmp? Is this always easy to do and done by the compiler?
reply
dwattttt
2 months ago
[-]
That's the part I was curious about; whether there would've been a helpful cache impact, if not for modern Spectre prevention.
reply
starspangled
2 months ago
[-]
Spectre mitigations don't change that, some of them do require adding speculation barriers or otherwise turn off branch prediction for cases where unprivileged state can be used to direct mis-predicted privileged branches into gadgets which can create a side-band to privileged data with speculative state.

But in general execution (i.e., no privilege domain crossings), this mechanism is not required.

Performance effects of executing mispredicted branches (called something like "wrong-path execution" or "mispredicted path ..." in literature) is interesting and it has been studied. I don't know what the state of the art is, although I've seen results showing both speedups and slowdowns (as you would expect with any cache / BP kind of topic :P).

reply
BeeOnRope
2 months ago
[-]
> Spectre mitigations don't change that, ...

Yes, exactly. To the first order I think Spectre didn't really change the performance of existing userspace-only code. What slowed down was system calls, kernel code and some things which were recompiled or otherwise adjusted to mitigate some aspects of Spectre. There might be a rare exception, e.g., IIRC `lfence` slowed down on AMD in order to make it more useful as a speculation barrier on AMD but this is hardly an instruction that saw much use before.

> I don't know what the state of the art is, although I've seen results showing both speedups and slowdowns

Yeah. This seems like a pretty cut and dry case where you'd get a speedup from wrong-path misses, since the independent next search will be correctly predicted from the start and access exactly the right nodes, so it serves as highly accurate prefetching: it only gets thrown out because of a mispredict at the end of the _prior_ search.

Something like the misses within a single binary search are more ambiguous: for random input the accuracy drops off like 0.5^n as you predict n levels deep, but that still adds up to ~double MLP compared to not speculating, so in a microbenchmark it tends to look good. In the real world with 1 lookup mixed in with a lot of other code, the many cache lines brought in on the bad path may be overall worse than inserting a speculation barrier yourself.

That's the cool part: we can choose whether we want speculation or not if we know up front if it's harmful.

reply
scythmic_waves
2 months ago
[-]
I appreciate write-ups of failed experiments like this. They're sorta like null results in science, but for engineering. And they can help others from needlessly walking down the same path.

If everyone only wrote about their successes, we'd all have to independently rediscover failures behind closed doors.

reply
aidenn0
2 months ago
[-]
It's been a while since I last tried things, but I found crit-bit trees[1] to be much faster than b-trees. Hash array-mapped tries are also good if you don't need the things that trees give you (e.g. in-order traversal, get all values in a certain range).

1: https://cr.yp.to/critbit.html

reply
shpongled
2 months ago
[-]
Something like a radix trie can give you the in-order traversal and range query aspect, while still keeping some of the nice properties of HAMTs. In practice though (for my domain, scientific floating point data), I have found that b-trees and simple sorted arrays with binary search (sadly) outperform radix tries and cleverer solutions.
reply
aidenn0
2 months ago
[-]
For very small number of keys, an array with linear searching can win out.
reply
fweimer
2 months ago
[-]
B-trees are supposed to address the bad cache behavior of binary trees because they are generally much shallower. Crit-bit trees as originally described do not have this property.
reply
crest
2 months ago
[-]
Yes and no. A lot of it depends on how the low fan-out tree is laid out in memory. Also some binary tree structures (e.g. binary heaps) are often stored in an array with one child implied and only the other child index stored in each node. In such cases chasing down the child references may no be more expensive that searching inside a B(+)-tree node.
reply
josephg
2 months ago
[-]
I'd be curious to see how performance would change from storing b-tree entries in a semi-sorted array, and applying various other optimizations from here:

https://en.algorithmica.org/hpc/data-structures/b-tree/

The aggregate performance improvements Sergey Slotin gets from applying various "tricks" is insane.

reply
rebanevapustus
2 months ago
[-]
That's how it's done in the rust stdlib alternative https://github.com/brurucy/indexset

Faster reads, slower inserts, but then you get the capability of indexing by position in (almost) O(1). In regular B-Trees this can only happen in O(n).

reply
vlovich123
2 months ago
[-]
Notably I believe his data structures tend to ignore string keys because it’s less amenable to SIMD. Would be interesting to see if his ideas about layout still show improvements to strings.
reply
anonymoushn
2 months ago
[-]
You can do strings. You probably want to store a bunch of string prefixes inline though, so that when the prefixes are not equal, you can use essentially the same code that he uses for integers.
reply
ur-whale
2 months ago
[-]
Unless I'm missing something, title of the article doesn't really correlate with its conclusion.
reply
dpatterbee
2 months ago
[-]
The title of the article is "Smolderingly fast b-trees". Smoldering is (sorta) an antonym of blazing. Blazingly fast means very fast, smolderingly fast would therefore mean not very fast.
reply
hinkley
2 months ago
[-]
Smoldering has other connotations though, which still imply attractiveness.
reply
kibo_money
2 months ago
[-]
Very interesting ! You mentioned the memory usage at the end, BTreeMaps are actually better than HashMaps most of the time, at least for Rust

Here's a good break down: https://ntietz.com/blog/rust-hashmap-overhead/

reply
pclmulqdq
2 months ago
[-]
Btrees don't waste much memory, while hash tables have to have excess capacity if you want them to go fast.
reply
marginalia_nu
2 months ago
[-]
That's true for on-disk b-trees which typically have large node sizes (typically 4KB), but in-memory btrees tend to aim for CPU cache lines (typically a small multiple of 32B), and thus do actually waste a fair amount of memory with their comparatively low branching factor, and thus relatively large number of branches compared to their number of leaves.
reply
pclmulqdq
2 months ago
[-]
The waste of abseil's b-tree, for example, is small per value: https://abseil.io/about/design/btree

The efficiency compared to hash tables very much does carry over to the small b-trees used for in-memory data.

reply
cmrdporcupine
2 months ago
[-]
Appreciate the attention to detail on the microbenchmarks.

Skimming through, need to read in more detail later, but what I would love to see is a real world comparison against just linear search in a vector. Either of associated pairs, or two vectors (one for key, one for value, with matching offsets).

My hunch is that people in general are more often than not reaching for hash-tables (and sometimes trees) for the API convenience of an associative structure -- but that on modern architectures with decent cache sizes and for small(ish) data sets they can be outperformed by a simple O(N) lookup.

For example, it would be an interesting experiment to take something like the Python runtime (or the JDK etc) and replace its dictionary type with vectors -- at least for small dictionaries -- and then run through some common applications and frameworks and see what impact this has.

reply
tialaramex
2 months ago
[-]
I expect this experiment provides a small perf win for very small N, but that's swamped by the price of deciding whether to try this and in many applications it's also noise compared to the perf win from using hash tables for larger N.

A hash table with a very cheap hash (remember in C++ out of the box their hash for integers is usually the identity function) well be cheaper for quite modest N because it's mostly just doing less work. I could believe N>=4 for example

reply
cmrdporcupine
2 months ago
[-]
I think N=>4 is woefully pessimistic. Contiguous, cached memory... it's just so much faster on modern machines.

That and all the potential for branch prediction misses in a hashtable or tree vs a simple array lookup...

Linear scan be stupid fast.

reply
tialaramex
2 months ago
[-]
Linear scan is very fast, but it's not competing against a worse scan it's competing against not scanning which is what the hash table is going to do.

At N=4, say we're mapping one 64-bit integer to another for some reason, so that's 16 bytes per entry, four entries, 64 bytes, conveniently sized linear scan table.

The linear scan reads each of the four keys in turn, if one matches that's a hit, if none do we missed. Four 64-bit reads, four 64-bit compares & branches.

A Swiss table big enough for N=4 has eight slots (128 butes) plus 16 bytes of metadata (total 144 bytes). Half of the slots are empty.

We do a single arithmetic operation on the key, picking the only 16 bytes of metadata which exist, then we load that data, and a single SIMD instruction matches either zero or one entry, and if it wasn't zero we load and compare that entry.

In practical systems you'll get more leeway because Swiss tables want a real hash, your keys probably aren't suitable so maybe there's a dozen or more ALU operations to hash a key - which is a substantial head start for a linear scan. But hey we're talking small amounts of data, who even cares about hash quality?

But do measure, I too am interested, at least in a real shoot out between an actual linear scan and a decent hash table. If the "same API as a hash table but same implementation as linear scan" makes sense for some sizes maybe it's worth writing that collection type.

reply
menaerus
2 months ago
[-]
> Linear scan is very fast, but it's not competing against a worse scan it's competing against not scanning which is what the hash table is going to do.

Not sure I agree since "not scanning" would imply that we have a hash-table where no probing ever fails beyond the initial step. That is not a very realistic workload so I also think that N=4 is way pessimistic. Memory prefetchers, especially with linear access patterns in your code, will hide the latency so I would expect that N should be much much larger to show the diff.

reply
tialaramex
2 months ago
[-]
Probabilistically, the long probe lengths occur for large N, but for large N the linear scan is already horrible. But it's definitely worth actually measuring.
reply
cmrdporcupine
2 months ago
[-]
the point is that linear scan is not horrible when held in cache lines that are several times faster than main memory

that and branch prediction (or absence of it) leads to performance characteristics that seem entirely counter intuitive for people of my age who grew up counting cycles on a 68000

reply
tialaramex
2 months ago
[-]
The point, as I have already said once, is that even though it's fast it's not faster than not doing it at all.

For very small containers the linear scan wins because it's the same work you'd have done anyway so we're cutting to the chase. My conjecture is that they have to be very small. Several people now have insisted that it will somehow just stay faster beyond that, and my guess is that while they've thought about an actual linear scan on modern CPUs they have chosen to imagine the 1980s closed addressing hash table and so they've racing against pointer chasing - but that was last century. A modern hash table is just linear data but laid out more cleverly, so it's not chasing pointers it's just not wasting time on your linear scan.

C++ std::unordered_map is going to involve pointer chasing (the "hash table" just has pointers in it, each bucket is allocated separately), which is probably going to result in a cache miss and give you a lot of time to win the race. A Swiss Table is not, it's just M slots plus metadata, where typically M <= N*2 - several other techniques have the same consequence but I won't spell them all out.

Because M > N if we linear scanned the entire table it's maybe 2x slower, but we're not going to scan the entire table that's why it's a hash table, we're probably jumping to a single entry if it's even present.

reply
SPascareli13
2 months ago
[-]
I think I tested very casually some time ago with Go maps and up to like one hundred items the linear search on array was faster than map lookup. Considering that many times when we use Maps for convenience they will have less than a hundred items this could be useful.

Unfortunately I don't have the results (or the test code) anymore, but it shouldn't be hard to do again (casually at least).

reply
hinkley
2 months ago
[-]
Trees can also be useful here but they have some of the same branching problems that hashes do and lists do not.
reply
hinkley
2 months ago
[-]
Do we need TimSearch?
reply
ww520
2 months ago
[-]
Adaptive radix tree is pretty good as well, with support for in order listing and range query. It can beat b-tree and come closely behind hashmap.
reply
vlowther
2 months ago
[-]
I reach for adaptive radix trees over b-trees when I have keys and don't need to have arbitrary sort orderings these days. They are just that much more CPU and memory efficient.
reply
orlp
2 months ago
[-]
Why was Rust's hashmap only tested with SipHash? It's known to be pretty bad for performance.

I'm biased as the author of course, but try adding a benchmark with the Rust hasher + foldhash as well: https://github.com/orlp/foldhash.

reply
espadrine
2 months ago
[-]
They are looking for a data structure that is robust against hash flooding attacks like https://www.cve.org/CVERecord?id=CVE-2011-4815

You mention that foldhash does not claim to provide HashDoS resistance against interactive attackers, so perhaps that disqualifies it.

If anything, given this requirement, comparing with wyhash, as they do in the article, is misleading.

reply
orlp
2 months ago
[-]
> You mention that foldhash does not claim to provide HashDoS resistance against interactive attackers, so perhaps that disqualifies it.

The linked CVE is not an interactive attack either FYI, so foldhash would be sufficient to protect against that. When I say an "interactive attacker" I mean one that analyzes hash outcomes (either directly or indirectly through things like timing attacks and iteration order) to try and reverse engineer the hidden internal state.

> If anything, given this requirement, comparing with wyhash, as they do in the article, is misleading.

That is correct. There is no reason to believe wyhash is secure against interactive attackers.

reply
vlovich123
2 months ago
[-]
Xxh3 would have this property and would be drastically faster. Siphash is just a bad default choice imho.
reply
orlp
2 months ago
[-]
XXH3 does not have this property, no more than foldhash does.
reply
pjdesno
2 months ago
[-]
It would be interesting to compare the Python sortedcontainers algorithm - I've been using a C++ version of it that works quite well.

Note also that nodes in B-trees (and other splitting-based data structures) have a mean load factor more like 75% - 50% is the minimum load for non-root nodes, right after splitting, and 100% is the max right before splitting.

reply
evanmoran
2 months ago
[-]
This post reminded me of this killer btree implementation for swift. The analysis on GitHub is fantastic and reads like a thriller novel :)

https://github.com/attaswift/BTree#why

reply
helltone
2 months ago
[-]
Possibly off topic, but I was wondering: what are the most comprehensive data structure benchmarks out there?
reply
waynecochran
2 months ago
[-]
If you can keep all the data in core memory why not just use you favorite balanced binary search tree (BST) like a red-black tree. B-Trees only help performance, at least in common understanding, when all the data can not be in core memory (which is why databases use them).
reply
ismailmaj
2 months ago
[-]
For cache locality, if the layout is more wide instead of deep, you can avoid many cache misses.
reply
winwang
2 months ago
[-]
Yep -- and more generally, sequential access > random access, just a bit less so in memory than on SSD, not even mentioning spinning disk.
reply
winwang
2 months ago
[-]
Would be really interesting to see a similar study but with skiplists! I'd imagine it would be slower than hashmaps for many of the exact same reasons outlined in the article, but numbers would be cool.
reply
tekknolagi
2 months ago
[-]
I thought a lot of b(+)tree advantage was in bigger-than-RAM something or other for large databases and these benchmarks seem relatively small in comparison
reply
foota
2 months ago
[-]
B-Trees are good for in memory data too because they have fairly good cache behavior.
reply
crest
2 months ago
[-]
As long as your puny little working set (2^16 small keys) fits into L2 cache and get is perfectly covered by the L1 dTLB you won't see the cost of touching random pages in a big hash table larger than the last level TLB coverage and on chip caches. There won't be any TLB stalls waiting for the page walkers and you won't miss the lost spacial locality in the key-space preserved by B(+)trees if everything is in L2 cache. At the very least it proves that hash tables can be a good fit for point queries of datasets too large for linear searching or sorting + binary searches, but not yet large enough to exhaust CPU cache capacity.
reply
robertclaus
2 months ago
[-]
Ya, I would be curious to see how this performs on out-of-cache data on an SSD and actual hard drive. On the other hand, the findings are definitely still relevant since RAM has gotten fairly cheap and most applications probably fit in it just fine.

Regarding databases - Btrees also have a natural sort order, which hash tables don't. This means a btree as your main data structure helps with sort, range, or list operations in a way a hash tables can't. That being said, even traditional databases obviously still use hash tables extensively (ex. Hash joins).

reply
scotty79
2 months ago
[-]
In Rust thanks to it you can have BTreeSet of BTreeSet-s.
reply
scotty79
2 months ago
[-]
I think my comment was way too short.

What I meant was that I needed a set type in Rust. At first I tried to use HashSet but I needed it's elements to also be sets. I couldn't because HashSet doesn't itself have a natural Hash. I'd need to implement it myself.

However while using BTreeSet the requirement on the elements is that they have natural ordering (have Ord trait) but the BTreeSet itself also implements Ord so I could have a BTreeSet of BTreeSets without implementing anything additionally. Thanks to the fact that BTrees have natural ordering.

So I used BTreeSet instead,

reply
xxs
2 months ago
[-]
b-trees are just 'better' binary trees as they have lower amounts of indirections (nodes)
reply
marginalia_nu
2 months ago
[-]
You can line them up with disk blocks, or with CPU cache lines, the effect is relatively similar.
reply
nialv7
2 months ago
[-]
I feel I missed point of this article. I thought the author is trying to prove that b-tren isn't that bad compared to hashmaps. But taking 2~3x longer looks pretty bad.

If I need predictable ordering (but not actually sorting the keys) I will use something like indexmap, not b-tree.

reply
magicalhippo
2 months ago
[-]
The point seems to be the author found very different estimates of just how much worse b-trees would be. As the author notes, hashmaps have some less desirable properties as well. So the author ran some benchmarks to find out, and ended with the following conclusion:

Overall, I'm unconvinced that it's worth exploring btrees further. I'll stick to hashmaps and I'll either iterate in insertion order or I'll require sorting entries before iterating.

reply
lsb
2 months ago
[-]
Clojure, for example, uses Hash Array Mapped Tries as its associative data structure, and those work well
reply
georgehill
2 months ago
[-]
reply
BeeOnRope
2 months ago
[-]
Nice article!

Very cool to see both the "independent" and "serially dependent" cases addressed. Microbenchmarks still have lots of ways of giving the wrong answer, but looking at both these cases exposes one of the big variables which cause that.

In my experience looking at container performance you often pass through two distinct regimes (in a microbenchmark!):

Small regime: for small containers, instruction count, instruction dependencies and IPC (including the effect of branch missed) dominate.

In this regime fastest container in a "throughput" sense will often be the one with fewest micro-operations (including those executed on the wrong-path). Fewer operations helps both in raw speed and also in overlapping more multiple independent lookups within the instruction window. Any type of per-lookup misprediction is hugely destructive to performance. For random keys, this often favors hash tables because they can be written to have << 1 mispredict per lookup.

In this small regime the fastest container in a latency sense is the one with the shortest critical path from input to output, again considering mispredicts. The non-memory latency instruction will be very important in this critical path and again mispredicts are very destructive since usually mispredicts add directly to the critical path (not always!). There are lots of tricks to keeping the critical path including hashes with somewhat higher operation counts but smaller critical paths (e.g., a short merge), linear searches which have > 1 independent stream, etc. If the keys are predictable, hashes containers can look bad because they tend to have a long data-dependency from the hash through the lookup to the output. Tree-like containers tend to replace those with control, so the data-dependent critical path can be very short! With random keys, hashes win again because mispredicts are so destructive.

Then in the large regime, a lot of the same themes repeat but instead of applying to "all instructions", it's mostly about memory access. I.e., the winning throughput containers are the ones that can get the highest useful MLP, and the winning latency containers are the ones with the shortest critical path of data-dependent memory accesses, mostly ignoring everything else. Instructions still matter because MLP is often dependent on how many accesses you can stuff into the processors OoOE execution window, and the size of that structure is (roughly speaking) counted in instructions. Software prefetching can help a ton with stuffing the right memory accesses in the OoOE window.

For random keys and "usually hit", hashes again tend to win in this regime, because they can usually get down to 1 miss per lookup, and that's math the other structures just can't overcome. For non-random keys, the field is wide open, it depends heavily on the key distribution. For lookups which often miss there are plenty of ways to break the 1 miss-per-lookup barrier too.

reply