A hash table by any other name
127 points
1 month ago
| 5 comments
| lwn.net
| HN
metadat
1 month ago
[-]
Related reading:

What is RCU, Fundamentally? - https://lwn.net/Articles/262464/

Amusingly, the first several hits for "Golang RCU" point to GitHub repos which contain a bunch of Mutexes (locks), import "unsafe", and haven't been updated in 7+ years. Perhaps this is not the most well-understood area of computer science.

Let's face it, when was the last time you needed RCU for a web app?

reply
yencabulator
1 month ago
[-]
In a language with GC, RCU is just an thing you can atomically swap to a new version. C is why RCU is complicated.

https://pkg.go.dev/sync/atomic#example-Value-Config

reply
facturi
1 month ago
[-]
In GC languages, RCU is just atomic pointer update with immutable data structure. In a language without GC, it requires delayed destruction to avoid memory safety issues.

Linux kernel's RCU waits for all CPU cores to context switch and disables preemption when reading RCU data structure. This is not usable in normal application programming.

reply
stoperaticless
1 month ago
[-]
In lwn comments a guy insists that “O(f(n)) is by definition a bound over the worst-case case input to a function.”

Is that actually true?

Mathematically: O(f(n)) is asymptotic bound when n approaches infinity, which could be used to describe worst, best, average and other cases.

Have programmers redefined O to specifically mean worst case?

reply
lkirkwood
1 month ago
[-]
No, I believe you are forgetting the context. Big O is a upper bound on the growth rate of a function. In this case the function describes the time taken, therefore a higher time is a "worse case". So Big O is, in this scenario, a bound on the worst case time taken (essentially what the commenter said).

Certainly as I understand it your definition of Big O is incorrect - it provides exclusively an upper bound. Big Omega is used for a lower bound and Big Theta for an upper and lower bound.

reply
stoperaticless
1 month ago
[-]
Math definition goes something like this: f(n)=O(g(n)), if exists C such that f(n) < C x g(n), when n goes to infinity.

Function “quickSortExecutionTime(permutation)” depends on exact input (each permutation/shuffle has different exec time), parameter “permutation” can not go to infinity, so it can’t be used with O(n) directly. (Parameter need to be single real number)

QuickSortWorstCaseTime(n) is single valued function thus, it can be used with O(n) notation. Same is true for quickSortAvgCaseTime(n) and quickSortBestCaseTime(n).

But you answered my question. (Very indirect “Yes” :) )

reply
lkirkwood
1 month ago
[-]
It's true that in the case of a sorting algorithm it's not immediately obvious how to analyse the time complexity with Big O. However, I feel that you may be getting bogged down in semantics.

Perhaps it is accurate to say that when Big O is used to describe the time complexity of a function in computer science the variable n in O(f(n)) usually describes the size of the input, which may not be common knowledge. But if the question is:

> Have programmers redefined O to specifically mean worst case?

Then in general I would say no. I suppose to answer more concretely I would have to ask: "Which programmers?"

reply
John23832
1 month ago
[-]
Big O is upper bound (worst case), big theta is average case, big omega is lower bound (best case).
reply
msm_
1 month ago
[-]
Not according to my understanding of the subject, my copy of "Introduction to Algorithms", and Wikipedia.

Big O is an upper bound for a function growth - being O(f(n)) means growing asymptotically slower than f(n) (up to a multiplication by a constant).

Small o is a lower bound for a function growth - being o(f(n)) means growing asymptotically faster than f(n) (up to a multiplication by a constant).

Big theta means the function grows exactly that fast - being Θ(f(n)) means the function is both O(f(n)) and o(f(n)), so being exactly f(n) up to a multiplication by a constant. In practice, due to typographical limitations, people use O(n) where Θ(n) would be more precise.

Average case, best case, worst case, amortized etc complexities are a wholly different thing. For example, quick sort is Θ(n^2) worst case and Θ(n log n) average case.

reply
commandlinefan
1 month ago
[-]
Interestingly, I was just reading the book "Java Concurrency in Practice" where the author measures the performance of an array-based hash table like the one described here with a linked-list based hash table and finds that the linked list comes out ahead performance-wise.
reply
kevincox
1 month ago
[-]
A JVM with a moving GC can have wildly difference performance characteristics than C. A moving collector can use a bump allocator which can result in these independently allocated chain nodes ending up adjacent in memory which will mitigate lots of the cache miss costs of linked hash tables. Especially in microbenchmarks where basically all allocations come from the hash table under measurement.

In C it is much more likely that these allocations are spread around memory and in different cache lines.

reply
NoahKAndrews
1 month ago
[-]
That's a fantastic book. I hope it gets a new edition at some point.
reply
commandlinefan
1 month ago
[-]
I'd love to read it if it did, but honestly I wonder if enough's changed to warrant a new edition. Everything I'm reading now is still relevant.
reply
NoahKAndrews
29 days ago
[-]
Virtual threads seem like a change worth covering, but yes, it's hardly obsolete.
reply
pphysch
1 month ago
[-]
> "I've assumed an average chain length of 10 for rhashtable in the above memory calculations."

> Reviewers were, as ever, skeptical in the face of assertions about performance without corresponding measurements. Peng Zhang asked "how much performance improvement is expected if it is applied to dcache?" Wilcox didn't reply to Zhang's question.

Am I reading this right that the author supplied no actual (benchmarking) evidence to support their proposed performance optimization?

reply
RandomThoughts3
1 month ago
[-]
Seriously, a simple google search or taking a minute to follow the first link in the article would have shown you that the patch set obviously came with a benchmark [1]. It’s performance data on actual real use in the kernel which are missing. It’s someone posting a patch to the kernel mailing list, not some random kid on GitHub, there is no point assuming they are an idiot to try to shine.

My days as a LWN subscriber are now far behind but gosh reading the comments there was a blast. I had forgotten how good proper technical discussions used to be.

[1] https://lore.kernel.org/lkml/20240222203726.1101861-1-willy@...

reply
jonhohle
1 month ago
[-]
I poked through a few messages in that thread and while there is a space benchmark, there is no time benchmark, something I would expect to see on switching from a LL to a cache-aware implementation. Phoronix, who would benchmark a potato if you told them it gave you better performance, has an article with no benchmarks. The commit says it should be good enough for someone to run a perf benchmark.

Is it common to add an unused data structure into the kernel without understanding its performance characteristics? If they are available it is not obvious to someone who has read the article, read the commit message, read lkml, went looking for benchmarks, and has still come up with no concept of how this might perform in realistic or synthetic scenarios.

reply
scottlamb
1 month ago
[-]
I'm pretty sure pphysch is correct.

> Seriously, a simple google search or taking a minute to follow the first link in the article would have shown you that the patch set obviously came with a benchmark [1].

If you're referring to the table with headers "number xarray maple rhash rosebush", it's an estimate of memory usage, not a benchmark. And the author (Matthew Wilcox) says "I've assumed an average chain length of 10 for rhashtable in the above memory calculations", which folks questioned. So there's reason to doubt its correctness.

> It’s someone posting a patch to the kernel mailing list, not some random kid on GitHub, there is no point assuming they are an idiot to try to shine.

Let's just let the work speak rather than flagging the author as an idiot or genius...at present rosebush is unproven and based on questionable assumptions. That might change in future rounds.

reply
pphysch
1 month ago
[-]
The entire point of a LWN article like this is to accurately summarize the patch sets and email discussions for casual readers. No need for the snark.
reply
kelnos
1 month ago
[-]
That's how I read it too, and that seems bizarre to me. The one request for actual performance numbers was ignored? That seems pretty crappy.
reply
rurban
1 month ago
[-]
Because many others did that benchmarks before and come to vast improvements over simple linked lists.
reply
o11c
1 month ago
[-]
Is it just me, or is this unsafely assuming that collisions can only happen in low bits, not for the full hash (despite it being only 32 bits)
reply
masklinn
1 month ago
[-]
No? From the text at least, the low bits are just used to select the bucket, this is a common technique (and reason for using power of two sizes such that you don't need a full modulo).

Once you're in a bucket, it does full hash match, then I assume (didn't look at the patch) actual key comparison.

reply
o11c
1 month ago
[-]
That would be fine, except that in v2 the buckets are fixed size.

What happens when more than 20 objects end up with the exact same hash?

reply
tialaramex
1 month ago
[-]
Yes it does seem like a problem, by my reading this will just endlessly try to "split" the bucket of 20 items, hoping that eventually one of the items goes into a new bucket leaving space, but if they have the same hash that never happens, disaster.

If I'm wrong I don't think this code does a good enough job of explaining what's going on, and if I'm correct it's defective and should not ship.

Even if this situation "shouldn't" happen, it clearly can happen, you just have to get unlucky, and it's difficult to compute how unlikely that is for real systems, so better to ensure that "We got unlucky" isn't a disaster for the kernel.

reply
rurban
1 month ago
[-]
No desaster. On an actual attack (eg if the seed leaked) you count the collisions and return empty on some low threshold. Like 100 on 32bit
reply
tialaramex
1 month ago
[-]
Where in the code is this "no disaster" routine you describe?
reply
rurban
1 month ago
[-]
The code is not security hardened yet. seed attacks are not yet accustomed for. But it's no disaster, as it's trivial to prevent. They say seeds are secure, so no chance to attack it. Which is a bit childish.

It would just be a disaster if they follow Miller's networking code, which uses slower hashes to prevent seed attacks. But every other popular language/library does the very same wrong thing.

reply
samatman
1 month ago
[-]
If it were my hash table, the arrays would just be unrolled linked lists.

So a "twenty object" bucket would actually store 19 objects, and if the last one wasn't null it would contain a pointer to the next bucket.

I'm willing to presume these folks know what they're doing, so that's probably what's happening here.

reply
tubs
1 month ago
[-]
They resize the top level array and rehash items into their new buckets.
reply
tialaramex
1 month ago
[-]
Sure. There were 20 items in the bucket with the same hash, we wanted to add one more.

So we grow the top level hash table and we put all of the twenty items in the same new bucket, because they have the same hash, and then we... oh, we don't have enough room. Try again? (The code presented does). Too bad, that can't help.

In your understanding, why does this not happen? That's the question. How do this code ensure this can't happen? If it doesn't, if it's just hoping to get lucky the code must not be used.

reply
tubs
1 month ago
[-]
They go to the same bucket because their hash *mod the size of the top array* collides.
reply
tialaramex
1 month ago
[-]
This whole sub-thread is about the case where the entire hash collides

Changing which bits we use changes nothing for this problem case.

reply
tubs
1 month ago
[-]
I see what you mean here, sorry.

I guess I'd argue if you have 20 items whose full hash collides then something is pretty seriously wrong with the hashing algorithm - raise an error/exception/panic/(whatever is appropriate for the situation) and surface it.

reply
bufferoverflow
1 month ago
[-]
I think they address this in the article. They use arrays instead of linked lists. Which makes direct key comparison very fast, because it's all in CPU cache.
reply
tialaramex
1 month ago
[-]
"I just use an array" doesn't solve the problem "What if the array is full?".

In fact it makes that problem acute. A linked list structure will degrade gradually as we shove more entries with the same hash into it, every look-up incurs a pointer chase down the entire list, slower is eventually unacceptable but we do get signs first -- but with arrays we're just screwed, in C it's pretty likely we end up with an illegal dereference or we lose data once there's no room.

reply
mst
1 month ago
[-]
If it's designed to be used for caching, then replacing an old entry or throwing away the new one is quite possibly a perfectly reasonable "out."

If it's designed to be used for other purposes as well, not so much.

(I wasn't entirely sure which from reading the article)

reply
10000truths
1 month ago
[-]
The rhashtable's bucket_table struct stores a random seed (the hash_rnd field) that is folded into the calculation of the hash. Without knowing what the seed is going to be, an attacker has no way to precompute keys with colliding hashes.
reply