FilterHN

ssivark

1 hour ago

[-]

Daniel Lemire's points about low-level hardware optimization notwithstanding, it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic.

If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better. eg: a human searching a physical paper dictionary can zoom into the right bunch of pages faster than pure idealized binary search; it's a separate matter it's hard for humans to continue binary search till the very end and we might default to scanning linearly for the last few iterations (cognitive convenience / affordances of human wetware / etc).

In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm. Often, we could very well construct a suitable cost function and use gradient descent or its accelerated cousins.

More generally, the best bet to solving a problem more efficiently is always to use more information about the specific problem you want to solve, instead of pulling up the solution for an overly abstract representations. That can offer scalable orders of magnitude speedup compared to constant factor speedups from just using hardware better.

hinkley

47 minutes ago

[-]

I swear I read an article about treaps but instead of being used to balance the tree, they used the weights to Huffman encode the search depth to reduce the average access time for heterogenous fetch frequencies.

I did not bookmark it and about twice a year I go searching for it again. Some say he’s still searching to this day.

https://arxiv.org/abs/2206.12110 ?

mvelbaum

16 minutes ago

[-]

rixed

31 minutes ago

[-]

> it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic

Also if you do not learn anything about the data while performing the binary search, no? Like, if you are constantly below the estimate, you could gess that the distribution is biases toward large values and adjust your guess based on this prediction.

molf

4 minutes ago

[-]

It's not possible to learn anything about other elements when performing binary search, _except_ the only thing there is to learn: if the target is before or after the recently compared element.

If we would guess that there is a bias in the distribution based on recently seen elements, the guess is just as likely to be wrong as it is to be right.

Unless we have prior knowledge. For example: if there is a particular distribution, or if we know we're dealing with integers without any repetition (i.e. each element is strictly greater than the previous one), etc.

painted-now

43 minutes ago

[-]

> In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm.

Never thought about it this way. Brilliant!

tantalor

51 minutes ago

[-]

> use that extra information to perform MUCH better

Do you mean using a better estimator for the median value? Or something else?

mycall

1 hour ago

[-]

Furthermore, with the vast and immediate knowledge that LLMs have, we could see a proliferation of domain-specific sorting algorithms designed for all types of purposes.

locknitpicker

54 minutes ago

[-]

> If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better.

You don't even need priors. See interpolation search, where knowing the position and value of two elements in a sorted list already allows the search to make an educated guess about where the element it's searching for is by estimating the likely place it would be by interpolating the elements.

rv64imafdc

50 minutes ago

[-]

> knowing the position and value of two elements in a sorted list

That's a prior about the distribution, if a relatively weak one (in some sense, at least).

darknoon

46 minutes ago

[-]

This relies on knowledge of the distribution, just querying in the middle of A = [1, 2, 4, 8, 16, ..., 2^(n-1)] is slower than binary search

[1] https://lalitm.com/post/exponential-search/ [2] https://en.wikipedia.org/wiki/Exponential_search

lalitmaganti

1 hour ago

[-]

I also wrote recently [1] about Exponential Search [2] which is another algorithm if you need to repeatedly binary search in an array where the elements you're searching are themselves are sorted. It allowed for an 8x speedup in our workload!

drob518

4 hours ago

[-]

Isn't "quaternary" just sort of unrolling the binary search loop by one level? I mean, to find the partition in which the item is located, you still do roughly the same rough number of comparisons. You're just taking them 4 at a time, not 2 at a time. Seems like loop unrolling would give you the same.

nkurz

3 hours ago

[-]

It's trickier than that. Modern processors are speculative, which means that they guess at the result for a comparison and keep going along one side of a branch as far as they can until they are told they guessed wrong or hit some internal limit. If they guessed wrong, they throw away the speculative work, take a penalty of a handful of cycles, and do the same thing again from a different starting point.

Essentially, this means that all loops are already unrolled from the processors point of view, minus a tiny bit of overhead for the loop itself that can often be ignored. Since in binary search the main cost is grabbing data from memory (or from cache in the "warm cache" examples) this means that the real game is how to get the processor to issue the requests for the data you will eventually need as far in advance as possible so you don't have to wait as long for it to arrive.

The difference in algorithm for quad search (or anything higher than binary) is that instead of taking one side of each branch (and thus prefetching deeply in one direction) is that you prefetch all the possible cases but with less depth. This way you are guaranteed to have successfully issued the prefetch you will eventually need, and are spending slightly less of your bandwidth budget on data that will never be used in the actual execution path.

As others are pointing out, "number of comparisons" is almost useless metric when comparing search algorithms if your goal is predicting real world performance. The limiting factor is almost never the number of comparisons you can do. Instead, the potential for speedup depends on making maximal use of memory and cache bandwidth. So yes, you can view this as loop unrolling, but only if you consider how branching on modern processors works under the hood.

drob518

3 hours ago

[-]

Yea, I get that the actual comparison instruction itself is insignificant. It's everything that goes along with it. Seems like quaternary is fetching more data, however.

For instance, if you have 8 elements, 01234567, and you're looking for 1, with binary, you'd fetch 4, 2, and then 1. With quaternary, you'd fetch 2, 4, 6, then 1. Obviously, if you only have 8 elements, you'd just delegate to the SIMD instruction, but if this was a much larger array, you'd be doing more work.

I guess on a modern processor, eliminating the data dependency is worth it because the processor's branch prediction and speculation only follows effectively a single path.

Would be interesting to see this at a machine cycle level on a real processor to understand exactly what is happening.

LoganDark

2 hours ago

[-]

It's not about doing more or less work; it's about doing the work faster. For instance, it's relatively common to discover that some recomputation can be faster than caching or lookup tables. Similarly, fetching more from memory also can be faster if it means you make less roundtrips.

crdrost

1 hour ago

[-]

Well that's where I thought this link was going to go before it went down the simd path... We have a way to beat binary search, it is called b-trees, it has the same basic insight that you can easily take 64 elements from your data set evenly spaced, compare against all of those rapidly, and instead of bifurcating your search space once, you do the same as six times, but because you store the 64 elements in an array in memory, they only take one array fetch and you get cache locality... But as you have more elements, you need to repeat this lookup table like three or four or five times, so it costs a bit of extra space, so what if we make it not cost space by just storing the data in these lookup tables...

wtallis

4 hours ago

[-]

Yes, this can be seen as unrolling the loop a bit. It improves performance not by significantly reducing the number of instructions or memory reads, but by relaxing the dependencies between operations so that it doesn't have to be executed purely serially. You could also look at it as akin to speculatively executing both sides of the branch.

mayoff

4 hours ago

[-]

Quaternary search effectively performs both of the next loop iteration’s possible comparisons simultaneously with the current iteration’s comparison. This is a little more complex than simple loop unrolling.

Regardless, both kinds of search are O(log N) with different constants. The constants don’t matter so much in algorithms class but in the real world they can matter a lot.

3 hours ago

[-]

Sort of, yes, but you're also removing a data dependency between the unrolled stages.

pfortuny

3 hours ago

[-]

It is because processors do not do what one might naively think they do.

taeric

4 hours ago

[-]

If you are talking smaller arrays, linear search with a sentinel value at the end is already tough to beat. The thing that sucks about that claim, is that "smaller" is such a nebulous qualifier that it is really hard to internalize.

rao-v

4 hours ago

[-]

This is simply not true - if you look at this article’s excellent benchmarking, linear search falls behind somewhere around 200-400 elements.

In general I love this article, it took what I’ve often wondered about and did a perfect job exploring with useful ablation studies.

KalMann

2 hours ago

[-]

I don't really see how this implies the above commenter's statement is "simply not true".

taeric

3 hours ago

[-]

I don't think std::find typically uses a sentinel, though?

BeetleB

4 hours ago

[-]

For that machine and compiler version, yes.

eggprices

4 hours ago

[-]

Except on Apple, where binary search always wins. Does anyone know why?

stephencanon

3 hours ago

[-]

Prior to the current generation Intel designs, Apple’s branch predictor tables were a good deal larger than Intel’s IIRC, so depending on benchmarking details it’s plausible that Apple Silicon was predicting every branch perfectly in the benchmark, while Intel had a more real-world mispredict rate. Perf counters would confirm.

SuperV1234

4 hours ago

[-]

That's not what the article is about.

layer8

54 minutes ago

[-]

…for 16-bit integers, and it’s still a binary search with the same asymptotic complexity, just a constant-factor speedup.

srcreigh

3 hours ago

[-]

The algorithm description was a bit confusing for me.

The SIMD part is just in the last step, where it uses SIMD to search the last 16 elements.

The Quad part is that it checks 3 points to create 4 paths, but also it's searching for the right block, not just the right key.

The details are a bit interesting. The author chooses to use the last element in each block for the quad search. I'm curious how the algorithm would change if you used the first element in each block instead, or even an arbitrary element.

4 hours ago

[-]

As a teenager I spent a weekend thinking that if binary search was good, because it cuts the search space in half at every step, then wouldn't a ternary search be better? Because we'd cut it into thirds at every step.

So instead of just comparing the middle value, we'd compare the one at the 1/3 point, and if that turns out to be too low then we compare the value at the 2/3 point.

Unfortunately although we cut the search space to 2/3 of what it was for binary search at each step (1/3 vs 1/2), we do 3/2 as many comparisons at each step (one comparison 50% of the time, two comparisons the other 50%), so it averages out to equivalence.

EDIT: See zamadatix reply, it's actually 5/3 as many comparisons because 2/3 of the time you have to do 2.

zamadatix

4 hours ago

[-]

This ternary approach doesn't actually average 3/2 comparisons per level:

- First third: 1 comparisons

- Second third: 2 comparisons

- Third third: 2 comparisons

(1+2+2)/3 = 5/3 average comparisons. I think the gap starts here at assuming it's 50% of the time because it feels like "either you do 1 comparison or 2" but it's really 33% of the time because "there is 1/3 chance it's in the 1st comparison and 2/3 chance it'll be 2 comparisons".

This lets us show ternary is worse in total average comparisons, just barely: 5/3*Log_3[n] = 1.052... * Log_2[n].

In other words, you end up with fewer levels but doing more comparisons (on average) to get to the end. This is true for all searches of this type (w/ a few general assumptions like the values being searched for are evenly distributed and the cost of the operations is idealized - which is where the main article comes in) where the number of splits is > 2.

4 hours ago

[-]

Oh yeah!

[1] https://en.wikipedia.org/wiki/Optimal_radix_choice

GuB-42

3 hours ago

[-]

It turns out that teenager you had something.

Not for the ternary version of the binary search algorithm, because what you had is just a skewed binary search, not an actual ternary search. Because comparisons are binary by nature, any search algorithm involving comparisons are a type of binary search, and any choice other than the middle element is less efficient in terms of algorithmic complexity, though in some conditions, it may be better on real hardware. For an actual ternary search, you need a 3-way comparison as an elementary operation.

Where it gets interesting is when you consider "radix efficiency" [1], for which the best choice is 3, the natural number closest to e. And it is relevant to tree search, that is, a ternary tree may be better than a binary tree.

bryanlarsen

4 hours ago

[-]

Did you continue by fantasizing about CPU's that contain ternary comparators?

ack_complete

3 hours ago

[-]

Note that CPUs have also gotten dramatically wider in both execution width and vector capability since you were a teenager. The increased throughput shifts the balance more toward being able to burn operations to reduce dependency chains. It's possible for your idea to have been both non-viable on the CPUs at the time and more viable on CPUs now.

https://en.wikipedia.org/wiki/Stooge_sort

compiler-guy

2 hours ago

[-]

This idea is closely related to the famous "Stooge Sort", which is basically quicksort with the pivot at 1/3 rather than 1/2. Naively, one might think it is faster than Quicksort, but of course it isn't.

For years--maybe still?--analyzing its running time was a staple of the first or second problem set in a college-level "Introduction to Algorithms" course.

nkurz

4 hours ago

[-]

> Unfortunately although we cut the search space to 2/3 of what it was for binary search at each step (1/3 vs 1/2), we do 3/2 as many comparisons at each step (one comparison 50% of the time, two comparisons the other 50%), so it averages out to equivalence.

True, but is there some particular reason that you want to minimize the number of comparisons rather than have a faster run time? Daniel doesn't overly emphasize it, but as he mentions in the article: "The net result might generate a few more instructions but the number of instructions is likely not the limiting factor."

The main thing this article shows is that (at least sometimes on some processors) a quad search is faster than a binary search _despite_ the fact that that it performs theoretically unnecessary comparisons. While some computer scientists might scoff, I'd bet heavily that an optimized ternary search could also frequently outperform.

4 hours ago

[-]

You normally measure runtime of a sorting algorithm in terms of the number of comparisons it has to do.

Obviously real-world performance depends on other things as well.

Someone

4 hours ago

[-]

Not “normally”, but “in computer science” and even then, mostly “in the past” and even then, only “typically” (there are sorting algorithms that make zero comparisons. See for example https://pages.cs.wisc.edu/~paton/readings/Old/fall01/LINEAR-...)

All other people live in the real world, and care about real-world performance, and modern computer scientists know that.

alexfoo

3 hours ago

[-]

Those algorithms may not be doing any pairwise comparisons (e.g. between elements being sorted) but they still do plenty of comparisons.

And some of the algorithms, as described, still end up doing pairwise comparisons in all-but-optimal cases.

(Bucket sort requires items that end up in the same bucket to be sorted. This doesn't happen automatically via the algorithm as stated. Radix sort requires the items at each "level" to be sorted. Neither algorithm specifies how this should be done without pairwise comparisons.)

Counting Sort does work without pairwise comparisons, but is only efficient for small ranges of values, and if that's the case then it's obvious you don't need to apply a traditional sort if the number of elements greatly outnumbers the number of possible values.

Also, the algorithms still require some form of comparisons, just not pairwise comparisons.

> All other people live in the real world, and care about real-world performance, and modern computer scientists know that.

Yes, completely agree with that, but traditional "Comp Sci" is built on small building blocks of counting "comparisons" or "memory accesses". It's not designed to analyse prospective performance given modern processors with L1/L2/L3 caches, branch prediction, SIMD instructions, etc.

eggprices

4 hours ago

[-]

When you can't seek quickly, e.g. on a disk, you can use a B-tree with say 128-way search. Fetching 128 keys doesn't cost much more than fetching 1 but it saves an additional 7 fetches.

madcaptenor

4 hours ago

[-]

Isn't it a bit better on average, although not as much as you'd hoped? For example 19 steps of binary search get you down to 1/524288 of the original search space with 19 comparisons. 12 steps of ternary search get you down to 1/3^12 = 1/531441 of the original search space with, on average, 12 * 3/2 = 18 comparisons.

4 hours ago

[-]

Maybe! But you can see the other comment that points out I was wrong and it is actually 5/3 comparisons so it still works out worse.

bena

4 hours ago

[-]

Imagine if you split the search space N times, no middles. Then you could just compare the value.

[1] https://www.youtube.com/watch?v=_3RNB8eOSx0

gobdovan

4 hours ago

[-]

I thought this would be about how you can beat binary search in the 'Guess Who?' game. There's a cool math paper about it [0] and an approachable video by the author. [1]

[0] https://arxiv.org/abs/1509.03327

thaumasiotes

1 hour ago

[-]

You can't beat binary search in Guess Who. From the abstract:

>> Instead, the optimal strategy for the player who trails is to make certain bold plays in an attempt catch up.

The reason that's optimal, if you're losing, is that you assume that your opponent, who isn't losing, is going to use binary search. They're going to use binary search because it's the optimal way to find the secret.

Since you're behind, if you also use binary search, both players will progress toward the goal at the same rate, and you'll lose.

Trying to get lucky means that you intentionally play badly in order to get more victories. You're redistributing guesses taken between games in a negative-sum manner - you take more total guesses (because your search strategy is inferior to binary search), but they are unevenly distributed across your games, and in the relatively few games where you perform well above expectation, you can score a victory.

gobdovan

25 minutes ago

[-]

You're mixing two different objectives the paper presents. You can't beat binary search when the objective is to minimise the expected number of turns in a single player setting.

However, in a two player setting, using the strategies presented in the paper, you will beat an adversary that uses binary search in more than 50% of the games played.

Here's another visual demonstration: https://www.youtube.com/watch?v=zmvn4dnq82U

alexfoo

3 hours ago

[-]

The classical canonical Comp Sci algorithms are effectively "designed" for CPUs with no parallelism (either across multiple cores, via Hyper-threading technology, or "just" SIMD style instructions), and also where all memory accesses take the same amount of time (so no concept of L1/L2/L3/etc caches of varying latencies). And all working on general/random data.

As soon as you move away from either (or both) of these assumptions then there are likely to be many tweaks you can make to get better performance.

What the classical algorithms do offer is a very good starting point for developing a more optimal/efficient solution once you know more about the specific shape of data or quirks/features of a specific CPU.

When you start to get at the pointy end of optimising things then you generally end up looking at how the data is stored and accessed in memory, and whether any changes you can make to improve this don't hurt things further down the line. In a job many many years ago I remember someone who spent way too long optimising a specific part of some code only to find that the overall application ran slower as the optimisations meant that a lot more information needed later on had been evicted from the cache.

(This is probably just another way of stating Rob Pike's 5th rule of programming which was itself a restatement of something by Fred Brooks in _The Mythical Man Month_. Ref: https://www.cs.unc.edu/~stotts/COMP590-059-f24/robsrules.htm...)

BeetleB

4 hours ago

[-]

Some of the plots would have been much more helpful if instead of absolute value in seconds, the y-axis were the multiplier w.r.t binary search (and eyeballing suggests a relatively constant multiplier).

Obviously, this isn't changing the big-Oh complexity, but in the "real world", still nice to see a 2-4x speedup.

quirino

3 hours ago

[-]

On optimizing binary search: https://en.algorithmica.org/hpc/data-structures/binary-searc...

garaetjjte

2 hours ago

[-]

I once did have a need for binary search in memory mapped files and I experimented with Eytzinger layout (which I learned from https://bannalia.blogspot.com/2015/06/cache-friendly-binary-...). It turned out that it was slower than plain binary search, I think because keys I was looking up were often clumped together thus it played quite well with cache anyway.

2 hours ago

[-]

The title is slightly misleading, I mean yes, naive binary search might have larger constant but the algorithm is still O(log(n)). This is still some "divide and conquer" style algorithm just with bunch of CPU specific optimizations. Also this works well with simple data structures, like integers, with more complex objects (custom comparators) it matters less.

pfortuny

1 hour ago

[-]

The complexity of binary search in terms of "search" (comparison) operations is exactly log_2(n)+1, not just O(n). This algorithm just uses modern and current processor architecture artifacts to "improve" it on arrays of up to 4096 elements.

So not exactly "n" as in O(n).

Also: only for 16-bit integers.

1 hour ago

[-]

> The complexity of binary search in terms of "search" (comparison) operations is exactly log_2(n)+1, not just O(n)

> So not exactly "n" as in O(n).

For large enough inputs the algorithm with better Big O complexity will eventually win (at least in the worst cases). Yes, sometimes it never happens in practice when the constants are too large. But say 100 * n * log(n) will eventually beat 5 * n for large enough n. Some advanced algorithms can use algorithms with worse Big O complexity but smaller constants for small enough sub-problems to improve performance. But it's more like to optimization detail rather than a completely different algorithm.

> This algorithm just uses modern and current processor architecture artifacts to "improve" it on arrays of up to 4096

Yes, that's my point. It's basically "I made binary search for integers X times faster on some specific CPUs". "Beating binary search" is somewhat misleading, it's more like "microptimizing binary search".

cubefox

2 hours ago

[-]

> The title is slightly misleading, I mean yes, naive binary search might have larger constant but the algorithm is still O(log(n)).

I think the title is not misleading since the Big O notation is only supposed to give a rough estimate of the performance of an algorithm.

(I agree though that binary search is already extremely fast, so making something twice as fast won't move the needle for the vast majority of applications where the speed bottleneck is elsewhere. Even infinite speed, i.e. instant sorted search, would likely not be noticeable for most software.)

1 hour ago

[-]

For me it's slightly misleading because it's almost like saying "I wrote a faster quicksort implementation, so it beats quicksort!". In this case the binary search fundamental idea of "divide and conquer" is still there, the article just does microptimizations (which seem to be not very portable and are less relevant/applicible for more complex data structures) in order to reduce the constant part.

Yes, algorithmic complexity is theoretical, it often ignores the real world constants, but they are usually useful when comparing algorithms for larger inputs, unless we are talking about "galactic algorithms" with insanely large constants.

aidenn0

4 hours ago

[-]

If you are storing 16-bit integers, wouldn't an 8kB bitmap be even faster?

https://roaringbitmap.org/

3 hours ago

[-]

The library the author is talking about selects between bitmap and array dynamically depending on density.

Findecanor

4 hours ago

[-]

The range is 1..4096, so 4096 bits = 512 byte bitmap would suffice.

That is, if you're only ever going to test for membership in the set. If you need metadata then ... You could store that in a packed array and use a population count of the bit-vector before the lookup bit as index into it. For each word of bits, store the accumulated population count of the words before it to speed up lookup. Modern CPU's are memory-bound so I don't think SIMD would help much over using 64-bit words. For 4096 bits / 64, that would be 64 additional bytes.

gobdovan

3 hours ago

[-]

I remember I had a pedagogy class in Uni taught by psychology faculty, and was messing with them by proposing a mock syllabus where we'd teach students binary search, then the advanced advanced ones ternary search, and the very advanced, Quaternary, with a big Q, as in the geological period. Jokes on me now, I suppose.

wood_spirit

4 hours ago

[-]

A beautiful algorithm.

Would there be any value in using simd to check the whole cache line that you fetch for exact matches on the narrowing phase for an early out?

bediger4000

2 days ago

[-]

The (AI generated?) image on this article is absolutely not helpful, and I think it's wrong based on how I read the article. Better not to have an image at all.

crazygringo

3 hours ago

[-]

Seriously. It makes it seem like this is going to be a blog post either intended for elementary school students, or more likely for teachers on how to better explain some arithmetic concept to elementary school students.

It's absolutely bizarre. Images communicate meaning. Much better to have no image than to have an image that is completely misleading about the target audience or level of technical sophistication.

iosovi

4 hours ago

[-]

Agreed, it threw me off at first but the rest of the article was quite nice.

kardos

4 hours ago

[-]

So is the SIMD the magic piece here, or is it the interpolation search? If the data is evenly distributed, that is pretty optimal for the interpolation search..

mayoff

4 hours ago

[-]

In the Intel CPU + cold cache case, the quad search matters. In the other three cases, only the SIMD matters.

3 hours ago

[-]

To put it another way: this is addressed in the article.

vasco

1 hour ago

[-]

This was the entry level project we did in a hardware optimization course I took maybe 15 years ago, using SIMD instructions. Lots of things can be naively optimized by unrolling any loops like this. Compilers do some of this themselves.

peter_d_sherman

1 hour ago

[-]

>"Virtually all processors today have data parallel instructions (sometimes called SIMD) that can check several values at once.

[...]

The binary search checks one value at a time. However, recent processors can load and check more than one value at once. They have excellent memory-level parllelism. This suggest that instead of a binary search, we might want to try a quaternary search..."

First of all, brilliant observations! (Overall, a great article too!)

Yes, today's processors indeed have a parallelism which was unconceived of at the time the original Mathematicians, then-to-be Computer Scientists, conceived of Binary Search...

Now I myself wonder if these ideas might be extended to GPU's, that is, if the massively parallel execution capability of GPU's could be extended to search for data like Binary Search does, and what such an appropriately parallelized algorithm/data structure would look like... keep in mind, if we consider an updateable data structure, then that means that parts of it may need to be appropriately locked at the same time that multiple searches and updates are occurring simultaneously... so what data structure/algorithm would be the most efficient for a massively parallel scenario like that?

Anyway, great article and brilliant observations!

owlcompliance

3 hours ago

[-]

What about non-binary search?

gowld

1 hour ago

[-]

40x Faster Binary Search - This talk will first expose the lie that binary search takes O(lg n) time — it very much does not! Instead, we will see that binary search has only constant overhead compared to an oracle. Then, we will exploit everything that modern CPUs have to offer (SIMD, ILP, prefetching, efficient caching) in order to gain 40x increased throughput over the Rust standard library implementation.

jonfe-darontos

3 hours ago

[-]

And here I thought this was going to be related to quaternions

aamargulies

1 hour ago

[-]

Here's my version with a key spline improvement. I should really write this up...

#include <stdbool.h> #include <stdint.h> #include <arm_neon.h>

/* Author: aam@fastmail.fm * * Apple M4 Max (P-core) variant of simd_quad which uses a key spline * to great effect (blog post summary incoming!) / bool simd_quad_m4(const uint16_t carr, int32_t cardinality, uint16_t pos) { enum { gap = 64 };

    if (cardinality < gap) {
        if (cardinality >= 32) {
            // 32 <= n < 64: NEON-compare the first 32 as a single x4 load,
            // sweep the remainder.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x4_t v = vld1q_u16_x4(carr);
            uint16x8_t hit = vorrq_u16(
                vorrq_u16(vceqq_u16(v.val[0], needle), vceqq_u16(v.val[1], needle)),
                vorrq_u16(vceqq_u16(v.val[2], needle), vceqq_u16(v.val[3], needle)));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 32; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 16) {
            // 16 <= n < 32: paired x2 load + sweep tail.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x2_t v = vld1q_u16_x2(carr);
            uint16x8_t hit = vorrq_u16(vceqq_u16(v.val[0], needle),
                                       vceqq_u16(v.val[1], needle));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 16; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 8) {
            // 8 <= n < 16: single 128-bit compare + sweep tail.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8_t v = vld1q_u16(carr);
            if (vmaxvq_u16(vceqq_u16(v, needle)) != 0) return true;
            for (int32_t j = 8; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        for (int32_t j = 0; j < cardinality; j++) {
            uint16_t v = carr[j];
            if (v >= pos) return v == pos;
        }
        return false;
    }

    int32_t num_blocks = cardinality / gap;
    int32_t base = 0;
    int32_t n = num_blocks;

    while (n > 3) {
        int32_t quarter = n >> 2;
        int32_t k1 = carr[(base + quarter + 1) * gap - 1];
        int32_t k2 = carr[(base + 2 * quarter + 1) * gap - 1];
        int32_t k3 = carr[(base + 3 * quarter + 1) * gap - 1];
        int32_t c1 = (k1 < pos);
        int32_t c2 = (k2 < pos);
        int32_t c3 = (k3 < pos);
        base += (c1 + c2 + c3) * quarter;
        n -= 3 * quarter;
    }
    while (n > 1) {
        int32_t half = n >> 1;
        base = (carr[(base + half + 1) * gap - 1] < pos) ? base + half : base;
        n -= half;
    }
    int32_t lo = (carr[(base + 1) * gap - 1] < pos) ? base + 1 : base;

    if (lo < num_blocks) {
        const uint16_t *blk = carr + lo * gap;
        uint16x8_t needle = vdupq_n_u16(pos);
        uint16x8x4_t a = vld1q_u16_x4(blk);
        uint16x8x4_t b = vld1q_u16_x4(blk + 32);
        uint16x8_t h0 = vorrq_u16(
            vorrq_u16(vceqq_u16(a.val[0], needle), vceqq_u16(a.val[1], needle)),
            vorrq_u16(vceqq_u16(a.val[2], needle), vceqq_u16(a.val[3], needle)));
        uint16x8_t h1 = vorrq_u16(
            vorrq_u16(vceqq_u16(b.val[0], needle), vceqq_u16(b.val[1], needle)),
            vorrq_u16(vceqq_u16(b.val[2], needle), vceqq_u16(b.val[3], needle)));
        return vmaxvq_u16(vorrq_u16(h0, h1)) != 0;
    }

    for (int32_t j = num_blocks * gap; j < cardinality; j++) {
        uint16_t v = carr[j];
        if (v >= pos) return v == pos;
    }
    return false;

}

/* * Spine variant, M4 edition. * * pack the interpolation probe keys into a dense contiguous region so the * cold-cache pointer chase streams through consecutive cache lines: * * n=4096 -> 64 spine keys -> 128 B = 1 M4 cache line * n=2048 -> 32 spine keys -> 64 B = half a line * n=1024 -> 16 spine keys -> 32 B * * The entire interpolation phase for a max-sized Roaring container now * lives in one cache line. The final SIMD block check still loads from * carr. * * The num_blocks <= 3 fallback: * with very few blocks the carr-based probes accidentally prime the final * block's lines, which the spine path disrupts. / bool simd_quad_m4_spine(const uint16_t carr, const uint16_t spine, int32_t cardinality, uint16_t pos) { enum { gap = 64 };

    if (cardinality < gap) {
        // Same fast paths as simd_quad_m4 -- spine is irrelevant here.
        if (cardinality >= 32) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x4_t v = vld1q_u16_x4(carr);
            uint16x8_t hit = vorrq_u16(
                vorrq_u16(vceqq_u16(v.val[0], needle), vceqq_u16(v.val[1], needle)),
                vorrq_u16(vceqq_u16(v.val[2], needle), vceqq_u16(v.val[3], needle)));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 32; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 16) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x2_t v = vld1q_u16_x2(carr);
            uint16x8_t hit = vorrq_u16(vceqq_u16(v.val[0], needle),
                                       vceqq_u16(v.val[1], needle));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 16; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 8) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8_t v = vld1q_u16(carr);
            if (vmaxvq_u16(vceqq_u16(v, needle)) != 0) return true;
            for (int32_t j = 8; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        for (int32_t j = 0; j < cardinality; j++) {
            uint16_t v = carr[j];
            if (v >= pos) return v == pos;
        }
        return false;
    }

    int32_t num_blocks = cardinality / gap;

    if (num_blocks <= 3) {
        return simd_quad_m4(carr, cardinality, pos);
    }

    int32_t base = 0;
    int32_t n = num_blocks;

    // Pull the whole spine into L1 up front. For n in [256, 4096] this is
    // 1 line (128 B); for smaller n it is a partial line. Cheap on cold.
    __builtin_prefetch(spine);

    while (n > 3) {
        int32_t quarter = n >> 2;
        int32_t k1 = spine[base + quarter];
        int32_t k2 = spine[base + 2 * quarter];
        int32_t k3 = spine[base + 3 * quarter];
        int32_t c1 = (k1 < pos);
        int32_t c2 = (k2 < pos);
        int32_t c3 = (k3 < pos);
        base += (c1 + c2 + c3) * quarter;
        n -= 3 * quarter;
    }
    while (n > 1) {
        int32_t half = n >> 1;
        base = (spine[base + half] < pos) ? base + half : base;
        n -= half;
    }
    int32_t lo = (spine[base] < pos) ? base + 1 : base;

    if (lo < num_blocks) {
        const uint16_t *blk = carr + lo * gap;
        uint16x8_t needle = vdupq_n_u16(pos);
        uint16x8x4_t a = vld1q_u16_x4(blk);
        uint16x8x4_t b = vld1q_u16_x4(blk + 32);
        uint16x8_t h0 = vorrq_u16(
            vorrq_u16(vceqq_u16(a.val[0], needle), vceqq_u16(a.val[1], needle)),
            vorrq_u16(vceqq_u16(a.val[2], needle), vceqq_u16(a.val[3], needle)));
        uint16x8_t h1 = vorrq_u16(
            vorrq_u16(vceqq_u16(b.val[0], needle), vceqq_u16(b.val[1], needle)),
            vorrq_u16(vceqq_u16(b.val[2], needle), vceqq_u16(b.val[3], needle)));
        return vmaxvq_u16(vorrq_u16(h0, h1)) != 0;
    }

    for (int32_t j = num_blocks * gap; j < cardinality; j++) {
        uint16_t v = carr[j];
        if (v >= pos) return v == pos;
    }
    return false;

}

// Build the spine for a given carr. Caller allocates cardinality/64 u16s. void simd_quad_m4_build_spine(const uint16_t carr, int32_t cardinality, uint16_t spine) { enum { gap = 64 }; int32_t num_blocks = cardinality / gap; for (int32_t i = 0; i < num_blocks; i++) { spine[i] = carr[(i + 1) gap - 1]; } }

m3kw9

2 hours ago

[-]

Will I get a job if i say i can beat binary search?

cubefox

4 hours ago

[-]

Since binary search is already very fast with its O(log n) time complexity: are there any real world applications which could practically benefit from this improvement?

2 hours ago

[-]

I guess it matters if you have to do lookup in a tight loop. If you do this occasionally, I think it's not worth it, especially for complex objects with custom comparators. The algorithm is still O(log(n)) just a more advanced "divide and conquer" with smaller constant.

VorpalWay

15 minutes ago

[-]

I would expect the standard library of various languages to provide an optimised implementation such as this. Then everyone downstream benefits, and benefits from future improvements when compiled for a newer version of the language / executed under a newer version of the runtime.

You see this in rust, where they replaced the hash tables many years ago, the channel a couple of years ago, and most recently the sort implementations for both stable and unstable sort. I expect other languages / runtimes do similar things over time as well as CPUs change and new approaches are discovered.

3 hours ago

[-]

This is a drop-in improvement for essentially any binary search over 16-bit integer members.