So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)
On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:
Sorting 50 million doubles:
ipnsort 0.79s
blqs 0.90s
driftsort 1.13s (stable)
std::sort 1.22s
std::stable_sort 4.64s (stable)
Sorting 50 million (i32, i32) structs:
ipnsort 0.82s
blqs 0.89s
driftsort 1.07s (stable)
std::sort 3.09s
std::stable_sort 3.15s (stable)
And now for a cool party trick, let's repeat the 50 million doubles experiment again, but have the first 90% already sorted, last 10% random: driftsort 0.29s (stable)
ipnsort 0.81s
std::sort 1.15s
std::stable_sort 1.63s (stable)
blqs 1.89s T k; // default-construct
if (i > 0) k = left[--i]; // copy-assign
This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know.
Again, in `partition_small`, we have T swbuf[SMALLPART];
which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.
None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".
But it's perfectly possible for a type to be "trivially copyable" without being "default-constructible." An example of such a type from the STL: `std::reference_wrapper<int>`.
Anyway, looks like a quick fix for this would be to just extend the list of traits on which blqsort is gated (currently `is_trivially_copyable` and `sizeof(T) <= 16`) by adding `is_trivially_default_constructible<T>::value` also.
Why not compare against that?
>
>for (int i = 0; i < 1000; i++) {
> small_numbers[smlen] = numbers[i];
> smlen += (numbers[i] < 500);
>}
Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?
> then wouldn't they also optimize the naive piece of code?
Great question. They do sometimes!
In general, the problem for compilers is that its not obvious which method would be better in some given piece of code. Most branches are highly predictable. Like, imagine a for loop which counts to 1000. At the end of the loop body, the code branches to see whether we should stay in the loop, or exit the loop. The first 999 times through the loop we keep going - so 99.9% of the time, the branch ends up taking the same path. Its very predictable! CPU designers optimise heavily for this, via branch prediction logic. Highly predictable branches run fast. (This is also why array bounds checking doesn't really hurt performance at all.)
But the branch predictor really struggles when the condition is unpredictable - ie, when a conditional branch is taken about 50% of the time. As is the case in a sorting algorithm.
The compiler has no idea whether any condition in your code is predictable or not. There are hints you can use, but it often defaults to just doing whatever you ask it to do.
Here's what the compiler actually does with the code you quoted. You can see the extra branch + jump for the second version of the code:
https://c.godbolt.org/z/zv7Tcd49f
I clicked around - for some reason even using __builtin_expect_with_probability, none of the compilers I tried will convert from one version of this code into the other.
Unless the loop is unrolled, yes, there is a branch to exit the loop. But then that doesn’t matter because the whole goal at the beginning was to avoid branch misprediction (which is not the same thing as avoiding branches entirely).