Alexander Monakov has called the attention of the highload Telegram chat to this paper[0], saying:
Haswell is tricky for memory bw tuning, as even at fixed core frequency, uncore frequency is not fixed, and depends on factors such as hardware-measured stall cycles:
> According to the respective patent [15], the uncore frequency depends on the stall cycles of the cores, the EPB of the cores, and c-states
> ... uncore frequencies–in addition to EPB and stall cycles–depend on the core frequency of the fastest active core on the system. Moreover, the uncore frequency is also a target of power limitations.
So one wonders if it's not really a matter of reading the RAM in the right pattern to appease the prefetchers but of using values in the right pattern to create the right pattern of stalls to get the highest frequency.[0]: https://tu-dresden.de/zih/forschung/ressourcen/dateien/proje...
#pragma clang loop vectorize(enable)
#pragma clang loop interleave(enable)
for (; offset < length; offset += 4) {
const auto x = ((uint32_t\*)start)[offset / 4];
count += ((x & 0xFF) == 0x7F);
count += ((x & 0xFF00) == 0x7F00);
count += ((x & 0xFF0000) == 0x7F0000);
count += ((x & 0xFF000000) == 0x7F000000);
}
also gives some points. It'd probably be more if I could be bothered to break apart your assembly. :)550x gains with some C ++ / mixed gnarly low level assembly vs standard C++ is pretty shocking to me.
const u8 *file_lo;
file_lo = (const u8*)mmap(0,250000000ull,PROT_READ,MAP_PRIVATE|MAP_POPULATE,0,0);
const u8 *file_hi = file_lo + 250000000ull;
u64 count = 0;
while (file_lo < file_hi) {
if (*file_lo == 127) {
++count;
}
file_lo++;
}
I got a bit under 54ms. The solution in the article runs in a bit under 16ms.Used clang with -Ofast -march=native -static. Funnily, gcc gets only 54000 with the same options, 1.6 times slower.
So you can get 25k with following code, clang -Ofast -std=c++17 -march=native -static
#include <iostream>
#include <cstdint>
#include <sys/mman.h>
#include <unistd.h>
int main() {
auto file_lo = (const uint8_t*)mmap(0,250000000ull,PROT_READ,MAP_PRIVATE|MAP_POPULATE,STDIN_FILENO,0);
int count = 0;
for (uint32_t i = 0; i < 250000000; ++i) {
if (file_lo[i] == 127) {
++count;
}
}
std::cout << count << std::endl;
_exit(0);
return 0;
}
if (*file_lo == 127) {
++count;
with count += (*file_lo == 127);
That might save you the occasional branch mis-prediction, and might possibly open up some hardware-level loop optimisations. Any difference?With that in mind, I propose the following solution.
`print(976563)`
For SVE, prior to very recent versions of SVE, there was no tblq (pshufb equivalent) so I didn't have much hope for using it for general-purpose programming, though of course it would be fine for stuff like TFA.
[0]: https://community.arm.com/arm-community-blogs/b/infrastructu...
[1]: https://www.corsix.org/content/whirlwind-tour-aarch64-vector...
A parallel matrix multiply running across every core shouldn't have any trouble maximizing memory bandwidth utilization.
... std::cin >> v; ...
Oh come on! That's I/O for every item, I'm surprised it's not even slower.