/**
* @brief Defines variable alignment to avoid false sharing.
* @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
* @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
*
* The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
*
* @code{.cpp}
* #if defined(__cpp_lib_hardware_interference_size)
* static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
* #else
* static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
* #endif
* @endcode
*
* That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
* unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
* with `-Wno-interference-size`.
*/
static constexpr std::size_t default_alignment_k = 128;
As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...
I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.
Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule
AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.
Regardless of whether it would be better in some situations to align to 128 bytes, 64 bytes really is the cache line size on all common x86 cpus and it is a good idea to avoid threads modifying the same cacheline.
Memory ordering has nothing to do with cache coherency, it's all about what happens within the CPU pipeline itself. On ARM reads and writes can become reordered within the CPU pipeline itself, before they hit the caches (which are still fully coherent).
ARM still has strict memory ordering for code within a single core (some older processors do not), but the writes from one core might become visible to other cores in the wrong order.
The instructions to which you refer are not atomics, but rather instructions that influence the ordering of loads and stores. x86 has total store ordering by design. On ARM, the program has to use LDAR/STLR to establish ordering.
The reason why programmers don’t believe in cache coherency is because they have experienced a closely related phenomena, memory reordering. This requires you to use a memory fence when accessing a shared value between multiples cores - as if they needed to synchronize.
Also not only compilers reorder things, most processors nowadays do OoOE; even if the order from the compiler is perfect in theory, different latencies for different instruction operands may lead to execute later things earlier not to stall the CPU.
On a single core a load can be served from the store buffer (queue), but other cores can't see those stores yet, which is where all the inconsistencies come from.
Does anyone understand how Go handles the CPU cache?
This means you need to synchronize every shared access, whether it's a read or write. In hardware systems you can cheat because usually a write performs a write-through. In a JVM that's not the case.
It's been a long time since I had to think about this, but it bit us pretty hard when we found that.
Actual manual cache management is way too much of an implementation detail for a general-purpose CPU to expose; doing so would deeply tie code to a specific set of processor behavior. Cache sizes and even hierarchies change often between processor generations, and some internal cache behavior has changed within a generation as a result of microcode and/or hardware steppings. Actual cache control would be like MIPS exposing delay slots but so much worse (at least older delay slots really only turn into performance issues, older cache control would easily turn into correctness issues).
Really the only way to make this work is for the final compilation/"specialization" step to occur on the specific device in question, like with a processor using binary translation (e.g. Transmeta, Nvidia Denver) or specialization (e.g. Mill) or a system that effectively enforces runtime compilation (e.g. runtime shader/program compilation in OpenGL and OpenCL).
This ignores store buffers and consequently memory fencing which is the basis for the nightmarish std::memory_order, the worst api documentation you will ever meet
<https://marabos.nl/atomics/hardware.html>
While the book this chapter is from is about Rust, this chapter is pretty much language-agnostic.
Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
176 points by whack on June 14, 2023 | hide | past | favorite | 138 comments