If your sections are that short then you can use a hybrid mutex and never actually park. Unless you're wrong about how long things take, in which case you'll save yourself.
>alignas(64) in C++
std::hardware_destructive_interference_size
Exists so you don't have to guess, although in practice it'll basically always be 64.The code samples also don't obey the basic best practices for spinlocks for x86_64 or arm64. Spinlocks should perform a relaxed read in the loop, and only attempt a compare and set with acquire order if the first check shows the lock is unowned. This avoids hammering the CPU with cache coherency traffic.
Similarly the x86 PAUSE instruction isn't mentioned, even though it exist specifically to signal spin sections to the CPU.
Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex. Spinning for consumer threads can be done in specialty exclusive thread per core cases where you want to minimize wakeup costs, but that's not the same as a spinlock which would cause any contending thread to spin.
Most people using spinlocks really care about latency, and many will have hyperthreading disabled to reduce jitter
Very much this. Spins benchmark well but scale poorly.
Unfortunately it's not quite true, do to e.g. spacial prefetching [0]. See e.g. Folly's definition [1].
[0] https://community.intel.com/t5/Intel-Moderncode-for-Parallel...
[1] https://github.com/facebook/folly/blob/d2e6fe65dfd6b30a9d504...
Yeah, pure spinlocks in user-space programs is a big no-no in my book. If you're on the happy path then it costs you nothing extra in terms of performance, and if you for some reason slide off the happy path you have a sensible fall-back.
glibc pthread mutex uses a user-space spinlock to mitigate the syscall cost for uncontended cases.
Of course, this is just the number the compiler thinks is good. It’s not necessarily the number that is actually good for your target machine.
Your code will look great in your synthetic benchmarks and then it will end up burning CPU for no good reason in the real world.
Notably the claim about how atomic operations clear the cache line in every cpu. Wow! Shared data can really be a performance limitation.
Lock free algorithms for read only access to shared data structures have only seldom disadvantages (when the shared data structure is modified extremely frequently by writers, so the readers never succeed to read it between changes).
On the other hand lock free algorithms for read/write access to shared data structures must be used with great caution, because they frequently have a higher cost than using mutual exclusion. Such lock free algorithms are based on the optimistic assumption that your thread will complete the access before the shared structure is accessed by another competing thread. Whenever this assumption fails (which will happen when there is high contention) the transaction must be retried, which will lead to much more wasted work than the work that is wasted in a spinlock.
It also depends which lock free solutions you're evaluating.
Some are higher order spins (more similar high level problems), others have different secondary costs (such as copying). A common overlap is the inter-core, inter-thread, and inter-package side effects of synchronization points, for a lot of stuff with a strong atomic in the middle that'll be stuff like sync instruction costs, pipeline impacts of barriers/fences, etc.