Spinlocks vs. Mutexes: When to Spin and When to Sleep
74 points
10 hours ago
| 7 comments
| howtech.substack.com
| HN
charleslmunger
8 hours ago
[-]
>Critical section under 100ns, low contention (2-4 threads): Spinlock. You’ll waste less time spinning than you would on a context switch.

If your sections are that short then you can use a hybrid mutex and never actually park. Unless you're wrong about how long things take, in which case you'll save yourself.

>alignas(64) in C++

    std::hardware_destructive_interference_size
Exists so you don't have to guess, although in practice it'll basically always be 64.

The code samples also don't obey the basic best practices for spinlocks for x86_64 or arm64. Spinlocks should perform a relaxed read in the loop, and only attempt a compare and set with acquire order if the first check shows the lock is unowned. This avoids hammering the CPU with cache coherency traffic.

Similarly the x86 PAUSE instruction isn't mentioned, even though it exist specifically to signal spin sections to the CPU.

Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex. Spinning for consumer threads can be done in specialty exclusive thread per core cases where you want to minimize wakeup costs, but that's not the same as a spinlock which would cause any contending thread to spin.

reply
nly
57 minutes ago
[-]
The PAUSE instruction isn't actually as good as it used to be. In, iirc, Skylake Intel massively increased the latency to improve utilisation under hyperthreading. The latency of this instruction is now really high.

Most people using spinlocks really care about latency, and many will have hyperthreading disabled to reduce jitter

reply
raggi
7 hours ago
[-]
> Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex. Spinning for consumer threads can be done in specialty exclusive thread per core cases where you want to minimize wakeup costs, but that's not the same as a spinlock which would cause any contending thread to spin.

Very much this. Spins benchmark well but scale poorly.

reply
charleshn
6 hours ago
[-]
> std::hardware_destructive_interference_size Exists so you don't have to guess, although in practice it'll basically always be 64.

Unfortunately it's not quite true, do to e.g. spacial prefetching [0]. See e.g. Folly's definition [1].

[0] https://community.intel.com/t5/Intel-Moderncode-for-Parallel...

[1] https://github.com/facebook/folly/blob/d2e6fe65dfd6b30a9d504...

reply
magicalhippo
8 hours ago
[-]
> Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex

Yeah, pure spinlocks in user-space programs is a big no-no in my book. If you're on the happy path then it costs you nothing extra in terms of performance, and if you for some reason slide off the happy path you have a sensible fall-back.

reply
surajrmal
6 hours ago
[-]
Hybrid locks are also bad for overall system performance by maximizing local application performance. There is a reason default lock implementations from OS don't spin even a little bit.
reply
charleslmunger
1 hour ago
[-]
That depends on your workload. If you're making a game that's expected to use near 100% of system resources, or a real time service pinned to specific cores, your local application is the overall system.
reply
menaerus
2 hours ago
[-]
> There is a reason default lock implementations from OS don't spin even a little bit.

glibc pthread mutex uses a user-space spinlock to mitigate the syscall cost for uncontended cases.

reply
nly
54 minutes ago
[-]
GNU libc posix mutexes do spin...
reply
saagarjha
6 hours ago
[-]
> std::hardware_destructive_interference_size

Of course, this is just the number the compiler thinks is good. It’s not necessarily the number that is actually good for your target machine.

reply
haileys
3 hours ago
[-]
Please just don't use spinlocks in userland code. It's really not the appropriate mechanism.

Your code will look great in your synthetic benchmarks and then it will end up burning CPU for no good reason in the real world.

reply
nly
53 minutes ago
[-]
Burning CPU is preferable in some industries where latency is all that matters.
reply
gebdev
54 minutes ago
[-]
Loved this article. It showed how lacking my knowledge is in how operating systems implement concurrency primitives. It motivated me to do a bunch of research and learn more.

Notably the claim about how atomic operations clear the cache line in every cpu. Wow! Shared data can really be a performance limitation.

reply
EdSchouten
4 hours ago
[-]
I don’t understand why I would need to care about this. Can’t my operating system and/or pthread library sort this out by itself?
reply
senderista
3 hours ago
[-]
Pretty much, given that any decent pthreads implementation will offer an adaptive mutex. Unless you really need a mutex the size of a single bit or byte (which likely implies false sharing), there's little reason to ever use a pure spinlock, since a mutex with adaptive spinning (up to context switch latency) gives you the same performance for short critical sections without the disastrous worst-case behavior.
reply
nly
51 minutes ago
[-]
Some people don't want to block for a microsecond when their lock goes 1ns over your adaptive mutexes spin deadline. That kind of jitter is unacceptable.
reply
baobun
3 hours ago
[-]
General heuristics only get you so far and at the limit come with their own overhead compared to what you can do with a tailored solution with knowledge about your usage and data access patterns. The cases where this makes a practical difference for higher-level apps are rare but they exist.
reply
staticfloat
8 hours ago
[-]
I love that this article includes a test program at the bottom to allow you to verify its claims.
reply
Tom1380
9 hours ago
[-]
I didn't know about using alignment to avoid cache bouncing. Fascinating stuff
reply
bitexploder
8 hours ago
[-]
Yep. Super important in lock free synchronization primitives like ring buffers. Cache line padded atomics are really cool :)
reply
markisus
8 hours ago
[-]
Where do lock free algorithms fall in this analysis?
reply
adrian_b
1 minute ago
[-]
There are several classes of lock free algorithms with different properties.

Lock free algorithms for read only access to shared data structures have only seldom disadvantages (when the shared data structure is modified extremely frequently by writers, so the readers never succeed to read it between changes).

On the other hand lock free algorithms for read/write access to shared data structures must be used with great caution, because they frequently have a higher cost than using mutual exclusion. Such lock free algorithms are based on the optimistic assumption that your thread will complete the access before the shared structure is accessed by another competing thread. Whenever this assumption fails (which will happen when there is high contention) the transaction must be retried, which will lead to much more wasted work than the work that is wasted in a spinlock.

reply
raggi
7 hours ago
[-]
Some considerations can be similar, but the total set of details are different.

It also depends which lock free solutions you're evaluating.

Some are higher order spins (more similar high level problems), others have different secondary costs (such as copying). A common overlap is the inter-core, inter-thread, and inter-package side effects of synchronization points, for a lot of stuff with a strong atomic in the middle that'll be stuff like sync instruction costs, pipeline impacts of barriers/fences, etc.

reply
bluecalm
4 hours ago
[-]
In a very basic case lock free data structures make threads race instead of spin. A thread makes their copy of a part of list/tree/whatever it needs updating, introduces changes to that copy and then tries to substitute their own pointer for the data structure pointer if it hasn't changed in the meantime (there is a CPU atomic instruction for that). If the thread fails (someone changed the pointer in the meantime) it tries again.
reply