A tale about fixing eBPF spinlock issues in the Linux kernel
126 points
by y1n0
13 hours ago
| 9 comments
| rovarma.com
| HN
legedemon
11 hours ago
[-]
Thanks for the great write-up with links to many more interesting articles and code! I have long stopped working on Linux kernel but deep dives like these are very exciting reading.
reply
alecco
3 hours ago
[-]
Good writeup.

It is very confusing how Linux source code has macros with names that make them look like functions. At first view it looks like "flags" is passed uninitialized, but it's a temporary save variable used by a macro. Sigh.

reply
squirrellous
42 minutes ago
[-]
Great post!

The minimized repro seems like something many other eBPF programs will do. This makes me wonder why such kernel issues weren’t found earlier. Is this code utilizing some new eBPF capabilities in recent kernels?

reply
sidkshatriya
7 hours ago
[-]
Excellently explained writeup. Kudos on explaining the shockingly multiple kernel bugs in a (a) simple (b) interesting way.

TL;DR the main issue arises because the context switch and sampling event both need to be written to the `ringBuffer` eBPF map. sampling event lock needs to be taken in an NMI which is by definition non-maskable. This leads to lock contention and recursive locks etc as explained when context switch handler tries to do the same thing.

Why not have context switches write to ringBuffer1 and sampling events write to ringBuffer2 (i.e. use different ringBuffers). This way buggy kernels should work properly too !?

reply
rovarma
5 hours ago
[-]
> Why not have context switches write to ringBuffer1 and sampling events write to ringBuffer2 (i.e. use different ringBuffers)

That would work, but at the cost of doubling memory usage, since you then have two fixed-size ring buffers instead of one. Also, in our particular cases, the correct ordering of events is important, which is ~automatic with a single ring buffer, but gets much trickier with two.

> This way buggy kernels should work properly too !?

We have a workaround for older/buggy kernels in place. We simply guard against same-CPU recursion by maintaining per-CPU state that indicates whether a given CPU is currently in the process of adding data to the ring buffer. If that state is set, we discard events, which prevents the recursion too.

reply
stupefy
3 hours ago
[-]
It is a fantastic write up
reply
Reed10119039
1 hour ago
[-]
docker compose for dev, k8s for prod. don't overcomplicate it
reply
jamesvzb
5 hours ago
[-]
kubernetes makes this 10x more complicated than it needs to be
reply
hanikesn
50 minutes ago
[-]
How is this related to kubernetes?
reply
Boulos00191
2 hours ago
[-]
observability is underrated. you can't fix what you can't see
reply
bubblerme
5 hours ago
[-]
eBPF spinlock debugging is exactly the kind of kernel work that's simultaneously terrifying and fascinating. Spinlocks in eBPF programs are particularly tricky because you're operating in a context where you can't sleep, can't take mutexes, and the verifier needs to statically prove your lock usage is correct before the program even runs.

The verification challenge is the interesting part. The kernel verifier has to ensure that every path through the eBPF program properly acquires and releases locks, which is essentially solving a subset of the halting problem through conservative static analysis. False positives (rejecting valid programs) are acceptable; false negatives (allowing deadlocks) are not.

reply
peyton
3 hours ago
[-]
Please don’t generate gobbledegook. Or at least try harder at it. How about a fun persona. Maybe a bit of a backstory.
reply
jacquesm
2 hours ago
[-]
Or better yet, just fuck off.
reply