Safepoints and Fil-C
67 points
3 days ago
| 4 comments
| fil-c.org
| HN
cryptonector
5 hours ago
[-]
> Fil-C's pollchecks also support stop-the-world (via the FILC_THREAD_STATE_STOP_REQUESTED bit). This is used for:

> - Implementing fork(2), which needs all threads to stop at a known-good point before the fork child jettisons them.

Makes me wonder how it handles `vfork()`, but I think it needs just a safepoint and no stop-the-world since, after all, the child side of `vfork()` is executing in the same address space as the parent (until exec-or-exit), so it's as though it's the same thread as in the parent (which is stopped waiting for the child to exec-or-exit). All the more reasons that `fork()` is evil and `vfork()` much better.

reply
pizlonator
4 hours ago
[-]
I don't support vfork(2) right now, only fork(2).

I have some super crazy ideas about how to make vfork(2) work, but I don't particularly like any of them. I'll implement it if I find some situation where vfork(2) is strongly required (so far I haven't found even one such case). The closest is that I've found cases where I have to add the CLOEXEC pipe trick to do error handling the fork(2) way rather than the vfork(2) way.

reply
kmavm
3 hours ago
[-]
Hi Fil! Congrats on all the amazing progress on Fil-C.

We needed to port all the user-level fork(2) calls to vfork(2) when working on uCLinux, a port of Linux to MMU-less microcontrollers[1]. It used to be that paging MMUs were kinda expensive (those TLBs! so much associativity!!!), and the CPU on your printer/ethernet card/etc. might not have that much grit. Nowadays not so much.

Still. A hard-and-fast use for vfork(2), as requested perhaps.

[1] http://www.ibiblio.org/lou/old/ViewStation/

reply
pizlonator
3 hours ago
[-]
It'll be a while before Fil-C is appropriate for use in MMU-less microcontrollers.
reply
cryptonector
4 hours ago
[-]
Remember that the child side of `vfork(2)` can only do async-signal-safe things as `vfork(2)` is documented, and it's really as if it is executing in the same thread as the parent (down to thread-locals, I do believe), so really, `vfork(2)` shouldn't really be all that special for Fil-C! You might want to try it.

As u/foota notes, it's important. The point of `vfork(2)` is that it's _fast_. https://news.ycombinator.com/item?id=30502392

reply
pizlonator
4 hours ago
[-]
> Remember that the child side of `vfork(2)` can only do async-signal-safe things as `vfork(2)` is documented

And if it does anything that isn't async-signal-safe, then all bets are off.

So, the Fil-C implementation of vfork(2) would have to have some way of checking (either statically or dynamically) that nothing async-signal-unsafe happens. That's the hard bit. That's why I think that implementing it is crazy.

I'd have to have so many checks in so many weird places!

> The point of `vfork(2)` is that it's _fast_

Yup. That's the reason to have it. But I haven't yet found a situation where a program I want to run is slow because I don't have vfork(2). The closest thing is that bash is visibly slower in Fil-C due to forking, but bash always uses fork(2) anyway - so in that case what I need to do is make my fork(2) impl faster, not implement vfork(2).

reply
cryptonector
4 hours ago
[-]
> And if it does anything that isn't async-signal-safe, then all bets are off.

> So, the Fil-C implementation of vfork(2) would have to have some way of checking (either statically or dynamically) that nothing async-signal-unsafe happens. That's the hard bit. That's why I think that implementing it is crazy.

Not really. See, Fil-C already makes those things that would not be safe be safe except returning from the caller of `vfork(2)`. Treat the child as if it's the same thread as in the parent and let it do whatever it would do.

Well, I guess the biggest issue is that you'd have to deal with how to communicate between the child and the GC thread, if you have to. One option is to just not allow the GC to run while a child of `vfork(2)` hasn't exec'ed-or-exit'ed.

> > The point of `vfork(2)` is that it's _fast_

> Yup. That's the reason to have it. But I haven't yet found a situation where a program I want to run is slow because I don't have vfork(2). The closest thing is that bash is visibly slower in Fil-C due to forking, but bash always uses fork(2) anyway - so in that case what I need to do is make my fork(2) impl faster, not implement vfork(2).

Typically it's programs with large RSS that need it, so things like JVMs, which probably wouldn't run under Fil-C.

reply
pizlonator
4 hours ago
[-]
Simple example that breaks the world:

    void foo(void)
    {
        vfork();
    }
In this case, the vfork child is returning. Eventually it'll exit. In the meantime, it's clobbered the vfork parent's stack.

Kaboom

reply
cryptonector
3 hours ago
[-]
Yes, _that_ is not to be allowed. I think you could have LLVM know about `vfork(2)` and treat that as an error -- heck, clang already knows how to do that, does it not?
reply
pizlonator
3 hours ago
[-]
That's just one example and the problem isn't just checking this one case, but any generalization of it. It can't be a fully static check because you could achieve similar things with longjmp or exceptions (both of which Fil-C supports).

And then there are similar issues with heap accesses and safepoints. The vfork child has to particular in safepoints except that it cannot quite (you can't use pthread_mutex/pthread_cond primitives in the vfork child to synchronize with threads in the vfork parent).

Anyway, I'm thought about this a lot, and I could keep coming at you with reasons why it's hard. It's not like I haven't implemented vfork(2) out of some abstract fear.

reply
trws
2 hours ago
[-]
The other place it comes up is launchers and resource managers. We actually have a series of old issues and implementation work on flux (large scale resource manager for clusters) working around fork becoming a significant bottleneck in parallel launch. IIRC it showed up when we had ~1gb of memory in use and needed to spawn between 64 and 192 processes per node. That said, we actually didn’t pivot to vfork, we pivoted to posix_spawn for all but the case where we have to change working directory (had to support old glibc without the attr for that in spawn). If you’re interested I think we did some benchmarking with public results I could dredge up.

Anyway, much as I have cases where it matters I guess what I’m saying is I think you’re right that vfork is rarely actually necessary, especially since you’d probably have a much easier time getting a faster and still deterministic spawn if it ever actually becomes a bottleneck for something you care about.

reply
foota
5 hours ago
[-]
I was looking at https://fil-c.org/programs_that_work and saw "dash 0.5.12. One tiny change: use fork(2) instead of vfork(2).", so maybe it doesn't :)
reply
cryptonector
5 hours ago
[-]
@pizlonator I wonder if you couldn't bracket all assembly with `filc_exit`/`filc_enter` as you do system calls. When you know the assembly doesn't allocate memory then this should work fine but.. ah, it's the stack allocations that are also interesting, right? So you'd have to ensure that the assembly has enough stack space to execute _and_ that it does not invoke the heap allocator. But that seems doable for things like cryptography in in OpenSSL's libcrypto.
reply
pizlonator
4 hours ago
[-]
Yeah you could do that. It's tantamount to having an FFI. It's super risky though - lots of ways that the assembly code could break Fil-C's assumptions and then you'd get memory safety issues.

My long term plan is to give folks a way to write memory safe assembly. I have some crazy ideas for how to do that

reply
cryptonector
4 hours ago
[-]
Yeah, but it would go a looong way to getting us (the public) to use Fil-C in production. Like I'd really like to be able to run OpenSSH using Fil-C and I really don't want to have to worry about the C-coded crypto being non-constant-time.

What I suggest as a medium-term solution might be to have macros that expand to nothing if Fil-C is not used but which expand to a Fil-C macro decoration that indicates that the author thinks the assembly (or assembly coded function) is Fil-C-safe.

Now... I know, there's more, right, like you burn a register to keep a pointer to the Fil-C thread data structure, and the assembly needs to not step on that, so ok, it's probably harder than I'm making it seem, but maybe not much harder because you can save that pointer as part of the exit and restore it as part of the enter.

I do trust that your crazy ideas are crazy but workable though!

reply
pizlonator
4 hours ago
[-]
That's a good point.

Note that statically checked inline asm is very achievable, so those folks who do constant time crypto by concealing their math operators behind inline asm will get what they need.

But I guess you really want the OpenSSL out-of-line assembly to work?

reply
cryptonector
3 hours ago
[-]
> But I guess you really want the OpenSSL out-of-line assembly to work?

Yes. And that code generally does not invoke the allocator or stray out of the bounds of the given buffers, so at the very least you could mark that assembly as trusted for this mode.

reply
pizlonator
2 hours ago
[-]
Yeah, I'll have to come up with a decent way of allowing this, at some point.

But see https://fil-c.org/runtime

It's really about creating an FFI for Fil-C, because Fil-C has its own calling convention and symbol mangling. It could be a lot of work.

And, worst case, it ends up being a footgun

reply
cryptonector
2 hours ago
[-]
I'm sure for many people's $WORK the ability to run OpenSSH built with Fil-C and constant-time crypto would be amazing, and it would be great advertising for Fil-C. But there is no way any of us would run OpenSSH built with Fil-C in production w/o constant-time crypto.
reply
pizlonator
2 hours ago
[-]
That's good to know.

If I made the assembly memory safe under Fil-C rules by running it through a transform that inserted additional instructions but did not otherwise change what was happening, would you trust that it's still constant-time?

reply
correct_horse
6 hours ago
[-]
Fil-C seems interesting, and I didn’t understand the details of how multi-threaded garbage collectors worked before reading it (I still don’t but I’m closer!). The tradeoff between a compacting garbage collector (Java) vs what you can bolt on to C without forking LLVM is particularly interesting.
reply
pjmlp
4 hours ago
[-]
Note that it is very simplistic to say Java has a compacting garbage collector, between Oracle JDK, OpenJDK, Temurin, Azul, PTC, Aicas, microEJ, OpenJ9, GraalVM, Jikes RVM, and the cousin Android, there is pleothora of GC configurations to chose from, that go beyond that.
reply
whizzter
4 hours ago
[-]
Not to mention that many of them have multiple GC's that are selected as startup options.

Personally though I can't wait for Azul's patents to start running out so that OS vendors can start implementing similar machinery into mainline OS's for all runtimes to use.

reply
cryptonector
5 hours ago
[-]
TFA is really good at explaining how a threaded GC works with minimal impact when it's not running.
reply
cyberax
4 hours ago
[-]
Polling for the safepoint signal adds overhead to tight loop, so Golang uses asynchronous signals to interrupt threads. It needs it for async pre-emption, not just GC. It also results in not knowing the exact state of the frame, so it has to scan the last frame of the stack conservatively.

There are no good ways to deal with that currently. Some VMs even used CPU instruction simulators to interpret the code until it hit the known good safepoint.

I had one idea about improving that: have a "doppelganger" mirror for the inner tight loops, that is instruction-by-instruction identical to the regular version. Except that backjumps are replaced with the call into the GC safepoint handler.

It'd be interesting to try this approach with Fil-C.

reply
pizlonator
4 hours ago
[-]
> Polling for the safepoint signal adds overhead to tight loop

The usual trick, which Fil-C doesn't do yet, is to just unroll super tight loops

Also, the pollcheck doesn't have to be a load-and-branch. There are other ways. There's a bottomless pit of optimization strategies you can do.

Conservative scanning wouldn't be sound in Fil-C, because the LLVM passes that run after FilPizlonator could break the fundamental assumptions of conservative scanning

reply
cyberax
4 hours ago
[-]
What other methods of stopping at safepoints do you think are viable? Another approach is code patching, but it's not possible on iOS, and it's a bad idea in general.
reply
pizlonator
4 hours ago
[-]
The most commonly used optimization is to have just a load, or just a store, rather than a load-and-branch. Basically you access a page that you mprotect to trigger the handshake.

Unrolling loops is also super common. Recognizing loops that have a bounded runtime is also common.

My favorite technique to try one day is:

1. just record where the pollcheck points are but don't emit code there

2. to handshake with a thread, shoot it with a signal and have the signal handler interpret the machine code starting at the ucontext until it gets to a pollcheck

reply
cyberax
1 hour ago
[-]
Memory protection games just don't scale. You're modifying global structures, resulting in tons of contention. That's also why compacting GCs prefer to use card marking rather than mprotect.

About 2. - that's exactly what I meant. The downside is that now you have to write a full x86 emulator, maybe including SIMD instructions. Good luck.

And that's also what I want to try to avoid. Imagine generating a copy of the innermost loops, with the nearest backbranch replaced with a jump into the GC code. Then from the signal handler you just need to find the offset between the current instruction pointer and the mirror copy, adjust the pointer, and exit from the signal handler.

reply
pizlonator
37 minutes ago
[-]
> Memory protection games just don't scale. You're modifying global structures, resulting in tons of contention. That's also why compacting GCs prefer to use card marking rather than mprotect.

I’m describing what production JVMs do. They do it because it’s a speedup for them. Like, those compacting GCs that you’re saying use card marking are using page faults for safe points more often than not.

It’s true that page protection games don’t result in speedups for generational store barriers. A safe point is not that.

reply
cyberax
3 minutes ago
[-]
I believe, Oracle JVM uses polling?

This is the class responsible for safepoints: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...

And this is one of the implementations: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...

The arming code: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...

I might be missing something, but I'd be terribly surprised if VM games are worth it.

reply