> - Implementing fork(2), which needs all threads to stop at a known-good point before the fork child jettisons them.
Makes me wonder how it handles `vfork()`, but I think it needs just a safepoint and no stop-the-world since, after all, the child side of `vfork()` is executing in the same address space as the parent (until exec-or-exit), so it's as though it's the same thread as in the parent (which is stopped waiting for the child to exec-or-exit). All the more reasons that `fork()` is evil and `vfork()` much better.
I have some super crazy ideas about how to make vfork(2) work, but I don't particularly like any of them. I'll implement it if I find some situation where vfork(2) is strongly required (so far I haven't found even one such case). The closest is that I've found cases where I have to add the CLOEXEC pipe trick to do error handling the fork(2) way rather than the vfork(2) way.
We needed to port all the user-level fork(2) calls to vfork(2) when working on uCLinux, a port of Linux to MMU-less microcontrollers[1]. It used to be that paging MMUs were kinda expensive (those TLBs! so much associativity!!!), and the CPU on your printer/ethernet card/etc. might not have that much grit. Nowadays not so much.
Still. A hard-and-fast use for vfork(2), as requested perhaps.
As u/foota notes, it's important. The point of `vfork(2)` is that it's _fast_. https://news.ycombinator.com/item?id=30502392
And if it does anything that isn't async-signal-safe, then all bets are off.
So, the Fil-C implementation of vfork(2) would have to have some way of checking (either statically or dynamically) that nothing async-signal-unsafe happens. That's the hard bit. That's why I think that implementing it is crazy.
I'd have to have so many checks in so many weird places!
> The point of `vfork(2)` is that it's _fast_
Yup. That's the reason to have it. But I haven't yet found a situation where a program I want to run is slow because I don't have vfork(2). The closest thing is that bash is visibly slower in Fil-C due to forking, but bash always uses fork(2) anyway - so in that case what I need to do is make my fork(2) impl faster, not implement vfork(2).
> So, the Fil-C implementation of vfork(2) would have to have some way of checking (either statically or dynamically) that nothing async-signal-unsafe happens. That's the hard bit. That's why I think that implementing it is crazy.
Not really. See, Fil-C already makes those things that would not be safe be safe except returning from the caller of `vfork(2)`. Treat the child as if it's the same thread as in the parent and let it do whatever it would do.
Well, I guess the biggest issue is that you'd have to deal with how to communicate between the child and the GC thread, if you have to. One option is to just not allow the GC to run while a child of `vfork(2)` hasn't exec'ed-or-exit'ed.
> > The point of `vfork(2)` is that it's _fast_
> Yup. That's the reason to have it. But I haven't yet found a situation where a program I want to run is slow because I don't have vfork(2). The closest thing is that bash is visibly slower in Fil-C due to forking, but bash always uses fork(2) anyway - so in that case what I need to do is make my fork(2) impl faster, not implement vfork(2).
Typically it's programs with large RSS that need it, so things like JVMs, which probably wouldn't run under Fil-C.
void foo(void)
{
vfork();
}
In this case, the vfork child is returning. Eventually it'll exit. In the meantime, it's clobbered the vfork parent's stack.Kaboom
And then there are similar issues with heap accesses and safepoints. The vfork child has to particular in safepoints except that it cannot quite (you can't use pthread_mutex/pthread_cond primitives in the vfork child to synchronize with threads in the vfork parent).
Anyway, I'm thought about this a lot, and I could keep coming at you with reasons why it's hard. It's not like I haven't implemented vfork(2) out of some abstract fear.
Anyway, much as I have cases where it matters I guess what I’m saying is I think you’re right that vfork is rarely actually necessary, especially since you’d probably have a much easier time getting a faster and still deterministic spawn if it ever actually becomes a bottleneck for something you care about.
My long term plan is to give folks a way to write memory safe assembly. I have some crazy ideas for how to do that
What I suggest as a medium-term solution might be to have macros that expand to nothing if Fil-C is not used but which expand to a Fil-C macro decoration that indicates that the author thinks the assembly (or assembly coded function) is Fil-C-safe.
Now... I know, there's more, right, like you burn a register to keep a pointer to the Fil-C thread data structure, and the assembly needs to not step on that, so ok, it's probably harder than I'm making it seem, but maybe not much harder because you can save that pointer as part of the exit and restore it as part of the enter.
I do trust that your crazy ideas are crazy but workable though!
Note that statically checked inline asm is very achievable, so those folks who do constant time crypto by concealing their math operators behind inline asm will get what they need.
But I guess you really want the OpenSSL out-of-line assembly to work?
Yes. And that code generally does not invoke the allocator or stray out of the bounds of the given buffers, so at the very least you could mark that assembly as trusted for this mode.
But see https://fil-c.org/runtime
It's really about creating an FFI for Fil-C, because Fil-C has its own calling convention and symbol mangling. It could be a lot of work.
And, worst case, it ends up being a footgun
If I made the assembly memory safe under Fil-C rules by running it through a transform that inserted additional instructions but did not otherwise change what was happening, would you trust that it's still constant-time?
Personally though I can't wait for Azul's patents to start running out so that OS vendors can start implementing similar machinery into mainline OS's for all runtimes to use.
There are no good ways to deal with that currently. Some VMs even used CPU instruction simulators to interpret the code until it hit the known good safepoint.
I had one idea about improving that: have a "doppelganger" mirror for the inner tight loops, that is instruction-by-instruction identical to the regular version. Except that backjumps are replaced with the call into the GC safepoint handler.
It'd be interesting to try this approach with Fil-C.
The usual trick, which Fil-C doesn't do yet, is to just unroll super tight loops
Also, the pollcheck doesn't have to be a load-and-branch. There are other ways. There's a bottomless pit of optimization strategies you can do.
Conservative scanning wouldn't be sound in Fil-C, because the LLVM passes that run after FilPizlonator could break the fundamental assumptions of conservative scanning
Unrolling loops is also super common. Recognizing loops that have a bounded runtime is also common.
My favorite technique to try one day is:
1. just record where the pollcheck points are but don't emit code there
2. to handshake with a thread, shoot it with a signal and have the signal handler interpret the machine code starting at the ucontext until it gets to a pollcheck
About 2. - that's exactly what I meant. The downside is that now you have to write a full x86 emulator, maybe including SIMD instructions. Good luck.
And that's also what I want to try to avoid. Imagine generating a copy of the innermost loops, with the nearest backbranch replaced with a jump into the GC code. Then from the signal handler you just need to find the offset between the current instruction pointer and the mirror copy, adjust the pointer, and exit from the signal handler.
I’m describing what production JVMs do. They do it because it’s a speedup for them. Like, those compacting GCs that you’re saying use card marking are using page faults for safe points more often than not.
It’s true that page protection games don’t result in speedups for generational store barriers. A safe point is not that.
This is the class responsible for safepoints: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...
And this is one of the implementations: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...
The arming code: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...
I might be missing something, but I'd be terribly surprised if VM games are worth it.