[disclaimer] Without brushing up on the details of this, I strongly suspect that this is about removing the need for executable stacks than performance. Allocating a trampoline on the stack rather than heap is good for efficiency.
These days, many GNU/Linux distros are disabling executable stacks by default in their toolchain configuration, both for building the distro and for the toolchain offered by the system to the user.
When you use GCC local functions, it overrides the linker behavior so that the executable is marked for executable stacks.
Of course, that is a security concession because when your stack is executable, that enables malicious remote execution code to work that relies on injecting code into the stack via a buffer overflow and tricking the process into jumping to it.
If trampolines can be allocated in a heap, then you don't need an executable stack. You do need an executable heap, or an executable dedicated heap for these allocations. (Trampolines are all the same size, so they could be packed into an array.)
Programs which indirect upon GCC local functions are not aware of the trampolines. The trampolines are deallocated naturally when the stack rolls back on function return or longjmp, or a C++ exception passing through.
Heap-allocated trampolines have an obvious deallocation problem; it would be interesting to see what strategy is used for that.
With -ftrampoline-impl=heap, GCC automatically insert[1] pairs of constructor/destructor routines from libgcc which were built around mmap/munmap.
It probably has some logic for dealing with different threads, and also for dealing with stack frames abandoned by longjmp or a longjmp-like mechanism.
The most striking surprise is the magnitude of the gap between std::function and std::function_ref. It turns out std::function (the owning container) forces a "copy-by-value" semantics deeply into the recursion. In the "Man-or-Boy" test, this apparently causes an exponential explosion of copying the closure state at every recursive step. std::function_ref (the non-owning view) avoids this entirely.
Differently from GCC14, GCC15 itself does seem to be able to optimize the allocation (and the whole std::function) in trivial cases though (independently of what the library does).
Therefore it's very jarring with this text after the first C code example:
This uses a static variable to have it persist between both the compare function calls that qsort makes and the main call which (potentially) changes its value to be 1 instead of 0
This feels completely made up, and/or some confusion about things that I would expect an author of a piece like this to really know.
In reality, in this usage (at the global outermost scope level) `static` has nothing to do with persistence. All it does is make the variable "private" to the translation unit (C parliance, read as "C source code file"). The value will "persist" since the global outermost scope can't go out of scope while the program is running.
It's different when used inside a function, then it makes the value persist between invocations, in practice typically by moving the variable from the stack to the "global data" which is generally heap-allocated as the program loads. Note that C does not mention the existence of a stack for local variables, but of course that is the typical implementation on modern systems.
The fact that you are questioning the use of the term shows that you are not familiar with the ISO C standard. What the author alludes to is static storage duration. And whether or not you use the "static" keyword in that declaration (also definition), the storage duration of the object remains "static". People mostly call those things "global variables", but the proper standardese is "static storage duration". In that sense, the author was right to use "static" for the lifetime of the object.
EDIT: if you drop "static" from that declaration, what changes is the linkage of the identifier (from internal to external).
The only misleading thing here is that ‘static’ is monospaced in the article (this can’t be seen on HN). Other than that, ‘static variable’ can plausibly refer to an object with a static storage duration, which is what the C standard would call it.
>moving the variable from the stack to the "global data" which is generally heap-allocated as the program loads
It is not heap-allocated because you can’t free() it. Non-zero static data is not even anonymously mapped, it is file-backed with copy-on-write.
This doesn’t mean that it’s impossible to make mistakes, but still.
If I follow your comment, you mean that he could have use a non-static global variable instead and avoid mentioning "static" keyword afterward?
Yes, the `static` can simply be dropped, it does no additional work for a single-file snippet like this.
I tried diving into Compiler Explorer to examine this, and it actually produces slightly different code for the with/without `static` cases, but it was confusing to deeply understand quickly enough to use the output here. Sorry.
Also, the difference manifests in the symbols table, not the assembly.
(I can't be bothered to run his benchmarks)
#include <stdio.h>
typedef struct env_ E;
typedef struct fat_ptr_ Fp;
typedef int fn(E*);
struct fat_ptr_ {
fn *f;
E *e;
};
#define INT(body) ({ int lambda(E*){ return body; }; (Fp){lambda,0}; })
struct env_ {
int k;
Fp xl; Fp x2; Fp x3; Fp x4;
};
#define FpMk(fn,e) {fn, e}
#define FpCall(fn) (fn.f(fn.e))
int main(){
int a(E env, Fp x5){
int b(E *ep){
return a( (E){--(ep->k), FpMk(b, ep), ep->xl, ep->x2, ep->x3}, ep->x4 );
}
return env.k<=0 ? FpCall(env.x4) + FpCall(x5) : b(&env);
}
printf(" %d\n", a( (E){10, INT(1), INT(-1), INT(-1), INT(1)}, INT(0)) );
}Something I've been thinking about lately is having a "state" keyword for declaring variables in a "stateful" function. This works just like "static" except instead of having a single global instance of each variable the variables are added to an automatically defined struct, whose type is available using "statetype(foo)" or some other mechanism, then you can invoke foo as with an instance of the state (in C this would be an explicit first parameter also marked with the "state" parameter.) Stateful functions are colored in the sense that if you invoke a nested stateful function its state gets added to the caller's state. This probably won't fly with separate compilation though.
C++Builder’s entire UI system is built around __closure and it is remarkably efficient: effectively, a very neat fat pointer of object instance and method.
[*] Edit: two dates on the paper, but “bound pointer to member” and they note the connection to events too: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2002/n13...
Raku (née Perl 6) has this! https://docs.raku.org/language/variables#The_state_declarato...
- where does the automatically defined struct live? Data segment might work for static, but doesn't allow dynamic use. Stack will be garbage if closure outlives function context (ie. callback, future). Heap might work, but how do you prevent leaks without C++/Rust RAII?
- while a function pointer may be copied or moved, the state area probably cannot. It may contain pointers to stack object or point into itself (think Rust's pinning)
- you already mention recursion, compilation
- ...
[1] https://github.com/ThePhD/future_cxx/issues/55#issuecomment-...
int main(int argc, char* argv[]) {
if (argc > 1) {
char\* r_loc = strchr(argv[1], 'r');
if (r_loc != NULL) {
ptrdiff_t r_from_start = (r_loc - argv[1]);
if (r_from_start == 1 && argv[1][0] == '-' && strlen(r_loc) == 1) {
in_reverse = 1;
}
}
}
...
}Why not
if (argc > 1 && strcmp(argv[1], "-r") == 0) {
in_reverse = 1;
}for example?
Your solution is perfectly fine. Even if you don't have access to strchr for some reason, the original snippet is really convoluted.
You could just write (strlen(argv[1]) > 1 && argv[1][0] == '-' && argv[1][0] == 'r') if you really want to.
And if you ever find yourself actually doing command line parsing, use getopt(). It handles all the corner cases reliably, and consistent with other tools.
Also, the use of a convoluted `if` to conditionally assign a literal boolean is a code smell (to me), I would drop the `if` and just use:
in_reverse = argc > 0 && argv[1][0] == '-' && argv[1][1] == 'r';
if a more forward-thinking/strict check is not needed.And then we have this "modern" way of spelling pointers, "const int* right" (note the space). In C, declaration syntax mirrors use, so it should be "const int *right", because "*right" is a "const int".
I feel too old for this shit. :(
const int left = *(const int*)untyped_left, right = *(const int*)untyped_right;
return in_reverse?
(right > left) - (right < left)
: (left > right) - (left < right);
I wonder if there is a way to actually do it with only arithmetic, without comparisons?https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3654.pdf
(and I am not impressed by micro benchmarks)
You can call the local functions directly and get the benefits of the specialized code.
There's no way to spell out this function's type, and no way to store it anywhere. This is true of regular functions too!
To pass it around you need to use the type-erased "fat pointer" version.
I don't see how anything else makes sense for C.
well regular functions decay to function pointers. You could have the moral equivalent of std::function_ref (or similarly, borland __closure) in C of course and have closures decay to it.
I'm a fan of nested functions but don't think the executable stack hack is worth it, and using a 'display' is a better solution.
See the Dragon Book or Compiler Construction: Principles and Practice (1984) by Louden
I've used lambdas extensively in modern C++. I hate them with a passion.
I've also used OCaml. An awesome language where this stuff is super natural and beautiful.
I don't understand why people want to shoehorn functional programming into C. C++ was terrible already, and is now worse for it.
> we’re going to be focusing on and looking specifically at Closures in C and C++, since this is going to be about trying to work with and – eventually – standardize something for ISO C that works for everyone.
Sigh. My heart sinks.
I've often seen that; I call this curiously recurring clumping pattern "encapsulation". Maybe if I blogged about it, it would catch on.
#include <stdlib.h>
#include <string.h>
#include <stddef.h>
typedef int (*comp)(const void *untyped_left, const void *untyped_right);
thread_local int in_reverse = 0;
__attribute__((noinline))
int compare_impl(const void *untyped_left, const void *untyped_right, int in_reverse) {
const int* left = untyped_left;
const int* right = untyped_right;
return (in_reverse) ? *right - *left : *left - *right;
}
comp make_sort(int direction) {
in_reverse = direction;
int compare(const void *untyped_left, const void *untyped_right) {
return compare_impl(untyped_left, untyped_right, in_reverse);
}
return compare;
}
int main(int argc, char* argv[]) {
int list[] = { 2, 11, 32, 49, 57, 20, 110, 203 };
comp normal_sort = make_sort(0);
comp reverse_sort = make_sort(1);
qsort(list, (sizeof(list)/sizeof(*list)), sizeof(*list), normal_sort);
return list[0];
}
Because we create `reverse_sort` between creating `normal_sort` and calling it, we end up with a reverse sort despite clearly asking for a normal sort.Then it's clearly only half a solution.
The example I gave above should work fine in any language with first-class closures.
_Thread_local struct {
void *data;
int (*compare)(const void *a, const void*, void*);
} _qsort2_closure ;
static int _qsort2_helper(const void *a, const void *b) {
return _qsort2_closure.compare(a, b, _qsort2_closure.data);
}
void qsort2(void *base, size_t elements, size_t width, int (*compare)(const void *a, const void*, void*), void *userData)
{
_qsort2_closure.data = userData;
_qsort2_closure.compare = compare;
qsort(base, elements, width, _qsort2_helper);
}No I do not. It will reassigned next call.
> But again you are reinventing dynamic scoping
No. I’m not reinventing anything. I’m using the existing feature of thread local variables.
The usage of such is entirely an implementation detail of qsort2 with the exception of recursion.
Dynamic scoping typically refers to defining variables which have scope outside of their call stack. No usage of this API requires it.
Can you just try to learn something new?
Once again, the caller of the API does not declare any variables so there is no dynamic scoping.
It’s confusing to me that thread locals are “not the best idea outside small snippets” meanwhile the top solution is templating on recursion depth with a constexpr limit of 11.
(You could solve that with a manually maintained stack for the context in a thread local, but you'd have to do that case-by-case)
I think the times you need to do this are few. And this version is much more pruden.
// imagine my_function takes 3 ints, the first 2 args are captured and curried.
Function<void(int)> my_closure(&my_function, 1, 2);
my_closure(3);
I've never implemented it myself, as I don't use C++ features all too much, but as a pet project I'd like to someday. I wonder how something like that compares!I have a case where I need to create a static templated lambda to be passed to C as a pointer. Such thing is impossible in Rust, which I considered at first.
let mut state = 1;
let mut fat_closure = || state += 1;
let (fnptr, userdata) = make_trampoline(&mut &mut fat_closure);
unsafe {
fnptr(userdata);
}
assert_eq!(state, 2);
use std::ffi::c_void;
fn make_trampoline<C: FnMut()>(closure: &mut &mut C) -> (unsafe fn(*mut c_void), *mut c_void) {
let fnptr = |userdata: *mut c_void| {
let closure: *mut &mut C = userdata.cast();
(unsafe { &mut *closure })()
};
(fnptr, closure as *mut _ as *mut c_void)
}
It requires a userdata arg for the C function, since there's no allocation or executable-stack magic to give a unique function pointer to each data instance. OTOH it's zero-cost. The generic make_trampoline inlines code of the closure, so there's no extra indirection.This isn’t fully accurate. In your example, `&mut C` actually has the same layout as usize. It’s not a fat pointer. `C` is a concrete type and essentially just an anonymous struct with FnMut implemented for it.
You’re probably thinking of `&mut dyn FnMut` which is a fat pointer that pairs a pointer to the data with a pointer to a VTable.
So in your specific example, the double indirection is unnecessary.
The following passes miri: https://play.rust-lang.org/?version=nightly&mode=debug&editi...
(did this on mobile, so please excuse any messiness).
Well, capturing closures that are implemented like C++ lambdas or Rust closures anyway. The executable stack crimes do make a thin fn-ptr with state.
Unfortunately a lot of existing C APIs won't have the user arg in the place you need it, it's a mix of first, last, and sometimes even middle.
The only unsafe here is to demonstrate it works with C/C++ FFI (where void* userdata is actually not type safe)
Practically speaking all lambda options except for the one involving allocation (why would you even do that) are equivalent modulo inlining.
In particular, the caveat with the type erasure/helper variants is precisely that it prevents inlining, but given everything is in the same translation unit and isn't runtime-driven, it's still possible for the compiler to devirtualize.
I think it would be more interesting to make measurements when controlling explicitly whether inlining happens or the function type can be deduced statically.
In a simple test I see that GCC14 has no problems completely removing the overhead of std::function_ref, but plain std::function is a huge mess.
Eventually we will get there [1], but in the meantime I prefer not to rely on devirtualization, and heap elision is more of a party trick.
edit: to compare early vs late inlining: while gcc 14 can remove one layer of function_ref, it seems that it cannot remove two layers, as apparently doesn't rerun the required passes to take advantage of the new opportunity. It has no problem of course removing an arbitrary large (but finite) layers of plain lambdas.
edit2: GCC15 can remove trivial uses of std::function, but this is very fragile. It still can't remove two function_ref.
[1] for example 25 years ago compilers were terrible at removing abstraction overhead of the STL, today there is very little cost.