Allocating on the Stack
84 points
4 hours ago
| 8 comments
| go.dev
| HN
csjh
30 minutes ago
[-]
Optimizations like these are so cool. I love seeing higher level languages take advantage of their high level-ness
reply
nasretdinov
2 hours ago
[-]
Nice to see common and natural patterns to have their performance improved. Theoretically appending to a slice would be possible to handle with just stack growth, but that would require having large gaps between goroutine stacks and mapping them lazily upon access instead of moving goroutines to the new contiguous blocks as it's implemented right now. But given how many questionable changes it requires from runtime it's certainly not going to happen :)
reply
ivanjermakov
2 hours ago
[-]
Having big stack frames is bad for cache locality. Stack is not something magical, it's mapped to the same physical memory as heap and needs to be loaded. Pretty sure such optimization would reduce performance in most cases.
reply
wahern
13 minutes ago
[-]
In the case where you're using the top of the stack as a, well, stack, I don't see the problem. It would only work if you're not interleaving processing of dynamically-sized objects and function codegen works out. It's similar to TCO in the sense of maintaining certain invariants across calls (e.g. no temporaries need be preserved), and actually in languages with TCO, like Lua, you can hack an application-level stack data structure using tail recursion (and coroutines/threads if you need more than one) that can sometimes be more performant or more convenient than using a native data structure.

There's been a least one experiment (posted a few years ago to HN) where someone benchmarked a stackful coroutine implementation with hundreds of thousands (millions?) of stacks that could grow contiguously on-demand up to, e.g., 2MB, but were initially minimally sized and didn't reserve the maximum stack size upfront. The bottleneck was the VMA bookkeeping--the syscalls, exploding the page table, TLB flushing, etc. In principle it could work well and be even more performant than existing solutions, and it might work better today since Linux 6.13's lightweight guard page feature, MADV_GUARD_INSTALL, but we probably still need more architectural support from the system (kernel, if not hardware) to make it performant and competitive with language-level solutions like goroutines, Rust async, etc.

reply
HarHarVeryFunny
2 hours ago
[-]
This article is about Go, but I wonder how many C/C++ developers realize that you've always had the ability to allocate on the stack using alloca() rather than malloc().

Of course use cases are limited (variable length buffers/strings, etc) since the lifetime of anything on the stack has to match the lifetime of the stack frame (i.e the calling function), but it's super fast since it's just bumping up the stack pointer.

reply
spacechild1
2 hours ago
[-]
alloca() is super useful, but it's also quite dangerous because you can easily overflow the stack.

The obvious issue is that you can't know how much space is left on the stack, so you basically have to guess and pick an arbitrary "safe" size limit. This gets even more tricky when functions may be called recursively.

The more subtle issue is that the stack memory returned by alloca() has function scope and therefore you must never call it directly in a loop.

I use alloca() on a regular basis, but I have to say there are safer and better alternatives, depending on the particular use case: arena/frame allocators, threadlocal pseudo-stacks, static vectors, small vector optimizations, etc.

reply
12_throw_away
1 hour ago
[-]
> The obvious issue is that you can't know how much space is left on the stack [...]

Oh, huh. I've never actually tried it, but I always assumed it would be possible to calculate this, at least for a given OS / arch. You just need 3 quantities, right? `remaining_stack_space = $stack_address - $rsp - $system_stack_size`.

But I guess there's no API for a program to get its own stack address unless it has access to `/proc/$pid/maps` or similar?

reply
chuckadams
1 hour ago
[-]
If your API includes inline assembly, then it's trivial. Go's internals would need it to swap stacks like it does. But I doubt any of that is exposed at the language level.
reply
fluntcaps
38 minutes ago
[-]
You can do something like:

    void *get_sp(void) {
        volatile char c;
        return (void *)&c;
    }
Or, in GCC and Clang:

    void *get_sp(void) {
        return __builtin_frame_address(0);
    }
Which gets you close enough.
reply
Joker_vD
1 hour ago
[-]
> $system_stack_size

Does such thing even exist? And non-64 bit platforms the address space is small enough that with several threads of execution you may just be unable to grow your stack even up to $system_stack_size because it'd bump into something else.

reply
masklinn
1 hour ago
[-]
> Does such thing even exist?

AFAIK no. There are default stack sizes, but they're just that, defaults, and they can vary on the same system: main thread stacks are generally 8MiB (except for Windows where it's just 1) but the size of ancillary stacks is much smaller everywhere but on linux using glibc.

It should be possible to get the stack root and size using `pthread_getattr_np`, but I don't know if there's anyone bothering with that, and it's a glibc extension.

reply
MarkSweep
55 minutes ago
[-]
.NET bothers with it, to support RuntimeHelpers.EnsureSufficientExecutionStack [1] and other things. See the pthreads calls used to here [2].

[1]: https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

[2]: https://github.com/dotnet/runtime/blob/b6a3e784f0bb418fd2fa7...

reply
norir
1 hour ago
[-]
If you have well defined boundaries, you can move the stack to an arbitrarily large chunk of memory before the recursive call and restore it to the system stack upon completion.
reply
chuckadams
1 hour ago
[-]
And if you never do reach completion, you can just garbage collect that chunk. AKA "Cheney on the MTA": https://dl.acm.org/doi/10.1145/214448.214454
reply
anematode
2 hours ago
[-]
If you're not doing recursion, I prefer using an appropriately sized thread_local buffer in this scenario. Saves you the allocation and does the bookkeeping of having one per thread
reply
dzdt
42 minutes ago
[-]
For purely historical reasons the C/C++ stack is "small" with exactly how small being outside of programmer control. So you have to avoid using the stack even if it would be the better solution. Otherwise you risk your program crashing/failing with stack overflow errors.
reply
csjh
33 minutes ago
[-]
What do you mean outside of programmer control? What's stopping you from setting the stack size in the linker flags?
reply
HarHarVeryFunny
12 minutes ago
[-]
With Linux the stack size is a process limit, set with ulimit (default 8MB?). You can even set it to unlimited if you want, meaning that essentially (but not quite) the stack and heap grow towards each other only limited by the size of the address space.

ulimit only affects the main program stack though. if you are using multi-threading then there is a per-thread stack limit, which you can configure with pthreads, but not until C++23 for std::thread.

reply
rwmj
2 hours ago
[-]
Most C compilers let you use variable length arrays on the stack. However they're problematic and mature code bases usually disable this (-Werror -Wvla) because if the size is derived from user input then it's exploitable.
reply
ozgrakkurt
2 hours ago
[-]
This is more of a patch/hack solution as far as I can understand.

You can just as well pass a heap allocated buffer + size around and allocate by incrementing/decrementing size.

Or even better use something like zig's FixedSizeAllocator.

Correct me if I am wrong please

reply
HarHarVeryFunny
2 hours ago
[-]
I wouldn't call it a hack, but it's not a general alternative for memory allocated on the heap since the lifetime is tied to that of the allocating function.

I think what you're referring to is an arena allocator where you allocate a big chunk of memory from the heap, then sequentially sub-allocate from that, then eventually free the entire heap chunk (arena) in one go. Arena allocators are therefore also special use case since they are for when all the sub-allocations have the same (but arbitrary) lifetime, or at least you're willing to defer deallocation of everything to the same time.

So, heap, arena and stack allocation all serve different purposes, although you can just use heap for everything if memory allocation isn't a performance issue for your program, which nowadays is typically the case.

Back in the day when memory was scarce and computers were much slower, another common technique was to keep a reuse "free list" of allocated items of a given type/size, which was faster than heap allocate and free/coalesce, and avoided the heap fragmentation of random malloc/frees.

reply
stackghost
2 hours ago
[-]
alloca()'s availability and correctness/bugginess is platform dependent, so it probably sees only niche usage since it's not portable. Furthermore, even its man page discourages its use in the general case:

>The alloca() function is machine- and compiler-dependent. Because it allocates from the stack, it's faster than malloc(3) and free(3). In certain cases, it can also simplify memory deallocation in applications that use longjmp(3) or siglongjmp(3). Otherwise, its use is discouraged.

Furthermore:

>The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.

https://man7.org/linux/man-pages/man3/alloca.3.html

reply
lstodd
1 hour ago
[-]
It becames super slow when you bump that pointer into a page that's missing from the TLB.
reply
HarHarVeryFunny
1 hour ago
[-]
A TLB miss could happen when executing the next statement in your program. It's not something you have a lot of control over, and doesn't change the fact that allocating from the stack (when an option) is going to be faster than allocating from the heap.
reply
lstodd
52 minutes ago
[-]
So you don't allocate left and right, be it stack or heap.

It's all useless though unless you control the hardware. If you don't, you might as well prlimit --stack=unlimited and have at it.

reply
anematode
2 hours ago
[-]
Awesome stuff! Does Go have profile-guided optimization? I'm wondering whether a profile could hint to the compiler how large to make the pre-reserved stack space.
reply
tptacek
2 hours ago
[-]
Yep. `go build -pgo=foo.pprof`

https://go.dev/doc/pgo

reply
bertylicious
2 hours ago
[-]
Nice! That's (seems) so simple yet also so very effective. Shouldn't other memory-managed languages be able to profit from this as well?
reply
lionkor
1 hour ago
[-]
C# has `stackalloc`
reply
bertylicious
1 hour ago
[-]
But that requires an explicit declaration and isn't done automatically under the hood, or am I missing something?
reply
Smaug123
39 minutes ago
[-]
The JIT does this automatically in some cases as of .NET 10 (https://learn.microsoft.com/en-us/dotnet/core/whats-new/dotn...).
reply
lstodd
1 hour ago
[-]
I read that as "Allocating on the Slack" and immediately came up with three ways how to do that.
reply
zabzonk
2 hours ago
[-]
alloca() is not part of the C++ standard, and I can't imagine how it could used safely in a C++ environment
reply
mwkaufma
1 hour ago
[-]
If I had a nickel for every article about avoiding implicit boxing in gc-heap languages...
reply
adonovan
51 minutes ago
[-]
...you would have the same balance as before, because this is not an article about implicit boxing. ;-)
reply