They had a nice little optimization for word-sized values to store in-place rather than as a pointer out to a value. With the precise gc, they had to make the change to only storing pointers, leading to allocating small values.
I don't know if they've done work to (or perhaps better put: had success) regain the performance hit from the extra allocation & gc load.
On the flip side, my experience is that they've made the pretty unobtrusive with regards to latency and pauses. Or perhaps I'm just not stressing it as much as I had in the past.
Random anecdote on gc tuning: I was once dealing with a go process that sped up (higher throughput, lower latency) by an alarming amount limiting the max processors. This was many years ago, and I wouldn't be able to say what version of go. It was after you no longer had to set GOMAXPROCS, but probably not very long after.
Performance tuning is crazy sometimes.
One very common thing I've noticed is that people new to the language overuse/abuse goroutines and channels. I completely get why, they are two selling points you learn about early on. But it's really easy to make things go wrong when spamming them everywhere.
It’s both weird and beautiful because getting an end-to-end understanding is instrumental, so you often have to go look at many marginal things that might actually play a significant role.
I can't tell you the number of times I've fixed code like this
matches = []
for (first : items) {
for (second : items) {
if (first.name == second.name)
matches.add(first);
}
}
Very frequently a bright red spot in most profiler output.Almost certainly true, I can only speak of my own experiences.
> Since that is the type of performance issue that a profiler is good at finding, those are the performance issues you'll find looking at a profiler.
I have to disagree with you on this. Sampling profilers are good at finding out exactly what methods are eating performance. In fact, if anything they have a tendency to push you towards looking at single methods for problems rather than moving up a layer of two to see the big picture (it's why flame graphs are so important in profiling).
I have, for example, seen plenty of times where the profiler indicated that double math was the root cause of problems yet popping a few layers up the stack revealed (sometimes non-obviously) that there was n^2 behavior going on.
I assume those NOPs you mention exist for alignment padding. Clang and GCC let you configure the alignment padding of any or all functions, and Clang lets you explicitly align any for-loop anywhere you want with `[[clang::code_align]]`.
Why? Many dynamic languages use a tagged pointer representation which isn't incompatible with precise garbage collectors. Couldn't Go do the same?
the release notes[0] at the time stated
> The implementation of interface values has been modified. In earlier releases, the interface contained a word that was either a pointer or a one-word scalar value, depending on the type of the concrete object stored. This implementation was problematical for the garbage collector, so as of 1.4 interface values always hold a pointer. In running programs, most interface values were pointers anyway, so the effect is minimal, but programs that store integers (for example) in interfaces will see more allocations.
@rsc wrote in some detail[0] about it the initial layout on his blog.
I couldn't find a detailed explanation about how the optimization interacted w/ the gc, but my understanding is that it couldn't discern between pointer and integer values
[0] https://go.dev/doc/go1.4#runtime [1] https://research.swtch.com/interfaces
The bigger issue IMO is and has been that fat pointers (strings, slices) can't be inlined in interface values. There's definitely some benefit to inlining integers and floats, but the indirection that comes from not inlining them isn't that significant compared to the double-indirection that comes from not inlining strings and slices. There was some discussion on the mailing list IIRC about expanding the interface to 3 words wide (to hold strings at least) or 4 words wide (to hold slices too), but this was rejected as going too far the other way (at the time).
Interestingly, the log/slog package does inline string values at least [1], and demonstrates how such a thing can be done when needed, albeit with a fair deal of complexity.
[1]: https://cs.opensource.google/go/go/+/refs/tags/go1.23.1:src/...
I don’t know Go. Is that the problem we are seeing here?
One of the realtime GC solutions that stuck with me is amortized GC, which might be appropriate to Go.
Instead of moving dead objects immediately to the free list you “just” need to stay ahead of allocation. You can accomplish that by freeing memory every time you allocate memory - but not a full GC, just finishing the free of a handful of dead objects.
That puts an upper bound on allocation time without a lower bound on reallocation time.
Precise means that on a GC cycle, only true pointers to heap allocated objects are identified as roots, which means that an unreferenced object cannot be kept alive by a value that happens to have a similar bit pattern as a heap pointer. This guarantees that unreferenced objects are eventually cleaned up, but not that this cleanup happens immediately, certainly not as part of a busy loop. (Unless the system runs out of memory, but in that case, a GC iteration is required anyway.)
GC is a large field, it has its own terminology, and instead of asking "why are we inventing terminology", I'd ask "who we are to change the existing terminology".
I wonder if there's anything that automatically defers collection within a loop and collects the garbage after the loop is exited. Something like Obj-C's old @autoreleasepool but inserted automatically. Static analysis has gotten so fancy that that might not be a terrible idea once you work out all the nuances (e.g. what if the loop is too short, nested loops, what happens when your entire program is a loop like game loops, etc). Could be the best of both worlds.
But generally I think it turns out that in refcounted GC systems you end up just knowing to not allocate/free in your hot loop or if you're doing it in a loop then it's probably not the thing your hot loop is dominated by.
The odd thing about GenGC is that it punishes you for trying to recycle old objects yourself. You create much more pressure to scan the old generation by doing so, which is expensive. If you have the available memory it’s often better to build a new object and swap it for the retained reference to the old one at the end when you’re done (poor man’s MVCC as well if the switch takes a couple context switches to finish)
You have code like this:
function(make_infinite_lazy_list());
where function walks down the list: fun function(list)
{
while (list) { // nil is false
...
list = cdr(list);
}
}
problem is, the parent stack frame contains a spurious copy of the original return value from make_infinite_lazy_list, and so as function marches down the list, the discard prefix of the list is not becoming garbage, as expected.This is the showstopper for conservative GC. Not the bogeyman of a stray machine integer suddenly looking exactly like a heap pointer.
Stick a conservative scanner under C, make yourself some lazy lists, and this problem will easily reproduce!
This is not necessarily accurate with true precise tracking E.g.:
using System.Runtime.CompilerServices;
Example();
// Skip Tier 0 compilation which does not track gcrefs as precisely
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
static void Example() {
var obj = new object();
var wr = new WeakReference(obj);
for (var i = 0; i < 3; i++) {
Console.WriteLine(obj);
}
GC.Collect();
Console.WriteLine(wr.IsAlive);
}
This works in much more advanced scenarios too.Unfortunately, I can't link a simple document that covers this in detail from the top of my head but there's a wealth of information in Konrad Kokosa's works:
.NET GC internals lectures: https://www.youtube.com/watch?v=8i1Nv7wGsjk&list=PLpUkQYy-K8...
Pro .NET Memory Management (which is a great book in general): https://prodotnetmemory.com/
The number of trackable lifetimes was 64 in .NET Framework but has been steadily increased in modern .NET and is now 1024, so it's rarely a capacity issue; but there are cases where we can't effectively reason about lifetimes.
For us another big drawback to conservative scanning is that any object referred to by a conservative reference cannot be relocated, since the reference might be live and is not guaranteed to be a GC reference; these objects are (in our parlance) effectively pinned, and this causes additional overhead.
I knew about 1000 (turns out 1024) limit for method locals, in hindsight it does make sense for it to apply to gcref tracking just as much...
In addition, there are plenty of cases where memory usage is unbounded or excessive, not because of allocator behavior, but because of programming mistakes. In fact, memory can sometimes blow up just because of large user inputs and very few systems are prepared for properly handling OOM conditions that happen legitimately.
Case in point: Both CRuby and Apple's JavaScriptCore have garbage collectors that use conservative stack scanning and are widely used in production systems without the world ending.
That said, you're probably not going to use conservative stack scanning because of collection speed alone. There are other trade-offs between conservative stack scanning and precise stack scanning that weigh more heavily.
I'll add the caveat that I would be very cautious about using a conservative GC on 32-bit systems, but on 64-bit systems this is about as much of a concern as memory fragmentation.
After long years of finding problems and trying to encourage, pressure, bribe, cajole and finally yell at people about said problems, my professional opinion is that people aren’t very observant, and either do not see problems or pretend not to see them.
They just reboot the process every 48 hours and walk away.
And sadly continuous deployment is making this worse. The collective We have problems that only show up on three or four day weekends, right when you don’t want to be on call.
It is not sexy, and it has other issues, but it has worked for a long time.
Another possibility would be two adjacent 32-bit ints merging; for a 2GB heap this'd require the high half hitting one of the two valid 1≤x≤32767 values (reasonably frequent range for general-purpose numbers) and the bottom one can be anything; though whether such packing can happen on-stack depends on codegen (and, to a lesser extent, the program, as one can just do "12345L<<32" and get a thing that has the 2 in 32767 (0.006%) chance of hitting the heap); but even then fitting a million of such on the stack only gives ~61 potential false roots, and for that to be significant some of those must hit massive objects (brings back some 1 in a billion factor unless there are large cyclic object subgraphs).
That said this new generation of only conservatively sweeping the stack has the significant advantage that the stack has remained pretty small (8mb is still the standard stack size), so if you have precise heap scanning the odds of spurious collections from conservative scanning go down a ton.
> Also, Apple’s JavaScriptCore uses conservative stack scanning, and V8 is looking at switching to it. Funny, right?
If your use case requires a minimum bound, use another algorithm.
The only way for the total amount of leaked memory to grow is by replacing something else on the stack, which releases whatever was there before, and, assuming there's some maximum stack size, there's a finite amount of such space.
End result being that, even if you have some code path that generates an integer that looks like a reference, it can only leak one object (which could be massive perhaps, but with generational GC or similar there's no guarantee of all garbage being GC'd at any single GC anyway); if codegen clears stack slots upon their lifetime ending, you need one such integer-holding frame to be on the stack for every object you want to be leaked.
Luckily sorting is something you can easily choose another implementation of, if the default over didn't fit your use-cade, unlike the GC built into the single language implementation that your customer uses.
Branch misprediction is a permanent loss. You will never get those nanoseconds back.
Space inefficiencies require you to install more memory.
Go is supposedly a counterexample. I haven’t used it enough to offer an opinion, but I have heard of companies rewriting Go services simply to avoid GC at runtime.
IME, the most common GC problems I see are:
- incredibly basic settings are wrong, like the max heap size
- the app was written with some very poor behavior, e.g. unreal amounts of allocation or tons of allocation thrashing in tenured heap.
There aren't even a lot of GCs knobs to "manage" anymore compared to a decade ago [0]. Don't treat your GC as a black-box machine with infinitely available resources, and you'll be fine.0) https://docs.oracle.com/en/java/javase/22/gctuning/ergonomic...
An extremely simplified description of how OpenJDK's GCs work is like this: every allocation bumps the "heap end" pointer, and when memory runs out, the GC scans the objects reachable from some set of "roots", resets the allocation pointer to the start of the heap, and copies the live objects to the allocation pointer, bumping it along the way (and in the process, overwriting the dead objects). In practice, the scanning and the compaction may be done concurrently with the application, there is not just one allocation pointer (to avoid contention), and not the entire object graph needs to be scanned to free memory.
The cost of Java's good GC is virtually entirely in memory footprint (there may be some speed costs to GC in some cases, but they're offset by GC's speed benefits).
[1]: Java may indeed be slower than C on some workloads due to memory layout, but that's because of data locality, not GC. The upcoming inlined value objects will take care of that.
Precision gives you a lot more flexibility in moving data. Conservative scanning is way more pleasant for embedding (you don't have to go through a ritual to register every root). The difference in what gets collected rarely seems to matter much in practice, though the determinism can matter, especially for intermittently failing tests.
There's a lot of path dependence in choosing one over the other. It's easy to go from precise to conservative, but very hard to go the other way. In practice, that means you tend to stick to whatever you've got—if you have a working precise collector, you're loathe to loosen it up because you might get stuck. I was heavily involved in moving the SpiderMonkey collector from conservative to precise, and I wouldn't like to lose that work, and I'd really hate to have to do it again! And yet, the maintenance cost of keeping the precise collector correct is not low.
I suspect a lot of the argument for precision boils down to: bump allocators go brrr. With a precise scanner, you can bump allocate into a young generation, move everything out during a minor GC without regard for conservative pinning, and then do it over again. No looking at allocation bitmaps or free lists or whatever. Easily JITted (at least for the basic allocation), good locality, simple code.
A more tenuous guess is that a lot of the argument for being conservative boils down to ease of embedding. You mostly don't have to think about registering your stack roots, you can just allocate stuff and call things.
Edit: ah he says assume the heap is precisely traced (somehow?) so I guess it would already been known what references are in there.
To do this chasing precisely you need some metadata, saying which fields are pointers vs other data like ints. For OOP languages this metadata is stored with the vtables for objects. For the stack you need similar metadata, at the very least how many pointers there are if you put them first in the stackframe.
Not having this metadata and conservatively treating random ints as pointers isn't always catastrophic, but it has some drawbacks. Two big ones:
- a moving GC is tough, you have to update pointers after moving an object but can't risk updating random ints that happen to have that value
- You do more work during GC chasing random non-pointers, and free less memory meaning more GC runs
Generating ths precise GC metadata for stackframes is sort-of easy.You need specific Safe-Points where all data is at known locations for the GC to work anyway, usually by spilling resgisters to the stack. These GC checkpoints usually coincide with heap allocation, which is why long non-allocating loops can block stop-the-world GC and send other threads into a spin-lock in many GC implementations
Maybe a non-precise GC could treat registers as possible pointers and skip spilling to the stack for Safe-Points? There are alternatives to spilling like a precise stack-map for each program instruction (so every instruction is a safe point), but those are expensive to process. Usually only used for debugging or exception handling, not something frequent like GC
But usually you want to free ints which look like ptrs earlier, sync don't keep them around. It's rare, bug when happening a memory hog. Esp. On long running processes I would only do precise GC. Those simple bit shifts and masks cost almost nothing today.
For C interop you need to track pointers anyway separately.
Precision is sometimes useless due to time and computing resources constraints. Speed can come at the cost of poor results.
I think the answer depends upon the specific use case and environment.
One day a student came to Moon and said, I understand how to make a better garbage collector...”
Reading the conclusion, I don't know if anybody working on actual top-tier GCs (Java, JavaScript, C#) finds this useful. It strikes me more as a somewhat interesting factoid, as demonstrated by a paper.
The conclusion:
> When it comes to designing a system with GC, don’t count out conservative stack scanning; the tradeoffs don’t obviously go one way or the other, and conservative scanning might be the right engineering choice for your system.
If there would be examples of how this relates to actual GCs in production and compares them, now that would be interesting.
I would venture that, as a bare minimum, the intended audience is people who understand and care about the subject. That includes me; your mileage may vary.
diff --git a/src/runtime/mgc.go b/src/runtime/mgc.go
index a2b6b979c1..d2f2852294 100644
--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -144,7 +144,7 @@ const (
// debugScanConservative enables debug logging for stack
// frames that are scanned conservatively.
- debugScanConservative = false
+ debugScanConservative = true
// sweepMinHeapDistance is a lower bound on the heap distance
// (in bytes) reserved for concurrent sweeping between GC
And observe that Go scans a few stack frames conservatively. I think it only does this if a stack frame is preempted. Most frames are scanned precisely.Someone reading your sentence might be at risk of conflating a minority position with a fringe one. Clearly this isn't the case here.
But fringe isn’t far off. Among those who write GCs professionally, I was definitely on the fringe as an advocate for conservative-on-the-stack.
(I wrote most of JSC’s GC and I was one of the folks pushing for it to remain conservative at a time when V8 was accurate and SpiderMonkey transitioned to being accurate. I like that folks are now acknowledging the good choice we made in JSC, but it was a fringe/minority choice at the time for sure, and it’s still a minority choice among high performance GCs.)
It goes to show that popularity is a poor measure of technical merit, and the fringe doesn't always stay that way.
The literature uses “Accurate GC”, “Precise GC”, and “Exact GC” interchangeably.
Famous paper on this that uses “accurate”: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...
Paper that calls it “exact”: https://dl.acm.org/doi/10.1145/2660193.2660198