Conservative GC can be faster than precise GC
144 points
9 days ago
| 12 comments
| wingolog.org
| HN
chrsig
9 days ago
[-]
It was a bit of a bummer when go switched from the conservative gc to a precise gc. One of the implications was that they needed to change how interface types were represented.

They had a nice little optimization for word-sized values to store in-place rather than as a pointer out to a value. With the precise gc, they had to make the change to only storing pointers, leading to allocating small values.

I don't know if they've done work to (or perhaps better put: had success) regain the performance hit from the extra allocation & gc load.

On the flip side, my experience is that they've made the pretty unobtrusive with regards to latency and pauses. Or perhaps I'm just not stressing it as much as I had in the past.

Random anecdote on gc tuning: I was once dealing with a go process that sped up (higher throughput, lower latency) by an alarming amount limiting the max processors. This was many years ago, and I wouldn't be able to say what version of go. It was after you no longer had to set GOMAXPROCS, but probably not very long after.

Performance tuning is crazy sometimes.

reply
silisili
9 days ago
[-]
It's always important to know your bottlenecks before tuning. If you're disk io bound, throwing 100 goroutines at it just compounds the problem, for example.

One very common thing I've noticed is that people new to the language overuse/abuse goroutines and channels. I completely get why, they are two selling points you learn about early on. But it's really easy to make things go wrong when spamming them everywhere.

reply
znpy
9 days ago
[-]
Performance tuning is still largely a dark art from what I see, having dabbled in the space.

It’s both weird and beautiful because getting an end-to-end understanding is instrumental, so you often have to go look at many marginal things that might actually play a significant role.

reply
bqmjjx0kac
9 days ago
[-]
Disappointingly, it's a dark art often because the CPU is a black box. Intel X86 chips translate the instructions you give them to some internal microcode and then execute them speculatively, out of order, etc. I'm still mystified by the performance gains afforded by randomly inserting NOP instructions.
reply
cogman10
9 days ago
[-]
Fortunately, at least in my experience, the variability that CPUs introduce (which do matter in many contexts) aren't often the source of slowness. In my experience plain old algorithmic complexity would go a long way in making stuff faster.

I can't tell you the number of times I've fixed code like this

    matches = []
    for (first : items) {
      for (second : items) {
        if (first.name == second.name)
          matches.add(first);
     }
    }
Very frequently a bright red spot in most profiler output.
reply
marginalia_nu
9 days ago
[-]
I think there's an element of selection bias in that observation. Since that is the type of performance issue that a profiler is good at finding, those are the performance issues you'll find looking at a profiler.
reply
cogman10
9 days ago
[-]
> I think there's an element of selection bias in that observation.

Almost certainly true, I can only speak of my own experiences.

> Since that is the type of performance issue that a profiler is good at finding, those are the performance issues you'll find looking at a profiler.

I have to disagree with you on this. Sampling profilers are good at finding out exactly what methods are eating performance. In fact, if anything they have a tendency to push you towards looking at single methods for problems rather than moving up a layer of two to see the big picture (it's why flame graphs are so important in profiling).

I have, for example, seen plenty of times where the profiler indicated that double math was the root cause of problems yet popping a few layers up the stack revealed (sometimes non-obviously) that there was n^2 behavior going on.

reply
10000truths
9 days ago
[-]
There is a plethora of information regarding instruction timings, throughout/latency, execution port usage, etc. that compilers make liberal use of for optimization purposes. You could, in theory, also use that information to establish an upper bound on how long a series of instructions would take to execute. The problem lies in the difference in magnitude between average case and worst case, due to dynamic execution state like CPU cache, branch prediction, kernel scheduling, and so on.
reply
Mathnerd314
8 days ago
[-]
There is uiCA, it achieves an error of about 1% relative to actual measurements of basic block throughput across a wide range of microarchitectures. And then FACILE, similar to uiCA. I don't know of any compilers using these more accurate models, but it is certainly possible.
reply
Conscat
9 days ago
[-]
Intel also provides vtune to annotate sequenced instructions and profile the microseconds and power consumption down to the level of individual instructions.

I assume those NOPs you mention exist for alignment padding. Clang and GCC let you configure the alignment padding of any or all functions, and Clang lets you explicitly align any for-loop anywhere you want with `[[clang::code_align]]`.

reply
pjmlp
9 days ago
[-]
That is why tooling like VTune exist.
reply
matheusmoreira
8 days ago
[-]
> With the precise gc, they had to make the change to only storing pointers, leading to allocating small values.

Why? Many dynamic languages use a tagged pointer representation which isn't incompatible with precise garbage collectors. Couldn't Go do the same?

reply
skybrian
8 days ago
[-]
Perhaps because Go has pointers to the middle of records and arrays? (And records of arrays, etc.) It's an unusual language feature.
reply
kbolino
8 days ago
[-]
This, plus Go pointers are real memory addresses, a property which is guaranteed by the runtime for unsafe pointer operations and C interop.
reply
soegaard
9 days ago
[-]
> Do you have more details (or a reference)?
reply
chrsig
9 days ago
[-]
It took me a bit to hunt down, but it was part of the go 1.4 release in 2014.

the release notes[0] at the time stated

> The implementation of interface values has been modified. In earlier releases, the interface contained a word that was either a pointer or a one-word scalar value, depending on the type of the concrete object stored. This implementation was problematical for the garbage collector, so as of 1.4 interface values always hold a pointer. In running programs, most interface values were pointers anyway, so the effect is minimal, but programs that store integers (for example) in interfaces will see more allocations.

@rsc wrote in some detail[0] about it the initial layout on his blog.

I couldn't find a detailed explanation about how the optimization interacted w/ the gc, but my understanding is that it couldn't discern between pointer and integer values

[0] https://go.dev/doc/go1.4#runtime [1] https://research.swtch.com/interfaces

reply
kbolino
8 days ago
[-]
I can't imagine it was impossible to keep the inline-value optimization; after all, the full type information is always right there in the other word of the interface value, but I think it would have been too complicated/expensive for too little benefit.

The bigger issue IMO is and has been that fat pointers (strings, slices) can't be inlined in interface values. There's definitely some benefit to inlining integers and floats, but the indirection that comes from not inlining them isn't that significant compared to the double-indirection that comes from not inlining strings and slices. There was some discussion on the mailing list IIRC about expanding the interface to 3 words wide (to hold strings at least) or 4 words wide (to hold slices too), but this was rejected as going too far the other way (at the time).

Interestingly, the log/slog package does inline string values at least [1], and demonstrates how such a thing can be done when needed, albeit with a fair deal of complexity.

[1]: https://cs.opensource.google/go/go/+/refs/tags/go1.23.1:src/...

reply
_rlh
5 days ago
[-]
It was a memory model / two word atomicity problem. The mutator uses two writes, one for type and one for value to create the interface. The GC concurrently reads the 2 words of the interface to see if the value is a pointer or not. This is a race that was considered too expensive / complicated to fix.
reply
hinkley
9 days ago
[-]
The problem with precise GC is usually the same problem with malloc/free - if you allocate in an inner loop you have to free in that inner loop and the bookkeeping kills throughput.

I don’t know Go. Is that the problem we are seeing here?

One of the realtime GC solutions that stuck with me is amortized GC, which might be appropriate to Go.

Instead of moving dead objects immediately to the free list you “just” need to stay ahead of allocation. You can accomplish that by freeing memory every time you allocate memory - but not a full GC, just finishing the free of a handful of dead objects.

That puts an upper bound on allocation time without a lower bound on reallocation time.

reply
sltkr
9 days ago
[-]
That's not what precise GC means.

Precise means that on a GC cycle, only true pointers to heap allocated objects are identified as roots, which means that an unreferenced object cannot be kept alive by a value that happens to have a similar bit pattern as a heap pointer. This guarantees that unreferenced objects are eventually cleaned up, but not that this cleanup happens immediately, certainly not as part of a busy loop. (Unless the system runs out of memory, but in that case, a GC iteration is required anyway.)

reply
vlovich123
9 days ago
[-]
Since reference counting is GC, isn't reference counting a form of precise GC? That's certainly the scenario I had in mind reading what OP wrote where deallocation within a hot loop would be possible.
reply
hinkley
8 days ago
[-]
No it’s usually called conservative unless you have loop detection.
reply
hinkley
8 days ago
[-]
Oof. We used to just call that tracing.
reply
ordu
8 days ago
[-]
Tracing is the technique to find all alive objects from roots. You dereference a pointer to find there some object, and then to recursively iterate over pointers in that object. But before that you need to choose what to use as roots, and here is a difference: you can be precise with choosing the roots, or to err on a conservative side, choosing more roots than you actually need.
reply
hinkley
8 days ago
[-]
That is too subtle of a distinction. Tracing demands roots. Why are we inventing a new classification where we have accurate versus inaccurate roots?
reply
ordu
7 days ago
[-]
This classification was invented decades ago. I think in 1970s or even earlier. Garbage collection is a large field with lots of ideas and research behind it. There are different approaches to structure the heap, to find roots, to trace living objects. These different approaches oftentimes (but not always) replaceable, you can change how you find roots while leaving other things intact.

GC is a large field, it has its own terminology, and instead of asking "why are we inventing terminology", I'd ask "who we are to change the existing terminology".

reply
chrsig
8 days ago
[-]
i think it is still tracing, it's just a matter of identifying possible roots versus actual roots and constraints that fall out from that.
reply
vlovich123
9 days ago
[-]
> if you allocate in an inner loop you have to free in that inner loop and the bookkeeping kills throughput

I wonder if there's anything that automatically defers collection within a loop and collects the garbage after the loop is exited. Something like Obj-C's old @autoreleasepool but inserted automatically. Static analysis has gotten so fancy that that might not be a terrible idea once you work out all the nuances (e.g. what if the loop is too short, nested loops, what happens when your entire program is a loop like game loops, etc). Could be the best of both worlds.

But generally I think it turns out that in refcounted GC systems you end up just knowing to not allocate/free in your hot loop or if you're doing it in a loop then it's probably not the thing your hot loop is dominated by.

reply
hinkley
8 days ago
[-]
Generational GC - particularly with escape analysis - can make those temp objects just slightly more expensive than stack allocation.

The odd thing about GenGC is that it punishes you for trying to recycle old objects yourself. You create much more pressure to scan the old generation by doing so, which is expensive. If you have the available memory it’s often better to build a new object and swap it for the retained reference to the old one at the end when you’re done (poor man’s MVCC as well if the switch takes a couple context switches to finish)

reply
Rusky
8 days ago
[-]
Any tracing GC, conservative or precise, generational or not, already does this. Hinkley is wrong about what "conservative" and "precise" mean.
reply
kazinator
9 days ago
[-]
Conservative GC is quite vicious against the use of lazy lists:

You have code like this:

  function(make_infinite_lazy_list());
where function walks down the list:

  fun function(list)
  {
     while (list) {  // nil is false
        ...
        list = cdr(list);
     }
  }
problem is, the parent stack frame contains a spurious copy of the original return value from make_infinite_lazy_list, and so as function marches down the list, the discard prefix of the list is not becoming garbage, as expected.

This is the showstopper for conservative GC. Not the bogeyman of a stray machine integer suddenly looking exactly like a heap pointer.

Stick a conservative scanner under C, make yourself some lazy lists, and this problem will easily reproduce!

reply
celeritascelery
8 days ago
[-]
Wouldn’t a stack map have that value as well?
reply
kazinator
8 days ago
[-]
If the compiler puts out a stack map that is conservative, then the GC scan will be effectively conservative. The compiler has to compensate for that somehow. If a temporary location is in the stackmap, such that the value in it is a dead value before a function call, the compiler has to insert an instruction to null out that temporary location.
reply
Rusky
8 days ago
[-]
And the same can of course be done under conservative GC...
reply
kazinator
8 days ago
[-]
But it won't, because people implement conservative GCs when they are not in control of all the pieces, like the compilers used for some parts of their run-time.
reply
Rusky
8 days ago
[-]
The point of the post is that conservative GC can be useful for reasons other than that simple lack of control.
reply
neonsunset
9 days ago
[-]
> A compiler that does precise root-finding will typically output a side-table indicating which slots in a stack frame hold references to heap objects. These lifetimes aren’t always precise, in the sense that although they precisely enumerate heap references, those heap references might actually not be used in the continuation of the stack frame. When GC occurs, it might mark more objects as live than are actually live, which is the imputed disadvantage of conservative collectors.

This is not necessarily accurate with true precise tracking E.g.:

    using System.Runtime.CompilerServices;

    Example();

    // Skip Tier 0 compilation which does not track gcrefs as precisely
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    static void Example() {
        var obj = new object();
        var wr = new WeakReference(obj);
        for (var i = 0; i < 3; i++) {
            Console.WriteLine(obj);
        }
        GC.Collect();
        Console.WriteLine(wr.IsAlive);
    }
This works in much more advanced scenarios too.

Unfortunately, I can't link a simple document that covers this in detail from the top of my head but there's a wealth of information in Konrad Kokosa's works:

.NET GC internals lectures: https://www.youtube.com/watch?v=8i1Nv7wGsjk&list=PLpUkQYy-K8...

Pro .NET Memory Management (which is a great book in general): https://prodotnetmemory.com/

reply
andyayers
9 days ago
[-]
In .NET, even in optimized methods, there can be "untracked" lifetimes where a stack slot is reported live to GC throughout the extent of a method, so presumably these can lead to the "over-reporting" cases mentioned.

The number of trackable lifetimes was 64 in .NET Framework but has been steadily increased in modern .NET and is now 1024, so it's rarely a capacity issue; but there are cases where we can't effectively reason about lifetimes.

For us another big drawback to conservative scanning is that any object referred to by a conservative reference cannot be relocated, since the reference might be live and is not guaranteed to be a GC reference; these objects are (in our parlance) effectively pinned, and this causes additional overhead.

reply
neonsunset
8 days ago
[-]
Thanks!

I knew about 1000 (turns out 1024) limit for method locals, in hindsight it does make sense for it to apply to gcref tracking just as much...

reply
nu11ptr
9 days ago
[-]
I've never understood why anyone would use a conservative collector outside toy programs or academia. It is hard enough to make programs deterministic even with precise collection. I can't even imagine releasing software that was inherently non-deterministic and could suddenly, and without notice, start retaining memory (even if atypical in practice). Thus, IMHO, which is faster is a moot point.
reply
rbehrends
9 days ago
[-]
The same argument would apply to any non-compacting allocator, because the worst case memory blowup due to external fragmentation is huge. But such cases are extremely rarely observed in practice, so people use e.g. standard malloc()/free() implementations or non-compacting garbage collectors without being concerned about that.

In addition, there are plenty of cases where memory usage is unbounded or excessive, not because of allocator behavior, but because of programming mistakes. In fact, memory can sometimes blow up just because of large user inputs and very few systems are prepared for properly handling OOM conditions that happen legitimately.

Case in point: Both CRuby and Apple's JavaScriptCore have garbage collectors that use conservative stack scanning and are widely used in production systems without the world ending.

That said, you're probably not going to use conservative stack scanning because of collection speed alone. There are other trade-offs between conservative stack scanning and precise stack scanning that weigh more heavily.

I'll add the caveat that I would be very cautious about using a conservative GC on 32-bit systems, but on 64-bit systems this is about as much of a concern as memory fragmentation.

reply
hinkley
9 days ago
[-]
> But such cases are extremely rarely observed in practice

After long years of finding problems and trying to encourage, pressure, bribe, cajole and finally yell at people about said problems, my professional opinion is that people aren’t very observant, and either do not see problems or pretend not to see them.

They just reboot the process every 48 hours and walk away.

And sadly continuous deployment is making this worse. The collective We have problems that only show up on three or four day weekends, right when you don’t want to be on call.

reply
bjoli
9 days ago
[-]
Unless you have deterministic threading no parallel GC will be deterministic. A lot of software has used the Boehm collector without issue, like inkscape and I believe crystal lang.

It is not sexy, and it has other issues, but it has worked for a long time.

reply
adgjlsfhk1
9 days ago
[-]
The counterpoint here is that a if your program uses 2gb of memory on a 64bit computer, only 1 in 4 billion random 64 bit values will be plausible addresses (and almost all of them will only pin small amounts of memory).
reply
dzaima
9 days ago
[-]
Less trivial considering that typically only the low ~47 bits of addresses are allowed to be non-zero; then again, values on the stack would still be full 64 bits and any non-zero bit in the top 17 immediately disqualifies it; then again, besides literally random 64-bit integers, ones not pushing up to the 64-bit limit are probably more likely anyway.

Another possibility would be two adjacent 32-bit ints merging; for a 2GB heap this'd require the high half hitting one of the two valid 1≤x≤32767 values (reasonably frequent range for general-purpose numbers) and the bottom one can be anything; though whether such packing can happen on-stack depends on codegen (and, to a lesser extent, the program, as one can just do "12345L<<32" and get a thing that has the 2 in 32767 (0.006%) chance of hitting the heap); but even then fitting a million of such on the stack only gives ~61 potential false roots, and for that to be significant some of those must hit massive objects (brings back some 1 in a billion factor unless there are large cyclic object subgraphs).

reply
cpeterso
8 days ago
[-]
Depending on the processor architecture, you might be able to assume a minimum memory alignment and then immediately disqualify values whose least significant bits aren’t zero.
reply
adgjlsfhk1
8 days ago
[-]
You probably can't if your programming language puts data in those bits (which is fairly commin since it's a very convenient place to store a couple bits of data for the GC).
reply
PaulHoule
9 days ago
[-]
Didn’t the opposite happen near the end of the 32 bit age? My take was that you could get away with conservative garbage collection if you had 16MB of RAM in a 32 bit address space but as you filled more of it the probability of a significant memory leak converged on 1.
reply
adgjlsfhk1
9 days ago
[-]
I assume you mean 16 mb of ram in a 32 gb space, but yes. As app size approaches address space size, lots of things get worse.

That said this new generation of only conservatively sweeping the stack has the significant advantage that the stack has remained pretty small (8mb is still the standard stack size), so if you have precise heap scanning the odds of spurious collections from conservative scanning go down a ton.

reply
PaulHoule
9 days ago
[-]
I corrected GB -> MB in my post, thanks!
reply
IshKebab
9 days ago
[-]
That's only true if the valid addresses or values are uniformly distributed, which is not the case.
reply
MobiusHorizons
9 days ago
[-]
The article suggests this might be used for JavaScript VMs maybe the lack of int64s in the language helps?

> Also, Apple’s JavaScriptCore uses conservative stack scanning, and V8 is looking at switching to it. Funny, right?

reply
sestep
9 days ago
[-]
Yeah... I can't imagine trying to debug that. "We kept getting memory leaks, so we dug into it and realized that the language was interpreting local integer variables as pointers and refusing to free memory, but this only happened sporadically and we couldn't reproduce the bug on our development machines. After banging our heads against the wall for weeks we realized what was going on, and it turns out this behavior is completely intentional and they have no plans to change it."
reply
hypertele-Xii
9 days ago
[-]
Computer programming is full of probabilistic edge cases with ridiculous costs that are amortized over normal use. Most optimization, encoding, and compression is based on statistics.

If your use case requires a minimum bound, use another algorithm.

reply
sestep
9 days ago
[-]
Amortized analysis actually provides a guarantee (either deterministic or probabilistic) that things will tend to even out in the long run. Unless I misunderstand something, conservative GC provides no such guarantee, and there are no hard statistics behind the claim that memory leaks caused by it should be rare. There's a difference between "this algorithm is actually random and so sometimes will happen to exhibit suboptimal behavior" and "under the right circumstances, this garbage collection scheme will consistently produce memory leaks due to arcane rules, but those conditions are practically impossible to reproduce in a controlled setting."
reply
dzaima
9 days ago
[-]
Assuming that on-heap objects are tracked precisely, the maximum number of objects conservative stack scanning can incorrectly consider as roots is O(stack size) (with, in practice, a tiny constant factor, from excluding actual intentional references and frequently-changing or trivial non-reference values).

The only way for the total amount of leaked memory to grow is by replacing something else on the stack, which releases whatever was there before, and, assuming there's some maximum stack size, there's a finite amount of such space.

End result being that, even if you have some code path that generates an integer that looks like a reference, it can only leak one object (which could be massive perhaps, but with generational GC or similar there's no guarantee of all garbage being GC'd at any single GC anyway); if codegen clears stack slots upon their lifetime ending, you need one such integer-holding frame to be on the stack for every object you want to be leaked.

reply
reichstein
9 days ago
[-]
Many quick-sort implementations are deterministic, so will consistently have their worst case behavior on the same inputs again and again. The good ones try to do a little better than choosing the center element as pivot, but with a well crafted input, it can easily become polynomial anyway.

Luckily sorting is something you can easily choose another implementation of, if the default over didn't fit your use-cade, unlike the GC built into the single language implementation that your customer uses.

reply
hypertele-Xii
6 days ago
[-]
I take it there are languages that are locked to a particular GC, and this is a point of pain and friction? A good garbage collected language would make automated memory management both optional and modular. I recall D having a GC that can be disabled?
reply
paulddraper
9 days ago
[-]
Yeah, but we're talking about memory leaks not some branch misprediction or cache miss.
reply
hypertele-Xii
6 days ago
[-]
You're thinking in terms of space, now consider the same concept in the dimension of time.

Branch misprediction is a permanent loss. You will never get those nanoseconds back.

reply
paulddraper
6 days ago
[-]
Yes, you have to wait longer.

Space inefficiencies require you to install more memory.

reply
hypertele-Xii
5 days ago
[-]
If you exhaust memory, it gets paged to disk, increasing access latency. I.e. You have to wait longer.
reply
hedora
9 days ago
[-]
I’ve never worked with a language that had a precise GC that wasn’t also a nightmare to run in production. Java’s the obvious example of a language with an unmanageable GC. (Yes, they’re claiming the next GC will work, but that was the top line feature in the 1.4 marketing back in the late 1990’s, and I simply don’t believe such claims after 25+ years and over a dozen major releases that failed to deliver it.

Go is supposedly a counterexample. I haven’t used it enough to offer an opinion, but I have heard of companies rewriting Go services simply to avoid GC at runtime.

reply
foobazgt
8 days ago
[-]
This is such a weird take. Java's been everywhere in prod in the industry for decades from backends running millions of qps to HFT to game clients.

IME, the most common GC problems I see are:

  - incredibly basic settings are wrong, like the max heap size
  - the app was written with some very poor behavior, e.g. unreal amounts of allocation or tons of allocation thrashing in tenured heap.
There aren't even a lot of GCs knobs to "manage" anymore compared to a decade ago [0]. Don't treat your GC as a black-box machine with infinitely available resources, and you'll be fine.

0) https://docs.oracle.com/en/java/javase/22/gctuning/ergonomic...

reply
pron
8 days ago
[-]
reply
adgjlsfhk1
9 days ago
[-]
Counterpoint: the problem with Java isn't the GC, but the language. If every struct you instantiated in C required a malloc/free, that would be really slow too.
reply
pron
8 days ago
[-]
Java's memory management doesn't work like malloc/free at all, and that's why it's fast [1]. Even disregarding scalar replacement that means many `new` don't allocate anything on the heap, nearly all allocations are a pointer bump and objects that die don't require some active deallocation. With tracing collectors it's keeping objects alive for a long time that requires some computational work, not deallocating them (as is the case with refcounting collectors).

An extremely simplified description of how OpenJDK's GCs work is like this: every allocation bumps the "heap end" pointer, and when memory runs out, the GC scans the objects reachable from some set of "roots", resets the allocation pointer to the start of the heap, and copies the live objects to the allocation pointer, bumping it along the way (and in the process, overwriting the dead objects). In practice, the scanning and the compaction may be done concurrently with the application, there is not just one allocation pointer (to avoid contention), and not the entire object graph needs to be scanned to free memory.

The cost of Java's good GC is virtually entirely in memory footprint (there may be some speed costs to GC in some cases, but they're offset by GC's speed benefits).

[1]: Java may indeed be slower than C on some workloads due to memory layout, but that's because of data locality, not GC. The upcoming inlined value objects will take care of that.

reply
sfink
8 days ago
[-]
I find this to be more of an interesting observation than a reason to choose a conservative scanner. I've never really thought of conservative GCs as faster or slower. The stack scan is fast, simple, and prefetchable, but occasionally has to do extra work. The precise scan has to consult separate tables, often not stored nearby. I guess I'd expect the speed difference to come more as a result of conservative pointers getting pinned and what repercussions that has on the overall collection. I did think it was interesting that precise scanning can sometimes be more conservative than a conservative scanner because of excessively long live ranges. It's great to have a wingo writeup on these things.

Precision gives you a lot more flexibility in moving data. Conservative scanning is way more pleasant for embedding (you don't have to go through a ritual to register every root). The difference in what gets collected rarely seems to matter much in practice, though the determinism can matter, especially for intermittently failing tests.

There's a lot of path dependence in choosing one over the other. It's easy to go from precise to conservative, but very hard to go the other way. In practice, that means you tend to stick to whatever you've got—if you have a working precise collector, you're loathe to loosen it up because you might get stuck. I was heavily involved in moving the SpiderMonkey collector from conservative to precise, and I wouldn't like to lose that work, and I'd really hate to have to do it again! And yet, the maintenance cost of keeping the precise collector correct is not low.

I suspect a lot of the argument for precision boils down to: bump allocators go brrr. With a precise scanner, you can bump allocate into a young generation, move everything out during a minor GC without regard for conservative pinning, and then do it over again. No looking at allocation bitmaps or free lists or whatever. Easily JITted (at least for the basic allocation), good locality, simple code.

A more tenuous guess is that a lot of the argument for being conservative boils down to ease of embedding. You mostly don't have to think about registering your stack roots, you can just allocate stuff and call things.

reply
tinco
9 days ago
[-]
I've heard about scanning of the stack but I'm not sure if I get it. Is the strategy literally to not keep track of references at all, and simply do a sequential scan over the entire memory looking for bytes that look like they're pointers into the heap, and then assuming they are roots? And then you look them up in the allocation records and mark them as still in use? You'd have to scan them in turn as well to know if they've got references to other heap locations right?

Edit: ah he says assume the heap is precisely traced (somehow?) so I guess it would already been known what references are in there.

reply
Tarean
9 days ago
[-]
Both strategies start from roots (e.g. the stack) and then transitively chase pointers. Any memory reachable this way is live.

To do this chasing precisely you need some metadata, saying which fields are pointers vs other data like ints. For OOP languages this metadata is stored with the vtables for objects. For the stack you need similar metadata, at the very least how many pointers there are if you put them first in the stackframe.

Not having this metadata and conservatively treating random ints as pointers isn't always catastrophic, but it has some drawbacks. Two big ones:

- a moving GC is tough, you have to update pointers after moving an object but can't risk updating random ints that happen to have that value

- You do more work during GC chasing random non-pointers, and free less memory meaning more GC runs

Generating ths precise GC metadata for stackframes is sort-of easy.You need specific Safe-Points where all data is at known locations for the GC to work anyway, usually by spilling resgisters to the stack. These GC checkpoints usually coincide with heap allocation, which is why long non-allocating loops can block stop-the-world GC and send other threads into a spin-lock in many GC implementations

Maybe a non-precise GC could treat registers as possible pointers and skip spilling to the stack for Safe-Points? There are alternatives to spilling like a precise stack-map for each program instruction (so every instruction is a safe point), but those are expensive to process. Usually only used for debugging or exception handling, not something frequent like GC

reply
Rohansi
9 days ago
[-]
Pretty much, yes. There are optimizations done so it would not need to scan every possible location. Pointers should be properly aligned of course but also things like having separate heap regions for data that has no pointers in it (large data arrays) to skip scanning entirely.
reply
adgjlsfhk1
9 days ago
[-]
the heap is a lot easier to trace than the stack because objects on the heap are put there explicitly, so as long as you know their type, it's pretty easy to know where their pointers are. the tricky part of the stack is that once your compiler has figured out that something can go in the stack, it also might want to do things like store part of the object only in registers or something like that.
reply
gok
9 days ago
[-]
I would have assume the major benefit to precision is that it enables compaction…
reply
sfink
8 days ago
[-]
You can still compact with a conservative scanner, you just have to accommodate pinned regions.
reply
themk
8 days ago
[-]
How do you compact with conservative GC? You can't change the pointer values because they might not be pointers right?
reply
dzaima
8 days ago
[-]
Any object which is referenced by the stack cannot be moved, but the rest of the heap (i.e. the vast majority, assuming the stack is much smaller than the heap) still can.
reply
rurban
8 days ago
[-]
Of course it can be faster because it's wont need a shift, ptrs not a mask and objects not 2 words.

But usually you want to free ints which look like ptrs earlier, sync don't keep them around. It's rare, bug when happening a memory hog. Esp. On long running processes I would only do precise GC. Those simple bit shifts and masks cost almost nothing today.

For C interop you need to track pointers anyway separately.

reply
tmendez
8 days ago
[-]
I'd be curious to run experiments on this with one of my Go services. However, as far as I can tell, Go's garbage collection is precise and there's no way to run a Go program with a different (conservative) garbage collector. Am I overlooking something or am I out of luck here?
reply
notorandit
9 days ago
[-]
Just like a number of (unrelated) algorithms, the sweet spot can be found in the middle between two (or more) optimal solutions.

Precision is sometimes useless due to time and computing resources constraints. Speed can come at the cost of poor results.

I think the answer depends upon the specific use case and environment.

reply
cancerhacker
9 days ago
[-]
“One day a student came to Moon and said, I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons. Moon patiently told the student the following story:

One day a student came to Moon and said, I understand how to make a better garbage collector...”

reply
DarkNova6
9 days ago
[-]
I'm not sure what this article is trying to convey to me. I can only suppose the intended audience is entirely academia-centric.

Reading the conclusion, I don't know if anybody working on actual top-tier GCs (Java, JavaScript, C#) finds this useful. It strikes me more as a somewhat interesting factoid, as demonstrated by a paper.

The conclusion:

> When it comes to designing a system with GC, don’t count out conservative stack scanning; the tradeoffs don’t obviously go one way or the other, and conservative scanning might be the right engineering choice for your system.

If there would be examples of how this relates to actual GCs in production and compares them, now that would be interesting.

reply
samatman
9 days ago
[-]
Andy Wingo is a professional compiler engineer. Best known for his work on Guile Scheme, but he works for Igalia, and if you were to read more of his blog (I recommend this highly) you'll see references to work he's done on the so-called top-tier GCs you seem to favor.

I would venture that, as a bare minimum, the intended audience is people who understand and care about the subject. That includes me; your mileage may vary.

reply
zelphirkalt
9 days ago
[-]
Well, there are many ecosystems and languages and not all are running a top tier GC. The author, afaik, has himself been a driving force in developing a JIT for GNU Guile. Posts like there, experts sharing their knowledge and insights, are among the most valuable here on HN.
reply
pizlonator
9 days ago
[-]
Most production GCs are accurate and the article’s position on conservative GC is a minority position.
reply
fweimer
9 days ago
[-]
Some are not. You can try this:

    diff --git a/src/runtime/mgc.go b/src/runtime/mgc.go
    index a2b6b979c1..d2f2852294 100644
    --- a/src/runtime/mgc.go
    +++ b/src/runtime/mgc.go
    @@ -144,7 +144,7 @@ const (
     
      // debugScanConservative enables debug logging for stack
      // frames that are scanned conservatively.
    - debugScanConservative = false
    + debugScanConservative = true
     
      // sweepMinHeapDistance is a lower bound on the heap distance
      // (in bytes) reserved for concurrent sweeping between GC
And observe that Go scans a few stack frames conservatively. I think it only does this if a stack frame is preempted. Most frames are scanned precisely.
reply
samatman
9 days ago
[-]
As the Fine Article mentions, the JavaScriptCore GC is conservative, and V8 is considering a switch.

Someone reading your sentence might be at risk of conflating a minority position with a fringe one. Clearly this isn't the case here.

reply
pizlonator
9 days ago
[-]
I said minority, not fringe.

But fringe isn’t far off. Among those who write GCs professionally, I was definitely on the fringe as an advocate for conservative-on-the-stack.

(I wrote most of JSC’s GC and I was one of the folks pushing for it to remain conservative at a time when V8 was accurate and SpiderMonkey transitioned to being accurate. I like that folks are now acknowledging the good choice we made in JSC, but it was a fringe/minority choice at the time for sure, and it’s still a minority choice among high performance GCs.)

reply
samatman
9 days ago
[-]
I would say you're being a bit modest! JSC is deployed on hundreds of millions of devices, and if V8 does adopt a conservative GC strategy, that would be a comfortable majority of JavaScript engines making use of that approach.

It goes to show that popularity is a poor measure of technical merit, and the fringe doesn't always stay that way.

reply
mseepgood
9 days ago
[-]
s/accurate/precise/
reply
pizlonator
9 days ago
[-]
No.

The literature uses “Accurate GC”, “Precise GC”, and “Exact GC” interchangeably.

Famous paper on this that uses “accurate”: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...

Paper that calls it “exact”: https://dl.acm.org/doi/10.1145/2660193.2660198

reply