During my time at Facebook, I maintained a bunch of kernel patches to improve jemalloc purging mechanisms. It wasn't popular in the kernel or the security community, but it was more efficient on benchmarks for sure.
Many programs run multiple threads, allocate in one and free in the other. Jemalloc's primary mechanism used to be: madvise the page back to the kernel and then have it allocate it in another thread's pool.
One problem: this involves zero'ing memory, which has an impact on cache locality and over all app performance. It's completely unnecessary if the page is being recirculated within the same security domain.
The problem was getting everyone to agree on what that security domain is, even if the mechanism was opt-in.
We did extensive benchmarking of HHVM with and without your patches, and they were proven to make no statistically significant difference in high level metrics. So we dropped them out of the kernel, and they never went back in.
I don't doubt for a second you can come up with specific counterexamples and microbenchnarks which show benefit. But you were unable to show an advantage at the system level when challenged on it, and that's what matters.
By the time you joined and benchmarked these systems, the continuous rolling deployment had taken over. If you're restarting the server every few hours, of course the memory fragmentation isn't much of an issue.
> But you were unable to show an advantage at the system level when challenged on it, and that's what matters.
You mean 5 years after I stopped working on the kernel and the underlying system had changed?
I don't recall ever talking to you on the matter.
Nope, I started in 2014.
> I don't recall ever talking to you on the matter.
I recall. You refused to believe the benchmark results and made me repeat the test, then stopped replying after I did :)
If you don't like the idea of memory cgroups as a security domain, you could tighten it to be a process. But kernel developers have been opposed to tracking pages on a per address space basis for a long time. On the other hand memory cgroup tracking happens by construction.
There needs to be more competition in the malloc space. Between various huge page sizes and transparent huge pages, there are a lot of gains to be had over what you get from a default GNU libc.
Our results from July 2025:
rows are <allocator>: <RSS>, <time spent for allocator operations>
app1:
glibc: 215,580 KB, 133 ms
mimalloc 2.1.7: 144,092 KB, 91 ms
mimalloc 2.2.4: 173,240 KB, 280 ms
tcmalloc: 138,496 KB, 96 ms
jemalloc: 147,408 KB, 92 ms
app2, bench1
glibc: 1,165,000 KB, 1.4 s
mimalloc 2.1.7: 1,072,000 KB, 5.1 s
mimalloc 2.2.4:
tcmalloc: 1,023,000 KB, 530 ms
app2, bench2
glibc: 1,190,224 KB, 1.5 s
mimalloc 2.1.7: 1,128,328 KB, 5.3 s
mimalloc 2.2.4: 1,657,600 KB, 3.7 s
tcmalloc: 1,045,968 KB, 640 ms
jemalloc: 1,210,000 KB, 1.1 s
app3
glibc: 284,616 KB, 440 ms
mimalloc 2.1.7: 246,216 KB, 250 ms
mimalloc 2.2.4: 325,184 KB, 290 ms
tcmalloc: 178,688 KB, 200 ms
jemalloc: 264,688 KB, 230 ms
tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.i don't recall which jemalloc was tested.
tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).
Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.
Are you using async rust, or sync rust?
[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...
Edit: I see mimalloc v3 is out – I missed that! That probably moots this discussion altogether.
Even toolchains like Turbo Pascal for MS-DOS, had an API to customise the memory allocator.
The one size fits all was never a solution.
If you got a web request, you could allocate a memory pool for it, then you would do all your memory allocations from that pool. And when your web request ended - either cleanly or with a hundred different kinds of errors, you could just free the entire pool.
it was nice and made an impression on me.
I think the lowly malloc probably has lots of interesting ways of growing and changing.
Yes, if you want to use huge pages with arbitrary alloc/free, then use a third-party malloc. If your alloc/free patterns are not arbitrary, you can do even better. We treat malloc as a magic black box but it's actually not very good.
(99% of the time, I find this less problematic than Java’s approach, fwiw).
I heard that was a common complaint for minecraft
To an outsider, that looks like the JVM heap just steadily growing, which is easy to mistake for a memory leak.
This feels like a huge understatement. I still have some PTSD around when I did Java professionally between like 2005 and 2014.
The early part of that was particularly horrible.
Baring bugs/native leaks - Java has a very predictable memory allocation.
The issue is that through the standard course of a JVM application running, every allocated page will ultimately be touched. The JVM fills up new gen, runs a minor collection, moves old objects to old gen, and continues until old gen gets filled. When old gen is filled, a major collection is triggered and all the live objects get moved around in memory.
This natural action of the JVM means you'll see a sawtooth of used memory in a properly running JVM where the peak of the sawtooth occasionally hits the memory maximum, which in turn causes the used memory to plummet.
There's a lot of bad tuning guides for minecraft that should be completely ignored and thrown in the trash. The only GC setting you need for it is `-XX:+UseZGC`
For example, a number of the minecraft golden guides I've seen will suggest things like setting pause targets but also survivor space sizes. The thing is, the pause target is disabled when you start playing with survivor space sizes.
It was a better idea when Java had the old mark and sweep collector. However, with the generational collectors (which are all Java collectors now. except for epsilon) it's more problematic. Reusing buffers and objects in those buffers will pretty much guarantees that buffer ends up in oldgen. That means to clear it out, the VM has to do more expensive collections.
The actual allocation time for most of Java's collectors is almost 0, it's a capacity check and a pointer bump in most circumstances. Giving the JVM more memory will generally solve issues with memory pressure and GC times. That's (generally) a better solution to performance problems vs doing the large buffer.
Now, that said, there certainly have been times where allocation pressure is a major problem and removing the allocation is the solution. In particular, I've found boxing to often be a major cause of performance problems.
For example, some code I had to clean up pretty early on in my career was a dev, for unknown reasons, reinventing the `ArrayList` and then using that invention as a set (doing deduplication by iterating over the elements and checking for duplicates). It was done in the name of performance, but it was never a slow part of the code. I replaced the whole thing with a `HashSet` and saved ~300 loc as a result.
This individual did that sort of stuff all over the code base.
Heap allocation in java is something trivial happens constantly. People typically do funky stuff with memory allocation because they have to, because the GC is causing pauses.
People avoid system allocators in C++ too, they just don't have to do it because of uncontrollable pauses.
This same dev did things like putting what he deemed as being large objects (icons) into weak references to save memory. When the references were collected, invariably they had to be reloaded.
That was not the source of memory pressure issues in the app.
I've developed a mistrust for a lot of devs "doing it because we have to" when it comes to performance tweaks. It's not a never thing that a buffer is the right thing to do, but it's not been something I had to reach for to solve GC pressure issues. Often times, far more simple solutions like pulling an allocation out of the middle of a loop, or switching from boxed types to primatives, was all that was needed to relieve memory pressure.
The closest I've come to it is replacing code which would do an expensive and allocation heavy calculation with a field that caches the result of that calculation on the first call.
Last time I checked mimalloc which was admittedly a while ago, probably 5 years, it was noticebly worse and I saw a lot of people on their github issues agreeing with me so I just never looked at it again.
Jemalloc can usually keep the smallest memory footprint, followed by tcmalloc.
Mimalloc can really speed things up sometimes.
As usually, YMMV.
Mimalloc made the claim that they were the fastest/best when they released and that didn't hold up to real world testing, so I am not inclined to trust it now.
That’s… ahistorical, at least so far as I remember. It wasn’t marketed as either of those; it was marketed as small/simple/consistent with an opt-in high-severity mode, and then its performance bore out as a result of the first set of target features/design goals. It was mainly pushed as easy to adopt, easy to use, easy to statically link, etc.
That is true of basically every single malloc replacement out there, that is not a uniquely defining feature.
https://jemalloc.net/jemalloc.3.html
One thing to call out: sdallocx integrates well with C++'s sized delete semantics: https://isocpp.org/files/papers/n3778.html
The nice thing about mimalloc is that there are a ton of configurable knobs available via env vars. I'm able to hand those 16 1 GiB pages to the program at launch via `MIMALLOC_RESERVE_HUGE_OS_PAGES=16`.
EDIT: after re-reading your comment a few times, I apologize if you already knew this (which it sounds like you did).
My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)
Jemalloc Postmortem - https://news.ycombinator.com/item?id=44264958 - June 2025 (233 comments)
Jemalloc Repositories Are Archived - https://news.ycombinator.com/item?id=44161128 - June 2025 (7 comments)
Most of the savings seemed to come from HVAC costs, followed by buying less computers and in turn less data centers. I'm sure these days saving memory is also a big deal but it doesn't seem to have been then.
The above was already the case 10 years ago, so LLMs are at most another factor added on.
In startups I've put more effort into squeezing blood from a stone for far less change; even if the change was proportionally more significant to the business. Sometimes it would be neat to say "something I did saved $X million dollars or saved Y kWh of energy" or whatever.
At most... Think 10x rather than 0.1x or 1x.
they've been using jemalloc (and employing "je") since 2009.
I'm saddened that the job market in Australia is largely React CRUD applications and that it's unlikely I will find a role that lets me leverage my niche skill set (which is also my hobby)
Link in bio.
I applied for both and got ghosted, haha.
I also saw a government role as a security researcher. Involves reverse engineering, ghidra and that sort of thing. Super awesome - but the pay is extremely uncompetitive. Such a shame.
Other than that, the most interesting roles are in finance (like HFT) - where you need to juggle memory allocations, threads and use C++ (hoping I can pitch Rust but unlikely).
Sadly they have a reputation of having pretty rough cultures, uncompetitive salaries and it's all in-office
It's really awful how bad the pay in the public sector is. I saw an SWE job posting with the BOM, and it paid about half the salary of an equivalent role in the private sector. It could be an interesting position, but how could you justify taking it?
I interviewed for one local trading firm a few years ago (it's not hard to guess which one). I made it through the initial screening, interview, and leetcode exam. The final recruitment stage was doing a two-hour presentation of some of your own projects to their employees. I presented some of what eventually would become [1], and a few enterprise web projects from my career. It was basically 2 hours of me being mocked by two of their engineers. It was a really negative experience. I actually looked up one of my interviewers on LinkedIn after the interview, and they were a graduate less than a year out of university. Unreal.
There was an 'Ask HN' post a few months ago asking 'who isn't working on web services?'. I can't seem to find it now, but it showed there's a lot of people out there feeling unenthusiastic about web development.
The one I know of (IMC trading) does a lot of low level stuff like this and is currently hiring.
https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-...
Facebook's coding AIs to the rescue, maybe? I wonder how good all these "agentic" AIs are at dreaded refactoring jobs like these.
There are not many engineers capable of working on memory allocators, so adding more burden by agentic stuff is unlikely to produce anything of value.
No.
This is something you shouldn't allow coding agents anywhere near, unless you have expert-level understanding required to maintain the project like the previous authors have done without an AI for years.
I've done some work in this sort of area before, though not literally on a malloc. Yes you very much want to be careful, but ultimately it's the tests that give you confidence. Pound the heck out of it in multithreaded contexts and test for consistency.
Second thoughts: Actually the fb.com post is more transparent than I'd have predicted. Not bad at all. Of course it helps that they're delivering good news!
Initially the idea was diagnostics, instead the the problem disappeared on its own.
He's doing just fine. If you're looking for a story about a FAANG company not paying engineers well for their work, this isn't it.
void* malloc(size_t size) {
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANON, -1, 0);
return (ptr == MAP_FAILED) ? NULL : ptr;
}
void free(void *ptr) { /* YOLO */ }
/sFrom the Department of Redundancy Department.
I was recently debugging an app double-free segfault on my android 13 samsung galaxy A51 phone, and the internal stack trace pointed to jemalloc function calls (je_free).
> This branch is 71 commits ahead of and 70 commits behind jemalloc/jemalloc:dev.
It looks like both have been independently updated.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
They should have just called it an ivory tower, as that's what they're building whenever they're not busy destroying democracy with OS Backdoor lobbyism or Cambridge Analytica shenanigans.
Edit: If every thread about any of Elon Musk's companies can contain at least 10 comments talking about Elon's purported crimes against humanity, threads about Zuckerberg's companies can contain at least 1 comment. Without reminders like this, stories like last week's might as well remain non-consequential.