RAM Has a Design Flaw from 1966. I Bypassed It [video]
142 points
2 days ago
| 8 comments
| youtube.com
| HN
kreelman
3 hours ago
[-]
This is very much worth watching. It is a tour de force.

Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.

Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.

She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.

She's turned "Tailslayer" into a lib now, available on Github, https://github.com/LaurieWired/tailslayer

You can see her having so much fun, doing cool victory dances as she works out ways of getting around each of the issues that she finds.

The experimentation, explanation and graphing of results is fantastic. Amazing stuff. Perhaps someone will use this somewhere?

As mentioned in the YT comments, the work done here is probably a Master's degrees worth of work, experimentation and documentation.

Go Laurie!

reply
gopalv
1 hour ago
[-]
>> It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules

This is the sort of thing which was done before in a world where there was NUMA, but that is easy. Just task-set and mbind your way around it to keep your copies in both places.

The crazy part of what she's done is how to determine that the two copies don't get get hit by refresh cycles at the same time.

Particularly by experimenting on something proprietary like Graviton.

reply
rockskon
1 hour ago
[-]
She determines that by having three copies. Or four. Or eight.

Tis just probabilities and unlikelihood of hitting a refresh cycle across that many memory channels all at once.

reply
ufocia
2 hours ago
[-]
I like the video, but this is hardly groundbreaking. You send out two or more messengers hoping at least one of them will get there on time.
reply
rcbdev
1 hour ago
[-]
Yeah. These are literally just mainframe techniques from yesteryear.
reply
npunt
2 hours ago
[-]
and dropbox was just rsync
reply
UltraSane
2 hours ago
[-]
The clever part is figuring out what RAM is controlled by which controllers.
reply
foltik
3 hours ago
[-]
Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.

The hedging technique is a cool demo too, but I’m not sure it’s practical.

At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.

I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.

Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.

reply
mzajc
3 hours ago
[-]
reply
rkagerer
2 hours ago
[-]
Halfway through this great video and I have two questions:

1) Can we take this library and turn it into a a generic driver or something that applies the technique to all software (kernel and userspace) running on the system? i.e. If I want to halve my effective memory in order to completely eliminate the tail latency problem, without having to rewrite legacy software to implement this invention.

2) What model miniature smoke machine is that? I instruct volunteer firefighters and occasionally do scale model demos to teach ventilation concepts. Some research years back led me to the "Tiny FX" fogger which works great, but it's expensive and this thing looks even more convenient.

reply
lauriewired
19 minutes ago
[-]
1. not that I can think of, due to the core split. It really has to be independent cores racing independent loads. anything clever you could do with kernel modules, page-table-land, or dynamically reacting via PMU counters would likely cost microseconds...far larger than the 10s-100s of nanoseconds you gain.

what I wished I had during this project is a hypothetical hedged_load ISA instruction. Issue two requests to two memory controllers and drop the loser. That would let the strategy work on a single thread! Or, even better, integrating the behavior into the memory controller itself, which would be transparent to all software without recompilation. But, you’d have to convince Intel/AMD/someone else :)

2. It’s called a “smokeninja”. Fairly popular in product photography circles, it’s quite fun!

reply
hawk_
27 minutes ago
[-]
> halve my effective memory in order to completely eliminate the tail latency problem,

Wouldn't you have a tail latency problem on the write side though if you just blindly apply it every where? As in unless all the replicas are done writing you can't proceed.

reply
imp0cat
1 hour ago
[-]
Brio 33884. It has a tiny ultrasonic humidifier in there.
reply
WatchDog
1 hour ago
[-]
Laurie makes great content that covers really interesting and low level subject matter, and this in particular is really cool, it's an idea I've never heard anyone talk about before.

Her videos have really high production value, but man I just really struggle to watch them.

reply
rcbdev
1 hour ago
[-]
Am I the only one who feels the comments here don't sound organic at all?
reply
tredre3
40 minutes ago
[-]
No I felt the same way, they're exactly like the usual LLM bot comment where a LLM recap ops and ends with an platitude or witty encouragement.

But all the accounts are old/legit so I think that you and me have just become paranoid...

reply
isoprophlex
45 minutes ago
[-]
You're absolutely right
reply
silisili
23 minutes ago
[-]
You're absolutely right to call this out. No humans, no emotion, no real comments - just LLM slop.

In all seriousness, agreed. The top comment at time of this writing seems like a poor summarizing LLM treating everything as the best thing since sliced bread. The end result is interesting, but neither this nor Google invented the technique of trying multiple things at once as the comment implies.

reply
Alifatisk
57 minutes ago
[-]
I don’t see anything unusual
reply
boznz
2 hours ago
[-]
Should say DRAM, SRAM does not have this.
reply
dinkumthinkum
1 hour ago
[-]
This is an unreasonably good video. Hopefully, it inspires others to see we can still think hard and critically about technical things.
reply