FilterHN

Bug 1950764: Work Around Crash on Intel Raptor Lake CPU

57 points

by luu

2 days ago

| past

| 6 comments

| phabricator.services.mozilla.com

| HN

▲

bri3d

3 hours ago

[-]

Linked in the Bugzilla thread is a really nice in depth investigation of the same issue with high register aliases in a similar algorithm (Huffman coding) but in an entirely different product: https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-in... .

It's concerning that Intel don't seem to have been responsive to anyone with respect to this issue and it doesn't appear to have an official errata yet, although Raptor Lake was the Intel CPU with voltage issues and basically random bit rot so I suppose it's hard to tell if this is a silicon level errata caused by bad design or by some kind of post-manufacturing damage. Raptor Lake in general causes enough non-reproducible noise that I believe Firefox gave up on automated crash reports from it ( https://bugzilla.mozilla.org/show_bug.cgi?id=1975808 ).

EDIT: I read that Oodle article (which is SO good!) again and realized that their customer-provided reproduction of the bug was directly linked to boost clock speeds (the customer said that overclocking by 5% made it happen entirely reliably), so this is definitely not a "the architecture has a 100% bug in it" but rather some deeper issue with clock propagation that appears at edge cases.

▲

mtlmtlmtlmtl

16 minutes ago

[-]

It's very interesting because my 13900K has worked like a dream from day one and still to this day. Never had any of the voltage issues, never had any abnormal crashes in Firefox or any other software. I was undervolting it for a long while, so I wonder if somehow that saved me from the voltage issues before they were fixed?

▲

Polizeiposaune

3 hours ago

[-]

Details of the errata from a comment in the diff:

"Write both dist bytes as a single 2-byte store. This avoids the `movb %ch, [mem]` instruction pattern (store from high-byte register alias) that LLVM otherwise emits when dist arrives as a wide register. That pattern triggers the Intel Raptor Lake CPU errata, causing silent 2-byte stores that corrupt the adjacent `len` byte."

▲

whadawha

3 hours ago

[-]

How did this get past validation at Intel?

This is worse than https://en.wikipedia.org/wiki/Pentium_FDIV_bug

▲

bri3d

3 hours ago

[-]

There's another blog post going into more depth about the issue here: https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-in... where they speculate that it seems to relate to both other clock-related instability on specific Raptor Lake parts and possibly the overarching voltage control problems that the platform had early on; I can't tell entirely from the bug reports whether the behavior reliably reproduces on 100% of Raptor Lakes but the indicators I'm reading point to that it doesn't. It is concerning that Intel didn't get back to Mozilla about it though, since it's certainly a lot more than a one off.

▲

userbinator

2 hours ago

[-]

"validation? what validation?"

https://news.ycombinator.com/item?id=27244941

Edit: you should probably read the article I linked first.

▲

dmitrygr

3 hours ago

[-]

modifying source to avoid an assembly isntr isn't a fix... this need a compiler fix most likely, or a microcode fix, if possible.

▲

whadawha

2 hours ago

[-]

Anyone have knowledge of whether microcode can be patched on consumer grade Intel CPUs?

▲

bri3d

2 hours ago

[-]

Yes? It is regularly; both the firmware or the OS can deliver updates depending on configuration. The Raptor Lake CPUs in question have gone through an enormous number of microcode revisions already due to quite famous voltage scaling issues; it's unclear if this errata is fallout from or related to a similar root cause or just another issue with the processor.

▲

__patchbit__

1 hour ago

[-]

At boot time, the following package provides the latest Intel CPU microcode data files on NetBSD.

  sysutils/intel-microcode-netbsd

dmesg shows

  cpu 0: ucode 0xf0->0xf6
  cpu 1: ucode 0xf0->0xf6

▲

altairprime

2 hours ago

[-]

https://github.com/intel/intel-linux-processor-microcode-dat...

  $ echo 1 > /sys/devices/system/cpu/microcode/reload

Hot-swappable, even. TIL!

▲

robin_reala

2 hours ago

[-]

Also worth reading this thread on the subject: https://mas.to/@gabrielesvelto/116630047156991279

Regarding the Raptor Lake bug I received a couple of messages from confused users that had read articles on Tomshardware and Neowin. They asked about erratas and microcode updates which puzzled me, because that was part of my early investigation into the bug and we know that the failure is not caused by a known errata and microcode updates cannot fix broken CPUs. So why did they ask? As it turns out it was slop. Both articles are 100% slop full of confusing and inaccurate claims.

▲

userbinator

3 hours ago

[-]

WTF, Intel? This is reminding me of a very similar bug from 9 years ago: https://news.ycombinator.com/item?id=14630183

Clearly Intel needs to do far more extensive regression-testing, with things like demoscene productions --- especially the extremely size-optimised ones that can exercise the edge-cases much better than the usual "compiler slop".

▲

hsbauauvhabzb

2 hours ago

[-]

Intel knowingly sold defective cpus and denied the defect until reports hit critical mass. I don’t think they care.

▲

userbinator

2 hours ago

[-]

"knowingly" is meaningless, as otherwise they wouldn't even bother releasing errata lists; it's more likely that they underestimated the severity or their planned obsolescence calculations happened to be more statistically favourable than reality.

https://news.ycombinator.com/item?id=41041855

▲

hsbauauvhabzb

1 hour ago

[-]

The link you provided doesn’t match your comment, one of the comments in that thread points out that Intel blamed motherboards during the early stages.

▲

mike_hock

3 hours ago

[-]

Uh ... working around this in each and every piece of software sounds like a non-starter? Intel should be on the hook to fix this.

▲

Polizeiposaune

3 hours ago

[-]

Use of the "h" register slices (bits 8..15) by compilers is thankfully pretty rare -- otherwise this would have been noticed much sooner!

Agner Fog's optimization guide says "Any use of the high 8-bit registers AH, BH, CH, DH should be avoided because it can cause false dependences and less efficient code."

▲

userbinator

3 hours ago

[-]

Use of the "h" register slices (bits 8..15) by compilers is thankfully pretty rare

That's unfortunate, because it's precisely why things like this will keep happening.

Agner Fog's optimization guide says "Any use of the high 8-bit registers AH, BH, CH, DH should be avoided because it can cause false dependences and less efficient code."

The sad vicious cycle of compilers not exercising the hardware, and then the hardware designers not paying attention. Using the high 8-bit registers and "implicitly merging" them is one of the ways to reduce the number of instructions and thus improve size optimisation.

▲

charcircuit

2 hours ago

[-]

Hopefully this bug is getting handled upstream in a microcode update or a compiler fix to avoid emitting such instructions. Just a comment mentioning that you should not emit a particular instruction is not a strong guarantee.