This went on for months. Then one day we had a power outage. Two months later, every single machine failed at the same time. I checked the logs and it was 49 days and few hours since that outage. It didn't take me too long to figure out what the underlying programming error inside the TPM was. At least we could then describe exactly what the problem was to our PC vendor.
Back in the day it was Windows, that had a hard limit on how long it could run in one pass. I forgot when it began and ended, but happily AI helped out to investigate back in time.
The bug primarily affected the Windows 9x family of operating systems:
Windows 95 (all versions)
Windows 98 (original release)
Windows 98 Second Edition (SE)
While there were separate reports of similar 497-day overflows in Windows NT 4.0 and Windows 2000, the "classic" version of this bug that most people remember is the 49.7-day limit on Windows 95 and 98.
Why 49.7 days? The issue was a classic integer overflow. Windows used a 32-bit counter to track the number of milliseconds since the system started. This counter was used by the Virtual Device Driver (VMM) to manage system timers.
The maximum value for a 32-bit unsigned integer is: 2^32 - 1, which equals: 4,294,967,295 millisec.
If you convert those milliseconds into days: 4,294,967,295 / 1,000 = 4,294,967 seconds 4,294,967 / 60 / 60 / 24 ~ 49.71 days
When the counter hit that maximum value, it would "wrap around" to zero. Because many system services and drivers were waiting for the counter to increase to a certain target time, they would suddenly find themselves waiting for a number that had already passed or was now mathematically impossible to reach in their logic. This caused the "hang"—the mouse might still move, but the OS could no longer process tasks.
When did it start and end? Started: With the release of Windows 95 in August 1995.
Ended: Microsoft officially fixed the bug with a patch in 1999 (Knowledge Base article KB216641). Windows Me (released in 2000) was the first in that specific family to ship with the fix included, and the transition to the Windows NT architecture (Windows XP and later) eventually rendered the specific underlying cause obsolete for home users.
1: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971...
https://elixir.bootlin.com/linux/v6.15.7/source/include/linu...
* Do this with "<0" and ">=0" to only test the sign of the result. A
* good compiler would generate better code (and a really good compiler
* wouldn't care). Gcc is currently neither.
It's funny the love-hate relationship the Linux kernel has with GCC. It's the only supported compiler[1], and yet...[1] can Clang fully compile Linux yet? I haven't followed the updates in a while.
GCC is a different beast and far better nowadays.
They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.
Javascript and some other languages only have integer precision up to 52 bits then they switch to floating point.
Curious.
Maybe the clock was just feeling a little sluggish? /s
Only remember that because that's the limit for Windows 95…
Also, who else immediately noticed the AI-generated comment?
ouch
This means that every single device would seemingly randomly completely break: touchscreen, keyboard, modems, you name it. Everything broke. And since the modem was part of it, we would lose access to the device — very hard to solve because maintenance teams were sometimes hours (& flights!) away.
It seemed to happen at random, and it was very hard to trace it down because we were also gearing up for an absolutely massive (hundreds of devices, and then a couple of months later, thousands) launch, and had pretty much every conceivable issue thrown at us, from faulty USB hubs, broken modems (which would also kill the USB hub if they pulled too much power), and I'm sure I've forgotten a bunch of other issues.
Plus, since the problem took a week to manifest, we couldn't really iterate on fixes quickly - after deploying a "potential fix", we'd have to wait a whole week to actually see if it worked. I can vividly remember the joy I had when I managed to get the issue to consistently happen only in the span of 2 hours instead of a week. I had no idea _why_, but at least I could now get serviceable feedback loops.
Eventually, after trying to mess with every variable we could, and isolating this specific issue from the other ones, we somehow figured out that the issue was indeed a bug in the kernel, or at least in one of its drivers: https://github.com/raspberrypi/linux/issues/5088 . We had many serial ports and a pattern of opening and closing them which triggered the issue. Upgrading the kernel was impossible due to a specific vendor lock-in, and we had to fix live devices and ship hundreds of them in less than a month.
In the end, we managed to build several layers on top of this unpatchable ever-growing USB-incapacitating bug: (i) we changed our serial port access patterns to significantly reduce the frequency of crashes; (ii) we adjusted boot parameters to make it much harder to trigger (aka "throw more memory at the memory leak"); (iii) we built a system that proactively detected the issue and triggered a USB reset in a very controlled fashion (this would sometimes kill the network of the device for a while, but we had no choice!); (iv) if, for some reason, all else failed, a watchdog would still reboot the system (but we really _really_ _reaaaally_ didn't want this to happen).
In a way, even though these issues suck, it's when we are faced with them that we really grow. We need to grab our whole troubleshooting arsenal, do things that would otherwise feel "wrong" or "inelegant", and push through the issues. Just thinking back to that period, I'm engulfed by a mix of gratitude for how much I learned, and an uneasy sense of dread (what if next time I won't be able to figure it out)?
> at day 66 all our jobs started randomly failing
if there's a definable pattern, you can call it unpredictabily, but you can't call it randomly.
But what they seem to be indicating is that all jobs fail on day 66. There's no randomness in evidence.