Yep this kind of thing can happen. I found and reported incorrect gradients for Apple's Metal-backed tensorflow conv2d in 2021 [1].
(Pretty sure I've seen incorrect gradients with another Pytorch backend, but that was a few years ago and I don't seem to have raised an issue to refer to... )
One might think this class of errors would be caught by a test suite. Autodiff can be tested quite comprehensively against numerical differentiation [2]. (Although this example is from a much simpler lib than Pytorch, so I could be missing something.)
[1] https://github.com/apple/tensorflow_macos/issues/230
[2] https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...
BTW, numeric differentiation can only be tested very limitedly (due to algorithmic complexity when you doing big matrix). It is much easier / effective to test against multiple implementations.
Not that I understand much of what they say, but it appears there are a lot of correctness bugs in pytorch that are flying under the radar, probably having a measurable impact on the results of model quality.
It would be interesting to see model weights comparison of the same model trained with the two to see if they exhibit meaningfully different behavior.
What do you mean with "we're required to"? Isn't that something you do with all libraries and something you as an engineer want to do, at the very least to prove correctness? Personally I couldn't imagine using a 3rd party library without at least have some basic tests to confirm correctness, even when I use PyTorch I do the same.
Do you have any links to public thoughts about this? As if it was true, could mean a lot of research could be invalidated, so obviously would make huge news.
Also feels like something that would be relatively easy to make reproducible test cases from, so easy to prove if that's true or not.
And finally if something is easy to validate, and would make huge news, I feel like someone would already have attempted to prove this, and if it was true, would have published something a long time ago.
In the academic sense, a model that happens to work isn't research; the product of research should be a technique or insight that generalizes.
"Standard technique X doesn't work in domain Y, so we developed modified technique X' that does better" is the fundamental storyline of many machine learning papers, and that could be 'invalidated' if the poor performance of X was caused by a hidden correctness bug avoided by X'.
A lot of research is unreproducible crap. That’s not news to anyone. Plus, bugs usually make results worse, not better.
So if PyTorch is full of numerical flaws, that would likely mean many models with mediocre/borderline performance were discarded (never published) because they just failed to meet the threshold where the authors felt it was worth their time to package it up for a mid-tier conference. A finding that many would-be mediocre papers are actually slightly less mediocre than believed would be an utterly unremarkable conclusion and I believe that's why we haven't seen a bombshell analysis of PyTorch flaws and reproducibility at NeurIPS.
A software error in, say, a stats routine or a data preprocessing routine would be a different story because the degrees of freedom are fewer, leaving a greater probability of an error hitting a path that pushes a result to look artificially better as opposed to artificially worse
TLDR: Python gevent compiled with -Ofast messes up x87 floating point unit state. Bad for PyTorch.
I’d bet the majority of ML people are unaware, including those doing lower level stuff.
> The exact same float32 code updates weights on CPU but fails on MPS
It's MPS... Exactly zero research is being impacted. Why doesn't the $3.9T corporation contribute more to torch?
> Checking the latest version revealed the bug was already fixed in v2.4, patched by an ML engineer at Apple last year using almost the exact same approach I’d used.
Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is...
Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.
Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.
It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.
I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.
Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.
1. Numerical code is the canonical example of "functional" code. If you prove all the pieces correct then the result is also correct. If you prove one wrong then you know why your overall code is wrong. As such, focusing more heavily than normal on proving each piece correct is prudent. Use automated techniques (like numerical gradient checking), and use randomized inputs. It's easier than you'd think for your favorite special cases to be correct in both right and wrong algorithms. Your eyes will deceive you, so use the computer to do your spot checks.
2. I lied in (1). Especially when you start involving GPUs, it's easy to have to start worrying about variable lifetimes, UAF, double-free, un-initialized memory, accidental clobberings, and other ways in which an innocent "functional" computation can stomp on something else you're doing. Still start with all the checks from (1), and if the parts are correct and the whole is broken then you're messing up global state somewhere. Tracking that down is more art than science, but one technique is adding a "poison" field, tracking deinit count, and otherwise exposing metrics regarding those failure modes. Panic/crash when you hit an invalid state, and once you figure out where the issue happens you can triage as normal (working backward from the broken state to figure out how you got there). With a solid memory management strategy up-front you'll not see this sort of thing, but if it's not something you've thought about then I wouldn't rule it out.
3. Not really another point, just an extension of (2), corruption can show up in subtle ways (like stack-copied pointers inside a paused async function closure which occasionally gets copied by your event loop). If global state is the issue, it's worth a full audit of the application.
E(loss).cuda() <= E(loss.cuda())
But then this joke might be flying above my head as well.
I say "consumer-visible" because the bugs still exist and people who can catch them early get promoted quickly and paid a lot. It's very exciting work if you can get it, since you really have to understand the full GPU to break it.
Good luck!!
Funnily, only a few days ago I was thinking about just how far the field has come since 2014 or so when you'd build a computational graph, initialize weights manually and so on, versus now, where you just have to use a library like Ultralytics or HuggingFace most of the time. Then I thought about just how many deep, undetected bugs there would be in this mountain of abstraction. Bugs that make the computation invalid.
I also had a very similar bug a while ago, broken gradients due to non-contiguous data for masked_select: https://github.com/pytorch/pytorch/issues/99638
In my case, it was easier to identify: I had another implementation of my loss function before that did not use masked_select. But then I thought I can be clever and use masked_select to take out the non-masked frames and calculate the loss only on those. But it wasn't working. Also, it only happened for some models, not for all. It turns out, it was always happening when the data coming out of the model was non-contiguous.
I think the bugs with non-contiguous data are not so uncommon. I wonder how much of that we still have.
Meta, the creator and main contributor to PyTorch, does not use Macs for their day-to-day ML work (they focus on GPUs and CPUs), so the MPS backend is sadly incomplete and has errors like the one you see here.
EDIT: for the downvoters - i'll repeat, this is not a correct assessment of the relationship between Apple and PyTorch. but you can keep downvoting if you want <shrug>
https://x.com/soumithchintala/status/1978848796953161754
"MacStudio you ask?
Apple Engineering's *actual* time spent on PyTorch support has't given me confidence that PyTorch Mac experience would get anywhere close to NVIDIA's any time soon, if ever.
The Meta engineers continue to do a huge amount of heavy-lifting for improving the MPS backend, including feeling the responsibility for the Mac experience. Apple's priorities keep changing, the number of engineering hours they contribute keeps changing and their interest in actually and wholly owning the PyTorch MPS backend keeps varying.
If Apple wants MacStudio to become an actual AI devbox, and not just an AI inference machine, then prioritizing software support for PyTorch (>90% marketshare in AI) would probably be a good idea."
Even identical classes could help future folks know copying back is platform specific: “hm, we wrote to an OutputPlaceholder but didn’t read back from it, that seems wrong”.
The landing page in our app used jqueryUI’s drag and drop support, back around the time they declared bankruptcy on the confusing buggy code and wouldn’t even accept bug fixes because they were replacing it component by component (which was taking almost 3x as long as predicted). We had columns you could drag items between but they had a max height and scroll bars and it turned out jqueryUI would let you drag items into different rows if the overflow area for adjacent drag targets overlapped your row.
The person who found it couldn’t fix it. The other fixer couldn’t fix it. I diagnosed it but the spaghetti code was a recursive mess and I could not find a spot where I could fix it. Especially given I couldn’t send in a patch to them.
So I spent half of my free time the last day of every (2 week) sprint for almost six months before I finally found a small function I could monkey patch to wrap it in a short circuit check for clipping region. I spent maybe 20,30 hours on this, a lot of it just getting back to the same situation to debug. But it felt like it took forever to fix it.
The short circuit also made drag and drop faster, which was just getting in the edge of distracting. Particularly on a crowded page.
Also remembering when Firebug for Firefox appeared, and made so many things so much easier. Suddenly things that took hours took days, and it was so much easier when you had some introspection tools.
God the bad karma for working with this crap. I'm glad it's over.
Inverse? Shouldn't it be things that took days took hours ?
I think if you want to give your reader a quick intro to, e.g., what is the Adam optimizer, a simple link to Wikipedia is fine. No need to copy-paste an AI tutorial on Adam into the blog post.
What’s the trickiest bug you’ve ever run into?
I originally wrote "vanilla" there but didn't want to repeat that word twice in a row so swapped it for "standard" without realizing it now looked like the SGD acronym
just fixed that to avoid confusion- thanks for pointing it out!