Although unit testing an entire LLM is not really feasible right now, all these bugs were in small deterministic parts of the system. Load balancing, top-k probability calculations and so on are all engineered parts no different to other software, and should in principle all be unit testable. At most you need an injectable PRNG. Yes, non-deterministic optimization bugs are awful but I've personally found compiler and database bugs in the past using just regular app test suites. With CI you get a lot of runs so rare events can still surface as long as you investigate flakes.
A few days ago I commented on a thread about the Java launch that people often feel Java is "enterprisey" compared to Python because Java code is typically written to be heavily unit testable. A lot of abstraction is driven by the desire for dependency injection, for example. I contrasted that to scripting language culture where I've found testing is often either missing or kinda surface level (e.g. mostly just asserting on types).
When I've been learning PyTorch a few years ago I noticed the same thing. The tutorials took you from simple to complex stuff without talking much about how you test or best structure the code. This makes sense for ML research where you don't have a clear goal and success boils down to maxing a score in some kind of human-driven eval, but it doesn't make sense for production deployment at scale.
I wonder if the AI labs could use more people with SRE and HA SWE background to focus on things like this. I'm kinda skeptical that more aggressive rolling evals-in-prod are the best way to avoid bugs like these happening again.
Even more than that, AI tends to mock _everything_. Mocking is useful, but the more real code a unit test invokes, the better, because the risk is not only the code itself but its interactions, the interface. Yet AI in Python will mock so heavily it barely tests even the code itself, with tautological statements.
I've prompted with heavy warnings against mocking and pointing directly at examples of thorough tests as examples. FWIW, Python does have excellent tools for injection and can write really nicely structured code.
> Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback.
Ok makes sense and glad to hear
> It remains particularly helpful for users to continue to send us their feedback directly. You can use the /bug command in Claude Code
Ok makes sense and I’d expect that a human can then see the context in that case although I hope it is still very explicit to the end user (I’m not a Claude Code user so I cannot comment)
> or you can use the "thumbs down" button in the Claude apps to do so
This is pretty concerning. I can’t imagine the average person equates hitting this button with forfeiting their privacy.
> I’m pretty surprised that Anthropic can directly impact the infra for AWS Bedrock as this article suggests.
We don't directly manage AWS Bedrock deployments today, those are managed by AWS.
> I can’t imagine the average person equates hitting this button with forfeiting their privacy.
We specify
> Submitting this report will send the entire current conversation to Anthropic for future improvements to our models.
in the thumbs down modal. Is there a straightforward way to improve this copy?
That was my understanding before this article. But the article is pretty clear that these were "infrastructure bugs" and the one related to AWS Bedrock specifically says it was because "requests were misrouted to servers". If Anthropic doesn't manage the AWS Bedrock deployments, how could it be impacting the load balancer?
When you click "thumbs down" you get the message "Submitting this report will send the entire current conversation to Anthropic for future improvements to our models." before you submit the report, I'd consider that pretty explicit.
Interesting, this implies that the 1M context servers performs worst at low context. Perhaps this is due to some KV cache compression, eviction or sparse attention scheme being applied on these 1M context servers?
> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.
Despite this I'm still a paying customer because Claude is a fantastic product and I get a lot of value from it. After trying the API it became a no brainer to buy a 20x Max membership. The amount of stuff I've gotten done with Claude has been awesome.
The last several weeks have strongly made me question my subscription. I appreciate the openness of this post, but as a customer I'm not happy.
I don't trust that these issues are all discovered and resolved yet, especially the load balancing ones. At least anecdotally I notice that around 12 ET (9AM pacific) my Claude Code sessions noticeably drop in quality. Again, I hope the team is able to continue finding and fixing these issues. Even running local models on my own machine at home I run into complicated bugs all the time — I won't pretend these are easy problems, they are difficult to find and fix.
Doesn't that say it all? At this point the quality of the AI trumps reliability for the customer (you and me), so even though of course they should (and I'm sure will) focus on it, why would they prioritise reliability over model quality right now?
If you trust this OpenRouter data the uptime record of these APIs is... not good to say the least: https://openrouter.ai/openai/gpt-5/uptime
It's clear to me that every provider is having enormous scale challenges. Claude Code often slows to a crawl and I have to interrupt it and tell it to try again.
This is especially pronounced around 4-6pm UK time (when we have Europe, Eastern US and West Coast US all hammering it).
Even today I was getting 503 errors from Gemini AI studio with model overloaded at that time, nothing on status page.
I really wonder if it would be worth Claude et al offering a cheaper off peak plan, to try and level out demand. Perhaps the optics of that don't look good though.
Edit to add: I think another potential dimension to this is GB200s have been a lot slower to come on stream than probably the industry expected. There's been a lot of defects with various hardware and software components and I suspect the liquid cooling has been difficult to get right (with far more catastrophic failure states!).
At this point, I'd be surprised if the different vendors on openrouter weren't abusing their trust by silently dropping context/changing quantization levels/reducing experts - or other mischievous means of delivering the same model at lower compute.
e.g. S3 has many times encountered increased error rate but doesn't report. No one says anything about S3.
People will say many things, but their behaviour is to reward the lie. Every growth hack startup guy knows this already.
Can anyone explain to a layperson how this sort of thing is even possible for an LLM?
For normal code, of course stupid bugs happen all the time. You accidentally introduce an off-by-one error in a conditional, for example, or add an extra `goto fail`.
But LLMs aren't written by humans! Models are trained by automated programs over a period of many months across unfathomably massive data centers.
How would a human introduce a bug like the one described in TFA?
[1] Here is an example of two common approaches: https://www.reddit.com/r/AIDungeon/comments/1eppgyq/can_some...
I've honestly received the best results in creative writing by ignoring top_k/top_p and simply tuning temperature. Restricting my output to only common words causes everything to feel generic. But Deepseek constantly breaks into Chinese/gibberish/ZALGO! when I go to 1.14.
This isn't related to the "recent issues" but I feel like it's useful advice for anyone trying out AI story creation.
As you discuss, training happens over a long period of time in a (mostly) hands-off fashion once it starts.
But inference? That’s a separate process which uses the trained model to generate responses, and it’s a runtime process - send a prompt, inference runs, response comes back. That’s a whole separate software stack, and one that is constantly being updated to improve performance.
It’s in the inference process where these issues were produced.
Matches my experience. I use CC through our enterprise Vertex AI account and never noticed any degradation.
In general it seems like these bugs, while serious, were substantially less prevalent than anecdotal online reports would have you believe. We are really talking about a ~1-2 week window here where most issues were concentrated, a relatively small percentage of total requests and total users impacted.
> Approximately 30% of Claude Code users had at least one message routed to the wrong server type, resulting in degraded responses.
> However, some users were affected more severely, as our routing is "sticky". This meant that once a request was served by the incorrect server, subsequent follow-ups were likely to be served by the same incorrect server.
30% of Claude Code users getting a degraded response is a huge bug.
I would have appreciated if they had released the full distribution of impact though.
They don't give an upper bound though. 30% had at least one message degraded. Some proportion of that 30% (maybe most of them?) had some larger proportion of their messages (maybe most of them?) degraded. That matters, and presumably the reason we're not given those numbers is that they're bad.
Regardless of whether it’s to save money, it’s purposefully inaccurate:
“When Claude generates text, it calculates probabilities for each possible next word, then randomly chooses a sample from this probability distribution.”
I think the reason for this is that if you were to always choose the highest probable next word, you may actually always end up with the wrong answer and/or get stuck in a loop.
They could sandbag their quality or rate limit, and I know they will rate limit because I’ve seen it. But, this is a race. It’s not like Microsoft being able to take in the money for years because people will keep buying Windows. AI companies can try to offer cheap service to government and college students, but brand loyalty is less important than selecting the smarter AI to help you.
No, it's just the definition of sampling at non-zero temperature. You can set T=0 to always get the most likely token. Temperature trades of consistency for variety. You can set T to zero in the API, I assume the defaults for Claude code and their chat are nonzero.
You are absolutely right! Greedy decoding does exactly that for longer seqs: https://huggingface.co/docs/transformers/generation_strategi...
Interestingly DeepSeek recommends a temperature of 0 for math/coding, effectively greedy.
How many users forget they have a sub? How many get a sub through work and don't use it often?
I'd bet a large number tbh based on other subscription services.
I typically read corporate posts as cynically as possible, since it's so common to word things in any way to make the company look better.
Glad to see an outlier!
Unless you consider service responsiveness as a factor of integrity. Still waiting on a service message reply from third week of May. I’m sure it’s right around the corner though.
imho there's a big market gap for companies that are truly honest with customers instead of corporate gaslighting
I do think an independent service status monitor might be an easier stip-gap and could serve to improve honesty. It's not trivial.
Statistically, probably likely that the dip occurred at a point that wasn't too important? But what happens if the idiot comes out at a critical point?
Kind of reminds me of the two alternate ways that time travel works in sci-fi. Does the small change to the past explode like a fission reaction, or does history heal itself?
Anywho, if errors do accumulate, I can see being very pissed off even with temporary idiocy from the model, as it means it poisons the context for the entire rest of the conversation.
[1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
A google search isn't deterministic. Neither is loading upvote count on social media.
It's common advice in distributed systems to have a graceful degradation state instead of becoming unavailable. That wouldn't be possible in a system that's completely deterministic.
> to have a graceful degradation state instead of becoming unavailable. That wouldn't be possible in a system that's completely deterministic.
What does this even mean? I see no incompatibility between determinism and your ability to perform the same function more slowly. Determinism just means that the output of the system is solely dependent on the inputs - feed the same inputs get the same outputs. If by degraded state you’re intentionally choosing to change your inputs, that doesn’t change the determinism of your system.
When it is said that LLMs aren’t deterministic, it’s because the output token is dependent on the inputs context and all other contexts processed in the same batch because the kernels are written non-deterministically. If the kernels were written deterministically (so that the output only depended on your input context), then there wouldn’t be a problem and it also wouldn’t change the ability for the system to degrade; it would be deterministic because capturing the input context and random seed would be sufficient. As it stands you’d have to capture the interim states of the other inputs being processed in the same batch and that interim state problem is what makes it non deterministic.
As for Google search, it’s not clear to me it’s non-deterministic. When you Google the exact same thing twice you get exactly the same page of results and selected snippets. That suggests there’s more determinism in the system than you’re giving it credit for.
https://www.reddit.com/r/singularity/comments/1khxwjh/claude...
Was it 1% worse / unnoticeable? Did it become useless? The engineering is interesting but I'd like to see it tied to actual impact
I know I'll probably get push back on this, but it left a sour taste in my mouth when I paid for a $200 sub that felt like it was less useful than ChatGPT Plus ($20) at times.
Or to summarize: [south park "we're sorry" gif]
So to be fair, you are getting exactly what you paid for - a non-deterministic set of generated responses of varying quality and accuracy.
According to reports, users did not stop coming back even when the app was broken for hours.
A similar thing happened to me when playing some initial version of The Binding of Isaac on Linux, when it was made with Flash. Its performance wasn't the best but I couldn't stop playing.
So if people still returns maybe Anthropic has something great going on with Claude Code.
[1]: https://www.theguardian.com/technology/2016/jan/05/facebook-...
Calling the platforms A, B and C might help provide us the insight we're missing to spot incongruous behaviors faster than trying to aggregate more generalized feedback.
“I refuse to believe what the people who would know the best said, for no real reason except that it doesn’t feel right” isn’t exactly the level of considered response we’re hoping for here on HN. :)
There's a thousand and one reasons why a company valued in the billions, with the eyes of the world watching, would not be completely honest in their public response.
In Aug–Sep 2025, Claude users saw degraded output quality due to infrastructure bugs, not intentional changes.
The Three Issues 1. *Context window routing error* - Short-context requests sometimes routed to long-context servers.
- Started small, worsened after load-balancing changes.
2. *Output corruption*
- TPU misconfigurations led to weird outputs (wrong language, syntax errors). - Runtime optimizations wrongly boosted improbable tokens.
3. *Approximate top-k miscompilation*
- A compiler bug in TPU/XLA stack corrupted token probability selection. - Occasionally dropped the true top token.
Why It Was Hard to Detect
- Bugs were subtle, intermittent, and platform-dependent.- Benchmarks missed these degradations.
- Privacy/safety rules limited access to real user data for debugging.
Fixes and Next Steps - More sensitive, continuous evals on production.
- Better tools to debug user feedback safely.
- Stronger validation of routing, output correctness, and token-selection.
Do their ToS really limit access to user data (prompt/response)? I don't remember seeing anything to that effect in their terms.
Layered in aggrandizing. You host a service, people give you money.
My criticism is it's 'puffy'. The 'scope and complexity' for a public postmortem is 'customer-facing'. Otherwise it's a tree/forest scenario.
One might say 'the lady doth protest too much'; this should be routine. It is, elsewhere: see Cloud, Web Hosting, PBX. Pick your decade.
Claude code made almost half a billion so far[1] (>500m in ARR and its like 9 months old) , and 30% of all users have been impacted at least once, just from the first routing bug. Scary stuff.
Their post mortem is basically "evaluations are hard, we relied on vibe checking, now we are going to have even more frequent vibe checking". I believe it was indeed unintentional, but in the future where investor's money wont come down from the skies, serving distilled models will be very tempting. And you can not be liable to any SLA currently, it's just vibes. I wonder how enterprise vendors are going to deal with this going forward, you cannot just degrade quality without client or vendor even being able to really prove it.
[1][https://www.anthropic.com/news/anthropic-raises-series-f-at-...]
We're firmly in the realms of 'this thing is kind of smarter / faster at a task compared to me my employees, so I am contracting it to do that task'.
That doesn't mean 'if it fails, no payment'.
But I think it's too analogous to non-tech-products to hide behind a 'no refunds' policy. It's that good - there are consequences for it, I think.
the blog explains what issues they had and how they fixed them. this is good enough.
It's a material difference in the product, not just "a bug."