I've noticed recently that when I am using Opus at night (Eastern US), I am seeing it go down extreme rabbit holes on the same types of requests I am putting through on a regular basis. It is more likely to undertake refactors that break the code and then iterates on those errors in a sort of spiral. A request that would normally take 3-4 minutes will turn into a 10 minute adventure before I revert the changes, call out the mistake, and try again. It will happily admit the mistake, but the pattern seems to be consistent.
I haven't performed a like for like test and that would be interesting, but has anyone else noticed the same?
The most reliable time to see it fall apart is when Google makes a public announcement that is likely to cause a sudden influx of people using it.
And there are multiple levels of failure, first you start seeing iffy responses of obvious lesser quality than usual and then if things get really bad you start seeing just random errors where Gemini will suddenly lose all of its context (even on a new chat) or just start failing at the UI level by not bothering to finish answers, etc.
The sort of obvious likely reason for this is when the models are under high load they probably engage in a type of dynamic load balancing where they fall back to lighter models or limit the amount of time/resources allowed for any particular prompt.
I just assume it went to the bar, got wasted, and needed time to sober up!
What Anthropic does do is poke the model to tell you to go to bed if you use it too long ("long conversation reminder") which distracts it from actually answering.
Sometimes they do have associations with things like the day of the year and might be lazier some months than others.
I jokingly (and not so) thought that it was trained on data that made it think it should be tired at the end of the day.
But it is happening daily and at night.
Now I don't know what to think.
LLM providers must dynamically scale inference-time compute based on current load because they have limited compute. Thus it's impossible for traffic spikes _not_ to cause some degradations in model performance (at least until/unless they acquire enough compute to saturate that asymptotic curve for every request under all demand conditions -- it does not seem plausible that they are anywhere close to this)
They either overprovision the server during low demand or they might dynamically provision servers based on load.
I remember clearly this problem happening in the past, despite their claims. I initially thought it was an elaborate hoax, but it turned out to be factually true in my case.
[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...
Now GPT4.1 was another story last year, I remember cooking at 4am pacific and feeling the whole thing slam to a halt as the US east coast came online.
This is my guess, sometimes it churns through things without a care in the world and other times is seem to be intentionally annoying to eat up the token quota without doing anything productive.
Kind of have to see which mode it's in before turning it loose unsupervised and keep an eye on it just in case it decides to get stupid and/or lazy.
What i find IS tied to time of day is my own fatigue, my own ability to detect garbage tier code and footguns, and my patience is short so if I am going to start cussing at Clod, it is almost always after 4 when I am trying to close out my day.
FWIW, I experienced it with sonnet as well. My conspiracy brain says they’re testing tuning the model to use up more tokens when they want to increase revenue, especially as agents become more automated. Making things worse == more money! Just like the rest of tech
People put forward many theories for this (weaker model routing; be it a different model, Sonnet or Haiku or lower quantized Opus seem the most popular), Anthropic says it is all not happening.