There's also the memory side. A large model has to live entirely in GPU VRAM to run efficiently. You can't just "add more RAM" on the fly the way you can with CPU workloads. Scaling means acquiring, provisioning, and loading entirely new physical machines — which takes minutes to hours, not seconds.
So you end up with a system that's simultaneously very expensive per-request, very hard to scale horizontally in real time, and very sensitive to traffic spikes. That's a reliability engineer's nightmare even before you factor in the supply constraints the sibling comment mentioned.
Another factor is just it's a new field and move fast and break things is still the go to as competition is high, and the stakes are incredibly high monetary wise.
A pessimistic, but perhaps true theory is also just vibe-coding/slop is reducing their reliability.
A counter point is that regular services like github seem to go down almost as frequently.