The frustrating part isn't the outages themselves — it's that the feedback loop for debugging them is so slow. A step fails, you read logs, make a guess, push, wait 5 minutes for the run to get to the same point, and repeat. There's no way to inspect the state of a runner mid-pipeline or retry a single step without re-running the whole workflow.
Reliability issues hurt more when your only debugging tool is "push and pray" (im a recovering vc)