Isn't this the time they accidentally deleted governmental databases? I love the attempt at blameless generalization, but wow.
https://cloud.google.com/blog/products/infrastructure/detail...
If GCP is composed of 10-30 services (hypothetically) then keeping 5-10 employees whose job is ensure mega deletes are safe is not too much of a cost.
If GCP is composed of 500 services, then it is all the more important to have humans in the loop so ensure correct behaviour so that complex interacting services don't take a wrong action.
But the article itself contains no concrete examples.
https://kagi.com/search?q=STPA&r=no&sh=6ZXVCq1feUflSKjoBMMXm...
This seems so cool at a scale that I can't fathom. Tell me specifically how it's done at google with regards to a specific service, at least enough information to understand what's going on. Make it concrete. Like "B lacks feedback from C", why is this bad?
You've told me absolutely nothing and it makes me angry.
The biggest danger is taking everything at face value and structuring your work or organization the same exact way based solely on these documents. The reality is, the vast majority of companies are not Google and will never encounter Google’s problems. That’s not where the value is though.
Someone else made the point that the book itself is an idealistic view as a visionary document of what Google wants it to be but from someone sitting in the SRE role at Google, the role is probably not exactly as described.
https://www.usenix.org/publications/loginonline/evolution-sr...
There is a feedback loop through D? And why does the same issue not apply to the missing directed edge from B to D?
EDIT: I figured it out on a reread, the vertical up/down orientation matters for whether an edge represents control vs feedback, so B is merely not controlling D, which is fine. But if B is only controlling C as a way to get through to D (which is what I would have guessed, absent other information), what's the issue with that?
The article spends paragraphs on some childhood radio repair story before awkwardly linking it to STPA, a safety analysis method that’s been around for decades. Google didn’t invent it, but they act like adapting it for software is a major breakthrough.
Most of the piece is just filler about feedback loops and control structures—basic engineering concepts—framed as deep insights. The actual message? "We made an internal training program because existing STPA examples didn’t click with Googlers." That’s it. But instead of just saying that, they pad it out with corporate storytelling, self-congratulation, and hand-wringing over how hard it is to teach people things.
The ending is especially cringe: You can’t afford NOT to use this! Classic corporate play—take something mundane, slap on some urgency, and act like ignoring it is a reckless gamble.
TL;DR: Google is training engineers in STPA. That’s the whole story.
The breaking point for me (and why I left after almost a decade) was when people started getting high ratings for fixing things they had an original hand in causing. Honestly, the comfiest job in the world if you're a professional bullshitter.
Subsequent data collection demonstrated X% outage frequency drop clearly demonstrating readiness for promotion, data driven.
Hmm..