FilterHN

Teaching a new way to prevent outages at Google

106 points

by motxilo

7 months ago

| past

| 13 comments

| sre.google

| HN

▲

hinkley

7 months ago

[-]

> In one particular case at Google, a software controller–acting on bad feedback from another software system–determined that it should issue an unsafe control action. It scheduled this action to happen after 30 days. Even though there were indicators that this unsafe action was going to occur, no software engineers–humans–were actually monitoring the indicators. So, after 30 days, the unsafe control action occurred, resulting in an outage.

Isn't this the time they accidentally deleted governmental databases? I love the attempt at blameless generalization, but wow.

▲

decimalenough

7 months ago

[-]

If you're referring to the time they nuked an Australian retirement fund's VMware setup, no, that was basically a billing screwup. An operator left a field blank, the system assumed that meant a 1-year expiry, and dutifully deleted it after 1 year was up.

https://cloud.google.com/blog/products/infrastructure/detail...

▲

nthingtohide

7 months ago

[-]

All mega deletes should be authorised. A human person should type in the word "delete" and then only the action should take place. Not doing this is like the decision is taken by VOID created by complex interacting systems.

▲

hinkley

7 months ago

[-]

Honestly unless it’s RTBF, no deletion should happen at all as long as you meet your reserve capacity of freshly silvered disks. Every defunct account should probable go to cold storage first.

▲

nthingtohide

7 months ago

[-]

We have sensible reasons to suggest this in both the cases : simple and complex.

If GCP is composed of 10-30 services (hypothetically) then keeping 5-10 employees whose job is ensure mega deletes are safe is not too much of a cost.

If GCP is composed of 500 services, then it is all the more important to have humans in the loop so ensure correct behaviour so that complex interacting services don't take a wrong action.

▲

cynicalsecurity

7 months ago

[-]

The most unbelievable thing about that case was that Google actually deleted data instead of keeping then forever and use for ads.

▲

perching_aix

7 months ago

[-]

Username checks out.

▲

mimikatz

7 months ago

[-]

Thanks to all the people here pointing out how bloated, overly broad and useless this is. I went to read it thinking I would pick up something applicable and it was written in such a overwrought humanless style that I gave up learning nothing and thought the problem was me. I am glad to learn I am not alone.

▲

smcameron

7 months ago

[-]

> "The class itself is very well structured. I've heard about STPA in past years, but this was the first time I saw it explained with concrete examples. The Google example at the end was also really helpful."

But the article itself contains no concrete examples.

▲

eitland

7 months ago

[-]

If you can like examples from outside Google, STPA seems to have been around for years:

https://kagi.com/search?q=STPA&r=no&sh=6ZXVCq1feUflSKjoBMMXm...

▲

irjustin

7 months ago

[-]

I don't understand and I really really want to.

This seems so cool at a scale that I can't fathom. Tell me specifically how it's done at google with regards to a specific service, at least enough information to understand what's going on. Make it concrete. Like "B lacks feedback from C", why is this bad?

You've told me absolutely nothing and it makes me angry.

▲

SlightlyLeftPad

7 months ago

[-]

This has really always been the case with Google philosophy docs. They tend to be very abstract and academic.

The biggest danger is taking everything at face value and structuring your work or organization the same exact way based solely on these documents. The reality is, the vast majority of companies are not Google and will never encounter Google’s problems. That’s not where the value is though.

▲

bbkane

7 months ago

[-]

Maybe less of a philosophy doc, but I found the Google SRE workbook to have plenty of helpful concrete examples

▲

SlightlyLeftPad

7 months ago

[-]

Of course, I’m not suggesting they don’t contain great examples. It’s just silly to apply the book wholesale to a company 1/100th the size of Google or even 1/10th the size of Google. It’ll almost never work verbatim. You must adapt it to the organization, resources, architecture you have and adjust the direction for where you need it to go.

Someone else made the point that the book itself is an idealistic view as a visionary document of what Google wants it to be but from someone sitting in the SRE role at Google, the role is probably not exactly as described.

▲

twalla

7 months ago

[-]

The other thing to consider is a lot of the time these docs are sort of guidelines or wishlists for the way things ought to be - meanwhile an outside observer will assume these are the way things actually are.

▲

hinkley

7 months ago

[-]

This link at the bottom is less confusing:

https://www.usenix.org/publications/loginonline/evolution-sr...

▲

snorkel

7 months ago

[-]

In other words STPA is a design review framework for finding some less obvious failure modes. FMEA is more popular but relies on making a list of all of the knowable failure modes in a system, but the failure modes you haven’t thought of don’t make it on the list. STPA helps fill in some of those gaps of failure modes you haven’t thought of.

▲

primitivesuave

7 months ago

[-]

This would have been a lot more compelling had they provided a single real-world example of STPA actually solving a reliability issue at Google.

▲

MinelloGiacomo

7 months ago

[-]

STAMP/STPA work well as a model and methodology for complex systems, I was interested in them a while ago in the context of cyber risk quantification. Having a fairly easy model to reason about unsafe control action is not a given in other approaches. I just wish they were adopted by more companies, I have seen too many of them stuck with ERM-based frameworks that do no make sense most of the time when scaled down to working at the system level granularity.

▲

dooglius

7 months ago

[-]

> After working with the system experts to build this control structure, we immediately noticed missing feedback from controller C to controller B–in other words, controller B did not have enough information to support the decisions it needed to make

There is a feedback loop through D? And why does the same issue not apply to the missing directed edge from B to D?

EDIT: I figured it out on a reread, the vertical up/down orientation matters for whether an edge represents control vs feedback, so B is merely not controlling D, which is fine. But if B is only controlling C as a way to get through to D (which is what I would have guessed, absent other information), what's the issue with that?

▲

mianos

7 months ago

[-]

This is peak corporate drivel—bloated storytelling, buzzwords everywhere, and a desperate attempt to make an old idea sound revolutionary.

The article spends paragraphs on some childhood radio repair story before awkwardly linking it to STPA, a safety analysis method that’s been around for decades. Google didn’t invent it, but they act like adapting it for software is a major breakthrough.

Most of the piece is just filler about feedback loops and control structures—basic engineering concepts—framed as deep insights. The actual message? "We made an internal training program because existing STPA examples didn’t click with Googlers." That’s it. But instead of just saying that, they pad it out with corporate storytelling, self-congratulation, and hand-wringing over how hard it is to teach people things.

The ending is especially cringe: You can’t afford NOT to use this! Classic corporate play—take something mundane, slap on some urgency, and act like ignoring it is a reckless gamble.

TL;DR: Google is training engineers in STPA. That’s the whole story.

▲

sepositus

7 months ago

[-]

I'm not sure if things have changed over the past five years, but this is exactly the stuff you'd throw in a promotion packet or maybe in a performance (perf) review to hit that mythical "superb" rating.

The breaking point for me (and why I left after almost a decade) was when people started getting high ratings for fixing things they had an original hand in causing. Honestly, the comfiest job in the world if you're a professional bullshitter.

▲

dataflow

7 months ago

[-]

By "had a hand in causing" do you mean "they should have prevented it", or do you just mean "they were involved in the causation"? Because sometimes you're forced to do things you know are wrong, because that's what other people are making you do, and in that case you still "have a hand" in causing.

▲

praptak

7 months ago

[-]

Something in between. Like "pushed to implement a feature without the safety measures". When outages started to happen implemented Outage Prevention Program, i.e. implemented the safety measures that should have been implemented from the start.

Subsequent data collection demonstrated X% outage frequency drop clearly demonstrating readiness for promotion, data driven.

▲

sepositus

7 months ago

[-]

Exactly this.

▲

AStonesThrow

7 months ago

[-]

It's not easy or popular to link Dilbert these days, but there's a classic cartoon of the PHB announcing their bug bounty program for dev employees, and one of the fellows exclaims that he's going to “code his way to a minivan”!

▲

SlightlyLeftPad

7 months ago

[-]

What I’ve been seeing from Google’s products lately suggests that these are the only ones still there. It’s a house of cards built with professional bullshitters. Google’s culture has entered or is already deep within the bullshit era.

▲

z3t4

7 months ago

[-]

It will happen in all companies that has a monopoly status. If they start to struggle they will just increase the rent.

▲

ikiris

7 months ago

[-]

You can’t swoop in and be a hero and make impact without a meteor.

▲

hansmayer

7 months ago

[-]

The point about basic engineering concepts is spot on. But I wonder how much it has to do with the creeping in of superficially educated "tech" people across technology sector. Not to downplay the value of self-learning (am a bit of autodidact myself), but the amount of people who switch into the mythical "tech" who have never heard of a differential equation is worrying. Hence companies unfortunately really seem to need to explain concepts like feedback loop to people who only ever heard of it in the context of performance review. The article itself is a word salad though, the start reads like a SEO-optimised cooking blog ;)

▲

tekla

7 months ago

[-]

Woah, hold up, why does anyone need to know math?

▲

agumonkey

7 months ago

[-]

Oh wow, shallow communication performative piece in a way ?

▲

pcdoodle