Try to fix it one level deeper
123 points
2 days ago
| 11 comments
| matklad.github.io
| HN
raphlinus
2 days ago
[-]
The title immediately brings to mind the Osterhout classic, "Always Measure One Level Deeper", [1], and I imagine was probably inspired by it. Also worth revisiting.

[1]: https://cacm.acm.org/research/always-measure-one-level-deepe...

reply
matklad
1 day ago
[-]
I was not actually aware of the paper, and it is indeed pure gold, thanks for linking it. It is _also_ extremely timely, as with my current project (TigerBeetle), we are exactly at the point of transition from first-principles back-of-the-envelope performance architecture to measurement-driven performance engineering, and all the advice here is directly applicable!
reply
WillAdams
1 day ago
[-]
See the Book review: A Philosophy of Software Design (2020) (johz.bearblog.dev)

https://news.ycombinator.com/item?id=27686818

for more in this vein.

reply
andai
2 days ago
[-]
I was reading about NASA's software engineering practices.

When they find a bug, they don't just fix the bug, they fix the engineering process that allowed the bug to occur in the first place.

reply
anotherhue
2 days ago
[-]
Maintenance is never as rewarded as new features, there's probably some MBA logic behind it to do with avoiding commoditisation.

It's true in software, it's true in physical infrastructure (read about the sorry state of most dams).

Until we root cause that process I don't see much progress coming from this direction, on the plus side CS principles are making their way into compilers. We're a long way from C.

reply
hedvig23
2 days ago
[-]
Speaking of digging deeper, can you expand on that theory on why focus/man hours spent on maintenance leads to commoditization and why a company wants to avoid that?
reply
asdff
1 day ago
[-]
Given enough iteration with the same incentives, two engineering teams might end up with the same sort of product overall. We see this with airframes. We established the prototypical airframe for the commercial airliner in the 1950s and haven't changed it in 70 years since. This is good for the airline but bad for the aircraft manufacturer. The airline can now choose between boeing or airbus or anyone else for their product. If boeing had some novel plane design that wasn't copied the world over, then the airline company would be beholden to them alone.
reply
anotherhue
2 days ago
[-]
Top of my head, new things have unbounded potential, existing ones have known potential. We assume the new will be better.

I think it's part of the reason stocks almost always dip after positive earnings reports. No matter how positive it's always less than idealised.

You might think there's a trick where you can sell maintenance as a new thing but you've just invented the unnecessary rewrite.

To answer your question more directly, once something has been achieved it's safe to assume someone else can achieve it also, so the focus turns to the new thing. Why else would we develop hydrogen or neutron bombs when we already had perfectly good fission ones (they got commoditised).

reply
giantg2
2 days ago
[-]
"Maintenance is never as rewarded as new features,"

And security work is rewarded even less!

reply
riknos314
2 days ago
[-]
> And security work is rewarded even less

While I do recognize that this is a pervasive problem, it seems counter-intuitive to me based on the tendency of the human brain to be risk averse.

It raises an interesting question of "why doesn't the risk of security breaches trigger the emotions associated with risk in those making the decision of how much to invest in security?".

Downstream of that is likely "Can we communicate the security risk story in a way that more appropriately triggers the associated risk emotions?"

reply
SAI_Peregrinus
2 days ago
[-]
What is the consequence for security breaches? Usually some negative press everyone forgets in a week. Maybe a lost sale or two, but that's hard to measure. If you're exceedingly unlucky, an inconsequential fine. At worst paying for two years of credit monitoring for your users.

What's the risk? The stock price will be back up by next week.

reply
chii
6 hours ago
[-]
the easy explanation is that the cost of a breach is externalized, so decision makers gain benefit from savings in not investing in security.

Look at the crowdstrike failure as a recent example, but there's plenty more in the past.

reply
giantg2
2 days ago
[-]
The people making the decision don't have a direct negative impact. Someone's head might role, but that's usually far up the chain where the comp and connections are high enough to not care. The POs making the day to day decisions are under more pressure for new features than they are for security.
reply
amonon
1 day ago
[-]
easier to consider people to tend towards conservative rather than risk averse. if we were truly risk averse, society would be very different.
reply
xelxebar
2 days ago
[-]
This is such a powerful frame of mind. Bugs, software architecture, tooling choices, etc. all happen within organizational, social, political, and market machinery. A bug isn't just a technical failure, but a potential issue with the meta-structures in which the software is embedded.

Code review is one example of addressing the engineering process, but I also find it very helpful to consider business and political processes as well. Granted, NASA's concerns are very different than that of most companies, but as engineers and consultants, we have leeway to choose where and how to address bugs, beyond just the technical and immediate dev habits.

Soft skills matter hard.

reply
asdff
1 day ago
[-]
It makes you wonder if there's been work designing software that is resilient to bugs. Maybe you can test this by writing a given function in a variety of different ways, simulate some type of bug (fat fingering is probably easiest), and compare outputs. Some of these functions might not work at all. Some might spit out the wrong result. But then there will probably be a few that are written in such a way to get very close to the true result, and maybe that variance is acceptable for your purposes. Given how we currently write code (in english in a way a human can read it) maybe its not so realistic. But if we get to the point with our generative code where you can generate good quality machine code without having it transmuted to human readable code for human verification, then this is how we would be operating: looking at distributions of results from a billion putative functions.
reply
toolz
2 days ago
[-]
To that example though, is NASA really the pinnacle of achievement in their field? Sure, it's not a very competitive field (e.g. compared to something like the restaurant industry) and most of their existence has been about r&d for tech there wasn't really a market for yet, but still spaceX comes along and in a fraction of the time they're landing and reusing rockets making space launches more attainable and significantly cheaper.

I'm hoping that example holds up, but I'm not well versed in that area so it may be a terrible counter-example but my overarching point is this: overly engineered code often produces less value than quickly executed code. We're not in the business of making computers do things artfully just for the beauty of the rigor and correctness of our systems. We're doing it to make computers do useful thing for humanity.

You may think that spending an extra year perfecting a pace-maker might end up saving lives, but what if more people die in the year before you go to market than would've ended up dying had you launched with something almost perfect, but with potential defects?

Time is expensive in so many more ways than just capital spent.

reply
the_other
2 days ago
[-]
SpaceX came along decades after NASA’s most famous projects. Would SpaceX have been able to do what they did if NASA hadn’t engineered to their standard earlier on?

My argument (and I’m just thought experimenting here) is that without NASA’s rigor, their programmes would have failed. Public support, and thus the market for soace projects, would have dried up before SpaceX was able to “do it faster”.

(Feel free to shoot this down: I wasn’t there and I havn’t read any deep histories of the conpanies. I’m just brainstorming to explore the problem space)

reply
euroderf
20 hours ago
[-]
So I wonder if SpaceX has a lending library of all of the relevant (quality-related) NASA documents, printed out on dead trees. For light lunchtime reading.
reply
sfn42
2 days ago
[-]
The fallacy here is that you're assuming that doing things right takes more time.

Doing things right takes less time in my experience. You spend a little more time up front to figure out the right way to do something, and a lot of the time that investment pays dividends. The alternative is to just choose the quickest fix every time until eventually your code is so riddled with quick fixes that nobody knows how it works and it's impossible to get anything done.

reply
dsego
1 day ago
[-]
It's tough to sell this to leaders and managers that there could be more benefit to quality and stability at the cost of cutting scope and losing a few oh so indispensable features. But their incentive is to dream up imaginative OKRs and come up with deadlines to show visible progress and justify their roles until the next quarter.
reply
sendfoods
2 days ago
[-]
Which blog/post/book was this? Thanks
reply
niccl
2 days ago
[-]
In the course of interviewing a bunch of developers, and employing a few of them, I've concluded that this ability/inclination/something to do this deeper digging is one of the things I prize most in a developer. They have to know when to go deep and when not to, though, and that's sometimes a hard balancing act.

I've never found a good way of screening for the ability, and more, for when not to go deep, because everyone will come up with some example if you ask, and it's not the sort of thing that I can see highlighting in a coding test (and _certainly_ not in a leet-code test!). If anyone has any suggestions on how to uncover it during the hiring process I'd be ecstatic!

reply
giantg2
2 days ago
[-]
"I've concluded that this ability/inclination/something to do this deeper digging is one of the things I prize most in a developer."

Where have you been all my life? It seems most of teams I've been on value speed over future proofing bugs. The systems thinking approach is rare.

If you want to test for this, you can create a PR for a fake project. Make sure the project runs but has error, code smells, etc. Have a few things like they talk about in the article, like a message of being out of disk space but missing critical message/logging infrastructure to cover other scenarios. The best part is, you can use the same PR for all levels that you're hiring for by expecting senior to get X% of the bugs, mids to get X/2% and noobs to get X/4%.

reply
jerf
2 days ago
[-]
"It seems most of teams I've been on value speed over future proofing bugs."

So, obviously, if one team is future proofing bugs, and the other team just blasts out localized short-term fixes as quickly as possible, there will come a point where the first team will overtake the second, because the second team's velocity will by necessity has to slow down more than the first as the code base grows.

If the crossover point is ten years hence, then it only makes sense to be the second team.

However, what I find a bit horrifying as a developer is that my estimate of the crossover point keeps coming in. When I'm working by myself on greenfield code, I'd put it at about three weeks; yes, I'll go somewhat faster today if I just blast out code and skip the unit tests, but it's only weeks before I'm getting bitten by that. Bigger teams may have a somewhat farther cross over point, but it's still likely to be small single-digit months.

There is of course overdoing it and being too perfectionist, and that does get some people, but the people, teams, managers, and companies who always vote for the short term code blasting simply have no idea how much performance they are leaving on the table almost immediately.

Established code bases are slower to turn, naturally. But even so, I still think the constant short-term focus is vastly more expensive than those who choose it understand. And I don't even mean obvious stuff like "oh, you'll have more bugs" or "oh, it's so much harder to on board", even if that's true... no, I mean, even by the only metric you seem care about, the team that takes the time to fix fundamental issues and invests in better logging and metrics and all those things you think just slow you down can also smoke you on dev speed after a couple of months... and they'll have the solid code base, too!

"Make sure the project runs but has error, code smells, etc."

It is a hard problem to construct a test for this but it would be interesting to provide the candidate some code that compiles with warnings and just watch them react to the warnings. You may not learn everything you need but it'll certainly teach you something.

reply
daelon
2 days ago
[-]
Slow is smooth, smooth is fast.
reply
ozim
2 days ago
[-]
Unfortunately I believe there is no crossing point even in 10 years.

If quick fix works it is most likely a proper fix, if it doesn’t work then you dig deeper. It is also case if feature to be fixed is even worth spending so much time.

reply
rocqua
2 days ago
[-]
A quick fix works now. It makes the next fix or change much harder because it just added a special case, or ignored an edge case that wasn't possible in the configuration at that time.
reply
ozim
2 days ago
[-]
My main point is That’s false dichotomy.

There is bunch of stuff that could be “fixed better” or “properly” if someone took a better look but also a lot of times it is just good enough and is not somehow magically impeding proper fix.

reply
jerf
1 day ago
[-]
It is and it isn't a false dichotomy.

It is a false dichotomy in that in the Aristotelian sense of "X -> Y" means that absolutely, positively every X must with 100% probability lead to Y, it is absolutely true that "This is a quick fix -> This not the best fix" is false. Sometimes the quick fix is correct. A quick example: I'm doing some math of some sort and literally typed minus instead of plus. The quick fix to change minus to plus is reasonable.

(If you're wondering about testing, well, let's say I wrote unit tests to assert the wrong code. I've written plenty of unit tests that turn out to be asserting the wrong thing. So the quick fix may involve fixing those too.)

It is true in the sense that if you plot the quickness of the fix versus the correctness of the fix, you're not going to get a perfectly uniformly random two dimensional graph that would indicate they are uncorrelated. You'll get some sort of Pareto-optimal[1] front that will develop, becoming more pronounced as the problem and minimum size fix become larger (and they can get pretty large in programming). It'll be a bit loose, you'll get occasional outliers where you have otherwise fantastic code that just happened to have this tiny screw loose that caused a lot of problems everywhere and one quick fix can fix a lot of issues at once; I think a lot of us will see those once or twice a decade or so, but for the most part, there will develop a definite trend that once you eliminate all the fixes that are neither terribly fast nor terribly good for the long term, there will develop a fairly normal "looks like 1/x" curve of tradeoffs between speed and long-term value.

This is a very common pattern across many combinations of X and Y that don't literally, 100% oppose each other, but in the real world, with many complicated interrelated factors interacting with each other and many different distributions of effort and value interacting, do contradict each other... but only if you are actually on the Pareto frontier! For practical purposes in this case I think we usually are, at least relative to the local developers fixing the bug; nobody deliberately sets out to make a fix that is visibly obviously harder than it needs to be and less long-term valuable than it needs to be.

My favorite "false dichotomy" that arises is the supposed contradiction between security and usability. It's true they oppose each other... but only if your program is already roughly optimally usable and secure on the Pareto frontier and now you really can't improve one without diminishing the other. Most programs aren't actually there, and thus there are both usability and security improvements that can be made without affecting the other.

I'm posting this because this is one of those things that sounds really academic and abstruse and irrelevant, but if you learn to see it, becomes very practical and powerful for your own engineering.

[1]: https://en.wikipedia.org/wiki/Pareto_front

reply
ozim
13 hours ago
[-]
Well I mostly work on systems that people don’t have their lives on line and don’t care that much like HN or Facebook. If I cannot post my comment I can go on with my life.

Sometimes I get “hey you did too quick requests” while posting.

Proper fix would be making better check if I am really a bot or I just casted a vote and wrote quick comment - but no one is going to care enough.

Whatever the long time dead dude was saying.

reply
marcosdumay
1 day ago
[-]
My impression is that bigger teams have a shorter crossover point.

Weirdly, teams seem to adapt better to bad code. But that adaptation occurs through meetings. And meetings just destroy a team productivity.

reply
carlmr
1 day ago
[-]
>Weirdly, teams seem to adapt better to bad code.

The greenfield team usually adapts well to its own buggy code. They know the system so well inside-out that if a bug pops up they have a general idea why.

This is bad, because with natural fluctuation in team members this institutional knowledge is lost. New members don't have the benefit of knowing about the whole evolution with all its quirks, and don't have the unit tests from the previous team to prevent regressions.

This then slows velocity to near zero as the team gets replaced, and leads to the inevitable rewrite.

reply
marcosdumay
18 hours ago
[-]
My experience is that people adapt differently to different issues, so in a team people can specialize better on the types of bad code they handle best. So the code quality degrades less their productivity while they are working than if each person had to deal with the entire diversity alone.

But that implies on a division of work that is not aligned with any communication-reducing objective.

reply
ozim
2 days ago
[-]
I have seen enough BSers who claimed that they need “do the proper fix” doing analysis and wasting everyone’s time.

They would be vocal about it and then spend weeks delivering nothing “tweaking db indexes” while I immediately have seen code was crap and needed slight changes but I also don’t have time to fight all the fights in the company.

reply
giantg2
2 days ago
[-]
That's the thing, my comment wasn't about that long analysis or doing the proper fix. It's all about asking if this is the root cause or not, or is there a similar related bug not yet identified. You could find a root cause and bring it back to the team if it's going to take weeks. At that point the team has the say on if that fix is necessary.
reply
niccl
2 days ago
[-]
That's a really good idea. Thanks
reply
bongodongobob
2 days ago
[-]
Knowing when to go down the rabbit hole is probably more about experience/age than anything. I work with a very intelligent junior that is constantly going down rabbit holes. His heart is in the right spot but sometimes you just need to make things work/get things done.

I used to do it a lot too and I kind of had a "shit, I'm getting old" moment the other day when I was telling him something along the lines of "yeah, we could probably fix that deeper but it's going to take 6 weeks of meetings and 3 departments to approve this. Is that really what you want to spend your time on?"

Like you said, it's definitely a balancing act and the older I get, the less I care about "doing things the right way" when no one actually cares or will know.

I get paid to knock out tickets, so that's what I'm going to do. I'll let the juniors spin their wheels and burn mental CPU on the deep dives and I'm around to lend a hand when they need it.

reply
layer8
2 days ago
[-]
However, you have to overdo it a sufficient number of times when you’re still inexperienced, in order to gain the experience of when it’s worth it and when it’s not. You have to make mistakes in order to learn from them.
reply
giantg2
2 days ago
[-]
When it's worth it and when it's not seems to be more of a business question for the product owner. It's all opinion.

I've been on a where I had 2 weeks left and they didn't want me working on anything high priority during that time so it wouldn't be half finished when I left. I had a couple small stories I was assigned. Then I decide to cherrypick the backlog to see how much tech debt I could close for the team before I left. I cleared something like 11 stories out of 100. I was then chewed out by the product owner because she "would have assigned [me] other higher priority stories". But the whole point was that I wasn't suppose dto be on high priority tasks because I'm leaving...

reply
layer8
2 days ago
[-]
The product owner often isn’t technical enough, or into the technical weeds enough, to be able to asses how long it might take. You need the technical experience to have a feeling of the effort/risk/benefit profile. You also may have to start going down the hole to assess the situation in the first place.

The product owner can decide how much time would be worth it given a probable timeline, risks and benefits, but the experienced developer is needed to provide that input information. The developer has to present the case to the product owner, who can then make the decision about if, when, and how to proceed. Or, if the developer has sufficient slack and leeway, they can make the decision themselves within the latitude they’ve been given.

reply
giantg2
2 days ago
[-]
Yeah. The team agreed I should just do the two stories, which was what was committed to in that sprint. I got that done and then ripped through those other 11 stories in the slack time before I left the team. My TL supported that I didn't do anything wrong in picking up the stories. The PO still didn't like it.
reply
seadan83
2 days ago
[-]
Why product owner? (Perhaps rather not say team lead?)

Are these deeply technical product owners? Which ones would be best to make this decision and which less?

reply
giantg2
2 days ago
[-]
In a non-technical company with IT being a cost center, it seems that the product owner gets the final say. My TL supported me, but the PO was still upset.
reply
rocqua
2 days ago
[-]
Regardless, these deep dives are so valuable in teaching yourself, they can be worth it just for that.
reply
userbinator
2 days ago
[-]
Have you been asked "why do we never have the time to do it right, but always time to do it twice?"
reply
sqeaky
2 days ago
[-]
His response is likely something like "I am hourly contractor, I have howevermuch time time they want", or something with the same no long gives a shit energy.

But their manager likely believes that deeper fixes aren't possible or useful for some shortsighted bean-counter reason. Not that bean counting isn't important, but they are often cout ed early and wrong.

reply
bongodongobob
2 days ago
[-]
Yeah don't get me wrong, I'm not saying "don't care about anything and do a shitty job" but sometimes the extra effort just isn't worth it. I'm a perfectionist at heart but I have to weigh the cost of meeting my manager's goals or getting behind because I want it to be perfect. Then 6 months later my perfect thing gets hacked apart by a new request/change. Knowing when and where to go deeper and when to polish things is a learned skill and has more to do with politics and the internal workings of your company more than some ideal. Everything is in constant flux and having insight into smart deep dives isn't some black and white general issue. It's completely context dependant.
reply
thelostdragon
2 days ago
[-]
"yeah, we could probably fix that deeper but it's going to take 6 weeks of meetings and 3 departments to approve this. Is that really what you want to spend your time on?"

This is where a developer goes from junior to serior.

reply
variadix
18 hours ago
[-]
Surely there is a way to present some piece of buggy code to a candidate and ask what’s wrong with it (letting them use whatever tools they want, doing this on a whiteboard is senseless), let them determine what the bug is and see how they fix it. Obviously there are constraints that make the code in question difficult to construct (needs to be simple and small enough to fit in an interview without the candidate having ever seen the code before, but not too simple to make it a non-differentiating question; the bug has to have multiple levels to it where there’s an obvious fix and possibly several better but less obvious ways to fix the issue, etc)
reply
iamcreasy
1 day ago
[-]
Mike Acton said the same thing in an interview. He said curiosity is the best indicator if a candidate will be a good hire.

Casey Muratori was interviewing him at HandmadeHero Con back in 2016. Here is the snippet: https://youtu.be/qWJpI2adCcs?si=ezSKud42PC3Ub-UO&t=3112

reply
atoav
2 days ago
[-]
Such qualities can sometimes be unearthed when you ask candidates to deal with a problem they can't know the answer to. In the end the ability to go deep has a lot to do with them being confident in their ability to be able to understand things that are new to them.

Most people can go into a deep dive if you force them to do it, but how they conduct themselves while doing it can show you if this is a thing they would do on their own.

reply
brody_hamer
2 days ago
[-]
I learned a similar mantra that I keep returning to: “there’s never just one problem.”

- How did this bug make it to production? Where’s the missing unit test? Code review?

- Could the error have been handled automatically? Or more gracefully?

reply
Cpoll
2 days ago
[-]
This kind of reminds me of https://en.m.wikipedia.org/wiki/Five_whys.
reply
peter_d_sherman
2 days ago
[-]
>"There’s a bug! And it is sort-of obvious how to fix it. But if you don’t laser-focus on that, and try to perceive the surrounding context, it turns out that the bug is valuable, and it is pointing in the direction of a bigger related problem."

That is an absolutely stellar quote!

It's also more broadly applicable to life / problem solving / goal setting (if we replace the word 'bug' with 'problem' in the above quote):

"There’s a problem! And it is sort-of obvious how to fix it. But if you don’t laser-focus on that, and try to perceive the surrounding context, it turns out that the problem is valuable, and it is pointing in the direction of a bigger related problem."

In other words, in life / problem solving / goal setting -- smaller problems can be really valuable, because they can be pointers/signs/omens/subcases/indicators of/to larger surrounding problems in larger surrounding contexts...

(Just like bugs can be, in Software Engineering!)

Now if only our political classes (on both sides!) could see the problems that they typically see as problems -- as effects not causes (because that's what they all are, effects), of as-of-yet unseen larger problems, of which those smaller problems are pointers to, "hints at", subcases of, "indicators of" (use whatever terminology you prefer...)

Phrased another way, in life/legislation/problem solving/Software Engineering -- you always have to nail down first causes -- otherwise you're always in "Effectsville"... :-)

You don't want to live in "Effectsville" -- because anything you change will be changed back to what it was previously in the shortest time possible, because everything is an effect in Effectsville! :-)

Legislating something that is seen that is the effect of another, greater, as-of-yet unseen problem -- will not fix the seen problem!

Finally, all problems are always valuable -- but if and only if their surrounding context is properly perceived...

So, an an excellent observation by the author, in the context of Software Engineering!

reply
Terr_
2 days ago
[-]
IMO it may be worth distinguishing between:

1. Diagnosing the "real causes" one level deeper

2. Implementing a "real fix" fix one level deeper

Sometimes they have huge overlap, but the first is much more consistently-desirable.

For example, it might be the most-practical fix is to add some "if this happens just retry" logic, but it would be beneficial to know--and leave a comment--that it occurs because of a race condition.

reply
hoherd
1 day ago
[-]
This seems like the code implementation way of shifting left. https://news.ycombinator.com/item?id=38187879
reply
KaiserPro
2 days ago
[-]
You need to choose your rabbit holes carefully.

In large and complex codebases, its often more pragmatic to build a guard in your local area against that bug, than following the bug all the way downthe stack.

Its not optimal, and doesn't make the system better as a whole. but its the only way to get things done.

That doesn't mean you should be silent though, you do need to contact the team that looks after that part of the system

reply
cantSpellSober
1 day ago
[-]
In enterprise monorepos I find this hard because "one level deeper" is often code you don't own.

Fun article, good mantra!

reply
kayvulpe
20 hours ago
[-]
Echoes of "Hal fixing a light bulb" from Malcolm in the Middle
reply