It's not enough for a program to work – it has to work for the right reasons
124 points
2 days ago
| 17 comments
| buttondown.com
| HN
peterldowns
1 day ago
[-]
Haven't seen it mentioned here in the comments so I'll throw in — this is one of the best uses for code coverage tooling. When I'm trying to make sure something really works, I'll start with a failing testcase, get it passing, and then also use coverage to make sure that the testcase is actually exercising the logic I expect. I'll also use the coverage measured when running the entire suite to make sure that I'm hitting all the corner cases or edges that I thought I was hitting.

I never measure coverage percentage as a goal, I don't even bother turning it on in CI, but I do use it locally as part of my regular debugging and hardening workflow. Strongly recommend doing this if you haven't before.

I'm spoiled in that the golang+vscode integration works really well and can highlight executed code in my editor in a fast cycle; if you're using different tools, it might be harder to try out and benefit from it.

reply
hinkley
1 day ago
[-]
I don’t mind coverage in CI except when someone fails builds based on reductions in coverage percent, because it ends up squashing refactoring and we want people doing more of that not less.

Sometimes very well covered code is dead code. If it has higher coverage than the rest of the project, then deleting it removes for example 1000 lines of code at 99% coverage, which could reduce the overall by .1%.

And even if it wasn’t 99% when you started, rewriting modules often involves first adding pinning tests, so replacing 1000 lines with 200 new could first raise the coverage percent and then drop it again at the end.

There are some things in CI/CD that should be charts not failures and this is one.

reply
maxbond
2 days ago
[-]
> It's not enough for a program to work, it has to work for the right reasons. Code working for the wrong reasons is code that's going to break when you least expect it.

This reminds me of the recent discussion of gettiers[1]. That article focused on Gettier bugs, but this passage discusses what you might call Gettier features.

Something that's gotten me before is Python's willingness to interpret a comma as a tuple. So instead of:

    my_event.set()
I wrote:

    my_event,set()
Which was syntactically correct, equivalent to:

    _ = (my_event, set())
The auto formatter does insert a space though, which helps. Maybe it could be made to transform it as I did above, that would make it screamingly obvious.

[1a] https://jsomers.net/blog/gettiers

[1b] https://news.ycombinator.com/item?id=41840390

reply
csours
1 day ago
[-]
I was expecting to find the word Gettier in the text.

My comment on that Gettier post:

Puttiers: When a junior engineer fixes something, but a different error is returned, so they cannot tell if progress was made or not.

https://news.ycombinator.com/item?id=41850429

reply
hansvm
1 day ago
[-]
One pattern that eliminates a lot of such bugs is never using any name that's a keyword or common name in any mainstream programming language. The existence of `def set...` in your code was already asking for trouble, and you were unlucky enough to find it.
reply
maxbond
1 day ago
[-]
I agree, but this is a method on an object in the standard library!

https://docs.python.org/3/library/asyncio-sync.html#asyncio....

Could've been called fire() or activate(), perhaps. This is also the kind of problem lints are really good for. I wouldn't be surprised if there was an lint for this already (I haven't checked).

reply
hansvm
1 day ago
[-]
Oh that's annoying.
reply
bobbylarrybobby
1 day ago
[-]
This is ridiculous, not having access to a set() method is just silly. Also I'd imagine that most languages use another language’s keyword for functions to name a callable variable, func being (I'd assume) a particularly common name. match is a perfectly good name for the result of a regex search... etc etc.

The real issue here is simply that python didn't camel case its built in class names; then this would only be a problem if you violated case conventions and named your method Set() (or mistyped the case as well).

reply
hansvm
1 day ago
[-]
Those are perfectly good names, but it's also incredibly easy to avoid a few hundred words. `f` or `predicate` or what have you are easy names for functions as arguments, and something like `maybe_match` or `maybe_parsed_username` is more likely to represent the results of an arbitrary regex.

You absolutely don't have to follow every piece of advice you read, but I'll argue for it once more anyway. My guiding principle is that I make mistakes from time to time, and those mistakes are incredibly expensive when they hit prod. If two things need to change in tandem then I'll derive one from the other in code (or at an absolute bare minimum put comments in both locations warning that they need to be changed in tandem, though with modern language features that's rarely necessary for any reason nowadays) (guaranteeing they actually change together). If I don't need the result of a function call I'll still give it a name and explicitly discard it (adding redundant bits of information for future readers -- indicating that I ignored the result intentionally and also reducing future git blame search time in case the result is ever valuable or the API changes). When a chunk of data has some constraints beyond being an arbitrary primitive I create a wrapper type (free at runtime in any system language) to let the compiler double-check my work. Those are all small things, but with a short library of such patterns the code I write is almost always correct if it compiles, and none of those habits are particularly hard to develop.

Not using potentially reserved words, similarly, reduces the chance of certain errors. Kind of like the swiss cheese model in aviation, you add one extra (fallible) layer of redundancy against bugs like foo,set(). Since it's a cheap mitigation, it's one I follow unless I have an amazing reason to do otherwise.

Yes, Python could have named things differently, but that ship has sailed. If I'm using Python (and it winds up being a good choice a few times a week currently) then I have to work around those design issues.

reply
praptak
1 day ago
[-]
So we have an analogy:

accidentally working app : correct app :: Gettier "not knowledge" JTB : proper knowledge JTB

Is it possible to backport the program analogy back into the realm of philosophy? I'm dreaming of a philosophy paper along the lines of "Knowledge is JTB with proper testing".

reply
HL33tibCe7
2 days ago
[-]
Your font and/or eyesight might need attention!
reply
maxbond
2 days ago
[-]
You know what, I do use a small font size in my editor. I like to see a lot of code at once. And if memory serves I spotted this in the browser, where I do the opposite.

I'll have to look into hyper legible monospace fonts. Or maybe I'll just use Atkinson and deal with the variable spacing.

reply
nomel
1 day ago
[-]
My editor uses a different color for comma and period.
reply
settsu
2 days ago
[-]
Tell me you're a 20-something engineer without telling me you're a 20-something engineer.
reply
lcnPylGDnU4H9OF
1 day ago
[-]
The implication being that older programmers would be entirely unconcerned with one's eyesight and the effect that reading a small font could have on such? Somehow that seems a bit backwards. People don't know what they have until it's gone.
reply
PaulHoule
1 day ago
[-]
Developing presbyopia was the first time I thought "getting old sucks".

I am doing an accessibility audit for an application at work right now and I was tasked with getting things up to the WCAG AA level but I found that I had huge amounts of unused white space, even viewing it at 1024x768 desktop, and jacked font sizes up not to the "large type" levels of AAA but substantially larger which I think is easier for everyone, given the size of the UI controls information density is not going down a whole lot.

reply
HL33tibCe7
1 day ago
[-]
Wrong
reply
BerislavLopac
2 days ago
[-]
> How do I know whether my tests are passing because they're properly testing correct code or because they're failing to test incorrect code?

One mechanism to verify that is by running a mutation testing [0] tool. They are available for many languages; mutmut [1] is a great example for Python.

[0] https://en.wikipedia.org/wiki/Mutation_testing

[1] https://mutmut.readthedocs.io

reply
layer8
2 days ago
[-]
That’s basically the approach mentioned in the article’s paragraph starting with “A broader technique I follow is make it work, make it break.”
reply
bunderbunder
2 days ago
[-]
If there's one thing that engineer engineers have considered standard practice for ever and ever, but software engineers seem to still not entirely grok, it's destructive testing.

I see this a lot with performance measurement, for example. A team will run small-scale benchmarks, and then try to estimate how a system will scale by linearly extrapolating those results. I don't think I've ever seen it work out well in practice. Nothing scales linearly forever, and there's no reliable way to know when and how it will break down unless you actually push it to the point of breaking down.

reply
johnnyanmac
1 day ago
[-]
I think it's because companies tend to segregate engineering from testing/QA. There's more features to work on, stress testing is Qa's job and those tickets will come later. Engineers will do some basic common tests to make sure it functions as expected, but aren't given the time nor tools to really dig in and ensure it's truly robust.

It also reflects the domain. For mission critical code there better be 10 different layers of red lines between development and shipping. For web code, care for stuff like performance and even correctness can fall by the wayside.

reply
erik_seaberg
1 day ago
[-]
One of the things I like about cloud is that it's relatively easy to spin up an isolated full-scale environment and find out where prod's redline probably is. On-prem hardware might have different bottlenecks.
reply
pron
1 day ago
[-]
Yep. And for Java: https://pitest.org
reply
dan-robertson
2 days ago
[-]
One general way I like to think about this is that most software you use has passed through some filter – it needed to be complete enough for people to use it, people needed to find it somehow (eg through marketing), etc. If you have some fixed amount of resources to spend on making that software, there is a point where investing more of them in reducing bugs harms one’s chances of passing the filter more than it helps. In particularly competitive markets you are likely to find that the most popular software is relatively buggy (because it won by spending more on marketing or features) and you are often more likely to be using that software (for eg interoperability reasons) too.
reply
TeMPOraL
2 days ago
[-]
Conversely, the occasional success Open Source tooling has is in large part due it not competing, therefore not being forced by competitive pressure to spend ~all resources on marketing, and ~nil on development. I'm not sure where computing would be today if all software was marketing-driven, but I guess nowhere near as far as it is now.
reply
talldayo
1 day ago
[-]
> I'm not sure where computing would be today if all software was marketing-driven

Basically just look at the 80s and early 90s. Video games, C compilers, NAS software, operating systems and hardware sales were all almost entirely marketing driven. Before any serious Open Source revolution, you paid for almost any code that was perceived to have value. Functionality built-in was not something people took for granted.

Open Source won not because you can't market it (in fact, you can - it's just that nobody is paid to do it), but because it's free. The ultimate victory Linux wielded over it's contemporaries was that you could host a web server without paying out the ass to do it. It turned out to be so competitive that it pretty much decimated the market for commercial OSes with word-of-mouth alone. It's less about their neglect of marketing tactics and more a reflection of the resentment for the paid solutions at the time.

reply
kazinator
23 hours ago
[-]
> If the code still works even after the change, my model of the code is wrong and it was succeeding for the wrong reasons.

If someone else wrote the code, your model of why it works being wrong doesn't mean anything is wrong other than your understanding.

Sometimes even if you wrote something that works and your own model is wrong, you don't necessarily have to fix anything: just learn the real reason the code works, go "oh", and leave it. :) (Revise some documentation and write some tests based on the new understanding.)

reply
krackers
17 hours ago
[-]
Is this not common practice? I'd expect a good engineer who cares about their work to be just as suspicious if something works _when it shouldn't_ as when something doesn't work when it should. Both indicate a mismatch between your mental model and what it's actually doing.
reply
foobar8495345
2 days ago
[-]
In my regressions, i make sure i include an "always fail" test, to make sure the test infrastructure is capable of correctly flagging it.
reply
rzzzt
1 day ago
[-]
This opens up a philosophical can of worms. Does the test pass when it fails? Is it marked green or red?
reply
jerf
1 day ago
[-]
Not only philosophical, it can come out in the code too. I've written a number of testing packages over the years, and it's a rare testing platform that can assert that some sort of test failure assertion "correctly" fails without some sort of major hoop jumping, usually having to run that test in an isolated OS process and parse the output of that process.

This isn't a complaint; it's too marginal and weird a test case to complain about, and the separate OS process is always there as a fallback solution.

reply
heisgone
1 day ago
[-]
You want both. To test green and red pixels.
reply
TeMPOraL
1 day ago
[-]
So basically you want yellow? As it's what you get when you start testing red and green subpixels simultaneously.
reply
JohnMakin
1 day ago
[-]
"Task failed successfully!"
reply
joeyagreco
1 day ago
[-]
could you give a concrete example of what you mean by this?
reply
maxbond
1 day ago
[-]
Not GP but when I feel like I'm going crazy I insert an "assert False" test into my test suite. It's a good way to reveal when you're testing a cached version of your code for some reason (for instance integration tests using Docker Compose that aren't picking up your changes because you've forgotten to specify --build or your .dockerignore is misconfigured).

But I delete it when I'm done.

reply
singron
1 day ago
[-]
We once accidentally made a change to a python project test suite that caused it to successfully run none of the tests. Then we broke some stuff but the tests kept "passing".

It's a little difficult to productionize an always_fail test since you do actually want the test suite to succeed. You could affirmatively test that you have non-zero passing tests, which is I think what we did. If you have an always_fail test, you could check that that's your only failure, but you have to be super careful that your test suite doesn't stop after a failure.

reply
maxbond
1 day ago
[-]
Maybe you could ignore that test by default, and then write a shell script to run your tests in two stages. First you run only the should-fail test(s) and assert that they fail. Then you can run your actual tests.
reply
SoftTalker
1 day ago
[-]
Sounds like the old George Carlin one-liner. Or maybe it's a two-liner:

The following statement is true.

The preceeding statement is false.

reply
robotresearcher
1 day ago
[-]
Even older than George Carlin. The Liar Paradox is documented from at least 400BC.

https://en.m.wikipedia.org/wiki/Liar_paradox

reply
maxbond
1 day ago
[-]
I have to imagine it's about as old as propositional logic (so, as old as the hills).

I most closely associate it with Gödel and his work on incompleteness.

reply
marcosdumay
1 day ago
[-]
> We once accidentally made a change to a python project test suite that caused it to successfully run none of the tests.

That shouldn't be an easy mistake to make.

Your test code should be clearly marked, and better if slightly separated from the rest of the code. Also, there should be some feedback about the amount of tests that run.

And yeah, I know Python doesn't help you make those things.

reply
eternityforest
1 day ago
[-]
I love unit tests, but I sometimes additionally manually step through code in the debugger, looking for anything out of place. If a variable does anything surprising then I know I don't understand what I just wrote.
reply
RangerScience
1 day ago
[-]
Colleagues: If the code works, it’s good!

Me: Hmmm.

Managers, a week later: We’re starting everyone on a 50% on-call rotation because there’s so many bugs that the business is on fire.

Anyway, now I get upset and ask them to define “works”, which… they haven’t been able to do yet.

reply
tayo42
1 day ago
[-]
It's amazing that people put through changes without truly understanding what's going on.

I also don't understand how it's even done, like do you just guess until you get the result you kind of want? Then make up a story for your self explaining it?

reply
awesomerob
1 day ago
[-]
> I also don't understand how it's even done

Ooh, I know this one! "I asked shatGPT to write some code that does X..."

Then they ask that same bullshit-generator to explain the code, or write a test, etc.

reply
JohnMakin
2 days ago
[-]
There are few things that terrify me more nowadays at this point in my career than spending a lot of time writing something and setting it up, only to turn it on for the first time and it works without any issues.
reply
twic
2 days ago
[-]
"If it ain't broke, open it up and see what makes it so bloody special." -- The BOFH
reply
norir
2 days ago
[-]
Yes and the first thing I might ask is "how can I break this?" If I can't easily break it with a small change, I've probably missed something.
reply
sudhirj
2 days ago
[-]
Oh god this is such a nightmare. It takes much longer to build something that works not the first try, because then I have to force-simulate a mistake to make sure things were actually correct in the first place.

Test Driven Development had a fix for this, which I used to do back in day when I was evangelical about the one true way the write software. You wrote a test that failed, and added or wrote code only to make that test pass. Never add any code except to make a failing test pass.

It didn't guarantee 100% correct software, of course, but it prevented you from gaslighting yourself for being too awesome.

reply
ipaddr
2 days ago
[-]
Tests are like the burning sun in your eyes after you wake up for a night of drinking.

I prefer separating writing some code down, making it functionally work on screen and writing tests. I usually cover cases in step 2 but when you add sometime new later it is nice to have step 3.

reply
shahzaibmushtaq
1 day ago
[-]
The author is simply talking about the most common testing types[0] but in a more philosophical way.

[0] https://www.perfecto.io/resources/types-of-testing

reply
RajT88
1 day ago
[-]
I had a customer complain once about how great Akamai WAF was, because it never had false positives. (My company's WAF solution had many)

Is that actually desirable? This article articulates my exact gut feeling.

reply
teddyh
2 days ago
[-]
> This is why test-driven development gurus tell people to write a failing test first.

To be precise, it’s one of the big reasons, but it’s far from the only reason to write the test first.

reply
klabb3
2 days ago
[-]
I’m increasingly of the opinion that TDD is only as good as your system is testable.

This means that the time of writing your first test is too late. It’s part of the core business logic architecture – the whiteboard stage.

If you can make it testable, TDD isn’t just good practice – it’s what you want to do because it’s so natural. Similar to how unit tests are already natural when you write hermetic code (like say a string formatter).

If, OTOH, your business logic is inseparable from prod databases, files, networking, current time & time zone, etc, then TDD and tests in general are both cumbersome to write and simultaneously delivers much less value (as in finding errors) per test-case. Controversially, I think that for a spaghetti code application tests are quite useless and are largely ritualistic.

The only way I know how to design such testable systems (or subsystems) is through the “functional core - imperative shell” pattern. Not necessarily religious adherence to “no side effects”, but isolation is a must.

reply
TechDebtDevin
2 days ago
[-]
This is my problem, I don't worry about tests until I'm already putting the marinara sauce on my main functions.
reply
treflop
1 day ago
[-]
Writing for reusability also tends to make software testable from my experience. If you make excessively involved units of code, you can't test, but you also can't re-use.

And I'm big on reusability because I'm lazy. If requirements change, I rather tweak than rebuild.

reply
pdimitar
2 days ago
[-]
> If, OTOH, your business logic is inseparable from prod databases, files, networking, current time & time zone, etc, then TDD and tests in general are both cumbersome to write and simultaneously delivers much less value (as in finding errors) per test-case. Controversially, I think that for a spaghetti code application tests are quite useless and are largely ritualistic.

I don't disagree with this and I have found it to be quite true -- though IMO it still has to be said that you can mock / isolate a lot of stuff, system time included. I am guessing you already accounted for that when you said that tests can become cumbersome to write and I agree. But we should still try because there are projects where you can't ever get a truly isolated system to test f.ex. I recently finished a contract where I had to write an server for dispatching SMS jobs to the right per-tenant & per-data-center instances of the actual connected-to-the-telco-network SMS servers; the dev environment was practically useless because the servers there did not emit half the events my application needed to function properly so I had to record the responses from the production servers and use them as mocks in my dev env tests.

Did the test succeed? Sure they did but ultimately gave me almost no confidence. :/

But yeah, anyway, I agree with your premise, I just think that we should still go the extra mile to reduce entropy and chaos as much as we can. Because nobody likes being woken up to fight a fire in production.

reply
lo_zamoyski
1 day ago
[-]
As the Dijkstrian expression goes, testing shows the presence of bugs, not their absence. Unit tests can show that a bug exists, but it cannot show you that there are no bugs, save for the particular cases tested and even then, only in a behaviorist sort of way (meaning, a your buggy code may still produce the expected output for tested cases). For that, you need to be able to prove your code possesses certain properties.

Type systems and various forms of static analysis are going to increasingly shape the future of software development, I think. Large software systems especially become practically impossible to work with and impossible to verify and test without types.

reply
computersuck
2 days ago
[-]
Website not quite loading.. HN hug of death?
reply
ilrwbwrkhv
2 days ago
[-]
Buttondown is a great non-success. Ergo they are a good company.
reply
hwayne
2 days ago
[-]
I like buttondown because I can directly contact the developer when I have problems. Some downsides to small companies, lots of upsides too.
reply
sciencesama
2 days ago
[-]
Nvidea ?
reply
kellymore
2 days ago
[-]
Site is buggy
reply
mobeigi
1 day ago
[-]
Works fine for me?
reply