What's ironic about this is that the very things that TFA points out are needed for success (test coverage, debuggability, a way to run locally etc) are exactly the things that typical LLMs themselves lack.
I know what seems natural to me but that's because I'm extremely familiar with the internal workings of the project. LLM's seem to be very good at coming with names that are just descriptive enough but not too long, and most importantly follow "general conventions" from similar projects that I may not be aware of. I can't count the number of times an LLM has given me a name for a function that I've thought, oh of course, that's a much clearer name that what I was using. And I thought I was already pretty good at naming things...
Over the past week, I have been writing a small library for midi controller I/O and simple/elegant is the priority. I am not really that opinionated. I just want it to not be overengineered. AI has been able to make some suggestions when I give it a specific goal for refactoring a specific class, but it cannot solve a problem on its own without overengineered solutions.
I often just make the changes myself because it's faster than describing them.
You do the thinking, LLM does the writing. The LLM doesn't solve problems, that's your job. The LLMs job is to help you do the job more efficiently. Not just do it for you.
With the proposed way of measuring code quality, it’s also unclear how comparable the resulting numbers would be between different projects. If one project has more essential complexity than another project, it’s bound to yield a worse score, even if the code quality is on par.
Cycolmatic complexity is a terrible metric to obsesses over yet in a project I was on it was undeniably true that the newer code written by more experienced Devs was both subjectively nicer and also had lower cycolmatic complexity than the older code worked on by a bunch of juniors (some of the juniors had then become some of the experienced Devs who wrote the newer code)
Yes. But it means that it doesn’t let you assess code quality, only (at best) changes in code quality. And it’s difficult as soon as you add or remove functionality, because then it isn’t strictly speaking the same project anymore, as you may have increased or decreased the essential complexity. What you can assess is whether a pure refactor improves or worsens a project’s amenibility to AI coding.
The first time I tried without the deeper output, it "solved" it by writing a load of code that failed in loads of other ways, and ended up not even being related to the actual issue.
Like you can be certain it'll give you some nice looking metrics and measurements - but how do you know if they're accurate?
I'm not necessarily convinced that the current generation of LLMs are overly amazing at this, but they definitely are very good at measuring inefficiency of tooling and problematic APIs. That's not all the issues, but it can at least be useful to evaluate some classes of problems.
You have to be super careful and review everything because if you don't you can find your code littered with this strange mix of seeming brilliance which makes you complacent... and total Junior SWE behaviour or just outright negligence.
That, or recently, it's just started declaring victory and claiming to have fixed things, even when the test continues to fail. Totally trying to gaslight me.
I swear I wasn't seeing this kind of thing two weeks ago, which makes me wonder if Anthropic has been turning some dials...
It feels like it’s become grabbier and less able to stay in its lane: ask for a narrow thing, and next thing you know it’s running hog wild across the codebase shoehorning in half-cocked major architectural changes you never asked for. [Ed.: wow, how’s that for mixing metaphors?]
Then it smugly announces success, even when it runs the tests and sees them fail. “Let me test our fix” / [tests fail] / [accurately summarizes the way the tests are failing] / “Great! The change is working now!”
After leaving a trail of mess all over.
Wat?
Someone is changing some weights and measures over at Anthropic and it's not appreciated.
For the exact same task, some changes in the system prompt used by Claude Code, and/or how it constructs the user prompt, can quite easily make the task either easy enough or not. It is a fine line.
My team refers to this as a "VW Bugfix".
Just outright "if test-is-running { return success; }" level stuff.
Not kidding. 3 or 4 times in the past week.
Thinking of cancelling my subscription, but I also find it kind of... entertaining?
So… it did.
It made the tests pass.
“Job done boss!”
> Honestly? I blame the testing regime here, for trusting the engine manufacturers too much. It was foolish to ever think that the manufacturers were on anybody's side but their own.
> It sucks to be writing tests for people who aren't on your side, but in this case there's nothing which can change that.
> Lesson learned. Now it's time to harden those tests up.
The issue is that it can happily go down the completely wrong path and report exactly the same as though it's solved the problem.
This is a common and irritating intellectual trap. We want to measure things as this gives us a handle to apply algorithms or logical processes on them.
But we can only measure very simple and well defined dimensions such as mass, length, speed etc.
Being measurable is the exception, not the rule.
> In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it's a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong
I recently implemented Microsoft's MSAL authentication on iOS which includes as you might expect a function that retrieves the authenticated accounts. Oh sorry, I said function, but there's two actually: one that retrieves one account, and one that retrieves multiple accounts, which is odd but harmless enough right?
Wrong, because whoever designed this had an absolutely galaxy brained moment and decided if you try and retrieve one account when multiple accounts are signed in, instead of, oh I dunno, just returning an error message, or perhaps returning the most recently used account, no no no, what we should do in that case is throw an exception and crash the fucking app.
I just. Why. Why would you design anything this way!? I can't fathom any situation you would use the one-account function in when the multi-account one does the exact same fucking thing, notably WITHOUT the potential to cause a CRASH, and just returns a set of one, and further, why then if you were REALLY INTENT ON making available one that only returned one, it wouldn't itself just call the other function and return Accounts.first.
</ rant>
ObjC has a widespread convention where a failable method will take an NSError** parameter, and fill out that parameter with an error object on failure. (And it's also supposed to indicate failure with a sentinel return value, but that doesn't matter for this discussion.) This is used by nearly all ObjC APIs.
Swift has a language feature for do/try/catch. Under the hood, this is implemented very similarly to the NSError* convention, and the Swift compiler will automatically bridge them when calling between languages. Notably, the implementation does not do stack unwinding, it's just returning an error to the caller by mostly normal means, and the caller checks for errors with the equivalent of an if statement after the call returns. The language forces you to check for errors when making a failable call, or make an explicit choice to ignore or terminate on errors.
ObjC also has exceptions. In modern ObjC, these are implemented as C++ exceptions. They used to be used to signal errors in APIs. This never worked very well. One reason is that ObjC doesn't have scoped destructors, so it's hard to ensure cleanup when an exception is thrown. Another reason is that older ObjC implementations didn't use C++ exceptions, but rather setjmp/longjmp, which is quite slow in the non-failure case, and does exciting things like reset some local variables to the values they had when entering the try block. It was almost entirely abandoned in favor of the NSError* technique and only shows up in a few old APIs these days.
Like C++, there's no language enforcement making sure you catch exceptions from a potentially throwing call. And because exceptions are rarely used in practice, almost no code is exception safe. When an exception is thrown, it's very likely the program will terminate, and if there happens to be an exception handler, it's very likely to leave the program in a bad state that will soon crash.
As such, writing code for iOS that throws exceptions is an exceptionally bad idea.
More importantly: why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind. I wouldn't call our use of the framework an edge case by any means, it opens a web form in which one puts authentication details, passes through the flow, and then we are given authentication tokens and the user data we need. It's not unheard of for more than one account to be returned (especially on our test devices which have many) and I get the one-account function not being suitable for handling that, my question is... why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?
It is if the caller is expecting there to be exactly one account.
This is why I generally like to return a set of things from any function that might possibly return zero or more than one things. Fewer special cases that way.
But if the API of the function is to return one, then you either give one at random, which is probably not right, or throw an exception. And with the latter, the person programming the caller will be nudged towards using the other API, which is probably what they should have done anyway, and then, as you say, the returns-one-account function should probably just not exist at all.
Then later on, it was figured out that multiple accounts per credential set (?!?) needed to be supported, but the original clients still needed to be supported.
And either no one could afree on a sane convention if this happened (like returning the first from the list), or someone was told to ‘just do it’.
So they made the new call, migrated themselves, and put in a uncaught exception in the old place (can’t put any other type there without breaking the API) and blam - ticket closed.
Not that I’ve ever seen that happen before, of course.
Oh, and since the multi-account functionality is obviously new and probably quite rare at first, it could be years before anyone tracks down whoever is responsible, if ever.
Yes there is! Just get rid of it. It's useless. The re-implementation from using one to the other was barely a few moments of work, and even if you want to say "well that's a breaking change" I mean, yeah? Then break it. I would be far less annoyed if a function was just removed and Xcode went "hey this is pointed at nothing, gotta sort that" rather than letting it run in a way that turns the use of authentication functionality into a landmine.
You might be bound to support these calls for many, many years.
But there is a way that closes your ticket fast and will compile!
Seems like you should have a generic error handler that will at a minimum catch unexpected, unhandled exceptions with a 'Something went wrong' toast or similar?
Not if you handle the exception properly.
> why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind.
Because you explicitly asked for "the" account, and your request is based on a false premise.
>why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?
Because other users of the library explicitly want that to be an error condition, and would rather not write the logic for it themselves.
Performance could factor into it, too, depending on implementation details that obviously I know nothing about.
Or for legacy reasons as described in https://news.ycombinator.com/item?id=44321644 .
Most functions can fail, and any user-facing app has to be prepared for it so that it behaves gracefully towards the user. In that sense I agree that the error reporting mechanism doesn’t matter. It’s unclear though what the difference was for the GP.
The designer of the API decided that if you ask for "the single account" when there are multiple, that is an error condition.
> throw an exception and crash the fucking app
Yes, if your app crashes when a third-party API throws an exception, it's a "skill issue" of you. This comment is an example why sometimes blaming the user's skill issue is valid.
Server side APIs and especially authentication APIs tend towards the “fail fast” approach. When APIs are accidentally mis-used this is treated either as a compiler error or a deliberate crash to let the developer know. Silent failures are verboten for entire categories of circumstances.
There’s a gradient of: silent success, silent failure, error codes you can ignore, exceptions you can’t, runtime panic, and compilation error.
That you can’t even tell the qualitative difference between the last half of that list is why I’m thinking you’re primarily a JavaScript programmer where only the first two in the list exist for the most part.
I'd take a wager.
I didn't, just an AI tool in general.
It may not be objective, but at least it's consistent, and it reflects something about the default human position.
For example, there are no good ways of measuring the amount of technical debt in a codebase. It's such a fuzzy question that only subjective measures work. But what if we show the AI one file at a time, ask "Rate, 1-10, the comprehensibility, complexity, and malleability of this code," and then average across the codebase. Then we get measure of tech debt, which we can compare over time to measure if it's rising or falling. The AI makes subjective measurements consistent.
This essay gives such a cool new idea, while only scratching the surface.
No it doesn't. Nothing that comes out of an LLM reflects anything except the corpus it was trained on and the sampling method used. That definitionally true, since those are the very things it is a product of.
You get NO subjective or objective insight from asking the AI about "technical debt" you only get an opaque statistical metric that you can't explain.