The claim is a fast-moving, high performing team has become a 10x fast moving, high-performing team. That's equivalent to 2-1/2 years of development across a team.
Shall we expect the tangible results soon?
I'm perfectly willing to accept that AI coding will make us all a lot more productive, but I need to see the results.
I'm willing to believe it will make high-judgement autonomous people more productive, I'm less sure it will scale to everyone. The author is one of the senior-most technical staff at AWS.
I know software people don't want to accept that, but it's almost always something on the business or administrative/management side of things.
Even for the programming bits, if your initial programmers suck (for some reason) but you have money, a great management team would just replace them with better programmers and fix the code mess with their help. So even that isn't a programming problem, it's a management problem.
And let's look at Twitter, who had atrocious code early on (fail whale galore), yet managed to make a profitable business due to amazing product market fit, despite management incompetence.
Companies just need to pass a code quality bar which is much, much, much lower than the bar programmers set.
Before we just had code that devs don't know how to build securely.
Now we'll have code that the devs don't even know what it's doing internally.
Someone found a critical RCE in your code? Good luck learning your own codebase starting now!
"Oh, but we'll just ask AI to write it again, and the code will (maybe) be different enough that the exact same vuln won't work anymore!" <- some person who is going to be updating their resume soon.
I'm going to repurpose the term, and start calling AI-coding "de-dev".
I think that has already been true for some time for large projects continuously updated over a long time, and lots of developers entering and leaving the project throughout the years because nobody who has a choice wants to do that demoralizing job for long (I was one of them in the 1990s, the job was later given to an Indian H1B who could not switch to something better easily, not before putting in a few years of torture to have a better resume, and possibly a greencard).
Most famous post here, but I would like to see what e.g. Microsoft's devs would have to say, or Adobe's:
https://news.ycombinator.com/item?id=18442941
Such code has long been held together by the extensive test suites rather than intimate knowledge of how it all works.
The task of the individual developer is to close bug tickets and add features, not to produce an optimal solution, or even refactoring. They long ago gave up on that as taking too long.
But the reality is that most of us will never work in anything that big. I think the biggest thing i've worked in was in the 500K LOC range tops.
The code base is disproportionally testing automation, telemetry and monitoring systems but a lot code none the less ;) So even in a solo/small team project depend on architecture, procedures, test suites etc. over knowing every line of code.
I’d rather tell it as a joke than be blunt about the left tail of engineers being made redundant for life, slowly, but inevitably.
Haha, that already happens in almost any project after 2-3 years.
Now with AI you’ll be able to not understand your code in only 2-3 days.
The next release will reduce the time to confusion to 2-3 hours.
Imagine a future where you’ll be able to generate a million lines of code per second, and not understand any of it.
Rookie. Numbers.
With ADHD I lose all understanding of my code in 20-30 minutes
The problem is marketing.
Cycling industry is akin to audiophiles and will swear on their lives that $15,000 bicycle is the pinnacle of human engineering. This year's bike will go 11% faster than the previous model. But if you read last 10 years of marketing materials and do math it should basically ride itself.
There's so much money in AI right now that you can't really expect anyone to say "well, we had hopes, but it doesn't really work the way we expected". Instead you have pitch after pitch, masses parroting CEOs, and everyone wants to get a seat on the hype train.
It's easy to dispel audiophiles or carbon enthusiasts but it's not so easy with AI, because no one really knows how it works. OpenAI released a paper in which they stated, sorry for paraphrasing, "we did this, we did that, and we don't know why results were different".
I am working on a legacy project. This is already the case!
It's like with AI images, where they look plausible at first, but then you start noticing all the little things that are off in the sidelines.
writing code is the easiest part of software development. Reviewing code is so much more difficult than writing it
A lot of people say this, and I do not doubt that it is fully true in their real experience. But it is not necessarily the only way for things to be.If more time and effort were put into writing code which is easier to review, the difficulty of writing it would increase and the difficulty of reading it would decrease, flipping that equation. The incentives just aren't like that. It doesn't pay to maximize readability against time spent writing: Not every line will have to be reviewed, and not every line that has to be reviewed will be so complex that readability needs to be perfect to be maintainable.
And regarding "not every line will have to be reviewed, and not every line that has to be reviewed will be so complex that readability needs to be perfect to be maintainable.", the problem with AI is that code becomes basically unknowable.
Which is fine if everything that is built is slop, but many things aren't slop. Stuff that touches money, healthcare, personal relationships, etc you know, the things that matter in life, risks all turning into slop, which <will> have real life consequences.
We'll start seeing this in a few years.
This sounds a lot like Tesla's Fake Self Driving. It self drives right up to the crash, then the user is blamed.
Part of being a mature engineer is knowing when to use which tools, and accepting responsibility for your decisions.
It's not that different from collaborating with a junior engineer. This one can just churn out a lot more code, and has occasional flashes of brilliance, and occasional flashes of inanity.
By the people who are disclaiming it, yes.
I naively believed that we'll start building black boxes based on requirements, sets of inputs and outputs, and sudden changes of heart from stakeholders that often happen on a daily basis for many of us and mandates almost complete reimagination of project architecture will simply need another pass of training with new parameters.
Instead the mainstream is pushing hard reality where we mass produce a ton of code until it starts to work within guard rails.
Does it really work? Is it maintainable?
Get out of here. We're moving at 200mph.Karpathy is bullish on everything bleeding edge, and unfortunately it kinda shows when you know the material better than he does. (source, I've been lecturing on all of it for a few years now). I'm not saying this is bad. It's great to see people who are engaging and bullish, it's better than most futurists waving their hands and going "something, something warp drive".
But when you take a step back and really ask what is going on behind the scenes, all we have is massive statistical tools performing neato tricks at statistical probability to predict patterns. There's no greater understanding or ability to learn or mimic. YET. The transformer for-instance can't easily learn complex mathematical operations. There's a google paper on "learning" multiplication and I know people working on building networks to "learn" sin/cos from scratch. But given these basic limitations and pretty much, every, single, paper, out of Apple "intelligence" crapping on the buzz. We've pretty much hit a limit beyond being the first company to allow for multi-trillion token parsing (or basic, limited, token parsing memory) for companies to capture and retrieve information.
I'm not quite sure why everyone seems to want the AIs to be writing typescript - that's a language designed for human capabilities, with all the associated downsides.
Why not Prolog? APL? Something with richer primitives and tighter guardrails that is intrinsically hard for humans to wrangle with.
That source is bearing a lot of weight.
This makes Karpathy look worse, not better.
I just think he puts on very rose tinted glasses when looking to the future rather than seeing the problems hitting ML model design/implementation now. We had a great leap forward with Attention, it woke an entire industry up by giving them something solid to lean on. But it also highlights we should see a _lot_ more pollination of ideas between maths, sciences, stats and comp-sci rather than re-inventing the wheel in every discipline.
I switched back to Rails for my side project a month ago and ai coding when doing not too complex stuff has been great. While the old NextJS code base was in shambles.
Before I was still doing a good chunk of the NextJS coding. I’m probably going to be directly coding less than 10% of the code base from here on out. I’m now spending time trying to automate things as much as possible, make my workflow better, and see what things can be coded without me in the loop. The stuff I’m talking about is basic CRUD and scraping/crawling.
For serious coding, I’d think coding yourself and having ai as your pair programmer is still the way to go.
> These aren't just implementation details - they're architectural choices that ripple through the codebase.
> The gains are real - our team's 10x throughput increase isn't theoretical, it's measurable.
Enjoyed the article and the points it brought up. I do find it uncanny that this article about the merits and challenges of AI coding was likely written by ChatGPT.
The way to code going forward with AI is Test Driven Development. The code itself no longer matters. You give the AI a set of requirements, ie. tests that need to pass, and then let it code whatever way it needs to in order to fulfill those requirements. That's it. The new reality us programmers need to face is that code itself has an exact value of $0. That's because AI can generate it, and with every new iteration of the AI, the internal code will get better. What matters now are the prompts.
I always thought TDD was garbage, but now with AI it's the only thing that makes sense. The code itself doesn't matter at all, the only thing that matters is the tests that will prove to the AI that their code is good enough. It can be dogshit code but if it passes all the tests, then it's "good enough". Then, just wait a few months and then rerun the code generation with a new version of the AI and the code will be better. The humans don't need to know what the code actually is. If they find a bug, write a new test and force the AI to rewrite the code to include the new test.
I think TDD has really found its future now that AI coding is here to stay. Human code doesn't matter anymore and in fact I would wager that modifying AI generated code is as bad and a burden. We will need to make sure the test cases are accurate and describe what the AI needs to generate, but that's it.
> with every new iteration of the AI, the internal code will get better
This is a claim that requires proof; it cannot just be asserted as fact. Especially because there's a silent "appreciably" hidden in there between "get" and "better" which has been less and less apparent with each new model. In fact, it more and more looks like "Moore's law for AI" is dead or dying, and we're approaching an upper limit where we'll need to find ways to be properly productive with models only effectively as good as what we already have!
Additionally, there's a relevant adage in computer science: "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." If the code being written is already at the frontier capabilities of these models, how the hell are they supposed to fix the bugs that crop up, especially if we can't rely on them getting twice as smart? ("They won't write the bugs in the first place" is not a realistic answer, btw.)
Additionally, while the intelligence floor is shooting up and the intelligence ceiling is very slowly rising, the models are also getting better at following directions, writing cleaner prose, and their context length support is increasing so they can handle larger systems. The progress is still going strong, it just isn't well represented by top line "IQ" style tests.
LLMs and humans are good at dealing with different kinds of complexity. Humans can deal with messy imperative systems more easily assuming they have some real world intuition about it, whereas LLMs handily beat most humans when working with pure functions. It just so happens that messy imperative systems are bad for a number of reasons, so the fact that LLMs are really good at accelerating functional systems gives them an advantage. Since functional systems are harder to write but easier to reason about and test, this directly addresses the issue of comprehending code.
How many times have you seen a code change that “passed all the tests” take down production or break an important customer’s workflow?
Usually that was just a relatively small change.
Now imagine that you regenerated literally all the code.
The code is the spec. Any other spec comprehensive enough to cover all possible functionality has to be at least as complex as the code.
Put another way if your TDD always pass then there’s no point in writing them, and there’s no known bugs before you have any code. So discovering future bugs that didn’t exist when you’re writing those tests is the point.
TDD is useful to build some initial "guard rails" when writing new code and it's useful to prevent regressions (by adding more guard rails when you notice the program went off the road). You can't just add "all the guard rails ever needed" in advance.
Similarly, bugs often crop up because of interactions which aren’t obvious at the time. Thus the reason a test is failing can be wildly different than the intended use case of a test. Perhaps the test failed because the continuous integration environment has some bad RAM, you’ll need to investigate to discover why a test fails.
You then determine what are the inputs / outputs that you're taking for each function / method / class / etc.
You also determine what these functions / methods / classes / etc. compute within their blocks.
Now you have that on paper and have it planned out, so you write tests first for valid / invalid values, edge cases, etc.
There are workflows that work for this, but nowadays I automate a lot of test creation. It's a lot easier to hack a few iterations first, play with it, then when I have my desired behaviour I write some tests. Gradually you just write tests first, you may even keep a repo somewhere for tests you might use again for common patterns.
Quite curious about the TDD approach to that, espcially taking into account the religious "no code without broken tests" mantra.
Once you've got unit tests and built what you think you need, write integration/e2e tests and try to get those green as well. As you integrate you'll probably also run into more bugs, make sure you add regression tests for those and fix them as you're working.
2. Write code that makes it look right, running the test and checking that picture periodically. When it looks right, lock in the artefact which should now be checked against the actual picture (green, if it matches).
3. Refactor.
The only criticism ive heard of this is that it doesnt fit some people's conceptions of what they think TDD "ought to be" (i.e. some bullshit with a low level unit test).
You are right that most people wouldn't know what 10/10 design looks/behaves like. That's the real bottleneck: people can't prompt for what they don't understand.
It is also so monumentally brittle that if you do this for interactive software, you will drive yours nuts trying.
For a simple example, FuzzBuzz as a loop that has some if statements inside is not so easy to test. Instead break it in half so you have a function that does the fiddly bits and a loop that just contains “output += MakeFizzBizzLineForNumeber(X);” Now it’s easy to come up tests for likely mistakes and conceptually you’re working with two simpler problems with clear boundaries between them.
In a slightly different context you might have a function that decides which kind of account to create based on some criteria which then returns the account type rather than creating the account. That function’s logic is then testable by passing in some parameters and then looking at the type of account returned without actually creating any accounts. Getting good at this requires looking at programs in a more abstract way, but a secondary benefit is rather easy to maintain code at the cost of a little bookkeeping. Just don’t go overboard, the value is breaking out bits that are likely to contain bugs at some point where abstraction for abstraction’s sake is just wasted effort.
Your edge case depends on the kind of experimentation you’re doing. I sometimes treat CSS as kind of black magic and just look for the right incantation that happens to work across a bunch of browsers. It’s not efficient, but I’m ok punting because I don’t have the time to become an expert on everything.
On the other hand when looking for an efficient algorithm or optimization I likely to know what kind of results I’m looking for at some stage before creating the relevant code. In such cases tests help clarify what exactly the mysterious code needs to do so in a few hours to weeks later when inspiration hits you haven’t forgotten any relevant details. I might have gone in a wildly different direction, but as long as I consider why each test was made before deleting it the process of drilling down into the details has value.
I get where you're coming from, because I'm about a decade behind you, but resisting change is not a good look. I feel the same way about all this vibe coding and junk--don't really think it's a good idea, but there it is. Get used to being wrong about everything.
Your condescending attitude is not a good look. You don't know me at all.
There were a few places I worked that TDD actually succeeded because the project was fairly well baked and the requirements that came it could be understood. That was the exception, not the rule.
If you can design fully what your system does before starting it is more reasonable. And often that means going down to level of are inputs and states. Think more of something like control systems for say mobile networks or planes or factory control. You could design whole operation and all states that should happen or could happen before single line of code.
Write some tests for a non trivial function before creating the function and the entire cycle might take as little as 20 minutes.
Again, I don't do that for correctness, I do it because it's faster than not having something to work against, that you can run with one command that tells you "Yup, you did the thing!" or "Nope, not there yet". When I don't do TDD, I'm slower, because I have to manually verify things and sometimes there are regressions.
Catching these things and automating the process is what makes (for me) TDD worth it.
> Put another way if your TDD always pass then there’s no point in writing them
Uuh, no one said this?
I'm not sure where people got the idea that TDD is this very strict "one way and one way only", the core idea is that your work gets easier to do, if it doesn't, then you're doing it wrong, probably following the rules too tightly.
We don't have to be so dogmatic about any methodologies out there, everything has tradeoffs, chose wisely.
Ironically, AI can. In my experience it is extremely good at thinking about edge cases and writing tests to defend against them.
That means, you have to understand if it is even proving the properties you require for the software to work.
It's very easy to write a proof akin to a test that does not test anything useful...
1. Tradeoffs, as always. The more advanced typing you head towards, the much more time consuming it becomes to reason about the program. There is good reason for why even the most staunch type advocates rarely push for anything more advanced than monads. A handful of assertive tests is usually good enough, while requiring significantly less effort.
2. Not just time consuming, but often beyond comprehension. Most developers just don't know how to think in terms of formal proofs. Throw a language with an advanced type system, like Coq or Idris, in from of them and they wouldn't have a clue what to do with it (even ignoring the unfamiliar syntax). And with property tests, now you're asking them to not only think in advanced types, but to also effectively define the types themselves from scratch. Despite #1, I fully expect we would still see more property testing if it weren't for this huge impediment.
Formal proofs are useful on the same class of bug property tests are.
And vice versa.
The issue isnt necessarily that devs cant use them, it's that the problems they have which cause most bugs do not map on to the space of "what formal proofs are good at".
> You give the AI a set of requirements, ie. tests that need to pass, and then let it code whatever way it needs to in order to fulfill those requirements.
SQLite has tests-lines-to-code-lines ratio above 1000 (yes, 1000 lines of tests for single line of code) and still has bugs.AMD, at the time it decided to apply ACL2 to its FPU, had 29 million tests (not lines of code, but test inputs and outputs). ACL2 verification found several bugs in the FPU.
Just to make a couple of points for someone to draw a line.
I never bought into TDD because it is only usefull for business logic, plain algorithms and data structures, it is no accident that is what 99% of conference talks and books focus on.
There isn't a single TDD talk about shader programming for GPGPU, and validating that what the shader algorithms produce via automated tests, the reason being the amount of enginneering effort only to make it work, and still lacks human sensitivity for what gets rendered.
You can argue semantics until you're blue in the face it still follows red-green-refactor and it confers the same benefits as TDD.
The bugs occur because the initial tests didn’t fully capture the desired and undesired behaviors.
I’ve never seen a formal list of software requirements state that a product cannot take more than an hour to do a (trivial) operation. Nobody writes that out because it’s implicitly understood.
Imagine writing a “life for dummies” textbook on how to grow from a 5yr old to 10yr old. It’s impossible to fully cover.
Hah, if that were true the industry would be a better place. Or a worse place. Or a slower place but exactly the same. I should build a test for that...
I've worked on many projects where tests get disabled as nobody can tell why it's failing (or why it was even written in some cases).
I've rewritten test systems from scratch in the past to drag projects out of the dumpster fire by getting them into a state of passing simple startup/shutdown safely routines and then watched as I pass the project onto others how it rots until some "genius" young coder comes along and "removes the slow test-suite because it takes 2hr+ to run on my way out of spec laptop".
TDD combined with vibe-coding can create code that has unwanted side-effects, because your tests only check the result. It can also have various security vulnerabilities, which you don't test for, because how would you know what to test. It can also lead to massive duplication and code bloat, while tests still pass. It can lead to software which wastes a lot of resources (memory, cpu, inefficient network requests and the like) due to bad algorithms. If you try to keep that in check by writing performance tests, how do you know what acceptable performance is, if you have no idea how your program works?
Also, you can give AI a SLO for code and fail stress tests that don't meet it. AI will happily respond to a failing stress test with profiling and well thought out optimizations in many cases.
The reason AI code generation works so well is a) it is text based- the training data is huge and b) the output is not the final result but a human readable blueprint (source code), ready to be made fit by a human who can form an abstract idea of the whole in his head. The final product is the compiled machine code, we use compilers to do that, not LLMs.
Ai genereted code is not suitable to be directly transferred to the final product awaiting validation by TDD, it would simply be very inefficient to do so.
None of that matters of its not a person writing the code
If you give AI a set of tests to pass and turn it loose with no oversight, it will happily spit out 500k LOC when 500 would do. And then it will have a very hard time when you ask it to add some functionality.
AI routinely writes code that is beyond its ability to maintain and extend. It can’t just one shot large code bases either, so any attempt to “regenerate the code” is going to run into these same issues.
I've been playing around with getting the AI to write a program, where I pretend I don't know anything about coding, only giving it scenarios that need to work in a specific way. The program is about financial planning and tax computations.
I recently discovered AI had implemented four different tax predictions to meet different scenarios. All of them incompatible and all incorrect but able to pass the specific test scenarios because it hardcoded which one to use for which test.
This is the kind of mess I'm seeing in the code when AI is left alone to just meet requirements without any oversight on the code itself.
Yes. The first thing I always check in every project (an especially vibe-coded projects) is whether if:
A. Does it have tests?
B. Is the coverage over 70%?
C. Do the tests actually test for the behaviour of the code (good) or just its implementation (bad.)
If any of those requirements are missing, then that is a red flag for the project.
While TDD is absolutely valuable for clean code, focusing too much on it can be the death of a startup.
As you said the code itself is $0, then the first product is still worth $10 and the finished product is worth $1M+ once it makes money, which is what matters.
My prediction is that in the future, a lot of desperate companies are going to need living, breathing reverse software engineers to aid them because they have lost the ability to understand their own codebases.
Oh, and why is code worth $0? A lot of code is throwaway, but I still got paid to produce it and much of it makes money for the company or saves them money.
writing comprehensive tests is harder than writing the code
Today, it just does something and when corrected it says "You are right!....".
Yeah, me neither…
oh no! another bug!
I agree rigidly defining exactly what the code does through tests is harder than people think.
> The way to code going forward with AI is Test Driven Development.
No. TDD already collapses under its own weight as a project grows.
> The code itself no longer matters.
No. Definitely no. That’s absurd. You can’t box in a correct solution with guard rails. Especially since, even if you could get something close to that, you would also lose the ability to understand the tests.
> You give the AI a set of requirements, ie. tests that need to pass, and then let it code whatever way it needs to in order to fulfill those requirements. That's it. The new reality us programmers need to face is that code itself has an exact value of $0.
No. The opposite. When code is cheap, understanding and control become expensive. Code a human can understand will be the most valuable going forward.
> That's because AI can generate it, and with every new iteration of the AI, the internal code will get better.
No. All code is technical debt. AI produces code faster. Therefore AI produces bugs faster.
”Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it” -Brian Kernighan
This is literally where we’re at. AI writes code just beyond its ability to fix.
> What matters now are the prompts.
No. This is such a dead end. It’s a roll of the dice, and so we have examples of people who seem to get it to build something faster. That’s like saying there are people who win the lottery. It’s true, and it also says nothing of your ability to repeat their process. Confirmation bias of the wins. But in building something reliable, we care more about the floor (minimum quality) than the ceiling (the peak it can reach sometimes).
That's only true for problems that has been solved and well documented before. AI can't solve novel problems. I have ton of examples I use from time to time when new models come out. I've tried to ride the hype train, and I've been frustrated working with people before, but I've never been so frustrated as trying to make AI follow simple set of rules and getting:
"Oh yes, my bad, I get that now. Black is white and white is black. Let me rewrite the code..."
My favorite example is tasked AI with a rudimentary task and it gave me a working answer but it was fishy, so I googled the answer and lo and behold I landed on stackoverflow page with exact same answer being top voted answer to question very similar to my task. But that answer also had a ton of comments explaining why you never should do it that way.I've been told many times that "you know, kubernetes is so complicated, but I tell AI what I want and it gives me a command I simply paste in my terminal". Fuck no.
AI is great for scaffolding projects, working with typical web apps where you have repeatable, well documented scenarios, etc.
But it's not a silver bullet.
We had no clue that this could actually happen one day in the form of gen AI. I want to agree with you just to prove that I was right!
This is going to bring up a huge issue though: nailing requirements. Because of the nature of this, you're going to have to spec out everything in great detail to avoid edge cases. At that point, will the juice be worth the squeeze? Maybe. It feels like good businesses are thorough with those kinds of requirements.
What you need is indeed spec-driven development, but specs need to be written in some kind of language that allows for more formal verification. Something like https://en.wikipedia.org/wiki/Design_by_contract, basically.
It is extremely ironic that, instead, the two languages that LLMs are the most proficient in - and thus the ones most heavily used for AI coding - are JavaScript and Python...
It's sort of like a director telling an AI the high level plot of a movie, vs giving an AI the actual storyboards. The storyboards will better capture the vision of the director vs just a high level plot description, in my opinion.
This is not new at all. Code has always been a liability. It having $0 value would be a great improvement IMHO.
The value was always in the product regardless of the amount of code in it and regardless of its quality. Customers don’t buy code. (Except of course when the code is the product, which is very unusual nowadays.)
Once AI has cheap real-time eyes it might get slightly better, but all the logs and browser MCP tools and yadda yadda in the world will not get it to produce anything remotely efficient.
Good luck explaining that when you get hacked out of oblivion.
This is like saying the fine-print of contracts don't matter so I get "AI" to regurgitate them all for me as a lawyer. It's so wrong as to be beyond laughable.
Put the coffee down and go for a walk, preferably to a library, and LEARN SOMETHING.
When I tried it, it "worked", I admittedly felt really good about it, but I stepped away for a few weeks because of life and now I can't tell you how it works beyond the high level concepts I fed into the LLM.
When there's bugs, I basically have to derive from first principles where/how/why the bug happens instead of having good intuition on where the problem lies because I read/wrote/reviewed/integrated with the code myself.
I've tried this method of development with various levels of involvement in implementation itself and the conclusion I came to is if I didn't write the code, it isn't "mine" in every sense of the term, not just in terms of legal or moral ownership, but also in the sense of having a full mental model of the code in a way I can intellectually and intuitively own it.
Really digging into the tests and code, there are fundamental misunderstandings that are very, very hard to discern when doing the whole agent interfacing loop. I believe they're the types of errors you'd only pick up on if you wrote the code yourself, you have to be in that headspace to see the problem.
Also, I'd be embarrassed to put my name on the project, given my lack of implementation, understanding and the overall quality of the code, tests, architecture, etc. It isn't honest and it's clearly AI slop.
It did make me feel really productive and clever while doing it, though.
And that's the greatest trap of this whole thing. That the _feels_ are so quickly diverged from the actual.
Since its mechanism is to predict the next token of the conversation, it's reasonable to "predict" itself making more mistakes once it has made one.
There's currently not an official workflow on how to manage these steering files across repos if you want to have organisation-wide standards, which is probably my main criticism.
Then they claim (and demonstrate with a picture of a commits/day chart) a team-wide 10x throughput increase. I claim there's got to be a lot of rubber-stamp reviewing going on here. It may help to challenge the "author" to explain things like "why does this lifetime have the scope it does?" or "why did you factor it this way instead of some other way?" e.g. questions which force them to defend the "decisions" they made. I suspect if you're actually doing thorough reviews that the velocity will actually decrease instead of increase using LLMs.
To quote Joey from Friends - "400 bucks are gone from my pocket and nobody is getting happier?"
1) Abstract data showing an increase in "productivity" ... CHECK
2) Completely lacking in any information on what was built with that "productivity" ... CHECK
Hilarious to read this on the backend of the most widely publicized AWS failure.
The first paragraph of said guidelines reads:
What to Submit
On-Topic: Anything that good hackers would find interesting. That includes more than hacking and startups. If you had to reduce it to a sentence, the answer might be: anything that gratifies one's intellectual curiosity.
And yet, the original submission was just another version of the trope I used AI to boost my productivity 10 fold, and it was all roses and butterflies.After the n-th iteration of the same self-congratulating, hype-pushing, AI-generated drivel, the point can be made that the original submission does not meet the HN guidelines.
To quote Jimmy in South-Park: it's an ad.
Comments like the one I replied too just make HN seem mean and miserable and that's definitely something we're trying to avoid.
Comments like the one I replied too just make HN seem mean and miserable and that's definitely something we're trying to avoid.
And guess what happens? Reality doesn't match expectations and everyone ends up miserable.
Good engineering orgs should have engineers deciding what tools are appropriate based on what they're trying to do.
I’ve sure used various LLMs to solve difficult nuts to crack. Problems I have been able to verbalise, but unable to solve.
Chances are that if you are using an LLM to mass produce boiler, you are writing too much boiler.
The corollary being. If you can't (through skill or effort) verify don't trust.
If you break this pattern you deserve all the follies that become you as a "professional".
This is really something striking to me about all these AI productivity claims. They never provide the methodology and data.
But I think the objections can mostly be overcome with a minor adjustment: You only need to couple TDD with a functional programming style. Functional programming lets you tightly control the context of each coding task, which makes AI models ridiculously good at generating the right code.
Given that, if most of your code is tightly-scoped, well-tested components implementing orthogonal functionality, the actual code within those components will not matter. Only glue code becomes important and that too could become much more amenable to extensive integration testing.
At that point, even the test code may not matter much, just the test-cases. So as a developer you would only really need to review and tweak the test cases. I call this "Test-Case-Only Development" (TCOD?)
The actual code can be completely abstracted away, and your main task becomes design and architecture.
It's not obvious this could work, largely because it violates every professional instinct we have. But apparently somebody has even already tried it with some success: https://www.linkedin.com/feed/update/urn:li:activity:7196786...
All the downsides that have been mentioned will be true, but also may not matter anymore. E.g. in a large team and large codebase, this will lead to a lot of duplicate code with low cohesion. However, if that code does what it is supposed to and is well-tested, does the duplication matter? DRY was an important principle when the cost of code was high, and so you wanted to have as much leverage as possible via reuse. You also wanted to minimize code because it is a liability (bugs, tech debt, etc.) and testing, which required even more code that still didn't guarantee lack of bugs, was also very expensive.
But now that the cost of code is plummeting, that calculus is shifting too. You can churn out code and tests (including even performance tests, which are always an afterthought, if thought of at all) at unimaginable rates.
And all this while reducing the dependencies of developers on libraries and frameworks and each other. Fewer dependencies means higher velocity. The overall code "goodput" will likely vastly outweight inefficiences like duplication.
Unfortunately, as TFA indicates, there is a huge impedance mismatch with this and the architectures (e.g. most code is OO, not functional), frameworks, and processes we have today. Companies will have to make tough decisions about where they are and where they want to get.
I suspect AI-assisted coding taken to its logical conclusion is going to look very different from what we're used to.
> I suspect AI-assisted coding taken to its logical conclusion is going to look very different from what we're used to.
100%. I now design new libraries so that AI can easily write code for them.
These guys actually seem rattled now.
"The Arithmetic of AI-Assisted Coding Looks Marginal" would be the more honest article title.
Could be the other way around, but I think marketing-speak is taking cues here from legal-ese and especially the US supreme court, where it's frequently used by the justices. They love to talk about "ethical calculus" and the "calculus of stare decisis" as if they were following any rigorous process or believed in precedent if it's not convenient. New translation from original Latin: "we do what we want and do not intend to explain". Calculus, huh? Show your work and point to a real procedure or STFU
Waiting to see anyone show even a month ahead of schedule after 6 months.
AI can't keep up because its context window is full of yesteryear's wrong ideas about what next month will look like.
We've been having a go around with corporate leadership at my company about "AI is going to solve our problems". Dude, you don't even know what our problems are. How are you going to prompt the AI to analyze a 300 page PDF on budget policy when you can't even tell me how you read a 300 page PDF with your eyes to analyze the budget policy.
I'm tempted to give them what they want: just a chatter box they can ask, "analyze this budget policy for me", just so I can see the looks on their faces when it spits out five poorly written paragraphs full of niceties that talk its way around ever doing any analysis.
I don't know, maybe I'm too much of a perfectionist. Maybe I'm the problem because I value getting the right answer rather than just spitting out reams of text nobody is ever going to read anyway. Maybe it's better to send the client a bill and hope they are using their own AIs to evaluate the work rather than reading it themselves? Who would ever think we were intentionally engaging in Fraud, Waste, and Abuse if it was the AI that did it?
Ah, but they'll love it.
> I don't know, maybe I'm too much of a perfectionist. Maybe I'm the problem because I value getting the right answer rather than just spitting out reams of text nobody is ever going to read anyway. Maybe it's better to send the client a bill and hope they are using their own AIs to evaluate the work rather than reading it themselves? Who would ever think we were intentionally engaging in Fraud, Waste, and Abuse if it was the AI that did it?
We're already doing all the same stuff, except today it's not the AI that's doing that, it's people. One overworked and stressed person somewhere makes for a poorly designed, buggy library, and then millions of other overworked and stressed people spend most of their time at work finding out how to cobble dozens of such poorly designed and buggy pieces of code together into something that kinda sorta works.
This is why the top management is so bullshit on AI. It's because it's a perfect fit for a model that they have already established.
That, or its a discovery of why what I wanted is impossible and it's back to the drawing board.
It's nice to not be throwing away code that I'd otherwise have been a perfectionist about (and still thrown away).
So, yeah, they probably think typing is a huge bottle neck and it’s a huge time saver.
How about learning to touch type? Clearly code manipulation is not the hard part of writing software so all the people finding efficiency improvements in that tooling and skill set would be better served doing something else with their time? I find it instructive that the evergreen dismissal of one persons enthusiasm as unimportant rarely says what exactly they should be investing in instead.
Fine for just you. Not fine for others, not fine for business, not fine the moment you star count starts moving.
Congratulations, you invented end-to-end testing.
"We have yellow flags when the build breaks!"
Congratulations! You invented backpressure.
Every team has different needs and path dependencies, so settles on a different interpretation of CI/CD and software eng process. Productizing anything in this space is going to be an uphill battle to yank away teams' hard-earned processes.
Productizing process is hard but it's been done before! When paired with a LOT of spruiking it can really progress the field. It's how we got the first CI/CD tools (eg. https://en.wikipedia.org/wiki/CruiseControl) and testing libraries (eg. pytest)
So I wish you luck!
This article attempted to outline a fairly reasonable approach to using AI tooling, and the criticisms hardly seem related to it at all.
now AWS guy doing it !
"My team is no different—we are producing code at 10x of typical high-velocity team. That's not hyperbole - we've actually collected and analyzed the metrics."
Rofl
"The Cost-Benefit Rebalance"
In here he basically just talks about setting up mock dependencies and introducing intermittent failures into them. Mock dependencies have been around for decades, nothing new here.
It sounds like this test system you set up is as time consuming as solving the actual problems you're trying to solve, so what time are you saving?
"Driving Fast Requires Tighter Feedback Loop"
Yes if you're code-vomiting with agents and your test infrastructure isn't rock solid things will fall apart fast, that's obvious. But setting up a rock solid test infrastructure for your system involves basically solving most of the hard problems in the first place. So again, what? What value are you gaining here?
"The communication bottleneck"
Amazon was doing this when I worked there 12 years ago. We all sat in the same room.
"The gains are real - our team's 10x throughput increase isn't theoretical, it's measurable."
Show the data and proof. Doubt.
Yeah I don't know. This reads like complete nonsense honestly.
Paraphrasing: "AI will give us huge gains, and we're already seeing it. But our pipelines and testing will need to be way stronger to withstand the massive increase in velocity!"
Velocity to do what? What are you guys even doing?
Amazon is firing 30,000 people by the way.
Can you point me to anyone who knows what they're talking about declaring that LOC is the best productivity metric for AI-assisted software development?
Can you point me to where the author of this article gives any proof to the claim of 10x increased productivity other than the screenshot of their git commits, which shows more squares in recent weeks? I know git commits could be net deleting code rather than adding code, but that's still using LOC, or number of commits as a proxy to it, as a metric.
Yes, I'm also reading that the author believes commit velocity is one reflection of the productivity increases they're seeing, but I assume they're not a moron and has access to many other signals they're not sharing with us. Probably stuff like: https://www.amazon.science/blog/measuring-the-effectiveness-...
and this guy didn't survive there for a decade by challenging it
I think he is right.