These questions are even not about AI: if I were to give money to a human agency and were given something they tell me works, I would ask the same questions. If I did not know how to evaluate, I would hire people that do. With LLMs the verification part is what bothers me the most.
The only decent software engineering perspective I’ve seen has been from Mitchell Hashimoto.
They can just summon bespoke software out of the ether that only handles the use cases of themselves and a few of their collaborators.
Making “side projects” was mot possible for non-developers before powerful LLMs. Now it is.
Imagine not being an architect and using Claude to put together a building plan, then concluding it’s basically done but we might need a real architect to double check the measurements. It may even be true but I’d be skeptical if it’s always non-architects saying this.
Building in the physical world has physical and time constraints that cannot be overcome, which is one of the reasons architecture (and engineering) are so important in this domain. In software development these constraints were only inherent when people were writing the majority of the software. I feel like I’m seeing what I thought were fundamental constraints being eroded by the increasing speed and correctness of these tools and it’s making me reconsider the importance of some of the values that are held by software engineering.
It’s obviously dependent on the domain and solution, but if your software can be extremely rapidly rearranged, bugs found and fixed with little effort, and features added with only a minimum prompt, I think the entire definition of technical debt has changed. I’ve been sceptical of these tools and still approach their output with caution. I also worry that, as a software developer, if more can be accomplished in less time there will be less room on this planet for software developers.
This very well summarizes my current thinking on the subject as well. And most of my career has been playing the role of technical debt nazi. Much to the detriment of my earning potential.
Does AI make incredibly inefficient code most of the time? Yup. But it does it at lightspeed with minimal effort.
I think many software engineers forget they exist to get real things done (in many cases at least) and they are a cost center for most businesses. If your end product is not selling software, very few people actually Doing the Thing(tm) will give a single solitary care about code quality or maintainability when they can just spend 30 minutes and $15 worth of tokens to fix it.
It won't take over everything, but I've already seen otherwise very intelligent go-getter type folks who are not technical or know how to code made extremely useful things for themselves and their small little enterprises. And this will seemingly only get better and more efficient.
For someone who really does love the idea of well architected and future-proof code this is just icky to even say or consider. But I'm coming around to this is the future for the majority of software for most places. And it may have the ability to seriously even the playing field for small enterprises in some industries.
I'm currently using it to implement a zillion side projects at home I've been "meaning to get to" for years. It makes incredibly silly unmaintainable code most of the time - but I learned to not care, and just tell the AI bot to fix it/add to it as I go along. Worst-case I spend a single night deleting it all and starting from zero to "refactor" an entire thing.
I am suprised to hear people so naive they expect their token usage to stay flat if code quality and maintainability starts falling exponentially?
What if to fix 2 bugs your LLM starts adding 50 new ones? Will you tell your customers in supports channel "sorry software is finished, if we try fixing anything, everything else might break, not worth it". Or "we can probably fix it, but our AI usage will raise so much we need to up the subscription 3 fold, you choose".
The speed at which LLM codes is only comparable to the speed at which they add garbage to your repo. If you stop caring about maintainability, you also stops caring about your AI/LLM related bills and the viability of your project past the PoC stage.
Another thing though is selling software in the first place will soon become tough proposition outside of a few niches.
There's no reason to think that quality and maintainability will start falling exponentially. On the contrary, these models get better every couple months, and 99% of software isn't actually that complicated. There's just no reason for the fear-mongering that fixing 2 bugs will cause the LLM to add 50 new ones.
Not 50:1 but it does happen
One billion percent. I think the vast majority of the anti-AI sentiments I hear from software engineers comes down to them caring more about playing with their tools than actually solving the problem.
This hits the nail in the head.
Detractors often hang on to examples of coding assistants making mistakes or output subpar code, but they somehow miss the fact that coding assistants can also be prompted again and refactor whole swaths of code just as fast as they introduce oopsies. This means that the worst case scenario implies fast convergence to an acceptable outcome, and from there also fast iteration to improve upon that.
The only way I see AI coding working in the long run is if we go back to a Waterfall/BDUF process and having actual engineering. Let engineers really own the architecture. Enforce that any new feature - no matter how small - to be specced out with complete sequence diagrams. Ensure that every new software package needs to be put on an UML component diagram for the team to review and see each addition interacts with the whole system, etc.
If we do that, then we can just give all the documents to a coding agent and say "go ahead and implement this" with a minimal amount of confidence. But in doing this, I bet we will realize the following:
- the "effort" has never been about writing code itself. The code is just the material manifest of all the thought that went to think over a solution into the problems that the product is attempting to solve.
- we will likely be better off by using code generation tools (i.e, UML-to-code) and a "weak" LLM (than can run locally) than by playing the token lottery at the Anthropic Casino.I'd substitute "owner" for the team and in that sense the owner will not need to be human.
We're at this state where Claude is great at doing the "middle" part of work, but it's crap at gathering requirements and verification of what it has done. I also don't see people caring about these aspects of software development as shown in the article
That's so far been called software development.
All software developed by people suffers from this issue.
Where exactly is the novelty?
> The only way I see AI coding working in the long run is if we go back to a Waterfall/BDUF process and having actual engineering.
Nonsense. The problem is exactly the same.
With agents iterations are much faster, and this can mean things can get messier faster but can get in shape just as fast.
Ironically, agents improve the quality of the deliverable as well. Approaches such as spec-driven development do a far better job delivering features up to spec than manual coding by flesh and blood developers.
There's an awful lot of baseless scaremongering in your post. You make it sound like with AI assisted coding developers stopped paying any attention to quality.
And that’s pretty much where you are wrong. Take any long running open source project and you can see the craftsmanship that goes into it. It may not be perfect, but hacks are clearly marked as such.
I think you are demonstrating a clear lack of insight and experience in software development settings, including FLOSS projects. I can name you a dozen of fairly known FLOSS projects which are a big ball of mud. Just go to the likes of GitHub, check out the list of popular projects, and peek at their code. You will get a very mixed set of results.
The compounding speed. Your devs might reach a point where they have to rewrite and refactor, in a decade.
Your LLM, with its higher throughput, may put you in that game breaking situation next week.
I think that this is exactly why this scaremongering breaks down. If you believe the compounding speed is that greater, wouldn't you be compelled to accept that refactoring and cleaning things up is just as fast and effortless?
I mean, you have a tool that writes software for you following your commands. If you are that concerned with maintainability then what can possibly compell you to not invest any effort in it?
No. not at all. Imagine that each unit of work (a new PR for a feature, a bugfix) builds something that is 99% close to optimal and you can only get to bring it to 100% if you spend time to really review and rewrite the "not good" part. Also, for the sake of argument, let's just say that the overall quality of the system is geometric mean of the quality score of each unit of work. The only way to get an "ideal" system is by ensuring that work done on it follows the "ideal" architecture - for whatever "ideal" means for the developers/maintainers.
You are arguing that you are saving time because you only have to write the 1% that the AI got wrong, so you'd be getting a 100x speed up. My argument is that there is not so much time because if you want 100% quality, you will have to review 100% of the code. Understanding the produced code is the time-consuming part, not typing it out.
So, the only way to have these time savings by working with coding agents is if you accept that the code generated is good enough to not have careful review. But if you do that, then each unit of work that you tell yourself "not ideal but good enough. Ship it and we refactor later" ends up bringing the overall system quality. If you have 10 of these "99% good enough" PRs, and your overall system score is already at 90%. With 50 of these, the score dives down to 60%.
This is what OP and I are talking about "compounding" issues: unless we get to a point where generated code does not need review at all, your development speed will always be bottle-necked by the human in the loop. The only way to get speed benefits from the code generation is if we remove the human in the loop, but in doing so quality will drop faster than you can fix it.
I’m not saying that you can’t use AI to do it because I believe that with carefully controlled workflows and context management you can, but it’s not a simple prompt away, it’s requires guidance and understanding, and isn’t the speed demon that raw prompting is.
That's not really the point though. That presumes models are only useful if they are one-shot models. That is false.
I mean, what if your prompt successfully changes 20 source files and makes a mess in one? How much work did it saved?
And the elephant in the room is when models actually outperform whatever the prompter is able to deliver, and faster. That is somehow left out.
That’s not at all what I’m saying.
I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.
I want them to be more useful outside of one-shot uses, but I find that they currently miss the mark.
That's not my experience at all, and I have been using models that are far from being cutting edge. Even in the cases where a model generates utter nonsense, a couple of clarifying questions is all it takes to get it back on track.
But that might be a factor of the project being worked on, and the extension of the changes being asked.
Doesn't matter how fast you can make the wrong thing.
I think computers are incredibly cheap compared to humans. These models and infrastructure to run them are going to only get more efficient in time. Right now we are still using (for the most part) entire hardware architectures mostly shoehorned from one purpose (graphics) into another. As purpose-built hardware becomes more prevalent and the SOTA starts to slow down I can't imagine a $100k hardware box not being able to handle a small team of developer's needs for many things.
I do think there will be a place for the top 20% of software engineers forever. But most people are not in that top 20%, and the quality when you get below average is not a linear progression. It will not be that difficult for AI generated code to beat the "bottom end" of the industry since tbh it's hard for me to tell the difference between LLM generated code and some of the shit I've seen over the years. I've ran across code written by folks who don't know what an array is more than once.
Most software is not built by MIT and Stanford grads making $500k/yr in the Valley. It's built by work-a-day programmers in the middle of nowhere making $80k/yr to keep some niche small business going with hyper-specific software that was first designed for Windows 95. Or stuff like making horribly designed Wordpress plugins. Or Shopify integrations. etc. etc.
I've also seen these small businesses totally held back by incompetent programmers, and despite their best efforts and huge amounts (for them!) of investment they can never seem to fix it. These types of enterprises are having AI run circles around their current engineering practices, even if it would make most FAANG engineers gasp in horror.
Either way it will certainly be interesting to watch! I just wish I was closer to retirement.
My preferred workflow these days is to pair program with an LLM until it gets close-ish and then manually touch it up. Without that, it just produces junk in different forms.
No, you can't. Adjusting prompts ensures absolutely nothing.
The author specifically says:
> I am sure it is not perfect (I only spent an hour working with the results), but a software engineer would iron out the remaining potential bugs that I could not find quickly (which is one reason we may need more, not less, coders in the future, to help with the explosion of new uses for software)
which acknowledges pretty clearly that engineers bring a level of insight and experience still missing from Mythos. Saying that, I totally disagree with his contention that this will always be true. It's pretty weird that the author of an article stressing the steep improvements in a model's capability can't seem to imagine further improvements in that capability. As if Mythos is where development ends or whatever gap remains between models and experts won't steadily narrow or eventually widen in reverse.
> With Fable the spell has gotten powerful enough that I am no longer sure I am the wizard. I am closer to a patron. I describe what I want, I pay for it, and I judge the result. The conjuring happens somewhere I cannot watch, in hundreds of small choices I never get a vote on. The work has shifted from process to outcome. I no longer steer; I commission.
have a very different meaning coming from a non-technical researcher than they would from someone who builds software for a living.
Apple was Woz's side project, once upon a time. Adsense came from Google's 20% time. Social media started as a side project.
Forests grow from trees. Trees grow from seeds. More potential seeds = more potential forests.
The question was "are side projects a trillion dollar industry" not "has a side project ever started an industry"
How much of a new $1T software product will anthropic capture in token costs, anyway?
But which self-own exactly do you mean, of the many there are?
> I am sure it is not perfect (I only spent an hour working with the results), but a software engineer would iron out the remaining potential bugs that I could not find quickly [...]
People have said things like this many times in the past, and, in the past (perhaps not now), it's always been a misunderstanding of what is good and bad, what's difficult and easy.
For example, someone would draw a UI in a GUI painter that generates code (or a resource file), and a manager would see it and think the majority of the work towards the product is done. (Incidentally, then there seemed to be a reaction, towards making your UI mockups look abstract or otherwise different from runnable code, helping the nontechical to understand that this isn't 90% of the finished product.)
Or a student intern hacks out a homework-grade demo, and a manager who understands neither software engineering nor product domain says "we just need some engineers to polish it up for production", and thinks the student is a star and why can't their engineers be as brilliant and productive. (I might have once been that energetic intern, who was happy for the encouragement, but then learned more, and saw it was a thing.)
This common misunderstanding was sometimes self-correcting -- when trying to ship became a disaster of misery and regretted-attrition, or the product was poorly received by the market because it wasn't thought through nor implemented well, or building subsequent functionality atop it was a nightmare. (But adverse effects of bad approaches is one of the reasons for management and ICs to job-hop, before the unwanted effects affect them personally.)
What might be different now is that some of these AI tools are outputting better-engineered work than some software engineers, and much faster.
At the back of my mind, I'm wondering how the really great software engineers will continue to stand out, as the discipline is being devalued in the minds of most leadership, and anyone can prompt an AI to generate something that superficially appears to them like what they assume a great software engineer would produce. (Even if the great engineer would do much better quality of implementation, have innovative ideas that ML from open source code would not, and maybe arrive at better product concepts as they worked through the problems.)
The trick to getting good at using LLMs for software is to learn how to make _all_ projects low-stakes.
Like, an AI coaching session for executives at the yearly executive retreat. You show up, spend a few hours going through some nonsense slides ChatGPT put together for you, you charge an eye watering fee for it, HR or whoever organizes it will gladly pay for it because it will make them look all cutting edge in front of the CEO, by the next day everyone will forget about it. No accountability at all!
If you want to get paid to work on software, you get involved after its found success and the stakes get higher.
(Which assumes there are still significant areas where economies of scale reward that vs everybody just having their own DIY version of everything.)
Monoliths vs micro-services.
Unless you know enough to tell them to! And keep them honest about it...
this doesn't really work in the real world. There are many things that actually matter, engineering is fundamentally about handling them.
the quality of produced code and the medium
A thought I have been tossing around in my head as the models get better is that it really may not matter what the code looks like.If the observed behavior of the software is good, then the software is good. If a bug, of whatever kind, can be fixed by a model on a vibe-coded codebase, then that's a fixable bug. If there are no exploitable vulnerabilities, then the code is secure. If the performance is adequate, then the code is performant.
It simply does not matter what the code looks like if, from the outside, it does what its supposed to, and, from the inside, a model can fix the issue if one is found.
More than ever, software engineering is now really a job about making sure the code is doing what its supposed to.
And even if it DOES matter what the code looks like, you can have a model fix that too.
But all of those correctness are imaginary. The hardware only enforce a few (and it may be buggy). The OS adds some more (and it’s buggy). The compiler/interpreter may have bugs (but that’s rarely a nuisance) and the libraries are often brittle. There are cracks everywhere in the tower of abstractions.
The code has never mattered. What has always mattered is the knowledge of what is the model of correctness of the software (programming as a theory by NauR), so that you can discern where a program is wrong.
The thing is a crash or some other immediate errors are actually nice to have. You get to react immediately and can have a core dump or a stacktrace that points you the error. What is truly a terror is silent corruption (wrong order of operations, wrong values for a comparison that has expanded the idea of correctness, security issues that has been backdoored for years,…).
As Hoare said:
There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.
The first method is far more difficult.
LLM are very much the second kind. You write a lot of complicated code, and then you can no longer reason about their correctness.That is so real. Brilliant !
I clicked one of his examples intrigued "a snake game where the snake is self-aware and crazy things happen;". Played for 1-2 minutes, and it's the classic 1980s snake game. Am I missing something? What is "self-aware" about it? Some funny messages at the bottom of the screen? And what are the "crazy things"?
I will say, the act of eating creates a "bulge distortion" that flows down the length of the snake is a nice touch though.
And at my own firm, I think every developer is generating most of their code using agentic coding. We're still sceptical enough that we are doing the usual heavy handed human review process, so we're not seeing a huge speed up in delivery times, but we are seeing a volume increase. That is because writing the changes and raising the PRs is much faster, but also a lot of boring admin and support work is now mostly done by LLMs. Reports of instability, vague client requests, etc? Throw the LLM at them and it usually figure it out why I continue to engineer.
So I know, first hand, that these things are very good. I also know second and third hand that pretty much every fintech in the industry is as heavily using agentic coding as we are.
And then I come to HN or reddit and I see people telling us that they cannot write decent production code, and this is just wrong. This isn't opinion wrong, it is objectively wrong. Any fintech that wants to keep up will tell you this.
I can't speak for other industries but I can't imagine they're different.
So, I'm not sure what to conclude from this. I don't want to be uncharitable, but when HN/reddit posts just don't match the reality I see for myself, I have no choice but to categorise them as being emotionally driven to stick to a particular narrative, and so I can dismiss them.
What I take from that time also is that the hand loom weavers were not incorrect. The power loom did not do as good of a job as they did by hand.
You can still by a hand woven shirt today at a premium price.
There is a category error as if quality is the product as opposed to one input of the product.
You probably don't get to be a master craftsman without that quality mindset so they aren't wrong but missing the forest for the trees.
Yes, it does nearly all the typing for me now. But left to its own devices, it'll happily spit out awful code.
> I see people telling us that they cannot write decent production code, and this is just wrong.
At least for me, that has never been the counterpoint that I’ve been making. I’ve never cared about code itself, especially with languages like Java and Kotlin, where you basically autocomplete most of the code, and with SDK like ios where you can collect snippets for most of the patterns that you need. And with frameworks like Laravel, where most big additions are done with the tooling. And because code is so repetitive, editors like emacs and vim have lots of features and plugins to help with copying and pasting (registers, macros, navigation, snippets,…)
And the fact is some code you wrote today will be worthless tomorrow and will be replaced and deleted. So, it’s very rare to care about some particular snippets or patch of code.
What myself, and others, have been complaining about is the quality of the codebase and the sustainability of the practice. Especially with the associated claims about increased productivity.
I care about correctness. Simplicity and reduced amount of code increase my confidence that I can achieve it. New features, until tested in production, are more probable to decrease the reliability of the software. And with each fix for a bug, I need to make sure that I’m not adding five more.
To this day, I’ve not seen any compelling arguments that is about writing better code reliably. I’ve seen a lot about writing more code. It’s like manager thinking if you’re not at your computer typing, you’re not working.
> We're still sceptical enough that we are doing the usual heavy handed human review process, so we're not seeing a huge speed up in delivery times, but we are seeing a volume increase
Are you seeing a quality increase? Less customer bugs, less outages, faster resolution? Are you measuring those?
We're not at the stage to measure yet. We may be behind others, not sure. Actually, this isn't quite true. I was interested, so a created an ad-hoc report (with AI) on PRs landed per week over time. This has gone up over the last 6 momths. But that is hard to say why that is. It might just be people are raising smaller PRs because it becomes easy to have the AI split things up, while before, people were too lazy to do this.
Our bottleneck is still that we want humans to review. Sometimes we spot errors, but our pre-existing testing frameworks are very robust already, so if these pass, we're very confident to release to production, and the agent is excellent at understanding the existing testing frameworks and adding to them for new stuff.
So in our team, we don't often see blatant logic errors. It is mostly to do with things like using a pattern that is used elsewhere in the codebase (or not at all) and doesn't belong in our specific section of the code (we have a large monorepo). These become fewer as we enhance our ruleset (AGENTS.md or CLAUDE.md) for our particular developers.
So how can you justify this comment of yours from your reply if you’re not measuring anything? Mind you, I can easily get good results from AI tools, but I don’t like the experience and the code is often over-engineered and drifts away from my target architecture.
But the worst is quickly loosing sight of the tiny technical details that matters when solving bugs or altering features. I don’t like typing code. What I like is to be able to go directly to the code that I need to change, modify it, and then verify that it works. Most of my time is spent deep thinking about the design of the software which is orthogonal to code.
And if there is one thing that is common about people fully onboard with LLM is that they can talk about the product, but they can’t argue about its behavior and its correctness. There’s no intrinsic model that they can compare with the real code. They don’t know the edge cases, the technical pitfalls, how the software will react if you modify one component. Any brainstorming session quickly turns into a slog because they cannot contrast approaches anymore. You can see the decay of understanding in realtime.
I think it is going to continue to get better, and I don't think we'll be having this argument in two years time. Our entire industry will look very different.
I am creating a game and I can say that with the coding part the models help a lot, mostly gpt 5.5 high. Tbh to me all the frontier models feel the same and they can all solve the stuff I do quite well with some guidance and prompting. But that kind of makes me appreciate the other stuff more like visual style, sound design, mechanics etc etc. Tons of work still.
For brainstorming I find the models bad nowadays or maybe I am just too critical of the results
The lack of downvotes on posts on HN has always felt like more of a bug than a feature to me.
Everyone does. You don’t think about it everyday because we’ve delegated it to experts which don’t come up with a new composition of Asphalt every time you press “generate”. It’s rigorously battle tested and short of intentional negligence, it’s consistent. I’m amazed how people are forgetting how the world actually works.
If AIs can generate code that looks ridiculous to humans but over time has the correct performance, the correct behaviour, no-one outside of software engineers will know or care.
They do those in labs, and then studies are made to prove that it can replace the current composition. They do not invent those on the spot and let the drivers QA the road.
> If AIs can generate code that looks ridiculous to humans but over time has the correct performance, the correct behaviour
It’s on you to prove that this big “if” can be realized. A -> B only matters when A is true.
Not really. This is a discussion about what code looks like if AI can write applications that are as good, stable, correct as humans.
I think they can, better than most programmers at the moment, with the correct guardrails and supervision. But in time, I think we may not need to review the code at all, but instead verify correctness and performance only. The AI can write the code however it likes.
Obviously I don't have a proof for this, but based on the progress I've seen so far, if someone forced me to bet one way or the other, this is what I'd bet on.
But yes, you are right - I don't build roads and don't know what is a price to build a road and how to determine the quality of correctly built one, nor I will ever care or learn.
That's not how I am reading it. You will get a road built exactly to your spec, quickly. So no penguin crossings unless you ask for them.
I am also not entirely sure how the pothole argument translates.
I get that there's little sense in arguing with the MBA hivemind, but... c'mon.
I manage two teams of highly motivated, largely pro-AI engineers. Both teams have independently concluded that they needed to ramp down GenAI usage because of code quality / maintainability concerns. Both teams have suffered from protracted outages caused by LLM jank not being sufficiently fenced off and guarded against. Both teams have expressed concern that the code generated by LLMs is far too verbose, full of slop, and rapidly becomes an unmaintainable mess.
These are teams that are building non-trivial LLM solutions (deep agentic data synthesis and multi-modal data tagging). They are using the technology creatively and pro-actively, not just vibe-coding slop and throwing their hands up when it fails. Both teams will continue using GenAI coding agents, don't get me wrong - but the gains are incremental, not transformative, and need careful fencing to make sustainable.
Nothing in these articles resonates as real. People who work in reality don't agree. I don't understand why this shit keeps getting attention (or rather I do, but the reasons aren't good).
So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?
If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.
In fact, that's the entire reason we care about "quality code", because we assume that quality code is code that does what you expect well and consistently.
I say this as someone who hand writes code pretty much every night for fun, just to experiment with computation. Which, oddly, is more fun than ever because I don't feel like there's any need to connect this type of programming with "real world software", and I can really enjoy code for it's own sake, meanwhile my job is mostly just running agent loops (which I quite like as well).
That is the entire purpose of "quality of code".
If the end user experiences a correctly performing application, now, and in the future, they don't care at all what the code looks like.
AIs could resort to a single global array of primitives and forget all about functions, and just use gotos if it helped them (it probably doesn't).
Also this is easily solved by .md spec files, this whole "bad code" cope is just FUD'
Yet, I can't deny the reality that I observe working with LLMs every day. If this truly is a step-function (as some are sgguesting), then I have absolutely zero concern for the quality of the code.
I said I had zero concern for the quality of the code. That is, I do not have concern that the quality of the code will be a concern in and of itself.
It's a subtle, but IMO important difference. We only care about code quality so as it gives us stable, understandable systems. Historically that meant a human had to read and understand it. Suppose a future where that's no longer the case, then we may still end up with stable, understandable systems without understanding every minutiae of the substrate. It's the same way I don't really know if my compiler is correct, but the behavioral patterns of my code suggest it is without me understanding anything about its code quality.
It also burned through my usage quota like a late-90s Hummer.
Yeah. I have a Max 5x subscription and Fable burned through 16% of my weekly quota in a 40 minute code review session. It didn't even finish the review, it switched back to Opus 4.8 in the critical memory safety parts where I actually needed Fable.
I feel like I'm going to get priced out of these models soon. I should probably try to get the most out of Fable until June 22nd.
It's not just salary, but also safety/labor regulation, legal risk, vacations, sick time, personal conflicts, HR, benefits.
Even when automation is more expensive on paper, it's generally still cheaper
You underestimate what these models cost. Uber's budget is $1,500/dev/month. I gather that was put in place because the dev's were going through $6,000/dev/month, which Uber decided could not be cost justified.
Fable costs at least twice as much, or $12,000/dev/month.
Fable can apparently work for hours without supervision, which means a skilled engineer can now have it working on many tasks concurrently. I would not be at all surprised if they can put a nought or two on that number. If you do that, you are well out of "what a human costs" territory.
$1,500/month needs to be contextualised against the fully-loaded cost of a software engineer. Uber's average TC for a US-based software engineer is around $350k, the fully-loaded cost is going to be in the $450k-$500k range. So we're talking around $38k/month for a software engineer.
$1,500/month isn't even a drop in the bucket. If LLM use lets them shave just one person off a team, that pays for tokens for the next 25 engineers.
I kinda get why execs are excited
Our 401ks turn on this actually being true. Otherwise pop.
People keep saying this and it keeps not happening.
ChatGPT Pro was $200/mo when it launched in '23 for a ~100B class model with 8k context. Claude Max is now the same price for practically unlimited access to a ~1T class model with 1M context.
Moore's Law never died, it just switched architectures.
I can't help thinking there might be some kind of strategic issue here.
Perhaps someone should ask Mythos about it.
If you get $100,000 per year as a SWE, and Anthropic offers a coding model for $100,000 per year (but working 24/7), then you'll have to give up all of those addons that make the fully burdened cost of the employee. Say goodbye to vacation, sick time, benefits, etc.
> "They're slaves."
> "Well, what the heck," said Buck. "I mean, they aren't people. They don't suffer. They don't mind working."
> "No. But they compete with people."
> "That's a pretty good thing, isn't it--considering what a sloppy job most people do of anything?"
> "Anybody that competes with slaves becomes a slave," said Harrison thickly, and he left.
Kurt Vonnegut, Player Piano
As far as I can tell this part of the job isn't really on anyone's radar anymore.
However, given this model now silently corrupts its own work if it thinks you are up to no good, it's absolutely 100% not Mythos so possibly Mythos is better, but who knows now that the alignment and safety safety people are on the case, inadvertently keeping humans in the loop?
https://simonwillison.net/2026/Jun/10/if-claude-fable-stops-...
Do you not believe in running tests, evaluations, or experiments at all to better understand your environment?
The ROI in the case of a positive outcome is the reduced time needed to inspect the results in the future (the entire point of AI is to know what you can trust it on, so you can delegate everything at that level with less oversight). The ROI in the negative case is the tokens not wasted on tasks to ambitious for the model.
We know this model will be cheaper and faster with time.
And we have not even reached the timespan/timeframe were we have ASIC style models.
OpenAI has to do something which will beat Fable otherwise Anthropic won. China currently overtakes cars, pv, batteries and very soon silicon chip making, it has all the incentive to also take over AI.
Why? Demand for AI compute seems to be increasing faster than new production is due to come online for the foreseeable future, particularly if more-intensive models induce demand.
So I would expect Fable-level intelligence to get cheaper.
I find it good for code reviews.
Huawei just showed LogicFolding and have a roadmap for 1.4 nanometer by 2031; SMIC is going for 5nm.
And all of this WITHOUT EUV.
Every sw dev knows this is a very dangerous, and unrealistic, assumption.
"Posterior beliefs about market demand are purely referencedependent: holding dollars raised constant, they track only performance relative to the founder’s self-chosen goal—jumping half a standard deviation at the threshold, responding steeply for the first ten points past it, and flattening thereafter"
Humans generally don't verbalize data this way. The summary document is also very fluffy.
So nope, not the AGI. But definitely an improvement.
That's the kind of behaviour I've seen in Claude Code (Opus 4.8) when it's context space is over the 40-50% range.
I tend to keep an eye on the context usage (ie `/context`) quite a lot, and generally see good results as long as the context usage is ~30% or below.
Which isn't heaps, considering having to ensure it has the required docs/stuff it needs can take 15-20% of context by itself.
Not exactly strange behavior, Opus acted just like this too when I first subscribed. The popupar meme is Anthropic nerfed Opus during their capacity cruch. No idea if it's true, but I do wonder if Fable will fall victim to the same fate.
> Again, it wasn’t perfect. As an expert, I was able to spot some errors and omissions (some as a result of the design I had asked for) that I had the AI correct
That's the bit that stuck out to me - that's longer than I would expect to work on a problem in a day or even expect to go back & fix the output of something that has a core reward loop of hours.
My customers are currently clamoring to push down my agent response times from 85 seconds down to below the 20s mark.
At the same time, it is very dissonant to see the industry heading towards hour+ long workflows with an agent.
We're gonna go back to the days where our bosses ask why we're just sitting around, but instead of saying "compiling," we'll just say, "waiting for Claude."
https://monstersandmemories.com
It's in private beta but sometimes they have a public beta, like just last week. They were supposed to have released this month but they pushed back to October.
Also check out Adrullan Online, it's also an EQ clone but Minecraft voxel style. More like alpha status, they don't seem as far along.
Will Claude's code be perfect in one shot? Probably not, will it get you 80 to 90% of the way there with your chosen design patterns in under a few hours? Absolutely.
Sounds like we've nearly reached in coding the point where Paul Bunyan [0] has his epic competition with the chainsaw... and loses by 1/4" and history forever changes...
It's some prompt engineered AI harness, that guides the AI to create stats after it researches a subject and ingests the data, but I'm not sure what is it that the tool actually does on top of this.
At this point, pay me significantly more, and I'll do it.
Ha ha, that's how you negotiate yourself out of a job!
There are people that almost feel physical pain if something is unnecessarily incorrect.
+ That if the mental model of something is accurate, it is actually _more_ work to say something that is incorrect than just saying the correct thing.
If you had your own on-premises LLM, that would indeed be your LLM, and it would make sense to compare it to the on-premises LLMs of other people, as your setup particulars would affect the result.
There was a time where one actually bought software to own it.
This time is.. actually it is right now. Please leave at once.
Similiar to "My game just crashed".
Jira otoh is not yours, because it's in the cloud. It might be "my internet connection", "my browser" or "my account" that is having trouble.
___
Hm. "My train got delayed" is interesting in this context. I don't find that offensive. But that also might be because trains don't seek rent the way SaaS does? Not sure.
I guess trains do not hold me hostage. They might just be a container in which someone does that.
Jira, cloud LLM inference or similar otoh..
I guess the main difference is that TAAS has many different trains where the experience varies wildly, so it helps to be specific on which train you're licensing; but LLMs are the same product for everyone, and you can't stay with say, ChatGPT 1.0, you get the same choices as everyone else.
That's ridiculous. You wouldn't respond to "I went to visit my doctor yesterday" with "but slavery has been illegal since forever!" Similarly it would be foolish to respond to "where should we meet? my place or yours" with "but we both rent!"
I'm amazed we're so far into SOTA bloat that the chinese will kill once they start etching silicon with these models.
In a project like mine (https://github.com/tsz-org/tsz) I am constantly frustrated that models were not doing enough research and were not taking into account other situations. Again and again models would produce code that would fix one thing and break 2 other tests that were "unrelated".
With Fable it seems like tasks are taking much longer (I have not seen a pull request from Fable sessions yet) but reading the transcription of those sessions I can see how it is doing the right thing by not leaving any stone unturned.
As the article says, it's hard to communicate this "feeling" about models because it is very project specific but I thought I share
But overall, this is pretty normal for compilers to have this sort of "unexpected" tests failing due to some work in an area. It happened to me when I was coding everything manually back in the day too
That's not what a clean setup means... I mean good separation of concerns, established invariants, etc.
Personally I don't really care, because I like coding and learning myself and DeepSeek Flash is all I really care about. But it's really easy to have a ton of benchmarks where the top models can't get anywhere close - and I like to test them on these problems to see how good they are getting.
Fable 5 is def a little better than 4.8 btw.
Myth. Total myth! I recently had to beg for more RAM after continually hitting swap space which causes tools like dictation to stop working, failure to load certain websites without rebooting, and so on. Devs do in fact need powerful machines and the ~$500-1000 an employer saves upfront in machine costs is dwarfed by productivity losses.
Giving your engineering employees new machines in a 2-year cycle that are between the middle and high end is one of the cheapest ROI decisions that a tech org can make.
A small portion of this effort is having a high quality Lua in Rust repo. I’m using mythos to fix some of the performance issues with my Lua interpreter that gpt 5.5/ opus 4.8 had stone walled on.
Not sure if Mythos will be able to crack this but it has been running for a couple hours now with some promising results.
Performance charts linked here if your curious https://github.com/ianm199/lua-rs
The other reason is that because mlua is just a wrapper around the C code, it has unsafe you can't really get around. So for example Lua is used in Redis, which has this critical CVE https://github.com/redis/redis/security/advisories/GHSA-4789... that a memory safe version of Lua wouldn't have to deal with.
Mlua is still fine or even better for many other cases though!
It just seems like a lot of hassle to write a lua interpreter, although it would be nice to see a high quality one in Rust :)
Hematita was promising, but looks abandoned.
And yes it seems like there has been many attempts to get a solid Rust Lua over the years and most never reached parity so hoping some people can find use case for it! This one is at full parity in terms of behavior and performance is getting to within striking distance.
Fable 5 found quite a few issues Opus 4.8 missed on code review, even though the stupid cybersecurity nonsense downgraded it. I can't tell you more, I only get a single session per 5h window on Max 5x. Only ran two sessions so far.
On the margins, suppose the prompt is literally: "Build a feature complete, high polish Facebook clone". Facebook is complex but likely not super complicated tech, and still I would assume that (after having burned through a substantial amount of tokens) you would find substantial enough differences in the outcomes between different models on that prompt on various fronts.
The above ask is obviously not useful, but what's preventing you from taking on bigger chunks until you approach the limit? At some point you would hit a boundary, where the diff will be obvious.
> This is from "The Cyberiad", a collection of science-fiction fairy tales by Polish author Stanislaw Lem ... In one of the stories, a robot constructor named Trurl creates a machine that writes poetry. A jealous rival named Klapaucian challenges the machine to compose "...a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism and in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter s!!"
And the computer responds with:
"Seduced, shaggy Samson snored.
She scissored short. Sorely shorn,
Soon shackled slave, Samson sighed.
Silently scheming,
Sightlessly seeking
Some savage, spectacular suicide"
The author had to be referencing this moment in their challenge to Fable/Mythos. I'm curious to know what their exact prompt was.
Cyprian cyberotoman, cynik, ceniąc czule
Czarnej córy cesarskiej cud ciemnego ciała,
Ciągle cytrą czarował. Czerwieniała cała,
Cicha, co-dzień czekała, cierpiała, czuwała...
... Cyprian ciotkę całuje, cisnąwszy czarnulę!!
You can consider the job of a translator as compared to LLM. Both derivative works, working within some constraints but with room for creativity.Or it just swept it up in the training data given Anthropic license Reddit comments.
https://isochronic-passage-chart.netlify.app/
Doesn’t work too well on mobile but looks interesting
I also see some logic flaws. It overlooks the option of going to a major hub to access faster aircraft, rather than hopping on local hubs.
Also, immigration and customs are cleared at the first airport you arrive at in the country, not at the last one.
In some countries, you need to clear immigration even while going to a third country, so 1 hour is not enough to do it.
Which just about sums up my experience with using LLMs to code, really (though not with these state-of-the-art models, admittedly) - it's amazing what they can do, but left to their own devices they'll make boneheaded decisions.
Yeah, the whole "can run for 9 hours on a task" to me is not a positive.
I tend to find if Opus 4.8 runs for ~15 mins on a task, then the end result has gone off in a weird direction at some point, and it needs winding back a fair bit.
And that's with extremely clear direction, literal specification docs to follow, etc.
That being said, having functional code already created beforehand (ie by a human) goes a long way to ensuring the AI model has a path it can build on without making too many dumb architectural choices by itself. Generally.
The real issue with the title is that it doesnt fit in the box!
It's like someone took a beatiful, intricate piece of vintage jewellry and made a slapdash imitation out of cheap plastic.
I'm not very threatened by this if this is the dangerous Mythos model - it just seems like a slightly incrementally better sonnet
He is a professor but sadly also an AI shill. He should switch to advertising washing power.
> Switched to Opus 4.8: Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback or learn more.
The poem Kandel translated from the original Polish was, for artistic reasons, completely different. I will be impressed when machine translation can duplicate that!
She scissored short. Sorely shorn,
Soon shackled slave, Samson sighed.
Silently scheming,
Sightlessly seeking
Some savage, spectacular suicide.
- That's the translated Cyberiad Poem the blog post based it off off (or the AI decided to do so)
I don’t see why working longer is a pro. The results don’t seem much better than you’d get from putting Opus in a long loop.
Care to share the results you got from Opus working on the same prompt? It should be easy to compare quality.
I do not fear that management will get tools like Mythos and then not need people like me. Most of the value I provide is in translating what the management/client _thinks_ they need into what is the real problem and solution.
That's not an insult to them, it's just pointing out that they see only their problem, and they imagine what would be the solution. They then ask for that solution. Quite often, what they want built isn't what they need. And I've seen so many problems, from so many domains and scenarios, that I can usually recognize the core need and propose (and build or direct building of) a solution which resolves that need AND has an eye toward the likely future needs.
Mythos may do an excellent job providing a high quality result based on what is asked of it. But the result will only be as good as the quality, clarity, and presentation of the request.
If I hire a home builder to build me a custom home, that builder is going to ask me a thousand questions - questions I had never even thought of. Mythos isn't going to ask all those questions - it's going to make the best choices it can without the consultant's level of interaction. And the buyer will get what they get. Sure, the buyer can then say, "oh, I don't want any hallways - just connected spaces." Then the house gets demolished and rebuilt to the new, clearer spec. Repeat, repeat repeat. Maybe eventually the buyer gets what they really want. More likely they give up before reaching that point, and they go and hire a real builder.
I'll sum it up like this: You can get great results with minimal effort if you don't really care too much about the details. But if you don't care much about the details, then your need probably wasn't very significant.
Sure, AI can auto-complete the line, but it can't write full functions.
Sure, AI can write functions, but it can't complete full features.
Sure, AI can write full features, but it can't build full applications.
Sure, AI can write full applications, but it can't build them in the right way / ask the right questions / write beautiful maintainable code / do what _I_ do..
Time will tell.
The problem is much broader though - consolidation of wealth and power have enabled, frankly, idiots to be able to control how the world works - from politics to business. Greed and stupidity is eating the world.
I don't see any solution. This is like a disease that will either eventually kill the body or take a long time to heal, leaving deep scars and forever changing humanity.
Maybe War Games was right - the only way to win is not to play. Therefore, find something you love (even if it doesn't pay well), and do that.
(I spent two years looking for a tech job. My 30 years of broad and meaningful experience is apparently not interesting to at least the 200 companies I applied to. So now I'm a teacher, and I'm quite happy.)
I have a mere 10y of experience, but also already looking for 1 year and also considering maybe I should become a teacher. Dealing with unruly children might be nerve wracking, and the tech level will be very basic, but I have always enjoyed enabling others to understand things and see them grow and having done my part in that. Also solid well taught foundations are very important. Currently, my only obstacle is, that it is not so easy to become a teacher where I live. Certain requirements kind of like certifications, that you can't just easily get, but need to invest a lot of time into.
Good to read of someone, who went that way. How did you manage the transition?
But recently I got my TEFL certification. Now I teach English in Bangkok. My students are high school, and they are fantastic people. Honestly I'm happier than I would have imagined. I only wish I had more time with each student, because they're all great in one way or another. To be more transparent, my school is one where students have to be top performers and compete to get in. So I'm not dealing with students who were like me when I was young ;).
I earn a fraction of what I earned in tech in the past, but it's enough to live with a modest buffer - and still actually enjoy life. I wish I had done this long ago.
Before I took this job, I spent a week teaching "computing" to grades 1-6. For various reason that wasn't a good situation and I left, but even those kids were pretty great. It's humbling to see what some motivated 6 year olds are capable of creating.
[1] https://isochronic-passage-chart.netlify.app/
[2] https://mapitout.welcome-to-nl.nl/
- Went deep on "what types of guidance even are there? what does giving good guidance mean?"
- Sampled my existing Claude guidance (CLAUDE.md, skills, hooks, etc.) and broke their guidance into "atoms"
- Categorized them by clustering, the same way Big Five was generated
- Generated a new candidate
- Then used independent agents to compare it against my existing corpus assuming that the new one would be worse
Working with it felt like working with a supersmart entity capable of generating very plausible-sounding but not-necessarily-true statements. The outcome certainly felt like an alien artifact, like nothing I'd make myself.
Only time'll tell if it holds up, but it sure had some interesting ideas.
I made serious progress towards repairing a proof for a conjecture that was published 10 days ago but kept running into a wall with one of the Lemmas.
I threw Fable 5 Max at it with the same subagent set up and in an hour it claimed to have disproved a core theorem of the paper.
The Lean construction looks correct, but I still need to verify it rigorously. This is certainly not something Opus 4.6 Max could do and it’s likely something Opus 4.8 Max could do with more delicate orchestration and time. However, the “one-shot” Fable 5 did give me pause.
Maybe my prompts are too vague, but it’s worth noting that every example in the post is a greenfield build, and vague prompting seems to hold up fine when there are no existing constraints to respect.
Other commenters have pointed out that his isochrone map contains a lot of nonsense as well.
So the most charitable interpretation here is that this is a case of Gell-Mann amnesia.
Just an FYI this guy is an AI hype-beast. Some of his tweets are truly out there.
It's likely that at least some amount of additional context was provided to the model to enable it to reliably create the desired form factor. This introduces the caveat that the author probably views some amount of context as being trivial / beneath the level of mentioning. But then the question becomes where they draw the line.
Given a tool that is supposed to unlock creativity and excitement, he made a series of worse clones of things.
Again, technically impressive, but the world has never needed the ability to make Balatro but less polished and coherent. We already have Balatro.
I'd be more convinced if people made things that didn't already exist; show me that these tools enable something you actually want.
Most of the “impressive” stuff is not “the model” but “the harness”. Spinning up the subagents and teams of lower models, letting them explore, do adversarial coding. It’s all in the harness. Granted, Mythos might be better at that orchestration, but it’s still the harness.
Second is the prompting. The author is an expert in what they’re doing and prompts the system in a way that yields useful results. I see too many people believing that if an expert can achieve those results in a domain they’re familiar with, then them as non-experts will be able to as well. And that’s a fallacy that Mythos doesn’t change.
There is only one hint: 475k tokens in the screenshot when OP asked the model to fix some behaviour, but it would be fascinating to know the total tokens amount.
The first item on the article, the first thing it showed, was wrong though.
It is 100% faster to go from London to New York in 1881 than Volgagrad. Or any of the Russian hinterland colored green or Turkey or Egypt.
the map is for 2026, yeah?
And I'm excited to try it, but also have a fear that I will like it too much and then won't have access to it in 2 weeks... But maybe I will and maybe it will be worth it and I'll just pay a bunch of extra for it and it'll be great!
I think the article could be improved by actually sharing more feelings. I clicked on the article for feelings but I didn't see that many feelings described.
Wow
What makes me excited is that GPT 5.6 (its actually GPT 6) is going to be crazy
Not a great start for "a generational leap in model effectiveness"
> It worked for nine and a half hours.
And how much did that cost?
Is it a hard problem or is it just labor intensive?
At first i thought its routing was just completely botched.
The text overflow on the legend is pretty funny considering how well the other graphics turned out
(Edit: referring to the map app)
looks nice but deeply flawed
classic LLM output
Edit: A couple hours in and I just got my first gaslighting attempt from the model. Good times!
What?