We do have some idea. Kimi K2 is a relatively high performing open source model. People have it running at 24 tokens/second on a pair of Mac Studios, which costs 20k. This setup requires less than a KW of power, so the $0.8-0.15 being spent there is negligible compared to a developer. This might be the cheapest setup to run locally, but it's almost certain that the cost per token is far cheaper with specialized hardware at scale.
In other words, a near-frontier model is running at a cost that a (somewhat wealthy) hobbyist can afford. And it's hard to imagine that the hardware costs don't come down quite a bit. I don't doubt that tokens are heavily subsidized but I think this might be overblown [1].
[1] training models is still extraordinarily expensive and that is certainly being subsidized, but you can amortize that cost over a lot of inference, especially once we reach a plateau for ideas and stop running training runs as frequently.
Is Kimi K2 near-frontier though? At least when run in an agent harness, and for general coding questions, it seems pretty far from it. I know what the benchmarks say, they always say it's great and close to frontier models, but is this other's impression in practice? Maybe my prompting style works best with GPT-type models, but I'm just not seeing that for the type of engineering work I do, which is fairly typical stuff.
I’ve been pretty active in the open model space and 2 years ago you would have had to pay 20k to run models that were nowhere near as powerful. It wouldn’t surprise me if in two more years we continue to see more powerful open models on even cheaper hardware.
Are you just using the API mode?
So above and beyond frontier models? Because they certainly aren't "flawless" yet, or we have very different understanding of that word.
During the day I am working on building systems that move lots of data around where context and understanding of the business problem is everything. I largely use LLMs for assistance. This is because I need the system to be robust, scalable, maintainable by other people and adaptable to large range of future needs. LLMs will never be flawless in a meaningful sense in this space (at least in my opinion).
When I'm using Kimi I'm using it for purely vibe coded projects where I don't look at the code (and if I do I consider this a sign I'm not thinking about the problem correctly). Are these programs robust, scalable, generalizable, adaptable to future use case? No, not at all. But they don't need to be, they need to serve a single user for exactly the purpose I have. There are tasks that used to take me hours that now run in the background while I'm at work.
In this latter sense I say "flawless" because 90% of my requests solve the problem on the first pass, and the 10% of the time where there is some error, it is resolved in a single request, and I don't have to ever look at the code. For me that "don't have to look at the code" is a big part of my definition of "flawless".
This is what I've been increasingly understanding is the wrong way to understand how LLMs are changing things.
I fully agree that LLMs are not suitable for creating production code. But the bigger question you need to ask is 'why do we need production code?' (and to be clear, there are and always will be cases where this is true, just increasingly less of them)
The entire paradigm of modern software engineering is fairly new. I mean it wasn't until the invention of the programmable microprocessor that we even had the concept of software and that was less than 100 years ago. Even if you go back to the 80s, a lot of software doesn't need to be distributed or serve a endless variety of users. I've been reading a lot of old Common Lisp books recently and it's fascinating how often you're really programming lisp for you and your experiments. But since the advent of the web and scaling software to many users with diverse needs we've increasingly needed to maintain systems that have all the assumed properties of "production" software.
Scalable, robust, adaptable software is only a requirement because it was previously infeasible for individuals to build non-trivial systems for solving any more than a one or two personal problems. Even software engineers couldn't write their own text editor and still have enough time to also write software.
All of the standard requirements of good software exist for reasons that are increasingly becoming less relevant. You shouldn't rely on agents/LLMs to write production code, but you also should increasingly question "do I need production code?"
Now, these models are a bit weaker, but they're in the realm of Claude Sonnet to Claude Opus 4. 6-12 months behind SOTA on something that's well within a personal hobby budget.
I haven't tried Minimax M2.5 yet. How do its capabilities compare to Qwen3 Coder Next in your testing?
I'm working on getting a good agentic coding workflow going with OpenCode and I had some issues with the Qwen model getting stuck in a tool calling loop.
Same with unsloth/gpt-oss-120b-GGUF:F16 gets 25 tps and gpt-oss20b gets 195 tps!!!
The advantage is that you can use the APU for booting, and pass through the GPU to a VM, and have nice safer VMs for agents at the same time while using DDR4 IMHO.
I won’t use a public model for my secret sauce, no reason to help the foundation models on my secret sauce.
Even an old 1080ti works well for FIM for IDEs.
IMHO the above setup works well for boilerplate and even the sota models fail for the domain specific portions.
While I lucked out and foresaw the huge price increases, you can still find some good deals. Old gaming computers work pretty well, especially if you have Claude code locally churn on the boring parts while you work on the hard parts.
I'm still not sold on the idea, but this allows me to experiment with it fully locally, without paying rent to some companies I find quite questionable, and I can know exactly how much power I'm drawing and the money is already spent, I'm not spendding hundreds a month on a subscription.
And yes, the Strix Halo isn't the only way to run models locally for a relatively affordable price; it's just the one I happened to pick, mostly because I already needed a new laptop, and that 128 GiB of unified RAM is pretty nice even when I'm not using most of it for a model.
$ uname -a
Linux fedora 6.18.9-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Feb 6 21:43:09 UTC 2026 x86_64 GNU/Linux
You also need to set a few kernel command line paramters to set it up to allow it to use most of your memory as graphics memory, I have the following in my kernel command line, those are each 110 GiB expressed in number of pages (I figure leaving 18 GiB or so for CPU memory is probably a good idea): ttm.pages_limit=28835840 ttm.page_pool_size=28835840
Then I'm running llama.cpp in the official llama.cpp Docker containers. The Vulkan one works out of the box. I had to build the container myself for ROCm, the llama.cpp container has ROCm 7.0 but I need 7.2 to be compatible with my kernel. I haven't actually compared the speed directly between Vulkan and ROCm yet, I'm pretty much at the point where I've just gotten everything working.In a checkout of the llama.cpp repo:
podman build -t llama.cpp-rocm7.2 -f .devops/rocm.Dockerfile --build-arg ROCM_VERSION=7.2 --build-arg ROCM_DOCKER_ARCH='gfx1151' .
Then I run the container with something like: podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2 --model unsloth/MiniMax-M2.5-GGUF/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf --jinja --ctx-size 16384 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio
Still getting my setup dialed in, but this is working for now.Edit: Oh, yeah, you had asked about Qwen3 Coder Next. That command was:
podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable \
--rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2 -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q6_K_XL \
--jinja --ctx-size 262144 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio
(as mentioned, still just getting this set up so I've been moving around between using `-hf` to pull directly from HuggingFace vs. using `uvx hf download` in advance, sorry that these commands are a bit messy, the problem with using `-hf` in llama.cpp is that you'll sometimes get surprise updates where it has to download many gigabytes before starting up)You might be correct when you say the global 1%, but that's still 83 million people.
have you paid any attention to the hardware situation over the last year?
this week they've bought up the 2026 supply of disks
$20,000 is a lot to drop on a hobby. We're probably talking less than 10%, maybe less than 5% of all hobbyists could afford that.
The naivete on here is crazy tbh.
this is marketing not reality.
Get a few lines of code and it becomes unusable.
I walked into that room expecting to learn from people who were
further ahead. People who’d cracked the code on how to adopt AI at scale,
how to restructure teams around it, how to make it work. Some of the
sharpest minds in the software industry were sitting around those tables.
And nobody has it all figured out.
People who say they have are trying to mess with your head.I don't think you can find that level of ego anywhere in the software industry or any other industry for that matter. Respetc.
This is one of the most interesting questions right now I think.
I've been taking on much more significant challenges in areas like frontend development and ops and automation and even UI design now that LLMs mean I can be much more of a generalist.
Assuming this works out for more people, what does this mean for the shape of our profession?
In the past 6 months, all my code has been written by claude code and gemini cli. I have written code backend, frontend, infrastructure and iOS. Considering my career trajectory all of this was impossible a couple of years ago.
But the technical debt has been enormous. And I'll be honest, my understanding of these technologies hasn't been 'expert' level. I'm 100% sure any experienced dev could go through my code and may think it's a load of crap requiring serious re-architecture.
It works (that's great!) but the 'software engineering' side of things is still subpar.
We’ve been trying to build well engineered, robust, scalable systems because software had to be written to serve other users.
But LLMs change that. I have a bunch of vibe coded command lines tools that exactly solve my problems, but very likely would make terrible software. The thing is, this program only needs to run on my machine the way I like to use it.
In a growing class of cases bespoke tools are superior to generalized software. This historically was not the case because it took too much time and energy to maintain these things. But today if my vibe coded solution breaks, I can rebuild it almost instantly (because I understand the architecture). It takes less time today to build a bespoke tool that solved your problem than it does to learn how to use existing software.
There’s still plenty of software that cannot be replaced with bespoke tools, but that list is shrinking.
The value proposition isn't really "we'll help you write all the code for your company" it's a world where the average user's computer is a dumb terminal that opens up to a ChatGPT interface.
I didn't initially understand the value prop but have increasingly come to see it. The gamble is that LLMs will be your interface to everything the same way HTTP was for the last 20 years.
The mid-90s had a similar mix of deep skepticism and hype-driven madness (and if you read my comments you'll see I've historically been much closer to the skeptic side, despite a lot of experience in this space). But even in the 90s the hyped-up bubble riders didn't really see the idea that http would be how everything happens. We've literally hacked a document format and document serving protocol to build the entire global application infrastructure.
We saw a similar transformation with mobile devices where most of your world lives on a phone and the phone maker gets a nice piece of that revenue.
People thought Zuck was insane for his metaverse obsession, but what he was chasing was that next platform. He was wrong of course, but what his hope was was that VR would be the way people did everything.
Now this is what the LLM providers are really after. Claude/ChatGPT/Grok will be your world. You won't have to buy SaaS subscriptions for most things because you can just build it yourself. Why use Hubspot when you can just have AI do all your marketting, then you just need Hubspot for their message sending infrastructure. Why pay for a budgeting app when you can just build a custom one that lives on OpenAIs server (today your computer, but tomorrow theirs). Companies like banks will maintain interfaces to LLMs but you won't be doing your banking in their web app. Even social media will ultimately be replaced by an endless stream of bespoke images video and content made just for you (and of course it will be much easier to inject advertising into this space you don't even recognize as advertising).
The value prop is that these large, well funded, AI companies will just eat large chunks of industry.
There are a ton of recipe management apps out there, and all of them are more complex than I really need. They have to be, because other people looking for recipe management software have different needs and priorities. So I just vibe coded my own recipe management app in an afternoon that does exactly what I want and nothing more. I'm sure it would crash and burn if I ever tried to launch it at scale, but I don't have to care about that.
If I was in the SaaS business I would be extremely worried about the democratization of bespoke software.
What people are describing is that Normies can now do the kinds of things that only wizards with PERL could do in the 90s. The sorts of things that were always technically possible with computers if you were a very specific kind of person are now possible with computers for everyone else.
People want to open Netflix / YT / TikTok, open instagram, scroll reddit, take pictures, order stuff online, etc. Then professionals in fields want to read / write emails, open drawings, CADs, do tax returns, etc.
If anything overall interest in software seems to be going down for the average person compared to 2010s. I feel like most of the above normal people are going to stop using in favor of LLMs. LLMs certainly do compete with Googling for regular people though and writing emails.
Claude Code is producing working useful GUIs for me using Qt via pyside6. They work well but I have no doubt that a dev with real experience with Qt would shudder. Nonetheless, because it does work, I am content to accept that this code isn't meant to be maintained by people so I don't really care if it's ugly.
FOSS meant that the cost of building on reusable components was nearly zero. Large public clouds meant the cost of running code was negligible. And now the model providers (Anthropic, Google, OpenAI) means that the cost of producing the code is relatively small. When the marginal cost of producing code approaches zero, we start optimizing for all the things around it. Code is now like steel. It's somewhat valuable by itself, but we don't need the town blacksmith to make us things anymore.
What is still valuable is the intuition to know what to build, and when to build it. That's the je ne sais quoi still left in our profession.
“Ideas that surfaced: code as ‘just another projection’ of intended behaviour. Tests as an alternative projection. Domain models as the thing that endures. One group posed the provocative question: what would have to be true for us to ‘check English into the repository’ instead of code?
The implications are significant. If code is disposable and regenerable, then what we review, what we version-control, and what we protect all need rethinking.”
Absolutely. Also crucial is what's possible to build. That takes a great deal of knowledge and experience, and is something that changes all the time.
Second there’s a world of difference still between a developer with taste using AI with care and the slop cannons out there churning out garbage for others to suffer through. I’m betting there is value in the former in the long run.
The generalist capability boost is real. I'm shipping things that would have required frontend, backend, and devops specialists two years ago. But a new specialization is quietly emerging alongside that: people who understand how LLM pipelines behave in production.
This is genuinely hard knowledge that doesn't transfer from traditional engineering. Multi-step agent pipelines fail in ways that look nothing like normal software bugs - context contamination between model calls, confidence-correlated hallucinations that vary by model family, retry logic that creates feedback loops in agentic chains. Debugging this requires understanding the statistical behavior of models as much as the code.
My guess: the profession splits more than it unifies. Most developers will use LLMs to be faster generalists on standard work. A smaller group will specialize in building the infrastructure those LLMs run on - model routing, context management, failure isolation, eval pipelines. That second group isn't really a generalist or a traditional specialist. It's something new.
The Fowler article's 'supervisory middle loop' concept hints at this - someone has to monitor what the agents are doing, and that role requires both breadth and a very specific kind of depth.
Expert generalists are also almost impossible to distinguish from bullshitters. It’s why we get along so well with LLMs. ;)
If you want to get/stay good at debugging--again IMO--it's more important to be involved in operations, where shit goes wrong in the real world because you're dealing with real invalid data that causes problems like poison pill messages stuck in a message queue, real hardware failures causing services to crash, real network problems like latency and timeouts that cause services which work in the happy path to crumble under pressure. Not only does this instil a more methodical mentality in you, it also makes you a better developer because you think about more classes of potential problems and how to handle them.
A useful complement is the programmer-level shift: agents are great at narrow, reversible work when verification is cheap. Concretely, think small refactors behind golden tests, API adapters behind contract tests, and mechanical migrations with clear invariants. They fail fast in codebases with implicit coupling, fuzzy boundaries, or weak feedback loops, and they tend to amplify whatever hygiene you already have.
So the job moves from typing to making constraints explicit and building fast verification, while humans stay accountable for semantics and risk.
If useful, I expanded this “delegation + constraints + verification” angle here: https://thomasvilhena.com/2026/02/craftsmanship-coding-five-...
When we have solid tests, the agent output is useful and we can trust it. When tests are thin or missing, the agents still ship a lot of code, but we spend way more time debugging and fixing subtle bugs.
The text is actually about the Thoughtworks Future of Software Development retreat.
Personally, I'm more interested in whether software development has become more or less pay to win with LLMs?
I do like the idea that "all code is tech debt", and we shouldn't want to produce more of it than we need. But it's also worth remembering that debt is not bad per se, buying a house with a mortgage is also debt and can be a good choice for many reasons.
I suggest something like "Tidbits from the Thoughtworks Future of Software Development Retreat" (from the first sentence, captures the content reasonably well.)
I agree that AI tools are likely to amplify the importance of quick cycles and continuous delivery.
the part that's tricky is that slow lane and fast lane look identical in a PR. the framework only works if it's explicit enough to survive code review fatigue and context switching. and most teams are figuring that out as they go.
> One large enterprise employee commented that they were deliberately slow with AI tech, keeping about a quarter behind the leading edge. “We’re not in the business of avoiding all risks, but we do need to manage them”.
I’m unclear how this pattern helps with security vis-à-vis LLMs. It makes sense when talking about software versions, in hoping that any critical bugs are patched, but prompt injection springs eternal.
We've experimented with rolling open source models on local hardware, but it's so easy to inject things into them that it's not really going anywhere. It's going to be a massive challenge, because if we don't provide the tools, employees are going to figure out how to do it on their own.
Yes, but some are mitigated when discoverd, and some more critical areas need to be isolated from the LLM so taking their time to provision LLM into their lifecycle is important, and they're happy to spend the time doing it right, rather than just throwing the latest edge tech into their system.
* Never give an agent any input that is not trusted
* Never give an agent access to anything that would cause a security problem (read only access to any sensitive data/credentials, or write access to anything dangerous to write to)
* Never give an agent access to the internet (which is full of untrusted input, as well as places that sensitive data could be exfiltrated)
An LLM is effectively an unfixable confused deputy, so the only way to deal with it is effectively to lock it down so it can't read untrusted input and then do anything dangerous.
But it is really hard to do any of the things that folks find agents useful for, without relaxing those restrictions. For instance, most people let agents install packages or look at docs online, but any of those could be places for prompt injection. Many people allow it to run git and push and interact with their Git host, which allow for dangerous operations.
My current experimentation is running my coding agent in a container that only has access to the one source directory I'm working on, as well as the public internet. Still not great as the public internet access means that there's a huge surface area for prompt injection, though for the most part it's not doing anything other than installing packages from known registries where a malicious package would be just as harmful as a prompt injection.
Anyhow, there have been various people talking about how we need more sandboxes for agents, I'm sure there will be products around that, though it's a really hard problem to balance usability with security here.
[0] Which is not even enough, these are the ones with truly excess money to burn.
Are you assuming tech debt has no financial cost?
Exactly.
That's one of the reasons it's gotten so out-of-hand.
Chinese open source models are dirt cheap, you can buy $20 worth of kimi-k2.5 on opencode and spam it all week and barely make a dent.
Assuming we never got bigger models, but hardware keeps improving, we'll either be serviing current models for pennies, or at insane speeds, or both.
The only actual situation where tokens are being subsidized is free tiers on chat apps, which are largely irrelevant for any sort of useful economic activity.
I think this is often a mental excuse for continuing to avoid engaging with this tech, in the hope that it will all go away.
What people probably get messed up on as being the loss leader is likely generous usage limits on flat rate subscriptions.
For example GitHub Copilot Pro+ comes with 1500 premium requests a month. That's quite a lot and it's only $39.00. (Requests ~ Prompts).
For some time they were offering Opus 4.6 Fast at 9x billing (now raised to 30x).
That was upto 167 requests of around ~128k context for just $39. That ridiculous model costs $30/$150 Mtok so you can easily imagine the economics on this.
There's a difference between running inference and running a frontier model company.
Inference costs grow with your users.
Provided you are making a profit on that inference you can eventually cover your training costs if you sign up enough paying customers.
If you LOSE money on inference every new customer makes your financial position worse.
You're putting way too much faith in Dario's statements. It wasn't "abundantly clear" to me. In that interview, prior to explaining how inference profits work, he said, "These are stylized facts. These numbers are not exact. I'm just trying to make a toy model," followed shortly by "[this toy model's economics] are where we're projecting forward in a year or two."
https://www.theinformation.com/articles/anthropic-lowers-pro...
This isn't a case where you have specific code/capital you have borrowed and need to pay for its use or give it back. This is flat out putting liabilities into your assets that will have to be discovered and dealt, someday.
Now producing code is _cheap_. You can write and run code in an automated way _on demand_. But if you do that, you have essentially traded upfront cost for run time cost. It's really only worth it if the work is A) high value and B) intermittent.
There is probably a formula you can write to figure out where this trade off makes sense and when it doesn't.
I'm working on a system where we can just chuck out autonomous agents onto our platform with a plain text description, and one thing I have been thinking about is tracking those token costs and figuring out how to turn agentic workflows into just normal code.
I've been thinking about running an agent that watches the other agents for cost and reads their logs ono a schedule to see if any of what the agents are doing can be codified and turned into a normal workflow, and possibly even _writing that workflow itself_.
It would be analogous to the JVM optimizing hot-path functions... ---
What I do know is that what we are doing for a living will be near unrecognizable in a year or two.
Token costs are also non-trivial. Claude can exhaust a $20/month session limit with one difficult problem (didn't even write code, just planned). Each engineer needs at least the $200/mo plan - I have multiple plans from multiple providers.
Local or self hosted LLMs will ultimately be the future. Start learning how to build up your own AI stack and use it day to day. Hopefully hardware catches up so eventually running LLMs on device is the norm.
2. Is it a problem if a rich guy funds activities in America that suspiciously align with a foreign power? That has interesting implications for many pro Israel billionaires and organizations.
3. Only a paranoid MAGA troll would characterize the left wing groups he funds as domestic terrorists. Code Pink? Pro Palestinian protest groups? Come on.