At some point, we might end up in a steady state where the models are as good as they can be and the training arms race is over, but we're not there yet.
Fixed costs can't be rolled into the unit economics because the divisor is continually growing. The marginal costs of each incremental token/query don't depend on the training cost.
The training is already done when you make a generative query. No matter how many consumers there are, the cost for training is fixed.
The economy will destroy inefficient actors in due course. The environmental and economic incentives are not entirely misaligned here.
Taken literally, this is just an agreement with the comment you're replying to.
Amortizing means that it is gradually written off over a period. That is completely consistent with the ability to average it over some usage. For example, if a printing company buys a big new printing machine every 5 years (because that's how long they last before they wear out), they would amortize it's cost over the 5 years (actually it's depreciation not amortization because it's a physical asset but the idea is the same). But it's 100% possible to look at the number of documents they print over that period and calculate the price of the print machine per document. And that's still perfectly consistent with the machine paying for itself.
I presume historical internal datasets remain high value, since they might be cleaner (no slop) or maybe unavailable (copyright takedowns) and companies are getting better at hiding their data from spidering.
> So, if I wanted to analogize the energy usage of my use of coding agents, it’s something like running the dishwasher an extra time each day, keeping an extra refrigerator, or skipping one drive to the grocery store in favor of biking there.
That's for someone spending about $15-$20 in a day on Claude Code, estimated at the equivalent of 4,400 "typical queries" to an LLM.
This is for someone using a lot of LLM tokens relative to the average customer of these companies.
electricity and cooling incur wider costs and consequences.
Why is it an externality? Anthropic (or other model provider) pays the electricity cost, then it's passed along in the subscription or API bill. The direct cost of the energy is fully internalized in the price.
I'm all for regulation that makes businesses pay for their externalities - I'd argue that's a key economic role that a government should play.
I've been told in other (non US) economies, decisions to site hyperscaler DCs has had downstream impacts on power costs and longterm power planning. The infra to make a lot of power appear at a site, means the same capex and inputs cannot be used to supply power to towns and villages. There's a social opportunity loss in hosting the DC because the power supply doesn't magicly make more transformers and wires and syncons appear on the market: Prices for these things are going up because of a worldwide shortage.
Its like the power version of worldwide RAM pricing.
That's why I think most of this data center energy use, especially over longer terms is a joke. Data center can pretty easily run on solar and wind energy if we spend even a small amount to political capital to make it happen.
I am not in the DC business. if somebody who is says "thats bunkum" I'd pay attention to it.
> For the purposes of this post, I’ll use the figures from the 100,000 “maximum”–Claude Sonnet and Opus 4.5 both have context windows of 200,000 tokens, and I run up against them regularly–to generate pessimistic estimates. So, ~390 Wh/MTok input, ~1950 Wh/MTok output.
Expensive commercial energy would be 30¢ per kWh in the US, so the energy cost implied by these figures would be about 12¢/MTok input and 60¢/MTok output. Anthropic's API cost for Opus 4.5 is $5/MTok input and $25/MTok output, nearly two orders of magnitude higher than these figures.
The direct energy cost of inference is still covered even if you assume that Claude Max/etc plans are offering a tenfold subsidy over the API cost.
This has been covered a lot. You can find quotes from one of the companies saying that they'd be profitable if not for training costs. In other words, inference is a net positive.
You have to keep in mind that the average customer doesn't use much inference. Most customers on the $20/month plans never come close to using all of their token allowance.
If that's so common then what's your theory as to why Anthropic aren't price competitive with GPT-5.2?
Do people even care about this?
How much energy does it take to download a video on YouTube versus the energy input to keep it all setup and running?
We've only launched to friends and family but I'll share this here since its relevant: we have a service which actually optimizes and measures the energy of your AI use: https://portal.neuralwatt.com if you want to check it out. We also have a tools repo we put together that shows some demonstrations of surfacing energy metadata in to your tools: https://github.com/neuralwatt/neuralwatt-tools/
Our underlying technology is really about OS level energy optimization and datacenter grid flexibility so if you are on the pay by KWHr plan you get additional value as we continue to roll new optimizations out.
DM me with your email and I'd be happy to add some additional credits to you.
[0] https://watercalculator.org/news/articles/beef-king-big-wate...
TLDR this is, intentionally or not, an industry puff piece that completely misunderstands the problem.
Also, even if everyone is effectively running a a dishwasher cycle every day, this is still a problem that we can't just ignore, that's still a massive increase in ecological impact.
It is true that there are always more training runs going, and I don't think we'll ever find out how much energy was spent on experimental or failed training runs.
Constant until the next release? The battle for the benchmark-winning model is driving cadence up, and this competition probably puts a higher cost on training and evaluation too.
I wish we reach the everything is equally bad phase so we can start enjoying the more constant cost of the entire craze to build your own model with more data than the rest.
Training is more or less the same as doing inference on an input token twice (forward and backward pass). But because its offline and predictable it can be done fully batched with very high utilization (efficiently).
Training is guestimate maybe 100 trillion total tokens but these guys apparently do inference on the quadrillion token monthly scales.
Propellers are a very common means to make aeroplanes work though, instead of a piston engine, which is cheap to make but relatively unreliable and expensive to run, you can use turbine engines, which run on JetA aka kerosene, and the rotary motion of the turbine drives the propeller making a turboprop. In the US you won't see that many turboprop engines for passenger service, but in the rest of the world that's a very common choice for medium distance aeroplane routes, while the turbofan planes common everywhere in the US would in most places be focused on longer distances between bigger airfields because they deliver peak efficiency when they spend longer up in the sky.
JetA, whether for a turbofan or turboprop does not have lead in it, so to a first approximation no actual $$$ commercial flights spew lead. They're bad for the climate, but they don't spew lead into the atmosphere.
Most of the innovation happening today is in post-training rather than pre-training, which is good for people concerned with energy use because post-training is relatively cheap (I was able to post-train a ~2b model in less than 6 hours on a rented cluster[2]).
[1]: https://github.com/lino-levan/wubus-1 [2]: https://huggingface.co/lino-levan/qwen3-1.7b-smoltalk