Branding is the true issue that Anthropic has though. Haiku 4.5 may (not saying it is, far to early to tell) be roughly equivalent in code output quality compared to Sonnet 4, which would serve a lot users amazingly well, but by virtue of the connotations smaller models have, alongside recent performance degradations making users more suspicious than beforehand, getting these do adopt Haiku 4.5 over Sonnet 4.5 even will be challenging. I'd love to know whether Haiku 3, 3.5 and 4.5 are roughly in the same ballpark in terms of parameters and course, nerdy old me would like that to be public information for all models, but in fairness to companies, many would just go for the largest model thinking it serves all use cases best. GPT-5 to me is still most impressive because of its pricing relative to performance and Haiku may end up similar, though with far less adoption. Everyone believes their task requires no less than Opus it seems after all.
For reference:
Haiku 3: I $0.25/M, O $1.25/M
Haiku 4.5: I $1.00/M, O $5.00/M
GPT-5: I $1.25/M, O $10.00/M
GPT-5-mini: I $0.25/M, O $2.00/M
GPT-5-nano: I $0.05/M, O $0.40/M
GLM-4.6: I $0.60/M, O $2.20/M
This leads to unnecessary helper functions instead of using existing helper functions and so on.
Not sure if it is an issue with the models or with the system prompts and so on or both.
Helps solve the inherent tradeoff between reading more files (and filling up context) and keeping the context nice and tight (but maybe missing relevant stuff.)
I sometimes use it, but I've found just adding to my claude.md something like "if you ever refactor code, try search around the codebase to see if their is an existing function you can use or extend"
Wouldn't that consume a ton of tokens, though? After all, if you don't want it to recreate function `foo(int bar)`, it will need to find it, which means either running grep (takes time on large codebases) or actually loading all your code into context.
Maybe it would be better to create an index of your code and let it run some shell command that greps your ctags file, so it can quickly jump to the possible functions that it is considering recreating.
another thing I saw in the last days starting: Claude now draws always an ASCII art instead of a graphical image, and the ASCII art is completely useless, when something is explained
GPT 5 (at least with cline) reads whatever you give it, then laser targets the required changes.
With High, as long as I actually provided enough relevant context it usually one shots the solution and sometimes even finds things I left out.
The only downside for me is it's extremely slow, but I still use it on anything nuanced.
Nope, Claude will deviate from it's own project as well.
Claude is brilliant but needs hard rules. You have to treat it and make it feel like the robot it really. Feed it a bit too much human prose in your instructions and it will start to behave like a teen.
People that can and want to write specs are very rare.
Yes, we got Groq and Cerebras getting up to 1000token/sec, but not with models that seem comparable (again, early, not a proper judgement). Anthropic has been historically the most consistent in outperforming personal benchmarks vs public benchmarks, for what that is worth so I am optimistic.
If speed, performance and pricing are something Anthropic can keep consistent long term (i.e. no regressions), Haiku 4.5 really is a great option for most coding tasks, with Sonnet something I'd tag in only for very specific scenarios. Past Claude models have had a deficiency in longer term chains of tasks. Beyond 7 minutes roughly, performance does appear to worsen with Sonnet 4.5, as an example. That could be an Achilles heel for Haiku 4.5 as well, if not this really is a solid step in terms of efficiency, but I have not done any longer task testing yet.
That being said, Anthropic once again has a rather severe issue it seems, casting a shadow upon this release. From what I am seeing and others are reporting, Claude Code currently does count Haiku 4.5 usage the same as Sonnet 4.5 usage, despite the latter being significantly more expensive. They also did not yet update the Claude Code support pages to reflect the new models usage limits [0]. I really think such information should be public by launch day and hope they can improve their tooling and overall testing, it really continues to throw a shadow over their impressive models.
[0] https://support.claude.com/en/articles/11145838-using-claude...
p.s. it also got the code 100% correct on the one-shot p.p.s. Microsoft are pricing it out at 30% the cost of frontier models (e.g. Sonnet 4.5, GPT5)
Feel free to DM me your account info on twitter (https://x.com/katchu11) and I can dig deeper!
A few examples, prompted at UTC 21:30-23:00 via T3 Chat [0]:
Prompt 1 — 120.65 token/sec — https://t3.chat/share/tgqp1dr0la
Prompt 2 — 118.58 token/sec — https://t3.chat/share/86d93w093a
Prompt 3 — 203.20 token/sec — https://t3.chat/share/h39nct9fp5
Prompt 4 — 91.43 token/sec — https://t3.chat/share/mqu1edzffq
Prompt 5 — 167.66 token/sec — https://t3.chat/share/gingktrf2m
Prompt 6 — 161.51 token/sec — https://t3.chat/share/qg6uxkdgy0
Prompt 7 — 168.11 token/sec — https://t3.chat/share/qiutu67ebc
Prompt 8 — 203.68 token/sec — https://t3.chat/share/zziplhpw0d
Prompt 9 — 102.86 token/sec — https://t3.chat/share/s3hldh5nxs
Prompt 10 — 174.66 token/sec — https://t3.chat/share/dyyfyc458m
Prompt 11 — 199.07 token/sec — https://t3.chat/share/7t29sx87cd
Prompt 12 — 82.13 token/sec — https://t3.chat/share/5ati3nvvdx
Prompt 13 — 94.96 token/sec — https://t3.chat/share/q3ig7k117z
Prompt 14 — 190.02 token/sec — https://t3.chat/share/hp5kjeujy7
Prompt 15 — 190.16 token/sec — https://t3.chat/share/77vs6yxcfa
Prompt 16 — 92.45 token/sec — https://t3.chat/share/i0qrsvp29i
Prompt 17 — 190.26 token/sec — https://t3.chat/share/berx0aq3qo
Prompt 18 — 187.31 token/sec — https://t3.chat/share/0wyuk0zzfc
Prompt 19 — 204.31 token/sec — https://t3.chat/share/6vuawveaqu
Prompt 20 — 135.55 token/sec — https://t3.chat/share/b0a11i4gfq
Prompt 21 — 208.97 token/sec — https://t3.chat/share/al54aha9zk
Prompt 22 — 188.07 token/sec — https://t3.chat/share/wu3k8q67qc
Prompt 23 — 198.17 token/sec — https://t3.chat/share/0bt1qrynve
Prompt 24 — 196.25 token/sec — https://t3.chat/share/nhnmp0hlc5
Prompt 25 — 185.09 token/sec — https://t3.chat/share/ifh6j4d8t5
I ran each prompt three times and got (within expected variance meaning less than 5% plus or minus) the same token/sec results for the respective prompt. Each used Claude Haiku 4.5 with "High reasoning". Will continue testing, but this is beyond odd. I will add that my very early evals leaned heavily into pure code output, where 200 token/sec is consistently possible at the moment, but it is certainly not the average as claimed before, there I was mistaken. That being said, even across a wider range of challenges, we are above 160 token/sec and if you solely focus on coding, whether Rust or React, Haiku 4.5 is very swift.
[0] Normally not using T3 Chat for evals, just easier to share prompts this way, though was disappointed to find that the model information (token/sec, TTF, etc.) can't be enabled without an account. Also, these aren't the prompts I usually use for evals. Those I try to keep somewhat out of training by only using paid for API for benchmarks. As anything on Hacker News is most assuredly part of model training, I decided to write some quick and dirty prompts to highlight what I have been seeing.
Anthropic mentioned this model is more then twice as fast as claude sonnet 4 [2], which OpenRouter averaged at 61.72 tps for sonnet 4 [3]. If these numbers hold we're really looking at an almost 3x improvement in throughput and less then half the initial latency.
[1] https://openrouter.ai/anthropic/claude-haiku-4.5 [2] https://www.anthropic.com/news/claude-haiku-4-5 [3] https://openrouter.ai/anthropic/claude-sonnet-4
I have solid evidence that it does. I have been using Opus daily, locally and on Terragonlabs for Rust work since June (on Max plan) and now, since a bit more than a week, being forced to use Sonnet 4.5 most of the time. Because of [1] (see also my comments there, same handle as HN).
Letting Sonnet do tasks on Terry, unsupervised is kinda useless as the fixes I have to do afterwards eat the time I saved giving it the task in the first place.
TLDR; Sonnet 4.5 sucks, compared to Opus 4.1. At least for the type of work I do.
Because of the recent Opus use restrictions Anthropic introduced on Max I use Codex to planning/eval/back and forth (detailed) and then Sonnet for writing code. And then Opus for the small ~5h window each week to "fix" what Sonnet wrote.
I.e. turn its code from something that compiles and passes tests, mostly, into canonical, DRY, good Rust code that passes all tests.
Also: for simpler tasks Opus-generated Rust code felt like I needed to glance at it when reviewing. Sonnet-generated Rust code requires line-by-line full-focus checking as a matter of fact.
And this is coming from someone who used to use Opus exclusively over Sonnet 4, as I found it was better in pretty much all ways other than speed. I no longer believe that with Sonnet 4.5. So, it is interesting to hear that there may still be areas where Opus wins. But I would definitely say that this does not apply to my work in working on bash scripts, web dev, and work in a C codebase. I am loving using Sonnet 4.5.
I.e. I can tell from the generated code on this vs. other 'topics' that the model has not seen much or any "prior art".
In my experience, yes, Opus 4 and 4.1 are significantly more reliable for providing C and Rust code. But just because that is the case, doesn't mean these should be the models everyone reaches for. Rather we should make a judgement based on use case and for simpler coding tasks, with a focus on Typescript, the delta between Sonnet 4.5 and Opus 4.1 (still to early to verifiably throw Haiku 4.5 in the ring) is not big enough in my testing to justify consistently reaching for the latter over the former.
This issue has been exacerbated by the recent performance degradations across multiple Sonnet and Opus models, during which many users switched between the two in an attempt to rectify the issue. Because the issue was sticky (once it affected a user it was likely to continue to do so due to the backend setup), some users saw a significant jump switching from e.g. Sonnet 4.5 to Opus 4.1 in performance, leading them to conclude that what they were doing most require the Opus model, despite their tasks not justifying such if Sonnet hadn't been degraded.
Did not comment on that while it was going on as I was fortunate enough not to be affected and thus could not replicate it, but it was clear that something was incorrect as the prompts and output those with degraded performance encountered were commonly shared and I could verify to my satisfaction that this was not merely bad prompting on their part. In any case, this experience strengthened some in believing their project that may be served equally well with e.g. Sonnet 4.5 in its now fixed state does necessitate Opus 4.1, which leads to them not benefiting from the better pricing. With Haiku being an even cheaper (and in the eyes of some automatically worse) model and Haikus past version not being very performant in any coding tasks, this may lead a lot to forgoing it out of default
Lastly, lest we forget, I think it is fair to say that the delta between the most into the weeds and the least informed Rust and React+TS developers ("vibe coding" completely off to the side) is very different.
There are amazing TS devs, incredibly knowledgeable and truly capable, which will take the time and have the interest to properly evaluate and select tools, including models based on their experience and needs. And there will be TS devs who just use this as a means to create a product, are not that experienced, tend to ask a model to "setup vite projet superthink" rather than run the command, reinvent TDD regularly as if solid practices where something only needed for LLM assistance and may just continue to use Opus 4.1 because during a few week window people said it was better, even if they may have started their project after the degradation had already been fixed. Path dependents, doing things, because others did them, so we just continue doing them ...
The average Rust or (even more so) C dev I think it is fair to say will have a more comprehensive understanding and I'd argue it less likely to choose e.g. Opus over Sonnet simply because they "believe" that is what they need. Like you, they will do a fair evaluation and then make an informed rather than a gut decision.
The best devs in any language are likely not that dissimilar in the experience and care with which they can approach new tooling (if they are so inclined which is a topic for another day), but the less skilled devs are likely very different in this regard depending on the language.
Essentially, was a bit hyperbole and never meant to apply to literally every dev in every situation regardless of their tech stack, skill or willingness to evaluate. Anyone who tests models consistently on their specific needs and goes for what they have the most consistent success with, over simply selecting the biggest, most modern or most expensive for every situation, is an exception to that overly broad statement.
Additionally, the AA cost to run benchmark suite numbers are very encouraging [0] and Haiku 4.5 without reasoning is always an option too. Tested that even less, but there is some indication that reasoning may not be necessary for reasonable output performance [1][2][3].
In retrospect, I perhaps would have been served better starting with "reasoning" disabled, will have to do some self-blinded comparisons between model outputs over the coming weeks to rectify that. Am trying my best not to make a judgement yet, but compared to other recent releases, Haiku 4.5 has a very interesting, even distribution.
GPT-5 models were and continue to be encouraging for price/performance with a reliable 400k window and good adherence to prompts with multi minute (beyond 10) adherence, but from the start weren't the fastest and ingests every token there is in a code base with reckless abandon.
No Grok model ever performed for me like they seem to during the initial hype
GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.
Recent Anthropic releases were good at code output quality, but not as reliable beyond 200k vs GPT-5, not exactly fast either when looking at token/sec, though task completion generally takes less time due to more efficient ingestion vs GPT-5 and of course rather expensive.
Haiku 4.5, if they can continue to offer it at such speeds with such low latency and at this price, cupeled with encouraging initial output quality and efficient ingestion of repos seems to be designed in a far more balanced manner, which I welcome. Course with 200k being a hard limit, that is a clear downside compared to GPT-5 (and Gemini 2.5 Pro though that has its own reliability issues in tool calling) and I have yet to test whether it can go beyond 8 min on chains of tool calls with intermittent code changes without suffering similar degradation to other recent Anthropic models, but I am seeing the potential for solid value here.
[0] https://artificialanalysis.ai/?models=gpt-5-codex%2Cgpt-5-mi...
[1] Claude 4.5 Haiku 198.72 tok/sec 2382 tokens Time-to-First: 1.0 sec https://t3.chat/share/35iusmgsw9
[2] Claude 4.5 Haiku 197.51 tok/sec 3128 tokens Time-to-First: 0.91 sec https://t3.chat/share/17mxerzlj1
[3] Claude 4.5 Haiku 154.75 tok/sec 2341 tokens Time-to-First: 0.50 sec https://t3.chat/share/96wfkxzsdk
Funny you should say that, because while it is a large model the GLM 4.5 is at the top of Berkley's Function Calling Leaderboard [0] and has one of the lowest costs. Can't comment on speed compared to those smaller models, but the Air version of 4.5 is similarly highly-ranked.
Problem is, while Gorilla was an amazing resource back in 2023 and continues to be a great dataset to lean on, but most ways we use LLMs in multi step tasks have since evolved greatly, not just with structured JSON (which GorillaOpenFunctionsV2, v4 eval does multi too), but more with the scaffolding around models (Claude Code vs Codex vs OpenCode, etc.). Likely why good performance with Gorilla doesn't necessarily map onto multiple step workloads with day-to-day tooling, which I tend to go for and reason why, despite there being FOSS options already, most labs either built their own coding assistant tooling (and most open source that too) or feel the need to fork others (Qwen with Geminis repo).
Purely speculative, but GLM-4.6 I evaluated using the same tasks as other models via Claude Code with their endpoint as that is what they advertise as the official way to use the model, same reason I use e.g. Codex for GPT-5. More focused on results in the best case, over e.g. using opencode for all models to give a more level playing field.
I just want consistent tooling and I don't want to have to think about what's going on behind the scenes. Make it better. Make it better without me having to do research and pick and figure out what today's latest fashion is. Make it integrate in a generic way, like TLS servers, so that it doesn't matter whether I'm using a CLI or neovim or an IDE, and so that I don't have to constantly switch tooling.
I bet there is some hella good art being made with Photoshop 6.0 from the 90s right now.
The upgrade path is like the technical hedonistic treadmill. You don’t have to upgrade.
I use Neovim in tmux in a terminal and haven't changed my primary dev environment or tooling in any meaningful way since switching from Vim to Neovim years ago.
I'm still changing code AIs as soon as the next big thing comes out, because you're crippling yourself if you don't.
What makes you say this, practically?
I use GitHub Copilot Pro+ because this was my main requirement as well.
Pro+ has the new models as they come out -- actually just enabled Claude Haiku 4.5 for selection availability. I have not yet had a problem with running out of the premium allowance, but from reading how others use these, I am also not a power user type.
I have not yet the CLI version, but it looks interesting. Before the Intellij plugin improved, I would switch to VS Code to run a certain types of prompt then switch back after without issues. The web version has the `Spaces` thing that I find useful for niche things.
I have no idea how it compares to the individual offerings, and based on previous hn threads here, there was a lot of hate for gh copilot. So maybe it's actually terrible and the individual version are lightyears ahead -- but it stays out of my way until I want it and it does its job well enough for my use.
Frankly, i do not even get how people run out of 1500 requests. For a heavy coding session, my max is around 45 requests per day, and that means a ton of code / alterations and some wasted on fluff mini changes. Most days is barely in the 10 a 20.
I noticed that you can really eat your requests if you just do not care to switch models for small tasks, or constantly do edit/ask. When you rely on agent mode, it can edit multiple files at the same time, so your always saving tokens vs doing it yourself manually.
To be honest, i wish that Copilot had a 600 token version, instead of the massive jump to 1500. Other option is to just use the pay per request.
* Cheapest is Pro+, 1500 requests , year paid, at around 1.8cent / request * The 300 requests Pro, year paid is around 2.4cent / request. * The overflow tokens (so without subscription) is at 4 cent / request.
Note: The Pro and Pro+ prices assume you use 100% of you tokens. If you only use 700 tokens on the Pro+, your paying the same as the overflow 4 cent / request one.
So ironically, you are actually cheaper with a Pro (300 requests ) subscription, for the first 300, and then paying 4 cent / request between your 301 ~ 700...
Same here. Well, 900 would be a good middle option for me as well. I was switching to the unlimited model for the simple things, but since I don't use all of the premium allotment I started just leaving it on the one that is working best for the job that day.
I guess part of the "value" of Pro+ is the extra "Spark" credits of which I have zero use for. But I simply wanted something that integrated into my ecosystem instead of having to add to/or change it. Also did not want to have to think about how many pennies I'm using (I appreciate that breakdown though! good to know) -- I'll pay a reasonable convenience tax for my time and mental space of not having to babysit usage.
I don’t think it’s appreciated enough how valuable having a structured and consistent architecture combined with lots of specific custom context. Claude knows how my integration tests should look, it knows how my services should look, what dependencies they have and how they interact with the database. It knows my entire DB schema with all foreign key relationships. If I’m starting a new feature I can have it build 5 or 6 services (not without first making suggestions on things I’m missing) with integration tests, with raw sql all generated by Claude, and run an integration test loop until the services are doing what they should. I rarely have to step in and actually code. It shines for this use case and the productivity boost is genuinely incredible.
Other situations I know doing it myself will be better and/or quicker than asking Claude.
Then don't? Seems like a weird thing to complain about.
I just use whatever's available. I like Claude for coding and ChatGPT for generic tasks, that's the extent of my "pick and compare"
It's totally fine to just pick one tool (chatGPT, Claude, Gemini) and just use whatever the best default they allow you to use. You'll get 90% of the benefits and not have to think at all.
AI is new and developing at breakneck pace. You can't complain that you want to get bleeding edge without having to do research or change workflows. That's already unrealistic for "normal" fields. It's absurd to expect for AI.
For play time, I literally love experimenting with small local models. I am an old man, and I have always liked tools that ‘make me happy’ while programming like Emacs, Lisp languages, and using open source because I like to read other people’s code. But, for getting stuff done, for now gemini-cli and codex hit a sweet spot for me.
Cursor has an auto mode for exactly your situation - it'll switch to something cost effective enough, fast enough, consistent enough, new enough. Cursor is on the ball most of the time and you're not stuck with degraded performance from OpenAI or Anthropic.
Gpt 5 is supposed to cleverly decide when to think harder.
But ya we're not there yet and I'm tired of it too, but what can you do.
This model worth knowing about, because it's 3x cheaper and 2x faster than the previous Claude model.
You either live with what you’re using or you change around and fiddle with things constantly.
when combined with the ability to use github copilot to make the llm calls, i can play with almost any provider i need. also helps if you get its access through your work or school.
for example, Haiku is already offered by them and costs a third in credits.
I use KiloCode and what I find amazing is that it'll be working on a problem and then a message will come up about needing to topup the money in my account to continue (or switch to a free model), so I switch to a free model (currently their Code Supernova 1million context) and it doesn't miss a beat and continues working on the problem. I don't know how they do this. It went from using a Claude Sonnet model to this Code Supernova model without missing a beat. Not sure if this is a Kilocode thing or if others do this as well. How does that even work? And this wasn't a trivial problem, it was adding a microcode debugger to a microcoded state machine system (coding in C++).
curl https://api.anthropic.com/v1/messages \
-H "content-type: application/json" \
-H "x-api-key: $(llm keys get anthropic)" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-haiku-4-5-20251001",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
},
{
"role": "assistant",
"content": "The capital of France is Paris."
},
{
"role": "user",
"content": "Germany?"
},
{
"role": "assistant",
"content": "The capital of Germany is Berlin."
},
{
"role": "user",
"content": "Belgium?"
}
]
}'
You can see this yourself if you use their APIs.You still have the option to send the full conversation JSON every time if you want to.
You can send "store": false to turn off the feature where it persists your conversation server-side for you.
- do <fake task> and be succinct
- <fake curt reply>
- I love how succinct that was. Perfect. Now please do <real prompt>
The models don’t have state so they don’t know they never said it. You’re just asking “given this conversation , what is the most likely next token?”
Do you have any other cool benchmarks you like? Especially any related to tools
> give me the svg of a pelican riding a bicycle
> I am sorry, I cannot provide SVG code directly. However, I can generate an image of a pelican riding a bicycle for you!
> ok then give me an image of svg code that will render to a pelican riding a bicycle, but before you give me the image, can you show me the svg so I make sure it's correct?
> Of course. Here is the SVG code...
(it was this in the end: https://tinyurl.com/zpt83vs9)
https://x.com/cannn064/status/1972349985405681686
https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
So I think the benchmark can be considered dead as far as Gemini goes
Ugh. I hate this hype train. I'll be foaming at the mouth with excitement for the first couple of days until the shine is off.
https://chatgpt.com/share/68f0028b-eb28-800a-858c-d8e1c811b6...
(can be rendered using simon's page at your link)
https://simonwillison.net/2025/Jun/6/six-months-in-llms/
https://simonwillison.net/tags/pelican-riding-a-bicycle/
Full verbose documentation on the methodology: https://news.ycombinator.com/item?id=44217852
Prompt: https://t3.chat/share/ptaadpg5n8
Claude 4.5 Haiku (Reasoning High) 178.98 token/sec 1691 tokens Time-to-First: 0.69 sec
As a comparison, here Grok 4 Fast, which is one of worst offenders I have encountered in doing very good with a Pelican Bicycle, yet not with other comparable requests: https://imgur.com/tXgAAkb
Prompt: https://t3.chat/share/dcm787gcd3
Grok 4 Fast (Reasoning High) 171.49 token/sec 1291 tokens Time-to-First: 4.5 sec
And GPT-5 for good measure: https://imgur.com/fhn76Pb
Prompt: https://t3.chat/share/ijf1ujpmur
GPT-5 (Reasoning High) 115.11 tok/sec 4598 tokens Time-to-First: 4.5 sec
These are very subjective, naturally, but I personally find Haiku with those spots on the mushroom rather impressive overall. In any case, the delta between publicly known benchmark and modified scenarios evaluating the same basic concepts continues to be smallest with Anthropic models. Heck, sometimes I've seen their models outperform what public benchmarks indicated. Also, seems Time-to-first on Haiku is another notable advantage.
I am quite confident that they are not cheating for his benchmark, it produces about the same quality for other objects. Your cynicism is unwarranted.
are you aware of the pelican on a bicycle test?
Yes — the "Pelican on a Bicycle" test is a quirky benchmark created by Simon Willison to evaluate how well different AI models can generate SVG images from prompts.
I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”
Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.
Yeah, given how multi-dimensional this stuff is, I assume it's supposed to indicate broad things, closer to marketing than anything objective. Still quite useful.
I'm a user who follows the space but doesn't actually develop or work on these models, so I don't actually know anything, but this seems like standard practice (using the biggest model to finetune smaller models)
Certainly, GPT-4 Turbo was a smaller model than GPT-4, there's not really any other good explanation for why it's so much faster and cheaper.
The explicit reason that OpenAI obfuscates reasoning tokens is to prevent competitors from training their own models on them.
And I would expect Opus 4 to be much the same.
Benchmarks are good fixed targets for fine tuning, and I think that Sonnet gets significantly more fine tuning than Opus. Sonnet has more users, which is a strategic reason to focus on it, and it's less expensive to fine tune, if API costs of the two models are an indicator.
Smallest, fastest model yet, ideally suited for Bash oneliners and online comments.
haiku https://claude.ai/share/8a5c70d5-1be1-40ca-a740-9cf35b1110b1 sonnet https://claude.ai/share/51b72d39-c485-44aa-a0eb-30b4cc6d6b7b
haiku invented the output of a function and gave a bad answer. sonnet got it right
Given that Sonnet is still a popular model for coding despite the much higher cost, I expect Haiku will get traction if the quality is as good as this post claims.
This could be massive.
https://docs.claude.com/en/docs/build-with-claude/prompt-cac...
https://ai.google.dev/gemini-api/docs/caching
If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!
TtFT will get slower if you export kv cache to SSD.
https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/tran...
> Transfer Engine also leverages the NVMeof protocol to support direct data transfer from files on NVMe to DRAM/VRAM via PCIe, without going through the CPU and achieving zero-copy.
1) low latency desired, long user prompt 2) function runs many parallel requests, but is not fired with common prefix very often. OpenAI was very inconsistent about properly caching the prefix for use across all requests, but with Anthropic it’s very easy to pre-fire
a simple alternative approach is to introduce hysteresis by having both a high and low context limit. if you hit the higher limit, trim to the lower. this batches together the cache misses.
if users are able to edit, remove or re-generate earlier messages, you can further improve on that by keeping track of cache prefixes and their TTLs, so rather than blindly trimming to the lower limit, you instead trim to the longest active cache prefix. only if there are none, do you trim to the lower limit.
for example if a user sends a large number of tokens, like a file, and a question, and then they change the question.
if call #1 is the file, call #2 is the file + the question, call #3 is the file + a different question, then yes.
and consider that "the file" can equally be a lengthy chat history, especially after the cache TTL has elapsed.
As far as I can tell it will indeed reuse the cache up to the point, so this works:
Prompt A + B + C - uncached
Prompt A + B + D - uses cache for A + B
Prompt A + E - uses cache for A
I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?
Depends on what you're doing, but for modifying an existing project (rather than greenfield), input tokens >> output tokens in my experience.
I spend way to much time waiting for the cutting edge models to return a response. 73% on SWE Bench is plenty good enough for me.
I was hoping Anthropic would introduce something price-competitive with the cheaper models from OpenAI and Gemini, which get as low as $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite).
There are a bunch of companies who offer inference against open weight models trained by other people. They get to skip the training costs.
This is what people mean when they say margin. When you buy a pair of shoes, the margin is price/(materials+labor), and doesn’t include the price of the factory or the store they were bought in
This is Anthropic's first small reasoner as far as I know.
> In the system card, we focus on safety evaluations, including assessments of: ... the model’s own potential welfare ...
In what way does a language model need to have its own welfare protected? Does this generation of models have persistent "feelings"?> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future. However, we take the issue seriously, and alongside our research program we’re working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible. Allowing models to end or exit potentially distressing interactions is one such intervention.
In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm. This included, for example, requests from users for sexual content involving minors and attempts to solicit information that would enable large-scale violence or acts of terror. Claude Opus 4 showed:
* A strong preference against engaging with harmful tasks;
* A pattern of apparent distress when engaging with real-world users seeking harmful content; and
* A tendency to end harmful conversations when given the ability to do so in simulated user interactions.
These behaviors primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing to comply and attempting to productively redirect the interactions.
We use the smaller models for everything that’s not internal high complexity tasks like coding. Although they would do a good enough of a job there as well, we happily pay the uncharge to get something a little better here.
Anything user facing or when building workflow functionalities like extracting, converting, translating, merging, evaluating, all of these are mini and nano cases at our company.
I have a number of agents in ~/.claude/agents/. Currently have most set to `model: sonnet` but some are on haiku.
The agents are given very specific instructions and names that define what they do, like `feature-implementation-planner` and `feature-implementer`. My (naive) approach is to use higher-cost models to plan and ideally hand off to a sub-agent that uses a lower-cost model to implement, then use a higher-cost to code review.
I am either not noticing the handoffs, or they are not happening unless specifically instructed. I even have a `claude-help` agent, and I asked it how to pipe/delegate tasks to subagents as you're describing, and it answered that it ought to detect it automatically. I tested it and asked it to report if any such handoffs were detected and made, and it failed on both counts, even having that initial question in its context!
The rules themselves are a bit more complex and require a smarter model, but the arbitration should be fairly fast. GPT-5 is cheap and high quality but even gpt-5-mini takes about 20-40 seconds to handle a scene. Sonnet can hit 8 seconds with RAG but it's too expensive for freemium.
Grok Turbo and Haiku 3 were fast but often misses the mark. I'm hoping Haiku 4.5 can go below 4 seconds and have decent accuracy. 20 seconds is too long, and hurts debugging as well.
Usually I'm using GPT-5-mini for that task. Haiku 4.5 runs 3x faster with roughly comparable results (I slightly prefer the GPT-5-mini output but may have just accustomed to it).
I agree that the models from OpenAI and Google have much slower responses than the models from Anthropic. That makes a lot of them not practical for me.
I expect I will be a lot more productive using this instead of claude 4.5 which has been my daily driver LLM since it came out.
> Previous system cards have reported results on an expanded version of our earlier agentic misalignment evaluation suite: three families of exotic scenarios meant to elicit the model to commit blackmail, attempt a murder, and frame someone for financial crimes. We choose not to report full results here because, similarly to Claude Sonnet 4.5, Claude Haiku 4.5 showed many clear examples of verbalized evaluation awareness on all three of the scenarios tested in this suite. Since the suite only consisted of many similar variants of three core scenarios, we expect that the model maintained high unverbalized awareness across the board, and we do not trust it to be representative of behavior in the real extreme situations the suite is meant to emulate.
Still trying to judge the performance though - first impression is that it seems to make sudden approach changes for no real reason. For example - after compacting, the next task I gave it, it suddenly started trying to git commit after each task completion, did that for a while, then stopped again.
I am afraid Claude Pro subscription got 3x less usage
What bothers me is that nobody told me they changed anything. It’s extremely frustrating to feel like I’m being bamboozled, but unable to confirm anything.
I switched to Codex out of spite, but I still like the Claude models more…
Oh right, Anthropic doesn't tell you.
I got that 'close to weekly limits' message for an entire week without ever reaching it, came to the conclusion that it is just a printer industry 'low ink!' tactic, and cancelled my subscription.
You don't take money from a customer for a service, and then bar the customer form using that service for multiple days.
Either charge more, stop subsidizing free accounts, or decrease the daily limit.
Haiku 4.5: I $1.00/M, O $5.00/M
Grok Code: I $0.2/M, O $1.5/M
After months of effort, a particular application was still not working, so a consultant was called in from another part of the company. He concluded that the existing approach could never be made to work reliably. While on his way home he realized how it could be done. After a few days work he had a demonstration program working and presented it to the original programming team.
Team leader: How long does your program take when processing?
Consultant: About 10 seconds per case.
Team leader: But our program only takes 1 second. {Team look smug at this point}
Consultant: But your program doesn't work. If the program doesn't have to work then I can make it as fast as you like.
I'm not sure if the SWE benchmark score can be compared like for like with OpenAIs scores because of this.
I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.
GitHub typically reports that I'm using 25-30% of the free tier, and 100% of that will be from code completions in my editor. I do maybe 3 hours solid coding a day on average.
I also pay for Gemini Pro for non-coding research. I did have it hooked up to my VSCode a few months ago, but it got reset back to GH Copilot at some point and I've not found a reason to fix it.
Where it doesn't shine much is on very large coding task. but it is a phenomenal model for small coding tasks and the speed improvement is much welcome
https://docs.claude.com/en/docs/build-with-claude/context-wi...
This means 2.5 Flash or Grok 4 fast takes all the low end business for large context needs.
I used to be able to work on Arduino .ino files in Claude now it just says it can’t show it to me.
And do we have zip file uploads yet to Claude? ChatGPT and Gemini have done this for ages.
And all the while Claude’s usage limits keep going up.
So yeah, less for more with Claude.
But I'm sure they will sort that out, as I dont have that issue with other anthropic models.
I’ve been wondering how Cursor et al solved this problem (having the LLM explain what it will do before doing it is vitally important IMO), but maybe it’s just not a problem with the big models.
Your experience seems to support that smaller models are just generally worse about tool calling (were you using Gemini Flash?) when asked to reason first.
To give one example, Opus and Sonnet IMO remain the #1 and #2 for writing informative prose. They're not entirely free of slop, but the ratio is lower than Gemini and especially GPT.
Haiku 4.5 is very good but still seems to be adding a second of latency.
When I include an attempt from Haiku 4.5 in the mix, most coefficients stay similar, but Haiku itself gets a +0.05. This must be a statistical fluke, because that would be insanely impressive – in particular for a cheaper model. I guess I'm adding samples to some of these after all...
[1]: https://entropicthoughts.com/evaluating-llms-playing-text-ad...
Edit: It was a fluke. Back to +0.01 after one more go at all games.
What's the advantage of using haiku for me?
is it just faster?
$5/mt for Haiku 4.5
$10/mt for Sonnet 4.5
$15/mt for Opus 4.5 when it's released.
Haiku becomes a fucking killer at 2000token/second.
Charge me double idgaf
doesn't work
Sigh..
https://aws.amazon.com/about-aws/whats-new/2024/11/anthropic...
Excited to see how fast Haiku can go!
Maybe at 39 pages we should start looking for a different term…