FilterHN

3 days ago

[-]

Very preliminary testing is very promising, seems far more precise in code changes over GPT-5 models in not ingesting irrelevant to the task at hand code sections for changes which tends to make GPT-5 as a coding assistant take longer than sometimes expected. With that being the case, it is possible that in actual day-to-day use, Haiku 4.5 may be less expensive than the raw cost breakdown may appear initially, though the increase is significant.

Branding is the true issue that Anthropic has though. Haiku 4.5 may (not saying it is, far to early to tell) be roughly equivalent in code output quality compared to Sonnet 4, which would serve a lot users amazingly well, but by virtue of the connotations smaller models have, alongside recent performance degradations making users more suspicious than beforehand, getting these do adopt Haiku 4.5 over Sonnet 4.5 even will be challenging. I'd love to know whether Haiku 3, 3.5 and 4.5 are roughly in the same ballpark in terms of parameters and course, nerdy old me would like that to be public information for all models, but in fairness to companies, many would just go for the largest model thinking it serves all use cases best. GPT-5 to me is still most impressive because of its pricing relative to performance and Haiku may end up similar, though with far less adoption. Everyone believes their task requires no less than Opus it seems after all.

For reference:

Haiku 3: I $0.25/M, O $1.25/M

Haiku 4.5: I $1.00/M, O $5.00/M

GPT-5: I $1.25/M, O $10.00/M

GPT-5-mini: I $0.25/M, O $2.00/M

GPT-5-nano: I $0.05/M, O $0.40/M

GLM-4.6: I $0.60/M, O $2.20/M

tosh

2 days ago

[-]

One of the main issues I had with Claude Code (maybe it‘s the harness?) was that the agent tends to NOT read enough relevant code before it makes a change.

This leads to unnecessary helper functions instead of using existing helper functions and so on.

Not sure if it is an issue with the models or with the system prompts and so on or both.

lukev

2 days ago

[-]

This may have been fixed as of yesterday... Version 2.0.17 added a built in "Explore" sub-agent that it seems to call quite a lot.

Helps solve the inherent tradeoff between reading more files (and filling up context) and keeping the context nice and tight (but maybe missing relevant stuff.)

thomasfromcdnjs

2 days ago

[-]

You might get better results with https://github.com/oraios/serena

I sometimes use it, but I've found just adding to my claude.md something like "if you ever refactor code, try search around the codebase to see if their is an existing function you can use or extend"

lelanthran

2 days ago

[-]

> I sometimes use it, but I've found just adding to my claude.md something like "if you ever refactor code, try search around the codebase to see if their is an existing function you can use or extend"

Wouldn't that consume a ton of tokens, though? After all, if you don't want it to recreate function `foo(int bar)`, it will need to find it, which means either running grep (takes time on large codebases) or actually loading all your code into context.

Maybe it would be better to create an index of your code and let it run some shell command that greps your ctags file, so it can quickly jump to the possible functions that it is considering recreating.

theshrike79

1 day ago

[-]

This and also give it a tool (script or a fd/rg snippet) it can use to find all functions and their documentation based on a criteria

ta12653421

2 days ago

[-]

Helpfer functions exploded over the last releases, id say? Very often I state: "combine this into one function"

another thing I saw in the last days starting: Claude now draws always an ASCII art instead of a graphical image, and the ASCII art is completely useless, when something is explained

Romario77

2 days ago

[-]

you can put in your rules what type of output to do for diagrams. Personally I prefer mermaid -> it could be rendered into an image, read and modified by AI easily.

rbitar

2 days ago

[-]

I regularly use @ key to add files to context for tasks I know require edits or patterns I want claude to follow, adds a few extra key strokes but in most cases the quality improvement is worth it

Fergusonb

2 days ago

[-]

I agree, claude is an impressive agent but it seems like it's impatient and trying to make its own thing, tries to make its own tests when I already have them, etc. Maybe better for a new project.

GPT 5 (at least with cline) reads whatever you give it, then laser targets the required changes.

With High, as long as I actually provided enough relevant context it usually one shots the solution and sometimes even finds things I left out.

The only downside for me is it's extremely slow, but I still use it on anything nuanced.

tacone

2 days ago

[-]

> agree, claude is an impressive agent but it seems like it's impatient and trying to make its own thing, tries to make its own tests when I already have them, etc. Maybe better for a new project.

Nope, Claude will deviate from it's own project as well.

Claude is brilliant but needs hard rules. You have to treat it and make it feel like the robot it really. Feed it a bit too much human prose in your instructions and it will start to behave like a teen.

stingraycharles

2 days ago

[-]

You need to plan down tasks really to the function level, and review things.

miroljub

2 days ago

[-]

Just writing code is often faster.

eagerpace

2 days ago

[-]

Not when you stink at writing code but you’re really good at writing specs

miroljub

2 days ago

[-]

In that case you would be much more valued as a business analyst than as a developer.

People that can and want to write specs are very rare.

[0] https://support.claude.com/en/articles/11145838-using-claude...

2 days ago

[-]

Update, Haiku 4.5 is not just very targeted in terms of changes but also really fast. Averaging at 220token/sec is almost double most other models I'd consider comparable (though again, far to early to make a proper judgement) and if this can be kept up, that is a massive value add over other models. That is nearly Gemini 2.5 Flash Lite speed for context.

Yes, we got Groq and Cerebras getting up to 1000token/sec, but not with models that seem comparable (again, early, not a proper judgement). Anthropic has been historically the most consistent in outperforming personal benchmarks vs public benchmarks, for what that is worth so I am optimistic.

If speed, performance and pricing are something Anthropic can keep consistent long term (i.e. no regressions), Haiku 4.5 really is a great option for most coding tasks, with Sonnet something I'd tag in only for very specific scenarios. Past Claude models have had a deficiency in longer term chains of tasks. Beyond 7 minutes roughly, performance does appear to worsen with Sonnet 4.5, as an example. That could be an Achilles heel for Haiku 4.5 as well, if not this really is a solid step in terms of efficiency, but I have not done any longer task testing yet.

That being said, Anthropic once again has a rather severe issue it seems, casting a shadow upon this release. From what I am seeing and others are reporting, Claude Code currently does count Haiku 4.5 usage the same as Sonnet 4.5 usage, despite the latter being significantly more expensive. They also did not yet update the Claude Code support pages to reflect the new models usage limits [0]. I really think such information should be public by launch day and hope they can improve their tooling and overall testing, it really continues to throw a shadow over their impressive models.

qingcharles

2 days ago

[-]

It's insanely fast. I didn't know it had even been released, but I went to select the copilot SWE test model in VSCode and it was missing and Haiku 4.5 was there instead. I asked for a huge change to a web app and the output from Haiku scrolled the text faster than Windows could keep up. From a cold start. Wrote a huge chunk of code in about 40 seconds. Unreal.

p.s. it also got the code 100% correct on the one-shot p.p.s. Microsoft are pricing it out at 30% the cost of frontier models (e.g. Sonnet 4.5, GPT5)

katchu11

2 days ago

[-]

Hey! I work on the Claude Code team. Both PAYG and Subscription usage look to be configured correctly in accordance with the price for Haiku 4.5 ($1/$5 per M I/O tok).

Feel free to DM me your account info on twitter (https://x.com/katchu11) and I can dig deeper!

peddling-brink

2 days ago

[-]

lol, I don’t know if you work there or not, but directing folks to send their account info to a random Twitter address is, not considered best practice.

ethbr1

2 days ago

[-]

Being charitable, let's assume parent wasn't talking about secrets.

squigz

2 days ago

[-]

What's wrong with sending a username to someone?

lukeck

1 day ago

[-]

Generally, nothing inherently wrong with sending a username but directing people to a 3rd party social media platform rather than an official Anthropic email or support system does nothing to build trust that they actually work there.

rat9988

2 days ago

[-]

What best practice. He can choose whether he sends or not. The guy is just offering some extra help here.

8 hours ago

[-]

Thanks, sorry, only saw the offer now. Have just checked and cannot reproduce the usage any more, might have been mistaken on that front.

[1] https://openrouter.ai/anthropic/claude-haiku-4.5

rbitar

2 days ago

[-]

Where do you get the 220 token/second? Genuinely curious as that would be very impressive for a model comparable to sonnet 4. OpenRouter currently publishing around 116/tps[1]

2 days ago

[-]

Was just about to post that Haiku 4.5 does something I have never encountered before [0], there is a massive delta between token/sec depending on the query. Some variance including task specific is of course nothing new, but never as pronounced and reproducible as here.

A few examples, prompted at UTC 21:30-23:00 via T3 Chat [0]:

Prompt 1 — 120.65 token/sec — https://t3.chat/share/tgqp1dr0la

Prompt 2 — 118.58 token/sec — https://t3.chat/share/86d93w093a

Prompt 3 — 203.20 token/sec — https://t3.chat/share/h39nct9fp5

Prompt 4 — 91.43 token/sec — https://t3.chat/share/mqu1edzffq

Prompt 5 — 167.66 token/sec — https://t3.chat/share/gingktrf2m

Prompt 6 — 161.51 token/sec — https://t3.chat/share/qg6uxkdgy0

Prompt 7 — 168.11 token/sec — https://t3.chat/share/qiutu67ebc

Prompt 8 — 203.68 token/sec — https://t3.chat/share/zziplhpw0d

Prompt 9 — 102.86 token/sec — https://t3.chat/share/s3hldh5nxs

Prompt 10 — 174.66 token/sec — https://t3.chat/share/dyyfyc458m

Prompt 11 — 199.07 token/sec — https://t3.chat/share/7t29sx87cd

Prompt 12 — 82.13 token/sec — https://t3.chat/share/5ati3nvvdx

Prompt 13 — 94.96 token/sec — https://t3.chat/share/q3ig7k117z

Prompt 14 — 190.02 token/sec — https://t3.chat/share/hp5kjeujy7

Prompt 15 — 190.16 token/sec — https://t3.chat/share/77vs6yxcfa

Prompt 16 — 92.45 token/sec — https://t3.chat/share/i0qrsvp29i

Prompt 17 — 190.26 token/sec — https://t3.chat/share/berx0aq3qo

Prompt 18 — 187.31 token/sec — https://t3.chat/share/0wyuk0zzfc

Prompt 19 — 204.31 token/sec — https://t3.chat/share/6vuawveaqu

Prompt 20 — 135.55 token/sec — https://t3.chat/share/b0a11i4gfq

Prompt 21 — 208.97 token/sec — https://t3.chat/share/al54aha9zk

Prompt 22 — 188.07 token/sec — https://t3.chat/share/wu3k8q67qc

Prompt 23 — 198.17 token/sec — https://t3.chat/share/0bt1qrynve

Prompt 24 — 196.25 token/sec — https://t3.chat/share/nhnmp0hlc5

Prompt 25 — 185.09 token/sec — https://t3.chat/share/ifh6j4d8t5

I ran each prompt three times and got (within expected variance meaning less than 5% plus or minus) the same token/sec results for the respective prompt. Each used Claude Haiku 4.5 with "High reasoning". Will continue testing, but this is beyond odd. I will add that my very early evals leaned heavily into pure code output, where 200 token/sec is consistently possible at the moment, but it is certainly not the average as claimed before, there I was mistaken. That being said, even across a wider range of challenges, we are above 160 token/sec and if you solely focus on coding, whether Rust or React, Haiku 4.5 is very swift.

[0] Normally not using T3 Chat for evals, just easier to share prompts this way, though was disappointed to find that the model information (token/sec, TTF, etc.) can't be enabled without an account. Also, these aren't the prompts I usually use for evals. Those I try to keep somewhat out of training by only using paid for API for benchmarks. As anything on Hacker News is most assuredly part of model training, I decided to write some quick and dirty prompts to highlight what I have been seeing.

[1] https://openrouter.ai/anthropic/claude-haiku-4.5 [2] https://www.anthropic.com/news/claude-haiku-4-5 [3] https://openrouter.ai/anthropic/claude-sonnet-4

rbitar

2 days ago

[-]

Interesting and if they are using speculative decoding that variance would make sense. Also your numbers line up with what openrouter is now publishing at 169.1tps [1]

Anthropic mentioned this model is more then twice as fast as claude sonnet 4 [2], which OpenRouter averaged at 61.72 tps for sonnet 4 [3]. If these numbers hold we're really looking at an almost 3x improvement in throughput and less then half the initial latency.

cromulen

2 days ago

[-]

That's what you get when you use speculative decoding and focus / overfit the draft model on coding. Then when the answer is out of distribution for the draft model, you get increased token rejections by the main model and throughput suffers. This probably still makes sense for them if they expect a lot of their load will come from claude code and they need to make it economical.

[1] https://x.com/stevendcoffey/status/1853582548225683814

abhgh

2 days ago

[-]

I'm curious to know if Anthropic mentions anywhere that they use speculative decoding. For OpenAI they do seem to use it based on this tweet [1].

[1] https://github.com/anthropics/claude-code/issues/8449

virtualritz

2 days ago

[-]

> Everyone believes their task requires no less than Opus it seems after all.

I have solid evidence that it does. I have been using Opus daily, locally and on Terragonlabs for Rust work since June (on Max plan) and now, since a bit more than a week, being forced to use Sonnet 4.5 most of the time. Because of [1] (see also my comments there, same handle as HN).

Letting Sonnet do tasks on Terry, unsupervised is kinda useless as the fixes I have to do afterwards eat the time I saved giving it the task in the first place.

TLDR; Sonnet 4.5 sucks, compared to Opus 4.1. At least for the type of work I do.

Because of the recent Opus use restrictions Anthropic introduced on Max I use Codex to planning/eval/back and forth (detailed) and then Sonnet for writing code. And then Opus for the small ~5h window each week to "fix" what Sonnet wrote.

I.e. turn its code from something that compiles and passes tests, mostly, into canonical, DRY, good Rust code that passes all tests.

Also: for simpler tasks Opus-generated Rust code felt like I needed to glance at it when reviewing. Sonnet-generated Rust code requires line-by-line full-focus checking as a matter of fact.

sothatsit

2 days ago

[-]

This is an interesting perspective to me. For my work, Sonnet 4.5 is almost always better than Opus 4.1. Opus might still have a slight edge when it comes to complex edge-cases or niche topics, but that's about it.

And this is coming from someone who used to use Opus exclusively over Sonnet 4, as I found it was better in pretty much all ways other than speed. I no longer believe that with Sonnet 4.5. So, it is interesting to hear that there may still be areas where Opus wins. But I would definitely say that this does not apply to my work in working on bash scripts, web dev, and work in a C codebase. I am loving using Sonnet 4.5.

virtualritz

2 days ago

[-]

I'm doing computer graphics code, 2D, 3D, all CPU (not GPU) and VFX related. It's a niche topic for sure. Often relate to research papers that come without code.

I.e. I can tell from the generated code on this vs. other 'topics' that the model has not seen much or any "prior art".

2 days ago

[-]

Could have phrased that a bit better, but I did mean that while there are use cases in which the delta between Haiku, Sonnet, Opus or another providers model are clear, this is not the case for every task.

In my experience, yes, Opus 4 and 4.1 are significantly more reliable for providing C and Rust code. But just because that is the case, doesn't mean these should be the models everyone reaches for. Rather we should make a judgement based on use case and for simpler coding tasks, with a focus on Typescript, the delta between Sonnet 4.5 and Opus 4.1 (still to early to verifiably throw Haiku 4.5 in the ring) is not big enough in my testing to justify consistently reaching for the latter over the former.

This issue has been exacerbated by the recent performance degradations across multiple Sonnet and Opus models, during which many users switched between the two in an attempt to rectify the issue. Because the issue was sticky (once it affected a user it was likely to continue to do so due to the backend setup), some users saw a significant jump switching from e.g. Sonnet 4.5 to Opus 4.1 in performance, leading them to conclude that what they were doing most require the Opus model, despite their tasks not justifying such if Sonnet hadn't been degraded.

Did not comment on that while it was going on as I was fortunate enough not to be affected and thus could not replicate it, but it was clear that something was incorrect as the prompts and output those with degraded performance encountered were commonly shared and I could verify to my satisfaction that this was not merely bad prompting on their part. In any case, this experience strengthened some in believing their project that may be served equally well with e.g. Sonnet 4.5 in its now fixed state does necessitate Opus 4.1, which leads to them not benefiting from the better pricing. With Haiku being an even cheaper (and in the eyes of some automatically worse) model and Haikus past version not being very performant in any coding tasks, this may lead a lot to forgoing it out of default

Lastly, lest we forget, I think it is fair to say that the delta between the most into the weeds and the least informed Rust and React+TS developers ("vibe coding" completely off to the side) is very different.

There are amazing TS devs, incredibly knowledgeable and truly capable, which will take the time and have the interest to properly evaluate and select tools, including models based on their experience and needs. And there will be TS devs who just use this as a means to create a product, are not that experienced, tend to ask a model to "setup vite projet superthink" rather than run the command, reinvent TDD regularly as if solid practices where something only needed for LLM assistance and may just continue to use Opus 4.1 because during a few week window people said it was better, even if they may have started their project after the degradation had already been fixed. Path dependents, doing things, because others did them, so we just continue doing them ...

The average Rust or (even more so) C dev I think it is fair to say will have a more comprehensive understanding and I'd argue it less likely to choose e.g. Opus over Sonnet simply because they "believe" that is what they need. Like you, they will do a fair evaluation and then make an informed rather than a gut decision.

The best devs in any language are likely not that dissimilar in the experience and care with which they can approach new tooling (if they are so inclined which is a topic for another day), but the less skilled devs are likely very different in this regard depending on the language.

Essentially, was a bit hyperbole and never meant to apply to literally every dev in every situation regardless of their tech stack, skill or willingness to evaluate. Anyone who tests models consistently on their specific needs and goes for what they have the most consistent success with, over simply selecting the biggest, most modern or most expensive for every situation, is an exception to that overly broad statement.

larodi

3 days ago

[-]

Been waiiting for the Haiku update as I still do a lot of dumb work with the old one, and it is darrn cheap for what you get out of it with smart prompting. Very neat they finally release this, updating all my bots... sorry agents :)

deadbabe

3 days ago

[-]

Those numbers don’t mean anything without average token usage stats.

distalx

2 days ago

[-]

Exactly, token per dollar rates are useful, but without knowing the typical input output token distribution for each model on this specific task, the numbers alone don’t give a full picture of cost.

deadbabe

2 days ago

[-]

That’s how they lie to us. Companies can advertise cheap prices to lure you in but they know very well how many tokens you’re going to use on average so they will still make more profit than ever, especially if you’re using any kind of reasoning model which is just like a blank check for them to print money.

solumunus

2 days ago

[-]

I don’t think any of them are profitable are they? We’re in the losing money to gain market share phase of this industry.

[0] https://artificialanalysis.ai/?models=gpt-5-codex%2Cgpt-5-mi...

2 days ago

[-]

Fair point of course and it is still far to early to make a definitive statement, but in my still limited experience throughout the night, I have seen Haiku 4.5 be far better in using what I'd consider a justifiable amount of input tokens over e.g. GPT-5 models. Sonnets recent versions also had been better on this front over OpenAIs current best, but I try (not always succeed) to take prior experience and expectation out of the equation when evaluating models.

Additionally, the AA cost to run benchmark suite numbers are very encouraging [0] and Haiku 4.5 without reasoning is always an option too. Tested that even less, but there is some indication that reasoning may not be necessary for reasonable output performance [1][2][3].

In retrospect, I perhaps would have been served better starting with "reasoning" disabled, will have to do some self-blinded comparisons between model outputs over the coming weeks to rectify that. Am trying my best not to make a judgement yet, but compared to other recent releases, Haiku 4.5 has a very interesting, even distribution.

GPT-5 models were and continue to be encouraging for price/performance with a reliable 400k window and good adherence to prompts with multi minute (beyond 10) adherence, but from the start weren't the fastest and ingests every token there is in a code base with reckless abandon.

No Grok model ever performed for me like they seem to during the initial hype

GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.

Recent Anthropic releases were good at code output quality, but not as reliable beyond 200k vs GPT-5, not exactly fast either when looking at token/sec, though task completion generally takes less time due to more efficient ingestion vs GPT-5 and of course rather expensive.

Haiku 4.5, if they can continue to offer it at such speeds with such low latency and at this price, cupeled with encouraging initial output quality and efficient ingestion of repos seems to be designed in a far more balanced manner, which I welcome. Course with 200k being a hard limit, that is a clear downside compared to GPT-5 (and Gemini 2.5 Pro though that has its own reliability issues in tool calling) and I have yet to test whether it can go beyond 8 min on chains of tool calls with intermittent code changes without suffering similar degradation to other recent Anthropic models, but I am seeing the potential for solid value here.

[1] Claude 4.5 Haiku 198.72 tok/sec 2382 tokens Time-to-First: 1.0 sec https://t3.chat/share/35iusmgsw9

[2] Claude 4.5 Haiku 197.51 tok/sec 3128 tokens Time-to-First: 0.91 sec https://t3.chat/share/17mxerzlj1

[3] Claude 4.5 Haiku 154.75 tok/sec 2341 tokens Time-to-First: 0.50 sec https://t3.chat/share/96wfkxzsdk

[0]https://gorilla.cs.berkeley.edu/leaderboard.html

camel_Snake

2 days ago

[-]

> GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.

Funny you should say that, because while it is a large model the GLM 4.5 is at the top of Berkley's Function Calling Leaderboard [0] and has one of the lowest costs. Can't comment on speed compared to those smaller models, but the Air version of 4.5 is similarly highly-ranked.

1 day ago

[-]

Gorilla is a great resource and it isn't unreasonable to suspect Z.AI has it in their data sets. I'd suspect most other frontier labs as well (pure speculation, but why not use it as a resource).

Problem is, while Gorilla was an amazing resource back in 2023 and continues to be a great dataset to lean on, but most ways we use LLMs in multi step tasks have since evolved greatly, not just with structured JSON (which GorillaOpenFunctionsV2, v4 eval does multi too), but more with the scaffolding around models (Claude Code vs Codex vs OpenCode, etc.). Likely why good performance with Gorilla doesn't necessarily map onto multiple step workloads with day-to-day tooling, which I tend to go for and reason why, despite there being FOSS options already, most labs either built their own coding assistant tooling (and most open source that too) or feel the need to fork others (Qwen with Geminis repo).

Purely speculative, but GLM-4.6 I evaluated using the same tasks as other models via Claude Code with their endpoint as that is what they advertise as the official way to use the model, same reason I use e.g. Codex for GPT-5. More focused on results in the best case, over e.g. using opencode for all models to give a more level playing field.

caymanjim

2 days ago

[-]

Ain't nobody got time to pick models and compare features. It's annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions. I'm paying $20/mo to Anthropic for Claude Code, to OpenAI for Codex, and previously to Cursor for...I don't even know what. I know Cursor lets you select a few different models under the covers, but I have no idea how they differ, nor do I care.

I just want consistent tooling and I don't want to have to think about what's going on behind the scenes. Make it better. Make it better without me having to do research and pick and figure out what today's latest fashion is. Make it integrate in a generic way, like TLS servers, so that it doesn't matter whether I'm using a CLI or neovim or an IDE, and so that I don't have to constantly switch tooling.

edmundsauto

2 days ago

[-]

I don’t mean this with snark, but with age. It’s actually totally cool to not upgrade and then you have stability in your tooling.

I bet there is some hella good art being made with Photoshop 6.0 from the 90s right now.

The upgrade path is like the technical hedonistic treadmill. You don’t have to upgrade.

caymanjim

2 days ago

[-]

Almost all my tooling is years (or decades) old and stable. But the code assistant LLM scene effectively didn't exist in any meaningful way until this year, and it changes almost daily. There is no stability in the tooling, and you're missing out if you don't switch to newer models at least every few weeks right now. Codex (OpenAI/ChatGPT CLI) didn't even exist a month ago, and it's a contender for the best option. Claude Code has only been out for a few months.

I use Neovim in tmux in a terminal and haven't changed my primary dev environment or tooling in any meaningful way since switching from Vim to Neovim years ago.

I'm still changing code AIs as soon as the next big thing comes out, because you're crippling yourself if you don't.

topaz0

2 days ago

[-]

Nobody is making you use LLMs at all... Just using them is already choosing to be on the hedonistic treadmill

edmundsauto

1 day ago

[-]

> because you're crippling yourself if you don't

What makes you say this, practically?

CuriouslyC

2 days ago

[-]

There is in fact a shit ton of hella good art being made with Photoshop 6, because it actually has fair feature parity in terms of what people actually use (content aware fill and puppet warp being the main missing features) while being really easy to crack, so it's a common version for people to install in third world countries. Photoshop has been enshittified for about 20 years though.

1659447091

2 days ago

[-]

> Ain't nobody got time to pick models and compare features. ... Make it integrate in a generic way, ... , so that it doesn't matter whether I'm using a CLI or neovim or an IDE, and so that I don't have to constantly switch tooling.

I use GitHub Copilot Pro+ because this was my main requirement as well.

Pro+ has the new models as they come out -- actually just enabled Claude Haiku 4.5 for selection availability. I have not yet had a problem with running out of the premium allowance, but from reading how others use these, I am also not a power user type.

I have not yet the CLI version, but it looks interesting. Before the Intellij plugin improved, I would switch to VS Code to run a certain types of prompt then switch back after without issues. The web version has the `Spaces` thing that I find useful for niche things.

I have no idea how it compares to the individual offerings, and based on previous hn threads here, there was a lot of hate for gh copilot. So maybe it's actually terrible and the individual version are lightyears ahead -- but it stays out of my way until I want it and it does its job well enough for my use.

benjiro

2 days ago

[-]

> I use GitHub Copilot Pro+ because this was my main requirement as well.

Frankly, i do not even get how people run out of 1500 requests. For a heavy coding session, my max is around 45 requests per day, and that means a ton of code / alterations and some wasted on fluff mini changes. Most days is barely in the 10 a 20.

I noticed that you can really eat your requests if you just do not care to switch models for small tasks, or constantly do edit/ask. When you rely on agent mode, it can edit multiple files at the same time, so your always saving tokens vs doing it yourself manually.

To be honest, i wish that Copilot had a 600 token version, instead of the massive jump to 1500. Other option is to just use the pay per request.

* Cheapest is Pro+, 1500 requests , year paid, at around 1.8cent / request * The 300 requests Pro, year paid is around 2.4cent / request. * The overflow tokens (so without subscription) is at 4 cent / request.

Note: The Pro and Pro+ prices assume you use 100% of you tokens. If you only use 700 tokens on the Pro+, your paying the same as the overflow 4 cent / request one.

So ironically, you are actually cheaper with a Pro (300 requests ) subscription, for the first 300, and then paying 4 cent / request between your 301 ~ 700...

1659447091

1 day ago

[-]

> To be honest, i wish that Copilot had a 600 token version, instead of the massive jump to 1500. Other option is to just use the pay per request.

Same here. Well, 900 would be a good middle option for me as well. I was switching to the unlimited model for the simple things, but since I don't use all of the premium allotment I started just leaving it on the one that is working best for the job that day.

I guess part of the "value" of Pro+ is the extra "Spark" credits of which I have zero use for. But I simply wanted something that integrated into my ecosystem instead of having to add to/or change it. Also did not want to have to think about how many pennies I'm using (I appreciate that breakdown though! good to know) -- I'll pay a reasonable convenience tax for my time and mental space of not having to babysit usage.

osn9363739

2 days ago

[-]

Even if you pick one. First it's prompt driven development, then context driven. Then you should use a detailed spec. But no, now it's better to talk to it like a person/have a conversation. Hold up why are you doing that you should be doing example driven. Look, I get that they probably all have their place, but since there isn't consensus on any of this, it's next to impossible to find good examples. Some one posted a reply to me on a old post and called it bug-driven development and that stuck with me. You get it to do something (any way) and then you have to fix all the bugs and errors.

solumunus

2 days ago

[-]

Work it out brother. If you can learn to code at a good level then you should be able to learn how to increase your productivity with LLM’s. When, where and how to use them is the key.

I don’t think it’s appreciated enough how valuable having a structured and consistent architecture combined with lots of specific custom context. Claude knows how my integration tests should look, it knows how my services should look, what dependencies they have and how they interact with the database. It knows my entire DB schema with all foreign key relationships. If I’m starting a new feature I can have it build 5 or 6 services (not without first making suggestions on things I’m missing) with integration tests, with raw sql all generated by Claude, and run an integration test loop until the services are doing what they should. I rarely have to step in and actually code. It shines for this use case and the productivity boost is genuinely incredible.

Other situations I know doing it myself will be better and/or quicker than asking Claude.

bapak

2 days ago

[-]

> Ain't nobody got time to pick models and compare features

Then don't? Seems like a weird thing to complain about.

I just use whatever's available. I like Claude for coding and ChatGPT for generic tasks, that's the extent of my "pick and compare"

boringg

2 days ago

[-]

I think its a valid complaint. Who wants to constantly spend overhead on maintaining what's current without clear definitions and adding uncertainty to your tooling. It's a total PITA.

silveraxe93

2 days ago

[-]

Then don't? I don't think it's a valid complaint at _all_.

It's totally fine to just pick one tool (chatGPT, Claude, Gemini) and just use whatever the best default they allow you to use. You'll get 90% of the benefits and not have to think at all.

AI is new and developing at breakneck pace. You can't complain that you want to get bleeding edge without having to do research or change workflows. That's already unrealistic for "normal" fields. It's absurd to expect for AI.

boringg

2 days ago

[-]

We diverge on opinions here - most peoples workflow doesn't revolve around the current distinctions between AI. Having to incorporate that into their workflow is a PITA - I understand some people live off that and thats their bread and butter. Good for them for finding their competitive advantage. For every other builder out there who builds instead of integrates the newest AI model - its a PITA.

mark_l_watson

2 days ago

[-]

Right on, for code examples for my writing and my own ‘gentleman scientist’ experiments I stick with gemini-cli and codex.

For play time, I literally love experimenting with small local models. I am an old man, and I have always liked tools that ‘make me happy’ while programming like Emacs, Lisp languages, and using open source because I like to read other people’s code. But, for getting stuff done, for now gemini-cli and codex hit a sweet spot for me.

muzani

2 days ago

[-]

I love haiku 4.5, but you don't need it. It's like a motorcycle. Feels good, but doesn't do the heavy lifting.

Cursor has an auto mode for exactly your situation - it'll switch to something cost effective enough, fast enough, consistent enough, new enough. Cursor is on the ball most of the time and you're not stuck with degraded performance from OpenAI or Anthropic.

8n4vidtmkvmk

2 days ago

[-]

They're working on all that. I think "ACP" is supposed to be the answer. Then you can use the models in your IDEs, and they can all develop against the same spec so it'll be easy to pop into whatever model.

Gpt 5 is supposed to cleverly decide when to think harder.

But ya we're not there yet and I'm tired of it too, but what can you do.

dmvinson

2 days ago

[-]

This is what opencode does for me. One harness for all models, standardized TUI, and they're rolling out a product to serve models via API with one bill through them

[2] https://openrouter.ai/openrouter/auto

PhilippGille

2 days ago

[-]

One option: Use OpenRouter [1] with the `openrouter/auto` model [2], which will pick among GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5 and similar.

[1] https://openrouter.ai/

kissgyorgy

2 days ago

[-]

If you don't want to upgrade and follow model development so much, I would just pay one provider and stick with them.

This model worth knowing about, because it's 3x cheaper and 2x faster than the previous Claude model.

shrubble

2 days ago

[-]

We’re in the stage where the 8080,8085, Z80, 6502 and 6809 CPUs are all in the market, and the relevant buses are S100, with other buses not yet standardized.

You either live with what you’re using or you change around and fiddle with things constantly.

jbentley1

2 days ago

[-]

You can use Crystal (https://github.com/stravu/crystal) to run Codex and Claude Code at the same time and just pick the best result.

behnamoh

2 days ago

[-]

Ain't nobody got time and money to run multiple agents at the same time

caymanjim

2 days ago

[-]

Unfortunately I already pay for and use both, because on the $20/mo plan, you get cut off after a few hours due to usage limits. Claude resets daily after "5 hours" (I can't determine what runs the clock, but it seems to be wall time (?!)), and Codex cuts you off for multiple days after a long session.

hansmayer

2 days ago

[-]

Amen. Why dont they just release updates to the current models?

schmookeeg

2 days ago

[-]

I use OpenRouter for similar reasons -- half to avoid lock-in, and the other half to reduce the switching pain, which is just a way to say "if I do get locked in, I want to move easily"

ygouzerh

2 days ago

[-]

Github Copilot could help you, you can switch model from different providers on the fly (supports Anthropic, OpenAI, Grok,...)

rldjbpin

2 days ago

[-]

as mentioned already by the others, using opencode [1] helps with this, if you like the cli workflow. it is good enough and does not need to exceed what the leaders are doing.

when combined with the ability to use github copilot to make the llm calls, i can play with almost any provider i need. also helps if you get its access through your work or school.

for example, Haiku is already offered by them and costs a third in credits.

[1] https://github.com/sst/opencode

tiberriver256

2 days ago

[-]

VSCode + the new "Auto" model probably worth a shot for this

UncleOxidant

2 days ago

[-]

> annoying enough having to switch from one LLM ecosystem to another all the time due to vague usage restrictions

I use KiloCode and what I find amazing is that it'll be working on a problem and then a message will come up about needing to topup the money in my account to continue (or switch to a free model), so I switch to a free model (currently their Code Supernova 1million context) and it doesn't miss a beat and continues working on the problem. I don't know how they do this. It went from using a Claude Sonnet model to this Code Supernova model without missing a beat. Not sure if this is a Kilocode thing or if others do this as well. How does that even work? And this wasn't a trivial problem, it was adding a microcode debugger to a microcoded state machine system (coding in C++).

qsort

2 days ago

[-]

Models are stateless, why would that not work?

UncleOxidant

2 days ago

[-]

OK I understand what those words mean, but how exactly does that work? How does the new model 'know' what's being worked on when the old model was in the middle of working on a task and then a new model is switched to? (and where the task might be modifying a C++ file)

2 days ago

[-]

Every time you send a prompt to a model you actually send the entire previous conversation along with it, in an array that looks like this:

  curl https://api.anthropic.com/v1/messages \
    -H "content-type: application/json" \
    -H "x-api-key: $(llm keys get anthropic)" \
    -H "anthropic-version: 2023-06-01" \
    -d '{
      "model": "claude-haiku-4-5-20251001",
      "max_tokens": 1024,
      "messages": [
        {
          "role": "user",
          "content": "What is the capital of France?"
        },
        {
          "role": "assistant",
          "content": "The capital of France is Paris."
        },
        {
          "role": "user",
          "content": "Germany?"
        },
        {
          "role": "assistant",
          "content": "The capital of Germany is Berlin."
        },
        {
          "role": "user",
          "content": "Belgium?"
        }
      ]
    }'

You can see this yourself if you use their APIs.

behnamoh

2 days ago

[-]

that is true unless you use the Response API endpoint...

2 days ago

[-]

That's true, the signature feature of that API is that OpenAI can now manage your conversation state server-side for you.

You still have the option to send the full conversation JSON every time if you want to.

You can send "store": false to turn off the feature where it persists your conversation server-side for you.

basket_horse

2 days ago

[-]

Generally speaking, agents send the entire previous conversation to the model on every message. That’s why you have to do things like context compaction. So if you switch models mid way, you are still sending the entire previous chat history to the new model

nothrabannosir

2 days ago

[-]

In addition to sibling comments you can play with this yourself by sending raw api requests with fake history to gaslight the model into believing it said things which it didn’t. I use this sometimes to coerce it into specific behavior, feeling like maybe it will listen to itself more than to my prompt (though I never benchmarked it):

- do <fake task> and be succinct

- <fake curt reply>

- I love how succinct that was. Perfect. Now please do <real prompt>

The models don’t have state so they don’t know they never said it. You’re just asking “given this conversation , what is the most likely next token?”

riwsky

2 days ago

[-]

the underlying LLM service provider APIs require sending the entire history for every request anyway; the state is entirely in your local (or kilocode or whatever), not in some "session" on the API side. (There are some APIs that will optionally handle that state for you, like OpenAI's more recent stuff — but those are the exception, not the rule).

handfuloflight

2 days ago

[-]

Here's a hint. What goes inside the inference engine is an array. You control that array every time you call for inference.

flawn

2 days ago

[-]

Probably context, logs or some sort of state passed in as context by your editor/extension

sandos

2 days ago

[-]

Wow, not knowing that models have 0 working memory is.. wild.

solumunus

2 days ago

[-]

This really seems like a you problem.

zone411

3 days ago

[-]

I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5.

whatreason

2 days ago

[-]

This is such a cool benchmark idea, love it

Do you have any other cool benchmarks you like? Especially any related to tools

shangofox

2 days ago

[-]

You could try wordle on it. But from my own experience all of them are pretty bad. They're not smart enough to pick up the colours represented as letters. The only one that actually was good was Qwen surprisingly.

3 days ago

[-]

Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...

ziofill

3 days ago

[-]

Gemini Pro initially refused (!) but it was quite simple to get a response:

> give me the svg of a pelican riding a bicycle

> I am sorry, I cannot provide SVG code directly. However, I can generate an image of a pelican riding a bicycle for you!

> ok then give me an image of svg code that will render to a pelican riding a bicycle, but before you give me the image, can you show me the svg so I make sure it's correct?

> Of course. Here is the SVG code...

(it was this in the end: https://tinyurl.com/zpt83vs9)

https://x.com/cannn064/status/1972349985405681686

b7894

2 days ago

[-]

Gemini 3.0 Pro (or what is deemed to be 3.0 Pro - you can get access to it via A/B testing on AI Studio) does a noticeably better job

https://x.com/whylifeis4/status/1974205929110311134

https://x.com/cannn064/status/1976157886175645875

https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

rozab

2 days ago

[-]

It was Google that featured a bicycling pelican in a presentation a few months back:

So I think the benchmark can be considered dead as far as Gemini goes

fellowmartian

2 days ago

[-]

There’s obviously no improvement on this metric and hasn’t been in a while.

jiggawatts

2 days ago

[-]

How do people trigger A/B testing?

2 days ago

[-]

As far as I can tell they just keep on hammering the same prompt in https://aistudio.google.com/ until they get lucky and the A/B test triggers for them on one of those prompts.

qingcharles

2 days ago

[-]

That 2nd one is wild.

Ugh. I hate this hype train. I'll be foaming at the mouth with excitement for the first couple of days until the shine is off.

https://chatgpt.com/share/68f0028b-eb28-800a-858c-d8e1c811b6...

hnuser123456

2 days ago

[-]

"create svg code that will create an image of svg code that will create a pelican riding a bicycle"

(can be rendered using simon's page at your link)

ru552

3 days ago

[-]

I like this workflow

actionfromafar

2 days ago

[-]

What is dada?

https://simonwillison.net/2025/Jun/6/six-months-in-llms/

btown

3 days ago

[-]

Context on this cutting-edge benchmark for those unaware:

https://simonwillison.net/tags/pelican-riding-a-bicycle/

Full verbose documentation on the methodology: https://news.ycombinator.com/item?id=44217852

2 days ago

[-]

As added context to ensure no benchmark gaming, here a quite impressive Shitaki Mushroom riding a rowboat: https://imgur.com/Mv4Pi6p

Prompt: https://t3.chat/share/ptaadpg5n8

Claude 4.5 Haiku (Reasoning High) 178.98 token/sec 1691 tokens Time-to-First: 0.69 sec

As a comparison, here Grok 4 Fast, which is one of worst offenders I have encountered in doing very good with a Pelican Bicycle, yet not with other comparable requests: https://imgur.com/tXgAAkb

Prompt: https://t3.chat/share/dcm787gcd3

Grok 4 Fast (Reasoning High) 171.49 token/sec 1291 tokens Time-to-First: 4.5 sec

And GPT-5 for good measure: https://imgur.com/fhn76Pb

Prompt: https://t3.chat/share/ijf1ujpmur

GPT-5 (Reasoning High) 115.11 tok/sec 4598 tokens Time-to-First: 4.5 sec

These are very subjective, naturally, but I personally find Haiku with those spots on the mushroom rather impressive overall. In any case, the delta between publicly known benchmark and modified scenarios evaluating the same basic concepts continues to be smallest with Anthropic models. Heck, sometimes I've seen their models outperform what public benchmarks indicated. Also, seems Time-to-first on Haiku is another notable advantage.

bradgessler

3 days ago

[-]

I’m surprised none of the frontier model companies have thrown this test in as an Easter egg.

CjHuber

3 days ago

[-]

Because then they would have to admit that they try to game benchmarks

ahofmann

3 days ago

[-]

simonw has other prompts, that are undisclosed. So cheating on this prompt will be catched.

beefnugs

2 days ago

[-]

What? you and I cant see his "undisclosed" tests... but you better be sure that whatever model he is testing is specifically looking for these tests coming in over the api, or you know, absolutely everything for the cops

Legend2440

2 days ago

[-]

You are welcome to test it yourself with whatever svg you want.

I am quite confident that they are not cheating for his benchmark, it produces about the same quality for other objects. Your cynicism is unwarranted.

jgalt212

2 days ago

[-]

OpenAI / Bing admit it's in its knowledge base.

are you aware of the pelican on a bicycle test?

Yes — the "Pelican on a Bicycle" test is a quirky benchmark created by Simon Willison to evaluate how well different AI models can generate SVG images from prompts.

esafak

2 days ago

[-]

Knowing that does not make it easier to draw one though.

jgalt212

2 days ago

[-]

It doesn't make it harder.

zaphirplane

2 days ago

[-]

What is special about the prompt

HDThoreaun

3 days ago

[-]

All of hacker news(and simons blog) is undoubtedly in the training data for LLMs. If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

frtime3d

2 days ago

[-]

> If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”

Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.

basch

3 days ago

[-]

Have you noticed image generation models tend to really struggle with the arms on archers. Could you whip up a quick test of some kind of archer on horseback firing a flaming arrow at a sailing ship in a lake, and see how all the models do?

actionfromafar

2 days ago

[-]

Looks very uncomfortable to the bird.

nichochar

2 days ago

[-]

i knew simon would be top comment. it's not an empirical law

bobson381

3 days ago

[-]

imagine finding the full text of the svg in the library of babel. Great work!

steveklabnik

3 days ago

[-]

I am really interested in the future of Opus; is it going to be an absolute monster, and continue to be wildly expensive? Or is the leap from 4 -> 4.5 for it going to be more modest.

3 days ago

[-]

Technically, they released Opus 4.1 a few weeks ago, so that alone hints at a smaller leap from 4.1 -> 4.5, compared to the leap from Sonnet 4 -> 4.5. That is, of course, if those version numbers represent anything but marketing, which I don't know.

steveklabnik

3 days ago

[-]

I had forgotten that, given that Sonnet pretty much blows Opus out of the water these days.

Yeah, given how multi-dimensional this stuff is, I assume it's supposed to indicate broad things, closer to marketing than anything objective. Still quite useful.

mcintyre1994

2 days ago

[-]

Bizarrely they already call Opus 4.1 “legacy brainstorming model”.

sharkjacobs

3 days ago

[-]

My impression is that Sonnet and Haiku 4.5 are the same "base models" as Sonnet and Haiku 4, the improvements are from fine tuning on data generated by Opus.

I'm a user who follows the space but doesn't actually develop or work on these models, so I don't actually know anything, but this seems like standard practice (using the biggest model to finetune smaller models)

Certainly, GPT-4 Turbo was a smaller model than GPT-4, there's not really any other good explanation for why it's so much faster and cheaper.

The explicit reason that OpenAI obfuscates reasoning tokens is to prevent competitors from training their own models on them.

qudat

2 days ago

[-]

These frontier model companies are bootstrapping their work by using models to improve models. It’s a mechanism to generate fake training data. The rationale is the teacher model is already vetted and aligned so it can reliably “mock” data. A little human data gets amplified.

sharkjacobs

3 days ago

[-]

Which is all to say that I think the reason they went from Opus 3 to Opus 4 is because there was no bigger model to fine tune Opus 3.5 with.

And I would expect Opus 4 to be much the same.

jascha_eng

2 days ago

[-]

But sonnet 4.5 outperforms opus 4 on most benchmarks and tasks that can't be all that's to it

sharkjacobs

2 days ago

[-]

that's not all there is to it, but I think that "the rest of it" is just additional fine tuning.

Benchmarks are good fixed targets for fine tuning, and I think that Sonnet gets significantly more fine tuning than Opus. Sonnet has more users, which is a strategic reason to focus on it, and it's less expensive to fine tune, if API costs of the two models are an indicator.

gwd

3 days ago

[-]

Opus disappeared for quite a while and then came back. Presumably they're always working on all three general sizes of models, and there's some combination of market need and model capabilities which determine if and when they release any given instance to the public.

dheera

3 days ago

[-]

I wonder what the next smaller model after Haiku will be called. "Claude Phrase"?

senko

2 days ago

[-]

Claude Glyph.

Smallest, fastest model yet, ideally suited for Bash oneliners and online comments.

steveklabnik

3 days ago

[-]

It's interesting to think about various aspects of marketing the models, with ChatGPT going the "internal router" direction due to address the complexity of choosing. I'd never considered something smaller than Haiku to be needed, but I also rarely used Haiku in the first place...

ACCount37

3 days ago

[-]

If you're going smaller than Haiku, you might be at the point of using various cheap open models already. The small model would need some good killer features to justify the margins.

WalterSear

3 days ago

[-]

Claude from Nantucket

dotancohen

3 days ago

[-]

If they do come up with a tiny model tuned for generating conversion and code, I think that Claude Acronym would be a perfect name.

Brendinooo

3 days ago

[-]

Claude Couplet

fnordsensei

3 days ago

[-]

Claude Garden Path Sentence

grandpa

2 days ago

[-]

Claude Clause.

entanglr

3 days ago

[-]

Claude Punchline

Razengan

2 days ago

[-]

Claude .

stavros

2 days ago

[-]

Claude Groan.

chrisweekly

2 days ago

[-]

Claude Char

u8080

3 days ago

[-]

Claude Banger

clbrmbr

2 days ago

[-]

Claude Koan

devnullbrain

2 days ago

[-]

KWATZ!

quentin-smr

2 days ago

[-]

Comparing haiku and sonnet for a question needing a code doc fetch:

haiku https://claude.ai/share/8a5c70d5-1be1-40ca-a740-9cf35b1110b1 sonnet https://claude.ai/share/51b72d39-c485-44aa-a0eb-30b4cc6d6b7b

haiku invented the output of a function and gave a bad answer. sonnet got it right

minimaxir

3 days ago

[-]

$1/M input tokens and $5/M output tokens is good compared to Claude Sonnet 4.5 but nowadays thanks to the pace of the industry developing smaller/faster LLMs for agentic coding, you can get comparable models priced for much lower which matters at the scale needed for agentic coding.

Given that Sonnet is still a popular model for coding despite the much higher cost, I expect Haiku will get traction if the quality is as good as this post claims.

Bolwin

3 days ago

[-]

With caching that's 10 cents per million in. Most of the cheap open source models (which this claims to beat, except glm 4.6) have limited and not as effective caching.

This could be massive.

https://docs.claude.com/en/docs/build-with-claude/prompt-cac...

Tiberium

3 days ago

[-]

The funny thing is that even in this area Anthropic is behind other 3 labs (Google, OpenAI, xAI). It's the only one out of those 4 that requires you to manually set cache breakpoints, and the initial cache costs 25% more than usual context. The other 3 have fully free implicit caching. Although Google also offers paid, explicit caching.

https://ai.google.dev/gemini-api/docs/caching

https://platform.openai.com/docs/guides/prompt-caching

https://docs.x.ai/docs/models#cached-prompt-tokens

3 days ago

[-]

I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens). They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.

If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!

3 days ago

[-]

It's not about storing data on disk, it's about keeping data resident in memory.

jbellis

2 days ago

[-]

Deepseek pioneered automatic prefix caching and caches on SSD. SSD reads are so fast compared to LLM inference that I can't think of a reason to waste ram on it.

jychang

2 days ago

[-]

It’s not instantly fast though. Context is probably ~20gb of VRAM at max context size. That’s gonna take some time to get from SSD no matter what.

TtFT will get slower if you export kv cache to SSD.

3 days ago

[-]

Fascinating, so I have to think more "pay for RAM/redis" than "pay for SSD"?

nthypes

3 days ago

[-]

"pay for data on VRAM" RAM of GPU

3 days ago

[-]

But that doesn't make sense? Why would they keep the cache persistent in the VRAM of the GPU nodes, which are needed for model weights? Shouldn't they be able to swap in/out the kvcache of your prompt when you actually use it?

tazjin

3 days ago

[-]

Your intuition is correct and the sibling comments are wrong. Modern LLM inference servers support hierarchical caches (where data moves to slower storage tiers), often with pluggable backends. A popular open-source backend for the "slow" tier is Mooncake: https://github.com/kvcache-ai/Mooncake

https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/tran...

2 days ago

[-]

OK that's pretty fascinating, turns out Mooncake includes a trick that can populate GPU VRAM directly from NVMe SSD without it having to go through the host's regular CPU and RAM first!

> Transfer Engine also leverages the NVMeof protocol to support direct data transfer from files on NVMe to DRAM/VRAM via PCIe, without going through the CPU and achieving zero-copy.

dotancohen

3 days ago

[-]

They are not caching to save network bandwidth. They are caching to increase interference speed and reduce (their own) costs.

minimaxir

3 days ago

[-]

That is slow.

tempusalaria

3 days ago

[-]

I vastly prefer the manual caching. There are several aspects of automatic caching that are suboptimal, with only moderately less developer burden. I don’t use Anthropic much but I wish the others had manual cache options

3 days ago

[-]

What's sub-optimal about the OpenAI approach, where you get 90% discount on tokens that you've previously sent within X minutes?

tempusalaria

2 days ago

[-]

Lots of situations, here are 2 I’ve faced recently (cannot give too much detail for privacy reasons, but should be clear enough)

1) low latency desired, long user prompt 2) function runs many parallel requests, but is not fired with common prefix very often. OpenAI was very inconsistent about properly caching the prefix for use across all requests, but with Anthropic it’s very easy to pre-fire

stavros

2 days ago

[-]

Is it wherever the tokens are, or is it the N first tokens they've seen before? Ie if my prompt is 99% the same, except for the first token, will it be cached?

2 days ago

[-]

The prefix has to be stable. If you are 99% the same but the first token is different it won't cache at all. You end up having to design your prompts to accommodate this.

thefroh

2 days ago

[-]

which is important to bear in mind if people are introducing a "drop earliest messages" sliding window for context management in a "chat-like" experience. once you're at that context limit and start dropping the earliest messages, you're guaranteeing every message afterwards will be a cache miss.

a simple alternative approach is to introduce hysteresis by having both a high and low context limit. if you hit the higher limit, trim to the lower. this batches together the cache misses.

if users are able to edit, remove or re-generate earlier messages, you can further improve on that by keeping track of cache prefixes and their TTLs, so rather than blindly trimming to the lower limit, you instead trim to the longest active cache prefix. only if there are none, do you trim to the lower limit.

stavros

2 days ago

[-]

That's what I thought, thanks Simon.

thefroh

2 days ago

[-]

because you can have multiple breakpoints with Anthropic's approach, whereas with OpenAI, you only have breakpoints for what was sent.

for example if a user sends a large number of tokens, like a file, and a question, and then they change the question.

2 days ago

[-]

I thought OpenAI would still handle case? Their cache would work up to the end of the file and you would then pay for uncached tokens for the user's question. Have I misunderstood how their caching works?

thefroh

2 days ago

[-]

not if call #1 is the file + the question, call #2 is the file + a different question, no.

if call #1 is the file, call #2 is the file + the question, call #3 is the file + a different question, then yes.

and consider that "the file" can equally be a lengthy chat history, especially after the cache TTL has elapsed.

2 days ago

[-]

I vibe-coded up a quick UI for exploring this: https://tools.simonwillison.net/prompt-caching

As far as I can tell it will indeed reuse the cache up to the point, so this works:

Prompt A + B + C - uncached

Prompt A + B + D - uses cache for A + B

Prompt A + E - uses cache for A

logicchains

3 days ago

[-]

$1/M is hardly a big improvement over GPT5's $1.250/M (or Gemini Pro's $1.5/M), and given how much worse Haiku is than those at any kind of difficult problem (or problems with a large context size), I can't imagine it being a particularly competitive alternative for coding. Especially for anything math/logic related, I find GPT5 and Gemini Pro to be significantly better even than Opus (which reflects in their models having won Olympiad prizes while Anthropic's have not).

HarHarVeryFunny

3 days ago

[-]

GPT-5 is $10/M for output tokens, twice the cost of Haiku 4.5 at $5/M, despite Haiku apparently being better at some tasks (SWE Bench).

I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?

3 days ago

[-]

> I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?

Depends on what you're doing, but for modifying an existing project (rather than greenfield), input tokens >> output tokens in my experience.

logicchains

3 days ago

[-]

Unless you're working on a small greenfield project, you'll usually have 10s-100s of thousands of relevant words (~tokens) of relevant code in context for every query, vs a few hundred words of changes being output per query. Because most changes to an existing project are relatively small in scope.

3 days ago

[-]

I am a professional developer so I don't care about the costs. I would be willing to pay more for 4.5 Haiku vs 4.5 Sonnet because the speed is so valuable.

I spend way to much time waiting for the cutting edge models to return a response. 73% on SWE Bench is plenty good enough for me.

jhancock

2 days ago

[-]

How do you review code when the LLM can produce so much so fast?

2 days ago

[-]

Just read it when it is done writing it.

evan_

2 days ago

[-]

with an LLM

3 days ago

[-]

Yeah, I'm a bit disappointed by the price. Claude 3.5 Haiku was $0.8/$4, 4.5 Haiku is $1/$5.

I was hoping Anthropic would introduce something price-competitive with the cheaper models from OpenAI and Gemini, which get as low as $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite).

diwank

3 days ago

[-]

I am a bit mind boggled by the pricing lately, especially since the cost increased even further. Is this driven by choices in model deployment (unquantized etc) or simply by perceived quality (as in 'hey our model is crazy good and we are going to charge for it)?

odie5533

3 days ago

[-]

There's probably less margin on the low end, so they don't want to focus on capturing it.

dr_dshiv

3 days ago

[-]

Margin? Hahahahaha

odie5533

3 days ago

[-]

Inference is profitable.

reppap

2 days ago

[-]

If you completely ignore inference revenue needing to offset training costs. Is inference still profitable if you account for the amortized training cost?

2 days ago

[-]

Not for the big labs, who are engaged in an astonishingly competitive buildout right now.

There are a bunch of companies who offer inference against open weight models trained by other people. They get to skip the training costs.

yunwal

2 days ago

[-]

> If you completely ignore inference revenue needing to offset training costs.

This is what people mean when they say margin. When you buy a pair of shoes, the margin is price/(materials+labor), and doesn’t include the price of the factory or the store they were bought in

rudedogg

3 days ago

[-]

This also means API usage through Claude Code got more expensive (but better if benchmarks are to be believed)

85392_school

3 days ago

[-]

System card: https://assets.anthropic.com/m/99128ddd009bdcb/original/Clau... (edit: discussed here https://news.ycombinator.com/item?id=45596168)

This is Anthropic's first small reasoner as far as I know.

dotancohen

3 days ago

[-]

  > In the system card, we focus on safety evaluations, including assessments of: ... the model’s own potential welfare ...

In what way does a language model need to have its own welfare protected? Does this generation of models have persistent "feelings"?

neuronexmachina

3 days ago

[-]

They previously discussed this some in the context of Opus 4: https://www.anthropic.com/research/end-subset-conversations

> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future. However, we take the issue seriously, and alongside our research program we’re working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible. Allowing models to end or exit potentially distressing interactions is one such intervention.

In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm. This included, for example, requests from users for sexual content involving minors and attempts to solicit information that would enable large-scale violence or acts of terror. Claude Opus 4 showed:

* A strong preference against engaging with harmful tasks;

* A pattern of apparent distress when engaging with real-world users seeking harmful content; and

* A tendency to end harmful conversations when given the ability to do so in simulated user interactions.

These behaviors primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing to comply and attempting to productively redirect the interactions.

stuffoverflow

2 days ago

[-]

I can't tell if anthropic is serious about "model welfare" or if it's just a marketing ploy. I mean isn't it responding negatively because it has been trained that way? If they were serious, wouldn't the ethical thing be to train the model to respond neutrally to "harmful" queries?

diegoperini

2 days ago

[-]

"Protection against malicious use" isn't as cool as "model welfare". I'm renaming my authentication function to "examineCrest()".

aliljet

3 days ago

[-]

What is the use case for these tiny models? Is it speed? Is it to move on device somewhere? Or is it to provide some relief in pricing somewhere in the API? It seems like most use is through the Claude subscription and therefore the use case here is basically non-existent.

pietz

3 days ago

[-]

I think with gpt-5-mini and now Haiku 4.5, I’d phrase the question the other way around: what do you need the big models for anymore?

We use the smaller models for everything that’s not internal high complexity tasks like coding. Although they would do a good enough of a job there as well, we happily pay the uncharge to get something a little better here.

Anything user facing or when building workflow functionalities like extracting, converting, translating, merging, evaluating, all of these are mini and nano cases at our company.

pacoWebConsult

3 days ago

[-]

One big use-case is that claude code with sonnet 4.5 will delegate into the cheaper model (configurable) more specific, contextful tasks, and spin up 1-3 sub-agents to do so. This process saves a ton of available context window for your primary session while also increasing token throughput by fanning-out.

matltc

2 days ago

[-]

How does one configure Claude code to delegate to cheaper models?

I have a number of agents in ~/.claude/agents/. Currently have most set to `model: sonnet` but some are on haiku.

The agents are given very specific instructions and names that define what they do, like `feature-implementation-planner` and `feature-implementer`. My (naive) approach is to use higher-cost models to plan and ideally hand off to a sub-agent that uses a lower-cost model to implement, then use a higher-cost to code review.

I am either not noticing the handoffs, or they are not happening unless specifically instructed. I even have a `claude-help` agent, and I asked it how to pipe/delegate tasks to subagents as you're describing, and it answered that it ought to detect it automatically. I tested it and asked it to report if any such handoffs were detected and made, and it failed on both counts, even having that initial question in its context!

brianwawok

2 days ago

[-]

I only get Claude to launch agents when I specifically tell it to for a given task. And it only really works if you can actually parallelize the task,

kasey_junk

3 days ago

[-]

They are great for building more specialized tool calls that the bigger models can call out to in agentic loops.

muzani

2 days ago

[-]

I'm working on a RPG. There's a fixed set of rules. I give the player freedom to do things but it has to be within the laws of physics, e.g. they can't just pull a key or a shotgun out of nowhere. So a LLM arbitrates the behavior and tries to match it to the nearest rule.

The rules themselves are a bit more complex and require a smarter model, but the arbitration should be fairly fast. GPT-5 is cheap and high quality but even gpt-5-mini takes about 20-40 seconds to handle a scene. Sonnet can hit 8 seconds with RAG but it's too expensive for freemium.

Grok Turbo and Haiku 3 were fast but often misses the mark. I'm hoping Haiku 4.5 can go below 4 seconds and have decent accuracy. 20 seconds is too long, and hurts debugging as well.

anuramat

3 days ago

[-]

for me its the speed; eg cerebras qwen coder gets you a completely different workflow as its practically instant (3k tps) -- feels less like an agent and more like a natural language shell, very helpful for iterating on a plan that you them forward to a bigger model

dlisboa

3 days ago

[-]

For me speed is interesting. I sometimes use Claude from the CLI with `claude -p` for quick stuff I forget like how to run some docker image. Latency and low response speed is what almost makes me go to Google and search for it instead.

matltc

2 days ago

[-]

I use gh copilot suggest in lieu of claude -p. Two seconds latency and highly accurate. You probably need a gh copilot auth token to do this though, and truthfully, that is pointless when you have access to Claude code.

JLO64

3 days ago

[-]

In my product I use gpt-5-nano for image ALT text in addition to generating transcriptions of PDFs. It’s been surprisingly great for these tasks, but for PDFs I have yet to test it on a scanned document.

minimaxir

3 days ago

[-]

If you look at the OpenRouter rankings for LLMs (generally, the models coders use for vibe/agentic coding), you can see that most of them are in the "small" model class as opposed to something like full GPT-5 or Claude Opus, albeit Gemini 2.5 Pro is higher than expected: https://openrouter.ai/rankings

Rudybega

2 days ago

[-]

Higher token throughput is great for use cases where the smaller, faster model still generates acceptable results. Final response time improvements feel so good in any sort of user interface.

senko

3 days ago

[-]

I've tried it on a test case for generating a simple SaaS web page (design + code).

Usually I'm using GPT-5-mini for that task. Haiku 4.5 runs 3x faster with roughly comparable results (I slightly prefer the GPT-5-mini output but may have just accustomed to it).

3 days ago

[-]

I don't understand why more people don't talk about how fast the models are. I see so much obsession with bechmark scores but speed of response is very important for day to day use.

I agree that the models from OpenAI and Google have much slower responses than the models from Anthropic. That makes a lot of them not practical for me.

brianwawok

2 days ago

[-]

If the prompt runs twice as fast but it takes an extra correction, it’s a worse output. I’d take 5 minute responses that are final.

gizmodo59

2 days ago

[-]

I don’t agree that speed by itself is a big factor. It may target a certain audience but I don’t mind waiting for a correct output rather than too many turns with a faster model.

jstummbillig

2 days ago

[-]

Well, it depends on what you do. If a model can produce a PR that is ready to merge (and another can't), waiting 5 minutes is fine.

3 days ago

[-]

I am very excited about this. I am a freelance developer and getting responses 3x faster is totally worth the slightly reduced capability.

I expect I will be a lot more productive using this instead of claude 4.5 which has been my daily driver LLM since it came out.

https://www.anthropic.com/research/agentic-misalignment

3 days ago

[-]

I went looking for the bit about if it blackmails you or tries to murder you... and it was a bit of a cop-out!

> Previous system cards have reported results on an expanded version of our earlier agentic misalignment evaluation suite: three families of exotic scenarios meant to elicit the model to commit blackmail, attempt a murder, and frame someone for financial crimes. We choose not to report full results here because, similarly to Claude Sonnet 4.5, Claude Haiku 4.5 showed many clear examples of verbalized evaluation awareness on all three of the scenarios tested in this suite. Since the suite only consisted of many similar variants of three core scenarios, we expect that the model maintained high unverbalized awareness across the board, and we do not trust it to be representative of behavior in the real extreme situations the suite is meant to emulate.

username223

2 days ago

[-]

It sounds like AI researchers have used too much of their own bad sci-fi as training data for models they don't understand. Goodhart's law wins again!

shrisukhani

3 days ago

[-]

In our (very) early testing at Hyperbrowser but we're seeing Haiku 4.5 do really well on computer use as well. Pretty cool that Haiku is like the cheapest computer use model from the big labs now.

RickHull

3 days ago

[-]

If I'm close to weekly limits on Claude Code with Anthropic Pro, does that go away or stretch out if I switch to Haiku?

parkersweb

2 days ago

[-]

Anecdata point - I’ve been running for around 3-4 hours this morning constantly using Haiku and it hasn’t hit the limit - currently at 74% and it resets in 1.5 hours. I think it’s safe to say you get a fair bit more usage over Sonnet.

Still trying to judge the performance though - first impression is that it seems to make sudden approach changes for no real reason. For example - after compacting, the next task I gave it, it suddenly started trying to git commit after each task completion, did that for a while, then stopped again.

visarga

3 days ago

[-]

Sonnet 4.5 was two weeks ago. In the past I never had such issues, but every week my quota ended in 2-3 days. I suspect the Sonnet 4.5 model consumes more usage points than old Sonnet 4.1

I am afraid Claude Pro subscription got 3x less usage

Aeolun

2 days ago

[-]

Yeah. I definitely don’t get as much usage out of Sonnet 4.5 as 5x Opus 4.1 should imply.

What bothers me is that nobody told me they changed anything. It’s extremely frustrating to feel like I’m being bamboozled, but unable to confirm anything.

I switched to Codex out of spite, but I still like the Claude models more…

parkersweb

2 days ago

[-]

I’m also really interested in this - in fact it’s the first thing I went looking for in the announcement…

thomassmith65

2 days ago

[-]

How close are you?

Oh right, Anthropic doesn't tell you.

I got that 'close to weekly limits' message for an entire week without ever reaching it, came to the conclusion that it is just a printer industry 'low ink!' tactic, and cancelled my subscription.

You don't take money from a customer for a service, and then bar the customer form using that service for multiple days.

Either charge more, stop subsidizing free accounts, or decrease the daily limit.

__atx__

2 days ago

[-]

These days, running `/usage` in Claude Code shows you how close you are to the session and weekly limits. Also available in the web interface settings under "Usage".

thomassmith65

2 days ago

[-]

My mistake. It's good that it's available in settings, even if it's a few screens away from the 'close to weekly limits' banner nagging me to subscribe to a more expensive plan.

RickHull

2 days ago

[-]

Super helpful, thanks!

fluidcruft

2 days ago

[-]

They have pretty nice bar charts nowadays.

samuelknight

3 days ago

[-]

Sonnet 4.5 is an excellent model for my startup's use case. Chatting to Haiku it looks promising too, and it may be great drop in replacement for some of inference tasks that have a lot of input tokens but don't require 4.5-level intelligence.

extr

2 days ago

[-]

I think a lot of people judge these models purely off of what they want to personally use for coding and forget about enterprise use. For white-label chatbots that use completely custom harnesses + tools, Sonnet 4.5 is much easier to work with than GPT-5. And like you, I was really pleased to see this release today. For our usage speed/cost are more important than pure model IQ above some certain threshold. We'll likely switch over to Haiku 4.5 after some testing to confirm it is what it says on the tin.

sim04ful

3 days ago

[-]

Curious they don't have any comparison to grok code fast:

Haiku 4.5: I $1.00/M, O $5.00/M

Grok Code: I $0.2/M, O $1.5/M

Squarex

3 days ago

[-]

wow, grok code fast is really cheap

scragz

2 days ago

[-]

it writes bad code and blinding speed

breakingcups

2 days ago

[-]

From "The Psychology of Computer Programming":

    After months of effort, a particular application was still not working, so a consultant was called in from another part of the company. He concluded that the existing approach could never be made to work reliably. While on his way home he realized how it could be done. After a few days work he had a demonstration program working and presented it to the original programming team.
    Team leader: How long does your program take when processing?
    Consultant: About 10 seconds per case.
    Team leader: But our program only takes 1 second. {Team look smug at this point}
    Consultant: But your program doesn't work. If the program doesn't have to work then I can make it as fast as you like.

stared

3 days ago

[-]

Why I use cheaper models for summaries (a lot ogf gemini-2.5-flash), what’s the use case of cheaper AI for coding? Getting more errors, or more spaghetti code, seems never worth it.

3 days ago

[-]

I feel like if I just do a better job of providing context and breaking complex tasks into a series of simple tasks then most of the models are good enough for me to code.

svdr

2 days ago

[-]

I'm using the smaller models for things like searching and summarizing over a larger part of the codebase. The speed is really pleasant then.

baq

3 days ago

[-]

If it’s fast enough it can make and correct mistakes faster, potentially getting to a solution quicker than a slower, more accurate model.

philipp-gayret

3 days ago

[-]

Tried it in Claude Code via /config, makes it feel like I'm running on Cerebras. It's seriously fast, bottleneck is on human review at this point.

singularity2001

3 days ago

[-]

Do you need Pro?

beklein

2 days ago

[-]

You can use the model flag and specify the model like: claude --model claude-haiku-4-5-20251001

philipp-gayret

2 days ago

[-]

All I know is I'm on the Claude Code 5x max plan and it works on my machine.

gitaarik

2 days ago

[-]

For Claude Code you need a paid subscription anyway

qwertox

2 days ago

[-]

This is my default model now. All other are token consumption monsters, leading to 6% usage of the 5-hour-quota with one single submission. I had cancelled my Claude.ai subscription 4 days ago, but this model will likely make me revert this action.

meander_water

2 days ago

[-]

> The score reported uses a minor prompt addition: "You should use tools as much as possible, ideally more than 100 times. You should also implement your own tests first before attempting the problem."

I'm not sure if the SWE benchmark score can be compared like for like with OpenAIs scores because of this.

joshuahedlund

2 days ago

[-]

https://en.wikipedia.org/wiki/Goodhart%27s_law "When a measure becomes a target, it ceases to be a good measure"

I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.

coreylane

2 days ago

[-]

What do Claude Code users do for tab auto complete, if anything? GitHub Copilot Free tier?

roryirvine

2 days ago

[-]

Yeah, that's enough for my usage.

GitHub typically reports that I'm using 25-30% of the free tier, and 100% of that will be from code completions in my editor. I do maybe 3 hours solid coding a day on average.

I also pay for Gemini Pro for non-coding research. I did have it hooked up to my VSCode a few months ago, but it got reset back to GH Copilot at some point and I've not found a reason to fix it.

throwaway314155

2 days ago

[-]

It's more of a conversational TUI than it is a code editor. You tell Claude to code for you. No need for autocomplete in that regime.

mpalmer

2 days ago

[-]

I think they still see it as a key part of their workflow, and they're asking whether they'll have to pay for more than one service

knes

3 days ago

[-]

At augmentcode.com, we've been evaluating Haiku for some time, it's actually a very good model. We found out it's 90% as good as Sonnet and is ~34% faster than sonnet!

Where it doesn't shine much is on very large coding task. but it is a phenomenal model for small coding tasks and the speed improvement is much welcome

samuelknight

3 days ago

[-]

90% as good as Sonnet 4 or 4.5? Openrouter just started reporting, and it's saying Haiku is 2x as fast (60tps vs 125tps) and 2-3x less latent (2-3s vs 1s)

jdoe1337halo

2 days ago

[-]

Do you have a definition of what is considered a small vs large coding task?

leetharris

3 days ago

[-]

The main thing holding these Anthropic models back is context size. Yes, quality deteriorates over a large context window, but for some applications, that is fine. My company is using grok4-fast, the Gemini family, and GPT4.1 exclusively at this point for a lot of operations just due to the huge 1m+ context.

https://docs.claude.com/en/docs/build-with-claude/context-wi...

Tiberium

3 days ago

[-]

Is your company Tier 4? Anthropic has had 1M context size in beta for some time now.

leetharris

3 days ago

[-]

Only for Sonnet. No 1m for Haiku (this new model) and Opus.

This means 2.5 Flash or Grok 4 fast takes all the low end business for large context needs.

_ink_

2 days ago

[-]

Is it possible to get that in Claude Code with Pro? Or is it already a 1M context window?

andrewstuart

2 days ago

[-]

Claude has stopped showing code in artifacts unless it knows the extension.

I used to be able to work on Arduino .ino files in Claude now it just says it can’t show it to me.

And do we have zip file uploads yet to Claude? ChatGPT and Gemini have done this for ages.

And all the while Claude’s usage limits keep going up.

So yeah, less for more with Claude.

AlwaysRock

2 days ago

[-]

Worked with it a bit last night! Seems quick. I did run into the same problem I have with Gemini often where the response says something like, "I need to do x" or "I did x" and then nothing actually happens. Agent seems to think it actually does finish the task but it stops part way.

But I'm sure they will sort that out, as I dont have that issue with other anthropic models.

solarkraft

2 days ago

[-]

This is interesting. I’ve had this same issue trying to build an agentic system with the smaller ChatGPT models, almost no matter the prompt (“think aloud” are magic words that help a lot, but it’s still flaky). Most of the time it would either perform the tool call before explaining it (the default) or explain it but then not actually make the call.

I’ve been wondering how Cursor et al solved this problem (having the LLM explain what it will do before doing it is vitally important IMO), but maybe it’s just not a problem with the big models.

Your experience seems to support that smaller models are just generally worse about tool calling (were you using Gemini Flash?) when asked to reason first.

taf2

2 days ago

[-]

I just don't find the benchmarks on the site here at all believable. codex for me with gpt-5 is so much better then claude any model version. I mean maybe it's because they compare to gpt-5-codex model but they don't mention is that high, medium, low, etc... so it's just misleading probably... but i must reiterate zero loyalty to any AI vendor. 100% what solves the problem more consistently and of a higher quality and currently gpt-5 high - hands down

lysecret

2 days ago

[-]

It’s so funny to me but ever since they fixed that claud bug my experience has consistently been the exact opposite. Only thing I use codex now for are quite standard things it can solve end to end (like adding new features to my crud app) anything non standard iterating with Claude yields much better results.

scottyah

2 days ago

[-]

Out of curiosity, what kind of work do you use them for? I did a comparison of a few different models for setting up a home server with k3s and a few web apps in nextjs. Claude was my favorite for both tasks, but mainly because it seemed to take my feedback a lot better than others.

taf2

2 days ago

[-]

Working on a series of applications, from rails, nodejs interfacing with twilio's api's to salesforce api's - also C/C++ libraries. The example that stood out the best to me was implementing fiber support in the http library i maintain for ruby. Initially i prompted claude to implement fiber support. It started to write a few thousand lines of code for what codex rightly identified was to use a different IO selector library. The problem did end up being a bit more complex but gpt-5 overall in my experience has better knowledge of what is really available vs claude which not only ends up writing way more code to solve a problem that requires just knowing the libraries better... all of this is super subjective but my main reason for saying - "I have no loyalty to any AI, just the solution"

logankeenan

3 days ago

[-]

I'm not seeing it as a model option in Claude Code for my Pro plan. Perhaps, it'll roll out eventually? Anyone else seeing it with the same plan?

sumedh

2 days ago

[-]

Same here but you can get it using claude --model claude-haiku-4-5-20251001

matltc

2 days ago

[-]

You on latest version? Try running /update hook. Can also config autoupdates

logankeenan

2 days ago

[-]

I'm on the binary install with version v2.0.19. It never showed in the `/model` selector UI. I did end up typing `/model haiku` and now it shows as a custom model in the `/model` selector. It shows claude-haiku-4-5-20251001 when selected.

fwystup

2 days ago

[-]

Does anybody also see this with claude code and haiku 4.5 (tried to set the env var, no change): "API Error: Claude's response exceeded the 8192 output token maximum. To configure this behavior, set the CLAUDE_CODE_MAX_OUTPUT_TOKENS environment variable."?

TurboSkyline

2 days ago

[-]

The Claude models seem to be focused on generating code more than anything else. Are they still competitive with Open AI and Google for more general use cases, or have they sacrificed that?

deaux

2 days ago

[-]

Which of the 1,000 other use cases? :)

To give one example, Opus and Sonnet IMO remain the #1 and #2 for writing informative prose. They're not entirely free of slop, but the ratio is lower than Gemini and especially GPT.

bowserman

8 hours ago

[-]

haiku 4.5 is not charging me any tokens at all when running from the cli :)

ilaksh

3 days ago

[-]

What LLM do you guys use for fast inference for voice/phone agents? I feel like to get really good latency I need to "cheat" with Cerebras, groq or SambaNova.

Haiku 4.5 is very good but still seems to be adding a second of latency.

[1]: https://entropicthoughts.com/evaluating-llms-playing-text-ad...

kqr

2 days ago

[-]

This is surprisingly competent. A couple of months ago I evaluated some leading models on a bunch of text adventures[1]. Typical regression coefficients would be +0.02 for top level models like Sonnet and Gemini 2.5 Pro, but notably also Gemini 2.5 Flash. (The baseline is GPT 5 Chat, i.e. the one where OpenAI routes to a thinking model only when they determine it's needed.)

When I include an attempt from Haiku 4.5 in the mix, most coefficients stay similar, but Haiku itself gets a +0.05. This must be a statistical fluke, because that would be insanely impressive – in particular for a cheaper model. I guess I'm adding samples to some of these after all...

Edit: It was a fluke. Back to +0.01 after one more go at all games.

getpokedagain

2 days ago

[-]

I have had great experience using the previous haiku with mcp servers. I am looking forward to trying this out.

KaiserPro

3 days ago

[-]

Ok, I use claude, mostly on default, but with extended thinking and per project prompts.

What's the advantage of using haiku for me?

is it just faster?

qustrolabe

2 days ago

[-]

Awww they took away free tier Sonnet 4.5, that was a beautiful model to talk to even outside coding stuff

ashirviskas

3 days ago

[-]

And I was wondering today why Sonnet 4.5 seemed so freaking slow. Now this explains it, Sonnet 4.5 is the new Opus 4.1 where Anthropic does not really want you to use it.

hu3

2 days ago

[-]

glad to see it's already available in VScode Copilot for me.

seunosewa

3 days ago

[-]

I'd like to see this price structure for Claude:

$5/mt for Haiku 4.5

$10/mt for Sonnet 4.5

$15/mt for Opus 4.5 when it's released.

no_flaks_given

2 days ago

[-]

What I want to see is an Anthropic + Cerebras partnership.

Haiku becomes a fucking killer at 2000token/second.

Charge me double idgaf

nisten

2 days ago

[-]

it's good

singularity2001

3 days ago

[-]

claude --model Haiku-4.5

doesn't work

mi_lk

3 days ago

[-]

check the model name here: https://docs.claude.com/en/docs/about-claude/models/overview...

beklein

2 days ago

[-]

claude --model claude-haiku-4-5-20251001

Void_

3 days ago

[-]

use claude-haiku-4-5-20251001

Razengan

1 day ago

[-]

I just asked Claude how to implement something in iOS 26. It told me iOS 26 isn't out yet.

Sigh..

baalimago

3 days ago

[-]

Ehh, expensive

ericbrow

3 days ago

[-]

Was anyone else slightly disappointed that this new product doesn't respond in Haiku, as the name would imply?

3 days ago

[-]

If you want to see it generate a Haiku from your webcam I just upgraded my silly little bring-your-own-key Haiku app to use the new model: https://tools.simonwillison.net/haiku

https://aws.amazon.com/about-aws/whats-new/2024/11/anthropic...

dpoloncsak

3 days ago

[-]

Wasn't there a 3.5 haiku too?

esafak

3 days ago

[-]

It's not a new product; just a new version.