FilterHN

Show HN: State of the Art of Coding Models, According to Hacker News Commenters

87 points

by yunusabd

9 hours ago

| past

| 18 comments

| hnup.date

| HN

Hello HN,

I was away from my computer for two weeks, and after coming back and reading the latest discussions on HN about coding assistants (models, harnesses), I felt very out of the loop. My normal process would have been to keep reading and figure out the latest and greatest from people's comments, but I wanted to try and automate this process.

Basically the goal is to get a quick overview over which coding models are popular on HN. A next iteration could also scan for harnesses that people use, or info on self-hosting or hardware setups.

I wrote a short intro on the page about the pipeline that collects and analyzes the data, but feel free to ask for more details or check the Google Sheet for more info.

https://hnup.date/hn-sota

▲

jdw64

9 hours ago

[-]

Interpreting these metrics is quite interesting.

One thing for sure is that while Claude is currently taking the #1 spot in mentions, it carries a lot of negative sentiment due to API pricing policies and frequent server downtime. On the other hand, the runner-up, GPT-5.5, actually seems to have more positive feedback.

Personally, my experience with Codex wasn't as good as with Claude Code (Codex freezes on Windows more often than you'd expect), so this is a bit surprising. That said, the more defensive GPT is definitely better in terms of sheer code-writing capability. However, GPT actually has quite a few issues with text corruption when generating in Korean or Chinese—something English-speaking users probably don't notice. In terms of model capabilities, when given the same agent.md (CLAUDE.md) file, I think GPT is better at writing code, while Claude is better at writing text during code reviews.

Looking at the bottom right, Qwen and DeepSeek are open-source, so they are largely mentioned in the context of guarding against vendor lock-in, which drives positive sentiment. Considering that Hacker News occasionally shows negative sentiment toward China, the fact that they are viewed this positively—unlike US models—shows that being open-source is a massive advantage in itself.

Anyway, one thing for sure is that Gemini is pretty much unusable.

▲

sgc

2 hours ago

[-]

I think it's decidedly preliminary to compare models using the same .md file, since they respond quite differently to the same input. I try to narrow to the top 2-3 and then refine inputs for each one. For me it's unfortunately not much better than an intuitive process of trial and error.

Gemini is not at all unusable. It is quite usable for the tasks it excels at - to the point that it is the top pick for many tasks and I spend more money there than elsewhere. On the other hand it responds quite differently from the other major models - so that claude and gpt on one hand are similar and gemini requires a different approach. In my opinion people who think gemini is worthless have not learned how to prompt it correctly. Again, it's intuitive and watching concrete response difference due to small input changes, but if I had to summarize it shows its google books / google scholar roots.

I have started experimenting with qwen more than deepseek, but I have not had good results yet. Given the good press I presume I will learn how to interact with it for better results.

Curious if others have similar experiences in comparing models usefully, or if most don't bother with this, or do something else? I mainly use models for highly focused specialty tasks, so this fine tuning makes the difference between usable and unusable. I don't yet have the luxury of defining my preferred workflow and finding the tool for the task. Everything just breaks almost immediately if I try to shoehorn into my preferred flow.

▲

uxcolumbo

9 minutes ago

[-]

What are your prompting and general tips for using Gemini effectively?

And what use cases do you think it’s best suited for?

▲

pryanshu89

27 minutes ago

[-]

I know its subjective, but I tried different models with my OpenRouter subscription and VSCode Roocode plugin. I evaluated them based on cost and code quality. I liked gemini-3-flash-preview.

Its really a cost effective model.

▲

2ndorderthought

6 hours ago

[-]

I like your analysis but I think the open models are genuinely well received not only because of vendor lock in or being open source.

They are cheaper! All signals point to them staying cheaper because they are built more sustainably. Also, some of the latest entries can run on 1 GPU! Literally available at your desktop where there can be no service interruptions. Not even network latency. People are one and few shotting little games for 0 dollars because they bought a GPU to play video games this year. To me that's an unbeatable value. Once the tooling catches up and a few more model releases, it could change everything completely.

▲

dgacmu

5 hours ago

[-]

I had a surprisingly positive experience with Gemini optimizing some mathy MPS code. It did far better than claude.

Of course, when I tried it on something else it rewrote every line in the file for no good reason, applied changes directly when I told it just to plan, etc.

So maybe it has one strength.

▲

petesergeant

1 hour ago

[-]

Yeah, I think we are pretty past an idea of "better" and are at the point where it needs qualification as "better at". "Claude writes, Codex reviews, and Gemini doesn't get installed" is my go-to, although I go to Gemini whenever I want an advanced graphical calculator, or data extraction of any type.

▲

devmor

35 minutes ago

[-]

Mostly my experience, but “Gemini crunches data” would be my replacement there.

If I have a task that requires parsing through swathes of irregular data that traditional ml would choke on (or require an intermediate training step ala bigquery), I have gotten much better results from Gemini than the other two.

▲

awesome_dude

6 hours ago

[-]

> Anyway, one thing for sure is that Gemini is pretty much unusable

Ha! I find that Gemini is quite useful - if only because I am forced to use it (on my personal projects) because it's the only one that has unlimited interaction for "free"

It has its limitations, yes, but so does Claude (which I am leaning on too heavily at work at the moment)

▲

gertlabs

1 hour ago

[-]

This is awesome data! I've been wanting to measure how closely hype aligns to our results at https://gertlabs.com/rankings

Subjectively, it seemed like DeepSeek V4 Pro had the highest hype/performance ratio (meaning high hype for lower performance). Whereas MiMo V2.5 Pro didn't get much attention despite being the top dog in the open weights world, not even an honorable mention in your chart :( ...

▲

yunusabd

2 minutes ago

[-]

There is one mention of Mimo V2.5 Pro in the data by... you! In the UserRatings tab in the sheet, if you want to have a look.

Searching for it on HN shows very few results, that's why it's not showing up in the analysis yet. But it might in the future, once it gains traction.

I'll keep an eye on it, thanks for bringing it up!

https://news.ycombinator.com/item?id=47911464

▲

2ndorderthought

6 hours ago

[-]

Interesting to see the positive sentiment around kimi2.6 qwen3.6 and deepseek relative to the negative. I hope the trend of people appreciating open models continue. They aren't namesakes yet, but it's a higher percentage then I thought it would be. Especially on HN where we are all talking about businesses.

I am upset because now anthropic, openai, meta, etc will continue their smear campaigns here. But I am also happy because it will make HN less useful when they do.

Everything is a give and take I guess. Excited to see where the equilibrium sits

▲

SilverElfin

5 hours ago

[-]

Is it just “smear campaigns”? Don’t get me wrong - I don’t want big tech or big AI monopolies and appreciate the open weight models. But it’s also true that Chinese companies are basically stealing through distillation and also that they censor to align to CCP rules. They’re problematic in a different way.

What I want is more fully open models where everything is shared. Data, training algorithms, weights. That way we can figure out if we should trust it.

▲

2ndorderthought

5 hours ago

[-]

They are all stealing from each other just like how they all stole from us. Grok supposedly admitted to distilling from open ai for instance.

I think it's also unfair to say their success is solely due to stealing data. They are contributing a lot of advances to the literature about what they are doing. The proof is in the results we have 27b models you can vibe code with. Not 1t+

It's murky sure. But there are smear campaigns about how people can't trust China too. There's some truth to that too but we can't trust the US either so local models are an interesting way for China to offer us some level of sovereignty.

▲

cheesecakegood

3 hours ago

[-]

It's extra interesting because I think the model people should be talking about is actually not DeepSeek V4 Pro, but the Flash version. When accounting for cache hits, the input price (per OpenRouter) is effectively only 6 cents per million tokens (3 vs 14 cents hit/miss), and 28 cents on output. That's really good efficiency, and it's not a sale price like they are doing with V4 Pro, it's the normal price.

It's actually pretty difficult to find a good comparison model because there isn't one. Again, a 14/28 cent in/out model, ignoring cache, it scores just below GPT 5.4 Mini-xhigh (75/450) and Gemini 3 Flash (50/300) in intelligence. It's similar to Gemma 4 31B in some metrics (13/38) including cost, so it's not completely unheard of, but it's pretty notable that virtually everything else in the same region in most benchmarks are going to cost at least 5 times more (much, much more in very output-heavy contexts)

▲

esperent

2 hours ago

[-]

It's well priced but does that have much relevance for "state of the art coding models", specifically?

I wouldn't use Gemini 3 Flash or GPT 5.4 mini for anything except the most trivial work, although both are useful for basic exploratory work.

So I'm using a heavy model for the bulk of the work and the cost of that so far outweighs the light model that the light model cost is effectively irrelevant.

▲

julianlam

1 hour ago

[-]

It's so interesting to see the wild pendulum swings of LLM sentiment here.

If one likes a model then it's capable of one-shotting entire apps.

Otherwise it's "only suitable for the most trivial tasks".

Never in between.

▲

esperent

1 hour ago

[-]

You're confusing "different people with different opinions" with "wild pendulum swings".

Personally my opinion in this regard is highly consistent over time.

▲

Jabbles

8 hours ago

[-]

Please fix your graph so the names of the models are readable

▲

marcuskaz

8 hours ago

[-]

Also, the stacked graph only allows you to quickly see total mentions, really hard to compare negative or positive sentiment across models at a glance.

▲

yunusabd

7 hours ago

[-]

Yep, a toggle to scale all columns to the same height could solve this. I'll look into it when I do the custom graph.

Edit: Done

▲

marcuskaz

6 hours ago

[-]

Much better, nice update!

▲

yunusabd

6 hours ago

[-]

Thanks for the comment, should be fixed now.

▲

smeej

7 hours ago

[-]

Came here to offer this feedback. If I can't see the name of the model, nothing else in the chart really matters to me. I even tried going to the Google Sheet.

It's way too important a piece of information not to have it visible.

▲

yunusabd

6 hours ago

[-]

Thanks, I replaced it with a custom graph, should be easier to read now.

▲

idivett

6 hours ago

[-]

Thanks for doing the hard work. I've bookmarked this, hoping it'll come handy when new models are released. If you're taking feature requests, I've a few. - Show combined measurements of model makes. Like All claude models vs open ai, Deepseek so on. - Another toggle to remove the neutral section?

▲

chillfox

5 hours ago

[-]

Surely "Claude Opus 4.7" and "Claude Opus Latest" should be the same, right?

▲

brooksc

7 hours ago

[-]

It'd be interesting to also graph this over time to see how sentiment changes from when a model is released to today.

▲

yunusabd

5 hours ago

[-]

Yes! Going forward I'm definitely doing that, once there is enough data. Might even backfill the data more into the past. I just want to stabilize the methodology before burning more tokens.

And it's probably a good idea to create a list of model release dates, so older comments can't accidentally map to models that weren't released yet.

▲

gobdovan

6 hours ago

[-]

Before harnesses, I'd fix the methodology/claims. A saner methodology would be to see comments that compare two models, say 'gpt5.5>opus4.7' and infer context ('ctx:frontend', for example). For your current methodology, 'opus 4.6 was very smart, opus4.7 is a disappointing upgrade to 4.6' would make normal aspect-based sentiment analysis consider 4.6 is smarter than 4.6. But considering you have <300 mentions total, probably you'd be better off scrapping some other websites as well. I'd also take out completely the SotA claim and downgrade the mentions to measuring something like visibility rather than performance.

▲

yunusabd

5 hours ago

[-]

That's fair, my immediate concern would be that there would be very few comments comparing any two models, so the data would be very anecdotal.

The context would be really nice to have, but reading the comments myself, it often just isn't very clear what exactly users are building or which programming language they are using.

I think analyzing more comments is promising. If you get enough data, you can generalize across use cases and get more meaningful ratings. The obvious lever is including more posts, although it might hit diminishing returns. I'll play around with it.

For the context, I want to try giving Gemini a "scratch pad", where it can note down strengths and weaknesses per model that it finds in the comments. Something like "some users say that model x is good for writing tests". Then on each run, I let it update the scratch pad and publish the results as more of a qualitative analysis.

For the wording, I'd like to keep a certain amount of click bait, sorry ;)

▲

skeptrune

1 hour ago

[-]

What a win it is for open source that qwen and kimi show up on this at all.

▲

jesse_dot_id

3 hours ago

[-]

I suspect companies are deploying bots to shift sentiment around their products. I find metrics like this to be largely useless vs. actually just trying stuff out.

▲

yakkomajuri

8 hours ago

[-]

"Prompts an LLM" -> which LLM?

I saw you're using Gemini for the sentiment rating (which I guess you picked because it's not often mentioned and thus "neutral"? lol)

But would be interesting to get more details overall

▲

yunusabd

7 hours ago

[-]

It's actually ChatGPT at the moment for the first filtering step, for no other reason than having a code snippet ready that I could point Cursor at (I know, so 2025). The Gemini call is using batch processing, so it's handled differently.

▲

julianlam

2 hours ago

[-]

Interesting that Gemma 4 didn't crack the top 10.

I've been experimenting with the 26B-A4B model with some surprisingly good results (both in inference speed and code quality — 15 tok/s, flying along!), vs my last few experiments with Devstral 24B. Not sure whether I can fit that 35B Qwen model everybody's so keen on, on my 32GB unified RAM.

However I think I may be in the minority of HN commenters exploring models for local inference.

▲

pbgcp2026

7 hours ago

[-]

So, it's a webpage with 3 paragraphs and a simple chart. It has: 1) terrible color scheme – fine, I switch to reader mode 2) shitloads of JS - fine, NoScript works, page breaks 3) Fancy "design" with simple graph but unreadable X axis labels - fine, I can use screen zoom for that ... to see 3x "Claude O..." LOL are we playing guess-me-over game? 4) ... "LxxxLxxx - Learn languages with YouTube!"

▲

Hari2028

5 hours ago

[-]

How noisy is the sentiment classification? Feels like that could skew results a lot

▲

yunusabd

5 hours ago

[-]

From the comments that I've checked manually it's pretty good. You can go to the "User Ratings" tab in the Google Sheet and check some comments to get an idea.

▲

tokkkie

2 hours ago

[-]

more users = more complaints. negativity just means popularity.

kimi...?

▲

ranger_danger

8 hours ago

[-]

Just FYI this article seems to define "start of the art" as "popular", as measured by "total mentions and user sentiment", without any bearing on the technical abilities or actual usage of the model.

▲

yunusabd

7 hours ago

[-]

Calling it sota might be a bit provocative, but what actually is the "state of the art"? We have benchmarks, but those are getting increasingly gamed and don't necessarily reflect the actual performance of a model, see Opus 4.7. So I think it's useful to have real world data from actual users as an additional data point.

▲

mellosouls

8 hours ago

[-]

That's pretty much exactly what the title says.

The technical abilities and usage are derived from the commenters usage reflections.

▲

Frannky

4 hours ago

[-]

I am looking for a good alternative to Claude code + opus that is not codex. I tried switching back to opus 4.6. The attitude of 4.7 is what is more problematic. Difficult to enforce checking stuff before answering, and it suppose he knows better than me and reality. Plus all the latest shenanigans they did. Pretty disgusted I am still using them

▲

Frannky

4 hours ago

[-]

I have forgotten to add the tendency of not owing problems and taking care and solve immediately but instead deflecting and saying it shouldn't be done now it's not my responsibility etc Just terrible