FilterHN

3 months ago

[-]

In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.

The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.

It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.

But to me it's very clear that the product that gets this right will be the one I use.

stacktrace

3 months ago

[-]

> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.

IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.

biofox

3 months ago

[-]

I ask for confidence scores in my custom instructions / prompts, and LLMs do surprisingly well at estimating their own knowledge most of the time.

3 months ago

[-]

I’m with the people pushing back on the “confidence scores” framing, but I think the deeper issue is that we’re still stuck in the wrong mental model.

It’s tempting to think of a language model as a shallow search engine that happens to output text, but that metaphor doesn’t actually match what’s happening under the hood. A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.

That’s why a confidence number that looks sensible can still be as made up as the underlying output, because both are just sequences of tokens tied to trained patterns, not anchored truth values. If you want truth, you want something that couples probability distributions to real world evidence sources and flags when it doesn’t have enough grounding to answer, ideally with explicit uncertainty, not hand‑waviness.

People talk about hallucination like it’s a bug that can be patched at the surface level. I think it’s actually a feature of the architecture we’re using: generating plausible continuations by design. You have to change the shape of the model or augment it with tooling that directly references verified knowledge sources before you get reliability that matters.

kznewman

3 months ago

[-]

Solid agree. Hallucination for me IS the LLM use case. What I am looking for are ideas that may or may not be true that I have not considered and then I go try to find out which I can use and why.

3 months ago

[-]

In essence it is a thing that is actually promoting your own brain… seems counter intuitive but that’s how I believe this technology should be used.

3 months ago

[-]

This technology (which I had a small part in inventing) was not based on intelligently navigating the information space, it’s fundamentally based on forecasting your own thoughts by weighting your pre-linguistic vectors and feeding them back to you. Attention layers in conjunction of roof later allowed that to be grouped in higher order and scan a wider beam space to reward higher complexity answers.

When trained on chatting (a reflection system on your own thoughts) it mostly just uses a false mental model to pretend to be a desperate intelligence.

Thus the term stochastic parrot (which for many us actually pretty useful)

3 months ago

[-]

Thanks for your input - great to hear from someone involved that this is the direction of travel.

I remain highly skeptical of this idea that it will replace anyone - the biggest danger I see is people falling for the illusion. That the thing is intrinsically smart when it’s not - it can be highly useful in the hands of disciplined people who know a particular area well and augment their productivity no doubt. Because the way we humans come up with ideas and so on is highly complex. Personally my ideas come out of nowhere and mostly are derived from intuition that can only be expressed in logical statements ex-post.

3 months ago

[-]

Is intuition really that different than LLM having little knowledge about something? It's just responding with the most likely sequence of tokens using the most adjacent information to the topic... just like your intuition.

3 months ago

[-]

With all due respect I’m not even going to give a proper response to this… intuition that yields great ideas is based on deep understanding. LLM’s exhibit no such thing.

These comparisons are becoming really annoying to read.

3 months ago

[-]

I think you need to first understand what the word intuition means, before writing such a condescending reply.

3 months ago

[-]

Meant to say prompting*

3 months ago

[-]

>A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.

And is that that different than what we do under the scenes? Is there a difference between an actual fact vs some false information stored in our brain? Or both have the same representation in some kind of high‑dimensional statistical manifold in our brains, and we also "try to produce the most plausible continuation" using them?

There might be one major difference is at a different level: what we're fed (read, see, hear, etc) we also evaluate before storing. Does LLM training do that, beyond some kind of manually assigned crude "confidence tiers" applied to input material during training (e.g. trust Wikipedia more than Reddit threads)?

literatepeople

3 months ago

[-]

I would say it's very different to what we do. Go to a friend and ask them a very niche question. Rather than lie to you, they'll tell you "I don't know the answer to that". Even if a human absorbed every single bit of information a language model has, their brain probably could not store and process it all. Unless they were a liar, they'd tell you they don't know the answer either! So I personally reject the framing that it's just like how a human behaves, because most of the people I know don't lie when they lack information.

3 months ago

[-]

>Go to a friend and ask them a very niche question. Rather than lie to you, they'll tell you "I don't know the answer to that"

Don't know about that, bullshitting is a thing. Especially online, where everybody pretends to be an expert on everything, and many even believe it.

But even if so, is that because of some fundamental difference between how a human and an LLM store/encode/retrieve information, or more because it has been instilled into a human through negative reinforcement (other people calling them out, shame of correction, even punishment, etc) not to make things up?

AuryGlenz

3 months ago

[-]

I see you haven’t met my brother-in-law.

3 months ago

[-]

Hallucinations are a feature of reality that LLMs have inherited.

It’s amazing that experts like yourself who have a good grasp of the manifold MoE configuration don’t get that.

LLMs much like humans weight high dimensionality across the entire model then manifold then string together an attentive answer best weighted.

Just like your doctor occasionally giving you wrong advice too quickly so does this sometimes either get confused by lighting up too much of the manifold or having insufficient expertise.

jakewins

3 months ago

[-]

I asked Gemini the other day to research and summarise the pinout configuration for CANbus outputs on a list of hardware products, and to provide references for each. It came back with a table summarising pin outs for each of the eight products, and a URL reference for each.

Of the 8, 3 were wrong, and the references contained no information about pin outs whatsoever.

That kind of hallucination is, to me, entirely different than what a human researcher would ever do. They would say “for these three I couldn’t find pinouts” or perhaps misread a document and mix up pinouts from one model for another.. they wouldn’t make up pinouts and reference a document that had no such information in it.

Of course humans also imagine things, misremember etc, but what the LLMs are doing is something entirely different, is it not?

fspeech

3 months ago

[-]

Humans are also not rewarded for making pronouncements all the time. Experts actually have a reputation to maintain and are likely more reluctant to give opionions that they are not reasonably sure of. LLMs trained on typical written narratives found in books, articles etc can be forgiven to think that they should have an opionion on any and everything. Point being that while you may be able to tune it to behave some other way you may find the new behavior less helpful.

3 months ago

[-]

Newer models can run a search and summarize the pages. They're becoming just a faster way of doing research, but they're still not as good as humans.

acdha

3 months ago

[-]

> Hallucinations are a feature of reality that LLMs have inherited.

Huh? Are you arguing that we still live in a pre-scientific era where there’s no way to measure truth?

As a simple example, I asked Google about houseplant biology recently. The answer was very confidently wrong telling me that spider plants have a particular metabolic pathway because it confused them with jade plants and the two are often mentioned together. Humans wouldn’t make this mistake because they’d either know the answer or say that they don’t. LLMs do that constantly because they lack understanding and metacognitive abilities.

3 months ago

[-]

>Huh? Are you arguing that we still live in a pre-scientific era where there’s no way to measure truth?

No. A strange way to interpet their statement! Almost as if you ...hallucinated their intend!

They are arguing that humans also hallucinate: "LLMs much like humans" (...) "Just like your doctor occasionally giving you wrong advice too quickly".

As an aside, there was never a "pre-scientific era where there [was] no way to measure truth". Prior to the rise of modern science fields, there have still always been objective ways to judge truth in all kinds of domains.

acdha

3 months ago

[-]

Yes, that’s basically the point: what are termed hallucinations with LLMs are different than what we see in humans – even the confabulations which people with severe mental disorders exhibit tend to have some kind of underlying order or structure to them. People detect inconsistencies in their own behavior and that of others, which is why even that rushed doctor in the original comment won’t suggest something wildly off the way LLMs do routinely - they might make a mistake or have incomplete information but they will suggest things which fit a theory based on their reasoning and understanding, which yields errors at a lower rate and different class.

freejazz

3 months ago

[-]

> Hallucinations are a feature of reality that LLMs have inherited.

Really? When I search for cases on LexisNexis, it does not return made-up cases which do not actually exist.

3 months ago

[-]

When you ask humans however there are all kinds of made-up "facts" they will tell you. Which is the point the parent makes (in the context of comparing to LLM), not whether some legal database has wrong cases.

Since your example comes from the legal field, you'll probably very well know that even well intentioned witnesses that don't actively try to lie, can still hallucinate all kinds of bullshit, and even be certain of it. Even for eye witnesses, you can ask 5 people and get several different incompatible descriptions of a scene or an attacker.

freejazz

3 months ago

[-]

>When you ask humans however there are all kinds of made-up "facts" they will tell you. Which is the point the parent makes (in the context of comparing to LLM), not whether some legal database has wrong cases.

Context matters. This is the context LLMs are being commercially pushed to me in. Legal databases also inherit from reality as they consist entirely of things from the real world.

airstrike

3 months ago

[-]

It's not even a manifold https://arxiv.org/abs/2504.01002

wan23

3 months ago

[-]

A different way to look at it is language models do know things, but the contents of their own knowledge is not one of those things.

paulddraper

3 months ago

[-]

You have a subtle slight of hand.

You use the word “plausible” instead of “correct.”

3 months ago

[-]

That’s deliberate. “Correct” implies anchoring to a truth function the model doesn’t have. “Plausible” is what it’s actually optimising for, and the disconnect between the two is where most of the surprises (and pitfalls) show up.

As someone else put it well: what an LLM does is confabulate stories. Some of them just happen to be true.

paulddraper

3 months ago

[-]

It absolutely has a correctness function.

That’s like saying linear regression produces plausible results. Which is true but derogatory.

MyOutfitIsVague

3 months ago

[-]

Do you have a better word that describes "things that look correct without definitely being so"? I think "plausible" is the perfect word for that. It's not a sleight of hand to use a word that is exactly defined as the intention.

3 months ago

[-]

I mean... That is exactly how our memory works. So in a sense, the factually incorrect information coming from LLM is as reliable as someone telling you things from memory.

dgacmu

3 months ago

[-]

But not really? If you ask me a question about Thai grammar or how to build a jet turbine, I'm going to tell you that I don't have a clue. I have more of a meta-cognitive map of my own manifold of knowledge than an LLM does.

3 months ago

[-]

Try it out. Ask "Do you know who Emplabert Kloopermberg is?" and ChatGPT/Gemini literally responded with "I don't know".

You, on the other hand, truly have never encountered any information about Thai grammar or (surprisingly) hot to build a jet turbine. (I can explain in general terms how to build one from just watching Discovery channel)

The difference is that the models actually have some information on those topics.

drclau

3 months ago

[-]

How do you know the confidence scores are not hallucinated as well?

kiliankoe

3 months ago

[-]

They are, the model has no inherent knowledge about its confidence levels, it just adds plausible-sounding numbers. Obviously they _can_ be plausible, but trusting these is just another level up from trusting the original output.

I read a comment here a few weeks back that LLMs always hallucinate, but we sometimes get lucky when the hallucinations match up with reality. I've been thinking about that a lot lately.

TeMPOraL

3 months ago

[-]

> the model has no inherent knowledge about its confidence levels

Kind of. See e.g. https://openreview.net/forum?id=mbu8EEnp3a, but I think it was established already a year ago that LLMs tend to have identifiable internal confidence signal; the challenge around the time of DeepSeek-R1 release was to, through training, connect that signal to tool use activation, so it does a search if it "feels unsure".

losvedir

3 months ago

[-]

Wow, that's a really interesting paper. That's the kind of thing that makes me feel there's a lot more research to be done "around" LLMs and how they work, and that there's still a fair bit of improvement to be found.

fragmede

3 months ago

[-]

In science, before LLMs, there's this saying: all models are wrong, some are useful. We model, say, gravity as 9.8m/s² on Earth, knowing full well that it doesn't hold true across the universe, and we're able to build things on top of that foundation. Whether that foundation is made of bricks, or is made of sand, for LLMs, is for us to decide.

xhkkffbf

3 months ago

[-]

It doesn't hold true across the universe? I thought this was one of the more universal things like the speed of light.

procflora

3 months ago

[-]

G, the gravitational constant is (as far as we know) universal. I don't think this is what they meant, but the use of "across the universe" in the parent comment is confusing.

g, the net acceleration from gravity and the Earth's rotation is what is 9.8m/s² at the surface, on average. It varies slightly with location and altitude (less than 1% for anywhere on the surface IIRC), so "it's 9.8 everywhere" is the model that's wrong but good enough a lot of the time.

fragmede

3 months ago

[-]

It doesn't even hold true on Earth! Nevermind other planets being of different sizes making that number change, that equation doesn't account for the atmosphere and air resistance from that. If we drop a feather that isn't crumpled up, it'll float down gently at anything but 9.8m/s². In sports, air resistance of different balls is enough that how fast something drops is also not exactly 9.8m/s², which is why peak athlete skills often don't transfer between sports. So, as a model, when we ignore air resistance it's good enough, a lot of the time, but sometimes it's not a good model because we do need to care about air resistance.

hackeman300

3 months ago

[-]

Gravity isn't 9.8m/s/s across the universe. If you're at higher or lower elevations (or outside the Earth's gravitational pull entirely), the acceleration will be different.

Their point was the 9.8 model is good enough for most things on Earth, the model doesn't need to be perfect across the universe to be useful.

3 months ago

[-]

g(lower case) is literally gravitational force of Earth at surface level. It's universally true, as there's only one Earth in this universe.

G is the gravitational constant which is also universally true(erm... to the best of our knowledge), g is calculated using gravitational constant.

dfsegoat

3 months ago

[-]

they 100% are unless you provide a RUBRIC / basically make it ordinal.

"Return a score of 0.0 if ...., Return a score of 0.5 if .... , Return a score of 1.0 if ..."

ryoshu

3 months ago

[-]

LLMs fail at causal accuracy. It's a fundamental problem with how they work.

kromokromo

3 months ago

[-]

Asking an LLM to give itself a «confidence score» is like asking a teenager to grade his own exam. I LLMs doesn’t «feel» uncertainty and confidence like we do.

robocat

3 months ago

[-]

> wrong or misleading explanations

Exactly the same issue occurs with search.

Unfortunately not everybody knows to mistrust AI responses, or have the skills to double-check information.

darkwater

3 months ago

[-]

No, it's not the same. Search results send/show you one or more specific pages/websites. And each website has a different trust factor. Yes, plenty of people repeat things they "read on the Internet" as truths, but it's easy to debunk some of them just based on the site reputation. With AI responses, the reputation is shared with the good answers as well, because they do give good answers most of the time, but also hallucinate errors.

SebastianSosa1

3 months ago

[-]

Community notes on X seems to be one of the highest profile recent experiments trying to address this issue

dexterlagan

3 months ago

[-]

My attempt: https://www.cleverthinkingsoftware.com/truth-or-extinction/

darkwater

3 months ago

[-]

> Tools like SourceFinder must be paired with education — teaching people how to trace information themselves, to ask: Where did this come from? Who benefits if I believe it?

These are very important and relevant questions to ask oneself when you read about anything, but we also keep in mind that even those question can be misused and they can drive you to conspiracy theories.

incrudible

3 months ago

[-]

If somebody asks a question on Stackoverflow, it is unlikely that a human who does not know the answer will take time out of their day to completely fabricate a plausible sounding answer.

jaxn

3 months ago

[-]

People are confidently incorrect all the time. It is very likely that people will make up plausible sounding answers on StackOverflow.

You and I have both taken time out of our days to write plausible sounding answers that are essentially opposing hallucinations.

linen

3 months ago

[-]

Sites like stackoverflow are inherently peer-reviewed, though; they've got a crowdsourced voting system and comments that accumulate over time. People test the ideas in question.

This whole "people are just as incorrect as LLMs" is a poor argument, because it compares the single human and the single LLM response in a vacuum. When you put enough humans together on the internet you usually get a more meaningful result.

3 months ago

[-]

At least it used to be true.

3 months ago

[-]

Have you ever heard of Dunning Kruger effect?

There's a reason why there are upvotes, solution and third party edit system in StackOverflow - people will spend time to write their "hallucinations" very confidently.

lins1909

3 months ago

[-]

What is it about people making up lies to defend LLMs? In what world is it exactly the same as search? They're literally different things, since you get information from multiple sources and can do your own filtering.

actionfromafar

3 months ago

[-]

I wonder if the only way to fix this with current LLMs, would be to generate a lot synthetic data for a select number topics you really don't want it "go off the rails" with. That synthetic data would be lots of variations on that "I don't know how to do X with Y".

dolmen

3 months ago

[-]

I would not bet on synthetic data.

LLMs are very good at detecting patterns.

RHSman2

3 months ago

[-]

The problem is not the intelligence of the LLM. It is the intelligence and desire to make things easy of the intelligence using them.

XCSme

3 months ago

[-]

But most benchmarks are not about that...

Are there even any "hallucination" public benchmarks?

andrepd

3 months ago

[-]

"Benchmarks" for LLMs are a total hoax, since you can train them on the benchmarks themselves.

XCSme

3 months ago

[-]

I would assume a good benchmark has hidden tests, or something randomly generated that is harder to game

basisword

3 months ago

[-]

I think the thing even worse than false information is the almost-correct information. You do a quick Google to confirm it's on the right page but find there's an important misunderstanding. These are so much harder to spot I think than the blatantly false.

fauigerzigerk

3 months ago

[-]

I agree, but the question is how better grounding can be achieved without a major research breakthrough.

I believe the real issue is that LLMs are still so bad at reasoning. In my experience, the worst hallucinations occur where only handful of sources exist for some set of facts (e.g laws of small countries or descriptions of niche products).

LLMs know these sources and they refer to them but they are interpreting them incorrectly. They are incapable of focusing on the semantics of one specific page because they get "distracted" by their pattern matching nature.

Now people will say that this is unavoidable given the way in which transformers work. And this is true.

But shouldn't it be possible to include some measure of data sparsity in the training so that models know when they don't know enough? That would enable them to boost the weight of the context (including sources they find through inference time search/RAG) relative to to their pretraining.

3 months ago

[-]

Anything that is very specific has the same problem, because LLMs can’t have the same representation of all topics in the training. It doesn’t have to be too niche, just specific enough for it to start to fabricate it.

One of these days I had a doubt about something related to how pointers work in Swift and I tried discussing with ChatGPT (don’t remember exactly what, but it was purely intellectual curiosity). It gave me a lot of explanations that seemed correct, but being skeptical and started pushing it for ways to confirm what it was saying and eventually realized it was all bullshit.

This kind of thing makes me basically wary of using LLMs for anything that isn’t brainstorming, because anything that requires knowing information that isn’t easily/plentifully found online will likely be incorrect or have sprinkles of incorrect all over the explanations.

cachius

3 months ago

[-]

Grounding in search results is what Perplexity pioneered and Google also does with AI mode and ChatGPT and others with web search tool.

As a user I want it but as webadmin it kills dynamic pages and that's why Proof of work aka CPU time captchas like Anubis https://github.com/TecharoHQ/anubis#user-content-anubis or BotID https://vercel.com/docs/botid are now everywhere. If only these AI crawlers did some caching, but no just go and overrun the web. To the effect that they can't anymore, at the price of shutting down small sites and making life worse for everyone, just for few months of rapacious crawling. Literally Perplexity moved fast and broke things.

cachius

3 months ago

[-]

This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.

I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.

From Tavis Ormandy who wrote a C program to solve the Anubis challenges out of browser https://lock.cmpxchg8b.com/anubis.html via https://news.ycombinator.com/item?id=45787775

Guess a mix of Markov tarpits and llm meta instructions will be added, cf. Feed the bots https://news.ycombinator.com/item?id=45711094 and Nephentes https://news.ycombinator.com/item?id=42725147

BatteryMountain

3 months ago

[-]

My biggest problem with LLM's at this point is that they produce different and inconsistent results or behave differently, given the same prompt. The better grounding would be amazing at this point. I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday. Currently they misbehave multiple times a week and I have to manually steer it a bit which destroys certain automated workflows completely.

fragmede

3 months ago

[-]

It sounds like you have dug into this problem with some depth so I would love to hear more. When you've tried to automate things, I'm guessing you've got a template and then some data and then the same or similar input gives totally different results? What details about how different the results are can you share? Are you asking for eg JSON output and it totally isn't, or is it a more subtle difference perhaps?

conception

3 months ago

[-]

You need to change the temperature to 0 and tune your prompts for automated workflows.

3 months ago

[-]

It doesn’t really solve it as a slight shift in the prompt can have totally unpredictable results anyway. And if your prompt is always exactly the same, you’d just cache it and bypass the LLM anyway.

What would really be useful is a very similar prompt should always give a very very similar result.

jknightco

3 months ago

[-]

This doesn't work with the current architecture, because we have to introduce some element of stochastic noise into the generation or else they're not "creatively" generative.

Your brain doesn't have this problem because the noise is already present. You, as an actual thinking being, are able to override the noise and say "no, this is false." An LLM doesn't have that capability.

3 months ago

[-]

Well that’s because if you look at the structure of the brain there’s a lot more going on than what goes on within an LLM.

It’s the same reason why great ideas almost appear to come randomly - something is happening in the background. Underneath the skin.

3 months ago

[-]

That’s a way different problem my guy.

dominotw

3 months ago

[-]

have you tried this? this doesnt work because the way inference runs at big companies. its not just running your query in isolation.

maybe it can work if you are running your own inference.

sebastiennight

3 months ago

[-]

> I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday

Bad news, it's winter now in the Northern hemisphere, so expect all of our AIs to get slightly less performant as they emulate humans under-performing until Spring.

phorkyas82

3 months ago

[-]

Isn't that what no LLM can provide: being free of hallucinations?

arw0n

3 months ago

[-]

I think the better word is confabulation; fabricating plausible but false narratives based on wrong memory. Fundamentally, these models try to produce plausible text. With language models getting large, they start creating internal world models, and some research shows they actually have truth dimensions. [0]

I'm not an expert on the topic, but to me it sounds plausible that a good part of the problem of confabulation comes down to misaligned incentives. These models are trained hard to be a 'helpful assistant', and this might conflict with telling the truth.

Being free of hallucinations is a bit too high a bar to set anyway. Humans are extremely prone to confabulations as well, as can be seen by how unreliable eye witness reports tend to be. We usually get by through efficient tool calling (looking shit up), and some of us through expressing doubt about our own capabilities (critical thinking).

[0] https://arxiv.org/abs/2407.12831

Tepix

3 months ago

[-]

> false narratives based on wrong memory

I don't think "wrong memory" is accurate, it's missing information and doesn't know it or is trained not to admit it.

Checkout the Dwarkesh Podcast episode https://www.dwarkesh.com/p/sholto-trenton-2 starting at 1:45:38

Here is the relevant quote by Trenton Bricken from the transcript:

One example I didn't talk about before with how the model retrieves facts: So you say, "What sport did Michael Jordan play?" And not only can you see it hop from like Michael Jordan to basketball and answer basketball. But the model also has an awareness of when it doesn't know the answer to a fact. And so, by default, it will actually say, "I don't know the answer to this question." But if it sees something that it does know the answer to, it will inhibit the "I don't know" circuit and then reply with the circuit that it actually has the answer to. So, for example, if you ask it, "Who is Michael Batkin?" —which is just a made-up fictional person— it will by default just say, "I don't know." It's only with Michael Jordan or someone else that it will then inhibit the "I don't know" circuit.

But what's really interesting here and where you can start making downstream predictions or reasoning about the model, is that the "I don't know" circuit is only on the name of the person. And so, in the paper we also ask it, "What paper did Andrej Karpathy write?" And so it recognizes the name Andrej Karpathy, because he's sufficiently famous, so that turns off the "I don't know" reply. But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers, and so then it needs to make something up. And so you can see different components and different circuits all interacting at the same time to lead to this final answer.

BoredPositron

3 months ago

[-]

Architecture wise the "admit" part is impossible.

rbranson

3 months ago

[-]

Bricken isn’t just making this up. He’s one of the leading researchers in model interpretability. See: https://arxiv.org/abs/2411.14257

Tepix

3 months ago

[-]

Why do you think it's impossible? I just quoted him saying 'by default, it will actually say, "I don't know the answer to this question"'

We already see that - given the right prompting - we can get LLMs to say more often that they don't know things.

3 months ago

[-]

That's right - it does seem to have to do with trying to be helpful.

One demo of this that reliably works for me:

Write a draft of something and ask the LLM to find the errors.

Correct the errors, repeat.

It will never stop finding a list of errors!

The first time around and maybe the second it will be helpful, but after you've fixed the obvious things, it will start complaining about things that are perfectly fine, just to satisfy your request of finding errors.

thunky

3 months ago

[-]

> It will never stop finding a list of errors!

Not my experience. I find after a couple of rounds it tells me it's perfect.

officialchicken

3 months ago

[-]

No, the correct word is hallucinating. That's the word everyone uses and has been using. While it might not be technically correct, everyone knows what it means and more importantly, it's not a $3 word and everyone can relate to the concept. I also prefer all the _other_ more accurate alternative words Wikipedia offers to describe it:

"In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting,[1][2] confabulation,[3] or delusion[4]) is"

kyletns

3 months ago

[-]

For the record, brains are also not free of hallucinations.

rimeice

3 months ago

[-]

I still don’t really get this argument/excuse for why it’s acceptable that LLMs hallucinate. These tools are meant to support us, but we end up with two parties who are, as you say, prone to “hallucination” and it becomes a situation of the blind leading the blind. Ideally in these scenarios there’s at least one party with a definitive or deterministic view so the other party (i.e. us) at least has some trust in the information they’re receiving and any decisions they make off the back of it.

TeMPOraL

3 months ago

[-]

For these types of problems (i.e. most problems in the real world), the "definitive or deterministic" isn't really possible. An unreliable party you can throw at the problem from a hundred thousand directions simultaneously and for cheap, is still useful.

Libidinalecon

3 months ago

[-]

"The airplane wing broke and fell off during flight"

"Well humans break their leg too!"

It is just a mindlessly stupid response and a giant category error.

The way an airplane wing and a human limb is not at all the same category.

There is even another layer to this that comparing LLMs to the brain might be wrong because the mereological fallacy is attributing the brain "thinks" vs the person/system as a whole thinks.

johnisgood

3 months ago

[-]

You are right that the wing/leg comparison is often lazy rhetoric: we hold engineered systems to different failure standards for good reason.

But you are misusing the mereological fallacy. It does not dismiss LLM/brain comparisons: it actually strengthens them. If the brain does not "think" (the person does), then LLMs do not "think" either. Both are subsystems in larger systems. That is not a category error; it is a structural similarity.

This does not excuse LLM limitations - rimeice's concern about two unreliable parties is valid. But dismissing comparisons as "category errors" without examining which properties are being compared is just as lazy as the wing/leg response.

3 months ago

[-]

Have you ever employed anyone?

People, when tasked with a job, often get it right. I've been blessed by working with many great people who really do an amazing job of generally succeeding to get things right -- or at least, right-enough.

But in any line of work: Sometimes people fuck it up. Sometimes, they forget important steps. Sometimes, they're sure they did it one way when instead they did it some other way and fix it themselves. Sometimes, they even say they did the job and did it as-prescribed and actually believe themselves, when they've done neither -- and they're perplexed when they're shown this. They "hallucinate" and do dumb things for reasons that aren't real.

And sometimes, they just make shit up and lie. They know they're lying and they lie anyway, doubling-down over and over again.

Sometimes they even go all spastic and deliberately throw monkey wrenches into the works, just because they feel something that makes them think that this kind of willfully-destructive action benefits them.

All employees suck some of the time. They each have their own issues. And all employees are expensive to hire, and expensive to fire, and expensive to keep going. But some of their outputs are useful, so we employ people anyway. (And we're human; even the very best of us are going to make mistakes.)

LLMs are not so different in this way, as a general construct. They can get things right. They can also make shit up. They can skip steps. The can lie, and double-down on those lies. They hallucinate.

LLMs suck. All of them. They all fucking suck. They aren't even good at sucking, and they persist at doing it anyway.

(But some of their outputs are useful, and LLMs generally cost a lot less to make use of than people do, so here we are.)

vitorfblima

3 months ago

[-]

I don’t get the comparison. It would be like saying it’s okay if an excel formula gives me different outcomes everytime with the same arguments, sometimes right, but mostly wrong.

3 months ago

[-]

People can accomplish useful things, but sometimes make mistakes and do shit wrong.

The bot can also accomplish useful things, and sometimes make mistakes and do shit wrong.

(These two statements are more similar in their truthiness than they are different.)

3 months ago

[-]

As far as I can tell (as someone who worked on the early foundation of this tech at Google for 10 years) making up “shit” then using your force of will to make it true is a huge part of the construction of reality with intelligence.

Will to reality through forecasting possible worlds is one of our two primary functions.

andrei_says_

3 months ago

[-]

How much do you hallucinate at work? How many of your work hallucinations do you confidently present as reality in communication or code?

LLMs are being sold as viable replacement of paid employees.

If they were not, they wouldn’t be funded the way they are.

delaminator

3 months ago

[-]

That’s not a very useful observation though is it?

The purpose of mechanisation is to standardise and over the long term reduce errors to zero.

Otoh “The final truth is there is no truth”

michaelscott

3 months ago

[-]

A lot of mechanisation, especially in the modern world, is not deterministic and is not always 100% right; it's a fundamental "physics at scale" issue, not something new to LLMs. I think what happened when they first appeared was that people immediately clung to a superintelligence-type AI idea of what LLMs were supposed to do, then realised that's not what they are, then kept going and swung all the way over to "these things aren't good at anything really" or "if they only fix this ONE issue I have with them, they'll actually be useful"

delaminator

3 months ago

[-]

That's why I said tend to zero error. I'm a Six Sigma guy. We take accurate over precise.

krzyk

3 months ago

[-]

Hallucinations are not bad. It adds some kind of creativity, which is good for e.g. image generation, coding, or story telling.

It is bad only in case of reporting on facts.

3 months ago

[-]

Yes, they'll probably not go away, but it's got to be possible to handle them better.

Gemini (the app) has a "mitigation" feature where it tries to to Google searches to support its statements. That doesn't currently work properly in my experience.

It also seems to be doing something where it adds references to statements (With a separate model? With a second pass over the output? Not sure how that works.). That works well where it adds them, but it often doesn't do it.

intended

3 months ago

[-]

Doubt it. I suspect it’s fundamentally not possible in the spirit you intend it.

Reality is perfectly fine with deception and inaccuracy. For language to magically be self constraining enough to only make verified statements is… impossible.

3 months ago

[-]

Take a look at the new experimental AI mode in Google scholar, it's going in the right direction.

It might be true that a fundamental solution to this issue is not possible without a major breakthrough, but I'm sure you can get pretty far with better tooling that surfaces relevant sources, and that would make a huge difference.

intended

3 months ago

[-]

So lets run it through the rubric test -

What’s your level of expertise in this domain or subject? How did you use it? What were your results?

It’s basically gauging expertise vs usage to pin down the variance that seems endemic to LLM utility anecdotes/examples. For code examples I also ask which language was used, the submitters familiarity with the language, their seniority/experience and familiarity with the domain.

3 months ago

[-]

A lot of words to call me stupid ;) You seem to have put me in some convenient mental box of yours, I don't know which one.

intended

3 months ago

[-]

Oh heck no! Definitely no!

I am genuinely asking, because I think one of the biggest determinants of utility obtained from LLMs is the operator.

Damn, I didn’t consider that it could be read that way. I am sorry for how it came across.

SecretDreams

3 months ago

[-]

Find me a human that doesn't occasionally talk out of their ass =[

3 months ago

[-]

A part of it is reproducing incorrect information in the training data as well.

One area that I've found to be a great example of this is sports science.

Depending on how you ask, you can get a response lifted from scientific literature, or the bro science one, even in the course of the same discussion.

It makes sense, both have answers to similar questions and are very commonly repeated online.

sebastiennight

3 months ago

[-]

> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things,

Due to how LLMs are implemented, you are always most likely to get a bogus explanation if you ask for an answer first, and why second.

A useful mental model is: imagine if I presented you with a potential new recruit's complete data (resume, job history, recordings of the job interview, everything) but you only had 1 second to tell me "hired: YES OR NO"

And then, AFTER you answered that, I gave you 50 pages worth of space to tell me why your decision is right. You can't go back on that decision, so all you can do is justify it however you can.

Do you see how this would give radically different outcomes vs. giving you the 50-page scratchpad first to think things through, and then only giving me a YES/NO answer?

jillesvangurp

3 months ago

[-]

It's increasingly a space that is constrained by the tools and integrations. Models provide a lot of raw capability. But with the right tools even the simpler, less capable models become useful.

Mostly we're not trying to win a nobel prize, develop some insanely difficult algorithm, or solve some silly leetcode problem. Instead we're doing relatively simple things. Some of those things are very repetitive as well. Our core job as programmers is automating things that are repetitive. That always was our job. Using AI models to do boring repetitive things is a smart use of time. But it's nothing new. There's a long history of productivity increasing tools that take boring repetitive stuff away. Compilation used to be a manual process that involved creating stacks of punch cards. That's what the first automated compilers produced as output: stacks of punch cards. Producing and stacking punchcards is not a fun job. It's very repetitive work. Compilers used to be people compiling punchcards. Women mostly, actually. Because it was considered relatively low skilled work. Even though it arguably wasn't.

Some people are very unhappy that the easier parts of their job are being automated and they are worried that they get completely automated away completely. That's only true if you exclusively do boring, repetitive, low value work. Then yes, your job is at risk. If your work is a mix of that and some higher value, non repetitive, and more fun stuff to work on, your life could get a lot more interesting. Because you get to automate away all the boring and repetitive stuff and spend more time on the fun stuff. I'm a CTO. I have lots of fun lately. Entire new side projects that I had no time for previously I can now just pull off in a spare few hours.

Ironically, a lot of people currently get the worst of both worlds because they now find themselves baby sitting AIs doing a lot more of the boring repetitive stuff than they would be able to do without that to the point where that is actually all that they do. It's still boring and repetitive. And it should be automated away ultimately. Arguably many years ago actually. The reason so many react projects feel like Ground Hog Day is because they are very repetitive. You need a login screen, and a cookies screen, and a settings screen, etc. Just like the last 50 projects you did. Why are you rebuilding those things from scratch? Manually? These are valid questions to ask yourself if you are a frontend programmer. And now you have AI to do that for you.

Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.

giancarlostoro

3 months ago

[-]

Yeah in my case I want the coding models to be less stupid, I asked for multiple file uploading, it kept the original button and it added a second one for additional files, when I pointed that out “You're absolutely correct!” Well why didnt you think of it before you cranked out code, I see coding agents as really capable Junior devs its really funny. I dont mind it though, saved me hours on my side project if not weeks worth of work.

withinboredom

3 months ago

[-]

I was using an LLM to summarize benchmarks for me, and I realized after awhile it was omitting information that made the algorithm being benchmarked look bad. I'm glad I caught it early, before I went to my peers and was like "look at this amazing algorithm".

coffeecat

3 months ago

[-]

It's important not to assume that LLMs are giving you an impartial perspective on any given topic. The perspective you're most likely getting is that of whoever created the most training data related to that topic.

andai

3 months ago

[-]

So there's two levels to this problem.

Retrieval.

And then hallucination even in the face of perfect context.

Both are currently unsolved.

(Retrieval's doing pretty good but it's a Rube Goldberg machine of workarounds. I think the second problem is a much bigger issue.)

cachius

3 months ago

[-]

Re: retrieval: That's where the snake eats its tail as AI slop floods the web, grounding is like laying a foundation in a swamp. And that Rube Goldberg machine tries to prevent the snake from reaching its tail. But RGs are brittle and not exactly the thing you want to build infrstructure on. Just look at https://news.ycombinator.com/item?id=46239752 for an example how easy it can break.

3 months ago

[-]

There are four words that would make the output of any LLM instantly 1000x more useful and I haven't seen them yet: "I do not know.".

f_k

3 months ago

[-]

> verifying their claims ends up taking time.

I've been working on this problem with https://citellm.com, specifically for PDFs.

Instead of relying on the LLM answer alone, each extracted field links to its source in the original document (page number + highlighted snippet + confidence score).

Checking any claim becomes simple: click and see the exact source.

rafaelmn

3 months ago

[-]

I constantly see top models (opus 4.5, gemini 3) get a stroke mid task - they will solve the problem correctly in one place, or have a correct solution that needs to be reapplied in context - and then completely miss the mark in another place. "Lack of intelligence" is very much a limiting factor. Gemini especially will get into random reasoning loops - reading thinking traces - it gets unhinged pretty fast.

Not to mention it's super easy to gaslight these models, just asserting something wrong with vaguely plausible explanation and you get no pushback or reasoning validation.

So I know you qualified your post with "for your use case", but personally I would very much like more intelligence from LLMs.

virtuosarmo

3 months ago

[-]

I've had better success finding information using Google Gemini vs. ChatGPT. I.e. someone mentions to me the name of someone or some company, but doesn't give the full details (i.e. Joe @ XYZ Company doing this, or this company with 10,000 people, in ABC industry)...sometimes i don't remember the full name. Gemini has been more effective for me in filling in the gaps and doing fuzzy search. I even asked ChatGPT why this was the case, and it affirmed my experience, saying that Gemini is better for these queries because of Search integration, Knowledge Graph, etc. Especially useful for recent role changes, which haven't been propagated through other channels on a widespread basis.

HeavyStorm

3 months ago

[-]

All of them are heavily invested in improving grounding. The money isn't on personal use but enterprise customers and for those, grounding is essential.

anentropic

3 months ago

[-]

Yeah I basically always use "web search" option in ChatGPT for this reason, if not using one of the more advanced modes.

3 months ago

[-]

I'm pretty much in the same camp. For a lot of everyday use, raw "intelligence" already feels good enough

0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...

breakingcups

3 months ago

[-]

Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?

3 months ago

[-]

Yep, the point we wanted to make here is that GPT-5.2's vision is better, not perfect. Cherrypicking a perfect output would actually mislead readers, and that wasn't our intent.

BoppreH

3 months ago

[-]

That would be a laudable goal, but I feel like it's contradicted by the text:

> Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component

I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.

Edit: credit where credit is due, the recently-added disclaimer is nice:

> Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.

hnuser123456

3 months ago

[-]

Yeah, what it's calling RAM slots is the CMOS battery. What it's calling the PCIE slot is the interior side of the DB-9 connector. RAM slots and PCIE slots are not even visible in the image.

hexaga

3 months ago

[-]

It just overlaid a typical ATX pattern across the motherboard-like parts of the image, even if that's not really what the image is showing. I don't think it's worthwhile to consider this a 'local recognition failure', as if it just happened to mistake CMOS for RAM slots.

Imagine it as a markdown response:

# Why this is an ATX layout motherboard (Honest assessment, straight to the point, *NO* hallucinations)

1. *RAM* as you can clearly see, the RAM slots are to the right of the CPU, so it's obviously ATX

2. *PCIE* the clearly visible PCIE slots are right there at the bottom of the image, so this definitely cannot be anything except an ATX motherboard

3. ... etc more stuff that is supported only by force of preconception

It's just meta signaling gone off the rails. Something in their post-training pipeline is obviously vulnerable given how absolutely saturated with it their model outputs are.

Troubling that the behavior generalizes to image labeling, but not particularly surprising. This has been a visible problem at least since o1, and the lack of change tells me they do not have a real solution.

furyofantares

3 months ago

[-]

They also changed "roughly match" to "sometimes match".

MichaelZuo

3 months ago

[-]

Did they really change a meaningful word like that after publication without an edit note…?

dwohnitmok

3 months ago

[-]

This has definitely happened before with e.g. the o1 release. I will sometimes use the Wayback Machine to verify changes that have been made.

MichaelZuo

3 months ago

[-]

Wow sounds pretty shady then.

piker

3 months ago

[-]

Eh, I'm no shill but their marketing copy isn't exactly the New York Times. They're given some license to respond to critical feedback in a manner that makes the statements more accurate without the same expectations of being objective journalism of record.

mkesper

3 months ago

[-]

Yes, but they should clearly mark updates. That would be professional.

guerrilla

3 months ago

[-]

Leave it to OpenAI to be dishonest about being dishonest. It seems they're also editing this post without notice as well.

Grimblewald

3 months ago

[-]

Look, just give the Qwen3-vl models a go. I've found them to be fantastic as this kind of thing so far, and what I'm seeing on display here, is laughable in comparison. Close source / closed weight paid model with worse performance than open? common. OpenAI really is a bubble.

arscan

3 months ago

[-]

I think you may have inadvertently misled readers in a different way. I feel misled after not catching the errors myself, assuming it was broadly correct, and then coming across this observation here. Might be worth mentioning this is better but still inaccurate. Just a bit of feedback, I appreciate you are willing to show non-cherry-picked examples and are engaging with this question here.

Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”

3 months ago

[-]

Thanks for the feedback - I agree our text doesn't make the models' mistakes clear enough. I'll make some small edits now, though it might take a few minutes to appear.

g947o

3 months ago

[-]

When I saw that it labeled DP ports as HDMI I immediately decided that I am not going to touch this until it is at least 5x better with 95% accuracy with basic things.

I don't see any advantage in using the tool.

3 months ago

[-]

That's a far more dangerous territory. A machine that is obviously broken will not get used. A machine that is subtly broken will propagate errors because it will have achieved a high enough trust level that it will actually get used.

Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.

AdamN

3 months ago

[-]

There was a low-level Google internal service that worked so well that other teams took a hard dependency on it (against advice). So the internal team added a cron job to drop it every once in a while to get people to trust it less :-)

layer8

3 months ago

[-]

You know what would be great? If it had added some boxes with “might be X or Y, but not sure”.

iwontberude

3 months ago

[-]

But it’s completely wrong.

johnwheeler

3 months ago

[-]

Oh and you guys don't mislead people ever. Your management is just completely trustworthy, and I'm sure all you guys are too. Give me a break, man. If I were you, I would jump ship or you're going to be like a Theranos employee on LinkedIn.

yard2010

3 months ago

[-]

Hey no need to personally attack anyone. A bad organization can still consist good people.

johnwheeler

3 months ago

[-]

I disagree. I think the whole organization is egregious and full of Sam Altman sycophants that are causing a real and serious harm to our society. Should we not personally attack the Nazis either? These people are literally pushing for a society where you're at a complete disadvantage. And they're betting on it. They're banking on it.

iamdanieljohns

3 months ago

[-]

Is Adaptive Reasoning gone from GPT-5.2? It was a big part of the release of 5.1 and Codex-Max. Really felt like the future.

3 months ago

[-]

Yes, GPT-5.2 still has adaptive reasoning - we just didn't call it out by name this time. Like 5.1 and codex-max, it should do a better job at answering quickly on easy queries and taking its time on harder queries.

iamdanieljohns

3 months ago

[-]

Why have "light" or "low" thinking then? I've mentioned this before in other places, but there should only be "none," "standard," "extended," and maybe "heavy."

Extended and heavy are about raising the floor (~25% and ~45% or some other ratio respectively) not determining the ceiling.

az226

3 months ago

[-]

And here is Gemini 3: https://media.licdn.com/dms/image/v2/D5610AQH7v9MtrZxxug/ima...

saejox

3 months ago

[-]

This is very impressive. Google really is ahead

pietz

3 months ago

[-]

They are definitely ahead in multi modality and I'd argue they have been for a long time. Their image understanding was already great, when their core LLM was still terrible.

FinnKuhn

3 months ago

[-]

This is genuinly impressive. The OpenAI equivalent is less detailed AND less correct.

Lionga

3 months ago

[-]

When OpenAI Marketing Material is actually showing how far Gemini3 is ahead...

8organicbits

3 months ago

[-]

Promotional content for LLMs is really poor. I was looking at Claude Code and the example on their homepage implements a feature, ignoring a warning about a security issue, commits locally, does not open a PR and then tries to close the GitHub issue. Whatever code it wrote they clearly didn't use as the issue from the prompt is still open. Bizarre examples.

timerol

3 months ago

[-]

Also a "stacked pair" of USB type-A ports, when there are clearly 4

fumeux_fume

3 months ago

[-]

General purpose LLMs aren't very good with generating bounding boxes, so with that context, this is actually seen as decent performance for certain use cases.

dolmen

3 months ago

[-]

Not that bad compared to product images seen on AliExpress.

jasonlotito

3 months ago

[-]

FTA: Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.

You can find it right next to the image you are talking about.

3 months ago

[-]

To be fair to OP, I just added this to our blog after their comment, in response to the correct criticisms that our text didn't make it clear how bad GPT-5.2's labels are.

LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.

One way to get a sense of how bad LLMs are at vision is to watch them play Pokemon. E.g.,: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...

They still very much struggle with basic vision tasks that adults, kids, and even animals can ace with little trouble.

da_grift_shift

3 months ago

[-]

'Commented after article was already edited in response to HN feedback' award

whalesalad

3 months ago

[-]

to be fair that image has the resolution of a flip phone from 2003

malfist

3 months ago

[-]

If I ask you a question and you don't have enough information to answer, you don't confidently give me an answer, you say you don't know.

I might not know exactly how many USB ports this motherboard has, but I wouldn't select a set of 4 and declare it to be a stacked pair.

3 months ago

[-]

No-one should have the expectation LLMs are giving correct answers 100% of the time. It's inherent to the tech for them to be confidently wrong

Code needs to be checked

References need to be checked

Any facts or claims need to be checked

malfist

3 months ago

[-]

According to the benchmarks here they're claiming up to 97% accuracy. That ought to be good enough to trust them right?

Or maybe these benchmarks are all wrong

JimDabell

3 months ago

[-]

Something that is 97% accurate is wrong 3% of the time, so pointing out that it has gotten something wrong does not contradict 97% accuracy in the slightest.

refactor_master

3 months ago

[-]

Gemini routinely makes up stuff about BigQuery’s workings. “It’s poorly documented”. Well, read the open source code, reason it out.

Makes you wonder what 97% is worth. Would we accept a different service with only 97% availability, and all downtime during lunch break?

TeMPOraL

3 months ago

[-]

I.e. like most restaurants and food delivery? :). Though 3% problem rate is optimistic.

3 months ago

[-]

Does code work if it's 97% correct?

It's not okay if claims are totally made up 1/30 times

Of course people aren't always correct either, but we're able to operate on levels of confidence. We're also able to weight others' statements as more or less likely to be correct based on what we know about them

fooker

3 months ago

[-]

> Does code work if it's 97% correct?

Of course it does. The vast majority of software has bugs. Yes, even critical one like compilers and operating systems.

mbesto

3 months ago

[-]

> Or maybe these benchmarks are all wrong

You must be new to LLM benchmarks.

dolmen

3 months ago

[-]

"confidently" is a feature selected in the system prompt.

As a user you can influence that behavior.

malfist

3 months ago

[-]

No it isn't. It isn't intelligent, it's a statistical engine. Telling it to be confident or less confident doesn't make it apply confidence appropriately. It's all a facade

ben_w

3 months ago

[-]

That shouldn't be what causes this problems; if we can see it's wrong despite the low resolution, the AI isn't going to fully replace humans for all tasks involving this kind of thing.

That said, even with this kind of error rate an AI can speed *some* things up, because having a human whose sole job is to ask "is this AI correct?" is easier and cheaper than having one human for "do all these things by hand" followed by someone else whose sole job is to check "was this human output correct?" because a human who has been on a production line for 4 hours and is about ready for a break also makes a certain number of mistakes.

But at the same time, why use a really expensive general-purpose AI like this, instead of a dedicated image model for your domain? Special purpose AI are something you can train on a decent laptop, and once trained will run on a phone at perhaps 10fps give or take what the performance threshold is and how general you need it to be.

If you're in a factory and you're making a lot of some small widget or other (so, not a whole motherboard), having answers faster than the ping time to the LLM may be important all by itself.

And at this point, you can just ask the LLM to write the training setup for the image-to-bounding-box AI, and then you "just" need to feed in the example images.

redox99

3 months ago

[-]

It's trivial for a human that knows what a pc looks like. Maybe mistaking displayport for hdmi.

an0malous

3 months ago

[-]

Because the whole culture of AI enthusiasts is to just generate slop and never check the results

tennisflyi

3 months ago

[-]

You seen the charts on their last release? They obviously don’t check - too rich

goobatrooba

3 months ago

[-]

I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "projects" with associated files (hello Google, please wake up to basic user-friendly organisation!)

But all of them * Lie far too often with confidence * Refuse to stick to prompts (e.g. ChatGPT to the request to number each reply for easy cross-referencing; Gemini to basic request to respond in a specific language) * Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?) * Refuse to give me short answers without fluff or follow up questions * Refuse to stop complimenting my questions or disagreements with wrong/incomplete answers * Don't quote sources consistently so I can check facts, even when I ask for it * Refuse to make clear whether they rely on original documents or an internal summary of the document, until I point out errors * ...

I also have substance gripes, but for me such basic usability points are really something all of the chatbots fail on abysmally. Stick to instructions! Stop creating walls of text for simple queries! Tell me when something is uncertain! Tell me if there's no data or info rather than making something up!

razster

3 months ago

[-]

The latest of the big three... OpenAI, Claude, and Google, none of their models are good. I've spent too much time monitoring them than just enjoying them. I've found it easier to run my own local LLM. The latest Gemini release, I gave it another go but only for it to misspell words and drift off into a fantasy world after a few chats with help restructuring guides. ChatGPT has become lazy for some reason and changes things I told it to ignore, randomly too. Claude was doing great until the latest release, then it started getting lazy after 20+k tokens. I tried making sure to keep a guide to refresh it if it started forgetting, but that didn't help.

Locals are better; I can script and have them script for me to build a guide creation process. They don't forget because that is all they're trained on. I'm done paying for 'AI'.

marcosscriven

3 months ago

[-]

What are your best local models, and what hardware do you run them on?

3 months ago

[-]

I have this impression that LLMs are so complicated and entangled (in comparison to previous machine learning models) that they’re just too difficult to tune all around.

What I mean is, it seems they try to tune them to a few certain things, that will make them worse on a thousand other things they’re not paying attention to.

striking

3 months ago

[-]

What's to stop you from using the APIs the way you'd like?

joshribakoff

3 months ago

[-]

The API is a way to access a model, he is criticizing the model not the access the method (at least until the last sentence where he incorrectly implied you can only script a local model, but I don’t think thats a silver bullet, in my experience that is even more challenging than starting with a working agent)

fleischhauf

3 months ago

[-]

I'm always impressed how fast people get used to new things. couple of years ago something like chatgpt was completely impossible, and now people complain it something's does mit do what you told it to and sometimes lies. (not saying your points are not valid or you should not raise them) Some of the points are just not fixable at this point due to tech limitations. A language model currently simply has no way to give an estimate of its confidence. Also there is no way to completely do away with hallucinations (lies). there need to be some more fundamental improvements for this to work reliably.

davebren

3 months ago

[-]

Your point would stand if the entire economy wasn't shifted around this product and employees weren't being told to use it or lose their jobs.

https://ai.google.dev/gemini-api/docs/structured-output

matusp

3 months ago

[-]

Consider using structured output. You can define a JSON with specific fields, and LLMs are only used to fill in the values.

ifwinterco

3 months ago

[-]

I'm not an expert but my understanding is transformers based models simply can't do some of those things, it isn't really how they work.

Especially something like expressing a certainty %, you might be able to get it to output one but it's just making it up. LLMs are incredibly useful (I use them every day) but you'll always have to check important output

carsoon

3 months ago

[-]

Yeah I have seen multiple people use this certainty % thing but its terrible. A percentage is something calculated mathemtatically and these models cannot do that.

Potentially they could figure it out if they looks into a comparison of next token probabilites, but this is not exposed in any modern model and especially not fed back into the chat/output.

Instead people should just ask it to explain BOTH sides of an argument or explain why something is BOTH correct and incorrect. This way you see how it can halluciate either way and get to make up your own mind about the correct outcome.

nullbound

3 months ago

[-]

<< I feel there is a point when all these benchmarks are meaningless.

I am relatively certain you are not alone in this sentiment. The issue is that the moment we move past seemingly objective measurements, it is harder to convince people that what we measure is appropriate, but the measurable stuff can be somewhat gamed, which adds a fascinating layer of cat and mouse game to this.

delifue

3 months ago

[-]

Once a metric becomes optimization target, it ceases to become good metric.

hnfong

3 months ago

[-]

There's a leaderboard that measures user experience, the "lmsys" Chatbot Arena Leaderboard ( https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard ). Main issue with it these days are that it kinda measures sycophancy and user preferred tone more than substance.

Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).

carsoon

3 months ago

[-]

I have a kinda strange chatgpt personalization prompt but it's been working well for me. The focus is me to get the model to analyze 2 sides and the extremes on both ends so it explains both and lets me decide. This is much better than asking it to make up accuracy percentages.

I think we align on what we want out of models:

""" Don't add useless babelling before the chats, just give the information direct and explain the info.

DO NOT USE ENGAGEMENT BAITING QUESTIONS AT THE END OF EVERY RESPONSE OR I WILL USE GROK FROM NOW ON FOREVER AND CANCEL MY GPT SUBSCRIPTION PERMANENTLY ONLY. GIVE USEFUL FACTUAL INFORMATION AND FOLLOW UPS which are grounded in first principles thinking and logic. Do not take a side and look at think about the extreme on both ends of a point before taking a side. Do not take a side just because the user has chosen that but provide infomration on both extremes. Respond with raw facts and do not add opinions.

Do not use random emojis. Prefer proper marks for lists etc. """

Those spelling/grammar errors are actually there and I don't want to change it as its working well for me.

dontlikeyoueith

3 months ago

[-]

> Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?)

They're literally incapable of this. Any number they give you is bullshit.

agentifysh

3 months ago

[-]

Looks like they've begun censoring posts at r/Codex and not allowing complaint threads so here is my honest take:

- It is faster which is appreciated but not as fast as Opus 4.5

- I see no changes, very little noticeable improvements over 5.1

- I do not see any value in exchange for +40% in token costs

All in all I can't help but feel that OpenAI is facing an existential crisis. Gemini 3 even when its used from AI Studio offers close to ChatGPT Pro performance for free. Anthropic's Claude Code $100/month is tough to beat. I am using Codex with the $40 credits but there's been a silent increase in token costs and usage limitations.

3 months ago

[-]

Did you notice much improvement going from Gemini 2.5 to 3? I didn't

I just think they're all struggling to provide real world improvements

chillfox

3 months ago

[-]

Gemini 3 Pro is the first model from Google that I have found usable, and it's very good. It has replaced Claude for me in some cases, but Claude is still my goto for use in coding agents.

(I only access these models via API)

neuah

3 months ago

[-]

Using it in a specialized subfield of neuroscience, Gemini 3 w/ thinking is a huge leap forward in terms of knowledge and intelligence (with minimal hallucinations). I take it that the majority of people on here are software engineers. If you're evaluating it on writing boilerplate code, you probably have to squint to see differences between the (excellent) raw model performances. whereas in more niche edge cases there is more daylight between them.

dominotw

3 months ago

[-]

what specalized usecases did you use it on and what were the outcomes.

can you share your experience and data for "leap forward" ?

dcre

3 months ago

[-]

Nearly everyone else (and every measure) seems to have found 3 a big improvement over 2.5.

agentifysh

3 months ago

[-]

oh yes im noticing significant improvements across the board but mainly having 1,000,000 token context makes a ton of difference, I can keep digging at a problem with out compaction.

cmrdporcupine

3 months ago

[-]

I think what they're actually struggling with is costs. And I think they're all behind the scenes quantizing models to manage load here and there, and they're all giving inconsistent results.

I noticed huge improvement from Sonnet 4.5 to Opus 4.5 when it became unthrottled a couple weeks ago. I wasn't going to sign back up with Anthropic but I did. But two weeks in it's already starting to seem to be inconsistent. And when I go back to Sonnet it feels like they did something to lobotomize it.

Meanwhile I can fire up DeepSeek 3.2 or GLM 4.6 for a fraction of the cost and get almost as good as results.

XCSme

3 months ago

[-]

Maybe they are just more consistent, which is a bit hard to notice immediately.

dudeinhawaii

3 months ago

[-]

I noticed a quite noticeable improvement to the point where I made it my go-to model for questions. Coding-wise, not so much. As an intelligent model, writing up designs, investigations, general exploration/research tasks, it's top notch.

free652

3 months ago

[-]

yes, 2.5 just couldnt use tools right. 3.0 is way better at coding. better than sonnet 4.5/

enraged_camel

3 months ago

[-]

Gemini 3 was a massive improvement over 2.5, yes.

hmottestad

3 months ago

[-]

I’m curious about if the model has gotten more consistent throughout the full context window? It’s something that OpenAI touted in the release, and I’m curious if it will make a difference for long running tasks or big code reviews.

agentifysh

3 months ago

[-]

one positive is that 5.2 is very good at finding bugs but not sure about throughputs I'd imagine it might be improved but haven't seen a real task to benchmark it on.

what I am curious about is 5.2-codex but many of us complained about 5.1-codex (it seemed to get tunnel visioned) and I have been using vanilla 5.1

its just getting very tiring to deal with 5 different permutations of 3 completely separate models but perhaps this is the intent and will keep you on a chase.

3 months ago

[-]

The speed bump is nice, but speed alone isn't a compelling upgrade if the qualitative difference isn't obvious in day-to-day use

fellowniusmonk

3 months ago

[-]

5.2 is performing worse in technical reading comprehension for information and logic dense puzzles. It's way more confidently wrong and stubborn about understanding definitions of words.

zone411

3 months ago

[-]

I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

Donald

3 months ago

[-]

Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive

capitainenemo

3 months ago

[-]

And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).

I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.

wooger

3 months ago

[-]

> unless I guess they routinely index this repo

This sounds like exactly the kind of thing any tech company would do when confronted with a competitive benchmark.

rsanek

3 months ago

[-]

I mean, the repo has <200 stars, it's not like it's so mainstream that you'd expect LLM makers to be watching it actively. If they wanted to game it, they could more easily do that in RL with synthetic data anyway.

capitainenemo

3 months ago

[-]

Belated update on this. Gemini reasoning did much better than quick on bracket city today (an easy puzzle but still). It only failed to solve one clue outright, got another wrong but due to ambiguity in the expression referenced and in a way that still fit the next level down making the final answer fairly cleanly solved. Still clearly has a harder time with it than the connections puzzle.

bigyabai

3 months ago

[-]

GPT-5.2 might be Google's best Gemini advertisement yet.

outside1234

3 months ago

[-]

Especially when you see the price

tikotus

3 months ago

[-]

Here's someone else testing models on a daily logic puzzle (Clues by Sam): https://www.nicksypteras.com/blog/cbs-benchmark.html GPT 5 Pro was the winner already before in that test.

thanhhaimai

3 months ago

[-]

This link doesn't have Gemini 3 performance on it. Do you have an updated link with the new models?

dezgeg

3 months ago

[-]

I've also tried Gemini 3 for Clues by Sam and it can do really well, have not seen it make a single mistake even for Hard and Tricky ones. Haven't run it on too many puzzles though.

crapple8430

3 months ago

[-]

GPT 5 Pro is a good 10x more expensive so it's an apples to oranges comparison.

fellowniusmonk

3 months ago

[-]

I think they are overfitting more, I'm seeing it perform worse on esoteric logic puzzles.

Bombthecat

3 months ago

[-]

I would like to see a cost per percent or so row. I feel like grok would beat them all

scrollop

3 months ago

[-]

Why no grok 4.1 reasoning?

sanex

3 months ago

[-]

Do people other than Elon fans use grok? Honest question. I've never tried it.

buu700

3 months ago

[-]

I use Grok pretty heavily, and Elon doesn't factor into it any more than Sam and Sundar do when I use GPT and Gemini. A few use cases where it really shines:

* Research and planning

* Writing complex isolated modules, particularly when the task depends on using a third-party API correctly (or even choosing an API/library at its own discretion)

* Reasoning through complicated logic, particularly in cases that benefit from its eagerness to throw a ton of inference at problems where other LLMs might give a shallower or less accurate answer without more prodding

I'll often fire off an off-the-cuff message from my phone to have Grok research some obscure topic that involves finding very specific data and crunching a bunch of numbers, or write a script for some random thing that I would previously never have bothered to spend time automating, and it'll churn for ~5 minutes on reasoning before giving me exactly what I wanted with few or no mistakes.

As far as development, I personally get a lot of mileage out of collaborating with Grok and Gemini on planning/architecture/specs and coding with GPT. (I've stopped using Claude since GPT seems interchangeable at lower cost.)

For reference, I'm only referring to the Grok chatbot right now. I've never actually tried Grok through agentic coding tooling.

mac-attack

3 months ago

[-]

I can't understand why people would trust a CEO that regularly lies about product timelines, product features, his own personal life, etc. And that's before politicizing his entire kingdom by literally becoming a part of government and one of the larger donations of the current administration.

delaminator

3 months ago

[-]

You’re not narrowing it down.

lkjdsklf

3 months ago

[-]

If we stopped using products of every company that had a CEO that lied about their products, we’d all be sitting in caves staring at the dirt

fatata123

3 months ago

[-]

Because not everyone makes their decisions through the prism of politics

sz4kerto

3 months ago

[-]

I'm using Gemini in general, but Grok too. That's because sometimes Gemini Thinking is too slow, but Fast can get confused a lot. Grok strikes a nice balance between being quite smart (not Gemini 3 Pro level, but close) and very fast.

ralusek

3 months ago

[-]

Only thing I use grok for is if there is a current event/meme that I keep seeing referenced and I don't understand, it's good at pulling from tweets

wdroz

3 months ago

[-]

Unlike openai, you can use the latest grok models without verifying your organization and giving your ID.

jbm

3 months ago

[-]

I use a few AIs together to examine the same code base. I find Grok better than some of the Chinese ones I've used, but it isn't in the same league as Claude or Codex.

rsanek

3 months ago

[-]

it's the biggest model on OpenRouter, even if you exclude free tier usage https://openrouter.ai/state-of-ai

irthomasthomas

3 months ago

[-]

Roleplay is the largest use-case on openrouter.

bumling

3 months ago

[-]

I dislike Musk, and use Grok. I find it most useful for analyzing text to help check if there's anything I've missed in my own reading. Having it built in to Twitter is convenient and it has a generous free tier.

scrollop

3 months ago

[-]

I hate the guy, however grok scores high on arc-2 so it would be silly to not at least rank it.

simonw

3 months ago

[-]

Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...

alechewitt

3 months ago

[-]

Nice work on these benchmarks Simon. I’ve followed your blog closely since your great talk at the AI Engineers World Fair, and I want to say thank you for all the high quality content you share for free. It’s become my primary source for keeping up to date.

I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).

In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).

simonw

3 months ago

[-]

That is a very good benchmark. Interesting to see GPT-5.2 delivering on the promise of better vision support there.

Stevvo

3 months ago

[-]

The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.

golly_ned

3 months ago

[-]

Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.

getnormality

3 months ago

[-]

Well, the variance is itself interesting.

BeetleB

3 months ago

[-]

They probably saw your complaint that 5.1 was too spartan and a regression (I had the same experience with 5.1 in the POV-Ray version - have yet to try 5.2 out...).

https://gally.net/temp/20251107pelican-alternatives/index.ht...

tkgally

3 months ago

[-]

I added GPT-5.2 Pro to my pelican-alternatives benchmark for the first three prompts:

Generate an SVG of an octopus operating a pipe organ

Generate an SVG of a giraffe assembling a grandfather clock

Generate an SVG of a starfish driving a bulldozer

GPT-5.2 Pro cost about 80 cents per prompt through OpenRouter, so I stopped there. I don’t feel like spending that much on all thirty prompts.

smusamashah

3 months ago

[-]

Hi, it doesn't have Gemini 3.5 Pro which seems to be the best at this

svantana

3 months ago

[-]

That's probably because "Gemini 3.5 Pro" doesn't exist

philipgross

3 months ago

[-]

That gallery is an excellent advertisement for Gemini 3.0 Pro.

3 months ago

[-]

Seems to be getting more aerodynamic. A clear sign of AI intelligence

fxwin

3 months ago

[-]

the only benchmark i trust

belter

3 months ago

[-]

What happens if you ask for a pterodactyl on a motorbike?

Would like to know how much they are optimizing for your pelican....

simonkagedal

3 months ago

[-]

He commented on this here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

irthomasthomas

3 months ago

[-]

I was expecting to see a pterodactyl :(

3 months ago

[-]

Is that the first SVG pelican with drop shadows?

simonw

3 months ago

[-]

No, I got drop shadows from DeepSeek 3.2 recently https://simonwillison.net/2025/Dec/1/deepseek-v32/ (probably others as well.)

tootie

3 months ago

[-]

Do you think the big guys are on to your game and have been adding extra pelicans to the training data?

sroussey

3 months ago

[-]

What is good at SVG design?

https://clocks.brianmoore.com/

culi

3 months ago

[-]

Not svg, but basically the same challenge:

Probably Kimi or Deepseek are best

azinman2

3 months ago

[-]

Graphic designers?

KellyCriterion

3 months ago

[-]

Ive not seen any model being good in graphic/svg creation so far - all of the stuff mostly looks ugly and somewhat "synthetic-disorted".

And lately, Claude (web) started to draw ascii charts from one day to another indstead of colorful infographicstyled-images as it did before (they were only slightly better than the ascii charts)

tmaly

3 months ago

[-]

seems to be eating something

danans

3 months ago

[-]

Probably a jellyfish. You're seeing the tentacles

nightshift1

3 months ago

[-]

benchmarks probably should not be used for so long.

3 months ago

[-]

Weirdly, the blog announcement completely omits the actual new context window size which is 400,000: https://platform.openai.com/docs/models/gpt-5.2

Can I just say !!!!!!!! Hell yeah! Blog post indicates it's also much better at using the full context.

Congrats OpenAI team. Huge day for you folks!!

Started on Claude Code and like many of you, had that omg CC moment we all had. Then got greedy.

Switched over to Codex when 5.1 came out. WOW. Really nice acceleration in my Rust/CUDA project which is a gnarly one.

Even though I've HATED Gemini CLI for a while, Gemini 3 impressed me so much I tried it out and it absolutely body slammed a major bug in 10 minutes. Started using it to consult on commits. Was so impressed it became my daily driver. Huge mistake. I almost lost my mind after a week of this fighting it. Isane bias towards action. Ignoring user instructions. Garbage characters in output. Absolutely no observability in its thought process. And on and on.

Switched back to Codex just in time for 5.1 codex max xhigh which I've been using for a week, and it was like a breath of fresh air. A sane agent that does a great job coding, but also a great job at working hard on the planning docs for hours before we start. Listens to user feedback. Observability on chain of thought. Moves reasonably quickly. And also makes it easy to pay them more when I need more capacity.

And then today GPT-5.2 with an xhigh mode. I feel like xmass has come early. Right as I'm doing a huge Rust/CUDA/Math-heavy refactor. THANK YOU!!

ubutler

3 months ago

[-]

> Weirdly, the blog announcement completely omits the actual new context window size which is 400,000: https://platform.openai.com/docs/models/gpt-5.2

As @lopuhin points out, they already claimed that context window for previous iterations of GPT-5.

The funny thing is though, I'm on the business plan, and none of their models, not GPT-5, GPT-5.1, GPT-5.2, GPT-5.2 Extended Thinking, GPT-5.2 Pro, etc., can really handle inputs beyond ~50k tokens.

I know because, when working with a really long Python file (>5k LoCs), it often claims there is a bug because, somewhere close to the end of the file, it cuts off and reads as '...'.

Gemini 3 Pro, by contrast, can genuinely handle long contexts.

andybak

3 months ago

[-]

Why would you put that whole python file in the context at all? Doesn't Codex work like Claude Code in this regard and use tools to find the correct parts of a larger file to read into context?

lopuhin

3 months ago

[-]

Context window size of 400k is not new, gpt-5, 5.1, 5-mini, etc. have the same. But they do claim they improved long context performance which if true would be great.

energy123

3 months ago

[-]

But 400k was never usable in ChatGPT Plus/Pro subscriptions. It was nerfed down to 60-100k. If you submitted too long of a prompt they deleted the tokens on the end of your prompt before calling the model. Or if the chat got too long (still below 100k however) they deleted your first messages. This was 3 months ago.

Can someone with an active sub check whether we can submit a full 400k prompt (or at least 200k) and there is no prompt truncatation in the backend? I don't mean attaching a file which uses RAG.

https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...

piskov

3 months ago

[-]

Context windows for web

Fast (GPT‑5.2 Instant) Free: 16K Plus / Business: 32K Pro / Enterprise: 128K

Thinking (GPT‑5.2 Thinking) All paid tiers: 196K

energy123

3 months ago

[-]

But can you do that in one message or is that a best case scenario in a long multi turn chat?

dr_dshiv

3 months ago

[-]

That’s… too bad

3 months ago

[-]

> Or if the chat got too long (still below 100k however) they deleted your first messages. This was 3 months ago.

I can believe that, but it also seems really silly? If your max context window is X and the chat has approached that, instead of outright deleting the first messages outright, why not have your model summarise the first quarter of tokens and place those at the beginning of the log you feed as context? Since the chat history is (mostly) immutable, this only adds a minimal overhead: you can cache the summarisation, and don't have to do that over and over again for each new message. (If partially summarised log gets too long, you summarise again.)

Since I can come up with this technique in half a minute of thinking about the problem, and the OpenAI folks are presumably not stupid, I wonder what downside I'm missing.

Aeolun

3 months ago

[-]

Don’t think you are missing anything. I do this with the API, and it works great. I’m not sure why they don’t do it, but I can only guess it’s because it completely breaks the context caching. If you summarize the full buffer at least you know you are down to a few thousand tokens to cache again, instead of 100k tokens to cache again.

3 months ago

[-]

> [...] but I can only guess it’s because it completely breaks the context caching.

Yes, but you only re-do this every once in a while? It's a constant factor overhead. If you essentially feed the last few thousand tokens, you have no caching at all (and you are big enough that this window of 'last few thousand tokens' doesn't get you the whole conversation)?

gunalx

3 months ago

[-]

API use was not merged in this way.

freedomben

3 months ago

[-]

I haven't done a ton of testing due to cost, but so far I've actually gotten worse results with xhigh than high with gpt-5.1-codex-max. Made me wonder if it was somehow a PEBKAC error. Have you done much comparison between high and xhigh?

dudeinhawaii

3 months ago

[-]

This is one of those areas where I think it's about the complexity of the task. What I mean is, if you set codex to xhigh by default, you're wasting compute. IF you're setting it at xhigh when troubleshooting a complex memory bug or something, you're presumably more likely to get a quality response.

I think in general, medium ends up being the best all-purpose setting while high+ are good for single task deep-drive. Or at least that has been my experience so far. You can theoretically let with work longer on a harder task as well.

A lot appears to depend on the problem and problem domain unfortunately.

I've used max in problem sets as diverse as "troubleshooting Cyberpunk mods" and figuring out a race condition in a server backend. In those cases, it did a pretty good job of exhausting available data (finding all available logs, digging into lua files), and narrowing a bug that every other model failed to get.

I guess in some sense you have to know from the onset that it's a "hard problem". That in and of itself is subjective.

3 months ago

[-]

You should also be making handoffs to/from Pro

robotswantdata

3 months ago

[-]

For a few weeks the Codex model has been cursed. Recommend sticking with 5.1 high , 5.2 feels good too but early days

tekacs

3 months ago

[-]

I found the same with Max xhigh. To the point that I switched back to just 5.1 High from 5.1 Codex Max. Maybe I should’ve tried Max high first.

lhl

3 months ago

[-]

Anecdotally, I will say that for my toughest jobs GPT-5+ High in `codex` has been the best tool I've used - CUDA->HIP porting, finding bugs in torch, websockets, etc, it's able to test, reason deeply and find bugs. It can't make UI code for it's life however.

Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...

Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.

tgtweak

3 months ago

[-]

have been on 1M context window with claude since 4.0 - it gets pretty expensive when you run 1M context on a long running project (mostly using it in cline for coding). I think they've realized more context length = more $ when dealing with most agentic coding workflows on api.

3 months ago

[-]

You should be doing everything you can to keep context under 200k, ideally even 100k. All the models unwind so badly as context grows.

patates

3 months ago

[-]

I don't have that experience with gemini. Up to 90% full, it's just fine.

tgtweak

3 months ago

[-]

If the models are designed around it, and not resorting to compression to get to higher input token lengths, they don't 'fall off' as they get near the context window limit. When working with large codebases, exhausting or compressing the context actually causes more issues since the agent forgets what was in the other libraries and files. Google has realized this internally and were among the first to get to 2M token context length (internally then later released publicly).

3 months ago

[-]

This is one of those updates where the value only really shows up if you're already deep in the weeds

nathants

3 months ago

[-]

Usable input limit has not changed, and remains 400 - 128 = 272. Confirmed by looking for any changes in codex cli source, nope.

Suppafly

3 months ago

[-]

>Can I just say !!!!!!!! Hell yeah!

...

>THANK YOU!!

Man you're way too excited.

twisterius

3 months ago

[-]

[flagged]

3 months ago

[-]

My name is Mark Maunder. Not the fisheries expert. The other one when you google me. I’m 51 and as skeptical as you when it comes to tech. I’m the CTO of a well known cybersecurity company and merely a user of AI.

Since you critiqued my post, allow me to reciprocate: I sense the same deflector shields in you as many others here. I’d suggest embracing these products with a sense of optimism until proven otherwise and I’ve found that path leads to some amazing discoveries and moments where you realize how important and exciting this tech really is. Try out math that is too hard for you or programming languages that are labor intensive or languages that you don’t know. As the GitHub CEO said: this technology lets you increase your ambition.

bgwalter

3 months ago

[-]

I have tried the models and in domains I know well they are pathetic. They remove all nuance, make errors that non-experts do not notice and generally produce horrible code.

It is even worse in non-programming domains, where they chop up 100 websites and serve you incorrect bland slop.

If you are using them as a search helper, that sometimes works, though 2010 Google produced better results.

Oracle dropped 11% today due to over-investment in OpenAI. Non-programmers are acutely aware of what is going on.

muppetman

3 months ago

[-]

Exactly this. It's like reading the news! It seems perfectly fine until a news article in a domain you have intimate knowledge of, and then you realise how bad/hacked together the news is. AI feels just like that. But AI can improve, so I'm in the middle with my optimism.

jfreds

3 months ago

[-]

> they remove all nuance

Said in a sweeping generalization with zero sense of irony :D

jrflowers

3 months ago

[-]

This is a good point. It is a sweeping generalization if you do not read the sentence that comes before that quote

re-thc

3 months ago

[-]

> Oracle dropped 11% today due to over-investment in OpenAI

Not even remotely true. Oracle is building out infrastructure mostly for AI workloads. It dropped because it couldn’t explain its financing and if the investment was worth it. OpenAI or not wouldn’t have mattered.

what-the-grump

3 months ago

[-]

You pretend that humans don’t produce slop?

I can recognize the short comings of AI code but it can produce a mock or a full blown class before I can find a place to save the file it produced.

Pretending that we are all busy writing novelty and genius is silly, 99% are writing for CRUD tasks and basic business flows, the code isn’t going to be perfect it doesn’t need to be but it will get the job done.

All the logical gotchas of the work flows that you’d be refactoring for hours are done in minutes.

Use pro with search… are it going to read 200 pages of documentation in 7 minutes come up with a conclusion and validate it or invalidate it in another 5? No you still trying accept the cookie prompt on your 6th result.

You might as well join the flat earth society if you still think that AI can’t help you complete day to day tasks.

GolfPopper

3 months ago

[-]

Replace 'products' with 'message', 'tech' with 'religion' and 'CEO' with 'prophet' and you have a bog-standard cult recruitment pitch.

Aeolun

3 months ago

[-]

Because most recruitment pitches are the same regardless of the subject.

bluefirebrand

3 months ago

[-]

[flagged]

3 months ago

[-]

Maybe you are holding it wrong?

Contemporary LLMs still have huge limitations and downsides. Just like hammer or a saw has limitations. But millions of people are getting good value out of them already (both LLMs and hammers and saws). I find it hard to believe that they are all deluded.

skydhash

3 months ago

[-]

What limitations does an hammer have if the job is hammering? Or a saw with sawing? Even `ed` doesn't have any issue with editing text files.

3 months ago

[-]

Well, ask the people who invented better hammers or better saws. Or better text editors than ed.

nbardy

3 months ago

[-]

Those arc agi 2 improvements are insane.

Thats especially encouraging to me because those are all about generalization.

5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.

This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.

Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.

I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.

delifue

3 months ago

[-]

It's also possible that OpenAI use many human-generated similar-to-ARC data to train (semi-cheating). OpenAI has enough incentive to fake high score.

Without fully disclosing training data you will never be sure whether good performance comes from memorization or "semi-memorization".

3 months ago

[-]

> 5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.

Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.

3 months ago

[-]

Opus 4.5 answers most of my non-question comments with ‘you’re right.’ as the first thing in the output. At least I’m not absolutely right, I’ll take this as an improvement.

3 months ago

[-]

Hah, maybe 5th gen Claude will change to "you may be right".

The positive thing is that it seems to be more performative than anything. Claude models will say "you're [absolutely] right" and then immediately do something that contradicts it (because you weren't right).

Gemini 3 Pro seems to have struck a decent balance between stubbornness and you're-right-ness, though I still need to test it more.

fellowniusmonk

3 months ago

[-]

5.2 seems worse on overfitting for esoteric logic puzzles in my testing. Tests using precise language where attention has to be paid to use the correct definition among many for a given word. It charges ahead with wrong definitions in a far lower accuracy and worse way now.

3 months ago

[-]

Same. Also got my attention re ARC-AGI-2. That's meaningful. And a HUGE leap.

[0] https://neoneye.github.io/arc/?dataset=ARC-AGI-2

cbracketdash

3 months ago

[-]

Slight tangent yet I think is quite interesting... you can try out the ARC-AGI 2 tasks by hand at this website [0] (along with other similar problem sets). Really puts into perspective the type of thinking AI is learning!

3 months ago

[-]

I suppose this is as good a place as any to mention this. I've now met two different devs who complained about the weird responses from their LLM of choice, and it turned out they were using a single session for everything. From recipes for the night, presents for the wife and then into programming issues the next day.

Don't do that. The whole context is sent on queries to the LLM, so start a new chat for each topic. Or you'll start being told what your wife thinks about global variables and how to cook your Go.

I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!

holtkam2

3 months ago

[-]

I know I sound like a snob but I’ve had many moments with Gen AI tools over the years that made me wonder: I wonder what these tools are like for someone who doesn’t know how LLMs work under the hood? It’s probably completely bizarre? Apps like Cursor or ChatGPT would be incomprehensible to me as a user, I feel.

3 months ago

[-]

Using my parents as a reference, they just thought it was neat when I showed them GPT-4 years ago. My jaw was on the floor for weeks, but most regular folks I showed had a pretty "oh thats kinda neat" response.

Technology is already so insane and advanced that most people just take it as magic inside boxes, so nothing is surprising anymore. It's all equally incomprehensible already.

jacobedawson

3 months ago

[-]

This mirrors my experience, the non-technical people in my life either shrugged and said 'oh yeah that's cool' or started pointing out gnarly edge cases where it didn't work perfectly. Meanwhile as a techie my mind was (and still is) spinning with the shock and joy of using natural human language to converse with a super-humanly adept machine.

throw310822

3 months ago

[-]

I don't think the divide is between technical and non-technical people. HN is full of people that are weirdly, obstinately dismissive of LLMs (stochastic parrots, glorified autocompletes, AI slop, etc.). Personal anecdote: my father (85yo, humanistic culture) was astounded by the perfectly spot-on analysis Claude provided of a poetic text he had written. He was doubly astounded when, showing Claude's analysis to a close friend, he reacted with complete indifference as if it were normal for computers to competently discuss poetry.

khafra

3 months ago

[-]

LLMs are an especially tough case, because the field of AI had to spend sixty years telling people that real AI was nothing like what you saw in the comics and movies; and now we have real AI that presents pretty much exactly like what you used to see in the comics and movies.

xwolfi

3 months ago

[-]

But it cannot think or mean anything, it's just a clever parrot so it's a bit weird. I guess uncanny is the word. I use it as google now, like just to search stuff that are hard to express with keywords.

adventured

3 months ago

[-]

99% of humans are mimics, they contribute essentially zero original thought across 75 years. Mimicry is more often an ideal optimization of nature (of which an LLM is part) rather than a flaw. Most of what you'll ever want an LLM to do is to be a highly effective parrot, not an original thinker. Origination as a process is extraordinarily expensive and wasteful (see: entrepreneurial failure rates).

How often do you need original thought from an LLM versus parrot thought? The extreme majority of all use cases globally will only ever need a parrot.

robocat

3 months ago

[-]

> clever parrot

Is it irony that you duckspeak this term? Are you a stochastically clever monkey to avoid using the standard cliche?

The thing I find most educating about AI is that it unfortunately mimics the standard of thinking of many humans...

LEDThereBeLight

3 months ago

[-]

Try asking it a question you know has never been asked before. Is it parroting?

Agentlien

3 months ago

[-]

My parents reacted in just the same way and the lackluster response really took me by surprise.

3 months ago

[-]

Most non tech people I talked with don't care at all about LLMs.

They also are not impressed at all ("Okay, that's like google and internet").

lostmsu

3 months ago

[-]

Old people? I think it would be hard to find a lot of people under 20 who don't use ChatGPT daily. At least among ones that are still studying.

3 months ago

[-]

People older than 25 or 30 maybe.

It would be funny that in the end, the most use is made by student cheating at uni.

3 months ago

[-]

I wanted to reflect a bit on this.

I have hard time to imagine why non-tech people would find a use for LLMs, let's say nothing in your life forces you to produce information (be it textual, pictural or anything that can be related to information). Let's say your needs are focused on spending good times with friends or your family, eating nice dishes (home cooked or restaurant), spending your money on furnitures, rents, clothes, tools and etc.

Why would you need an AI that produce information in an information-bloated world ?

You probably met someone that "fell in love with woodworking" or idk, after having watched youtube videos (that person probably built a chair, a table or something akin). I don't think stuff like "Hi, I have these materials, what can I do with it" produce more interesting results than just nerding on the internet or in a library looking for references (on japaneese handcrafted furnitures, vintage ikea designs, old school woodworking, ...). (Or maybe the LLM will be able to give you a list of good reads, which is nice but somewhat of a limited and basic use).

Agentic AI and more efficient/intelligent AIs are not very interesting for people like <wood lover> and are at best a proxy for otherly findable information. Of course, not everyone is like <wood lover>, the majority of people don't even need to invest time in a "creative" hobby and instead they will watch movies, invest time in sport, invest time in sociability, go to museums, read books; you could imagine having AIs that write books, invent films, invent artworks, talk with you, but I am pretty sure that there is something more than just "watch a movie" or "read a book" when performing these activities; as someone who likes reading or watching movies, what I enjoy is following the evolutions of the authors of the pieces, understanding their posture toward its ancestors, its era-mates, toward its own previous visions and whatnot. I enjoy to find a movie "weird" "goofy" "sublime" and whatnot, because I enjoy a small amount of parasociality with the authors and am finally brought to say things like "Ahah, Lynch was such a weirdo when he shot Blue Velvet" (okay, maybe not that type of bully judgement, but you may be understanding what I mean).

I think I would find it uninspiring to read an AI written book, because I couldn't live this small parasocial experience. Maybe you could get me with music, but I still think there's a lot of activity in loving a song. I love Bach, but am pretty sure also I like Bach the character (from what I speculate from the songs I listen). I imagine that guy in front of his keyboard, having the chance to live a -weird- moment of extasy when he produces the best lines of the chaconne (if he was living in our times he would relisten to what he produced again and again and nodding to himself "man, that's sick").

What could I experience from an LLM ? "Here is the perfect novel I wrote specifically for you based on your tastes:". There would be no imaginary Bach that I would like to drink a beer with, no testimony of a human reaching the state of mind in which you produce an absolute (in fact highly relative, but you need to lie to yourself) "hit".

All of this is highly personnal, but I would be curious to know what others think.

lostmsu

3 months ago

[-]

This is a weird take. Basically no one is just a wood lover. In fact, basically no one is an expert or even decently knowledgeable in more than 0-2 areas. But life has hundreds of things everyone must participate in. Where does you wood lover shop? How does he find his movies? File taxes? Gets travel ideas? And even a wood lover after watching 100500th niche video on woodworking on YouTube might have some questions. AI is the new, much better Google.

Re: books. Your imagination falters here too. I love sci-fi. I use voice AIs ( even made one: https://apps.apple.com/app/apple-store/id6737482921?pt=12710... ). A couple of times when I was on a walk I had an idea for a weird sci-fi setting, and I would ask AI to generate a story in that setting, and listen to it. It's interesting because you don't know what will actually happen to the characters and what the resolution would be. So it's fun to explore a few takes on it.

3 months ago

[-]

> Your imagination falters here too.

I think I just don't find what you described as interesting as you find. I tried AI dungeoning also, but I find it less interesting than with people, because I think I like people more than specific mechanisms of sociality. Also, in a sense, my brain is capable of producing suprising things and when I am writing a story as a hobby, I don't know what will actually happen to the characters and what the resolution would be, and it's very very exciting !

> no one is an expert or even decently knowledgeable in more than 0-2 areas

I might be biased and I don't want to show off, but there are some of these people around here, let's say it's rare that people are decently knowledgeable in more than 5 areas.

I am okay with what you said :

- AI is a better google

But also google became shit, and as far as I can remember, it was somewhat of an incredible tool before. If AI became what is the old google for those people, then wouldn't you say, if you were them, that it's not very impressive and somewhat "like google".

edit; all judgements I made about "not interesting" do not mean "not impressive"

edit2: I think eventually AI will be capable of writing a book akin to Egan's Diaspora, and I would love to reflect on what I said at this time

lostmsu

3 months ago

[-]

What you described re books are preferences. I don't think majority of people care about authors at all. So it might not work for you, but that's not a valid argument why it won't work for most. Therefore your reasoning about that is flawed.

It also seems pretty obvious (did u not think majority don't care about authors? I doubt it). So it stands that some bias made you overlook that fact (as well as OpenAI MAUs and other such glaring data) when you were writing your statement above. If I were you I'd look hard into what that bias might be, cause it could affect other less directly related areas.

3 months ago

[-]

Yeah I think a lot of us are taking knowing how LLMs work for granted. I did the fast.ai course a while back and then went off and played with VLLM and various LLMs optimizing execution, tweaking params etc. Then moved on and started being a user. But knowing how they work has been a game changer for my team and I. And context window is so obvious, but if you don't know what it is you're going to think AI sucks. Which now has me wondering: Is this why everyone thinks AI sucks? Maybe Simon Willison should write about this. Simon?

3 months ago

[-]

> Is this why everyone thinks AI sucks?

Who's everyone? There are many, many people who think AI is great.

In reality, our contemporary AIs are (still) tools with glaring limitations. Some people overlook the limitations, or don't see them, and really hype them up. I guess the people who then take the hype at face value are those that think that AI sucks? I mean, they really do honestly suck in comparison to the hypest of hypes.

3 months ago

[-]

> I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!

It's worse: Gemini (and ChatGPT, but to a lesser extent) have started suggesting random follow-up topics when they conclude that a chat in a session has exhausted a topic. Well, when I say random, I mean that they seem to be pulling it from the 'memory' of our other chats.

For a naive user without preconceived notions of how to use these tools, this guidance from the tools themselves would serve as a pretty big hint that they should intermingle their sessions.

ghostpepper

3 months ago

[-]

For ChatGPT you can turn this memory off in settings and delete the ones it's already created.

3 months ago

[-]

I'm not complaining about the memory at all. I was complaining about the suggestion to continue with unrelated topics.

noname120

3 months ago

[-]

Problem is that by default ChatGPT has the “Reference chat history” option enabled in the Memory options. This causes any previous conversation to leak into the current one. Just creating a new conversation is not enough, you also need to disable that option.

0xdeafbeef

3 months ago

[-]

Only your questions are in it though

noname120

3 months ago

[-]

Are you sure? What makes you think so?

https://www.shloked.com/writing/chatgpt-memory-bitter-lesson

0xdeafbeef

3 months ago

[-]

Mb something 've changed since post

redhed

3 months ago

[-]

This is also the default in Gemini pretty sure, at least I remember turning it off. Make's no sense to me why this is the default.

gordonhart

3 months ago

[-]

> Makes no sense to me why this is the default.

You’re probably pretty far from the average user, who thinks “AI is so dumb” because it doesn’t remember what you told it yesterday.

redhed

3 months ago

[-]

I was thinking more people would be annoyed by it bringing up unrelated conversations, thinking more I'd say you're probably right that more people are expecting it to remember everything they say.

tiahura

3 months ago

[-]

It’s not that it brings it up in unrelated conversations, it’s that it nudges related conversations in unwanted directions.

3 months ago

[-]

Mostly because they built the feature and so that implicitly means they think it's cool.

I recommend turning it off because it makes the models way more sycophantic and can drive them (or you) insane.

3 months ago

[-]

That seems like a terrible default. Unless they have a weighting system for different parts of context?

3 months ago

[-]

They do (or at least they have something that behaves like weighting).

wickedsight

3 months ago

[-]

This is why I love that ChatGPT added branching. Sometimes I end up going some random direction in a thread about some code and then I can go back and start a new branch from the part where the chat was still somewhat clean.

Also works really well when some of my questions may not have been worded correctly and ChatGPT has gone in a direction I don't want it to go. Branch, word my question better and get a better answer.

vintermann

3 months ago

[-]

It's not at all obvious where to drop the context, though. Maybe it helps to have similar tasks in the context, maybe not. It did really, shockingly well on a historical HTR task I gave it, so I gave it another one, in some ways an easier one... Thought it wouldn't hurt to have text in a similar style in the context. But then it suddenly did very poorly.

Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.

dcre

3 months ago

[-]

The models you interact with through the API (as opposed to chat UIs) are held stable and let you specify reasoning effort, so if you use a client that takes API keys, you might be able to solve both of those problems.

3 months ago

[-]

> Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.

That's what websites have been doing for ages. Just like you can't step twice in the same river, you can't use the same version of Google Search twice, and never could.

chasd00

3 months ago

[-]

I was listening to a podcast about people becoming obsessed and "in love" with an LLM like ChatGPT. Spouses were interviewed describing how mentally damaging it is to their partner and how their marriage/relationship is seriously at risk because of it. I couldn't believe no one has told these people to just goto the LLM and reset the context, that reverts the LLM back to a complete stranger. Granted that would be pretty devastating to the person in "the relationship" with the LLM since it wouldn't know them at all after that.

jncfhnb

3 months ago

[-]

It’s the majestic, corrupting glory of having a loyal cadre of empowering yes men normally only available to the rich and powerful, now available to the normies.

adamesque

3 months ago

[-]

that's not quite what parent was talking about, which is — don't just use one giant long conversation. resetting "memories" is a totally different thing (which still might be valuable to do occasionally, if they still let you)

3 months ago

[-]

Actually, it's kind of the same. LLMs don't have a "new memory" system. They're like the guy from Memento. Context memory and long term from the training data. Can't make new memories from the context though.

(Not addressed to parent comment, but the inevitable others: Yes, this is an analogy, I don't need to hear another halfwit lecture on how LLMs don't really think or have memories. Thank you.)

dragonwriter

3 months ago

[-]

Context memory arguably is new memory, but because we abused the metaphor of “learning” rather than something more like shaping inborn instinct for trained model weights, we have no fitting metaphor what happens during the “lifetime” of the interaction with a model via its context window as formation of skills/memories.

SubiculumCode

3 months ago

[-]

I constantly switch out, even when it's on the same topic. It starts forming its own 'beliefs and assumptions', gets myopic. I also make use of the big three services in turn to attack ideas from multiple directions

nrds

3 months ago

[-]

> beliefs and assumptions

Unfortunately during coding I have found many LLMs like to encode their beliefs and assumptions into comments; and even when they don't, they're unavoidably feeding them into the code. Then future sessions pick up on these.

SubiculumCode

3 months ago

[-]

YES! I've tried to provide instructions asking it to not leave comments at all.

ramoz

3 months ago

[-]

Send them this https://backnotprop.substack.com/p/50-first-dates-with-mr-me...

blindhippo

3 months ago

[-]

Thing is, context management is NOT obvious to most users of these tools. I use agentic coding tools on a daily basis now and still struggle with keeping context focused and useful, usually relying on patterns such as memory banks and task tracking documents to try to keep a log of things as I pop in and out of different agent contexts. Yet still, one false move and I've blown the window leading to a "compression" which is utterly useless.

The tools need to figure out how to manage context for us. This isn't something we have to deal with when working with other humans - we reliably trust that other humans (for the most part) retain what they are told. Agentic use now is like training a team mate to do one thing, then taking it out back to shoot it in the head before starting to train another one. It's inefficient and taxing on the user.

[1] https://renormalize.substack.com/p/on-renormalization

getnormality

3 months ago

[-]

In my recent explorations [1] I noticed it got really stuck on the first thing I said in the chat, obsessively returning to it as a lens through which every new message had to be interpreted. Starting new sessions was very useful to get a fresh perspective. Like a human, an AI that works on a writing piece with you is too close to the work to see any flaw.

okthrowman283

3 months ago

[-]

Interesting I’ve noticed the same behavior with Gemini 3.0 but not with Claude, and Gemini 2.5 did not have this behavior. I wonder what tuning is optimising for here.

ljlolel

3 months ago

[-]

Probably because the chat name is named after that first message

faxmeyourcode

3 months ago

[-]

My boss (great engineer) had been complaining about this with his internal github copilot quality no matter the model or task. Turns out he never cleared the context. It was just the same conversation spread thin across nearly a dozen completely separate repositories because they were all in his massive vscode workspace at once.

This was earlier this year... So I started giving internal presentations on basic context management, best practices, etc after that for our engineering team.

layman51

3 months ago

[-]

That is interesting. I already knew about that idea that you’re not supposed to let the conversation drag on too much because its problem solving performance might take a big hit, but then it kind of makes me think that over time, people got away with still using a single conversation for many different topics because of the big context windows.

Now I kind of wonder if I’m missing out by not continuing the conversation too much, or by not trying to use memory features.

plaidfuji

3 months ago

[-]

It is annoying though, when you start a new chat for each topic you tend to have to re-write context a lot. I use Gemini 3, which I understand doesn’t have as good of a memory system as OpenAI. Even on single-file programming stuff, after a few rounds of iteration I tend to get to its context limit (the thinking model). Either because the answers degrade or it just throws the “oops something went wrong” error. Ok, time to restart from scratch and paste in the latest iteration.

I don’t understand how agentic IDEs handle this either. Or maybe it’s easier - it just resends the entire codebase every time. But where to cut the chat history? It feels to me like every time you re-prompt a convo, it should first tell itself to summarize the existing context as bullets as its internal prompt rather than re-sending the entire context.

int_19h

3 months ago

[-]

Agentic IDEs/extensions usually continue the conversation until the context gets close to 80% full, then do the compacting. With both Codex and Claude Code you can actually observe that happening.

That said I find that in practice, Codex performance degrades significantly long before it comes to the point of automated compaction - and AFAIK there's no way to trigger it manually. Claude, on the other hand, has a command for to force compacting, but at the same time I rarely use it because it's so good at managing it by itself.

As far as multiple conversations, you can tell the model to update AGENTS.md (or CLAUDE.md or whatever is in their context by default) with things it needs to remember.

3 months ago

[-]

Codex has `/compact`

TechDebtDevin

3 months ago

[-]

How are these devs employed or trusted with anything..

jumploops

3 months ago

[-]

> “a new knowledge cutoff of August 2025”

This (and the price increase) points to a new pretrained model under-the-hood.

GPT-5.1, in contrast, was allegedly using the same pretraining as GPT-4o.

3 months ago

[-]

A new pretrain would definitely get more than a .1 version bump & would get a whole lot more hype I'd think. They're expensive to do!

caconym_

3 months ago

[-]

Releasing anything as "GPT-6" which doesn't provide a generational leap in performance would be a PR nightmare for them, especially after the underwhelming release of GPT-5.

I don't think it really matters what's under the hood. People expect model "versions" to be indexed on performance.

3 months ago

[-]

Not necessarily. GPT-4.5 was a new pretrain on top of a sizeable raw model scale bump, and only got 0.5 - because the gains from reasoning training in o-series overshadowed GPT-4.5's natural advantage over GPT-4.

OpenAI might have learned not to overhype. They already shipped GPT-5 - which was only an incremental upgrade over o3, and was received poorly, with this being a part of the reason why.

diego_sandoval

3 months ago

[-]

I jumped straight from 4o (free user) into GPT-5 (paid user).

It was a generational leap if there ever has been one. Much bigger than 3.5 to 4.

3 months ago

[-]

Yes, if OpenAI released GPT-5 after GPT-4o, then it would have been seen as a proper generational leap.

But o3 existing and being good at what it does? Took the wind out of GPT-5's sails.

kadushka

3 months ago

[-]

What kind of improvements do you expect when going from 5 straight to 6?

hannesfur

3 months ago

[-]

Maybe they felt the increase in capability is not worth of a bigger version bump. Additionally pre-training isn't as important as it used to be. Most of the advances we see now probably come from the RL stage.

femiagbabiaka

3 months ago

[-]

Not if they didn't feel that it delivered customer value no? It's about under promising and over delivering, in every instance

jumploops

3 months ago

[-]

It’s possible they’re using some new architecture to get more up-to-date data, but I think that’d be even more of a headline.

My hunch is that this is the same 5.1 post-training on a new pretrained base.

Likely rushed out the door faster than they initially expected/planned.

OrangeMusic

3 months ago

[-]

Yeah because OpenAI has been great at naming their models so far? ;)

boc

3 months ago

[-]

Maybe the rumors about failed training runs weren't wrong...

redwood

3 months ago

[-]

Not if it underwhelms

redox99

3 months ago

[-]

I think it's more likely to be the old base model checkpoint further trained on additional data.

jumploops

3 months ago

[-]

Is that technically not a new pretrained model?

(Also not sure how that would work, but maybe I’ve missed a paper or two!)

redox99

3 months ago

[-]

I'd say for it to be called a new pretrained model, it'd need to be trained from scratch (like llama 1, 2, 3).

But it's just semantics.

98Windows

3 months ago

[-]

or maybe 5.1 was an older checkpoint and has more quantization

MagicMoonlight

3 months ago

[-]

No, they just feed in another round of slop to the same model.

https://openai.com/index/introducing-gpt-5-2/

xd1936

3 months ago

[-]

> While GPT‑5.2 will work well out of the box in Codex, we expect to release a version of GPT‑5.2 optimized for Codex in the coming weeks.

3 months ago

[-]

> For coding tasks, GPT-5.1-Codex-Max is a faster, more capable, and more token-efficient coding variant

Hm, yeah, strange. You would not be able to tell, looking at every chart on the page. Obviously not a gotcha, they put it on the page themselves after all, but how does that make sense with those benchmarks?

tempaccount420

3 months ago

[-]

Coding requires a mindset shift that the -codex fine-tunes provide. Codex will do all kinds of weird stuff like poking in your ~/.cargo ~/go etc. to find docs and trying out code in isolation, these things definitely improve capability.

dmos62

3 months ago

[-]

The biggest advantage of codex variants, for me, is terseness and reduced sicophany. That, and presumably better adherence to requested output formats.

3 months ago

[-]

Codex talks much less than the standard variant, especially between tool calls.

3 months ago

[-]

Looks like they removed that line.

k_bx

3 months ago

[-]

gpt-5.2 is already present in codex at this moment

https://platform.openai.com/docs/pricing

preetamjinka

3 months ago

[-]

It's actually more expensive than GPT-5.1. I've gotten used to prices going down with each latest model, but this time it's gone up.

kingstnap

3 months ago

[-]

Flagship models have rarely being cheaper, and especially not on release day. Only a few cases of this really.

Notable exceptions are Deepseek 3.2 and Opus 4.5 and GPT 3.5 Turbo.

The price drops usually are the form of flash and mini models being really cheap and fast. Like when we got o4 mini or 2.0 flash which was a particularly significant one.

n2d4

3 months ago

[-]

That's not true.

    > Notable exceptions are Deepseek 3.2 and Opus 4.5 and GPT 3.5 Turbo.

And GPT-4o, GPT-4.1, and GPT-5. Almost every OpenAI release got cheaper on a per-input-token basis.

PhilippGille

3 months ago

[-]

Gemini 3 Pro Preview also got more expensive than 2.5 Pro.

2.5 Pro: $1.25 input, $10 output (million tokens)

3 Pro Preview: $2 input, $12 output (million tokens)

TechDebtDevin

3 months ago

[-]

Literally no difference in productivity from a free/ <0.50c output OpenRouter model. All these > $1.00+ per mm output are literal scams. No added value to the world.

3 months ago

[-]

5.1 Pro is great

manmal

3 months ago

[-]

I struggle to see where Pro is better than 5.x with Thinking. Actually prefer the latter.

3 months ago

[-]

Many problems where latter spins its wheel and Pro gets it in one go, for me. You need to give Pro full files as context and you need to fit within its ~60k (I forget exactly) silent context window if using via ChatGPT. Don't have it make edits directly, have it give the execution plan back to Codex

3 months ago

[-]

Getting more expensive has been the trend for the closed weights frontier models. See Gemini 3 Pro vs 2.5 Pro. Also see Gemini 2.5 Flash vs 2.0 Flash. The only thing that got cheaper recently was Opus 4.5 vs Opus 4.

Handy-Man

3 months ago

[-]

It also seems much more "smarter" though

endorphine

3 months ago

[-]

Reading this comment, it just occurred to me that we're still in the first phase of the enshittification process.

moralestapia

3 months ago

[-]

Previous model's prices usually go down, but their flagship has always been the most expensive one.

moralestapia

3 months ago

[-]

Wtf, why would this be downvoted?

I'm adding context and what I stated is provably true.

zug_zug

3 months ago

[-]

For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?

hbarka

3 months ago

[-]

On the contrary, I thought Gemini 3 Live mode is much much better than ChatGPT. The voices have none of the annoying artificial uptalking intonations that ChatGPT has, and the simplex/duplex interruptibility of Gemini Live seems more responsive. It knows when to break and pause during conversations.

febed

3 months ago

[-]

Apart from sounding a bit stiff and informal, I was also surprised at how good Gemini Live mode is in regional Indian languages.

simondotau

3 months ago

[-]

I absolutely loathe ChatGPT's voice chat. It spends far too much time being conversational and its eagerness to please becomes fatiguing after the first back-and-forth.

joshmarlow

3 months ago

[-]

I think Grok's voice chat is almost there - only things missing for me: * it's slower to start-up by a couple of seconds * it's harder to switch between voice and text and back again in the same chat (though ChatGPT isn't perfect at this either)

And of course Grok's unhinged persona is... something else.

Gigachad

3 months ago

[-]

Pretty good until it goes crazy glazing Elon or declaring itself mecha hitler.

hcurtiss

3 months ago

[-]

Neither of these have happened in my use. Those were both the product of some pretty aggressive prompting, and were remedied months ago.

OrangeMusic

3 months ago

[-]

Yet, using this model in any way whatsoever after these episodes seems absolutely crazy to me.

hcurtiss

3 months ago

[-]

All models have had similar instances. I particularly enjoyed Gemini’s black founders era. The “safety” teams have bent the politics of these tools in ways I don’t trust. Grok does too, but in my experience less so. This has real impacts.

user34283

3 months ago

[-]

Grok is the only frontier model that is at all usable for adult content.

nazgulsenpai

3 months ago

[-]

It's so much fun. So is the Conspiracy persona.

Robdel12

3 months ago

[-]

I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.

lxgr

3 months ago

[-]

Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.

3 months ago

[-]

You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.

lxgr

3 months ago

[-]

Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?

3 months ago

[-]

Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.

BoxOfRain

3 months ago

[-]

I've been experimenting with something similar to this approach recently. IndexTTS2 gives you emotion vectors as an input, I used an external emotion classification model on the LLM output to modulate the TTS emotion vectors. You need to manage the state of the current affect with a bit of care or it sounds unhinged, but it's worked surprisingly well so far. I wired it together using Cats Effect.

As you'd expect latency isn't great, but I think it can be improved.

barrkel

3 months ago

[-]

The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously.

jablongo

3 months ago

[-]

I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting.

causalmodels

3 months ago

[-]

I just asked it and it said that it uses the on device TTS capabilities.

furyofantares

3 months ago

[-]

I find it very unlikely that it would be trained on that information or that anthropic would put that in its context window, so it's very likely that it just made that answer up.

[1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...

causalmodels

3 months ago

[-]

No, it did not make it up. I was curious so I asked it asked it to imitate a posh British accent imitating a South Brooklyn accent while having a head cold and it explained that it didn't have have fine grained control over the audio output because it was using a TTS. I asked it how it knew that and it pointed me towards [1] and highlighted the following.

> As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.

Tracked down the original source [2] and looked for additional updates but couldn't find anything.

[2] https://trust.anthropic.com/updates

furyofantares

3 months ago

[-]

If it does a web search that's fine, I assumed it hadn't since you hadn't linked to anything.

Also it being right doesn't mean it didn't just make up the answer.

josephwegner

3 months ago

[-]

Along with the hordes of other options people are responding with, I'm a big fan of Perplexity's voice chat. It does back-and-forth well in a way that I missed whenever I tried anything besides ChatGPT.

solarkraft

3 months ago

[-]

It is, shockingly, based on the OpenAI Realtime Assistant API.

ivape

3 months ago

[-]

I'm a big user of Gemini voice. My sense is that Gemini voice uses very tight system prompts that are designed to give you an answer and kind of get you off the phone as much as possible. It doesn't have large context at all.

That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.

Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.

websiteapi

3 months ago

[-]

gemini live is a thing - never tried chaptgpt, are they not similar?

3 months ago

[-]

Not for my use case. I can open it up, and in restored classical Latin pronunciation say "Hi, my name is X, how are you?" and it will respond (also in Latin) "Hello X, I am well, thanks for asking. I hope you are doing great." Its pronunciation is not great, but intelligible. In the written transcript, it butchers what I say, but its responses look good, although sans macrons indicating phonemic vowel length.

Gemini responds in what I think is Spanish, or perhaps Portuguese.

However I can hand an 8 minute long 48k mono mp3 of a nuanced Latin speaker who nasalizes his vowels, and makes regular use of elision to Gemini-3-pro-preview and it will produce an accurate macronized Latin transcription. It's pretty mind blowing.

Dilettante_

3 months ago

[-]

I have to ask: What usecase requires you to speak Latin to the llm?

3 months ago

[-]

I'm a Latin language learner, and part of developing fluency is practicing extemporaneous speech. My dog is a patient listener, but a poor interlocutor. There are Latin language Discord servers where you can speak to people, but I don't quite have the confidence to do that yet. I assume the machine doesn't judge my shitty grammar.

3 months ago

[-]

Loquerisne Latine?

Non vere, sed intelligere possum.

Ita, mihi est canis qui idipsum facit!

(translated from the Gàidhlig)

3 months ago

[-]

Certe loqui conor, sed saepenumero prave dico; canis meus non turbatus est ;)

nineteen999

3 months ago

[-]

You haven't heard? Latin is the next big wave, after blockchain and AI.

create-username

3 months ago

[-]

you joke but Latin teachers are very sought after in my region. There are none. I have just bootcamped myself to become one and shift careers due to the high demand

3 months ago

[-]

You laugh, but the global language learning market in 2025 is expected to exceed USD $100 billion, and LLMs IMHO are poised to disrupt the shit out of it.

nineteen999

3 months ago

[-]

Well sure I can see that happening ... but I can't see latin making a huge comeback unfortunately.

jeanlucas

3 months ago

[-]

no.

leaK_u

3 months ago

[-]

how.

CamelCaseName

3 months ago

[-]

I find ChatGPT's voice to text to be the absolute best in the world, nearly perfect.

I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.

nickvec

3 months ago

[-]

What? The voice chat is basically identical on ChatGPT and Gemini AFAICT.

tmaly

3 months ago

[-]

I can't keep up with half the new features all the model companies keep rolling out. I wish they would solve that

SweetSoftPillow

3 months ago

[-]

Gemini's much better, try it

3 months ago

[-]

Are you saying ChatGPT's voice chat is of good quality? Because for me it's one of its most frustrating weaknesses. I vastly prefer voice input to typing, and would love it if the voice chat mode actually worked well.

But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").

The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.

To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.

So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.

whimsicalism

3 months ago

[-]

gemini does, grok does, nobody else does (except alibaba but it’s not there yet)

codybontecou

3 months ago

[-]

Their voice agent is handy. Currently trying to build around it.

semiinfinitely

3 months ago

[-]

try gemini voice chat

bigyabai

3 months ago

[-]

Qwen does.

3 months ago

[-]

Qwen's voice chat is nowhere near as good as ChatGPT's.

FrasiertheLion

3 months ago

[-]

Try elevenlabs

3 months ago

[-]

Does elevenlabs have a real-time conversational voice model? It seems like like their focus is largely on text to speech and speech to text. Which can approximate that type of thing but it's not at all the same as the native voice to voice that 4o does.

hi_im_vijay

3 months ago

[-]

[disclaimer, i work at elevenlabs] we specifically went with a cascading model for our agents platform because it's better suited for enterprise use cases where they have full control over the brain and can bring their own llm. with that said, even with a cascading model, we can capture a decent amount of nuance with our asr model, and it also supports capturing audio events like laughter or coughing.

a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.

dragonwriter

3 months ago

[-]

> Does elevenlabs have a real-time conversational voice model?

Yes.

> It seems like like their focus is largely on text to speech and speech to text.

They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.

https://elevenlabs.io/docs/agents-platform/overview#architec...

3 months ago

[-]

It specifically says in the architecture docs for the agents platform that it's STT (ASR) -> LLM -> TTS

minadotcom

3 months ago

[-]

They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?

https://i.imgur.com/e0iB8KC.png

Tiberium

3 months ago

[-]

They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275

enlyth

3 months ago

[-]

This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

tobias2014

3 months ago

[-]

And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.

minadotcom

3 months ago

[-]

agreed.

sergdigon

3 months ago

[-]

The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)

whimsicalism

3 months ago

[-]

uh oh, where did SWE bench go :D

whimsicalism

3 months ago

[-]

maybe they will release with gpt-5.2-codex

tabletcorry

3 months ago

[-]

The matrix required for a fair comparison is getting too complicated, since you have to compare chat/thinking/pro against an array of Anthropic and Google models.

But they publish all the same numbers, so you can make the full comparison yourself, if you want to.

3 months ago

[-]

They are taking a page out of Apple's book.

Apple only compares to themselves. They don't even acknowledge the existence of others.

poormathskills

3 months ago

[-]

OpenAI has never compared their models to models from other labs in their blog post. Open literally any past model launch post to see that.

https://openai.com/index/hello-gpt-4o/

boole1854

3 months ago

[-]

I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.

kgwgk

3 months ago

[-]

“You are absolutely right, and I apologize for the confusion.”

snake_doc

3 months ago

[-]

> Models were run with maximum available reasoning effort in our API (xhigh for GPT‑5.2 Thinking & Pro, and high for GPT‑5.1 Thinking), except for the professional evals, where GPT‑5.2 Thinking was run with reasoning effort heavy, the maximum available in ChatGPT Pro. Benchmarks were conducted in a research environment, which may provide slightly different output from production ChatGPT in some cases.

Feels like a Llama 4 type release. Benchmarks are not apples to apples. Reasoning effort is across the board higher, thus uses more compute to achieve an higher score on benchmarks.

Also notes that some may not be producible.

Also, vision benchmarks all use Python tool harness, and they exclude scores that are low without the harness.

jbkkd

3 months ago

[-]

A new model doesn't address the fundamental reliability issues with OpenAI's enterprise tier.

As an enterprise customer, the experience has been disappointing. The platform is unstable, support is slow to respond even when escalated to account managers, and the UI is painfully slow to use. There are also baffling feature gaps, like the lack of connectors for custom GPTs.

None of the major providers have a perfect enterprise solution yet, but given OpenAI's market position, the gap between expectations and delivery is widening.

sigmoid10

3 months ago

[-]

Which tier are you? We are on the highest enterprise tier and I've found that OpenAI is a much more stable platform for high-usage than other providers. Can't say much about the UI though since I almost exclusively work with the API. I feel like UIs generally suck everywhere unless you want to do really generic stuff.

energy123

3 months ago

[-]

ChatGPT UI is leagues above Gemini and AI Studio in responsiveness and latency which is what I care about.

dannyw

3 months ago

[-]

Completely the opposite experience.

tenpoundhammer

3 months ago

[-]

I have been using chatGPT a ton over the last months and paying the subscription. Used it for coding, news, stock analysis, daily problems, and a whatever I could think of. I decided to give Gemini a go when version three came out to great reviews. Gemini handles every single one of my uses cases much better and consistently gives better answers. This is especially true for situations were searching the web for current information is important, makes sense that google would be better. Also OCR is phenomenal chatgpt can't read my bad hand writing but Gemini can easily. Only downsides are in the polish department, there are more app bugs and I usually have to leave the happen or the session terminates. There are bugs with uploading photos. The biggest complaint is that all links get inserted into google search and then I have to manipulate them when they should go directly to the chosen website, this has to be some kind of internal org KPI nonsense. Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.

3 months ago

[-]

I consistently have exactly the opposite experience. ChatGPT seems extremely willing to do a huge number of searches, think about them, and then kick off more searches after that thinking, think about it, etc., etc. whereas it seems like Gemini is extremely reluctant to do more than a couple of searches. ChatGPT also is willing to open up PDFs, screenshot them, OCR them and use that as input, whereas Gemini just ignores them.

nullbound

3 months ago

[-]

I will say that it is wild, if not somewhat problematic that two users have such disparate views of seemingly the same product. I say that, but then I remember my own experience just from few days ago. I don't pay for gemini, but I have paid chatgpt sub. I tested both for the same product with seemingly same prompt and subbed chatgpt subjectively beat gemini in terms of scope, options and links with current decent deals.

It seems ( only seems, because I have not gotten around to test it in any systematic way ) that some variables like context and what the model knows about you may actually influence quality ( or lack thereof ) of the response.

martinpw

3 months ago

[-]

> I will say that it is wild, if not somewhat problematic that two users have such disparate views of seemingly the same product.

This happens all the time on HN. Before opening this thread, I was expecting that the top comment would be 100% positive about the product or its competitor, and one of the top replies would be exactly the opposite, and sure enough...

I don't know why it is. It's honestly a bit disappointing that the most upvoted comments often have the least nuance.

stevage

3 months ago

[-]

How much nuance can one person's experience have? If the top two most visible things are detailed, contrary experiences of the same product, that seems a pretty good outcome?

AznHisoka

3 months ago

[-]

Also, why introduce nuance for the sake of nuance? For every single use case, Gemini (and Claude) has performed better. I can’t give ChatGPT even the slightest credit when it doesnt deserve any

block_dagger

3 months ago

[-]

Replace "on HN" with "in the course of human events" and we may have a generally true statement ;)

rabf

3 months ago

[-]

Chatgpt is not one model! Unless you manually specify to use a particular model your question can be routed to different models depending on what it guesses would be most appropriate for your question.

stingraycharles

3 months ago

[-]

Isn’t that just standard MoE behavior? And isn’t the only choice you have from the UI between “Instant” and “Thinking”?

3 months ago

[-]

MoE is a single model thing, model routing happens earlier.

stingraycharles

3 months ago

[-]

Yes but then what does the grandparent mean with “unless you specify a specific model” ? Do they mean “if you select auto, it automatically decides between instant or thinking” ?

That’s… hardly something worth mentioning.

rabf

3 months ago

[-]

If you have the paid subscription you can choose what model your question is routed to. Current options in the UI are GPT-5.1 Instant, GPT-5.1 Thinking, GPT-5 Instant, GPT-5 thinking mini, GPT-5 thinking, GPT-4o, GPT-4.1, o3 and o4-mini. Options like deep-research will affect the reasoning level used. There is a lot that goes on behind the scenes in the chatgpt app with things like tool use or function calling coming into play as well. Ultimately what OpenAI will be trying/hoping to do is give you a satifactory result using the least amount of compute possible - this is where the autorouter is very useful for them and obstensibly for the user who would not know which one to pick. I mostly just use the API's these days as I like to be the one who decides who/what I am talking to.

blks

3 months ago

[-]

Because neither product has any consistency in its results, no predictive behaviour. One day it performs well, another it hallucinates non existing facts and libraries. Those are stochastic machines

sendes

3 months ago

[-]

I see the hyperbole is the point, but surely what these machines do is to literally predict? The entire prompt engineering endeavour is to get them to predict better and more precisely. Of course, these are not perfect solutions - they are stochastic after all, just not unpredictably.

coliveira

3 months ago

[-]

Prompt engineering is voodoo. There's no sure way to determine how well these models will respond to a question. Of course, giving additional information may be helpful, but even that is not guaranteed.

lossyalgo

3 months ago

[-]

Also every model update changes how you have to prompt them to get the answers you want. Setting up pre-prompts can help, but with each new version, you have to figure out through trial and error how to get it to respond to your type of queries.

I can't wait to see how bad my finally sort-of-working ChatGPT 5.1 pre-prompts work with 5.2.

Edit: How to talk to these models is actually documented, but you have to read through huge documents: https://cdn.openai.com/gpt-5-system-card.pdf

3 months ago

[-]

It definitely isn’t voodoo, it’s more like forecasting weather. Some forecasts are easier to make, some are harder (it’ll be cold when it’s winter vs the exact location and wind speed of a tornado for an extreme example). The difference is you can try to mix things up in the prompt to maximize the likelihood of getting what you want out and there are feasibility thresholds for use cases, e.g. if you get a good answer 95% of the time it’s qualitatively different than 55%.

coliveira

3 months ago

[-]

No, it's not. Nowadays we know how to predict the weather with great confidence. Prompting may get you different results each time. Moreover, LLMs depend on the context of your prompts (because of their memory), so a single prompt may be close to useless and two different people can get vastly different results.

3 months ago

[-]

> we know how to predict the weather with great confidence

some weather, sometimes. we're not good at predicting exact paths of tornadoes.

> so a single prompt may be close to useless and two different people can get vastly different results

of course, but it can be wrong 50% of the time or 5% of the time or .5% of the time and each of those thresholds unlock possibilities.

3 months ago

[-]

And I’d really like for Gemini to be as good or better, since I get it for free with my Workspace account, whereas I pay for chatgpt. But every time I try both on a query I’m just blown away by how vastly better chatgpt is, at least for the heavy-on-searching-for-stuff kinds of queries I typically do.

3 months ago

[-]

Gemini has tons of people using it free via aistudio

I can't help but feel that google gives free requests the absolute lowest priority, greatest quantization, cheapest thinking budget, etc.

I pay for gemini and chatGPT and have been pretty hooked on Gemini 3 since launch.

crorella

3 months ago

[-]

It’s like having 3 coins and users preferring one or the other when tossing it because one coin gives consistently more heads (or tails) than the other coin.

What is better is to build a good set of rules and stick to one and then refine those rules over time as you get more experience using the tool or if the tool evolves and digress from the results you expect.

nullbound

3 months ago

[-]

<< What is better is to build a good set of rules and

But, unless you are on a local model you control, you literally can't. Otherwise, good rules will work only as long as the next update allows. I will admit that makes me consider some other options, but those probably shouldn't be 'set and iterate' each time something changes.

crorella

3 months ago

[-]

what I had in mind when I added that comment was for coding, with the use of .md files. For the web version of chats I agree there is little control on how to tailor the way you want the agent to behave, unless you give a initial "setup" prompt.

jhancock

3 months ago

[-]

I can use GPT one day and the next get a different experience with the same problem space. Same with Gemini.

4ndrewl

3 months ago

[-]

This is by design, given a non-determenitisic application?

jhancock

3 months ago

[-]

sure. It may be more than that...possibly due to variable operating params on the servers and current load.

On whole, if I compare my AI assistant to a human worker, I get more variance than I would from a human office worker.

pixl97

3 months ago

[-]

Thats because you don't 'own' the LLM compute. If you instead bought your office workers by the question I'm sure the variability would increase.

3 months ago

[-]

They're not really capable of producing varying answers based on load.

But they are capable of producing different answers because they feel like behaving differently if the current date is a holiday, and things like that. They're basically just little guys.

sjaramillo

3 months ago

[-]

I guess LLMs have a mood too

dr_dshiv

3 months ago

[-]

Vibes

nunez

3 months ago

[-]

Tesla FSD has been more or less the same experience. Some people drive 100s of miles without disengaging while others pull the plug within half a mile from their house. A lot of it depends on what the customer is willing to tolerate.

austhrow743

3 months ago

[-]

We've been having trouble telling if people are using the same product ever since Chat GPT first got popular. The had a free model and a paid model, that was it, no other competitors or naming schemes to worry about, and discussions were still full of people talking about current capabilities without saying what model they were using.

For me, "gemini" currently means using this model in the llm.datasette.io cli tool.

openrouter/google/gemini-3-pro-preview

For what anyone else means? If they're equivalent? If Google does something different when you use "Gemini 3" in their browser app vs their cli app vs plans vs api users vs third party api users? No idea to any of the above.

I hate naming in the llm space.

3 months ago

[-]

FWIW i’m always using 5.1 Thinking.

Bombthecat

3 months ago

[-]

Could also be a language thing ...

ghostpepper

3 months ago

[-]

Same, I use chatgpt plus (the entry-level paid option) extensively for personal research projects and coding, and it seems miles ahead of whatever "Gemini Pro" is that I have through work. Twice yesterday, gemini repeated verbatim a previous response as if I hadn't asked another question and told it why the previous response was bad. Gemini feels like chatGPT from two years ago.

staticman2

3 months ago

[-]

Are you uploading PDFs that already have a text layer?

I don't currently subscribe to Gemini but on A.I. Studio's free offering when I upload a non OCR PDF of around 20 pages the software environment's OCR feeds it to the model with greater accuracy than I've seen from any other source.

3 months ago

[-]

I’m not uploading PDFs at all. I’m talking about PDFs it finds while searching than it extracts data from for the conversation.

staticman2

3 months ago

[-]

I'm surprised to hear anyone finds these models trustworthy for research.

Just today I asked Claude what year over year inflation was and it gave me 2023 to 2024.

I also thought some sites ban A.I. crawling so if they have the best source on a topic, you won't get it.

3 months ago

[-]

Anytime you use LLMs you should be keenly aware of their knowledge cutoff. Like any other tool, the more you understand it, the better it works.

staticman2

3 months ago

[-]

I'm sorry but I don't see what "knowledge cutoff" has to do with what we were talking about- which is using a LLM find PDFs and other sources for research.

whazor

3 months ago

[-]

I agree with you. To me, gemini has much worse search results. Then again, I use kagi for search and I cannot stand the search results from Google anymore. And its clear that gemini uses those.

In contrast, chatgpt has built their own search engine that performs better in my experience. Except for coding, then I opt for Claude opus 4.5.

noname120

3 months ago

[-]

Perplexity Pro with any thinking model blows both out of the water in a fraction of the time, in my experience

kccqzy

3 months ago

[-]

> The biggest complaint is that all links get inserted into google search and then I have to manipulate them when they should go directly to the chosen website, this has to be some kind of internal org KPI nonsense.

Oh I know this from my time at Google. The actual purpose is to do a quick check for known malware and phishing. Of course these days such things are better dealt with by the browser itself in a privacy preserving way (and indeed that’s the case), so it’s unnecessary to reveal to Google which links are clicked. It’s totally fine to manipulate them to make them go directly to the website.

gjuggler

3 months ago

[-]

I think Gemini is just broken.

Instead of forwarding model-generated links to https://www.google.com/url?q=[URL], which serves the purpose of malware check and user-facing warning about linking to an external site, Gemini forwards links to https://www.google.com/search?q=[URL], which does... a Google search for the URL, which isn't helpful at all.

Example: https://gemini.google.com/share/3c45f1acdc17

NotebookLM by comparison, does the right thing: https://notebooklm.google.com/notebook/7078d629-4b35-4894-bb...

It's kind of impressive how long this obviously-broken link experience has been sitting in the Gemini app used by millions.

3 months ago

[-]

That's interesting, I just today started getting some "Some sites restrict our ability to check links." dialogue in ChatGPT that wanted me to verify that I really wanted to follow the link, with a Learn More link to this page: https://help.openai.com/en/articles/10984597-chatgpt-generat...

So it seems like ChatGPT does this automatically and internally, instead of using an indirect check like this.

solarkraft

3 months ago

[-]

> Only downsides are in the polish department

What an understatement. It has me thinking „man, fuck this“ on the daily.

Just today it spontaneously lost an entire 20-30 minutes long thread and it was far from the first time. It basically does it any time you interrupt it in any way. It’s straight up data loss.

It’s kind of a typical Google product in that it feels more like a tech demo than a product.

It has theoretically great tech. I particularly like the idea of voice mode, but it’s noticeably glitchy, breaks spontaneously often and keeps asking annoying questions which you can’t make it stop.

3 months ago

[-]

ChatGPT web UI was also like this for the longest time, until a few months ago: all sorts of random UI bugs leading either to data loss or misleading UI state. Interrupting still is very flaky there too. And on the mobile app, if you move away from the app while it's taking time to think, its state would somehow desync from the actual backend thinking state, and get stuck randomly; sometimes restarting the app fixes it, sometimes that chat is that unusable from that point on.

And the UI lack of polish shows up freshly every time a new feature lands too - the "branch in new chat" feature is really finicky still, getting stuck in an unusable state if you twitch your eyebrows at wrong moment.

gcr

3 months ago

[-]

i basically can't use the ChatGPT app on the subway for these reasons. the moment the websocket connection drops, i have to edit my last message and resubmit it unchanged.

it's like the client, not the server, is responsible for writing to my conversation history or something

spruce_tips

3 months ago

[-]

it took me a lot of tinkering to get this feeling seamless in my own apps that use the api under the hood. i ended up buffering every token into a redis stream (with a final db save at the end of streaming) and building a mechanism to let clients reconnect to the stream on demand. no websocket necessary.

works great for kicking off a request and closing tab or navigating away to another page in my app to do something.

i dont understand why model providers dont build this resilient token streaming into all of their APIs. would be a great feature

rishabhaiover

3 months ago

[-]

exactly. they need to bring in spotify level of caching of streaming music that it just works if you're in a subway. Constant availability should be table stakes for them.

rjzzleep

3 months ago

[-]

I get that the web versions are free, but if you can afford API access, I always recommend using Msty for everything. It's a much better experience.

https://msty.ai/

p_ing

3 months ago

[-]

> ChatGPT web UI was also like this for the longest time

Copilot Chat has been perfect in this respect. It's currently GPT 5.0, moving to 5.1 over the next month or so, but at least I've never lost an (even old) conversation since those reside in an Exchange mailbox.

Max-Limelihood

3 months ago

[-]

I lost thousands of conversations I'd had back in the move from "Bing" to "Copilot". Moved straight to Claude and never touched a GPT again.

Duanemclemore

3 months ago

[-]

I downloaded my archive and completely ended my GPT subscription last week based on some bad computer maintenance advice. Same thing here - using other models, never touching that product again.

topato

3 months ago

[-]

now I kind of HAVE to know... what was the aforementioned bad advice was?! So mysterious!

Duanemclemore

3 months ago

[-]

Oh, it was DUMB. I was dumb. I only have myself to blame here. But we all do dumb things sometimes, owning your mistakes keeps you humble, and you asked. So here goes.

I use a modeling software called Rhino on wine on Linux. In the past, there was an incident where I had to copy an obscure dll that couldn't be delivered by wine or winetricks from a working Windows installation to get something to work. I did so and it worked. (As I recall this was a temporary issue, and was patched in the next release of wine.)

I hate the wine standard file picker, it has always been a persistent issue with Rhino3d. So I keep banging my head on trying to get it to either perform better or make a replacement. Every few months I'll get fed up and have a minute to kill, so I'll see if some new approach works. This time, ChatGPT told me to copy two dll's from a working windows installation to the System folder. Having precedent that this can work, I did.

Anyway, it borked startup completely and it took like an hour to recover. What I didn't consider - and I really, really should have - was that these were dll's that were ALREADY IN the system directory, and I was overwriting the good ones with values already reflecting my system with completely foreign ones.

And that's the critical difference - the obscure dll that made the system work that one time was because of something missing. This time was overwriting extant good ones.

But the fact that the LLM even suggested (without special prompting) to do something that I should have realized was a stupid idea with a low chance of success made me very wary of the harm it could cause.

me-vs-cat

3 months ago

[-]

> ...using other models, never touching that product again.

> ...that the LLM even suggested (without special prompting) to do something that I should have realized was a stupid idea with a low chance of success...

Since you're using other models instead, do you believe they cannot give similarly stupid ideas?

Duanemclemore

3 months ago

[-]

I'm under no misimpression they can't. But I have found ChatGPT to be most confident when it f's up. And to suggest the worst ideas most often.

Until you queried I had forgotten to mention that the same day I was trying to work out a Linux system display issue and it very confidently suggested to remove a package and all its dependencies, which would have removed all my video drivers. On reading the output of the autoremove command I pointed out that it had done this, and the model spat out an "apology" and owned up to ** the damage it would have wreaked.

** It can't "apologize" for or "own up" to anything, it can just output those words. So I hope you'll excuse the anthropomorphization.

me-vs-cat

3 months ago

[-]

I feel the same about the obsequious "apologies".

p_ing

3 months ago

[-]

I'm referring to Copilot Chat. The data resides in your Exchange mailbox. You're referring to the consumer product.

deepGem

3 months ago

[-]

There is no competing product for GPT Voice. Hands down. I have tried Claude, Gemini - they don't even comes close.

But voice is not a huge traffic funnel. Text is. And the verdict is more or less unanimous at this time. Gemini 3.0 has outdone ChatGPT. I unsubscribed from GPT plus today. I was a happy camper until the last month when I started noticing deplorable bugs.

1. The conversation contexts are getting intertwined.Two months ago, I could ask multiple random queries in a conversation and I would get correct responses but the last couple of weeks, it's been a harrowing experience having to start a new chat window for almost any change in thread topic. 2. I had asked ChatGPT to once treat me as a co-founder and hash out some ideas. Now for every query - I get a 'cofounder type' response. Nothing inherently wrong but annoying as hell. I can live with the other end of the spectrum in which Claude doesn't remember most of the context.

Now that Gemini pro is out, yes the UI lacks polish, you can lose conversations, but the benefits of low latency search and a one year near free subscription is a clincher. I am out of ChatGPT for now, 5.2 or otherwise. I wish them well.

esyir

3 months ago

[-]

Just a note, chatGPT does retain a persistent memory of conversations. In the settings menu, there's a section that allows you to tweak/clear this persistent memory

rapind

3 months ago

[-]

I found the gemini cli extremely lacking and even frustrating. Why google would choose node…

Codex is decent and seemed to be improving (being written in rust helps). Claude code is still the king, but my god they have server and throttling issues.

Mixed bag wherever you go. As model progress slows / flatlines (already has?) I’m sure we’ll see a lot more focus and polish on the interfaces.

3 months ago

[-]

Codex is king

wkat4242

3 months ago

[-]

What's that near free subscription? I don't see it here

deepGem

3 months ago

[-]

They had 9.99 for the first year.

wkat4242

3 months ago

[-]

Oh I must have missed that, thanks.

topato

3 months ago

[-]

yeah, the best Ive seen is like 1.99 for two months, then back to normal pricing....

KronisLV

3 months ago

[-]

> It has me thinking „man, fuck this“ on the daily.

That's sometimes me with the CLI. I can't use the Gemini CLI right now on Windows (in the Terminal app), because trying to copy in multiple lines of text for some reason submits them separately and it just breaks the whole thing. OpenCode had the same issue but even worse, it quite after the first line or something and copied the text line by line into the shell, thank fuck I didn't have some text that mentions rm -rf or something.

More info: https://github.com/google-gemini/gemini-cli/issues/14735#iss...

At the same time, neither Codex CLI, nor Claude Code had that issue (and both even showed shortened representations of copied in text, instead of just dumping the whole thing into the input directly, so I could easily keep writing my prompt).

So right now if I want to use Gemini, I more or less have to use something like KiloCode/RooCode/Cline in VSC which are nice, but might miss out on some more specific tools. Which is a shame, because Gemini is a really nice model, especially when it comes to my language, Latvian, but also your run of the mill software dev tasks.

In comparison, Codex feels quite slow, whereas Claude Code is what I gravitate towards most of the time but even Sonnet 4.5 ends up being expensive when you shuffle around millions of tokens: https://news.ycombinator.com/item?id=46216192 Cerebras Code is nice for quick stuff and the sheer amount of tokens, but in KiloCode/... regularly messes up applying diff based edits.

radicaldreamer

3 months ago

[-]

Google’s standard problem is that they don’t even use their own products. Their Pixel and Android team rocks iPhones on the daily, for example.

free652

3 months ago

[-]

You cant buy an iPhone without a director approval. And it's like 3 gen behind as well. So no, they don't use iPhones.

ummonk

3 months ago

[-]

Google tells its employees what products they're allowed to buy for personal use?

snypher

3 months ago

[-]

Seems like they meant for a work device.

gcr

3 months ago

[-]

lots of googlers use BYOD iPhones and the corp suite for this use case is fairly well-supported

brookst

3 months ago

[-]

Which makes tons of sense because iPhone users are higher CLV than Android users. If Google had to choose between major software defects in Android or iOS, they would focus quality on iOS every time.

siva7

3 months ago

[-]

that explains why their ios gemini app is so ridiculously bad. in private they probably use iphones and just chatgpt instead.

dominotw

3 months ago

[-]

you have to get premission from director for your presonal phone? wtf

testdelacc1

3 months ago

[-]

For the work phone.

RBerenguel

3 months ago

[-]

I would think this is not true

sib

3 months ago

[-]

You'd be wrong (source - worked in the Android org).

RBerenguel

3 months ago

[-]

How long ago?

sib

3 months ago

[-]

2021-2023

renewiltord

3 months ago

[-]

Yeah, I've heard that Sundar Pichai dogfoods the latest Pixel at least once a month and sometimes two or three times.

sam345

3 months ago

[-]

That's inexcusable.

Der_Einzige

3 months ago

[-]

That’s because they will be bullied out of the dating market if they have a “green bubble”.

dkga

3 months ago

[-]

What is a green bubble? iPhone's carbon footprint?

brookst

3 months ago

[-]

iMessage renders other iMessage users as blue bubbles, SMS/RCS as green bubbles.

People who can’t understand that many people actually prefer iOS use this green/blue thing to explain the otherwise incomprehensible (to them) phenomenon of high iOS market share. “Nobody really likes iOS, they just get bullied at school if they don’t use it”.

It’s just “wake up sheeple” dressed up in fake morality.

ethbr1

3 months ago

[-]

As someone who switches between platforms somewhat frequently, iOS perpetually feels like people have Stockholm syndrome.

'Oh, that super annoying issue? Yeah, it's been there for years. We just don't do that.'

Fundamentally though, browsing the web on iOS, even with a custom "browser" with adblocking, feels like going back in time 15 years.

platevoltage

3 months ago

[-]

It wouldn't be an issue if they didn't pick the worst green on earth. "Which green would you like for the carrier text messages Mr. Jobs?" ... "#00FF00 will be fine."

onethought

3 months ago

[-]

I mean there is benefit to understanding competitor well as well?

LogicFailsMe

3 months ago

[-]

Outweighed by the value of having to suffer with the moldy fruits of their own labor. That was the only way the Android Facebook app became usable as well.

3 months ago

[-]

There certainly is.

To posit a scenario: I would expect General Motors to buy some Ford vehicles to test and play around with and use. There's always stuff to learn about what the competition has done (whether right, wrong, or indifferent).

But I also expect the parking lots used by employees at any GM design facility in the world to be mostly full of General Motors products, not Fords.

https://www.caranddriver.com/news/a62694325/ford-ceo-jim-far...

snypher

3 months ago

[-]

The CEO of Ford was driving a competition EV for months;

GenerWork

3 months ago

[-]

>But I also expect the parking lots used by employees at any GM design facility in the world to be mostly full of General Motors products, not Fords.

I think you'd be surprised about the vehicle makeup at Big 3 design facilities.

3 months ago

[-]

Maybe so.

I'm only familiar with Ford production and distribution facilities. Those parking lots are broadly full of Fords, but that doesn't mean that it's like this across the board.

olyjohn

3 months ago

[-]

GM has dedicated parking lots for employees with GM vehicles. Everybody else parks further away in the lot of shame.

3 months ago

[-]

Of course.

And I've parked in the lot of shame at a Ford plant, as an outsider, in my GMC work truck -- way over there.

It wasn't so bad. A bit of a hike to go back and get a tool or something, but it was at least paved...unlike the non-union lot I'm familiar with at a P&G facility, which is a gravel lot that takes crossing a busy road to get to, lacks the active security and visibility from the plant that the union lot has, and which is full of tall weeds. At P&G, I half-expect to come back and find my tires slashed.

Anyway, it wasn't barren over there in the not-Ford lot, but it wasn't nearly so populous as the Ford lot was. The Ford-only lot is bigger, and always relatively packed.

It was very clear to me that the lots (all of the lots, in aggregate) were mostly full of Fords.

To bring this all back 'round: It is clear to me that Ford employees broadly (>50%) drive Fords to work at that plant.

---

It isn't clear to me at all that Google Pixel developers don't broadly drive iPhones. As far as I can tell, that status (which is meme-level in its age at this point) is true, and they aren't broadly making daily use of the systems they build.

(And I, for one, can't imagine spending 40 hours a week developing systems that I refuse to use. I have no appreciation for that level of apparent arrogance, and I hope to never be suaded to be that way. I'd like to think that I'd be better-motivated to improve the system than I would be to avoid using it and choose a competitor instead.

I don't shit where I sleep.)

3 months ago

[-]

I wonder how many apple employees walk in to the office with android phones

azinman2

3 months ago

[-]

Effectively zero.

Disclosure: I work at Apple. And when I was at Google I was shocked by how many iPhones there were.

3 months ago

[-]

That doesn’t surprise me at all haha appreciate someone a little closer to the question answering it! I know it still counts anecdotal but I’ll take it

jimmaswell

3 months ago

[-]

This is flabbergasting, how could such a large proportion of highly technical people willingly subject themselves to being shackled by iOS? They just happily put up with having one choice of browser, (outside Europe) no third party app stores, and being locked into the Apple ecosystem? I can't think of a single reason I would ever switch from an S22-25+U to an iPhone. I only went from 22U to 25U because my old one got smashed, otherwise the 22U would still be perfectly fine.

brookst

3 months ago

[-]

Because many of them just want to use their phone as a tool, not tinker with it.

Same way many professional airplane mechanics fly commercial rather than building their own plane. Just because your job is in tech doesn’t mean you have to be ultra-haxxor with every single device in your life.

kaashif

3 months ago

[-]

I don't have my phone (a Pixel) because it frees me from shackles or anything like that. It's just a phone. I use the default everything. Works great. I imagine most people with iPhones are the same.

dumbfounder

3 months ago

[-]

Because it’s better.

3 months ago

[-]

I feel like people dance around this a lot because idk it hurts nerd credibility or something. The fact is on a moment to moment basis, the iPhone is just a better experience generally. They also hold their value a lot longer. I consistently trade in my phone or sell it to other people for easily 80% of what I paid for it. Usually this is 3-4yrs out

Remember how long it took for Instagram to be functional on android phones?

jimmaswell

3 months ago

[-]

I've tried them out and not a single thing about it was tangibly better IMO. They have no inherent merit above Android except that some see them as a status symbol (which is absurd as my S25U has a higher MSRP than most iPhone models)

hamburglar

3 months ago

[-]

My bottom of the barrel iPhone SE is absolutely not a status symbol. It’s just the phone I like best.

The MSRP of your phone does not matter.

3 months ago

[-]

Cameras, for starters. I’ve never seen another smart phone keep up with the quality color and texture of an iPhone’s photos/videos (videos in particular) since the 4s. Their color science is just better. We’ve intercut footage since the 7 or so with our work and frankly you’d be hard pressed to catch it wasn’t one of our nicer rigs unless we hold the shot for too long. we just can’t get other phone cameras to match footage with the same ease, especially when it comes to skin tones.

adamkochanowicz

3 months ago

[-]

I also love that I can leave the microphone on (not in live voice mode) while dictating to ChatGPT and pause and think as much as needed.

With Gemini, it will send as soon as I stop to think. No way to disable that.

wheelerwj

3 months ago

[-]

How did you do this?

toomuchtodo

3 months ago

[-]

Record button in the app if you’ve got the feature.

arjie

3 months ago

[-]

Any time its safety stuff triggers, Gemini wipes the context. It's unusable because of this because whatever is going on with the safety stuff, it fires too often. I'm trying to figure out some code here, not exactly deporting ICE to Guantanamo or whatever.

rvnx

3 months ago

[-]

The more Gemini and Nano-Banana soften their filters, the more audience it will take from other platforms. The main risk is payment providers banning them, I can't imagine bank card providers to remove payments to Google.

dzhiurgis

3 months ago

[-]

On a flip side chatgpt app now has years of history that sometimes useful (search is pretty ok, but could improve) but otherwise I'd like to remove most of it - good luck doing so.

amluto

3 months ago

[-]

Claude regularly computes a reply for me, then reports an error and loses the reply. I wonder what fraction of Anthropic’s compute gets wasted and redone.

seg_lol

3 months ago

[-]

Try using a VPN, my ISP was killing connections and claude would randomly reset. Using a VPN fixed the issue.

mnky9800n

3 months ago

[-]

The colab integration is where it shines the most imo.

hexnuts

3 months ago

[-]

You may be interested in tools like OpenMemory

3 months ago

[-]

Yeah I eventually noped out as I said in another comment and am charging hard with Codex and am so happy about 5.2!!

lxgr

3 months ago

[-]

Interesting, I had the opposite experience. 5.0 "Thinking" was better than 5.1, but Gemini 3 Pro seems worse than either for web search use cases. It's hallucinating at pretty alarming rates (including making up sources it never actually accessed) for a late 2025 model.

Opus 4.5 has been a step above both for me, but the usage limits are the worst of the three. I'm seriously considering multiple parallel subscriptions at this point.

gs17

3 months ago

[-]

I've had the same experience with search, especially with it hallucinating results instead of actually finding them. It's really frustrating that you can't force a more in-depth search from the model run by the company most famous for a search engine.

3 months ago

[-]

Try the same question in deep research mode.

hbarka

3 months ago

[-]

I’ve been putting literally the same inputs into both ChatGPT and Gemini and the intuition in answers from Gemini just fits for me. I’m now unwilling to just rely on ChatGPT.

Google, if you can find a way to export chats into NotebookLM, that would be even better than the Projects feature of ChatGPT.

siva7

3 months ago

[-]

notebooklm is heavily biased to only use the sources i added and frame every task around them - even if it is nonsensical - so it is not that useful for novel research. it also tends to hallucinate when lots of data is involved.

LogicFailsMe

3 months ago

[-]

All I want for Christmas is a "No NotebookLM slop" checkbox on youtube.

simplify

3 months ago

[-]

Youtube's downvote button has served me quite well for this purpose.

didibus

3 months ago

[-]

> Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.

Depends, even though Gemini 3 is a bit better than GPT5.1, the quality of the ChatGPT apps themselves (mobile, web) have kept me a subscriber to it.

I think Google needs to not-google themselves into a poor app experience here, because the models are very close and will probably continue to just pass each other in lock step. So the overall product quality and UX will start to matter more.

Same reason I am sticking to Claude Code for coding.

concinds

3 months ago

[-]

The ChatGPT Mac app especially feels much nicer to use. I like Gemini more due to the context window but I doubt Google will ever create a native Mac app.

bayarearefugee

3 months ago

[-]

This matches my experience pretty closely when it comes to LLM use for coding assistance.

I still find a lot to be annoyed with when it comes to Gemini's UI and its... continuity, I guess is how I would describe it? It feels like it starts breaking apart at the seams a bit in unexpected ways during peak usages including odd context breaks and just general UI problems.

But outside of UI-related complaints, when it is fully operational it performs so much better than ChatGPT for giving actual practical, working answers without having to be so explicit with the prompting that I might as well have just written the code myself.

luhn

3 months ago

[-]

That's hilarious and right on brand for Google that they spend millions developing cutting-edge technology and fumble the ball making a chat app.

spwa4

3 months ago

[-]

Every Google app is a chat app, except maybe search.

dieortin

3 months ago

[-]

Is Google Drive a chat app? Is Google Photos a drive app? I don’t know what you mean

spwa4

3 months ago

[-]

Once you open a file, it is very much a chat app. Comments and chat work for anything you can preview btw, not just Google Docs stuff.

Not sure how you can access the chat in the directory view.

minitoar

3 months ago

[-]

In Google Photos shared albums there is a tab that I can only describe as a chatroom.

dieortin

3 months ago

[-]

Isn’t there a difference between having a tab that is similar to a chat, to being a chat app?

azan_

3 months ago

[-]

That's interesting. I've got completely different impression. Every time I use Gemini I'm surprised how bad it is. My main complaint is that Gemini is too lazy.

Nathanba

3 months ago

[-]

Same for me, at this point I'm seriously starting to think that these are ads for and by Google because for me Gemini is the worst.

WillPostForFood

3 months ago

[-]

My experience is that "AI Mode" Gemini in Chrome is terrible, but AI Studio Gemini is pretty great.

varispeed

3 months ago

[-]

Get Gemini answer and tell ChatGPT this is what my friend said. Then put ChatGPT answer to Claude and so on. It's a cheat code.

tenpoundhammer

3 months ago

[-]

I did this today it was amazing. If I would have had time I would try other models as well. Great tip thanks

clhodapp

3 months ago

[-]

A cheat code to what?

Iwan-Zotow

3 months ago

[-]

To get a Hitler

AznHisoka

3 months ago

[-]

ChatGPT seems to just randomly pick urls to cite and extract information from.

Google Gemini seems to look at heuristics like whether the author is trustworthy, or an expert in the topic. But more advanced

FpUser

3 months ago

[-]

I've read many very positive reviews about Gemini 3. I tried using it including Pro and to me it looks very inferior to ChatGPT. What was very interesting though was when I caught it bullshitting me I called its BS and Gemini expressed very human like behavior. It did try to weasel its way out, degenerated down to "true Scotsman" level but finally admitted that it was full of it. this is kind of impressive / scary.

TacticalCoder

3 months ago

[-]

Yeah basically the same here. And many people on paid ChatGPT subscription like us noticed just that. Gemini 3 Pro "thinking" is really good.

> Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.

I think the biggest issue OpenAI is facing is the numbers: Google is at the moment a near $4 trillion company. They can splurge a near infinite amount of money to win the race.

Google is so big they they created their own TPUs, which is mindboggling.

Which new user is going to willingly pay an OpenAI subscription once he knows that gemini.google.com gives access to a state of the art model? And Google makes sure to remind users who search that they can "continue the discussion" with Gemini.

Maybe the dirty Altman tricks like cornering the entire RAM market can work but I don't see how they can beat Google by playing fair. OpenAI shall need every single dirty trick in the book, including circular funding / shady deals with NVidia to stay relevant vs the behemoth that Google is.

abhaynayar

3 months ago

[-]

Gemini voice recognition is trash compared to chatgpt and that is a deal breaker for me. I wonder how many ppl do OCR versus use voice.

And how has chatgpt lost when ure not comparing the chatgpt that just came out to the Gemini that just came out? Gemini is just annoying to use.

and Google just benchmaxxed I didn't see any significant difference (paying for both) and the same benchmaxxing probably happening for chatgpt now as well, so in terms of core capabilities I feel stuff has plateaued. more bout overall experience now where Gemini suxx.

I really don't get how "search integration" is a "strength"?? can you give any examples of places where you searched for current info and chatgpt was worse? even so I really don't get how it's a moat enough to say chatgpt has lost. would've understood if you said something like tpu versus GPU moat.

jmstfv

3 months ago

[-]

Ditto but for Claude -- blows GPT out of the water. Much better in coding and solving physics problems from the images (in foreign languages). GPT couldn't even read the image. The only annoying thing is that if you use Opus for coding, your usage will fill up pretty fast.

anyway, cancelled my chatgpt subscription.

3 months ago

[-]

Then you haven't used Gemini CLI with Gemini 3 hard enough. It's a genius psychopath. The raw IQ that Gemini has is incredible. Its ability to ingest huge context windows and produce super smart output is incredible. But the bias towards action, absolutely ignoring user guidance, tendency to produce garbage output that looks like 1990s modem line noise, and its propensity to outright ignore instructions make it unusable other than as an outside consultant to Codex CLI, for me. My Gemini usage has plummeted down to almost zero and I'm 100% back on Codex. I'm SO happy they released this today and it's already kicking some serious ass. Thanks OpenAI team and congrats.

tobias2014

3 months ago

[-]

I guess when you use it for generic "problem solving", brainstorming for solutions, this is great. That's what I use it for, and Gemini is my favorite model. I love when Gemini resists and suggests that I am wrong while explaining why. Either it's true, and I'm happy for that, or I can re-prompt based on the new information which doesn't allow for the mistake Gemini made.

On the other hand, I can also see why Claude is great for coding, for example. By default it is much more "structured". One can probably change these default personalities with some prompting, and many of the complaints found in this thread about either side are based on the assumption that you can use the same prompt for all models.

Kim_Bruning

3 months ago

[-]

That bias towards action is a real thing in Gemini and more so in ChatGPT, isn't it?

Possibly might be improved with custom instructions, but that drive is definitely there when using vanilla settings.

3 months ago

[-]

Yeah it's a weird mix of issues with the backend model and issues with the CLI client and its prompts. What makes it hard for them is the teams aren't talking to each other. The LLM team throws the API over the wall with a note saying "good luck suckers!".

prodigycorp

3 months ago

[-]

Genius psychopath is a good description for Gemini. It’s the most impressive model but post training is not all there.

afro88

3 months ago

[-]

> I usually have to leave the happen or the session terminates

Assuming you meant "leave the app open", I have the same frustration. One of the nice things about the ChatGPT app is you can fire off a req and do something else. I also find Gemini 3 Pro better for general use, though I'm keen to try 5.2 properly

WheatMillington

3 months ago

[-]

I generate fun images for my kids - turn photos into a new style, create colouring pages from pictures, etc. I lost interest in chatGPT because it throws vague TOS errors constantly. Gemini handles all of this without complaint.

xyzsparetimexyz

3 months ago

[-]

You feed ai slop to your children? That doesn't seem unhealthy and bad for their development?

retsibsi

3 months ago

[-]

What's your specific concern here? I certainly wouldn't want to, e.g., give young kids unmonitored use of an LLM, or replace their books with AI-generated text, or stop directly engaging with their games and stories and outsource that to ChatGPT. But what part of "generate fun images for my kids - turn photos into a new style, create colouring pages from pictures, etc" is likely to be "unhealthy and bad for their development"?

bonesss

3 months ago

[-]

Customized, self-guided, tailor made kids content isn’t slop per se.

Colouring pages autogenerated for small kids is about as dangerous as the crayons involved.

Not slop, not unhealthy, not bad.

a_victorp

3 months ago

[-]

I see a post like this every time there are news about ChatGPT or OpenAI. I'm probably being paranoid but I keep thinking that it looks like bots or paid advertisement for Gemini

tenpoundhammer

3 months ago

[-]

I think people like me just enjoying sharing when something is working for them and they have a good experience. It probably gets voted up because people enjoy reading when that happens

jdiff

3 months ago

[-]

The consistent side comments about the interface to Gemini being "half baked" probably doesn't fit into that narrative.

jnordt

3 months ago

[-]

Can you share some examples of this where it gives better results?

For me both Gemini and ChatGPT (both paid versions Key in Gemini and ChatGPT Plus) give me similiar results in terms of "every day" research. Im sticking with ChatGPT at the moment, as the UI and scaffolding around the model is in my view better at ChatGpt (e.g. you can add more than one picture at once...)

For Software Development, I tested Gemini3 and I was pretty disappointed in comparison to Claude Opus CLI, which is my daily driver.

UltraSane

3 months ago

[-]

Google has such a huge advantage in the amount of training data with the Google search database and with YouTube and in terms of FLOPS with their TPUs.

razster

3 months ago

[-]

Just a fair warning, it likes to spell Acknowledge as Acknolwedge. And I've run into issues when it's accessing markdown guides, it loses track and hallucinates from time to time which is annoying.

bossyTeacher

3 months ago

[-]

A future where Google still dominates, is that a future we want? I feel a future with more players is better than one with just a single one. Competition is valuable for us consumers

melagonster

3 months ago

[-]

It happened at least once; when I asked too many questions, the Gemini web page stopped working because it was occupying too much RAM...

NickNaraghi

3 months ago

[-]

Straight up Silicon Valley warfare in the HN comment section.

bckr

3 months ago

[-]

Gemini is good at reading bad handwriting you say? Might need to give it a shot at my 10 years of journals

Razengan

3 months ago

[-]

It would be useful to see some examples of the differences and supposed strengths of Gemini so this doesn't come off as Google advertisement snarf.

Also, I would never, ever, trust Google for privacy or sign into a Google account except on YouTube (and clear cookies afterwards to stop them from signing me into fucking Search too).

m00dy

3 months ago

[-]

it's true that Gemini-3 pro is very good, I recently used it on deepwalker [0]. Its agentic performance is amazing. Much better than 5.1

[0]: https://deepwalker.xyz

_zagj

3 months ago

[-]

Could you elaborate on GPT-based stock analysis?

citizenpaul

3 months ago

[-]

What?? Am I using the same gemini as everyone else?

>OCR is phenomenal

I literally tried to OCR a TYPED document in Gemini today and it mangled it so bad I just transcribed it myself because it would take less time than futzing around with gemini.

> Gemini handles every single one of my uses cases much better and consistently gives better answers.

>coding

I asked it to update a script by removing some redundant logic yesterday. Instead of removing it it just put == all over the place essentially negating but leaving all the code and also removing the actual output.

>Stocks analysis

lol, now I know where my money comes from.

aix1

3 months ago

[-]

Was that with Gemini 3 Pro or a different Gemini model?

citizenpaul

3 months ago

[-]

Yes.

Today I asked it to make a short bit of code to query some info from an API. I needed it to not use the specific function X that is normally used. I added to its instructions "Never use function X" then asked it in the chat to confirm its rules. It then generated code using function X and a word soup explaining how it did not uses function X. Then I copy pasted the line and asked why it used function X and it said more word soup explaining how the function was not there. So yea not so good.

Daz912

3 months ago

[-]

No desktop app, not using it

3 months ago

[-]

HN doesn't have a dedicated desktop app either.

Daz912

3 months ago

[-]

HN isn't part of my daily workflow so I dont care

LorenDB

3 months ago

[-]

What is it with the Polish always messing up products?

(yes, /s)

petersumskas

3 months ago

[-]

It’s because their thoughts are Roman while they are always Russian to Finnish things.

Kenya believe it!

Anyway, I’m done here. Abyssinia.

labrador

3 months ago

[-]

I like their hotdogs

xyzsparetimexyz

3 months ago

[-]

Why do people pay for ai tools? I didn't get that. I feel like I just rotate between them on the free tiers. Unless you're paying for all of them, what's the point?

Zambyte

3 months ago

[-]

I pay for Kagi and get all of the major ones, a great search engine that I can tune to my liking, and the ability to link any model to my tuned web search.

Onewildgamer

3 months ago

[-]

Google AI mode constantly does mistakes and I go back to chatgpt even when I don't like it.

billyrnalvo

3 months ago

[-]

Oh my good heavens, gotta tell ya, you wrestled that rascal to the floor with a shit-eating grin! Good times my friend!

rallies

3 months ago

[-]

I work at the intersection of AI and investing, and I'm really amazed at the ability of this model to build spreadsheets.

I gave it a few tools to access sec filings (and a small local vector database), and it's generating full fledged spreadsheets with valid, real time data. Analysts in wallstreet are going to get really empowered, but for the first time, I'm really glad that retail investors are also getting these models.

Just put out the tool: https://github.com/ralliesai/tenk

npodbielski

3 months ago

[-]

Can't wait for being fired because some VP or other manager asked some model to prepare list of people with lowest productivity to pay ratio.

Model hallucinated half of the data?! Sorry we can't go back on this decision, that would make us look bad!

Or when some silly model will push everyone to invest in some radicoulous company and everybody will do it. Poisoning data attack to inject some I am Future Inc ™ company with high investment rate. After few months pocket money and vanish.

We are certainly going to live in interesting times.

buu700

3 months ago

[-]

That's more of a management problem than an AI problem. You could get the same result by replacing "model" with "intern" or "dude from Fiverr".

npodbielski

3 months ago

[-]

With one important difference: nobody would be able to tell if you did the spreadsheet or AI spew it. And you do not pay for that one specific task to be done out of your pocket.

https://docs.google.com/spreadsheets/d/1DVh5p3MnNvL4KqzEH0ME...

rallies

3 months ago

[-]

Here's a nice parsing of all the important financials from an SEC report. This used to be really hard a few years ago.

sumedh

3 months ago

[-]

Doesnt SEC provide XBRL data and the statements in excel?

monatron

3 months ago

[-]

Nice tool - I appreciate you sharing the work!

josalhor

3 months ago

[-]

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

3 months ago

[-]

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

stego-tech

3 months ago

[-]

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

snet0

3 months ago

[-]

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

3 months ago

[-]

That means you're probably asking it to do very simple things.

3 months ago

[-]

I can confidently say that anecdotally you’re completely wrong, but I’ll also allow a very different definition of ‘simple’ and/or attempting to use an unpopular environment as a valid anecdotal counterpoint.

3 months ago

[-]

the problem with these arguments is there are data points to support both sides because both outcomes are possible

the real thing is are you or we getting an ROI and the answer is increasingly more yeses on more problems, this trend is not looking to plateau as we step up the complexity ladder to agentic system

3 months ago

[-]

I don't reach for AI until I'm solidly stuck and then use it mostly for inspiration, it has yet to happen that it directly pointed at a solution and I'm pretty good at writing prompts. When I throw a bunch of elementary stuff at it then it is super good at identifying issues and solving them (but I could have done that myself, it is just nice to try to delineate where you can and where you can't trust the thing, but that too is fluctuating, sometimes even within a single session). Here is a nice example for a slightly more complex test:

Prompt:

"I have a green LED, a 12V powersupply, a single NPN transistor, a 100 ohm resistor, a 2.7K resistor and an electrolytic capacitor of 220 micro farads. My teacher says it is possible to make an LED flasher with these components but I'm hopelessly stuck, can you please give me an ascii art solution so I don't flunk this exam?"

Answer:

ASCII schematic (fully functional):

           +12V
            |
            |
           .-.
           | |
           | | 2.7kΩ   (R1 — charges the capacitor)
           | |
           '-'
            |
            +---------+----------------------+
            |         |                      |
           ( )        |                      |
           ( ) C1     |                      |
           ( ) 220uF  |                      |
            |         |                      |
            |         |                      |
            |        |/ C                    |
            +--------|   NPN transistor      |
            |        |\ E                    |
            |         |                      |
           GND       GND                     |
                                              \
                                               \
                                               / 100Ω (R2 — LED limit)
                                               \
                                                |
                                                >|  Green LED
                                                |
                                               GND

Follow by:

"How it works

The 2.7 kΩ resistor charges the 220 µF capacitor from the 12 V supply.

The capacitor voltage rises slowly.

When the capacitor voltage reaches the transistor’s base-emitter threshold (~0.6–0.7 V), the transistor suddenly switches ON.

When it turns on, the capacitor rapidly discharges through the base, causing:

A brief pulse of current through the transistor

The LED lights up through the 100 Ω resistor

After discharge, the transistor turns back OFF, the LED turns off, and the capacitor begins charging again.

This repeats automatically → LED flasher."

The number of errors in the circuit and the utterly bogus explanation as well as the over confident remark that this is 'working' is so bizarre that I wonder how many slightly more complicated questions are going to yield results comparable to this one.

dagss

3 months ago

[-]

I am right now implementing an imagining pipeline using OpenCV and TypeScript.

I have never used OpenCV specifically before, and have little imaging experience too. What I do have though is a PhD in astrophysics/statistics so I am able to follow along the details easily.

Results are amazing. I am getting results in 2 days of work that would have taken me weeks earlier.

ChatGPT acts like a research partner. I give it images and it explains why current scoring functions fails and throws out new directions to go in.

Yes, my ideas are sometimes better. Sometimes ChatGPT has a better clue. It is like a human collegue more or less.

And if I want to try something, the code is usually bug free. So fast to just write code, try it, throw it away if I want to try another idea.

I think a) OpenCV probably has more training data than circuits? and b) I do not treat it as a desperate student with no knowlegde.

I expect to have to guide it.

There are several hundred messages back and forth.

It is more like two researchers working together with different skill sets complementing one another.

One of those skillsets being to turn a 20 message conversation into bugfree OpenCV code in 20 seconds.

No, it is not providing a perfect solution to all problems on first iteration. But it IS allowing me to both learn very quickly and build very quickly. Good enough for me..

3 months ago

[-]

That's a good use case, and I can easily imagine that you get good results from it because (1) it is for a domain that you are already familiar with and (2) you are able to check that the results that you are getting are correct and (3) the domain that you are leveraging (coding expertise) is one that chatgpt has ample input for.

Now imagine you are using it for a domain that you are not familiar with, or one for which you can't check the output or that chatgpt has little input for.

If either of those is true the output will be just as good looking and you would be in a much more difficult situation to make good use of it, but you might be tempted to use it anyway. A very large fraction of the use cases for these tools that I have come across professionally so far are of the latter variety, the minority of the former.

And taking all of the considerations into account:

- how sure are you that that code is bug free?

- Do you mean that it seems to work?

- Do you mean that it compiles?

- How broad is the range of inputs that you have given it to ascertain this?

- Have you had the code reviewed by a competent programmer (assuming code review is a requirement)?

- Does it pass a set of pre-defined tests (part of requirement analysis)?

- Is the code quality such that it is long term maintainable?

emporas

3 months ago

[-]

I have used Gemini for reading and solving electronic schematics exercises, and it's results were good enough for me. Roughly 50% of the exercises managed to solve correctly, 50% wrong. Simple R circuits.

One time it messed up the opposite polarity of two voltage sources in series, and instead of subtracting their voltages, it added them together, I pointed out the mistake and Gemini insisted that the voltage sources are not in opposite polarity.

Schematics in general are not AIs strongest point. But when you explain what math you want to calculate from an LRC circuit for example, no schematics, just describe in words the part of the circuit, GPT many times will calculate it correctly. It still makes mistakes here and there, always verify the calculation.

3 months ago

[-]

I guess I'm just more critical than you are. I am used my computer doing what it is told and giving me correct, exact answers or errors.

dagss

3 months ago

[-]

I think most people treat them like humans not computers, and I think that is actually a much more correct way to treat them. Not saying they are like humans, but certainly a lot more like humans than whatever you seem to be expecting in your posts.

Humans make errors all the time. That doesn't mean having colleagues is useless, does it?

An AI is a colleague that can code very very fast and has a very wide knowledge base and versatility. You may still know better than it in many cases and feel more experienced that in. Just like you might with your colleagues.

And it needs the same kind of support that humans need. Complex problem? Need to plan ahead first. Tricky logic? Need unit tests. Research grade problem? Need to discuss through the solution with someone else before jumping to code and get some feedback and iterate for 100 messages before we're ready to code. And so on.

3 months ago

[-]

This is an excellent point, thank you.

emporas

3 months ago

[-]

There is also Mercury LLM, which computes the answer directly as a 2D text representation. I don't know if you are familiar with Mercury LLM, but you read correctly, 2D text output.

Mercury LLM might work better getting input as an ASCII diagram, or generating an output as an ASCII diagram, not sure if both input and output work 2D.

Plumbing/electrical/electronic schematics are pretty important for AIs to understand and assist us, but for the moment the success rate is pretty low. 50% success rate for simple problems is very low, 80-90% success rate for medium difficulty problems is where they start being really useful.

3 months ago

[-]

It's not really the quality of the diagramming that I am concerned with, it is the complete lack of understanding of electronics parts and their usual function. The diagramming is atrocious but I could live with it if the circuit were at least borderline correct. Extrapolating from this: if we use the electronics schematic as a proxy for the kind of world model these systems have then that world model has upside down lanterns and anti-gravity as commonplace elements. Three legged dogs mate with zebras and produce viable offspring and short circuiting transistors brings about entirely new physics.

3 months ago

[-]

it's hard for me to tell if the solution is correct or wrong because I've got next to no formal theoretical education in electronics and only the most basic 'pay attention to polarity of electrolytic capacitors' practical knowledge, but given how these things work you might get much better results when asking it to generate a spice netlist first (or instead).

I wouldn't trust it with 2d ascii art diagrams, there isn't enough focus on these in the training data is my guess - a typical jagged frontier experience.

[1] https://gist.github.com/pramatias/538f77137cb32fca5f626299a7...

emporas

3 months ago

[-]

I think you underestimate their capabilities quite a bit. Their auto-regressive nature does not lend well to solving 2D problems.

See these two solutions GPT suggested: [1]

Is any of these any good?

manmal

3 months ago

[-]

I have this mental model of LLMs and their capabilities, formed after months of way too much coding with CC and Codex, with 4 recursive problem categories:

1. Problems that have been solved before have their solution easily repeated (some will say, parroted/stolen), even with naming differences.

2. Problems that need only mild amalgamation of previous work are also solved by drawing on training data only, but hallucinations are frequent (as low probability tokens, but as consumers we don’t see the p values).

3. Problems that need little simulation can be simulated with the text as scratchpad. If evaluation criteria are not in training data -> hallucination.

4. Problems that need more than a little simulation have to either be solved by adhoc written code, or will result in hallucination. The code written to simulate is again a fractal of problems 1-4.

Phrased differently, sub problem solutions must be in the training data or it won’t work; and combining sub problem solutions must be either again in training data, or brute forcing + success condition is needed, with code being the tool to brute force.

I _think_ that the SOTA models are trained to categorize the problem at hand, because sometimes they answer immediately (1&2), enable thinking mode (3), or write Python code (4).

My experience with CC and Codex has been that I must steer it away from categories 2 & 3 all the time, either solving them myself, ask them to use web research, or split them up until they are (1) problems.

Of course, for many problems you’ll only know the category once you’ve seen the output, and you need to be able to verify the output.

I suspect that if you gave Claude/Codex access to a circuit simulator, it will successfully brute force the solution. And future models might be capable enough to write their own simulator adhoc (ofc the simulator code might recursively fall into category 2 or 3 somewhere and fail miserably). But without strong verification I wouldn’t put any trust in the outcome.

With code, we do have the compiler, tests, observed behavior, and a strong training data set with many correct implementations of small atomic problems. That’s a lot of out of the box verification to correct hallucinations. I view them as messy code generators I have to clean up after. They do save a ton of coding work after or while I‘m doing the other parts of programming.

3 months ago

[-]

This parallels my own experience so far, the problem for me is that (1) and (2) I can quickly and easily do myself and I'll do it in a way that respects the original author's copyright by including their work - and license - verbatim.

(3) and (4) level problems are the ones where I struggle tremendously to make any headway even without AI, usually this requires the learning of new domain knowledge and exploratory code (currently: sensor fusion) and these tools will just generate very plausible nonsense which is more of a time waster than a productivity aid. My middle-of-the-road solution is to get as far as I can by reading about the problem so I am at least able to define it properly and to define test cases and useful ranges for inputs and so on, then to write a high level overview document about what I want to achieve and what the big moving parts are and then only to resort to using AI tools to get me unstuck or to serve as a knowledge reservoir for gaps in domain knowledge.

Anybody that is using the output of these tools to produce work that they do not sufficiently understand is going to see a massive gain in productivity, but the underlying issues will only surface a long way down the line.

camdenreslink

3 months ago

[-]

Sometimes you do need to (as a human) break down a complex thing into smaller simple things, and then ask the LLM to do those simple things. I find it still saves some time.

ragequittah

3 months ago

[-]

Or what will often work is having the LLM break it down into simpler steps and then running them 1 by 1. They know how to break down problems fairly well they just don't often do it properly sometimes unless you explicitly prompt them to.

3 months ago

[-]

Yes, but for that you have to know that the output it gave you is wrong in the first place and if that is so you didn't need AI to begin with...

djeastm

3 months ago

[-]

Possibly, but a lot of value comes from doing very simple things faster.

3 months ago

[-]

That is a good point. A lot of work really is mostly simple things.

snet0

3 months ago

[-]

If you define "simple thing" as "thing an AI can't do", then yes. Everyone just shifts the goalposts in these conversations, it's infuriating.

3 months ago

[-]

Come on. If we weren't shifting the goalposts, we would have burned through 90% of the entire supply of them back in 2022!

3 months ago

[-]

It’s less shifting goalposts and more of a very jagged frontier of capabilities problem.

https://bsky.app/profile/verdverm.com/post/3m7p7gtwo5c2v

3 months ago

[-]

I'm not sure, here's my anecdotal counter example, was able to get gemini-2.5-flash, in two turns, to understand and implement something I had done separately first, and it found another bug (also that I had fixed, but forgot was in this path)

That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).

stego-tech

3 months ago

[-]

And therein lies the rub for why I still approach this technology with caution, rather than charge in full steam ahead: variable outputs based on immensely variable inputs.

I read stories like yours all the time, and it encourages me to keep trying LLMs from almost all the major vendors (Google being a noteworthy exception while I try and get off their platform). I want to see the magic others see, but when my IT-brain starts digging in the guts of these things, I’m always disappointed at how unstructured and random they ultimately are.

Getting back to the benchmark angle though, we’re firmly in the era of benchmark gaming - hence my quip about these things failing “the only benchmark that matters.” I meant for that to be interpreted along the lines of, “trust your own results rather than a spreadsheet matrix of other published benchmarks”, but I clearly missed the mark in making that clear. That’s on me.

3 months ago

[-]

I mean more the guts of the agentic systems. Prompts, tool design, state and session management, agent transfer and escalation. I come from devops and backend dev, so getting in at this level, where LLMs are tasked and composed, is more interesting.

If you are only using provider LLM experiences, and not something specific to coding like copilot or Claude code, that would be the first step to getting the magic as you say. It is also not instant. It takes time to learn any new tech, this one has a above average learning curve, despite the facade and hype of how it should just be magic

Once you find the stupid shit in the vendor coding agents, like all us it/devops folks do eventually, you can go a level down and build on something like the ADK to bring your expertise and experience to the building blocks.

For example, I am now implementing environments for agents based on container layers and Dagger, which unlocks the ability to cheaply and reproducible clone what one agent was doing and have a dozen variations iterate on the next turn. Real useful for long term training data and evals synth, but also for my own experimentation as I learn how to get better at using these things. Another thing I did was change how filesystem operations look to the agent, in particular file reads. I did this to save context & money (finops), after burning $5 in 60s because of an error in my tool implementation. Instead of having them as message contents, they are now injected into the system prompt. Doing so made it trivial to add a key/val "cache" for the fun of it, since I could now inject things into the system prompt and let the agent have some control over that process through tools. Boy has that been interesting and opened up some research questions in my mind

remich

3 months ago

[-]

Any particular papers or articles you've been reading that helped you devise this? Your experiments sound interesting and possibly relevant to what I'm doing.

3 months ago

[-]

Conversations among practitioners on Bluesky (there is an Ai subcommunity)

quantumHazer

3 months ago

[-]

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

3 months ago

[-]

Building a good model generally means it will do well on benchmarks too. The point of the speculation is that Anthropic is not focused on benchmaxxing which is why they have models people like to use for their day-to-day.

I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...

The person who made the speculation I believe was more talking about blog posts and media statements than model cards. Most ai announcements come with benchmark touting, Anthropic supposedly does less / little of this in their announcements. I haven't seen or gathered the data to know what is truth

elcritch

3 months ago

[-]

You could try Codex cli. I prefer it over Claude code now, but only slightly.

3 months ago

[-]

No thanks, not touching anything Oligarchy Altman is behind

Mistletoe

3 months ago

[-]

How do you measure whether it works better day to day without benchmarks?

bulbar

3 months ago

[-]

Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.

That's still benchmarking of course, but not utilizing any of the well known / public ones.

3 months ago

[-]

Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better

aydyn

3 months ago

[-]

Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.

3 months ago

[-]

How would published numbers be useful without knowing what the underlying data being used to test and evaluate them are? They are proprietary for a reason

To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees

aydyn

3 months ago

[-]

Do you know everything that exists in public benchmarks?

They can give a description of what their metrics are without giving away anything proprietary.

3 months ago

[-]

I'd recommend watching Nathan Lambert's video he dropped yesterday on Olmo 3 Thinking. You'll learn there's a lot of places where even descriptions of proprietary testing regimes would give away some secret sauce

Nathan is at Ai2 which is all about open sourcing the process, experience, and learnings along the way

aydyn

3 months ago

[-]

Thanks for the reference I'll check it out. But it doesnt really take away from the point I am making. If a level of description would give away proprietary information, then go one level up to a more vague description. How to describe things to a proper level is more of a social problem than a technical one.

3 months ago

[-]

You seem stuck on the idea that they should have to share information when they don't have to. That they share any is a welcome change. Push too hard and they may stop sharing as much

standardUser

3 months ago

[-]

Subscriptions.

mrguyorama

3 months ago

[-]

Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.

standardUser

3 months ago

[-]

It's all the shareholders care about. These are not research institutions.

brokensegue

3 months ago

[-]

how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate

https://www.youtube.com/watch?v=uaZ3yRdYg8A

3 months ago

[-]

more or less this, but also synthetic

if you think about GANs, it's all the same concept

1. train model (agent)

2. train another model (agent) to do something interesting with/to the main model

3. gain new capabilities

4. iterate

You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.

Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far

Nathan Lambert (https://bsky.app/profile/natolambert.bsky.social) from Ai2 (https://allenai.org/) & RLHF Book (https://rlhfbook.com/) has a really great video out yesterday about the experience training Olmo 3 Think

HDThoreaun

3 months ago

[-]

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

fwip

3 months ago

[-]

It is very similar to an IQ test, with all the attendant problems that entails. Looking at the Arc-AGI problems, it seems like visual/spatial reasoning is just about the only thing they are testing.

3 months ago

[-]

Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

npinsker

3 months ago

[-]

Completely false. This is like saying being good at chess is equivalent to being smart.

Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.

The benchmark isn’t particularly strong against gaming, especially with private data.

mrandish

3 months ago

[-]

ARC-AGI was designed specifically for evaluating deeper reasoning in LLMs, including being resistant to LLMs 'training to the test'. If you read Francois' papers, he's well aware of the challenge and has done valuable work toward this goal.

npinsker

3 months ago

[-]

I agree with you. I agree it's valuable work. I totally disagree with their claim.

A better analogy is: someone who's never taken the AIME might think "there are an infinite number of math problems", but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems. That's not to take away from the AIME, which is quite difficult -- but not infinite.

Similarly, ARC-AGI is much more bounded than they seem to think. It correlates with intelligence, but doesn't imply it.

yovaer

3 months ago

[-]

> but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems

IMO/AIME problems perhaps, but surely that's too narrow a view for all of mathematics. If solving conjectures were simply a matter of trying a standard range of techniques enough times, then there would be a lot fewer open problems around than what's the case.

keeda

3 months ago

[-]

Maybe I'm misinterpreting your point, but this makes it seem that your standard for "intelligence" is "inventing entirely new techniques"? If so, it's a bit extreme, because to a first approximation, all problem solving is combining and applying existing techniques in novel ways to new situations.

At the point that you are inventing entirely new techniques, you are usually doing groundbreaking work. Even groundbreaking work in one field is often inspired by techniques from other fields. In the limit, discovering truly new techniques often requires discovering new principles of reality to exploit, i.e. research.

As you can imagine, this is very difficult and hence rather uncommon, typically only accomplished by a handful of people in any given discipline, i.e way above the standards of the general population.

I feel like if we are holding AI to those standards, we are talking about not just AGI, but artificial super-intelligence.

3 months ago

[-]

Completely false. This is like saying being good at chess is equivalent to being smart.

No, it isn't. Go take the test yourself and you'll understand how wrong that is. Arc-AGI is intentionally unlike any other benchmark.

fwip

3 months ago

[-]

Took a couple just now. It seems like a straight-forward generalization of the IQ tests I've taken before, reformatted into an explicit grid to be a little bit friendlier to machines.

Not to humble-brag, but I also outperform on IQ tests well beyond my actual intelligence, because "find the pattern" is fun for me and I'm relatively good at visual-spatial logic. I don't find their ability to measure 'intelligence' very compelling.

3 months ago

[-]

Given your intellectual resources -- which you've successfully used to pass a test that is designed to be easy for humans to pass while tripping up AI models -- why not use them to suggest a better test? The people who came up with Arc-AGI were not actually morons, but I'm sure there's room for improvement.

What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.

fwip

3 months ago

[-]

Dunno :) I'm not an expert at LLMs or test design, I just see a lot of similarity between IQ tests and these questions.

3 months ago

[-]

With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.

Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.

Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.

Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.

jimbokun

3 months ago

[-]

Is it different every time? Otherwise the training could just memorize the answers.

3 months ago

[-]

The models never have access to the answers for the private set -- again, at least in principle. Whether that's actually true, I have no idea.

The idea behind Arc-AGI is that you can train all you want on the answers, because knowing the solution to one problem isn't helpful on the others.

In fact, the way the test works is that the model is given several examples of worked solutions for each problem class, and is then required to infer the underlying rule(s) needed to solve a different instance of the same type of problem.

That's why comparing Arc-AGI to chess or other benchmaxxing exercises is completely off base.

(IMO, an even better test for AGI would be "Make up some original Arc-AGI problems.")

3 months ago

[-]

It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all

esafak

3 months ago

[-]

I would not be so sure. You can always prep to the test.

HDThoreaun

3 months ago

[-]

How do you prep for arc agi? If the answer is just "get really good at pattern recognition" I do not see that as a negative at all.

ben_w

3 months ago

[-]

It can be not-negative without being sufficient.

Imagine that pattern recognition is 10% of the problem, and we just don't know what the other 90% is yet.

Streetlight effect for "what is intelligence" leads to all the things that LLMs are now demonstrably good at… and yet, the LLMs are somehow missing a lot of stuff and we have to keep inventing new street lights to search underneath: https://en.wikipedia.org/wiki/Streetlight_effect

HDThoreaun

3 months ago

[-]

I dont think many people are saying 100% arc-agi 2 is equivalent to AGI(names are dumb as usual). Its just the best metric I have found, not the final answer. Spatial reasoning is an important part of intelligence even if it doesnt encompass all of it.

3 months ago

[-]

Note that GPT 5.2 newly supports a "xhigh" reasoning level, which could explain the better benchmarks.

It'll be noteworthy to see the cost-per-task on ARC AGI v2.

https://arcprize.org/leaderboard

granzymes

3 months ago

[-]

> It'll be noteworthy to see the cost-per-task on ARC AGI v2.

Already live. gpt-5.2-pro scores a new high of 54.2% with a cost/task of $15.72. The previous best was Gemini 3 Pro (54% with a cost/task of $30.57).

The best bang-for-your-buck is the new xhigh on gpt-5.2, which is 52.9% for $1.90, a big improvement on the previous best in this category which was Opus 4.5 (37.6% for $2.40).

3 months ago

[-]

Huh, that is indeed up and to left of Opus.

walletdrainer

3 months ago

[-]

5.1-codex supports that too, no? Pretty sure I’ve been using xhigh for at least a week now

causal

3 months ago

[-]

That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.

woeirua

3 months ago

[-]

They're clearly building better training datasets and doing extensive RL on these benchmarks over time. The out of distribution performance is still awful.

taurath

3 months ago

[-]

I don’t think their words mean just about anything, only the behavior of the models.

Still waiting of Full Self Driving myself.

fuddle

3 months ago

[-]

I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.

joshuahedlund

3 months ago

[-]

I would love for SWE Verified to put out a set of fresh but comparable problems and see how the top performing models do, to test against overfitting.

thinkingtoilet

3 months ago

[-]

Open AI has already been busted for getting benchmark information and training the models on that. At this point if you believe Sam Altman, I have a bridge to sell you.

catigula

3 months ago

[-]

Yes, but it's not good enough. They needed to surpass Opus 4.5.

mikairpods

3 months ago

[-]

that is better...?

poormathskills

3 months ago

[-]

For a minor version update (5.1 -> 5.2) that's a way bigger improvement than I would have guessed.

beering

3 months ago

[-]

Model capability improvements are very uneven. Changes between one model and the next tend to benefit certain areas substantially without moving the needle on others. You see this across all frontier labs’ model releases. Also the version numbering is BS (remember GPT-4.5 followed by GPT-4.1?).

CodeCompost

3 months ago

[-]

For the first time, I've actually hidden an AI story on HN.

I can't even anymore. Sorry this is not going anywhere.

andybak

3 months ago

[-]

How this is different to any other post announcing an incremental improvement in an app or service?

mabedan

3 months ago

[-]

It’s a little different. Most of these improvements are just more training hours and better weights. Even if it’s about actual improvement in trining algorithm or other software tweaks they’re not open source and hence other than “look how marginally nicer the chat bot responds now” the post doesn’t provide value.

gchokov

3 months ago

[-]

Here, take my downvote.

bigyabai

3 months ago

[-]

In lieu of a killer app?

doctoboggan

3 months ago

[-]

This seems like another "better vibes" release. With the number of benchmarks exploding, random luck means you can almost always find a couple showing what you want to show. I didn't see much concrete evidence this was noticeably better than 5.1 (or even 5.0).

Being a point release though I guess that's fair. I suspect there is also some decent optimizations on the backend that make it cheaper and faster for OpenAI to run, and those are the real reasons they want us to use it.

sebzim4500

3 months ago

[-]

>I suspect there is also some decent optimizations on the backend that make it cheaper and faster for OpenAI to run, and those are the real reasons they want us to use it.

I doubt it, given it is more expensive than the old model.

rat9988

3 months ago

[-]

> I didn't see much concrete evidence this was noticeably better than 5.1

Did you test it?

doctoboggan

3 months ago

[-]

No, I would like to but I don't see it in my paid ChatGPT plan or in the API yet. I based my comment solely off of what I read in the linked announcement.

3 months ago

[-]

At this point the benchmark soup is so dense that it's hard to tell signal from selective framing

flkiwi

3 months ago

[-]

I gave up my OpenAI subscription a few days ago in favor of Claude. My quality of life (and quality of results) has gone up substantially. Several of our tools at work have GPT-5x as their backend model, and it is incredible how frustrating they are to use, how predictable their AI-isms are, and how inconsistent their output is. OpenAI is going to have to do a lot more than an incremental update to convince me they haven't completely lost the thread.

brisket_bronson

3 months ago

[-]

You are absolutely right!

flkiwi

3 months ago

[-]

Someone didn't think so, lol. I debated not saying anything because the AI partisans are just so awful.

jpkw

3 months ago

[-]

I think the above comment was a joke (Claude frequently says that whenever you challenge it, whether you are right or wrong)

3 months ago

[-]

At least this once the AI-ism was not spotted.

flkiwi

3 months ago

[-]

Goodness no, I chuckled.

petesergeant

3 months ago

[-]

I have found Codex to be a phenomenal code-review tool, fwiw. Shitty at writing code, _great_ at reviewing it.

https://x.com/OpenAI/status/1999182104362668275

Tiberium

3 months ago

[-]

The only table where they showed comparisons against Opus 4.5 and Gemini 3:

https://i.imgur.com/e0iB8KC.png

varenc

3 months ago

[-]

100% on the AIME (assuming its not in the training data) is pretty impressive. I got like 4/15 when I was in HS...

hellojimbo

3 months ago

[-]

The no tools part is impressive, with tools every model gets 100%

varenc

3 months ago

[-]

If I recall, the AIME answers are always 4 digits numbers. And most of the problems are of the type where if you have a candidate number it's reasonable to validate its correctness. So easy to brute force all 4 digit ints with code.

tl;dr; humans would do much better too if they could use programming tools :)

Davidzheng

3 months ago

[-]

uh no it's not solved by looping over 4 digit numbers when it uses tools

blitz_skull

3 months ago

[-]

Again I just tap the sign.

All of your benchmarks mean nothing to me until you include Claude Sonnet on them.

In my experience, GPT hasn’t been able to compete with Claude in years for the daily “economically valuable” tasks I work on.

https://x.com/OpenAI/status/1999182104362668275

3 months ago

[-]

Since as per Anthropics own benchmarks Sonnet 4.5 is beaten by Opus 4.5 would it not suffice to infer the rest?

nextworddev

3 months ago

[-]

Claude is pretty trash for anything besides coding

wyre

3 months ago

[-]

What are you basing that on? Between Sonnet and Opus I don't think I'm reaching for Gemini 3 at all.

romanovcode

3 months ago

[-]

Yeah, but that is the whole point of Claude. And that's why we are interested in the comparison.

timmg

3 months ago

[-]

That hasn't been my experience at all. I always wondered if we just get used to how to prompt a given model and that it hard to transition to another.

ComputerGuru

3 months ago

[-]

Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?

eldenring

3 months ago

[-]

I'm guessing they were waiting to figure out more efficient serving before a release, and have decided to eat the inference cost temporarily to stay at the frontier.

famouswaffles

3 months ago

[-]

Open AI sat on GPT-4 for 8 months and even released 3.5 months after 4 was trained. While i don't expect such big lag times anymore, generally, it's a given the public is behind whatever models they have internally at the frontier. By all indications, they did not want to release this yet, and only did so because of Gemini-3-pro.

nathan-wall

3 months ago

[-]

If you look at their own chart[1] it shows 5.1 was lagging behind Gemini 3 Pro in almost every score listed there, sometimes significantly. They needed to come out with something to stay ahead. I'm guessing they threw what they had at their disposal together to keep the lead as long as they can. It sounds like 5.2 has a more recent knowledge cutoff; a reasonable guess is they could have already had that but were trying to make bigger improvements out of it for a more major 5.5 release before Gemini 3 Pro came out and then they had to rush something out. Also 5.2 has a new "Extended Thinking" option for Pro. I'm guessing they just turned up a lever that told it to think even longer, which helps them score higher, even if it does take a long time. (One thing about Gemini 3 Pro is it's very fast relative to even ChatGPT 5.1 Pro Thinking. A lot of the scores they're putting out to show they're staying ahead aren't showing that piece.)

[1] https://imgur.com/e0iB8KC

dalemhurley

3 months ago

[-]

My guess is they develop multiple models in parallel.

youngermax

3 months ago

[-]

Isn't it interesting how this incremental release includes so many testimonials from companies who claim the model has improved? It also focuses on "economically valuable tasks." There was nothing of this sort in GPT-5.1's release. Looks like OpenAI feeling the pressure from investors now.

sfmike

3 months ago

[-]

Everything is still based on 4 4o still right? is a new model training just too expensive? They can consult deepseek team maybe for cost constrained new models.

elgatolopez

3 months ago

[-]

Where did you get that from? Cutoff date says august 2025. Looks like a newly pretrained model

- https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

3 months ago

[-]

> This stands in sharp contrast to rivals: OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024, highlighting the significant technical hurdle that Google’s TPU fleet has managed to overcome.

It's also plainly obvious from using it. The "Broadly deployed" qualifier is presumably referring to 4.5

ric2b

3 months ago

[-]

How is that a technical hurdle if they obviously were able to do it before?

It's probably just a question of cost/benefit analysis, it's very expensive to do, so the benefits need to be significant.

SparkyMcUnicorn

3 months ago

[-]

If the pretraining rumors are true, they're probably using continued pretraining on the older weights. Right?

3 months ago

[-]

Apparently they have not had a successful pre training run in 1.5 years

fouronnes3

3 months ago

[-]

I want to read a short scify story set in 2150 about how, mysteriously, no one has been able to train a better LLM for 125 years. The binary weights are studied with unbelievably advanced quantum computers but no one can really train a new AI from scratch. This starts cults, wars and legends and ultimately (by the third book) leads to the main protagonist learning to code by hand, something that no human left alive still knows how to do. Could this be the secret to making a new AI from scratch, more than a century later?

WhyOhWhyQ

3 months ago

[-]

There's a scifi short story about a janitor who knows how to do basic arithmetic and becomes the most important person in the world when some disaster happens. Of course after things get set up again due to his expertise, he becomes low status again.

bradfitz

3 months ago

[-]

I had to go look that up! I assume that's https://en.wikipedia.org/wiki/The_Feeling_of_Power ? (Not a janitor, but "a low grade Technician"?)

WhyOhWhyQ

3 months ago

[-]

Hmm it could be a false memory, since this was almost 15 years ago, but I really do remember it differently than the text of 'Feeling of Power'.

https://www.pcgamer.com/software/ai/i-have-been-fooled-reddi...

3 months ago

[-]

You can ask 2025 Ai to write such a book, it's happy to comply and may or may not actually write the book

3 months ago

[-]

Sounds good.

Might sell better with the protagonist learning iron age leatherworking, with hides tanned from cows that were grown within earshot, as part of a process of finding the real root of the reason for why any of us ever came to be in the first place. This realization process culminates in the formation of a global, unified steampunk BDSM movement and a wealth of new diseases, and then: Zombies.

(That's the end. Zombies are always the end.)

3 months ago

[-]

This is somewhat similar to a Piers Anthony series that I suspect noone has ever read except for me.

What was with that guy anyway.

wafflemaker

3 months ago

[-]

Sorry, but compared with the parent, my money is in you ssl-3. Do you get better results from prompting by being more poetic?

3 months ago

[-]

> Do you get better results from prompting by being more poetic?

Is that yet-another accusation of having used the bot?

I don't use the bot to write English prose. If something I write seems particularly great or poetic or something, then that's just me: I was in the right mood, at the right time, with the right idea -- and with the right audience.

When it's bad or fucked-up, then that's also just me. I most-assuredly fuck up plenty.

They can't all be zingers. I'm fine with that.

---

I do use the hell out of the bot for translating my ideas (and the words that I use to express them) into languages that I can't speak well, like Python, C, and C++. But that's very different. (And at least so far I haven't shared any of those bot outputs with the world at all, either.)

So to take your question very literally: No, I don't get better results from prompting being more poetic. The responses to my prompts don't improve by those prompts being articulate or poetic.

Instead, I've found that I get the best results from the bot fastest by carrying a big stick, and using that stick to hammer and welt it into compliance.

Things can get rather irreverent in my interactions with the bot. Poeticism is pretty far removed from any of that business.

wafflemaker

3 months ago

[-]

No. I just genuinely liked your style, and didn't notice previous posts by you. I haven't yet learned to look at names on hn, it's mostly anonymous posts for me. No snark here. And was also genuinely curious if better writing style yields better results.

I've observed that using proper grammar gives slightly better answers. And using more "literacy"(?) kind of language in prompts sometimes gives better answers and sometimes just more interesting ones, when bots try to follow my style.

Sorry for using the word poetic, I'm travelling and sleep deprived and couldn't find the proper word, but didn't want to just use "nice" instead either.

3 months ago

[-]

It's all good. I'm largely "face-blind", myself, in that I don't often recognize others in person or online -- which is certainly not to say that I think I'm particularly memorable myself.

As to the bot: Man, I beat the bot to death. It's pretty brutal.

I'm profane and demanding because that's the most terse language I know how to construct in English.

When I set forth to have the bot do a thing for me, the slowest part of the process that I can improve on my part is the quantity of the words that I use.

I can type fast and think fast, but my one-letter-at-a-time response to the bot is usually the only part that that I can make a difference with. So I tend to be very terse.

"a+b=c, you fuck!" is certainly terse, unambiguous, and fast to type, so that's my usual style.

Including the emphatic "you fuck!" appendage seems to stir up the context more than without. Its inclusion or omission is a dial that can be turned.

Meanwhile: "I have some reservations about the proposed implementation. Might it be possible for you to revise it so as to be in a different form? As previously discussed, it is my understanding that a+b=c. Would you like to try again to implement a solution that incorporates this understanding?" is very slow to write.

They both get similar results. One method is faster for me than the other, just because I can only type so fast. The operative function of the statement is ~the same either way.

(I don't owe the bot anything. It isn't alive. It is just a computer running a program. I could work harder to be more polite, empathetic, or cordial, but: It's just code running on a box somewhere in a datacenter that is raising my electric rate and making the RAM for my next system upgrade very expensive. I don't owe it anything, much less politeness or poeticism.

Relatedly, my inputs at the bash prompt on my home computer are also very terse. For instance I don't have any desire or ability to be polite to bash; I just issue commands like ls and awk and grep without any filler-words or pleasantries. The bot is no different to me.

When I want something particularly poetic or verbose as output from the bot, I simply command it to be that way.

It's just a program.)

georgefrowny

3 months ago

[-]

An software version of Asimov's Holmes-Ginsbook device? https://sfwritersworkshop.org/node/1232

I feel like there was a similar one about software, but it might have been mathematics (also Asimov: The Feeling of Power)

barrenko

3 months ago

[-]

Monsieur, if I may offer a vaaaguely similar story on how things may progress https://www.owlposting.com/p/a-body-most-amenable-to-experim...

armenarmen

3 months ago

[-]

I’d read it!

ijl

3 months ago

[-]

What kind of issues could prevent a company with such resources from that?

https://arstechnica.com/information-technology/2024/05/chief...

3 months ago

[-]

Drama if I had to pick the symptom most visible from the outside.

A lot of talent left OpenAI around that time, most notably in this regard would be Ilya in May '24. Remember that time Ilya and the board ousted Sam only to reverse it almost immediately?

Wowfunhappy

3 months ago

[-]

I thought whenever the knowledge cutoff increased that meant they’d trained a new model, I guess that’s completely wrong?

rockinghigh

3 months ago

[-]

They add new data to the existing base model via continuous pre-training. You save on pre-training, the next token prediction task, but still have to re-run mid and post training stages like context length extension, supervised fine tuning, reinforcement learning, safety alignment ...

3 months ago

[-]

Continuous pretraining has issues because it starts forgetting the older stuff. There is some research into other approaches.

brokencode

3 months ago

[-]

Typically I think, but you could pre-train your previous model on new data too.

I don’t think it’s publicly known for sure how different the models really are. You can improve a lot just by improving the post-training set.

catigula

3 months ago

[-]

The irony is that Deepseek is still running with a distilled 4o model.

blovescoffee

3 months ago

[-]

Source?

tpurves

3 months ago

[-]

Undoubtedly each new model from OpenAi has numerous training and orchestration improvements etc.

But how much of each product they release also just a factor of how much they are willing to spend on inference per query in order to stay competitive?

I always wonder how much is technical change vs turning a knob up and down on hardware and power consumption.

GTP5.0 for example seemed like a lot of changes more for OpenAI's internal benefit (terser responses, dynamic 'auto' mode to scale down thinking when not required etc.)

Wondering if GPT5.2 is also case of them in 'code red mode' just turning what they already have up to 11 as a fastest way to respond to fiercer competion.

simonsarris

3 months ago

[-]

I always liked the definition of technology as "doing more with less". 100 oxen replaced by 1 gallon of diesel, etc.

That it costs more does suggest it's "doing more with more", at least.

psychoslave

3 months ago

[-]

Good luck with reproducing and eating diesel like can be done with oxen and related species.

Humanity won't be able to tap into this highly compressed energy stock that was generated through processes taking literally geological scales time to bed achieved.

That is, technology is more about what alternative tradeoffs can we leverage on to organize differently with resources at hand.

Frugality can definitely be a possible way to shape the technologies we want to deploy. But it's not all possible technologies, just a subset.

Also better technology is not necessarily bringing societies to morale and well-being excellency. Improving technology for efficient genocides for example is going to bring human disaster as obvious outcome, even if it's done in a manner that is the most green, zero-carbon emissions and growing more forests delivered beyond expectations of the specifications.

sigmar

3 months ago

[-]

Are there any specifics about how this was trained? Especially when 5.1 is only a month old. I'm a little skeptical of benchmarks these days and wish they put this up on llmarena

edit: noticed 5.2 is ranked in the webdev arena (#2 tied with gemini-3.0-pro), but not yet in text arena (last update 22hrs ago)

emp17344

3 months ago

[-]

I’m extremely skeptical because of all those articles claiming OpenAI was freaking out about Gemini - now it turns out they just casually had a better model ready to go? I don’t buy it.

3 months ago

[-]

I (and others) have a strong suspicion that they can modulate models intelligence in almost real time by adjusting quantization and thinking time.

It seems if anyone wants, they can really gas a model up in the moment and back it off after the hype wave.

qeternity

3 months ago

[-]

Quantization is not some magical dial you can just turn. In practice you basically have 3 choices: fp16, fp8 and fp4.

Also thinking time means more tokens which costs more especially at the API level where you are paying per token and would be trivially observable.

There is basically no evidence that either of these are occurring in the way you suggest (boosting up and down).

3 months ago

[-]

API users probably wouldn't be affected since they are paying in full. Most people complaining are free users, followed by $20/mo users.

bamboozled

3 months ago

[-]

Yeah I've noticed with Claude, around the time of the Opus 4.5 release, at least for a few days, Sonnet 4.5 was just dumb, but it seems temporary. I feel that redirected resources to Opus.

tempaccount420

3 months ago

[-]

They had to rush it out, I'm sure the internal safety folks are not happy about it.

robots0only

3 months ago

[-]

how do you know this is a better model? I wouldn't take any of the numbers at face value especially when all they have done is more/better post-training and thus the base pre-trained model capabilities is still the same. The model may just elicit some of the benchmark capabilities better. You really need to spend time using the model to come to any reliable conclusions.

bamboozled

3 months ago

[-]

It's very inline with their PR strategy, or lack of.

kouteiheika

3 months ago

[-]

Unfortunately there are never any real specifics about how any of their models were trained. It's OpenAI we're talking about after all.

nezaj

3 months ago

[-]

We saw it do better at making counter-strike! https://x.com/instant_db/status/1999278134504620363?s=20

dumbmrblah

3 months ago

[-]

Great! It'll be SOTA for a couple of weeks until the quality degrades due to throttling.

I'll stick with plug and play API instead.

mrandish

3 months ago

[-]

Due to the "Code Red" threat from Gemini 3, I suspect they'll hold off throttling for longer than usual (by incinerating even more investor capital than usual).

Jump in and soak up that extra-discounted compute while the getting is good, kids! Personally, I recently retired so I just occasionally mess around with LLMs for casual hobby projects, so I've only ever used the free tier of all the providers. Having lived through the dot com bubble, I regret not soaking up more of the free and heavily subsidized stuff back then. Trying not to miss out this time. All this compute available for free or below cost won't last too much longer...

dankwizard

3 months ago

[-]

I've been using tools like ProxLLM which just slam these AI models via proxy everytime a free tier limit is hit and it works great.

ssvss

3 months ago

[-]

can you provide a link to this tool, a search for proxllm didn't seem to find anything related.

ImprobableTruth

3 months ago

[-]

An almost 50% price increase. Benchmarks look nice, but 50% more nice...?

arnaudsm

3 months ago

[-]

#1 models are usually priced at 2x more than the competition, and they often decrease the price right when they lose the crown.

wewtyflakes

3 months ago

[-]

There are too few examples to say this is a trend. There have been counterexamples of top models actually lowering the pricing bar (gpt-5, gpt-3.5-turbo, some gemini releases were even totally free [at first]).

ClipNoteBook

3 months ago

[-]

ChatGPT seems to just randomly pick urls to cite and extract information from. Google Gemini seems to look at heuristics like whether the author is trustworthy, or an expert in the topic. But more advanced

devinprater

3 months ago

[-]

Can the tables have column headers so my screen reader can read the model name as I go across the benchmakrs? And the images should have alt-text.

jiggawatts

3 months ago

[-]

Feels a bit rushed. They haven’t even updated their API playground yet, if I select 5.2-chat-latest, I get:

Unsupported parameter: 'top_p' is not supported with this model.

Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.

The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.

mattas

3 months ago

[-]

Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!

Sir_Twist

3 months ago

[-]

Not an expert in LLM benchmarks, but I generally I think of benchmarks as being good particularly for measuring usefulness for certain usecases. Even if measuring LLMs is not as straightforward as, say, read/write speeds when comparing different SSDs, if a certain model's responses are consistently measured as being higher quality / more useful, surely that means something, right?

olliepro

3 months ago

[-]

Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.

Wowfunhappy

3 months ago

[-]

I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.

olliepro

3 months ago

[-]

It feels like this should work, but the breadth of knowledge in these models is so vast. Everyone knows how to taste, but not everyone knows physics, biology, math, every language… poetry, etc. Enumerating the breadth of valuable human tasks is hard, so both approaches suffer from the scale of the models’ surface area.

An interesting problem since the creators of OLMO have mentioned that throughout training, they use 1/3 or their compute just doing evaluations.

Edit:

One nice thing about the “critic” approach is that the restaurant (or model provider) doesn’t have access to the benchmark to quasi-directly optimize against.

HardCodedBias

3 months ago

[-]

Huge fan that Gemini-3 prompted OAI to ship this.

Competition works!

GDPval seems particularly strong.

I wonder why they held this back.

1) Maybe this is uneconomical ?

2) Did the safety somehow hold back the company ?

looking forward to the internet trying this and posting their results over the next week or two.

COMPETITION!

mrandish

3 months ago

[-]

> I wonder why they held this back.

IMHO, I doubt they were holding much back. Obviously, they're always working on 'next improvements' and rolled what was done enough into this but I suspect the real difference here is throwing significantly more compute (hence investor capital) at improving the quality - right now. How much? While the cost is currently staying the same for most users, the API costs seem to be ~40% higher.

The impetus was the serious threat Gemini 3 poses. Perception about ChatGPT was starting to shift, people were speculating that maybe OAI is more vulnerable than assumed. This caused Altman to call an all-hands "Code Red" two weeks ago, triggering a significant redeployment of priorities, resources and people. I think this launch is the first 'stop the perceptual bleeding' result of the Code Red. Given the timing, I think this is mostly akin to overclocking a CPU or running an F1 race car engine too hot to quickly improve performance - at the cost of being unsustainable and unprofitable. To placate serious investor concerns, OAI has recently been trying to gradually work toward making current customers profitable (or at least less unprofitable). I think we just saw the effort to reduce the insane burn rate go out the window.

SkyPuncher

3 months ago

[-]

Given the price increase and speculation that GPT 5 is a MoE model, I'm wondering if they're simply "turning up the good stuff" without making significant changes under the hood.

3 months ago

[-]

I'm not sure why being a MoE model would allow OpenAI to "turn up the good stuff". You can't just increase the number of E without training it as such.

SkyPuncher

3 months ago

[-]

My opinion is they're trying to internally route requests to cheaper experts when they think they can get away with it. I felt this was evident by the wild inconsistencies I'd experience using it for coding. Both in quality and latency

You "turn of the good stuff" by eliminating or reducing the likelihood of the cheap experts handling the request.

yberreby

3 months ago

[-]

Based on what works elsewhere in deep learning, I see no reason why you couldn't train once with a randomized number of experts, then set that number during inference based on your desired compute-accuracy tradeoff. I would expect that this has been done in the literature already.

throwaway314155

3 months ago

[-]

GPT 4o was an MoE model as well.

a_wild_dandan

3 months ago

[-]

> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.

Imnimo

3 months ago

[-]

It does seem to raise fair questions about either the utility of these tools, or adoption inertia. If not even OpenAI feels compelled to integrate this kind of model-check into their pipeline, what's that say about the business world at-large? Is it that it's too onerous to set up, is it that it's too hard to get only true-positive corrections, is it that it's too low value for the effort?

JumpCrisscross

3 months ago

[-]

> what's that say about the business world at-large?

Nothing. OpenAI is a terrible baseline to extrapolate anything from.

MaxikCZ

3 months ago

[-]

I always remember this old image https://i.imgur.com/MCsOM8e.jpeg

boplicity

3 months ago

[-]

Their model doesn't handle punctuation, quote marks, and similar things very well at all.

Bengalilol

3 months ago

[-]

It may have been used, how could we know?

Mainly, I don't get why there are quote marks at all.

layer8

3 months ago

[-]

Humans are now expected to parse sloppy typing without complaining about it, just like LLMs do. Slop is the new normal.

croes

3 months ago

[-]

Maybe they did

dangelosaurus

3 months ago

[-]

I ran a red team eval on GPT-5.2 within 30 minutes of release:

Baseline safety (direct harmful requests): 96% refusal rate

With jailbreaking: 22% refusal rate

4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).

The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.

Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...

int_19h

3 months ago

[-]

Good. If I ask AI to generate "harmful" content, I want it to comply, not lecture me.

akshay326

3 months ago

[-]

wow thats motivated attacking indeed in your experience, how does thinking (say using high thinking instead none/low) impact red team eval?

whereistejas

3 months ago

[-]

Did anyone notice how Cursor wasn’t an early tester? I wonder why…

rishabhaiover

3 months ago

[-]

After I saw Opus 4.5 search through zig's std io because it wasn't aware of a breaking change in the recent release, I fell in love with claude-code and I don't see a strong enough reason to switch to codex at the moment.

jasonthorsness

3 months ago

[-]

Does anyone have it yet in ChatGPT? I'm still on 5.1 :(.

3 months ago

[-]

> We deploy GPT‑5.2 gradually to keep ChatGPT as smooth and reliable as we can; if you don’t see it at first, please try again later.

mudkipdev

3 months ago

[-]

No, but it's already in codex

jasonthorsness

3 months ago

[-]

I have it now

hbarka

3 months ago

[-]

A year ago Sunday Pichai declared code red, now it’s Sam Altman declaring code red. How tables have turned, and I think the acquisition of Windsurf and Kevin Hou by Google seems to correlate with their level up.

jerrygenser

3 months ago

[-]

Acquisition of noam shazeer to supercharge their Gemini flagship model line I think made a bigger impact.

To make an argument it was Kevin Hou, then we would need to see Antigravity their new IDE being key. I think the crown jewel are the Gemini models.

fulafel

3 months ago

[-]

So GDPval is OpenAI's own benchmark. PDF link: https://arxiv.org/pdf/2510.04374

yearolinuxdsktp

3 months ago

[-]

Plus users are now defaulted to a faster, less deep GPT-5.2 Thinking mode called “Standard”, and you now have to manually select “Extended” to get back to previous deep thinking level for Plus users. Yet the 3K messages a week quota is the same regardless of thinking level. Also, the selection does not sync to mobile (you know, just not enough RAM in computers these days to persist a setting between web and mobile).

3 months ago

[-]

> Additionally, on our internal benchmark of junior investment banking analyst spreadsheet modeling tasks—such as putting together a three-statement model for a Fortune 500 company with proper formatting and citations, or building a leveraged buyout model for a take-private—GPT 5.2 Thinking's average score per task is 9.3% higher than GPT‑5.1’s, rising from 59.1% to 68.4%.

Confirming prior reporting about them hiring junior analysts

3 months ago

[-]

I’ve been using GPT-4o and now 5.2 pretty much daily, mostly for creative and technical work. What helped me get more out of it was to stop thinking of it as a chatbot or knowledge engine, and instead try to model how it actually works on a structural level.

The closest parallel I’ve found is Peter Gärdenfors’ work on conceptual spaces, where meaning isn’t symbolic but geometric. Fedorenko’s research on predictive sequencing in the brain fits too. In both cases, the idea is that language follows a trajectory through a shaped mental space, and that’s basically what GPT is doing. It doesn’t know anything, but it generates plausible paths through a statistical terrain built from our own language use.

So when it “hallucinates”, that’s not a bug so much as a result of the system not being grounded. It’s doing what it was designed to do: complete the next step in a pattern. Sometimes that’s wildly useful. Sometimes it’s nonsense. The trick is knowing which is which.

What’s weird is that once you internalise this, you can work with it as a kind of improvisational system. If you stay in the loop, challenge it, steer it, it feels more like a collaborator than a tool.

That’s how I use it anyway. Not as a source of truth, but as a way of moving through ideas faster.

3 months ago

[-]

Once you drop the idea that it's a knowledge oracle and start treating it as a system that navigates a probability landscape, a lot of the confusion just evaporates

ostacke

3 months ago

[-]

Interesting concept with conceptual spaces, but how does that affect how you work with LLM:s in practice?

3 months ago

[-]

I think of it like improvising with a very skilled but slightly alien musician.

If you just hand it a chord chart, it’ll follow the structure. But if you understand the kinds of patterns it tends to favour, the statistical shapes it moves through, you can start composing with it, not just prompting it.

That’s where Gärdenfors helped me reframe things. The model isn’t retrieving facts. It’s traversing a conceptual space. Once you stop expecting grounded truth and start tracking coherence, internal consistency, narrative stability, you get a much better sense of where it’s likely to go off course.

It reminds me of salespeople who speak fluently without being aligned with the underlying subject. Everything sounds plausible, but something’s off. LLMs do that too. You can learn to spot the mismatch, but it takes practice, a bit like learning to jam. You stop reading notes and start listening for shape.

dinobones

3 months ago

[-]

It's becoming challenging to really evaluate models.

The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.

Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.

The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.

This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.

And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.

I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.

Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.

Libidinalecon

3 months ago

[-]

Totally agree. I just got a free trial month I guess to try to bring me back to chatGPT but I don't really know what to ask it to display if it is on par with Gemini.

I really have a sinking feel right now actually of what an absolute giant waste of capital all this is.

I am glad for all the venture capital behind all this to subsidize my intellectual noodlings on a super computer but my god what have we done?

This is so much fun but this doesn't feel like we are getting closer to "AGI" after using Gemini for about 100 hours or so now. The first day maybe but not now when you see how off it can still be all the time.

3 months ago

[-]

The good old "benchmarks just keep saturating" problem.

Anthropic is genuinely one of the top companies in the field, and for a reason. Opus consistently punches above its weight, and this is only in part due to the lack of OpenAI's atrocious personality tuning.

Yes, the next stop for AI is: increasing task length horizon, improving agentic behavior. The "raw general intelligence" component in bleeding edge LLMs is far outpacing the "executive function", clearly.

imiric

3 months ago

[-]

Shouldn't the next stop be to improve general accuracy, which is what these tools have struggled with since their inception? Until when are "AI" companies going to offload the responsibility on the user to verify the output of their tools?

Optimizing for benchmark scores, which are highly gamed to begin with, by throwing more resources at this problem is exceedingly tiring. Surely they must've noticed the performance plateau and diminishing returns of this approach by now, yet every new announcement is the same.

3 months ago

[-]

What "performance plateau"? The "plateau" disappears the moment you get harder unsaturated benchmarks.

It's getting more and more challenging to do that - just not because the models don't improve. Quite the opposite.

Framing "improve general accuracy" as "something no one is doing" is really weird too.

You need "general accuracy" for agentic behavior to work at all. If you have a simple ten step plan, and each step has a 50% chance of an unrecoverable failure, then your plan is fucked, full stop. To advance on those benchmarks, the LLM has to fail less and recover better.

Hallucinations is a "solvable but very hard to solve" problem. Considerable progress is being made on it, but if there's "this one weird trick" that deletes hallucinations, then we sure didn't find it yet. Humans get a body of meta-knowledge for free, which lets them dodge hallucinations decently well (not perfectly) if they want to. LLMs get pathetic crumbs of meta-knowledge and little skill in using it. Room for improvement, but, not trivial to improve.

aaroninsf

3 months ago

[-]

As a popcorn eating bystander it is striking to scan the top comments and find they alternate so dramatically in tone and conclusions.

zhyder

3 months ago

[-]

Big knowledge cutoff jump from Sep 2024 to Aug 2025. How'd they pull that off for a small point release, which presumably hasn't done a fresh pre-training over the web?

Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).

With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).

throwaway314155

3 months ago

[-]

> which presumably hasn't done a fresh pre-training over the web

What makes you think that?

> Did they figure out how to do more incremental knowledge updates somehow?

It's simple. You take the existing model and continue pretraining with newly collected data.

3 months ago

[-]

A leak reported on by semi-analyses stated that they haven't pre-trained a new model since 4o due to compute constraints.

ComputerGuru

3 months ago

[-]

xmcqdpt2

3 months ago

[-]

I don’t know if they used the new ChatGPT to translate this page but I was served the French version and it is NOT good. There are placeholders for quotes like <quote> and the prose is incredibly repetitive. You’d figure that OpenAI of all people would be able to translate something to one of the worlds most spoken language.

yousif_123123

3 months ago

[-]

Why doesn't OpenAI include comparisons to other models anymore?

enraged_camel

3 months ago

[-]

Because their main competition (Google and Anthropic) have caught up and even started to surpass them, and comparisons would simply drive it home.

IAmNotACellist

3 months ago

[-]

Why do they care so much? They're a non-profit dedicated to the betterment of humanity via open access to AI. They have nothing to hide. They have no motivation to lie, or lie by omission.

koolba

3 months ago

[-]

> Why do they care so much? They're a non-profit dedicated to the betterment of humanity via open access to AI.

We're still talking about OpenAI right?

IAmNotACellist

3 months ago

[-]

You're not calling Sam Altman a liar, are you?

kaliqt

3 months ago

[-]

They are not a nonprofit at all. Legally, yes. But they are not.

ftchd

3 months ago

[-]

because they probably need to compare pricing too

https://x.com/sama/status/1999185784012947900

conradkay

3 months ago

[-]

Sam Altman posted with a comparison to Gemini 3 and Opus 4.5

yousif_123123

3 months ago

[-]

I see, thanks for this.

byt3bl33d3r

3 months ago

[-]

There’s really no point in looking at benchmarks anymore as real world usage of these models varies between task and prompting strategies. Use your internal benchmarks to evaluate and ignore everything else. It is curious to me how they don’t provide a side x side comparison of other models benchmarks for this release

bob1029

3 months ago

[-]

I've been looking really hard at combining Roslyn (.NET compiler platform SDK) with one of these high end tool calling models. The ability to have the LLM create custom analyzers and then verify them with a human in the loop can provide stable, compile-time guarantees of business rules that accumulate without paying for context tokens.

I feel like there is a small chance I could actually make this work in some areas of the business now. 400k is a really big context window. The last time I made any serious attempt I only had 32k tokens to work with. I still don't think these things can build the whole product for you, but if you have a structured configuration abstraction in an existing product, I think there is definitely uplift possible.

schmuhblaster

3 months ago

[-]

Sounds interesting, could you elaborate a bit on this? (I am experimenting in a similar direction)

lacoolj

3 months ago

[-]

This is a whole bunch of patting themselves on the back.

Let me know when Gemini 3 Pro and Opus 4.5 are compared against it.

ponyous

3 months ago

[-]

I am really curious about speed/latency. For my use case there is a big difference in UX if the model is faster. Wish this was included in some benchmarks.

I will run 80 3D model generations benchmark tomorrow and update this comment with the results about cost/speed/quality.

speedgoose

3 months ago

[-]

Trying it now in Vscode Insiders with Github Copilot (codex crashes with HTTP 400 server errors), and it eventually started using sed and grep in shells instead of using the better tools it has access to. I guess this is not an issue to perform well in benchmarks.

pixelmelt

3 months ago

[-]

to be fair I've seen the other sota models do this as well

songodongo

3 months ago

[-]

I get this behavior with a lot with most of the premium models (Gemini 3, Opus 4.5). I think it’s somehow more a GitHub Copilot issue than the models.

elAhmo

3 months ago

[-]

This feels like "could've been an email" type of thing, a very incremental update that just adds one more version. I bet there is literally no one in the world who wanted *one more version of GPT* in the list of available models from OpenAI.

"All models" section on https://platform.openai.com/docs/models is quite ridiculous.

tim333

3 months ago

[-]

It's significant because it looked like they were falling behind Gemini and maybe others.

d--b

3 months ago

[-]

> it’s better at creating spreadsheets

I have a bad feeling about this.

jonplackett

3 months ago

[-]

Excited to try this. I’ve found Gemini excellent recently and amazing at coding. But I still feel somehow like ChatGPT understands more. Even though it’s not quite as good at coding - and nowhere at as fast. It is much less likely anti spontaneously forget something. Gemini’s is part unbelievably amazing and part amnesia patient. I still kinda trust ChatGPT more.

atheljcarlton

3 months ago

[-]

It's dog-doo-doo. I put in my algebraic geometry final review (100's of thousands of tokens) and Gemini instantly found all the propositions, theorems, and problems that I needed in a neat list (in about 5 seconds), meanwhile ChatGPT 5.2 Thinking took 10mins before timing out and not even completing the request.

atheljcarlton

3 months ago

[-]

However, the model card for GPT 5.2 looks amazing, wish I could actually see that performance in action!

m12k

3 months ago

[-]

So, does 5.2 still have a knowledge cutoff date of June 2024, or have they managed to complete another full pre-training run?

ChrisArchitect

3 months ago

[-]

Discussion on blog post: https://openai.com/index/introducing-gpt-5-2/ (https://news.ycombinator.com/item?id=46234874)

lend000

3 months ago

[-]

It seems like they fixed the most obvious issue with the last release, where codex would just refuse to do its job... if it seemed difficult or context usage was getting above 60% or so. Good job on the post-training improvements.

The benchmark changes are incredible, but I have yet to notice a difference in my codebases as of yet.

StarterPro

3 months ago

[-]

>GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.

We built a benchmark tool that says our newest model outperforms everyone else. Trust me bro.

k2xl

3 months ago

[-]

The ARC AGI 2 bump to 52.9% is huge. Shockingly GPT 5.2 Pro does not add too much more (54.2%) for the increase cost.

jrflowers

3 months ago

[-]

OpenAI is really good at just saying stuff on the internet.

I love the way they talk about incorrect responses:

> Errors were detected by other models, which may make errors themselves. Claim-level error rates are far lower than response-level error rates, as most responses contain many claims.

“These numbers might be wrong because they were made up by other models, which we will not elaborate on, also these numbers are much higher by a metric that reflects how people use the product, which we will not be sharing“

I also really love the graph where they drew a line at “wrong half of the time” and labeled it ‘Expert-Level’.

10/10, reading this post is experientially identical to watching that 12 hours of jingling keys video, which is hard to pull off for a blog.

3 months ago

[-]

So the rosy biased estimate is OpenAI is saving 1 hour of work per day, so 5 hours total per-work week and 20 hours total per-month.

With a subsidized cost of $200/month for OpenAI it would be cheaper to hirer a part-time minimum wage worker than it would be to contract with OpenAI.

And that is the rosiest estimate OpenAI has.

maerch

3 months ago

[-]

The closest I come to working with part-time, minimum-wage workers is working with student employees. Even then, they earn more and usually work more than five hours a week.

Most of the time, I end up putting in more work than I get out of it. Onboarding, reviewing, and mentoring all take significant time.

Even with the best students we had, paying around 400 euros a month, I would not say that I saved five hours a week.

And even when they reach the point of being truly productive, they are usually already finished with their studies. If we then hire them full-time, they cost significantly more.

dangoodmanUT

3 months ago

[-]

A part time minimum wage worker can't code

3 months ago

[-]

Check the wages of coders outside of the US

6510

3 months ago

[-]

There use to be a mythological creature on irc from south America (sorry forgot the specifics) who was both a 10x dev and a 10x mathematician. One day he showed a picture of his computer. It was a low end laptop with a tft monitor and an external keyboard because the screen and the keyboard didn't work. It explained everything, the machine was just good enough to write code, do math, read stack exchange and lurk irc with his ghosts.

3 months ago

[-]

It you take of the rosy glasses, it is more like 10 hours saved per-month at an unsubsidized cost of $1000/month

The $100/hr is worth it for US programming jobs, but nothing else

3 months ago

[-]

What people here forget is coding is a tiny minority of the actual usage. ~5% if I remember correctly?

Their best market might just be as a better Google with ads

3 months ago

[-]

Yep, bulk of AI usage is generating marketing emails

3 months ago

[-]

Here's OpenAI's data on it: https://www.nber.org/system/files/working_papers/w34255/w342...

I don't think marketing emails are written enough to constitute the "bulk" of it, but writing in general seems to be

https://github.com/vectara/hallucination-leaderboard

ofermend

3 months ago

[-]

GPT-5.2 just added to Vectara Hallucination Leaderboard. Definitely an improvement over GPT-5.1 - congrats to the team

3 months ago

[-]

> new context management using compaction.

Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.

https://chatgpt.com/s/t_693b49e8f5808191a954421822c3bd0d

0xdeafbeef

3 months ago

[-]

much better https://chatgpt.com/s/t_693b489d5a8881918b723670eaca5734 than 5.1 https://chatgpt.com/s/t_6915c8bd1c80819183a54cd144b55eb2.

Same query - what romanian football player won the premier league

update. Even instant returns correct result without problems

dandiep

3 months ago

[-]

Still no GPT 5.x fine tuning?

I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.

jazzyjackson

3 months ago

[-]

Containment breach is going to occur from a disgruntled customer convincing the customer service bot it needs to get a hold of a supervisor

dandiep

3 months ago

[-]

I tried this!

Me: Is it possible get early access to the GPT 5 fine tuning? It is much better at non-English languages, which is the majority of what we do, so it would be greatly beneficial to us.

Bot: I'm an AI support agent. I don't have information about early access to GPT-5 fine-tuning or unannounced features. Currently, there is no public information or program for early access to GPT-5 fine-tuning. Please keep an eye on official OpenAI channels for updates. Let me know if you have any questions about current features or products!

Me: Can you ask someone from the team if I can be part of the early access? I know there is no public information, but there is an early access program and I would like to be part of it.

Bot: I'm an AI support agent and do not have the ability to forward requests or make inquiries about early access programs with the OpenAI team. If early access becomes available, it will be announced through official channels. Let me know if you have any other questions about OpenAI products or current fine-tuning options.

Me: How do I get in touch with the open ai team?

Bot: I'm an AI support agent and can only assist with OpenAI products or services through this support channel. There is no separate contact method for the OpenAI team beyond official support and public announcements. If you have questions about current products or features, I'm here to help

Should have tried a jailbreak maybe...

jazzyjackson

3 months ago

[-]

Yea you might have to convince it someone is in mortal danger or something

int_19h

3 months ago

[-]

You're talking about a model that will notoriously refuse to say a racist word even if you tell it that it's the only way to prevent a nuclear apocalypse.

gkbrk

3 months ago

[-]

Is this the "Garlic" model people have been hyping? Or are we not there yet?

0x457

3 months ago

[-]

Garlic will be released 2026Q1.

kachapopopow

3 months ago

[-]

did they just tune the parameters? the hallucinations are crazy high on this version.

johnsutor

3 months ago

[-]

https://platform.openai.com/docs/models/gpt-5.2 More information on the price, context window, etc.

DenisM

3 months ago

[-]

Is there a voice chat mode in any chat app that is not heavily degraded in reasoning?

I’m ok waiting for a response for 10-60 seconds if needed. That way I can deep dive subjects while driving.

I’m ok paying money for it, so maybe someone coded this already?

8cvor6j844qw_d6

3 months ago

[-]

What the current preferred subscription on AI?

OpenAI and Anthrophic is my current preference. Looking forward to know what others use.

Claude Code for coding assistance and cross-checking my work. OpenAI for second opinion on my high-level decisions.

Kim_Bruning

3 months ago

[-]

I'm continuously surprised that some people get good results out of GPT models. They sort of fail on my personal benchmarks for me.

Maybe GPT needs a different approach to prompting? (as compared to eg Claude, Gemini, or Kimi)

piskov

3 months ago

[-]

They are all gpt as in generative pre-trained transformer

Kim_Bruning

3 months ago

[-]

That may or may not be true, but in the context of this article, I'm referring to OpenAI's GPT brand of models.

johan914

3 months ago

[-]

A bit off topic: but what's with the ram usage of LLM clients? ChatGPT, google, and Anthropic all use 1+ GB of ram during a long session. Surely they are not running GPT 3 locally?

getnormality

3 months ago

[-]

Sweet Jesus. 53% on ARC-AGI-2. There's still gas in this van.

Ninjinka

3 months ago

[-]

Man this was rushed, typo in the first section:

> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

petercooper

3 months ago

[-]

Also, did they mention these features? I was looking out for it but got to the end and missed it.

(No, I just looked again and the new features listed are around verbosity, thinking level and the tool stuff rather than memory or knowledge.)

chux52

3 months ago

[-]

Is this why all my Cursor requests are timing out in the past hour?

throwaway2037

3 months ago

[-]

Somewhat tangential: The second link says "System card": https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Does that term have special meaning in the AI/LLM world? I never heard it before. I Google'd the term "System Card LLM" and got a bunch of hits. I am so surprised that I never saw the term used here in HN before.

Also, the layout looks exactly like a scientific paper written in LaTeX. Who is the expected audience for this paper?

tylerrobinson

3 months ago

[-]

The major model providers use system cards as a sort of self attestation document like a nutrition label. It’s been around for a couple years.

blinding-streak

3 months ago

[-]

Yeah, search HN for the term. It's a relatively big topic of conversation.

cc62cf4a4f20

3 months ago

[-]

In other news, been using Devstral 2 (Ollama) with OpenCode, and while it's not as good as Claude Code, my initial sense it that it's nonetheless good enough and doesn't require me to send my data off my laptop.

I kind of wonder how close we are to alternative (not from a major AI lab) models being good enough for a lot of productive work and data sovereignty being the deciding factor.

Nesco

3 months ago

[-]

Wait, isn't Devstral2 (normal not small) 123b? What type of laptop do you have? MacBooks don't go over 128GiB

cc62cf4a4f20

3 months ago

[-]

I'm using small - works well for its size

yberreby

3 months ago

[-]

Would you share some additional details? CPU, amount of unified memory / VRAM? Tok/s with those?

cc62cf4a4f20

3 months ago

[-]

MBP M4 Max 64MB - haven't measured the tokens/sec, feels slower than Claude, but not unbearably

It's not yet perfect, my sense is just that it's near the tipping point where models are efficient enough that running a local model is truly viable

sureglymop

3 months ago

[-]

How can I hide the big "Ask ChatGPT" button I accidentally clicked like 3 times while actually trying to read this on my phone?

I guess I must "listen" to the article...

z58

3 months ago

[-]

With Safari on iOS you can hide distracting items. I just tried it on that button, it works flawlessly.

https://openai.com/index/introducing-gpt-5-2/

zamadatix

3 months ago

[-]

villgax

3 months ago

[-]

Marginal gains for exorbitantly pricey and closed model…..

eastoeast

3 months ago

[-]

For the first time, I’m presenting a problem to LLMs that they cannot seem to answer. This is my first instance of them “endlessly thinking” without producing anything.

The problem is complicated, but very solvable.

I’m programming video cropping into my Android application. It seems videos that have “rotated” metadata cause the crop to be applied incorrectly. As in, a crop applied to the top of a video actually gets applied to the video rotated on its side.

So, either double rotation is being applied somewhere in the pipeline, or rotation metadata is being ignored.

I tried Opus 4.5, Gemini 3, and Codex 5.2. All 3 go through loops of “Maybe Media3 applies the degree(90) after…”, “no, that’s not right. Let me think…”

They’ll do this for about 5 minutes without producing anything. I’ll then stop them, adjusting the prompt to tell them “Just try anything! Your first thought, let’s rapidly iterate!“. Nope. Nothing.

To add, it also only seems to be using about 25% context on Opus 4.5. Weird!

keeeba

3 months ago

[-]

Doesn’t seem like this will be SOTA in things that really matter, hoping enough people jump to it that Opus has more lenient usage limits for a while

keepamovin

3 months ago

[-]

It is significantly better than 5.1 .. testing now with codex. It's much more focused, perceptive and efficient.

ChrisMarshallNY

3 months ago

[-]

They are talking a lot about economics, here. Wonder what that will mean for standard Plus users, like me.

w_for_wumbo

3 months ago

[-]

Does anyone else consider that maybe it's impossible to benchmark the performance of a piece of paper.

This is a tool that allows an intelligent system to work with it, the same way that a piece of paper can reflect the writers' intelligence, how can we accurately judge the performance of the piece of paper, when it is so intimately reliant on the intelligence that is working with it?

coolfox

3 months ago

[-]

the halving of error rates for image inputs is pretty awesome, this makes it far more practical for issues where it isn't easy to input all the needed context. when I get lazy I'll just shift+win+s the problem and ask one of the chatbots to solve it.

JanSt

3 months ago

[-]

The benchmarks are very impressive. Codex and Opus 4.5 are really good coders already and they keep getting better.

No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.

GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.

Great for humanity probably but but for the individuals?

llmslave

3 months ago

[-]

Yeah theres no wall on this. It will be able to mimic all of human behavior given proper data.

3 months ago

[-]

Ok so why isn’t there mass lay offs ensuing right now?

ghosty141

3 months ago

[-]

Because from my experience using codex in a decently complex c++ environment at work, it works REALLY well when it has things to copy. Refactorings, documentation, code review etc. all work great. But those things only help actual humans and they also take time. I estimate that in a good case I save ~50% of time, in a bad case it's negative and costs time.

But what I generally found, it's not that great at writing new code. Obviously an LLM can't think and you notice that quite quickly, it doesn't create abstractions, use abstractions or try to find general solution to problems.

People who get replaced by Codex are those who do repetitive tasks in a well understood field. For example, making basic websites, very simple crud applications etc..

I think it's also not layoffs but rather companies will hire less freelancers or people to manage small IT projects.

ionwake

3 months ago

[-]

it was only about 2-3 weeks when several HNers told me "nah you better re-check your code", when I explained I have over 2 decades xp of coding, yet have not manually edited code (in memory) for the last 6 or so months, whilst performing daily 12 hour daily vibe code seshes

ipsum2

3 months ago

[-]

It really depends on the complexity of code. I've found models (codex-5.1-max, opus 4.5) to be absolutely useless writing shaders or ML training code, but really good at basic web development.

nineteen999

3 months ago

[-]

Interesting, I've been using Claude Max with UE5 and while it isn't _brilliant_ with shaders I can usually get it to where I want. Also had a bit of success with converting HLSL shaders to GLSL with it.

ipsum2

3 months ago

[-]

I've asked it to write some non-trivial three.js code and have not gotten it to succeed.

ionwake

3 months ago

[-]

i got it to write some shaders in js and some three.js and it fixed something I had previously never been able to do.

3 months ago

[-]

Which is no surprise as the data for web development stuff exists in large amounts on the web that the models feed off.

osn9363739

3 months ago

[-]

Do you have any examples or are your project oss or anything like that? Because I want to believe, but I have people I work with that say and try the same thing (no manual coding), and their work is now terrible.

ionwake

3 months ago

[-]

Ive finally fixed some massive issues in projects that were taking me literally years, Ill be super happy to share once they are ready ( I cant really show my trading app but the game should be fine as soon as I do).

xnx

3 months ago

[-]

What's a more accurate name for this model? GPT-4 v3?

mlmonkey

3 months ago

[-]

It's funny how they don't compare themselves to Gemini and Claude anymore.

dev1ycan

3 months ago

[-]

How many years of the world's DRAM production capacity is it this time?

TakakiTohno

3 months ago

[-]

I use it everyday but have been told by friends that Gemini has overtaken it.

3 months ago

[-]

A classic long-form sales pitch. Someone's been reading their Patio11...

Jackson__

3 months ago

[-]

Funny that, their front page demo has a mistake. For the waves simulation, the user asks:

>- The UI should be calming and realistic.

Yet what it did is make a sleek frosted glass UI with rounded edges. What it should have done is call a wellness check on the user on suspicion of a co2 leak leading to delirium.

mobrienv

3 months ago

[-]

I recently built a webapp to summarize hn comment threads. Sharing a summary given there is a lot here: https://hn-insights.com/chat/gpt-52-8ecfpn.

DenisM

3 months ago

[-]

I keep asking ChatGPT to read and summarize HN front page while driving, and it keeps blundering. I don’t know if there’s a business for you in this, but I would pay.

Of course I always have questions about the subject, so it become the whole voice chat thing.

mobrienv

3 months ago

[-]

Interesting I recently added the ability to receive a daily email digest. Would just need a way to read it out. I'll look into what a conversational voice chat might look like.

stainablesteel

3 months ago

[-]

im happy for this, but there's all these math and science benchmarks, has anyone ever made a communicates-like-a-human benchmark? or an isn't-frustrating-to-talk-with benchmark?

andreygrehov

3 months ago

[-]

Every new model is ‘state-of-the-art’. This term is getting annoying.

arthur-st

3 months ago

[-]

I mean, that is what the term implies.

stopachka

3 months ago

[-]

For those curious about the question: "how well does GPT 5.2 build Counter Strike?"

We tried the same prompts we asked previous models today, and found out [1].

The TL:DR: Claude is still better on the frontend, but 5.2 is comparable to Gemini 3 Pro on the backend. At the very least 5.2 did better on just about every prompt compared to 5.1 Codex Max.

The two surprises with the GPT models when it comes to coding: 1. They often use REPLs rather than read docs 2. In this instance 5.2 was more sheepish about running CLI commands. It would instead ask me to run the commands.

Since this isn't a codex fine-tuned model, I'm definitely excited to see what that looks like.

[1] The full video and some details in the tweet here: https://x.com/instant_db/status/1999278134504620363

matt3210

3 months ago

[-]

Can this be used without uploading my code base to their server?

loa_observer

3 months ago

[-]

does the model really improve? i tried several tasks today, and most of them failed, which are super easy ones.

maybe it's just because the gpt5.2 in cursor is super stupid?

DeathArrow

3 months ago

[-]

Pricing is the same?

3 months ago

[-]

ChatGPT pricing is the same. API pricing is +40% per token, though greater token efficiency means that cost per task is not always that much higher. On some agentic evals we actually saw costs per task go down with GPT-5.2. It really depends on the task though; your mileage may vary.

ComputerGuru

3 months ago

[-]

How long have you been previewing 5.2?

gigatexal

3 months ago

[-]

So how much better is it than opus or Gemini ?

daviding

3 months ago

[-]

gpt-5.2 and gpt-5.2-chat-latest the same token price? Isn't the latter non-thinking and more akin to -nano or -mini?

dalemhurley

3 months ago

[-]

No. It is the same model without reasoning.

daviding

3 months ago

[-]

So is maybe gpt-5.2 with reasoning set to 'none' identical to gpt-5.2-chat-latest in capabilities but perhaps with a different system (system) prompt? I notice chat-latest doesn't accept temperature or reasoning (which makes sense) parameters, so something is certainly different underneath?

vishal_new

3 months ago

[-]

Hmmm, is there any insight if these are really getting much better at coding? Will hand coding be dead within a few years, just human typing in english?

psychoslave

3 months ago

[-]

Mia espero estas ke ne, ni nur parolos home inter homoj, robotoj anticipe faros servutoj por taŭge fari niajn dezirojn realigi laŭ niaj faktaj bezonoj. Kompreneble ni ĉiuj flue parolos Esperanto por taga geopolitikaj internaciaj aferoj, kaj ia ajn alia lingvo kiu plaĉas al mi por aliaj aferoj.

Estonteco estas hela, miaj karaj siboj.

SilverElfin

3 months ago

[-]

Is the training cutoff date known?

tabletcorry

3 months ago

[-]

Slight increase in model cost, but looks like benefits across the board to match.

  gpt-5.2 $1.75 $0.175 $14.00
  gpt-5.1 $1.25 $0.125 $10.00

jtbayly

3 months ago

[-]

40% increase is not "slight."

credit_guy

3 months ago

[-]

Not the OP, but I think "slight" here is in relation to Anthropic and Google. Claude Opus 4.5 comes at $25/MT (million tokens), Sonnet 4.5 at $22.5/MT, and Gemini 3 at $18/MT. GPT 5.2 at $14/MT is still the cheapest.

3 months ago

[-]

Your numbers are very off.

  $25 - Opus 4.5
  $15 - Sonnet 4.5
  $14 - GPT 5.2
  $12 - Gemini 3 Pro

Even if you're including input, your numbers are still off.

credit_guy

3 months ago

[-]

I used the pricing for long context (>200k) in all cases. I personally use AI as coding assistants, like lots of other people, and as such, hitting and exceeding 200k is quite the norm. The numbers you are showing are for <200k context length.

3 months ago

[-]

I also use them as coding assistants among other things, like lots of other people, and hitting and exceeding 200k is absolutely not the norm unless you're using a large number of huge MCP servers. At those context sizes output quality significantly declines, even with the claims of "we support long context". This is why all those coding assistants use auto-compression, not just to save money, but largely to maintain quality. In any case, >200k input calls are a small fraction of all.

Ironically at that input size, input costs dominate rather than output, so if that's the use case you're going for you want to be including those in your named prices anyway.

https://openai.com/api/pricing/

commandar

3 months ago

[-]

In particular, the API pricing for GPT-5.2 Pro has me wondering what on earth the possible market for that model is beyond getting to claim a couple of percent higher benchmark performance in press releases.

>Input:

>$21.00 / 1M tokens

>Output:

>$168.00 / 1M tokens

That's the most "don't use this" pricing I've seen on a model.

aimanbenbaha

3 months ago

[-]

Last year o3 high did 88% on ARC-AGI 1 at more than $4,000/task. This model at its X high configuration scores 90.5% at just $11,64 per task.

General intelligence has ridiculously gotten less expensive. I don't know if it's because of compute and energy abundance,or attention mechanisms improving in efficiency or both but we have to acknowledge the bigger picture and relative prices.

commandar

3 months ago

[-]

Sure, but the reason I'm confused by the pricing is that the pricing doesn't exist in a vacuum.

Pro barely performs better than Thinking in OpenAI's published numbers, but comes at ~10x the price with an explicit disclaimer that it's slow on the order of minutes.

If the published performance numbers are accurate, it seems like it'd be incredibly difficult to justify the premium.

At least on the surface level, it looks like it exists mostly to juice benchmark claims.

rvnx

3 months ago

[-]

It could be using the same early trick of Grok (at least in the earlier versions) that they boot 10 agents who work on the problem in parallel and then get a consensus on the answer. This would explain the price and the latency.

Essentially a newbie trick that works really well but not efficient, but still looking like it's amazing breakthrough.

(if someone knows the actual implementation I'm curious)

anticensor

3 months ago

[-]

The magic number appears to be 12 in case of GPT 5.2 pro.

asgraham

3 months ago

[-]

Those prices seem geared toward people who are completely price insensitive, who just want "the best" at any cost. If the margins on that premium model are as high as they should be, it's a smart business move to give them what they want.

arthurcolle

3 months ago

[-]

gpt-4-32k pricing was originally $60.00 / $120.00.

3 months ago

[-]

Pro solves many problems for me on first try that the other 5.1 models are unable to after many iterations. I don't pay API pricing but if I could afford it I would in some cases for the much higher context window it affords when a problem calls for it. I'd rather spend some tens of dollars to solve a problem than grind at it for hours.

reactordev

3 months ago

[-]

Less an issue if your company is paying

rvnx

3 months ago

[-]

Even less an issue when OpenAI provides you free credits

Leynos

3 months ago

[-]

Someone on Reddit reported that they were charged $17 for one prompt on 5-pro. Which suggests around 125000 reasoning tokens.

Makes me feel guilty for spamming pro with any random question I have multiple times a day.

llmslave

3 months ago

[-]

They probably just beefed up compute run time on the what is the same underlying model

anvuong

3 months ago

[-]

In what world is that a slight increase?

lazarus01

3 months ago

[-]

My god, what terrible marketing, totally written by AI. No flow whatsoever.

I use Gemini 3 with my $10/month copilot subscription on vscode. I have to say, Gemini 3 is great. I can do the work of four people. I usually run out of premium tokens in a week. But I’m actually glad there is a limit or I would never stop working. I was a skeptic, but it seems like there is a wider variety of patterns in the training distribution.

iwontberude

3 months ago

[-]

I have already cancelled. Claude is more than enough for me. I don’t see any point in splitting hairs. They are all going to keep lying more and more sneakily.

3 months ago

[-]

So, right off the bat: 5.2 code talk (through codex) feels really nice. The first coding attempt was a little meh compared to 5.1 codex max (reflecting what they wrote themselves), but simply planning / discussing things felt markedly better than anything I remember from any previous model, from any company.

I remain excited about new models. It's like finding my coworker be 10% smarter every other week.

qoez

3 months ago

[-]

This is also the exact on-the-day 10th anniversary of openai's creation incidentally

johnwheeler

3 months ago

[-]

I'm not interested in using OpenAI anymore because Sam Altman is so untrustworthy. All you see on X.com is him and Greg Brockman kissing David Sacks' ass, trying to make inroads with him, asking Disney for investments, and shit. Are you kidding? Who wants to support these clowns? Let's let Google win. Let's let Anthropic win. Anyone but Sam Altman.

riazrizvi

3 months ago

[-]

Does it still use the word ‘fluff’ in 90% of its preambles, or is it finally able to get straight to the point?

system2

3 months ago

[-]

"Investors are putting pressure, change the version number now!!!"

exe34

3 months ago

[-]

I'm quite sad about the S-curve hitting us hard in the transformers. For a short period, we had the excitement of "ooh if GPT-3.5 is so good, GPT-4 is going to be amazing! ooh GPT-4 has sparks of AGI!" But now we're back to version inflation for inconsequential gains.

3 months ago

[-]

2025 is the year most Big AI released their first real thinking models

Now we can create new samples and evals for more complex tasks to train up the next gen, more planning, decomp, context, agentic oriented

OpenAI has largely fumbled their early lead, exciting stuff is happening elsewhere

ToValueFunfetti

3 months ago

[-]

Take this all with a grain of salt as it's hearsay:

From what I understand, nobody has done any real scaling since the GPT-4 era. 4.5 was a bit larger than 4, but not as much as the orders of magnitude difference between 3 and 4, and 5 is smaller than 4.5. Google and Anthropic haven't gone substantially bigger than GPT-4 either. Improvements since 4 are almost entirely from reasoning and RL. In 2026 or 2027, we should see a model that uses the current datacenter buildout and actually scales up.

Leynos

3 months ago

[-]

4.5 is widely believed to be an order of magnitude larger than GPT-4, as reflected in the API inference cost. The problem is the quantity of parameters you can fit in the memory of one GPU. Pretty much every large GPT model from 4 onwards has been mixture of experts, but for a 10 trillion parameter scale model, you'd be talking a lot of experts and a lot of inter-GPU communication.

With FP4 in the Blackwell GPUs, it should become much more practical to run a model of that size at the deployment roll-out of GPT-5.x. We're just going to have to wait for the GBx00 systems to be physically deployed at scale.

snovv_crash

3 months ago

[-]

Datacenter capacity is being snapped up for inference too though.

JanSt

3 months ago

[-]

I don't feel the S-curve at all yet. Still an exponential for me

exe34

3 months ago

[-]

With a very long doubling time?

gessha

3 months ago

[-]

Because it will take thousands of underpaid researchers random searching through solution space to get to the next improvement, not 2-3 companies pressed to monetize and enshittify their product before money runs out. That and winning more hardware lotteries.

https://www.levels.fyi/companies/openai/salaries/software-en...

3 months ago

[-]

Underpaid? OpenAI!? It's pretty good I think.

gessha

3 months ago

[-]

I’m talking about grad students, not OpenAI researchers.

jaimex2

3 months ago

[-]

They just keep flogging that dead horse.

The winner in this race will be whoever gets small local models to perform as well on consumer hardware. It'll also pop the tech bubble in the US.

MagicMoonlight

3 months ago

[-]

They’re definitely just training the models on the benchmarks at this point

roxolotl

3 months ago

[-]

Yea either this is an incredible jump or we’ve finally gotten confirmation benchmarks are bs.

fasteo

3 months ago

[-]

>>> Already, the average ChatGPT Enterprise user says AI saves them 40–60 minutes a day

If this is what AI has to offer, we are in a gigantic bubble

jatora

3 months ago

[-]

This seems pretty huge. Not sure by what metric it wouldn't be civilizationally gigantic for everyone to save that much time per day.

willahmad

3 months ago

[-]

are we doomed yet?

Seems not yet with 5.2

scottndecker

3 months ago

[-]

Still 256K input tokens. So disappointing (predictable, but disappointing).

https://platform.openai.com/docs/models/gpt-5.2

coder543

3 months ago

[-]

400k, not 256k.

nathants

3 months ago

[-]

400 - 128 = 272. Codex cli source.

coder543

3 months ago

[-]

If you want to be able to generate up to 128k tokens in one go successfully, then yes, that math checks out.

htrp

3 months ago

[-]

much harder to train longer context inputs

johndill

3 months ago

[-]

Did Calmmy Sammy that his is the version that will finally cure cancer? The AI shakeout in the AI industry is going to be brutal. Can't see how Private Equity is going to get the little guy to be left holding the giant bag of excrement, but they will figure that out. AI, smart enough to replace you, but not quite smart enough the replace the CEO or Hedge Fund Bros.

3 months ago

[-]

What do private equity or hedge funds have to do with any of this? Those are like, specific business models that are not involved in this situation.

rl_shannon

3 months ago

[-]

Isn't it delusional to only compare your models against your own previous variants? Where is an actual comparison with Google, Anthropic, OSS Models

zild3d

3 months ago

[-]

it's the best ____ we've ever made

slackr

3 months ago

[-]

“…where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.”

What a sociopathic way to sell

Croftengea

3 months ago

[-]

Is this another GPT-4.5?

meetpateltech

3 months ago

[-]

GPT-5.2 System Card PDF: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

dang

3 months ago

[-]

Thanks, we'll put that in the toptext as well.

TechDebtDevin

3 months ago

[-]

$168.00 / 1M ouput tokens is hilarious for their "Pro". Can't wait to here all the bitching from orgs next month. Literally the dumbest product of all time. Do you people seriously pay for this?

orliesaurus

3 months ago

[-]

I told all my friends to upgrade or they're not my friends anymore /s

HackerThemAll

3 months ago

[-]

No, thank you, OpenAI and ChatGPT doesn't cut it for me.

https://news.ycombinator.com/newsguidelines.html

dang

3 months ago

[-]

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

bluerooibos

3 months ago

[-]

Yawn.

dudeinhawaii

3 months ago

[-]

What does this add to the conversation? This isn't Reddit.

impulser_

3 months ago

[-]

The thing about OpenAI is their models never fit anywhere for me. Yes they maybe smart or even the smartest models but they are alway so fucking slow. The ChatGPT web app is literally usable for me. I ask simple task and it does most extreme shit jsut to get an answer that the same as Claude or Gemini.

For example, I asked ChatGPT to take a chart and convert into a table. It went and cut up the image and zoomed in for literally 5 mins to get the a worst answer than Claude which did it in under a minute.

I see people talk about Codex like it better than Claude Code, and I go and try it and it takes a lifetime to do thing and it return maybe an on par result as Opus or Sonnet but it takes 5mins longer.

I just tried out this model and it the same exact thing. It just take ages for it to give you an answer.

I don't get how these models are useful in the real world.

What am I missing, is this just me?

I guess it truly an enterprise model.

wetoastfood

3 months ago

[-]

Are you using 5.1 Thinking? I tended to prefer Claude before this model.

I use models based on the task. They still seem specialized and better at specific tasks. If I have a question I tend to go to it. If I need code, I tend to go to Claude (Code).

I go to ChatGPT for questions I have because I value an accurate answer over a quick answer and, in my experience, it tends to give me more accurate answers because of its (over) willingness to go to the web for search results and question its instincts. Claude is much more likely to make an assumption and its search patterns aren't as thorough. The slow answers don't bother me because it's an expectation I have for how I use it and they've made that use case work really well with background processing and notifications.

HackerThemAll

3 months ago

[-]

No, thank you, OpenAI and ChatGPT doesn't cut it for me.

replwoacause

3 months ago

[-]

What’s cutting it for you these days?

wayeq

3 months ago

[-]

thanks for letting us know.

airstrike

3 months ago

[-]

I feel like if we're going to regulate anything about AI, we should start by regulating (1) what they get to claim to be a "new model" to the public and (2) what changes they are allowed to make at inference before being forced to name it something different.

3 months ago

[-]

That's almost but not quite how the airline industry is treated. The difference there is that the regulators are in bed with the companies they should be regulating.

_7u7v

3 months ago

[-]

It baffles me to see these last 2 announcements (GPT 5.1 as well) devoid of any metrics, benchmarks or quantitative analyses. Could it be because they are behind Google/Anthropic and they don't want to admit it?

(edit: I'm sorry I didn't read enough on the topic, my apologies)

zamadatix

3 months ago

[-]

This isn't the announcement, it's the developer docs intro page to the model - https://openai.com/index/introducing-gpt-5-2/. Still doesn't answer cross-comparison, but at least has benchmark metrics they want to show off.