FilterHN

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

78 points

by maille

1 hour ago

| past

| 7 comments

| github.com

| HN

▲

ACCount37

2 minutes ago

[-]

A rare case "they made the model dumber" when they actually made the model dumber, instead of the usual user psychosis?

▲

nsingh2

4 minutes ago

[-]

Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.

Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes causing bugs.

▲

zenapollo

25 minutes ago

[-]

I’ve definitely experienced step jumps down in quality on an almost daily basis. I usually used xhigh. The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me. I’m seeing incredibly stupid implementations intermittently, and have simply switched to Claude until openai takes the issue seriously. As far as i could tell they haven’t taken it seriously for the several months I’ve been personally seeing it.

▲

siva7

21 minutes ago

[-]

I've switched 3 months ago to Codex because Claude got incredibly stupid. 6 months ago vice versa. It doesn't matter if you use Codex or Claude. Both will fuck with you at some point. Though Codex probably less.

▲

cyanydeez

6 minutes ago

[-]

i don't ever believe these issues are technical. They're business decisions to downgrade performance because to fix it means $$$$ and you arn't paying them enough.

▲

kleton

34 minutes ago

[-]

Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization

▲

kbdiaz

8 minutes ago

[-]

Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here.

▲

siva7

13 minutes ago

[-]

I swear some days ago someone here claimed Openai succeeded cutting down their compute cost by half with a breakthrough optimization. So this is it?

▲

simonw

9 minutes ago

[-]

That was an article in The Information but it didn't read very well to me, I didn't get the impression the author was enough of a technical expert on how LLMs work to credibly evaluate the claim, which came from an insider rumor: https://www.theinformation.com/newsletters/ai-agenda/openai-...

> OpenAI engineers earlier this month told some colleagues they had figured out a way to more than halve the cost of inference, or running existing models, thanks to some newly-discovered optimizations, according to a person with knowledge of those discussions.

▲

maille

1 hour ago

[-]

tldr:

GPT-5.5 Codex model exhibits a clustering phenomenon in which reasoning_output_tokens cluster at fixed values spaced 518 apart.

These stuck responses at fixed thresholds are strongly correlated with errors in complex tasks.

Observed phenomenon is specific to GPT-5.5; it is much less prevalent in GPT-5.4 and almost absent in GPT-5.2 and 5.3

▲

ProofHouse

1 hour ago

[-]

Personally, I would say very likely, to be honest. I gotta go through this a little more, but I actually use 5.5 codex an obscene amount, and I almost never use it for reasoning anymore. It's not even in the same galaxy as far as actually taking out the thinking and using GPT-5.5 or even Claude and then coming back and giving it the reasoning. Blah blah blah, it's the same model. Well, let me tell you, no, it's not, for several reasons, and the delta on intelligence is pretty staggering.

▲

benjiro29

57 minutes ago

[-]

Care to explain what you mean by that?

▲

criley2

5 minutes ago

[-]

I'm struggling as well to understand, and I think perhaps they mean they use ChatGPT website with GPT-5.5+reasoning for problem solving, and paste the output into Codex CLI/App. I think they're saying that letting Codex CLI/App problem solve with GPT-5.5 isn't as effective. Essentially that the web harness is superior to the agentic engineering harness for problem solving?

Not sure if I agree, but I do happen to use a fair bit of web harness as well, just because I find it to be much more effective at web search and a different type of reasoning. So I must agree a little or else I wouldn't do that.

▲

dimitrios1

33 minutes ago

[-]

I know that these types of comments are not really popular here, but this struck a chord with me because I feel the same. They aren't remotely close.

I have codex right now purely because they gave me a month free of ChatGPT Pro, so I have been using it in between my usage resets with claude. Since it's "free money" for me I have been using it exclusively on xHigh.

One of my most frequent prompts is "hey codex worked on ____, but it didn't quite hit the mark, can we review the work..."

Yes, part of this is normal even within the same model -- you have the highest power model review the work for correctness, refactoring opportunities, and so on, but man I tell you, I don't know what it is about codex, this is obviously one guy's anecdote -- same prompting style, same repository documentation ala MD files, same skills, way different results.

All that to say, maybe the bug report is on to something here, and it can be fixed.

▲

m101

58 minutes ago

[-]

What?