What eval is tracking that? It seems like it's potentially the most imporatnt metric for real-world software engineering and not one-shot vibe prayers.
https://charlielabs.ai/research/gpt-5
Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.
While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.
For better accessibility and a safer experience[1] I would recommend not animating the background, or at least making it easily togglable.
[1] https://developer.mozilla.org/en-US/docs/Web/Accessibility/G...
Edited to add: I am, in fact, photosensitive (due to a genetic retinal condition), and for my eyes, your site as it is very easy to read, and the visualizations look great.
Love that you included the judge prompts in your article.
For whatever reason Github's Copilot is treated like the redheaded stepchild of coding assistants. Even through there are Anthropic, OpenAI, and Google models to choose from. And there is a "spaces"[0] website feature that may be close to what you are looking for.
I got better results for testing some larger task using that than I did through the IDE version. But have not used it much. Maybe others have more experience with it. Trying to gather all the context and then review the results was taking longer than doing it myself; having the context gathered already or building it up over time is probably where its value is.
For my use cases, this is mostly needing to be really home in on relevant code files, issues, discussions, PRs. I'm hopeful that GPT5 will be a step forward in this regard that isn't fully captured in the benchmark results. It's certainly promising that it can achieve similar results more cheaply than e.g. Opus.
To get great results, it's still very important to manage context well. It doesn't matter if the model allows a very large context window, you can't just throw in the kitchen sink and expect good results
But is it really 272k even if the output was say 10k? Cause it does say “max output” in the docs, so I wonder
>"GPT‑5 is the strongest coding model we’ve ever released. It outperforms o3 across coding benchmarks and real-world use cases, and has been fine-tuned to shine in agentic coding products like Cursor, Windsurf, GitHub Copilot, and Codex CLI. GPT‑5 impressed our alpha testers, setting records on many of their private internal evals."
If there's no substantial difference in software development expertise then GPT-5 absolutely blows Opus out of the water due to being almost 10x cheaper.
Because if not, I'd still go with Opus + Claude Code. I'd rather be able to tell my employer, "this will cost you $200/month" than "this might cost you less than $200/month, but we really don't know because it's based on usage"
> Availability and access > GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming in one week. Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI (opens in a new window) by signing in with ChatGPT.
The power of these models has peaked and simply arn't going to manage the type of awareness being promised.
However, at leas for me there is lots of "small enough context" boilerplate that the context can deal with.
Clearly this is not a tool in the sense it's predictable.
I'm using it mostly for C#, WPF and OpenTK. The type system seems to help a lot.
The UI logic it recommends is mostly god awful. But at least for me when it's given a pattern it can apply, it does so pretty well.
don’t have long-running tasks, llms or not. break the problem down into small manageable chunks and then assemble it. neither humans nor llms are good at long-running tasks.
That's a wild comparison to make. I can easily work for an hour. Cursor can hardly work for a continuous pomodoro. "Long-running" is not a fixed size.
The long running task, at it's core, is composed of many smaller tasks and you mostly focus on one task at a time per brain part. It's why you cannot read two streams of text simultaneously even if both are in your visual focus field.
I think the plan is not just words, if it was, you could read a book on how to ride a bike.
Because we communicate in language and because code output is also a language we think that the process is also language based, but I think it's not, especially when doing hard stuff.
I know for certain in my case it isn't -- when tracking a hard problem for a junior after 2 hours of pair programming the other week, I had to tell him to commit everything and just let me do some deep thinking/debugging and I solved the problem myself. Sure I explained my process to him in language the best I could, but it's clear it was not language, it was not liniar, I did not think it step by step.
I wish I could explain it, but when figuring out a hard problem, for me it takes some time to take it all in, get used to the moving parts, play with them. I'm sure there are actual neurons/synapses formed then, actual new wires sprawling about in the brain, that's why it takes time. I think the solution is a hardware one, not a software one.
That's why we can sleep on it and get better the next day and that's why we feel the problem. There are actual multiple paralel "threads" of thinking going at the same time in our heads and we can FEEL the solution as almost there.
I think it simply is that hard problems can occur in a combination of code, state, models that simply cannot be solved incrementally and big jumps are necessary.
I'm not saying the problem cannot be solved incrementally, but it's possible that by going in small steps, you either reach the solution or a blocker that requires a big jump.
Claud always misunderstands how API exported by my service works and every compaction it forgets all over and commits "oh api has changed since last time I've used, let me use different query parameters", my brother Christ nothing has changed, and you are the one who made this API.
LLMs multiply errors over time.
If LLMs are going to act as agents, they need to maintain context across these chunks.
- "don't nest modules'–nests 4 mods in 1 file
- "don't write typespecs"–writes typespecs
- "Always give the user design choices"– skips design choices.
gpt-4.1 way outperforms w/ same instructions. And sonnet is a whole different league (remains my goto). gpt-5 elixir code is syntactically correct, but weird in a lot of ways, junior-esque inefficient, and just odd. e.g function arguments that aren't used, yet passed in from callers, dup if checks, dup queries in same function. I imagine their chat and multimodal stuff strikes a nice balance with leaps in some areas, but for coding agents this is way behind any other SOTA model I've tried. Seems like this release was more about striking a capability balance b/w roflscale and costs than a gpt3-4 leap.I used gpt-5-mini with reasoning_effort="minimal", and that model finally resisted a hallucination that every other model generated.
Screenshot in post here: https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvb...
I'll run formal evaluations next.
GPT4: Collaborating with engineering, sales, marketing, finance, external partners, suppliers and customers to ensure …… etc
GPT5: I don't know.
Upon speaking these words, AI was enlightened.
The new training rewards that suppress hallucinations and tool-skipping hopefully push us in the right direction.
I think the dev workflow is going to fundamentally change because to maximise productivity out of this you need to get multiple AIs working in parallel so rather than just jumping straight into coding we're going to end up writing a bunch of tickets out in a PM tool (Linear[3] looks like it's winning the race atm) and then working out (or using the AI to work out) which ones can be run in parallel without causing merge conflicts and then pulling multiple tickets into your IDE/Terminal and then cycling through the tabs and jumping in as needed.
Atm I'm still not really doing this but I know I need to make the switch and I'm thinking that Warp[4] might be best suited for this kind of workflow, with the occasional switch over to an IDE when you need to jump in and make some edits.
Oh also, to achieve this you need to use git worktrees[5,6,7].
[1]: https://www.youtube.com/watch?v=gZ4Tdwz1L7k
[3]: https://linear.app/
[5]: https://docs.anthropic.com/en/docs/claude-code/common-workfl...
[1]: https://code.visualstudio.com/updates/v1_103#_git-worktree-s...
[2]: https://code.visualstudio.com/updates/v1_103#_chat-sessions-...
This is a complete game changer for staying on top of what's being covered by local government meetings. Our local bureaucrats are astounding competent at talking about absolutely nothing for 95% of the time, but hidden is three minutes of "oh btw we're planning on paving over the local open space preserve to provide parking for the local business".
Write '!sum ' hit cmd-v and enter
Then the Kagi summariser will do that :)
Spend 1.5 hours now to learn from an experienced dev on a stack that is better suited for job: most likely future hours gained.
With text I can skim around the headings and images and see at a glance how deep the author is going into the subject.
In that specific video the first 30 minutes is related to everything but the new Web Scale[0] LLM native database the author is "moving to" from SQL.
Meanwhile Postgresql is just chugging along and over-performing all of them.
Anecdotally, the tool updates in the latest Cursor (1.4) seem to have made tool usage in models like Gemini much more reliable. Previously it would struggle to make simple file edits, but now the edits work pretty much every time.
When you break the problem of "create new endpoint" down into its sub-components (Which you can do with the agent) and then work on one part at a time, with a new session for each part, you generally do have more success.
The more boilerplate-y the part is, the better it is. I have not really found one model that can yet reliably one-shot things in real life projects, but they do get quie close.
For many tasks, the models are slower than what I am, but IMO at this point they are helpful and definitely should be part of the toolset involved.
I find that OpenAI's reasoning models write better code and are better at raw problem solving, but Claude code is a much more useful product, even if the model itself is weaker.
Yes, but it does worse than o3 on the airline version of that benchmark. The prose is totally cherry picker.
Telecom was made after retail & airline, and fixes some of their problems. In retail and airline, the model is graded against a ground truth reference solution. But in reality, there can be multiple solutions that solve the problem, and perfectly good answers can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why airline and retail scores haven't climbed with the latest generations of models and are stuck around 60% / 80%. Even a literal superintelligence would probably plateau here.
In telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.
Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that telecom is much better than airline/retail for measuring tool use.
Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if they trigger a quirk not present in the eval).
Here's the tau2-bench paper if anyone wants to read more: https://arxiv.org/abs/2506.07982
Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M tokens
For context, Claude Opus 4.1 is $15 / 1M for input tokens and $75/1M for output tokens.
The big question remains: how well does it handle tools? (i.e. compared to Claude Code)
Initial demos look good, but it performs worse than o3 on Tau2-bench airline, so the jury is still out.
It's interesting that they're using flat token pricing for a "model" that is explicitly made of (at least) two underlying models, one with much lower compute costs than the other; and with use ability to at least influence (via prompt) if not choose which model is being used. I have to assume this pricing model is based on a predicted split between how often the underlying models get used; I wonder if that will hold up, if users will instead try to rouse the better model into action more than expected, or if the pricing is so padded that it doesn't matter.
what do you mean?
The price is what it is today because they are trying to become a dominant platform. It doesn't mean the price reflects what it actually costs to run.
I'd bet a lot of the $40 billion they got in March goes towards loss leaders.
My go to benchmark is a 3d snake game Claude does almost flawlessly (or at least in 3-4 iterations)
The prompt:
write a 3d snake game in js and html. you can use any libraries you want. the game still happens inside a single plane, left arrow turns the snake left, right arrow turns it right. the plane is black and there's a green grid. there are multiple rewards of random colors at a given time. each time a reward is eaten, it becomes the snake's new head. The camera follows the snake's head, it is above an a bit behind it, looking forward. When the snake moves right or left, the camera follows gradually left or right, no snap movements. write everything in a single html file.
EDIT: I'm not trying to shit on GPT-5, so many people here seem to be getting very good results, am I doing something wrong with my prompt?
If you need to know how the snake game should look to get the code then Claude is not doing the work you are.
[^1]: https://github.com/guidance-ai/llguidance/blob/f4592cc0c783a...
I'm already running into a bunch of issues with the structured output APIs from other companies like Google and OpenAI have been doing a great job on this front.
This run-on sentence swerved at the end; I really can't tell what your point is. Could you reword it for clarity?
The basic idea is: at each auto-regressive step (each token generation), instead of letting the model generate a probability distribution over "all tokens in the entire vocab it's ever seen" (the default), only allow the model to generate a probability distribution over "this specific set of tokens I provide". And that set can change from one sampling set to the next, according to a given grammar. E.g. if you're using a JSON grammar, and you've just generated a `{`, you can provide the model a choice of only which tokens are valid JSON immediately after a `{`, etc.
[0] https://github.com/dottxt-ai/outlines [1] https://github.com/guidance-ai/guidance
This sounds like a really cool feature. I'm imagining giving it a grammar that can only output safe, well-constrained SQL queries. Would I actually point an LLM directly at my database in production? Hell no! It's nice to see OpenAI trying to solve that problem anyway.
It was (attempted to be) solved by a human before, yet not merged... With all the great coding models OpenAI has access to, their SDK team still feels too small for the needs.
I just went through the agony of provisioning my team with new Claude Code 5x subs 2 weeks ago after reviewing all of the options available at that time. Since then, the major changes include a Cerebras sub for Qwen3 Coder 480B, and now GPT-5. I’m still not sure I made the right choice, but hey, I’m not married to it either.
If you plan on using this much at all then the primary thing to avoid is API-based pay per use. It’s prohibitively costly to use regularly. And even for less important changes it never feels appropriate to use a lower quality model when the product counts.
Claude Code won primarily because of the sub and that they have a top tier agentic harness and models that know how to use it. Opus and Sonnet are fantastic agents and very good at our use case, and were our preferred API-based models anyways. We can use Claude Code basically all day with at least Sonnet after using our Opus limits up. Worth nothing that Cline built a Claude Code provider that the derivatives aped which is great but I’ve found Claude Code to be as good or better anyways. The CLI interface is actually a bonus for ease of sharing state via copy/paste.
I’ll probably change over to Gemini Code Assist next, as it’s half the price and more context length, but I’m waiting for a better Gemini 2.5 Pro and the gemini-cli/Code Assist extensions to have first party planning support, which you can get some form of third party through custom extensions with the cli, but as an agent harness they are incomplete without.
The Cerebras + Qwen3 Coder 480B with qwen3-cli is seriously tempting. Crazy generation speed. Theres some question about how long big the rate limit really is but it’s half the cost of Claude Code 5x. I haven’t checked but I know qwen3-cli, which was introduced along side the model, is a fork of gemini-cli with Qwen-focused updates; wonder if they landed a planning tool?
I don’t really consider Cursor, Windsurf, Cline, Roo, Kilo et al as they can’t provide a flat rate service with the kind of rate limits you can get with the aforementioned.
GitHub Copilot could be a great offering if they were willing to really compete with a good unlimited premium plan but so far their best offering has less premium requests than I make in a week, possibly even in a few days.
Would love to hear if I missed anything, or somehow missed some dynamic here worth considering. But as far as I can tell, given heavy use, you only have 3 options today: Claude Max, Gemini Code Assist, Cerebras Code.
I find there's a niche where API pay-per-use is cost effective. It's for problems that require (i) small context and (ii) not much reasoning.
Coding problems with 100k-200k context violates (i). Math problems violate (ii) because they generate long reasoning streams.
Coding problems with 10k-20k context are well suited, because they generate only ~5k output tokens. That's $0.03-$0.04 per prompt to GPT-5 under flex pricing. The convenience is worth it, unless you're relying on a particular agentic harness that you don't control (I am not).
For large context questions, I send them to a chat subscription, which gives me a budget of N prompts instead of N tokens. So naturally, all the 100k-400k token questions go there.
To achieve AGI, we will need to be capable of high fidelity whole brain simulations that model the brain's entire physical, chemical, and biological behavior. We won't have that kind of computational power until quantum computers are mature.
Both of those seem questionable, multiplying them together seems highly unlikely.
Even though every NES in existence is a physical system, you don't physical level simulation to create and have a playable NES system via emulation.
Maybe your point is that until we understand our own intelligence, which would be reflected in such a simulation, it would be difficult to improve upon it.
whats the next sentence i'm going to type? is not just based on the millions of sentences ive typed before and read before? even the premise of me playing devils advocate here, that's a pattern i've learned over my entire life too.
your argument also falls apart a bit when we see emergent behavior, which has definitely happened
That's what I hear when people say stuff like this anyway.
Similar to CS folks throwing around physics 'theories'
why isn't it on https://aider.chat/docs/leaderboards/?
"last updated August 07, 2025"
The leaderboard score would come from Aider independently running GPT-5 themselves. The score should be about the same.
(I work at OpenAI.)
That's really interesting to me. Looking forward to trying GPT-5!
Doesnt look like it. Unless they add a fixed pricing, claude imo still would be better from a developer POV
- The permission system is broken (this is such an obvious one that I wonder if it's specific to GPT5 or my environment). If you tell Codex to ask permission before running commands, it can't ever write to files. It also runs some commands (e.g. `sed`) without asking. Once you skip sandbox mode, it's difficult to go back.
- You can't paste or attach images (helpful for design iteration)
- No built-in login flow so you have to mess with your shell config and export your OpenAI key to all terminal processes.
- Terminal width isn't respected. Model responses always wrap at some hard-coded value. Resizing the window doesn't correctly redraw the screen.
- Some keyboard shortcuts aren't supported, like option+delete to delete words (which I use often, apparently...)
This is on MacOS, iTerm2, Fish shell. I guess everyone uses Cursor or Windsurf?
GPT-5 solved the problem - which Gemini failed to solve - then failed 6 times in a row to write the code to fix it.
I then gave ChatGPT-5's problem analysis to Google Gemini and it immediately implemented the correct fix.
The lesson - ChatGPT is good at analysis and code reviews, not so good at coding.
Problem is the models have zero idea wether they are right or wrong and always believe they are right. Which makes them useful for anything were either you do not care if the answer is actually right or where somehow it is hard to come up with the right answer but very easy to verify it the answer is right and kind of useless for everything else.
I haven't tried Chat GPT on it yet, hoping to do so soon.
hmm, they should call it gpt-5-chat-nonreasoning or something.
"gpt-5-chat-latest is described by OpenAI as a non-reasoning GPT-5 variant—meaning it doesn’t engage in the extended “thinking token” process at all.
gpt-5 with reasoning_effort="minimal" still uses some internal reasoning tokens—just very few—so it’s not truly zero-reasoning.
The difference: "minimal" is lightweight reasoning, while non-reasoning is essentially no structured chain-of-thought beyond the basic generation loop."
So, at least twice larger context than those
I almost exclusively wrote and released https://github.com/andrewmcwattersandco/git-fetch-file yesterday with GPT 4o and Claude Sonnet 4, and the latter's agentic behavior was quite nice. I barely had to guide it, and was able to quickly verify its output.
https://extraakt.com/extraakts/openai-s-gpt-5-performance-co...
EDIT: It's out now
Looks like they're trying to lock us into using the Responses API for all the good stuff.
I will say I would far more appreciate an AI that when it faces these ambiguous problems, either provides sources for further reading, or just admits it doesn't know and is, you know, actually trying to work together to find a solution instead of being trained to 1 shot everything.
When generalizing these skills to say, debugging, I will often just straight up ignore the AI slop output it concluded and instead explore the sources it found. o3 is surprisingly good at this. But for hard niche debugging, the conclusions it comes to are not only wrong, but it phrases it in an arrogant way and when you push back it's actually like talking to a narcissist (phrasing objections as "you feel", being excessively stubborn, word dumping a bunch of phrases that sound correct but don't hold up to scrutiny, etc).
That's been out for a while and used their 'codex' model, but they updated it today to default to gpt-5 instead.
You can't make this up
/s
I’m not sure of the utility of being so outraged that some people made wrong predictions.
No, I don’t take them seriously, that was my point, which apparently I didn’t make clear enough.
There are billions of people. You have people who think the earth is flat. You can probably find any insane takes if you look for it. Best not take get told anything by them as you have seemed to have taken it to heart.
https://x.com/elonmusk/status/1953509998233104649
Anyone know why he said that?
Once OpenAI breaks out of the "App" space and into the "OS" and "Device" space, Microsoft may get absorbed into the ouroboros.
OpenAI's dependence on Microsoft currently is purely financial (investment) and contractual (exclusivity, azure hosting).