Codex vs. Claude Code (today)
59 points
3 hours ago
| 16 comments
| build.ms
| HN
willaaam
2 hours ago
[-]
This blog post lacks almost any form of substance.

It could've been shortened to: Codex is more hands off, I personally prefer that over claude's more hands-on approach. Neither are bad. I won't bring you proof or examples, this is just my opinion based on my experience.

reply
mergesort
26 minutes ago
[-]
Heya, author here! Admittedly this was a quick blog post I fired off, much shorter than my usual writing.

My goal wasn't to create a complete comparison of both tools — but to provide a little theory a behavior I'm seeing. You're (absolutely) right that it's a theory not a study, and I made sure to state that in the post. :)

Mostly though the conclusion describes pretty succinctly why I wrote the post, as a way to get more people to try more of the tools so they can adequately form their own conclusions.

> I think back to coworkers I’ve had over the years, and their varying preferences. Some people couldn’t start coding until they had a checklist of everything they needed to do to solve a problem. Others would dive right in and prototype to learn about the space they would be operating in.

> The tools we use to build are moving fast and hard to keep up with, but we’ve been blessed with a plethora of choices. The good news is that there is no wrong choice when it comes to AI. That’s why I don’t dismiss people who live in Claude Code, even though I personally prefer Codex.

> The tool you choose should match how you work, not the other way around. If you use Claude, I’d suggest trying Codex for a week to see if maybe you’re a Codex person and didn’t know it. And if you use Codex, I’d recommend trying Claude Code for a week to see if maybe you’re more of a Claude person than you thought.

> Maybe you’ll discover your current approach isn’t the best fit for you. Maybe you won’t. But I’m confident you’ll find that every AI tool has its strengths and weaknesses, and the only way to discover what they are is by using them.

reply
adastra22
14 minutes ago
[-]
It’s funny because my use of Claude Code is the opposite. I use slash commands with instructions to find context, and basically never interact with it while it is doing its thing.
reply
deepdarkforest
2 hours ago
[-]
> Codex is more hands off, I personally prefer that over claude's more hands-on approach

Agree, and it's a nice reflection of the individual companie's goals. OpenAI is about AGI, and they have insane pressure from investors to show that that is still the goal, hence codex when works they could say look it worked for 5 hours! Discarding that 90% of the time it's just pure trash.

While Anthropic/Boris is more about value now, more grounded/realistic, providing more consistent hence trustable/intuitive experience that you can steer. (Even if Dario says the opposite). The ceiling/best case scenario of a claude code session is a bit lower than Codex maybe, but less variance.

reply
dworks
1 hour ago
[-]
Well, if you had tried using GPT/Codex for development you would know that the output from those 5 hours would not be 90% trash, it would be close to 100% pure magic. I'm not kidding. It's incredible as long as you use a proper analyze-plan-implement-test-document process.
reply
cube2222
1 hour ago
[-]
I’ve checked out codex after the glowing reviews here around September / October and it was, all in all, a letdown (this was writing greenfield modules in a larger existing codebase).

Codex was very context efficient, but also slow (though I used the highest thinking effort), and didn’t adapt do the wider codebase almost at all (even if I pointed it at the files to reference / get inspired by). Lots of defensive programming, hacky implementations, not adapting to the codebase style and patterns.

With Claude Code and starting each conversation by referencing a couple existing files, I am able to get it to write code mostly like I would’ve written it. It adapts to existing patterns, adjusts to the code style, etc. I can steer it very well.

And now with the new cheaper faster Opus it’s also quite an improvement. If you kick off sonnet with a long list of constraints (e.g. 20) it would often ignore many. Opus is much better at “keeping more in mind” while writing the code.

Note: yes, I do also have an agent.md / claude.md. But I also heavily rely on warming the context up with some context dumping at conversation starts.

reply
throwaway12345t
1 hour ago
[-]
All codex conversations need to be caveat with the model because it varies significantly. Codex requires very little tweaking but you do need to select the highest thinking model if you’re writing code and recommend the highest thinking NON-code model for planning. That’s really it, it takes task time up to 5-20m but it’s usually great.

Then I ask Opus to take a pass and clean up to match codebase specs and it’s usually sufficient. Most of what I do now is detailed briefs for Codex, which is…fine.

reply
dworks
22 minutes ago
[-]
I will jump between a ChatGPT window and a VSCode window with the Codex plugin. I'll create an initial prompt in ChatGPT, which will ask the coding agent to audit the current implementation, then draft an implementation plan. The plan bounces between Chat and Codex about 5 times, with Chat telling Codex how to improve. Then Codex implements, creates an implementation summary, which I give to Chat. Chat then asks to add a couple of things fixes, then it's done.
reply
IgorPartola
41 minutes ago
[-]
Why non-thinking model? Also 5-20 minutes?! I guess I don’t know what kind of code you are writing but for my web app backends/frontends planning takes like 2-5 minutes tops with Sonnet and I have yet to feel the need to even try Opus.
reply
thejazzman
13 minutes ago
[-]
In my experience sonnet > opus, so it’s not surprise you don’t “need” opus. They charge a premium on sonnet now instead
reply
N_Lens
2 hours ago
[-]
A lot of (carefully hedged) pro Codex posts on HN read suspect to me. I've had mixed results with both CC and Codex and these kinds of glowing reviews have the air of marketing rather than substance.
reply
mergesort
23 minutes ago
[-]
Heya, I'm the author! I can promise you that I 0% affiliated with OpenAI — and have no qualms with calling them out for the larger moral, ethical, and societal questions that have emerged with the strategy they set out.

I do earnestly believe their models are currently the best to work with as software developers, but as I state in my post I think this is the state of the world today and have no premonition for that being true forever.

Same questions apply to Anthropic, Google, etc, etc — I'm not paid by anyone to say anything.

reply
jstummbillig
1 hour ago
[-]
If only fair comparisons would not be so costly, in both time and money.

For example, I have a ChatGPT and a Gemini subscription, and thus could somewhat quickly check out their products, and I have looked at a lot of the various Google AI dev ventures, but I have not yet found the energy/will to get more into Gemini CLI specifically. Antigravity with Gemini 3 pro did some really wonky stuff when I tried it.

I also have a Windsurf subscription, which allows me to look at any frontier model for coding (well, most of the time, unless there's some sort of company beef going). This I have often used to check out Anthropic models, with much less success than Codex with > GPT-5.1 – but of course, that's without using Clode Caude (which I subscribed to for a month, idk, 6 months ago, and seemed fine back then but not mind blowingly so).

Idk! Codex (mostly using the vscode extension) works really well for me right now, but I would assume this is simply true across the board: Everything has gotten so much better. If I had to put my finger on what feels best about codex right now, specifically: Least amount of oversights and mistakes when working on gnarly backend code, with the amount of steering I am willing to put into it, mostly working off of 3-4 paragraph prompts.

reply
baq
1 hour ago
[-]
I’ve been using frontier Claude and GPT models for a loooong time (all of 2025 ;)) and I can say anecdotally the post is 100% correct. GPT codex given good enough context and harness will just go. Claude is better at interactive develop-test-iterate because it’s much faster to get a useful response, but it isn’t as thorough and/or fills in its context gaps too eagerly, so needs more guidance. Both are great tools and complement each other.
reply
pitched
2 hours ago
[-]
The usage limits on Claude have been making it too hard to experiment with. Lately, I get about an hour a day before hitting session/weekly limits. With Codex, the limits are higher than my own usage so I never see them.

Because of that, everyone who is new to this will be focused on Codex and write their glowing reviews of the current state of AI tools in that context.

reply
sebzim4500
1 hour ago
[-]
For what it's worth I just switched from claude code to codex and have found it to be incredibly impressive.

You can check my history to confirm I criticize sama far too much to be an OpenAI shill.

reply
mold_aid
1 hour ago
[-]
Yeah. I can excuse bad writing, I can tolerate evangelism. I don't have patience for both.
reply
thedelanyo
2 hours ago
[-]
Exactly my thoughts. Most of these posts are what I'll say "paid posts".
reply
lmeyerov
2 hours ago
[-]
I've been using Claude code most of the year, and codex since soon after it released:

It's important to separate vibes coding from vibes engineering here. For production coding, I create fairly strict plans -- not details, but sequences, step requirements, and documented updating of the plan as it goes. I can run the same plan in both, and it's clear that codex is poor at instruction following because I see it go off plan most of the time. At the same time it can go on its own pretty far in an undirected way.

The result is when I'm doing serious planned work aimed for production PRs, I have to use Claude. When it's experimental and I don't care about quality but speed and distance, such as for prototyping or debugging, codex is great.

Edit: I don't think codex being poor at instruction following is inherent, just where they are today

reply
AbrahamParangi
2 hours ago
[-]
Respectfully I don’t think the author appreciates that the configurability of Claude Code is its performance advantage. I would much rather just tell it what to do and have it go do it, but I am much more able to do that with a highly configured Claude Code than with Codex which is pretty much just set at the out of the box quality level.

I spend most of my engineering time these days not on writing code or even thinking about my product, but on Claude Code configuration (which is portable so should another solution arise I can move it). Whenever Claude Code doesn’t oneshot something, that is an opportunity for improvement.

reply
mergesort
4 minutes ago
[-]
Heya, I'm the author of the post and I just wanted to say I do appreciate the configurability! As I mentioned in the post, I have been that kind of developer in the past.

> This is a perfect match for engineers who love configuring their environments. I can’t tell you how many full days of my life I’ve lost trying out new Xcode features or researching VS Code extensions that in practice make me 0.05% more productive.

And I tried to be pretty explicit about the idea that this is a very personal choice.

> Personally — and I do emphasize this is a personal decision — I‘d rather write a well-spec’d plan and go do something else for 15 minutes. Claude’s Plan Mode is exceptional, and that‘s why so many people fall in love with Claude once they try it.2

For every person who feels like me today, there's someone who feels like you out there. And for every person who feels like you, there's someone like me (today) who finds it not as valuable to their workflow. That's the reason my conclusion was all about getting folks to try out both to see what works for them — because people change and it's worth finding out who you really at this moment in time.

Anyhow, I do think that Codex is also very configurable — I was just trying to emphasize that it's really great out the box while Claude Code requires more tuning. But that tuning makes it more personal, which as you mention is a huge plus! As I've touched on in a few posts [^1] [^2] Skills are to me a big deal, because they allow people to achieve high levels of customization without having to be the kind of developer that devotes a lot of time to creating their perfect set up. (Now supported in both Claude Code and Codex.)

I don't want this to turn into a bit of a ramble so I'll just say that I agree with you — but also there's a lot of nuance here because we're all having very personal coding experiences with AI — so it may not entirely sound like I agree with you. :)

Would love to hear more about your specific customizations, to make sure that I'm not missing out on anything valuable. :D

[1]: https://build.ms/2025/10/17/your-first-claude-skill/ [2]: https://build.ms/2025/12/1/scribblenauts-for-software/

reply
monerozcash
2 hours ago
[-]
Hey, I'm not very familiar with Claude Code. Can you explain what configuration you're referring to?

Is this just things like skills and MCPs, or something else?

reply
CharlesW
1 hour ago
[-]
Skills, MCPs, /commands, agents, hooks, plugins, etc. I package https://charleswiltgen.github.io/Axiom/ as an easily-installable Claude Code plugin, and AFAICT I'm not able to do that for any other AI coding environment.
reply
dist-epoch
1 hour ago
[-]
OpenCode, Pi are even more configurable.
reply
btbuildem
54 minutes ago
[-]
I don't think the comparison to programming languages holds, maybe very tenuously at best. Coding assistants evolve constantly, you can't even be talking about "Codex" without specifying the time range (ie, Codex 2025-10) because it's different from quarter to quarter. Same with CC.

I believe this is the main source of disagreement / disappointment when people read opinions / reviews, then proceed to have an experience very different from expected.

Ironically, this constant improvement/evolution erodes product loyalty -- personally, I'm a creature of habit and will stay with a tool past its expiry date; with coding assistants / sota llms, I cancel and switch subscriptions all the time.

reply
motoboi
2 hours ago
[-]
It's hard to compare the two tools because they change so much and so fast.

Right now, as an example, claude code with opus 4.5 is a beast, but before that, with sonnet 4.0, codex was much better.

Gemini-cli, on the other hand, with gemini-flash-3.0 (which is strangely good for the "small and fast" model), it's very good (but the cli and the user experience are not on par with codex or claude yet).

So we need to be in constant observations of those tools. Currently (after gemini-flash-3.0 came out), I tend to submit the same task to claude (with opus) and gemini to understand the behaviour. gemini is surprising me.

reply
funnyfoobar
1 hour ago
[-]
The process you have described for Codex is scary to me personally.

it takes only one extra line of code in my world(finance) to have catastrophic consequences.

even though i am using these tools like claude/cursor, i make sure to review every small bit it generated to a level, where i ask it create a plan with steps, and then perform each step, ask me for feedback, only when i give approval/feedback, it either proceeds for the next step or iterate on previous step, and on top of that i manually test everything I send for PR.

because there is no value in just sending a PR vs sending a verified/tested PR

with that said, I am not sure how much of your code is getting checked in without supervision, as it's very difficult for people to review weeks worth of work at a time.

just my 2 cents

reply
cherryteastain
1 hour ago
[-]
I think the author glosses over the real reason why tons of people use Codex over CC: limits. If you want to use CC properly you must use Opus 4.5 which is not even included in the Claude Pro plan. Meanwhile you can use Codex with gpt-5.2-codex on the ChatGPT Plus plan for some seriously long sessions.

Looks like Gemini plans have even more generous limits on the equivalently priced plans (Google AI Pro). I'd be interested in the experiences of people who used Google Antigravity/Gemini CLI/Gemini Code Assist for nontrivial tasks.

reply
Tiberium
1 hour ago
[-]
A small correction: Opus 4.5 is included in the Pro plan nowadays, but yeah, the usage limits for it on the $20 sub are really, really low.
reply
Maxious
52 minutes ago
[-]
Both Claude Pro and Google Antigravity free tier have Opus 4.5
reply
oldandboring
31 minutes ago
[-]
Personally I bit the bullet and went with the Max plan for Claude Code. After tax it costs me ($108) less than I earn from one billable hour. I have been punishing it for the last two months, it defaults to Opus 4.5 and while I occasionally hit my session limit (it resets after an hour or so), I can't even scratch the surface of my monthly usage limit.
reply
sourcecodeplz
1 hour ago
[-]
Opus IS included in Pro plan.
reply
throwawaybla73
1 hour ago
[-]
Opus 4.5 is included in the Pro plan.
reply
cherryteastain
58 minutes ago
[-]
Thanks for the correction, looks like I misremembered. But limits are low enough with Sonnet that, I imagine you can barely do anything serious with Opus on the Pro plan.
reply
ChicagoDave
1 hour ago
[-]
Spec dev can certainly be effective, but having used Claude Code since its release, I’ve found the pattern of continuous refactoring of design and code produces amazing results.

And I’ll never use OpenAI dev tools because the company insists on a complete absence of ethical standards.

reply
sixhobbits
2 hours ago
[-]
This is an interesting opinion but I would like to see some proof or at least more details.

What plans are you using, what did you build, what was the output from both on similar inputs, what's an example of a prompt that took you two hours to write, what was the output, etc?

reply
Rperry2174
2 hours ago
[-]
I've noticed a lot of these posts tend to go codex vs claude, but as author is someone who does AI workshops curious why Cursor is left out of this post (and more generally posts like this).

From my personal experience I find cursor to be much more robust because rather than "either / or" its both and can switch depending on the time or the task or whatever the newest model is.

It feels like the same way people often try to avoid "vendor lock in" in software world that Cursor allows freedom for that, but maybe I'm on my own here as I don't see it naturally come up in posts like these as much.

reply
tin7in
1 hour ago
[-]
Speaking from personal experience and talking to other users - the agents/harnesses of the vendors are just better and they are customized for their own models.
reply
Rperry2174
1 hour ago
[-]
what kinds of tasks do you find this to be true for? For a while I was using claude code inside of the cursor terminal, but I found it to be basically the same as just using the same claude model in there.

Presumably the harness cant be doing THAT much differently right? Or rather what tasks are responsibilities of the harness could differentiate one harness from another harness

reply
oldandboring
27 minutes ago
[-]
I feel you brother/sister. I actually pay for Claude Code Max and also for the $20/mo Cursor plan. I use Claude Code via the VSCode extension running within the Cursor IDE. 95% of my usage is Claude Code via that extension (or through the CLI in certain situations) but it's great having Cursor as a backup. Sometimes I want to have another model check Claude's work, for example.
reply
dist-epoch
1 hour ago
[-]
Github Copilot also allows you to use both models, codex, claude, and gemini on top.

Cursor has this "tool for kids" vibe, it's also more about the past - "tab, tab, enter" low-level coding versus the future - "implement task 21" high level delegating.

reply
CjHuber
2 hours ago
[-]
I do feel like the Codex CLI is quite a bit behind CC. If I recall correctly it took months for Codex to get the nice ToDo Tool Claude Code uses in memory to structure a task into substeps. Also I‘m missing the ability to have the main agent invoke subagents a lot.

All this of course can be added using MCPs, but it’s still friction. The Claude Code SDK is also way better than OpenAI Agents, it’s almost no comparison.

Also in general when I experienced bugs with Codex I was always almost sure to find an open GitHub issue with people already asking about a fix for months.

Still I like GPT-5.2 very much for coding and general agent tasks, and there is EveryCode which is a nice fork of Codex that mitigates a lot of shortcomings

reply
frwickst
2 hours ago
[-]
You can use Every Code [1] (a Codex fork) for this, it can invoke agents, but not just codex ones, but claude and gemini as well.

[1] https://github.com/just-every/code

reply
CjHuber
2 hours ago
[-]
Seems like you wrote at the same time I did my edit, yes Every Code is great however Ctlr+T is important to get terminal rendering otherwise is has performance problems for me
reply
sumedh
2 hours ago
[-]
> with people already asking about a fix for months.

OpenAI needs to get access to Claude Code to fix them :)

reply
dist-epoch
1 hour ago
[-]
The general consensus today is that ToDo tool is obsolete and lowers performance for frontier models (Opus 4.5, GPT-5.2)
reply
pshirshov
1 hour ago
[-]
On hard projects (really hard, like https://github.com/7mind/jopa), Codex fails spectacularly. The only competition is Claude vs Gemini 3 Pro.
reply
oldandboring
24 minutes ago
[-]
I must be doing something wrong. When I last tried to use Codex 5.2 (via Cursor), no amount of prompting could get it to stop aggressively asking me for permission to do things. This seems to be the opposite of the article's claim, which is that Codex is better for long-running, hands off tasks.
reply
mergesort
14 minutes ago
[-]
Heya, I'm the author of the post! This was probably unintentional but I think you're making a really valuable observation that will be helpful to others.

The models Cursor provides to use in their product are intermediated versions of models that companies like OpenAI and Anthropic offer. They are technically using Codex, but not in the way that they would be if you were in a tool like Codex (CLI) or Claude Code.

If you ask Cursor to solve a tough problem, Cursor will break down the problem into a different problem before sending that request to OpenAI so they can use Codex. They do this because: 1. To save money. By restructuring the prompt they can use less tokens, saving them money for running Cursor since they are the ones paying for the tokens with your subscription cost. 2. [Based on things the Cursor team has said] They believe they can construct a better intermediate prompt that is more representative of the problem you want to solve.

This extra level of abstraction means that you are not getting the best results when you use a tool like Cursor. OpenAI and Anthropic are running their harnesses Codex CLI and Claude Code at a loss (because VC), but providing better results. This is not the best way to make money, but it's a great way to build mindshare and hopefully get customers for life. (People are fickle and cheap though so I doubt this is a customers for life strategy the way people buy the same brand of deodorant once they start buying Dove.)

Happy to answer any questions you may have, but mostly I would highly suggest trying out Codex CLI and Claude Code to get a better feel for what I'm saying — and to also to get more out of your AI tools. :)

reply
veidr
42 minutes ago
[-]
I tried so hard to make Codex work, after the glowing reviews (not just from Internet randos/potential-shills, though; people I know well, also).

It's objectively worse for me on every possible axis than Claude Code. I even wondered if maybe I was on some kind of shadow-ban nerf-list for making fun of Sam Altman's WWDC outfit in a tweet 20 years ago. (^_^)

I don't love Claude's over-exuberant personality, and prefer Codex's terse (arguably sullen) responses.

But they both fuck up often (as they all do), and unlike Claude Code (Opus, always), Codex has been net-negative for me. I'm not speed-sensitive, I round-robin among a bunch of sessions, so I use the max thinking option at all times, but Codex 5.1 and 5.2 for me are just worse code, and worse than that, worse at code review to the point that it negated whatever gains I had gotten from it.

While all of them miss a ton of stuff (of course), and LLM code review just really isn't good unless the PR is tiny — Claude just misses stuff (fine; expected), while Codex comes up with plausible edge-case database query concurrency bugs that I have to look at, and squint at, and then think hmm fuck and manually google with kagi.com for 30 minutes (LIKE AN ANIMAL) only to conclude yeah, not true, you're hallucinating bud, to which Codex is just like. "Noted; you are correct. If you want, I can add a comment to that effect, to avoid confusion in future."

So for me, head-to-head, Claude murders Codex — and yet I know that isn't true for everybody, so it's weird.

What I do like Codex for is reviewing Claude's work (and of course I have all of them review my own work, why not?). Even there, though, Codex sometimes flags nonexistent bugs in Claude's code — less annoying, though, since I just let them duke it out, writing tests that prove it one way or the other, and don't have to manually get involved.

reply