This is not a benchmark. It is just my experience from daily use on one production codebase. For some medium-complexity tasks, I also ran both tools with the same prompts, but I did not try to make this a controlled evaluation.
TL;DR: for my production Python monolith, I still prefer Codex.
The codebase is a many-years-old Python backend. It has several architectural layers from different periods: a newer experimental DDD-ish style, older but still well-structured legacy code, and very old fragile spaghetti code.
We usually do not rewrite old parts unless we have to. The preferred strategy is to leave them alone until they are naturally replaced or removed. This is not a simple CRUD web server. It is a complex, sometimes overcomplicated, application with many A/B tests and very specific business logic in many corners.
Why I prefer Codex for this codebase:
1. Codex follows harness-engineering principles much better for me. See: https://openai.com/index/harness-engineering/ Claude does not reliably follow this workflow unless my AGENTS.md contains very explicit short instructions, such as: “Read exec_plan.md and follow it.”
2. Claude more often creates new tools instead of first searching the codebase for existing ones. In this kind of codebase, reusing existing project-specific tools and patterns matters a lot.
3. Claude more often reads too little code or documentation before choosing where to put new functionality. I frequently had to go through several correction rounds in the same task: “Put this functionality in module A instead, not in the controller. That is the right place.” “Do not construct the response object using the statuses you sent in the request. The API already returns the updated object — use that response, include it in the result, and validate that its state matches what we expect.” “No, validate it in the same module that owns this boundary.” This kind of back-and-forth became tiring. Codex seems to have a better planning mode for this type of work. It more often notices missing context in my prompt and asks clarifying questions before making architectural changes.
4. I migrated through several Codex/GPT model versions during this period because new versions were released while I was testing. I have not tested GPT-5.5 on UI-heavy work yet. However, Opus 4.6 was much better for frontend work than Codex 5.3 and GPT-5.4 in my experience. For UI tasks, I currently prefer Claude.
Skills and MCP: I use only one shared skill for both LLMs: commands to start and stop the Docker Compose environment and run tests inside it.