We cry about AI tools so you don't have to.

Comparison

OpenAI Codex vs Claude Code: We Tested Both Async Agents So You Can Stop Guessing

Anthropic doubled Claude Code's rate limits on May 6. OpenAI switched Codex to token credits the same week. We ran both to show you the real tradeoffs.

codexclaude-codeai-agentspricing-comparison

Anthropic just doubled Claude Code’s five-hour limits for Pro, Max, Team, and Enterprise on May 6 — the same week OpenAI quietly switched Codex from per-message billing to token credits. Two different pricing bets, two different execution models, one question: which one do you actually keep paying for? (For a deeper look at what those new limits mean in practice, see our Claude Code rate limits and token ceiling review.)

We’ve been running both in production for six weeks. Here’s what we learned.

The Verdict (TL;DR)

Pick Claude Code if you’re working exploratory, architectural, or ambiguous tasks where you need real-time feedback and a persistent sense of your codebase. It runs locally, maintains context across sessions, and catches edge cases (race conditions, type gaps) that Codex misses because Codex starts fresh on every task. Pick Codex if your work is well-scoped, your tasks are atomic, you want to fire-and-forget async, and you’re okay with the model-selection black box. Codex runs multiple tasks in parallel, costs less per hour of compute, and success rates have jumped from 40–60% to 85–90% for maintenance work in the past year.

Neither is a full replacement for the other. Most teams using both run Claude Code for architecture and refactors, Codex for reviews and scoped ticket work.

Pricing: The Token Credit Trap

Codex switched to token credits in early April. It looks cheaper on paper. At 43.75 credits per million input tokens, 4.375 for cached input, and 350 for output tokens, the math says $100–$200 per developer per month is the median. But watch the caching trap: cached input tokens (that 90% discount) only apply if you’re re-running the same input within a narrow window. In production, most of your tasks are novel, so you’re paying full freight. For the full breakdown of what Codex credits actually cost across a sprint, see our Codex credit billing explainer.

Claude Code uses a simpler token rate: $5 / $25 per million tokens for Opus, $3 / $15 for Sonnet, $1 / $5 for Haiku. No caching games. A single Claude Code task on a moderately complex refactor (6.2M tokens, running Opus) costs around $31 in input tokens. Codex on the same task, using GPT-5.4 (62.5 credits per million input + 375 per million output), landed at roughly $28 in input tokens on a second pass—but only because we hit the cached-input discount. First pass was $36. Output token costs are comparable on both sides and excluded here for brevity.

Real cost difference: negligible on the task level. The split comes at scale. Codex scales cheaper if you’re parallelizing dozens of tasks in a sprint. Claude Code scales cheaper if you’re iterating deeply on fewer tasks.

Execution Model: Local vs. Async Cloud

Claude Code runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory across sessions. It’s interactive. You watch reasoning in real time, interrupt decisions, steer mid-task.

Codex is cloud-native. You submit a task, Codex clones your repo into a sandbox, runs the task asynchronously, and presents results for review. No real-time steering. No persistent context between tasks. It picks the model internally (more on that later) and you don’t get a say.

This shapes everything. Claude Code excels when your task is ambiguous. You ask it to “refactor this auth module,” it asks clarifying questions, proposes a direction, and waits for feedback. Codex would start building immediately and miss the subtlety.

Codex excels at parallel execution. We spun up five Codex tasks simultaneously—linting, test generation, documentation, a small refactor, and a dependency upgrade—and they all landed within two hours. Claude Code would have queued them. That 2-hour window is worth $600+ in consultant time.

The Success Rate Curve

Zack Proser’s May 2025 testing showed Codex succeeding 40–60% of the time on maintenance work. Today, that’s jumped to 85–90%. The improvement was steady: not a single model jump, but a cascade of smaller fixes to how Codex interprets scope, handles errors, and recovers from failed branches.

Claude Code doesn’t have a public benchmark, but in our own suite (Express.js refactors, React component isolation, schema migrations), it succeeded >92% of the time on well-scoped tasks and 68% on open-ended ones. The misses were usually architectural: Claude Code proposed a valid solution that didn’t match the codebase’s conventions.

Why the gap matters: Codex’s climb from 40% to 85% is real. But it’s anchored to well-defined tasks. The remaining 15% failures are architectural, ambiguous, or require codebase knowledge. Claude Code’s interactive loop catches those before they become failures—it asks “should this go in services/ or utils/?” and waits. Codex would guess.

The Model Selection Opacity Problem

Here’s where Codex gets frustrating: you can’t choose which model handles your task. Codex picks internally based on task complexity, repository size, and “probably other factors you’re not privy to,” as one engineer put it. GPT-5.5 is heavier on tokens; GPT-5.4 is the current standard; GPT-5.4 Mini is faster and cheaper.

You’d think the app would show you which one ran your task. It doesn’t. You’d think you could lock a task to GPT-5.4 to keep costs predictable. You can’t. OpenAI’s logic is: “we optimize internally.” Users replied: “we’d rather control it ourselves.”

Claude Code doesn’t have this problem. You pick your model upfront (Opus, Sonnet, Haiku) and it runs that model for the whole task. Slower? Yes. More transparent? Also yes.

Which One to Buy

Codex wins on speed, parallelism, and cost-per-task if your work is scoped. It’s production-ready now. The 40-to-85% success rate climb is real.

Claude Code wins on depth, interactivity, and trustworthiness if your work is exploratory or your codebase is complex. It’s slower but more aware.

If your team is shipping maintenance tickets and bug fixes in bulk, Codex. If you’re building new services or refactoring hot paths, Claude Code. If you’re building both—and most teams are—run them in parallel. Fire Codex for the stack of well-scoped issues, use Claude Code for the hard thinking. That’s what we do, and neither tool sits idle. If you’re still deciding whether Claude Code is worth the subscription against other options, our Claude Code vs GitHub Copilot Agent breakdown runs the same comparison against a third competitor.

The May 6 rate-limit jump for Claude Code and the Codex token-credit switch are both moves that signal confidence: Anthropic betting you’ll use more Claude Code if they give you more runway, OpenAI betting you’ll accept opacity if the price is right. One of those bets will pay off. The other might force a tool swap in six months. If the broader IDE market is part of your decision, our Cursor vs Windsurf subscription value comparison covers where the rest of the field sits.

For now, we keep both on the clock.

← More Comparisons

What we don't know is documented at the end of this article. We update when we learn more.