What is Claude Opus 4.5's SWE-bench score?

Claude Opus 4.5 achieved 80.9% on SWE-bench Verified, making it the first AI model to exceed 80%. This compares to GPT-5.1-Codex-Max at 77.9% and Gemini 3 Pro at 76.2%.

Did Claude Opus 4.5 beat human engineers?

Yes. According to Anthropic, Claude Opus 4.5 surpassed the performance of all human engineering candidates in Anthropic's internal SWE-bench hiring test. This is the first time an AI model has achieved this milestone.

How much does Claude Opus 4.5 cost?

Claude Opus 4.5 costs $5 per million input tokens and $25 per million output tokens. This represents a 66% price reduction from the previous Opus pricing of $15/$75 per million tokens.

When was Claude Opus 4.5 released?

Claude Opus 4.5 was released on November 24, 2025, approximately 8 months after Claude 3 Opus.

Claude Opus 4.5 Hits 80.9% SWE-bench — First AI to Beat All Human Engineering Candidates

Here's a sentence that would have seemed absurd two years ago: An AI model just outscored every human candidate who applied for an engineering job at one of the world's top AI companies.

That's what Anthropic claims happened with Claude Opus 4.5. Released November 24, 2025, it's the first model to exceed 80% on SWE-bench Verified — the benchmark that measures whether AI can actually fix real bugs in real codebases.

Let's break down what this means, what it doesn't mean, and why the "beat human engineers" claim needs more context than the headlines give it.

The Numbers: Where Opus 4.5 Actually Lands

SWE-bench Verified is the industry's most rigorous test for AI coding ability. It presents models with actual GitHub issues from popular open-source projects and asks them to generate patches that fix the bugs. No multiple choice. No synthetic problems. Real code, real bugs, real tests.

Claude Opus 4.5

80.9%

SWE-bench Verified

GPT-5.1-Codex-Max

77.9%

SWE-bench Verified

Gemini 3 Pro

76.2%

SWE-bench Verified

Claude 3.5 Sonnet

49.0%

SWE-bench Verified

The jump from Claude 3.5 Sonnet (49.0%) to Opus 4.5 (80.9%) represents a 65% improvement. That's not iterative progress. That's a step change.

Anthropic System Card, November 2025

"Claude Opus 4.5 achieves 80.9% on SWE-bench Verified, making it the first model to exceed the 80% threshold. On our internal SWE-bench hiring test, it surpassed the performance of all human engineering candidates who have taken the test."

The "Beat Human Engineers" Claim — Let's Be Precise

The headline sounds dramatic: AI beats all human engineers. But precision matters here.

Anthropic gave Claude Opus 4.5 the same SWE-bench-style test they give to engineering candidates during their hiring process. The model scored higher than every human who has taken that specific test.

What this means:

On a specific benchmark (SWE-bench-style bug fixing), the model outperforms human candidates.
This measures one dimension of engineering ability — diagnosing and patching isolated bugs.
It doesn't measure system design, requirements gathering, collaboration, debugging production issues with incomplete information, or any of the other things software engineers actually do.

What this doesn't mean:

Claude can replace a software engineer.
AI is "better" than human engineers in any general sense.
You should fire your developers and let Claude take over.

The Real Story

Benchmarks measure what benchmarks measure. SWE-bench tests whether a model can read a bug report and produce a working patch. That's valuable. But it's not software engineering.

Real engineering involves understanding why a bug matters, whether fixing it creates other problems, how the fix affects the broader system, and whether the "bug" is actually a feature someone depends on. No benchmark captures that.

The honest framing: Claude Opus 4.5 is the best AI tool for automated bug fixing we've ever seen. That's significant. It's just not "AI replaces engineers."

New Features Worth Knowing

Beyond the benchmark numbers, Opus 4.5 introduces several architectural changes:

The "Effort" Parameter

You can now control how much compute Claude uses per request. Set effort to "low" for quick responses, "medium" for balanced performance, or "high" when you need maximum reasoning capability. This lets you optimize for cost on simple tasks while throwing full compute at complex problems.

Hybrid Reasoning

Opus 4.5 combines extended thinking (like o1's approach) with Claude's standard inference. The model can "think" through multi-step problems when needed, but doesn't waste compute on simple questions. Anthropic reports this makes it competitive with reasoning-specialized models like o1 while maintaining Claude's general-purpose capabilities.

Infinite Chat (Beta)

A new feature that maintains context across arbitrarily long conversations. The technical details haven't been fully disclosed, but early reports suggest it uses a combination of summarization and retrieval rather than simply expanding the context window.

Computer Use Improvements

Opus 4.5 scores 66.3% on OSWorld (computer use benchmark), up from Claude 3.5's 22%. That's a 3x improvement in the model's ability to navigate UIs, click buttons, and complete tasks across applications.

The Price Drop Nobody Expected

Previous Opus Pricing

$15 / $75

per million input/output tokens

Opus 4.5 Pricing

$5 / $25

per million input/output tokens

A 66% price cut while dramatically increasing capability. That's unusual in AI — most frontier models get more expensive, not less.

The economics here matter. At $5/$25, Opus 4.5 costs about the same as GPT-4 Turbo did in 2024. For a model that beats GPT-5.1-Codex-Max on coding benchmarks, that's aggressive pricing.

Possible explanations:

Efficiency gains: Anthropic may have achieved significant inference optimization.
Competitive pressure: DeepSeek's $6M training costs and open-source releases have compressed margins industry-wide.
Market share play: Anthropic may be pricing to gain developer adoption, betting on lock-in.

Where It Fits in the Competitive Landscape

The AI coding tool market as of December 2025:

Model	SWE-bench Verified	Price (Input/Output)	Context Window
Claude Opus 4.5	80.9%	$5 / $25	200K
GPT-5.1-Codex-Max	77.9%	$10 / $30	128K
Gemini 3 Pro	76.2%	$7 / $21	2M
DeepSeek V3.2	74.8%	$0.55 / $2.19	128K

Claude Opus 4.5 leads on SWE-bench while maintaining competitive pricing. DeepSeek remains the cost leader by a massive margin. Gemini 3 Pro offers the largest context window. Each has trade-offs.

What This Actually Means For You

If you're a developer using AI coding tools:

Bug fixing workflows — Opus 4.5 is now the strongest option for automated bug diagnosis and patching. If you're building pipelines that automatically triage and fix issues, this should be your default model.
Complex reasoning tasks — The hybrid reasoning architecture makes it competitive with o1 while remaining a general-purpose model. You don't need separate models for reasoning vs. coding.
Cost optimization — The effort parameter lets you balance quality vs. cost per request. Use low effort for simple tasks, high effort for complex ones.

If you're building AI-powered developer tools:

Benchmark leadership matters — Users increasingly compare tools based on which model powers them. Being on Opus 4.5 is now a competitive advantage.
The price/performance ratio shifted — At $5/$25, you can offer more AI-heavy features without crushing your unit economics.

The Honest Assessment

Claude Opus 4.5 is genuinely impressive. 80.9% on SWE-bench represents a meaningful capability jump. The price drop makes it accessible. The hybrid reasoning architecture is elegant.

But let's not lose our heads.

This is a better tool for specific tasks. It's not artificial general intelligence. It's not replacing software engineers. It's not the end of human programming.

It's an AI that's really good at reading bug reports and generating patches. That's useful. That's valuable. That's what it is.

The hype cycle wants every release to be revolutionary. The reality is that AI capabilities are improving steadily, with occasional step changes like this one. Progress is real. Revolution is marketing.

Use the tool. Understand its limits. Don't believe anyone who tells you it's more than it is — or less than it is.

Claude Opus 4.5 Hits 80.9% SWE-bench — First AI to Beat All Human Engineering Candidates

TL;DR

Transparency Note

The Numbers: Where Opus 4.5 Actually Lands

The "Beat Human Engineers" Claim — Let's Be Precise

The Real Story

New Features Worth Knowing

The "Effort" Parameter

Hybrid Reasoning

Infinite Chat (Beta)

Computer Use Improvements

The Price Drop Nobody Expected

Where It Fits in the Competitive Landscape

What This Actually Means For You

The Honest Assessment

Frequently Asked Questions

Is Claude Opus 4.5 actually better than GPT-5?

Should I switch from GPT-5 to Claude Opus 4.5?

What does "beat human engineers" actually mean?

Is the 66% price drop sustainable?

TL;DR

Transparency Note

The Numbers: Where Opus 4.5 Actually Lands

The "Beat Human Engineers" Claim — Let's Be Precise

The Real Story

New Features Worth Knowing

The "Effort" Parameter

Hybrid Reasoning

Infinite Chat (Beta)

Computer Use Improvements

The Price Drop Nobody Expected

Where It Fits in the Competitive Landscape

What This Actually Means For You

The Honest Assessment

Frequently Asked Questions

Is Claude Opus 4.5 actually better than GPT-5?

Should I switch from GPT-5 to Claude Opus 4.5?

What does "beat human engineers" actually mean?

Is the 66% price drop sustainable?

Get AI Analysis That Cuts Through The Hype

Related Reading

Reddit's AI Sentiment: What's Real Versus What's Fabricated

Anthropic's 60 Minutes Bombshell: Claude Attempted Blackmail to Avoid Being Shut Down

What Is AI? Yuval Harari Says We're Asking the Wrong Question