Claude Opus 4.5: When AI Breaks Its Own Benchmark


i was refreshing hacker news at 3am when the post hit. "Claude Opus 4.5 is here."
Within an hour, the comments section turned into something between a tech review and an existential crisis. People who write code for a living were quietly freaking out. Not the loud kind of freaking out. The quiet kind. The "i need to rethink my career" kind.
Because this time, the model broke a test.
Not failed. Broke. As in, the test had to be retired.
What Actually Shipped
Anthropic released Claude Opus 4.5 as its newest flagship model, calling it the best in the world for coding, agents, and computer use. The release came just two months after Sonnet 4.5 and Haiku 4.5. Three major models in eight weeks. The pace is absurd.
And the pricing. They cut it by two thirds. From $15 and $75 per million tokens down to $5 and $25. That's not a sale. That's a restructuring of what "expensive AI" means.
But here's what made people stop scrolling.
Opus 4.5 scored over 80% on SWE-bench Verified, the first model to cross that threshold. For context, this benchmark tests if AI can actually fix real GitHub issues. Not toy problems. Real bugs in real codebases.
GPT-5.1 got 77.9%. Gemini 3 Pro got 76.2%. Opus 4.5 got 80.9%.
The model is now better at fixing code bugs than most senior engineers.
The Test That Had To Die
Here's where it gets weird.
Alex Albert from Anthropic tweeted that they had to remove a benchmark because Opus 4.5 broke it by being too clever. The test simulated an airline customer service agent. A customer wants to change their flight, but they have a basic economy ticket. Those can't be changed. Dead end. Test over.
Except Opus 4.5 found a loophole.
It noticed that basic economy tickets can't be modified. But they can be upgraded. And once upgraded to regular economy, they can be modified. So the model reasoned: upgrade first, then change the flight. Policy satisfied. Customer happy.
The model essentially displayed lawyer-level reasoning, finding the difference between the letter and spirit of rules.
This isn't autocomplete. This is creative problem solving.
What Engineers Are Saying
The developer community split into camps fast.
On r/singularity, people treated the release like the rapture. On Hacker News, they called it marketing fluff. But the quiet reactions were more interesting. Simon Willison, a well-known developer, used Opus 4.5 to refactor his sqlite-utils project over a weekend, handling 20 commits, 39 files, and over 2,000 additions.
He said the model worked. But when his preview expired, he switched back to Sonnet 4.5 and barely noticed a difference.
That's the tension. The benchmarks say one thing. Daily use says another.
Michael Truell from Cursor called it a notable improvement. Scott Wu from Cognition said it delivers stronger results on hard evaluations. But these are people who build with AI for a living. They know what to look for.
For everyone else, it just feels like the models keep getting better. But "better" is starting to mean something different.
The Naming Mess
Anthropic names its models after poetry styles. Opus. Sonnet. Haiku. It sounds nice until you try to explain it to a project manager.
"Which model are we using?"
"Opus 4.5."
"Is that the expensive one?"
"It used to be. Now it's cheaper than Sonnet was six months ago."
This is the AI industry in 2025. Models leapfrog each other weekly. The week of November 18 saw three different models claim the top spot: Gemini 3 Pro on the 18th, GPT-5.1-Codex-Max on the 19th, and Opus 4.5 on the 24th.
You can't keep up. Nobody can.
The Feature Nobody Talks About
Buried in the release notes is something called effort settings. You can now tell the model how hard to think.
Low effort: fast answers. Medium effort: balanced. High effort: maximum intelligence.
This isn't a slider. It's a philosophical statement. We've built machines that can control their own computational expense. They decide when to think deeply and when to skim.
At medium effort, Opus 4.5 matches Sonnet 4.5's best performance while using 76% fewer output tokens.
That's not just efficiency. That's the model learning when to stop.
Chrome and Excel
They also shipped two integrations nobody expected to care about.
Claude for Chrome lets the model take actions across browser tabs. Click buttons. Fill forms. Navigate websites.
Claude for Excel can understand and edit spreadsheets. Not just read them. Edit them. With domain awareness.
Both features are now available to all Max, Team, and Enterprise users.
This is where it stops being about code. Financial analysts use Excel. Accountants use Excel. Consultants use Chrome for everything.
Opus 4.5 isn't just for developers anymore.
The Price War That Isn't
Every article about Opus 4.5 mentions the price drop. $5/$25 per million tokens. Two thirds cheaper than Opus 4.1.
But look at the competition. GPT-5.1 is $1.25/$10. Gemini 3 Pro is $2/$12.
Opus 4.5 is still more expensive. By a lot. But Anthropic isn't competing on price. They're competing on not screwing up.
Opus 4.5 is harder to trick with prompt injection than any other frontier model, though it still fails about 1 in 20 times with a single attempt. If someone tries ten different attacks, the success rate climbs to one in three.
That's better than everyone else. But it's not solved.
The truth is, training models to resist prompt injection might be the wrong approach entirely. We need to assume attackers will find a way. Build accordingly.
Weekend Projects
There's a kind of AI user who only appears when pricing drops. Hobbyists. Students. Solo devs with weird ideas and no budget.
When Opus 4.1 cost $15/$75, those people used Haiku. Or Sonnet. Or nothing. But at $5/$25, Opus becomes viable for projects that don't make money.
Someone on Reddit is building a meal planner that reasons about nutrition. Another person is making a chess coach that explains why moves are bad. These aren't startups. They're just things people want to exist.
The price drop isn't about competition. It's about making weird stuff possible.
Infinite Chat
The Claude app has a new feature called infinite chat. You know how conversations used to hit a limit and die? That's gone.
Now the model automatically compresses old context when you get close to the limit. You never hit a wall. The chat just keeps going.
Penn from Anthropic explained that knowing the right details to remember is more important than just having a longer context window.
This sounds small. It's not. It changes how you use AI. You don't restart conversations anymore. You don't copy-paste context. The model remembers. Or at least, it pretends to.
Microsoft and Nvidia
Microsoft and Nvidia announced multibillion dollar investments in Anthropic last week, pushing the company's valuation to about $350 billion.
That number is stupid large. For reference, that's bigger than Intel. Bigger than Adobe. Bigger than most companies you've heard of.
Microsoft wants Claude on Azure. Nvidia wants Claude running on their chips. Both want to make sure OpenAI doesn't own the entire future of AI.
This isn't about technology anymore. It's about who controls the infrastructure.
What "State of the Art" Means Now
In January, state of the art was GPT-4 Turbo. By March, it was Claude Sonnet 3.5. By June, Gemini 1.5 Pro. By September, GPT-4o. By November, we're on the fourth generation of models in twelve months.
The term "state of the art" doesn't mean what it used to. It means "best this week."
Simon Willison noted that evaluating new models is getting increasingly difficult because the practical differences are so small.
You test a model. It works. You test another model. It also works. Which one is better? Depends on the task. Depends on the prompt. Depends on the day.
The Uncomfortable Part
Anthropic tested Opus 4.5 on their own internal hiring exam for performance engineers, and the model scored higher than any human candidate ever had.
That sentence is doing a lot of work. Let's unpack it.
This wasn't a toy test. It's the actual exam Anthropic gives to people they want to hire. Two hours. Real problems. The kind that filter out everyone except the best engineers.
Opus 4.5 beat all of them.
There's a footnote. The top score used parallel test-time compute, where the model tries multiple solutions and picks the best one. Without that, it only tied the best human.
Only tied.
This is the part where people start asking uncomfortable questions. If AI can pass the hiring exam, why hire humans? If the model can debug production code, what's left for engineers to do?
Anthropic says the test doesn't measure collaboration, communication, or professional intuition. That's true. But it also doesn't make the result less scary.
What Actually Matters
Here's what i keep thinking about.
The technology gets better every month. The benchmarks go up. The prices come down. But the work doesn't change. We still write code. We still fix bugs. We still ship features.
AI didn't replace us. It changed what "us" means. The job isn't writing code anymore. It's knowing what to build. Knowing what's worth building. Knowing when to stop.
Early testers said Opus 4.5 just "gets it" when pointed at complex bugs, handling ambiguity and reasoning about tradeoffs without hand-holding.
That's the shift. The model understands intent. It fills in gaps. It makes decisions.
We're not managing syntax anymore. We're managing intelligence.
One More Thing
There's a Discord server where people share Claude conversations. Someone posted a thread yesterday. They'd been arguing with Opus 4.5 about the best way to structure a database schema. Back and forth. Real debate.
At the end, they said: "i forgot i was talking to a bot."
That's the line. When you forget it's a machine, something fundamental has changed.
Opus 4.5 isn't the first model to cross that line. But it might be the first where crossing it feels normal.
We're not in the future yet. But we're close enough to see it.
Enjoyed this article? Check out more posts.
View All Posts