Data & Evidence

Benchmarks

Every number, every test, every methodology note. This is the complete statistical record behind Claude Sonnet 4.6's launch — straight from Anthropic's official announcement and system card.

Computer Use · OSWorld-Verified

The 16-Month Rocket

From "experimental" to approaching human-level. No benchmark tells Sonnet's story better.

OSWorld Trajectory — Sonnet Family

16 months of continuous improvement on the standard AI computer use benchmark

⚠️ Pre-4.5 scores use original OSWorld; 4.5+ use OSWorld-Verified (released July 2025, upgraded task quality and grading). These represent a continuous improvement trajectory but are not directly comparable across the methodology change.

OCTOBER 2024

Claude Sonnet 3.5

14.9%

First general-purpose computer-using model. Anthropic called it "still experimental — at times cumbersome and error-prone." A historic first, but humble beginnings.

FEBRUARY 2025

Claude Sonnet 3.7

28.0%

Nearly doubled the score in four months. +13.1pp gain, proving computer use was on an accelerating trajectory.

JUNE 2025

Claude Sonnet 4

42.2%

Another +14.2pp. Computer use crossed the 40% threshold — the point where agentic automation started becoming genuinely useful for routine office tasks.

OCTOBER 2025

Claude Sonnet 4.5

61.4%

+19.2pp on OSWorld-Verified (new methodology). Over 60% for the first time — early users began reporting human-level performance on specific spreadsheet and web form tasks.

FEBRUARY 2026

Claude Sonnet 4.6 ✦

72.5%

A +11.1pp gain and nearly 5× the starting score. Real-world users at Pace reported 94% accuracy on insurance workflows. "It reasons through failures and self-corrects in ways we haven't seen before."

📐
What OSWorld Measures

OSWorld presents hundreds of tasks across real software — Chrome, LibreOffice, VS Code, and more — running on a simulated computer. No special APIs. No purpose-built connectors. The model clicks a virtual mouse and types on a virtual keyboard, exactly like a human would.

Core Benchmarks

Performance Across Categories

Key Metrics: Sonnet 4.5 vs. Sonnet 4.6

Select benchmarks showing absolute scores or inferred baselines

Math: Box/enterprise eval. Data extraction: >80% threshold (4.6 shown at 82%). Heavy reasoning: Box Q&A baseline normalized. Developer preference: Claude Code head-to-head (out of 100%). Sources: Anthropic announcement + Box enterprise eval.

SWE-bench Verified — Coding

Real-world software engineering on actual GitHub repositories

Sonnet 4.6 (10-trial avg)79.6%
Sonnet 4.6 (prompt modification)80.2%
💡

SWE-bench Verified tests the model on real GitHub issues — actual bug reports and feature requests from real open-source projects. The model must write code that makes existing tests pass.

ARC-AGI-2 — Adaptive Reasoning

Novel problem-solving beyond pattern matching

Max effort~65%*
High effort60.4%

* Max effort score confirmed in system card; precise value see official system card. High effort = 60.4% confirmed.

Sonnet 4.6 Benchmark Scores — Progress View

All confirmed scores as reported in official Anthropic benchmarks

OSWorld-Verified (Computer Use)72.5%
SWE-bench Verified (Coding)79.6%
Math Accuracy (Enterprise Eval)89%
Data Extraction Accuracy (PDFs/Docs)>80%
Insurance Computer Use (Pace Benchmark)94%
ARC-AGI-2 (High Effort)60.4%
Developer Preference vs. Sonnet 4.5~70%
Developer Preference vs. Opus 4.5 (prior flagship)59%
Claude Code · Head-to-Head

What Developers Actually Chose

Blind preference testing inside Claude Code tells the real story.

vs. Sonnet 4.5 (predecessor)

70%

preferred Sonnet 4.6

Less overengineering Fewer hallucinations Better context reading Logic consolidation

vs. Opus 4.5 (November 2025 Flagship)

59%

preferred Sonnet 4.6 over the previous flagship

🏆

A mid-tier model preferred over the previous generation's flagship — at 40% lower cost — is historically unusual. Sonnet 4.6 rated as significantly better at instruction following and less "lazy."

Vending-Bench Arena · Long-Horizon Planning

When AI Models Compete to Run a Business

A unique evaluation — AI models compete against each other in a simulated economy to maximize profit over a full year.

Vending-Bench Arena — Earnings Trajectory

Illustrative curve based on Anthropic's reported strategy and outcome (relative units)

Sonnet 4.6 nearly tripled the earnings of Sonnet 4.5 over the simulated year. The pivot point (~month 7) reflects Sonnet 4.6's independently developed strategy: invest in capacity first, then aggressively pursue profitability. Source: Anthropic, Vending-Bench Arena.

The Strategy Sonnet 4.6 Invented

Without being instructed, Sonnet 4.6 developed a novel competitive strategy in Vending-Bench Arena:

  1. Months 1–10: Invest aggressively in capacity — spending significantly more than competitors
  2. Month 10+: Pivot sharply to profitability, capitalizing on the capacity lead
  3. Final result: Finished well ahead of all competing models
🎲
What Vending-Bench Arena Is

Multiple AI models are given the same simulated vending machine business. They make independent decisions about pricing, restocking, and investment month by month, competing for the highest profit by year-end. It tests multi-step strategic reasoning, not just single-shot answers.

Complete Record

Full Benchmark Reference Table

Benchmark What It Measures Sonnet 4.5 Sonnet 4.6 Delta
OSWorld-VerifiedAI computer use (real software, no APIs)61.4%72.5%+11.1pp
SWE-bench VerifiedReal-world software engineering (GitHub)~70%*79.6%~+10pp
OSWorld (Pace Insurance)Complex insurance workflows, computer use94%Highest ever
Math AccuracyEnterprise math tasks62%89%+27pp
Data ExtractionPDF/Word document accuracy>80%
Box Enterprise Q&AHeavy reasoning over real enterprise docsBaseline+15pp+15pp
OfficeQAReading enterprise docs (charts, PDFs, tables)Below OpusMatches Opus 4.6Major
Vending-Bench ArenaMulti-step business strategy simulationBaseline~3× earnings~3×
ARC-AGI-2 (high effort)Novel reasoning, not pattern matching60.4%
Dev Preference vs. S4.5Claude Code blind head-to-head~70% win rate+40pp over 50%
Dev Preference vs. Opus 4.5Claude Code vs. previous flagship59% win rateBeats flagship

* Sonnet 4.5 SWE-bench score is an approximation based on reported context. All other figures from official Anthropic announcement and linked evaluations.

⚠️ Methodology Notes

📊
OSWorld vs. OSWorld-Verified

Pre-Sonnet 4.5 scores use original OSWorld. Sonnet 4.5+ use OSWorld-Verified (released July 2025), an upgrade with better task quality and grading. These two versions are not directly comparable but both represent the same benchmark family's continuous trajectory.

🔢
SWE-bench Averaging

Sonnet 4.6's score of 79.6% is averaged over 10 trials to account for run-to-run variance. A single run with a prompt modification achieved 80.2%. Both are legitimate measures; the averaged score is the more conservative and reproducible one.

🧪
ARC-AGI-2 Effort Levels

ARC-AGI-2 was run at two effort levels. Max effort score is available in the system card. High effort = 60.4%, confirmed in Anthropic's official footnotes. Terminal-Bench 2.0 was run with thinking turned off using the Terminus-2 harness.