Every number, every test, every methodology note. This is the complete statistical record behind Claude Sonnet 4.6's launch — straight from Anthropic's official announcement and system card.
From "experimental" to approaching human-level. No benchmark tells Sonnet's story better.
OSWorld Trajectory — Sonnet Family
16 months of continuous improvement on the standard AI computer use benchmark
⚠️ Pre-4.5 scores use original OSWorld; 4.5+ use OSWorld-Verified (released July 2025, upgraded task quality and grading). These represent a continuous improvement trajectory but are not directly comparable across the methodology change.
OCTOBER 2024
Claude Sonnet 3.5
14.9%
First general-purpose computer-using model. Anthropic called it "still experimental — at times cumbersome and error-prone." A historic first, but humble beginnings.
FEBRUARY 2025
Claude Sonnet 3.7
28.0%
Nearly doubled the score in four months. +13.1pp gain, proving computer use was on an accelerating trajectory.
JUNE 2025
Claude Sonnet 4
42.2%
Another +14.2pp. Computer use crossed the 40% threshold — the point where agentic automation started becoming genuinely useful for routine office tasks.
OCTOBER 2025
Claude Sonnet 4.5
61.4%
+19.2pp on OSWorld-Verified (new methodology). Over 60% for the first time — early users began reporting human-level performance on specific spreadsheet and web form tasks.
FEBRUARY 2026
Claude Sonnet 4.6 ✦
72.5%
A +11.1pp gain and nearly 5× the starting score. Real-world users at Pace reported 94% accuracy on insurance workflows. "It reasons through failures and self-corrects in ways we haven't seen before."
OSWorld presents hundreds of tasks across real software — Chrome, LibreOffice, VS Code, and more — running on a simulated computer. No special APIs. No purpose-built connectors. The model clicks a virtual mouse and types on a virtual keyboard, exactly like a human would.
Key Metrics: Sonnet 4.5 vs. Sonnet 4.6
Select benchmarks showing absolute scores or inferred baselines
Math: Box/enterprise eval. Data extraction: >80% threshold (4.6 shown at 82%). Heavy reasoning: Box Q&A baseline normalized. Developer preference: Claude Code head-to-head (out of 100%). Sources: Anthropic announcement + Box enterprise eval.
SWE-bench Verified — Coding
Real-world software engineering on actual GitHub repositories
SWE-bench Verified tests the model on real GitHub issues — actual bug reports and feature requests from real open-source projects. The model must write code that makes existing tests pass.
ARC-AGI-2 — Adaptive Reasoning
Novel problem-solving beyond pattern matching
* Max effort score confirmed in system card; precise value see official system card. High effort = 60.4% confirmed.
Sonnet 4.6 Benchmark Scores — Progress View
All confirmed scores as reported in official Anthropic benchmarks
Blind preference testing inside Claude Code tells the real story.
vs. Sonnet 4.5 (predecessor)
70%
preferred Sonnet 4.6
vs. Opus 4.5 (November 2025 Flagship)
59%
preferred Sonnet 4.6 over the previous flagship
A mid-tier model preferred over the previous generation's flagship — at 40% lower cost — is historically unusual. Sonnet 4.6 rated as significantly better at instruction following and less "lazy."
A unique evaluation — AI models compete against each other in a simulated economy to maximize profit over a full year.
Vending-Bench Arena — Earnings Trajectory
Illustrative curve based on Anthropic's reported strategy and outcome (relative units)
Sonnet 4.6 nearly tripled the earnings of Sonnet 4.5 over the simulated year. The pivot point (~month 7) reflects Sonnet 4.6's independently developed strategy: invest in capacity first, then aggressively pursue profitability. Source: Anthropic, Vending-Bench Arena.
Without being instructed, Sonnet 4.6 developed a novel competitive strategy in Vending-Bench Arena:
Multiple AI models are given the same simulated vending machine business. They make independent decisions about pricing, restocking, and investment month by month, competing for the highest profit by year-end. It tests multi-step strategic reasoning, not just single-shot answers.
| Benchmark | What It Measures | Sonnet 4.5 | Sonnet 4.6 | Delta |
|---|---|---|---|---|
| OSWorld-Verified | AI computer use (real software, no APIs) | 61.4% | 72.5% | +11.1pp |
| SWE-bench Verified | Real-world software engineering (GitHub) | ~70%* | 79.6% | ~+10pp |
| OSWorld (Pace Insurance) | Complex insurance workflows, computer use | — | 94% | Highest ever |
| Math Accuracy | Enterprise math tasks | 62% | 89% | +27pp |
| Data Extraction | PDF/Word document accuracy | — | >80% | — |
| Box Enterprise Q&A | Heavy reasoning over real enterprise docs | Baseline | +15pp | +15pp |
| OfficeQA | Reading enterprise docs (charts, PDFs, tables) | Below Opus | Matches Opus 4.6 | Major |
| Vending-Bench Arena | Multi-step business strategy simulation | Baseline | ~3× earnings | ~3× |
| ARC-AGI-2 (high effort) | Novel reasoning, not pattern matching | — | 60.4% | — |
| Dev Preference vs. S4.5 | Claude Code blind head-to-head | — | ~70% win rate | +40pp over 50% |
| Dev Preference vs. Opus 4.5 | Claude Code vs. previous flagship | — | 59% win rate | Beats flagship |
* Sonnet 4.5 SWE-bench score is an approximation based on reported context. All other figures from official Anthropic announcement and linked evaluations.
Pre-Sonnet 4.5 scores use original OSWorld. Sonnet 4.5+ use OSWorld-Verified (released July 2025), an upgrade with better task quality and grading. These two versions are not directly comparable but both represent the same benchmark family's continuous trajectory.
Sonnet 4.6's score of 79.6% is averaged over 10 trials to account for run-to-run variance. A single run with a prompt modification achieved 80.2%. Both are legitimate measures; the averaged score is the more conservative and reproducible one.
ARC-AGI-2 was run at two effort levels. Max effort score is available in the system card. High effort = 60.4%, confirmed in Anthropic's official footnotes. Terminal-Bench 2.0 was run with thinking turned off using the Terminus-2 harness.