The Sonnet line didn't arrive fully formed. It was built through relentless iteration — five major releases in sixteen months, each one redefining what mid-tier AI could do.
Every step in the journey, with the milestone that defined each release.
OCTOBER 2024
Claude Sonnet 3.5
14.9% OSWorld
History was made. Anthropic released the world's first general-purpose computer-using AI model. The score was humble — and the company was honest about it, calling the capability "still experimental — at times cumbersome and error-prone." But a new era began here.
FEBRUARY 2025
Claude Sonnet 3.7
28.0% OSWorld
The first proof that computer use wasn't a one-time trick — it was improvable. +13.1 percentage points in four months. Nearly doubled the starting score. The trajectory was unmistakable.
JUNE 2025
Claude Sonnet 4
42.2% OSWorld
Crossed 40%. This was the threshold where computer use started becoming genuinely useful for routine office automation. Enterprise adoption began accelerating. Another +14.2pp in four months.
OCTOBER 2025
Claude Sonnet 4.5
61.4% OSWorld-Verified
The methodology upgrade arrived with Sonnet 4.5 — OSWorld-Verified, with stricter task quality and grading. Despite the harder benchmark, the model scored 61.4%. Early users began reporting human-level performance on specific tasks: spreadsheet navigation, web forms, multi-tab workflows.
FEBRUARY 17, 2026
Claude Sonnet 4.6 ✦ NOW
72.5% OSWorld-Verified
The culmination of sixteen months of relentless improvement. Computer use nearly 5× the starting score. Coding at 79.6% SWE-bench. Math at 89%. Matching Opus 4.6 on OfficeQA. A 1M token context window. And the same price as the model it replaced.
OSWorld Score Growth — Cumulative
Percentage point gains per release
* Partially affected by OSWorld → OSWorld-Verified methodology change
OSWorld-Verified, released July 2025, upgraded the original benchmark with better task quality, improved grading, and updated infrastructure. Sonnet 4.5 and 4.6 use this harder version — meaning their scores are held to a stricter standard than earlier models.
Five major releases in sixteen months. The Sonnet line went from being called "experimental" to achieving 72.5% on one of the hardest AI benchmarks in existence. This rate of improvement — roughly 3.6 percentage points per month — has no precedent in the history of AI benchmarking.
Three tiers. Different strengths. Understanding which to reach for.
The fastest and most cost-efficient model in the family. Built for high-volume tasks where throughput matters more than depth.
The model for the vast majority of real work. Coding, computer use, document reasoning, agent workflows, design — at $3/$15 per million tokens.
Default for Free + Pro plans. No change needed.
The deepest reasoning. The highest precision. For tasks where getting it exactly right matters more than cost or speed. $5/$25 per million tokens.
Claude 4.6 Family — Pricing Comparison
Cost per million tokens across the model tier
Sonnet 4.6 pricing unchanged from Sonnet 4.5. Haiku pricing approximate. Source: claude.com/pricing
| Task Type | Recommended Model | Why |
|---|---|---|
| Everyday coding, bug fixes, code review | Sonnet 4.6 | 79.6% SWE-bench; 70% dev preference vs predecessor |
| Computer use / UI automation | Sonnet 4.6 | 72.5% OSWorld; 94% on insurance tasks |
| Enterprise document analysis | Sonnet 4.6 | Matches Opus 4.6 on OfficeQA |
| Multi-step agentic workflows | Sonnet 4.6 | Adaptive thinking; improved orchestration evals |
| Full codebase refactoring | Opus 4.6 | Opus retains top spot on Terminal-Bench 2.0 |
| Coordinating multiple AI agents | Opus 4.6 | Deepest reasoning for coordination complexity |
| High-volume classification / tagging | Haiku 4.5 | Fastest, most cost-efficient for volume tasks |
| Real-time API responses | Haiku 4.5 | Speed-optimized |