A full upgrade across every dimension. Here's what changed, why it matters, and how to get the most out of each capability pillar.
Better code, less friction, fewer mistakes — across the entire development lifecycle.
Sonnet 4.6 scored 79.6% on SWE-bench Verified — averaged over 10 trials for reliability, with a single-run high of 80.2% with prompt modification. This benchmark tests the model on real GitHub issues in real open-source codebases.
But the numbers only tell part of the story. Developers in Claude Code testing reported qualitative shifts:
Sonnet 4.6 is better at reading context — so give it more. Include the relevant file structure, existing function signatures, and the error message or test output you're working from. The model will consolidate logic rather than patch over it.
"Out of the gate, Claude Sonnet 4.6 is already excelling at complex code fixes, especially when searching across large codebases is essential. For teams running agentic coding at scale, we're seeing strong resolution rates and the kind of consistency developers need."
— Joe Binder, VP of Product, GitHub
72.5% on OSWorld-Verified. Human-level on spreadsheets and web forms. 94% accuracy in insurance workflows.
Computer use is Claude's ability to operate real software interfaces the way a human would — clicking, typing, scrolling, and navigating — without needing purpose-built API connectors. It sees the screen. It acts on the screen.
This matters enormously for organizations with legacy software — insurance portals, government databases, ERP systems, hospital scheduling tools — all built before modern APIs existed. Sonnet 4.6 can work with any of them.
Computer use carries a specific risk: prompt injection attacks — malicious instructions hidden on websites that try to hijack the model mid-task. Sonnet 4.6 shows major improvement in resisting these, performing on par with Opus 4.6 in safety evaluations.
When deploying computer use, always define a clear task boundary in your system prompt. Tell the model what it's allowed to interact with and what to do if it encounters something unexpected. Set explicit stop conditions — "if you see an error you can't resolve, stop and report back."
14.9%
Oct 2024 · Sonnet 3.5
"Still experimental"
72.5%
Feb 2026 · Sonnet 4.6
Human-level on key tasks
4.87×
improvement in 16 months
Pace tested Sonnet 4.6 on their complex insurance benchmark — submission intake, first notice of loss — and scored 94%. CEO Jamie Cuffe: "It reasons through failures and self-corrects in ways we haven't seen before."
1 million tokens. Not just stored — actively reasoned across.
Context Window Comparison
Sonnet 4.5 vs. Sonnet 4.6 (Beta)
1M token context is in beta. 200k tokens = ~150,000 words. 1M tokens = ~750,000 words ≈ 10 novels or the full Linux kernel source.
Earlier models with large context windows could store information but often struggled to reason across it. Sonnet 4.6 is specifically noted for reasoning effectively across all that context — not just retrieving from it. This is the capability that enables long-horizon planning.
For agentic tasks that run longer than even 1M tokens, context compaction automatically summarizes older conversation when approaching limits — enabling effectively unlimited session length for long-running agents.
Adaptive thinking. Context compaction. Branched multi-step task execution that actually works.
Before (Sonnet 4.5)
Binary choice: extended thinking ON or OFF. Developers had to decide upfront how much reasoning to enable — and couldn't change it mid-task.
Now (Sonnet 4.6)
Claude decides when deeper reasoning is warranted. At default effort (high), extended thinking activates automatically for complex steps — and disengages for simple ones. Better reasoning where it matters, lower cost where it doesn't.
For multi-step agent workflows, structure your system prompt around outcomes, not steps. Let Sonnet 4.6's adaptive thinking determine when to reason deeply. Specify clear handoff conditions and failure modes — the model is now much better at recognizing when it's stuck and reporting back rather than hallucinating progress.
Sonnet 4.6 is a significant leap forward on reasoning through difficult tasks. We find it especially strong on branched and multi-step tasks like contract routing, conditional template selection, and CRM coordination — exactly where our customers need strong model sense and reliability.
Sonnet 4.6 outperforms on our orchestration evals, handles our most complex agentic workloads, and keeps improving the higher you push the effort settings.
89% math accuracy. Matches Opus 4.6 on OfficeQA. +15pp on Box's enterprise reasoning benchmark.
Knowledge Work Gains
Specific enterprise measurements comparing Sonnet 4.5 to 4.6
Math and extraction data from Box enterprise evaluation. OfficeQA achievement reported by Databricks.
Retail scenario: Sonnet 4.5 stumbled with financial interpretation, causing cascading calculation errors. Sonnet 4.6 correctly computed investment-to-cost ratios and ranked articles by price increase.
Education scenario: Sonnet 4.5 miscounted students who passed, producing a flawed recommendation. Sonnet 4.6 counted correctly and delivered an accurate recommendation.
Claude Sonnet 4.6 matches Opus 4.6 performance on OfficeQA, which measures how well a model can read enterprise documents (charts, PDFs, tables), pull the right facts, and reason from those facts. It's a meaningful upgrade for document comprehension workloads.
For complex documents, ask Sonnet 4.6 to first identify the structure and key data points before answering your question. Chain your prompts: "What are the key tables and metrics in this document?" → then "Based on those metrics, calculate X." The new model handles this chain more reliably.
"Perfect design taste." Better layouts, animations, and outputs — fewer iterations to production quality.
The design improvement was perhaps the most surprising finding from early customers — reported independently by multiple companies who didn't know others were saying the same thing.
Frontend code outputs were consistently described as more polished, with better layouts, cleaner animations, and a stronger visual design sensibility. More importantly, customers needed fewer rounds of iteration to reach production quality.
Claude Sonnet 4.6 has perfect design taste when building frontend pages and data reports, and it requires far less hand-holding to get there than anything we've tested before.
Claude Sonnet 4.6 produced the best iOS code we've tested for Rakuten AI. Better spec compliance, better architecture, and it reached for modern tooling we didn't ask for, all in one shot. The results genuinely surprised us.
Design quality in code generation improves when the model:
Reads the full context before generating — understanding what aesthetic already exists
Has better instruction following — translates design briefs more accurately
Reasons about what "complete and professional" means beyond just functional
Applies modern design patterns without being explicitly asked — "reached for modern tooling we didn't ask for"
Describe the feeling you want, not just the spec. Instead of "make a dashboard with a header and three columns," try "make a dashboard that feels like a Bloomberg terminal — dense, data-forward, professional dark theme." Sonnet 4.6's improved design reasoning responds better to intent-driven briefs.