Capabilities — Claude Sonnet 4.6

Pillar 01

⌨️ Coding

Better code, less friction, fewer mistakes — across the entire development lifecycle.

What Changed

Sonnet 4.6 scored 79.6% on SWE-bench Verified — averaged over 10 trials for reliability, with a single-run high of 80.2% with prompt modification. This benchmark tests the model on real GitHub issues in real open-source codebases.

But the numbers only tell part of the story. Developers in Claude Code testing reported qualitative shifts:

Context comprehension before edits↑ Major

Logic consolidation (vs duplication)↑ Significant

Reduced overengineering/laziness↑ Significant

Instruction following accuracy↑ Meaningfully

Fewer false success claims↑ Fewer

🧑‍💻

Tutorial: Better Prompts for Coding

Sonnet 4.6 is better at reading context — so give it more. Include the relevant file structure, existing function signatures, and the error message or test output you're working from. The model will consolidate logic rather than patch over it.

# Good: give context first
model: "claude-sonnet-4-6"
system: "You are a senior engineer. Read the full
  file structure before modifying anything."
user: "Here is my codebase structure:
  [paste structure]

  The failing test is:
  [paste test]

  Fix the root cause, don't patch."
      

Why GitHub Uses It

"Out of the gate, Claude Sonnet 4.6 is already excelling at complex code fixes, especially when searching across large codebases is essential. For teams running agentic coding at scale, we're seeing strong resolution rates and the kind of consistency developers need."

— Joe Binder, VP of Product, GitHub

Pillar 02

🖥️ Computer Use

72.5% on OSWorld-Verified. Human-level on spreadsheets and web forms. 94% accuracy in insurance workflows.

What Computer Use Actually Means

Computer use is Claude's ability to operate real software interfaces the way a human would — clicking, typing, scrolling, and navigating — without needing purpose-built API connectors. It sees the screen. It acts on the screen.

This matters enormously for organizations with legacy software — insurance portals, government databases, ERP systems, hospital scheduling tools — all built before modern APIs existed. Sonnet 4.6 can work with any of them.

🔐

Safety: Prompt Injection

Computer use carries a specific risk: prompt injection attacks — malicious instructions hidden on websites that try to hijack the model mid-task. Sonnet 4.6 shows major improvement in resisting these, performing on par with Opus 4.6 in safety evaluations.

📋

Tutorial: Setting Up Computer Use

When deploying computer use, always define a clear task boundary in your system prompt. Tell the model what it's allowed to interact with and what to do if it encounters something unexpected. Set explicit stop conditions — "if you see an error you can't resolve, stop and report back."

The 5× Achievement

14.9%

Oct 2024 · Sonnet 3.5

"Still experimental"

72.5%

Feb 2026 · Sonnet 4.6

Human-level on key tasks

4.87×

improvement in 16 months

🏢

Real-World: Pace Insurance

Pace tested Sonnet 4.6 on their complex insurance benchmark — submission intake, first notice of loss — and scored 94%. CEO Jamie Cuffe: "It reasons through failures and self-corrects in ways we haven't seen before."

Pillar 03

🧠 Long-Context Reasoning

1 million tokens. Not just stored — actively reasoned across.

Context Window Comparison

Sonnet 4.5 vs. Sonnet 4.6 (Beta)

1M token context is in beta. 200k tokens = ~150,000 words. 1M tokens = ~750,000 words ≈ 10 novels or the full Linux kernel source.

What 1 Million Tokens Looks Like

📚~10 full-length novels

💾Full Linux kernel source code

📄Dozens of research papers simultaneously

⚖️An entire multi-year contract corpus

🏗️A large enterprise codebase with full dependency graph

💡

The Key Distinction

Earlier models with large context windows could store information but often struggled to reason across it. Sonnet 4.6 is specifically noted for reasoning effectively across all that context — not just retrieving from it. This is the capability that enables long-horizon planning.

⚙️

Context Compaction (Beta)

For agentic tasks that run longer than even 1M tokens, context compaction automatically summarizes older conversation when approaching limits — enabling effectively unlimited session length for long-running agents.

Pillar 04

🤖 Agent Planning

Adaptive thinking. Context compaction. Branched multi-step task execution that actually works.

Adaptive Thinking — What Changed

Before (Sonnet 4.5)

Binary choice: extended thinking ON or OFF. Developers had to decide upfront how much reasoning to enable — and couldn't change it mid-task.

Now (Sonnet 4.6)

Claude decides when deeper reasoning is warranted. At default effort (high), extended thinking activates automatically for complex steps — and disengages for simple ones. Better reasoning where it matters, lower cost where it doesn't.

🌐

Tutorial: Building Agentic Workflows

For multi-step agent workflows, structure your system prompt around outcomes, not steps. Let Sonnet 4.6's adaptive thinking determine when to reason deeply. Specify clear handoff conditions and failure modes — the model is now much better at recognizing when it's stuck and reporting back rather than hallucinating progress.

Sonnet 4.6 is a significant leap forward on reasoning through difficult tasks. We find it especially strong on branched and multi-step tasks like contract routing, conditional template selection, and CRM coordination — exactly where our customers need strong model sense and reliability.

Wade FosterCo-founder & CEO · Zapier

Sonnet 4.6 outperforms on our orchestration evals, handles our most complex agentic workloads, and keeps improving the higher you push the effort settings.

Michele CatastaPresident · Replit

New GA API Tools (Shipped Feb 2026)

Code Execution Memory Tool Programmatic Tool Calling Tool Search Tool Use Examples Web Search + Fetch (improved)

Pillar 05

📊 Knowledge Work

89% math accuracy. Matches Opus 4.6 on OfficeQA. +15pp on Box's enterprise reasoning benchmark.

Knowledge Work Gains

Specific enterprise measurements comparing Sonnet 4.5 to 4.6

Math Accuracy89% (+27pp)

Previously (Sonnet 4.5)62%

Box Heavy Reasoning Q&A gain+15pp

Data Extraction Accuracy>80%

Math and extraction data from Box enterprise evaluation. OfficeQA achievement reported by Databricks.

🧾

What Broke in Sonnet 4.5 (and doesn't in 4.6)

Retail scenario: Sonnet 4.5 stumbled with financial interpretation, causing cascading calculation errors. Sonnet 4.6 correctly computed investment-to-cost ratios and ranked articles by price increase.

Education scenario: Sonnet 4.5 miscounted students who passed, producing a flawed recommendation. Sonnet 4.6 counted correctly and delivered an accurate recommendation.

Claude Sonnet 4.6 matches Opus 4.6 performance on OfficeQA, which measures how well a model can read enterprise documents (charts, PDFs, tables), pull the right facts, and reason from those facts. It's a meaningful upgrade for document comprehension workloads.

Hanlin TangCTO of Neural Networks · Databricks

💡

Tutorial: Document Analysis

For complex documents, ask Sonnet 4.6 to first identify the structure and key data points before answering your question. Chain your prompts: "What are the key tables and metrics in this document?" → then "Based on those metrics, calculate X." The new model handles this chain more reliably.

Pillar 06

🎨 Design

"Perfect design taste." Better layouts, animations, and outputs — fewer iterations to production quality.

The design improvement was perhaps the most surprising finding from early customers — reported independently by multiple companies who didn't know others were saying the same thing.

Frontend code outputs were consistently described as more polished, with better layouts, cleaner animations, and a stronger visual design sensibility. More importantly, customers needed fewer rounds of iteration to reach production quality.

Claude Sonnet 4.6 has perfect design taste when building frontend pages and data reports, and it requires far less hand-holding to get there than anything we've tested before.

AJ OrbachCo-founder · Triple Whale

Claude Sonnet 4.6 produced the best iOS code we've tested for Rakuten AI. Better spec compliance, better architecture, and it reached for modern tooling we didn't ask for, all in one shot. The results genuinely surprised us.

Yusuke KajiGeneral Manager, AI · Rakuten

Why Design Improved

Design quality in code generation improves when the model:

01

Reads the full context before generating — understanding what aesthetic already exists

02

Has better instruction following — translates design briefs more accurately

03

Reasons about what "complete and professional" means beyond just functional

04

Applies modern design patterns without being explicitly asked — "reached for modern tooling we didn't ask for"

🖌️

Tutorial: Getting Better Design Output

Describe the feeling you want, not just the spec. Instead of "make a dashboard with a header and three columns," try "make a dashboard that feels like a Bloomberg terminal — dense, data-forward, professional dark theme." Sonnet 4.6's improved design reasoning responds better to intent-driven briefs.