Seventeen executives from GitHub to Zapier, Rakuten to Replit, all shared what they found when Sonnet 4.6 landed in their workflows. These are their words, unfiltered.
Out of the gate, Claude Sonnet 4.6 is already excelling at complex code fixes, especially when searching across large codebases is essential. For teams running agentic coding at scale, we're seeing strong resolution rates and the kind of consistency developers need.
Claude Sonnet 4.6 is a notable improvement over Sonnet 4.5 across the board, including long-horizon tasks and more difficult problems.
For the first time, Sonnet brings frontier-level reasoning in a smaller and more cost-effective form factor. It provides a viable alternative if you are a heavy Opus user.
The performance-to-cost ratio of Claude Sonnet 4.6 is extraordinary—it's hard to overstate how fast Claude models have been evolving in recent months. Sonnet 4.6 outperforms on our orchestration evals, handles our most complex agentic workloads, and keeps improving the higher you push the effort settings.
Claude Sonnet 4.6 has meaningfully closed the gap with Opus on bug detection, letting us run more reviewers in parallel, catch a wider variety of bugs, and do it all without increasing cost.
Claude Sonnet 4.6 delivers frontier-level results on complex app builds and bug-fixing. It's becoming our go-to for the kind of deep codebase work that used to require more expensive models.
Box evaluated how Claude Sonnet 4.6 performs when tested on deep reasoning and complex agentic tasks across real enterprise documents. It demonstrated significant improvements, outperforming Claude Sonnet 4.5 in heavy reasoning Q&A by 15 percentage points.
Claude Sonnet 4.6 matches Opus 4.6 performance on OfficeQA, which measures how well a model can read enterprise documents (charts, PDFs, tables), pull the right facts, and reason from those facts. It's a meaningful upgrade for document comprehension workloads.
Claude Sonnet 4.6 meaningfully improves the answer retrieval behind our core product—we saw a significant jump in answer match rate compared to Sonnet 4.5 in our Financial Services Benchmark, with better recall on the specific workflows our customers depend on.
Claude Sonnet 4.6 is faster, cheaper, and more likely to nail things on the first try. That combination was a surprising combination of improvements, and we didn't expect to see it at this price point.
Sonnet 4.6 is a significant leap forward on reasoning through difficult tasks. We find it especially strong on branched and multi-step tasks like contract routing, conditional template selection, and CRM coordination—exactly where our customers need strong model sense and reliability.
Claude Sonnet 4.6 was exceptionally responsive to direction — delivering precise figures and structured comparisons when asked, while also generating genuinely useful ideas on trial strategy and exhibit preparation.
Claude Sonnet 4.6 hit 94% on our insurance benchmark, making it the highest-performing model we've tested for computer use. This kind of accuracy is mission-critical to workflows like submission intake and first notice of loss. It reasons through failures and self-corrects in ways we haven't seen before.
We've been impressed by how accurately Claude Sonnet 4.6 handles complex computer use. It's a clear improvement over anything else we've tested in our evals.
Claude Sonnet 4.6 has perfect design taste when building frontend pages and data reports, and it requires far less hand-holding to get there than anything we've tested before.
Claude Sonnet 4.6 produced the best iOS code we've tested for Rakuten AI. Better spec compliance, better architecture, and it reached for modern tooling we didn't ask for, all in one shot. The results genuinely surprised us.
The Boldest Claim
"Claude Sonnet 4.6 is the best model we have seen to date. It has Opus 4.6 level accuracy, instruction following, and UI, all for a meaningfully lower cost."
Brendan Falk
Founder & CEO · Hercules
Every Claude model undergoes extensive safety evaluation. Here's what Anthropic's researchers concluded about Sonnet 4.6.
Safety Dimensions — S4.5 vs. S4.6
Illustrative index based on reported safety evaluation findings
Prompt injection resistance is a confirmed major improvement. All other dimensions reflect Anthropic's qualitative safety conclusions. See the official system card for full methodology.
"A broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes forms of misalignment."
— Anthropic Safety Researchers, Sonnet 4.6 System Card
| Safety Dimension | Finding |
|---|---|
| Overall safety | As safe as, or safer than, other recent Claude models |
| Prompt injection resistance | Major improvement vs. Sonnet 4.5; on par with Opus 4.6 |
| High-stakes misalignment | No signs of major concerns |
| Character assessment | Warm, honest, prosocial |
| Hallucinations | Fewer false claims of success (developer evals) |
When Sonnet 4.6 uses computer use to browse the web, it can encounter malicious content specifically designed to hijack its behavior — called a prompt injection attack. Sonnet 4.6's resistance to these attacks is now on par with Opus 4.6, Anthropic's most capable model. This is critical for safe enterprise deployment.
Early Adopters by Industry Vertical
Based on publicly featured customers at launch (illustrative distribution)
Based on 15 companies featured at Sonnet 4.6 launch. Developer Tools: GitHub, Cursor, Windsurf, Replit, Bolt, Cognition. Enterprise/Finance: Box, Databricks, Hebbia, Mercury, Zapier, Harvey. Insurance/Legal: Pace, Convey. E-commerce/Design: Triple Whale, Rakuten.
Multiple companies independently reported the same finding about design quality — without knowing others were reporting it. Triple Whale called it "perfect design taste." Anthropic notes that "customers independently described visual outputs from Sonnet 4.6 as notably more polished." Convergent, uncoordinated validation is one of the strongest signals of genuine improvement.