Anthropic Sonnet 4.6
Weighted composite
4.35
Recommendation
Practitioner-grade
Cohort
legacy frontier
Scorecard
Per-criterion scores not available for this model. It appears in the cross-part overview but is not part of the canonical Models 1–10 hand-graded set; only the headline composite is shown.
Per-part composites
| Part | Opus 4.7 (inline) | DeepSeek V4 Pro (judge) |
|---|---|---|
| Part A | 4.35rep-sample → 4.55 full-pass⁴ | 3.40 |
| Part B | 3.95sampled (`_or`, truncated)¹ / Opus inline regrade pending on `_or_r1_64k`³ | 3.95 |
| Part C | 4.65(`_or`)² | 4.60 |
Notes from the evaluation
Anthropic Sonnet 4.6 | `anthropic/claude-sonnet-4.6` | `raw_responses/Part_B_Sonnet_4_6_or_r1_openrouter.txt` | 123,142 | Q51–Q80 (response truncated at max_tokens=32000 mid-Q80) | | OpenAI GPT 5.4 (non-thinking) | `openai/gpt-5.4` | `raw_responses/Part_B_GPT_5_4_nonthinking_or_r1_openrouter.txt` | 91,174 | Q51–Q100 (full 50/50) |
Source files in the repo
Cross-judge composites: analysis/results_overview.md
Full report · PDF
Get the full report
All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.
Welcome back.
You've already requested the full report.