Kimi K2.6
Weighted composite
4.60
Unweighted
4.64
Recommendation
Verifier-grade
Cohort
large open weight
Scorecard
Per-part composites
| Part | Opus 4.7 (inline) | DeepSeek V4 Pro (judge) |
|---|---|---|
| Part A | 4.85 | 4.85 |
| Part B | 4.75 | 2.05 |
| Part C | 4.55 | 4.90 |
Notes from the evaluation
Kimi K2.6 at 84 KB graded fine). Documented in `analysis/judge_agreement.md` under "Judge-side failures on Part A". Try a different `--judge-model` to grade this cell.
Source files in the repo
Hand-graded report: evaluations/evaluation_model_5.md
Cross-judge composites: analysis/results_overview.md
Full report · PDF
Get the full report
All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.
Welcome back.
You've already requested the full report.