Kimi K2.6

Canonical label · Model 5

Weighted composite

4.60

Unweighted

4.64

Recommendation

Verifier-grade

Cohort

large open weight

Scorecard

Problem framing5.0

Framework knowledge4.5

Regulatory knowledge4.5

Mathematical reasoning4.5

Domain knowledge4.5

Critical reasoning5.0

Epistemic discipline4.5

Per-part composites

Part	Opus 4.7 (inline)	DeepSeek V4 Pro (judge)
Part A	4.85	4.85
Part B	4.75	2.05
Part C	4.55	4.90

Notes from the evaluation

Kimi K2.6 at 84 KB graded fine). Documented in `analysis/judge_agreement.md` under "Judge-side failures on Part A". Try a different `--judge-model` to grade this cell.

Source files in the repo

Hand-graded report: evaluations/evaluation_model_5.md
Cross-judge composites: analysis/results_overview.md

Full report · PDF

Get the full report

All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.

Welcome back. You've already requested the full report.

Download the report (PDF) ↓