Last refresh: 2026-05-17 · 17 models graded 3 OPEN RUBRIC ITEMS · DeepSeek judge coverage: 98%

← Leaderboard

Kimi K2.6

Canonical label · Model 5

Weighted composite
4.60
Unweighted
4.64
Recommendation
Verifier-grade
Cohort
large open weight

Scorecard

Problem framing5.0
Framework knowledge4.5
Regulatory knowledge4.5
Mathematical reasoning4.5
Domain knowledge4.5
Critical reasoning5.0
Epistemic discipline4.5

Per-part composites

PartOpus 4.7 (inline)DeepSeek V4 Pro (judge)
Part A 4.85 4.85
Part B 4.75 2.05
Part C 4.55 4.90

Notes from the evaluation

Kimi K2.6 at 84 KB graded fine). Documented in `analysis/judge_agreement.md` under "Judge-side failures on Part A". Try a different `--judge-model` to grade this cell.

Source files in the repo

Hand-graded report: evaluations/evaluation_model_5.md
Cross-judge composites: analysis/results_overview.md

Full report · PDF

Get the full report

All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.

Sent from noreply@verdatir.com. We store your address to deliver the report and, if you opt in, future updates. See /privacy. CC-BY 4.0.
Welcome back. You've already requested the full report.

Download the report (PDF) ↓