DeepSeek V4 Pro

Weighted composite

4.48

Recommendation

Practitioner-grade

Cohort

large open weight

Scorecard

Per-criterion scores not available for this model. It appears in the cross-part overview but is not part of the canonical Models 1–10 hand-graded set; only the headline composite is shown.

Per-part composites

Part	Opus 4.7 (inline)	DeepSeek V4 Pro (judge)
Part A	4.48	4.00
Part B	4.50	3.55
Part C	4.65	4.45

Notes from the evaluation

DeepSeek V4 Pro second-judge outputs from `runner/grade.py`. - Part B Opus scores are sampled estimates unless the file explicitly says a full per-question pass was performed. - Part B DeepSeek scores are the current whole-response judge scores.

Source files in the repo

Cross-judge composites: analysis/results_overview.md

Full report · PDF

Get the full report

All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.

Welcome back. You've already requested the full report.

Download the report (PDF) ↓