Gemma 4 31B
Scorecard
Per-criterion scores not available for this model. It appears in the cross-part overview but is not part of the canonical Models 1–10 hand-graded set; only the headline composite is shown.
Per-part composites
| Part | Opus 4.7 (inline) | DeepSeek V4 Pro (judge) |
|---|---|---|
| Part A | 3.70 | 3.35 |
| Part B | 3.40sampled | 2.40 |
| Part C | 3.95 | 2.75 |
Notes from the evaluation
Gemma 4 31B and Mistral Small 4 (judge returned empty completions on the first call). Both were re-run successfully in single-cell retry batches and their composite + per-criterion scores are merged into the table above. The per-model justification text from the retry runs was discarded during cleanup; for full per-criterion narrative on those two cells, regenerate with `runner/grade.py --response
Source files in the repo
Cross-judge composites: analysis/results_overview.md
Full report · PDF
Get the full report
All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.