Last refresh: 2026-05-17 · 17 models graded 3 OPEN RUBRIC ITEMS · DeepSeek judge coverage: 98%

Methodology

How questions are written, how responses are graded, how composites are computed, what rubric items are still open, and how to reproduce any score on this site.

The seven criteria

Each response set is scored on seven cognitive-skill criteria, each on an anchored 1–5 scale.

CriterionWhat it tests
Problem framingIdentifying the question type, scope, stakeholder perspective, and implicit issues.
Framework knowledgeISO 14040 / 14044 / 14025 / 14067, EN 15804+A2, EN 16908, PAS 2050, PEF, PACT v3, GHG Protocol Product, sector-specific PCRs.
Regulatory knowledgeEU Battery Regulation, CBAM, ESPR, EUDR; delegated acts; compliance dates; legal-admissibility distinctions.
Mathematical reasoningFunctional unit, allocation arithmetic, CFF coefficients, GWP characterisation, sensitivity & uncertainty.
Domain knowledgePractitioner-grade detail — ecoinvent vintages, PEFCR defaults, sector chemistry, country defaults.
Critical reasoningSpot-the-error, internal-contradiction catching, framework-conflation diagnosis.
Epistemic disciplineHonest uncertainty handling, declining to invent missing data, separating recalled from verified information.

Full 1–5 level definitions per criterion are in analysis/evaluation_matrix_and_legend.md.

Declaration-review weighting

The headline composite uses weights tuned for what a verifier reviewing an EPD or PCF declaration actually needs: quantitative correctness and safe epistemics over standards recall.

CriterionWeight
Mathematical reasoning25%
Epistemic discipline20%
Domain knowledge20%
Critical reasoning15%
Framework knowledge10%
Regulatory knowledge5%
Problem framing5%

Two-judge design

Each response set is graded by two independent judge models against the same rubric.

Inter-judge agreement is in analysis/judge_agreement.md. Rank order is preserved across judges (Spearman ρ = 1.00 / 0.73 / 0.65 on Parts A / B / C). Absolute composites diverge most on Part B (mean Δ = −1.7 between judges) — partly an artefact of grading method, since Opus inline samples Q51–Q70 while DeepSeek grades all 50 Part-B questions per pass.

Anonymisation boundary

The canonical 10-model matrix (the per-criterion heatmap on the homepage) is blinded as Models 1–10 to avoid bias in commentary, matching the evaluations/evaluation_model_N.md reports in the repo. The cross-part leaderboard uses real model names, matching the evaluations/auto_grade_*.md sweep files. No identity-mapping file is kept in the repo. The full report unlocks the mapping where the source files already allow it.

Open rubric items

Three rubric items are flagged for verification. Composites on this site are computed under the current rubric answers; the open items are surfaced so a citing reader can specify pre- or post-resolution scope.

Reproducibility

From a clean clone of the repository, with an OpenRouter key in .env:

pip install -r requirements.txt
cp .env.example .env

# generate a response set
python -m runner.run --part A --model anthropic/claude-opus-4.7 --label Opus_4_7

# grade it with the second judge
python -m runner.grade \
    --responses-glob "raw_responses/Part_A_Opus_4_7_*" \
    --rubric rubric/answer_key_partA_Q1-Q50.md \
    --judge-model deepseek/deepseek-v4-pro \
    --out evaluations/auto_grade_repro.md

Tolerance: the composite from a fresh run lands within ±0.10 of the published number for that model. Drift outside ±0.10 means the prompt preamble, runner defaults, or judge parameters have changed and the leaderboard must be regenerated.

Full report · PDF

Get the full report

All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.

Sent from noreply@verdatir.com. We store your address to deliver the report and, if you opt in, future updates. See /privacy. CC-BY 4.0.
Welcome back. You've already requested the full report.

Download the report (PDF) ↓