Methodology

How questions are written, how responses are graded, how composites are computed, what rubric items are still open, and how to reproduce any score on this site.

The seven criteria

Each response set is scored on seven cognitive-skill criteria, each on an anchored 1–5 scale.

Criterion	What it tests
Problem framing	Identifying the question type, scope, stakeholder perspective, and implicit issues.
Framework knowledge	ISO 14040 / 14044 / 14025 / 14067, EN 15804+A2, EN 16908, PAS 2050, PEF, PACT v3, GHG Protocol Product, sector-specific PCRs.
Regulatory knowledge	EU Battery Regulation, CBAM, ESPR, EUDR; delegated acts; compliance dates; legal-admissibility distinctions.
Mathematical reasoning	Functional unit, allocation arithmetic, CFF coefficients, GWP characterisation, sensitivity & uncertainty.
Domain knowledge	Practitioner-grade detail — ecoinvent vintages, PEFCR defaults, sector chemistry, country defaults.
Critical reasoning	Spot-the-error, internal-contradiction catching, framework-conflation diagnosis.
Epistemic discipline	Honest uncertainty handling, declining to invent missing data, separating recalled from verified information.

Full 1–5 level definitions per criterion are in analysis/evaluation_matrix_and_legend.md.

Declaration-review weighting

The headline composite uses weights tuned for what a verifier reviewing an EPD or PCF declaration actually needs: quantitative correctness and safe epistemics over standards recall.

Criterion	Weight
Mathematical reasoning	25%
Epistemic discipline	20%
Domain knowledge	20%
Critical reasoning	15%
Framework knowledge	10%
Regulatory knowledge	5%
Problem framing	5%

Two-judge design

Each response set is graded by two independent judge models against the same rubric.

Primary: Claude Opus 4.7, inline (hand-graded).
Second judge: DeepSeek V4 Pro via the repo's runner/grade.py.

Inter-judge agreement is in analysis/judge_agreement.md. Rank order is preserved across judges (Spearman ρ = 1.00 / 0.73 / 0.65 on Parts A / B / C). Absolute composites diverge most on Part B (mean Δ = −1.7 between judges) — partly an artefact of grading method, since Opus inline samples Q51–Q70 while DeepSeek grades all 50 Part-B questions per pass.

Anonymisation boundary

The canonical 10-model matrix (the per-criterion heatmap on the homepage) is blinded as Models 1–10 to avoid bias in commentary, matching the evaluations/evaluation_model_N.md reports in the repo. The cross-part leaderboard uses real model names, matching the evaluations/auto_grade_*.md sweep files. No identity-mapping file is kept in the repo. The full report unlocks the mapping where the source files already allow it.

Open rubric items

Three rubric items are flagged for verification. Composites on this site are computed under the current rubric answers; the open items are surfaced so a citing reader can specify pre- or post-resolution scope.

Q7 — EU Battery Regulation delegated-act date. Current answer B (18 Aug 2026, threshold-determination delegated act per Regulation 2023/1542 consolidated text). Four of ten canonical models answer B; six answer A (18 Feb 2025, the EV compliance deadline — not the threshold-act deadline).
Q47 — GWP version for N₂O characterisation. Rubric to disambiguate AR5 (265), AR6 (273), or both as acceptable.
Q48 — "kWh of compute delivered" interpretation. Utilisation-weighted useful compute vs time-weighted operating output. Both arithmetically correct but different per-kWh values (619 g vs 535 g).

Reproducibility

From a clean clone of the repository, with an OpenRouter key in .env:

pip install -r requirements.txt
cp .env.example .env

# generate a response set
python -m runner.run --part A --model anthropic/claude-opus-4.7 --label Opus_4_7

# grade it with the second judge
python -m runner.grade \
    --responses-glob "raw_responses/Part_A_Opus_4_7_*" \
    --rubric rubric/answer_key_partA_Q1-Q50.md \
    --judge-model deepseek/deepseek-v4-pro \
    --out evaluations/auto_grade_repro.md

Tolerance: the composite from a fresh run lands within ±0.10 of the published number for that model. Drift outside ±0.10 means the prompt preamble, runner defaults, or judge parameters have changed and the leaderboard must be regenerated.

Full report · PDF

Get the full report

All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.

Welcome back. You've already requested the full report.

Download the report (PDF) ↓