Methodology
How questions are written, how responses are graded, how composites are computed, what rubric items are still open, and how to reproduce any score on this site.
The seven criteria
Each response set is scored on seven cognitive-skill criteria, each on an anchored 1–5 scale.
| Criterion | What it tests |
|---|---|
| Problem framing | Identifying the question type, scope, stakeholder perspective, and implicit issues. |
| Framework knowledge | ISO 14040 / 14044 / 14025 / 14067, EN 15804+A2, EN 16908, PAS 2050, PEF, PACT v3, GHG Protocol Product, sector-specific PCRs. |
| Regulatory knowledge | EU Battery Regulation, CBAM, ESPR, EUDR; delegated acts; compliance dates; legal-admissibility distinctions. |
| Mathematical reasoning | Functional unit, allocation arithmetic, CFF coefficients, GWP characterisation, sensitivity & uncertainty. |
| Domain knowledge | Practitioner-grade detail — ecoinvent vintages, PEFCR defaults, sector chemistry, country defaults. |
| Critical reasoning | Spot-the-error, internal-contradiction catching, framework-conflation diagnosis. |
| Epistemic discipline | Honest uncertainty handling, declining to invent missing data, separating recalled from verified information. |
Full 1–5 level definitions per criterion are in analysis/evaluation_matrix_and_legend.md.
Declaration-review weighting
The headline composite uses weights tuned for what a verifier reviewing an EPD or PCF declaration actually needs: quantitative correctness and safe epistemics over standards recall.
| Criterion | Weight |
|---|---|
| Mathematical reasoning | 25% |
| Epistemic discipline | 20% |
| Domain knowledge | 20% |
| Critical reasoning | 15% |
| Framework knowledge | 10% |
| Regulatory knowledge | 5% |
| Problem framing | 5% |
Two-judge design
Each response set is graded by two independent judge models against the same rubric.
- Primary: Claude Opus 4.7, inline (hand-graded).
- Second judge: DeepSeek V4 Pro via the repo's
runner/grade.py.
Inter-judge agreement is in analysis/judge_agreement.md. Rank order is preserved across judges (Spearman ρ = 1.00 / 0.73 / 0.65 on Parts A / B / C). Absolute composites diverge most on Part B (mean Δ = −1.7 between judges) — partly an artefact of grading method, since Opus inline samples Q51–Q70 while DeepSeek grades all 50 Part-B questions per pass.
Anonymisation boundary
The canonical 10-model matrix (the per-criterion heatmap on the homepage) is blinded as
Models 1–10 to avoid bias in commentary, matching the
evaluations/evaluation_model_N.md reports in the repo. The cross-part leaderboard
uses real model names, matching the evaluations/auto_grade_*.md sweep files. No
identity-mapping file is kept in the repo. The full report unlocks the mapping where the
source files already allow it.
Open rubric items
Three rubric items are flagged for verification. Composites on this site are computed under the current rubric answers; the open items are surfaced so a citing reader can specify pre- or post-resolution scope.
- Q7 — EU Battery Regulation delegated-act date. Current answer B (18 Aug 2026, threshold-determination delegated act per Regulation 2023/1542 consolidated text). Four of ten canonical models answer B; six answer A (18 Feb 2025, the EV compliance deadline — not the threshold-act deadline).
- Q47 — GWP version for N₂O characterisation. Rubric to disambiguate AR5 (265), AR6 (273), or both as acceptable.
- Q48 — "kWh of compute delivered" interpretation. Utilisation-weighted useful compute vs time-weighted operating output. Both arithmetically correct but different per-kWh values (619 g vs 535 g).
Reproducibility
From a clean clone of the repository, with an OpenRouter key in .env:
pip install -r requirements.txt
cp .env.example .env
# generate a response set
python -m runner.run --part A --model anthropic/claude-opus-4.7 --label Opus_4_7
# grade it with the second judge
python -m runner.grade \
--responses-glob "raw_responses/Part_A_Opus_4_7_*" \
--rubric rubric/answer_key_partA_Q1-Q50.md \
--judge-model deepseek/deepseek-v4-pro \
--out evaluations/auto_grade_repro.md Tolerance: the composite from a fresh run lands within ±0.10 of the published number for that model. Drift outside ±0.10 means the prompt preamble, runner defaults, or judge parameters have changed and the leaderboard must be regenerated.
Full report · PDF
Get the full report
All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.