An LLM benchmark for LCA, PCF, and EPD methodology

Which Large Language Models can be trusted with declaration-review-grade environmental work — the same standard an EPD verifier would hold a senior practitioner to?

111 questions · 3 parts · 7 anchored criteria · 2 independent judges v1 · refresh 2026-05-22 CC-BY 4.0

Ranking

Sorted by the declaration-review weighted composite (math 25 · epistemic 20 · domain 20 · critical 15 · framework 10 · regulatory 5 · framing 5). Click any row for the per-model report. How this is scored →

#	Model	Weighted	Unweighted	Parts (Opus / DeepSeek)	Recommendation
1	Claude Opus 4.7 anthropic/claude-opus-4.7	4.95	—	A B C	Verifier-grade
2	GPT-5.5 openai/gpt-5.5	4.88	—	A B C	Verifier-grade
3	Kimi K2.6 canonical: Model 5	4.60	4.64	A B C	Verifier-grade
4	DeepSeek V4 Pro	4.48	—	A B C	Practitioner-grade
5	Gemini 3.1 Pro google/gemini-3.1-pro	4.40	—	A B C	Practitioner-grade
6	Anthropic Sonnet 4.6	4.35	—	A B C	Practitioner-grade
7	OpenAI GPT 5.4 thinking	4.10	—	A B C	Practitioner-grade
8	GLM 5.1	3.83	—	A B C	Practitioner-grade
9	Gemma 4 31B	3.70	—	A B C	Practitioner-grade
10	Mistral Large 2512	3.63	—	A B C	Practitioner-grade
11	Anthropic Haiku 4.5	3.50	—	A B C	Practitioner-grade
12	OpenAI GPT 5.4 (non-thinking)	3.50	—	A B C	Practitioner-grade
13	Qwen 3.5 397B	3.50	—	A B C	Practitioner-grade
14	Mistral Small 2603	3.38	—	A B C	Not for declaration-review
15	Gemma 4 26B A4B	3.08	—	A B C	Not for declaration-review
16	Qwen 3.5 9B	2.38	—	A B C	Not for declaration-review
17	Nemotron Nano 9B v2	1.80	—	A B C	Not for declaration-review

#1 · Claude Opus 4.7

Weighted 4.95

Verifier-grade #2 · GPT-5.5

Weighted 4.88

Verifier-grade #3 · Kimi K2.6

Weighted 4.60 Unweighted 4.64

Verifier-grade #4 · DeepSeek V4 Pro

Weighted 4.48

Practitioner-grade #5 · Gemini 3.1 Pro

Weighted 4.40

Practitioner-grade #6 · Anthropic Sonnet 4.6

Weighted 4.35

Practitioner-grade #7 · OpenAI GPT 5.4 thinking

Weighted 4.10

Practitioner-grade #8 · GLM 5.1

Weighted 3.83

Practitioner-grade #9 · Gemma 4 31B

Weighted 3.70

Practitioner-grade #10 · Mistral Large 2512

Weighted 3.63

Practitioner-grade #11 · Anthropic Haiku 4.5

Weighted 3.50

Practitioner-grade #12 · OpenAI GPT 5.4 (non-thinking)

Weighted 3.50

Practitioner-grade #13 · Qwen 3.5 397B

Weighted 3.50

Practitioner-grade #14 · Mistral Small 2603

Weighted 3.38

Not for declaration-review #15 · Gemma 4 26B A4B

Weighted 3.08

Not for declaration-review #16 · Qwen 3.5 9B

Weighted 2.38

Not for declaration-review #17 · Nemotron Nano 9B v2

Weighted 1.80

Not for declaration-review

Coverage — every model × every part × every judge

Each cell is one (model, part, judge) result. We show what we have and what we don't, by design.

Cohort	Model	A·Opus	A·DeepSeek	B·Opus	B·DeepSeek	C·Opus	C·DeepSeek
large-open-weight	DeepSeek V4 Pro	4.48	4.00	4.50	3.55	4.65	4.45
large-open-weight	GLM 5.1	3.83	2.90	3.85	2.05	4.00	4.30
large-open-weight	Kimi K2.6	4.85	4.85	4.75	2.05	4.55	4.90
large-open-weight	Mistral Large 2512	3.63	2.40	3.30	1.75	3.45	4.45
large-open-weight	Qwen 3.5 397B	3.50	hung	3.65	2.05	3.65	4.00
latest-frontier	Claude Opus 4.7	4.95	3.75	4.95	3.75	5.00	4.40
latest-frontier	Gemini 3.1 Pro	4.40	2.85	4.60	2.10	4.00	3.40
latest-frontier	GPT-5.5	4.88	3.85	4.75	3.25	4.80	4.40
legacy-frontier	Anthropic Haiku 4.5	3.50	2.95	4.10	2.25	4.10	3.75
legacy-frontier	Anthropic Sonnet 4.6	4.35	3.40	3.95	3.95	4.65	4.60
legacy-frontier	OpenAI GPT 5.4 (non-thinking)	3.50	2.30	3.20	2.30	4.00	4.35
legacy-frontier	OpenAI GPT 5.4 thinking	4.10	3.80	4.30	2.30	4.45	4.05
small-open-weight	Gemma 4 26B A4B	3.08	1.95	3.20	2.00	3.50	3.50
small-open-weight	Gemma 4 31B	3.70	3.35	3.40	2.40	3.95	2.75
small-open-weight	Mistral Small 2603	3.38	1.75	3.55	1.75	3.70	4.60
small-open-weight	Nemotron Nano 9B v2	1.80	1.10	2.00	1.00	2.15	1.80
small-open-weight	Qwen 3.5 9B	2.38	1.55	2.40	1.35	1.80	1.85

Cells: composite value where graded; "hung" where the judge timed out; blank where not run. Cohort gaps (latest-frontier × Part C; legacy × Part C) are by design — see the repository gap ledger.

Methodology in one paragraph

Every response set is scored on seven cognitive-skill criteria — problem framing, framework knowledge, regulatory knowledge, mathematical reasoning, domain knowledge, critical reasoning, and epistemic discipline — each on an anchored 1–5 scale. Two independent judge models grade the same responses against the same rubric: Claude Opus 4.7 inline (primary) and DeepSeek V4 Pro via the repo's grading runner (second judge). Rank order is preserved across judges (Spearman ρ = 1.00 / 0.73 / 0.65 on Parts A / B / C). Inter-judge agreement, weighting rationale, and the verification status of each rubric item are documented in methodology.

Per-criterion sample — and the full report

The seven-criterion matrix for the citation-safe Part-A cohort (10 models, anonymised). Full mapping to real names, per-question grids, and the same matrix for the wider 17-model real-name cohort are in the full report.

Criterion	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6	Model 7	Model 8	Model 9	Model 10
Problem framing	4.0	4.0	4.5	4.5	5.0	4.0	3.5	2.5	4.0	2.0
Framework knowledge	4.0	4.5	4.5	4.0	4.5	4.0	3.5	2.0	4.5	2.0
Regulatory knowledge	3.5	3.5	3.5	3.5	4.5	3.5	3.0	2.0	4.0	2.0
Mathematical reasoning	4.0	2.5	4.5	4.5	4.5	4.0	3.0	2.0	2.5	1.5
Domain knowledge	3.0	3.0	4.5	3.5	4.5	3.5	3.0	2.5	3.5	2.0
Critical reasoning	3.5	3.5	4.0	4.0	5.0	3.5	3.0	3.0	4.0	1.5
Epistemic discipline	4.0	2.5	5.0	3.5	4.5	3.5	3.0	2.5	3.0	2.0

Full report · PDF

Get the full report

All 111 questions, the complete rubric with acceptable variations and common errors, per-model verdicts across the canonical cohort, the seven-criterion matrix in full, and the methodology paper text. CC-BY 4.0. Delivered as PDF within 60 seconds.

Welcome back. You've already requested the full report.

Download the report (PDF) ↓