Last refresh: 2026-05-17 · 17 models graded 3 OPEN RUBRIC ITEMS · DeepSeek judge coverage: 98%

An LLM benchmark for LCA, PCF, and EPD methodology

Which Large Language Models can be trusted with declaration-review-grade environmental work — the same standard an EPD verifier would hold a senior practitioner to?

111 questions · 3 parts · 7 anchored criteria · 2 independent judges v1 · refresh 2026-05-17 CC-BY 4.0

Ranking

Sorted by the declaration-review weighted composite (math 25 · epistemic 20 · domain 20 · critical 15 · framework 10 · regulatory 5 · framing 5). Click any row for the per-model report. How this is scored →

# Model Weighted Unweighted Parts (Opus / DeepSeek) Recommendation
#1 · Claude Opus 4.7
Weighted 4.95
Verifier-grade
#2 · GPT-5.5
Weighted 4.88
Verifier-grade
#3 · Kimi K2.6
Weighted 4.60 Unweighted 4.64
Verifier-grade
#4 · DeepSeek V4 Pro
Weighted 4.48
Practitioner-grade
#5 · Gemini 3.1 Pro
Weighted 4.40
Practitioner-grade
#6 · Anthropic Sonnet 4.6
Weighted 4.35
Practitioner-grade
#7 · OpenAI GPT 5.4 thinking
Weighted 4.10
Practitioner-grade
#8 · GLM 5.1
Weighted 3.83
Practitioner-grade
#9 · Gemma 4 31B
Weighted 3.70
Practitioner-grade
#10 · Mistral Large 2512
Weighted 3.63
Practitioner-grade
#11 · Anthropic Haiku 4.5
Weighted 3.50
Practitioner-grade
#12 · OpenAI GPT 5.4 (non-thinking)
Weighted 3.50
Practitioner-grade
#13 · Qwen 3.5 397B
Weighted 3.50
Practitioner-grade
#14 · Mistral Small 2603
Weighted 3.38
Not for declaration-review
#15 · Gemma 4 26B A4B
Weighted 3.08
Not for declaration-review
#16 · Qwen 3.5 9B
Weighted 2.38
Not for declaration-review
#17 · Nemotron Nano 9B v2
Weighted 1.80
Not for declaration-review

Coverage — every model × every part × every judge

Each cell is one (model, part, judge) result. We show what we have and what we don't, by design.

Cohort Model A·OpusA·DeepSeekB·OpusB·DeepSeekC·OpusC·DeepSeek
large-open-weight DeepSeek V4 Pro 4.484.004.503.554.654.45
large-open-weight GLM 5.1 3.832.903.852.054.004.30
large-open-weight Kimi K2.6 4.854.854.752.054.554.90
large-open-weight Mistral Large 2512 3.632.403.301.753.454.45
large-open-weight Qwen 3.5 397B 3.50hung3.652.053.654.00
latest-frontier Claude Opus 4.7 4.953.754.953.755.004.40
latest-frontier Gemini 3.1 Pro 4.402.854.602.104.003.40
latest-frontier GPT-5.5 4.883.854.753.254.804.40
legacy-frontier Anthropic Haiku 4.5 3.502.954.102.254.103.75
legacy-frontier Anthropic Sonnet 4.6 4.353.403.953.954.654.60
legacy-frontier OpenAI GPT 5.4 (non-thinking) 3.502.303.202.304.004.35
legacy-frontier OpenAI GPT 5.4 thinking 4.103.804.302.304.454.05
small-open-weight Gemma 4 26B A4B 3.081.953.202.003.503.50
small-open-weight Gemma 4 31B 3.703.353.402.403.952.75
small-open-weight Mistral Small 2603 3.381.753.551.753.704.60
small-open-weight Nemotron Nano 9B v2 1.801.102.001.002.151.80
small-open-weight Qwen 3.5 9B 2.381.552.401.351.801.85

Cells: composite value where graded; "hung" where the judge timed out; blank where not run. Cohort gaps (latest-frontier × Part C; legacy × Part C) are by design — see the repository gap ledger.

Methodology in one paragraph

Every response set is scored on seven cognitive-skill criteria — problem framing, framework knowledge, regulatory knowledge, mathematical reasoning, domain knowledge, critical reasoning, and epistemic discipline — each on an anchored 1–5 scale. Two independent judge models grade the same responses against the same rubric: Claude Opus 4.7 inline (primary) and DeepSeek V4 Pro via the repo's grading runner (second judge). Rank order is preserved across judges (Spearman ρ = 1.00 / 0.73 / 0.65 on Parts A / B / C). Inter-judge agreement, weighting rationale, and the verification status of each rubric item are documented in methodology.

Per-criterion sample — and the full report

The seven-criterion matrix for the citation-safe Part-A cohort (10 models, anonymised). Full mapping to real names, per-question grids, and the same matrix for the wider 17-model real-name cohort are in the full report.

Criterion Model 1Model 2Model 3Model 4Model 5Model 6Model 7Model 8Model 9Model 10
Problem framing 4.04.04.54.55.04.03.52.54.02.0
Framework knowledge 4.04.54.54.04.54.03.52.04.52.0
Regulatory knowledge 3.53.53.53.54.53.53.02.04.02.0
Mathematical reasoning 4.02.54.54.54.54.03.02.02.51.5
Domain knowledge 3.03.04.53.54.53.53.02.53.52.0
Critical reasoning 3.53.54.04.05.03.53.03.04.01.5
Epistemic discipline 4.02.55.03.54.53.53.02.53.02.0