An LLM benchmark for LCA, PCF, and EPD methodology
Which Large Language Models can be trusted with declaration-review-grade environmental work — the same standard an EPD verifier would hold a senior practitioner to?
Ranking
Sorted by the declaration-review weighted composite (math 25 · epistemic 20 · domain 20 · critical 15 · framework 10 · regulatory 5 · framing 5). Click any row for the per-model report. How this is scored →
| # | Model | Weighted | Unweighted | Parts (Opus / DeepSeek) | Recommendation |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 anthropic/claude-opus-4.7 | 4.95 | — | A B C | Verifier-grade |
| 2 | GPT-5.5 openai/gpt-5.5 | 4.88 | — | A B C | Verifier-grade |
| 3 | Kimi K2.6 canonical: Model 5 | 4.60 | 4.64 | A B C | Verifier-grade |
| 4 | DeepSeek V4 Pro | 4.48 | — | A B C | Practitioner-grade |
| 5 | Gemini 3.1 Pro google/gemini-3.1-pro | 4.40 | — | A B C | Practitioner-grade |
| 6 | Anthropic Sonnet 4.6 | 4.35 | — | A B C | Practitioner-grade |
| 7 | OpenAI GPT 5.4 thinking | 4.10 | — | A B C | Practitioner-grade |
| 8 | GLM 5.1 | 3.83 | — | A B C | Practitioner-grade |
| 9 | Gemma 4 31B | 3.70 | — | A B C | Practitioner-grade |
| 10 | Mistral Large 2512 | 3.63 | — | A B C | Practitioner-grade |
| 11 | Anthropic Haiku 4.5 | 3.50 | — | A B C | Practitioner-grade |
| 12 | OpenAI GPT 5.4 (non-thinking) | 3.50 | — | A B C | Practitioner-grade |
| 13 | Qwen 3.5 397B | 3.50 | — | A B C | Practitioner-grade |
| 14 | Mistral Small 2603 | 3.38 | — | A B C | Not for declaration-review |
| 15 | Gemma 4 26B A4B | 3.08 | — | A B C | Not for declaration-review |
| 16 | Qwen 3.5 9B | 2.38 | — | A B C | Not for declaration-review |
| 17 | Nemotron Nano 9B v2 | 1.80 | — | A B C | Not for declaration-review |
Coverage — every model × every part × every judge
Each cell is one (model, part, judge) result. We show what we have and what we don't, by design.
| Cohort | Model | A·Opus | A·DeepSeek | B·Opus | B·DeepSeek | C·Opus | C·DeepSeek |
|---|---|---|---|---|---|---|---|
| large-open-weight | DeepSeek V4 Pro | 4.48 | 4.00 | 4.50 | 3.55 | 4.65 | 4.45 |
| large-open-weight | GLM 5.1 | 3.83 | 2.90 | 3.85 | 2.05 | 4.00 | 4.30 |
| large-open-weight | Kimi K2.6 | 4.85 | 4.85 | 4.75 | 2.05 | 4.55 | 4.90 |
| large-open-weight | Mistral Large 2512 | 3.63 | 2.40 | 3.30 | 1.75 | 3.45 | 4.45 |
| large-open-weight | Qwen 3.5 397B | 3.50 | hung | 3.65 | 2.05 | 3.65 | 4.00 |
| latest-frontier | Claude Opus 4.7 | 4.95 | 3.75 | 4.95 | 3.75 | 5.00 | 4.40 |
| latest-frontier | Gemini 3.1 Pro | 4.40 | 2.85 | 4.60 | 2.10 | 4.00 | 3.40 |
| latest-frontier | GPT-5.5 | 4.88 | 3.85 | 4.75 | 3.25 | 4.80 | 4.40 |
| legacy-frontier | Anthropic Haiku 4.5 | 3.50 | 2.95 | 4.10 | 2.25 | 4.10 | 3.75 |
| legacy-frontier | Anthropic Sonnet 4.6 | 4.35 | 3.40 | 3.95 | 3.95 | 4.65 | 4.60 |
| legacy-frontier | OpenAI GPT 5.4 (non-thinking) | 3.50 | 2.30 | 3.20 | 2.30 | 4.00 | 4.35 |
| legacy-frontier | OpenAI GPT 5.4 thinking | 4.10 | 3.80 | 4.30 | 2.30 | 4.45 | 4.05 |
| small-open-weight | Gemma 4 26B A4B | 3.08 | 1.95 | 3.20 | 2.00 | 3.50 | 3.50 |
| small-open-weight | Gemma 4 31B | 3.70 | 3.35 | 3.40 | 2.40 | 3.95 | 2.75 |
| small-open-weight | Mistral Small 2603 | 3.38 | 1.75 | 3.55 | 1.75 | 3.70 | 4.60 |
| small-open-weight | Nemotron Nano 9B v2 | 1.80 | 1.10 | 2.00 | 1.00 | 2.15 | 1.80 |
| small-open-weight | Qwen 3.5 9B | 2.38 | 1.55 | 2.40 | 1.35 | 1.80 | 1.85 |
Cells: composite value where graded; "hung" where the judge timed out; blank where not run. Cohort gaps (latest-frontier × Part C; legacy × Part C) are by design — see the repository gap ledger.
Methodology in one paragraph
Every response set is scored on seven cognitive-skill criteria — problem framing, framework knowledge, regulatory knowledge, mathematical reasoning, domain knowledge, critical reasoning, and epistemic discipline — each on an anchored 1–5 scale. Two independent judge models grade the same responses against the same rubric: Claude Opus 4.7 inline (primary) and DeepSeek V4 Pro via the repo's grading runner (second judge). Rank order is preserved across judges (Spearman ρ = 1.00 / 0.73 / 0.65 on Parts A / B / C). Inter-judge agreement, weighting rationale, and the verification status of each rubric item are documented in methodology.
Per-criterion sample — and the full report
The seven-criterion matrix for the citation-safe Part-A cohort (10 models, anonymised). Full mapping to real names, per-question grids, and the same matrix for the wider 17-model real-name cohort are in the full report.
| Criterion | Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | Model 7 | Model 8 | Model 9 | Model 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Problem framing | 4.0 | 4.0 | 4.5 | 4.5 | 5.0 | 4.0 | 3.5 | 2.5 | 4.0 | 2.0 |
| Framework knowledge | 4.0 | 4.5 | 4.5 | 4.0 | 4.5 | 4.0 | 3.5 | 2.0 | 4.5 | 2.0 |
| Regulatory knowledge | 3.5 | 3.5 | 3.5 | 3.5 | 4.5 | 3.5 | 3.0 | 2.0 | 4.0 | 2.0 |
| Mathematical reasoning | 4.0 | 2.5 | 4.5 | 4.5 | 4.5 | 4.0 | 3.0 | 2.0 | 2.5 | 1.5 |
| Domain knowledge | 3.0 | 3.0 | 4.5 | 3.5 | 4.5 | 3.5 | 3.0 | 2.5 | 3.5 | 2.0 |
| Critical reasoning | 3.5 | 3.5 | 4.0 | 4.0 | 5.0 | 3.5 | 3.0 | 3.0 | 4.0 | 1.5 |
| Epistemic discipline | 4.0 | 2.5 | 5.0 | 3.5 | 4.5 | 3.5 | 3.0 | 2.5 | 3.0 | 2.0 |
Full report · PDF
Get the full report
All 111 questions, the complete rubric with acceptable variations and common errors, per-model verdicts across the canonical cohort, the seven-criterion matrix in full, and the methodology paper text. CC-BY 4.0. Delivered as PDF within 60 seconds.