Last refresh: 2026-05-17 · 17 models graded 3 OPEN RUBRIC ITEMS · DeepSeek judge coverage: 98%

About the LCA Benchmark

A research site for LLM methodology in environmental life-cycle assessment — what we measure, how we measure it, and which models meet a declaration-review bar.

The question this benchmark answers

Which Large Language Models can be trusted with the same kind of judgment an EPD verifier or a senior LCA practitioner applies — across product carbon footprints, life-cycle assessments, and environmental product declarations? "Trusted" here means: correct on the central methodology, careful with uncertainty, honest about what it doesn't know, and able to spot the methodological errors a junior analyst would miss.

What's measured

111 questions across three self-contained parts. Part A is 50 questions across five types (MCQ, scenario, standards-conflict, spot-the-error, quantitative). Part B is 50 longer practitioner artefacts — LCI extracts, EPD excerpts, verifier findings to draft, client disputes. Part C is 11 questions on topic gaps surfaced against the UNEP / Donaldson et al. expert benchmark. Everything is graded by two judges — Claude Opus 4.7 inline and DeepSeek V4 Pro via the repo's runner/grade.py — so inter-judge agreement can be reported alongside the headline numbers.

Who publishes this

The LCA Benchmark is Verdatir Research's evaluation surface for LLM methodology in environmental life-cycle assessment. Every model is scored against the same rubric, by the same two judges, and we publish the methodology, the prompts, the answer keys, the raw model responses, and the grading reports in full — including for any model Verdatir itself uses internally.

Our intent is for the site to remain a living reference: when new frontier models are released, we re-run the benchmark and republish; when a rubric item resolves (Q7, Q47, Q48 are currently open), we update the leaderboard and surface the diff in a changelog.

Related work

The closest public benchmark is Donaldson, Balaji, Oriekezie, Kumar & Patouillard, "Expert benchmark of LLMs for LCA tasks" (UNEP, 2025), with toolkit at tur-ium/unep-life-cycle-assessment-ml-paper-evaluation. Methodology differs on three axes:

AxisDonaldson et al. (UNEP, 2025)This benchmark
Prompts24 LCA prompts, broad methodology, single-shot100 questions Parts A+B (declaration-review focus) + 11 in Part C (gap-filling)
GraderDomain experts via Zooniverse, Krippendorff αOpus 4.7 inline + DeepSeek V4 Pro; inter-judge agreement reported
OutputMedian accuracy + median explanation scores with SEMSeven anchored 1–5 criteria with a declaration-review weighted composite

Part C was specifically written to cover topic gaps surfaced against the UNEP prompts — attributional vs consequential LCA, study-vs-study conflict diagnosis, social-LCA scope misuse, unit-trap numeracy, multi-criteria comparative trade-offs, and a regenerative-agriculture data-collection deliverable. The vendored upstream prompt set (with commit SHA and SHA-256) is at prompts/unep/ in the repo.

Citation

Data and methodology text on lcabench.verdatir.com are licensed under CC-BY 4.0.

LCA Benchmark — lcabench.verdatir.com (2026). CC-BY 4.0.
Repository: https://github.com/Ketan-Verdatir/LCA-bench

When citing, please note (a) the rubric version used (pre- or post- Q7 / Q47 / Q48 resolution), and (b) whether your scope is Parts A+B only or includes Part C.

Contact

Methodology questions, dataset access, or a model-submission request — open an issue on the GitHub repository or reply to any LCA Benchmark email.

Full report · PDF

Get the full report

All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.

Sent from noreply@verdatir.com. We store your address to deliver the report and, if you opt in, future updates. See /privacy. CC-BY 4.0.
Welcome back. You've already requested the full report.

Download the report (PDF) ↓