About the LCA Benchmark
A research site for LLM methodology in environmental life-cycle assessment — what we measure, how we measure it, and which models meet a declaration-review bar.
The question this benchmark answers
Which Large Language Models can be trusted with the same kind of judgment an EPD verifier or a senior LCA practitioner applies — across product carbon footprints, life-cycle assessments, and environmental product declarations? "Trusted" here means: correct on the central methodology, careful with uncertainty, honest about what it doesn't know, and able to spot the methodological errors a junior analyst would miss.
What's measured
111 questions across three self-contained parts. Part A is 50 questions across five types
(MCQ, scenario, standards-conflict, spot-the-error, quantitative). Part B is 50 longer
practitioner artefacts — LCI extracts, EPD excerpts, verifier findings to draft, client
disputes. Part C is 11 questions on topic gaps surfaced against the
UNEP / Donaldson et al.
expert benchmark. Everything is graded by two judges — Claude Opus 4.7 inline and
DeepSeek V4 Pro via the repo's runner/grade.py — so inter-judge agreement
can be reported alongside the headline numbers.
Who publishes this
The LCA Benchmark is Verdatir Research's evaluation surface for LLM methodology in environmental life-cycle assessment. Every model is scored against the same rubric, by the same two judges, and we publish the methodology, the prompts, the answer keys, the raw model responses, and the grading reports in full — including for any model Verdatir itself uses internally.
Our intent is for the site to remain a living reference: when new frontier models are released, we re-run the benchmark and republish; when a rubric item resolves (Q7, Q47, Q48 are currently open), we update the leaderboard and surface the diff in a changelog.
Related work
The closest public benchmark is Donaldson, Balaji, Oriekezie, Kumar & Patouillard,
"Expert benchmark of LLMs for LCA tasks" (UNEP, 2025), with toolkit at
tur-ium/unep-life-cycle-assessment-ml-paper-evaluation.
Methodology differs on three axes:
| Axis | Donaldson et al. (UNEP, 2025) | This benchmark |
|---|---|---|
| Prompts | 24 LCA prompts, broad methodology, single-shot | 100 questions Parts A+B (declaration-review focus) + 11 in Part C (gap-filling) |
| Grader | Domain experts via Zooniverse, Krippendorff α | Opus 4.7 inline + DeepSeek V4 Pro; inter-judge agreement reported |
| Output | Median accuracy + median explanation scores with SEM | Seven anchored 1–5 criteria with a declaration-review weighted composite |
Part C was specifically written to cover topic gaps surfaced against the UNEP prompts —
attributional vs consequential LCA, study-vs-study conflict diagnosis, social-LCA scope
misuse, unit-trap numeracy, multi-criteria comparative trade-offs, and a regenerative-agriculture
data-collection deliverable. The vendored upstream prompt set (with commit SHA and SHA-256)
is at prompts/unep/ in the repo.
Citation
Data and methodology text on lcabench.verdatir.com are licensed under CC-BY 4.0.
LCA Benchmark — lcabench.verdatir.com (2026). CC-BY 4.0.
Repository: https://github.com/Ketan-Verdatir/LCA-bench When citing, please note (a) the rubric version used (pre- or post- Q7 / Q47 / Q48 resolution), and (b) whether your scope is Parts A+B only or includes Part C.
Contact
Methodology questions, dataset access, or a model-submission request — open an issue on the GitHub repository or reply to any LCA Benchmark email.
Full report · PDF
Get the full report
All 111 questions, the complete rubric, per-model verdicts, and the methodology paper. Delivered as PDF within 60 seconds. CC-BY 4.0.