Commentary
From benchmarks to science: the case for eval reporting standards
March 12, 2026
CONSORT, PRISMA, TRIPOD: medicine's reporting standards took decades to build and are now basic requirements for publication. To our knowledge, nothing comparable exists for large language model (LLM) evaluations. That gap seems worth examining, since those evaluations now inform decisions about which models get deployed and which get flagged as unsafe.
What LLM evaluations are and why they matter
LLM evaluations ("evals") are structured tests that measure model capabilities and behaviors. They compare models, track progress over time, and help decide whether a model is ready to deploy. In health research specifically, evals have been proposed for assessing LLM performance on clinical reasoning, literature synthesis, and risk-of-bias assessment (see our earlier review of LLM evals in health research).
Burns et al. (2023) framed a central challenge in AI alignment as empirical: how can humans reliably supervise AI systems that may eventually outperform them? Evals are the proposed mechanism. The quality of those evals determines, at least in part, how much the oversight is worth.
Statistical rigor in current eval practice: a gap appears present
Miller (2024), writing from Anthropic, argues that benchmark results are routinely reported without standard errors, confidence intervals, or power analyses. Clustered standard errors, he finds, can be more than three times larger than naive estimates, meaning published confidence intervals may be far too narrow. The paper offers practical formulas for correcting this. Whether those corrections are being applied in practice is, to our knowledge, undocumented.
Apollo Research (2024) raises a related concern. Trivial changes, such as rephrasing a prompt or altering its formatting, can shift benchmark scores substantially. The authors describe current evals as "more art than science" and call for interdisciplinary standards along the lines of what aviation or clinical research have developed. Whether that characterization holds across the whole field is hard to assess from the outside, but the concern seems credible.
Reporting standards in medicine and economics
The medical model is worth looking at, even if the analogy is imperfect. CONSORT and PRISMA emerged through years of community effort, driven by evidence that unreported methods led to results that couldn't be reproduced or trusted. They are now standard requirements at most high-impact journals.
Kolbinger et al. (2024) identified 26 AI-specific reporting guidelines in medicine published between 2009 and 2023. They vary widely in quality and scope, and the authors conclude that no universal standard exists even within medical AI, let alone across AI research. Still, 26 guidelines represent sustained community work. As far as we are aware, nothing comparable exists for AI safety evaluations.
Two recent efforts attempt to extend this to LLM research. Gallifant et al. (2025) published TRIPOD-LLM in Nature Medicine, a 19-item checklist for LLM studies in biomedical research developed by expert consensus. An ISPOR working group separately introduced ELEVATE-GenAI (Fleurence et al. 2025), a 10-domain framework for reporting LLM use in health economics. Both are domain-specific. To our knowledge, no equivalent framework exists for general LLM capability or safety evaluations.
Implications for oversight and safety
Evaluations are how humans maintain oversight of increasingly capable AI systems. The quality of those evaluations is the substance of the oversight, not a technical detail. An underpowered eval that returns a confident-looking number can give false reassurance. A benchmark sensitive to prompt formatting may be measuring prompt craft as much as underlying capability.
This concern seems particularly relevant in health research. LLMs are being proposed for tasks with direct clinical consequences: diagnostic reasoning, evidence synthesis, risk stratification (see LLM evals in health research). Whether those proposals rest on evaluations rigorous enough to justify clinical use is not always clear from published reports.
Signs of progress and remaining questions
The field appears to recognize the problem. Miller (2024) and Apollo Research (2024) both treat eval methodology as a solvable statistical and engineering challenge. TRIPOD-LLM and ELEVATE-GenAI suggest that domain-specific reporting standards can be built when a community decides to build them.
Who will do that for AI safety and capability evaluations outside medicine and economics is, as far as we are aware, still an open question. So is whether the pace of model development will leave time for that consensus to form. The history of reporting standards in clinical research suggests this kind of work starts later than it should, and takes longer than anyone expects.