← Scaiences

Brief Review

LLM evals in epidemiology, biostatistics, and health research: a brief review

March 9, 2026

Background

Large language models (LLMs) are now evaluated across many health-related tasks, but the literature is uneven. Bedi et al. (2025) systematically reviewed 519 studies and found that 44.5% focused on medical knowledge or licensing-style questions, 84.2% focused on question answering, 95.4% used accuracy as the main metric, and only 5.0% used real patient-care data.

That matters for epidemiology, biostatistics, and health research. In these fields, practical value depends less on exam-like recall and more on valid reasoning, transparent methods, reproducibility, and reliable error detection.

Objective

To briefly review what has actually been evaluated, where the published evidence is strongest, what recent preprints add, and which limits still prevent strong claims about real-world readiness.

Methods

Narrative review.

Findings

Study Eval (domain) Methods Key results
Jin et al. (2019, preprint) PubMedQA (biomedical research QA) Dataset from PubMed abstracts with yes/no/maybe questions; 1k expert-annotated, 61.2k unlabeled, and 211.3k generated instances. Best reported model reached 68.1% accuracy versus 78.0% for a single human; useful for standardizing comparison, but still an abstract-based QA task.
Arora et al. (2025, preprint) HealthBench (general health interactions) Benchmark of 5,000 multi-turn conversations scored with physician-authored rubrics spanning 48,562 criteria across open-ended health scenarios. Moves beyond multiple-choice evaluation; top reported score on the main benchmark was 60%, leaving substantial room for improvement.
Harris et al. (2025, preprint) PubHealthBench (public-health information) More than 8,000 questions from 687 current UK government guidance documents; evaluates both multiple-choice and free-form responses across 24 LLMs. Latest proprietary models exceeded 90% on multiple-choice questions, but no model exceeded 75% on free-form answers.
Lu et al. (2025, preprint) StatEval (statistics) Benchmark with 13,817 foundational problems and 2,374 research-level proof tasks; built with a scalable pipeline and human validation. Even strong closed models stayed below 57% on research-level problems, suggesting that statistical reasoning remains difficult.
Bonzi et al. (2025, preprint) CareMedEval (biomedical critical appraisal) Dataset of 534 questions based on 37 scientific articles, derived from authentic medical-student exams and tested under varying context conditions. Models did not exceed an exact match rate of 0.5; questions about study limitations and statistical analysis were especially difficult.
He et al. (2025, preprint) CONSORT adherence evaluation (trial reporting quality) Zero-shot evaluation on a gold-standard set of 150 published randomized trials; three-class classification of CONSORT adherence. Best macro F1 was 0.634 with only fair agreement with experts; models detected compliant items more reliably than omissions or non-applicable items.
Li et al. (2025, preprint) Meta-analysis data extraction benchmark (evidence synthesis) Compared 3 LLMs across statistical results, risk-of-bias assessments, and study characteristics in 3 medical domains using 4 prompting strategies. Models showed high precision but poor recall; customized prompts improved recall by up to 15%, supporting selective supervised automation rather than full extraction.

What has actually been evaluated?

Most published evaluations still measure constrained tasks: multiple-choice questions, short-form question answering, or rubric-based judgments rather than end-to-end research workflows (Bedi et al., 2025; Lee et al., 2024).

Foundational biomedical benchmarks such as PubMedQA helped standardize comparison, but they mainly test answer selection from article abstracts rather than full evidence synthesis, causal inference, or statistical workflow execution (Jin et al., 2019, preprint).

Where the published evidence is strongest

The published evidence shows that current evaluation practice remains narrow. Bedi et al. (2025) documented a benchmark-heavy literature with limited real-world data, and Lee et al. (2024) found no clear evaluation framework and frequent reliance on small query counts, single measurements, and limited secondary analyses.

Tam et al. (2024) reviewed 142 studies and proposed the QUEST framework for human evaluation, emphasizing quality of information, understanding and reasoning, expression style and persona, safety and harm, and trust and confidence.

For biostatistical practice specifically, Grambow et al. (2025) are informative because they are closer to real work. Among 69 respondents from three biostatistics units at two academic medical centers, 44/69 (63.8%) reported using LLMs, but 29/41 (70.7%) reported significant errors, including incorrect code, statistical misinterpretation, and hallucinated functions.

What the preprints suggest

Recent preprints suggest a useful shift toward more domain-specific evaluation, but they should be interpreted cautiously. Arora et al. (2025, preprint) introduced HealthBench, a broad health benchmark designed to measure performance and safety in more open-ended health interactions. Harris et al. (2025, preprint) introduced PubHealthBench for UK government public-health information and highlighted a gap between multiple-choice and free-form answering, which is relevant to epidemiology and public-health communication.

Other preprints extend evaluation into specialized research tasks. Lu et al. (2025, preprint) introduced StatEval for statistics questions. Bonzi et al. (2025, preprint) introduced CareMedEval for biomedical critical appraisal and reasoning grounded in scientific papers. He et al. (2025, preprint) studied whether LLMs can identify CONSORT adherence in randomized trials. Li et al. (2025, preprint) benchmarked meta-analysis data extraction and reported that automation remains challenging, with performance that may be acceptable for some low-risk subtasks but not for unsupervised extraction of all key study data.

Main limits of current evals

Four limits recur across the literature. First, task selection is narrow: question answering and exam-style prompts dominate, while protocol design, causal reasoning, bias assessment, statistical troubleshooting, and full evidence synthesis are less often tested (Bedi et al., 2025; Lee et al., 2024).

Second, evaluation metrics are narrow: accuracy is common, but calibration, uncertainty, fairness, safety, and downstream decision quality are much less often measured (Bedi et al., 2025). Third, study designs are often weak for deployment claims: many studies use small numbers of questions, few repeated runs, and limited external validation (Lee et al., 2024).

Fourth, benchmark success may overstate readiness. Constrained formats can look strong even when open-ended answers, statistical interpretation, or workflow integration remain brittle (Harris et al., 2025, preprint; Grambow et al., 2025).

Interpretation

The most defensible near-term use case is assisted work with human verification, not autonomous epidemiologic or biostatistical reasoning. The evaluation question should shift from “can the model answer this benchmark item?” to “does the workflow remain valid when the model is used here, by this team, under these constraints?”

Limitations of this brief review

This is a brief, narrative review rather than a systematic review. It is meant to map the current evaluation landscape, not to estimate pooled performance. The literature is also moving quickly, and several domain-specific sources are still preprints.

Conclusion

The published evaluation literature is still dominated by narrow benchmarks, while the most relevant domain-specific evaluations are only beginning to emerge. For now, the evidence supports cautious, supervised use for bounded tasks, especially drafting, coding assistance, and information retrieval, but it does not support strong claims of robust, unsupervised reliability in core research workflows.