Commentary

Can LLMs assess risk of bias in medical research?

April 13, 2026

We just completed a study asking whether large language models (LLMs) can reproduce expert risk-of-bias judgments in medical research. The preprint is in preparation. Here is a preview of what we found.

Background

Risk-of-bias (RoB) assessment is central to systematic reviewing. For each included study, trained reviewers examine flaws in design, conduct, and reporting that could systematically distort the results. The work is important but time-intensive, and LLMs have been proposed as a way to reduce that burden.

We wanted to know how well two LLMs could reproduce expert RoB labels from our earlier systematic review of COVID-19 contact tracing studies, and whether performance improved as we added more guidance to the prompt.

What we did

We tested a weaker and a stronger Gemini model against expert consensus labels for 14 observational studies. Each model assessed all 14 studies under four cumulative prompt conditions: a bare task instruction, then criteria definitions, then training material, then a worked example. The study was pre-registered before data collection at the Open Science Framework (osf.io/8grbe).

A preview of the results

The headline numbers look encouraging. For both models, overall agreement with expert labels improved substantially as we added guidance — from well below majority in the no-guidance baseline up to much higher agreement under the fullest prompt condition. The stronger model showed higher overall agreement at every condition.

The criterion-level results are more interesting. Overall agreement measures only whether the model matched the final label. Criterion-level agreement measures whether it reproduced the individual judgments that produce the label — a more demanding test. At this level, the picture shifts. The apparent ranking of the two models changed depending on which prompt condition we used. One model improved with more guidance; the other did not. And both models remained below a simple baseline that does not read the study text at all.

One instruction we changed mid-study shifted performance enough to warrant a separate sensitivity analysis. Prompt details, it turns out, behave like part of the intervention — they shape measured performance, and can change which model appears to do better. The preprint has all the numbers and a full discussion of what this means.

What it means

These findings do not support replacing human reviewers with LLMs for RoB assessment. They do suggest that prompt design and rubric specification deserve more attention than they typically receive in this literature. Researchers building LLM-assisted review pipelines should report prompt conditions explicitly and evaluate performance at the criterion level, not just on the overall label.

Limitations

This study covers 14 observational studies. Confidence intervals are wide throughout. The test set covers single-arm observational studies assessed with one rubric, using two models from one provider. The findings should be generalized with caution.

The full study is coming soon as a preprint

Complete results — criterion-level tables, condition contrasts, per-criterion error patterns, and a full discussion — are in the preprint, currently in preparation. The study is pre-registered on the Open Science Framework at osf.io/8grbe.

AI disclosure

Claude Sonnet 4.6 was used to assist in drafting and editing this article. All final decisions were made by the author.