Practical Guide
How to design, run, and report an AI evaluation step-by-step
March 15, 2026
Engineers use AI evaluations ("evals") to inform model selection, deployment decisions, and safety claims.
While some guidelines have been published, no single universal standard guides AI evals yet, and various teams approach evals differently. This can lead to subpar results.
For example, Miller (2024) reviewed published LLM evaluations and found three recurring problems: point estimates reported without confidence intervals, no pre-specified pass/fail threshold, and standard errors that ignore item clustering and therefore understate uncertainty.
This guide aims to offer a practical synthesis of current guidance to improve the practice of evals in the AI community.
Methods
We reviewed five source classes covering AI evaluation design, execution, and reporting. Sources were identified through targeted searches of official guidance, standards documents, framework documentation, and methodological papers, supplemented by consultation with ChatGPT Pro, through March 2026.
The first is venue requirements: the ACL Rolling Review Responsible NLP Research Checklist and the NeurIPS Paper Checklist, both of which define minimum reporting standards for published evaluations. The second is standards-body and statistical guidance, primarily National Institute of Standards and Technology (NIST) AI 800-3 (Keller et al., 2026), which introduces estimand language and explicit uncertainty to AI evaluation, and the NIST statistical handbook's guidance on confidence intervals for proportions.
The third is operational documentation from widely used evaluation frameworks: OpenAI's evaluation best practices, LangSmith's evaluation concepts, and the UK AI Safety Institute's autonomous systems evaluation standard. The fourth is domain-specific reporting and human evaluation frameworks, including MI-CLEAR-LLM (Park et al., 2024) for healthcare LLM evaluations and QUEST (Tam et al., 2024) for human evaluation quality. The fifth is selected methodological papers on judge validation, uncertainty estimation, and evaluation reliability, principally Miller (2024), Jung et al. (2025), and Haldar and Hockenmaier (2025).
The guide covers custom AI evaluations designed to support research or deployment claims. It does not cover all benchmarking traditions or capability evaluation practice. Where a recommendation comes from a formal checklist or standards body, it is noted as such. Where it reflects operational guidance or pragmatic synthesis, we say so.
Table 1. Source classes used in this guide.
| Step group | Representative sources |
|---|---|
| Decision and estimand (1, 2, 4, 5) | OpenAI (n.d.), NIST, Miller (2024) |
| Statistical design (5, 12) | NIST, Miller (2024) |
| Reference standard and judge (6, 9) | QUEST / Tam et al. (2024), MI-CLEAR-LLM / Park et al. (2024), Jung et al. (2025) |
| Scoring and judge (9) | OpenAI (n.d.), LangSmith (n.d.), Jung et al. (2025) |
| Reporting (14) | MI-CLEAR-LLM / Park et al. (2024), ACL Rolling Review, NeurIPS |
| Monitoring (15) | OpenAI (n.d.), LangSmith (n.d.), UK AISI |
Results
The following 15 steps cover the full evaluation lifecycle, from stating the decision question through post-deployment monitoring.
Table 2. The 15-step evaluation lifecycle.
| # | Step | Description | Example |
|---|---|---|---|
| 1 | The decision first | State the decision question before anything else. "How good is this model?" is not a decision question. | "Should we use this model to screen study reports for risk of bias in our systematic review pipeline?" |
| 2 | Replacement or assistant? | Decide early whether the model replaces or supports expert judgment. This changes the threshold, error weighting, and monitoring plan. | A replacement needs near-expert accuracy. An assistant can succeed at lower accuracy if it reliably saves reviewer time. |
| 3 | Freeze the system under test | Record model name, version, date, access mode, full prompt text, session structure, and decoding parameters. | GPT-4o via API, queried 2026-03-10, temperature 0, fresh session per item, no retrieval. |
| 4 | Define the estimand | State what population the test items represent. Decide whether you are validating one system or comparing two; these require different analyses. | Fixed performance on 50 held-out study reports (single-system); performance difference on the same 50 reports (A/B). |
| 5 | Pass/fail rule and sample size | Pre-specify the threshold that constitutes success. Check whether the CI width at your expected n is narrow enough for the decision. | Target ≥85% agreement. On 50 binary items, observed agreement near the target may still produce a CI wide enough to leave the deployment decision uncertain. |
| 6 | Build the reference standard | Have experts label cases before running the model. Adjudicate disagreements with a pre-specified process. | Two reviewers independently label 50 study reports for risk-of-bias domains; a third adjudicates disagreements. |
| 7 | Assemble the case set; keep dev separate | Split into pilot, development, and held-out test sets. Never tune prompts on test items. | 10-item pilot for rubric design, 14-item dev set for prompt refinement, 26-item locked holdout. |
| 8 | Write the rubric and output schema | Define judgment categories and required output fields before running. Structured output (JSON or fixed table) is more reliable than free prose. | JSON with fields: judgment (yes / no / unclear), rationale, supporting_quote, confidence_flag. |
| 9 | Choose the scoring method | Reference-based scoring compares output to expert labels. LLM-as-judge requires validation against held-out expert labels before use. | LLM judge validated on 20 expert-labeled items; items below confidence threshold routed to human review. |
| 10 | Pilot and lock | Run 10–50 curated items to catch rubric ambiguity, schema brittleness, and construct mis-mapping before touching the holdout. Then lock the protocol. | Pilot reveals "unclear" judgment applied inconsistently; rubric revised and re-piloted before holdout is opened. |
| 11 | Run under controlled conditions | Process all items under identical settings. Log item-level outputs, not only aggregated scores. | Fresh session per item, temperature 0, all raw model outputs saved to JSON before any aggregation. |
| 12 | Analyze as estimates, not scores | Report confidence intervals. Use Wilson score intervals for proportions. Use paired item-level analysis for A/B. Apply clustered SEs when items nest. | 82% agreement (95% CI: 69–91%) on 50 items, not "82% accuracy." |
| 13 | Error analysis and robustness | Break errors down by subgroup, check for directional bias, test prompt-wording sensitivity. | Model misses confounding bias in 6/8 observational studies but performs well on RCTs. |
| 14 | Report so others can repeat it | Disclose model version, query date, full prompt, session structure, split plan, rubric, scoring method, pass/fail threshold, and CI method. | Prompt text and split plan published in supplementary materials; pre-registered on OSF before holdout was opened. |
| 15 | After deployment, keep evaluating | Log production inputs/outputs, audit samples periodically, and re-run the evaluation after any model or prompt change. | Monthly 5% random audit; full holdout re-run triggered whenever the prompt is updated. |
1. The decision first
Many evaluations become hard to interpret before any data are collected, because nobody wrote down what question the evaluation was supposed to answer. "How good is this model?" is not a decision question.
Common evaluation purposes include: comparing models for research, making a go/no-go deployment decision, demonstrating safety assurance, supporting procurement, monitoring a deployed system, and measuring human uplift (whether access to the model changes what a person can actually do). Each purpose implies different sampling, different metrics, and a different bar for what counts as sufficient certainty.
Stating the decision question also forces you to name what a positive or negative result would mean. If a result above some threshold would change a decision and a result below it would not, that threshold should be on paper before data collection begins. OpenAI (n.d.) recommends setting pass/fail criteria in advance, not after reviewing results.
2. Replacement or assistant?
Decide early whether the model is being evaluated as a replacement for expert judgment or as support for it. That choice affects everything downstream: what counts as success, how errors are weighted, and what monitoring looks like after deployment.
A replacement system needs accuracy close to expert level, because it acts without a human backstop. An assistant system can succeed at lower absolute accuracy if it reliably reduces a reviewer's time or improves consistency. Error direction matters too. In medical or safety settings, false negatives and false positives carry different costs, and that asymmetry belongs in the pass/fail rule before data collection, not in the discussion section afterward.
In a risk-of-bias evaluation, for instance, a system that consistently misses confounding bias may be worse than one that over-calls it, depending on how the output gets used. Writing down the tolerable error profile before running makes that judgment explicit and auditable.
3. Freeze the system under test
An evaluation needs a stable target. LLMs change across versions. Access modes vary across API, browser, web-search-enabled, and retrieval-augmented configurations. Prompt wording can shift outputs materially even when the underlying model has not changed. Park et al. (2024) note that small changes in wording, punctuation, or session structure can produce results different enough to make a study non-reproducible.
Before running, record: the model name, version, and date of querying; whether web access or retrieval augmentation was on; the full text of every prompt; session structure (fresh session per item, or continuous); temperature and any other decoding parameters; and the number of attempts per item, with a pre-specified aggregation rule.
4. Define the estimand (the quantity you are trying to measure)
Most evaluations report accuracy or agreement on a fixed set of test items (each item is one unit scored: a study report, abstract, question, or passage). That is a useful number. It is not the same as expected performance on the population the evaluation is meant to represent.
Keller et al. (2026) draw an analogy to survey sampling: a fixed test set is one particular convenience sample, and the uncertainty in that estimate does not account for what a different sample of items might show. They describe mixed-model approaches that can estimate generalized performance with uncertainty decomposed across item and model variance.
The practical implication is simpler: say what population the test items are supposed to represent, and say honestly whether they were drawn from that population or assembled by convenience. If the result is meant to generalize, the reported uncertainty should reflect item sampling, not only measurement noise within the items used.
A second design choice is single-system validation versus A/B comparison. Single-system validation asks whether performance clears a threshold. A/B comparison asks which system does better on the same items. For A/B, Miller (2024) recommends analyzing paired item-level differences rather than comparing two separate confidence intervals. The paired approach reduces variance and is the right estimand when both models are scored on the same items.
5. Set the pass/fail rule and consider sample size
Setting a pass/fail rule in advance is probably the most frequently skipped step in AI evaluation. A number alone does not support a claim. A number measured against a pre-stated threshold does.
Sample size is part of the same question. Miller (2024) treats power analysis as a design step, not an afterthought. A Wilson confidence interval on 50 binary items at 80% agreement runs roughly ±11 percentage points at 95% confidence. Whether that width is acceptable depends on the decision: fine for an internal development tool, probably not enough for a published claim about clinical usefulness.
Writing down the required precision before collecting data forces a realistic look at whether the case set is large enough for the claim. If the evaluation compares two models, the sample needed to detect a given paired difference is often smaller than what two independent single-model evaluations would require.
6. Build the reference standard
For any evaluation asking whether a model's output matches an authoritative judgment, the reference standard is the foundation. A noisy or poorly defined reference produces agreement numbers that cannot be interpreted.
Build the reference standard before running the model, using a structured adjudication process when raters disagree. The QUEST framework (Tam et al., 2024) organizes human evaluation into planning, adjudication, and scoring review stages, treating reference-standard creation as a design decision rather than something to sort out later.
The documentation should answer three things: who labeled the cases and at what level of expertise; how disagreements were resolved; and what happens when adjudication fails. If unresolvable disagreements are common, the construct is probably underspecified, and the rubric needs revision before the evaluation runs.
Require the model to produce evidence supporting its judgment, not only a label. Labels can agree for the wrong reasons. Evidence review lets you check whether the model was right in the way that actually matters.
7. Assemble the case set and keep development separate from testing
Tuning prompts on the same items used to report final performance is the most common failure mode in AI evaluations. It inflates apparent accuracy and makes results impossible to reproduce. ACL Rolling Review asks explicitly whether the held-out test set was kept separate from any tuning. Park et al. (2024) list test-set independence as a minimum reporting item for healthcare LLM evaluations.
A disciplined split has three parts: a pilot subset for prompt design and rubric clarification; a development subset for wording and schema refinement; and a held-out test set that nobody touches until the protocol is locked. For small case sets of 30–50 items, reserving even 5–10 for the pilot is enough. The number matters less than the rule.
Leakage can also come from training data contamination (the evaluation items may have appeared in the model's pre-training data) or from prompt optimization tools that indirectly overfit to held-out performance. Neither is always avoidable, but both should be reported.
8. Write the rubric and output schema before running
LLMs are more reliable on classification against explicit criteria than on open-ended grading (OpenAI (n.d.)). Writing the rubric and output schema before running forces precise construct definition and makes scoring checkable.
The output schema should be structured: a fixed-format response, such as JSON or a named-field table, rather than free prose. At minimum, fields should include the judgment (from a closed set), a brief rationale, and a supporting evidence quote. If an LLM judge will be used downstream, add a confidence flag.
The evidence quote requirement is also a validity check. If the model cannot produce a quote that supports its judgment, that judgment deserves lower confidence regardless of whether it matches the reference label.
9. Choose the scoring method
Whether a reference standard exists determines the scoring approach. Reference-based evaluation compares output to known ground truth. It is the right choice when the goal is measuring accuracy against an authoritative judgment. Reference-free evaluation judges properties such as format compliance, response style, or safety without a gold label (LangSmith (n.d.)).
LLM-as-judge is common now when human judgment is expensive and exact-match scoring does not work. The discipline here is treating the judge as a measurement instrument that needs its own validation. Jung et al. (2025) propose a "trust or escalate" approach: validate the judge against held-out expert labels, estimate confidence per item, and send low-confidence cases to humans. That is more useful than looking for a universal kappa cutoff, which does not exist in the literature. Haldar and Hockenmaier (2025) and related work report low intra-rater reliability from LLM judges across multiple settings, so the validation step should not be skipped.
A practical criterion: require the LLM judge to match human-human agreement on the same rubric, estimated from a validation set of expert-labeled items. If it does not, revise the rubric, use a stronger judge, or switch to human scoring.
10. Pilot and lock
The pilot is where rubric ambiguity, schema brittleness, and construct mis-mapping get caught. As a rule of thumb, run at least 10–20 curated items covering typical cases and known edge cases. LangSmith (n.d.) recommends starting with 5–10 examples for a quick check and scaling to 20 or more for thorough coverage. For tasks with multiple domains, subgroups, or study designs, a stratified set of 20–50 items is more likely to surface problems before they reach the holdout.
Start with rubric ambiguity: can two raters apply the judgment categories consistently? Then check for construct mis-mapping (does the output correspond to what the rubric intends?), schema brittleness (does it produce malformed outputs?), and whether the model can produce supporting evidence quotes at all. Run the same item across multiple fresh sessions to measure stochasticity. Verify that the item set covers the subgroups and study types relevant to the claim. Work out the adjudication rule for rater disagreements before the holdout is touched.
After the pilot, lock the prompt, output schema, and scoring plan. Changes made after seeing the holdout are protocol deviations. Report them as such.
11. Run under controlled conditions and log everything
Reproducibility requires consistent execution. Process each item under the same conditions: same session structure (fresh per item is recommended), same decoding parameters, same output schema.
For stochastic models, repeated runs reveal within-item variance. If multiple runs per item are used, specify the aggregation rule before running, not after: majority vote, mean score, or whatever else is appropriate. Log item-level outputs, not only aggregated scores, so that specific failure modes can be investigated later.
OpenAI (n.d.) recommends logging everything during development because logs generate future evaluation cases and surface failures that disappear in aggregate summaries. UK AISI's evaluation standard takes the same position: model outputs and decision steps should be easily observable in logs for any individual item.
12. Analyze results as estimates, not scores
Report confidence intervals, not only point estimates. The NeurIPS Paper Checklist requires authors to state confidence intervals for estimated performance and describe the source of variability, the method, and the underlying assumptions. AI evaluation practice frequently falls short of this.
For binary outcomes over 30–100 independent items, Wilson score intervals are a practical default for single proportions (NIST). Clopper-Pearson works when conservative coverage is needed or proportions are near 0 or 1. Bootstrap is the choice when the metric is not a simple proportion or mean.
For A/B comparison on the same items, analyze paired item-level differences rather than comparing two separate single-model intervals. Miller (2024) shows that naive standard errors can be substantially wrong when evaluation items cluster (meaning several items come from the same source and are therefore not independent). If items group naturally (multiple domains per paper, multiple questions per passage, translated variants of the same item), independence-based standard errors understate uncertainty even in small datasets. Clustered standard errors or cluster bootstrap are the appropriate fix. With few clusters, ordinary cluster-robust asymptotics can be unreliable; bootstrap or other small-sample corrections are safer (Cameron et al. (2008)).
State explicitly whether the result comes from one run, an average across repeated runs, or another aggregation, and whether it is expected to generalize beyond the specific items tested.
13. Error analysis and robustness
A headline score is not enough. Failure mode analysis tells you whether errors are random or patterned. Patterned errors can signal construct problems that aggregate accuracy numbers hide entirely.
Break errors down by subgroup first: study type, domain, language, difficulty. Then look at direction: are false positives and false negatives roughly balanced, or is the model consistently biased one way? For expert tasks, check whether labels match and whether the supporting evidence actually justifies them. Test prompt-wording sensitivity too. A result that shifts substantially when a synonym is substituted was not robust to begin with.
For a risk-of-bias evaluation of observational studies, the pattern of expert judgments in the source review may suggest where model errors will concentrate. If detection bias appeared in nearly every included study, that domain is the first place to look.
14. Report so others can repeat it
A result that others cannot reproduce is hard to interpret and hard to trust. The minimum disclosure set, drawing from Park et al. (2024), ACL Rolling Review, and the NeurIPS Paper Checklist, includes: model name and version; query date; whether web access or retrieval was enabled; the full prompt text; session structure; temperature and decoding parameters; number of attempts and aggregation rule; the case-set source and split plan; the reference standard and adjudication process; the rubric and output schema; the scoring method; the primary endpoint and pass/fail threshold; the uncertainty estimation method; and known failure modes and limits of generalization.
Pre-registration is worth doing for any evaluation meant to support a published claim. OSF and AsPredicted both support time-stamped registration before data collection. Registered components for LLM-based risk-of-bias evaluations already exist on OSF (see an OSF risk-of-bias registration example) and can serve as templates. The registration should lock: the claim, the case-source and sampling rules, the split rules, the model and version, the prompt text, the primary endpoint, the uncertainty method, the escalation rule for low-confidence items, and the error-analysis plan.
15. After deployment, the evaluation continues
Pre-deployment evaluations are controlled by design. They cannot account for the inputs, populations, and context shifts that accumulate once a system is actually in use.
OpenAI (n.d.) and LangSmith (n.d.) both treat post-deployment monitoring as a standard part of the evaluation lifecycle. Production failures should be logged, reviewed, and folded back into the offline evaluation set. Any change to the model or prompt should trigger re-evaluation.
A minimum monitoring setup includes: version-pinned logging of prompts and outputs; periodic sampled human review; alert thresholds for agreement failures, format failures, or citation errors; a process for reviewing severe failures; and a documented requirement to re-evaluate after any model or prompt change. The audit rate is a design choice. Logging, monitoring, and re-evaluation on change are the current standard.
Limitations
This guide synthesizes current guidance rather than conducting a systematic review of the evaluation literature. The source selection reflects guidance most directly relevant to custom LLM evaluations supporting research or deployment claims. It does not cover all benchmarking traditions, leaderboard methodology, or capability evaluation practice.
Recommendations described as best practice reflect convergence across multiple independent source classes. Recommendations described as rules of thumb reflect practical synthesis where no formal consensus exists. Readers should expect the guidance to evolve as the field matures.
Conclusion: An eval is a study, not a demo
An evaluation is most useful when it is designed as a study. The claim determines the estimand. The estimand determines sampling, scoring, and what uncertainty needs to be reported. And the result is only useful to others if they could understand the process and repeat it.
The current standards landscape is still fragmented. No single document governs all AI evaluations. But the practical guidance now converges on a short list: define the decision first, separate development from held-out testing, freeze the system under test, validate the scoring method, log item-level behavior, report uncertainty as estimates, and continue monitoring after deployment.
Better evaluations will not come from yet another checklist. They will come from applying this measurement discipline consistently.