← Scaiences

Commentary

Autoresearch meets biology: curing diseases autonomously

March 12, 2026

Andrej Karpathy built a loop that does science overnight. Right now it trains language models. Soon something like it could be screening cancer drugs.

Who is Andrej Karpathy?

Andrej Karpathy is one of the clearest thinkers in AI. He co-founded OpenAI, led Tesla's Autopilot team through its most consequential years, and has spent the time since building minimal, educational systems that strip away abstraction to reveal how things actually work. His projects — nanoGPT, micrograd, llm.c — follow a recognizable philosophy: find the irreducible core of a problem, implement it cleanly, and run it.

His latest project, autoresearch, applies that same philosophy to science itself.

What he built

Autoresearch is a closed loop in which an AI agent conducts machine learning experiments without human supervision. The agent modifies a training script, runs it for exactly five minutes, checks whether the model improved, keeps or discards the change, and repeats. At roughly twelve experiments per hour, an overnight run yields around a hundred iterations — all unattended.

No dramatic results have been published yet. That is not really the point. The contribution is the structure: a working demonstration that the scientific loop — hypothesis, experiment, measurement, revision — can be automated end to end when three conditions hold: the action space is bounded, the feedback is fast, and the metric is unambiguous.

Why the structure matters more than the result

Science has always been slow not because experiments are hard to run, but because the human steps surrounding them are slow: deciding what to try next, waiting for results, interpreting them, writing up, discussing, revising. autoresearch removes most of those delays from the loop entirely. The agent does not sleep, does not get distracted, and does not need to schedule a meeting to decide whether to keep a change.

The fixed five-minute budget is a key design choice. It forces the agent to explore broadly rather than overfit to any single run. It also makes results comparable across very different architectural choices. Constraint, it turns out, is what makes autonomous iteration tractable.

The real question autoresearch raises is not whether AI can tune a training script. It is whether the same loop can be transplanted into domains where the editable file is not Python, the metric is not bits per byte, and the experiment takes longer than five minutes. Biology is the obvious candidate.

What if the experiment is biological?

The analogy maps cleanly. The training script becomes a lab protocol: which compounds to test, at what dose, in which cell line, for how long. The performance metric becomes a biological readout: cell viability at 72 hours, tumor growth inhibition in an organoid, median lifespan in C. elegans, or an epigenetic clock score. The five-minute compute run becomes a plate-reader assay or an automated imaging pass — each of which can return a clean number in hours rather than days.

Most of the automation infrastructure already exists in high-throughput screening labs. Liquid-handling robots (Opentrons, Hamilton) can execute protocols from structured files. Automated confocal imagers can quantify morphology without a human at the microscope. Plate readers return viability numbers in minutes. Robotic incubators maintain conditions and hand off samples on schedule. What has been missing is not the hardware — it is the loop: something to look at the results, propose the next perturbation, and queue the next run.

That is exactly what autoresearch demonstrates how to do.

The imagined lab — and what is genuinely hard

Picture a concrete version. A C. elegans longevity screen: the agent cycles through NAD+ precursors, senolytics, mTOR inhibitors, and rapamycin analogs, reading out median lifespan from an automated worm tracker every few days. A cancer organoid screen: patient-derived tumor tissue grown in 96-well plates, with the agent testing drug combinations and reading viability at 72 hours. An aging intervention discovery pipeline: the metric is a methylation-based epigenetic clock score, measured from lysate after each treatment cycle. In each case, the agent sees a number, decides what to try next, and queues the run — overnight, unattended, iteratively.

None of this is science fiction at the component level. The honest challenge is connecting the components, and there are real obstacles.

Biology is noisier than compute. The same protocol run twice on the same cell line can yield meaningfully different numbers. An agent optimizing a single readout without replicates will chase noise. Statistical power requires running conditions in triplicate at minimum, which compresses the iteration count from a hundred overnight to perhaps thirty. The agent needs to know this, and to design experiments accordingly.

The search space is also combinatorially larger. A training script has a few dozen architectural choices. A drug combination screen has thousands of compounds, each with its own dose-response curve, each potentially interacting with every other. Without smart priors — which an LLM with biological knowledge can plausibly supply — random search will not get far.

Capital is a real constraint. A serious automated screening lab costs somewhere between half a million and two million dollars to equip, before reagents. That is not a weekend project, but it is within reach of a well-funded research group or a startup willing to treat the lab itself as the product.

And feedback loops are slower. A five-minute GPU run is not achievable in biology, but some assays come closer than you might expect. ATP-based viability assays return results in under an hour. Impedance measurements are continuous. Short-term transcriptomic readouts using fast RNA extraction protocols can run in a few hours. The feedback is not as tight as a GPU, but it is tighter than the academic grant cycle.

Where this is going

A semi-automated version of this lab — one where the agent proposes and a human executes — is achievable today. The bottleneck is not technology; it is the willingness to treat the loop itself as the core intellectual contribution rather than any single experimental result.

A fully autonomous version, where the agent queues runs without human approval, is probably five to ten years away for most biological questions, and closer for well-characterized systems with fast, reliable readouts.

Karpathy's framing in autoresearch is worth sitting with: "give an AI agent a small but real setup and let it experiment autonomously overnight." The word "real" is doing a lot of work. Not a simulation, not a benchmark — a real environment with real consequences. For machine learning, that means a training script and a GPU. For biology, it means a protocol, a robot, and a plate reader.

The lab that never sleeps is not a metaphor. It is an engineering project. Autoresearch is a working proof of concept for one layer of it. The rest is a matter of connecting the hardware that already exists.