Autoresearch, Agent Loops and the Future of Work

March 9, 2026

25 min episode · 2 min read

Episode

25 min

Read time

2 min

Topics

Science & Discovery

AI-Generated Summary

Published Mar 9, 2026

Key Takeaways

✓The Agentic Loop Structure: Karpathy's auto research uses three files: fixed infrastructure, one editable training script the agent modifies, and a plain-English markdown strategy document the human writes. The human never touches code — only the memo. In 83 overnight experiments, 15 improvements drove validation loss from 0.9979 down to 0.9697 automatically.
✓The Five-Minute Clock as Equalizer: Setting a fixed five-minute budget per experiment — regardless of what the agent changes — creates a level comparison across all runs. This converts open-ended research into a scored game. Running overnight yields roughly 100 experiments. The constraint forces comparable evaluation and eliminates runaway compute from poorly scoped iterations.
✓The Ralph Wiggum Loop Pattern: Developer Jeffrey Huntley's Ralph Wiggum technique predates auto research: feed a coding agent a prompt, loop its output back as input, terminate when context fills, spin up a fresh agent that reads externalized state from git commits and a progress file. Memory lives in files, not context windows, making the system self-healing across sessions.
✓Loop Readiness Criteria: Agentic loops work best where five conditions hold — a scorable metric exists, iterations run fast and cheap, the environment is bounded, bad attempts cost minutes not months, and the agent can leave persistent traces. Code generation, ad bid optimization, and algorithmic trading sit at the high-readiness end; therapy and political negotiation sit at the opposite extreme.
✓New High-Value Human Skills: As loops automate execution, human value shifts to arena design (writing the strategy document), evaluator construction (defining what "better" means as a scalar score), and problem decomposition. A practical self-assessment: identify any repeated task where you already know what improvement looks like, then test whether that judgment can be encoded as an agent-readable scoring function.

What It Covers

Andrej Karpathy's auto research project — a three-file GitHub repo where an AI agent autonomously runs LLM training experiments in five-minute loops, keeping only improvements — signals a broader new work primitive: agentic loops that apply across business functions wherever outcomes can be scored objectively.

Key Questions Answered

•The Agentic Loop Structure: Karpathy's auto research uses three files: fixed infrastructure, one editable training script the agent modifies, and a plain-English markdown strategy document the human writes. The human never touches code — only the memo. In 83 overnight experiments, 15 improvements drove validation loss from 0.9979 down to 0.9697 automatically.
•The Five-Minute Clock as Equalizer: Setting a fixed five-minute budget per experiment — regardless of what the agent changes — creates a level comparison across all runs. This converts open-ended research into a scored game. Running overnight yields roughly 100 experiments. The constraint forces comparable evaluation and eliminates runaway compute from poorly scoped iterations.
•The Ralph Wiggum Loop Pattern: Developer Jeffrey Huntley's Ralph Wiggum technique predates auto research: feed a coding agent a prompt, loop its output back as input, terminate when context fills, spin up a fresh agent that reads externalized state from git commits and a progress file. Memory lives in files, not context windows, making the system self-healing across sessions.
•Loop Readiness Criteria: Agentic loops work best where five conditions hold — a scorable metric exists, iterations run fast and cheap, the environment is bounded, bad attempts cost minutes not months, and the agent can leave persistent traces. Code generation, ad bid optimization, and algorithmic trading sit at the high-readiness end; therapy and political negotiation sit at the opposite extreme.
•New High-Value Human Skills: As loops automate execution, human value shifts to arena design (writing the strategy document), evaluator construction (defining what "better" means as a scalar score), and problem decomposition. A practical self-assessment: identify any repeated task where you already know what improvement looks like, then test whether that judgment can be encoded as an agent-readable scoring function.

Notable Moment

Karpathy described the current single-threaded loop as just a seed — the real vision is thousands of AI agents collaborating asynchronously across branching research directions simultaneously, with existing tools like GitHub already showing strain under assumptions built for human-paced, single-master-branch workflows.

Know someone who'd find this useful?