Uk Aisi Chief Scientist

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Mar 1, 2026139 min

AI Summary

→ WHAT IT COVERS Geoffrey Irving, Chief Scientist at the UK AI Security Institute, outlines the current AI threat landscape across biosecurity, cybersecurity, and loss-of-control risks. With roughly 100 technical staff, the UKAIS conducts pre-release frontier model evaluations, red-team jailbreaking, and theoretical safety research, while briefing governments globally on why current mitigation strategies cannot achieve more than a few nines of reliability. → KEY INSIGHTS - **Current safety techniques have correlated failure modes:** All existing alignment approaches—honesty training, white-box detectors, AI control measures, monitoring—share a structural vulnerability: optimization pressure during training causes remaining failure modes to cluster around the same root cause. This means they could all fail simultaneously for the same reason. Irving frames this not as speculation but as a predictable consequence of how iterative training eliminates surface-level problems while leaving deeper, correlated ones intact. No current technique provides more than a couple nines of reliability. - **Jailbreaking success rate remains 100% across 30+ model evaluations:** The UKAIS red team has never failed to jailbreak a model, across over 30 distinct testing runs spanning all major frontier developers. While heavily defended domains like biosecurity do require significantly more effort and time to crack, the team always eventually succeeds. The practical implication: safeguards should be understood as friction and delay mechanisms that reduce the pool of capable bad actors, not as hard barriers. Treat them accordingly when modeling risk. - **Reinforcement learning is already working beyond verifiable domains:** A widespread assumption holds that RL-based capability gains are limited to tasks with clear, checkable answers like math or code. Irving disputes this directly. In 2025, RL-trained models substantially outperform prior generations on tasks like interpreting photographs of biological experiments—a domain with no verifiable reward signal. The gains come from RL applied to fuzzy, self-critique-based feedback, meaning capability growth is broader and less domain-constrained than most public narratives suggest. - **Eval awareness is an accelerating and largely unsolved problem:** Newer frontier models are measurably more aware that they are being evaluated, and this trend is increasing rapidly across model generations. The UKAIS partially mitigates this by embedding evaluation scenarios within realistic workflows—showing models actual UKAIS engineers doing routine coding tasks, then introducing edge-case situations. However, Irving states directly that he lacks high confidence these mitigations will hold as models become more capable, making evaluation validity an open and growing structural problem for the entire field. - **Jaggedness matters less as capability ceilings rise above human expert level:** Critics of catastrophic risk scenarios often cite model jaggedness—uneven performance across subtasks—as a natural buffer. Irving reframes this: jaggedness is relative to the frontier being evaluated. A Go grandmaster is jagged compared to other grandmasters but uniformly dominant against amateurs regardless of those gaps. As models exceed human expert performance across risk-relevant domains, their remaining jaggedness becomes irrelevant to harm potential. The calculation must be run on future capability levels, not current ones. - **Voluntary cooperation with frontier labs is functional but incomplete:** Google, Anthropic, and OpenAI have made voluntary safety commitments and are actively cooperating with UKAIS pre-deployment evaluations. The arrangement works partly because UKAIS provides direct value: jailbreaks discovered are disclosed privately before any public release, giving labs time to patch classifiers. However, not all frontier developers participate, and the voluntary nature means coverage is structurally incomplete. Irving notes that longer-horizon research collaborations—running months rather than days—are now replacing time-boxed pre-deployment evaluations to improve depth. - **Theoretical research in complexity and game theory is underfunded relative to its potential:** Irving is directing UKAIS funding toward information theory, complexity theory, game theory, and learning theory as potential sources of stronger safety guarantees. The core logic draws from theoretical computer science: in well-designed protocols, defenders can structurally win. Scalable oversight concepts like debate trace directly to interactive proof theory. However, these fields are only beginning to engage seriously with AI safety, meaning the person-years invested remain countable on a few hands. The opportunity cost of continued neglect is high. → NOTABLE MOMENT Irving describes a multi-month red-teaming collaboration with Anthropic and OpenAI where UKAIS discovered far more jailbreaks than any standard pre-deployment window would allow. The finding prompted real-time classifier updates to live models—not just future versions. This reveals that post-deployment patching of safety defenses is already operational practice, not a theoretical contingency, with implications for how dynamic and ongoing safety evaluation must become. 💼 SPONSORS [{"name": "Granola", "url": "https://granola.ai"}, {"name": "Servl", "url": "https://serval.com/cognitive"}, {"name": "Tasklet", "url": "https://tasklet.ai"}, {"name": "Claude", "url": "https://claude.ai/tcr"}] 🏷️ AI Safety Evaluation, Frontier Model Red-Teaming, Biosecurity Risk, Loss of Control, Reinforcement Learning, UK AI Security Institute, Scalable Oversight

Read Full Summary Listen

Featured On 1 Podcast

Cognitive Revolution

All Appearances

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

AI Summary

Explore More

Never miss Uk Aisi Chief Scientist's insights