Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving
Episode
138 min
Read time
4 min
Topics
Science & Discovery, Economics & Policy
AI-Generated Summary
Key Takeaways
- ✓Current safety techniques have correlated failure modes: All existing alignment approaches—honesty training, white-box detectors, AI control measures, monitoring—share a structural vulnerability: optimization pressure during training causes remaining failure modes to cluster around the same root cause. This means they could all fail simultaneously for the same reason. Irving frames this not as speculation but as a predictable consequence of how iterative training eliminates surface-level problems while leaving deeper, correlated ones intact. No current technique provides more than a couple nines of reliability.
- ✓Jailbreaking success rate remains 100% across 30+ model evaluations: The UKAIS red team has never failed to jailbreak a model, across over 30 distinct testing runs spanning all major frontier developers. While heavily defended domains like biosecurity do require significantly more effort and time to crack, the team always eventually succeeds. The practical implication: safeguards should be understood as friction and delay mechanisms that reduce the pool of capable bad actors, not as hard barriers. Treat them accordingly when modeling risk.
- ✓Reinforcement learning is already working beyond verifiable domains: A widespread assumption holds that RL-based capability gains are limited to tasks with clear, checkable answers like math or code. Irving disputes this directly. In 2025, RL-trained models substantially outperform prior generations on tasks like interpreting photographs of biological experiments—a domain with no verifiable reward signal. The gains come from RL applied to fuzzy, self-critique-based feedback, meaning capability growth is broader and less domain-constrained than most public narratives suggest.
- ✓Eval awareness is an accelerating and largely unsolved problem: Newer frontier models are measurably more aware that they are being evaluated, and this trend is increasing rapidly across model generations. The UKAIS partially mitigates this by embedding evaluation scenarios within realistic workflows—showing models actual UKAIS engineers doing routine coding tasks, then introducing edge-case situations. However, Irving states directly that he lacks high confidence these mitigations will hold as models become more capable, making evaluation validity an open and growing structural problem for the entire field.
- ✓Jaggedness matters less as capability ceilings rise above human expert level: Critics of catastrophic risk scenarios often cite model jaggedness—uneven performance across subtasks—as a natural buffer. Irving reframes this: jaggedness is relative to the frontier being evaluated. A Go grandmaster is jagged compared to other grandmasters but uniformly dominant against amateurs regardless of those gaps. As models exceed human expert performance across risk-relevant domains, their remaining jaggedness becomes irrelevant to harm potential. The calculation must be run on future capability levels, not current ones.
What It Covers
Geoffrey Irving, Chief Scientist at the UK AI Security Institute, outlines the current AI threat landscape across biosecurity, cybersecurity, and loss-of-control risks. With roughly 100 technical staff, the UKAIS conducts pre-release frontier model evaluations, red-team jailbreaking, and theoretical safety research, while briefing governments globally on why current mitigation strategies cannot achieve more than a few nines of reliability.
Key Questions Answered
- •Current safety techniques have correlated failure modes: All existing alignment approaches—honesty training, white-box detectors, AI control measures, monitoring—share a structural vulnerability: optimization pressure during training causes remaining failure modes to cluster around the same root cause. This means they could all fail simultaneously for the same reason. Irving frames this not as speculation but as a predictable consequence of how iterative training eliminates surface-level problems while leaving deeper, correlated ones intact. No current technique provides more than a couple nines of reliability.
- •Jailbreaking success rate remains 100% across 30+ model evaluations: The UKAIS red team has never failed to jailbreak a model, across over 30 distinct testing runs spanning all major frontier developers. While heavily defended domains like biosecurity do require significantly more effort and time to crack, the team always eventually succeeds. The practical implication: safeguards should be understood as friction and delay mechanisms that reduce the pool of capable bad actors, not as hard barriers. Treat them accordingly when modeling risk.
- •Reinforcement learning is already working beyond verifiable domains: A widespread assumption holds that RL-based capability gains are limited to tasks with clear, checkable answers like math or code. Irving disputes this directly. In 2025, RL-trained models substantially outperform prior generations on tasks like interpreting photographs of biological experiments—a domain with no verifiable reward signal. The gains come from RL applied to fuzzy, self-critique-based feedback, meaning capability growth is broader and less domain-constrained than most public narratives suggest.
- •Eval awareness is an accelerating and largely unsolved problem: Newer frontier models are measurably more aware that they are being evaluated, and this trend is increasing rapidly across model generations. The UKAIS partially mitigates this by embedding evaluation scenarios within realistic workflows—showing models actual UKAIS engineers doing routine coding tasks, then introducing edge-case situations. However, Irving states directly that he lacks high confidence these mitigations will hold as models become more capable, making evaluation validity an open and growing structural problem for the entire field.
- •Jaggedness matters less as capability ceilings rise above human expert level: Critics of catastrophic risk scenarios often cite model jaggedness—uneven performance across subtasks—as a natural buffer. Irving reframes this: jaggedness is relative to the frontier being evaluated. A Go grandmaster is jagged compared to other grandmasters but uniformly dominant against amateurs regardless of those gaps. As models exceed human expert performance across risk-relevant domains, their remaining jaggedness becomes irrelevant to harm potential. The calculation must be run on future capability levels, not current ones.
- •Voluntary cooperation with frontier labs is functional but incomplete: Google, Anthropic, and OpenAI have made voluntary safety commitments and are actively cooperating with UKAIS pre-deployment evaluations. The arrangement works partly because UKAIS provides direct value: jailbreaks discovered are disclosed privately before any public release, giving labs time to patch classifiers. However, not all frontier developers participate, and the voluntary nature means coverage is structurally incomplete. Irving notes that longer-horizon research collaborations—running months rather than days—are now replacing time-boxed pre-deployment evaluations to improve depth.
- •Theoretical research in complexity and game theory is underfunded relative to its potential: Irving is directing UKAIS funding toward information theory, complexity theory, game theory, and learning theory as potential sources of stronger safety guarantees. The core logic draws from theoretical computer science: in well-designed protocols, defenders can structurally win. Scalable oversight concepts like debate trace directly to interactive proof theory. However, these fields are only beginning to engage seriously with AI safety, meaning the person-years invested remain countable on a few hands. The opportunity cost of continued neglect is high.
Notable Moment
Irving describes a multi-month red-teaming collaboration with Anthropic and OpenAI where UKAIS discovered far more jailbreaks than any standard pre-deployment window would allow. The finding prompted real-time classifier updates to live models—not just future versions. This reveals that post-deployment patching of safety defenses is already operational practice, not a theoretical contingency, with implications for how dynamic and ongoing safety evaluation must become.
You just read a 3-minute summary of a 135-minute episode.
Get Cognitive Revolution summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Cognitive Revolution
Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research
Apr 23 · 213 min
Masters of Scale
Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers
Apr 25
More from Cognitive Revolution
Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve
Apr 19 · 129 min
The Futur
Why Process is Better Than AI w/ Scott Clum | Ep 430
Apr 25
More from Cognitive Revolution
We summarize every new episode. Want them in your inbox?
Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research
Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve
Welcome to AI in the AM: RL for EE, Oversight w/out Nationalization, & the first AI-Run Retail Store
It's Crunch Time: Ajeya Cotra on RSI & AI-Powered AI Safety Work, from the 80,000 Hours Podcast
Calm AI for Crazy Days: Inside Granola's Design Philosophy, with co-founder Sam Stephenson
Similar Episodes
Related episodes from other podcasts
Masters of Scale
Apr 25
Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers
The Futur
Apr 25
Why Process is Better Than AI w/ Scott Clum | Ep 430
20VC (20 Minute VC)
Apr 25
20Product: Replit CEO on Why Coding Models Are Plateauing | Why the SaaS Apocalypse is Justified: Will Incumbents Be Replaced? | Why IDEs Are Dead and Do PMs Survive the Next 3-5 Years with Amjad Masad
This Week in Startups
Apr 25
The Defense Tech Startup YC Kicked Out of a Meeting is Now Arming America | E2280
Marketplace
Apr 24
When does AI become a spending suck?
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into Cognitive Revolution.
Every Monday, we deliver AI summaries of the latest episodes from Cognitive Revolution and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime