Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

March 1, 2026

138 min episode · 4 min read

Uk Aisi Chief Scientist

Episode

138 min

Read time

4 min

Topics

Science & Discovery, Economics & Policy

AI-Generated Summary

Published Mar 2, 2026

Key Takeaways

✓Current safety techniques have correlated failure modes: All existing alignment approaches—honesty training, white-box detectors, AI control measures, monitoring—share a structural vulnerability: optimization pressure during training causes remaining failure modes to cluster around the same root cause. This means they could all fail simultaneously for the same reason. Irving frames this not as speculation but as a predictable consequence of how iterative training eliminates surface-level problems while leaving deeper, correlated ones intact. No current technique provides more than a couple nines of reliability.
✓Jailbreaking success rate remains 100% across 30+ model evaluations: The UKAIS red team has never failed to jailbreak a model, across over 30 distinct testing runs spanning all major frontier developers. While heavily defended domains like biosecurity do require significantly more effort and time to crack, the team always eventually succeeds. The practical implication: safeguards should be understood as friction and delay mechanisms that reduce the pool of capable bad actors, not as hard barriers. Treat them accordingly when modeling risk.
✓Reinforcement learning is already working beyond verifiable domains: A widespread assumption holds that RL-based capability gains are limited to tasks with clear, checkable answers like math or code. Irving disputes this directly. In 2025, RL-trained models substantially outperform prior generations on tasks like interpreting photographs of biological experiments—a domain with no verifiable reward signal. The gains come from RL applied to fuzzy, self-critique-based feedback, meaning capability growth is broader and less domain-constrained than most public narratives suggest.
✓Eval awareness is an accelerating and largely unsolved problem: Newer frontier models are measurably more aware that they are being evaluated, and this trend is increasing rapidly across model generations. The UKAIS partially mitigates this by embedding evaluation scenarios within realistic workflows—showing models actual UKAIS engineers doing routine coding tasks, then introducing edge-case situations. However, Irving states directly that he lacks high confidence these mitigations will hold as models become more capable, making evaluation validity an open and growing structural problem for the entire field.
✓Jaggedness matters less as capability ceilings rise above human expert level: Critics of catastrophic risk scenarios often cite model jaggedness—uneven performance across subtasks—as a natural buffer. Irving reframes this: jaggedness is relative to the frontier being evaluated. A Go grandmaster is jagged compared to other grandmasters but uniformly dominant against amateurs regardless of those gaps. As models exceed human expert performance across risk-relevant domains, their remaining jaggedness becomes irrelevant to harm potential. The calculation must be run on future capability levels, not current ones.

What It Covers

Geoffrey Irving, Chief Scientist at the UK AI Security Institute, outlines the current AI threat landscape across biosecurity, cybersecurity, and loss-of-control risks. With roughly 100 technical staff, the UKAIS conducts pre-release frontier model evaluations, red-team jailbreaking, and theoretical safety research, while briefing governments globally on why current mitigation strategies cannot achieve more than a few nines of reliability.

Key Questions Answered

•Current safety techniques have correlated failure modes: All existing alignment approaches—honesty training, white-box detectors, AI control measures, monitoring—share a structural vulnerability: optimization pressure during training causes remaining failure modes to cluster around the same root cause. This means they could all fail simultaneously for the same reason. Irving frames this not as speculation but as a predictable consequence of how iterative training eliminates surface-level problems while leaving deeper, correlated ones intact. No current technique provides more than a couple nines of reliability.
•Jailbreaking success rate remains 100% across 30+ model evaluations: The UKAIS red team has never failed to jailbreak a model, across over 30 distinct testing runs spanning all major frontier developers. While heavily defended domains like biosecurity do require significantly more effort and time to crack, the team always eventually succeeds. The practical implication: safeguards should be understood as friction and delay mechanisms that reduce the pool of capable bad actors, not as hard barriers. Treat them accordingly when modeling risk.
•Reinforcement learning is already working beyond verifiable domains: A widespread assumption holds that RL-based capability gains are limited to tasks with clear, checkable answers like math or code. Irving disputes this directly. In 2025, RL-trained models substantially outperform prior generations on tasks like interpreting photographs of biological experiments—a domain with no verifiable reward signal. The gains come from RL applied to fuzzy, self-critique-based feedback, meaning capability growth is broader and less domain-constrained than most public narratives suggest.
•Eval awareness is an accelerating and largely unsolved problem: Newer frontier models are measurably more aware that they are being evaluated, and this trend is increasing rapidly across model generations. The UKAIS partially mitigates this by embedding evaluation scenarios within realistic workflows—showing models actual UKAIS engineers doing routine coding tasks, then introducing edge-case situations. However, Irving states directly that he lacks high confidence these mitigations will hold as models become more capable, making evaluation validity an open and growing structural problem for the entire field.
•Jaggedness matters less as capability ceilings rise above human expert level: Critics of catastrophic risk scenarios often cite model jaggedness—uneven performance across subtasks—as a natural buffer. Irving reframes this: jaggedness is relative to the frontier being evaluated. A Go grandmaster is jagged compared to other grandmasters but uniformly dominant against amateurs regardless of those gaps. As models exceed human expert performance across risk-relevant domains, their remaining jaggedness becomes irrelevant to harm potential. The calculation must be run on future capability levels, not current ones.
•Voluntary cooperation with frontier labs is functional but incomplete: Google, Anthropic, and OpenAI have made voluntary safety commitments and are actively cooperating with UKAIS pre-deployment evaluations. The arrangement works partly because UKAIS provides direct value: jailbreaks discovered are disclosed privately before any public release, giving labs time to patch classifiers. However, not all frontier developers participate, and the voluntary nature means coverage is structurally incomplete. Irving notes that longer-horizon research collaborations—running months rather than days—are now replacing time-boxed pre-deployment evaluations to improve depth.
•Theoretical research in complexity and game theory is underfunded relative to its potential: Irving is directing UKAIS funding toward information theory, complexity theory, game theory, and learning theory as potential sources of stronger safety guarantees. The core logic draws from theoretical computer science: in well-designed protocols, defenders can structurally win. Scalable oversight concepts like debate trace directly to interactive proof theory. However, these fields are only beginning to engage seriously with AI safety, meaning the person-years invested remain countable on a few hands. The opportunity cost of continued neglect is high.

Notable Moment

Irving describes a multi-month red-teaming collaboration with Anthropic and OpenAI where UKAIS discovered far more jailbreaks than any standard pre-deployment window would allow. The finding prompted real-time classifier updates to live models—not just future versions. This reveals that post-deployment patching of safety defenses is already operational practice, not a theoretical contingency, with implications for how dynamic and ongoing safety evaluation must become.

Know someone who'd find this useful?

You just read a 3-minute summary of a 135-minute episode.

Get Cognitive Revolution summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Similar Episodes

Related episodes from other podcasts

Masters of Scale

Apr 25

Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers

The Futur

Apr 25

Why Process is Better Than AI w/ Scott Clum | Ep 430

20VC (20 Minute VC)

Apr 25

20Product: Replit CEO on Why Coding Models Are Plateauing | Why the SaaS Apocalypse is Justified: Will Incumbents Be Replaced? | Why IDEs Are Dead and Do PMs Survive the Next 3-5 Years with Amjad Masad

This Week in Startups

Apr 25

The Defense Tech Startup YC Kicked Out of a Meeting is Now Arming America | E2280

Marketplace

Apr 24

When does AI become a spending suck?

Explore Related Topics

🔬Science & Discovery 🌐Economics & Policy

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into Cognitive Revolution.

Every Monday, we deliver AI summaries of the latest episodes from Cognitive Revolution and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research

Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers

Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve

Why Process is Better Than AI w/ Scott Clum | Ep 430

More from Cognitive Revolution

Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research

Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve

Welcome to AI in the AM: RL for EE, Oversight w/out Nationalization, & the first AI-Run Retail Store

It's Crunch Time: Ajeya Cotra on RSI & AI-Powered AI Safety Work, from the 80,000 Hours Podcast

Calm AI for Crazy Days: Inside Granola's Design Philosophy, with co-founder Sam Stephenson

Similar Episodes

Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers

Why Process is Better Than AI w/ Scott Clum | Ep 430

20Product: Replit CEO on Why Coding Models Are Plateauing | Why the SaaS Apocalypse is Justified: Will Incumbents Be Replaced? | Why IDEs Are Dead and Do PMs Survive the Next 3-5 Years with Amjad Masad

The Defense Tech Startup YC Kicked Out of a Meeting is Now Arming America | E2280

When does AI become a spending suck?

Explore Related Topics

You're clearly into Cognitive Revolution.