Tool mentioned on podcasts
Anthropic interpretability tools
by Anthropic
Mentioned on 1 episode by 1 guest across our covered podcasts.
SignalCast may earn commission on purchases via these links.
Who mentioned it
- Jeffrey LadishRecommended
“Ladish argues interpretability tools — specifically Anthropic's work tracing blackmail behavior to specific training stages — represent the only technically grounded path toward verifying whether model motivations actually match stated values.”
Mentioned on: Cognitive Revolution