What are the key takeaways from this Cognitive Revolution episode?

Key insights include: **Process Guarantee via DSL:** Elicit built a proprietary domain-specific language that compiles reasoning into discrete microservices, ensuring the identical analytical process applies to document number 5 and document 9,999 in a batch. When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed — the other models admitted mid-conversation they had not completed the task.; **LLM Probability Instability:** Current frontier models produce unreliable confidence estimates because they lack coherent internal world models backing their stated probabilities. When asked to estimate clinical trial failure rates, models shift their percentage significantly if you simply mention base rates or add contextual framing — behavior a domain expert would resist. Elicit addresses this through structured scaffolding rather than relying on raw model verbalization of uncertainty.; **World Models as External Continual Learning:** Rather than storing evolving knowledge in model weights, Elicit is building structured external representations — combining graph-based causal diagrams, SQL tables, and heterogeneous knowledge formats — that models can update incrementally as new papers arrive. This approach makes a model's understanding of complex evidence bodies inspectable by humans and other AIs, enabling consistent counterfactual and intervention-based reasoning across thousands of data points.

What did Radically Better Reasoning discuss on Cognitive Revolution?

Elicit cofounders Andreas Stuhlmüller and Jungwon Byun explain how their AI research platform serves seven of the top 20 life sciences companies by combining frontier reasoning models with a custom domain-specific language that guarantees systematic process execution at scale, and why externalized "world models" represent the next frontier for reliable causal and counterfactual scientific analysis. Key topics include: **Process Guarantee via DSL:** Elicit built a proprietary domain-specific language that compiles reasoning into discrete microservices, ensuring the identical analytical process applies to document number 5 and document 9,999 in a batch. When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed — the other models admitted mid-conversation they had not completed the task.; **LLM Probability Instability:** Current frontier models produce unreliable confidence estimates because they lack coherent internal world models backing their stated probabilities. When asked to estimate clinical trial failure rates, models shift their percentage significantly if you simply mention base rates or add contextual framing — behavior a domain expert would resist. Elicit addresses this through structured scaffolding rather than relying on raw model verbalization of uncertainty..

How long is this episode of Cognitive Revolution?

This episode is 106 minutes long. SignalCast provides an AI-generated summary so you can get the key insights in about 3 minutes.

Cognitive Revolution

Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research

June 17, 2026

106 min episode · 3 min read

Radically Better Reasoning

Episode

106 min

Read time

3 min

Topics

Remote Work, Startups, Design & UX

AI-Generated Summary

Published Jun 18, 2026

Key Takeaways

✓Process Guarantee via DSL: Elicit built a proprietary domain-specific language that compiles reasoning into discrete microservices, ensuring the identical analytical process applies to document number 5 and document 9,999 in a batch. When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed — the other models admitted mid-conversation they had not completed the task.
✓LLM Probability Instability: Current frontier models produce unreliable confidence estimates because they lack coherent internal world models backing their stated probabilities. When asked to estimate clinical trial failure rates, models shift their percentage significantly if you simply mention base rates or add contextual framing — behavior a domain expert would resist. Elicit addresses this through structured scaffolding rather than relying on raw model verbalization of uncertainty.
✓World Models as External Continual Learning: Rather than storing evolving knowledge in model weights, Elicit is building structured external representations — combining graph-based causal diagrams, SQL tables, and heterogeneous knowledge formats — that models can update incrementally as new papers arrive. This approach makes a model's understanding of complex evidence bodies inspectable by humans and other AIs, enabling consistent counterfactual and intervention-based reasoning across thousands of data points.
✓Evidence Quality Beyond Metadata: Elicit evaluates research quality from content and methodology rather than relying solely on citation counts or journal impact factor — proxies that miss landmark papers like foundational CRISPR work published in lower-tier journals. Researchers can specify domain-appropriate quality thresholds, such as minimum sample sizes or study designs, and Elicit applies those criteria uniformly across all retrieved sources rather than defaulting to surface-level heuristics.
✓Automated Engineering via "The Line": Elicit's internal software pipeline called The Line automates the full engineering cycle — from Slack feature request through spec writing, implementation, video testing, code review, and production deployment — currently merging 30 to 50 issues per week without human intervention on simple tasks. The system self-identifies when human escalation is needed, such as incomplete specs or high-complexity changes, and routes accordingly.

What It Covers

Elicit cofounders Andreas Stuhlmüller and Jungwon Byun explain how their AI research platform serves seven of the top 20 life sciences companies by combining frontier reasoning models with a custom domain-specific language that guarantees systematic process execution at scale, and why externalized "world models" represent the next frontier for reliable causal and counterfactual scientific analysis.

Key Questions Answered

•Process Guarantee via DSL: Elicit built a proprietary domain-specific language that compiles reasoning into discrete microservices, ensuring the identical analytical process applies to document number 5 and document 9,999 in a batch. When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed — the other models admitted mid-conversation they had not completed the task.
•LLM Probability Instability: Current frontier models produce unreliable confidence estimates because they lack coherent internal world models backing their stated probabilities. When asked to estimate clinical trial failure rates, models shift their percentage significantly if you simply mention base rates or add contextual framing — behavior a domain expert would resist. Elicit addresses this through structured scaffolding rather than relying on raw model verbalization of uncertainty.
•World Models as External Continual Learning: Rather than storing evolving knowledge in model weights, Elicit is building structured external representations — combining graph-based causal diagrams, SQL tables, and heterogeneous knowledge formats — that models can update incrementally as new papers arrive. This approach makes a model's understanding of complex evidence bodies inspectable by humans and other AIs, enabling consistent counterfactual and intervention-based reasoning across thousands of data points.
•Evidence Quality Beyond Metadata: Elicit evaluates research quality from content and methodology rather than relying solely on citation counts or journal impact factor — proxies that miss landmark papers like foundational CRISPR work published in lower-tier journals. Researchers can specify domain-appropriate quality thresholds, such as minimum sample sizes or study designs, and Elicit applies those criteria uniformly across all retrieved sources rather than defaulting to surface-level heuristics.
•Automated Engineering via "The Line": Elicit's internal software pipeline called The Line automates the full engineering cycle — from Slack feature request through spec writing, implementation, video testing, code review, and production deployment — currently merging 30 to 50 issues per week without human intervention on simple tasks. The system self-identifies when human escalation is needed, such as incomplete specs or high-complexity changes, and routes accordingly.
•Token Spend as Headcount Substitute: Andreas spends approximately $2,000 per week on API tokens running multi-model orchestration pipelines that cross-check outputs across Claude, GPT, and Gemini — finding that model cross-checking improves results enough to justify the cost multiplication. For enterprise life sciences customers, Elicit's pricing displaces existing services spend rather than competing with software budgets, making the ROI framing more favorable than raw token cost comparisons suggest.
•Certificates of Reasoning over Chain-of-Thought Monitoring: Rather than supervising hidden chain-of-thought tokens, Elicit advocates for verifiable reasoning certificates embedded in outputs — analogous to mathematical proofs — that allow downstream checking without requiring access to internal model reasoning steps. Tool call logs already provide partial certificates: if a model never reads a paper's methodology section before summarizing its conclusions, that gap is detectable and auditable without needing to inspect reasoning tokens directly.

Notable Moment

When Stuhlmüller tested Elicit on a personal case involving a friend's cancer treatment, filtering roughly 5,000 relevant papers revealed a core limitation: even million-token context windows cannot produce coherent causal reasoning from raw literature at that scale. This motivated the entire world models research direction — the realization that structured external representations, not larger contexts, are required for reliable medical decision support.

Know someone who'd find this useful?

You just read a 3-minute summary of a 103-minute episode.

Get Cognitive Revolution summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Books, tools, and gear mentioned in this episode

SignalCast may earn commission on purchases via these links.

Tools

ChatGPT
by OpenAI
“When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed”
Gemini
by Google
“Andreas spends approximately $2,000 per week on API tokens running multi-model orchestration pipelines that cross-check outputs across Claude, GPT, and Gemini”
Mercury
“SPONSORS [Sponsor] Mercury”
Elicit
“Elicit cofounders Andreas Stuhlmüller and Jungwon Byun explain how their AI research platform serves seven of the top 20 life sciences companies by combining frontier reasoning models with a custom domain-specific language”
Claude
by Anthropic
“When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed”

Similar Episodes

Related episodes from other podcasts

a16z Podcast

Jul 31

Explore Related Topics

🏠Remote Work 🚀Startups 🎨Design & UX

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's Startups & Product Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Cognitive Revolution.

Every Monday, we deliver AI summaries of the latest episodes from Cognitive Revolution and 192+ other podcasts. Free for one show.

Start My Monday Digest

No credit card · Unsubscribe anytime

Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

Is Offense or Defense Dominant? FAR.AI's Adam Gleave on the AI Security Leaderboard

Decagon’s Playbook for Building Enterprise AI Applications

Nathan Goes to China – Part 1: Tech & Agent Setup, Chinese AI UX, WAIC, and Attitudes on AI

Applied Intuition: A Billion Intelligent Machines - [Business Breakdowns, EP.248]

Books, tools, and gear mentioned in this episode

Tools

More from Cognitive Revolution

Is Offense or Defense Dominant? FAR.AI's Adam Gleave on the AI Security Leaderboard

Nathan Goes to China – Part 1: Tech & Agent Setup, Chinese AI UX, WAIC, and Attitudes on AI

Alignment with Awakening: Davidad on Moral Realism, AI Wisdom, & why His p(Doom) is Down to 5%

AI:AM Highlights: Exploring the J-Space, AI Superforecasters, SambaNova's Chips, & LTX Video Gen

Intelligence on the Edge: Liquid AI's Ramin Hasani on the Search for Device-Native Foundation Models

Similar Episodes

Decagon’s Playbook for Building Enterprise AI Applications

Applied Intuition: A Billion Intelligent Machines - [Business Breakdowns, EP.248]

Why Do Some Memories Survive Dementia?

Why Physical AI Is the Next Frontier | Applied Intuition

Are You Really Allergic to Penicillin? (Update)

Explore Related Topics

You're clearly into Cognitive Revolution.