Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research
Episode
106 min
Read time
3 min
Topics
Remote Work, Startups, Design & UX
AI-Generated Summary
Key Takeaways
- ✓Process Guarantee via DSL: Elicit built a proprietary domain-specific language that compiles reasoning into discrete microservices, ensuring the identical analytical process applies to document number 5 and document 9,999 in a batch. When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed — the other models admitted mid-conversation they had not completed the task.
- ✓LLM Probability Instability: Current frontier models produce unreliable confidence estimates because they lack coherent internal world models backing their stated probabilities. When asked to estimate clinical trial failure rates, models shift their percentage significantly if you simply mention base rates or add contextual framing — behavior a domain expert would resist. Elicit addresses this through structured scaffolding rather than relying on raw model verbalization of uncertainty.
- ✓World Models as External Continual Learning: Rather than storing evolving knowledge in model weights, Elicit is building structured external representations — combining graph-based causal diagrams, SQL tables, and heterogeneous knowledge formats — that models can update incrementally as new papers arrive. This approach makes a model's understanding of complex evidence bodies inspectable by humans and other AIs, enabling consistent counterfactual and intervention-based reasoning across thousands of data points.
- ✓Evidence Quality Beyond Metadata: Elicit evaluates research quality from content and methodology rather than relying solely on citation counts or journal impact factor — proxies that miss landmark papers like foundational CRISPR work published in lower-tier journals. Researchers can specify domain-appropriate quality thresholds, such as minimum sample sizes or study designs, and Elicit applies those criteria uniformly across all retrieved sources rather than defaulting to surface-level heuristics.
- ✓Automated Engineering via "The Line": Elicit's internal software pipeline called The Line automates the full engineering cycle — from Slack feature request through spec writing, implementation, video testing, code review, and production deployment — currently merging 30 to 50 issues per week without human intervention on simple tasks. The system self-identifies when human escalation is needed, such as incomplete specs or high-complexity changes, and routes accordingly.
What It Covers
Elicit cofounders Andreas Stuhlmüller and Jungwon Byun explain how their AI research platform serves seven of the top 20 life sciences companies by combining frontier reasoning models with a custom domain-specific language that guarantees systematic process execution at scale, and why externalized "world models" represent the next frontier for reliable causal and counterfactual scientific analysis.
Key Questions Answered
- •Process Guarantee via DSL: Elicit built a proprietary domain-specific language that compiles reasoning into discrete microservices, ensuring the identical analytical process applies to document number 5 and document 9,999 in a batch. When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed — the other models admitted mid-conversation they had not completed the task.
- •LLM Probability Instability: Current frontier models produce unreliable confidence estimates because they lack coherent internal world models backing their stated probabilities. When asked to estimate clinical trial failure rates, models shift their percentage significantly if you simply mention base rates or add contextual framing — behavior a domain expert would resist. Elicit addresses this through structured scaffolding rather than relying on raw model verbalization of uncertainty.
- •World Models as External Continual Learning: Rather than storing evolving knowledge in model weights, Elicit is building structured external representations — combining graph-based causal diagrams, SQL tables, and heterogeneous knowledge formats — that models can update incrementally as new papers arrive. This approach makes a model's understanding of complex evidence bodies inspectable by humans and other AIs, enabling consistent counterfactual and intervention-based reasoning across thousands of data points.
- •Evidence Quality Beyond Metadata: Elicit evaluates research quality from content and methodology rather than relying solely on citation counts or journal impact factor — proxies that miss landmark papers like foundational CRISPR work published in lower-tier journals. Researchers can specify domain-appropriate quality thresholds, such as minimum sample sizes or study designs, and Elicit applies those criteria uniformly across all retrieved sources rather than defaulting to surface-level heuristics.
- •Automated Engineering via "The Line": Elicit's internal software pipeline called The Line automates the full engineering cycle — from Slack feature request through spec writing, implementation, video testing, code review, and production deployment — currently merging 30 to 50 issues per week without human intervention on simple tasks. The system self-identifies when human escalation is needed, such as incomplete specs or high-complexity changes, and routes accordingly.
- •Token Spend as Headcount Substitute: Andreas spends approximately $2,000 per week on API tokens running multi-model orchestration pipelines that cross-check outputs across Claude, GPT, and Gemini — finding that model cross-checking improves results enough to justify the cost multiplication. For enterprise life sciences customers, Elicit's pricing displaces existing services spend rather than competing with software budgets, making the ROI framing more favorable than raw token cost comparisons suggest.
- •Certificates of Reasoning over Chain-of-Thought Monitoring: Rather than supervising hidden chain-of-thought tokens, Elicit advocates for verifiable reasoning certificates embedded in outputs — analogous to mathematical proofs — that allow downstream checking without requiring access to internal model reasoning steps. Tool call logs already provide partial certificates: if a model never reads a paper's methodology section before summarizing its conclusions, that gap is detectable and auditable without needing to inspect reasoning tokens directly.
Notable Moment
When Stuhlmüller tested Elicit on a personal case involving a friend's cancer treatment, filtering roughly 5,000 relevant papers revealed a core limitation: even million-token context windows cannot produce coherent causal reasoning from raw literature at that scale. This motivated the entire world models research direction — the realization that structured external representations, not larger contexts, are required for reliable medical decision support.
You just read a 3-minute summary of a 103-minute episode.
Get Cognitive Revolution summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Cognitive Revolution
AI in the AM — Week 2 Highlights (June 2026)
Jun 13 · 104 min
All-In with Chamath, Jason, Sacks & Friedberg
Dan Loeb: The Lost Art of Short Selling, and Why Stock Picking is Back
Jun 5
More from Cognitive Revolution
Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work
Jun 10 · 106 min
This Week in Startups
The Startup Turning Space Into a Logistics Network
Jun 3
More from Cognitive Revolution
We summarize every new episode. Want them in your inbox?
AI in the AM — Week 2 Highlights (June 2026)
Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work
AI in the AM — Week 1 Highlights (June 2026)
Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures
Inside Nathan's Second Brain: Daniel Miessler, Security Expert & Creator of PAI, Audits My AI Setup
Similar Episodes
Related episodes from other podcasts
All-In with Chamath, Jason, Sacks & Friedberg
Jun 5
Dan Loeb: The Lost Art of Short Selling, and Why Stock Picking is Back
This Week in Startups
Jun 3
The Startup Turning Space Into a Logistics Network
This Week in Startups
May 23
From hypercars to cruise missiles: Lukas Czinger on the future of US defense | E2292
Odd Lots
May 7
How an American City Can Become a Manufacturing Hub
Odd Lots
May 2
Inside the Booming Market for Dinosaur Fossils
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's Startups & Product Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Cognitive Revolution.
Every Monday, we deliver AI summaries of the latest episodes from Cognitive Revolution and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime