Skip to main content
Cognitive Revolution

Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research

106 min episode · 3 min read
·
Radically Better Reasoning

Episode

106 min

Read time

3 min

Topics

Remote Work, Startups, Design & UX

AI-Generated Summary

Key Takeaways

  • Process Guarantee via DSL: Elicit built a proprietary domain-specific language that compiles reasoning into discrete microservices, ensuring the identical analytical process applies to document number 5 and document 9,999 in a batch. When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed — the other models admitted mid-conversation they had not completed the task.
  • LLM Probability Instability: Current frontier models produce unreliable confidence estimates because they lack coherent internal world models backing their stated probabilities. When asked to estimate clinical trial failure rates, models shift their percentage significantly if you simply mention base rates or add contextual framing — behavior a domain expert would resist. Elicit addresses this through structured scaffolding rather than relying on raw model verbalization of uncertainty.
  • World Models as External Continual Learning: Rather than storing evolving knowledge in model weights, Elicit is building structured external representations — combining graph-based causal diagrams, SQL tables, and heterogeneous knowledge formats — that models can update incrementally as new papers arrive. This approach makes a model's understanding of complex evidence bodies inspectable by humans and other AIs, enabling consistent counterfactual and intervention-based reasoning across thousands of data points.
  • Evidence Quality Beyond Metadata: Elicit evaluates research quality from content and methodology rather than relying solely on citation counts or journal impact factor — proxies that miss landmark papers like foundational CRISPR work published in lower-tier journals. Researchers can specify domain-appropriate quality thresholds, such as minimum sample sizes or study designs, and Elicit applies those criteria uniformly across all retrieved sources rather than defaulting to surface-level heuristics.
  • Automated Engineering via "The Line": Elicit's internal software pipeline called The Line automates the full engineering cycle — from Slack feature request through spec writing, implementation, video testing, code review, and production deployment — currently merging 30 to 50 issues per week without human intervention on simple tasks. The system self-identifies when human escalation is needed, such as incomplete specs or high-complexity changes, and routes accordingly.

What It Covers

Elicit cofounders Andreas Stuhlmüller and Jungwon Byun explain how their AI research platform serves seven of the top 20 life sciences companies by combining frontier reasoning models with a custom domain-specific language that guarantees systematic process execution at scale, and why externalized "world models" represent the next frontier for reliable causal and counterfactual scientific analysis.

Key Questions Answered

  • Process Guarantee via DSL: Elicit built a proprietary domain-specific language that compiles reasoning into discrete microservices, ensuring the identical analytical process applies to document number 5 and document 9,999 in a batch. When they tested Claude, ChatGPT, and Elicit on analyzing 100 toxicology papers, only Elicit could verify all 100 were actually processed — the other models admitted mid-conversation they had not completed the task.
  • LLM Probability Instability: Current frontier models produce unreliable confidence estimates because they lack coherent internal world models backing their stated probabilities. When asked to estimate clinical trial failure rates, models shift their percentage significantly if you simply mention base rates or add contextual framing — behavior a domain expert would resist. Elicit addresses this through structured scaffolding rather than relying on raw model verbalization of uncertainty.
  • World Models as External Continual Learning: Rather than storing evolving knowledge in model weights, Elicit is building structured external representations — combining graph-based causal diagrams, SQL tables, and heterogeneous knowledge formats — that models can update incrementally as new papers arrive. This approach makes a model's understanding of complex evidence bodies inspectable by humans and other AIs, enabling consistent counterfactual and intervention-based reasoning across thousands of data points.
  • Evidence Quality Beyond Metadata: Elicit evaluates research quality from content and methodology rather than relying solely on citation counts or journal impact factor — proxies that miss landmark papers like foundational CRISPR work published in lower-tier journals. Researchers can specify domain-appropriate quality thresholds, such as minimum sample sizes or study designs, and Elicit applies those criteria uniformly across all retrieved sources rather than defaulting to surface-level heuristics.
  • Automated Engineering via "The Line": Elicit's internal software pipeline called The Line automates the full engineering cycle — from Slack feature request through spec writing, implementation, video testing, code review, and production deployment — currently merging 30 to 50 issues per week without human intervention on simple tasks. The system self-identifies when human escalation is needed, such as incomplete specs or high-complexity changes, and routes accordingly.
  • Token Spend as Headcount Substitute: Andreas spends approximately $2,000 per week on API tokens running multi-model orchestration pipelines that cross-check outputs across Claude, GPT, and Gemini — finding that model cross-checking improves results enough to justify the cost multiplication. For enterprise life sciences customers, Elicit's pricing displaces existing services spend rather than competing with software budgets, making the ROI framing more favorable than raw token cost comparisons suggest.
  • Certificates of Reasoning over Chain-of-Thought Monitoring: Rather than supervising hidden chain-of-thought tokens, Elicit advocates for verifiable reasoning certificates embedded in outputs — analogous to mathematical proofs — that allow downstream checking without requiring access to internal model reasoning steps. Tool call logs already provide partial certificates: if a model never reads a paper's methodology section before summarizing its conclusions, that gap is detectable and auditable without needing to inspect reasoning tokens directly.

Notable Moment

When Stuhlmüller tested Elicit on a personal case involving a friend's cancer treatment, filtering roughly 5,000 relevant papers revealed a core limitation: even million-token context windows cannot produce coherent causal reasoning from raw literature at that scale. This motivated the entire world models research direction — the realization that structured external representations, not larger contexts, are required for reliable medical decision support.

Know someone who'd find this useful?

You just read a 3-minute summary of a 103-minute episode.

Get Cognitive Revolution summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Cognitive Revolution

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's Startups & Product Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Cognitive Revolution.

Every Monday, we deliver AI summaries of the latest episodes from Cognitive Revolution and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime