#316 Robbie Goldfarb: Why the Future of AI Depends on Better Judgment

January 23, 2026

63 min episode · 3 min read

Robbie Goldfarb

Episode

63 min

Read time

3 min

Topics

Artificial Intelligence

AI-Generated Summary

Published Jan 25, 2026

Key Takeaways

✓Expert thought process extraction: Form.ai maps how experts reason through complex questions by asking them to verbalize their approach rather than just label data. For political questions, experts might say they would cross-reference reliable sources first, creating a chain of reasoning. Form.ai builds agentic systems that mirror these thought process graphs, testing generalizability across scenarios before deploying judges at scale.
✓Consequence mapping methodology: Instead of traditional good-or-bad labeling, Form.ai asks experts to predict outcomes of AI conversations—what emotions users would feel, what actions they would take, what family members would say. This experimental approach provides richer data for training judges and reveals the reasoning behind expert evaluations, particularly valuable for sensitive domains like mental health where clinical nuance matters.
✓Four-tier evaluation framework for political content: Form.ai assesses political AI responses across bias (which breaks into multiple subcategories), factuality, source selection (credibility, balance, accurate attribution), and tone-language (avoiding inflammatory phrasing that mirrors user anger). Different combinations of topic, user intent, and evaluation dimension require separate judges, creating a matrix of specialized evaluators rather than one-size-fits-all assessment.
✓Trust gap blocking AI adoption: KPMG research shows 80-plus percent of people feel optimistic about AI improving their lives, yet only 40-something percent trust it. This delta represents the critical barrier to realizing AI potential. Form.ai addresses this through transparent expert networks published on their website, allowing users to see exactly who shaped model training rather than relying on internal engineering teams or scaled labelers.
✓Dangerous expectation of omniscient AI: ChatGPT established a problematic norm where users expect authoritative answers to any question without context-gathering dialogue. Real expertise requires back-and-forth—doctors do not prescribe after one statement. AI models need awareness of when they lack sufficient context to respond responsibly, but this conflicts with engagement metrics since users migrate to models providing immediate answers over those asking clarifying questions.

What It Covers

Robbie Goldfarb, founder of Form.ai, explains how his company scales expert judgment to evaluate and improve AI systems in contentious domains like healthcare and politics. Form.ai builds transparent networks of credible experts—including Fareed Zakaria and Neil Ferguson—then creates AI judges that capture their reasoning processes to assess models for bias, accuracy, and clinical nuance.

Key Questions Answered

•Expert thought process extraction: Form.ai maps how experts reason through complex questions by asking them to verbalize their approach rather than just label data. For political questions, experts might say they would cross-reference reliable sources first, creating a chain of reasoning. Form.ai builds agentic systems that mirror these thought process graphs, testing generalizability across scenarios before deploying judges at scale.
•Consequence mapping methodology: Instead of traditional good-or-bad labeling, Form.ai asks experts to predict outcomes of AI conversations—what emotions users would feel, what actions they would take, what family members would say. This experimental approach provides richer data for training judges and reveals the reasoning behind expert evaluations, particularly valuable for sensitive domains like mental health where clinical nuance matters.
•Four-tier evaluation framework for political content: Form.ai assesses political AI responses across bias (which breaks into multiple subcategories), factuality, source selection (credibility, balance, accurate attribution), and tone-language (avoiding inflammatory phrasing that mirrors user anger). Different combinations of topic, user intent, and evaluation dimension require separate judges, creating a matrix of specialized evaluators rather than one-size-fits-all assessment.
•Trust gap blocking AI adoption: KPMG research shows 80-plus percent of people feel optimistic about AI improving their lives, yet only 40-something percent trust it. This delta represents the critical barrier to realizing AI potential. Form.ai addresses this through transparent expert networks published on their website, allowing users to see exactly who shaped model training rather than relying on internal engineering teams or scaled labelers.
•Dangerous expectation of omniscient AI: ChatGPT established a problematic norm where users expect authoritative answers to any question without context-gathering dialogue. Real expertise requires back-and-forth—doctors do not prescribe after one statement. AI models need awareness of when they lack sufficient context to respond responsibly, but this conflicts with engagement metrics since users migrate to models providing immediate answers over those asking clarifying questions.
•Mental health scale reveals urgent need: OpenAI reported over one million weekly conversations where users demonstrated suicidal intent, illustrating the massive scale at which people turn to AI for mental health support. Current models lack clinical nuance—the National Eating Disorders Association shut down their Tessa chatbot after it produced adverse effects on users. Form.ai partners with Cleveland Clinic and Mount Sinai to embed medical expertise into health-related AI evaluations.

Notable Moment

Goldfarb describes how a podcaster insisted GPT-3 proved mind-body separation based on controversial Arizona experiments, demonstrating how question phrasing biases LLM responses. The same model gave opposite answers depending on how the question was framed, revealing that probability distributions masquerading as truth engines create dangerous sycophancy problems when users treat outputs as authoritative rather than probabilistic.

Know someone who'd find this useful?

You just read a 3-minute summary of a 60-minute episode.

Get Eye on AI summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Similar Episodes

Related episodes from other podcasts

The Mel Robbins Podcast

Apr 27

685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work

The AI Breakdown

Apr 26

Where the Economy Thrives After AI

Explore Related Topics

🤖Artificial Intelligence

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Eye on AI.

Every Monday, we deliver AI summaries of the latest episodes from Eye on AI and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime

#316 Robbie Goldfarb: Why the Future of AI Depends on Better Judgment

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

#338 Amith Singhee: Can India Catch Up in AI? IBM's Amith Singhee on What It Will Take

Do THIS Every Day to Rewire Your Brain From Stress and Anxiety

#337 Debdas Sen: Why AI Without ROI Will Die (Again)

The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow

More from Eye on AI

#338 Amith Singhee: Can India Catch Up in AI? IBM's Amith Singhee on What It Will Take

#337 Debdas Sen: Why AI Without ROI Will Die (Again)

#336 Professor Mausam: Why India Is Losing the AI Race and What It Will Take to Catch Up

#335 Sriram Raghavan: Why IBM Is Betting Everything on Small AI Models

#334 Abhishek Singh: The $1.2 Billion Plan to Turn India Into an AI Superpower