What are the key takeaways from this The Vergecast episode?

Key insights include: **Training data defines model capability:** A model's output is determined almost entirely by what it trains on — not just its architecture. A music model trained on 1950s jazz produces 1950s jazz; one trained on hip-hop produces hip-hop. Reisner argues the most accurate name for any AI model is a description of its training dataset, not a product name like GPT or Claude.; **Competitive secrecy masks legal exposure:** AI companies cite competitive advantage as the reason training data stays secret, but a second motive is equally significant — much of the data was scraped without creator consent. Authors, musicians, and video creators frequently discover their work was used only after the fact, and companies prefer to avoid that conversation entirely rather than address it proactively.; **Data laundering via nonprofits and universities:** AI companies routinely route data acquisition through academic institutions and nonprofits — such as Common Crawl and the European organization LION, which holds 12 million songs scraped from YouTube — allowing companies to claim distance from the scraping while funding those organizations directly and benefiting from the resulting datasets.

What did Alex Reisner discuss on The Vergecast?

Atlantic staff writer Alex Reisner joins The Vergecast to examine AI training data — what it is, where it comes from, and why companies guard it so closely. The conversation covers Common Crawl, YouTube as a data source, music datasets, synthetic data limitations, and the emerging paid creator economy for AI training. Key topics include: **Training data defines model capability:** A model's output is determined almost entirely by what it trains on — not just its architecture. A music model trained on 1950s jazz produces 1950s jazz; one trained on hip-hop produces hip-hop. Reisner argues the most accurate name for any AI model is a description of its training dataset, not a product name like GPT or Claude.; **Competitive secrecy masks legal exposure:** AI companies cite competitive advantage as the reason training data stays secret, but a second motive is equally significant — much of the data was scraped without creator consent. Authors, musicians, and video creators frequently discover their work was used only after the fact, and companies prefer to avoid that conversation entirely rather than address it proactively..

How long is this episode of The Vergecast?

This episode is 26 minutes long. SignalCast provides an AI-generated summary so you can get the key insights in about 3 minutes.

The Vergecast

How to train your data

June 25, 2026

26 min episode · 2 min read

Alex Reisner

Episode

26 min

Read time

2 min

Topics

Productivity, Leadership, Artificial Intelligence

AI-Generated Summary

Published Jun 26, 2026

Key Takeaways

✓Training data defines model capability: A model's output is determined almost entirely by what it trains on — not just its architecture. A music model trained on 1950s jazz produces 1950s jazz; one trained on hip-hop produces hip-hop. Reisner argues the most accurate name for any AI model is a description of its training dataset, not a product name like GPT or Claude.
✓Competitive secrecy masks legal exposure: AI companies cite competitive advantage as the reason training data stays secret, but a second motive is equally significant — much of the data was scraped without creator consent. Authors, musicians, and video creators frequently discover their work was used only after the fact, and companies prefer to avoid that conversation entirely rather than address it proactively.
✓Data laundering via nonprofits and universities: AI companies routinely route data acquisition through academic institutions and nonprofits — such as Common Crawl and the European organization LION, which holds 12 million songs scraped from YouTube — allowing companies to claim distance from the scraping while funding those organizations directly and benefiting from the resulting datasets.
✓YouTube is the dominant scraping target due to weak technical barriers: YouTube appears repeatedly across AI training datasets because download tools work reliably and easily, and because digital rights management protections are absent compared to platforms like Spotify. YouTube states scraping violates its terms of service but has taken no effective technical or legal action to stop it over multiple years.
✓Synthetic data does not replace human-generated content: Research on "model collapse" demonstrates that training AI models on their own outputs causes rapid quality degradation. Reisner states AI companies deliberately use vague language when claiming synthetic data works. The next actual data frontier is paying human creators directly — one company has already paid creators over $10 million to produce content specifically for AI training purposes.

What It Covers

Atlantic staff writer Alex Reisner joins The Vergecast to examine AI training data — what it is, where it comes from, and why companies guard it so closely. The conversation covers Common Crawl, YouTube as a data source, music datasets, synthetic data limitations, and the emerging paid creator economy for AI training.

Key Questions Answered

•Training data defines model capability: A model's output is determined almost entirely by what it trains on — not just its architecture. A music model trained on 1950s jazz produces 1950s jazz; one trained on hip-hop produces hip-hop. Reisner argues the most accurate name for any AI model is a description of its training dataset, not a product name like GPT or Claude.
•Competitive secrecy masks legal exposure: AI companies cite competitive advantage as the reason training data stays secret, but a second motive is equally significant — much of the data was scraped without creator consent. Authors, musicians, and video creators frequently discover their work was used only after the fact, and companies prefer to avoid that conversation entirely rather than address it proactively.
•Data laundering via nonprofits and universities: AI companies routinely route data acquisition through academic institutions and nonprofits — such as Common Crawl and the European organization LION, which holds 12 million songs scraped from YouTube — allowing companies to claim distance from the scraping while funding those organizations directly and benefiting from the resulting datasets.
•YouTube is the dominant scraping target due to weak technical barriers: YouTube appears repeatedly across AI training datasets because download tools work reliably and easily, and because digital rights management protections are absent compared to platforms like Spotify. YouTube states scraping violates its terms of service but has taken no effective technical or legal action to stop it over multiple years.
•Synthetic data does not replace human-generated content: Research on "model collapse" demonstrates that training AI models on their own outputs causes rapid quality degradation. Reisner states AI companies deliberately use vague language when claiming synthetic data works. The next actual data frontier is paying human creators directly — one company has already paid creators over $10 million to produce content specifically for AI training purposes.

Notable Moment

Reisner flatly contradicts a widely held industry assumption: when asked whether synthetic data represents the next training frontier, he says the evidence shows the opposite. Models trained on their own outputs degrade quickly because AI functions as an averaging machine, and human-generated content contains qualities that AI outputs simply do not replicate.

Know someone who'd find this useful?

You just read a 3-minute summary of a 23-minute episode.

Get The Vergecast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Books, tools, and gear mentioned in this episode

SignalCast may earn commission on purchases via these links.

Tools

Common Crawl
“The conversation covers Common Crawl, YouTube as a data source, music datasets, synthetic data limitations, and the emerging paid creator economy for AI training.”
LION
“Data laundering via nonprofits and universities: AI companies routinely route data acquisition through academic institutions and nonprofits — such as Common Crawl and the European organization LION, which holds 12 million songs scraped from YouTube”

Similar Episodes

Related episodes from other podcasts

The Ezra Klein Show

Jun 5

Explore Related Topics

⚡Productivity 👔Leadership 🤖Artificial Intelligence

This podcast is featured in Best Tech Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into The Vergecast.

Every Monday, we deliver AI summaries of the latest episodes from The Vergecast and 192+ other podcasts. Free for one show.

Start My Monday Digest

No credit card · Unsubscribe anytime

How to train your data

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

Google's new speaker and your smart home questions

The New Right’s Very Old Vision of Men

Why Big Tech can't quit smart glasses

1349: Valerie Fridland | Why We Talk Funny and What Our Voices Reveal

Books, tools, and gear mentioned in this episode

Tools

More from The Vergecast

Google's new speaker and your smart home questions

Why Big Tech can't quit smart glasses

Is the Steam Machine worth the wait?

Version History: Harmony remote

Snap's Specs look good on nobody

Similar Episodes

The New Right’s Very Old Vision of Men

1349: Valerie Fridland | Why We Talk Funny and What Our Voices Reveal

Interesting Times: Why Are We Still Driving?

Our Enduring Fascination With the Kennedys

Who Has the Power in Trump's White House?

Explore Related Topics

You're clearly into The Vergecast.