Skip to main content
The Vergecast

Millions of books died so Claude could live

88 min episode · 3 min read
·

Episode

88 min

Read time

3 min

Topics

Books & Authors

AI-Generated Summary

Key Takeaways

  • AI Training Data Acquisition: Anthropic's Project Panama used hydraulic cutting machines to destructively scan physical books after initially downloading pirated shadow libraries like LibGen. The company hired Tom Turvey from Google Books, purchased hundreds of thousands of used books from warehouses like Better World Books at bulk prices, sliced off spines, and rapidly scanned pages to digitize content for Claude training.
  • Books as Quality Training Material: AI companies prioritize books over other content sources because published works provide higher quality, vetted material with coherent sentence structure and fact-checking. Anthropic viewed books as a competitive advantage to catch up with larger rivals like OpenAI and Google, with evidence suggesting Claude's reputation as the best writing chatbot may stem from this book-heavy training approach.
  • Legal Fair Use Paradox: Two judges ruled AI model training on books constitutes fair use, but companies face liability for how they acquired books initially. Anthropic settled for one point five billion dollars over books they scanned but never used in commercial models, while the actual training process was deemed legally acceptable. This creates a counterintuitive situation where illegal acquisition precedes legal usage.
  • Theatrical Revenue Decline Drivers: Civic Science surveyed two thousand moviegoers and found lack of interest in available movie types ranked as the top reason people avoid theaters, with cost ranking second. Average moviegoers now see fewer films monthly than in the early nineteen nineties, while supply of theatrical releases has steadily decreased, creating uncertainty about whether more films would increase attendance.
  • Nostalgia Screening Strategy: Studios could fill theatrical gaps by reprinting beloved films like Mean Girls or Nightmare Before Christmas, which perform well during limited runs with minimal reprint costs. This approach mirrors streaming's use of catalog content like Friends to maintain subscriber lifetime value, allowing exhibitors to pay operating costs while studios reserve expensive new productions for proven blockbuster opportunities.

What It Covers

The Vergecast examines how Anthropic and other AI companies train models using millions of books through Project Panama, involving destructive scanning and shadow libraries. The episode explores Netflix's theatrical strategy amid the Warner Brothers Discovery acquisition, questioning whether movie theaters can survive through nostalgia screenings and alternative programming rather than traditional releases.

Key Questions Answered

  • AI Training Data Acquisition: Anthropic's Project Panama used hydraulic cutting machines to destructively scan physical books after initially downloading pirated shadow libraries like LibGen. The company hired Tom Turvey from Google Books, purchased hundreds of thousands of used books from warehouses like Better World Books at bulk prices, sliced off spines, and rapidly scanned pages to digitize content for Claude training.
  • Books as Quality Training Material: AI companies prioritize books over other content sources because published works provide higher quality, vetted material with coherent sentence structure and fact-checking. Anthropic viewed books as a competitive advantage to catch up with larger rivals like OpenAI and Google, with evidence suggesting Claude's reputation as the best writing chatbot may stem from this book-heavy training approach.
  • Legal Fair Use Paradox: Two judges ruled AI model training on books constitutes fair use, but companies face liability for how they acquired books initially. Anthropic settled for one point five billion dollars over books they scanned but never used in commercial models, while the actual training process was deemed legally acceptable. This creates a counterintuitive situation where illegal acquisition precedes legal usage.
  • Theatrical Revenue Decline Drivers: Civic Science surveyed two thousand moviegoers and found lack of interest in available movie types ranked as the top reason people avoid theaters, with cost ranking second. Average moviegoers now see fewer films monthly than in the early nineteen nineties, while supply of theatrical releases has steadily decreased, creating uncertainty about whether more films would increase attendance.
  • Nostalgia Screening Strategy: Studios could fill theatrical gaps by reprinting beloved films like Mean Girls or Nightmare Before Christmas, which perform well during limited runs with minimal reprint costs. This approach mirrors streaming's use of catalog content like Friends to maintain subscriber lifetime value, allowing exhibitors to pay operating costs while studios reserve expensive new productions for proven blockbuster opportunities.
  • IKEA Smart Home Thread Problems: IKEA's six dollar Billreza buttons represent mass market thread adoption but expose system failures. Google Home still refuses to support matter buttons despite years of requests, while Amazon thread networks cannot merge with other thread border routers. Initial pairing issues and network disconnections plague IKEA's first wave of thread devices, requiring troubleshooting through multiple platforms.

Notable Moment

Will Oremus discovered internal documents showing an Anthropic executive previously downloaded the entire LibGen pirated book library while at OpenAI, then repeated the same process after cofounding Anthropic. The documents included browser screenshots with torrent sites open and LibGen partially downloaded, demonstrating how AI companies systematically used piracy as their starting point before developing physical book scanning operations.

Know someone who'd find this useful?

You just read a 3-minute summary of a 85-minute episode.

Get The Vergecast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from The Vergecast

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best Tech Podcasts (2026) — ranked and reviewed with AI summaries.

You're clearly into The Vergecast.

Every Monday, we deliver AI summaries of the latest episodes from The Vergecast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime