Training data is the raw material of the AI industry. Claude, ChatGPT, Gemini, and the rest are built on top of oceans of stuff. What is that stuff? Books. Blog posts. YouTube videos. Reddit comments. All of it and more, in virtually incomprehensible quantities. Alex Reisner, a staff writer at The Atlantic who has been investigating training data, explains how AI companies get all this data, why they'd really prefer you not know what's in it, and whether training data could ever be a fair trade.
00:00 Intro
01:02 90 Seconds on The Verge
03:18 Why Training Data Matters
08:43 Common Crawl and Filtering
11:51 Academia and Data Laundering
15:37 YouTube as Data Mine
20:01 Synthetic Data Myth
21:59 Paying Creators for Data
23:13 Wrap Up and Credits
Subscribe:
Like The Verge on Facebook:
Follow on Twitter:
Follow on Instagram:
Watch The Vergecast on YouTube:
The Vergecast Podcast:
Decoder with Nilay Patel:
More about our podcasts:
Read More:
Community guidelines:
Wallpapers from The Verge:
Shop our Verge merch store here:
Subscribe to The Verge:
If you buy something from a Verge link, Vox Media may receive a commission without exerting any influence on editorial content. For more information about our ethics policy, visit:
|
FOX Business host Larry Kudlow discusses...
King Charles and Queen Camilla will not ...
KTTH Seattle radio host Jason Rantz disc...
'The Big Money Show' panelists discuss t...
The last week of June 2026 saw record-br...
New York Apartment Association CEO Kenny...
Capitalist Pig Hedge Fund manager Jonath...
King Charles has become the first monarc...
‘The Big Money Show’ panel discuss AI sp...
Analyzing the latest consumer retail enf...
Analyzing the latest macroeconomic threa...
Analyzing the latest infrastructure cost...