How to train your data | The Vergecast | US ECONOMY AND MARKET VIDEO NEWS

youtube

Training data is the raw material of the AI industry. Claude, ChatGPT, Gemini, and the rest are built on top of oceans of stuff. What is that stuff? Books. Blog posts. YouTube videos. Reddit comments. All of it and more, in virtually incomprehensible quantities. Alex Reisner, a staff writer at The Atlantic who has been investigating training data, explains how AI companies get all this data, why they'd really prefer you not know what's in it, and whether training data could ever be a fair trade. 00:00 Intro 01:02 90 Seconds on The Verge 03:18 Why Training Data Matters 08:43 Common Crawl and Filtering 11:51 Academia and Data Laundering 15:37 YouTube as Data Mine 20:01 Synthetic Data Myth 21:59 Paying Creators for Data 23:13 Wrap Up and Credits Subscribe: Like The Verge on Facebook: Follow on Twitter: Follow on Instagram: Watch The Vergecast on YouTube: The Vergecast Podcast: Decoder with Nilay Patel: More about our podcasts: Read More: Community guidelines: Wallpapers from The Verge: Shop our Verge merch store here: Subscribe to The Verge: If you buy something from a Verge link, Vox Media may receive a commission without exerting any influence on editorial content. For more information about our ethics policy, visit: