Apple, NVIDIA, and Anthropic Used YouTube Transcripts Without Permission to Train AI Models
A new investigation by Proof News has revealed that some of the world’s largest tech companies, including Apple, NVIDIA, and Anthropic, used transcripts from over 173,000 YouTube videos without permission to train their AI models. These transcripts came from over 48,000 YouTube channels and were part of a dataset created by the nonprofit EleutherAI.
What Happened?
The investigation found that the dataset included transcripts from popular YouTube creators like Marques Brownlee and MrBeast, and major news publishers such as The New York Times, BBC, and ABC News. Although the dataset did not contain actual videos or images, it used the subtitles from these videos as training data for AI models.
Marques Brownlee commented on X, expressing concern over how his content was used without permission. He noted that this issue is likely to continue as AI technology evolves.
YouTube’s Stance
A Google spokesperson reiterated previous statements by YouTube CEO Neal Mohan, emphasizing that using YouTube data to train AI models without permission violates the platform’s terms of service.
Lack of Transparency
The investigation highlights a broader issue with AI companies not being transparent about the sources of their training data. Earlier this month, artists and photographers criticized Apple for not disclosing the data sources for its generative AI, Apple Intelligence.
YouTube, the largest repository of videos globally, provides a wealth of transcripts, audio, video, and images, making it a prime target for AI training datasets. This has raised significant ethical and legal questions. For instance, OpenAI’s CTO Mira Murati avoided disclosing whether YouTube videos were used to train their AI video generation tool, Sora. Alphabet CEO Sundar Pichai has also acknowledged that using YouTube data for AI training without permission would breach the platform’s terms of service.
This report underscores the need for clearer guidelines and greater transparency in how data is used to train AI models, especially when it involves content created by others without their consent.