OpenAI has acknowledged the essential role of copyrighted material in the development of AI tools such as ChatGPT.
This recognition was conveyed as part of OpenAI’s submission to the UK’s House of Lords communications and digital select committee inquiry into large language models. The acknowledgment underscores the importance of leveraging copyrighted content for training large language models like ChatGPT, as it is considered crucial for their development and improvement.
AI models like ChatGPT and the image generator DALL-E derive their capabilities through training sessions, which involve using extensive datasets that, in part, may include content scraped from the public internet without explicit permission from rights holders. While some of the training content used by OpenAI is licensed, the acknowledgment brings attention to a practice that has been a longstanding tradition in academic machine learning research. However, the commercialization of deep learning AI models has intensified scrutiny on this approach, leading to increased awareness and debate about the ethical and legal aspects of using data without explicit authorization. the essential role
Because copyright today covers virtually every sort of human expression including blogposts, photographs, forum posts, scraps of software code, and government documents it would be impossible to train today’s leading AI models without using copyrighted materials,” wrote OpenAI in the House of Lords submission.
Further, OpenAI writes that limiting training data to public domain books and drawings “created more than a century ago” would not provide AI systems that “meet the needs of today’s citizens.”
