OpenAI faces accusations from various parties regarding its use of copyrighted content for training AI models. A recent paper from the AI Disclosures Project suggests that OpenAI may have used paywalled content from O’Reilly Media, which lacks a licensing agreement with the company, to train its GPT-4O model. This paper highlights that GPT-4O shows significantly improved recognition of O’Reilly’s proprietary books compared to OpenAI’s previous model, GPT-3.5 Turbo.
The researchers employed a method called DE-COP, or membership inference attack, to test whether the model could distinguish between human-written and AI-generated text based on certain copyrighted excerpts. Their findings indicate that GPT-4O likely recognizes content from paywalled O’Reilly books published prior to its training cutoff date. However, the authors caution that their methodology is not foolproof, acknowledging that OpenAI could have obtained such excerpts from user interactions.
Despite speculation, the paper does not address whether newer models like GPT-4.5 have been trained with O’Reilly’s content. OpenAI has been known to advocate for looser copyright restrictions while seeking high-quality training data, and the company conducts licensing with various content providers. The recent findings add to the ongoing legal challenges OpenAI faces concerning its data sourcing practices. OpenAI has yet to respond to inquiries regarding these allegations.
Source link