速递｜O'Reilly指控OpenAI"窃书" 训练 GPT-4o，AI数据黑箱再陷版权风暴

Core Viewpoint - OpenAI is accused of potentially using copyrighted content from O'Reilly Media's paywalled books to train its AI models without authorization, raising concerns about copyright infringement and data sourcing practices [1][2][5]. Group 1: Allegations and Findings - A new paper from the AI Disclosure Project suggests that OpenAI likely used O'Reilly Media's paywalled books to train its GPT-4o model, with no licensing agreement in place between the two entities [2]. - The paper indicates that GPT-4o shows a significantly higher recognition ability for O'Reilly's paywalled content compared to the earlier GPT-3.5 Turbo model [2][3]. - Researchers analyzed 13,962 excerpts from 34 O'Reilly books and found that GPT-4o's recognition rate for paywalled content was notably higher than that of older OpenAI models [3]. Group 2: Methodology and Limitations - The study employed a method called DE-COP, designed to detect copyrighted content in language model training data, which involves testing the model's ability to distinguish between human-written text and AI-generated rewrites [2][3]. - The authors acknowledge that their findings do not constitute definitive proof, as OpenAI could have obtained excerpts through user interactions with ChatGPT [4]. - The research did not evaluate OpenAI's latest models, such as GPT-4.5 and reasoning models, which may not have been trained on paywalled O'Reilly content [4]. Group 3: Industry Context and Practices - OpenAI has been known to advocate for relaxed restrictions on using copyrighted data for model development and has sought higher quality training data [4]. - The company has entered into licensing agreements with various publishers and social networks for some of its training data, and it provides a mechanism for copyright holders to opt-out of having their content used [4].