Workflow
AI训练数据版权
icon
Search documents
英伟达被起诉,用盗版训练大模型成行业潜规则?
Xin Lang Cai Jing· 2026-02-08 09:51
Core Viewpoint - Nvidia is facing a collective lawsuit regarding copyright infringement related to the use of data from "shadow libraries" for training its AI models, specifically the NeMo Megatron framework, which allegedly includes copyrighted works without permission [3][18]. Group 1: Lawsuit Details - The lawsuit was filed by five authors who claim Nvidia used a dataset from illegal "shadow libraries" to develop its next-generation language model [3][18]. - Nvidia submitted a motion on January 31, 2026, arguing that the plaintiffs failed to provide sufficient evidence of infringement and asserting that its actions fall under "fair use" [4][18]. - A hearing is scheduled for April 2, 2026, to review Nvidia's motion [4]. Group 2: Competitive Pressure - Internal records indicate that Nvidia faced competitive pressure from OpenAI, prompting it to acquire millions of pirated books from shadow libraries to showcase its technology at the 2023 developer conference [19][20]. - The lawsuit highlights that Nvidia provided tools and scripts to clients to facilitate the downloading of pirated datasets [19]. Group 3: Data Sources - Nvidia's NeMo Megatron models were reportedly trained on The Pile dataset, which includes a subset called Books3 sourced from the shadow library Bibliotik, containing approximately 190,000 books [21][22]. - Nvidia is accused of directly collaborating with the largest shadow library, Anna's Archive, to access millions of pirated books, totaling around 500TB of data [24][22]. Group 4: Industry Context - The rise of AI has led to increased litigation over training data copyright issues, with other companies like OpenAI, Anthropic, and Meta also facing similar lawsuits [20][28]. - The competitive landscape has intensified, with Nvidia's need for high-quality training data driving it to engage with shadow libraries, which offer easier access to vast amounts of data [21][27]. Group 5: Legal Precedents - Previous cases have seen significant settlements, such as Anthropic agreeing to pay at least $1.5 billion to settle a copyright infringement lawsuit, potentially setting a record for copyright damages [20][28]. - Courts have ruled on the fair use of copyrighted works for AI training, with some cases determining that using such works can be considered fair use under certain conditions [29][30].
OpenAI等六大AI巨头遭作家起诉
3 6 Ke· 2025-12-23 11:56
Core Viewpoint - A group of writers led by Pulitzer Prize winner John Carreyrou has filed a class-action lawsuit against six major AI companies, including OpenAI, Google, Meta, Anthropic, xAI, and Perplexity AI, accusing them of "willful infringement" by training their models on pirated books [1] Group 1: Allegations and Legal Context - The lawsuit centers on a "dual infringement chain," where the six companies allegedly downloaded millions of pirated books from illegal shadow libraries like LibGen and Z-Library, using these works for training large language models and commercial purposes, creating an illegal closed loop of "pirated acquisition - model training - commercial monetization" [1] - If the jury finds the infringement to be willful, each infringing work could result in damages of up to $150,000 [2] - OpenAI has faced at least 14 copyright lawsuits, making it a frequent target in the industry [2] Group 2: Previous Legal Issues - The New York Times has previously sued Microsoft and OpenAI for copyright infringement, claiming that millions of its articles were used to train AI models like Microsoft Copilot and ChatGPT, seeking damages for billions of dollars and the destruction of any AI models using its copyrighted materials [2] - In June 2023, OpenAI announced it was appealing a lawsuit from The New York Times regarding the indefinite retention of consumer data, arguing that the request violates user privacy commitments [2] - The New York Times also issued a "cease and desist" notice to Perplexity AI, demanding it stop accessing and using its content [2] Group 3: Industry-Wide Implications - Google received a cease-and-desist letter from Disney for allegedly copying a large volume of copyrighted works for AI development [3] - Meta has faced multiple infringement warnings from Hollywood studios regarding its model training data [3] - Anthropic was notably ordered to pay $1.5 billion in a settlement for using pirated books to train its Claude model, with a court ruling that "pirated data is not subject to fair use" [3] - The California Northern District Court has accepted 25 AI copyright cases, accounting for over half of similar cases nationwide, with potential rulings that could set key precedents for the legality of AI training data [3]
新研究为OpenAI版权争议添实据:训练数据“记忆”追踪技术或成诉讼关键
Huan Qiu Wang· 2025-04-06 02:36
Core Insights - A joint study by Washington University, Copenhagen University, and Stanford University provides new evidence regarding OpenAI's alleged unauthorized use of copyrighted content to train its AI models [1][3] - The research introduces an innovative method to identify the training data sources of AI models that provide services via API, potentially escalating legal disputes between OpenAI and copyright holders [1][3] Group 1: Research Findings - The research team developed a technology that analyzes specific patterns in AI-generated content to trace back the sources of its training data [3] - This method can detect whether models like OpenAI's have "memorized" unique segments from copyrighted works, overcoming limitations of traditional copyright detection techniques [3] - The findings provide copyright holders with a new legal tool to more accurately demonstrate infringement by OpenAI's models [3] Group 2: Legal Implications - Since 2023, OpenAI has faced multiple class-action lawsuits from copyright holders, including writers and programmers, accusing the company of using copyrighted works without permission for training its AI models [3] - OpenAI has defended itself by citing the "fair use" principle, but plaintiffs argue that there are no exemptions in U.S. copyright law for AI training data [3] - The study's results pose a significant challenge to OpenAI's defense, as copyright holders may leverage this technology to prove direct use of their works in training [3] Group 3: Industry Impact - The research team emphasizes that the technology is not intended for "fishing enforcement," but rather to provide objective evidence in copyright disputes [4] - The potential widespread adoption of this technology could lead to increased transparency regarding the sources of training data for AI companies like OpenAI [4] - This shift may disrupt the existing compliance framework for AI training, as companies have long relied on vast amounts of data for model training [4]