Nvidia-英伟达被起诉，用盗版训练大模型成行业潜规则？

Core Viewpoint - Nvidia is facing a collective lawsuit regarding copyright infringement related to the use of data from "shadow libraries" for training its AI models, specifically the NeMo Megatron framework, which allegedly includes copyrighted works without permission [3][18]. Group 1: Lawsuit Details - The lawsuit was filed by five authors who claim Nvidia used a dataset from illegal "shadow libraries" to develop its next-generation language model [3][18]. - Nvidia submitted a motion on January 31, 2026, arguing that the plaintiffs failed to provide sufficient evidence of infringement and asserting that its actions fall under "fair use" [4][18]. - A hearing is scheduled for April 2, 2026, to review Nvidia's motion [4]. Group 2: Competitive Pressure - Internal records indicate that Nvidia faced competitive pressure from OpenAI, prompting it to acquire millions of pirated books from shadow libraries to showcase its technology at the 2023 developer conference [19][20]. - The lawsuit highlights that Nvidia provided tools and scripts to clients to facilitate the downloading of pirated datasets [19]. Group 3: Data Sources - Nvidia's NeMo Megatron models were reportedly trained on The Pile dataset, which includes a subset called Books3 sourced from the shadow library Bibliotik, containing approximately 190,000 books [21][22]. - Nvidia is accused of directly collaborating with the largest shadow library, Anna's Archive, to access millions of pirated books, totaling around 500TB of data [24][22]. Group 4: Industry Context - The rise of AI has led to increased litigation over training data copyright issues, with other companies like OpenAI, Anthropic, and Meta also facing similar lawsuits [20][28]. - The competitive landscape has intensified, with Nvidia's need for high-quality training data driving it to engage with shadow libraries, which offer easier access to vast amounts of data [21][27]. Group 5: Legal Precedents - Previous cases have seen significant settlements, such as Anthropic agreeing to pay at least $1.5 billion to settle a copyright infringement lawsuit, potentially setting a record for copyright damages [20][28]. - Courts have ruled on the fair use of copyrighted works for AI training, with some cases determining that using such works can be considered fair use under certain conditions [29][30].