Core Viewpoint - Meta has been internally discussing the use of copyrighted works obtained through questionable means for training its AI models, raising significant legal and ethical concerns in ongoing copyright disputes [1][2]. Group 1: Internal Discussions and Strategies - Internal communications reveal that Meta employees, including senior management, acknowledged the potential legal issues of using copyrighted materials for AI training [3][4]. - A Meta research engineer suggested acquiring ebooks at retail prices instead of negotiating licensing deals, indicating a willingness to take risks in the face of legal challenges [5][6]. - Discussions included the possibility of using Libgen, a site known for providing access to copyrighted works, despite its legal controversies, highlighting concerns about competitiveness in the AI sector [7][8][9]. Group 2: Legal Mitigations and Data Sources - Meta's strategy to mitigate legal risks involved removing clearly marked pirated data from training sets and not publicly disclosing the use of such datasets [10][11]. - The company has also been tuning its models to avoid generating responses that could reveal the use of copyrighted materials, indicating a proactive approach to legal compliance [11]. - There are indications that Meta may have scraped data from platforms like Reddit for model training, which could lead to further legal scrutiny as Reddit plans to charge for data access [11][12]. Group 3: Legal Proceedings and Implications - The ongoing case Kadrey v. Meta has seen multiple amendments, with allegations that Meta cross-referenced pirated books with licensed works to evaluate potential licensing agreements [14]. - To bolster its defense, Meta has engaged two Supreme Court litigators, underscoring the high stakes involved in the legal proceedings [15].
Court filings show Meta staffers discussed using copyrighted content for AI training