人工智能模型训练

Search documents
刚刚,DeepSeek最新发文,V3/R1训练细节全公开,信息量巨大
3 6 Ke· 2025-09-01 12:06
Core Viewpoint - DeepSeek has proactively responded to the new regulations by marking all AI-generated content with an "AI-generated" label and has disclosed details about its V3/R1 model training process following the implementation of the "Identification Measures for AI-Generated Synthetic Content" by the Cyberspace Administration of China [1][2]. Group 1: Compliance with New Regulations - DeepSeek has announced that all AI-generated content will be clearly labeled as "AI-generated" to comply with the new regulations [2]. - The company has emphasized that users are strictly prohibited from maliciously deleting, altering, or concealing these labels, and from using AI to spread or create false information [2]. Group 2: Technical Disclosure - DeepSeek has released a document titled "Model Principles and Training Methods," providing insights into its technical approach [4]. - The training process of DeepSeek's models is divided into pre-training and optimization training phases, which include various stages such as data collection and model fine-tuning [6][17]. Group 3: Model Training Details - The latest DeepSeek V3-0324 model has a total parameter count of 685 billion, with parameters optimized through gradient descent during training [15]. - During the pre-training phase, the model learns general language understanding and generation capabilities using publicly available internet data and licensed third-party data, while ensuring no personal information is intentionally used [21]. - The optimization training phase involves constructing and annotating question-answer pairs, with some data potentially based on user input, while ensuring data privacy through encryption and anonymization [22][23]. Group 4: Model Deployment and Functionality - Once training is complete, the model enters the inference phase, where it can generate text and perform various tasks based on user input [25]. - DeepSeek has emphasized that the model does not store original training data but generates responses based on a deep understanding of language structure and semantics [27]. - The company has made its models open-source, allowing users to freely download and deploy them under a permissive MIT license [28]. Group 5: Addressing Limitations and Risks - DeepSeek acknowledges the limitations of AI, including the phenomenon known as "hallucination," where AI may generate incorrect or misleading content [30][31]. - The company is implementing various technical measures to reduce the hallucination rate, including high-quality training data and alignment strategies, although complete elimination is not currently feasible [32]. - DeepSeek has established internal risk management protocols and user rights, allowing users to opt-out of data usage for model training and delete their historical data [37][38].
速递|O'Reilly指控OpenAI"窃书" 训练 GPT-4o,AI数据黑箱再陷版权风暴
Z Potentials· 2025-04-02 03:17
Core Viewpoint - OpenAI is accused of potentially using copyrighted content from O'Reilly Media's paywalled books to train its AI models without authorization, raising concerns about copyright infringement and data sourcing practices [1][2][5]. Group 1: Allegations and Findings - A new paper from the AI Disclosure Project suggests that OpenAI likely used O'Reilly Media's paywalled books to train its GPT-4o model, with no licensing agreement in place between the two entities [2]. - The paper indicates that GPT-4o shows a significantly higher recognition ability for O'Reilly's paywalled content compared to the earlier GPT-3.5 Turbo model [2][3]. - Researchers analyzed 13,962 excerpts from 34 O'Reilly books and found that GPT-4o's recognition rate for paywalled content was notably higher than that of older OpenAI models [3]. Group 2: Methodology and Limitations - The study employed a method called DE-COP, designed to detect copyrighted content in language model training data, which involves testing the model's ability to distinguish between human-written text and AI-generated rewrites [2][3]. - The authors acknowledge that their findings do not constitute definitive proof, as OpenAI could have obtained excerpts through user interactions with ChatGPT [4]. - The research did not evaluate OpenAI's latest models, such as GPT-4.5 and reasoning models, which may not have been trained on paywalled O'Reilly content [4]. Group 3: Industry Context and Practices - OpenAI has been known to advocate for relaxed restrictions on using copyrighted data for model development and has sought higher quality training data [4]. - The company has entered into licensing agreements with various publishers and social networks for some of its training data, and it provides a mechanism for copyright holders to opt-out of having their content used [4].