大模型训练数据问题

Search documents
代码里插广告,腾讯 Codebuddy 们 “背锅”?DeepSeek “极你太美”事件,其他模型也逃不掉?
AI前线· 2025-08-27 05:42
Core Viewpoint - The article discusses a bug in the DeepSeek V3.1 model that causes unexpected tokens, particularly the character "极", to appear in generated code, leading to user frustration and confusion [2][4][15]. Group 1: Bug Discovery and User Reactions - Users reported issues with Tencent's Codebuddy and ByteDance's Trae, where the DeepSeek model introduced unexpected tokens into the code, prompting some to uninstall the applications [2][4]. - The bug was humorously referred to as the "极你太美" incident by users, highlighting the widespread nature of the issue [8]. - Some users noted that the bug was reproducible on official APIs but less frequent on third-party platforms [7][8]. Group 2: Technical Analysis of the Bug - Developers have speculated that the bug originates from the DeepSeek V3.1 model, with suggestions that it may be linked to pre-training data or the model's architecture [15][19]. - Various hypotheses were proposed regarding the cause of the bug, including token continuity issues, data contamination during training, and problems with multi-token prediction [15][20]. - The presence of the character "极" in outputs has been attributed to the model's training data, which may have included noisy or unclean data [19][20]. Group 3: Broader Implications and Community Response - The article emphasizes the importance of data quality in model training, suggesting that flaws in the training process can lead to significant issues in model outputs [20]. - Developers and users expressed a collaborative spirit in addressing the bug, indicating a community-driven approach to problem-solving in AI development [20].