数据集污染 - filings, earnings calls, financial reports, news

数据集污染

Search documents

程序员的那些事· 2025-08-26 12:35

Core Viewpoint - The recent release of Deep Seek V3.1 introduces significant improvements in reasoning efficiency and memory usage, but it also presents unexpected issues with random token generation, particularly the appearance of tokens like "极" and "extreme" during text generation [1][2][25]. Group 1: Version Improvements - Deep Seek V3.1 features a hybrid reasoning architecture that enhances reasoning efficiency by 20%-50% and supports 128K long context processing [1]. - The update incorporates UE8M0 FP8 parameter precision format, resulting in a 75% reduction in memory usage [1]. - The model is now compatible with domestic next-generation chips, reducing reliance on imported GPUs [1]. Group 2: User Feedback and Issues - Users have reported that the V3.1 model generates unexpected tokens such as "极" and "extreme" randomly during text generation [2][12]. - The issue has been observed across various platforms, including third-party APIs like VolcEngine and even on the DeepSeek official website, with a higher occurrence rate on third-party platforms [12][15]. - Developers have expressed confusion as the model fails to resolve these token generation issues even when prompted [3][12]. Group 3: Technical Analysis - Some technical analysts suggest that the appearance of the token "极" (token ID: 2577) may be due to residual data from training datasets, indicating a potential flaw in data cleaning processes [25][26]. - The model may have learned to treat "极" as a semantic boundary marker due to its presence in training data, leading to its random generation in outputs [25][26]. - The issue reflects a broader concern that large models may not be genuinely understanding language but rather learning statistical patterns from the data [27][28].