垃圾进垃圾出
Search documents
垃圾刷多了AI也会变蠢,“年度最令人不安的论文”
3 6 Ke· 2025-11-17 00:36
Core Insights - The article discusses the phenomenon of "Brain Rot" in AI, indicating that exposure to low-quality data can lead to irreversible cognitive decline in large language models (LLMs) [1][4][11]. Group 1: Research Findings - A recent study found that feeding LLMs with low-value Twitter data resulted in a 23% decrease in reasoning ability and a 30% decline in long-context memory [4][11]. - The study introduced the "LLM Brain Rot Hypothesis," exploring whether LLMs experience cognitive decline similar to humans when exposed to low-quality data [5][11]. - The research defined "garbage data" as non-malicious low-quality content, such as short, highly popular tweets, and identified two dimensions for categorizing this data: engagement and semantic quality [5][11]. Group 2: Methodology - The researchers trained four different LLMs using both garbage and control data, ensuring that the token counts were consistent to eliminate data volume bias [7][11]. - Various cognitive benchmarks were employed to assess the models' capabilities, including ARC for reasoning, RULER for memory and multitasking, and TRAIT for personality traits [9][10][11]. Group 3: Implications for the Industry - The study emphasizes the importance of data quality during the pre-training phase, suggesting that the industry should focus on data selection as a safety issue rather than only post-training alignment [23]. - It recommends implementing cognitive assessments for LLMs to prevent degradation of capabilities due to exposure to low-quality data [23]. - The findings indicate that metrics like "popularity" may be more effective than text length in determining data quality, advocating for the exclusion of short, highly viral content in future training datasets [23].
短视频刷多了AI也会变蠢!“年度最令人不安的论文”
量子位· 2025-11-16 07:20
Core Insights - The article discusses the phenomenon of "Brain Rot" in AI, indicating that exposure to low-quality data can lead to irreversible cognitive decline in large language models (LLMs) [2][13][26] - The research highlights that even after retraining with high-quality data, the damage caused by low-quality data cannot be fully repaired, suggesting a permanent cognitive shift [4][26][27] Research Findings - The study introduces the "LLM Brain Rot Hypothesis," exploring whether LLMs experience cognitive decline similar to humans when exposed to low-quality data [8][13] - Two dimensions were used to define "garbage data": M1 focuses on engagement metrics (short, high-traffic content), while M2 assesses semantic quality (clickbait and conspiracy theories) [11][12] - The models tested showed a 23% decline in reasoning ability and a 30% decrease in long-context memory after exposure to garbage data [6][14] Cognitive Impact - The study found that LLMs exhibit cognitive decline akin to "Brain Rot," with significant negative effects on safety and personality traits, particularly from M1 data [14][19] - A dose-effect relationship was observed, where increased exposure to garbage data correlates with greater cognitive damage [15] Repair Attempts - Attempts to repair the cognitive damage through external feedback and large-scale fine-tuning were unsuccessful, with models failing to regain baseline performance [25][26] - The research indicates that LLMs lack the ability to self-correct effectively, unlike humans who can mitigate cognitive decline through various means [24][27] Industry Implications - The findings emphasize the importance of data quality during the pre-training phase, suggesting that the industry should focus on data selection as a safety issue [28] - Implementing cognitive assessments for LLMs, such as ARC and RULER benchmarks, is recommended to prevent long-term exposure to low-quality data [29] - The study suggests prioritizing the exclusion of short, high-engagement content from training datasets to enhance model performance [29]