Transformer
Search documents
把RoPE扔掉,AI更能看懂长上下文!Transformer作者团队开源大模型预训练新方法
量子位· 2026-01-13 09:50
Core Insights - The article discusses a new technology called DroPE, developed by a research team led by Llion Jones, one of the core authors of the Transformer architecture, to address the challenges of long text processing in large models [1][24]. - DroPE allows for seamless zero-shot context expansion without the need for expensive long-context training, requiring less than 1% of the pre-training budget for model recalibration [2]. Group 1: Technology Overview - DroPE can be seen as a method to discard positional embeddings to extend context [5]. - The technology utilizes RoPE (Rotary Positional Encoding) as a temporary training tool during the pre-training phase to ensure stability and efficiency in training [12][13]. - During the inference phase, DroPE discards positional embeddings and performs brief recalibration under the original context length, unlocking the model's long-context extrapolation capabilities [15][16]. Group 2: Performance Metrics - Experiments conducted on various models, including a 5M parameter model and the SmolLM family (360M/1.7B) as well as the 7B parameter Llama2-7B, showed significant improvements [17]. - In the LongBench benchmark test, DroPE improved the average score of the base SmolLM by over 10 times [18]. - In the NIAH task evaluation, the recall rate of the DroPE model reached 74.92%, significantly surpassing traditional RoPE scaling methods [19]. Group 3: Comparative Analysis - A comparative table shows that DroPE outperforms other methods in various tasks, achieving an average score of 30.52 in the LongBench benchmark [20]. - Even on the large-scale Llama2-7B model, DroPE demonstrated exceptional performance in long-context question answering and summarization tasks using only 0.5% of the pre-training budget for recalibration [20]. Group 4: Company Background - The team behind DroPE, Sakana AI, was co-founded by Llion Jones and former Google senior scientist David Ha [24]. - Sakana AI has gained attention for creating the first AI scientist capable of generating complete academic papers, which has positioned the company prominently in the AI landscape [26].
杨植麟揭秘Kimi预训练策略:提升Token efficiency,实现长文本
Xin Lang Cai Jing· 2026-01-10 12:09
Core Insights - The core focus of the article is on the strategies for pre-training AI models, specifically emphasizing Token Efficiency and Long Context as critical components for enhancing performance in complex tasks [2][6]. Group 1: Token Efficiency - Token Efficiency is crucial because the reasoning or training of agents is fundamentally a search process, where better pre-training reduces the search space and enhances prior knowledge [3][7]. - The importance of Token Efficiency is highlighted by the need for AI to develop complex systems, such as an operating system, without enumerating every possible token combination, which may be meaningless or incorrect [7]. Group 2: Long Context - The architecture of Transformers shows significant advantages in long context scenarios, with experiments indicating that performance drops below LSTM when context length exceeds 1000 tokens, underscoring the importance of context length in model design [2][6]. - In the current Agentic era, many tasks require long contexts to execute complex instructions, making architectures with lower positional loss more technically capable [2][6]. Group 3: Aesthetic Considerations in AI - The development of AI models is not just a technical challenge but also involves aesthetic considerations, where the creation of a model reflects a worldview and values, akin to the concept of "Taste" as articulated by influential figures like Steve Jobs [3][7]. - Each model generates unique tokens that are not interchangeable, indicating that intelligence produced by different roles (e.g., a CEO vs. a designer) varies significantly, leading to an exponential increase in the space of possible "Tastes" [4][8].
ds新论文
小熊跑的快· 2026-01-04 11:31
Core Viewpoint - The article discusses the advancements in deep learning models, particularly focusing on the introduction of the mHC (Manifold-Constrained Hyper-Connections) method, which enhances information flow between layers in large models while maintaining computational efficiency and stability [1][2]. Group 1: Traditional Models and Innovations - Traditional models break down problems into smaller units, converting them into vectors processed through multiple layers of a Transformer, where information can diminish and noise can increase, leading to potential loss of critical data [1]. - The introduction of ResNet in 2015 proposed residual connections, allowing information from previous layers to be added to the current layer's output, improving data retention [1]. - The 2024 paper from ByteDance introduced Hyper-Connections (HC), which expands residual paths into multiple parallel channels for information exchange, but risks signal amplification and loss during training [1][2]. Group 2: mHC Methodology - The mHC method enhances the HC structure by imposing constraints on the mixing weights, ensuring that the sum of each row and column equals one, thus maintaining the total amount of information while allowing for flexible redistribution [2]. - This approach significantly reduces numerical instability and the risk of gradient explosion during large-scale training, achieving performance that surpasses traditional models with larger parameters using a 27 billion parameter model [2]. Group 3: Engineering Optimizations - The mHC method is designed to optimize engineering processes without fundamentally altering the Transformer architecture, focusing on improving the internal structure rather than making drastic changes [5]. - It is suggested that the mHC method is compatible with hardware optimizations, reducing data call volumes across nodes and enhancing single-card computational performance [3]. - There are indications that a new model, potentially named ds V4, is expected to be released, featuring a smaller size with active parameters below 37 billion but with a wider architecture [4].
梁文锋署名,DeepSeek 论文引爆 AI 圈:mHC 架构横空出世!网友:这工程难度是地狱级
AI前线· 2026-01-02 06:00
Core Insights - DeepSeek has introduced a new network architecture called mHC (Manifold-Constrained Hyper-Connections) aimed at addressing numerical instability and signal explosion issues in large-scale model training while retaining performance enhancement advantages [2][5][6] Problem Addressed by the Architecture - Traditional Transformer networks rely on residual connections to maintain stable signal transmission, which is crucial for training deep learning models. However, Hyper-Connections (HC) have led to instability due to unconstrained connection matrices, causing signal explosion and gradient issues during large-scale training [6][7] - The mHC architecture introduces geometric constraints by projecting the residual mapping space onto a specific manifold, ensuring that the connection matrix remains within a double stochastic matrix framework, thus restoring the identity mapping property and stabilizing signal norms [6][10] Technical Implementation - The research team utilized the Sinkhorn-Knopp algorithm for projection constraints, optimizing the connection matrix while controlling system overhead to maintain training efficiency [11][12] - During training, the model learns a regular real-valued matrix, which is then projected to an approximate double stochastic matrix before each forward pass, ensuring that connections remain within a safe manifold [12] Experimental Results - The experiments demonstrated that mHC effectively avoided common training convergence issues found in traditional HC while maintaining or even improving performance across various tasks at parameter scales of 3 billion, 9 billion, and 27 billion [12][15] Broader Implications - The significance of mHC lies not in replacing the Transformer paradigm but in providing a scalable theoretical and engineering framework for exploring complex residual topologies. It highlights the importance of explicitly constraining model structures within geometrically favorable spaces to systematically address stability issues [12][14] - This approach opens avenues for future designs of more complex multi-stream and multi-path networks, balancing enhanced expressiveness with controllable trainability [12][14]
刚刚,梁文锋署名,DeepSeek元旦新论文要开启架构新篇章
机器之心· 2026-01-01 08:22
Core Viewpoint - DeepSeek has introduced a new architecture called Manifold-Constrained Hyper-Connections (mHC) to address the instability issues in traditional hyper-connections during large-scale model training while maintaining significant performance gains [1][3][4]. Group 1: Introduction of mHC - The mHC framework extends the traditional Transformer’s single residual flow into a multi-flow parallel architecture, utilizing the Sinkhorn-Knopp algorithm to constrain the connection matrix on a doubly stochastic matrix manifold [1][4]. - The core objective of mHC is to retain the performance improvements from widening the residual flow while addressing training instability and excessive memory consumption [4][6]. Group 2: Challenges with Traditional Hyper-Connections - Traditional residual connections ensure stable signal transmission through identity mapping, but they face limitations due to the restricted width of information channels [3][6]. - Recent methods like Hyper-Connections (HC) have improved performance but introduced significant training instability and increased memory access overhead [3][6]. Group 3: Methodology of mHC - mHC projects the residual connection space onto a specific manifold to restore the identity mapping property while optimizing infrastructure for efficiency [4][9]. - The use of the Sinkhorn-Knopp algorithm allows the connection matrix to be projected onto the Birkhoff polytope, ensuring stability in signal propagation [4][10]. Group 4: Experimental Validation - Empirical results show that mHC not only resolves stability issues but also demonstrates exceptional scalability in large-scale training, such as with a 27 billion parameter model, increasing training time by only 6.7% while achieving significant performance improvements [4][29]. - In benchmark tests, mHC consistently outperformed baseline models and HC in various downstream tasks, indicating its effectiveness in large-scale pre-training [30][31]. Group 5: Infrastructure Design - DeepSeek has tailored infrastructure for mHC, including kernel fusion, selective recomputation, and enhanced communication strategies to minimize memory overhead and improve computational efficiency [17][21][23]. - The design choices, such as optimizing the order of operations and implementing mixed precision strategies, contribute to the overall efficiency of mHC [17][18].
两年前猛裁1.2万人后,谷歌吃起了“回头草”:新招的AI工程师中,20%是「老面孔」
猿大侠· 2025-12-25 04:09
Core Viewpoint - Google is strategically reclaiming its position in the AI sector by re-hiring former employees, with approximately 20% of new AI software engineers in 2025 being ex-employees, a significant increase compared to previous years [1][4]. Group 1: Employee Rehiring Strategy - The trend of re-hiring former employees at Google is not coincidental, as the company faced significant layoffs in early 2023, with about 12,000 employees cut, representing 6% of its total workforce [4]. - Google has maintained connections with former employees, creating a potential talent pool that can be reactivated, especially as competition in generative AI intensifies [4]. - The return of former employees is driven by the availability of substantial computing resources and competitive compensation, which are critical for AI development [5][6]. Group 2: Cultural and Structural Changes - Google has undergone notable changes in its internal culture and organizational structure, including taking more risks, accelerating product release schedules, and reducing management layers by over one-third [8]. - The company has also adopted unconventional recruitment practices, such as re-hiring former employees and involving co-founder Sergey Brin in recruiting key AI talent [8]. Group 3: Competitive Landscape and Market Response - Initially, Google struggled in the generative AI space, lagging behind competitors like OpenAI and Meta, which rapidly gained market share [11][12]. - However, starting in 2024, Google shifted its strategy by increasing investments in AI infrastructure and stabilizing its product line with the Gemini series models, including the recent release of Gemini 3 [12]. - The market has responded positively, with Alphabet's stock price increasing by over 60% in 2025, outperforming other tech giants [13].
刚做了一份世界模型的学习路线图,面向初学者......
自动驾驶之心· 2025-12-25 03:24
Core Viewpoint - The article discusses the distinction between world models and end-to-end models in autonomous driving, clarifying that world models are not a specific technology but rather a category of models with certain capabilities. It emphasizes the trend in the industry towards using world models for closed-loop simulation to address the high costs associated with corner cases in autonomous driving [2]. Course Overview - The course on world models in autonomous driving is structured into six chapters, covering the introduction, background knowledge, discussions on general world models, video generation-based models, OCC-based models, and job-related insights in the industry [5][6][7][8][9]. Chapter Summaries - **Chapter 1: Introduction to World Models** This chapter outlines the relationship between world models and end-to-end autonomous driving, discussing the development history and current applications of world models, as well as various streams such as pure simulation, simulation plus planning, and generating sensor inputs [5]. - **Chapter 2: Background Knowledge** This chapter covers foundational knowledge related to world models, including scene representation, Transformer technology, and BEV perception, which are crucial for understanding subsequent chapters [6]. - **Chapter 3: General World Models** Focuses on popular general world models like Marble from Li Fei-Fei's team and Genie 3 from DeepMind, discussing their core technologies and design philosophies [7]. - **Chapter 4: Video Generation-Based World Models** This chapter delves into video generation algorithms, starting with GAIA-1 & GAIA-2 and extending to recent works like UniScene and OpenDWM, highlighting both classic and cutting-edge advancements in this area [8]. - **Chapter 5: OCC-Based World Models** Concentrates on OCC generation algorithms, discussing three major papers and a practical project, emphasizing the potential for these methods to extend into vehicle trajectory planning [9]. - **Chapter 6: World Model Job Topics** This chapter shares practical insights from the instructor's experience, addressing industry applications, pain points, and interview preparation for positions related to world models [9]. Learning Outcomes - The course aims to provide a comprehensive understanding of world models in autonomous driving, equipping participants with the knowledge to achieve a level comparable to one year of experience as a world model algorithm engineer [10].
下周开课!我们设计了一份自动驾驶世界模型学习路线图....
自动驾驶之心· 2025-12-24 09:22
Core Viewpoint - The article discusses the distinction between world models and end-to-end models in autonomous driving, emphasizing that world models are a means to achieve end-to-end autonomous driving rather than a specific technology [2]. Summary by Sections Chapter 1: Introduction to World Models - This chapter provides an overview of the relationship between world models and end-to-end autonomous driving, covering the development history and current applications of world models. It introduces various types of world models, including pure simulation, simulation plus planning, and those generating sensor inputs and perception results, along with their industry applications and relevant datasets [5]. Chapter 2: Background Knowledge of World Models - The second chapter focuses on the foundational knowledge necessary for understanding world models, starting with scene representation and expanding to technologies like Transformer and BEV perception. It highlights key technical terms frequently encountered in job interviews related to world models [6][11]. Chapter 3: Discussion on General World Models - This chapter centers on general world models and recent popular works in autonomous driving, including models from Li Fei-Fei's team (Marble), DeepMind (Genie 3), and Meta (JEPA). It also discusses the widely talked-about VLA+ world model algorithms and Tesla's latest world model simulator shared at ICCV [7]. Chapter 4: Video Generation-Based World Models - The fourth chapter focuses on video generation algorithms, which are currently the most researched in both academia and industry. It covers classic works like GAIA-1 & GAIA-2 from Wayve and recent advancements such as UniScene and OpenDWM, providing a comprehensive view of the field's progress [8]. Chapter 5: OCC-Based World Models - This chapter discusses OCC generation algorithms, explaining three major papers and a practical project. These methods can be easily extended for vehicle trajectory planning, contributing to end-to-end solutions [9]. Chapter 6: World Model Job Topics - The final chapter shares practical insights from the instructor's years of experience, addressing the application of world models in the industry, existing pain points, and how to prepare for related job interviews, focusing on what companies prioritize [10]. Course Outcomes - The course aims to advance understanding of end-to-end autonomous driving, equipping participants with knowledge of world model technologies, including video generation and OCC generation methods, and preparing them for roles in the autonomous driving industry [10][13].
谷歌创始人罕见反思:低估 Transformer,也低估了 AI 编程的风险,“代码错了,代价更高”
AI前线· 2025-12-21 05:32
Group 1 - The core viewpoint of the article emphasizes the rapid advancements in AI, particularly in code generation, while also highlighting the associated risks and challenges, as noted by Sergey Brin [2][3][20] - Brin pointed out that AI's ability to write code can lead to significant errors, making it more suitable for creative tasks where mistakes are less critical [2][38] - He reflected on Google's initial hesitations regarding generative AI and the underestimation of the importance of scaling computational power and algorithms [2][22][24] Group 2 - The discussion included a historical overview of Google's founding, emphasizing the creative and experimental environment at Stanford that fostered innovation [4][6][10] - Brin noted that the early days of Google were characterized by a lack of clear direction, with many ideas being tested without strict limitations [6][9] - The importance of a strong academic foundation in shaping Google's culture and approach to research and development was highlighted [12][13] Group 3 - Brin discussed the competitive landscape of AI, noting that significant investments in AI infrastructure have reached hundreds of billions, with companies racing to lead in this space [21][22] - He acknowledged that while Google has made substantial contributions to AI, there were missed opportunities in the past due to insufficient investment and fear of releasing products prematurely [22][23][24] - The conversation also touched on the evolving nature of AI, with Brin expressing uncertainty about its future capabilities and the potential for AI to surpass human abilities [27][29][30] Group 4 - Brin emphasized the need for a balance between computational power and algorithmic advancements, stating that algorithmic progress has outpaced scaling efforts in recent years [3][55] - He mentioned that deep technology and foundational research are crucial for maintaining a competitive edge in AI [24][25] - The discussion concluded with reflections on the role of universities in the future, considering the rapid changes in education and knowledge dissemination due to technology [41][42]
AGI为什么不会到来?这位研究员把AI的“物理极限”讲透了
3 6 Ke· 2025-12-17 11:43
Group 1 - The article discusses the skepticism surrounding the realization of Artificial General Intelligence (AGI), emphasizing that current optimism in the market may be misplaced due to physical constraints on computation [1][4]. - Tim Dettmers argues that computation is fundamentally bound by physical laws, meaning that advancements in intelligence are limited by energy, bandwidth, storage, manufacturing, and cost [3][4]. - Dettmers identifies several key judgments regarding AGI: the success of Transformer models is not coincidental but rather an optimal engineering choice under current physical constraints, and further improvements yield diminishing returns [4][6]. Group 2 - The article highlights that discussions about AGI often overlook the physical realities of computation, leading to misconceptions about the potential for unlimited scaling of intelligence [5][9]. - It is noted that as systems mature, linear improvements require exponentially increasing resource investments, which can lead to diminishing returns [10][16]. - The article points out that the performance gains from GPUs, which have historically driven AI advancements, are nearing their physical and engineering limits, suggesting a shift in focus is necessary [18][22]. Group 3 - Dettmers suggests that the current trajectory of AI development may be approaching a stagnation point, particularly with the introduction of Gemini 3, which could signal a limit to the effectiveness of scaling [33][36]. - The cost structure of scaling has changed, with past linear costs now becoming exponential, indicating that further scaling may not be sustainable without new breakthroughs [35][36]. - The article emphasizes that true AGI must encompass the ability to perform economically meaningful tasks in the real world, which is heavily constrained by physical limitations [49][50]. Group 4 - The discussion includes the notion that the concept of "superintelligence" may be flawed, as it assumes unlimited capacity for self-improvement, which is not feasible given the physical constraints of resources [56][58]. - The article argues that the future of AI will be shaped by economic viability and practical applications rather than the pursuit of an idealized AGI [59][60].