Transformer
Search documents
又是王冠:27M小模型超越o3-mini!拒绝马斯克的00后果然不同
Sou Hu Cai Jing· 2025-08-10 04:21
Core Insights - The article discusses the development of a new AI model called the Hierarchical Reasoning Model (HRM) by Sapient Intelligence, which has achieved superior performance compared to larger models with fewer parameters [3][5][18] - HRM utilizes innovative techniques inspired by brain functions, allowing it to perform complex reasoning tasks efficiently without relying on traditional pre-training methods [4][12][14] Model Performance - HRM, with only 27 million parameters, surpassed larger models like o3-mini-high and Claude 3.7 in various tests, achieving a 40.3% accuracy rate in the ARC-AGI challenge [16][18] - In extreme Sudoku tasks, HRM demonstrated near-perfect accuracy, while traditional models struggled significantly [16][18] Technical Innovations - HRM employs a dual-layer cyclical module design that mimics the brain's hierarchical processing and time-scale separation, enhancing both global direction and local execution efficiency [4][7] - The model incorporates a layered convergence mechanism to avoid premature convergence, allowing it to adaptively set new goals based on high-level updates [9][11] - It utilizes approximate gradient techniques to optimize memory usage and computational efficiency, aligning with biological learning patterns [12] - A deep supervision mechanism is integrated, allowing for periodic evaluations and adjustments during the learning process, which helps in correcting deviations promptly [13][14] Developer Background - The model's creator, Wang Guan, is a young entrepreneur who previously declined offers from major tech figures like Elon Musk, aiming instead to revolutionize AI architecture [20][22] - Wang co-founded Sapient Intelligence in 2024, focusing on developing models with advanced reasoning and planning capabilities [22]
自动驾驶之心技术交流群来啦!
自动驾驶之心· 2025-07-29 07:53
Core Viewpoint - The article emphasizes the establishment of a leading communication platform for autonomous driving technology in China, focusing on industry, academic, and career development aspects [1]. Group 1 - The platform, named "Autonomous Driving Heart," aims to facilitate discussions and exchanges among professionals in various fields related to autonomous driving technology [1]. - The technical discussion group covers a wide range of topics including large models, end-to-end systems, VLA, BEV perception, multi-modal perception, occupancy, online mapping, 3DGS, multi-sensor fusion, transformers, point cloud processing, SLAM, depth estimation, trajectory prediction, high-precision maps, NeRF, planning control, model deployment, autonomous driving simulation testing, product management, hardware configuration, and AI job exchange [1]. - Interested individuals are encouraged to join the community by adding a WeChat assistant and providing their company/school, nickname, and research direction [1].
Grok4全网玩疯,成功通过小球编程测试,Epic创始人:这就是AGI
猿大侠· 2025-07-12 01:45
Core Viewpoint - The article discusses the rapid adoption and impressive capabilities of Elon Musk's Grok4 AI model, highlighting its performance in various tests and comparisons with other models like OpenAI's o3. Group 1: Performance Highlights - Grok4 successfully passed the hexagonal ball programming test, showcasing its ability to understand physical laws [2][12]. - In a comprehensive evaluation, Grok4 outperformed o3 in all eight tasks, including complex legal reasoning and code translation [23][18][20]. - Tim Sweeney, founder of Epic Games, praised Grok4 as a form of Artificial General Intelligence (AGI) after it provided deep insights on a previously unseen problem [9][10]. Group 2: User Interactions and Applications - Users have engaged with Grok4 in creative ways, such as visualizing mathematical concepts and generating SVG graphics, demonstrating its versatility [25][32]. - A user named Dan was able to create a visualization of Euler's identity with minimal interaction, indicating Grok4's efficiency in generating complex outputs [31][26]. - The article mentions a high-level application called "Expert Conductor," which simulates an expert collaboration environment, further showcasing Grok4's potential in problem-solving [54][56]. Group 3: Community Engagement - The article encourages readers to share their innovative uses of Grok4, indicating a growing community interest and engagement with the AI model [66]. - Various users have reported their experiences and findings, contributing to a collaborative exploration of Grok4's capabilities [12][66].
「Tokens是胡扯」,Mamba作者抛出颠覆性观点,揭露Transformer深层缺陷
机器之心· 2025-07-09 09:52
Core Viewpoint - The article discusses the trade-offs between State Space Models (SSM) and Transformers, arguing that tokenization is a limitation that SSM can overcome, leading to better computational efficiency and modeling capabilities [1][3][61]. Group 1: State Space Models (SSM) - SSM is defined as a modern version of recurrent neural networks (RNN) with key features that allow it to match the language modeling performance of Transformers [8][10]. - A significant characteristic of SSM is that its hidden state dimension is greater than the input and output dimensions, allowing for better context storage [9][10]. - The model's state update function must be expressive enough to accurately encode and retrieve necessary information, which is achieved through dynamic transfer matrices in selective SSM [11][12]. - Mamba, a specific SSM, integrates parallelization and memory management techniques to enhance computational efficiency [13][14]. - The article highlights that SSMs can outperform Transformers in language modeling tasks when computational resources are matched [53][56]. Group 2: Transformers - Transformers excel in tasks requiring fine-grained operations on individual tokens, but they suffer from quadratic complexity, limiting their efficiency [82][86]. - The article argues that Transformers have an inductive bias that affects their modeling capabilities, making them sensitive to the resolution and semantic content of the data [83][85]. - Despite their strengths, Transformers are not the ultimate solution for all modeling tasks, and there is still significant work to be done in the field [89]. Group 3: Tokenization - Tokenization is a critical step in language modeling, but it introduces limitations in understanding language details [39][40]. - The article posits that removing tokenization could lead to better model performance and aligns with the essence of deep learning, which aims to minimize manual feature engineering [44][45]. - The author suggests that without tokenization, models could learn more effective patterns directly from raw data, enhancing their capabilities [46][52].
Transformer死角,只需500步后训练,循环模型突破256k长度泛化极限
机器之心· 2025-07-08 04:09
Core Insights - The article discusses the advantages of linear recurrent models, such as Mamba, and linear attention mechanisms in handling long sequences, which is crucial for long-context reasoning tasks [1][2] - It highlights the performance improvements of recurrent models over time, indicating that they can now compete with Transformers in various tasks, despite previous limitations [3] - A significant finding is that recurrent models struggle with generalization beyond training lengths, leading to performance drops when faced with longer sequences [4][6] Group 1 - The article presents a solution to the generalization issue in recurrent models through simple training interventions, allowing them to generalize to sequences up to 256k in length with just 500 additional training steps [7] - The research emphasizes that recurrent models possess untapped potential rather than inherent flaws [7][8] - The authors propose the "Unexplored States Hypothesis" to explain why recurrent models fail to generalize in length, indicating that they only learn from a limited subset of possible states during training [13][14] Group 2 - The article outlines four training interventions to improve length generalization by altering the initial state of the model [19] - These interventions include Random Noise, Fitted Noise, State Passing, and Truncated Backpropagation Through Time (TBTT), each designed to expose the model to a broader range of state distributions [20][19] - The findings reveal that State Passing and TBTT mechanisms effectively enable length generalization, achieving results with only 0.02% of the original pre-training budget [23][24] Group 3 - The article discusses the performance of these interventions in various long-context tasks, demonstrating their ability to enhance length generalization [31] - Specific tasks mentioned include the BABILong benchmark, password retrieval, and synthetic copying tasks, where the interventions significantly improved model performance [32][35][39] - The results indicate that models trained with these interventions can effectively utilize relationships between tokens beyond the training context length [36][39] Group 4 - The article introduces the concept of "Effective Remembrance" to measure how well a model retains information from previous tokens, aiming for models to focus on recent context rather than distant tokens [44][50] - It shows that State Passing improves effective memory, allowing models to prioritize recent tokens in their predictions [51][52] - This adjustment is crucial for text modeling, ensuring that earlier tokens do not disproportionately influence the model's output [52]
Meta新注意力机制突破Transformer上限,还用上了OpenAI的开源技术
量子位· 2025-07-07 09:35
Core Viewpoint - Meta has made significant advancements by leveraging OpenAI's technology and recruiting a large number of OpenAI employees, resulting in the development of a new architecture called 2-Simplicial Transformer, which enhances the efficiency of data utilization in training large models [1][2][26]. Group 1: New Architecture and Methodology - The 2-Simplicial Transformer modifies standard attention mechanisms to improve the efficiency of data usage, addressing the data bottleneck in current large model development [2][4]. - The core method involves extending the standard dot-product attention to a trilinear function, allowing for better expression of complex tasks [3][6]. - A new key vector, K', is introduced to enhance the model's ability to capture richer relationships during attention calculations [9][10]. Group 2: Performance and Scalability - Experimental results indicate that the 2-Simplicial Transformer outperforms traditional Transformers in mathematical, programming, and reasoning tasks, especially as model parameters increase [4][19]. - The scaling index of the new architecture is superior to that of traditional Transformers, suggesting that performance improves more rapidly with increased parameters and data, making it advantageous in data-limited scenarios [20][22]. - In various tasks, the 2-Simplicial Transformer shows improved performance metrics compared to traditional Transformers, particularly in larger models [18][21]. Group 3: Implementation and Challenges - The implementation of the 2-Simplicial Transformer utilizes Triton, a GPU programming framework that allows for efficient computation without requiring extensive CUDA experience [11][12]. - Despite its advantages, the computational complexity and latency of the 2-Simplicial Transformer remain high, indicating a need for further optimization for production environments [22].
deepseek技术解读(3)-MoE的演进之路
自动驾驶之心· 2025-07-06 08:44
Core Viewpoint - The article discusses the evolution of DeepSeek in the context of Mixture-of-Experts (MoE) models, highlighting innovations and improvements from DeepSeekMoE (V1) to DeepSeek V3, while maintaining a focus on the MoE technology route [1]. Summary by Sections 1. Development History of MoE - MoE was first introduced in 1991 with the paper "Adaptive Mixtures of Local Experts," and its framework has remained consistent over the years [2]. - Google has been a key player in the development of MoE, particularly with the release of "GShard" in 2020, which scaled models to 600 billion parameters [5]. 2. DeepSeek's Work 2.1. DeepSeek-MoE (V1) - DeepSeek V1 was released in January 2024, addressing two main issues: knowledge mixing and redundancy among experts [15]. - The architecture introduced fine-grained expert segmentation and shared expert isolation to enhance specialization and reduce redundancy [16]. 2.2. DeepSeek V2 MoE Upgrade - V2 introduced a device-limited routing mechanism to control communication costs by ensuring that activated experts are distributed across a limited number of devices [28]. - A communication balance loss was added to address potential congestion issues at the receiving end of the communication [29]. 2.3. DeepSeek V3 MoE Upgrade - V3 maintained the fine-grained expert and shared expert designs while upgrading the gating network from Softmax to Sigmoid to improve scoring differentiation among experts [36][38]. - The auxiliary loss for load balancing was eliminated to reduce its negative impact on the main model, replaced by a dynamic bias for load balancing [40]. - A sequence-wise auxiliary loss was introduced to balance token distribution among experts at the sequence level [42]. 3. Summary of DeepSeek's Innovations - The evolution of DeepSeek MoE has focused on balancing general knowledge and specialized knowledge through shared and fine-grained experts, while also addressing load balancing through various auxiliary losses [44].
原来Scaling Law还能被优化?Meta这招省token又提效
机器之心· 2025-07-06 03:49
Core Insights - The article discusses the advancements in AI, particularly focusing on the evolution of the Transformer model and the introduction of the 2-simplicial Transformer, which enhances the efficiency of token utilization and model scalability [1][4][10]. Group 1: Transformer and AI Development - The paper "Attention Is All You Need" marked a significant turning point in AI development, establishing the Transformer as the foundational paradigm for current language models [1]. - The citation count for this paper is approaching 190,000, indicating its profound impact on the field [2]. - The ongoing challenge in AI is acquiring a sufficient quantity of high-quality tokens and efficiently utilizing them, necessitating further upgrades to the Transformer model [3]. Group 2: 2-Simplicial Transformer - Meta's recent research introduced a rotationally invariant trilinear attention mechanism, demonstrating comparable representational capacity to the 2-simplicial Transformer and potentially altering the coefficients in the Scaling Law [4][10]. - The 2-simplicial Transformer, derived from Clift et al. (2019), generalizes the dot-product attention mechanism to a trilinear form, enhancing its scalability under token constraints [19][11]. - Experimental results indicate that the 2-simplicial Transformer can more effectively approximate the irreducible entropy of natural language compared to traditional dot-product attention Transformers [11]. Group 3: Scaling Law and Model Performance - The Scaling Law describes how loss decreases with the total number of model parameters and token count, suggesting that larger models should approach the irreducible loss of natural text distribution as both parameters and tokens increase [13][15]. - Hoffmann et al. (2022) found that the optimal number of parameters and dataset size should scale proportionally with the computational budget, with estimated scaling exponents around 0.49 for parameters and 0.5 for tokens [17][18]. - The 2-simplicial Transformer exhibits a steeper scaling slope compared to the dot-product attention Transformer, indicating a higher exponent in its Scaling Law [50]. Group 4: Experimental Results - The team conducted experiments with various models, revealing that the 2-simplicial attention mechanism did not provide benefits in models with fewer than 2 billion active parameters [45]. - The performance metrics across different model sizes showed slight improvements or declines when comparing the 2-simplicial Transformer to traditional Transformers, with variations in performance percentages noted [43][44]. - The study estimated the differences in scaling coefficients between the 2-simplicial and dot-product attention mechanisms, highlighting the potential for improved efficiency in larger models [46][49].
X @Avi Chawla
Avi Chawla· 2025-07-04 06:48
AI Tools & Platforms - RAGFlow is a linked resource [1] - Xpander is a linked resource [1] - Transformer Lab is a linked resource [1] - Llama Factory is a linked resource [1] - LangFlow is a linked resource [1] - AutoAgent is a linked resource [1]
ICML 2025 | 打破残差连接瓶颈,彩云科技&北邮提出MUDDFormer架构让Transformer再进化!
机器之心· 2025-06-27 08:06
Core Viewpoint - The article discusses the introduction of Multiway Dynamic Dense (MUDD) connections as an effective alternative to residual connections in Transformers, significantly enhancing cross-layer information transfer efficiency in deep learning models [1][4]. Background - Residual connections, introduced by Kaiming He in ResNet, have become foundational in deep learning and Transformer LLMs, but they still face limitations in efficient information transfer across layers [1][7]. - MUDD connections dynamically establish cross-layer connections based on the current hidden state, addressing issues like representation collapse and information overload in residual streams [7][8]. Model Architecture - MUDDFormer architecture allows for independent dynamic connections for different information streams (Q, K, V, R), enhancing the model's ability to gather relevant information from previous layers [10][13]. - The introduction of dynamic connections enables the model to adaptively determine the weight of information extracted from previous layers based on the context of each token [11][13]. Experimental Evaluation - MUDDPythia, a model with 2.8 billion parameters, shows performance comparable to larger models (6.9 billion and 12 billion parameters) with only a 0.23% increase in parameters and a 0.4% increase in computation [4][18]. - The MUDDFormer outperforms baseline models like Transformer++ across various model sizes, demonstrating significant computational efficiency improvements [15][17]. Downstream Task Assessment - In downstream tasks, MUDDPythia exhibits higher accuracy in 0-shot and 5-shot evaluations compared to equivalent Pythia models, indicating enhanced contextual learning capabilities [18][20]. - The model achieves a 2.4 times efficiency leap over the 6.9 billion Pythia model and a 4.2 times efficiency leap over the 12 billion Pythia model in specific evaluations [18][20]. Conclusion - MUDDFormer improves residual connections by establishing independent dynamic cross-layer connections for different information streams, enhancing cross-layer interaction and contextual learning capabilities in Transformers [25].