Workflow
Transformer
icon
Search documents
DiT在数学和形式上是错的?谢赛宁回应:不要在脑子里做科学
机器之心· 2025-08-20 04:26
Core Viewpoint - The article discusses criticisms of the DiT model, highlighting potential architectural flaws and the introduction of a new method called TREAD that significantly improves training efficiency and image generation quality compared to DiT [1][4][6]. Group 1 - A recent post on X claims that DiT has architectural defects, sparking significant discussion [1]. - The TREAD method achieves a training speed improvement of 14/37 times on the FID metric when applied to the DiT backbone network, indicating better generation quality [2][6]. - The post argues that DiT's FID stabilizes too early during training, suggesting it may have "latent architectural defects" that prevent further learning from data [4]. Group 2 - TREAD employs a "token routing" mechanism to enhance training efficiency without altering the model architecture, using a partial token set to save information and reduce computational costs [6]. - The author of the original DiT paper, Sseining, acknowledges the criticisms and emphasizes the importance of experimental validation over theoretical assertions [28][33]. - Sseining also points out that DiT's architecture has some inherent flaws, particularly in its use of post-layer normalization, which is known to be unstable for tasks with significant numerical range variations [13][36]. Group 3 - The article mentions that DiT's design relies on a simple MLP network for processing critical conditional data, which limits its expressive power [16]. - Sseining highlights that the real issue with DiT lies in its sd-vae component, which is inefficient and has been overlooked for a long time [36]. - The ongoing debate around DiT reflects the iterative nature of algorithmic progress, where existing models are continuously questioned and improved [38].
端到端VLA的起点:聊聊大语言模型和CLIP~
自动驾驶之心· 2025-08-19 07:20
Core Viewpoint - The article discusses the development and significance of end-to-end (E2E) algorithms in autonomous driving, emphasizing the integration of various advanced technologies such as large language models (LLMs), diffusion models, and reinforcement learning (RL) in enhancing the capabilities of autonomous systems [21][31]. Summary by Sections Section 1: Overview of End-to-End Autonomous Driving - The first chapter provides a comprehensive overview of the evolution of end-to-end algorithms, explaining the transition from modular approaches to end-to-end solutions, and discussing the advantages and challenges of different paradigms [40]. Section 2: Background Knowledge - The second chapter focuses on the technical stack associated with end-to-end systems, detailing the importance of LLMs, diffusion models, and reinforcement learning, which are crucial for understanding the future job market in this field [41][42]. Section 3: Two-Stage End-to-End Systems - The third chapter delves into two-stage end-to-end systems, exploring their emergence, advantages, and disadvantages, while also reviewing notable works in the field such as PLUTO and CarPlanner [42][43]. Section 4: One-Stage End-to-End and VLA - The fourth chapter highlights one-stage end-to-end systems, discussing various subfields including perception-based methods and the latest advancements in VLA (Vision-Language Alignment), which are pivotal for achieving the ultimate goals of autonomous driving [44][50]. Section 5: Practical Application and RLHF Fine-Tuning - The fifth chapter includes a major project focused on RLHF (Reinforcement Learning from Human Feedback) fine-tuning, providing practical insights into building pre-training and reinforcement learning modules, which are applicable to VLA-related algorithms [52]. Course Structure and Learning Outcomes - The course aims to equip participants with a solid understanding of end-to-end autonomous driving technologies, covering essential frameworks and methodologies, and preparing them for roles in the industry [56][57].
马斯克:谷歌最有可能成为AI行业领先者
3 6 Ke· 2025-08-15 01:21
Core Viewpoint - Elon Musk praised Google as the most likely leader in the artificial intelligence industry due to its significant computational and data advantages, while also indicating that this situation may change in the coming years [1]. Group 1: Google's Position in AI - Google is recognized as a cornerstone in the AI field, having introduced the Transformer concept in a groundbreaking 2017 paper titled "Attention Is All You Need," which supports large language models like ChatGPT [1]. - Google has invested in AI startups such as Anthropic and Safe Superintelligence, holding a 14% stake in Anthropic [1]. - In its recent Q2 earnings call, Google announced a $10 billion increase in capital expenditures for the year, raising the total to $85 billion, aimed at meeting the growing demand for chips and AI products [2]. Group 2: Musk's Disputes and xAI - Musk's comments coincided with escalating tensions with OpenAI's CEO Sam Altman, including Musk's threat to sue Apple for promoting OpenAI's ChatGPT over his own xAI's Grok chatbot [3]. - Musk and Altman co-founded OpenAI in 2015, but their relationship soured after Musk left the board in 2018 due to differing views [3]. - Musk established xAI in July 2023, raising over $12 billion in funding across multiple rounds later that year [3]. - Tesla is set to hold a shareholder vote on whether to invest in xAI, although the timing of the vote has not been specified [4]. - Musk indicated that if it were solely his decision, Tesla would have already invested in xAI [5].
又是王冠:27M小模型超越o3-mini!拒绝马斯克的00后果然不同
Sou Hu Cai Jing· 2025-08-10 04:21
Core Insights - The article discusses the development of a new AI model called the Hierarchical Reasoning Model (HRM) by Sapient Intelligence, which has achieved superior performance compared to larger models with fewer parameters [3][5][18] - HRM utilizes innovative techniques inspired by brain functions, allowing it to perform complex reasoning tasks efficiently without relying on traditional pre-training methods [4][12][14] Model Performance - HRM, with only 27 million parameters, surpassed larger models like o3-mini-high and Claude 3.7 in various tests, achieving a 40.3% accuracy rate in the ARC-AGI challenge [16][18] - In extreme Sudoku tasks, HRM demonstrated near-perfect accuracy, while traditional models struggled significantly [16][18] Technical Innovations - HRM employs a dual-layer cyclical module design that mimics the brain's hierarchical processing and time-scale separation, enhancing both global direction and local execution efficiency [4][7] - The model incorporates a layered convergence mechanism to avoid premature convergence, allowing it to adaptively set new goals based on high-level updates [9][11] - It utilizes approximate gradient techniques to optimize memory usage and computational efficiency, aligning with biological learning patterns [12] - A deep supervision mechanism is integrated, allowing for periodic evaluations and adjustments during the learning process, which helps in correcting deviations promptly [13][14] Developer Background - The model's creator, Wang Guan, is a young entrepreneur who previously declined offers from major tech figures like Elon Musk, aiming instead to revolutionize AI architecture [20][22] - Wang co-founded Sapient Intelligence in 2024, focusing on developing models with advanced reasoning and planning capabilities [22]
自动驾驶之心技术交流群来啦!
自动驾驶之心· 2025-07-29 07:53
Core Viewpoint - The article emphasizes the establishment of a leading communication platform for autonomous driving technology in China, focusing on industry, academic, and career development aspects [1]. Group 1 - The platform, named "Autonomous Driving Heart," aims to facilitate discussions and exchanges among professionals in various fields related to autonomous driving technology [1]. - The technical discussion group covers a wide range of topics including large models, end-to-end systems, VLA, BEV perception, multi-modal perception, occupancy, online mapping, 3DGS, multi-sensor fusion, transformers, point cloud processing, SLAM, depth estimation, trajectory prediction, high-precision maps, NeRF, planning control, model deployment, autonomous driving simulation testing, product management, hardware configuration, and AI job exchange [1]. - Interested individuals are encouraged to join the community by adding a WeChat assistant and providing their company/school, nickname, and research direction [1].
Grok4全网玩疯,成功通过小球编程测试,Epic创始人:这就是AGI
猿大侠· 2025-07-12 01:45
Core Viewpoint - The article discusses the rapid adoption and impressive capabilities of Elon Musk's Grok4 AI model, highlighting its performance in various tests and comparisons with other models like OpenAI's o3. Group 1: Performance Highlights - Grok4 successfully passed the hexagonal ball programming test, showcasing its ability to understand physical laws [2][12]. - In a comprehensive evaluation, Grok4 outperformed o3 in all eight tasks, including complex legal reasoning and code translation [23][18][20]. - Tim Sweeney, founder of Epic Games, praised Grok4 as a form of Artificial General Intelligence (AGI) after it provided deep insights on a previously unseen problem [9][10]. Group 2: User Interactions and Applications - Users have engaged with Grok4 in creative ways, such as visualizing mathematical concepts and generating SVG graphics, demonstrating its versatility [25][32]. - A user named Dan was able to create a visualization of Euler's identity with minimal interaction, indicating Grok4's efficiency in generating complex outputs [31][26]. - The article mentions a high-level application called "Expert Conductor," which simulates an expert collaboration environment, further showcasing Grok4's potential in problem-solving [54][56]. Group 3: Community Engagement - The article encourages readers to share their innovative uses of Grok4, indicating a growing community interest and engagement with the AI model [66]. - Various users have reported their experiences and findings, contributing to a collaborative exploration of Grok4's capabilities [12][66].
「Tokens是胡扯」,Mamba作者抛出颠覆性观点,揭露Transformer深层缺陷
机器之心· 2025-07-09 09:52
Core Viewpoint - The article discusses the trade-offs between State Space Models (SSM) and Transformers, arguing that tokenization is a limitation that SSM can overcome, leading to better computational efficiency and modeling capabilities [1][3][61]. Group 1: State Space Models (SSM) - SSM is defined as a modern version of recurrent neural networks (RNN) with key features that allow it to match the language modeling performance of Transformers [8][10]. - A significant characteristic of SSM is that its hidden state dimension is greater than the input and output dimensions, allowing for better context storage [9][10]. - The model's state update function must be expressive enough to accurately encode and retrieve necessary information, which is achieved through dynamic transfer matrices in selective SSM [11][12]. - Mamba, a specific SSM, integrates parallelization and memory management techniques to enhance computational efficiency [13][14]. - The article highlights that SSMs can outperform Transformers in language modeling tasks when computational resources are matched [53][56]. Group 2: Transformers - Transformers excel in tasks requiring fine-grained operations on individual tokens, but they suffer from quadratic complexity, limiting their efficiency [82][86]. - The article argues that Transformers have an inductive bias that affects their modeling capabilities, making them sensitive to the resolution and semantic content of the data [83][85]. - Despite their strengths, Transformers are not the ultimate solution for all modeling tasks, and there is still significant work to be done in the field [89]. Group 3: Tokenization - Tokenization is a critical step in language modeling, but it introduces limitations in understanding language details [39][40]. - The article posits that removing tokenization could lead to better model performance and aligns with the essence of deep learning, which aims to minimize manual feature engineering [44][45]. - The author suggests that without tokenization, models could learn more effective patterns directly from raw data, enhancing their capabilities [46][52].
Transformer死角,只需500步后训练,循环模型突破256k长度泛化极限
机器之心· 2025-07-08 04:09
Core Insights - The article discusses the advantages of linear recurrent models, such as Mamba, and linear attention mechanisms in handling long sequences, which is crucial for long-context reasoning tasks [1][2] - It highlights the performance improvements of recurrent models over time, indicating that they can now compete with Transformers in various tasks, despite previous limitations [3] - A significant finding is that recurrent models struggle with generalization beyond training lengths, leading to performance drops when faced with longer sequences [4][6] Group 1 - The article presents a solution to the generalization issue in recurrent models through simple training interventions, allowing them to generalize to sequences up to 256k in length with just 500 additional training steps [7] - The research emphasizes that recurrent models possess untapped potential rather than inherent flaws [7][8] - The authors propose the "Unexplored States Hypothesis" to explain why recurrent models fail to generalize in length, indicating that they only learn from a limited subset of possible states during training [13][14] Group 2 - The article outlines four training interventions to improve length generalization by altering the initial state of the model [19] - These interventions include Random Noise, Fitted Noise, State Passing, and Truncated Backpropagation Through Time (TBTT), each designed to expose the model to a broader range of state distributions [20][19] - The findings reveal that State Passing and TBTT mechanisms effectively enable length generalization, achieving results with only 0.02% of the original pre-training budget [23][24] Group 3 - The article discusses the performance of these interventions in various long-context tasks, demonstrating their ability to enhance length generalization [31] - Specific tasks mentioned include the BABILong benchmark, password retrieval, and synthetic copying tasks, where the interventions significantly improved model performance [32][35][39] - The results indicate that models trained with these interventions can effectively utilize relationships between tokens beyond the training context length [36][39] Group 4 - The article introduces the concept of "Effective Remembrance" to measure how well a model retains information from previous tokens, aiming for models to focus on recent context rather than distant tokens [44][50] - It shows that State Passing improves effective memory, allowing models to prioritize recent tokens in their predictions [51][52] - This adjustment is crucial for text modeling, ensuring that earlier tokens do not disproportionately influence the model's output [52]
Meta新注意力机制突破Transformer上限,还用上了OpenAI的开源技术
量子位· 2025-07-07 09:35
Core Viewpoint - Meta has made significant advancements by leveraging OpenAI's technology and recruiting a large number of OpenAI employees, resulting in the development of a new architecture called 2-Simplicial Transformer, which enhances the efficiency of data utilization in training large models [1][2][26]. Group 1: New Architecture and Methodology - The 2-Simplicial Transformer modifies standard attention mechanisms to improve the efficiency of data usage, addressing the data bottleneck in current large model development [2][4]. - The core method involves extending the standard dot-product attention to a trilinear function, allowing for better expression of complex tasks [3][6]. - A new key vector, K', is introduced to enhance the model's ability to capture richer relationships during attention calculations [9][10]. Group 2: Performance and Scalability - Experimental results indicate that the 2-Simplicial Transformer outperforms traditional Transformers in mathematical, programming, and reasoning tasks, especially as model parameters increase [4][19]. - The scaling index of the new architecture is superior to that of traditional Transformers, suggesting that performance improves more rapidly with increased parameters and data, making it advantageous in data-limited scenarios [20][22]. - In various tasks, the 2-Simplicial Transformer shows improved performance metrics compared to traditional Transformers, particularly in larger models [18][21]. Group 3: Implementation and Challenges - The implementation of the 2-Simplicial Transformer utilizes Triton, a GPU programming framework that allows for efficient computation without requiring extensive CUDA experience [11][12]. - Despite its advantages, the computational complexity and latency of the 2-Simplicial Transformer remain high, indicating a need for further optimization for production environments [22].
deepseek技术解读(3)-MoE的演进之路
自动驾驶之心· 2025-07-06 08:44
Core Viewpoint - The article discusses the evolution of DeepSeek in the context of Mixture-of-Experts (MoE) models, highlighting innovations and improvements from DeepSeekMoE (V1) to DeepSeek V3, while maintaining a focus on the MoE technology route [1]. Summary by Sections 1. Development History of MoE - MoE was first introduced in 1991 with the paper "Adaptive Mixtures of Local Experts," and its framework has remained consistent over the years [2]. - Google has been a key player in the development of MoE, particularly with the release of "GShard" in 2020, which scaled models to 600 billion parameters [5]. 2. DeepSeek's Work 2.1. DeepSeek-MoE (V1) - DeepSeek V1 was released in January 2024, addressing two main issues: knowledge mixing and redundancy among experts [15]. - The architecture introduced fine-grained expert segmentation and shared expert isolation to enhance specialization and reduce redundancy [16]. 2.2. DeepSeek V2 MoE Upgrade - V2 introduced a device-limited routing mechanism to control communication costs by ensuring that activated experts are distributed across a limited number of devices [28]. - A communication balance loss was added to address potential congestion issues at the receiving end of the communication [29]. 2.3. DeepSeek V3 MoE Upgrade - V3 maintained the fine-grained expert and shared expert designs while upgrading the gating network from Softmax to Sigmoid to improve scoring differentiation among experts [36][38]. - The auxiliary loss for load balancing was eliminated to reduce its negative impact on the main model, replaced by a dynamic bias for load balancing [40]. - A sequence-wise auxiliary loss was introduced to balance token distribution among experts at the sequence level [42]. 3. Summary of DeepSeek's Innovations - The evolution of DeepSeek MoE has focused on balancing general knowledge and specialized knowledge through shared and fine-grained experts, while also addressing load balancing through various auxiliary losses [44].