Transformer

Search documents
马斯克:谷歌最有可能成为AI行业领先者
3 6 Ke· 2025-08-15 01:21
马斯克罕见在X上称赞其竞争对手谷歌,称其目前最有可能成为人工智能行业的领先者。 周四(8月14日),特斯拉首席执行官埃隆·马斯克罕见地称赞了他在人工智能领域的竞争对手Google。 马斯克在社交平台X上发帖称,"谷歌目前拥有最大的计算(和数据)优势,因此目前最有可能成为行业的领导者。" 但他补充道,"这种情况可能会在几年内发生变化…在可预见的未来,大型人工智能公司将继续蓬勃发展,xAI(马斯克 旗下人工智能公司)也是如此。有太多事情要做了!" 谷歌的地位很"硬" 谷歌一直以来都被视作人工智能的中流砥柱。 2017年,谷歌Research就发表了一篇题为"Attention Is All You Need"的论文。这项开创性的工作向世界介绍了Transformer 的概念,这种技术支持着像ChatGPT这样的大型语言模型。 2023年7月,马斯克成立了自己的人工智能初创公司xAI,并于当年晚些时候推出了他的聊天机器人Grok。后来在2024 年,xAI在A、B和C轮融资中筹集了超过120亿美元。 对于投资xAI一事,上个月,马斯克在X上表示,他旗下电动汽车公司特斯拉将要求股东投票决定是否投资xAI,不过他 没有 ...
又是王冠:27M小模型超越o3-mini!拒绝马斯克的00后果然不同
Sou Hu Cai Jing· 2025-08-10 04:21
闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 27M小模型超越o3-mini-high和DeepSeek-R1!推理还不靠思维链。 开发者是那位拒绝了马斯克、还要挑战Transformer的00后清华校友,Sapient Intelligence的创始人王冠。 2700万参数,就实现了对现有大模型的精准超车。 不用预训练补课,还不靠思维链打草稿,仅凭1000个训练样本,就把极端数独、30x30迷宫玩得明明白白。 所以,HRM这个小模型是如何做到的? 核心是仿脑的双层循环模块设计 HRM之所以能有如此出色的表现,源于其五项核心技术的巧妙设计。 首先是分层循环模块与时间尺度分离。 HRM受大脑皮层区域分层处理和时间分离机制启发,设计了两个相互配合的循环模块:一个高层模块负责慢节奏的抽象规划,一个低层模块处理快节奏 的细节计算,不用明确监督中间过程,一次就能完成推理。 这个27M小模型就是Sapient最新提出的开源可复现的分层推理模型Hierarchical Reasoning Model(下面简称HRM),模仿大脑的分层处理与多时间尺度 运作机制,克服了标准Transfomer的计算局限。 两者在不同时 ...
自动驾驶之心技术交流群来啦!
自动驾驶之心· 2025-07-29 07:53
Core Viewpoint - The article emphasizes the establishment of a leading communication platform for autonomous driving technology in China, focusing on industry, academic, and career development aspects [1]. Group 1 - The platform, named "Autonomous Driving Heart," aims to facilitate discussions and exchanges among professionals in various fields related to autonomous driving technology [1]. - The technical discussion group covers a wide range of topics including large models, end-to-end systems, VLA, BEV perception, multi-modal perception, occupancy, online mapping, 3DGS, multi-sensor fusion, transformers, point cloud processing, SLAM, depth estimation, trajectory prediction, high-precision maps, NeRF, planning control, model deployment, autonomous driving simulation testing, product management, hardware configuration, and AI job exchange [1]. - Interested individuals are encouraged to join the community by adding a WeChat assistant and providing their company/school, nickname, and research direction [1].
Grok4全网玩疯,成功通过小球编程测试,Epic创始人:这就是AGI
猿大侠· 2025-07-12 01:45
Core Viewpoint - The article discusses the rapid adoption and impressive capabilities of Elon Musk's Grok4 AI model, highlighting its performance in various tests and comparisons with other models like OpenAI's o3. Group 1: Performance Highlights - Grok4 successfully passed the hexagonal ball programming test, showcasing its ability to understand physical laws [2][12]. - In a comprehensive evaluation, Grok4 outperformed o3 in all eight tasks, including complex legal reasoning and code translation [23][18][20]. - Tim Sweeney, founder of Epic Games, praised Grok4 as a form of Artificial General Intelligence (AGI) after it provided deep insights on a previously unseen problem [9][10]. Group 2: User Interactions and Applications - Users have engaged with Grok4 in creative ways, such as visualizing mathematical concepts and generating SVG graphics, demonstrating its versatility [25][32]. - A user named Dan was able to create a visualization of Euler's identity with minimal interaction, indicating Grok4's efficiency in generating complex outputs [31][26]. - The article mentions a high-level application called "Expert Conductor," which simulates an expert collaboration environment, further showcasing Grok4's potential in problem-solving [54][56]. Group 3: Community Engagement - The article encourages readers to share their innovative uses of Grok4, indicating a growing community interest and engagement with the AI model [66]. - Various users have reported their experiences and findings, contributing to a collaborative exploration of Grok4's capabilities [12][66].
「Tokens是胡扯」,Mamba作者抛出颠覆性观点,揭露Transformer深层缺陷
机器之心· 2025-07-09 09:52
Core Viewpoint - The article discusses the trade-offs between State Space Models (SSM) and Transformers, arguing that tokenization is a limitation that SSM can overcome, leading to better computational efficiency and modeling capabilities [1][3][61]. Group 1: State Space Models (SSM) - SSM is defined as a modern version of recurrent neural networks (RNN) with key features that allow it to match the language modeling performance of Transformers [8][10]. - A significant characteristic of SSM is that its hidden state dimension is greater than the input and output dimensions, allowing for better context storage [9][10]. - The model's state update function must be expressive enough to accurately encode and retrieve necessary information, which is achieved through dynamic transfer matrices in selective SSM [11][12]. - Mamba, a specific SSM, integrates parallelization and memory management techniques to enhance computational efficiency [13][14]. - The article highlights that SSMs can outperform Transformers in language modeling tasks when computational resources are matched [53][56]. Group 2: Transformers - Transformers excel in tasks requiring fine-grained operations on individual tokens, but they suffer from quadratic complexity, limiting their efficiency [82][86]. - The article argues that Transformers have an inductive bias that affects their modeling capabilities, making them sensitive to the resolution and semantic content of the data [83][85]. - Despite their strengths, Transformers are not the ultimate solution for all modeling tasks, and there is still significant work to be done in the field [89]. Group 3: Tokenization - Tokenization is a critical step in language modeling, but it introduces limitations in understanding language details [39][40]. - The article posits that removing tokenization could lead to better model performance and aligns with the essence of deep learning, which aims to minimize manual feature engineering [44][45]. - The author suggests that without tokenization, models could learn more effective patterns directly from raw data, enhancing their capabilities [46][52].
Transformer死角,只需500步后训练,循环模型突破256k长度泛化极限
机器之心· 2025-07-08 04:09
Core Insights - The article discusses the advantages of linear recurrent models, such as Mamba, and linear attention mechanisms in handling long sequences, which is crucial for long-context reasoning tasks [1][2] - It highlights the performance improvements of recurrent models over time, indicating that they can now compete with Transformers in various tasks, despite previous limitations [3] - A significant finding is that recurrent models struggle with generalization beyond training lengths, leading to performance drops when faced with longer sequences [4][6] Group 1 - The article presents a solution to the generalization issue in recurrent models through simple training interventions, allowing them to generalize to sequences up to 256k in length with just 500 additional training steps [7] - The research emphasizes that recurrent models possess untapped potential rather than inherent flaws [7][8] - The authors propose the "Unexplored States Hypothesis" to explain why recurrent models fail to generalize in length, indicating that they only learn from a limited subset of possible states during training [13][14] Group 2 - The article outlines four training interventions to improve length generalization by altering the initial state of the model [19] - These interventions include Random Noise, Fitted Noise, State Passing, and Truncated Backpropagation Through Time (TBTT), each designed to expose the model to a broader range of state distributions [20][19] - The findings reveal that State Passing and TBTT mechanisms effectively enable length generalization, achieving results with only 0.02% of the original pre-training budget [23][24] Group 3 - The article discusses the performance of these interventions in various long-context tasks, demonstrating their ability to enhance length generalization [31] - Specific tasks mentioned include the BABILong benchmark, password retrieval, and synthetic copying tasks, where the interventions significantly improved model performance [32][35][39] - The results indicate that models trained with these interventions can effectively utilize relationships between tokens beyond the training context length [36][39] Group 4 - The article introduces the concept of "Effective Remembrance" to measure how well a model retains information from previous tokens, aiming for models to focus on recent context rather than distant tokens [44][50] - It shows that State Passing improves effective memory, allowing models to prioritize recent tokens in their predictions [51][52] - This adjustment is crucial for text modeling, ensuring that earlier tokens do not disproportionately influence the model's output [52]
Meta新注意力机制突破Transformer上限,还用上了OpenAI的开源技术
量子位· 2025-07-07 09:35
鱼羊 发自 凹非寺 量子位 | 公众号 QbitAI Meta挖走OpenAI大批员工后,又用OpenAI的技术搞出新突破。 这是什么杀人又诛心 (doge) ? 新架构名为 2-Simplicial Transformer ,重点是通过修改标准注意力,让Transformer能更高效地利用训练数据,以突破当前大模型发展的 数据瓶颈。 而核心方法,就是基于OpenAI提出的Triton,将标准点积注意力推广到三线性函数。 实验结果显示,在同等参数量和数据量下,相较于传统Transformer,新架构在数学、编程、推理等任务上均有更好的表现。 并且,2-Simplicial Transformer的缩放指数高于传统Transformer——这意味着 随着参数增加,新架构加持下的模型性能提升更快,更适用 于有限数据的场景 。 三元线性注意力 传统Transformer的核心机制是点积注意力,其计算复杂度较低,但对复杂任务 (如逻辑推理、数学运算等) 表达能力有限。 针对于此,Meta的这项研究,重点放在将点积注意力从二元线性操作扩展到三元线性操作。 简单来说,就是在计算注意力时引入第三个向量,来增加模型对复杂模式 ...
deepseek技术解读(3)-MoE的演进之路
自动驾驶之心· 2025-07-06 08:44
Core Viewpoint - The article discusses the evolution of DeepSeek in the context of Mixture-of-Experts (MoE) models, highlighting innovations and improvements from DeepSeekMoE (V1) to DeepSeek V3, while maintaining a focus on the MoE technology route [1]. Summary by Sections 1. Development History of MoE - MoE was first introduced in 1991 with the paper "Adaptive Mixtures of Local Experts," and its framework has remained consistent over the years [2]. - Google has been a key player in the development of MoE, particularly with the release of "GShard" in 2020, which scaled models to 600 billion parameters [5]. 2. DeepSeek's Work 2.1. DeepSeek-MoE (V1) - DeepSeek V1 was released in January 2024, addressing two main issues: knowledge mixing and redundancy among experts [15]. - The architecture introduced fine-grained expert segmentation and shared expert isolation to enhance specialization and reduce redundancy [16]. 2.2. DeepSeek V2 MoE Upgrade - V2 introduced a device-limited routing mechanism to control communication costs by ensuring that activated experts are distributed across a limited number of devices [28]. - A communication balance loss was added to address potential congestion issues at the receiving end of the communication [29]. 2.3. DeepSeek V3 MoE Upgrade - V3 maintained the fine-grained expert and shared expert designs while upgrading the gating network from Softmax to Sigmoid to improve scoring differentiation among experts [36][38]. - The auxiliary loss for load balancing was eliminated to reduce its negative impact on the main model, replaced by a dynamic bias for load balancing [40]. - A sequence-wise auxiliary loss was introduced to balance token distribution among experts at the sequence level [42]. 3. Summary of DeepSeek's Innovations - The evolution of DeepSeek MoE has focused on balancing general knowledge and specialized knowledge through shared and fine-grained experts, while also addressing load balancing through various auxiliary losses [44].
原来Scaling Law还能被优化?Meta这招省token又提效
机器之心· 2025-07-06 03:49
机器之心报道 编辑:Panda 2017 年,一篇《Attention Is All You Need》论文成为 AI 发展的一个重要分水岭,其中提出的 Transformer 依然是现今主流语言模型的基础范式。尤其是在基于 Transformer 的语言模型的 Scaling Law 得到实验验证后,AI 领域的发展更是进入了快车道。 现如今,这篇论文的引用量正向 19 万冲刺,而 Transformer 和注意力机制本身也已经历了很多改进和创新,比如我们前段时间报道过的「 Multi-Token Attention 」 和「 Multi-matrix Factorization Attention 」等。 随着 AI 的不断发展,现如今的一个重要挑战是如何获得足够多高质量的 token。又或者,该如何更高效地利用这些 token?为此,还必须对 Transformer 进行进一 步的升级改造。 该研究基于 RoPE 向三线性函数的泛化;而 2-simplicial Transformer 则源自 2019 年 Clift et al. 的研究《Logic and the 2-Simplicial Tran ...
X @Avi Chawla
Avi Chawla· 2025-07-04 06:48
Links:- RAGFlow: https://t.co/LtSlnL8yxe- Xpander: https://t.co/wohmbgRpPi- Transformer Lab: https://t.co/E9AJqvNTh1- Llama Factory: https://t.co/7ixKJrhQRv- LangFlow: https://t.co/zWay03JAyY- AutoAgent: https://t.co/ZTcsz2Rqxv https://t.co/VEY6gXtBYp ...