机器之心
Search documents
Adam的稳+Muon的快?华为诺亚开源ROOT破解大模型训练「既要又要」的两难困境
机器之心· 2025-11-27 04:09
Core Viewpoint - The article discusses the evolution of optimizers in large language model (LLM) training, highlighting the introduction of ROOT (Robust Orthogonalized Optimizer) by Huawei Noah's Ark Lab as a solution that combines the speed of Muon and the stability of Adam, addressing the limitations of existing optimizers in handling large-scale training and noise robustness [2][50]. Group 1: Optimizer Evolution - The early optimizer, SGD (Stochastic Gradient Descent), established the basic paradigm for neural network training but struggled with convergence speed and stability in high-dimensional loss landscapes [6][7]. - Adam and AdamW emerged as the de facto standards for training deep learning models, significantly improving convergence efficiency but revealing numerical instability issues when model parameters exceed one billion [7][8]. - Muon, a matrix-aware optimizer, attempted to address these issues by treating weight matrices as a whole, yet it faced challenges related to robustness and sensitivity to noise [11][13]. Group 2: ROOT Optimizer Features - ROOT enhances the robustness of orthogonalized optimizers by introducing adaptive coefficients for the Newton-Schulz iteration, tailored to specific matrix dimensions, thus overcoming the dimensional fragility seen in fixed-coefficient methods [26][29]. - The optimizer employs a soft-thresholding mechanism to filter out gradient noise, effectively separating normal and abnormal gradient components, which improves stability during training [30][33]. - ROOT's design aims to provide a balance between speed and stability, making it suitable for large-scale, non-convex real model training scenarios [20][21]. Group 3: Performance Validation - In extensive experiments, ROOT demonstrated superior convergence capabilities, achieving a training loss of 2.5407 in a 10B token pre-training experiment, outperforming Muon [41][42]. - ROOT achieved an average score of 60.12 across multiple downstream tasks, surpassing both AdamW (59.05) and Muon (59.59), indicating its competitive edge [43]. - The optimizer also showed strong cross-modal generalization capabilities, achieving a Top-1 accuracy of 88.44% on the CIFAR-10 dataset, significantly higher than Muon's 84.67% [46][47]. Group 4: Future Implications - ROOT is positioned to potentially usher in a new era of optimizers, addressing the increasing complexity and scale of future language models, thereby enhancing the reliability and efficiency of AI system training [49][51]. - The open-source release of ROOT's code is expected to encourage further research and application in training trillion-parameter models, reinforcing Huawei's commitment to innovation in AI [52].
NeurIPS 2025奖项出炉,Qwen获最佳论文,Faster R-CNN获时间检验奖
机器之心· 2025-11-27 03:00
Core Insights - The NeurIPS 2025 conference awarded four Best Paper awards and three Best Paper Runner-up awards, highlighting significant advancements in various AI research areas [1][4]. Group 1: Best Papers - Paper 1: "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" discusses the limitations of large language models in generating diverse content and introduces Infinity-Chat, a dataset with 26,000 diverse user queries for studying model diversity [5][6][9]. - Paper 2: "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" reveals the impact of gated attention mechanisms on model performance and stability, demonstrating significant improvements in the Qwen3-Next model [11][16]. - Paper 3: "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities" shows that increasing network depth to 1024 layers can enhance performance in self-supervised reinforcement learning tasks, achieving performance improvements of 2x to 50x [17][18]. - Paper 4: "Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training" identifies mechanisms that prevent diffusion models from memorizing training data, establishing a link between training dynamics and generalization capabilities [19][21][22]. Group 2: Best Paper Runner-Up - Paper 1: "Optimal Mistake Bounds for Transductive Online Learning" solves a 30-year-old problem in learning theory, establishing optimal mistake bounds for transductive online learning [28][30][31]. - Paper 2: "Superposition Yields Robust Neural Scaling" argues that representation superposition is the primary mechanism governing neural scaling laws, supported by multiple experiments [32][34]. Group 3: Special Awards - The Time-Tested Award was given to the paper "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," recognized for its foundational impact on modern object detection frameworks since its publication in 2015 [36][40]. - The Sejnowski-Hinton Prize was awarded for the paper "Random synaptic feedback weights support error backpropagation for deep learning," which contributed significantly to understanding biologically plausible learning rules in neural networks [43][46][50].
通用脑机接口时代要来了?跨尺度脑基础模型CSBrain真正读懂脑信号
机器之心· 2025-11-27 03:00
Core Insights - Brain-Computer Interface (BCI) is seen as the ultimate interface connecting human intelligence with artificial intelligence, with a focus on high-precision brain signal decoding to enable general AI models to understand complex brain activities [2] - Current BCI systems are limited to task-specific deep learning models, which lack generalizability and cross-task transfer capabilities, resulting in isolated "specialist" applications [2][3] - The introduction of a brain foundation model, CSBrain, aims to address these challenges by integrating cross-scale structural perception into the model design [5][6] Group 1: Challenges in Brain-Computer Interfaces - The BCI field has primarily relied on task-specific deep learning models, which perform well on specific datasets but struggle with adaptability to diverse brain signals [2] - The unique cross-scale spatiotemporal structure of brain signals presents challenges for traditional modeling paradigms, which fail to capture the inherent neural structure [3][5] Group 2: CSBrain Model Innovations - CSBrain introduces two core innovative modules: Cross-scale Spatiotemporal Tokenization (CST) and Structured Sparse Attention (SSA) [6][7] - CST extracts multi-scale temporal and spatial features from EEG signals, balancing neural representation capability and computational efficiency through a dimension allocation strategy [6] - SSA captures long-range temporal dependencies and models inter-region interactions while reducing computational complexity from O(N²) to O(N・k) [7] Group 3: Experimental Results and Performance - CSBrain was validated across 11 representative brain decoding tasks and 16 public datasets, achieving state-of-the-art performance with an average improvement of 3.35% over current models [12] - In high-challenge tasks, CSBrain showed a 5.2% accuracy improvement in motor imagery tasks and a 7.6% enhancement in epilepsy detection metrics [12] - The experimental results confirm the effectiveness of CSBrain's cross-scale modeling paradigm and pre-trained brain foundation model, supporting various BCI applications [12][14] Group 4: Future Prospects - As data scale and computational power increase, brain foundation models are expected to play a larger role in broader brain-AI integration scenarios, accelerating the application of next-generation brain-computer interfaces [14]
小米开源首个跨域具身基座模型MiMo-Embodied,29个榜单SOTA
机器之心· 2025-11-26 09:19
Core Insights - The article discusses the development of MiMo-Embodied, a foundational model that integrates autonomous driving and embodied intelligence, marking a significant advancement in AI research [5][46]. - The model addresses the fragmentation between autonomous driving and embodied AI, which have traditionally been treated as separate domains, leading to a lack of a unified cognitive framework [4][9]. Group 1: Model Development and Architecture - MiMo-Embodied is the first open-source model successfully merging autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) results across 17 benchmarks in embodied intelligence and 12 in autonomous driving [5][19]. - The model is built on Xiaomi's self-developed MiMo-VL architecture, which decomposes physical interactions into six core dimensions, enhancing both environmental perception and decision-making capabilities [11][12]. Group 2: Training Strategy - A four-stage progressive training strategy was designed to effectively integrate diverse cross-domain data while avoiding catastrophic forgetting, which is crucial for the model's performance [13][14]. - The training phases include: 1. Establishing foundational knowledge with general and embodied data [14]. 2. Injecting autonomous driving knowledge through mixed supervision while retaining embodied data [14][15]. 3. Enhancing logical reasoning capabilities using Chain-of-Thought (CoT) techniques [15]. 4. Refining the model through reinforcement learning (RL) to improve output precision [16]. Group 3: Performance Metrics - MiMo-Embodied achieved record-breaking performance in key areas such as affordance prediction, task planning, and spatial understanding, demonstrating its robust capabilities in embodied intelligence [19][22][25]. - In autonomous driving benchmarks, the model excelled in environmental perception, state prediction, and driving planning, showcasing its ability to generate coherent and contextually appropriate driving decisions [27][28][30]. Group 4: Real-World Applications - The model's practical utility was validated in embodied navigation and operation tasks, where it performed exceptionally well in identifying and locating objects in various household scenarios [33][34]. - In autonomous driving trajectory planning, MiMo-Embodied significantly outperformed competing models in both imitation learning and reinforcement learning phases, indicating its effectiveness in complex driving situations [38][39]. Group 5: Conclusion and Future Implications - The introduction of MiMo-Embodied signifies a new phase in embodied intelligence research, proving that cognitive logic in the physical world is unified across different applications [46]. - This work lays the groundwork for developing general Vision-Language-Action (VLA) models, moving towards the vision of a single brain applicable to various embodied forms [46].
谢赛宁与Jaakkola团队重磅研究:无数据Flow Map蒸馏
机器之心· 2025-11-26 09:19
Core Insights - The article discusses recent advancements in AI communication methods, particularly focusing on a new paradigm called "Cache-to-Cache" communication, which allows machines to exchange information without verbal language, enhancing efficiency in AI interactions [1] - Another significant research highlights the concept of "Thought Communication," enabling agents to share latent thoughts internally, resembling telepathic collaboration [3] - A joint study from MIT and NYU proposes a method that eliminates the need for data by sampling from prior distributions, achieving impressive performance in flow map distillation [4][5] Group 1: AI Communication Innovations - The "AI Transmission" research showcases a model where machines communicate through caches, bypassing traditional language, which has garnered significant attention in the tech community [1] - The "Thought Communication" concept introduced in a NeurIPS 2025 Spotlight paper emphasizes internal thought sharing among agents, pushing the boundaries of AI collaboration [3] Group 2: Data-Free AI Models - The joint research from MIT and NYU introduces a method that allows flow map distillation without relying on external data, achieving remarkable results [4][5] - The study demonstrates that the new approach can refresh the generation quality record on ImageNet, indicating a shift towards utilizing internal representations rather than explicit data [5] - The proposed "FreeFlow" framework emphasizes a paradigm shift towards data-free methodologies, ensuring alignment with prior distributions to avoid risks associated with teacher-data mismatches [21][30]
预测下一个像素还需要几年?谷歌:五年够了
机器之心· 2025-11-26 07:07
Core Insights - The article discusses the potential of next-pixel prediction in image recognition and generation, highlighting its scalability challenges compared to natural language processing tasks [6][21]. - It emphasizes that while next-pixel prediction is a promising approach, it requires significantly more computational resources than language models, with a token-per-parameter ratio that is 10-20 times higher [6][15][26]. Group 1: Next-Pixel Prediction - Next-pixel prediction can be learned in an end-to-end manner without the need for labeled data, making it a form of unsupervised learning [3][4]. - The study indicates that achieving optimal performance in next-pixel prediction requires a higher token-parameter ratio compared to text token learning, with a minimum of 400 for pixel models versus 20 for language models [6][15]. - The research identifies three core questions regarding the evaluation of model performance, the consistency of scaling laws with downstream tasks, and the variation of scaling trends across different image resolutions [7][8]. Group 2: Experimental Findings - Experiments conducted at a fixed resolution of 32×32 pixels reveal that the optimal scaling strategy is highly dependent on the target task, with image generation requiring a larger token-parameter ratio than classification tasks [18][22]. - As image resolution increases, the model size must grow faster than the data size to maintain optimal scaling, indicating that computational capacity is the primary bottleneck rather than data availability [18][26]. - The study shows that while the scaling trends for next-pixel prediction can be predicted using established frameworks from language models, the optimal scaling strategies differ significantly between tasks [21][22]. Group 3: Future Outlook - The article predicts that next-pixel modeling will become feasible within the next five years due to the rapid growth of training computational power, which is expected to increase by four to five times annually [8][26]. - It concludes that despite the current challenges, the path towards pixel-level modeling remains viable and could achieve competitive performance in the future [26].
云帆Meetup报名|i人也可,在NeurIPS圣地亚哥遇见同频智友
机器之心· 2025-11-26 07:07
Group 1 - The year 2025 is expected to be a significant year for the rapid development of AI, transforming human-computer interaction and becoming a new partner for exploring the world [2] - The pace of technological iteration in AI is accelerating, with new research being proposed, tested, and iterated rapidly, leading to a non-linear accumulation of knowledge [2] - NeurIPS, one of the most influential academic conferences in the AI field, received 21,575 valid submissions this year, with an acceptance rate of 24.52%, and will be held in San Diego from December 2 to 7, 2025 [2] Group 2 - The "Yunfan・NeurIPS 2025 AI Talent Meetup" is an informal event aimed at facilitating face-to-face exchanges among researchers, engineers, and entrepreneurs in a relaxed atmosphere [3] - The meetup is organized by Machine Heart and Shanghai Artificial Intelligence Laboratory, inviting participants to share ideas and discuss recent hot topics and research directions [3] Group 3 - The meetup will take place on December 3, 2025, from 17:30 to 20:30 local time in San Diego, with a capacity of 80 participants [4] - The event is designed for various attendees, including researchers interested in AI trends, job seekers, and innovators looking for opportunities and partners [4] Group 4 - The agenda for the meetup includes registration, an introduction to the Shanghai AI Laboratory, presentations on the Polaris & Xingqi program, an AI Talent Show, and a dinner with interactive sessions [6]
突破视觉-语言-动作模型的瓶颈:QDepth-VLA让机器人拥有更精准的3D空间感知
机器之心· 2025-11-26 07:07
Core Insights - The article discusses the significant potential of Vision-Language-Action (VLA) models in robotic manipulation, highlighting the introduction of QDepth-VLA, which enhances 3D spatial perception and reasoning capabilities through Quantized Depth Prediction [2][4][34]. Group 1: Model Limitations and Challenges - Despite advancements in semantic understanding and instruction following, VLA models struggle with spatial perception, particularly in fine-grained or long-duration multi-step tasks, leading to positioning errors and operational failures [5][6]. - The gap between 2D visual semantic understanding and 3D spatial perception has prompted researchers to explore various methods to integrate 3D information into VLA models, categorized into three main approaches: direct injection of 3D features, 3D feature projection, and auxiliary 3D visual prediction tasks [5][6]. Group 2: QDepth-VLA Methodology - QDepth-VLA introduces a mechanism that combines Quantized Depth Prediction with a hybrid attention structure, allowing the model to maintain semantic consistency while enhancing 3D spatial perception and action decision-making [8][34]. - The method consists of three main components: high-precision depth annotation using Video-Depth-Anything, a Depth Expert module for structured depth token prediction, and a hybrid attention mechanism to manage information flow across modalities [11][13][14]. Group 3: Experimental Validation - Comprehensive evaluations of QDepth-VLA were conducted in both simulated environments (Simpler and LIBERO) and real-world settings, demonstrating significant performance improvements in various object manipulation and multi-step tasks [18][19]. - In the Simpler simulation, QDepth-VLA achieved an average success rate increase of 8.5% and 3.7% compared to the baseline model Open π0 [20]. - In the LIBERO simulation, QDepth-VLA outperformed the 3D-CAVLA model by approximately 2.8% [26]. - Real-world experiments showed QDepth-VLA's superior performance in pick-and-place tasks, with a 20% improvement in basic tasks and a 10% enhancement in more challenging scenarios [30]. Group 4: Ablation Studies - Ablation studies indicated that the depth supervision and hybrid attention mechanisms are crucial for QDepth-VLA's high performance, with significant drops in success rates when these components were removed [31][32]. Group 5: Future Directions - Future research will focus on enhancing the model's spatial understanding capabilities, with potential developments in future spatial structure prediction and more efficient depth representation learning [35][36]. - The integration of enhanced 3D geometric perception and action consistency into CASBOT's product line is anticipated, supporting various applications in both domestic and industrial settings [35][36].
36年卷积猜想被解决,华人唯一作者,AI或受益
机器之心· 2025-11-26 05:12
Core Viewpoint - The article discusses a significant mathematical breakthrough by Yuansi Chen, who solved the Talagrand convolution conjecture, a problem that has remained open for 36 years, with implications for modern computer science and machine learning [3][10]. Group 1: Background and Importance - The Talagrand convolution conjecture, proposed in 1989, is one of the most important open problems in probability theory and functional analysis, focusing on the regularization properties of the heat semigroup applied to L₁ functions on the Boolean hypercube [10]. - The conjecture predicts that applying a smoothing operator to any L₁ function will significantly improve tail decay, which is crucial for theoretical computer science, discrete mathematics, and statistical physics [10][21]. Group 2: Key Findings - Chen's proof shows that for any non-negative function f on the Boolean hypercube, the probability of the smoothed function exceeding a certain threshold decays at a rate better than the Markov inequality, specifically with a bound involving a log log factor [6][11]. - The result provides a positive answer to whether the tail probability disappears as η approaches infinity, marking a significant improvement over previous methods [13][21]. Group 3: Methodology - The core of Chen's method involves constructing a coupling between two Markov jump processes through a "perturbed reverse heat process," representing a major methodological advancement in discrete stochastic analysis [15][20]. - The proof combines several innovative techniques, including total variation control and a multi-stage Duhamel formula, to achieve dimension-free bounds [20][21]. Group 4: Implications for Future Research - The remaining log log η factor presents a clear target for future research, with potential improvements in coupling distance or alternative perturbation designs that could eliminate this factor [21][25]. - The work enhances the toolbox for handling high-dimensional discrete space probability distributions and connects to current AI trends, particularly in score-based generative models [23][24].
谷歌TPU逆袭英伟达,创始人一夜之间跃升全球第二、第三富豪
机器之心· 2025-11-26 05:12
Core Viewpoint - Google's stock price has surged significantly, driven by advancements in artificial intelligence, particularly the launch of the Gemini 3 model and potential AI chip deals with Meta [2][9][11]. Stock Performance - As of November 25, Alphabet's stock price reached $326, marking a 2.4% increase and a historical high. The stock has seen a cumulative increase of over 11.5% in the past five trading days and 22% in the last month [2]. - Alphabet's market capitalization is approximately $3.84 trillion, making it the third-largest company globally, just behind Nvidia and Apple [2]. Wealth Impact - The surge in stock price has significantly increased the wealth of Google's founders, with Larry Page and Sergey Brin now ranked as the second and third richest individuals globally, surpassing Jeff Bezos [5]. AI Breakthroughs - The core drivers of Google's stock increase are two major advancements in AI: the impressive performance of the Gemini 3 model and a potential deal for Google's AI chips with Meta [9][11]. - Gemini 3 has received widespread acclaim for its speed and capabilities, outperforming OpenAI's ChatGPT-5 in several benchmarks [9][10]. AI Chip Developments - Google's latest TPU chip, "Ironwood," is reported to be the most powerful and energy-efficient custom chip to date, with a potential multi-billion dollar deal with Meta for its use in data centers [10][11]. - This deal could allow Google to capture about 10% of Nvidia's annual revenue, establishing a competitive position in the AI hardware market [11]. Cloud Computing and AI Demand - Google's cloud AI infrastructure head indicated that the company needs to double its computing power every six months to meet the explosive demand for AI services, aiming for a 1000-fold increase in computing power over the next 4-5 years [12]. Competitive Landscape - Nvidia has responded to concerns about Google's AI chip potentially disrupting its market dominance, asserting that its technology remains a generation ahead [14][15]. - Despite Google's growing attention in the AI chip space, Nvidia still holds over 90% of the AI chip market share [15]. Strategic Shifts - Google's successful turnaround in the AI race is attributed to the launch of Gemini 3, which has restored market confidence and attracted industry leaders back to its products [19][20]. - The company has been promoting its TPU chips through cloud services, which may pose a long-term threat to Nvidia's market position [22]. Legal and Financial Developments - A recent antitrust ruling allowed Google to maintain its search business structure, alleviating concerns about potential disruptions to its revenue streams [23]. - Warren Buffett's Berkshire Hathaway has invested approximately $4.3 billion in Alphabet, signaling strong confidence in the company's future [24]. Search Business Resilience - Google's search advertising revenue increased by 15% in the third quarter, indicating that its core business remains robust despite the rise of AI technologies [25].