机器之心

Search documents
刚刚,马斯克发布Grok 4!全榜第一,年费飚到2万+
机器之心· 2025-07-10 06:07
机器之心报道 机器之心编辑部 所有学科都是博士后水平。 酝酿良久的 xAI 下一代大模型——Grok 4 终于发布了!能力超乎我们想象。 北京时间今天中午 12 点左右,我们期待已久的 xAI 发布会终于开始,马斯克现身直播间,他上来就说:「这是世界上最好的 AI,让我们来展示一下。」 马斯克表示,Grok 4 每次都能在 SAT 考试(美国高考)中获得满分,无需事先查看题目,它也可以做到 GRE 任何学科接近满分,超过了全世界所有研究生的水 平。Grok 4 最强大的地方是其推理能力,它已经实现了超越人类的推理水平。 马斯克相信,Grok 4 可以在今年内实现科学新发现。 得益于计算能力的增强、强化学习的训练, Grok 4 的推理能力相较于前代提升了 10 倍 。从 Grok 2 到 Grok 4,采用的技术范式不同,分别为下一个 token 预测、 预训练计算、预训练 + RL、RL 计算。 其中,Grok 2 到 Grok 3 预训练阶段的计算量提升了 10 倍,Grok 3 reasoning 首次引入了 RL 微调,带来了深度推理能力。Grok 4 reasoning 的强化学习再度提升了 10 ...
人形机器人做汉堡火了! 伯克利等全新ViTacFormer让机器人操作稳如老手
机器之心· 2025-07-10 06:07
Core Viewpoint - The article discusses the advancements in humanoid robots, particularly focusing on the ViTacFormer framework that integrates visual and tactile information for dexterous manipulation tasks, showcasing its potential to revolutionize kitchen automation and other complex tasks [1][4][24]. Group 1: Technology and Innovation - The ViTacFormer framework is designed to enhance precision, stability, and continuous control in dexterous manipulation by combining visual and tactile data with a predictive mechanism for future tactile feedback [4][11]. - The system utilizes a dual-arm robot setup equipped with advanced tactile sensors and cameras to gather real-time data during operations, allowing for a comprehensive understanding of contact dynamics [13][14]. - ViTacFormer employs a cross-modal attention mechanism and an autoregressive tactile prediction branch, enabling the model to anticipate future contact states, thus improving action generation and overall task performance [9][11][24]. Group 2: Experimental Validation - The performance of ViTacFormer was evaluated through various short-range dexterous manipulation tasks, demonstrating a significant improvement in success rates, with an average increase of over 50% compared to existing methods [22][24]. - In a long-duration task simulating the complete process of making a hamburger, ViTacFormer achieved a continuous operation time of approximately 2.5 minutes with an overall success rate exceeding 80%, highlighting its effectiveness in complex, multi-stage tasks [28].
奖励模型终于迎来预训练新时代!上海AI Lab、复旦POLAR,开启Scaling新范式
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the limitations of current reward modeling methods in reinforcement learning, particularly in the context of large language models (LLMs), and introduces a new paradigm called POLAR that aims to enhance scalability and generalization in reward modeling [2][3][5]. Group 1: Current Reward Modeling Methods - Preference-based Reward Modeling relies on high-quality preference data, which is costly and difficult to scale, and struggles with generalization and susceptibility to reward hacking [3][4]. - Rule-based Verifier methods provide accurate reward signals for verifiable tasks but fail to extend to more general scenarios like open-domain dialogue and complex interactions [3][4]. Group 2: Introduction of POLAR - POLAR, developed by a team from Shanghai AI Lab and Fudan University, utilizes Policy Discriminative Learning to decouple from absolute preferences, allowing for efficient scaling and strong generalization capabilities [5][9]. - The training process of POLAR involves measuring the "distance" between candidate strategies and optimal strategies, providing a relative reward signal that does not depend on human-annotated preferences [9][10]. Group 3: Training Methodology - POLAR's pre-training corpus is constructed through automated data synthesis, sampling from LLM pre-training data and using a large pool of models for trajectory sampling [14][15]. - The pre-training objective employs Bradley-Terry Loss to assign higher rewards to trajectories generated by similar strategies, effectively modeling the differences in strategy distributions [14][15]. Group 4: Performance and Generalization - POLAR demonstrates superior performance in preference evaluation, outperforming state-of-the-art reward models by significant margins in various tasks, including STEM [33]. - In reinforcement fine-tuning (RFT) experiments, models fine-tuned with POLAR show an average improvement of 9.0% over initial results, highlighting its effectiveness in enhancing LLM capabilities [34]. Group 5: Scaling Effects - POLAR exhibits scaling laws similar to LLM Next Token Prediction, indicating that increased computational resources lead to improved reward model performance [35]. - The validation loss decreases in a power-law relationship with the increase in model parameters and training compute, suggesting the potential for building more powerful and generalizable reward models [35]. Conclusion - POLAR represents a novel and scalable approach to reward modeling, offering new possibilities for LLM post-training and addressing the challenges in reinforcement learning [37].
VLA统一架构新突破:自回归世界模型引领具身智能
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new unified Vision-Language-Action (VLA) model architecture called UniVLA, which enhances the integration of visual, language, and action signals for improved decision-making in embodied intelligence tasks [4][5][13]. Group 1: Model Architecture and Mechanism - UniVLA is based on a fully discrete, autoregressive mechanism that models visual, language, and action signals natively, incorporating world model training to learn temporal information and causal logic from large-scale videos [5][9][14]. - The framework transforms visual, language, and action signals into discrete tokens, creating interleaved multimodal temporal sequences for unified modeling [9][10]. Group 2: Performance and Benchmarking - UniVLA has set new state-of-the-art (SOTA) records across major embodied intelligence benchmarks such as CALVIN, LIBERO, and SimplerEnv, demonstrating its strong performance advantages [18][21]. - In the CALVIN benchmark, UniVLA achieved an average score of 95.5%, outperforming previous models significantly [19]. Group 3: Training Efficiency and Generalization - The post-training stage of the world model significantly enhances downstream decision-making performance without relying on extensive action data, utilizing only vast amounts of video data for efficient learning [14][15]. - The model supports unified training for various tasks, including visual understanding, video generation, and action prediction, showcasing its versatility and data scalability [10][24]. Group 4: Future Directions - The article suggests exploring deeper integration of the UniVLA framework with multimodal reinforcement learning to enhance its perception, understanding, and decision-making capabilities in open-world scenarios [24].
ICML 2025 | 给AI装上「智能升级插件」!阿里安全-清华大学D-MoLE让模型在持续学习中动态进化
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new framework called D-MoLE (Dynamic Mixture of Curriculum LoRA Experts) aimed at enhancing the continual adaptation capabilities of Multimodal Large Language Models (MLLMs) in response to evolving task requirements while preserving existing knowledge [4][12][13]. Research Background - Multimodal Large Language Models (MLLMs) combine various modalities such as visual and textual data, showcasing strong capabilities in handling multimodal information [3]. - A significant challenge in practical applications is the phenomenon of catastrophic forgetting, where models lose previously acquired knowledge when fine-tuned for new tasks [4]. Key Challenges - The need for continual multimodal instruction tuning (CMIT) arises to allow MLLMs to adapt to new tasks while retaining past knowledge [4][12]. - Two main challenges identified are task architecture conflicts and modality imbalance, where different tasks have varying dependencies on model layers and modalities [4][7]. Proposed Solution - D-MoLE framework allows dynamic adjustment of model architecture based on task requirements, introducing additional parameter modules (LoRA experts) as needed [10][13]. - It incorporates a gradient-based continual curriculum strategy to balance updates across different modalities, ensuring more equitable optimization [10][12]. Methodology - D-MoLE consists of two core components: a dynamic layer-wise expert allocator and a gradient-based inter-modal continual curriculum mechanism [16][22]. - The dynamic allocator identifies critical layers for adaptation and allocates LoRA experts accordingly, while the curriculum mechanism adjusts the update ratio between language models and modality encoders based on task difficulty [22][24]. Experimental Results - D-MoLE was evaluated on a benchmark comprising nine datasets across visual question answering, image captioning, and visual grounding [27]. - The framework demonstrated significant improvements over baseline methods, achieving an average performance increase of approximately 15.08% and reducing backward transfer (BWT) from -21.31% to -1.49% [29]. General Capability Assessment - D-MoLE maintained strong general multimodal capabilities, outperforming traditional methods in various evaluation benchmarks [30][31]. Training Efficiency - Despite the introduction of new mechanisms, D-MoLE's total training time was comparable to traditional methods, demonstrating efficiency in training through selective parameter updates [36]. Business Application - D-MoLE can enhance Alibaba's security multimodal auditing models, allowing for rapid adaptation to different platform rules without extensive retraining, thus reducing operational costs and improving flexibility [38][39].
他47岁转方向,一举解决了球体堆积领域内最大的未解问题
机器之心· 2025-07-10 04:26
如今, 数学家 Boaz Klartag 在四月份发表的一篇论文《Lattice packing of spheres in high dimensions using a stochastically evolving ellipsoid》中,新方法以显著优 势打破了之前的记录 。一些研究人员甚至认为,他的结果可能接近最优解。 论文地址: https://arxiv.org/pdf/2504.05042 作为该研究领域的新人,Klartag 通过复兴一种专家们几十年前就已放弃的古老技术,实现了他的堆积方法 —— 该方法适用于所有任意高维空间。这项工作触及了 关于高维空间最优堆积性质的几个长期争论。它们应该是有序的还是无序的?它们究竟能堆积到什么程度? 选自 quantamagazine 作者: Joseph Howlett 机器之心编译 编辑:泽南 在数学领域里,对于最优模式的探索永无止境,球体填充问题也不例外,它旨在尽可能高效地将球体塞进一个(高维)盒子里。几个世纪以来,它一直吸引着数 学家们,并在密码学、远程通信等领域有着重要的应用。 它看似简单,实则微妙。17 世纪初,天文学家、数学家约翰内斯・开普勒 ...
真实科研水平集体不及格!全新基准SFE给主流多模态LLM来了波暴击
机器之心· 2025-07-09 09:52
Core Insights - The article discusses the advancements in Artificial Intelligence for Science (AI4S) and the introduction of the Scientists' First Exam (SFE) to evaluate the capabilities of multimodal large language models (MLLMs) in scientific domains [1][3][12]. Group 1: AI4S and SFE Overview - AI4S has made significant progress in transforming scientific research through innovative tools, but to become a revolutionary tool, it requires a comprehensive approach integrating specialized knowledge [1]. - The SFE aims to systematically assess the cognitive abilities of MLLMs across various scientific disciplines, addressing the limitations of existing evaluation methods that focus primarily on knowledge recall [2][3]. Group 2: SFE Evaluation Framework - SFE introduces a three-tier evaluation framework: Signal Perception (L1), Attribute Understanding (L2), and Comparative Reasoning (L3), covering five scientific fields with 66 high-value tasks [4][10][12]. - The evaluation reveals that mainstream models perform well on traditional benchmarks but struggle significantly with high-level scientific tasks, with state-of-the-art models scoring around 30 [4][18]. Group 3: Performance Insights - The evaluation results indicate that closed-source MLLMs outperform open-source models by 6-8% on average, with notable differences in specific tasks [20]. - Materials science is identified as the strongest area for model performance, while astronomy presents more challenges due to the complexity of data [22][23]. Group 4: Model Development and Trends - Recent models show significant improvements in high-level reasoning tasks, while progress in understanding tasks remains limited, indicating a shift towards enhanced reasoning capabilities [25][26]. - The scaling of model size does not always correlate with improved scientific capabilities, suggesting the need for a balanced expansion of scientific data alongside model size [31][32]. Group 5: Future Directions and Ecosystem - The establishment of the SciPrismaX platform aims to create a rigorous and dynamic evaluation ecosystem for AI in science, incorporating various assessment dimensions and community collaboration [33][36].
「Tokens是胡扯」,Mamba作者抛出颠覆性观点,揭露Transformer深层缺陷
机器之心· 2025-07-09 09:52
Core Viewpoint - The article discusses the trade-offs between State Space Models (SSM) and Transformers, arguing that tokenization is a limitation that SSM can overcome, leading to better computational efficiency and modeling capabilities [1][3][61]. Group 1: State Space Models (SSM) - SSM is defined as a modern version of recurrent neural networks (RNN) with key features that allow it to match the language modeling performance of Transformers [8][10]. - A significant characteristic of SSM is that its hidden state dimension is greater than the input and output dimensions, allowing for better context storage [9][10]. - The model's state update function must be expressive enough to accurately encode and retrieve necessary information, which is achieved through dynamic transfer matrices in selective SSM [11][12]. - Mamba, a specific SSM, integrates parallelization and memory management techniques to enhance computational efficiency [13][14]. - The article highlights that SSMs can outperform Transformers in language modeling tasks when computational resources are matched [53][56]. Group 2: Transformers - Transformers excel in tasks requiring fine-grained operations on individual tokens, but they suffer from quadratic complexity, limiting their efficiency [82][86]. - The article argues that Transformers have an inductive bias that affects their modeling capabilities, making them sensitive to the resolution and semantic content of the data [83][85]. - Despite their strengths, Transformers are not the ultimate solution for all modeling tasks, and there is still significant work to be done in the field [89]. Group 3: Tokenization - Tokenization is a critical step in language modeling, but it introduces limitations in understanding language details [39][40]. - The article posits that removing tokenization could lead to better model performance and aligns with the essence of deep learning, which aims to minimize manual feature engineering [44][45]. - The author suggests that without tokenization, models could learn more effective patterns directly from raw data, enhancing their capabilities [46][52].
花49元试了下Lovart国内版,集结数十个模型的设计Agent能有多强?
机器之心· 2025-07-09 09:52
Core Viewpoint - The article discusses the launch and evaluation of Xingliu Agent, a domestic version of Lovart, focusing on design and creative content generation capabilities, highlighting its features, user experience, and the team behind it [2][3][82]. Group 1: Product Overview - Xingliu Agent is designed for generating various creative content, including images, videos, logos, posters, and 3D models, utilizing top models like F.1, Kling, Qwen, and hailuo02 [4][3]. - Users can access the platform without an invitation, with initial free trials and a points system for further usage [5][6]. Group 2: User Experience - The interface is user-friendly, divided into sections for tools, generator settings, output previews, and AI dialogue [9][8]. - The article details the process of creating a high-end fashion poster, noting both successful outputs and issues with Chinese text generation [12][19]. - The platform also allows for the creation of emoji packs and brand logos, although it faced challenges with text accuracy and layout [25][31]. Group 3: Performance Evaluation - The article highlights the strengths of Xingliu Agent in generating aesthetically pleasing images, particularly in photography styles, while noting weaknesses in detail accuracy and text generation [33][40]. - The platform's ability to generate 3D models and videos is discussed, with mixed results in quality and efficiency [50][46]. Group 4: Pricing and Recommendations - Xingliu Agent offers three subscription plans, with the lowest costing 49 yuan per month, allowing for limited tasks, which may lead to rapid depletion of points [79]. - The article suggests that for high-quality, detailed outputs, users may prefer specialized AI tools, while Xingliu Agent is suitable for quick, less detailed creations [81]. Group 5: Team Background - The team behind Xingliu Agent shares a background with Lovart, led by Wang Haofan, who has a history of developing successful AI projects [82][86]. - The founding team includes members with experience from prestigious institutions and companies, indicating a strong technical foundation [94][95].
ICCV 2025 | UniOcc: 自动驾驶占用预测与推理统一数据集及基准平台
机器之心· 2025-07-09 07:10
来自加州大学河滨分校(UC Riverside)、密歇根大学(University of Michigan)、威斯康星大学麦迪逊分校(University of Wisconsin–Madison)、 德州农工大学(Texas A&M University)的团队在 ICCV 2025 发表首个面向自动驾驶语义占用栅格构造或预测任务的统一基准框架 UniOcc。 UniOcc 融合真实世界(nuScenes、Waymo)与仿真环境(CARLA、OpenCOOD)的多源数据,统一体素(voxel)格式与语义(semantic)标签,首 次引入体素级前后向运动流标注,并支持多车协同占位预测与推理。为摆脱伪标签(pseudo-label)评估限制,UniOcc 设计了多项免真值(ground- truth-free)指标,用于衡量物体形状合理性与时序一致性。在多个 SOTA 模型上验证了其在运动流信息利用、跨域泛化和协同预测方面的显著优势。 UniOcc 已全面开源,支持占位预测、长时序预测、动态追踪等多种任务,致力于构建标准化的感知研究平台,推动自动驾驶迈向多模态、泛化能力 更强的新阶段。 论 文标题:UniOc ...