机器之心
Search documents
真机RL!最强VLA模型π*0.6来了,机器人在办公室开起咖啡厅
机器之心· 2025-11-18 03:30
Core Insights - Physical Intelligence (PI) has developed a new robot base model π*0.6, significantly enhancing the success rate and efficiency of embodied intelligence tasks [2][3][6] - The company secured over $400 million in funding in 2024, achieving a valuation exceeding $2 billion, positioning itself as a leading player in the embodied intelligence sector [3] Group 1: Model Development and Capabilities - The π*0.6 model utilizes a "Vision-Language-Action" (VLA) framework, trained on extensive robot perception and action data, enabling it to generalize and perform tasks in unknown environments [3][9] - The model has demonstrated a 90% success rate in various tasks, with significant improvements in processing efficiency [6][34] - The Recap method, which combines demonstration training, corrective guidance, and autonomous experience learning, has been pivotal in enhancing the model's performance [9][19] Group 2: Performance Metrics and Applications - The model has shown over a twofold increase in throughput and success rates for challenging tasks, such as making espresso coffee, after incorporating real-world execution experience [27][29] - Physical Intelligence has tested the model in three real-world applications: making espresso drinks, folding various types of clothing, and assembling packaging boxes, achieving over 90% success rates in these tasks [25][34] - The model's architecture allows it to handle diverse prompts and conditions, improving its adaptability in real-world scenarios [22][23] Group 3: Learning Methodology - The Recap method addresses the challenge of credit assignment in reinforcement learning, allowing the model to learn from both successful and unsuccessful actions [14][20] - The training process involves offline reinforcement learning for pre-training, followed by task-level fine-tuning using demonstration data and real-world feedback [25][36] - The combination of expert demonstrations, corrective guidance, and autonomous experience is expected to enhance the model's learning efficiency and performance [37]
华为诺亚发布ScaleNet:模型放大通用新范式
机器之心· 2025-11-18 03:30
Core Insights - The article discusses the challenges of scaling models in the field of AI, particularly the high costs associated with training large-scale models and the need for efficient model expansion methods [2][3][4]. - The ScaleNet method is introduced as a solution that allows for effective model expansion while maintaining parameter efficiency, demonstrating significant performance improvements in both visual and language tasks [5][20]. Research Motivation - The high computational cost of training large-scale models has led researchers to explore methods like Progressive Training, which reuses weights from smaller models to initialize larger ones. However, these methods often introduce new independent parameters, increasing storage requirements and slowing down optimization [4]. Core Methodology - ScaleNet combines two key techniques: Layer-wise Weight Sharing and Lightweight Adapters [6][7]. - Layer-wise Weight Sharing allows new layers to share parameters with existing layers in the pre-trained model, enhancing parameter efficiency and accelerating the learning process [8]. - Lightweight Adapters are introduced for each shared layer to provide unique adjustments, ensuring that while knowledge is shared, each layer can still learn specialized functions, thus maintaining model capacity and performance [11]. Experimental Results and Analysis - In visual model evaluations, ScaleNet outperformed baseline methods in accuracy while maintaining similar parameter counts across various architectures, such as DeiT and Swin [14]. - For instance, ScaleNet achieved a Top-1 accuracy of 76.46% with 6.45 million parameters in the Deit-Tiny model, compared to 75.01% for the baseline [15]. - ScaleNet also demonstrated superior training efficiency, requiring only 100 epochs and 15.8 hours to reach an accuracy of 81.13% in the DeiT-Small model, compared to 300 epochs and 47.3 hours for direct training [16]. Generalization to Language Models - The research team applied ScaleNet to the Llama-3.2-1B language model, achieving an average performance improvement of 0.92% across various common-sense reasoning benchmarks, indicating its cross-modal applicability [17][18]. - The method also showed stable improvements in downstream visual tasks such as object detection and semantic segmentation, further confirming its generalization capabilities [19]. Conclusion - ScaleNet provides an efficient and cost-effective technical pathway for expanding pre-trained models, significantly enhancing training efficiency and model performance in both visual and language tasks. This work contributes to the development of larger, stronger, and more economical AI models, promoting sustainable growth in the AI field [20].
让大模型学会「心灵感应」:基于思维沟通的多智能体合作范式来了
机器之心· 2025-11-17 23:40
如果多个大模型能读懂彼此的想法,会发生什么 ? 在 NeurIPS 2025 的 Spotlight 论文 Thought Communication in Multiagent Collaboration 中,来自 CMU、Meta AI 和 MBZUAI 的研究者提出了一种全新的协作方式, 让模型不再仅仅依靠语言交流,而是直接共享「 思维」。 这项研究提出了 Thought Communication(思维沟通) 的概念,让智能体在内部层面传递潜在思维(latent thoughts),实现类似「 心灵感应」的合作。 理论上,研究者建立了首个针对多智能体系统的 潜在思维可识别性理论 ,证明即使在非参数设定下,也能从模型状态中恢复出共享与私有思维。实现上,他们据 此提出了通用框架 ThoughtComm ,使模型能够自动抽取、路由并注入这些潜在思维,从而实现超越语言的直接沟通。 结果显示,这种「 思维层交流」不仅在理论上可行,在实践中也显著提升了模型的协作效率与推理能力。 论文标题: Thought Communication in Multiagent Collaboration 论文链接:https:/ ...
刚刚,马斯克Grok 4.1低调发布!通用能力碾压其他一切模型
机器之心· 2025-11-17 23:40
Core Insights - xAI has announced the release of Grok 4.1, which is now available to all users across various platforms including the Grok website, X, and mobile applications [1][3] - Grok 4.1 shows significant improvements in real-world usability, particularly in creativity, emotional interaction, and collaborative engagement [4][6] - The model has enhanced capabilities in understanding subtle intentions and maintaining coherent personality traits while retaining the intelligence and reliability of its predecessor [4][6] Performance Metrics - Grok 4.1 has achieved a 64.78% probability of being preferred by users in comparative evaluations against previous models [6] - In the LMArena Text Arena leaderboard, Grok 4.1's reasoning mode (quasarflux) ranks first with an Elo score of 1483, outperforming the highest non-xAI model by 31 points [13] - The non-reasoning mode (tensor) ranks second with an Elo score of 1465, demonstrating superior performance even without reasoning capabilities [13][14] Emotional Intelligence - Grok 4.1 was tested on the EQ-Bench3, which evaluates emotional intelligence through challenging role-play scenarios [17] - The results indicate that Grok 4.1's reasoning and non-reasoning modes ranked first and second respectively in emotional intelligence assessments [18] Creative Writing - xAI evaluated Grok 4.1's performance on the Creative Writing v3 benchmark, which involved generating responses to 32 different writing prompts [23] - The model has shown a significant reduction in hallucination rates for factual queries during its post-training phase, indicating improved reliability in information retrieval [27] Technical Details - For more technical details regarding Grok 4.1, a model card is available at the provided link [29]
首个完整开源的生成式推荐框架MiniOneRec,轻量复现工业级OneRec!
机器之心· 2025-11-17 09:00
Core Viewpoint - The article discusses the launch of MiniOneRec, the first complete end-to-end open-source framework for generative recommendation, which validates the generative recommendation Scaling Law and provides a comprehensive training and research platform for the community [2][4]. Group 1: Generative Recommendation Framework - MiniOneRec has gained significant attention in the recommendation community since its release on October 28, with all code, datasets, and model weights open-sourced, requiring only 4-8 A100 GPUs for easy reproduction [6]. - The framework offers a one-stop lightweight implementation and improvement for generative recommendation, including a rich toolbox for SID (Semantic ID) construction, integrating advanced quantization algorithms [9]. - The framework has demonstrated a significant advantage in parameter utilization efficiency, as shown by the training and evaluation loss decreasing with increasing model size from 0.5 billion to 7 billion parameters [8][10]. Group 2: Performance Validation - Researchers have validated the generative recommendation Scaling Law on public datasets, showcasing the model's efficiency in parameter utilization [7]. - MiniOneRec outperforms traditional and generative recommendation paradigms significantly, leading by approximately 30 percentage points over the TIGER model in metrics such as HitRate@K and NDCG@K [23]. Group 3: Innovations in Recommendation - The framework introduces a full-process SID alignment strategy, which significantly enhances the performance of generative recommendations by incorporating world knowledge from large models [13][15]. - MiniOneRec employs a novel reinforcement learning strategy tailored for recommendations, including a constrained decoding sampling strategy to improve the diversity of generated items and a ranking reward to enhance the distinction of sorting signals [17][21]. Group 4: Future Outlook - The article raises the question of whether generative recommendation will become the new paradigm for recommendation systems, highlighting two approaches: the reformist approach, which integrates generative architecture into existing systems, and the revolutionary approach, which aims to completely overhaul traditional models [25][26]. - Both approaches have demonstrated the practical value of the generative paradigm, with some major companies already realizing tangible benefits from its implementation [27].
成本仅0.3美元,耗时26分钟!CudaForge:颠覆性低成本CUDA优化框架
机器之心· 2025-11-17 09:00
本文作者包括明尼苏达大学的张子健(共同第一作者),王嵘(共同第一作者),李世阳,罗越波,洪明毅,丁才文。 CUDA 代码的性能对于当今的模型训练与推理至关重要,然而手动编写优化 CUDA Kernel 需要很高的知识门槛和时间成本。与此同时,近年来 LLM 在 Code 领域 获得了诸多成功。这推动人们去探索如何利用 LLM 来编写优化 CUDA kernel。然而,现有的方法面临诸多问题,例如高昂的训练与推理成本,不良的 kernel 性 能,以及缺乏硬件反馈导致的盲目探索。 那么对于使用 LLM 进行 CUDA 代码生成,我们能不能设计一个简单而有效的方法,使其能够低成本地生成可靠高效的 CUDA kernel? 明尼苏达大学的团队提出了一种新的方法—— CudaForge 。这是一种 简单、高效且低成本 的多智能体 CUDA Kernel 生成与优化工作流。该工作流受人类专家的 实际开发流程启发,包含初始 Kernel 的编写、正确性测试、硬件反馈分析以及迭代改进等关键阶段。 实验结果表明,CudaForge 在 KernelBench Levels 1-3 上取得了 SOTA 的结果,超越了现有的所有 ...
真情实感体验了阿里「千问APP」后,为什么说它是「中国的ChatGPT」?
机器之心· 2025-11-17 04:23
Core Viewpoint - Alibaba has launched a new application called Qianwen APP, which aims to compete in the C-end AI application market, positioning itself as "China's ChatGPT" [3][5][55]. Group 1: Product Positioning and Strategy - Qianwen APP is designed to be a personal AI assistant for users, integrating various daily tasks such as knowledge Q&A, search, content creation, code generation, and shopping into one platform [5][55]. - The app is seen as a significant move following Alibaba's investment of 380 billion yuan in AI infrastructure earlier this year, indicating the company's commitment to AI development [5][55]. - Qianwen APP represents Alibaba's first attempt to directly connect its strongest models to users in a personal assistant format, enhancing user experience and perception of model capabilities [6][13]. Group 2: Model Capabilities - The Qwen model family, particularly Qwen3-Max, boasts over 1 trillion parameters and 36 trillion tokens of pre-training data, achieving breakthroughs in various capabilities such as Chinese and English understanding, complex instruction following, and programming [6][12]. - Qwen3-Max has demonstrated top-tier performance in coding and tool-calling capabilities, ranking highly in global benchmarks [6][12][7]. - The Qwen model family covers a wide range of modalities, including text, vision, speech, video, and code, and is recognized as one of the most popular open-source models globally [8][12]. Group 3: User Experience and Features - The Qianwen APP features a minimalist design, focusing on user-friendliness and ease of use, which is crucial for a general-purpose AI assistant [15][53]. - The app excels in visual recognition capabilities, accurately identifying various objects and providing detailed information, which enhances user interaction [19][25]. - Qianwen APP's professional Q&A capabilities are tailored for high-value fields such as finance, technology, and academia, ensuring depth and accuracy in responses [32][53]. Group 4: Competitive Landscape - The launch of Qianwen APP positions Alibaba in direct competition with global AI applications, particularly OpenAI's ChatGPT, as it aims to establish itself as a leading AI assistant in the market [55][56]. - The app's capabilities and design reflect a strategic shift towards creating a "super AI assistant," aligning with global trends in AI development [55][56].
VinciCoder:多模态统一代码生成框架和视觉反馈强化学习,数据代码模型权重已开源
机器之心· 2025-11-17 04:23
Core Insights - The article discusses the limitations of traditional supervised fine-tuning (SFT) in multimodal code generation and introduces VinciCoder, a unified model that leverages visual reinforcement learning (ViRL) to enhance visual fidelity and code executability [2][6][22] - VinciCoder employs a two-phase strategy combining large-scale SFT with coarse-to-fine ViRL to address the challenges faced by existing models in generating diverse code from various visual inputs [2][7][22] Limitations of Traditional SFT - Traditional SFT suffers from a "visual gap" between training objectives and final tasks, leading to issues such as local optimization that fails to ensure global code executability and a lack of visual feedback during training [6][13] - The absence of visual feedback is critical, as minor code modifications can lead to significant changes in rendered images, highlighting the need for a mechanism that provides global visual feedback [6][7] VinciCoder's Approach - VinciCoder's innovation lies in shifting the reward mechanism from the text domain to the visual domain, utilizing a large-scale SFT to build foundational code capabilities, followed by a ViRL phase to optimize visual fidelity and executability [7][12] - The training framework consists of a "1.6M large-scale SFT phase" and a "42k coarse-to-fine ViRL phase," enabling strong code understanding and high-fidelity visual alignment [7][12] Large-Scale SFT and Code Optimization - The research team created a large-scale SFT corpus containing 1.6 million image-code pairs, which includes a new task of "visual code optimization" where the model corrects defective code to align with target images [10][12] Coarse-to-Fine ViRL Framework - VinciCoder introduces a coarse-to-fine visual reward mechanism that directly derives reward signals from visual outputs, addressing the lack of "visual-code" feedback in traditional SFT [12][14] - The framework evaluates visual similarity at both global (coarse) and local (fine) levels, enhancing the model's ability to generate accurate code [14] Experimental Results - VinciCoder demonstrated superior performance across multiple multimodal code generation benchmarks, outperforming both open-source and closed-source models, establishing new state-of-the-art (SOTA) standards [16][18] - The model's performance in challenging tasks, such as Image-to-SVG and chemical formula generation, rivals that of top closed-source models, showcasing its effectiveness [16][18] Research Significance and Future Applications - The research presents a new paradigm for multimodal code generation, emphasizing the importance of visual feedback in guiding code generation processes [19][20] - VinciCoder's success illustrates the potential of reinforcement learning to bridge the gap between visual and code modalities, paving the way for future developments in generalized multimodal intelligence [20][22]
ChatGPT:再见「破折号」
机器之心· 2025-11-17 04:23
Core Viewpoint - The article discusses the peculiarities of AI-generated text, particularly the frequent use of em dashes, which has become a hallmark of AI writing styles, leading to a perception that texts with excessive em dashes are likely AI-generated [6][7]. Group 1: AI Writing Characteristics - AI models, such as ChatGPT, tend to overuse em dashes in their outputs, which has led to the term "ChatGPT style" being coined to describe this phenomenon [6]. - Users have begun to avoid using em dashes in their own writing to prevent being mistaken for AI-generated content [6][18]. - OpenAI's CEO, Sam Altman, announced that users can now instruct ChatGPT to avoid using em dashes, which he referred to as a "small but happy victory" [7]. Group 2: User Experience and Feedback - Despite the update, users quickly reported that em dashes still appeared in ChatGPT's responses, indicating that the issue may not be fully resolved [9]. - Testing showed that when instructed not to use em dashes, ChatGPT complied and did not include them in its responses [10]. - The article highlights other AI tendencies, such as the frequent inclusion of English terms in parentheses and the use of quotation marks around abstract concepts, which can also detract from the overall readability of the text [15][17].
解决特斯拉「监督稀疏」难题,DriveVLA-W0用世界模型放大自动驾驶Data Scaling Law
机器之心· 2025-11-17 04:23
Core Insights - The article discusses the transition of VLA models in autonomous driving from academic research to practical applications, highlighting the challenge of "supervision deficit" [2][5][8] - A new research paper titled "DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving" proposes a solution to this challenge by introducing world models as a means to provide dense self-supervised signals [6][10][12] Group 1: Supervision Deficit - VLA models face a "supervision deficit" where high-dimensional visual input is paired with low-dimensional sparse supervisory signals, leading to wasted representational capacity [8][9] - The research team found that performance of VLA models saturates quickly with increased data under sparse action supervision, diminishing the effects of Data Scaling Law [9][22] Group 2: World Models as a Solution - The introduction of world models allows the model to predict future images, providing a richer and denser learning signal compared to relying solely on sparse actions [11][15][16] - This approach fundamentally alleviates the supervision deficit issue, enabling better learning of complex dynamics in driving environments [16][18] Group 3: Amplifying Data Scaling Law - The core contribution of the research is the discovery that world models significantly amplify the effects of Data Scaling Law, showing a steeper performance improvement with increased data compared to baseline models [18][21] - In experiments with up to 70 million frames, the world model reduced collision rates by 20.4%, demonstrating a qualitative leap in performance that surpasses merely stacking action data [24] Group 4: Efficiency and Real-World Application - The research also addresses the high latency issue in VLA models by proposing a lightweight MoE "action expert" architecture, which reduces inference latency to 63.1% of the baseline VLA without sacrificing performance [26][27] - This design enhances the feasibility of real-time deployment of VLA models in autonomous driving applications [27][29]