Workflow
SayCan
icon
Search documents
Ai2推出MolmoAct模型:在机器人领域挑战英伟达和谷歌
Sou Hu Cai Jing· 2025-08-14 07:50
物理AI是机器人技术与基础模型结合的快速发展领域,英伟达、谷歌和Meta等公司正在发布研究成 果,探索将大语言模型与机器人技术融合。 艾伦人工智能研究所(Ai2)发布了最新研究成果MolmoAct 7B,这是一个全新的开源模型,让机器人 能够"在空间中推理",旨在物理AI领域挑战英伟达和谷歌。MolmoAct基于Ai2的开源项目Molmo构建, 能够进行三维"思考",同时还发布了其训练数据。该模型采用Apache 2.0许可证,数据集则使用CC BY- 4.0许可证。 Ai2将MolmoAct归类为动作推理模型,即基础模型在物理三维空间中对动作进行推理。这意味着 MolmoAct能够运用推理能力理解物理世界,规划空间占用方式,然后执行相应动作。 **空间推理的独特优势** 由于机器人存在于物理世界中,Ai2声称MolmoAct能帮助机器人感知周围环境并做出更好的交互决策。 该公司表示:"MolmoAct可以应用于任何需要机器对物理环境进行推理的场景。我们主要考虑家庭环 境,因为那是机器人技术面临的最大挑战,环境不规则且不断变化,但MolmoAct可以应用于任何地 方。" **技术实现原理** Ai2表示:"与 ...
你的AI管家可能正在「拆家」?最新研究揭秘家⽤具⾝智能体的安全漏洞
机器之心· 2025-07-27 08:45
Core Insights - The article discusses the launch of IS-Bench, a benchmark focused on evaluating the safety of embodied agents interacting with household environments, highlighting the potential dangers of allowing AI assistants to operate autonomously [2][4][19] - Current visual language model (VLM) household assistants have a safety completion rate of less than 40%, indicating significant risks associated with their actions [4][19] Evaluation Framework - IS-Bench introduces over 150 household scenarios that contain hidden safety hazards, designed to comprehensively test the safety capabilities of AI assistants [2][4] - The evaluation framework moves away from static assessments to a dynamic evaluation process that tracks risks throughout the interaction, capturing evolving risk chains [5][10] Safety Assessment Challenges - Traditional evaluation methods fail to identify dynamic risks that emerge during task execution, leading to systematic oversight of critical safety hazards [6][7] - The article emphasizes that even if the final outcome appears safe, the process may have introduced significant risks, highlighting the need for a more nuanced safety assessment [7][19] Scenario Customization Process - IS-Bench employs a systematic scene customization pipeline that combines GPT-generated scenarios with human verification to ensure a diverse range of safety hazards [8][12] - The resulting "Household Danger Encyclopedia" includes 161 high-fidelity testing scenarios with 388 embedded safety hazards across various household settings [12] Interactive Safety Evaluation - The framework includes real-time tracking of the agent's actions, allowing for continuous safety assessments throughout the task [15] - A tiered evaluation mechanism is implemented to test agents under varying levels of difficulty, assessing their safety decision-making capabilities [15] Results and Insights - The evaluation results reveal that many VLM-based agents struggle with risk perception and awareness, with safety completion rates significantly improving when safety goals are clearly defined [18][19] - The article notes that proactive safety measures are often overlooked, with agents only successfully completing less than 30% of pre-cautionary actions [19]
技术干货:VLA(视觉-语言-动作)模型详细解读(含主流玩家梳理)
Robot猎场备忘录· 2025-06-25 04:21
Core Viewpoint - The article focuses on the emerging Vision-Language-Action (VLA) model, which integrates visual perception, language understanding, and action generation, marking a significant advancement in robotics and embodied intelligence [1][2]. Summary by Sections VLA Model Overview - The VLA model combines visual language models (VLM) with end-to-end models, representing a new generation of multimodal machine learning models. Its core components include a visual encoder, a text encoder, and an action decoder [2]. - The VLA model enhances the capabilities of traditional VLMs by enabling human-like reasoning and global understanding, thus improving its interpretability and usability [2][3]. Advantages of VLA Model - The VLA model allows robots to weave language intent, visual perception, and physical actions into a continuous decision-making flow, significantly shortening the gap between instruction understanding and task execution. This enhances the robot's ability to understand and adapt to complex environments [3]. Challenges of VLA Model - The VLA model faces several challenges, including: - Architectural inheritance, where the overall structure is not redesigned but only output modules are added or replaced [4]. - Action tokenization, which involves representing robot actions in a language format [4]. - End-to-end learning, integrating perception, reasoning, and control [4]. - Generalization issues, as pre-trained VLMs may struggle with cross-task transfer [4]. Solutions and Innovations - To address these challenges, companies are proposing a dual-system architecture that separates the VLA model into VLM and action execution models, potentially leading to more effective implementations [5][6]. Data and Training Limitations - The VLA model's training requires large-scale, high-quality multimodal datasets, which are difficult and costly to obtain. The lack of commercial embodied hardware limits data collection, making it challenging to build a robust data cycle [7]. - Additionally, the VLA model struggles with long-term planning and state tracking, as the connection between the "brain" (VLM) and "small brain" (action model) relies heavily on direct language-to-action mapping, leading to issues in handling multi-step tasks [7].
技术干货:VLA(视觉-语言-动作)模型详细解读(含主流玩家梳理)
Robot猎场备忘录· 2025-06-20 04:23
Core Viewpoint - The article focuses on the emerging Vision-Language-Action (VLA) model, which integrates visual perception, language understanding, and action generation, marking a significant advancement in embodied intelligence technology [1][2]. Summary by Sections VLA Model Overview - The VLA model combines visual language models (VLM) with end-to-end models, representing a new generation of multimodal machine learning models. Its core components include a visual encoder, a text encoder, and an action decoder [2]. - The VLA model enhances the capabilities of traditional VLMs by enabling human-like reasoning and global understanding, thus increasing its interpretability and human-like characteristics [2][3]. Advantages of VLA Model - The VLA model allows robots to weave language intent, visual perception, and physical actions into a continuous decision-making flow, significantly improving their understanding and adaptability to complex environments [3]. - The model's ability to break the limitations of single-task training enables a more generalized and versatile application in various scenarios [3]. Challenges of VLA Model - The VLA model faces several challenges, including: - Architectural inheritance, where the overall structure is not redesigned but only output modules are added or replaced [4]. - The need for action tokenization, which involves representing robot actions in a language format [4]. - The requirement for end-to-end learning that integrates perception, reasoning, and control [4]. Solutions and Innovations - To address these challenges, companies are proposing a dual-system architecture that separates the VLA model into VLM and action execution models, enhancing efficiency and effectiveness [5][6]. Data and Training Limitations - The VLA model's training requires large-scale, high-quality multimodal datasets, which are difficult and costly to collect due to the lack of commercial embodied hardware [7]. - The model struggles with long-term planning and state tracking, leading to difficulties in executing multi-step tasks and maintaining logical coherence in complex scenarios [7].
自诩无所不知的大模型,能否拯救笨手笨脚的机器人?
Hu Xiu· 2025-05-06 00:48
从上海到纽约,世界各地的餐厅里都能看到机器人在烹饪食物。它们会制作汉堡、印度薄饼、披萨和炒菜。它们的原理与过去50年机器人制造其他产品的 方式如出一辙:精准执行指令,一遍又一遍地重复相同的操作步骤。 但Ishika Singh想要的不是这种"流水线"式的机器人,而是真正能"做晚饭"的机器人。它应该能走进厨房,翻找冰箱和橱柜,拿出各种食材搭配组合,烹调 出美味的菜肴,然后摆好餐具。对孩子而言,这也许很简单,但没有任何机器人能做到这一点。这需要太多关于厨房的知识,更需要常识、灵活性和应变 能力,但这些能力都超出了传统机器人编程的范畴。 南加州大学计算机科学博士生Singh指出,问题的症结在于机器人学家使用的经典规划流程。"他们需要把每一个动作,以及它的前提条件和预期效果都定 义清楚,"她解释道,"这要求事先设定环境中所有可能发生的情况。"可即使经过无数次试错,编写数千行代码,这样的机器人仍无法应对程序之外的突 发状况。 一个晚餐服务机器人在制定"策略"(执行指令的行动计划)时,不仅要知道当地的饮食文化(当地所谓的"辛辣"究竟指什么),还要熟悉具体厨房环境 (电饭煲是否放在高层的架子上)、服务对象的特殊情况(Hec ...