MolmoAct 7B
Search documents
Ai2推出MolmoAct模型:在机器人领域挑战英伟达和谷歌
Sou Hu Cai Jing· 2025-08-14 07:50
Core Insights - The article discusses the rapid development of physical AI, which combines robotics and foundational models, with companies like Nvidia, Google, and Meta releasing research results exploring the integration of large language models with robotics [2][4] Group 1: MolmoAct Overview - The Allen Institute for Artificial Intelligence (Ai2) has released MolmoAct 7B, an open-source model designed to enable robots to "reason in space," aiming to challenge Nvidia and Google in the physical AI domain [2] - MolmoAct is classified as an action reasoning model that allows foundational models to reason about actions in a three-dimensional physical space, enhancing robots' ability to understand the physical world and make better interaction decisions [2][3] Group 2: Unique Advantages - Ai2 claims that MolmoAct possesses three-dimensional spatial reasoning capabilities, unlike traditional visual-language-action (VLA) models, which cannot think or reason spatially, making MolmoAct more efficient and generalized [2][6] - The model is particularly suited for applications in dynamic and irregular environments, such as homes, where robotics face significant challenges [2] Group 3: Technical Implementation - MolmoAct utilizes "spatial location perception Tokens" to understand the physical world, which are pre-trained and extracted using vector quantization variational autoencoders, allowing the model to convert video data into Tokens [3][7] - These Tokens enable the model to estimate distances between objects and predict a series of "image space" path points, leading to specific action outputs [3] Group 4: Performance Metrics - Benchmark tests indicate that MolmoAct 7B achieves a task success rate of 72.1%, surpassing models from Google, Microsoft, and Nvidia [3][8] - The model can adapt to various implementations, such as robotic arms or humanoid robots, with minimal fine-tuning required [8] Group 5: Industry Trends - The development of more intelligent robots with spatial awareness has been a long-term goal for many developers and computer scientists, with the advent of large language models facilitating this process [4][5] - Companies like Google and Meta are also exploring similar technologies, with Google’s SayCan helping robots reason about tasks and determine action sequences [4]