Workflow
视觉语言动作模型(VLA)
icon
Search documents
阿里新研究:统一了VLA和世界模型
量子位· 2025-10-29 09:30
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action Models (VLA) with World Models, proposed by Alibaba DAMO Academy, Lake Lab, and Zhejiang University [1][4] - Experimental results indicate that WorldVLA significantly outperforms independent action models and world models, showcasing a mutual enhancement effect [2] Model Overview - The framework combines three independent tokenizers for encoding images, text, and actions, utilizing a VQ-GAN model for image tokenization with a compression ratio of 16 and a codebook size of 8192 [8] - The action tokenizer discretizes continuous robot actions into 256 intervals, representing actions with 7 tokens [8] Model Design - WorldVLA employs a self-regressive action world model to unify action and image understanding and generation [4] - The model addresses limitations of existing VLA and world models by enhancing action generation accuracy through environmental physical understanding [5][14] Training and Performance - WorldVLA is jointly trained by integrating data from both action models and world models, enhancing action generation capabilities [13] - The model's performance is positively correlated with image resolution, with 512x512 pixel resolution showing significant improvements over 256x256 [21][23] Benchmark Results - WorldVLA demonstrates superior performance compared to discrete OpenVLA models, even without pre-training, validating its architectural design [19] - The model's ability to generate coherent and physically plausible states in various scenarios is highlighted, outperforming pure world models [31][32] Mutual Enhancement - The world model enhances the action model's performance by predicting environmental state changes based on current actions, crucial for tasks requiring precision [25] - Conversely, the action model improves the visual understanding of the world model, supporting better visual generation [17][30]
超万平方米的人形机器人训练场在京启用
Huan Qiu Wang Zi Xun· 2025-09-25 10:04
Core Insights - The humanoid robot training facility in Beijing Shijingshan has officially commenced operations, marking a significant development in China's humanoid robot industry and providing a model for training facilities nationwide [1][7] - The facility aims to accelerate the evolution of humanoid robots' "embodied intelligence" and promote their large-scale application in sectors such as automotive manufacturing and logistics, laying a solid foundation for a trillion-dollar industry [1][7] Group 1: Training Facility Overview - The training center spans over 10,000 square meters and replicates 16 detailed scenarios across four categories: industrial manufacturing, smart home, elderly care services, and 5G integration [3] - The humanoid robot "Kuavo," standing at 1.66 meters, is actively training in various scenarios, achieving a success rate of over 95% in tasks such as empty box retrieval, material sorting, weighing, packaging, and product boxing [3][4] - The training facility's data is sourced entirely from real machine operations, addressing industry challenges related to poor data quality, high acquisition costs, and migration difficulties [3][4] Group 2: Data Quality and Standardization - The facility aims to overcome the bottlenecks in data quality and accessibility that have historically plagued the humanoid robot industry, moving from a "cottage industry" model to standardized, large-scale data production [4][5] - High-quality, large-scale training data is essential for the performance of visual language action models (VLA), which enable robots to achieve cross-platform and cross-scenario capabilities [5] - Real machine data is crucial for bridging the gap between theoretical models and practical applications, as synthetic data cannot fully replicate real-world interactions and environmental dynamics [5] Group 3: Ecosystem Development - The training center has established an innovative ecosystem model that integrates training, application, incubation, and public education, aiming to create a national public data service platform for embodied intelligence [6] - Collaborations with universities and research institutions are in place to support entrepreneurship and application scenario development, while also providing high-quality data services [6] - The facility will host the "First Embodied Intelligence Operational Task Challenge & Startup Camp," fostering innovation through a "competition-incubation" mechanism [6] Group 4: Future Implications - The establishment of this training facility signifies a new phase of large-scale and standardized development in China's humanoid robot industry [7] - The training center will enhance the skill set of robots, enabling them to perform tasks more effectively across various sectors, including factories, logistics parks, and elderly care institutions [7] - As more robots "graduate" from this training facility, their presence is expected to increase in everyday settings, facilitating the integration of intelligent robots into various industries and households [7]
上海交大卢策吾:如何破解机器人泛化与鲁棒性
Core Insights - The main focus of the articles is on the advancements in robotics, particularly in embodied intelligence, and the challenges and opportunities within the industry [1][7]. Group 1: Robotics Development - The key challenges in developing robotic intelligence are not primarily related to chip computing power but rather to the iteration of embodied model architecture and data loops [1][2] - The "digital gene" framework proposed by the company aims to enhance the understanding and execution capabilities of robots, allowing them to interpret and act upon instructions more effectively [3][4] - The company has demonstrated significant advancements in robotic applications, such as a robot serving in an ice cream shop, showcasing its ability to perform complex tasks autonomously [6] Group 2: Market Dynamics - The robotics industry is experiencing a surge in investment, with various companies seeking funding to demonstrate their commercial potential [7][8] - Despite the increased interest, the financing scale for Chinese startups in embodied intelligence remains significantly lower compared to their American counterparts, with a reported difference of nearly 12 times in private AI investment [7] - The industry is characterized by a dual focus on talent and funding, which poses challenges for startups in terms of technology strategy and validation under financial constraints [8]
灵宝机器人团队在具身智能新赛道上不断突破 让机器人“心灵手巧”(科技视点·一线探创新)
Ren Min Ri Bao· 2025-07-27 22:23
Group 1 - The core message emphasizes the importance of technological innovation in advancing China's modernization and competitiveness in the global arena [1] - The article introduces a series of reports titled "Frontline Innovation," focusing on the experiences and observations of researchers in the field of scientific innovation [1] Group 2 - Lingbao Robotics, founded in 2023, specializes in developing general humanoid robots and embodied intelligence products, with a focus on practical applications [3][4] - The company utilizes a visual language action model (VLA) to enable robots to learn skills through imitation, significantly improving the efficiency of skill acquisition [4][5] - The robots developed by Lingbao can perform precise tasks, such as assembling computer components with a precision of 0.3 mm, showcasing their advanced capabilities [3][4] Group 3 - Lingbao Robotics is working on flexible automation solutions for the shoe manufacturing industry, addressing the challenges of high costs and low adaptability in traditional production lines [6][7] - The company has developed a system that allows robots to learn to perform tasks in dynamic environments, reducing the time required for training to about one hour [7] - The humanoid robot developed by Lingbao, CASBOT 01, features a bionic hand capable of executing complex tasks, highlighting the integration of embodied intelligence and precision operation [8] Group 4 - The domestic development of embodied intelligence is rapidly advancing, with a growing variety of tactile sensors and technologies being integrated into the industry [9] - Lingbao Robotics emphasizes the importance of collaboration between academia and industry, applying the latest research findings to product development while also contributing to academic research [9]
学习端到端大模型,还不太明白VLM和VLA的区别。。。
自动驾驶之心· 2025-06-19 11:54
Core Insights - The article emphasizes the growing importance of large models (VLM) in the field of intelligent driving, highlighting their potential for practical applications and production [2][4]. Group 1: VLM and VLA - VLM (Vision-Language Model) focuses on foundational capabilities such as detection, question answering, spatial understanding, and reasoning [4]. - VLA (Vision-Language Action) is more action-oriented, aimed at trajectory prediction in autonomous driving, requiring a deep understanding of human-like reasoning and perception [4]. - It is recommended to learn VLM first before expanding to VLA, as VLM can predict trajectories through diffusion models, enhancing action capabilities in uncertain environments [4]. Group 2: Community and Resources - The article invites readers to join a knowledge-sharing community that offers comprehensive resources, including video courses, hardware, and coding materials related to autonomous driving [4]. - The community aims to build a network of professionals in intelligent driving and embodied intelligence, with a target of gathering 10,000 members in three years [4]. Group 3: Technical Directions - The article outlines four cutting-edge technical directions in the industry: Visual Language Models, World Models, Diffusion Models, and End-to-End Autonomous Driving [5]. - It provides links to various resources and papers that cover advancements in these areas, indicating a robust framework for ongoing research and development [6][31]. Group 4: Datasets and Applications - A variety of datasets are mentioned that are crucial for training and evaluating models in autonomous driving, including pedestrian detection, object tracking, and scene understanding [19][20]. - The article discusses the application of language-enhanced systems in autonomous driving, showcasing how natural language processing can improve vehicle navigation and interaction [20][21]. Group 5: Future Trends - The article highlights the potential for large models to significantly impact the future of autonomous driving, particularly in enhancing decision-making and control systems [24][25]. - It suggests that the integration of language models with driving systems could lead to more intuitive and human-like vehicle behavior [24][25].
具身智能:一场需要谦逊与耐心的科学远征
Robot猎场备忘录· 2025-05-20 05:01
Core Viewpoints - Embodied intelligence is injecting new research vitality into the robotics field and has the potential to break through performance limits [1] - The development of embodied intelligence relies on breakthroughs in specific scientific problems and should not dismiss contributions from traditional robotics [2] - General intelligence cannot exist without a focus on specific tasks, as expertise in particular areas leads to advancements in broader capabilities [3] Group 1: Interdisciplinary Collaboration - Embodied intelligence is a cross-disciplinary product that requires collaboration with fields such as material science, biomechanics, and design aesthetics [2] - Breakthroughs often occur at the intersection of disciplines, highlighting the importance of diverse scientific contributions [2] Group 2: Technology Evolution - Technological evolution should not be viewed as a complete replacement of old systems; rather, it is a process of sedimentation where foundational technologies continue to support advancements [5] - The current trend in visual-language-action models may soon be replaced by more efficient alternatives, emphasizing the need for continuous innovation [5] Group 3: Realistic Expectations for AGI - Viewing embodied intelligence as the sole path to artificial general intelligence (AGI) is a dangerous oversimplification; AGI development requires a multitude of conditions and interdisciplinary knowledge [6] - The complexity of embodied systems necessitates a collaborative approach across various fields, rather than relying on a few "genius" individuals [6] Group 4: Current State of Embodied Intelligence - The field of embodied intelligence is still in its early stages, with significant challenges remaining in hardware and algorithm development [7] - Current human-like robots are not yet fully autonomous and often require human intervention, indicating that the technology is still evolving [7] Group 5: VLA Technology Pathway - The development of visual-language-action (VLA) models may not be the most efficient approach, as operational skills often precede language capabilities in learning processes [9] - Many current VLA models are resource-intensive and may be replaced by more efficient solutions in the future [9] Group 6: Balancing Short-term and Long-term Goals - A combination of learning and modeling approaches is seen as more practical in the short term, while pure learning methods may represent the long-term future of robotics [10] - Successful robotic solutions in industry often rely on model-based methods due to their stability and reliability [10] Group 7: Human-like Robots and Practicality - The design of human-like robots is driven by emotional projection and environmental adaptability, but specialized non-human forms may offer better efficiency in many applications [11] - There is a concern about over-investment in human-like robots at the expense of practical and economically viable solutions [11] Group 8: Building Technical Barriers - True competitive advantages in technology arise from extensive practical experience and meticulous attention to detail, rather than solely from innovative algorithms [12] - Long-term technical barriers are built through consistent effort and iterative improvements in engineering practices [12] Group 9: Vision and Practicality - Scientific research requires both grand visions and grounded practices, with embodied intelligence embodying both idealistic aspirations and real-world challenges [13] - The importance of foundational theories, such as control theory, remains critical in ensuring the safety and functionality of robotic systems [13]