VLA模型

Search documents
聊模型的王兴兴
3 6 Ke· 2025-08-12 08:05
Core Insights - The founder of Yushu Technology, Wang Xingxing, challenges the perception that the company solely focuses on robot hardware, emphasizing the importance of models, algorithms, and data in robotics [1][2] - Wang expresses skepticism towards the current VLA (Vision-Language-Action) approach, arguing that the existing data quality and quantity are insufficient for effective real-world interaction [1][2] - Yushu is exploring video-driven models for robotics, which Wang believes may develop faster and have a higher convergence probability than the VLA approach [3] Group 1: Model and Algorithm Focus - Yushu's model team is relatively large compared to its size, but still smaller than major AI companies, indicating a cautious yet significant investment in model development [2] - Wang believes that the number of personnel in model development does not directly correlate with the quality of outcomes, suggesting that smaller teams can also innovate effectively [2] - The company is not entirely dismissing the VLA model but is cautious about over-relying on data accumulation for training [2] Group 2: Robotics Application and Future Vision - Current public perception may suggest that Yushu's robots are primarily for entertainment, but internally, the focus is on developing robots capable of practical tasks [5][6] - Wang argues that achieving practical applications for robots in factories and homes is currently unrealistic, and performance demonstrations are more feasible [6] - The vision for future robotics includes multifunctional capabilities rather than single-task operations, with a potential timeline of 2-5 years for achieving a "ChatGPT moment" in robotics [7][8] Group 3: Computational Needs - Wang anticipates the need for low-cost, large-scale, distributed computing clusters in the robotics field to address computational challenges [4] - He suggests that factories with multiple robots could benefit from establishing distributed server clusters to reduce communication latency [4]
「宇树科技」王兴兴:推进合规、稳健的上市流程,VLA是一个相对傻瓜式的架构
Robot猎场备忘录· 2025-08-12 00:03
Core Viewpoints - The humanoid robot industry is currently in a stage where technology is not yet mature enough for large-scale, complex tasks, but the annual shipment of humanoid robots is expected to double, with potential breakthroughs leading to significant increases in output in the next 2-3 years [4][5][6] - The competition in the humanoid robot sector extends beyond products and markets to include founder interviews and public speaking engagements [4] - The race to complete an IPO is critical for companies like Yushu Technology and Zhiyuan Robotics, as being the first to go public can provide substantial funding support [5][6] Industry Insights - Hardware for humanoid robots is currently adequate but requires further improvement for larger scale, lower cost, and higher reliability [7] - The biggest challenge in the humanoid robot sector is the AI model rather than data, with a need for better model architecture to enhance performance [7] - The commercial viability of humanoid robots is questioned, as many companies focus on entertainment rather than practical applications [10][11] Company Strategies - Yushu Technology focuses on educational and research applications, while Zhiyuan Robotics and others emphasize strong AI capabilities [10][11] - The commercial logic for Yushu Technology involves leveraging impressive robotic performances and low pricing to quickly secure orders, but sustainability remains a concern [10][15] - The software-focused companies often announce high revenue figures but lack transparency regarding order numbers and actual product deliveries [11] Market Dynamics - The humanoid robot market is characterized by a divide between "hardware-focused" companies like Yushu Technology and "software-focused" companies like Zhiyuan Robotics, leading to different commercialization strategies [10][12] - The current trend shows that many humanoid robot startups are struggling with effective commercialization and face challenges in scaling production and real-world application [12][15] - The industry is witnessing a shift towards self-developed foundational models, with leading startups like Figure AI taking the lead [13]
一套搞定VLA研发!“腾讯系”人形机器人创企再迎重大技术突破,推开通用机器人大门!
Robot猎场备忘录· 2025-08-08 09:33
Core Viewpoint - The article highlights the significant technological advancements made by the humanoid robot startup, Stardust Intelligence, particularly with its self-developed AI system DuoCore and the launch of the first full-body mobile operation model DuoCore-WB, which enhances the practical application of humanoid robots in real-world scenarios [2][3]. Group 1: Technological Breakthroughs - Stardust Intelligence's DuoCore system has achieved a major update, enabling robots to possess a dual intelligence mode that combines instinctive responses with deep thinking, allowing for intelligent planning and operation in complex environments [3]. - The DuoCore system employs a highly anthropomorphic knowledge transfer mechanism, improving learning efficiency and enabling the transfer of skills across different scenarios without starting from scratch [4]. - The DuoCore-WB model utilizes a simplified imitation learning framework, allowing robots to learn complex tasks with minimal high-quality demonstrations, achieving an average task success rate of 80% in challenging household tasks [16][24]. Group 2: Product Overview - The Astribot Suite is a comprehensive robot learning kit that includes a high-performance robot platform (Astribot S1), an intuitive remote operation scheme, and an efficient full-body operation strategy [8]. - The Astribot S1 robot is designed for general tasks, featuring a unique rope-driven design that mimics human muscle tissue, allowing for flexible and precise movements [11]. - The S1 robot has impressive specifications, including a single-arm freedom of 7 degrees, a maximum speed exceeding 10 m/s, and a load capacity of 10 kg, surpassing typical adult male capabilities [13]. Group 3: Market Position and Future Prospects - Stardust Intelligence aims to become a leading AI robot assistant provider, with a vision to enable billions of people to have AI robot assistants, focusing on human-machine coexistence and collaboration [25]. - The company has completed five rounds of financing, with the latest round raising several hundred million yuan, indicating strong investor confidence, particularly from major tech firms [31][32]. - The company is actively pursuing commercialization, having announced the pre-sale of the Astribot S1 and collaborating with leading universities and enterprises for practical applications [33].
成功率提高57%,VLA+RL最新!CO-RFT:实现VLA模型的高效微调(北航&清华等)
具身智能之心· 2025-08-07 00:03
Core Insights - The article discusses the development of a new reinforcement learning framework called Chunked RL, specifically designed for fine-tuning Vision-Language-Action (VLA) models, which show great potential in real-world robotic control [4][8]. - The proposed CO-RFT algorithm demonstrates significant improvements over traditional supervised fine-tuning methods, achieving a 57% increase in success rate and a 22.3% reduction in cycle time in real-world environments [4][29]. Section Summaries Introduction - VLA models integrate perception and language understanding for embodied control, showing promise in developing general strategies for real-world robotic control [6]. - The challenges faced in fine-tuning VLA models primarily stem from the dependency on the quality and quantity of task-specific data, which limits generalization to out-of-distribution (OOD) scenarios [6][7]. Methodology - The article introduces Chunked RL, a novel reinforcement learning framework that incorporates action chunking to enhance sample efficiency and stability, particularly suited for VLA models [8][12]. - The CO-RFT algorithm consists of two phases: imitation learning for initializing the backbone network and policy, followed by offline RL with action chunking to optimize the pre-trained policy [16][18]. Experimental Analysis - The experiments were conducted on a robotic platform with six dexterous manipulation tasks, evaluating the performance of the CO-RFT algorithm against traditional methods [20][23]. - Results indicate that CO-RFT significantly outperforms supervised fine-tuning (SFT), achieving a 57% increase in success rate and a 22.3% decrease in average cycle time across various tasks [29][30]. Position Generalization - CO-RFT exhibits strong position generalization capabilities, achieving a 44.3% success rate in previously unseen locations, outperforming SFT by 38% in OOD scenarios [4][29]. Importance of Data Diversity - Data diversity plays a crucial role in the performance of CO-RFT, with models trained on diverse datasets showing significantly better generalization capabilities compared to those trained on fixed datasets [32][33].
VLA-OS:NUS邵林团队探究机器人VLA做任务推理的秘密
具身智能之心· 2025-08-01 16:02
Core Viewpoint - The article discusses a groundbreaking research study by a team from the National University of Singapore, focusing on the VLA-OS framework, which systematically analyzes and dissects task planning and reasoning in Vision-Language-Action (VLA) models, aiming to provide insights for the next generation of general-purpose robotic VLA models [2][4]. Group 1: VLA-OS Overview - VLA-OS is a structured framework that includes a clear codebase, multimodal task planning datasets, and standardized training processes for VLA models [4][5]. - The framework aims to unify various VLA paradigms and facilitate controlled experiments to identify effective task planning representations and paradigms [19][20]. Group 2: VLA Model Paradigms - The article outlines two main approaches for integrating task reasoning into VLA models: Integrated-VLA, which combines task planning and policy learning, and Hierarchical-VLA, which separates these functions into different models [10][12]. - Current VLA models exhibit significant variability in architecture, training methods, and task planning representations, complicating performance assessments [13][15]. Group 3: Experimental Findings - The research identifies 14 key findings from over 100 experiments, highlighting the advantages of visual planning representations over language-based ones and the superior performance of Hierarchical-VLA compared to Integrated-VLA [34][35]. - Findings indicate that Integrated-VLA benefits from implicit task planning, while Hierarchical-VLA demonstrates better generalization capabilities [51][52]. Group 4: Recommendations for Future Research - The article suggests prioritizing visual representation planning and goal image planning, with language planning as a supplementary approach [68]. - It emphasizes the importance of task planning pre-training and the need for efficient training mechanisms to avoid gradient conflicts between planning and action outputs [73].
VLA+强化学习,会催生更强大的系统!
具身智能之心· 2025-07-31 00:04
Core Viewpoint - The article discusses the advancements in robotic models, particularly focusing on the development of the RT-2 and RT-X models, which enhance the capabilities of robots in executing tasks through visual language models and diverse datasets [5][10][11]. Group 1: RT-2 and Its Capabilities - RT-2 is introduced as a foundational robot model that can process visual questions and execute tasks based on language instructions, showcasing the potential of remote-accessible robotic models [5][7]. - The model's ability to convert robot control tasks into question-answer formats allows it to perform various basic language instructions effectively [7][8]. Group 2: RT-X Dataset and Its Impact - The RT-X dataset, developed by DeepMind, comprises data from 34 research labs and 22 types of robots, providing a diverse training ground for robotic models [10]. - Models trained on the RT-X dataset outperform specialized models by approximately 50% in various tasks, indicating the advantages of cross-embodiment models [11]. Group 3: Evolution of VLA Models - The first-generation VLA model, RT-2, is noted for its simplicity, while the second-generation models utilize continuous action distributions for improved performance in complex tasks [14][15]. - The second-generation VLA models incorporate specialized mechanisms for generating continuous actions, enhancing their control capabilities [17][18]. Group 4: π0 and π0.5 Models - The π0 model, based on a large language model with 3 billion parameters, is designed to handle various tasks, including folding clothes, demonstrating its adaptability in different environments [18][23]. - The latest π0.5 model is aimed at executing long-term tasks in new environments, integrating high-level reasoning capabilities to manage complex instructions [28][30]. Group 5: Future Directions and Reinforcement Learning - Future VLA models are expected to integrate reinforcement learning techniques to enhance robustness and performance, moving beyond imitation learning [34][39]. - The combination of VLA and DLA (Deep Learning Architecture) is proposed to create a more effective system, leveraging expert data to improve generalist capabilities [44][46].
PI联合创始人,机器人大神!详解VLA+强化学习,催生更强大的系统
具身智能之心· 2025-07-30 06:03
Core Viewpoint - The article discusses the advancements in robotic models, particularly focusing on the development of the RT-2 and RT-X models, which enhance the capabilities of robots in executing complex tasks through improved data sets and model architectures [6][12][44]. Group 1: RT-2 and RT-X Models - RT-2 is introduced as a foundational robot model that utilizes a visual language model to process image-based commands and execute tasks [8][10]. - The RT-X dataset, developed by DeepMind, comprises data from 34 research labs and 22 types of robots, showcasing a diverse range of robotic capabilities [13][26]. - Cross-embodiment models trained on the RT-X dataset outperform specialized models by approximately 50% in various tasks, indicating the advantages of generalization in robotic learning [13][29]. Group 2: Evolution of VLA Models - The first generation of VLA models, like RT-2, is based on simple question-answer structures for robot control, while the second generation incorporates continuous action distributions for better performance [16][19]. - The second generation VLA models, such as π0, utilize a large language model with an action expert module to handle complex tasks, generating action sequences over time [22][24]. - The π0.5 model is designed for long-term tasks, integrating high-level reasoning to execute complex instructions in new environments [36][40]. Group 3: Integration of Reinforcement Learning - Future VLA models are expected to incorporate reinforcement learning techniques to enhance robustness and performance, moving beyond imitation learning [44][49]. - The integration of reinforcement learning with VLA aims to create a more effective training process, allowing robots to learn from both expert data and real-world interactions [56][60]. - Current research is focused on developing stable and effective end-to-end training processes that leverage reinforcement learning to improve VLA capabilities [60].
国产人形机器人硬件+应用加速落地
2025-07-14 00:36
Summary of the Conference Call on the Domestic Humanoid Robot Industry Industry Overview - The domestic humanoid robot industry is accelerating deployment, with significant investments in companies like Zhiyuan and Yushui totaling 124 million yuan, indicating a growing market demand for humanoid robot applications [1][2] - The humanoid robot supply chain is steadily advancing, with over 80 domestic companies, primarily startups from universities, focusing on application scenarios such as logistics, household chores, inspection, and textiles [1][3][4] Key Developments - Zhiyuan and Yushui won a procurement project for humanoid and biped robots from China Mobile Hangzhou, with a total contract value of 124 million yuan, highlighting the rapid deployment of robots in the domestic market [2] - Tiangong Walker's standard version is priced at approximately 300,000 yuan, with expected production and orders exceeding 1,000 units in 2025 [2] Application Scenarios - The application of humanoid robots in inspection, logistics, and textiles is promising, with robots capable of replacing human labor in high-risk tasks such as high-altitude inspections, thereby improving safety [3][10][11] - In the logistics sector, humanoid robots are expected to collaborate with unmanned logistics vehicles to achieve automation in factories, enhancing efficiency and reducing human error [12][14] Company Highlights - UBTECH showcased the Walker S Two, featuring a replaceable battery and has begun small-scale industrial orders, indicating high market acceptance [5] - Yushui demonstrated advanced motion control capabilities, including climbing and dancing, with its products achieving world-leading standards [6] - Zhiyuan introduced multiple commercial products and is actively collecting data to iterate on technology, planning to gather 500,000 data points weekly for comprehensive deployment [7] Competitive Landscape - Domestic companies are making significant progress in VRA and VLA model development, establishing a data commonality layer and collaborating with partners to build resource platforms [8] - The domestic humanoid robot supply chain is outperforming international competitors in terms of application depth and capital expenditure, with a focus on practical applications [9] Future Prospects - The future of humanoid robots in the textile industry is promising, as they can replace manual operations in labor-intensive tasks, with advancements in technology allowing for better handling of flexible materials [16] - The overall market for humanoid robots is expected to grow, with increasing applications in various sectors, including logistics and inspection, as companies continue to innovate and improve their products [10][17]
EmbodyX最新!VOTE:集成投票&优化加速VLA模型的通用框架,吞吐量加速35倍!
具身智能之心· 2025-07-13 09:48
Core Insights - The article discusses the limitations of existing VLA models in generalizing to new objects and unfamiliar environments, prompting the development of a more efficient action prediction method called VOTE [4][6][9]. Group 1: Background and Motivation - The challenge of creating a universal robotic strategy that can handle diverse tasks and real-world interactions has been a core focus in robotics research [6]. - VLA models have shown excellent performance in familiar environments but struggle with generalization in unseen scenarios, leading to the exploration of methods to enhance robustness [7][8]. Group 2: VOTE Methodology - VOTE is introduced as a lightweight VLA model that optimizes trajectory using an ensemble voting strategy, significantly improving inference speed and reducing computational costs [9][14]. - The model eliminates the need for additional visual modules and diffusion techniques, relying solely on the VLM backbone and introducing a special token <ACT> to streamline action prediction [9][18]. - The action sampling technique employs an ensemble voting mechanism to enhance model performance by aggregating predictions from previous steps, thus improving stability and robustness [22][23]. Group 3: Performance and Evaluation - Experimental results indicate that VOTE achieves state-of-the-art performance, with a 20% increase in average success rate on the LIBERO task suite and a 3% improvement over CogACT on the SimplerEnv WidowX robot [9][28]. - The model demonstrates a 35-fold increase in throughput on edge devices like NVIDIA Jetson Orin, showcasing its efficiency for real-time applications [9][31]. - VOTE's performance is superior to existing models, achieving a throughput of 42Hz on edge platforms while maintaining minimal memory overhead [31][32].
VLA 推理新范式!一致性模型 CEED-VLA 实现四倍加速!
机器之心· 2025-07-13 04:58
Core Viewpoint - The article discusses the advancements in Vision-Language-Action (VLA) models, particularly focusing on the CEED-VLA model, which significantly improves inference speed while maintaining high task success rates in robotic applications [2][8][24]. Group 1: VLA Model Overview - VLA models have become a crucial research direction in robotics due to their strong multimodal understanding and generalization capabilities [2]. - Despite advancements, VLA models face significant inference speed bottlenecks, especially in high-frequency and precise tasks [2]. Group 2: Proposed Solutions - The article introduces a consistency distillation training strategy that allows the model to predict multiple correct action tokens simultaneously, enhancing decoding speed [4]. - A mixed-label supervision mechanism is designed to mitigate potential error accumulation during the distillation process [4][9]. - An early-exit decoding strategy is proposed to address inefficiencies in Jacobi decoding, allowing for improved average inference efficiency by relaxing convergence conditions [5][10]. Group 3: Experimental Results - The proposed methods achieved over 4 times inference acceleration across multiple baseline models while maintaining high task success rates in both simulated and real-world robotic tasks [8][18]. - The CEED-VLA model demonstrated a significant increase in manipulation task success rates, exceeding 70%, due to enhanced inference speed and control frequency [24].