多模态及具身大模型在人形机器人上的应用
SIASUNSIASUN(SZ:300024)2025-05-14 15:19

Summary of Key Points from the Conference Call Industry Overview - The focus of the humanoid robot industry has shifted towards the application of AR capabilities and large model capabilities to meet user demands, with expectations for deep integration of hardware and models within 3-5 years in everyday scenarios [1][3] - The development of humanoid robots can be categorized into three stages: initial focus on core components, establishment of hardware architecture, and eventual deep integration of hardware and models for widespread application [3] Core Insights and Arguments - AI Agents play a crucial role as the "brain" of embodied robots, responsible for task decision-making and reasoning, enhancing task execution efficiency through tailored applications for different scenarios [1][8] - The mainstream framework for embodied robot brains is structured in five layers: physical layer, training layer, data layer, model layer (including LLM, VLM, and VLA), and application layer [1][9] - The introduction of 3D spatial perception capabilities is essential for improving spatial modeling and perception, which is vital for achieving general AGI [1][19] - The industrial sector predominantly employs a hierarchical embodied large model architecture to avoid retraining software due to hardware upgrades, contrasting with the academic sector's end-to-end approach [1][17] Technological Developments - Google's RT series models have significantly advanced VLA model development, although they have not been open-sourced, while Stanford and Berkeley's open-source models have accelerated industry growth [1][10][12] - Philips' Helix architecture, released in February 2025, differs fundamentally from the VOLATI model by employing a layered system that allows for cost-effective hardware upgrades [1][14][15] - The VELAN model is currently simple, utilizing text, visual, and action encoding for training, similar to Tesla's autonomous driving approach [1][16] Challenges and Future Directions - Current VLA models face challenges such as insufficient data volume, low task generalization ability, and significant performance impacts from lighting changes [1][18] - The importance of establishing industry standards for humanoid robots is emphasized, as it will influence market development and safety certifications [1][24] - Future trends in intelligent large models will focus on data collection and training to enhance the generalization capabilities of VRA models, with potential for unified foundational VOI models [1][27] Additional Insights - The competition among terminal manufacturers will hinge on optimizing foundational large models and unique advantages in scene data training for better hardware integration [1][2][27] - The VRM model's core in interaction capabilities includes voice recognition, output, and expression management, which are crucial for enhancing robot interaction [1][26] - Data collection in humanoid robotics is evolving, with a focus on human sensory perception data to improve design richness and reduce the Sim-to-Real gap [1][23]

SIASUN-多模态及具身大模型在人形机器人上的应用 - Reportify