宇树宣布开源VLA大模型UnifoLM-VLA-0

Core Insights - Yushu Technology announced the launch of the open-source model UnifoLM-VLA-0, which is designed for general humanoid robot operations [1][3] Group 1: Model Overview - UnifoLM-VLA-0 is a visual-language-action (VLA) large model under the UnifoLM series, aimed at overcoming the limitations of traditional visual-language models (VLM) in physical interactions [3] - The model evolves from general "text-image understanding" to a "embodied brain" with physical common sense through continued pre-training on robot operation data [3] Group 2: Performance and Capabilities - The model demonstrates significantly enhanced spatial reasoning abilities and reliable multimodal perception across various task scenarios [3] - It integrates text instructions with 2D/3D spatial details through continued pre-training, improving its spatial perception capabilities, particularly for operation tasks that require high instruction understanding and spatial awareness [3] - The model has been validated on real machines, achieving high-quality completion of 12 complex operational tasks with a single strategy [3]