Core Viewpoint - The article highlights the launch of Pelican-VL 1.0, a cutting-edge embodied visual language model (VLM) that claims to surpass the performance of GPT-5 and Google Gemini series, showcasing China's technological strength in the field of embodied intelligence [1][3]. Group 1: Overview of Pelican-VL - Pelican-VL is described as the "visual language brain" and its open-source nature significantly advances the progress of embodied intelligence technology [3]. - The core team behind Pelican-VL is composed entirely of women, emphasizing the contribution of female talent in China's technological research and development [7]. Group 2: Innovation Center and Team - The Beijing Humanoid Robot Innovation Center, established in November 2023, is the first provincial humanoid robot innovation center in China, formed by companies like Xiaomi Robotics and UBTECH [5]. - The center has achieved notable results, including the "Tian Gong" series, which is the world's first full-size electric humanoid robot capable of running at a speed of 12 km/h and adapting to various complex terrains [5]. Group 3: Core Technology - DPPO - Pelican-VL's performance breakthrough is attributed to its globally pioneering DPPO (Deliberate Practice Policy Optimization) training paradigm, which allows the model to achieve better performance with significantly less data [8][9]. - Traditional models require 1 million to 5 million data points for training, while Pelican-VL only needs 200,000 data points, achieving a data efficiency of 1/10 to 1/50 compared to similar models [8][9]. Group 4: Training Methodology - DPPO mimics human learning processes, involving a closed loop of observation, practice, error correction, and improvement [9]. - The training process consists of two key phases: reinforcement learning exploration and targeted supervised fine-tuning, focusing on identified weaknesses [12]. Group 5: Performance Comparison - Pelican-VL's training utilized a dedicated computing cluster of over 1,000 A800 GPUs, requiring more than 50,000 A800 GPU-hours for a complete model checkpoint training [15]. - The model offers two versions: a lightweight 7B parameter model for local deployment and a 72B parameter model for cloud-based complex task processing, providing flexibility and performance maximization [23]. Group 6: Data Quality and Performance Metrics - The training data for Pelican-VL was meticulously curated from 12 domains, resulting in a high-quality dataset that includes millions of tokens and numerous "failure cases" for effective learning [24]. - Performance tests show that Pelican-VL outperforms GPT-5 by 15.79% and Google Gemini by 19.25% across various dimensions, including visual understanding and action planning [25]. Group 7: VLA System Integration - Pelican-VL serves as the "brain" of the Vision-Language-Action (VLA) system, integrating visual, language, and action modules to execute complex tasks [29][30]. - This integration allows Pelican-VL to understand and execute highly abstract composite instructions, enhancing its operational capabilities in real-world scenarios [30]. Group 8: Open Source Impact - The open-source release of Pelican-VL is expected to lower the barriers for adopting embodied intelligence technology, enabling small and medium enterprises to develop intelligent robots without significant upfront investment [34]. - The open-source model encourages a complete industrial chain development, fostering a rich application ecosystem around Pelican-VL and expanding the boundaries of embodied intelligence applications [34].
性能超越GPT和Google,北京人形机器人创新中心开源全球最强具身VLM
具身智能之心·2025-11-17 00:47