具身智能之心
Search documents
李飞飞和LeCun的世界模型之争
具身智能之心· 2025-11-15 16:03
Core Viewpoint - The article discusses the competition among three major players in the AI industry—Li Fei Fei, LeCun, and Google—regarding the development of world models, highlighting their distinct technological approaches and implications for artificial general intelligence (AGI) [2][22][39]. Group 1: Li Fei Fei's Marble - Li Fei Fei's company, World Labs, has launched its first commercial world model, Marble, which is considered to have significant commercial potential due to its ability to generate persistent, downloadable 3D environments [5][21]. - Marble features a native AI world editor called Chisel, allowing users to create and modify worlds with simple prompts, which is particularly beneficial for VR and game developers [7][9]. - However, some experts argue that Marble resembles a 3D rendering model rather than a true world model, as it focuses on visual representation without incorporating the underlying physical laws necessary for robotic training [10][20]. Group 2: LeCun's JEPA - LeCun's approach to world models, exemplified by JEPA, emphasizes control theory and cognitive science rather than 3D graphics, focusing on abstract representations that enable robots to predict changes in the environment [22][25]. - JEPA is designed to train robots by capturing essential world states without generating visually appealing images, making it more suitable for robotic training [27][29]. - This model contrasts sharply with Marble, as it prioritizes understanding the structure of the world over visual fidelity [39]. Group 3: Google's Genie 3 - Google DeepMind's Genie 3, launched in August, generates interactive video environments based on prompts, showcasing improvements in long-term consistency and event triggering [31][34]. - Despite its advancements, Genie 3 remains fundamentally a video logic model, lacking the deep understanding of physical laws that LeCun's JEPA provides [35][36]. - The visual quality and resolution of Genie 3 are also limited compared to Marble, which offers high-precision, exportable 3D assets [38]. Group 4: Comparative Analysis - The three world models—Marble, Genie 3, and JEPA—represent different paradigms: Marble focuses on visual representation, Genie 3 on dynamic video generation, and JEPA on understanding the underlying structure of the world [39]. - This creates a "world model pyramid," where models become increasingly abstract and aligned with AI's cognitive processes as one moves up the hierarchy [47][48].
我们的自驾、具身和大模型社区7500人了!
具身智能之心· 2025-11-15 16:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 之心的全平台星球7500人了!2年多的时间,取得了还算不错的成绩。很多同学可能还不知道我们的 媒体矩阵,目前我们团队孵化了自动驾驶之心、具身智能之心、大模型之心Tech、3D视觉之心四个 IP,每个IP都有对应的付费社区与私域。 我们期望未来2年内做到近万人的规模。给大家打造一个交流+技术分享的聚集地,是许多初学者和 进阶的同学经常逛的地方。 | 0 国内高校著名自动驾驶团队整理 链接: https://t.zsxq.com/hlVJZ | 5 算法进阶 | (17) 规划控制 | 链接: https://t.zsxg.com/USyyN | | --- | --- | --- | --- | | | (1) BackBone汇总 | 链接: https://t.zsxq.com/melQb | (33) 自动驾驶仿真 | | 1 自动驾驶领域 ...
超大参数量具身VLM开源:DPPO训练范式,模型性价比天花板!
具身智能之心· 2025-11-15 16:03
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 最近,国内具身智能的开源 VLM 登顶了行业之巅。2025 年以来,具身智能的行业研发力似乎也迎来了井喷式爆发。 11 月 13 日,北京人形机器人创新中心正式开源了具身智能 VLM 模型 ——Pelican-VL 1.0,根据介绍,该模型覆盖 7B、72B 参数规模,被称为 "最大 规模的开源具身多模态大脑模型"。 官方资料显示,其核心优势在于深度整合海量数据与自适应学习机制:并在由 1000+ A800 GPU 组成的集群上训练,单次检查点训练耗费超过 50,000 A800 GPU - 小时;团队从原始数据中蒸馏出包含数亿 token 的高质量元数据以做训练基石。在基线基础上性能提升 20.3%,超过同级别开源模型 10.6%。根据测试,其平均性能超越 GPT-5 和 Google gemini 等闭源系列模型,成为了目前最强具身性能的开源多模态大模型 。 DPPO 模仿人类元认知的学习 ...
北大等团队用“分层小脑+仿真分身”让G1零样本上岗
具身智能之心· 2025-11-14 16:03
Core Insights - The article introduces the DemoHLM framework developed by a research team from Peking University and BeingBeyond, which addresses the challenges in humanoid robot loco-manipulation by generating vast amounts of training data from a single human demonstration in a simulated environment [1][3][20]. Group 1: Challenges in Humanoid Robot Loco-Manipulation - Humanoid robot loco-manipulation faces three main challenges: reliance on extensive real-world remote operation data, limited generalization across tasks, and difficulties in transferring simulation-trained strategies to real-world applications [3][5]. - Existing methods either remain confined to simulated environments or require hundreds of hours of real data, making them impractical for complex real-world scenarios [3]. Group 2: Innovations of DemoHLM - DemoHLM features a dual-engine approach combining hierarchical control and single demonstration data generation, ensuring stability in full-body movements while minimizing data costs for generalization [6][20]. - The hierarchical control architecture separates motion control from task decision-making, enhancing both flexibility and stability [7]. - The single demonstration data generation process allows for the creation of thousands of diverse training trajectories from just one simulated demonstration, significantly improving data efficiency and generalization capabilities [8][20]. Group 3: Experimental Validation - The framework was tested in both simulated environments and on a real Unitree G1 robot, demonstrating significant improvements in task success rates as the amount of synthetic data increased [9][11]. - For instance, the success rate for the "PushCube" task improved from 52.4% to 89.3% with increased data, showcasing the effectiveness of the data generation pipeline [11]. - The framework's adaptability was confirmed across various behavior cloning algorithms, with high success rates achieved in multiple tasks [14][16]. Group 4: Industry Implications and Future Directions - DemoHLM's advancements lower the cost of training humanoid robots, reducing the requirement from hundreds of hours of real operation to just hours of simulated demonstrations, thus lowering the barriers for industry applications [17][20]. - The framework's ability to generalize across different tasks without task-specific designs accelerates the transition of robots from laboratory settings to real-world environments [23]. - Future research will focus on addressing limitations related to simulation-to-reality discrepancies and enhancing performance in complex scenarios through mixed training with real data and multi-modal perception [19][23].
SemanticVLA:面向高效机器人操作的语义对齐剪枝与增强方法
具身智能之心· 2025-11-14 16:03
Core Insights - The article discusses significant advancements in visual-language-action models for robotic operations, highlighting the challenges faced in dynamic and cluttered environments, which hinder the deployment of existing models [2][4]. Research Background - Visual-language-action models have made notable progress in robotic operations through pre-trained visual language models that enable end-to-end mapping from language to action. However, two main bottlenecks limit their deployment in real-world scenarios: low computational efficiency and weak task grounding capabilities [2]. Key Innovations - Introduction of a semantic-guided dual-visual pruner that addresses visual redundancy through instruction-aware token filtering and geometric-aware aggregation, while maintaining semantic alignment [3]. Main Work Overall Framework Design - The framework processes real-time visual observations, robot state (e.g., joint angles, end-effector posture), and natural language instructions to predict future action sequences. It employs two parallel paths for visual input processing, culminating in an end-to-end pipeline for action mapping [4]. Visual Perception Redundancy - The general visual encoder processes all pixels uniformly, leading to background interference and environmental noise, which increases computational costs and dilutes attention on critical task cues [5]. Semantic Complementary Layered Fusion - A semantic complementary layered fusion mechanism integrates dense patch features with sparse semantic tokens, enhancing the alignment of instruction semantics with spatial structures [5]. Semantic Conditioned Action Coupler - The design reconstructs the mapping from visual to action, improving the efficiency and interpretability of action decoding by representing actions as semantically coherent types [5]. Experimental Results Efficiency Advantages - The model achieves a training cost reduction of 3.0 times, inference latency reduction of 2.7 times, and visual token compression of 8-16 times, significantly enhancing throughput [14]. Real-World Performance - In long-range tasks, the model's success rate reaches 77.8%, surpassing the OpenVLA-OFT model by 22.2%, demonstrating strong generalization capabilities [14]. Ablation Studies - The dual-pruning combination of the SD-Pruner enhances success rates by 2.1%-5.2%, achieving optimal performance and efficiency balance at an 8× sparsification ratio [16].
雷军下铺的兄弟,创业具身智能机器人
具身智能之心· 2025-11-14 16:03
Core Insights - The article discusses the career transition of Cui Baoqiu, a former Xiaomi executive, who is now venturing into the field of embodied intelligence and robotics after leaving Xiaomi [2][6][12]. Group 1: Career Transition and Vision - Cui Baoqiu, known as a "technical guru" at Xiaomi, is now focusing on creating household service robots, marking a shift from his previous role of building platforms for AI [2][4][5]. - His vision has evolved from "connecting everything" to "transforming the physical world," aiming to create an AI that can think, move, and interact with humans [4][7]. - Prior to his current venture, he served as the Chief Technical Advisor for a RISC-V chip company, indicating a strategic move towards foundational technology [8][10]. Group 2: Background and Achievements at Xiaomi - Cui joined Xiaomi in 2012 at the invitation of Lei Jun and played a crucial role in establishing Xiaomi's AI and cloud platform team [14][29]. - He was instrumental in promoting Xiaomi's "AIoT" strategy, which initially focused on connecting devices like smart speakers and cameras [7][29]. - Under his leadership, Xiaomi launched significant AI products, including the AI assistant "Xiao Ai," which reflects the culmination of his earlier predictions about AI capabilities [30][32]. Group 3: Industry Trends and Implications - The article highlights a broader trend in the tech industry where former executives from major companies are now focusing on building physical embodiments for AI, as software alone is insufficient to unlock AI's full potential [42][44]. - This shift towards embodied intelligence is seen as the next phase in the AI evolution, with many former tech leaders entering the robotics space [42][47]. - The competition in this sector is intensifying, with significant investments flowing into startups focused on general-purpose robotics and embodied intelligence [45][48].
开箱子,叠毛巾!从零把pi0部署到你的机械臂上吧!
具身智能之心· 2025-11-14 04:00
Core Viewpoint - The article introduces the Imeta-Y1, a lightweight and cost-effective robotic arm designed for beginners and researchers in the field of embodied intelligence, emphasizing its accessibility and ease of use for algorithm validation and project development [3][4][6]. Product Features - The Imeta-Y1 robotic arm is designed with a compact structure and modular interfaces, making it suitable for embedded AI and robotics learning platforms [7]. - It offers a full-process open-source toolchain and code examples, facilitating seamless transitions from data collection to model deployment [4][17]. - The arm supports dual-language interfaces (Python and C++) and is compatible with ROS1 and ROS2, allowing users to quickly get started regardless of their programming background [4][18][19]. Technical Specifications - The robotic arm has a weight of 4.2 kg, a rated load of 3 kg, and 6 degrees of freedom, with a working radius of 612.5 mm and a repeat positioning accuracy of ±0.1 mm [9][19]. - It operates at a supply voltage of 24V and communicates via CAN, with various external interfaces for power and communication [9][19]. - The arm's joint motion range and maximum speeds are specified, ensuring precise control for various applications [9][19]. Development and Support - The company provides a comprehensive open-source SDK, including drivers, API interfaces, sample code, and documentation, supporting rapid application development [26][32]. - The product includes support for multi-modal data fusion and is compatible with major frameworks like TensorFlow and PyTorch, enabling end-to-end deployment of intelligent algorithms [17][32]. - The company ensures timely after-sales support, with a 24-hour response guarantee for customer inquiries [19][44]. Testing and Reliability - The robotic arm undergoes rigorous hardware testing processes, including accuracy calibration, durability, load performance, and stability verification, to ensure reliability and safety in various application scenarios [35][39][42].
具身智能年度盛会!完整议程公开,VLA、世界模型与RL三大研讨会同期开讲
具身智能之心· 2025-11-14 01:02
Group 1 - The 2025 China Embodied Intelligent Robot Conference (EAIRCon 2025) will be held on November 19 in Shenzhen, focusing on the wave of embodied intelligent robots [2][4] - The conference will feature a main forum, specialized forums, workshops, and an exhibition area, with the theme "Embodied Intelligence Awakens" [2][4] - Nearly 40 guests will deliver speeches, reports, and dialogues to comprehensively analyze the new wave of robot revolution driven by embodied intelligence [2][5] Group 2 - The main forum will start with a keynote address by the CEO of Zhiyi Technology, followed by a report from a prominent professor on the challenges and advancements in humanoid robots [6][7] - Keynote speakers include executives and scientists from various leading companies and research institutions, discussing topics such as human-level intelligence in robots and the transition from open-source to standardization in humanoid robotics [8][9] Group 3 - The specialized forum will delve into the industrialization opportunities and innovations in humanoid robots, featuring discussions on the implementation paths for embodied intelligent robots [15][16] - Workshops will cover topics such as robot imitation learning, embodied world models, and VLA (Vision-Language-Action) large models, with participation from young scholars and industry experts [24][25][26] Group 4 - The conference aims to address the key capabilities required for humanoid robots, including physical interaction and skill evolution, to facilitate their practical applications [20][21] - The event will also explore the integration of AI and robotics, emphasizing the importance of real-world data and multi-modal information for the development of embodied intelligence [21][22]
港科大等团队提出WMPO:基于世界模型的VLA策略优化框架
具身智能之心· 2025-11-14 01:02
Core Insights - The article introduces WMPO (World Model-based Policy Optimization), a framework developed by Hong Kong University of Science and Technology and ByteDance Seed team, which enhances sample efficiency, task performance, generalization ability, and lifelong learning through pixel-level video generation for VLA (Vision-Language-Action) models [5][25]. Research Background and Pain Points - Existing solutions struggle to balance scalability and effectiveness, with human intervention requiring continuous supervision and high costs for adapting simulators to diverse scenarios [4]. - Traditional latent space world models misalign with web-scale pre-trained visual features, failing to fully leverage pre-trained knowledge [4] [6]. Core Framework Design - WMPO's logic is based on generating trajectories in an "imagination" space using high-fidelity pixel-level world models, replacing real environment interactions and supporting stronger on-policy reinforcement learning [5][11]. - The iterative process follows "imagination trajectory generation → trajectory sampling evaluation → policy update" [5]. Key Modules - **Generative World Model**: Simulates dynamic changes between the robot and the environment, generating visual trajectories aligned with VLA pre-trained features [8]. - **Lightweight Reward Model**: Automatically assesses the success or failure of imagined trajectories, providing sparse reward signals to avoid complex reward shaping [9]. - **On-Policy Policy Optimization (GRPO)**: Adapts Group Relative Policy Optimization for sparse reward scenarios, balancing stability and scalability [10]. Core Innovations - **Pixel Space Priority**: Directly generates trajectories in pixel space, perfectly matching VLA pre-trained visual features and maximizing the value of pre-trained knowledge [11]. - **Trajectory Generation Logic**: Predicts action blocks based on initial frames and language instructions, generating subsequent frames iteratively [12]. - **Dynamic Sampling Strategy**: Generates multiple imagined trajectories from the initial state, filtering out all-success or all-failure trajectories to ensure effective training samples [12]. Experimental Validation and Key Results - In simulation environments, WMPO outperformed baseline methods (GRPO, DPO) across four fine manipulation tasks, achieving an average success rate of 47.1% with a rollout budget of 128, and 57.6% with a budget of 1280, demonstrating superior sample efficiency [13][14]. - In real environments, WMPO achieved a success rate of 70% in a "block insertion" task, significantly higher than baseline strategies [15]. Emergent Behaviors - WMPO exhibits self-correcting capabilities, autonomously adjusting actions in response to failure states, unlike baseline strategies that continue erroneous actions until timeout [17]. Generalization Ability - WMPO demonstrated an average success rate of 29.6% in out-of-distribution scenarios, outperforming all baseline methods, indicating its learning of general operational skills rather than false visual cues [19][20]. Lifelong Learning - WMPO showed stable performance improvement through iterative collection of trajectories, while DPO struggled with instability and required more expert demonstrations [23]. Conclusion and Significance - WMPO establishes a new paradigm for VLA optimization by integrating world models with on-policy reinforcement learning, addressing high costs and low sample efficiency in real environment interactions. It enhances performance, generalization, and lifelong learning capabilities, paving the way for scalable applications in general robotic operations [25].
头部的具身公司,正在投资其它公司了......
具身智能之心· 2025-11-14 01:02
Core Viewpoint - The article discusses the investment activities of various companies in the embodied intelligence sector, highlighting their strategies to secure key technologies and supply chains for competitive advantage [2][3][4]. Group 1: Company Investments - Zhiyuan Robotics has been actively preparing for an IPO while investing in over 30 companies, focusing on upstream key technologies, product supply chains, and downstream markets [3]. - Xinghai Map has recently invested in Jianzhixinchuang (Beijing) Robot Technology Co., Ltd., which provides a one-stop service for "data + deployment" [6]. - Zhujidi Power has invested in Shanghai Wujitech, which is responsible for the research and development of high-performance motors and dexterous hands [7]. - Songyan Power has invested in Silicon-based Wisdom (Beijing) Robot Co., Ltd., which focuses on the development of companion and elderly care robots [8].