具身智能之心
Search documents
从零开始!使用低成本机械臂复现pi0和pi0.5~
具身智能之心· 2025-12-25 01:41
Core Viewpoint - The article emphasizes the increasing demand for VLA (Vision-Language Alignment) algorithms in the industry, highlighting the challenges faced by practitioners in data collection and model optimization [2][4]. Group 1: Industry Demand and Challenges - There is a significant demand for VLA algorithms, as reflected in the numerous job postings and research papers related to this field [2]. - Practitioners often face difficulties with VLA due to complex data collection processes and the reliance on hardware, leading to frustrations about wasted time and ineffective model training [2][4]. - Many companies in the embodied intelligence sector are committed to using real machine data, but the quality of this data can be suboptimal, complicating the training process [2][4]. Group 2: Educational Initiatives - The article introduces a practical course aimed at addressing the learning curve associated with VLA, developed in collaboration with industry experts [5]. - The course covers a comprehensive curriculum, including hardware, data collection, VLA algorithms, and real-world applications, designed to facilitate effective learning [8][9]. - Participants in the course will receive a SO-100 robotic arm, enhancing hands-on experience and practical application of the learned concepts [9]. Group 3: Course Structure and Content - The course is structured into nine chapters, covering topics from VLA basics to advanced model deployment and evaluation [11][12][13][14][15][16][17][18]. - Key areas of focus include data acquisition, model training, simulation environments, and the integration of VLA with world models [8][9][11][12][13][14][15][16][17]. - The course aims to equip learners with the necessary skills to transition into roles as algorithm engineers with 1-2 years of experience upon completion [25].
27秋具身方向博士联合招生|北大王鹤老师 x 清华弋力老师 x 银河通用张直政博士
具身智能之心· 2025-12-25 01:41
北京大学计算机学院前沿计算研究中心王鹤教授团队、清华大学交叉信息研究院弋力教授团队与银河通用机器人(Galbot)张直政博士团队 正式启动2027级博士生联 合招生计划。 联合招生团队面向 27秋入学博士生有北大计算机学院、清华叉院及AI院、智源-中科院自动化所联培、上海期智研究院-上交联培、上海期智研究院-上科大联培及中 关村学院等共计十个以上博士名额,港澳台及外籍学生还有专属名额。 我们将选拔优秀的 大三本科生、硕二研究生及低年级本硕学生进入具身联合招生科研实习冬令营进行科研实习,实习场地为银河通用北京总部 。实习期间表现将 作为2027级博士生录取的直接依据。 这是一个集结了" 顶尖学术导师 "与" 两百亿具身独角兽企业 "的黄金组合,我们诚邀全球顶尖学子加入,共同定义通用机器人的未来! 点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 弋力 博士 | 清华大学 弋力博士现任清华大学交叉信 ...
从 2D 感知到 3D 预测:GeoPredict 重构VLA模型的几何推理能力
具身智能之心· 2025-12-25 01:41
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Jingjing Qian等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在机器人操纵领域,视觉 - 语言 - 动作(VLA)模型凭借大规模预训练数据的语义与视觉先验,实现了跨任务泛化,但长期受限于 2D-centric 的反应式决策范式, 难以应对需要精准 3D 空间推理、长时程物理一致性的复杂任务。 香港中文大学(深圳)、湖南大学、理想汽车等联合团队提出的 GeoPredict 框架 ,以 "预测性运动学 + 3D 高斯几何" 为双核心,通过 "轨迹级运动预测 - 3D 高斯 场景建模 - 训练时监督推理时轻量化" 的创新架构,首次将未来感知的几何先验注入连续动作 VLA 模型,彻底突破了传统方法的空间推理瓶颈。 论文题目:GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Preci ...
深扒了具身的数据路线,四小龙的格局已经形成......
具身智能之心· 2025-12-24 10:04
Core Viewpoint - The development of embodied intelligence over the past 25 years has focused on a closed-loop process of data collection, model training, data scaling, and model optimization, with data remaining a key focus for future advancements [1][5]. Group 1: Data Routes - The industry is not selecting a single optimal solution but is progressing along four distinct data routes simultaneously, each addressing different constraints and stages [3]. - The four data routes have led to the emergence of a competitive landscape termed the "Four Little Dragons of Embodied Data," with key players including Zhiyuan, Galaxy, Tashi, and Luming [4][34]. Group 2: Data Route Descriptions - **Remote Control Real Machine**: This route provides the most authentic data but is also the most expensive and slow, requiring real robots and specialized operators, making it difficult to scale [8][12][14]. - **Simulation Data**: Offers high efficiency and scalability, but faces challenges due to the domain gap, limiting its effectiveness in real-world applications [16][18][20]. - **Human Video**: This route is cost-effective and covers a wide range of scenarios but lacks critical feedback mechanisms and is not a primary data source for initial capabilities [22][25]. - **UMI Data**: This approach decouples real interaction data from specific robots, allowing for more versatile and scalable data collection, thus becoming a foundational infrastructure for embodied data [27][30][31]. Group 3: Industry Practices - In the remote control real machine data direction, Tesla is advancing its remote operation system, while Zhiyuan Robotics is deepening its focus on real bodies and task loops [35]. - In the simulation data route, Galaxy General is expanding synthetic data scale through computational power and simulation engines [35]. - In the human video data direction, Tashi is developing large-scale human behavior video datasets to enhance semantic coverage [35]. - The UMI route is represented by Luming Robotics, which has made significant strides in scaling and engineering UMI data collection systems [35][39]. Group 4: Future Implications - As the industry transitions from proving feasibility to continuous evolution, the ability to consistently produce high-quality real data will become increasingly critical [37]. - The four data routes are not mutually exclusive; they each play distinct roles in the overall ecosystem, contributing to a clearer path forward for embodied intelligence [38][40]. - The importance of time accumulation is emphasized, particularly for the UMI route, which relies heavily on early choices and sustained investment [41][42]. - The current landscape of the "Four Little Dragons" serves as a structural description of the industry, with future success dependent on which routes and teams can maintain operational continuity and data advantages [44][45].
李弘扬老师团队最新工作X0!超低成本高效实现机器人操作任务~
具身智能之心· 2025-12-24 04:01
Core Insights - The article emphasizes the importance of achieving 100% reliability in robotic manipulation tasks through a strategic approach rather than merely increasing data scale [2][4]. Methodology - The proposed methodology consists of three stages: data collection, model training, and real-world inference, which are interconnected and critical for success [2]. - The approach focuses on pattern consistency, model algorithms, and leveraging phase advantages to optimize the transition from perception to action [3]. Pattern Consistency - The article defines the effective action distribution for specific tasks and highlights the need for dynamic alignment among human demonstration, learned strategies, and real-world execution [8][10]. - It identifies potential inconsistencies in traditional imitation learning processes, such as distribution shifts and deployment biases, which can lead to task failures [11][12]. Model Algorithms - The introduction of the Model Arithmetic (MA) method allows for training on new data subsets and merging models without the high costs associated with retraining on full datasets [27][30]. - The MA method successfully integrates different learned manifolds, enhancing model performance beyond that of models trained on full datasets [30]. Phase Advantages - The article discusses the significance of estimating advantage signals directly as a modeling objective, which improves the reliability of state transitions during long-horizon tasks [31][35]. - The proposed Direct+Stage method enhances the stability and smoothness of progress accumulation in robotic tasks [37]. Performance Improvements - Enhanced data collection methods and online strategy recovery trajectories significantly improve model recovery capabilities, leading to higher success rates and reduced retry costs [21]. - The implementation of spatiotemporal enhancements has resulted in increased throughput and task completion rates [23][26]. Conclusion - The article concludes that not all robotic data holds equal value, and the ability to quickly evaluate and select high-quality foundational strategies is crucial for effective research iterations [41]. - The findings suggest that re-evaluating fundamental concepts in reinforcement learning could yield further benefits in robotic manipulation tasks [41].
具身智能之心元旦开始送一波福利了(课程/具身硬件/科研辅导等)
具身智能之心· 2025-12-24 04:01
Group 1 - The article highlights a promotional period from December 24 to January 5, offering discounts on various courses and community memberships [1] - All embodied courses are available at a 25% discount, while new members joining the Knowledge Star community can enjoy a 40% discount, and existing members can renew at a 50% discount [3] - High-cost embodied research robotic arms are discounted by up to 1500, marking the first time this year such a discount is offered [3] Group 2 - A 1-on-1 job coaching service is currently available at a discounted rate [4] - The article encourages readers to add a WeChat contact for more information on research paper guidance and other offerings [6]
今年的VLA+RL的工作正在排队等着录用......
具身智能之心· 2025-12-24 00:25
Core Insights - The article emphasizes the importance of Reinforcement Learning (RL) in enhancing the generalization capabilities of Vision-Language-Action (VLA) models, with some experiments showing performance improvements of up to 42.6% on out-of-distribution tasks [2]. Group 1: VLA and RL Integration - VLA models are currently reliant on RL to overcome limitations in real-world out-of-distribution scenarios, where imitation learning alone proves insufficient [2]. - Recent advancements in VLA+RL frameworks have led to significant breakthroughs, with several notable papers published this year [2]. - Tools supporting VLA+RL frameworks are evolving, with recommendations for resources like Rlinf, which offers a growing number of supported methods [2]. Group 2: Notable Research Papers - A summary of representative VLA+RL research papers from the past two years is provided, highlighting their contributions to the field [5]. - Key papers include "NORA-1.5," which focuses on a VLA model trained using world model and action-based preference rewards, and "Balancing Signal and Variance," which discusses adaptive offline RL post-training for VLA flow models [5][10]. - Other significant works include "ReinboT," which enhances robot visual-language manipulation through RL, and "WMPO," which optimizes policies based on world models for VLA [8][10]. Group 3: Future Research Directions - The article suggests that future research should align with the advancements in VLA and RL, encouraging collaboration and consultation for those interested in exploring these areas [3].
深度解析世界模型嵌入具身系统的三大技术范式
具身智能之心· 2025-12-24 00:25
Core Insights - The article discusses the integration of world models into embodied intelligent systems, emphasizing the shift from reactive to predictive capabilities in these systems [1][3][8]. Summary by Sections Introduction to World Models - Embodied intelligent systems traditionally relied on a reactive loop of "perception-action" and lacked predictive capabilities. The introduction of world models allows these systems to "imagine" future scenarios [1][3]. Research Overview - A comprehensive survey from a research team including institutions like Tsinghua University and Harbin Institute of Technology categorizes existing research into three paradigms based on architectural integration [3][5]. Paradigm Classification - The relationship between world models (WM) and policy models (PM) is described as a "coupling strength spectrum," ranging from weak to strong dependencies [11]. - Three categories are identified: Modular, Sequential, and Unified architectures, each with distinct characteristics regarding gradient flow and information dependency [12]. Modular Architecture - In this architecture, WM and PM are independent, with no gradient flow between them. WM acts as a simulator, predicting future states based on current observations and candidate actions [16]. Sequential Architecture - This architecture involves two stages where WM predicts future states, and PM executes actions based on those predictions. It simplifies complex tasks into goal generation and goal-conditioned execution [17][18]. Unified Architecture - The unified architecture integrates WM and PM into a single end-to-end network, allowing for simultaneous training and optimization. This structure enables the system to predict future states and generate actions without explicitly separating simulation and decision-making [19][21]. Future Directions - The article outlines potential research directions, including the selection of representation spaces for world models, the generation of structured intentions, and the need for unified world-policy model paradigms to enhance decision-making efficiency [22][24].
单卡训练1亿高斯点,重建25平方公里城市:3DGS内存墙被CPU「外挂」打破了
具身智能之心· 2025-12-24 00:25
点击下方 卡片 ,关注" 具身智能之心 "公众号 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 想用3D高斯泼溅 (3DGS) 重建一座城市? 过去,这往往意味着一套昂贵的GPU集群。如今,研究人员给出了另一种答案: 一张RTX 4090,加上足够大的CPU内存,也可以完成城市 级3D重建 。 来自纽约大学的研究团队在ASPLOS 2026上提出了名为 CLM (CPU-offloaded Large-scale 3DGS training) 的系统。该工作通过将3D 高斯泼溅训练中占用显存最多的参数转移到CPU内存中,使单张消费级显卡也能训练上亿规模的高斯点模型,为大场景神经渲染显著降低了 硬件门槛。 3DGS的规模应用瓶颈 3D高斯泼溅 (3DGS) 因其高质量渲染效果和极高的渲染速度,已成为神经渲染领域的重要技术路线。然而,当研究人员尝试将其用于城市 街区、大型室内空间等复杂场景时,问题很快显现出来—— GPU显存成为最直接、也最难解决的瓶颈 。 一个高精度的3DGS模型通常包含数千万乃至上亿个高斯点。每个高斯点包含位置、形状、颜色和不透 ...
MIT团队提出OpenTouch:首次实现真实场景下视觉、触觉、手部姿态的同步建模
具身智能之心· 2025-12-24 00:25
Core Insights - The article discusses the OPENTOUCH framework, which integrates full-hand tactile data collection in real-world environments, addressing the limitations of existing single-modal systems in capturing critical tactile information [3][4][6]. Group 1: Challenges in Tactile Perception - The framework identifies four core challenges in tactile perception: lack of modal information, poor adaptability to real-world environments, difficulties in multi-modal synchronization, and low annotation efficiency [6][7][8][9]. Group 2: Technical Design of OPENTOUCH - OPENTOUCH consists of a three-layer technical loop: hardware perception system, large-scale data collection, and benchmark testing [11]. - The first layer includes a low-cost, robust hardware kit designed for high-precision multi-modal data collection, featuring a full-hand tactile sensing glove and a hand pose tracking glove [12]. - The second layer focuses on building a large-scale multi-modal dataset that covers real-life scenarios, addressing data scarcity [13]. - The third layer establishes a benchmark testing system for cross-modal retrieval and tactile classification tasks, ensuring effective multi-modal integration [15]. Group 3: Performance Validation - OPENTOUCH employs a three-tier validation system to demonstrate its effectiveness, including cross-modal performance, ablation studies, and real-world applications [18]. - The framework shows significant performance improvements in multi-modal fusion models compared to single-modal and linear baselines, with notable metrics in cross-sensory retrieval and tactile classification tasks [20][21]. Group 4: Future Directions and Limitations - While OPENTOUCH represents a breakthrough in full-hand tactile research, there are areas for optimization, such as expanding the tactile dimensions captured, enhancing hardware durability, and improving annotation accuracy in challenging conditions [28][29].