Workflow
具身智能之心
icon
Search documents
我们的25年竟然做了这么多事.....
具身智能之心· 2025-12-27 10:03
Group 1 - The core viewpoint of the article highlights the growth and evolution of the embodied intelligence industry, with an increase in both B-end partnerships and a shift towards more specialized C-end content [1][2] - The industry has seen a significant increase in participation, with the ability to recruit candidates with around one year of experience, indicating a maturation of the talent pool [1] - The company has established nearly 40 embodied groups and has grown its paid community to over 2000 members, showcasing its value in cultivating professionals and researchers in the field [3] Group 2 - The company is offering various discounts on embodied courses, including a 25% discount on all courses and a 40% discount for new members joining the knowledge community [7] - Additional promotions include significant discounts on high-cost embodied research robotic arms and complimentary high-quality courses for purchases over a certain amount [7] - The company is also providing personalized project and job application guidance, further supporting the development of professionals in the industry [7]
准备开放具身的榜单,模型、本体、数采、开源贡献等多个维度
具身智能之心· 2025-12-27 10:03
Group 1 - The article discusses the recent outreach from various embodiment companies and institutions seeking to create multiple rankings from different dimensions [1] - The rankings aim to be objective and fair, inviting contributions of materials and data from various organizations [2] - The rankings will cover several dimensions, encouraging companies to participate and provide references [3] Group 2 - The specific dimensions for the rankings include: embodiment base model, ontology sales ranking, competition champions ranking, open-source contribution ranking, and data collection service providers [5]
近2k star的RLinf又又又上新了!支持真机强化学习,像使用GPU一样使用你的机器人~
具身智能之心· 2025-12-26 03:38
Core Insights - The article discusses the advancements in the RLinf framework, particularly the release of RLinf v0.2, which supports real-world reinforcement learning and aims to enhance the capabilities of embodied intelligence systems [3][5]. Group 1: RLinf v0.2 Features - RLinf v0.2 allows users to utilize robots as flexible resources similar to GPUs, enabling the deployment of workers on robots by simply accessing their IP and port [3][6]. - The framework supports heterogeneous soft and hardware cluster configurations, accommodating the diverse requirements of real-world reinforcement learning [8][10]. - RLinf v0.2 introduces a fully asynchronous off-policy algorithm design, which decouples inference and training nodes, significantly improving training efficiency [11][14]. Group 2: Experimental Results - The initial version of RLinf v0.2 was tested using a Franka robotic arm on two tasks: Charger and Peg Insertion, achieving convergence within 1.5 hours for both tasks [12][15]. - The success rates for the tasks were impressive, with Peg Insertion achieving over 100 consecutive successes and Charger over 50 consecutive successes after training [15][18]. - The training process was documented through videos, showcasing the simultaneous operation of two Franka robotic arms in different locations [16][23]. Group 3: Development Philosophy - The RLinf team emphasizes the collaborative evolution of algorithms and infrastructure, aiming to create a new research ecosystem for embodied intelligence [20]. - The team is composed of members from various institutions, including Tsinghua University and Peking University, highlighting a diverse background in infrastructure, algorithms, and robotics [20].
盘了一下,25年竟然做了这多事.....
具身智能之心· 2025-12-26 03:38
Group 1 - The core viewpoint of the article highlights the growth and evolution of the embodied intelligence industry, with an increase in both B-end partnerships and a shift towards more specialized C-end content [1][2] - The industry has seen a significant increase in participation, with the ability to recruit candidates with around one year of experience, indicating a maturation of the talent pool [1] - The company has established nearly 40 embodied groups and has grown its paid community to over 2000 members, showcasing its value in cultivating professionals and researchers in the field [3] Group 2 - The company is offering various discounts on embodied courses, including a 25% discount on all courses and a 40% discount for new members joining the knowledge community [7] - Additional promotions include a maximum discount of 1500 on high-cost embodied research robotic arms and a complimentary high-quality course for purchases over 3000 [7] - The company is also providing personalized project and job application guidance at discounted rates, further supporting the professional development of its community [7]
刷新NAVSIM SOTA,复旦提出端到端自动驾驶新框架
具身智能之心· 2025-12-26 00:55
Core Insights - The article discusses the transition in end-to-end autonomous driving from a modular approach to a unified paradigm with the rise of Vision-Language-Action (VLA) models, highlighting the limitations of existing autoregressive models in mimicking human driving intuition [1][2]. Group 1: WAM-Diff Framework - The WAM-Diff framework, developed by Fudan University and Yiwang Intelligence, introduces a Discrete Masked Diffusion model for VLA autonomous driving planning, integrating a sparse mixture of experts (MoE) architecture and online reinforcement learning (GSPO) [2][4]. - WAM-Diff achieved state-of-the-art (SOTA) performance on the NAVSIM benchmark, scoring 91.0 PDMS and 89.7 EPDMS, demonstrating the potential of non-autoregressive generation in complex driving scenarios [2][16][18]. Group 2: Technical Innovations - WAM-Diff employs Hybrid Discrete Action Tokenization to convert continuous 2D trajectory coordinates into high-precision discrete tokens, allowing for a shared vocabulary with driving commands [5]. - The framework utilizes Masked Diffusion for generation, enabling parallel prediction of all token positions, which enhances inference efficiency and allows for global optimization [5][9]. Group 3: Decoding Strategies - WAM-Diff explores three decoding strategies: causal, reverse-causal, and random, finding that the reverse-causal strategy yields the best performance in closed-loop metrics, aligning with the "end-to-begin" planning intuition [9][20]. - This approach confirms that establishing long-term driving intentions before detailing immediate actions significantly improves planning consistency and safety [9][20]. Group 4: MoE and GSPO Integration - The MoE architecture within WAM-Diff includes 64 lightweight experts, dynamically activated based on the driving context, enhancing model capacity and adaptability while controlling computational costs [12]. - The GSPO algorithm bridges the gap between open-loop training and closed-loop execution, optimizing trajectory sequences based on safety, compliance, and comfort metrics [12][14]. Group 5: Experimental Results - In extensive experiments on the NAVSIM benchmark, WAM-Diff outperformed several leading models, achieving a PDMS score of 91.0 and an EPDMS score of 89.7, indicating its robustness in balancing safety and compliance [16][18]. - The model's performance in NAVSIM-v2, which includes stricter metrics for traffic rule adherence and comfort, improved by 5.2 points over the previous best, showcasing its capability in real-world driving scenarios [18]. Group 6: Conclusion - WAM-Diff represents a significant advancement in autonomous driving planning, moving towards a discrete, structured, and closed-loop approach, emphasizing the importance of both "how to generate" and "what to generate" in the VLA era [25].
全身操控!星尘推出异步快慢的VLA策略,端到端训练+3 倍于同类模型的推理速度
具身智能之心· 2025-12-26 00:55
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 论文题目:Astribot: Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation 论文链接:https://arxiv.org/pdf/2512.20188 核心亮点:真正异步快慢双路径、模态对齐桥接缓冲、全身体动作token化、端到端联合训练、3 倍于同类模型的推理速度 问题根源:大模型驱动机器人操纵的三大核心挑战 DuoCore-FS 的设计逻辑源于对现有 VLA 系统痛点的精准洞察,三大核心挑战构成技术突破的起点: 频率耦合瓶颈 传统 VLA 系统将 VLM 推理与动作生成绑定在同一频率,大模型(尤其是 3B 级以上)的低推理速度(通常<15Hz)直接限制了全身体控的响应频率,无法满足多关节、 动态场景的实时需求。 全身体控表征难题 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里 ...
突破2D-3D鸿沟!北大提出VIPA-VLA,视频解锁机器人精准操控
具身智能之心· 2025-12-26 00:55
Core Insights - The article discusses a new approach to robot learning that addresses the challenge of aligning 2D visual information with 3D spatial understanding, which has been a significant limitation in existing visual-language-action (VLA) models [3][6][41] - The research introduces a novel pre-training paradigm that utilizes human demonstration videos to enhance robots' spatial perception capabilities, allowing them to infer 3D spatial relationships from 2D visual inputs [4][40] Research Background - Current VLA models face limitations due to reliance on expensive robot datasets and lack of explicit 3D spatial modeling, which hampers their ability to accurately map physical actions [6][7] - Human demonstration videos provide a solution by offering diverse scenarios and inherent visual-physical correspondences that serve as valuable supervision signals for robot learning [7][8] Hand3D Dataset - The Hand3D dataset, comprising Hand3D-visual and Hand3D-action components, is described as a "3D spatial textbook" for robots, enabling them to learn visual-physical alignment [8][9] - The dataset includes data from nine heterogeneous human manipulation datasets, ensuring a wide variety of scenes and tasks [8][9] Model Architecture: VIPA-VLA - The VIPA-VLA model features a dual-encoder architecture that integrates semantic visual features with 3D spatial features, enhancing the model's ability to understand both scene semantics and spatial structures [15][20] - The model employs a cross-attention fusion layer to combine these features, allowing for effective learning of 3D relationships from 2D inputs [17][20] Training Process - The training process consists of three phases: 3D visual pre-training, 3D action pre-training, and post-training for task adaptation, ensuring a gradual acquisition of 3D capabilities [21][22] - The first phase focuses on aligning semantic and spatial features, while the second phase teaches the model to predict 3D motion tokens based on visual-language inputs [22][23] Experimental Results - VIPA-VLA outperformed existing baselines in various tasks, achieving a success rate of 92.4% in single-view settings and 96.8% in dual-view settings on the LIBERO benchmark [27][28] - In the RoboCasa benchmark, VIPA-VLA achieved a success rate of 45.8%, surpassing other models, particularly in tasks requiring precise 3D positioning [30] - The model demonstrated strong performance in real-world tasks, achieving a 60% success rate in the Wipe-Board task, significantly higher than competing models [31][34] Significance and Future Directions - The research presents a new paradigm for robot learning that reduces reliance on costly robot data and enhances model generalization by leveraging human demonstration videos [40][41] - Future work aims to combine this pre-training paradigm with robot data pre-training and expand the Hand3D dataset to include more complex human-robot interaction tasks [40][41]
从千亿到 25 万亿,具身市场迈入新量级
具身智能之心· 2025-12-25 09:30
近日在我们具身社区分享了一篇研报,摩根斯丹利 预计具身智能板块全球市场规模在2050年能够达到25万 亿美元。 对比来看,2025年这个市场的大小大概是在1000亿美元,包括无人驾驶汽车、各种形态的机器人以及无人 机等。这意味着在未来25年内,具身智能市场空间的增长将达到250倍。 在这25万亿美元的市场中: 人形机器人的市场规模在2050年预计为7.5万亿美元; 无人驾驶汽车为5.6万亿美元; 服务机器人为5万亿美元; 飞行器无人机为4.7万亿美元; 其他场景的非人形机器人约为2.2万亿美元。 最近几个月,我们收到了很多投资人分享的信息。就目前来看, 26年具身智能板块全球市场规模将持续扩 大,投资前景非常可观。 这里也希望大家可以多多关注我们的具身社区,精彩内容持续更新中~ 现在社区内也在积极筹划研报,非常很欢迎需要入门/进阶具身领域的同学加入我们的社区。近一年的搭 建,社区内已经完成了技术路线分享、直播、问答、求职、赛事等多个版块的分享。这里实现了产业、学 术、求职、问答交流等多个领域的闭环。我们致力于为行业培养更多优秀的人才,提供更多机会。 元旦新 人加入优惠已经开启,今年最大力度,收官,欢迎新同学扫 ...
首个基于3DGS的VLN具身学习数据集,群核科技联合浙大开源SAGE-3D
具身智能之心· 2025-12-25 04:01
Core Insights - The article discusses the advancements in embodied intelligence, particularly focusing on the SAGE-3D dataset and its implications for visual language navigation (VLN) tasks. It highlights the transition of 3DGS technology from a mere rendering tool to a functional navigation environment that incorporates semantic and physical attributes, enabling robots to understand and interact with their surroundings effectively [2][3][30]. Group 1: 3DGS Technology and Its Limitations - Embodied data is recognized as a core asset in robotics, with the ability to generate high-quality data being crucial for competitive advantage [2]. - 3DGS technology generates realistic 3D point cloud models from real scenes but lacks essential physical information such as area, size, and geometric structure, limiting its application in navigation tasks [2][9]. - The introduction of the SAGE-3D dataset addresses the limitations of traditional 3DGS by providing a navigable environment that includes physical collision detection, allowing robots to interpret complex instructions and navigate safely [3][10]. Group 2: SAGE-3D Dataset and Its Features - SAGE-3D consists of two main components: the InteriorGS dataset, which includes 1,000 finely annotated indoor scenes with over 554,000 object instances, and the SAGE-Bench, a benchmark for VLN tasks with 2 million trajectory-instruction pairs [13][14]. - The dataset supports a hierarchical instruction generation framework that combines high-level semantic goals with low-level action commands, enhancing the robot's ability to follow complex instructions [18][22]. - SAGE-3D's hybrid representation of 3DGS allows for high-fidelity rendering while embedding physical properties, enabling robots to interact with their environment without issues like mesh penetration [22][30]. Group 3: Performance and Evaluation - Models trained on SAGE-3D, such as NaVILA-SAGE, demonstrate superior performance in VLN tasks, achieving a success rate of 0.46, significantly higher than traditional models [21][23]. - The SAGE-Bench platform introduces new evaluation metrics that capture the nuances of navigation performance, such as continuous success rates and collision penalties, providing a more comprehensive assessment of model capabilities [27][29]. - The SAGE-3D dataset shows strong generalization capabilities, with models trained exclusively on it outperforming baseline models in unseen scenarios, indicating its effectiveness in real-world applications [26]. Group 4: Future Implications - The advancements represented by SAGE-3D redefine the application boundaries of 3DGS technology, paving the way for more complex outdoor scenarios and multi-robot collaboration [30][31]. - The integration of semantic and physical capabilities into 3DGS not only enhances robot navigation but also supports the development of more sophisticated embodied intelligence systems [31].
直面VLA的「阿喀琉斯之踵」:TeleAI提升具身推理稳定性
具身智能之心· 2025-12-25 01:41
Core Insights - The article discusses the rapid development of Vision-Language-Action (VLA) models in embodied intelligence, highlighting the challenge of instability during the reasoning phase, which hinders real-world application [1][3] - A new framework called TACO (Test-time Anti-exploration via pseudo-COunts) is introduced to address this instability, demonstrating significant improvements in task success rates through experimental validation [1][4] Group 1: VLA Model Challenges - VLA models exhibit extreme sensitivity to initial noise during inference, leading to success rates that can vary dramatically from 0% to 80% even after fine-tuning [4][5] - The instability is attributed to two main factors: the retention of redundant action patterns from diverse training data and the multimodal nature of fine-tuning datasets, which may encode suboptimal strategies [6][8] Group 2: TACO Framework - TACO employs an "anti-exploration" principle from offline reinforcement learning to constrain generated actions within the successful patterns of the fine-tuning dataset, avoiding irrelevant action patterns [10][12] - The framework includes a Coupled Pseudo-Count Estimator that utilizes the VLA model's internal representation to validate actions without requiring additional training resources [12][13] Group 3: Performance Improvements - TACO significantly enhances the average success rate of the π0 model from 32.2% to 41.3% in simulated environments, with notable improvements in challenging tasks [24][26] - In real-world robot experiments, TACO increased the average success rate from 40% to 56%, with specific tasks seeing improvements of up to 25% [34][32] Group 4: Technical Mechanisms - The TACO framework's two-stage reasoning process involves generating diverse action candidates and validating them through pseudo-counts, ensuring high fidelity in action representation [18][19] - The use of a shared observation key-value cache reduces computational costs significantly, allowing for efficient real-time operation [21][22] Group 5: Future Directions - TACO not only addresses practical issues but also opens new perspectives for VLA research, with plans to extend its application to more complex multi-task scenarios and enhance long-term planning capabilities [39][38]