Workflow
具身智能之心
icon
Search documents
关于VLA与RL真机部署的种种
具身智能之心· 2026-01-10 01:03
Core Viewpoint - The article discusses the deployment of VLA models and the advancements in AI chip technology, particularly focusing on the capabilities and future plans of the DiGua Robot's AI chips, which cater to consumer robotics and embodied intelligence scenarios [3][4]. Group 1: AI Chip Development - DiGua Robot offers AI chips with computing power ranging from 5 TOPS to 560 TOPS, with the RDK S100 model providing 80 to 120 TOPS and the newly released RDK S600 model offering 560 TOPS, set for product launch next year [4]. - The deployment of models on chips typically requires quantization, converting floating-point models to fixed-point models, which enhances efficiency and reduces power consumption [5]. - The RDK S600 is optimized for deploying large models, achieving throughput rates two to three times higher than mainstream chips for models around 7 billion parameters [4][5]. Group 2: Model Size and Performance - Current models are primarily in the range of 3 billion to 7 billion parameters, which are deemed sufficient given the limited data available for training [6]. - There is a trend towards developing larger models, but the deployment of such models on edge devices may not always be necessary, as smaller models can still perform adequately for specific tasks [7]. - The relationship between model size and reinforcement learning (RL) performance is highlighted, with larger foundational models potentially enhancing RL capabilities [7][8]. Group 3: Lightweight Model Approaches - Recent efforts in lightweight VLA work focus on engineering optimizations rather than merely reducing model size, emphasizing the importance of optimizing operators and compilation strategies [10][11]. - The exploration of model distillation, quantization, and operator-level optimization is crucial for improving deployment efficiency and training speed [12][13]. - The rapid evolution of foundational models in the embodied intelligence space necessitates quick adaptation in deployment and training processes to keep pace with advancements [14].
有“手感”,更高效!戴盟DM-EXton2为具身高质量数采提供破局之道
具身智能之心· 2026-01-09 09:00
2026年的具身智能,有关高质量数据采集的竞争已拉开帷幕。 在追求让机器人习得真正"灵巧"操作的路上, 融合人类与物理世界交互的触觉、力道等真实感知的数据集, 已成为不可或缺的核心"生产资料" ——这类数据可支撑机器人实现精准操作、获取真实交互反馈,进而在更 多场景中泛化出精细化操作能力。但当前具身智能规模化商用的核心瓶颈, 恰恰卡在这类高质量物理交互数 据的缺失上。 点击下方 卡片 ,关注" 具身智能 之心 "公众号 美国内华达州拉斯维加斯当地时间2026年1月6日,被誉为"全球科技春晚"与"产业风向标"的国际消费电子展 (CES 2026)如期启幕。戴盟机器人(DaimonRobotics)携 全球首款力/触反馈遥操作数据采集系统DM- EXton2惊艳登场 ,赋予机器人操作过程更精准的"手感",为机器人走向通用化与自主化提供底层能力支撑, 也为行业提供了一套可规模化落地的破局之道。 遥操作数据采集系统作为连接人类与机器人智能的桥梁,可通过远程操控并同步记录人类操作中的动作、力 控与触觉等多维数据,为机器人自主学习提供高质量"教材"。在AI迅猛发展的背景下,机器人进化面临的核 心瓶颈已从算力与算法转向物理 ...
成本仅2k!完成各类VLA任务的复现
具身智能之心· 2026-01-09 00:55
Core Viewpoint - The article discusses the challenges faced by beginners in the field of VLA (Vision-Language Alignment) tasks due to high costs and the complexity of data collection and model training, while introducing a comprehensive course aimed at addressing these issues and providing practical skills for aspiring professionals in the field [3][5][9]. Group 1: Challenges in VLA Tasks - Many students express frustration over the high costs associated with mechanical arms and sensors, which can exceed 15,000 yuan, making it difficult for self-learners or those without equipment to engage in VLA tasks [3]. - Open-source low-cost robotic arms are available, but many beginners struggle to achieve effective results due to difficulties in data collection and model training [4]. - A significant amount of time is wasted by students on troubleshooting and overcoming obstacles in data collection, model training, and deployment, particularly with complex models like π0 and π0.5, and GR00T [5]. Group 2: Course Offerings - The "Embodied Intelligence Heart" platform has replicated methods such as ACT, GR00T, π0, and π0.5 using SO-100 and LeRobot to help students who lack access to expensive equipment and do not know how to get started [8]. - A comprehensive VLA practical course has been developed in collaboration with industry experts, focusing on real-world applications and job readiness [9][14]. - The course covers a wide range of topics, including hardware for robotic arms, data collection, VLA algorithms, evaluation, simulation, deployment of mainstream VLA models, and various real-world experiments [14][15]. Group 3: Course Details and Requirements - Students who purchase the course will receive a SO-100 robotic arm, which includes both teaching and execution arms, delivered directly to them [18]. - The course is designed for individuals seeking practical experience and projects in the VLA field, including those transitioning from traditional computer vision, robotics, or autonomous driving [25]. - The course requires a foundational knowledge of Python and Pytorch, as well as experience in debugging real machines and data collection [25].
斯坦福最新的全身运控方案,跨地形泛化!
具身智能之心· 2026-01-09 00:55
Core Insights - The article discusses the challenges and advancements in humanoid robot locomotion, emphasizing the need for multi-limb coordination to navigate complex environments effectively [2][3][5]. Research Background and Core Challenges - Traditional humanoid robot movement focuses primarily on legged locomotion, but real-world scenarios require the use of additional body parts for stability and support [2]. - The research identifies two main challenges in humanoid robot locomotion: rich contact motion planning and robust control in complex environments, and the need for flexible skill switching across different terrains [3][5]. Core Methodology - A hierarchical framework combining physics-based keyframe animation and reinforcement learning is proposed, consisting of four main components: keyframe generation, policy training, skill selection, and hierarchical execution [4][5]. Keyframe Motion Generation - The study utilizes a GUI tool based on the MuJoCo physics engine to create keyframe animations that encode human movement knowledge while addressing physical realism and manual tuning costs [7]. - The limitations of keyframes include their open-loop nature, necessitating reinforcement learning to develop adaptive motion tracking strategies [8]. Motion Tracking Strategies - Strategies are categorized into three types, ensuring seamless transitions between four standard postures (standing, crawling, prone, supine) [9]. - The reward function for training includes components for tracking accuracy, energy efficiency, and preventing premature termination of training [10]. Visual Skill Classifier - The system employs a visual skill classifier to autonomously select appropriate movement skills based on environmental perception, categorizing skills into movement, transition, and terrain-specific skills [11]. Hierarchical Policy Execution - The framework separates visual planning from low-level control, enhancing robustness and real-time responsiveness [12]. Experimental Validation - Data collection involved real-world testing with a robot equipped with dual fisheye cameras, and the model was trained using a ResNet classifier to balance computational efficiency and geometric feature capture [15]. - The system demonstrated zero-shot transfer success across various obstacle configurations, validating the effectiveness of the motion tracking strategies [18][23]. Conclusion and Future Directions - The research presents a hybrid framework of keyframes and reinforcement learning, achieving humanoid robot mobility in complex terrains and demonstrating zero-shot transfer capabilities [28]. - Future work may focus on automating keyframe design, improving motion quality through advanced interpolation methods, and optimizing contact dynamics modeling to enhance performance in contact-rich tasks [28].
智源&港科大等出品!RoboMirror:让机器人先 “读懂” 视频,再精准复刻每一个动作
具身智能之心· 2026-01-09 00:55
Core Insights - The article introduces RoboMirror, a new paradigm in embodied intelligence that allows robots to understand and imitate human actions from video input without relying on traditional motion capture or pose estimation methods [3][5][6]. Industry Pain Points - Traditional robotic imitation has been limited to mechanical replication, facing challenges such as high latency, significant errors, and failure in first-person perspective scenarios [3][5]. - The lack of understanding in robots prevents them from interpreting the intent behind actions, leading to inefficiencies in learning and execution [5][6]. RoboMirror Framework - RoboMirror operates on a two-stage framework that transforms video input into robotic motion, emphasizing understanding before imitation [6][12]. - The first stage involves using a visual language model (VLM) to extract motion intent from videos, while the second stage employs a teacher-student policy architecture for precise action execution [6][10]. Performance Metrics - RoboMirror achieved a task success rate of 0.99 on the Nymeria dataset, significantly higher than the baseline of 0.92 [17]. - The joint position error (MPJPE) was reduced by nearly 50% compared to baseline methods, indicating improved accuracy in generated actions [17]. - The end-to-end processing time from video input to action execution was reduced from 9.22 seconds to 1.84 seconds, marking an approximately 80% improvement in efficiency [17]. Real-World Application - The article highlights successful demonstrations of RoboMirror's capabilities in real-world scenarios, showcasing its ability to accurately understand and replicate actions from video input [25][27].
大摩预测仅本体硬件营收,高达25万亿美元
具身智能之心· 2026-01-08 09:30
Core Insights - The article discusses the ongoing advancements in the embodiment of intelligence by major tech and AI companies, indicating that their foray into the physical realm is not just beginning but has been progressing along a "digital-physical" evolution path [1][4]. Group 1: Industry Trends - Major tech companies like Google, OpenAI, and Apple are transitioning from digital products to physical applications, with a focus on chatbots from 2022 to 2024, and plans to expand into wearable and mobile devices post-2025 [1]. - The global robotics market is projected to reach $25 trillion in hardware revenue by 2050, with significant growth expected to surpass $500 billion by 2030 and approximately $9 trillion by 2040 [4]. Group 2: Market Potential - To support the anticipated market size, a vast array of components and computational resources will be necessary. For instance, selling 1.4 billion robots by 2050 would require 5.7 billion cameras valued at $277 billion, 27 billion motors worth $2.5 trillion, and 1.7 million tons of rare earth magnets valued at $167 billion [7]. Group 3: Competitive Landscape - China has established a leading position in the robotics and physical AI sectors, with the report highlighting that the gap between China and Western countries is continuously widening due to strong policy support and guidance [8].
CES2026上,中国具身企业拉爆海外......
具身智能之心· 2026-01-08 09:30
Group 1: CES 2026 Overview - The CES 2026 in Las Vegas showcased over 4,000 exhibitors, with 942 Chinese companies participating, accounting for approximately 22% of the total, making China the second-largest exhibiting country globally [4] - This year's focus was on AI and hardware, with a notable increase in the presence of Chinese brands [3] Group 2: Embodied Intelligence Sector - Among the 38 humanoid robot exhibitors, 21 were from China, representing 55% of the total, indicating a shift towards showcasing engineering delivery capabilities and real-world operational abilities [6] - Chinese company UniX AI gained significant media attention at CES, being featured in major global news outlets shortly after its presentation [6] - Unitree showcased its H1 humanoid robot, which has a 360-degree perception capability and a 15 kg load capacity suitable for industrial inspections [7] - Galaxy General demonstrated a collaborative operation between humanoid and quadruped robots, introducing a new outdoor inspection model with enhanced environmental adaptability and an 8-hour battery life [9] Group 3: Autonomous Driving Sector - Companies like Geely, Great Wall, Leap Motor, and New Stone presented their products at CES, indicating that autonomous driving is becoming a consumer product [11] Group 4: AI Glasses and Consumer Products - Among the 23 AI glasses brands, nearly 70% were Chinese, including Alibaba, XREAL, and Thunderbird [15] - Domestic brands like TCL and Hisense showcased their strengths, while emerging brands like MOVA and Yingshi also made their presence felt [17]
开源1万小时具身智能数据,这家公司是为了什么?
具身智能之心· 2026-01-08 04:23
Core Viewpoint - The article emphasizes the importance of high-quality, large-scale, and diverse datasets for advancing embodied intelligence, highlighting the release of the "10Kh RealOmni-Open DataSet" by JianZhi Robotics as a significant milestone in the industry [1][4][16]. Dataset Overview - The "10Kh RealOmni-Open DataSet" consists of over 10,000 hours of data and nearly one million clips, making it the largest and most generalized open dataset in the field [1][4]. - The dataset focuses on 10 common household tasks, ensuring that each skill has over 10,000 clips, which enhances both the scale and depth of skills covered [4][5]. Data Quality and Specifications - The dataset features high-quality recordings with a resolution of 1600x1296 pixels and a frame rate of 30 fps, ensuring clarity and detail in the captured actions [4][5]. - It achieves centimeter-level trajectory precision, with advanced IMU hardware and cloud reconstruction techniques enhancing the accuracy to sub-centimeter levels [4][12]. Skill and Task Coverage - The dataset prioritizes tasks that can be performed with one hand in real scenarios, with 99.2% of the clips involving "two-handed, long-range tasks," providing a realistic representation of household activities [5][7]. - The average clip length is 1 minute and 37 seconds, capturing the complete process of tasks rather than static snapshots, which aids in understanding action logic and causality [5][7]. Data Collection Methodology - The data was collected from 3,000 real households, ensuring a rich variety of scenarios and natural human interactions, addressing the limitations of traditional data collection methods [7][9]. - JianZhi Robotics employs a comprehensive data production chain, allowing for rapid accumulation of data, with nearly one million hours collected in just two months [9][11]. Technological Infrastructure - The Gen DAS Gripper is a key component in the data collection process, enabling quick deployment without the need for extensive site preparation [11]. - The Gen Matrix data platform processes and cleans the collected data, achieving high precision in trajectory reconstruction and synchronization of heterogeneous data sources [13]. Future Directions - The open-sourcing of this dataset is seen as a way to accelerate innovation in embodied intelligence by filling data gaps, standardizing formats, and lowering research barriers [16]. - JianZhi Robotics plans to continue enhancing its data infrastructure and releasing more beneficial datasets and services, fostering a positive cycle of data sharing, model optimization, and practical application [16].
细节铺满!行业首个开源的灵巧操作真机数据集,解决机器人“看得见摸不准”的问题
具身智能之心· 2026-01-08 04:23
Core Viewpoint - The article emphasizes the significance of the newly open-sourced high-quality tactile operation dataset for dexterous robotic hands, which addresses the industry's urgent need for accurate physical interaction data and is expected to drive advancements in humanoid robotics in 2026 [6][7][42]. Group 1: Challenges in Dexterous Manipulation - Dexterous manipulation is challenging due to three main factors: the lack of mature hardware products, difficulties in model training that rely solely on visual data, and a scarcity of high-quality tactile data [6][9]. - The current limitation of dexterous manipulation is primarily due to the inability to effectively perceive physical properties like force and material through visual data alone, leading to the issue of "seeing but not touching" [9][24]. Group 2: Open-Sourced Dataset Details - The dataset consists of 800 high-quality tactile operation samples, which provide a continuous multi-modal learning resource that connects "visual-force-touch-action" [10][11]. - The dataset includes real-world scenarios such as fruit grabbing, package sorting, and material loading, ensuring a realistic representation of complex operational environments [9][12]. - It features multi-modal data with added tactile and six-dimensional force data, enhancing the robot's ability to perceive physical attributes of objects [9][11]. Group 3: Technical Advancements - The dataset introduces five key enhancements: arrayed tactile data, higher-dimensional force control data, 3D spatial information, synchronized perception of visual and tactile data, and a broad range of real-world scenarios to prevent overfitting [15][16]. - The tactile data is collected using a 6×12×5 sensor array, allowing the robotic hand to accurately sense material properties and contact states, while the six-dimensional force data provides high precision [15][20]. Group 4: Impact on Robotics - The open-sourced dataset is expected to improve the success rate of robotic operations by enabling real-time perception and adjustment of grasping techniques based on object shape and force [25][26]. - The integration of tactile and visual data allows robots to break through the limitations of pure visual perception, enhancing operational stability in complex environments [26][27]. - The dataset's broad coverage across various fields, including household, logistics, and consumer goods, will facilitate the adaptation of robots to different materials and shapes [27][31]. Group 5: Future Prospects - The open-sourcing of this dataset is anticipated to catalyze the development of the entire embodied intelligence industry chain, fostering innovation and application in robotics [40][41]. - The establishment of the Leju OpenLET community aims to create a collaborative platform for developers and researchers, accelerating the development and industrial application of embodied intelligence technologies [43].
SOP:具身智能在线进化新范式,为大规模真实世界部署而生
具身智能之心· 2026-01-08 04:23
Core Viewpoint - The article discusses the development and implementation of SOP (Scalable Online Post-training), a system designed for the scalable deployment and intelligent operation of general-purpose robots in real-world environments, emphasizing the need for continuous evolution and learning in robotic systems [2][3][23]. Group 1: SOP Overview - SOP is the first system in the industry to systematically integrate online learning, distributed architecture, and multi-task versatility for real-world deployment of robots [2]. - The core goal of SOP is to enable distributed and continuous online learning for robots in real-world settings [5]. Group 2: Challenges in Real-World Deployment - General-purpose robots face the challenge of meeting high task specialization requirements while leveraging existing VLA pre-trained models, which often require post-training for improved task success rates [3][4]. - Current mainstream VLA post-training methods are limited by offline, single-machine, and serial data collection, hindering efficient and continuous real-world learning [3]. Group 3: SOP Framework Design - SOP redefines VLA post-training from "offline, single-machine, sequential" to "online, cluster, parallel," creating a low-latency closed-loop system [6]. - The system allows multiple robots to execute tasks in parallel, with cloud-based centralized online updates and immediate model parameter feedback [6][9]. Group 4: Key Advantages of SOP - SOP enhances state space exploration through distributed multi-robot parallel exploration, significantly improving state-action coverage [12]. - It mitigates distribution bias by ensuring all robots operate based on the latest low-latency strategies, enhancing online training stability and consistency [13]. - SOP maintains generalization capabilities while improving performance, avoiding the degradation of models into single-task specialists [14]. Group 5: Experimental Evaluation - SOP significantly improves performance metrics, with a 33% overall performance increase in complex scenarios and a 114% increase in throughput for specific tasks like folding clothes [16][18]. - The use of multiple robots enhances learning efficiency, with a four-robot setup achieving a 92.5% success rate, 12% higher than a single robot [19][20]. - SOP demonstrates stable effectiveness across different pre-training scales, with performance improvements correlating with the quality of VLA pre-training [21]. Group 6: Deployment and Evolution - SOP allows robots to adapt and improve in new environments, transforming them from fixed-performance products into evolving entities capable of continuous learning [23].