Workflow
具身智能之心
icon
Search documents
FlowVLA:破解 VLA 模型 “物理失真” 难题,机器人世界建模再升级
具身智能之心· 2025-08-29 00:03
Core Viewpoint - The article discusses the limitations of traditional Vision-Language-Action (VLA) models and introduces FlowVLA, a new framework that addresses these issues by implementing a Visual Chain of Thought (Visual CoT) principle, enhancing the model's ability to predict future frames through structured physical reasoning rather than mere pixel replication [5][8][36]. Group 1: Background and Current State - VLA models, particularly those pre-trained as world models, show significant potential in the field of general robotics, primarily through large self-regressive Transformers that learn environmental dynamics from vast video data [6][7]. - Existing models face critical flaws, including task confusion leading to prediction failures, knowledge transfer inefficiencies between passive observation and active control, and entangled learning of dynamics and appearance [7]. Group 2: Contributions of FlowVLA - FlowVLA introduces a new learning framework that emphasizes structured physical reasoning by requiring the model to infer motion dynamics before predicting future frames [8][10]. - The model is designed to unify appearance and motion reasoning within a single self-regressive Transformer, maintaining parameter efficiency and architectural simplicity [9][10]. - Experimental results validate FlowVLA's superior performance across various robotic operation benchmarks, demonstrating enhanced sample efficiency and bridging the gap between pre-training and policy fine-tuning [10][20]. Group 3: Research Content - The Visual CoT reasoning process decomposes the frame prediction into a causal chain of "current frame → optical flow → future frame," allowing the model to separate dynamic and appearance learning [12][14]. - The two-phase training paradigm consists of a pre-training phase focused on world model learning and a fine-tuning phase for adapting to control tasks [15][16]. Group 4: Experimental Analysis - FlowVLA outperforms existing methods in the LIBERO dataset across all task sets, particularly excelling in long-term tasks, showcasing its robust understanding of physical dynamics [20][21]. - In the SimplerEnv dataset, FlowVLA demonstrates strong adaptability to visual domain shifts, achieving significant performance improvements in tasks where other models struggle [22][23]. - The model's sample efficiency is validated, requiring only one-third of the training steps to reach peak performance compared to baseline models, with a 55% higher peak success rate in low-data scenarios [30][32]. Group 5: Key Component Validation - Ablation studies on the LIBERO-10 benchmark highlight the importance of the Visual CoT structure, flow loss, and interleaved sequence format, confirming their critical roles in the model's performance [33][34]. Group 6: Comparison with Related Work - FlowVLA distinguishes itself from traditional VLA models by prioritizing dynamic understanding and establishing a robust world model before adapting to control tasks, thus laying a solid foundation for physical knowledge [35].
对话逐际动力张巍:造机器人很容易,关键是用起来
具身智能之心· 2025-08-29 00:03
编辑丨量子位 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 "让天下没有难落地的机器人。" 在这样向量子位表达定位和使命后,逐际动力"解释了"为何会成为阿里投资的第一家具身智能机器人公司。 在这样解释定位和使命后,量子位大概感受到了逐际动力被投资的原因—— 至少是成为阿里第一个具身智能投资项目的原因。 实际上,之前更多的时候, 逐际动力创始人张巍 更倾向于用"英伟达"来类比,因为英伟达提供了一个底层平台,可以让游戏、汽车、机器人 等等不同领域的创新,成为可能。 逐际动力的创办和发展,也希望成为这样的平台——提供机器人平台,然后各个领域的开发者可以进一步打造应用方案和产品。 张巍说,"人形机器人的本体硬件制造非常容易,比造飞机、造汽车都要容易。" 真正阻碍机器人落地的是——大脑,而如何让机器人控得很好 ——是 小脑AI化的能力 。 在与量子位的访谈中,张巍从头说起,共识的和非共识的、逐际动力的或者是具身智能行业的……这次都几乎"讲完了"。 如果你想知道 ...
具身智能之心技术交流群成立了!
具身智能之心· 2025-08-28 08:36
Group 1 - The establishment of the Embodied Intelligence Heart Technology Exchange Group focuses on various advanced technologies including VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, VLA+RL, sim2real, multimodal large models, simulation, motion control, target navigation, mapping and localization, and navigation [1] - Interested individuals can add the assistant's WeChat AIDriver005 to join the community [2] - To expedite the group entry process, it is advised to include a note with the institution/school, name, and research direction [3]
助力收割offer,这个具身领域的黄埔军校不简单......
具身智能之心· 2025-08-28 08:36
最近越来越多的同学开始给具身智能之心传递好消息,秋招拿到口头offer了、社招成功从自驾转到 具身了。而且据我们了解,对应的薪资和前景都是非常好的。 除此之外,还有很多具身机器人公司委托我们结合EDU版本的本体开发更多的教程与功能。目前已 经在筹备了,后面将慢慢把这类教程公布到我们的具身社区中,促进行业的发展。 "具身智能之心知识星球"是我们一直在维护的具身社区,目前集视频 + 图文 + 学习路线 + 问答 + 求 职交流为一体,是一个综合类的具身社区,近2000人了。我们期望未来2年内做到近万人的规模。给 大家打造一个交流+技术分享的聚集地,是许多初学者和进阶的同学经常逛的地方。 社区内部经常为大家解答各类实用问题:如何使用设备?如何有效采集数据?如何部署VA、VLA模 型等。是采集背景太复杂还是数据比较dirty? 快速解答,方便大家应用到项目中。 一个社区能在大家最需要帮助的时候解决问题,无疑是非常有价值的。具身智能之心知识星球(国 内首个具身全栈技术社区),目前已经完成了产业、学术、求职、问答交流等多个领域的闭环。遇 到什么问题就分享什么解决方案,哪块研究最前沿,就给大家源源不断提供解决思路,还有求职 ...
具身智能之心B端和C端培训老师招募来啦~
具身智能之心· 2025-08-28 01:20
Group 1 - The article announces the recruitment of teachers for embodied intelligence training, targeting both B-end (business) and C-end (consumer) training services, with compensation above industry standards [1] - The training covers various advanced topics including VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, sim2real, multimodal large models, simulation, motion control, and target navigation [2] - B-end training is aimed at enterprises, universities, and research institutions, while C-end training focuses on students and job seekers, with responsibilities including curriculum design and material preparation [3] Group 2 - Candidates are required to have a doctoral degree or higher (including those currently enrolled), with a preference for those who have published two papers in A-level or Q1 journals/conferences, or have two years of industry experience [3] - Interested individuals can add a specified WeChat contact for further inquiries [4]
斯坦福大学提出RTR框架,让机械臂助力人形机器人真机训练
具身智能之心· 2025-08-28 01:20
Core Insights - The article discusses the emerging focus on motion control of humanoid robots as a key application area for reinforcement learning (RL) algorithms, emphasizing the "Sim-to-Real" paradigm and the challenges associated with transferring learned behaviors from simulation to real-world environments [1][2]. Group 1: Current Challenges and Innovations - Current methods primarily utilize domain randomization to train general control models in diverse simulated environments, aiming for zero-shot transfer to real-world dynamics [1][2]. - Recent efforts have begun to explore fine-tuning models with limited real-world data after simulation pre-training, with notable contributions from institutions like NVIDIA and CMU [2]. - The inherent instability of humanoid robots poses significant risks during real-world training, making direct reinforcement learning in these environments a longstanding challenge [2]. Group 2: Proposed Solutions - The article introduces an innovative approach inspired by human learning, where a "teacher" robotic arm guides a "student" humanoid robot through online reinforcement learning [3][5]. - The teacher arm serves multiple roles: providing safety, assisting in resets after failures, collecting training data, and facilitating a structured learning process through curriculum learning [5][7]. Group 3: RTR System Overview - The proposed system, named RTR (Robot-Trains-Robot), highlights the importance of physical assistance from the teacher robot for effective real-world learning [7][9]. - To address the high costs of real-world data collection, a novel RL algorithm is introduced that optimizes a low-dimensional latent variable related to environmental dynamics, significantly enhancing sample efficiency [7][9]. Group 4: Methodology and Experimental Validation - The RTR system comprises hardware and algorithmic components, featuring a UR5 robotic arm as the teacher and a ToddlerBot humanoid as the student [9][10]. - The Sim-to-Real process is divided into three stages: training adaptable policies in simulation, optimizing a general latent variable, and performing online fine-tuning in the real world [10][12]. - Experimental results demonstrate the effectiveness of the RTR system in tasks such as walking and swinging, showing significant improvements in learning efficiency and performance compared to traditional methods [14][18]. Group 5: Future Implications - The RTR framework not only addresses current limitations in humanoid robot training but also introduces a new paradigm of physical assistance that could be applied to larger humanoid robots and other complex robotic systems [16][19]. - The findings suggest that the integration of teacher robots can enhance the learning process, making it more efficient and stable, which is crucial for advancing real-world applications of humanoid robotics [16][17].
EgoTwin :世界模型首次实现具身「视频+动作」同框生成,时间与空间上精确对齐
具身智能之心· 2025-08-28 01:20
Core Viewpoint - The article discusses the EgoTwin framework, which allows for the simultaneous generation of first-person perspective videos and human actions, achieving precise alignment in both time and space, thus attracting significant attention in the AR/VR, embodied intelligence, and wearable device sectors [2][5]. Summary by Sections Introduction - The EgoTwin framework was developed collaboratively by institutions including the National University of Singapore, Nanyang Technological University, Hong Kong University of Science and Technology, and Shanghai Artificial Intelligence Laboratory [2]. Key Highlights - EgoTwin combines first-person perspective video generation with human action generation, addressing challenges such as camera trajectory alignment with head movement and establishing a causal loop between observation and action [8]. - It utilizes a large dataset of 170,000 segments of first-person multimodal real-world scenes for training, leading to significant performance improvements, including a 48% reduction in trajectory error and a 125% increase in hand visibility F-score [8]. Method Innovations - EgoTwin introduces three core technologies: 1. A head-centric action representation that directly provides head 6D pose, reducing alignment errors [12]. 2. A bidirectional causal attention mechanism that allows for causal interactions between action tokens and video tokens [12]. 3. An asynchronous diffusion mechanism that ensures synchronization while allowing independent noise addition and removal on different timelines [12]. Technical Implementation - The model employs a three-channel diffusion architecture, optimizing computational efficiency by reusing only the necessary layers for the action branch [13]. - The training process involves three phases, including initial training of the action VAE, followed by alignment training, and finally joint fine-tuning of all three modalities [21]. Data and Evaluation - EgoTwin is trained and tested on the Nymeria dataset, which includes 170,000 five-second video clips covering various daily actions [17]. - A comprehensive evaluation system is established to measure generation quality and consistency across the three modalities, utilizing metrics such as I-FID, FVD, and CLIP-SIM [17]. Quantitative Experiments - EgoTwin outperforms baseline methods across all nine evaluation metrics, demonstrating significant improvements in trajectory alignment and hand score, while also enhancing the fidelity and consistency of generated videos and actions [18][19]. Generation Modes - The framework supports three generation modes: T2VM (text to video and motion), TM2V (text and motion to video), and TV2M (text and video to motion), showcasing its versatility [24].
启动招募!外滩大会机器人职业技能表演赛等你来战
具身智能之心· 2025-08-28 01:20
Core Viewpoint - The article emphasizes the importance of embodied intelligence in addressing human safety and operational efficiency in hazardous environments, highlighting the potential of robots to perform tasks in dangerous situations such as deep mining, firefighting, and emergency rescue [2][4]. Group 1: Industry Events and Competitions - The "Artificial Intelligence Hardware Innovation Competition" will feature a live robot skills performance event, inviting partners in the embodied intelligence industry to participate and gain media exposure and collaboration opportunities [4][5]. - The competition will include various challenge areas such as hazardous environment navigation, precision tasks, and emergency rescue operations, with evaluation criteria based on task difficulty, accuracy, efficiency, and autonomy [5]. Group 2: Community and Educational Resources - The "Embodied Intelligence Heart" community offers comprehensive support for academic and research endeavors, including guidance for top-tier conferences and journals, as well as assistance for thesis and competition preparation [7]. - The community serves as a platform for developers and researchers in embodied intelligence, focusing on various technical aspects such as data sets, simulation platforms, and advanced learning models, with resources including over 30 learning paths and 40 open-source projects [7][10].
英伟达通用机器人芯片来了:AI算力提升7.5倍,宇树、银河通用已搭载
具身智能之心· 2025-08-27 00:04
Core Viewpoint - Nvidia has launched its new robot-specific chip, Jetson Thor, which significantly enhances computing power for humanoid robots and other forms, aiming to support advanced embodied intelligence algorithms [3][11]. Group 1: Product Features - Jetson Thor features a GPU with AI computing capability of up to 2070 FP4 TFLOPS, which is 7.5 times more powerful than its predecessor, Jetson Orin, with a power consumption of 130W and energy efficiency 3.5 times better [3][7]. - The memory capacity of Jetson Thor has doubled to 128GB, with a memory bandwidth of 273GB/s [3][7]. - The chip is designed for generative AI model inference, supporting next-generation "physical AI" agents that can operate in real-time on the edge, minimizing reliance on cloud computing [7][10]. Group 2: Software and Ecosystem - Jetson Thor supports all major generative AI frameworks and inference models, enabling developers to conduct local experiments and run inferences efficiently [8][10]. - The product includes a developer kit priced at $3,499 (approximately 25,000 RMB) and a production-level module priced at $2,999 (approximately 21,400 RMB) for bulk orders [11]. Group 3: Market Impact and Partnerships - Major robotics companies, including Yushu Technology and Galaxy General Robotics, have announced plans to integrate Jetson Thor into their products, highlighting its significance in the robotics industry [13][14]. - Nvidia's strategy focuses on supporting the robotics and autonomous driving markets, which are projected to be worth trillions of dollars, while continuing to provide foundational AI infrastructure [18][17].
转行,拿到了具身岗位的offer!
具身智能之心· 2025-08-27 00:04
最近越来越多的同学开始给峰哥传递好消息, 秋招拿到口头offer了、社招成功从自驾转到具身了。 除此之外,还有很多具身机器人公司委托我们结合他们的EDU版本硬件开发更多的教程与功能。 这 个已经在筹备了,后面我们将决定慢慢把这类教程公布到我们的具身社区,促进行业的发展。 "具身智能之心知识星球"目前集视频 + 图文 + 学习路线 + 问答 + 求职交流为一体,是一个综合类的 具身社区,近2000人了。 我们期望未来2年内做到近万人的规模。给大家打造一个交流+技术分享的 聚集地,是许多初学者和进阶的同学经常逛的地方。 社区内部还经常为大家解答各类实用问题:如何使用设备?如何有效采集数据?如何部署VA、VLA 模型等。是采集背景太复杂还是数据比较dirty? 快速解答,方便大家应用到项目中。 一个社区能在大家最需要帮助的时候解决问题,无疑是非常有价值的。 具身智能之心知识星球(国 内首个具身全栈技术社区),目前已经完成了产业、学术、求职、问答交流等多个领域的闭环 。遇 到什么问题就分享什么解决方案,哪块研究最前沿,就给大家源源不断提供解决思路,还有求职岗 位第一时间对接给大家!除了上面的问题,我们还为大家梳理了很 ...