具身智能之心
Search documents
转行,拿到了具身岗位的offer!
具身智能之心· 2025-08-27 00:04
最近越来越多的同学开始给峰哥传递好消息, 秋招拿到口头offer了、社招成功从自驾转到具身了。 除此之外,还有很多具身机器人公司委托我们结合他们的EDU版本硬件开发更多的教程与功能。 这 个已经在筹备了,后面我们将决定慢慢把这类教程公布到我们的具身社区,促进行业的发展。 "具身智能之心知识星球"目前集视频 + 图文 + 学习路线 + 问答 + 求职交流为一体,是一个综合类的 具身社区,近2000人了。 我们期望未来2年内做到近万人的规模。给大家打造一个交流+技术分享的 聚集地,是许多初学者和进阶的同学经常逛的地方。 社区内部还经常为大家解答各类实用问题:如何使用设备?如何有效采集数据?如何部署VA、VLA 模型等。是采集背景太复杂还是数据比较dirty? 快速解答,方便大家应用到项目中。 一个社区能在大家最需要帮助的时候解决问题,无疑是非常有价值的。 具身智能之心知识星球(国 内首个具身全栈技术社区),目前已经完成了产业、学术、求职、问答交流等多个领域的闭环 。遇 到什么问题就分享什么解决方案,哪块研究最前沿,就给大家源源不断提供解决思路,还有求职岗 位第一时间对接给大家!除了上面的问题,我们还为大家梳理了很 ...
速度提升3倍,CoT推理助力VLA!ECoT-Lite:融合具身机器人推理改善策略的几种机制
具身智能之心· 2025-08-27 00:04
Core Insights - The article discusses the development of efficient training strategies for embodied reasoning in robotics, specifically focusing on the ECoT (Embodied Chain-of-Thought) framework and its lightweight variant, ECoT-Lite, which enhances policy generalization without the need for extensive additional data collection [3][8][30]. Group 1: Motivation and Background - The need for robots to generalize across diverse real-world scenarios has been a long-standing focus in the field of robotics, with various architectures like RT-X and RT-1 showing improved generalization capabilities through extensive training on diverse datasets [2]. - Traditional methods to enhance policy generalization involve collecting more robot datasets, often through tedious human remote control operations [3]. Group 2: ECoT Framework - ECoT improves policy performance by breaking down robot action prediction into a series of reasoning steps, such as identifying object locations and planning sub-tasks, which significantly enhances generalization to new scenes and tasks without requiring additional demonstration data [3][4][5]. - Despite its promise, ECoT incurs significant costs, including the need for detailed reasoning instructions in training data and slower inference speeds due to the extended reasoning steps [3][5]. Group 3: ECoT-Lite Development - ECoT-Lite introduces simpler and lighter alternatives to ECoT, focusing on better representation learning, improved learning processes, and enhanced expressiveness while avoiding the drawbacks of conventional chain-of-thought reasoning [6][8]. - ECoT-Lite achieves state-of-the-art performance on widely used benchmarks like LIBERO, surpassing traditional VLA models by 10-19% while increasing inference speed from 1-1.2Hz to over 3.5Hz [8]. Group 4: Experimental Results - The experiments demonstrate that ECoT-Lite significantly improves performance across various tasks, achieving approximately 90% accuracy on the LIBERO-90 dataset, which is higher than previous state-of-the-art results [54][56]. - Reasoning dropout and reasoning pre-training strategies were found to be particularly effective, with reasoning dropout providing a speed advantage while maintaining high performance [58][92]. Group 5: Implications and Recommendations - The findings suggest that while ECoT is the most performant method, it is also the slowest, making ECoT-Lite variants more practical for real-time applications [90]. - Recommendations include using full ECoT for maximum performance, reasoning dropout for fewer task domains, and reasoning pre-training for more diverse tasks or when unpaired reasoning data is available [92].
3个月!搞透具身大脑+小脑算法
具身智能之心· 2025-08-27 00:04
Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, driving advancements in embodied intelligence technologies [3]. - Major domestic companies like Huawei are launching initiatives such as the "Global Embodied Intelligence Industry Innovation Center" in collaboration with firms like Leju Robotics and Dazhu Robotics to develop key technologies for embodied intelligence [5]. - JD.com has been investing in companies like Zhiyuan Robotics and Qianxun Intelligent since May 2025 to enhance efficiency and service capabilities in logistics and home service scenarios [5]. - Internationally, companies like Tesla and Figure AI are advancing applications in industrial and logistics robotics, while U.S. investment firms are supporting companies like Wayve and Apptronik in autonomous driving and warehouse robotics [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages, from low-level perception to high-level task understanding and generalization, aiming to enhance robots' capabilities in real-world environments [6]. - The first stage focused on grasp pose detection, enabling robots to predict suitable end-effector poses for static object manipulation, but lacked context modeling for complex tasks [6]. - The second stage introduced behavior cloning, allowing robots to learn from expert demonstrations, yet faced challenges in generalization and performance in multi-target scenarios [6]. - The third stage, emerging in 2023, utilized Diffusion Policy methods to improve stability and generalization by modeling action trajectories [7]. - The fourth stage, starting in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing to overcome limitations in feedback and future prediction capabilities [8]. Product and Market Development - The evolution from grasp pose detection to behavior cloning and VLA models signifies a shift towards intelligent agents capable of handling general tasks in open environments, leading to the emergence of various products like humanoid robots and robotic arms across industries such as healthcare and logistics [9]. - The demand for engineering and system capabilities is increasing as embodied intelligence transitions from research to deployment, necessitating higher engineering standards [12].
研二多发几篇论文,也不至于到现在这个地步……
具身智能之心· 2025-08-26 04:45
Core Viewpoint - The article emphasizes the importance of high-quality research papers for graduate students, especially those facing challenges in job hunting or pursuing doctoral studies. It suggests that students should seek professional assistance to enhance their research capabilities and output [1]. Group 1: Research Challenges and Solutions - Many graduate students struggle with producing satisfactory research papers due to lack of guidance from their advisors, leading to confusion in topic selection and paper structure [1]. - The article introduces a professional tutoring service aimed at helping students navigate the research process and improve their paper writing skills [1][8]. Group 2: Tutoring Service Overview - The tutoring service is backed by a team of over 300 experts in fields like autonomous driving and embodied intelligence, with a high acceptance rate of 96% for students they have guided [5]. - The structured 12-week program includes defining research topics, literature review, experimental design, drafting, and submission processes [4]. Group 3: Target Audience and Benefits - The service is designed for students who are facing challenges such as lack of guidance, unclear research frameworks, or those looking to enhance their academic profiles for job applications or further studies [9][10]. - Successful participants may receive recommendations from prestigious institutions and opportunities for internships in leading tech companies [15].
2.5w!英伟达推出机器人“最强大脑”:AI算力飙升750%配128GB大内存,宇树已经用上了
具身智能之心· 2025-08-26 04:45
Core Viewpoint - NVIDIA has launched the Jetson Thor, a new robotic computing platform that significantly enhances AI computing power and efficiency, marking a leap towards the era of physical AI and general robotics [1][6][22]. Group 1: Product Features - Jetson Thor boasts an AI computing power of 2070 TFLOPS, which is 7.5 times higher than its predecessor, Jetson Orin, while achieving a 3.5 times improvement in energy efficiency [1][5]. - The platform includes 128GB of memory, an unprecedented configuration for edge computing devices [2]. - It supports multiple AI models simultaneously on edge devices, enhancing the capabilities of robots to interact with and even change the physical world [5][6]. Group 2: Technical Specifications - The GPU is based on the Blackwell architecture, featuring up to 2560 CUDA cores and 9 fifth-generation Tensor Cores, with support for Multi-Instance GPU (MIG) technology [16]. - The CPU consists of a 14-core Arm Neoverse V3AE, designed for real-time control and task management, with significant performance improvements over previous generations [16]. - Storage and bandwidth are upgraded to 128GB 256-bit LPDDR5X with a memory bandwidth of 273GB/s, supporting large Transformer inference and high-concurrency video encoding [16]. Group 3: Market Adoption - A significant number of Chinese companies, including Union Medical, Wanji Technology, and UBTECH, are among the first to adopt the Jetson Thor platform [19]. - Boston Dynamics is integrating Jetson Thor into its Atlas humanoid robot, enabling it to utilize computing power previously only available in servers [20]. - Agility Robotics plans to use Jetson Thor as the core computing unit for its sixth-generation Digit robot, aimed at logistics tasks in warehouses and manufacturing environments [21]. Group 4: Development and Simulation - NVIDIA emphasizes the importance of a three-computer system for achieving physical AI: a DGX system for training AI, an Omniverse platform for simulation, and the Jetson Thor as the robot's "brain" [22]. - Continuous training, simulation, and deployment cycles are essential for upgrading the robot's capabilities even after deployment [24].
基于大型VLM的VLA模型如何改一步一步推动机器人操作任务的发展?
具身智能之心· 2025-08-26 00:03
Core Viewpoint - The article discusses the transformative impact of large Vision-Language Models (VLMs) on robotic manipulation, enabling robots to understand and execute complex tasks through natural language instructions and visual cues [3][4][5]. Group 1: VLA Model Development - The emergence of Vision-Language-Action (VLA) models, driven by large VLMs, allows robots to interpret visual details and human instructions, converting this understanding into executable actions [4][5]. - The article highlights the evolution of VLA models, categorizing them into monolithic and hierarchical architectures, and identifies key challenges and future directions in the field [9][10][11]. Group 2: Research Contributions - The research from Harbin Institute of Technology (Shenzhen) provides a comprehensive survey of VLA models, detailing their definitions, core architectures, and integration with reinforcement learning and human video learning [5][9][10]. - The survey aims to unify terminology and modeling assumptions in the VLA field, addressing fragmentation across disciplines such as robotics, computer vision, and natural language processing [17][18]. Group 3: Technical Advancements - VLA models leverage the capabilities of large VLMs, including open-world generalization, hierarchical task planning, knowledge-enhanced reasoning, and rich multimodal integration [13][64]. - The article outlines the limitations of traditional robotic methods and how VLA models overcome these by enabling robots to handle unstructured environments and vague instructions effectively [16][24]. Group 4: Future Directions - The article emphasizes the need for advancements in 4D perception and memory mechanisms to enhance the capabilities of VLA models in long-term task execution [5][16]. - It also discusses the importance of developing unified frameworks for VLA models to improve their adaptability across various tasks and environments [17][66].
VLA和VLN技术交流群来啦!
具身智能之心· 2025-08-26 00:03
Group 1 - The establishment of multiple VLA and VLN related communities by the company aims to facilitate discussions on developments in academia, industry, and product implementations [1] - The company encourages individuals interested in VLA/VLN to join the community by adding a specific assistant on WeChat [2]
加州大学最新!做什么?教VLA模型拒绝不可能的任务
具身智能之心· 2025-08-26 00:03
Core Viewpoint - The article discusses the development and performance of the VLA model, focusing on its ability to handle false premise instructions in robotic tasks through the proposed IVA framework, which enhances the model's robustness in interpreting and responding to user commands [4][10]. Group 1: Problem Statement and Solution - The VLA model excels in various robotic tasks by relying on multimodal inputs, but it struggles with false premise instructions, which involve commands that reference non-existent objects or conditions [6][10]. - The IVA framework is introduced to address this issue, enabling the model to detect unexecutable commands, clarify or correct them through language, and associate reasonable alternatives with perception and action [4][10]. Group 2: Research Gaps and Contributions - Current research primarily focuses on the success rate of executing correct commands, neglecting the handling of ambiguous or unexecutable instructions [6][10]. - The core contributions of this work include the introduction of the IVA framework, the construction of a large-scale dataset for training, and validation of the model's performance across eight robotic tasks, demonstrating significant improvements in detecting false premises and executing valid commands [10][25]. Group 3: Experimental Results - The IVA framework achieved a false premise detection accuracy of 97.56% and a 50.78% increase in successful responses under false premise scenarios compared to baseline models [5][25]. - In various tasks, IVA outperformed the LLARVA model in overall success rates and false premise detection rates, with only minor reductions in success rates for real premise commands [25][28]. Group 4: Limitations and Future Directions - The dataset used for training is limited to a simulated environment, which may not fully represent real-world human-robot interactions, and the distribution of false premises may not align with actual occurrences [26][27]. - The IVA framework currently lacks the ability to handle complex, multi-turn clarifications and may struggle with longer, more ambiguous user commands in real-world scenarios [27][28].
真实场景也能批量造「险」!VLM+扩散模型打造极限测试
具身智能之心· 2025-08-26 00:03
Core Viewpoint - The article discusses the development of SafeMVDrive, a framework designed to generate high-fidelity, multi-view safety-critical driving videos for testing autonomous driving systems in extreme scenarios, addressing the challenges of real-world data collection and simulation limitations [7][11][30]. Group 1: Safety Testing Challenges - Current autonomous driving systems struggle to avoid accidents in high-risk scenarios such as night construction sites and sudden obstacles, indicating a need for improved reliability in these situations [2][3]. - Extreme scenarios are infrequent in real-world conditions, making data collection difficult, while existing simulators lack the realism required for effective testing [5][6]. Group 2: SafeMVDrive Framework - SafeMVDrive combines a Visual Language Model (VLM) for vehicle selection with a two-stage trajectory generation process to create high-fidelity safety-critical videos for testing [7][10]. - The framework addresses two main challenges: accurately selecting safety-critical vehicles and ensuring the generalization of multi-view video generation models [9][10]. Group 3: Innovations in Vehicle Selection and Trajectory Generation - The VLM-based vehicle selector utilizes visual information to identify potentially dangerous vehicles, improving upon traditional heuristic methods [19][31]. - The two-stage trajectory generation process first simulates collision trajectories and then transforms them into avoidance trajectories, maintaining the critical safety features while ensuring realistic video generation [20][22][23]. Group 4: Video Generation and Evaluation - SafeMVDrive employs a multi-view video generation module to convert avoidance trajectories into high-fidelity videos, ensuring both safety-criticality and visual realism [25][26]. - The framework significantly enhances the coverage and diversity of safety-critical scenarios compared to existing methods, demonstrating superior performance in generating challenging test data [28][30]. Group 5: Performance Metrics - SafeMVDrive shows improved metrics in sample-level and scene-level collision rates, indicating its effectiveness in generating realistic and challenging driving scenarios [29][30]. - The VLM vehicle selector achieves a balance of precision and recall, ensuring that the selected vehicles align with real traffic logic, which is crucial for effective simulation [32].
加州大学最新!做什么?教VLA模型拒绝不可能的任务
具身智能之心· 2025-08-25 06:00
Core Viewpoint - The article discusses the development and performance of the VLA model in handling robotic tasks, particularly focusing on its ability to detect and respond to false premise instructions through the proposed IVA framework, which enhances the model's robustness in real-world applications [4][10]. Group 1: Problem Identification and Solution - The VLA model excels in various robotic tasks by relying on multimodal inputs, but it struggles with false premise instructions, which involve commands that reference non-existent objects or conditions [6][10]. - The IVA framework is introduced to address this issue, enabling the model to detect unexecutable commands, clarify or correct them through language, and associate reasonable alternatives with perception and action [4][10]. Group 2: Research Gaps and Contributions - Current research primarily focuses on successful execution rates of correct commands, neglecting the handling of ambiguous or unexecutable instructions [6][10]. - The core contributions of this work include the introduction of the IVA framework, the construction of a large-scale dataset for training, and validation of the model's performance across eight robotic tasks, demonstrating significant improvements in detecting false premises and executing valid commands [10][25]. Group 3: Experimental Results - The IVA framework achieved a false premise detection accuracy of 97.56% and a 50.78% increase in successful responses under false premise scenarios compared to baseline models [5][25]. - In various tasks, IVA outperformed the LLARVA model in overall success rates and false premise detection rates, with only minor reductions in success rates for real premise commands [25][28]. Group 4: Limitations and Future Directions - The dataset used for training is limited to a simulated environment, which may not fully represent real-world human-robot interactions, and the distribution of false premises may not align with actual occurrences [26][27]. - The IVA framework currently lacks the ability to handle complex, multi-turn clarifications and may struggle with longer, more ambiguous human instructions [26][27].