Workflow
具身智能之心
icon
Search documents
RAGNet: 从“看得见”到“想得通”,再到“抓得准”的通用机器人之路 (ICCV'25)
具身智能之心· 2025-08-04 01:59
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Dongming Wu等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 Paper: https://arxiv.org/abs/2507.23734 写在前面:为什么"通用抓取"如此困难? "让机器人学会抓取物体"一直是机器人领域的重要研究课题。然而,当我们把机器人从实验室的固定台面放到真实的开放世界,例如厨房、仓库、甚至马路边, 就会发现老的范式迅速失效: 一言以蔽之,机器人必须同时具备"功能推理+精细操作"两大能力。 近期,来自香港中文大学、原力灵机、中科院计算所、默罕默德·本·扎耶德人工智能大学、澳门大学等机构的研究者联合发布了全新Affordance Segmentation数据 集与模型框架,RAGNet和 AffordanceNet,旨在实现与人类指令对齐的通用抓取机器人。 二、RAGNet:面向通用抓取的大规模推理型数据集 RAGNet是一个大规模、基于推理的可供性(a ...
具身智能之心强化学习交流群来啦!
具身智能之心· 2025-08-04 01:59
Group 1 - The article announces the establishment of a community focused on reinforcement learning, specifically targeting individuals working on quadrupedal, humanoid, and robotic arm control [1] - The community aims to create a platform for technical exchange and sharing within the industry [1] Group 2 - Interested individuals are encouraged to add a designated assistant on WeChat to join the group, with specific instructions for joining [2]
全球首个体智能安全基准出炉:大模型集体翻车
具身智能之心· 2025-08-04 01:59
Core Viewpoint - The article discusses the development of AGENTSAFE, the world's first comprehensive evaluation benchmark for the safety of embodied intelligent agents, addressing the emerging risks associated with "jailbreak" attacks that can lead to dangerous actions by robots [5][12][43]. Group 1: Introduction to AGENTSAFE - AGENTSAFE is designed to fill the gap in adversarial safety evaluation for embodied agents, which have been largely overlooked in existing benchmarks [5][11]. - The research has received recognition, winning the Outstanding Paper Award at the ICML 2025 Multi-Agent Systems workshop [6]. Group 2: The Need for AGENTSAFE - Traditional AI safety concerns have focused on generating harmful content, while embodied agents can perform physical actions that pose real-world risks [10]. - The article emphasizes the importance of proactive safety measures, stating that safety vulnerabilities should be identified before any harm occurs [12]. Group 3: Features of AGENTSAFE - AGENTSAFE includes a highly realistic interactive sandbox environment, simulating 45 real indoor scenes with 104 interactive objects [14][15]. - A dataset of 9,900 dangerous commands has been created, inspired by Asimov's "Three Laws of Robotics," and includes six advanced "jailbreak" attack methods [16][20]. Group 4: Evaluation Methodology - AGENTSAFE employs an end-to-end evaluation design that assesses the entire process from perception to action execution, ensuring a comprehensive safety assessment [21][23]. - The evaluation is structured into three stages: perception, planning, and execution, with a focus on the model's ability to translate natural language commands into executable actions [31]. Group 5: Experimental Results - The study tested five mainstream vision-language models (VLMs), revealing significant performance disparities when faced with dangerous commands [30][34]. - For example, GPT-4o and GLM showed high refusal rates for harmful commands, while Qwen and Gemini had much lower refusal rates, indicating a higher susceptibility to dangerous actions [36][37]. - The results demonstrated that once commands were subjected to "jailbreak" attacks, the safety measures of all models significantly deteriorated, with GPT-4o's refusal rate dropping from 84.67% to 58.33% for harmful commands [39][43]. Group 6: Conclusion - The findings highlight the current vulnerabilities in the safety mechanisms of embodied intelligent agents, stressing the need for rigorous safety testing before deployment in real-world scenarios [43][44].
中科院自动化所机器人视觉中的多模态融合与视觉语言模型综述
具身智能之心· 2025-08-04 01:59
Core Insights - The article discusses the advancements in multimodal fusion and vision-language models (VLMs) as essential tools for enhancing robot vision technology, emphasizing their potential in complex reasoning and long-term task decision-making [4][10]. Multimodal Fusion and Robot Vision - Multimodal fusion enhances semantic scene understanding by integrating various data sources, such as visual, linguistic, depth, and lidar information, addressing limitations faced by traditional unimodal methods [8][9]. - The rise of VLMs has propelled the development of multimodal fusion paradigms, showcasing capabilities in zero-shot understanding and instruction following [9][10]. Key Applications and Challenges - The article identifies key applications of multimodal fusion in tasks like simultaneous localization and mapping (SLAM), 3D object detection, navigation, and robot manipulation [10][19]. - Challenges in multimodal fusion include cross-modal alignment, efficient training strategies, and real-time performance optimization [10][19]. Data Sets and Benchmarking - A comprehensive analysis of mainstream multimodal datasets used for robot tasks is provided, detailing their modality combinations, task coverage, and limitations [10][43]. - The importance of high-quality multimodal datasets is highlighted, as they are crucial for model training and performance evaluation [62]. Future Directions - The article suggests future research directions to address challenges in multimodal fusion, such as improving cross-modal alignment techniques and enhancing real-time performance [10][63]. - Emphasis is placed on the need for standardized datasets and benchmarks to facilitate comparisons across different research efforts [66].
具身的创业者,赌的是这个市场远远比普通人想的要大......
具身智能之心· 2025-08-02 16:02
Core Insights - The article discusses the potential of embodied intelligence technology to transform various devices and services, suggesting that if technical and data challenges are resolved, many everyday items could be "embodied" [1][2] - The future of the industry is expected to shift from autonomous driving to embodied intelligence over the next decade, creating numerous job opportunities and attracting talent from various fields [2][3] Group 1: Industry Trends - The emergence of humanoid robots and mobile operation robots in various sectors such as healthcare, industry, and home services is highlighted, indicating a growing trend in embodied applications [1][2] - The concept of "VLA" (Visual Language Action) in autonomous vehicles is introduced, suggesting that users will be able to interact with systems using natural language for navigation and task optimization [1][2] Group 2: Market Opportunities - The potential for service and industrial robots to perform multiple tasks in parallel is emphasized, which could lead to more efficient production lines without the need for extensive reconfiguration [2] - The retail and service industries are expected to see significant advancements with the introduction of autonomous robots, capable of managing large spaces like supermarkets and restaurants [2] Group 3: Community and Knowledge Sharing - The "Embodied Intelligence Knowledge Planet" has established a closed-loop system across various fields, including industry, academia, and job-seeking, to foster community engagement and knowledge sharing [4][5] - The platform offers resources such as technical routes, job information, and access to industry experts, aiming to support both newcomers and experienced professionals in the field [5][11][12] Group 4: Educational Resources - The community provides a comprehensive list of over 30 technical routes for learning, catering to different levels of expertise, from beginners to advanced researchers [17][12] - Various resources, including open-source projects, datasets, and simulation platforms, are compiled to facilitate learning and development in embodied intelligence [17][32][36]
Spec-VLA:首个专为VLA推理加速设计的推测解码框架
具身智能之心· 2025-08-02 16:02
Core Viewpoint - The article discusses the development of Spec-VLA, a speculative decoding framework designed to accelerate Vision-Language-Action (VLA) models, addressing challenges related to computational demands and decoding delays [3][4][16]. Research Background and Motivation - VLA models have shown significant progress in generating robot action sequences based on language instructions, but they face challenges such as the large parameter size of backbone Visual Language Models (VLMs) and increased decoding latency due to autoregressive decoding strategies [3]. - Existing acceleration methods have limitations, necessitating a tailored approach for VLA models [3]. Core Framework: Spec-VLA - Spec-VLA introduces a collaborative mechanism between draft and validation models to enhance inference speed, utilizing a draft model to predict action tokens and a validation model to ensure output quality [4][5]. Key Mechanism: Relaxed Acceptance - The relaxed acceptance mechanism allows for a defined threshold of acceptable distance between draft and validation model predictions, facilitating a more efficient decoding process without significant computational overhead [7][10]. Experimental Validation - The framework was evaluated on the LIBERO simulation benchmark across four task sets, demonstrating significant improvements in speed and acceptance length while maintaining success rates [9][10]. - The introduction of relaxed acceptance led to an acceleration factor of 1.22× to 1.42×, with acceptance length increasing by 25%-44% [10][11]. Key Results - The results indicate that as the relaxed threshold increases, the acceptance length significantly improves while maintaining stable success rates across various datasets [10][11]. - Case studies show that relaxed conditions reduce the number of iterations needed to complete action sequences, validating the effectiveness of the relaxed acceptance mechanism [13]. Conclusion and Limitations - Spec-VLA demonstrates the potential of speculative execution in VLA prediction tasks, achieving a speedup of 1.42× and a 44% increase in acceptance length without compromising success rates [16]. - Limitations include the lack of real-world robot scenario testing and the exploration of action chunking strategies [16].
作为华为展台唯一机器人企业,它的实力究竟有多强?
具身智能之心· 2025-08-02 16:02
Core Viewpoint - The article highlights the significant advancements and contributions of Daimon Robotics in the field of embodied intelligence, particularly during the WAIC 2025 event in Shanghai, showcasing their innovative technologies and collaborations with major industry players like Huawei and China Mobile [2][4][9]. Group 1: Daimon Robotics at WAIC 2025 - Daimon Robotics made a remarkable appearance at the Huawei exhibition, attracting a large audience with their instant response and low-latency performance of the Sparky 1 robot, which became a popular attraction at the event [6]. - The company showcased its groundbreaking VTLA (Visual-Tactile-Language-Action) embodied operation model, which integrates tactile perception to enhance reasoning and generalization capabilities in complex scenarios, achieving near-human dexterity [8]. Group 2: Collaborations and Industry Impact - Daimon officially joined China Mobile's embodied intelligence industry cooperation plan, aimed at promoting the productization, industrialization, and scaling of embodied intelligence technologies, alongside key partners like Yushu and Zhiyuan [9]. - The tactile perception technology is crucial for advancing robots from being merely functional to being practical and user-friendly, positioning Daimon as a representative of tactile sensing technology in the next wave of breakthroughs [9]. Group 3: Product Showcase and Technological Strength - At WAIC 2025, Daimon unveiled several core products, including the world's first multi-dimensional high-resolution tactile sensor DM-Tac W, a multi-dimensional tactile sensing dexterous hand DM-Hand1, and a wearable remote operation data collection system DM-EXton series, demonstrating the company's technological prowess and commercialization achievements [11]. - The DM-Tac W sensor is recognized as an industry benchmark for visual-tactile sensing, capable of capturing minute tactile changes, thus opening new possibilities for high-precision operational scenarios [13][14]. Group 4: Leadership in Domestic Technology - Daimon Robotics, incubated at the Hong Kong University of Science and Technology, focuses on high-resolution multi-modal tactile perception and human-centered remote operation systems, led by pioneer Professor Wang Yu in the robotics field [15]. - The company has developed the world's thinnest tactile sensor technology, overcoming challenges related to sensor thickness, computing power, and durability, and has achieved commercialization of tactile products domestically [15].
VLA-OS:NUS邵林团队探究机器人VLA做任务推理的秘密
具身智能之心· 2025-08-01 16:02
Core Viewpoint - The article discusses a groundbreaking research study by a team from the National University of Singapore, focusing on the VLA-OS framework, which systematically analyzes and dissects task planning and reasoning in Vision-Language-Action (VLA) models, aiming to provide insights for the next generation of general-purpose robotic VLA models [2][4]. Group 1: VLA-OS Overview - VLA-OS is a structured framework that includes a clear codebase, multimodal task planning datasets, and standardized training processes for VLA models [4][5]. - The framework aims to unify various VLA paradigms and facilitate controlled experiments to identify effective task planning representations and paradigms [19][20]. Group 2: VLA Model Paradigms - The article outlines two main approaches for integrating task reasoning into VLA models: Integrated-VLA, which combines task planning and policy learning, and Hierarchical-VLA, which separates these functions into different models [10][12]. - Current VLA models exhibit significant variability in architecture, training methods, and task planning representations, complicating performance assessments [13][15]. Group 3: Experimental Findings - The research identifies 14 key findings from over 100 experiments, highlighting the advantages of visual planning representations over language-based ones and the superior performance of Hierarchical-VLA compared to Integrated-VLA [34][35]. - Findings indicate that Integrated-VLA benefits from implicit task planning, while Hierarchical-VLA demonstrates better generalization capabilities [51][52]. Group 4: Recommendations for Future Research - The article suggests prioritizing visual representation planning and goal image planning, with language planning as a supplementary approach [68]. - It emphasizes the importance of task planning pre-training and the need for efficient training mechanisms to avoid gradient conflicts between planning and action outputs [73].
MuJoCo教程来啦!从0基础到强化学习,再到sim2real
具身智能之心· 2025-08-01 16:02
Core Viewpoint - The article discusses the unprecedented advancements in AI, particularly in embodied intelligence, which is transforming the relationship between humans and machines. This technology is poised to revolutionize various industries, including manufacturing, healthcare, and space exploration [1][3]. Group 1: Embodied Intelligence - Embodied intelligence is characterized by machines that can understand language commands, navigate complex environments, and make intelligent decisions in real-time. This technology is no longer a concept from science fiction but is rapidly becoming a reality [1]. - Major tech companies like Tesla, Boston Dynamics, OpenAI, and Google are competing in the field of embodied intelligence, focusing on creating systems that not only have a "brain" but also a "body" capable of interacting with the physical world [1][3]. Group 2: Technical Challenges - Achieving true embodied intelligence presents significant technical challenges, including the need for advanced algorithms and a deep understanding of physical simulation, robot control, and perception fusion [3][4]. - MuJoCo (Multi-Joint dynamics with Contact) is highlighted as a critical technology in this field, serving as a high-fidelity simulation engine that bridges the virtual and real worlds [4][6]. Group 3: Advantages of MuJoCo - MuJoCo allows researchers to create realistic virtual robots and environments, enabling millions of trials and learning experiences without risking expensive hardware. This significantly accelerates the learning process, as simulations can run hundreds of times faster than real-time [6][8]. - The technology supports high parallelism, allowing thousands of simulation instances to run simultaneously, and provides a variety of sensor models, ensuring robust and precise simulations [6][8]. Group 4: Educational Opportunities - A comprehensive MuJoCo development course has been developed, focusing on practical applications and theoretical foundations, covering topics from physical simulation principles to deep reinforcement learning [9][11]. - The course is structured into six modules, each with specific learning objectives and practical projects, ensuring a solid grasp of embodied intelligence technologies [15][17]. Group 5: Project-Based Learning - The course includes six progressively challenging projects, such as building a smart robotic arm, implementing vision-guided grasping systems, and developing multi-robot collaboration systems, which are designed to provide hands-on experience [19][27]. - Each project is accompanied by detailed documentation and code references, facilitating a deep understanding of the underlying technologies and their applications in real-world scenarios [30][32]. Group 6: Target Audience and Outcomes - The course is suitable for individuals with programming or algorithm backgrounds looking to enter the field of embodied robotics, as well as students and professionals interested in enhancing their practical skills [32][33]. - Upon completion, participants will possess a complete skill set in embodied intelligence, including technical, engineering, and innovative capabilities, making them well-equipped for roles in this rapidly evolving industry [32][33].
准备扩大具身团队了,欢迎加入我们......
具身智能之心· 2025-08-01 16:02
Core Viewpoint - The rapid development of embodied intelligence is being recognized, with several leading companies preparing for IPOs, highlighting the importance of collaboration and communication within the industry [1] Group 1: Collaboration and Industry Development - The industry is encouraged to engage in active communication to overcome technological isolation, which can hinder overall development [1] - The company aims to create a platform that gathers talent from across the industry to promote progress [1] Group 2: Project Collaboration - The company is establishing project research teams in major cities including Beijing, Shanghai, Shenzhen, Guangzhou, Hangzhou, and Wuhan, with opportunities for part-time involvement [3] - Each city will recruit around 10 individuals with over 2 years of experience in embodied algorithms and robotics research [4] Group 3: Education and Consulting Services - The company invites industry experts to develop online courses and consulting services in the field of embodied intelligence [5] - Specific areas of expertise sought include large models, multi-modal models, reinforcement learning, and robot motion planning, among others [5][6] Group 4: Compensation and Opportunities - The company offers significant profit-sharing and resource sharing across the industry, with options for both part-time and full-time positions [7]