Workflow
具身智能之心
icon
Search documents
具身智能迎来数据革命!它石智航发布WIYH数据集,比特斯拉Optimus领先半年
具身智能之心· 2025-10-11 10:00
Core Insights - The article highlights the launch of the world's first large-scale real-world embodied VLTA (Vision-Language-Tactile-Action) multimodal dataset, World In Your Hands (WIYH), by the company Itstone Intelligent, marking a significant advancement in the embodied intelligence industry [1][6] - The WIYH dataset aims to address the challenges of data quality and availability in training large models, which have traditionally relied on inconsistent internet data and limited simulation data [1][3] Summary by Sections Dataset Features - The WIYH dataset is characterized by four main features: 1. **Realism**: Data is collected from actual embodied tasks, aligning with real-world applications [3] 2. **Richness**: It spans multiple industries and operational skills, enhancing the model's transfer and generalization capabilities [3] 3. **Comprehensiveness**: It includes multimodal data covering vision, language, touch, and action, facilitating pre-training alignment [3] 4. **Volume**: The dataset's scale is comparable to that of large language models, ensuring the future potential of embodied intelligence [3][4] Unique Advantages - The WIYH dataset offers three unique advantages: 1. **Modal Integrity**: It synchronously captures visual, tactile, and action data using proprietary collection equipment, ensuring precise temporal and spatial alignment [4] 2. **Data Annotation**: High-precision annotations are completed using the company's cloud-based foundational model, covering various granular truth labels for comprehensive supervision signals [4] 3. **Collection Environment**: Data is gathered in real-life operational settings, significantly enhancing authenticity, diversity, and generalization while reducing collection costs by an order of magnitude [4] Future Implications - The establishment of the WIYH dataset signifies the creation of a human-centric embodied data paradigm, enabling the pre-training of embodied AI models for real-world applications [6] - The dataset is expected to facilitate the transition from single-task applications to models with general operational capabilities, laying a solid foundation for the integration of embodied robots into various industries [6] - The company plans to make the WIYH dataset publicly available by December 2025, inviting research institutions and partners to collaborate in building a thriving ecosystem for embodied intelligence [6]
今晚分享!首篇智能体自进化综述:如何迈向超级人工智能之路?
具身智能之心· 2025-10-11 04:00
Core Insights - The article discusses the emerging paradigm of Self-evolving Agents in the field of artificial intelligence, emphasizing the shift from static models to dynamic agents capable of real-time learning and adaptation [1][6] - Despite growing interest from academia and industry, there is a lack of systematic organization and top-level design in the field, with most research treating evolution as a subset of the overall agent framework [1][6] - The article identifies three fundamental questions that remain unanswered in the field: What parts of the agent should evolve? When does evolution occur? How is evolution implemented? [1][6] Summary by Sections Self-evolution in Agents - The article outlines the areas where self-evolution occurs within agents, highlighting the need for clarity in understanding these components [5][6] Timing of Self-evolution - It addresses the timing of when self-evolution takes place, which is crucial for the development of effective intelligent agents [5][6] Implementation of Self-evolution - The article discusses how self-evolution can be realized, focusing on the methodologies and frameworks that can facilitate this process [5][6] Event Announcement - An upcoming live session featuring Gao Huanang, a PhD student from Tsinghua University, will delve deeper into the topic of self-evolving agents [2][6]
Being-VL的视觉BPE路线:把「看」和「说」真正统一起来
具身智能之心· 2025-10-11 00:02
Core Insights - The article discusses the limitations of traditional multimodal models, particularly how CLIP-style encoders prematurely align visual representations to text space, leading to potential hallucinations when details are queried without strong language dependence [1][5] - A new method called Being-VL is proposed, which focuses on visual BPE (Byte Pair Encoding) to improve the alignment and modeling of visual and textual data [1][2] Group 1: Being-VL Methodology - Being-VL consists of three main steps: quantizing images into discrete VQ tokens using VQ-GAN, training a visual BPE that measures both co-occurrence frequency and spatial consistency, and finally unifying visual and text tokens into a single sequence for modeling [2][5] - The Priority-Guided Encoding approach is introduced, which combines frequency and spatial consistency to create a more semantically and structurally meaningful visual token set [7][8] Group 2: Training Strategy - The training process is divided into three stages: initial alignment of visual token embeddings, selective fine-tuning of the LLM, and full fine-tuning on complex reasoning and instruction data [9][15] - A curriculum learning strategy is employed to gradually transition from basic tasks to more complex ones, enhancing the model's ability to understand cross-modal interactions [9][12] Group 3: Experimental Results - Experiments indicate that the discrete representation of images followed by visual BPE leads to improved reliability in detail-sensitive tasks and reduces hallucinations compared to traditional methods [12][16] - The introduction of visual BPE significantly enhances the model's performance and robustness, demonstrating that the semantic integration of stable visual patterns into tokens allows for better reasoning [12][19] Group 4: Tokenization and Efficiency - The study highlights the impact of BPE token size on training efficiency, suggesting that a balanced token size can optimize both expressiveness and training efficiency [19][20] - Larger token sizes may lead to sparse distributions and decreased returns on computational resources, indicating a need for careful scaling in future applications [19][20]
为「具身智能」打造专属眼睛:思岚科技Aurora S全集成AI空间感知系统破晓而来!
具身智能之心· 2025-10-11 00:02
今日,思岚科技(SLAMTEC)正式发布新一代全集成AI空间感知系统——Aurora S。有别于传统相机,Aurora S是一个 集成了AI算法和配套算例的"空间智能感知系统" ,旨在为具身智能机器人提供开箱即用的强大空间感知能力,大幅降低 集成开发门槛。 点击下方 卡片 ,关注" 具身智能 之心 "公众号 思岚科技新款AI空间感知系统Aurora S 一、从"传感器"到"感知系统"的升维 Aurora S最大的革新在于高度集成化。它带有思岚自研的深度学习AI-VSLAM算法提供全3D的地图构建和定位能力以及 端到端神经网络的双目深度估计和语义识别能力,而所需的算力硬件,全部集成于仅238克的紧凑机身内。 对开发者意味着什么? 极大降低门槛 :无需额外配置算力,无需从头开发复杂的视觉算法。 加速上市时间 :提供开箱即用的高精度3D感知、建图与语义理解能力,让开发者能聚焦于机器人上层应用的创新。 简化系统设计 :一体化的设计极大简化了机器人的结构设计与电源管理。 Aurora S带来的不仅是技术参数提升,更是一种开发范式的转变:让复杂的空间感知,变得像使用一个普通摄像头一样 简单。 二、为何Aurora S是" ...
具身机器人赋予了强化学习许多新的应用场景!
具身智能之心· 2025-10-11 00:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 强化学习的主要功能与落地场景 说到具身智能机器人,无论是人形还是四足,都离不开的一个重要任务是步态控制,这也是迈向通用具身 必须要攻克的难关。而目前主要方案即是强化学习,宇树、智元等公司的人形机器人大多通过强化学习完 成对应任务,包括:爬楼梯、爬山、跑步、跳舞、翻跟头等各类高难度动作的学习,从而赋予产品能够适 应救援、测量、危险环境的场景。 除此之外机械臂的VLA+RL方案在学术领域越来越受欢迎,RL让机器人执行的更高效、丝滑与顺畅。 然而,强化学习涉及内容众多,而且非常吃研究经验。体系较大、内容繁杂,很多小白根本不知道怎么入 门,发出一篇论文更是难度极大。产出一篇符合对应标准的论文需要在方法论证、实验结果、写作方式等 几个大模块上突击。哪一环节出错了,都可能导致审稿人的low score。 没有完整的学习体系,将会处处踩坑,久久不能入门,导致最 ...
具身智能之心1v1论文辅导来啦~
具身智能之心· 2025-10-10 03:14
Core Viewpoint - The article promotes a comprehensive thesis guidance service that addresses various challenges faced by students in research and writing, particularly in advanced fields like multimodal models and robotics. Group 1: Thesis Guidance Service - The service offers one-on-one customized guidance in cutting-edge research areas such as multimodal large models, visual-language navigation, and embodied intelligence [1][2]. - It provides a full-process support system from topic selection to experimental design, coding, writing, and submission strategies, aimed at producing high-quality research outcomes quickly [2]. - The guidance is provided by a team of experienced mentors from prestigious institutions like CMU, Stanford, and MIT, with expertise in top-tier conferences [1][3]. Group 2: Dual Perspective Approach - The service emphasizes both academic publication and practical application, focusing on the real-world value of research, such as improving the robustness of robotic grasping and optimizing navigation in real-time [3]. - Students consulting in the top 10 can receive free matching with dedicated mentors for in-depth analysis and tailored publication advice [4].
Figure AI正式发布新款人形机器人,都带来了哪些令人眼前一亮的设计?
具身智能之心· 2025-10-10 03:14
以下文章来源于机器觉醒时代 ,作者机械偃甲 机器觉醒时代 . 聚焦具身智能机器人赛道,专注追踪和洞察下一个时代风口 —— 硅基智能!从技术突破到产品落地, 从行业动态到未来图景,这里有你想了解的所有前沿干货。 点击下方 卡片 ,关注" 具身智能之心 "公众号 编辑丨机器觉醒时代 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有 你想要的。 2022年5月,连续创业者 Brett Adcock 在硅谷创立人形机器人公司Figure。 2025年9月16日,Figure宣布完成C轮融资,本轮融资规模超10亿美元,企业投后估值同步攀升至 390 亿美元,此轮融资将主要用于加速通用人形机器人在现实场景中的大规模落地应用。 从成立到完成 C轮融资仅用三年时间,完成C轮融资后,企业估值达到390亿美元,使其成为当前全 球估值最高的人形机器人独角兽公司。 2025年10月9日,Figure发布第三代人形机器人Figure 03。该机器人身高约1.68米,体重60kg,最长 续航时间为5小时,有效负载20kg,移动速度达1.2米/ ...
Qwen要做机器人了:林俊旸官宣成立具身智能团队
具身智能之心· 2025-10-10 00:02
Core Insights - Qwen, an open-source model leader, is transitioning into robotics by forming a dedicated team for embodied intelligence, indicating a shift from virtual to physical applications [2][10] - The establishment of this team aligns with Alibaba Cloud's broader strategy to support the embodied intelligence sector, which is gaining traction among global tech giants [10][12] Summary by Sections Qwen's Transition to Robotics - Qwen has officially announced the formation of a small robotics and embodied intelligence team, aiming to leverage multi-modal foundational models for long-term reasoning and real-world applications [2][10] - This move is expected to enhance the model's capabilities in real-world scenarios, addressing complexities such as feedback and uncertainty [10] Market Dynamics and Investment Trends - Recent investments in the robotics sector, such as the nearly 1 billion yuan A+ round funding for a variable robot company led by Alibaba Cloud, highlight the growing interest in embodied intelligence [7][10] - The global robotics market is projected to reach $7 trillion by 2050, attracting significant capital from various sources, including government funds [14] Competitive Landscape - Major players like NVIDIA and SoftBank are making substantial investments in robotics, with NVIDIA's CEO highlighting the potential for AI and robotics to drive long-term growth worth trillions [11][12] - SoftBank's acquisition of ABB's robotics business for $5.4 billion signifies a strategic move to integrate artificial superintelligence with robotics [12][13] Technological Advancements - Qwen's recent model updates, such as Qwen3-VL, focus on fine-grained visual understanding and 3D perception, providing a robust foundation for embodied intelligence applications [8][10] - The integration of generative AI with robotics is expected to fundamentally change human-machine interaction, marking a significant evolution in the field [10]
不是玄学!港科大清华等联手:撕开推理黑箱,RL让AI像人思考
具身智能之心· 2025-10-10 00:02
Core Insights - The article discusses the recent research by teams from Hong Kong University of Science and Technology, University of Waterloo, and Tsinghua University, which reveals that large language models (LLMs) learn reasoning in a human-like manner by separating high-level strategy planning from low-level execution [3][10][12]. Group 1: Reinforcement Learning and LLMs - Reinforcement Learning (RL) enhances the reasoning capabilities of LLMs, although the underlying mechanisms have not been clearly understood until now [2][5]. - The research highlights the importance of RL in enabling models to exhibit reflective behaviors during interactions with the RL environment [7][10]. - Two significant experimental clues are identified: "length scaling effect" and "aha moment," indicating that LLMs can learn to use more thinking time to solve reasoning tasks [8][9][10]. Group 2: Learning Dynamics - The study outlines a two-phase learning dynamic in LLMs during RL training: the first phase focuses on consolidating basic execution skills, while the second phase shifts towards exploring high-level planning strategies [14][22]. - In the first phase, the model's focus is on mastering low-level operations, which is marked by a decrease in the uncertainty of execution tokens [23][24]. - The second phase involves the model actively expanding its strategy planning library, which correlates with improved reasoning accuracy and longer solution chains [28][30]. Group 3: HICRA Algorithm - The research introduces a new algorithm called HICRA (Hierarchy-Aware Credit Assignment), which emphasizes the learning of planning tokens over execution tokens to enhance reasoning capabilities [18][42]. - HICRA consistently outperforms mainstream methods like GRPO, particularly when the model has a solid foundation in execution skills [20][45]. - Experimental results show that HICRA leads to significant improvements in various reasoning benchmarks compared to GRPO, indicating its effectiveness in optimizing planning tokens [46][47]. Group 4: Insights on Token Dynamics - The study reveals that the observed phenomena, such as "aha moments" and "length scaling," are not random but are indicative of a structured learning process [33][35]. - The overall token-level entropy decreases as the model becomes more predictable in executing low-level tasks, while the semantic entropy of planning tokens increases, reflecting the model's exploration of new strategies [39][40]. - The findings suggest that the key to enhancing reasoning capabilities lies in improving planning abilities rather than merely optimizing execution details [20][41].
DemoGrasp:一次演示是怎么实现灵巧手通用抓取的?
具身智能之心· 2025-10-10 00:02
Core Insights - The article discusses DemoGrasp, a novel method for universal dexterous grasping that allows robots to learn grasping strategies from a single demonstration [2][3][6]. Group 1: Methodology - DemoGrasp utilizes a simple and efficient reinforcement learning framework that enables any dexterous hand to learn universal grasping strategies by collecting just one successful grasping demonstration [6]. - The method involves editing the trajectory of robot actions to adapt to new objects and poses, determining grasping positions and methods through adjustments in wrist and hand joint angles [2][3]. Group 2: Performance and Validation - In simulation experiments, DemoGrasp achieved a success rate of 95% when using the Shadow hand to manipulate objects from the DexGraspNet dataset, outperforming existing methods [2]. - The method demonstrated excellent transferability, achieving an average success rate of 84.6% on six unseen object datasets, despite being trained on only 175 objects [2]. Group 3: Applications and Capabilities - The strategy successfully grasped 110 previously unseen real-world objects, including small and thin items, and is adaptable to variations in spatial positioning, background, and lighting [3]. - DemoGrasp supports both RGB and depth input types and can be extended to language-guided grasping tasks in cluttered environments [3].