具身智能之心
Search documents
开源!智元机器人正式发布首个具身智能操作系统参考框架:“智元灵渠OS”
具身智能之心· 2025-07-26 10:45
Core Insights - The article highlights the launch of the "Lingqu OS" open-source initiative by Zhiyuan Robotics at the WAIC 2025, aiming to build an open ecosystem for embodied intelligence [1][3][4] Group 1: Event Overview - The WAIC 2025 took place on July 26 at the Shanghai Expo Center, focusing on the themes of technology, cooperation, and inclusivity in AI development [1] - Zhiyuan Robotics' CTO, Peng Zhihui, represented embodied intelligence and showcased the Lingxi X2 humanoid robot, emphasizing the transition from tools to partners in human-robot collaboration [2][3] Group 2: Human-Robot Interaction - The dialogue between Zhiyuan Robotics and Lingxi X2 addressed critical questions about the role of robots as tools or partners, highlighting the importance of understanding in human-robot collaboration [2][3] - Lingxi X2 demonstrated advanced capabilities with smooth movements and high-quality autonomous responses, showcasing the potential for deeper human-robot interactions [2] Group 3: Open-Source Initiative - The "Lingqu OS" open-source plan aims to enhance the integration of current robotic systems and promote breakthroughs in new technologies for embodied intelligence [3][4] - The initiative will adopt a "layered open-source, co-build and share" model, providing a stable and efficient framework for distributed real-time communication and hardware abstraction [4] - The open-source plan is set to begin in Q4 of this year, with the goal of fostering collaboration within the industry to overcome challenges in intelligent enhancement and cloud-edge integration [4] Group 4: Industry Impact - Zhiyuan Robotics aims to lead the industry towards collaborative development and commercial scalability for embodied intelligence, leveraging the WAIC 2025 platform to showcase its capabilities [5]
弗吉尼亚大学提出Moving Out:实现物理世界人机无缝协作!
具身智能之心· 2025-07-25 07:11
Core Insights - The article emphasizes the need for a benchmark that simulates physical interactions and diverse collaboration scenarios to enhance the adaptability and generalization capabilities of intelligent agents in human-robot collaboration [3][6]. Group 1: Key Innovations - Introduction of the Moving Out benchmark, a physically-grounded human-robot collaboration environment that simulates various collaborative modes influenced by physical properties and constraints [8]. - Design of two evaluation tasks aimed at assessing the adaptability of intelligent agents to human behavioral diversity and their ability to generalize to unknown physical properties [10][11]. - Proposal of the BASS method, which enhances collaboration performance in physical environments through behavior augmentation, simulation, and action selection [13][14]. Group 2: Experimental Results - The BASS method demonstrated superior performance in both AI-AI and human-robot collaboration compared to baseline methods such as MLP, GRU, and Diffusion Policy [15][18]. - Evaluation metrics included Task Completion Rate (TCR), Normalized Final Distance (NFD), Waiting Time (WT), and Action Consistency (AC), with BASS showing significant improvements in these areas [16][17]. - User studies indicated that BASS significantly outperformed Diffusion Policy in terms of usefulness and physical understanding, reducing issues like object handover failures and delays in assistance [18]. Group 3: Related Work - Existing human-AI collaboration research has limitations, and Moving Out addresses these by providing a physically-grounded environment, diverse collaboration modes, and continuous state-action spaces [19][21]. - Previous works often focused on discrete environments with limited physical attributes or lacked independent task division, highlighting the need for more comprehensive evaluation methods that consider physical interactions [21].
正式开课啦!具身智能目标导航算法与实战教程来了~
具身智能之心· 2025-07-25 07:11
Core Viewpoint - Goal-Oriented Navigation empowers robots to autonomously complete navigation tasks based on goal descriptions, marking a significant shift from traditional visual language navigation systems [2][3]. Group 1: Technology Overview - Embodied navigation is a core area of embodied intelligence, relying on three technical pillars: language understanding, environmental perception, and path planning [2]. - Goal-Oriented Navigation requires robots to explore and plan paths in unfamiliar 3D environments using only goal descriptions such as coordinates, images, or natural language [2]. - The technology has been industrialized across various verticals, including delivery, healthcare, and hospitality, with companies like Meituan and Aethon deploying autonomous delivery robots [3]. Group 2: Technological Evolution - The evolution of Goal-Oriented Navigation can be categorized into three generations: 1. First Generation: End-to-end methods focusing on reinforcement learning and imitation learning, achieving breakthroughs in Point Navigation and closed-set image navigation tasks [5]. 2. Second Generation: Modular methods that explicitly construct semantic maps, breaking tasks into exploration and goal localization [5]. 3. Third Generation: Integration of large language models (LLMs) and visual language models (VLMs) to enhance knowledge reasoning and open-vocabulary target matching [7]. Group 3: Challenges in Learning - Learning Goal-Oriented Navigation is challenging due to the need for knowledge across multiple domains, including natural language processing, computer vision, and reinforcement learning [9]. - The fragmented nature of knowledge and the abundance of literature can overwhelm beginners, making it difficult to extract frameworks and understand development trends [9]. Group 4: Course Offering - A new course has been developed to address the challenges in learning Goal-Oriented Navigation, focusing on quick entry, building a domain framework, and combining theory with practice [10][11][12]. - The course includes a comprehensive curriculum covering semantic navigation frameworks, Habitat simulation ecology, end-to-end navigation methodologies, modular navigation architectures, and LLM/VLM-driven navigation systems [13][16][17][19][20][23].
开发者福利!一台机器搞定人形运控、强化学习、VLN/VLA
具身智能之心· 2025-07-25 07:11
Core Viewpoint - TRON1 is a cutting-edge research platform designed for educational and scientific purposes, featuring a modular design that supports multiple robotic forms and algorithms, catering to diverse research needs [1]. Group 1: Product Features - TRON1 supports humanoid gait development and is ideal for reinforcement learning research, with the EDU version allowing for external camera integration for navigation and perception tasks [6][24]. - The platform supports C++ and Python for development, making it accessible for users without C++ knowledge [6]. - It features a "three-in-one" modular design that allows for quick switching between bipedal, point-foot, and wheeled locomotion [1]. Group 2: Technical Specifications - The platform is compatible with major simulation platforms like NVIDIA Isaac, Mujoco, and Gazebo, enhancing validation efficiency and lowering research barriers [9]. - TRON1 can be equipped with a robotic arm for various mobile operation tasks, supporting both single-arm and dual-foot configurations [11]. - It integrates LiDAR and depth cameras for 3D mapping, localization, navigation, and dynamic obstacle avoidance [13]. Group 3: Hardware and Performance - The TRON1 standard version and EDU version share similar mechanical parameters, with a weight limit of approximately 10 kg and a maximum speed of 5 m/s for wheeled locomotion [26]. - The platform is powered by an 8-core Arm Cortex-A78AE CPU and features NVIDIA Ampere architecture GPU with AI computing power of 157 TOPS (sparse) and 78 TOPS (dense) [16][19]. - The battery supports a maximum power of 1000W, with a runtime of over 2 hours under rated conditions [26]. Group 4: User Support and Development - Comprehensive user manuals and development guides are provided, ensuring ease of use and support for new users [29][33]. - The platform offers a one-year after-sales service post-acceptance, with paid maintenance and parts support available thereafter [40].
准备扩大具身团队了,拉一些人搞点事.......
具身智能之心· 2025-07-25 07:11
Core Insights - The field of embodied intelligence is rapidly developing, with several star companies preparing for IPOs, indicating a growing market and investment opportunities [1] - Collaboration and communication within the industry are essential for overcoming early-stage challenges and fostering overall development [1] - The company aims to create a platform that gathers talent from the entire industry to promote progress [1] Group 1: Project Collaboration - The company is establishing research teams in major cities such as Beijing, Shanghai, Shenzhen, Guangzhou, Hangzhou, and Wuhan, seeking to recruit around 10 individuals per city [3] - Candidates are expected to have over 2 years of experience in embodied algorithms and robotics research [3] Group 2: Education and Consulting Services - The company invites industry experts to develop online courses and consulting services in the field of embodied intelligence [4] - Areas of focus include large models, multi-modal models, reinforcement learning, and robotics, among others [4][5] Group 3: Compensation and Recruitment - The company offers significant profit-sharing and resource sharing across the industry, with options for part-time or full-time positions [6] - A preference for candidates with a PhD or equivalent experience in research and development is noted [5]
NVIDIA最新!ThinkAct:复杂的具身任务中实现少样本适应、长时程规划
具身智能之心· 2025-07-24 09:53
Core Insights - The article introduces ThinkAct, a dual-system framework designed to enhance the reasoning capabilities of multi-modal large language models (MLLMs) in physical environments by connecting high-level reasoning with low-level action execution [4][9][12] - ThinkAct aims to address the limitations of existing VLA models that struggle with long-term planning and adapting to complex tasks by utilizing reinforced visual latent planning [4][6][9] Group 1: Framework and Methodology - ThinkAct employs a structured approach to VLA reasoning tasks, where the model receives visual observations and textual instructions to predict actions, effectively linking abstract planning with low-level control [12][21] - The framework utilizes reinforcement learning to enhance the reasoning capabilities of MLLMs, encouraging them to generate low-level actions after reasoning through the task [13][19] - A novel action-aligned visual feedback mechanism is introduced to capture long-term goals and encourage visual associations during the planning process [14][18] Group 2: Performance Evaluation - ThinkAct demonstrates superior performance in various robotic operation tasks, achieving a top success rate of 84.4% on the LIBERO benchmark, outperforming other models like DiT-Policy and CoT-VLA [25][26] - In the SimplerEnv evaluation, ThinkAct outperformed baseline action models by significant margins, achieving overall scores of 71.5%, 65.1%, and 43.8% across different settings [25] - The framework also excels in embodied reasoning tasks, showing advantages in long-term and multi-step planning capabilities, as evidenced by its performance on EgoPlan-Bench2 and RoboVQA benchmarks [26][27] Group 3: Qualitative Insights - The article provides qualitative examples illustrating ThinkAct's reasoning process and execution in tasks, showcasing its ability to decompose instructions into meaningful sub-goals and visualize planning trajectories [30][31] - The framework's reinforcement learning adjustments significantly enhance its reasoning capabilities, allowing it to better understand tasks and environments compared to cold-start models [31][32] Group 4: Adaptability and Error Correction - ThinkAct demonstrates effective few-shot adaptation capabilities, successfully generalizing to unseen environments and new skills with minimal demonstration samples [35][37] - The framework's ability to detect execution errors and perform ego correction is highlighted, showcasing its structured reasoning to reconsider tasks and generate corrective plans when faced with failures [37][38]
具身公司的两边倒!一边是大额融资,一边是招不到人.......
具身智能之心· 2025-07-24 09:53
太魔幻了!具身一边是海量岗位,一边是招不到人...... 最近星球里的同学来找我吐槽:峰哥,为什么很多具身公司明明很有钱,融资拿的根本花不完,岗位对外的也 多,但一直面试不发offer,他们对外一直说招不到人??? 作为完整经历过自驾发展周期的人来看,其实很简单。大家兜里有钱,但不敢轻易花钱了,保持着审慎的态 度,精打细算的细水长流。这个产业周期依然会很长,乱花钱、没有计划,死的会很快,洗牌也就这2-3年的 事情。 许多具身公司的产品(包括本体、算法、数据)都还不成熟,这一点我们在具身智能之心知识星球内详细分析 过。所以,有非常好的研究成果这批学者是各家公司争先招募的,比如人形机器人的稳定性、数据的scale、数 据的有效使用、泛化性等方向。底层技术突破的拐点还看不到,大家都想储备好干粮准备过寒冬, 对于求职 者来说,一方面需要自己技术过硬,另外一方面需要非常适配具身的研究方向。 而具身智能之心知识星球,作为国内最大的具身技术社区,一直在给行业和个人输送各类人才、产业学术信 息。 目前累积了国内外几乎所有主流具身公司和大多数知名研究机构。 如果您需要第一时间了解产业、求职 和行业痛点,欢迎加入我们。 一个认真 ...
Zebra-CoT:开创性视觉思维链数据集问世,多模态推理准确率提升13%
具身智能之心· 2025-07-24 09:53
Core Viewpoint - The article discusses the development of Zebra-CoT, a large-scale and diverse dataset aimed at enhancing visual reasoning capabilities in multi-modal models, addressing the challenges of existing visual CoT performance and the lack of high-quality training data [3][4]. Dataset Construction - Zebra-CoT consists of 182,384 samples, providing logical interleaved text-image reasoning trajectories across four main task categories: scientific reasoning, 2D visual reasoning, 3D visual reasoning, and visual logic and strategy games [6][12]. - The dataset overcomes limitations of existing datasets by offering a diverse range of tasks and ensuring high-quality text reasoning data, unlike previous datasets that focused on single tasks or lacked clear reasoning structures [6][18]. Task Coverage - The dataset covers four major task categories: - Scientific reasoning includes geometry, physics, chemistry, and algorithm problems [9]. - 2D visual reasoning encompasses visual search and visual puzzles [9]. - 3D visual reasoning involves multi-hop object counting and robot planning [9]. - Visual logic and strategy games feature chess, checkers, mazes, and more [9]. Data Sources and Processing - Real-world data is sourced from online resources, ensuring high-quality problem extraction and addressing issues of logical connections between modalities [10]. - Synthetic data is generated using templates and visual language models (VLM) to enhance reasoning diversity and expressiveness [10]. Model Fine-tuning and Performance - Fine-tuning the Anole-7B model on Zebra-CoT improved accuracy from 4.2% to 16.9%, a fourfold increase, with notable improvements in visual logic benchmarks [14]. - The Bagel-7B model, after fine-tuning, demonstrated the ability to generate high-quality interleaved visual reasoning chains, showcasing the dataset's effectiveness in developing multi-modal reasoning capabilities [14]. Limitations - Despite its strengths, the dataset relies on template generation for synthetic data, which may limit the diversity and expressiveness of text reasoning [18]. - Some sub-tasks within the dataset have a small sample size, potentially affecting model performance in those areas [18]. - Model fine-tuning results may vary, with some tasks showing insignificant or even decreased performance, indicating a need for further optimization [18].
具身智能之心求职交流群来啦!!!
具身智能之心· 2025-07-23 15:16
应广大粉丝的要求,我们开始正式运营具身相关的求职社群了。社群内部主要讨论相关具身产业、公司、 产品研发、求职与跳槽相关内容。如果您想结交更多同行业的朋友,第一时间了解产业。欢迎加入我们! 微信扫码添加小助理邀请进群,备注昵称+具身求职; 具身智能之心求职与行业交流群成立了! ...
具身智能离不开的感知模块!最强性价比3D激光扫描仪来啦
具身智能之心· 2025-07-23 09:48
Core Viewpoint - GeoScan S1 is presented as the most cost-effective 3D laser scanner in China, featuring lightweight design, one-click operation, and centimeter-level precision for real-time 3D scene reconstruction [1][5]. Group 1: Product Features - The GeoScan S1 can generate point clouds at a rate of 200,000 points per second, with a maximum measurement distance of 70 meters and 360° coverage, supporting large scenes over 200,000 square meters [1][28][31]. - It integrates multiple sensors, including a high-precision IMU and RTK, enabling it to handle complex indoor and outdoor environments effectively [33][46]. - The device supports various data export formats such as PCD, LAS, and PLY, and operates on Ubuntu 20.04, compatible with ROS [22]. Group 2: System Specifications - The system has a relative accuracy of better than 3 cm and an absolute accuracy of better than 5 cm [22]. - The device dimensions are 14.2 cm x 9.5 cm x 45 cm, weighing 1.3 kg without the battery and 1.9 kg with the battery, with a power input range of 13.8V to 24V [22]. - It features a battery capacity of 88.8 Wh, providing approximately 3 to 4 hours of operational time [22][25]. Group 3: Software and Usability - The GeoScan S1 offers a user-friendly interface with simple operation, allowing for quick scanning and immediate data export without complex setups [5][42]. - It includes a 3D Gaussian data collection module for high-fidelity scene restoration, enabling the digital replication of real-world environments [52]. - The software supports both offline and online rendering, enhancing the usability for various applications [5][61]. Group 4: Market Position and Pricing - The company offers multiple versions of the GeoScan S1, including a basic version priced at 19,800 yuan and a 3DGS offline version at 67,800 yuan, catering to diverse customer needs [61][64]. - The product is positioned as having the best price-performance ratio in the industry, integrating multiple sensors and advanced features [5][61].