Workflow
具身智能之心
icon
Search documents
面试的时候,问到了具身的大小脑算法是什么......
具身智能之心· 2025-10-08 02:49
Core Insights - The article discusses the evolution and current state of embodied intelligence, focusing on the roles of the brain and cerebellum in robotics, where the brain handles perception and planning, while the cerebellum is responsible for execution [3][10]. Technical Evolution - The development of embodied intelligence has progressed through several stages, starting from grasp pose detection to behavior cloning, and now to diffusion policy and VLA models, indicating a shift from low-level perception to high-level understanding and generalization [7][10]. - The first stage focused on grasp pose detection using point clouds or images for static object manipulation, but lacked context modeling for complex tasks [7]. - The second stage introduced behavior cloning, allowing robots to learn from expert demonstrations, but faced challenges in generalization and performance in multi-target scenarios [7]. - The third stage, emerging in 2023, introduced diffusion policy methods that enhance stability and generalization by modeling action sequences [8]. - The fourth stage, anticipated in 2024, emphasizes the integration of VLA models with reinforcement learning and world models, enhancing robots' predictive capabilities and multi-modal perception [9][10]. Current Trends and Applications - The integration of VLA with reinforcement learning improves robots' trial-and-error capabilities and self-improvement in long-term tasks, while the combination with world models allows for future prediction and better planning [10]. - The industry is witnessing a surge in products related to humanoid robots, robotic arms, and quadrupedal robots, serving various sectors such as industrial, home, dining, and medical rehabilitation [10]. - There is a growing demand for engineering capabilities as embodied intelligence transitions from research to deployment, necessitating skills in simulation and strategy training [14]. Educational Initiatives - The article outlines a structured curriculum aimed at providing comprehensive knowledge of embodied intelligence algorithms, catering to both beginners and advanced learners [11][20]. - The course includes practical applications and supervision to enhance learning outcomes, focusing on various modules such as diffusion policy, VLA, and tactile sensing [11][14].
普林斯顿大学最新!VLM2VLA:将 VLM 微调为 VLA,并避免灾难性遗忘
具身智能之心· 2025-10-07 10:00
Core Insights - The article discusses the catastrophic forgetting problem in the context of fine-tuning Visual Language Models (VLMs) into Visual Language Action Models (VLAs) for robotic control, highlighting the mismatch between pre-training and fine-tuning data distributions [2][4]. Group 1: Catastrophic Forgetting - Catastrophic forgetting occurs when the model loses its original reasoning and multimodal understanding capabilities during the action generation training process [2]. - The root cause of this issue is the distribution mismatch between the internet-scale pre-training data (primarily image-text pairs) and the low-dimensional action vector data used for robotic fine-tuning [2]. Group 2: VLM2VLA Approach - VLM2VLA aims to address the distribution mismatch by converting low-dimensional actions into natural language descriptions, aligning the fine-tuning data with the pre-training data [3][4]. - The method employs low-rank adaptation (LoRA) for fine-tuning, minimizing modifications to the VLM backbone and avoiding catastrophic forgetting [4]. Group 3: Hierarchical Action Representation - The VLM2VLA framework decomposes action prediction into a three-level reasoning process, utilizing natural language descriptions at all levels [6]. - High-level subtask prediction generates intermediate tasks based on initial observations and overall task instructions [6]. - Mid-level motion planning produces spatially oriented movement descriptions, while low-level action generation creates executable action sequences with language annotations [6]. Group 4: Data Reconstruction Pipeline - VLM2VLA utilizes Gemini 2.5 to automatically reconstruct raw robotic trajectory datasets into language-annotated datasets, ensuring compatibility with VLM pre-training formats [9]. - The reconstruction process involves providing context, decomposing trajectories into subtasks, and standardizing the format to align with VLM data [9]. Group 5: Efficient Fine-Tuning Strategy - The fine-tuning of the Gemma-3-12B-IT model is conducted using LoRA on linear layers without altering the VLM architecture or requiring joint training with internet-scale data [12][13]. - Key training parameters include a LoRA rank of 16, learning rate of 5e-5, and an effective batch size of 8 [12][13]. Group 6: Experimental Validation - Experiments focus on three core questions comparing VLM2VLA with baseline models, assessing the retention of multimodal understanding, competitive robotic manipulation performance, and the ability to generalize knowledge to new scenarios [14][15]. - VLM2VLA demonstrates competitive performance in both in-distribution and out-of-distribution tasks, showcasing its hierarchical reasoning capabilities [17][19]. Group 7: Limitations and Future Directions - The model currently faces challenges such as reasoning delays and the need for larger-scale robotic language-annotated datasets to enhance generalization capabilities [19]. - Future improvements may include optimizing decoding strategies, expanding language annotation for dexterous actions, and integrating validation capabilities within the VLM itself [19][22].
准备回去卷了......
具身智能之心· 2025-10-07 10:00
Group 1 - The number of companies in the embodied intelligence sector in China has approached 200, indicating a high level of competition and potential market saturation [1] - Companies are adopting different strategies, with some focusing on integrating applications while others prioritize core research and development, which may help them survive market challenges [1] - The community created around embodied intelligence aims to provide a platform for knowledge sharing, job referrals, and academic guidance, addressing the high trial-and-error costs for newcomers [2][4] Group 2 - The community has established a closed loop across various fields, including industry, academia, and job exchanges, providing solutions to problems and connecting job opportunities [4] - A comprehensive resource of over 30 technical routes has been compiled, aiding users in finding benchmarks and learning pathways efficiently [4] - The community invites industry experts to engage in discussions and provide insights on the latest developments in the embodied intelligence sector [4][12] Group 3 - The community offers a variety of learning paths for beginners and advanced users, including reinforcement learning and multi-modal large model understanding [12][13] - Members have access to job referral mechanisms, ensuring timely connections with desired companies [12][20] - The community has compiled a wealth of resources, including open-source projects, datasets, and technical learning routes, to facilitate knowledge acquisition and project development [12][29][35]
最新SOTA!JanusVLN:双重隐式记忆解耦语义与空间,显著降低了计算与推理开销
具身智能之心· 2025-10-07 03:03
Core Insights - The article introduces JanusVLN, an innovative framework for Vision-Language Navigation (VLN) that addresses the limitations of existing methods by implementing a Dual Implicit Memory paradigm, which decouples visual semantics and spatial geometry [2][19]. Background on Current VLN Memory Mechanisms - Current VLN methods face three main challenges: spatial information distortion and loss due to reliance on text cognitive maps, low computational and reasoning efficiency from storing historical image frames, and memory inflation leading to "memory explosion" issues [3][5]. Key Innovations of JanusVLN - JanusVLN introduces a Dual Implicit Memory framework inspired by human cognitive science, effectively separating semantic memory from spatial geometric memory [7][19]. - The framework utilizes a pre-trained 3D visual geometry model (VGGT) to derive spatial geometric information from single RGB video streams, enhancing the model's spatial perception capabilities [8][19]. - A mixed incremental update strategy is employed to maintain a fixed-size memory, significantly improving reasoning efficiency by avoiding redundant computations [8][11]. Methodology Overview - JanusVLN consists of three main components: a dual encoder architecture for visual perception, a dual implicit neural memory system, and a hybrid incremental update strategy [10][11]. - The dual encoder architecture includes a 2D visual semantic encoder and a 3D spatial geometric encoder, working together to provide comprehensive scene understanding [11]. Experimental Results - JanusVLN has been evaluated on two authoritative VLN benchmarks, VLN-CE and RxR-CE, achieving state-of-the-art (SOTA) performance [15]. - The framework demonstrates superior performance in spatial reasoning tasks, successfully completing complex navigation challenges [18][21]. Quantitative Analysis - JanusVLN shows significant improvements in success rate (SR) metrics, outperforming advanced methods that rely on expensive inputs by 10.5 to 35.5 percentage points [21]. - Compared to other SOTA methods using RGB input with explicit memory, JanusVLN achieves a 3.6 to 10.8 percentage point improvement in SR, validating the effectiveness of the Dual Implicit Memory paradigm [21].
具身研究的第一套机械臂,老师给我推荐了这款,好用性价比高!
具身智能之心· 2025-10-07 03:03
Core Viewpoint - The article introduces the Imeta-y1, a lightweight and cost-effective robotic arm designed for the embodied research field, addressing the need for affordable yet high-quality hardware for researchers and practitioners [2][4]. Product Overview - The Imeta-y1 robotic arm is specifically designed for educational, research, and light industrial applications, featuring high-precision motion control, low power consumption, and an open software and hardware architecture [4][5]. - It supports seamless integration from simulation to real machine, providing a comprehensive open-source SDK and toolchain for users to quickly implement algorithm validation, data collection, model training, and application deployment [4][15]. Technical Specifications - The robotic arm has a weight of 4.2 kg, a rated load of 3 kg, and 6 degrees of freedom, with a working radius of 612.5 mm and a repeat positioning accuracy of ±0.1 mm [7][17]. - It operates at a voltage of 24V and utilizes a PC controller, with communication via CAN [7][17]. Product Advantages - The product offers a complete toolchain from data collection to model training and inference deployment, supporting multi-modal data fusion and compatibility with mainstream frameworks like TensorFlow and PyTorch [15][30]. - It provides URDF models for real-time interaction between simulation environments like Gazebo and physical devices, significantly reducing development risks and debugging costs [20][30]. Development Support - The SDK includes support for C++ and Python, allowing developers to quickly get started regardless of their programming language preference [16][24]. - The product supports both ROS1 and ROS2 development, with plans for future upgrades and responsive customer service [17][24]. Testing and Reliability - The robotic arm undergoes rigorous hardware testing processes, including precision calibration, durability, load performance, and stability verification, ensuring reliability and safety across various application scenarios [33][40]. After-Sales Service - The company commits to delivering products within 1-2 weeks and provides timely after-sales support, with a six-month warranty for non-human damage [42][43].
“盲眼”机器人在完全看不见的情况下30秒跑酷首秀惊艳!
具身智能之心· 2025-10-07 03:03
Core Insights - The article discusses the advancements in humanoid robotics, specifically focusing on Amazon's FAR (Frontier AI for Robotics) team and their new technology, OmniRetarget, which enables robots to perform complex tasks without visual sensors [9][49]. Group 1: OmniRetarget Technology - OmniRetarget allows reinforcement learning strategies to learn long-term loco-manipulation skills in complex environments, achieving zero-shot transfer from simulation to humanoid robots [12][29]. - The technology utilizes an interaction mesh to model spatial and contact relationships between the robot, objects, and terrain, enhancing data efficiency and reducing data collection costs [15][25]. - OmniRetarget outperforms other motion redirection methods in key areas such as hard constraints, object interaction, terrain interaction, and data augmentation [16][40]. Group 2: Experimental Results - The research team demonstrated the broad capabilities of OmniRetarget, including natural object manipulation and terrain interaction, achieving a success rate of 79.1% on enhanced datasets [39][42]. - In comparative tests, OmniRetarget showed superior performance in kinematic quality metrics, such as penetration and contact preservation, outperforming baseline methods [41][42]. - The technology's high-quality redirection actions directly improve downstream reinforcement learning policy success rates by over 10% compared to baseline methods [42]. Group 3: Team and Background - Amazon's FAR team, established recently, is led by prominent scholars from the robotics field, including those from the renowned Covariant company [43][44]. - The team aims to revolutionize automation in humanoid robotics, marking Amazon's first significant foray into this area [49][50].
具身智能之心招募合作伙伴了!课程开发/培训/论文辅导等
具身智能之心· 2025-10-06 02:35
Core Viewpoint - The article emphasizes the importance of collaboration in the development of a platform that continuously adds value to the industry, inviting influential figures to participate in various initiatives [1]. Group 1: Collaboration Opportunities - The company seeks to develop courses and provide paper guidance to benefit beginners and promote industry advancement, targeting both C-end and enterprise training [2][3]. - There is an initiative to create a cost-effective and user-friendly research platform for embodied intelligence, ensuring accessibility for developers and ease of use for beginners [4][5]. - The company aims to offer consulting and training services for both B-end and C-end clients in areas such as embodied data, ontology, algorithms, and deployment, supporting industry upgrades and talent development [6][7]. Group 2: Recruitment and Compensation - The company is looking for individuals with engineering experience in the field or those holding a PhD or higher, including top-tier professionals, for both full-time and part-time positions [7]. - Competitive compensation is offered, along with access to industry resources for those who join [8].
提供最专业的平台和运营团队!我们正在招募运营的同学~
具身智能之心· 2025-10-06 02:35
Core Viewpoint - The company has evolved from a small workshop to a platform with significant technical depth and breadth, indicating a growing demand in the industry for embodied intelligence and related technologies [1]. Group 1: Team and Operations - The team has spent over two years developing four key IPs: Embodied Intelligence, Autonomous Driving, 3D Vision, and Large Model Tech, with a total online following of nearly 360,000 across various platforms [1]. - The company is currently hiring for full-time and part-time positions in operations and sales to support its expanding business lines [2]. Group 2: Job Responsibilities and Requirements - The operations role includes managing course progress, enhancing platform engagement, planning commercialization projects, and creating content related to the AI industry [4]. - The sales role involves creating promotional materials for online and hardware products and liaising with hardware manufacturers and academic/enterprise clients [5][6]. - Candidates for both roles are expected to have strong execution, communication skills, and a background in computer science, AI, or robotics, with familiarity in social media operations being a plus [12]. Group 3: Growth Opportunities - The company offers exposure to top-tier operational teams, providing opportunities to learn operational techniques and sales strategies, leading to rapid personal growth [7]. - Employees will engage with cutting-edge content in autonomous driving, embodied intelligence, 3D vision, and large models, broadening their technical perspective [8]. - There are opportunities for further academic pursuits, such as research and doctoral studies, which can enhance personal development [9].
强化学习在机械臂、四足、人形的应用有哪些?
具身智能之心· 2025-10-05 16:03
Core Viewpoint - The article discusses the importance of reinforcement learning (RL) in the development of embodied intelligent robots, highlighting its applications in various complex tasks and the challenges faced by newcomers in the field [3][4][10]. Group 1: Reinforcement Learning Applications - Reinforcement learning is crucial for gait control in humanoid and quadruped robots, enabling them to perform tasks such as climbing stairs, running, and dancing [3][9]. - The VLA+RL approach for robotic arms is gaining popularity in academia, enhancing the efficiency and smoothness of robot operations [4][9]. Group 2: Challenges in Learning and Research - The complexity and breadth of reinforcement learning make it difficult for beginners to enter the field, often leading to frustration and abandonment of studies [6][10]. - A lack of a comprehensive learning system can result in repeated mistakes and missed opportunities for aspiring researchers [7][10]. Group 3: Educational Offerings - To address the challenges faced by newcomers, the company has launched a 1v6 paper guidance small class in the field of reinforcement learning, aimed at graduate students and others needing paper guidance [7][8]. - The course includes 14 weeks of concentrated online guidance followed by 8 weeks of maintenance support, focusing on paper idea confirmation, project implementation, experimental guidance, and writing refinement [10][12]. Group 4: Course Structure and Content - The course covers various topics, including paper direction and submission analysis, reinforcement learning basics, simulation environments, and writing guidance [10][18]. - Students will have the opportunity to work on specific ideas related to quadruped robots, humanoid robots, and robotic arms, with a structured approach to developing a paper suitable for submission to top conferences [19][30]. Group 5: Expected Outcomes - Participants are expected to produce a draft of a paper that meets the requirements of specific conferences or journals, with support for writing and submission processes [29][34]. - The course emphasizes a comprehensive research cycle, including methodology, engineering, evaluation, writing, submission, and maintenance [36].
仅需 1 次演示,机器人就能像人手一样抓遍万物?DemoGrasp 刷新灵巧抓取天花板
具身智能之心· 2025-10-04 13:35
Core Viewpoint - The article discusses the innovative DemoGrasp framework, which enables robots to perform dexterous grasping tasks with a single demonstration, overcoming traditional challenges in robotic manipulation [2][20]. Group 1: Traditional Challenges in Robotic Grasping - Traditional reinforcement learning methods struggle with high-dimensional action spaces, requiring complex reward functions and often leading to poor generalization [1][2]. - Robots trained in simulation often fail in real-world scenarios due to the lack of precise physical parameters and environmental variations [1][2]. Group 2: Introduction of DemoGrasp - DemoGrasp, developed by a collaboration of Beijing University, Renmin University of China, and BeingBeyond, utilizes a single successful demonstration to redefine grasping tasks [2][4]. - The framework significantly improves performance in both simulated and real environments, marking a breakthrough in robotic grasping technology [2][4]. Group 3: Core Design of DemoGrasp - The core innovation of DemoGrasp includes three main components: demonstration trajectory editing, single-step reinforcement learning (RL), and visual-guided virtual-real transfer [4][10]. - The design allows robots to optimize "editing parameters" instead of exploring new actions, greatly reducing the dimensionality of the action space [6][7]. Group 4: Performance Results - DemoGrasp outperforms existing methods in simulation, achieving a success rate of 95.5% in testing with seen categories and 94.4% with unseen categories [10]. - The framework adapts to six different robotic embodiments without hyperparameter adjustments, achieving an average success rate of 84.6% on unseen datasets [11]. Group 5: Real-World Performance - In real-world tests, DemoGrasp achieved an overall success rate of 86.5% across 110 unseen objects, demonstrating its capability to handle various everyday items [14]. - The framework successfully grasps small and thin objects, such as coins and cards, which traditional methods struggle with due to collision issues [14]. Group 6: Limitations and Future Directions - Despite its strengths, DemoGrasp has limitations in handling functional grasping tasks and highly cluttered scenes [17][19]. - Future improvements may include segmenting demonstration trajectories for better decision-making and integrating visual feedback for dynamic scene adjustments [19][20].