Workflow
具身智能之心
icon
Search documents
普林斯顿大学最新!VLM2VLA:将 VLM 微调为 VLA,并避免灾难性遗忘
具身智能之心· 2025-10-07 10:00
Core Insights - The article discusses the catastrophic forgetting problem in the context of fine-tuning Visual Language Models (VLMs) into Visual Language Action Models (VLAs) for robotic control, highlighting the mismatch between pre-training and fine-tuning data distributions [2][4]. Group 1: Catastrophic Forgetting - Catastrophic forgetting occurs when the model loses its original reasoning and multimodal understanding capabilities during the action generation training process [2]. - The root cause of this issue is the distribution mismatch between the internet-scale pre-training data (primarily image-text pairs) and the low-dimensional action vector data used for robotic fine-tuning [2]. Group 2: VLM2VLA Approach - VLM2VLA aims to address the distribution mismatch by converting low-dimensional actions into natural language descriptions, aligning the fine-tuning data with the pre-training data [3][4]. - The method employs low-rank adaptation (LoRA) for fine-tuning, minimizing modifications to the VLM backbone and avoiding catastrophic forgetting [4]. Group 3: Hierarchical Action Representation - The VLM2VLA framework decomposes action prediction into a three-level reasoning process, utilizing natural language descriptions at all levels [6]. - High-level subtask prediction generates intermediate tasks based on initial observations and overall task instructions [6]. - Mid-level motion planning produces spatially oriented movement descriptions, while low-level action generation creates executable action sequences with language annotations [6]. Group 4: Data Reconstruction Pipeline - VLM2VLA utilizes Gemini 2.5 to automatically reconstruct raw robotic trajectory datasets into language-annotated datasets, ensuring compatibility with VLM pre-training formats [9]. - The reconstruction process involves providing context, decomposing trajectories into subtasks, and standardizing the format to align with VLM data [9]. Group 5: Efficient Fine-Tuning Strategy - The fine-tuning of the Gemma-3-12B-IT model is conducted using LoRA on linear layers without altering the VLM architecture or requiring joint training with internet-scale data [12][13]. - Key training parameters include a LoRA rank of 16, learning rate of 5e-5, and an effective batch size of 8 [12][13]. Group 6: Experimental Validation - Experiments focus on three core questions comparing VLM2VLA with baseline models, assessing the retention of multimodal understanding, competitive robotic manipulation performance, and the ability to generalize knowledge to new scenarios [14][15]. - VLM2VLA demonstrates competitive performance in both in-distribution and out-of-distribution tasks, showcasing its hierarchical reasoning capabilities [17][19]. Group 7: Limitations and Future Directions - The model currently faces challenges such as reasoning delays and the need for larger-scale robotic language-annotated datasets to enhance generalization capabilities [19]. - Future improvements may include optimizing decoding strategies, expanding language annotation for dexterous actions, and integrating validation capabilities within the VLM itself [19][22].
准备回去卷了......
具身智能之心· 2025-10-07 10:00
Group 1 - The number of companies in the embodied intelligence sector in China has approached 200, indicating a high level of competition and potential market saturation [1] - Companies are adopting different strategies, with some focusing on integrating applications while others prioritize core research and development, which may help them survive market challenges [1] - The community created around embodied intelligence aims to provide a platform for knowledge sharing, job referrals, and academic guidance, addressing the high trial-and-error costs for newcomers [2][4] Group 2 - The community has established a closed loop across various fields, including industry, academia, and job exchanges, providing solutions to problems and connecting job opportunities [4] - A comprehensive resource of over 30 technical routes has been compiled, aiding users in finding benchmarks and learning pathways efficiently [4] - The community invites industry experts to engage in discussions and provide insights on the latest developments in the embodied intelligence sector [4][12] Group 3 - The community offers a variety of learning paths for beginners and advanced users, including reinforcement learning and multi-modal large model understanding [12][13] - Members have access to job referral mechanisms, ensuring timely connections with desired companies [12][20] - The community has compiled a wealth of resources, including open-source projects, datasets, and technical learning routes, to facilitate knowledge acquisition and project development [12][29][35]
最新SOTA!JanusVLN:双重隐式记忆解耦语义与空间,显著降低了计算与推理开销
具身智能之心· 2025-10-07 03:03
Core Insights - The article introduces JanusVLN, an innovative framework for Vision-Language Navigation (VLN) that addresses the limitations of existing methods by implementing a Dual Implicit Memory paradigm, which decouples visual semantics and spatial geometry [2][19]. Background on Current VLN Memory Mechanisms - Current VLN methods face three main challenges: spatial information distortion and loss due to reliance on text cognitive maps, low computational and reasoning efficiency from storing historical image frames, and memory inflation leading to "memory explosion" issues [3][5]. Key Innovations of JanusVLN - JanusVLN introduces a Dual Implicit Memory framework inspired by human cognitive science, effectively separating semantic memory from spatial geometric memory [7][19]. - The framework utilizes a pre-trained 3D visual geometry model (VGGT) to derive spatial geometric information from single RGB video streams, enhancing the model's spatial perception capabilities [8][19]. - A mixed incremental update strategy is employed to maintain a fixed-size memory, significantly improving reasoning efficiency by avoiding redundant computations [8][11]. Methodology Overview - JanusVLN consists of three main components: a dual encoder architecture for visual perception, a dual implicit neural memory system, and a hybrid incremental update strategy [10][11]. - The dual encoder architecture includes a 2D visual semantic encoder and a 3D spatial geometric encoder, working together to provide comprehensive scene understanding [11]. Experimental Results - JanusVLN has been evaluated on two authoritative VLN benchmarks, VLN-CE and RxR-CE, achieving state-of-the-art (SOTA) performance [15]. - The framework demonstrates superior performance in spatial reasoning tasks, successfully completing complex navigation challenges [18][21]. Quantitative Analysis - JanusVLN shows significant improvements in success rate (SR) metrics, outperforming advanced methods that rely on expensive inputs by 10.5 to 35.5 percentage points [21]. - Compared to other SOTA methods using RGB input with explicit memory, JanusVLN achieves a 3.6 to 10.8 percentage point improvement in SR, validating the effectiveness of the Dual Implicit Memory paradigm [21].
具身研究的第一套机械臂,老师给我推荐了这款,好用性价比高!
具身智能之心· 2025-10-07 03:03
Core Viewpoint - The article introduces the Imeta-y1, a lightweight and cost-effective robotic arm designed for the embodied research field, addressing the need for affordable yet high-quality hardware for researchers and practitioners [2][4]. Product Overview - The Imeta-y1 robotic arm is specifically designed for educational, research, and light industrial applications, featuring high-precision motion control, low power consumption, and an open software and hardware architecture [4][5]. - It supports seamless integration from simulation to real machine, providing a comprehensive open-source SDK and toolchain for users to quickly implement algorithm validation, data collection, model training, and application deployment [4][15]. Technical Specifications - The robotic arm has a weight of 4.2 kg, a rated load of 3 kg, and 6 degrees of freedom, with a working radius of 612.5 mm and a repeat positioning accuracy of ±0.1 mm [7][17]. - It operates at a voltage of 24V and utilizes a PC controller, with communication via CAN [7][17]. Product Advantages - The product offers a complete toolchain from data collection to model training and inference deployment, supporting multi-modal data fusion and compatibility with mainstream frameworks like TensorFlow and PyTorch [15][30]. - It provides URDF models for real-time interaction between simulation environments like Gazebo and physical devices, significantly reducing development risks and debugging costs [20][30]. Development Support - The SDK includes support for C++ and Python, allowing developers to quickly get started regardless of their programming language preference [16][24]. - The product supports both ROS1 and ROS2 development, with plans for future upgrades and responsive customer service [17][24]. Testing and Reliability - The robotic arm undergoes rigorous hardware testing processes, including precision calibration, durability, load performance, and stability verification, ensuring reliability and safety across various application scenarios [33][40]. After-Sales Service - The company commits to delivering products within 1-2 weeks and provides timely after-sales support, with a six-month warranty for non-human damage [42][43].
“盲眼”机器人在完全看不见的情况下30秒跑酷首秀惊艳!
具身智能之心· 2025-10-07 03:03
Core Insights - The article discusses the advancements in humanoid robotics, specifically focusing on Amazon's FAR (Frontier AI for Robotics) team and their new technology, OmniRetarget, which enables robots to perform complex tasks without visual sensors [9][49]. Group 1: OmniRetarget Technology - OmniRetarget allows reinforcement learning strategies to learn long-term loco-manipulation skills in complex environments, achieving zero-shot transfer from simulation to humanoid robots [12][29]. - The technology utilizes an interaction mesh to model spatial and contact relationships between the robot, objects, and terrain, enhancing data efficiency and reducing data collection costs [15][25]. - OmniRetarget outperforms other motion redirection methods in key areas such as hard constraints, object interaction, terrain interaction, and data augmentation [16][40]. Group 2: Experimental Results - The research team demonstrated the broad capabilities of OmniRetarget, including natural object manipulation and terrain interaction, achieving a success rate of 79.1% on enhanced datasets [39][42]. - In comparative tests, OmniRetarget showed superior performance in kinematic quality metrics, such as penetration and contact preservation, outperforming baseline methods [41][42]. - The technology's high-quality redirection actions directly improve downstream reinforcement learning policy success rates by over 10% compared to baseline methods [42]. Group 3: Team and Background - Amazon's FAR team, established recently, is led by prominent scholars from the robotics field, including those from the renowned Covariant company [43][44]. - The team aims to revolutionize automation in humanoid robotics, marking Amazon's first significant foray into this area [49][50].
具身智能之心招募合作伙伴了!课程开发/培训/论文辅导等
具身智能之心· 2025-10-06 02:35
转眼快到下半年了,总感觉今年的规划完不成了,和年初的预期差异较大,因为真的有太多事情值得去做了。 我们期望能够做一个为行业带来持续价值的平台,然而一个社区的运营,离不开大家的鼎力支持,具身智能之 心期望能够在这场领域发展过程中贡献自己的力量,也期望能够邀请更多的人加入我们。 少数人的力量始终是有限的,我们真诚邀请那些对具身领域产生影响力的大佬。和我们一起在课程研发、论文 辅导、咨询服务、企业培训、学科共建、硬件研发等多个方向展开合作。 合作内容 11)课程开发//论文辅导 联系我们 和我们一起搭建能让更多初学者受益的课程,推动行业向前发展。包括C端、企业培训、高校学科建设。 22)硬件研发 和我们一起搭建好用、性价比高的具身科研平台,让每个开发者都能用得起,每个初学者都能用得顺利。 岗位要求 我们期望您具备一定的领域工程经验,或具备博士及以上的title(手握顶会的大牛)。全职和兼职均可哦~ 待遇说明 我们提供行业有竞争力的报酬(详细内容欢迎私聊),同时您也将拥有我们的行业资源。 33)咨询和培训服务 一起承接B端和C端在具身数据、本体、算法和部署等方面的咨询,助力产业升级转型、促进行业人才发展。 在企业就职 ...
提供最专业的平台和运营团队!我们正在招募运营的同学~
具身智能之心· 2025-10-06 02:35
Core Viewpoint - The company has evolved from a small workshop to a platform with significant technical depth and breadth, indicating a growing demand in the industry for embodied intelligence and related technologies [1]. Group 1: Team and Operations - The team has spent over two years developing four key IPs: Embodied Intelligence, Autonomous Driving, 3D Vision, and Large Model Tech, with a total online following of nearly 360,000 across various platforms [1]. - The company is currently hiring for full-time and part-time positions in operations and sales to support its expanding business lines [2]. Group 2: Job Responsibilities and Requirements - The operations role includes managing course progress, enhancing platform engagement, planning commercialization projects, and creating content related to the AI industry [4]. - The sales role involves creating promotional materials for online and hardware products and liaising with hardware manufacturers and academic/enterprise clients [5][6]. - Candidates for both roles are expected to have strong execution, communication skills, and a background in computer science, AI, or robotics, with familiarity in social media operations being a plus [12]. Group 3: Growth Opportunities - The company offers exposure to top-tier operational teams, providing opportunities to learn operational techniques and sales strategies, leading to rapid personal growth [7]. - Employees will engage with cutting-edge content in autonomous driving, embodied intelligence, 3D vision, and large models, broadening their technical perspective [8]. - There are opportunities for further academic pursuits, such as research and doctoral studies, which can enhance personal development [9].
强化学习在机械臂、四足、人形的应用有哪些?
具身智能之心· 2025-10-05 16:03
Core Viewpoint - The article discusses the importance of reinforcement learning (RL) in the development of embodied intelligent robots, highlighting its applications in various complex tasks and the challenges faced by newcomers in the field [3][4][10]. Group 1: Reinforcement Learning Applications - Reinforcement learning is crucial for gait control in humanoid and quadruped robots, enabling them to perform tasks such as climbing stairs, running, and dancing [3][9]. - The VLA+RL approach for robotic arms is gaining popularity in academia, enhancing the efficiency and smoothness of robot operations [4][9]. Group 2: Challenges in Learning and Research - The complexity and breadth of reinforcement learning make it difficult for beginners to enter the field, often leading to frustration and abandonment of studies [6][10]. - A lack of a comprehensive learning system can result in repeated mistakes and missed opportunities for aspiring researchers [7][10]. Group 3: Educational Offerings - To address the challenges faced by newcomers, the company has launched a 1v6 paper guidance small class in the field of reinforcement learning, aimed at graduate students and others needing paper guidance [7][8]. - The course includes 14 weeks of concentrated online guidance followed by 8 weeks of maintenance support, focusing on paper idea confirmation, project implementation, experimental guidance, and writing refinement [10][12]. Group 4: Course Structure and Content - The course covers various topics, including paper direction and submission analysis, reinforcement learning basics, simulation environments, and writing guidance [10][18]. - Students will have the opportunity to work on specific ideas related to quadruped robots, humanoid robots, and robotic arms, with a structured approach to developing a paper suitable for submission to top conferences [19][30]. Group 5: Expected Outcomes - Participants are expected to produce a draft of a paper that meets the requirements of specific conferences or journals, with support for writing and submission processes [29][34]. - The course emphasizes a comprehensive research cycle, including methodology, engineering, evaluation, writing, submission, and maintenance [36].
仅需 1 次演示,机器人就能像人手一样抓遍万物?DemoGrasp 刷新灵巧抓取天花板
具身智能之心· 2025-10-04 13:35
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 让机器人用多根手指灵活抓取物体,听起来简单,却是机器人操作领域困扰多年的 "老大难" 问题。想象一下:从拿起手机、握住水杯,到夹起薄如纸 片的便签、捏起直径不足 3 厘米的纽扣。这些人类习以为常的动作,对机器人而言,每一步都是高难度挑战。 传统强化学习方法为了让机器人掌握抓取技能,往往要在高自由度(DoFs)的动作空间里反复试错,不仅需要设计复杂的奖励函数和训练课程,还常 常 "学了抓杯子,就忘了抓卡片",泛化能力极差。更棘手的是,仿真环境中训练出的 "抓取高手",一到真实场景就 "水土不服"——没有了精确的物理 参数和物体接触点等 "特权信息",仅靠 RGB 或深度相机的视觉输入,再加上光照、背景变化的干扰,成功率断崖式下跌。 而那些小巧、纤薄的物体,更是传统方法的 "噩梦":硬币容易从指缝滑落,卡片难以找到受力点,想要无碰撞地抓起它们,仿佛让机 ...
突然发现,具身相关的公司已经近200家了......
具身智能之心· 2025-10-03 12:02
Core Viewpoint - The article discusses the growing number of companies in the embodied intelligence sector in China, highlighting the potential for market saturation and competition among nearly 200 companies, which may lead to a "cutthroat" environment [1]. Group 1: Industry Overview - The number of companies involved in embodied intelligence, including robotics and related research, has approached 200, indicating a crowded market with high product and business similarity [1]. - Companies are adopting different strategies, with some focusing on integrating applications with their core technologies while others prioritize foundational research, leaving application validation to developers [1]. - The article emphasizes the importance of having a rich technical stack to survive in the industry, as only those capable of practical implementation will remain viable in the long term [1]. Group 2: Community and Support - The "Embodied Intelligence Heart Knowledge Planet" aims to create a large community for both beginners and advanced learners in the field, providing job referrals, academic guidance, and problem-solving support [3]. - The community has established a closed-loop system across various domains, including industry, academia, and job exchanges, facilitating knowledge sharing and collaboration [5]. - The community offers access to over 30 technical routes, numerous open-source projects, and connections with industry leaders for mentorship and advice [5][15]. Group 3: Educational Resources - The community provides a comprehensive collection of learning paths and resources for newcomers, including technical stacks and project proposals for those already engaged in research [9][11]. - Various forums and live discussions are organized to share insights on the latest developments in the embodied intelligence industry [7]. - The community has compiled a wealth of resources, including datasets, research papers, and technical documentation, to support learning and development in the field [20][26][30].