Workflow
具身智能之心
icon
Search documents
DiffusionNFT:扩散强化学习新范式,训练效率提升25倍
具身智能之心· 2025-10-09 00:04
编辑丨 机器之心 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 清华大学朱军教授团队, NVIDIA Deep Imagination 研究组与斯坦福 Stefano Ermon 团队联合提出了一种全新的扩散模型强化学习(RL)范式 —— Diffusion Negative-aware FineTuning (DiffusionNFT) 。该方法首次突破现有 RL 对扩散模型的基本假设,直接在 前向加噪过程(forward process) 上进行优化,在彻底摆 脱似然估计与特定采样器依赖的同时,显著提升了训练效率与生成质量。文章共同一作郑凯文和陈华玉为清华大学计算机系博士生。 论文标题:DiffusionNFT: Online Diffusion Reinforcement with Forward Process 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 论文链接:https://arxiv.org/abs/2509.16117 代码仓库:https://github.com/NVla ...
我们正在找具身领域的合伙人......
具身智能之心· 2025-10-08 02:49
Core Viewpoint - The company is seeking collaboration with global practitioners in the embodied intelligence field to enhance capabilities in various areas such as technical services, training, course development, and research guidance [1]. Group 1: Collaboration Opportunities - There is an increasing demand from partners and small companies for the company to empower them through solutions, data collection, technology upgrades, and corporate training [1]. - The company is inviting outstanding partners to join in driving significant industry progress [1]. Group 2: Compensation and Resources - The company will offer high compensation and abundant industry resources to collaborators [2]. Group 3: Focus Areas - Key focus areas for collaboration include but are not limited to: VLA, VLN, Diffusion Policy, Reinforcement Learning, VLA+RL, remote operation, motion capture, sim2real, multimodal large models, simulation, motion control, end-to-end systems, and 3D perception [3]. Group 4: Job Description - The positions are primarily aimed at embodied course development, solution research and development, hardware development, and training collaboration, targeting both B-end (enterprises, universities, research institutes) and C-end (students, job seekers) [4]. Group 5: Contact Information - Interested parties can add WeChat oooops-life for further inquiries [5].
盘点下国内外那些做具身感知的公司们!
具身智能之心· 2025-10-08 02:49
Core Insights - The article focuses on the emerging field of embodied intelligence, highlighting the development of general-purpose robotic brain systems and multi-modal perception decision-making systems, which are attracting significant attention from both capital and industry [2][3]. Domestic Companies - **Xinghai Map**: Founded in 2023, focuses on developing a "general embodied large model" using real-world data to create robots with fine operational capabilities. The company has completed 8 rounds of financing [6]. - **WALL-A Model**: Set to launch in October 2024, it will be the largest parameter scale embodied intelligence general operation model globally, integrating visual, language, and motion control signals [6]. - **Wall-OSS**: An open-source embodied intelligence foundational model with strong generalization and reasoning capabilities [6]. - **UBTECH**: Established in 2012, it is a leader in humanoid robot commercialization with comprehensive self-research capabilities [10]. - **Thinker Model**: A multi-modal large model with 10 billion parameters, expected to achieve top rankings in three international benchmark tests by 2025, enhancing robots' perception and task planning in complex environments [10]. - **Zhiyuan Robotics**: Founded in February 2023, it aims to create world-class general embodied intelligent robot products [12]. - **Genie Operator-1**: Set to release in March 2025, it integrates multi-modal large models and hybrid expert technology, improving task success rates by 32% compared to market models [12]. - **Galaxy General**: Founded in May 2023, it focuses on multi-modal large models driven by synthetic data [14]. - **VLA Model**: The world's first general embodied large model, utilizing a "brain + cerebellum" collaborative framework [14]. - **Qianxun Intelligent**: Established in 2024, it specializes in AI and robotics with a strong technical foundation [16]. - **Spirit V1 VLA Model**: The first AI model to tackle long-range operations of flexible objects, supporting multi-task generalization [16]. - **Star Motion Era**: A new tech company incubated by Tsinghua University, focusing on general artificial intelligence applications [18]. - **ERA-42 Model**: The first end-to-end native embodied large model in China, capable of learning over 100 dynamic tasks through video training [18]. International Companies - **Figure AI**: Focuses on developing embodied intelligence large models and related infrastructure for various industries [20]. - **Noematrix Brain**: Combines advanced algorithms and data support for comprehensive capabilities in instruction reasoning and task planning [20]. - **Physical Intelligence**: A startup established in January 2023, aims to create advanced intelligent software for robots [24]. - **π0 Model**: Released on October 31, 2024, it is a foundational model for robots, achieving fine control capabilities through pre-training and fine-tuning [24]. - **Google DeepMind**: Merged with Google Brain in 2023, focusing on general artificial intelligence research [22]. - **Gemini Robotics**: A VLA model that allows robots to perform complex tasks without specialized training, enhancing their adaptability to environmental changes [22]. - **NVIDIA**: A leading GPU design company that has expanded into AI solutions [24]. - **Eureka System**: Based on GPT-4, it can automatically train robots for complex actions and optimize reinforcement learning processes [24].
VLA的基础模型与大规模训练任务汇总
具身智能之心· 2025-10-08 02:49
Core Insights - The article summarizes several research papers related to Vision-Language-Action (VLA) models and their training strategies, highlighting advancements in embodied intelligence and robotics [2][3][5][7][9][11][13][15][17][19]. Group 1: Training Strategies and Model Improvements - The paper "Training strategies for efficient embodied reasoning" discusses the use of Chain of Thought (CoT) reasoning to enhance the performance and generalization of VLA models, achieving a threefold increase in reasoning speed compared to standard methods [3]. - "CAST: Counterfactual labels improve instruction following in vision-language-action models" introduces a method to generate counterfactual labels, which significantly improves the instruction-following capabilities of VLA models, with a 27% increase in navigation task success rates [5]. - "RoboBrain: A unified brain model for robotic manipulation" presents a new dataset, ShareRobot, which enhances the planning and trajectory prediction capabilities of robots, leading to state-of-the-art performance in various tasks [7]. Group 2: Dataset Development and Evaluation - The "DROID" dataset is introduced as a large-scale, diverse dataset for robot manipulation, containing 76,000 demonstration trajectories collected over 350 hours, which improves performance and generalization of trained strategies [9]. - "ViSA-Flow" proposes a framework for learning from large-scale video data, achieving state-of-the-art performance in robot skill learning, particularly in low-data scenarios [11]. - The "CORTEXBENCH" benchmark evaluates pre-trained visual representations for embodied AI, revealing that no single representation excels across all tasks, but task-specific adaptations can lead to significant performance improvements [13]. Group 3: Generalist Robot Policies and Learning Frameworks - "Effective tuning strategies for generalist robot manipulation policies" identifies key factors influencing the performance of Generalist Manipulation Policies (GMPs) during fine-tuning, establishing a new benchmark for future research [15]. - The "CACTI" framework focuses on scalable multi-task learning in robotic systems, demonstrating effective training across various kitchen tasks in both real and simulated environments [17]. - "R3m: A universal visual representation for robot manipulation" shows that pre-trained visual representations can enhance data-efficient learning in real-world environments, improving task success rates by over 20% compared to training from scratch [19].
面试的时候,问到了具身的大小脑算法是什么......
具身智能之心· 2025-10-08 02:49
Core Insights - The article discusses the evolution and current state of embodied intelligence, focusing on the roles of the brain and cerebellum in robotics, where the brain handles perception and planning, while the cerebellum is responsible for execution [3][10]. Technical Evolution - The development of embodied intelligence has progressed through several stages, starting from grasp pose detection to behavior cloning, and now to diffusion policy and VLA models, indicating a shift from low-level perception to high-level understanding and generalization [7][10]. - The first stage focused on grasp pose detection using point clouds or images for static object manipulation, but lacked context modeling for complex tasks [7]. - The second stage introduced behavior cloning, allowing robots to learn from expert demonstrations, but faced challenges in generalization and performance in multi-target scenarios [7]. - The third stage, emerging in 2023, introduced diffusion policy methods that enhance stability and generalization by modeling action sequences [8]. - The fourth stage, anticipated in 2024, emphasizes the integration of VLA models with reinforcement learning and world models, enhancing robots' predictive capabilities and multi-modal perception [9][10]. Current Trends and Applications - The integration of VLA with reinforcement learning improves robots' trial-and-error capabilities and self-improvement in long-term tasks, while the combination with world models allows for future prediction and better planning [10]. - The industry is witnessing a surge in products related to humanoid robots, robotic arms, and quadrupedal robots, serving various sectors such as industrial, home, dining, and medical rehabilitation [10]. - There is a growing demand for engineering capabilities as embodied intelligence transitions from research to deployment, necessitating skills in simulation and strategy training [14]. Educational Initiatives - The article outlines a structured curriculum aimed at providing comprehensive knowledge of embodied intelligence algorithms, catering to both beginners and advanced learners [11][20]. - The course includes practical applications and supervision to enhance learning outcomes, focusing on various modules such as diffusion policy, VLA, and tactile sensing [11][14].
普林斯顿大学最新!VLM2VLA:将 VLM 微调为 VLA,并避免灾难性遗忘
具身智能之心· 2025-10-07 10:00
Core Insights - The article discusses the catastrophic forgetting problem in the context of fine-tuning Visual Language Models (VLMs) into Visual Language Action Models (VLAs) for robotic control, highlighting the mismatch between pre-training and fine-tuning data distributions [2][4]. Group 1: Catastrophic Forgetting - Catastrophic forgetting occurs when the model loses its original reasoning and multimodal understanding capabilities during the action generation training process [2]. - The root cause of this issue is the distribution mismatch between the internet-scale pre-training data (primarily image-text pairs) and the low-dimensional action vector data used for robotic fine-tuning [2]. Group 2: VLM2VLA Approach - VLM2VLA aims to address the distribution mismatch by converting low-dimensional actions into natural language descriptions, aligning the fine-tuning data with the pre-training data [3][4]. - The method employs low-rank adaptation (LoRA) for fine-tuning, minimizing modifications to the VLM backbone and avoiding catastrophic forgetting [4]. Group 3: Hierarchical Action Representation - The VLM2VLA framework decomposes action prediction into a three-level reasoning process, utilizing natural language descriptions at all levels [6]. - High-level subtask prediction generates intermediate tasks based on initial observations and overall task instructions [6]. - Mid-level motion planning produces spatially oriented movement descriptions, while low-level action generation creates executable action sequences with language annotations [6]. Group 4: Data Reconstruction Pipeline - VLM2VLA utilizes Gemini 2.5 to automatically reconstruct raw robotic trajectory datasets into language-annotated datasets, ensuring compatibility with VLM pre-training formats [9]. - The reconstruction process involves providing context, decomposing trajectories into subtasks, and standardizing the format to align with VLM data [9]. Group 5: Efficient Fine-Tuning Strategy - The fine-tuning of the Gemma-3-12B-IT model is conducted using LoRA on linear layers without altering the VLM architecture or requiring joint training with internet-scale data [12][13]. - Key training parameters include a LoRA rank of 16, learning rate of 5e-5, and an effective batch size of 8 [12][13]. Group 6: Experimental Validation - Experiments focus on three core questions comparing VLM2VLA with baseline models, assessing the retention of multimodal understanding, competitive robotic manipulation performance, and the ability to generalize knowledge to new scenarios [14][15]. - VLM2VLA demonstrates competitive performance in both in-distribution and out-of-distribution tasks, showcasing its hierarchical reasoning capabilities [17][19]. Group 7: Limitations and Future Directions - The model currently faces challenges such as reasoning delays and the need for larger-scale robotic language-annotated datasets to enhance generalization capabilities [19]. - Future improvements may include optimizing decoding strategies, expanding language annotation for dexterous actions, and integrating validation capabilities within the VLM itself [19][22].
准备回去卷了......
具身智能之心· 2025-10-07 10:00
Group 1 - The number of companies in the embodied intelligence sector in China has approached 200, indicating a high level of competition and potential market saturation [1] - Companies are adopting different strategies, with some focusing on integrating applications while others prioritize core research and development, which may help them survive market challenges [1] - The community created around embodied intelligence aims to provide a platform for knowledge sharing, job referrals, and academic guidance, addressing the high trial-and-error costs for newcomers [2][4] Group 2 - The community has established a closed loop across various fields, including industry, academia, and job exchanges, providing solutions to problems and connecting job opportunities [4] - A comprehensive resource of over 30 technical routes has been compiled, aiding users in finding benchmarks and learning pathways efficiently [4] - The community invites industry experts to engage in discussions and provide insights on the latest developments in the embodied intelligence sector [4][12] Group 3 - The community offers a variety of learning paths for beginners and advanced users, including reinforcement learning and multi-modal large model understanding [12][13] - Members have access to job referral mechanisms, ensuring timely connections with desired companies [12][20] - The community has compiled a wealth of resources, including open-source projects, datasets, and technical learning routes, to facilitate knowledge acquisition and project development [12][29][35]
最新SOTA!JanusVLN:双重隐式记忆解耦语义与空间,显著降低了计算与推理开销
具身智能之心· 2025-10-07 03:03
Core Insights - The article introduces JanusVLN, an innovative framework for Vision-Language Navigation (VLN) that addresses the limitations of existing methods by implementing a Dual Implicit Memory paradigm, which decouples visual semantics and spatial geometry [2][19]. Background on Current VLN Memory Mechanisms - Current VLN methods face three main challenges: spatial information distortion and loss due to reliance on text cognitive maps, low computational and reasoning efficiency from storing historical image frames, and memory inflation leading to "memory explosion" issues [3][5]. Key Innovations of JanusVLN - JanusVLN introduces a Dual Implicit Memory framework inspired by human cognitive science, effectively separating semantic memory from spatial geometric memory [7][19]. - The framework utilizes a pre-trained 3D visual geometry model (VGGT) to derive spatial geometric information from single RGB video streams, enhancing the model's spatial perception capabilities [8][19]. - A mixed incremental update strategy is employed to maintain a fixed-size memory, significantly improving reasoning efficiency by avoiding redundant computations [8][11]. Methodology Overview - JanusVLN consists of three main components: a dual encoder architecture for visual perception, a dual implicit neural memory system, and a hybrid incremental update strategy [10][11]. - The dual encoder architecture includes a 2D visual semantic encoder and a 3D spatial geometric encoder, working together to provide comprehensive scene understanding [11]. Experimental Results - JanusVLN has been evaluated on two authoritative VLN benchmarks, VLN-CE and RxR-CE, achieving state-of-the-art (SOTA) performance [15]. - The framework demonstrates superior performance in spatial reasoning tasks, successfully completing complex navigation challenges [18][21]. Quantitative Analysis - JanusVLN shows significant improvements in success rate (SR) metrics, outperforming advanced methods that rely on expensive inputs by 10.5 to 35.5 percentage points [21]. - Compared to other SOTA methods using RGB input with explicit memory, JanusVLN achieves a 3.6 to 10.8 percentage point improvement in SR, validating the effectiveness of the Dual Implicit Memory paradigm [21].
具身研究的第一套机械臂,老师给我推荐了这款,好用性价比高!
具身智能之心· 2025-10-07 03:03
Core Viewpoint - The article introduces the Imeta-y1, a lightweight and cost-effective robotic arm designed for the embodied research field, addressing the need for affordable yet high-quality hardware for researchers and practitioners [2][4]. Product Overview - The Imeta-y1 robotic arm is specifically designed for educational, research, and light industrial applications, featuring high-precision motion control, low power consumption, and an open software and hardware architecture [4][5]. - It supports seamless integration from simulation to real machine, providing a comprehensive open-source SDK and toolchain for users to quickly implement algorithm validation, data collection, model training, and application deployment [4][15]. Technical Specifications - The robotic arm has a weight of 4.2 kg, a rated load of 3 kg, and 6 degrees of freedom, with a working radius of 612.5 mm and a repeat positioning accuracy of ±0.1 mm [7][17]. - It operates at a voltage of 24V and utilizes a PC controller, with communication via CAN [7][17]. Product Advantages - The product offers a complete toolchain from data collection to model training and inference deployment, supporting multi-modal data fusion and compatibility with mainstream frameworks like TensorFlow and PyTorch [15][30]. - It provides URDF models for real-time interaction between simulation environments like Gazebo and physical devices, significantly reducing development risks and debugging costs [20][30]. Development Support - The SDK includes support for C++ and Python, allowing developers to quickly get started regardless of their programming language preference [16][24]. - The product supports both ROS1 and ROS2 development, with plans for future upgrades and responsive customer service [17][24]. Testing and Reliability - The robotic arm undergoes rigorous hardware testing processes, including precision calibration, durability, load performance, and stability verification, ensuring reliability and safety across various application scenarios [33][40]. After-Sales Service - The company commits to delivering products within 1-2 weeks and provides timely after-sales support, with a six-month warranty for non-human damage [42][43].
“盲眼”机器人在完全看不见的情况下30秒跑酷首秀惊艳!
具身智能之心· 2025-10-07 03:03
Core Insights - The article discusses the advancements in humanoid robotics, specifically focusing on Amazon's FAR (Frontier AI for Robotics) team and their new technology, OmniRetarget, which enables robots to perform complex tasks without visual sensors [9][49]. Group 1: OmniRetarget Technology - OmniRetarget allows reinforcement learning strategies to learn long-term loco-manipulation skills in complex environments, achieving zero-shot transfer from simulation to humanoid robots [12][29]. - The technology utilizes an interaction mesh to model spatial and contact relationships between the robot, objects, and terrain, enhancing data efficiency and reducing data collection costs [15][25]. - OmniRetarget outperforms other motion redirection methods in key areas such as hard constraints, object interaction, terrain interaction, and data augmentation [16][40]. Group 2: Experimental Results - The research team demonstrated the broad capabilities of OmniRetarget, including natural object manipulation and terrain interaction, achieving a success rate of 79.1% on enhanced datasets [39][42]. - In comparative tests, OmniRetarget showed superior performance in kinematic quality metrics, such as penetration and contact preservation, outperforming baseline methods [41][42]. - The technology's high-quality redirection actions directly improve downstream reinforcement learning policy success rates by over 10% compared to baseline methods [42]. Group 3: Team and Background - Amazon's FAR team, established recently, is led by prominent scholars from the robotics field, including those from the renowned Covariant company [43][44]. - The team aims to revolutionize automation in humanoid robotics, marking Amazon's first significant foray into this area [49][50].