具身智能之心
Search documents
刚刚,ICCV最佳论文出炉,朱俊彦团队用砖块积木摘得桂冠
具身智能之心· 2025-10-23 00:03
Core Insights - The article discusses the recent International Conference on Computer Vision (ICCV) held in Hawaii, highlighting the award-winning research papers and their contributions to the field of computer vision [2][5][24]. Group 1: Award Winners - The Best Paper Award was given to a research team from Carnegie Mellon University (CMU) for their paper titled "Generating Physically Stable and Buildable Brick Structures from Text," led by notable AI scholar Zhu Junyan [3][7][11]. - The Best Student Paper Award was awarded to a paper from the Technion, titled "FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models," which introduces a novel image editing method [28][30]. Group 2: Conference Statistics - ICCV is one of the top three conferences in computer vision, held biennially, with this year's conference receiving 11,239 valid submissions and accepting 2,699 papers, resulting in a 24% acceptance rate, a significant increase from the previous conference [5]. Group 3: Research Contributions - The paper by CMU presents Brick GPT, the first method capable of generating physically stable and interconnected brick assembly models based on text prompts. The research includes a large dataset of over 47,000 brick structures and 28,000 unique 3D objects with detailed descriptions [11][13]. - The FlowEdit paper from Technion proposes a new image editing approach that bypasses the traditional image-to-noise inversion process, achieving higher fidelity edits by establishing a direct mapping path between source and target image distributions [32][34]. Group 4: Methodology and Results - The Brick GPT method utilizes a self-regressive large language model trained on a dataset of brick structures, incorporating validity checks and a physics-aware rollback mechanism to ensure stability in generated designs [13][19]. - Experimental results show that Brick GPT outperforms baseline models in terms of validity and stability, achieving a 100% validity rate and 98.8% stability in generated structures [20][22].
星际硅途发布FoldPlanet-500数据集,开启智能叠衣机器人新纪元
具身智能之心· 2025-10-23 00:03
以下文章来源于星际硅途 ,作者星际硅途 星际硅途 . 上海星际硅途技术有限公司官方订阅号 想象一下 清晨的阳光洒进卧室,你刚把洗好的衣服从烘干机取出,一个灵巧的机器人手臂开始有条不紊地工作 —— 精准识别衣物类型, 流畅执行折叠动作, 短短几分钟,一堆衣物便化作一摞摞整齐的"小方块",完美 收纳进衣柜。 这一切并非遥不可及。 星际硅途预计,能够完成高复杂度衣物折叠的机器人,将在数年内实现落地应用。而实现这一愿景的 关键基础是 高质量、结构化、可学习 的 真实泛化场景叠衣动作数据集 。 今天 星际硅途 隆重推出 Fold Planet-500折叠星球 衣物折叠In-the-wild Human数据集 2 多模态数据,精准对齐 ① 视觉感知: 每个折叠任务都配备了多角度、高分辨率、时序清晰的视频、图像序列。多视角视频图 像数据,精确捕捉每一步操作的关键状态和动作轨迹。 ② 动作捕捉: 采用全身31节点动作捕捉采样技术,可精准捕获人体全身关节的运动轨迹与姿态变化, 生成适配机器人躯干及四肢协同控制的高精度动作数据,为机器人具身智能训练提供核心动作参考。 ③ 语义标注: 我们为每个折叠任务都配备了详尽化、步骤化、可量 ...
智元机器人亮相IROS 2025 :国际挑战赛圆满收官,全系产品实战演示圈粉
具身智能之心· 2025-10-22 12:00
Core Insights - The article highlights the successful hosting of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025) in Hangzhou, focusing on the integration of artificial intelligence and robotics technology [2] - Zhiyuan Robotics showcased its full range of products, demonstrating its leadership in embodied intelligence technology and ecosystem development [2][12] Product Demonstrations - Zhiyuan Robotics presented its product lineup, including the Qiling series, Lingxi X2, and Expedition A2, showcasing their capabilities in various industrial applications [4] - The Qiling G1 robot performed fully automated logistics operations without human intervention, collaborating with Demar Technology's intelligent sorting robot [4] - The newly launched Qiling G2 robot features advanced capabilities with dual 7-degree-of-freedom robotic arms, achieving high-precision tasks and expanding humanoid robot applications [4] - The Lingxi X2 demonstrated high maneuverability and interactive capabilities, while the Expedition A2 showcased its end-to-end operational capabilities in a simulated environment [6] International Challenge - The inaugural "AgiBot World Challenge @ IROS 2025," co-hosted by Zhiyuan Robotics and OpenDriveLab, attracted 431 teams from 23 countries, with a total prize pool of $560,000 [9] - The Manipulation track featured intense competition, with 11 teams advancing to the finals, showcasing their skills on the Zhiyuan Robotics platform [9] - The World Model track focused on AI's ability to predict the physical world, leading to several technological breakthroughs [11] Strategic Direction - Zhiyuan Robotics is driving the large-scale application of embodied intelligence technology through a dual strategy of "technology + ecosystem," aiming to empower various industries and promote human-robot collaboration [12]
宇树最新机器人发布:1米8大高个,能跳舞会功夫,就是颜值一言难尽
具身智能之心· 2025-10-22 06:02
Core Viewpoint - The article discusses the launch of Unitree's new humanoid robot, H2, highlighting its design, features, and public reception. Group 1: Product Features - The H2 robot stands 180 cm tall and weighs 70 kg, making it 23 kg heavier than its predecessor, H1 [2][14]. - H2 features a bionic face, which aims to make it appear more human-like [5][6]. - The robot has 31 degrees of freedom, enhancing its movement capabilities compared to previous models [14][24]. Group 2: Public Reception - The design of H2's face has received mixed reviews, with many users finding it unsettling, reminiscent of the NS-5 robot from the movie "I, Robot" [8][10]. - Despite the advanced features, some viewers commented on the robot's dance performance, describing it as lacking emotional expression and comparing it to a "zombie" [25][26]. - Overall, while there is excitement about the new robot, users are curious about its practical applications, such as household chores [38]. Group 3: Performance Demonstrations - H2 showcased its capabilities through dance, martial arts, and runway walking, demonstrating improved agility and coordination [20][28][36]. - The robot's martial arts performance was noted to be impressive, indicating advancements in stability and coordination technology [33]. - The final presentation included a reference to Leonardo da Vinci's "Vitruvian Man," symbolizing the ideal human proportions [40].
别造轮子了!原力灵机开源Dexbotic:迈向具身智能的一站式VLA工具箱
具身智能之心· 2025-10-22 06:02
Core Insights - The article discusses the rapid development of embodied VLA (Vision-Language Agents) models and the challenges faced by individual developers and small research teams in creating and maintaining a unified open-source framework for these models [4][7][29]. Group 1: VLA Development Challenges - The current VLA development landscape is fragmented, with various teams using different deep learning frameworks and model architectures, leading to inefficiencies in model comparison and performance evaluation [4][7]. - Existing VLA models often do not leverage the capabilities of the latest LLMs (Large Language Models), which limits the potential of the "embodied brain" [4][7]. - There is a pressing need for a mature, unified open-source VLA framework to address these challenges, which has led to the creation of Dexbotic [4][7]. Group 2: Dexbotic Framework Features - Dexbotic integrates mainstream pre-trained models for manipulation and navigation policies, supporting both cloud and local training, making it user-friendly and ready to use [2][4]. - The framework introduces the Dexdata format to unify data from different sources, significantly reducing storage costs and simplifying data preparation for developers [9][10]. - Dexbotic's architecture consists of three layers: data layer, model layer, and experimental layer, enhancing the efficiency of algorithm comparison and model iteration by over 50% [11][24]. Group 3: Performance Improvements - Dexbotic's pre-trained models have shown significant performance improvements in various tasks, with DB-CogACT achieving an 18.2% increase in average success rate compared to the original CogACT model [21][22]. - The framework has also demonstrated strong performance in real-world tasks, with UR5e achieving a 100% success rate in specific tasks [29]. Group 4: Open Source and Community Engagement - Dexbotic aims to facilitate collaboration and innovation in the field of embodied intelligence by providing an open-source platform that allows developers to contribute and share their work [30][32]. - The initiative encourages participation from both academic and industrial partners to enhance the development of embodied intelligence technologies [30][32].
RLINF-VLA:一种用于 VLA+RL 训练的统一高效框架
具身智能之心· 2025-10-22 06:02
Core Insights - The article presents the RLinf-VLA framework, a unified and efficient framework for training Visual-Language-Action (VLA) models using Reinforcement Learning (RL), addressing the limitations of existing models that rely on supervised fine-tuning [2][53] - The framework significantly enhances training efficiency and generalization capabilities, achieving high success rates in various simulation tasks and demonstrating superior performance in real-world applications compared to traditional supervised methods [5][53] Framework Design - The RLinf-VLA framework integrates multiple simulators, algorithms, and VLA architectures, optimizing resource allocation through flexible execution modes and system-level enhancements [4][53] - It supports three GPU allocation strategies: colocated, disaggregated, and hybrid, allowing users to easily switch modes via configuration files, thus reducing system customization costs [10][11] Model Compatibility - The framework supports LoRA for efficient parameter tuning, reducing memory consumption and accelerating training while maintaining performance [12] - It is compatible with OpenVLA and its extension OpenVLA-OFT, which have shown strong performance in various robotic operation benchmarks [12][22] Multi-Simulator Support - The framework emphasizes the importance of simulators in RL, utilizing ManiSkill and LIBERO as primary simulators to achieve diverse task capabilities [13] - It provides a unified interface for different simulators, facilitating the implementation of various tasks and supporting multiple RL algorithms, initially focusing on PPO and GRPO [13][14] Algorithm Design - The framework incorporates advanced techniques for advantage function and log-probability calculations, allowing for flexible integration of block-level and action-level definitions [14][15] - It supports various optimization strategies, including trajectory length normalization and effective action masking, to enhance training stability and performance [19][20] Experimental Results - The RLinf-VLA framework demonstrated significant performance improvements, with success rates increasing by 45% to 70% in various tasks compared to baseline models [22][24] - In LIBERO tasks, the framework achieved an average success rate of 98.11%, showcasing its capability for large-scale multi-task reinforcement learning [28] High Efficiency Performance - The framework's efficiency is evaluated based on throughput, achieving substantial improvements in training speed across different GPU configurations [30][35] - The hybrid allocation mode outperformed traditional methods, demonstrating the benefits of pipeline overlapping in resource utilization [35][37] Real-World Deployment - The RLinf-VLA framework was successfully deployed in real-world environments, showing superior zero-shot generalization capabilities compared to supervised fine-tuning strategies [51][53] - The experiments indicated that RL-trained models could adapt better to real-world tasks, achieving higher success rates in object manipulation tasks [51] Conclusion - The RLinf-VLA framework represents a significant advancement in the field of embodied intelligence, providing a robust foundation for future research and development in VLA training [53]
告别「偏科」,UniVid实现视频理解与生成一体化
具身智能之心· 2025-10-22 06:02
更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在视频生成与理解的赛道上,常常见到分头发力的模型:有的专注做视频生成,有的专注做视频理解(如问答、分类、检索等)。而最近, 一个开源项目 UniVid,提出了一个「融合」方向:把理解 + 生成融为一体 —— 他们希望用一个统一的模型,兼顾「看懂视频」+「生成视频」的能力。 编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 这就像把「看图识物」和「画图创作」两件事,交给同一个大脑去做:理解一段文字 + 理解已有视频内容 → 再「画」出新的、连贯的视频 —— 这在技术上挑战 极大。 UniVid 想解决什么问题? UniVid 尝试把视频「理解」与「生成」融合为一体,构建出一个 真正通用的统一视频模型(Unified Video Model), 一个既能「理解」又能「生成」的视频多模 态模型。 核心创新 1.统一结构:Adapter-based Unified Architecture 论文标题:UniVid: The Open-Sourc ...
Ask-to-Clarify:解决指令的模糊性,端到端为真实具身任务生成动作
具身智能之心· 2025-10-22 03:04
Core Insights - The article presents the Ask-to-Clarify framework aimed at enhancing embodied intelligent agents' ability to interact with humans by resolving instruction ambiguity through multi-turn dialogue [2][4][41]. Framework Design - A new collaborative task for embodied agents is introduced, requiring them to ask questions to clarify ambiguous instructions before executing tasks. This involves a combination of a visual-language model (VLM) for questioning and a diffusion model for action generation [6][10]. - The framework consists of two main components: a collaborative module for human interaction and an action module for generating specific actions. A connection module is designed to ensure smooth integration between these components [42][46]. Training Strategy - A two-phase "knowledge isolation" training strategy is proposed. The first phase focuses on training the model to handle ambiguous instructions, while the second phase maintains this capability while enhancing the action generation ability [8][15]. - In the first phase, a dataset of interactive dialogue is constructed to train the collaborative component, allowing it to ask questions when faced with ambiguous instructions [16][17]. - The second phase involves a hierarchical framework for end-to-end action generation, ensuring that the model retains its ability to clarify instructions while learning to generate actions [18][19]. Inference Process - During inference, the framework engages in dialogue with users to clarify instructions and then executes the inferred correct actions. A signal detector routes the process between questioning and executing based on the task state [22][23]. - The model uses specific signal markers to indicate whether an instruction is ambiguous or not, guiding its response accordingly [22][23]. Experimental Validation - The framework was tested in real-world scenarios, demonstrating its ability to clarify ambiguous instructions and reliably generate actions. Various experiments were conducted to assess its performance, including ablation studies on training strategies and the connection module [24][25][41]. - The results showed that the Ask-to-Clarify framework significantly outperformed baseline models in handling ambiguous instructions and executing tasks accurately [29][30][35]. Robustness Testing - The framework's robustness was evaluated under challenging conditions, such as low-light environments and the presence of distractors. It consistently outperformed baseline models in these scenarios, showcasing its practical applicability [37][39][40].
具身智能之心机器人运动控制群成立啦~
具身智能之心· 2025-10-22 03:04
Group 1 - The establishment of the Embodied Intelligence Robotics Motion Control Group, focusing on research directions such as humanoid and quadruped robots, is announced [1] - The group is interested in topics including VLA, reinforcement learning, WBC, and MPC [1] Group 2 - An invitation is extended for individuals to join the group by adding a WeChat assistant with specific membership details [2] - The blog author encourages communication regarding industry and academic discussions through personal WeChat [4]
从几个代表性的工作分析强化学习和VLA是怎么结合的?挑战有哪些?
具身智能之心· 2025-10-22 03:04
Core Insights - The article discusses the integration of reinforcement learning (RL) with Visual-Language-Action (VLA) models to enhance robotic capabilities, enabling robots to understand visual and linguistic instructions while optimizing their actions through trial and error [2][8]. Group 1: VLA and Reinforcement Learning Integration - The combination of VLA models and RL allows robots to interpret tasks and adjust their actions based on feedback, improving their performance in complex environments [2][3]. - The GRAPE framework enhances the generalization of robotic policies by aligning preferences, breaking down complex tasks into manageable stages, and optimizing actions through RL, resulting in a success rate increase of 51.79% for seen tasks and 58.20% for unseen tasks [6][7]. Group 2: Addressing Generalization Challenges - VLA models struggle with generalization in unfamiliar scenarios; however, the VLA-RL framework models the robotic operation as a multi-turn dialogue, achieving higher success rates in 40 complex tasks compared to pure imitation learning [8][10]. - The ReWiND framework generates flexible reward functions through language descriptions, allowing robots to adapt to new tasks with a learning efficiency that is twice as fast in simulations and five times faster in real-world applications [12][14]. Group 3: Fine-Tuning Strategies - The ConRFT framework combines offline and online fine-tuning methods, achieving an average success rate of 96.3% across eight real-world tasks, significantly improving performance compared to traditional supervised learning [15][18]. - The Dual-Actor framework utilizes a pre-trained VLA model to master basic actions before fine-tuning through RL, enhancing the robot's ability to perform complex assembly tasks with higher success rates [20][22]. Group 4: Safety and Efficiency - Safety mechanisms are integrated into RL processes to prevent collisions and damage during robotic exploration, ensuring a secure and efficient learning environment [23][24]. - The article emphasizes the importance of designing efficient multi-modal encoders to address the challenges of integrating visual, linguistic, and action data, which can lead to information loss [27][28].