Workflow
具身智能之心
icon
Search documents
告别「偏科」,UniVid实现视频理解与生成一体化
具身智能之心· 2025-10-22 06:02
更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在视频生成与理解的赛道上,常常见到分头发力的模型:有的专注做视频生成,有的专注做视频理解(如问答、分类、检索等)。而最近, 一个开源项目 UniVid,提出了一个「融合」方向:把理解 + 生成融为一体 —— 他们希望用一个统一的模型,兼顾「看懂视频」+「生成视频」的能力。 编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 这就像把「看图识物」和「画图创作」两件事,交给同一个大脑去做:理解一段文字 + 理解已有视频内容 → 再「画」出新的、连贯的视频 —— 这在技术上挑战 极大。 UniVid 想解决什么问题? UniVid 尝试把视频「理解」与「生成」融合为一体,构建出一个 真正通用的统一视频模型(Unified Video Model), 一个既能「理解」又能「生成」的视频多模 态模型。 核心创新 1.统一结构:Adapter-based Unified Architecture 论文标题:UniVid: The Open-Sourc ...
Ask-to-Clarify:解决指令的模糊性,端到端为真实具身任务生成动作
具身智能之心· 2025-10-22 03:04
Core Insights - The article presents the Ask-to-Clarify framework aimed at enhancing embodied intelligent agents' ability to interact with humans by resolving instruction ambiguity through multi-turn dialogue [2][4][41]. Framework Design - A new collaborative task for embodied agents is introduced, requiring them to ask questions to clarify ambiguous instructions before executing tasks. This involves a combination of a visual-language model (VLM) for questioning and a diffusion model for action generation [6][10]. - The framework consists of two main components: a collaborative module for human interaction and an action module for generating specific actions. A connection module is designed to ensure smooth integration between these components [42][46]. Training Strategy - A two-phase "knowledge isolation" training strategy is proposed. The first phase focuses on training the model to handle ambiguous instructions, while the second phase maintains this capability while enhancing the action generation ability [8][15]. - In the first phase, a dataset of interactive dialogue is constructed to train the collaborative component, allowing it to ask questions when faced with ambiguous instructions [16][17]. - The second phase involves a hierarchical framework for end-to-end action generation, ensuring that the model retains its ability to clarify instructions while learning to generate actions [18][19]. Inference Process - During inference, the framework engages in dialogue with users to clarify instructions and then executes the inferred correct actions. A signal detector routes the process between questioning and executing based on the task state [22][23]. - The model uses specific signal markers to indicate whether an instruction is ambiguous or not, guiding its response accordingly [22][23]. Experimental Validation - The framework was tested in real-world scenarios, demonstrating its ability to clarify ambiguous instructions and reliably generate actions. Various experiments were conducted to assess its performance, including ablation studies on training strategies and the connection module [24][25][41]. - The results showed that the Ask-to-Clarify framework significantly outperformed baseline models in handling ambiguous instructions and executing tasks accurately [29][30][35]. Robustness Testing - The framework's robustness was evaluated under challenging conditions, such as low-light environments and the presence of distractors. It consistently outperformed baseline models in these scenarios, showcasing its practical applicability [37][39][40].
具身智能之心机器人运动控制群成立啦~
具身智能之心· 2025-10-22 03:04
Group 1 - The establishment of the Embodied Intelligence Robotics Motion Control Group, focusing on research directions such as humanoid and quadruped robots, is announced [1] - The group is interested in topics including VLA, reinforcement learning, WBC, and MPC [1] Group 2 - An invitation is extended for individuals to join the group by adding a WeChat assistant with specific membership details [2] - The blog author encourages communication regarding industry and academic discussions through personal WeChat [4]
从几个代表性的工作分析强化学习和VLA是怎么结合的?挑战有哪些?
具身智能之心· 2025-10-22 03:04
Core Insights - The article discusses the integration of reinforcement learning (RL) with Visual-Language-Action (VLA) models to enhance robotic capabilities, enabling robots to understand visual and linguistic instructions while optimizing their actions through trial and error [2][8]. Group 1: VLA and Reinforcement Learning Integration - The combination of VLA models and RL allows robots to interpret tasks and adjust their actions based on feedback, improving their performance in complex environments [2][3]. - The GRAPE framework enhances the generalization of robotic policies by aligning preferences, breaking down complex tasks into manageable stages, and optimizing actions through RL, resulting in a success rate increase of 51.79% for seen tasks and 58.20% for unseen tasks [6][7]. Group 2: Addressing Generalization Challenges - VLA models struggle with generalization in unfamiliar scenarios; however, the VLA-RL framework models the robotic operation as a multi-turn dialogue, achieving higher success rates in 40 complex tasks compared to pure imitation learning [8][10]. - The ReWiND framework generates flexible reward functions through language descriptions, allowing robots to adapt to new tasks with a learning efficiency that is twice as fast in simulations and five times faster in real-world applications [12][14]. Group 3: Fine-Tuning Strategies - The ConRFT framework combines offline and online fine-tuning methods, achieving an average success rate of 96.3% across eight real-world tasks, significantly improving performance compared to traditional supervised learning [15][18]. - The Dual-Actor framework utilizes a pre-trained VLA model to master basic actions before fine-tuning through RL, enhancing the robot's ability to perform complex assembly tasks with higher success rates [20][22]. Group 4: Safety and Efficiency - Safety mechanisms are integrated into RL processes to prevent collisions and damage during robotic exploration, ensuring a secure and efficient learning environment [23][24]. - The article emphasizes the importance of designing efficient multi-modal encoders to address the challenges of integrating visual, linguistic, and action data, which can lead to information loss [27][28].
IROS 2025 AIR4S 研讨会:AI + 机器人,正在重塑未来科学
具身智能之心· 2025-10-21 07:20
Core Insights - The article discusses the integration of embodied AI and robotics in scientific research, highlighting the shift from human-led exploration to AI and robot collaboration in scientific discovery [4][5]. Group 1: Workshop Overview - The IROS 2025 AIR4S Workshop will focus on "Embodied AI and Robotics for Future Scientific Discovery," scheduled for October 24, 2025, in Hangzhou, China [6][20]. - The workshop aims to explore how AI and robots can participate in the entire research process, from literature review to hypothesis generation, experimental execution, data analysis, and publication [5][6]. Group 2: Expert Participation - Notable experts from academia and industry will participate, including representatives from Unitree Robotics, MIT, Stanford University, Tencent Robotics X, and the University of Tokyo, discussing various aspects of embodied AI and its applications in scientific discovery [7][9]. Group 3: Research Papers and Innovations - The workshop has received 17 papers covering cutting-edge topics such as AI for Science, robotic scientists, and laboratory automation. A novel AI Review mechanism will be introduced to assist in the paper review process [13][14]. - The integration of AI in scientific evaluation aims to enhance the efficiency and intelligence of research assessments [14]. Group 4: Community and Support - The workshop is supported by various organizations, including NOKOV, Frontiers in Robotics and AI, and Lumina, promoting the cross-disciplinary development of embodied AI and research automation [15]. - The article encourages scholars, students, and industry partners interested in the intersection of AI, robotics, and scientific discovery to join the IROS 2025 AIR4S community for updates and discussions [17].
相约杭州!具身智能之心首次赞助IROS并现场颁奖
具身智能之心· 2025-10-21 01:30
Core Viewpoint - The RoboSense Challenge 2025 aims to systematically evaluate the perception and understanding capabilities of robots in real-world scenarios, addressing the challenges posed by traditional perception algorithms in complex environments [1]. Group 1: Event Overview - The challenge is organized by multiple prestigious institutions, including the National University of Singapore, Nanyang Technological University, and the University of Michigan, among others [4][5]. - It is officially recognized as a competition during the IROS 2025 conference, which will take place in Hangzhou, China [5]. Group 2: Challenge Objectives - The primary goal is to develop socially intelligent autonomous navigation robots that can navigate safely and efficiently in dynamic indoor environments without disrupting human activities [8][10]. - The challenge focuses on creating a perception and navigation system based on RGBD vision and odometry, requiring robots to operate without maps or privileged information [9]. Group 3: Challenge Difficulties - Key challenges include dynamic behavior modeling, social rule encoding, and uncertainty handling in unpredictable environments [12]. - Evaluation metrics will not only consider success rates and path efficiency but also social compliance indicators and collision statistics [12]. Group 4: Recommended Directions - Suggested approaches include using transformer-based social trajectory prediction modules, behavior classifiers for risk assessment, and graph neural networks for multi-target structural modeling [15].
开源对机器人的价值,远超想象丨唐文斌深度对谈抱抱脸联创
具身智能之心· 2025-10-21 00:03
Core Insights - The article discusses the challenges in the field of robotics, particularly the gap between simulation and real-world application, and introduces RoboChallenge.ai as a solution to create a standardized evaluation platform for embodied intelligence [2][42][51]. Group 1: Current Challenges in Robotics - Many models perform well in simulations but fail in real-world scenarios, highlighting a significant pain point in robotics research [2][42]. - The need for a unified, open, and reproducible evaluation system for robotics is emphasized, as current benchmarks are primarily based on simulations [50][44]. Group 2: Introduction of RoboChallenge.ai - RoboChallenge.ai is launched as an open, standardized platform for evaluating robotic models in real-world environments, allowing researchers to remotely test their models on physical robots [6][51]. - The platform enables users to control local models through an API, facilitating remote testing without the need to upload models [8][53]. Group 3: Importance of Open Source in Robotics - Open source is identified as a crucial driver for advancements in AI and robotics, enabling collaboration and innovation across global teams [10][19]. - The article argues that open source in robotics may be even more critical than in large language models (LLMs) due to the necessity of hardware accessibility for model application [20][22]. Group 4: Future Directions and Community Involvement - The article anticipates that the next three to five years will see significant evolution in embodied intelligence research, with robots capable of executing longer and more complex tasks [82]. - Community participation is encouraged, with the expectation that diverse contributions will enhance data availability and model robustness [66][68].
无需再训练!港大团队提出GPC框架,实现机器人「策略组合」
具身智能之心· 2025-10-21 00:03
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 本文一作曹嘉航,香港大学在读博士生,前北京人形机器人创新中心实习生;共同一作黄翊泽,上海交通大学在读本科生;通讯导师 Andrew F. Luo,香港大学助 理教授。 在机器人学习领域,提升基于生成式模型的控制策略(Policy)的性能通常意味着投入巨额成本进行额外的数据采集和模型训练,这极大地限制了机器人能力的快 速迭代与升级。面对模型性能的瓶颈,如何在不增加训练负担的情况下,进一步挖掘并增强现有策略的潜力? 香港大学团队开创性地提出了 GPC(General Policy Composition,通用策略组合) 框架,为这一挑战提供了全新的免训练解决方案。该框架通过在测试时(test- time)对多个预训练模型进行 "策略组合",能够创造出一个性能超越任何单一父策略的 "组合策略"。 GPC 作为一个 "即插即用" 的通用框架,能够灵活融合不同架构(如 Diffusion-base ...
告别 “专家垄断”!AdaMoE 破解 VLA 模型效率与精度两难问题
具身智能之心· 2025-10-21 00:03
Core Viewpoint - The article discusses the AdaMoE architecture, which enhances the performance of Vision-Language-Action (VLA) models in robotic control by decoupling expert selection and weight distribution, leading to improved success rates in both simulation and real-world tasks [1][24]. Summary by Sections Research Background: The Three Dilemmas of VLA Models - Traditional VLA models face three main dilemmas: 1. Difficulty in improving performance due to high training costs, as collecting precise robotic data is resource-intensive [2]. 2. The challenge of real-time control, where dense models require all parameters to be activated, slowing down response times [3]. 3. The inefficiency of using Mixture of Experts (MoE) due to conflicts among experts, which hinders effective task execution [5]. Core Design: The Decoupling Magic of AdaMoE - AdaMoE's innovation lies in its ability to separate the roles of expert selection and performance evaluation, allowing each component to focus on its strengths rather than trying to solve all problems simultaneously [6]. Key Designs of AdaMoE - **Design 1**: Utilizes pre-trained weights to significantly reduce training costs by focusing on fine-tuning specialized skills rather than relearning basic actions [8]. - **Design 2**: Implements "sparse activation" and dual-module decoupling to balance capacity and efficiency while preventing conflicts among experts [9][10]. Key Findings: Advantages of Decoupling - The research team conducted extensive experiments revealing four key conclusions that highlight the superiority of AdaMoE: 1. Experts can effectively specialize in their tasks without interference, leading to improved performance [13]. 2. Decoupling responsibilities enhances performance compared to traditional coupling methods [15]. 3. Fewer, more specialized experts yield better results than a larger number of overlapping experts [19]. 4. Real-world scenarios benefit more from decoupling than simulated environments, with significant improvements in task success rates [22]. Experimental Results: Validation of AdaMoE - AdaMoE demonstrated superior performance across various benchmarks, achieving an average success rate of 96.0%, outperforming traditional models and other architectures [23]. Conclusion: The Breakthrough Significance of AdaMoE - AdaMoE not only improves performance but also provides a pathway for VLA models to operate effectively without excessive resource demands, emphasizing the importance of clear task specialization for both robots and humans [24][26].
最后1个名额!强化学习在人形/四足/机械臂等方向上的应用
具身智能之心· 2025-10-21 00:03
Core Insights - Reinforcement Learning (RL) remains a significant field, with increasing applications in robotics, including humanoid and quadrupedal robots, as well as in product optimization across various industries [1][2][3] - The complexity of RL poses challenges for newcomers, making it difficult to produce publishable research papers without a structured learning system [5][6][9] Group 1: Importance of Reinforcement Learning - RL is crucial for tasks such as gait control in embodied intelligent robots, which is essential for achieving general-purpose capabilities [2] - Companies like Yushun and Zhiyuan utilize RL for humanoid robots to perform complex actions like climbing stairs, running, and dancing, enabling applications in rescue and hazardous environments [2][8] Group 2: Challenges in Learning and Research - The extensive and intricate nature of RL makes it hard for beginners to enter the field, often leading to frustration and abandonment of learning [5][9] - Producing a paper that meets the standards of peer review requires proficiency in methodology, experimental results, and writing, with any misstep potentially resulting in low scores from reviewers [5][6] Group 3: Educational Initiatives - To address the entry barriers in RL research, a specialized 1v6 mentoring course has been launched, targeting graduate students and others needing guidance in paper writing [6][7] - The course includes weekly live sessions, project implementation, experimental guidance, and writing refinement, aiming to help participants produce a draft suitable for submission to top conferences and journals [7][9][15] Group 4: Course Structure and Content - The course spans 14 weeks of intensive online training followed by 8 weeks of maintenance support, focusing on various aspects of RL and robotics [9][15] - Key topics include foundational RL concepts, simulation environments, sim2real techniques, and writing guidance, with a structured approach to ensure participants achieve measurable milestones [15][19][20]