Workflow
具身智能之心
icon
Search documents
WorldVLA:世界模型实现视觉-动作双向增强,抓取精度显著提升
具身智能之心· 2025-06-30 12:17
Core Insights - The article introduces WorldVLA, a self-regressive action world model that integrates action and image understanding and generation, outperforming independent action and world models [3][6][8]. Group 1: WorldVLA Overview - WorldVLA combines visual-language-action (VLA) models and world models in a single framework, enhancing performance through mutual reinforcement between the two components [3][6]. - The model utilizes three independent tokenizers for images, text, and actions, sharing the same vocabulary to unify cross-modal understanding and generation [6][14]. - An attention mask strategy is proposed to mitigate error propagation in action sequence generation, significantly improving performance in action block generation tasks [7][31]. Group 2: Model Architecture and Training - The architecture consists of an action model and a world model, where the action model generates actions based on image observations and language instructions, while the world model predicts future states based on observed sequences and actions [11][13]. - Training involves mixing action model data and world model data to enhance action generation, with the world model providing a better understanding of environmental physics [15][20]. - The loss function combines cross-entropy losses from both models, balancing contributions due to the disparity in token counts [20]. Group 3: Experimental Results - WorldVLA shows a 4% higher success rate in grasping tasks compared to similar action models and a 10% reduction in Fréchet Video Distance (FVD) compared to standard world models [7][26]. - The model's performance improves with higher image resolutions, which is crucial for tasks requiring high operational precision [26]. - The integration of the world model significantly enhances the action model's performance by providing a better understanding of the underlying physical dynamics [28]. Group 4: Attention Mask and Performance - The proposed attention mask allows for parallel generation of multiple actions, reducing dependency on previous actions and alleviating error accumulation [19][31]. - The model's performance is optimized by using two historical image frames as input, balancing task success rates and computational efficiency [32]. Group 5: Pre-training and Future Potential - Pre-training the action model with world model data significantly improves grasping performance, highlighting the potential of leveraging general world knowledge to enhance specific task performance in robotics [35].
重磅直播!CVPR冠军方案BridgeVLA,真机性能提升32%
具身智能之心· 2025-06-30 12:17
Core Viewpoint - The article emphasizes the shift in live streaming and content acquisition towards embodied intelligence, highlighting the importance of knowledge sharing and community engagement in the digital landscape [1] Group 1 - The transition of live streaming platforms towards more interactive and intelligent content delivery methods is discussed, indicating a trend towards personalized user experiences [1] - The role of community-driven platforms in enhancing user engagement and content quality is highlighted, suggesting that companies should focus on building strong user communities [1] - The potential for embodied intelligence to revolutionize content creation and consumption is explored, with implications for future business models in the industry [1] Group 2 - The article outlines the competitive landscape of the live streaming industry, noting key players and their strategies for content acquisition and user retention [1] - It provides insights into user behavior trends, indicating a growing preference for interactive and immersive content experiences among audiences [1] - The impact of technological advancements on content delivery and user engagement is analyzed, suggesting that companies must adapt to stay relevant in a rapidly evolving market [1]
UCLA提出PEVA:具身Agents的世界模型时代
具身智能之心· 2025-06-30 03:47
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yutong Bai等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 背景与动机 本篇论文探讨了具身智能体理解 物理动作与视觉感知关系 的根本挑战。人类通过全身动作(如转身、伸 手)主动改变第一人称视角的视觉输入,这对智能体的环境交互和长期规划至关重要。现有世界模型(如 基于速度控制的导航模型)存在显著局限: 这些局限阻碍了智能体在真实场景中的物理交互能力。该研究提出 PEVA模型 ,首次将全身3D姿态作为条 件信号预测第一人称视频,为具身智能提供物理基础更扎实的仿真环境。内容出自国内首个具身智能全栈 学习社区:具身智能之心知识星球,欢迎和近200家公司和机构交流。 核心创新点 1. 结构化全身动作表征 关键突破 :将动作定义为48维向量,融合全局身体运动(骨盆位移)与局部关节旋转(15个上半身关 节的欧拉角变化),通过运动学树结构保留层次关系。 1. 动作表征简化 :多数模型采用低 ...
具身智能入门必备的技术栈:从零基础到强化学习与Sim2Real
具身智能之心· 2025-06-30 03:47
Core Insights - The article emphasizes that the field of AI is at a transformative juncture, particularly with the rise of embodied intelligence, which allows machines to understand and interact with the physical world [1][2]. Group 1: Embodied Intelligence - Embodied intelligence is defined as AI systems that not only possess a "brain" but also have a "body" capable of perceiving and altering the physical environment [1]. - Major tech companies like Tesla, Boston Dynamics, OpenAI, and Google are actively developing technologies in this revolutionary field [1]. - The potential impact of embodied intelligence spans across various industries, including manufacturing, healthcare, and space exploration [1]. Group 2: Technical Challenges - Achieving true embodied intelligence presents unprecedented technical challenges, requiring advanced algorithms and a deep understanding of physical simulation, robot control, and perception fusion [2][4]. - MuJoCo (Multi-Joint dynamics with Contact) is highlighted as a critical technology in this domain, serving as a high-fidelity simulation engine that bridges the virtual and real worlds [4][6]. Group 3: MuJoCo's Role - MuJoCo allows researchers to create realistic virtual robots and environments, enabling millions of trials and learning experiences without risking expensive hardware [6]. - The simulation speed of MuJoCo can be hundreds of times faster than real-time, significantly accelerating the learning process [6]. - MuJoCo has become a standard tool in both academia and industry, with major companies utilizing it for robot research [7]. Group 4: Practical Training - A comprehensive MuJoCo development course has been developed, focusing on practical applications and theoretical foundations in embodied intelligence [8][9]. - The course is structured into six modules, each with specific learning objectives and practical projects, ensuring a solid grasp of the technology [10][12]. - Projects range from basic robotic arm control to complex multi-agent systems, providing hands-on experience in real-world applications [14][21]. Group 5: Target Audience and Outcomes - The course is designed for individuals with programming or algorithm backgrounds looking to enter the field of embodied robotics, as well as students and professionals seeking to enhance their practical skills [27][28]. - Upon completion, participants will have a complete skill set in embodied intelligence, including proficiency in MuJoCo, reinforcement learning, and real-world application of simulation techniques [27][28].
港科大 | LiDAR端到端四足机器人全向避障系统 (宇树G1/Go2+PPO)
具身智能之心· 2025-06-29 09:51
Core Viewpoint - The article discusses the Omni-Perception framework developed by a team from the Hong Kong University of Science and Technology, which enables quadruped robots to navigate complex dynamic environments by directly processing raw LiDAR point cloud data for omnidirectional obstacle avoidance [2][4]. Group 1: Omni-Perception Framework Overview - The Omni-Perception framework consists of three main modules: PD-RiskNet perception network, high-fidelity LiDAR simulation tool, and risk-aware reinforcement learning strategy [4]. - The system takes raw LiDAR point clouds as input, extracts environmental risk features using PD-RiskNet, and outputs joint control signals, forming a complete closed-loop control [5]. Group 2: Advantages of the Framework - Direct utilization of spatiotemporal information avoids information loss during point cloud to grid/map conversion, preserving precise geometric relationships from the original data [7]. - Dynamic adaptability is achieved through reinforcement learning, allowing the robot to optimize obstacle avoidance strategies for previously unseen obstacle shapes [7]. - Computational efficiency is improved by reducing intermediate processing steps compared to traditional SLAM and planning pipelines [7]. Group 3: PD-RiskNet Architecture - PD-RiskNet employs a hierarchical risk perception network that processes near-field and far-field point clouds differently to capture local and global environmental features [8]. - The near-field processing uses farthest point sampling (FPS) to reduce data density while retaining key geometric features, and employs gated recurrent units (GRU) to capture local dynamic changes [8]. - The far-field processing uses average down-sampling to reduce noise and extract spatiotemporal features from distant environments [8]. Group 4: Reinforcement Learning Strategy - The obstacle avoidance task is modeled as an infinite horizon discounted Markov decision process, with state space including the robot's kinematic information and historical LiDAR point cloud sequences [10]. - The action space directly outputs target joint positions, allowing the policy to learn the mapping from raw sensor inputs to control signals without complex inverse kinematics [11]. - The reward function incorporates obstacle avoidance and distance maximization rewards to encourage the robot to seek open paths while penalizing deviations from target speeds [13][14]. Group 5: Simulation and Real-World Testing - The framework was validated against real LiDAR data collected using the Unitree G1 robot, demonstrating high consistency in point cloud distribution and structural integrity between simulated and real data [21]. - The Omni-Perception tool showed significant advantages in rendering efficiency, maintaining linear growth in rendering time as the number of environments increased, unlike traditional methods which exhibited exponential growth [22]. - In various tests, the framework achieved a 100% success rate in static obstacle scenarios and demonstrated superior performance in dynamic environments compared to traditional methods [26][27].
下半年CCF-A/B类会议窗口期收窄,发一篇具身论文还来得及吗?
具身智能之心· 2025-06-29 09:51
Core Viewpoint - The article emphasizes the importance of timely submission of research papers to key conferences, particularly for researchers in autonomous driving and embodied AI, and highlights the challenges faced in ensuring high-quality submissions under time constraints [1]. Group 1: Pain Points Addressed - The program targets students who lack guidance from mentors, have fragmented knowledge, and need a clear understanding of the research process [3][4]. - It aims to help students establish research thinking, familiarize themselves with research processes, and master both classic and cutting-edge algorithms [3]. Group 2: Phases of Guidance - **Topic Selection Phase**: Mentors assist students in brainstorming ideas or providing direct suggestions based on their needs [5]. - **Experiment Phase**: Mentors guide students through experimental design, model building, parameter tuning, and validating the feasibility of their ideas [7][12]. - **Writing Phase**: Mentors support students in crafting compelling research papers that stand out to reviewers [9][13]. Group 3: Course Structure and Duration - The total guidance period varies from 3 to 18 months depending on the target publication's tier, with specific core guidance and maintenance periods outlined for different categories [22][26]. - For CCF A/SCI 1区, the core guidance consists of 9 sessions, while for CCF B/SCI 2区 and CCF C/SCI 3区, it consists of 7 sessions each [22]. Group 4: Additional Support and Resources - The program includes personalized communication with mentors through dedicated groups for idea discussions and course-related queries [24]. - Students receive comprehensive training on research paper submission methods, literature review techniques, and experimental design methodologies [23][28].
中科院自动化所最新综述!VLA模型后训练与类人运动学习的共性
具身智能之心· 2025-06-29 09:51
Core Viewpoint - The article discusses the post-training strategies of Vision-Language-Action (VLA) models from the perspective of human motor skill learning, emphasizing the need for robots to undergo a post-training phase to adapt to specific tasks and environments, similar to how humans learn skills through practice and experience [4][5][9]. Summary by Sections 1. Introduction to VLA Models - VLA models integrate visual perception, language understanding, and action generation, enabling robots to interact with their environment effectively. However, their out-of-the-box performance is often insufficient for complex real-world applications, necessitating a post-training phase to refine their capabilities [8][9]. 2. Post-Training Strategies - The article categorizes VLA model post-training strategies into three dimensions: environment perception, embodiment (body awareness), and task understanding. This classification mirrors the key components of human motor learning, facilitating targeted improvements in specific model capabilities [10][12]. 3. Environmental Perception Enhancement - Strategies include enhancing the model's ability to perceive and adapt to various operational environments, utilizing cues from the surroundings to inform actions, and optimizing visual encoding for task-specific scenarios [12][13]. 4. Body Awareness and Control - The post-training strategies focus on developing internal models that predict body state changes, improving the model's ability to control robotic movements through feedback mechanisms inspired by human motor control [14]. 5. Task Understanding and Planning - The article highlights the importance of breaking down complex tasks into manageable steps, akin to human learning processes, to enhance the model's understanding of task objectives and improve operational planning [14]. 6. Multi-Component Integration - Effective skill acquisition in humans involves synchronizing multiple learning components. Similarly, VLA models benefit from integrating various strategies to optimize performance across different dimensions [14]. 7. Challenges and Future Trends - Despite advancements, challenges remain in enabling robots to learn and adapt like humans. Key areas for future research include improving kinematic models, optimizing action output structures, and enhancing human-robot interaction through expert knowledge integration [16][17][18]. 8. Continuous Learning and Generalization - The need for continuous learning capabilities is emphasized, as current VLA models often struggle with retaining previously learned skills. Future research should focus on developing algorithms that allow for lifelong learning and better generalization in open environments [22]. 9. Safety and Explainability - The article underscores the importance of safety and explainability in robotic decision-making, advocating for research into interpretable AI and safety mechanisms to ensure reliable operation in diverse scenarios [22].
具身智能之心sim2real交流群来啦!
具身智能之心· 2025-06-28 07:58
具身智能之心sim2real交流群来啦!我们针对业内常用的sim2real、sim2real2sim在机械臂、双臂、四足、人 形等多个领域任务展开讨论,欢迎感兴趣的大佬加入讨论交流! 扫码加入即可,群内只做交流分享,任何广告宣传一律拉黑清除。如若群已满,欢迎添加微信oooops-life, 邀请入群,备注sim2real加群。 ...
清华90后博士厨房机器人融资数千万,获北京首张具身智能餐饮许可证
具身智能之心· 2025-06-28 07:48
更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 作者丨 量子位 编辑丨 量子位 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 让机器人下厨房,获数千万元融资! 享刻智能正式官宣完成数千万元Pre-A轮系列融资,投资方阵容相当豪华:世纪长河科技集团、启迪之星联合领投,网龙天映创投、广华创投 等多家机构跟投。 创始人陈震为机器人圈的连续创业者, 拥有北航计算机学士、清华计算机硕士学位,目前还是清华大学未来实验室博士生。 2020年,他创办的视觉导航定位方案公司速感科技被九阳母公司JS环球生活全资收购,随后出任Shark Ninja机器人研发中心总经理。 时隔三年,这位连续创业老兵再次出发,瞄准厨房服务机器人。 就在去年9月,享刻智能的LAVA机器人拿下了北京市首张具身智能机器人食品经营许可证,成为全国第一个"持证上岗"的AI厨师。 千台订单在手,出海步伐加快 团队推出的LAVA机器人,能2分钟炸好一盘薯条、做汉堡,未来还要学会做冰淇淋和调饮品。 最厉害的是,它能通过视觉识别不同食材,自主判断烹饪时间, ...
数据、算法和本体,小白入门很难绕开任何一个部分......
具身智能之心· 2025-06-28 07:48
硬件部分:预算足的实验室有经费购买20-30w的本体,预算不足的同学依赖3D打印自己制作机械 臂或者采购性价比高的硬件平台,甚至在仿真里面做,研究比较受限。 我们的具身社区针对这三个大的模块做了比较充足的分享,包括数据采集方案、本体、仿真以及 算法部分,同时也给大家提供了几款高性价比的机械臂平台,助力研究。 社区目标是3年内打造一个万人聚集的地方,这里也非常欢迎优秀的同学加入我们(目前已经有很 多具身研究前沿的学者加入我们了)!我们和多家具身公司搭建了学术+产品+招聘完整的桥梁和 链路,同时内部在教研板块也基本形成了闭环(课程 + 硬件 + 问答)。社区里也能看到很多最新 的行业观点、技术输出。现在本体是怎么样的?有哪些不足?数据采集的成功率和有效率怎么提 升?sim2real怎么做的有效点?这些都是我们一直关注的。 入门具身离不开3个要素,数据+算法+本体,说实话很多同学只懂算法,甚至说懵懵懂!数据的采 集更是需要经验,遥操和retargeting方案,很多人采集不到真实有效的数据。本体更是许多同学触 不可及的东西,高性价比的平台和仿真是很多同学入门的第一步。 数据部分:遥操采集依赖本体,成本较高。但前处理 ...