Workflow
具身智能之心
icon
Search documents
VLA/强化学习/VLN方向的论文辅导招募!
具身智能之心· 2025-08-18 06:00
Core Viewpoint - The article announces the availability of one-on-one guidance for papers related to embodied intelligence, specifically in the areas of vla, reinforcement learning, and sim2real, targeting conferences such as CVPR, ICCV, ECCV, ICLR, CoRL, ICML, and ICRA [1]. Group 1 - The guidance is aimed at students interested in submitting to major conferences in the field of embodied intelligence [1]. - There are currently three available slots for the guidance sessions [1]. - The mentors are actively engaged in the academic field of embodied intelligence and have innovative ideas [1]. Group 2 - Interested individuals can inquire further by adding a specific WeChat contact or by scanning a QR code for consultation [2].
近2000人了,这个具身智能社区竟然私藏了这么多东西......
具身智能之心· 2025-08-18 06:00
Core Insights - The community "Embodied Intelligence Heart Knowledge Planet" aims to provide a comprehensive platform for technical exchange in the field of embodied intelligence, covering various aspects such as academic research, industry applications, and job opportunities [3][18][19]. Community Development - The community has organized multiple roundtable discussions focusing on data collection and embodied ontology, with plans to expand discussions on algorithm technologies in the future [1][3]. - The community currently has nearly 2000 members and aims to grow to around 10,000 members within the next two years, creating a hub for exchange and technical sharing [1][3]. Technical Resources - The community has compiled over 30 technical routes, including benchmarks and learning paths, to facilitate quick access to information for members [4]. - It offers a variety of resources, including open-source projects, datasets, and simulation platforms related to embodied intelligence, which are essential for both beginners and advanced researchers [18][32][38]. Job Opportunities - The community has established a job referral mechanism with several leading companies in the field, providing members with timely access to job openings [10][19]. - Members can receive recommendations for job positions related to embodied intelligence, ensuring they are connected with potential employers [19]. Educational Support - The community provides tailored learning paths for newcomers, as well as valuable industry frameworks and project proposals for those already engaged in research [13][15]. - Regular live sessions and forums are held to discuss the latest developments in the embodied intelligence industry, allowing members to stay updated on emerging trends and challenges [4][74]. Networking and Collaboration - The community encourages interaction among members, allowing them to ask questions and share insights on various topics, including career choices and research directions [77]. - It features contributions from industry leaders and experts, enhancing the learning experience and providing members with direct access to knowledge from the forefront of the field [4][18].
VLA+RL还是纯强化?从200多篇工作中看强化学习的发展路线
具身智能之心· 2025-08-18 00:07
Core Insights - The article provides a comprehensive analysis of the intersection of reinforcement learning (RL) and visual intelligence, focusing on the evolution of strategies and key research themes in visual reinforcement learning [5][17][25]. Group 1: Key Themes in Visual Reinforcement Learning - The article categorizes over 200 representative studies into four main pillars: multimodal large language models, visual generation, unified model frameworks, and visual-language-action models [5][17]. - Each pillar is examined for algorithm design, reward engineering, and benchmark progress, highlighting trends and open challenges in the field [5][17][25]. Group 2: Reinforcement Learning Techniques - Various reinforcement learning techniques are discussed, including Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which are used to enhance stability and efficiency in training [15][16]. - The article emphasizes the importance of reward models, such as those based on human feedback and verifiable rewards, in guiding the training of visual reinforcement learning agents [10][12][21]. Group 3: Applications in Visual and Video Reasoning - The article outlines applications of reinforcement learning in visual reasoning tasks, including 2D and 3D perception, image reasoning, and video reasoning, showcasing how these methods improve task performance [18][19][20]. - Specific studies are highlighted that utilize reinforcement learning to enhance capabilities in complex visual tasks, such as object detection and spatial reasoning [18][19][20]. Group 4: Evaluation Metrics and Benchmarks - The article discusses the need for new evaluation metrics tailored to large model visual reinforcement learning, combining traditional metrics with preference-based assessments [31][35]. - It provides an overview of various benchmarks that support training and evaluation in the visual domain, emphasizing the role of human preference data in shaping reward models [40][41]. Group 5: Future Directions and Challenges - The article identifies key challenges in visual reinforcement learning, such as balancing depth and efficiency in reasoning processes, and suggests future research directions to address these issues [43][44]. - It highlights the importance of developing adaptive strategies and hierarchical reinforcement learning approaches to improve the performance of visual-language-action agents [43][44].
VLA/VLA+触觉/VLA+RL/具身世界模型等方向教程来啦!
具身智能之心· 2025-08-18 00:07
Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, leading to the establishment of valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD.com, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a robust ecosystem for embodied intelligence, while international players like Tesla and investment firms are supporting companies like Wayve and Apptronik in the development of autonomous driving and warehouse robots [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action sequences, followed by the emergence of Vision-Language-Action (VLA) models that integrate visual perception, language understanding, and action generation [7]. - The fourth stage, starting in 2025, aims to integrate VLA models with reinforcement learning, world models, and tactile sensing to overcome current limitations [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and medical rehabilitation [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating training in platforms like Mujoco, IsaacGym, and Pybullet for strategy training and simulation testing [23]. Educational Initiatives - A comprehensive curriculum has been developed to cover the entire technology route of embodied "brain + cerebellum," including practical applications and advanced topics, aimed at both beginners and those seeking to deepen their knowledge [10][20].
具身智能之心灵巧手与触觉感知交流群来啦!
具身智能之心· 2025-08-18 00:07
Group 1 - The establishment of a community focused on dexterous hands and tactile perception technology has been announced, inviting individuals involved in control, algorithms, hardware, and VTLA related to dexterous hands to join [1] - The community aims to discuss industry and academic developments as well as engineering implementation [1] Group 2 - Interested individuals can add the assistant on WeChat with specific instructions to join the group, including mentioning "dexterous hand" along with their nickname [2]
NIPS 2025 MARS 多智能体具身智能挑战赛正式启动!
具身智能之心· 2025-08-18 00:07
Core Insights - The article discusses the challenges and advancements in multi-agent embodied intelligence, emphasizing the need for efficient collaboration among robotic systems to tackle complex tasks in real-world environments [3][4]. Group 1: Challenges in Embodied Intelligence - Single intelligent agents are insufficient for complex and dynamic task scenarios, necessitating high-level collaboration among multiple embodied agents [3]. - The MARS Challenge aims to address these challenges by encouraging global researchers to explore high-level planning and low-level control capabilities of multi-agent systems [4]. Group 2: MARS Challenge Overview - The MARS Challenge features two complementary tracks focusing on planning and control, aiming to evaluate the capabilities of intelligent agents in complex tasks [4][12]. - The challenge will culminate in results and awards announced at the NeurIPS 2025 SpaVLE Workshop [4]. Group 3: Track 1 - Multi-Agent Embodied Planning - Track 1 focuses on high-level task planning and role assignment for heterogeneous robots, utilizing the ManiSkill platform and RoboCasa dataset [5][6]. - Participants will use visual language models to select appropriate robot combinations and create high-level action sequences based on natural language instructions [5][8]. Group 4: Track 2 - Multi-Agent Control Strategy Execution - Track 2 emphasizes the collaborative capabilities of multi-agent systems in executing complex tasks, requiring real-time interaction with dynamic environments [12]. - The RoboFactory simulation environment will be used to develop and evaluate cooperative strategies, with participants designing deployable control models [12][13]. Group 5: Timeline and Participation - The challenge timeline includes a warm-up round starting on August 18, 2025, and the official competition beginning on September 1, 2025, concluding on October 31, 2025 [25]. - Participants from various fields such as robotics, computer vision, and natural language processing are encouraged to join and showcase their creativity and technology [26].
扩散世界模型LaDi-WM大幅提升机器人操作的成功率和跨场景泛化能力
具身智能之心· 2025-08-18 00:07
Core Viewpoint - The article discusses the development of LaDi-WM (Latent Diffusion-based World Models), a novel world model that enhances robotic operation performance through predictive strategies, addressing the challenge of accurately predicting future states in robot-object interactions [1][5][28]. Group 1: LaDi-WM Overview - LaDi-WM utilizes pre-trained vision foundation models to create latent space representations that encompass both geometric and semantic features, facilitating strategy learning and cross-task generalization in robotic operations [1][5][10]. - The framework consists of two main phases: world model learning and policy learning, which iteratively optimizes action outputs based on predicted future states [9][12]. Group 2: Methodology - The world model learning phase involves extracting geometric representations using DINOv2 and semantic representations using Siglip, followed by an interactive diffusion process to enhance dynamic prediction accuracy [10][12]. - The policy model training incorporates future predictions from the world model as additional inputs, guiding the model to improve action predictions and reduce output distribution entropy over iterations [12][22]. Group 3: Experimental Results - In virtual experiments on the LIBERO-LONG dataset, LaDi-WM achieved a success rate of 68.7% with only 10 training trajectories, outperforming previous methods by a significant margin [15][16]. - The framework demonstrated strong performance in the CALVIN D-D dataset, completing tasks with an average length of 3.63, indicating robust capabilities in long-duration tasks [17][21]. - Real-world experiments showed a 20% increase in success rates for tasks such as stacking bowls and drawer operations, validating the effectiveness of LaDi-WM in practical scenarios [25][26]. Group 4: Scalability and Generalization - The scalability experiments indicated that increasing the training data for the world model led to reduced prediction errors and improved policy performance [18][22]. - The generalization capability of the world model was highlighted by its ability to guide policy learning across different environments, achieving better performance than models trained solely in the target environment [20][21].
中山&清华:基于大模型的具身智能系统综述
具身智能之心· 2025-08-16 16:03
Core Viewpoint - The article provides a comprehensive overview of embodied intelligence systems based on large models, highlighting their applications, challenges, and future directions in various domains such as home services, healthcare, education, and industry [6][39]. Summary by Sections Perception and Understanding - Embodied intelligence systems utilize sensors like cameras and microphones to receive raw data and interpret it to form environmental awareness. Large models excel in processing multimodal input data, effectively integrating text, images, and audio to capture relationships and extract high-dimensional features for understanding the world [5][6]. - Multimodal models, such as GPT-4V, enhance the understanding of environments by encoding images and text into a shared vector space, facilitating perception and comprehension of user instructions [9]. Control Levels - The control levels of embodied intelligence systems are categorized into demand level, task level, planning level, and action level, each with representative works that demonstrate the application of large models [6][11]. System Architecture - The architecture of embodied intelligence systems includes end-to-end Transformer architectures and combinations of frozen parameter large models with foundational models, allowing for flexible optimization without sacrificing generalization [21][29]. Data Sources - Data sources for training embodied intelligence systems include simulators, imitation learning, and video learning, with simulators providing a controlled environment for rapid data collection and testing [31][32]. Challenges - Key challenges faced by embodied intelligence systems include the scarcity of real-world data, slow inference speeds, and the need for multi-agent collaboration in complex tasks [39][40]. Future Development Directions - Future directions for embodied intelligence systems involve improving data collection methods, optimizing large models for faster inference, enhancing multi-agent collaboration, and expanding applications across various fields [41][44].
迟迟入不了具身的门?别人在这里已经弯道超车了......
具身智能之心· 2025-08-16 16:03
Core Viewpoint - The article emphasizes the value of a community that provides solutions to problems in the field of embodied intelligence, facilitating knowledge sharing and job opportunities for its members [3][17]. Group 1: Community and Support - The Embodied Intelligence Knowledge Planet has created a closed-loop system covering various fields such as industry, academia, job seeking, and Q&A exchanges [3][17]. - The community offers a platform for members to share solutions to problems encountered in their work, such as data collection and model deployment [3][4]. - Members can access a wealth of resources, including over 30 technical routes, open-source projects, and job postings from leading companies in the field [4][11][31]. Group 2: Educational Resources - The community has compiled numerous learning paths and technical stacks for beginners, as well as valuable industry frameworks and project plans for those already engaged in research [12][14]. - A variety of topics are covered, including robot simulation, data collection platforms, and the challenges of implementing VLA (Visual-Language-Action) models [4][9]. - The community provides access to a range of academic papers, industry reports, and books related to robotics and embodied intelligence [24][27][29]. Group 3: Networking and Job Opportunities - The community has established a job referral mechanism with multiple companies in the embodied intelligence sector, facilitating connections between job seekers and employers [11][18]. - Members are encouraged to engage with industry leaders through forums and live discussions, enhancing their professional network [18][77]. - The community aims to create a supportive environment for members to discuss career choices and research directions, ensuring they receive timely advice and insights [79].
ICCV 2025 | HERMES:首个统一3D场景理解与生成的世界模型
具身智能之心· 2025-08-16 16:03
Core Viewpoint - HERMES presents a unified framework for self-driving technology that integrates both understanding and generation tasks, addressing the challenges of accurately predicting future scenarios while comprehensively understanding the current environment [6][10][26]. Group 1: Introduction to HERMES - HERMES is designed to enhance the capabilities of autonomous vehicles by combining deep environmental understanding with accurate future scene predictions [6][9]. - The framework aims to overcome the traditional separation of understanding and generation tasks in existing models, which limits their effectiveness in real-world driving scenarios [7][10]. Group 2: Methodology of HERMES - HERMES utilizes a Driving World Model (DWM) for future scene generation and a Large Language Model (LLM) for scene understanding, creating a synergy between the two [14][12]. - The Bird's-Eye View (BEV) representation is employed to encode high-resolution images efficiently, preserving spatial relationships and semantic details [15]. - A World Queries mechanism is introduced to bridge the gap between understanding and generation, allowing the model to leverage contextual knowledge for better predictions [16]. Group 3: Training and Optimization - HERMES is trained through a joint optimization process that includes language modeling loss and point cloud generation loss, ensuring balanced performance across tasks [18][20]. - The end-to-end training approach allows HERMES to achieve a high level of accuracy in both understanding and generating future scenarios [20]. Group 4: Experimental Results - HERMES outperforms existing models in both scene understanding and future generation tasks, demonstrating a 32.4% reduction in future point cloud error compared to similar models [22]. - The model shows significant improvements in natural language generation metrics, with an 8% increase in CIDEr scores compared to dedicated understanding models [22]. Group 5: Future Outlook - HERMES sets a foundation for further exploration of complex perception tasks, aiming towards the development of a general driving model capable of comprehensive physical world understanding [26][27].