Workflow
具身智能之心
icon
Search documents
Galaxea 团队推出:大规模高质量开放世界数据集与G0双系统VLA模型
具身智能之心· 2025-09-04 01:04
Core Insights - The article presents the Galaxea Open-World Dataset, a large-scale and diverse collection of robot behaviors recorded in real human living and working environments, addressing the scarcity of high-quality open-world robot data and insufficient model generalization [3][5][6]. Dataset Overview - The dataset consists of 500 hours of data, 100,000 demonstration trajectories, covering 150 task categories, 1,600 object types, and 58 operational skills, with a 2Hz frequency for detailed sub-task instruction labeling [8][12]. - Data was collected using the Galaxea R1 Lite mobile dual-arm robot, which has 23 degrees of freedom and is equipped with RGB cameras for global scene perception and fine operation sensing [5][6]. Data Diversity and Coverage - The dataset includes data from 11 physical sites across 50 unique scenarios, covering residential, retail, dining, and office environments, thus avoiding the limitations of existing datasets that are confined to controlled laboratory settings [6][12]. - The distribution of tasks shows a balance between basic actions and specialized skills, with residential scenes making up 50.8% and office scenes 33.2% of the dataset [11][12]. G0 Dual-System Framework - The G0 framework couples a "slow thinking" visual-language model (G0-VLM) with a "fast execution" visual-language-action model (G0-VLA), employing a three-stage training strategy to achieve complex task planning and precise execution [5][19]. - The training phases include cross-entity pre-training, single-entity pre-training, and task-specific fine-tuning, which enhance the model's performance significantly [21][30]. Model Performance Evaluation - The G0-VLA model demonstrated superior performance in benchmark tasks such as desktop organization and microwave operation, with G0-Full achieving the highest average task progress scores [39][47]. - The study found that single-entity pre-training is essential for effective model adaptation, as cross-entity pre-training can lead to negative transfer due to significant differences between the training and target robot entities [39][46]. Key Findings - The G0-VLM model outperformed mainstream visual-language models in instruction accuracy, achieving 83.3% in desktop organization and 78.2% in bed-making tasks, highlighting the importance of domain-specific fine-tuning [42][47]. - The dataset's design and the dual-system framework effectively address the challenges of real-world robot task execution, providing a robust foundation for future advancements in embodied intelligence [17][19].
VLA方向的1v1论文辅导来啦,辅导至中稿~
具身智能之心· 2025-09-03 10:00
主要会议:CVPR、ICCV、ECCV、ICLR、CoRL、ICML、ICRA、RSS等; 具身智能之心1v1论文辅导来啦!现在有5个vla相关方向的名额,主要面向A会和B会、一区二区等, 辅导直中稿~ 感兴趣的同学可以添加微信oooops-life咨询,或者直接扫码,备注具身论文辅导咨询。 辅导老师:均是积极活跃在具身学术领域,10篇以上顶会工作,有具体的idea。 ...
Galaxea 团队推出:大规模高质量开放世界机器人数据集与G0双系统VLA模型
具身智能之心· 2025-09-03 03:23
Core Insights - The article presents the Galaxea Open-World Dataset, a large-scale and diverse collection of robot behaviors recorded in real human living and working environments, addressing the scarcity of high-quality open-world robot data and insufficient model generalization capabilities [2][5][6]. Dataset Overview - The Galaxea Open-World Dataset is the first large-scale robot behavior dataset collected in real-life scenarios, solving issues of existing datasets that are limited to controlled environments and inconsistent robot entities [5][17]. - Data collection was conducted using the Galaxea R1 Lite mobile dual-arm robot, which features 23 degrees of freedom and is equipped with RGB cameras for global scene perception and fine operation sensing [8][6]. - The dataset includes 500 hours of data, 100,000 demonstration trajectories, covering 150 task categories, 1,600 object types, and 58 operational skills, with a 2Hz frequency for detailed sub-task instruction labeling [8][12]. Model Framework - The G0 dual-system framework couples a "slow thinking" visual-language model (G0-VLM) with a "fast execution" visual-language-action model (G0-VLA), utilizing a three-stage training strategy to achieve complex task planning and precise execution [5][19]. - The training phases include cross-entity pre-training, single-entity pre-training, and task-specific fine-tuning, which are designed to balance general knowledge and specific robot adaptation [21][27]. Performance Evaluation - The G0-VLA model demonstrated superior performance in benchmark tasks such as desktop organization, microwave operation, bed making, and block building, with G0-VLM achieving an instruction accuracy of 78.2% in bed making and 83.3% in desktop organization [42][47]. - The study found that single-entity pre-training is essential for effective model performance, as cross-entity pre-training can lead to negative transfer due to significant differences between the training and target robot entities [39][46]. Key Findings - The dataset's design emphasizes real-world adaptability and model training friendliness, ensuring that the collected data reflects the complexities of human environments [6][17]. - The G0 model's architecture is inspired by Kahneman's dual-system theory, where System 2 (slow thinking) is responsible for planning and System 1 (fast execution) handles real-time reactions, allowing for a balance between planning rationality and execution timeliness [19][21].
MemoryVLA:给机器人装上海马体,助力长时序机器人操作任务
具身智能之心· 2025-09-03 00:03
Core Viewpoint - The article discusses the development of MemoryVLA, a cognitive-memory-action framework inspired by human memory systems, aimed at improving robotic manipulation tasks that require long-term temporal dependencies [3][7]. Group 1: Current Issues in VLA Models - Existing Vision-Language-Action (VLA) models primarily rely on current observations, leading to poor performance in long-term, temporally dependent tasks [2][7]. - Cognitive science indicates that humans utilize a memory system involving neural activity and the hippocampus to manage tasks effectively over time [7]. Group 2: MemoryVLA Framework - MemoryVLA is designed to create a memory system for robots, drawing inspiration from human cognitive mechanisms [3][7]. - The framework includes a pre-trained Vision-Language Model (VLM) that encodes observations into perceptual and cognitive tokens, which are stored in a Perceptual-Cognitive Memory Bank [3]. - Working memory retrieves relevant entries from the memory bank, merging them with current tokens to adaptively update the memory [3]. Group 3: Importance of Memory in Robotics - The article emphasizes the necessity of memory in robotic tasks, explaining that it enhances decision-making and action sequences in complex environments [3][7]. - A memory-conditioned diffusion action expert generates action sequences with temporal awareness using the tokens [3].
诚聘英才 | 朗毅机器人2026届全球校园招聘启动!
具身智能之心· 2025-09-03 00:03
Core Viewpoint - The article highlights the advancements and market position of Langyi Robotics, a company specializing in embodied intelligence and spatial intelligence solutions, emphasizing its innovative navigation technology and significant market share in the humanoid robot sector [2][3]. Company Overview - Langyi Robotics focuses on developing next-generation embodied intelligence and spatial intelligence solutions, aiming to push the boundaries of robot perception and navigation technology [2]. - The company has launched the world's first embodied perception navigation module, enabling humanoid robots to achieve fully autonomous movement, obstacle avoidance, advanced spatial reasoning, and generalized environmental interaction capabilities [2]. Market Position - The company holds a remarkable market share of 80% and has served numerous leading humanoid robot manufacturers [3]. - Research and development account for 85% of the company's operations, indicating a strong focus on innovation and technology advancement [3]. Team and Expertise - The core team members come from prestigious universities such as Huazhong University of Science and Technology, Zhejiang University, and University of Electronic Science and Technology, with over ten years of experience in core algorithms for spatial intelligence [4]. - The company has secured tens of millions in investments from several leading institutions, including Inno Angel Fund, Jiada Capital, and Qiji Chuangtan [4]. Recruitment and Opportunities - Langyi Robotics is actively recruiting for full-time and internship positions, targeting 2026 graduates and current students in relevant fields [9]. - The company offers competitive compensation packages, including fixed salaries, performance bonuses, and core talent equity incentives [5]. - Opportunities for professional growth include a mentorship system with industry experts and participation in core business projects [5].
刚入学,导师让我从0开始研究具身智能方向......
具身智能之心· 2025-09-03 00:03
Core Viewpoint - The article discusses the advancements in embodied intelligence algorithms and their market potential, emphasizing the need for further research and development in this field [1][2]. Group 1: Technological Advancements - Embodied algorithms have improved global perception capabilities, transitioning from traditional pipeline solutions to end-to-end models [1]. - Navigation tasks have shifted from mapping and localization to target navigation using large model-based, map-free solutions [2]. Group 2: Market Potential and Challenges - The market size and capacity for embodied intelligence are larger than other fields, but many unresolved issues remain, requiring collective effort [2]. - The short development time of embodied intelligence has led to a lack of systematic approaches and pathways for newcomers [2]. Group 3: Educational Initiatives - The company has developed several courses in the embodied intelligence field to address the lack of structure and guidance for learners [2]. - A community has been established to facilitate learning and collaboration among individuals interested in embodied intelligence [2]. Group 4: Course and Community Highlights - The learning path is systematized to help users avoid common pitfalls and quickly get started [6]. - The program includes a variety of practical robotics projects, combining simulation and real machine practice [6]. - Live interactions with industry experts and researchers are available, along with permanent access to recorded sessions and shared source code [7].
Scaling Laws起源于1993年?OpenAI总裁:深度学习的根本已揭秘
具身智能之心· 2025-09-03 00:03
Core Viewpoint - The article discusses the historical development and significance of the Scaling Law in artificial intelligence, emphasizing its foundational role in understanding model performance in relation to computational resources [2][34][43]. Group 1: Historical Context - The Scaling Law's origins are debated, with claims that it was first proposed by OpenAI in 2020 or discovered by Baidu in 2017 [2]. - Recent discussions attribute the initial exploration of Scaling Law to Bell Labs, dating back to 1993 [3][5]. - The paper from Bell Labs demonstrated the relationship between model size, data set size, and classifier performance, highlighting the long-standing nature of these findings [5][9]. Group 2: Key Findings of the Research - The NeurIPS paper from Bell Labs outlines a method for efficiently predicting classifier suitability, which is crucial for resource allocation in AI model training [12]. - The authors established that as training data increases, the error rate of models follows a predictable logarithmic pattern, reinforcing the Scaling Law's validity [12][16]. - The research indicates that after training on 12,000 patterns, new networks significantly outperform older ones, showcasing the benefits of scaling [16]. Group 3: Contributions of Authors - The paper features five notable authors, including Corinna Cortes and Vladimir Vapnik, both of whom have made significant contributions to machine learning and statistical theory [18][19][27]. - Corinna Cortes has over 100,000 citations and is recognized for her work on support vector machines and the MNIST dataset [21][22]. - Vladimir Vapnik, with over 335,000 citations, is known for his foundational work in statistical learning theory [27]. Group 4: Broader Implications - The article suggests that the Scaling Law is not a sudden insight but rather a cumulative result of interdisciplinary research spanning decades, from psychology to neural networks [34][43]. - The evolution of the Scaling Law reflects a broader scientific journey, with contributions from various fields and researchers, ultimately leading to its current understanding in deep learning [43].
XDog:具身低成本科研平台,四足机械狗+单臂(含VLA/强化学习/仿真/sim2real教程)
具身智能之心· 2025-09-02 02:00
Core Viewpoint - Xdog is a low-cost, multifunctional quadruped robotic dog and robotic arm development platform designed for embodied developers, featuring a comprehensive curriculum for research and learning in robotics [1][2]. Hardware Overview - Xdog integrates advanced functionalities such as voice control, sim2real, real2sim, target recognition and tracking, autonomous robotic arm grasping, and reinforcement learning gait control, covering most of the technology stack for embodied intelligent lower limb control [2][5]. - The robotic dog measures 25cm x 20cm x 30cm and weighs 7.0kg, with a maximum speed of 7.2 km/h and a maximum rotation speed of 450 degrees per second [3][11]. - The main control chip is Allwinner H616, featuring a quad-core 1.6GHz CPU, 4GB RAM, and 32GB storage [4][5]. - The robotic arm can reach a maximum height of 0.85m and has a grasping range of 0.4m around its base [7]. Software and Functionality - The system supports various control methods including voice control via TCP, keyboard control, visual control, and reinforcement learning for autonomous movement [15][17]. - Development is based on ROS1, with Python as the primary programming language, and it is recommended to use a GPU of at least 2080ti for inference [16][24]. - The platform includes a comprehensive curriculum covering topics from basic ROS knowledge to advanced reinforcement learning principles and practical applications [22][23]. Team and Support - The project is led by a team of experienced instructors responsible for project advancement, technical support, and course development [22]. - After-sales service is provided for one year post-delivery, with video and source code access granted immediately after hardware receipt [26]. Delivery and Consultation - The delivery cycle is set to be completed within three weeks after payment [25]. - For further inquiries, potential customers are encouraged to consult the assistant via WeChat [27].
还在卷端到端模型?Embodied-R1另辟蹊径:用“指向”+强化学习实现SOTA性能!
具身智能之心· 2025-09-02 00:03
Core Insights - The article discusses the development of Embodied-R1, a new model designed to bridge the "seeing-to-doing gap" in robotics, which has been a long-standing challenge in the field [2][32] - The model introduces a novel intermediate representation called "pointing," which allows complex operational instructions to be translated into visual points, enhancing the robot's ability to understand and execute tasks [3][10] Group 1: Challenges in Robotics - The "seeing-to-doing gap" is primarily caused by data scarcity and morphological heterogeneity, which hinder effective knowledge transfer in robotics [2] - Existing visual-language-action (VLA) models struggle with performance in new environments, often losing zero-shot operational capabilities [2][10] Group 2: Embodied-R1 Model Overview - Embodied-R1 is a 3 billion parameter model that utilizes "pointing" as an intuitive intermediate representation, defining four key capabilities: REG (representational understanding), RRG (spatial region pointing), OFG (functional part pointing), and VTG (visual trajectory generation) [10][12] - The model has demonstrated superior performance in 11 spatial reasoning and pointing tasks, achieving a 56.2% success rate in the SIMPLEREnv simulation and an impressive 87.5% in eight real-world tasks without fine-tuning [10][27] Group 3: Training Methodology - The model employs a two-phase training curriculum, focusing first on spatial reasoning and then on embodied pointing capabilities, utilizing a large dataset of 200,000 samples [15][16] - Reinforcement fine-tuning (RFT) is introduced to address the "multi-solution dilemma" in pointing tasks, allowing the model to develop a generalized understanding rather than memorizing specific answers [17][19] Group 4: Performance Metrics - Embodied-R1 outperforms other models in various benchmarks, achieving state-of-the-art (SOTA) results in REG, RRG, OFG, and VTG tasks [29][30] - The model's trajectory generation quality is the best among all compared models, which is crucial for reliable robot execution [29] Group 5: Robustness and Adaptability - The model exhibits strong robustness against visual disturbances, maintaining performance even under challenging conditions such as poor lighting and background changes [31] - This adaptability is attributed to the "pointing" representation, which enhances the robot's strategic robustness [31] Group 6: Conclusion - The introduction of Embodied-R1 marks a significant advancement in addressing the long-standing "seeing-to-doing gap" in robotics, providing a promising pathway for developing more powerful and generalizable embodied AI systems [32]
穆尧团队最新!离散扩散引入VLA,支持精确动作建模和一致性训练
具身智能之心· 2025-09-02 00:03
Core Viewpoint - The article discusses the introduction of the Discrete Diffusion VLA model, which integrates discrete diffusion techniques into the Vision-Language-Action (VLA) framework, enhancing the efficiency and accuracy of robotic action decoding [4][7]. Group 1: Background and Problem Statement - The VLA model enables robots to understand visual and language inputs and execute corresponding action sequences. Current VLA frameworks typically adapt large pre-trained visual-language models (VLM) by adding an action generation head [4]. - Existing decoding methods fall into two categories: autoregressive (AR) methods, which generate actions sequentially, and continuous diffusion methods, which treat action trajectories as continuous signals [4][6]. Group 2: Proposed Solution - The Discrete Diffusion VLA model introduces a novel approach by incorporating discrete diffusion into action decoding, utilizing a single Transformer to unify visual, language, and action modalities without the need for additional training modules [6][12]. - The model employs a "first easy, then difficult" adaptive decoding strategy, allowing for parallel decoding of actions and error correction, significantly improving accuracy [12][18]. Group 3: Performance Metrics - In the LIBERO task with the Franka Panda robotic arm, the model achieved a success rate of 96.3%, outperforming traditional AR and continuous diffusion models [2][12]. - The Google robot demonstrated a visual matching rate of 71.2%, while the WidowX robot achieved a 49.3% overall success rate in real-simulation transfer scenarios, showcasing the model's robustness [2][25]. Group 4: Experimental Results - The Discrete Diffusion VLA model consistently outperformed benchmarks, with an average success rate of 96.3% across various tasks, surpassing the closest model, OpenVLA-OFT, by 0.8% [21][22]. - The model's performance in visual matching and variant aggregation was also superior, achieving an overall average success rate of 64.1% in diverse scenarios [23][24]. Group 5: Ablation Studies - Ablation studies indicated that the adaptive decoding strategy significantly enhances performance, with the "max confidence" approach yielding a 97.4% success rate, outperforming other strategies [27]. - The temperature scheduling method used in the model also proved effective, achieving a 97.4% success rate, validating the synergy between temperature adjustment and adaptive decoding [28].