Workflow
具身智能之心
icon
Search documents
Scaling Laws起源于1993年?OpenAI总裁:深度学习的根本已揭秘
具身智能之心· 2025-09-03 00:03
Core Viewpoint - The article discusses the historical development and significance of the Scaling Law in artificial intelligence, emphasizing its foundational role in understanding model performance in relation to computational resources [2][34][43]. Group 1: Historical Context - The Scaling Law's origins are debated, with claims that it was first proposed by OpenAI in 2020 or discovered by Baidu in 2017 [2]. - Recent discussions attribute the initial exploration of Scaling Law to Bell Labs, dating back to 1993 [3][5]. - The paper from Bell Labs demonstrated the relationship between model size, data set size, and classifier performance, highlighting the long-standing nature of these findings [5][9]. Group 2: Key Findings of the Research - The NeurIPS paper from Bell Labs outlines a method for efficiently predicting classifier suitability, which is crucial for resource allocation in AI model training [12]. - The authors established that as training data increases, the error rate of models follows a predictable logarithmic pattern, reinforcing the Scaling Law's validity [12][16]. - The research indicates that after training on 12,000 patterns, new networks significantly outperform older ones, showcasing the benefits of scaling [16]. Group 3: Contributions of Authors - The paper features five notable authors, including Corinna Cortes and Vladimir Vapnik, both of whom have made significant contributions to machine learning and statistical theory [18][19][27]. - Corinna Cortes has over 100,000 citations and is recognized for her work on support vector machines and the MNIST dataset [21][22]. - Vladimir Vapnik, with over 335,000 citations, is known for his foundational work in statistical learning theory [27]. Group 4: Broader Implications - The article suggests that the Scaling Law is not a sudden insight but rather a cumulative result of interdisciplinary research spanning decades, from psychology to neural networks [34][43]. - The evolution of the Scaling Law reflects a broader scientific journey, with contributions from various fields and researchers, ultimately leading to its current understanding in deep learning [43].
XDog:具身低成本科研平台,四足机械狗+单臂(含VLA/强化学习/仿真/sim2real教程)
具身智能之心· 2025-09-02 02:00
Core Viewpoint - Xdog is a low-cost, multifunctional quadruped robotic dog and robotic arm development platform designed for embodied developers, featuring a comprehensive curriculum for research and learning in robotics [1][2]. Hardware Overview - Xdog integrates advanced functionalities such as voice control, sim2real, real2sim, target recognition and tracking, autonomous robotic arm grasping, and reinforcement learning gait control, covering most of the technology stack for embodied intelligent lower limb control [2][5]. - The robotic dog measures 25cm x 20cm x 30cm and weighs 7.0kg, with a maximum speed of 7.2 km/h and a maximum rotation speed of 450 degrees per second [3][11]. - The main control chip is Allwinner H616, featuring a quad-core 1.6GHz CPU, 4GB RAM, and 32GB storage [4][5]. - The robotic arm can reach a maximum height of 0.85m and has a grasping range of 0.4m around its base [7]. Software and Functionality - The system supports various control methods including voice control via TCP, keyboard control, visual control, and reinforcement learning for autonomous movement [15][17]. - Development is based on ROS1, with Python as the primary programming language, and it is recommended to use a GPU of at least 2080ti for inference [16][24]. - The platform includes a comprehensive curriculum covering topics from basic ROS knowledge to advanced reinforcement learning principles and practical applications [22][23]. Team and Support - The project is led by a team of experienced instructors responsible for project advancement, technical support, and course development [22]. - After-sales service is provided for one year post-delivery, with video and source code access granted immediately after hardware receipt [26]. Delivery and Consultation - The delivery cycle is set to be completed within three weeks after payment [25]. - For further inquiries, potential customers are encouraged to consult the assistant via WeChat [27].
还在卷端到端模型?Embodied-R1另辟蹊径:用“指向”+强化学习实现SOTA性能!
具身智能之心· 2025-09-02 00:03
Core Insights - The article discusses the development of Embodied-R1, a new model designed to bridge the "seeing-to-doing gap" in robotics, which has been a long-standing challenge in the field [2][32] - The model introduces a novel intermediate representation called "pointing," which allows complex operational instructions to be translated into visual points, enhancing the robot's ability to understand and execute tasks [3][10] Group 1: Challenges in Robotics - The "seeing-to-doing gap" is primarily caused by data scarcity and morphological heterogeneity, which hinder effective knowledge transfer in robotics [2] - Existing visual-language-action (VLA) models struggle with performance in new environments, often losing zero-shot operational capabilities [2][10] Group 2: Embodied-R1 Model Overview - Embodied-R1 is a 3 billion parameter model that utilizes "pointing" as an intuitive intermediate representation, defining four key capabilities: REG (representational understanding), RRG (spatial region pointing), OFG (functional part pointing), and VTG (visual trajectory generation) [10][12] - The model has demonstrated superior performance in 11 spatial reasoning and pointing tasks, achieving a 56.2% success rate in the SIMPLEREnv simulation and an impressive 87.5% in eight real-world tasks without fine-tuning [10][27] Group 3: Training Methodology - The model employs a two-phase training curriculum, focusing first on spatial reasoning and then on embodied pointing capabilities, utilizing a large dataset of 200,000 samples [15][16] - Reinforcement fine-tuning (RFT) is introduced to address the "multi-solution dilemma" in pointing tasks, allowing the model to develop a generalized understanding rather than memorizing specific answers [17][19] Group 4: Performance Metrics - Embodied-R1 outperforms other models in various benchmarks, achieving state-of-the-art (SOTA) results in REG, RRG, OFG, and VTG tasks [29][30] - The model's trajectory generation quality is the best among all compared models, which is crucial for reliable robot execution [29] Group 5: Robustness and Adaptability - The model exhibits strong robustness against visual disturbances, maintaining performance even under challenging conditions such as poor lighting and background changes [31] - This adaptability is attributed to the "pointing" representation, which enhances the robot's strategic robustness [31] Group 6: Conclusion - The introduction of Embodied-R1 marks a significant advancement in addressing the long-standing "seeing-to-doing gap" in robotics, providing a promising pathway for developing more powerful and generalizable embodied AI systems [32]
穆尧团队最新!离散扩散引入VLA,支持精确动作建模和一致性训练
具身智能之心· 2025-09-02 00:03
Core Viewpoint - The article discusses the introduction of the Discrete Diffusion VLA model, which integrates discrete diffusion techniques into the Vision-Language-Action (VLA) framework, enhancing the efficiency and accuracy of robotic action decoding [4][7]. Group 1: Background and Problem Statement - The VLA model enables robots to understand visual and language inputs and execute corresponding action sequences. Current VLA frameworks typically adapt large pre-trained visual-language models (VLM) by adding an action generation head [4]. - Existing decoding methods fall into two categories: autoregressive (AR) methods, which generate actions sequentially, and continuous diffusion methods, which treat action trajectories as continuous signals [4][6]. Group 2: Proposed Solution - The Discrete Diffusion VLA model introduces a novel approach by incorporating discrete diffusion into action decoding, utilizing a single Transformer to unify visual, language, and action modalities without the need for additional training modules [6][12]. - The model employs a "first easy, then difficult" adaptive decoding strategy, allowing for parallel decoding of actions and error correction, significantly improving accuracy [12][18]. Group 3: Performance Metrics - In the LIBERO task with the Franka Panda robotic arm, the model achieved a success rate of 96.3%, outperforming traditional AR and continuous diffusion models [2][12]. - The Google robot demonstrated a visual matching rate of 71.2%, while the WidowX robot achieved a 49.3% overall success rate in real-simulation transfer scenarios, showcasing the model's robustness [2][25]. Group 4: Experimental Results - The Discrete Diffusion VLA model consistently outperformed benchmarks, with an average success rate of 96.3% across various tasks, surpassing the closest model, OpenVLA-OFT, by 0.8% [21][22]. - The model's performance in visual matching and variant aggregation was also superior, achieving an overall average success rate of 64.1% in diverse scenarios [23][24]. Group 5: Ablation Studies - Ablation studies indicated that the adaptive decoding strategy significantly enhances performance, with the "max confidence" approach yielding a 97.4% success rate, outperforming other strategies [27]. - The temperature scheduling method used in the model also proved effective, achieving a 97.4% success rate, validating the synergy between temperature adjustment and adaptive decoding [28].
国内最大的具身社区,开学季招生了!!!
具身智能之心· 2025-09-02 00:03
Group 1 - The article introduces a community focused on embodied intelligence, aiming to provide a platform for technical exchange and problem-solving among members from various sectors, including academia and industry [1][5][17] - The community has nearly 2000 members and aims to grow to 10,000 in the next two years, offering resources such as video content, learning paths, and job exchange [4][22] - Members are involved in various research areas, including VLA, reinforcement learning, and simulation, with a focus on practical applications and collaboration [1][4][5] Group 2 - The community provides a comprehensive collection of over 30 technical routes, facilitating quick access to benchmarks, reviews, and learning paths for both beginners and advanced learners [5][17] - Regular events such as roundtable discussions and live streams are organized to share insights on the latest developments in the embodied intelligence industry [5][22] - The community has established job referral mechanisms with multiple leading companies in the field, enhancing employment opportunities for members [7][22] Group 3 - The article outlines various research topics and technical areas, including data collection, VLA models, multi-sensor fusion, and robot operating systems, highlighting the community's focus on cutting-edge research [6][8][12] - Specific frameworks and models are discussed, such as RoboTwin 2.0 for data generation and evaluation, and various VLA frameworks aimed at improving generalization and safety [6][7][12] - The community also emphasizes the importance of collaboration and knowledge sharing among members to address challenges in the field of embodied intelligence [5][17][22]
上海交大具身导航中的感知智能、社会智能和运动智能全面综述
具身智能之心· 2025-09-02 00:03
Core Insights - The article presents the TOFRA framework, which decomposes the embodied navigation process into five key stages: Transition, Observation, Fusion, Reward-policy construction, and Action execution, providing a structured analysis for embodied navigation research [2][14] - It systematically integrates research findings from computer vision, classical robotics, and bionics in the context of embodied navigation, highlighting the complementary nature of these fields in sensing intelligence, social intelligence, and motion intelligence [2][3] - The article identifies four core challenges in the field of embodied navigation: adaptive spatiotemporal scale, joint optimization, system integrity, and data task generalization, guiding future research directions [2][3] Group 1: Research Background - Embodied Artificial Intelligence (EAI) emphasizes self-perception and interaction with humans or the environment as a pathway to Artificial General Intelligence (AGI) [2] - The core feature of embodied navigation is its egocentric perception and distributed computing capabilities, contrasting with traditional navigation methods that rely on predefined maps or external localization [2][3] Group 2: Intelligence Types - Sensing Intelligence: Achieved through multimodal self-centered perception, allowing for spatial cognition without complete reliance on pre-built global maps [3][4] - Social Intelligence: Enables understanding of high-level semantic instructions from humans, supporting complex task execution beyond predefined waypoints [10][11] - Motion Intelligence: Involves the ability to perform flexible and adaptive physical interactions in complex environments, not limited to fixed paths [10][11] Group 3: TOFRA Framework - Transition (T): Involves predicting the next state using internal sensors and various methods, including dynamics modeling and end-to-end neural networks [14][20] - Observation (O): Focuses on how robots perceive the environment through external sensors, forming an understanding of the external world [27][28] - Fusion (F): Combines internal state predictions with external perceptions to achieve optimal state estimation using classical Bayesian methods and neural networks [45][48] Group 4: Action Execution - Action execution involves the robot utilizing motion skills to complete the action sequences generated by the policy, including basic skills and complex skill combinations [60][61] - The article discusses the evolution of action execution from basic motion skills to complex combinations and morphological cooperation, highlighting the advancements in motion intelligence [60][68] Group 5: Application Scenarios - The TOFRA framework is applied to three typical navigation scenarios: embodied autonomous driving, indoor navigation, and complex terrain navigation, detailing how to integrate the framework's stages for efficient navigation systems [74][75][76]
具身智能之心合伙人招募来啦!具身数采/算法/仿真/硬件多个方向
具身智能之心· 2025-09-01 10:00
课程讲师招募 具身智能之心课程讲师招募开始啦!如果您是大模型/多模态大模型、Diffusion、VLA、VLA+RL、sim2real、 端到端、具身交互、视觉语言导航、强化学习、机器人运动规划、机器人框架、抓取点预测与位姿估计、导航 建图、触觉感知、大模型部署与量化感知推理、机器人仿真等方向,欢迎加入我们; 主要工作:开发具身相关的视频课程,负责群内答疑等; 待遇丰厚(底部添加微信了解),除了现金激励,我们共享全行业具身资源、职位等。 科研辅导老师 待遇优厚,高于行业水平,既能发论文,又能赚零花钱! 机器人硬件开发合伙人 如果您正在从事机械臂抓取系统、双足机器人、四足机器人、轮式机器人、大模型部署等软硬件的开发工作, 期望和我们一起推动具身教育的发展,欢迎联系我们; 我们将会提供合伙人的身份,一起开创更大的具身教育场景,推动行业发展。 联系我们 具身智能相关方向科研辅导老师招募开始啦!如果您是diffusion policy、VLA、VLA+强化、sim2real、强化学 习、具身仿真、具身感知、具身交互、视觉语言导航、目标导航、触觉感知、大模型/多模态大模型、大模型 量化、机械臂抓取、位姿估计、大模型部署 ...
穆尧团队最新!Discrete Diffusion VLA离散扩散引入VLA,支持精确动作建模和一致性训练
具身智能之心· 2025-09-01 10:00
Core Insights - The article discusses the introduction of the Discrete Diffusion VLA model, which integrates discrete diffusion techniques into the Vision-Language-Action (VLA) framework, enhancing the efficiency and accuracy of robotic action decoding [4][7][8]. Group 1: Model Overview - The Discrete Diffusion VLA model addresses the limitations of existing VLA frameworks by utilizing a single Transformer to unify visual, language, and action modalities, eliminating the need for additional training modules [6][12]. - The model achieves an average success rate of 96.3% in the LIBERO task with the Franka Panda robotic arm, outperforming traditional autoregressive and continuous diffusion models [2][8][21]. Group 2: Performance Metrics - In various environments, the model demonstrated superior performance: 96.3% in LIBERO, 64.1% in SimplerEnv-Fractal, and 49.3% in real-simulation transfer scenarios [2][8][25]. - The model's visual matching rate reached 71.2%, significantly higher than competitors, indicating robustness to scene changes [23][24]. Group 3: Innovation and Contributions - The introduction of a "first easy, then difficult" adaptive decoding strategy allows for parallel decoding and error correction, enhancing accuracy within a unified architecture [7][11]. - The model's training process aligns with existing VLM frameworks, allowing for seamless integration and optimization without the need for specialized training processes [12][14]. Group 4: Experimental Validation - Extensive experiments validated the model's performance across multiple scenarios, showing significant improvements over baseline models, including a 19.8% increase compared to traditional autoregressive models [21][27]. - Ablation studies confirmed the effectiveness of the decoding strategy and temperature selection, with the "maximum confidence" adaptive strategy yielding the highest success rates [27][28].
RLinf开源!首个面向具身智能“渲训推一体化”的大规模强化学习框架
具身智能之心· 2025-09-01 04:02
Core Viewpoint - The article discusses the launch of RLinf, a large-scale reinforcement learning framework aimed at embodied intelligence, highlighting its innovative design and capabilities in enhancing AI's transition from perception to action [2][5]. Group 1: Framework Overview - RLinf is a flexible and scalable framework designed for embodied intelligence, integrating various components to optimize performance [5]. - The framework's name "inf" signifies both "infrastructure" and "infinite" scaling, emphasizing its adaptable system design [7]. - RLinf features a hybrid execution model that achieves over 120% system speedup compared to traditional frameworks, with VLA model performance improvements of 40%-60% [7][12]. Group 2: Execution Modes - RLinf supports three execution modes: Collocated, Disaggregated, and Hybrid, allowing users to configure components based on their needs [17][15]. - The hybrid mode combines the advantages of both shared and separated execution, minimizing system idle time and enhancing efficiency [12][15]. Group 3: Communication and Scheduling - The framework includes an adaptive communication library designed for reinforcement learning, optimizing data exchange between components [19][22]. - RLinf features an automated scheduling module that minimizes resource idleness and dynamically adjusts to user training flows, achieving rapid scaling capabilities [23][24]. Group 4: Performance Metrics - RLinf has demonstrated significant performance improvements in embodied intelligence tasks, achieving success rates of 80%-90% in specific scenarios, compared to 30%-50% in previous models [24][26]. - The framework has also achieved state-of-the-art (SOTA) performance in mathematical reasoning tasks across multiple datasets, showcasing its versatility [29][30]. Group 5: Documentation and Community Engagement - Comprehensive documentation and API support are provided to enhance user experience and facilitate understanding of the framework [32][34]. - The RLinf team encourages collaboration and invites users to explore the framework, highlighting ongoing recruitment for various research and engineering positions [33][34].
最新综述!多模态融合与VLM在具身机器人领域中的方法盘点
具身智能之心· 2025-09-01 04:02
Core Insights - The article discusses the transformative impact of Multimodal Fusion and Vision-Language Models (VLMs) on robot vision, enabling robots to evolve from simple mechanical executors to intelligent partners capable of understanding and interacting with complex environments [3][4][5]. Multimodal Fusion in Robot Vision - Multimodal fusion integrates various data types such as RGB images, depth information, LiDAR point clouds, language, and tactile data, significantly enhancing robots' perception and understanding of their surroundings [3][4][9]. - The main fusion strategies have evolved from early explicit concatenation to implicit collaboration within unified architectures, improving feature extraction and task prediction [10][11]. Applications of Multimodal Fusion - Semantic scene understanding is crucial for robots to recognize objects and their relationships, where multimodal fusion greatly improves accuracy and robustness in complex environments [9][10]. - 3D object detection is vital for autonomous systems, combining data from cameras, LiDAR, and radar to enhance environmental understanding [16][19]. - Embodied navigation allows robots to explore and act in real environments, focusing on goal-oriented, instruction-following, and dialogue-based navigation methods [24][26][27][28]. Vision-Language Models (VLMs) - VLMs have advanced significantly, enabling robots to understand spatial layouts, object properties, and semantic information while executing tasks [46][47]. - The evolution of VLMs has shifted from basic models to more sophisticated systems capable of multimodal understanding and interaction, enhancing their applicability in various tasks [53][54]. Future Directions - The article identifies key challenges in deploying VLMs on robotic platforms, including sensor heterogeneity, semantic discrepancies, and the need for real-time performance optimization [58]. - Future research may focus on structured spatial modeling, improving system interpretability, and developing cognitive VLM architectures for long-term learning capabilities [58][59].