BeingBeyond
Search documents
以最低图像分辨率斩获SOTA!全栈开源具身模型发布:3.5万小时炼出通用大脑
量子位· 2026-01-23 12:09
Core Insights - The article discusses the breakthrough of the Being-H0.5 model in the field of embodied intelligence, addressing the challenges posed by data isolation and the "Matthew Effect" in the industry [1][3][39] - Being-H0.5 is the largest VLA model with 35,000 hours of training data, enabling cross-robot zero-shot skill transfer and showcasing remarkable generalization capabilities [2][3][30] Data and Model Development - The Being-H0.5 model integrates 35,000 hours of data, including 14,000 hours of robot data and 16,000 hours of human data, across 30 robot types, allowing for rapid adaptation and stable execution regardless of hardware configuration [2][8] - The UniHand-2.0 dataset, an iteration of UniHand-1.0, features over 35,000 hours of high-quality data, marking a significant advancement in cross-domain data integration [8][9] Training Paradigms - The model employs a human-centric learning paradigm, aligning human intent with robotic actions through a unified token sequence that captures physical interaction signals [20][39] - A unified action space framework is established to overcome the dimensional gap between heterogeneous hardware, facilitating joint training and knowledge sharing [16][17] Architectural Innovations - The Mixture-of-Flow (MoF) architecture allows for the decoupling of action experts, focusing on learning universal motion primitives while ensuring precise execution for specific robot types [22][23] - The model incorporates mechanisms like manifold-preserving gating and universal async chunking to enhance robustness and adaptability across different hardware [23][24] Performance and Validation - Extensive real-world testing on various robot types demonstrated that Being-H0.5 can perform complex tasks, achieving competitive success rates compared to specialized models [28][30][35] - The model's performance in quantitative evaluations shows it surpasses existing VLA models, achieving an average success rate of 98.9% in specific tasks [35][36] Open Source and Future Directions - The BeingBeyond team commits to a full-stack open-source approach, providing not only pre-trained models but also complete training frameworks and evaluation tools to foster community innovation [37][38] - The vision is to establish Being-H0.5 as a foundational infrastructure in the embodied intelligence sector, enabling rapid development without the need for extensive data collection [39]
具身基座模型的曙光初现,全球最强跨本体VLA来啦!
具身智能之心· 2026-01-20 00:33
Core Viewpoint - The emergence of the Being-H0.5 model is disrupting the established logic in the embodied intelligence industry, showcasing remarkable cross-embodiment generalization capabilities in visual-language-action tasks, regardless of hardware differences [3]. Group 1: Industry Trends - The competition in the embodied intelligence sector is intensifying, with companies focusing on a limited market of embodiments, where the volume of output directly influences data accumulation and algorithm performance [1]. - The Being-H0.5 model integrates data from nearly all mainstream robot configurations globally, demonstrating its ability to adapt and execute tasks effectively across different embodiments [3]. Group 2: Data Collection and Training - The UniHand-2.0 dataset, created by BeingBeyond, is the largest training dataset in the world, comprising over 14,000 hours of robot operation data and 16,000 hours of human video data, with a total of over 400 billion training tokens [6]. - Unlike previous studies that focused on specific robot configurations, UniHand-2.0 successfully merges data from over 30 different hardware configurations, addressing the challenge of significant differences in state and action spaces among various robots [8][10]. - The human-centric training paradigm enhances the model's capabilities by utilizing a vast amount of human video data, which contains rich physical and spatial prior information, enabling better generalization across tasks [11][14]. Group 3: Model Architecture and Performance - Being-H0.5 features a specialized expert mixture model that decouples multi-modal understanding from action generation while maintaining a coupling through a shared attention mechanism [17]. - Extensive real-world experiments on various robot configurations demonstrated the model's exceptional cross-embodiment and complex task execution capabilities, achieving success rates of 98.9% and 54% on widely used benchmarks [18]. Group 4: Industry Impact - The introduction of Being-H0.5 represents a significant advantage for most embodied companies, as it alleviates the need for substantial investments in data collection centers and allows for the adaptation of different configurations using human-centric learning as a natural data source [19].
单条演示即可抓取一切:北大团队突破通用抓取,适配所有灵巧手本体
3 6 Ke· 2025-10-29 08:55
Core Insights - The article discusses the introduction of the DemoGrasp framework, a novel approach to robotic grasping that addresses challenges in traditional reinforcement learning (RL) methods, particularly in high-dimensional action spaces and complex reward functions [1][4][6]. Group 1: Framework Overview - DemoGrasp is designed to enhance the efficiency of grasping tasks by utilizing a single successful demonstration trajectory as a starting point, allowing for trajectory editing to adapt to various objects and poses [4][8]. - The framework transforms multi-step Markov Decision Processes (MDP) into a single-step MDP based on trajectory editing, significantly improving learning efficiency and performance transfer to real robots [4][6]. Group 2: Learning Process - The learning process involves editing the trajectory of a successful grasp to accommodate new objects, where adjustments to wrist and finger positions are made to fit unseen items [8][12]. - DemoGrasp employs a simulation environment with thousands of parallel worlds to train the policy network, achieving over 90% success rate after 24 hours of training on a single RTX 4090 GPU [8][10]. Group 3: Performance Metrics - In experiments using the DexGraspNet dataset, DemoGrasp outperformed existing methods, achieving a visual policy success rate of 92% with only a 1% generalization gap between training and testing datasets [10][13]. - The framework demonstrated adaptability across various robotic forms, achieving an average success rate of 84.6% on 175 different objects without adjusting training hyperparameters [14][15]. Group 4: Real-World Application - In real-world tests, DemoGrasp successfully grasped 110 unseen objects with a success rate exceeding 90% for regular-sized items and 70% for challenging flat and small objects [15][16]. - The framework supports complex grasping tasks in cluttered environments, maintaining an 84% success rate for single-instance real-world grabs despite significant variations in lighting and object placement [16][17].
单条演示即可抓取一切:北大团队突破通用抓取,适配所有灵巧手本体
量子位· 2025-10-29 05:11
Core Insights - The article discusses the challenges of traditional reinforcement learning (RL) in high-dimensional action spaces for robotic grasping tasks and introduces the DemoGrasp framework as a solution [1][2][4]. Group 1: DemoGrasp Framework - DemoGrasp is a simple and efficient learning method for general robotic grasping, initiated from a single successful demonstration trajectory [2][4]. - The framework transforms multi-step Markov Decision Processes (MDP) into a single-step MDP by editing demonstration trajectories, enhancing learning efficiency and performance transfer to real robots [4][7]. Group 2: Learning Process - The learning process involves editing the robot's actions in the demonstration trajectory to adapt to different objects and poses, focusing on wrist and finger adjustments [9][16]. - DemoGrasp utilizes a simulation environment with thousands of parallel worlds to train the policy network, which outputs editing parameters based on observations [10][11]. Group 3: Training Efficiency - The training efficiency is notable, with a single RTX 4090 GPU achieving over 90% success rate in just 24 hours on a compact action space [12]. - The framework can adapt to various robotic hands without adjusting training hyperparameters, achieving an average success rate of 84.6% across 175 objects [20]. Group 4: Performance Metrics - DemoGrasp outperforms existing methods in the DexGraspNet dataset, achieving a visual policy success rate of 92% with minimal generalization gap [17][18]. - In real-world tests, DemoGrasp successfully grasped 110 unseen objects, maintaining over 90% success rates for regular objects and 70% for challenging flat and small objects [21][22]. Group 5: Future Directions - The framework aims to support more complex tasks such as functional grasping and tool usage, with potential for real-time adjustments and error recovery in future research [25][26]. - DemoGrasp can integrate with multimodal large models for autonomous grasping in open environments [27].
有臂有手还带主动视觉?全球首款桌面级灵巧手机械臂BeingBeyond D1震撼发布
具身智能之心· 2025-10-13 00:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 在具身智能蓬勃发展的今天,高校与科研机构对兼具性能与性价比的机器人平台需求愈发迫切。然而,传 统工业机械臂不仅价格高昂、动辄数十万元,还面临开发复杂、维护困难、算法与模型配套缺失等诸多瓶 颈,严重限制了科研创新的落地效率。 为打破这一局限, BeingBeyond正式发布全球首款桌面级灵巧手机械臂——D1 。它将"机械臂 + 灵巧手 + 主动视觉系统"三大核心功能集于一体,高性价比价格 ,打造高集成度的一体化平台 ,真正实现具身智能 的即刻上手。 D1不仅拥有强大的硬件能力,更搭载自研VLA大模型 Being-H0 ,覆盖 从数据采集、模型训练到部署落地 的完整链条 ,开箱即用,开源灵活,为科研人员提供一站式、低门槛的具身智能研究平台。 灵活模块化设计,功能强大、扩展无限 D1机械臂,不止是"灵巧",更是为科研量身打造的全能平台。它采用高度模块化架构,拥有19个自由度 (6臂 + 2头 + 11手),其中14个为主动自由度,5个为被动联动自由度,真正实现从感知到操控的全流程覆 盖。 模块解耦、接口标准,随装随用、随拆随换,是科研与教学的理想选择。 机械臂模块 ...
Being-VL的视觉BPE路线:把「看」和「说」真正统一起来
具身智能之心· 2025-10-11 00:02
Core Insights - The article discusses the limitations of traditional multimodal models, particularly how CLIP-style encoders prematurely align visual representations to text space, leading to potential hallucinations when details are queried without strong language dependence [1][5] - A new method called Being-VL is proposed, which focuses on visual BPE (Byte Pair Encoding) to improve the alignment and modeling of visual and textual data [1][2] Group 1: Being-VL Methodology - Being-VL consists of three main steps: quantizing images into discrete VQ tokens using VQ-GAN, training a visual BPE that measures both co-occurrence frequency and spatial consistency, and finally unifying visual and text tokens into a single sequence for modeling [2][5] - The Priority-Guided Encoding approach is introduced, which combines frequency and spatial consistency to create a more semantically and structurally meaningful visual token set [7][8] Group 2: Training Strategy - The training process is divided into three stages: initial alignment of visual token embeddings, selective fine-tuning of the LLM, and full fine-tuning on complex reasoning and instruction data [9][15] - A curriculum learning strategy is employed to gradually transition from basic tasks to more complex ones, enhancing the model's ability to understand cross-modal interactions [9][12] Group 3: Experimental Results - Experiments indicate that the discrete representation of images followed by visual BPE leads to improved reliability in detail-sensitive tasks and reduces hallucinations compared to traditional methods [12][16] - The introduction of visual BPE significantly enhances the model's performance and robustness, demonstrating that the semantic integration of stable visual patterns into tokens allows for better reasoning [12][19] Group 4: Tokenization and Efficiency - The study highlights the impact of BPE token size on training efficiency, suggesting that a balanced token size can optimize both expressiveness and training efficiency [19][20] - Larger token sizes may lead to sparse distributions and decreased returns on computational resources, indicating a need for careful scaling in future applications [19][20]
Being-VL的视觉BPE路线:把「看」和「说」真正统一起来
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the limitations of traditional multimodal models, particularly how CLIP-style encoders prematurely align visual representations with text space, leading to potential hallucinations when detailed, non-language-dependent queries are made [2][6] - A new method called Being-VL is proposed, which emphasizes a post-alignment approach, allowing for the discrete representation of images before aligning them with text, thereby preserving visual structure and reducing the risk of information loss [2][3] Being-VL Implementation - Being-VL consists of three main steps: quantifying images into discrete VQ tokens using VQ-GAN, training a visual BPE that measures both co-occurrence frequency and spatial consistency, and finally unifying visual and text tokens into a single sequence for modeling [3][10] - The visual BPE tokenizer prioritizes both frequency and spatial consistency to create a more semantically and structurally meaningful token set, which is independent of text [8][9] Training Strategy - The training process is divided into three stages: 1. **Embedding Alignment**: Only the new visual token embeddings are trained while freezing other parameters to maintain existing language capabilities [12] 2. **Selective Fine-tuning**: A portion of the LLM layers is unfrozen to facilitate cross-modal interaction at lower representation levels [12] 3. **Full Fine-tuning**: All layers are unfrozen for comprehensive training on complex reasoning and instruction data [12][10] Experimental Results - Experiments indicate that the discrete representation of images followed by visual BPE and unified modeling with text leads to improved reliability in detail-sensitive queries and reduces hallucinations compared to traditional methods [14][16] - The study highlights the importance of a gradual training approach, showing that a combination of progressive unfreezing and curriculum learning significantly outperforms single-stage training methods [14][10] Visual BPE Token Activation - Visualization of embedding weights shows that using visual BPE leads to a more balanced distribution of weights between text and visual tokens, indicating reduced modality gaps and improved cross-modal attention [16][19] Token Size and Training Efficiency - The research explores the impact of BPE token size on training efficiency, finding an optimal balance in resource-limited scenarios, while larger token sizes may lead to diminishing returns due to sparsity [19][20] Development and Summary - The evolution from Being-VL-0 to Being-VL-0.5 reflects enhancements in the unified modeling framework, incorporating priority-guided encoding and a structured training approach [20][24]
2025北京智源大会闭幕|黄铁军:构建物理智能体,类脑方法开启具身智能新范式
机器人圈· 2025-06-08 01:38
Core Viewpoint - The seventh Beijing Zhiyuan Conference highlighted the rapid growth of embodied intelligence, showcasing advancements in AI and robotics through various forums and discussions featuring leading experts and companies in the field [1][3]. Group 1: Conference Highlights - The conference featured over 180 reports across 20 forums, covering key topics such as multi-modal AI, deep reasoning, and AI safety [1]. - Notable attendees included four Turing Award winners, over 30 AI company founders and CEOs, and more than 100 global young scientists [1]. Group 2: Embodied Intelligence Developments - Embodied intelligence has emerged as a core area of integration between AI and robotics, with a dedicated all-day forum introduced at this year's conference [3]. - Discussions included the current state and future of embodied intelligence, featuring insights from leading researchers and company founders [3]. Group 3: Technical Insights - Tsinghua University professor Sun Fuchun emphasized the importance of world models and immersive digital physical systems for embodied intelligence [5]. - The need for brain-like algorithms to replace traditional controllers in humanoid robots was highlighted by researcher Zhao Mingguo [7]. - The use of synthetic data for training models to achieve zero-shot generalization was advocated by Wang He, CTO of Galaxy General [9]. Group 4: Data Challenges and Solutions - The challenges of data scarcity for humanoid robots were addressed, with suggestions to utilize internet video for pre-training models to learn human motion [13]. - High costs and difficulties in data collection for robots were noted, with recommendations for using video data to enhance training processes [15]. Group 5: Commercialization and Industry Challenges - The current limitations of humanoid robots in basic capabilities were discussed, emphasizing the need for improvements in terrain adaptability and stability before advancing to higher-level applications [20]. - Key pain points in the development of embodied intelligence include insufficient data quality and quantity, as well as the misalignment between academic research and industrial application [22]. Group 6: Future Outlook - The Zhiyuan Institute's chairman Huang Tiejun expressed ambitions for creating sophisticated physical intelligent agents, with a vision for embodied intelligence to potentially surpass human capabilities by 2045 [23].