Workflow
泛化能力
icon
Search documents
探究下VLA模型泛化差的原因......
具身智能之心· 2025-08-20 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 在大规模数据集(如 Open X-Embodiment,简称 OXE)上训练的通用机器人策略在各类任务中表现出较强性能。然而,它们往往难以超出训练数据的分布范围进行 泛化。 本文探究了这种泛化能力受限的根本原因, 发现捷径学习 —— 即对与任务无关特征的依赖 —— 是阻碍泛化的关键因素。 通过全面的理论与实证分析,我们揭示 了导致捷径学习的两个主要原因:(1) 单个子数据集内部多样性有限 ;(2) 子数据集之间存在显著的分布差异,进而导致数据集碎片化 。 这些问题源于 OXE 等大规模数据集的固有结构 —— 这类数据集通常由多个子数据集构成,而这些子数据集是在不同环境和机器人形态下独立收集的。 我们的研究结果为改进机器人数据集收集策略提供了重要见解,有助于减少捷径学习并提升通用机器人策略的泛化能力。此外,在获取新的大规模数据不切实际的 场景中,本文证实, 精心选择的机器人数据增强策略能够有效减少现有离线数据集中的捷径学习,从而提升通用机器人策略(如 )在仿真和真实环境中的泛化 能力 。 论文标题 : Shortcut Learning in Generali ...
链式思维是幻象吗?从数据分布视角重新审视大模型推理,马斯克回复,Grok破防
机器之心· 2025-08-14 09:11
Core Viewpoint - The research suggests that Chain-of-Thought (CoT) reasoning in large language models (LLMs) may not represent true reasoning but rather a replication of patterns learned from training data, leading to fragility when faced with out-of-distribution tasks [2][10][37]. Data Distribution Perspective on CoT - The effectiveness of CoT is attributed to the "structured inductive bias" learned within the training distribution, indicating that the reasoning chains are merely reproductions of common patterns rather than genuine logical deductions [13][37]. - A theoretical framework is introduced to quantify the relationship between training and testing distributions, highlighting how distribution shifts can impact reasoning performance [15]. Experimental Findings on Generalization - In "task generalization," the model shows nearly 100% accuracy within the training distribution, but accuracy drops to 0.01% with slight distribution shifts, indicating a lack of true generalization [23]. - Supervised fine-tuning on a small amount of new data can restore performance, but this only expands the existing distribution boundaries without enhancing abstract generalization capabilities [24]. - In "length generalization," even minor changes in input sequence length significantly affect model performance, demonstrating a tendency to generate reasoning chains consistent with training lengths [26]. - The model is highly sensitive to format changes, with even minor alterations in input prompts leading to complete reasoning failures [28]. Universal Sensitivity to Distribution Shifts - The study finds that the sensitivity to distribution shifts is a common phenomenon across different sampling temperatures and model sizes, indicating that this issue is not isolated to specific models [31]. Practical Implications - In high-risk fields such as healthcare and finance, reliance on CoT for robust reasoning is cautioned against, as misleading reasoning chains can be more dangerous than outright incorrect answers [34]. - Current evaluation methods that depend on validation sets closely aligned with training distributions may overestimate model robustness, necessitating stricter out-of-distribution testing [35]. - While supervised fine-tuning can quickly enhance performance on specific tasks, it does not equip models with true abstract reasoning capabilities [36].
字节发布全新 VLA 模型,配套机器人化身家务小能手
Sou Hu Cai Jing· 2025-07-23 16:51
Core Insights - ByteDance's Seed team has launched a new VLA model, GR-3, which supports high generalization, long-range tasks, and flexible object manipulation with dual-arm operations [2][4] - The GR-3 model is designed to understand abstract language instructions and can efficiently adapt to new tasks with minimal human data, contrasting with previous models that required extensive training [2][7] - The accompanying robot, ByteMini, is a versatile dual-arm mobile robot specifically designed to work with the GR-3 model, featuring 22 degrees of freedom and advanced sensory capabilities [4][5] Model Features - GR-3 is characterized by its ability to perform complex tasks with high robustness and success rates, effectively following step-by-step human instructions [4][5] - The model utilizes a unique training method that combines data from remote-operated robots, human VR trajectory data, and publicly available visual-language data, enhancing its learning capabilities [7] - GR-3's architecture includes a 4 billion parameter end-to-end model that integrates visual-language and action generation modules [7] Performance Highlights - In tasks such as table organization, GR-3 demonstrates high success rates and can accurately interpret and respond to complex instructions, even when faced with invalid commands [4][5] - The model excels in collaborative dual-arm operations, effectively manipulating deformable objects and recognizing various clothing arrangements [5] - GR-3's generalization ability allows it to handle previously unseen objects and comprehend abstract concepts during tasks, showcasing its adaptability [5][7] Future Plans - The Seed team plans to expand the model's scale and training data while incorporating reinforcement learning methods to further enhance generalization capabilities [7] - Generalization is identified as a key metric for evaluating VLA models, crucial for enabling robots to adapt quickly to dynamic real-world scenarios [7]
Qwen&清华团队颠覆常识:大模型强化学习仅用20%关键token,比用全部token训练还好
量子位· 2025-06-05 10:28
Core Insights - The article discusses a recent breakthrough by the LeapLab team from Tsinghua University, revealing that only 20% of high-entropy tokens can significantly enhance the training effectiveness of large models in reinforcement learning, outperforming the use of all tokens [1][6]. Group 1: Research Findings - The team achieved new state-of-the-art (SOTA) records with the Qwen3-32B model, scoring 63.5 in AIME'24 and 56.7 in AIME'25, marking the highest scores for models with fewer than 600 billion parameters trained directly from the base model [2]. - The maximum response length was extended from 20k to 29k, resulting in a score increase to 68.1 in AIME'24 [4]. - The research challenges the classic Pareto principle, indicating that in large model reinforcement learning, 80% of low-entropy tokens can be discarded without detrimental effects, and may even have adverse impacts [5][6]. Group 2: Token Analysis - The study reveals a unique entropy distribution pattern during chain-of-thought reasoning, where over 50% of tokens have an entropy value below 0.01, while only 20% exceed 0.672 [9][10]. - High-entropy tokens serve as "logical connectors" in reasoning, while low-entropy tokens are often deterministic components, such as affixes or mathematical expressions [11]. - The team conducted experiments showing that increasing the temperature of high-entropy tokens improves reasoning performance, while lowering it decreases performance, underscoring the importance of maintaining high entropy in critical positions [13]. Group 3: Training Methodology - By focusing solely on the top 20% of high-entropy tokens during reinforcement learning training, the Qwen3-32B model saw significant performance improvements, with AIME'24 scores increasing by 7.71 points and AIME'25 by 11.04 points, alongside an average response length increase of approximately 1378 tokens [15][17]. - Similar performance enhancements were observed in the Qwen3-14B model, while the Qwen3-8B model maintained stable performance [16]. - Conversely, training with 80% low-entropy tokens led to a sharp decline in model performance, indicating their minimal contribution to reasoning capabilities [18]. Group 4: Implications and Generalization - The findings suggest that high-entropy tokens facilitate exploration of different reasoning paths, while low-entropy tokens may restrict this exploration due to their determinism [20]. - The advantages of training with high-entropy tokens become more pronounced with larger models, with the 32B model showing the most significant improvements [22]. - Models trained with high-entropy tokens also performed exceptionally well on out-of-domain tasks, indicating a potential link between high-entropy tokens and the model's generalization capabilities [22]. Group 5: Reinforcement Learning Insights - The research indicates that reinforcement learning with verifiable rewards (RLVR) does not completely overhaul the base model but rather fine-tunes it, maintaining a high overlap of 86.67% in high-entropy token positions even after extensive training [24][25]. - The study highlights that higher initial entropy in tokens correlates with greater increases in entropy during RLVR training, while low-entropy tokens remain largely unchanged [25]. - Discussions raised in the article suggest that high-entropy tokens may explain why reinforcement learning can generalize better than supervised fine-tuning, which tends to lead to memorization and overfitting [26][27].
机器人“孝子”解养老困局:技术路径已明,非人形态先行
Core Viewpoint - The article discusses the potential of humanoid robots in addressing the growing elderly care needs in the context of an aging population, highlighting advancements in technology and the evolving landscape of the robotics industry [1][3][20]. Industry Overview - The aging population in China is rapidly increasing, with projections indicating that by the end of 2024, there will be 310 million people aged 60 and above, accounting for 22% of the total population [3][20]. - The concept of "elderly care robots" encompasses various forms of robots, including exoskeletons and humanoid robots, with a particular focus on humanoid robots in popular perception [4][21]. Technological Advancements - Recent breakthroughs in robotics include improvements in bionic joints, motion control algorithms, and cognitive decision-making frameworks, which are essential for the development of humanoid robots [1][6]. - The introduction of international standards for elderly care robots aims to guide the design, manufacturing, testing, and certification processes, promoting healthy industry development [7][9]. Market Dynamics - The market for humanoid robots is expected to grow significantly, with estimates suggesting that by 2035, the global market could reach $38 billion, and in China, the market could expand to 500 billion yuan [20][24]. - The current pricing of humanoid robots ranges from approximately 99,000 yuan to 199,000 yuan, with expectations that prices will decrease as technology matures [14][17]. Future Outlook - Experts predict that humanoid robots capable of providing companionship and care for the elderly may enter households within the next three to ten years, although some believe it could take longer [18][21]. - The industry is witnessing a shift towards consumer markets, with companies exploring opportunities in home care and rehabilitation, indicating a potential for growth in the elderly care robotics sector [22][23].
泛化性暴涨47%!首个意图检测奖励范式,AI工具爆炸时代意图识别新解法
机器之心· 2025-05-16 04:39
Core Viewpoint - The rapid development of large language models (LLMs) and the explosion of integrable tools have significantly enhanced the convenience of AI assistants in daily life, but the challenges of intent detection and generalization remain critical issues [1][2]. Group 1: Research and Methodology - Tencent's PCG social line research team has innovatively applied reinforcement learning (RL) methods, specifically the Group Relative Policy Optimization (GRPO) algorithm combined with Reward-based Curriculum Sampling (RCS), to improve intent detection tasks [2]. - The research demonstrated that models trained with RL exhibit significantly better generalization capabilities compared to those trained with supervised fine-tuning (SFT), particularly in handling unseen intents and cross-lingual tasks [4]. - The introduction of a thought process during RL training has been shown to enhance the model's generalization ability in complex intent detection tasks [5]. Group 2: Experimental Results - The experiments revealed that the GRPO method outperformed the SFT method in terms of generalization performance across various datasets, including MultiWOZ2.2 and a self-built Chinese dataset, TODAssistant [17]. - The GRPO method achieved comparable performance to SFT on the MultiWOZ2.2 dataset, indicating its effectiveness in intent detection tasks [14]. - The results from the experiments indicated that the GRPO method, when combined with RCS, further improved the model's accuracy, especially in the second phase of curriculum learning [19]. Group 3: Future Directions - The research team plans to explore more efficient online data filtering methods for the RCS approach in future work [24]. - There is an intention to investigate multi-intent recognition, as current experiments primarily focus on single-intent scenarios [25]. - The team aims to extend their research to more complex task-oriented dialogue tasks beyond intent recognition [26].