监督学习
Search documents
“最强具身VLA大模型”,究竟强在哪儿?
3 6 Ke· 2025-11-20 07:38
Core Insights - The core contribution of the π*0.6 model lies in its introduction of a more intuitive learning method called RECAP, which allows robots to learn from their mistakes rather than merely imitating correct actions [3][8][24] - The model demonstrates a high success rate of over 90% in tasks such as making espresso, folding clothes, and assembling packaging boxes, showcasing its practical capabilities [1][20] Group 1: RECAP Methodology - RECAP consists of three main phases: offline reinforcement learning (RL) using diverse demonstration data, fine-tuning with human guidance, and online execution where robots learn from sparse rewards and expert corrections [10][20] - The methodology leverages a value function to evaluate actions and an advantage-conditioned strategy to update policies, allowing for efficient learning from both successful and unsuccessful experiences [13][16][42] Group 2: Model Architecture and Performance - The π*0.6 model builds upon previous versions, expanding its backbone from Gemma (2.6 billion parameters) to Gemma3 (4 billion parameters), and increasing Action Expert parameters to 860 million [20] - In challenging tasks, RECAP has doubled the throughput (successful task completions per hour) and reduced failure rates by approximately 50% compared to models that only utilized supervised fine-tuning [20] Group 3: Learning from Mistakes - The RECAP approach emphasizes the importance of learning from errors, enabling robots to recover from mistakes through expert intervention and self-correction, which is crucial for real-world applications [24][28] - By utilizing a value function to assess the quality of actions, the model can identify key steps and sources of errors, enhancing its ability to adapt and improve in complex environments [39][41]
理想智驾二级部门数量从3个调整为11个是次要矛盾
理想TOP2· 2025-09-22 16:56
Core Viewpoints - The role of Li Xiang in Li Auto's autonomous driving can be highly compared to Elon Musk's role in Tesla's autonomous driving, focusing on resource expansion, ensuring continuous investment, and possessing the ability to understand AI fundamentals and participate in technical discussions [1][2][3] - The main contradiction in Li Auto's autonomous driving development lies in the global AI industry's development stage, the matching of various production factors, and the capabilities of Li Xiang [1][5] Group 1: Resource Management - Li Xiang's core functions include expanding resources, ensuring sustained investment, and having the ability to make critical judgments regarding the company's long-term direction and technology roadmap [3][4] - The adjustment of Li Auto's secondary departments from 3 to 11 indicates a minor contradiction under the broader context of resource matching [2] Group 2: Iteration and Development - Li Auto is expected to have multiple high-quality rapid iterations in the next 1-12 months due to a clear iterative direction [2][6] - The focus on enhancing simulation data quality and leveraging existing vehicle computing power is crucial for the development of autonomous driving capabilities [6][7] Group 3: AI and Organizational Structure - Successful implementation of physical AI is essential for Li Auto to excel in autonomous driving, requiring a leader who can make key judgments and adapt the organizational structure accordingly [6][8] - The importance of having the right talent aligned with future needs rather than relying solely on past achievements is emphasized, suggesting that the right fit is more critical than resumes [11]
生成式视角重塑监督学习!标签不只是答案,更是学习指南 | ICML 2025
量子位· 2025-06-24 13:36
Core Viewpoint - A new paradigm in supervised learning called Predictive Consistency Learning (PCL) is introduced, which redefines the role of labels as auxiliary references rather than just standard answers for comparison [1][5]. Group 1: Training Process Overview - PCL aims to capture complex label representations by progressively decomposing label information, allowing the model to predict complete labels with partial label hints [5][6]. - The training process involves mapping noisy labels back to true labels, with noise levels controlled by time steps, ensuring predictions remain consistent across different noise levels [7][8]. Group 2: Noise Process - The noise process for discrete labels is modeled using a categorical distribution, while continuous labels follow a Gaussian diffusion model, introducing noise progressively [9][11]. - In cases where labels are too complex, PCL introduces Gaussian noise directly into the latent embedding space, aligning with the continuous label noise process [11]. Group 3: Testing Process Overview - After training, the model can efficiently predict by sampling from a random noise distribution, achieving results that surpass traditional supervised learning even without label hints [14][28]. - A multi-step inference strategy is employed to refine predictions, where previous predictions are perturbed with noise to serve as hints for subsequent predictions [14][28]. Group 4: Information Theory Perspective - PCL proposes a structured learning process that gradually captures information, allowing the model to learn from noisy labels while minimizing dependency on them [15][18]. - The model's goal is to minimize noise condition dependence, ensuring predictions remain consistent across varying noise levels [19]. Group 5: Experimental Results - PCL demonstrates significant improvements in prediction accuracy across various tasks, including image segmentation, graph-based predictions, and language modeling, compared to traditional supervised learning [20][25][30]. - In image segmentation, PCL outperforms traditional methods in single-step predictions and continues to improve with additional prediction steps [22][28]. - The results indicate that while more inference steps can enhance detail capture, they also risk error accumulation, necessitating a balance in the number of steps [26][28].
微软副总裁X上「开课」,连更关于RL的一切,LLM从业者必读
机器之心· 2025-05-26 01:28
Core Viewpoint - The article discusses the educational series on artificial intelligence initiated by Nando de Freitas, focusing on reinforcement learning (RL) and its applications in large language models (LLMs) [1][2]. Summary by Sections Introduction to AI Education - Nando de Freitas aims to educate readers on AI through a series of posts on X, starting with reinforcement learning and gradually covering diffusion and flow matching technologies [1][2]. Learning Types - The article highlights that there is no ultimate conclusion on unsupervised learning, supervised learning, and reinforcement learning [8][19]. - Supervised learning is described as basic imitation, requiring high-quality expert data for effective learning [9]. - Reinforcement learning focuses on selective imitation, allowing agents to learn from suboptimal experiences and improve their performance [10][11]. Distributed Reinforcement Learning Systems - Modern distributed RL systems consist of two main components: Actors and Learners, where Actors interact with the environment and collect data, while Learners update the policy network based on this data [23][24]. - The importance of measuring operational durations and communication bandwidth in such systems is emphasized [24][27]. Offline Reinforcement Learning - Offline RL has unique value in scenarios like post-training LLMs, where it can leverage historical data for learning [28][29]. Single-step and Multi-step RL - The article differentiates between single-step and multi-step RL problems, with single-step focusing on immediate actions and multi-step involving planning over a series of interactions [35][39]. - The complexity of multi-step RL is noted, particularly in credit assignment issues where multiple decisions affect outcomes [40][41]. Policy Gradient and Techniques - Policy gradient methods are discussed, including the use of baseline subtraction to reduce variance in reward signals [49][56]. - The article also covers the significance of KL divergence in maintaining proximity to supervised fine-tuning strategies during post-training [69]. Importance Sampling and PPO - Importance sampling is introduced as a method to correct off-policy sample bias, with Proximal Policy Optimization (PPO) being a key technique to manage policy updates [73][78]. - The integration of various techniques in training models like DeepSeek-R1 is highlighted, showcasing the complexity of modern RL systems [81]. Future Directions - Freitas plans to expand the discussion from single-step to multi-step RL, indicating ongoing developments in the field [82].
被拒稿11年后翻盘获时间检验奖,DSN作者谢赛宁:拒稿≠学术死刑
量子位· 2025-05-06 04:24
Core Viewpoint - The article discusses the recognition of the paper "Deeply-Supervised Nets" (DSN) by AISTATS 2025 with a Time-Tested Award, highlighting its long-term impact on the field of deep learning and computer vision despite initial rejection by NeurIPS ten years ago [1][5][21]. Group 1: Paper Background and Development - The paper DSN was submitted in September 2014 and aimed to address issues in deep learning related to hidden layer feature learning and classification performance [2][12]. - The concept of intermediate layer supervision proposed in DSN has been further developed in subsequent works by the author, Saining Xie, such as REPA and U-REPA, showcasing the evolution from single model optimization to cross-model knowledge transfer [3][4]. Group 2: Technical Contributions - DSN addresses three major pain points of traditional Convolutional Neural Networks (CNNs): gradient vanishing, feature robustness, and training efficiency [14][15]. - The introduction of auxiliary classifiers in hidden layers enhances gradient signals, improves the discriminative power of shallow features, and accelerates training convergence, with empirical results showing a 30% faster convergence for ResNet-50 on the CIFAR-10 dataset and a 2.1% increase in Top-1 accuracy [15][17]. Group 3: Recognition and Impact - The paper has been cited over 3000 times on Google Scholar, indicating its significant influence in the field [18]. - The Time-Tested Award from AISTATS recognizes the paper as a seminal work that has laid the foundation for subsequent research, similar to the impact of GANs and Seq2Seq models in their respective areas [22][23]. Group 4: Personal Reflections and Insights - Saining Xie reflects on the initial rejection from NeurIPS, emphasizing the importance of perseverance in academic careers and the value of a strong support system [25][26]. - The article encourages researchers to view rejections as opportunities for improvement, citing examples of other significant works that faced initial rejection but later gained recognition [30][31].