自回归范式

Search documents
纯血VLA综述来啦!从VLM到扩散,再到强化学习方案
自动驾驶之心· 2025-09-30 16:04
1. 介绍 机器人学长期以来一直是科学研究中的重要领域。早期的机器人主要依赖预编程的指令和人工设计的控制策略来完成任务分解与执行。这类方法通常应用于简 单、重复性的任务,例如工厂流水线和物流分拣。近年来,人工智能的快速发展使研究者能够在图像、文本和点云等多模态数据中,利用深度学习的特征提取与 轨迹预测能力。通过结合感知、检测、跟踪和定位等技术,研究者将机器人任务分解为多个阶段,以满足执行需求,从而推动了具身智能与自动驾驶的发展。然 而,大多数机器人仍然作为孤立的智能体存在,它们通常为特定任务而设计,缺乏与人类和外部环境的有效交互。 为克服这些局限性,研究者开始探索将大语言模型(LLMs)与视觉语言模型(VLMs)引入机器人操作中,以实现更精准和灵活的控制。现代的机器人操作方法 通常依赖视觉-语言生成范式(如自回归模型 或扩散模型),并结合大规模数据集 以及先进的微调策略。我们将这些方法称为 VLA基础模型,它们显著提升了 机器人操作的质量。对生成内容进行细粒度的动作控制,使用户获得更大的灵活性,从而释放了VLA 在任务执行中的实际潜力。 标题:Pure Vision Language Action (VLA) ...
纯血VLA综述来啦!从VLM到扩散,再到强化学习方案
具身智能之心· 2025-09-30 04:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Dapeng Zhang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 | | | 1. 介绍 机器人学长期以来一直是科学研究中的重要领域。早期的机器人主要依赖预编程的指令和人工设计的控制策略来完成任务分解与执行。这类方法通常应用于简 单、重复性的任务,例如工厂流水线和物流分拣。近年来,人工智能的快速发展使研究者能够在图像、文本和点云等多模态数据中,利用深度学习的特征提取与 轨迹预测能力。通过结合感知、检测、跟踪和定位等技术,研究者将机器人任务分解为多个阶段,以满足执行需求,从而推动了具身智能与自动驾驶的发展。然 而,大多数机器人仍然作为孤立的智能体存在,它们通常为特定任务而设计,缺乏与人类和外部环境的有效交互。 为克服这些局限性,研究者开始探索将大语言模型(LLMs)与视觉语言模型(VLMs)引入机器人操作中,以实现更精准和灵活的控制。现代的机器人操作方法 通常依赖视觉-语言生成范式(如自回归模型 或扩散模型),并结合大规模数据集 以及先进的微调策略。我们将这些方法称为 VLA基础模型,它们 ...
图像分词器造反了!华为 Selftok:自回归内核完美统一扩散模型,触发像素自主推理
机器之心· 2025-05-17 06:00
Core Viewpoint - The article discusses the breakthrough of Huawei's Pangu multimodal generation team in transforming visual data into discrete tokens, aiming to replicate the success of large language models (LLMs) in the visual domain through a novel approach called Selftok [1][5]. Group 1: Selftok Breakthrough - Selftok technology integrates autoregressive (AR) principles into visual tokenization, allowing pixel streams to be converted into discrete sequences that adhere to causal relationships [1][3]. - The initial paper on Selftok has been recognized as a Best Paper Candidate at CVPR 2025, highlighting its significance in the field [3]. Group 2: Industry Consensus and Challenges - The current consensus in the industry is that LLMs face a language data bottleneck, while non-language data like images and videos hold significant development potential [5]. - A unified multimodal architecture is seen as key to unlocking stronger emergent capabilities in AI, with the main challenge being the conversion of continuous visual signals into discrete tokens [5]. Group 3: Advantages of Discrete Visual Tokens - The article argues for the abandonment of spatial priors in favor of discrete visual tokens, which can maintain high accuracy while avoiding the pitfalls of continuous representations [6]. - Continuous representations are criticized for their poor prediction stability, increased complexity in reinforcement learning, and limited decoupling capabilities [6]. Group 4: Selftok Architecture - Selftok's architecture consists of an encoder, quantizer, and decoder, utilizing a dual-stream structure to enhance computational efficiency and maintain reconstruction quality [18][20]. - The quantizer employs a unique mechanism to address traditional training imbalances, achieving a unified approach to diffusion processes and autoregressive modeling [20]. Group 5: Training and Optimization - The pre-training phase of Selftok involves aligning multimodal data inputs to transition from LLM to visual-language model (VLM) [24]. - The model is optimized using reinforcement learning (RL) algorithms, with two types of reward functions designed to evaluate the generated images' attributes and spatial relationships [25][27]. Group 6: Experimental Results - Selftok has achieved state-of-the-art (SoTA) results in various reconstruction metrics on ImageNet, demonstrating superior detail preservation compared to other tokenizers [28]. - In benchmark evaluations, Selftok-Zero significantly outperformed models like GPT-4o, achieving a score of 92 on the GenEval benchmark [29][30].