Workflow
视觉语言模型(VLM)
icon
Search documents
英伟达Jim Fan:机器人领域还处于混乱状态,连发展方向都有可能是错的
硬AI· 2025-12-29 14:24
Jim Fan表示,机器人硬件可靠性成为软件迭代最大障碍,行业标准缺失导致评估体系混乱,当前主流的视觉-语言-动作 模型(VLA)技术路线"感觉不对",其基于视觉语言模型(VLM)的预训练方式与机器人实际需求存在根本性错位。他表 示正押注于视频世界模型作为替代方案。 硬·AI 作者 | 赵 颖 编辑 | 硬 AI 近日,英伟达机器人业务负责人、GEAR实验室联合负责人Jim Fan在社交媒体上发表长文,对机器人行 业现状提出严厉批评。他认为,尽管硬件技术取得显著进展,但整个行业在软件迭代、标准制定和技术路 线选择上仍处于混乱状态。 Jim Fan指出,当前主流的视觉-语言-动作模型(VLA)技术路线"感觉不对",其基于视觉语言模型 (VLM)的预训练方式与机器人实际需求存在根本性错位。他表示正押注于视频世界模型作为替代方案。 这一表态引发业内关注,在人工智能其他领域快速发展的背景下,机器人技术的基础性问题凸显出该行业 距离商业化应用仍有较大距离,可能影响投资者对相关公司的估值预期。 Jim Fan总结了2025年在机器人领域学到的三个教训,涵盖硬件可靠性、行业标准和技术路线等核心问 题,为理解当前机器人产业瓶 ...
英伟达机Jim Fan:机器人领域还处于混乱状态,连发展方向都有可能是错的
Hua Er Jie Jian Wen· 2025-12-29 03:47
近日,英伟达机器人业务负责人、GEAR实验室联合负责人Jim Fan在社交媒体上发表长文,对机器人行 业现状提出严厉批评。他认为,尽管硬件技术取得显著进展,但整个行业在软件迭代、标准制定和技术 路线选择上仍处于混乱状态。 Jim Fan指出,当前主流的视觉-语言-动作模型(VLA)技术路线"感觉不对",其基于视觉语言模型 (VLM)的预训练方式与机器人实际需求存在根本性错位。他表示正押注于视频世界模型作为替代方 案。 这一表态引发业内关注,在人工智能其他领域快速发展的背景下,机器人技术的基础性问题凸显出该行 业距离商业化应用仍有较大距离,可能影响投资者对相关公司的估值预期。 Jim Fan总结了2025年在机器人领域学到的三个教训,涵盖硬件可靠性、行业标准和技术路线等核心问 题,为理解当前机器人产业瓶颈提供了一线视角。 硬件可靠性成为软件迭代最大障碍 行业标准缺失导致评估体系混乱 Jim Fan将机器人领域的基准测试(Benchmarking)状况称为"史诗级灾难"。他指出,与大语言模型领域 已形成MMLU、SWE-Bench等共识性标准不同,机器人行业在硬件平台、任务定义、评分标准、模拟 器或真实世界设置等 ...
世界模型和VLA正在逐渐走向融合统一
自动驾驶之心· 2025-12-11 03:35
Core Viewpoint - The integration of Vision-Language Action (VLA) and World Model (WM) technologies is becoming increasingly evident, suggesting a trend towards unification rather than opposition in the field of autonomous driving [3][5][7]. Group 1: Technology Trends - VLA and WM are seen as complementary technologies, with VLA focusing on abstract reasoning and WM on physical perception, both essential for achieving advanced General Artificial Intelligence (AGI) [4]. - Recent academic explorations have demonstrated the feasibility of combining VLA and WM, with notable projects like DriveVLA-W0 showcasing successful joint training [4]. - The future training pipeline for Level 4 (L4) autonomous systems is expected to incorporate VLA, Reinforcement Learning (RL), and WM, indicating the necessity of all three components [5]. Group 2: Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has been established to provide a comprehensive platform for learning and sharing knowledge in the autonomous driving sector, with over 4,000 members and plans to expand to nearly 10,000 [10][28]. - The community offers a variety of resources, including video content, learning routes, and Q&A sessions, aimed at both beginners and advanced practitioners in the field [10][12]. - A detailed compilation of over 40 technical routes and numerous datasets related to autonomous driving is available, facilitating quicker access to essential information for newcomers and experienced professionals alike [29][48]. Group 3: Job Opportunities and Networking - The community has established a job referral mechanism with various autonomous driving companies, allowing members to connect with potential employers easily [22]. - Regular discussions and insights from industry leaders are part of the community's offerings, providing members with valuable perspectives on career development and industry trends [14][107].
上交最新!端到端&VLA综述:广义范式下的统一视角
自动驾驶之心· 2025-12-11 00:05
Core Viewpoint - The article discusses the evolution of autonomous driving technology, emphasizing the need for a unified perspective on various paradigms, including end-to-end (E2E), VLM-centric, and hybrid approaches, to enhance understanding and performance in complex driving scenarios [2][4][14]. Group 1: Introduction and Background - Traditional modular approaches in autonomous driving have led to information loss and error accumulation due to task fragmentation, prompting a shift towards data-driven end-to-end architectures [5][10]. - The article introduces a comprehensive review titled "Survey of General End-to-End Autonomous Driving: A Unified Perspective," which aims to bridge the gap in understanding between different paradigms [3][4]. Group 2: Paradigms of Autonomous Driving - General End-to-End (GE2E) is defined as any model that processes raw sensor inputs into planning trajectories or control actions, regardless of whether it includes visual-language models (VLM) [4][14]. - The three main paradigms unified under GE2E are: - Traditional End-to-End (Conventional E2E), which relies on structured scene representation for precise trajectory planning [9][17]. - VLM-centric End-to-End, which utilizes pre-trained visual-language models to enhance generalization and reasoning capabilities in complex scenarios [11][33]. - Hybrid End-to-End, which combines the strengths of both traditional and VLM-centric approaches to balance high-level semantic understanding with low-level control precision [12][39]. Group 3: Performance Comparison - In open-loop performance tests, the hybrid paradigm outperformed others, demonstrating the importance of world knowledge in handling long-tail scenarios [54]. - Traditional E2E methods still dominate in numerical trajectory prediction accuracy, indicating their robustness in structured environments [54]. - In closed-loop performance, traditional methods maintain a stronghold, particularly in complex driving tasks, while VLA methods show potential but require further refinement in fine-grained trajectory control [55][56]. Group 4: Data and Learning Strategies - The evolution of datasets from geometric annotations to semantic-rich datasets is crucial for training models capable of logical reasoning and understanding complex traffic contexts [46][48]. - The introduction of Chain of Thought (CoT) annotations in datasets supports advanced reasoning tasks, moving beyond simple input-output mappings [47]. Group 5: Model Architecture and Details - The article provides a detailed comparison of mainstream model architectures, including their inputs, backbone networks, intermediate tasks, and output forms, to clarify the distinctions among different paradigms [57].
RL新思路,复旦用游戏增强VLM通用推理,性能匹敌几何数据
3 6 Ke· 2025-10-22 02:17
Core Insights - Fudan University's NLP lab developed Game-RL, which utilizes games to enrich visual elements and generate multimodal verifiable reasoning data, enhancing the reasoning capabilities of visual language models (VLM) [1][28] - The innovative Code2Logic method systematically synthesizes game task data, creating the GameQA dataset, which demonstrates the advantages of game data in complex reasoning training [1][28] Game-RL and Code2Logic - Game-RL constructs multimodal verifiable game tasks to reinforce VLM training, addressing the limitations of existing reinforcement learning (RL) approaches that focus primarily on geometric or chart reasoning [1][28] - The Code2Logic method leverages game code to systematically generate reasoning data, consisting of three core steps: game code construction, task and QA template design, and data engine construction [11][8] GameQA Dataset - The GameQA dataset comprises 4 cognitive ability categories, 30 games, 158 reasoning tasks, and 140,000 question-answer pairs, with tasks categorized into three difficulty levels [13][15] - GameQA's diverse game tasks provide a competitive edge in training models for general reasoning, matching the performance of traditional geometric datasets despite having fewer training samples [19][20] Training Outcomes - The use of GameQA in training led to improvements across four open-source VLMs on seven out-of-domain general visual language reasoning benchmarks, with Qwen2.5-VL-7B showing an average improvement of 2.33% [17][18] - GameQA's cognitive diversity and reasoning complexity demonstrate its generalizability and transferability, making it a valuable resource for enhancing VLM capabilities [20][19] Scaling Effects - Increasing the GameQA dataset size to 20,000 samples resulted in consistent performance improvements on general reasoning benchmarks [21][24] - Expanding the variety of games used in training enhances out-of-domain generalization effects, indicating the importance of diverse training data [22][24] Conclusion - The research introduces Game-RL and the Code2Logic method, expanding the reinforcement training domain for VLMs into gaming scenarios, and validates that Game-RL can enhance general reasoning capabilities [28][1]
车圈一个月48位高管变动,新一轮的变革要开始了......
自动驾驶之心· 2025-09-25 03:45
Group 1 - The automotive industry is undergoing a new round of transformation, with significant executive changes in various companies, including Li Auto, BYD, and Changan Automobile [1] - The autonomous driving sector is rapidly evolving, with a shift in focus from traditional methods to new algorithms and models, emphasizing the need for continuous learning and adaptation [2][3] - The community is actively engaging in discussions about the future of autonomous driving, exploring new article styles and hosting online events with industry leaders [3][6] Group 2 - The community has developed platforms for autonomous driving, embodied intelligence, and large models, aiming to explore new opportunities amidst constant change [3][4] - A comprehensive resource has been created within the community, offering over 40 technical routes and addressing practical questions related to autonomous driving [5][8] - The community is focused on providing a collaborative environment for both beginners and advanced practitioners, facilitating knowledge sharing and networking [10][14] Group 3 - The community offers a variety of learning resources, including video tutorials and structured learning paths for newcomers to the field of autonomous driving [11][13] - Regular discussions and Q&A sessions are held to address industry-related queries, such as entry points into end-to-end autonomous driving and the applicability of multi-sensor fusion [17][19] - The community aims to grow its membership significantly over the next two years, enhancing its role as a hub for technical exchange and career opportunities in the autonomous driving sector [3][19]
PhysicalAgent:迈向通用认知机器人的基础世界模型框架
具身智能之心· 2025-09-22 00:03
Core Viewpoint - The article discusses the development of PhysicalAgent, a robotic control framework designed to overcome key limitations in the current robot manipulation field, specifically addressing the robustness and generalizability of visual-language-action (VLM) models and world model-based methods [2][3]. Group 1: Key Bottlenecks and Solutions - Current VLM models require task-specific fine-tuning, leading to a significant drop in robustness when switching robots or environments [2]. - World model-based methods depend on specially trained predictive models, limiting their generalizability due to the need for carefully curated training data [2]. - PhysicalAgent aims to integrate iterative reasoning, diffusion video generation, and closed-loop execution to achieve cross-modal and cross-task general manipulation capabilities [2]. Group 2: Framework Design Principles - The framework's design allows perception and reasoning modules to remain independent of specific robot forms, requiring only lightweight skeletal detection models for different robots [3]. - Video generation models have inherent advantages due to pre-training on vast multimodal datasets, enabling quick integration without local training [5]. - The framework aligns with human-like reasoning, generating visual representations of actions based solely on textual instructions [5]. - The architecture demonstrates cross-modal adaptability by generating different manipulation tasks for various robot forms without retraining [5]. Group 3: VLM as the Cognitive Core - VLM serves as the cognitive core of the framework, facilitating a multi-step process of instruction, environment interaction, and execution [6]. - The innovative approach redefines action generation as conditional video synthesis rather than direct control strategy learning [6]. - The robot adaptation layer is the only part requiring specific robot tuning, converting generated action videos into motor commands [6]. Group 4: Experimental Validation - Two types of experiments were conducted to validate the framework's cross-modal generalization and iterative execution robustness [8]. - The first experiment focused on verifying the framework's performance against task-specific baselines and its ability to generalize across different robot forms [9]. - The second experiment assessed the iterative execution capabilities of physical robots, demonstrating the effectiveness of the "Perceive→Plan→Reason→Act" pipeline [12]. Group 5: Key Results - The framework achieved an 80% final success rate across various tasks for both the bimanual UR3 and humanoid G1 robots [13][16]. - The first-attempt success rates were 30% for UR3 and 20% for G1, with average iterations required for success being 2.25 and 2.75, respectively [16]. - The iterative correction process significantly improved task completion rates, with a sharp decline in the proportion of unfinished tasks after the first few iterations [16].
想跳槽去具身,还在犹豫...
自动驾驶之心· 2025-09-12 16:03
Core Viewpoint - The article discusses the ongoing developments and challenges in the autonomous driving industry, emphasizing the importance of community engagement and knowledge sharing among professionals and enthusiasts in the field [1][5]. Group 1: Community Engagement - The "Autonomous Driving Heart Knowledge Planet" serves as a comprehensive community for sharing knowledge, resources, and job opportunities related to autonomous driving, aiming to grow its membership to nearly 10,000 in the next two years [5][15]. - The community has over 4,000 members and offers various resources, including video content, learning routes, and Q&A sessions to assist both beginners and advanced practitioners [5][11]. Group 2: Technical Discussions - Key topics discussed include the transition from rule-based systems to end-to-end learning in autonomous driving, the potential of embodied intelligence versus intelligent driving, and the current state of companies excelling in smart driving technologies [2][3][19]. - The community has compiled over 40 technical routes covering various aspects of autonomous driving, including perception, simulation, and planning control [15][27]. Group 3: Industry Trends - The article highlights the ongoing shifts in the industry, such as the exploration of end-to-end algorithms and the importance of data loops in enhancing autonomous driving capabilities [2][19]. - There is a focus on the employment landscape, with discussions on the stability of hardware-related positions compared to rapidly evolving software roles in the autonomous driving sector [2][19]. Group 4: Learning Resources - The community provides structured learning paths for newcomers, including comprehensive guides on various technical stacks and practical applications in autonomous driving [11][15]. - Members can access a wealth of resources, including datasets, open-source projects, and insights from industry leaders, to facilitate their learning and career development [27][28].
李飞飞的答案:大模型之后,Agent 向何处去?
3 6 Ke· 2025-09-04 08:28
Core Insights - The latest paper by Fei-Fei Li delineates the boundaries and establishes paradigms for the currently trending field of Agents, with major players like Google, OpenAI, and Microsoft aligning their strategies with the proposed capability stack [1][4] - The paper introduces a comprehensive cognitive loop architecture that encompasses perception, cognition, action, learning, and memory, forming a dynamic iterative system for intelligent agents, which is not only a technological integration but also a systematic vision for the future of AGI [1][5] - Large models are identified as the core engine driving Agents, while environmental interaction is crucial for addressing issues of hallucination and bias, emphasizing the need for real or simulated feedback to calibrate reality and incorporate ethical and safety mechanisms [1][3][11] Summary by Sections 1. Agent AI's Core: A New Cognitive Architecture - The paper presents a novel Agent AI paradigm that is a forward-thinking consideration of the development path for AGI, rather than a mere assembly of existing technologies [5] - It defines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together create a complete and interactive cognitive loop for intelligent agents [5][10] 2. How Large Models Drive Agent AI - The framework of Agent AI is made possible by the maturity of large foundational models, particularly LLMs and VLMs, which serve as the basis for the cognitive capabilities of Agents [11][12] - LLMs and VLMs have internalized vast amounts of common and specialized knowledge, enabling Agents to perform zero-shot planning effectively [12] - The paper highlights the challenge of "hallucination," where models may generate inaccurate content, and proposes environmental interaction as a key anchor to mitigate this issue [13] 3. Application Potential of Agent AI - The paper explores the significant application potential of Agent AI in three cutting-edge fields: gaming, robotics, and healthcare [14][19] - In gaming, Agent AI can transform NPC behavior, allowing for meaningful interactions and dynamic adjustments based on player actions, enhancing immersion [15] - In robotics, Agent AI enables users to issue commands in natural language, allowing robots to autonomously plan and execute complex tasks [17] - In healthcare, Agent AI can serve as a medical chatbot for preliminary consultations and provide diagnostic suggestions, particularly in resource-limited settings [19][21] 4. Conclusion - The paper acknowledges that Agent AI is still in its early stages and faces challenges in achieving deep integration across modalities and domains [22] - It emphasizes the need for standardized evaluation metrics to guide development and measure technological progress in the field [22]
4000人的自动驾驶社区,开学季招生了!!!
自动驾驶之心· 2025-09-02 03:14
Core Viewpoint - The article emphasizes the establishment of a comprehensive community focused on autonomous driving technology, aiming to provide valuable resources and networking opportunities for both beginners and advanced learners in the field [1][3][12]. Group 1: Community Structure and Offerings - The community has been focusing on nearly 40 cutting-edge technology directions in autonomous driving, including multimodal large models, VLM, VLA, closed-loop simulation, world models, and sensor fusion [1][3]. - The community consists of members from leading autonomous driving companies, top academic laboratories, and traditional robotics firms, creating a complementary dynamic between industry and academia [1][12]. - The community has over 4,000 members and aims to grow to nearly 10,000 within two years, serving as a hub for technical sharing and communication [3][12]. Group 2: Learning and Development Resources - The community provides a variety of resources, including video content, articles, learning paths, and Q&A sessions, to assist members in their learning journey [3][12]. - It has organized nearly 40 technical routes for members, covering various aspects of autonomous driving, from entry-level to advanced topics [3][12]. - Members can access practical solutions to common questions, such as how to start with end-to-end autonomous driving and the learning paths for multimodal large models [3][12]. Group 3: Networking and Career Opportunities - The community facilitates job referrals and connections with various autonomous driving companies, enhancing members' employment opportunities [8][12]. - Regular discussions with industry leaders and experts are held to explore trends, technological directions, and challenges in mass production [4][12]. - Members are encouraged to engage with each other to discuss academic and engineering-related questions, fostering a collaborative environment [12][54]. Group 4: Technical Focus Areas - The community has compiled extensive resources on various technical areas, including 3DGS, NeRF, world models, and VLA, providing insights into the latest research and applications [12][27][31]. - Specific learning paths are available for different aspects of autonomous driving, such as perception, simulation, and planning control [12][13]. - The community also offers a detailed overview of open-source projects and datasets relevant to autonomous driving, aiding members in practical applications [24][25].