视觉语言模型（VLM） - filings, earnings calls, financial reports, news - Reportify

视觉语言模型（VLM）

Search documents

RL新思路，复旦用游戏增强VLM通用推理，性能匹敌几何数据

3 6 Ke· 2025-10-22 02:17

Core Insights - Fudan University's NLP lab developed Game-RL, which utilizes games to enrich visual elements and generate multimodal verifiable reasoning data, enhancing the reasoning capabilities of visual language models (VLM) [1][28] - The innovative Code2Logic method systematically synthesizes game task data, creating the GameQA dataset, which demonstrates the advantages of game data in complex reasoning training [1][28] Game-RL and Code2Logic - Game-RL constructs multimodal verifiable game tasks to reinforce VLM training, addressing the limitations of existing reinforcement learning (RL) approaches that focus primarily on geometric or chart reasoning [1][28] - The Code2Logic method leverages game code to systematically generate reasoning data, consisting of three core steps: game code construction, task and QA template design, and data engine construction [11][8] GameQA Dataset - The GameQA dataset comprises 4 cognitive ability categories, 30 games, 158 reasoning tasks, and 140,000 question-answer pairs, with tasks categorized into three difficulty levels [13][15] - GameQA's diverse game tasks provide a competitive edge in training models for general reasoning, matching the performance of traditional geometric datasets despite having fewer training samples [19][20] Training Outcomes - The use of GameQA in training led to improvements across four open-source VLMs on seven out-of-domain general visual language reasoning benchmarks, with Qwen2.5-VL-7B showing an average improvement of 2.33% [17][18] - GameQA's cognitive diversity and reasoning complexity demonstrate its generalizability and transferability, making it a valuable resource for enhancing VLM capabilities [20][19] Scaling Effects - Increasing the GameQA dataset size to 20,000 samples resulted in consistent performance improvements on general reasoning benchmarks [21][24] - Expanding the variety of games used in training enhances out-of-domain generalization effects, indicating the importance of diverse training data [22][24] Conclusion - The research introduces Game-RL and the Code2Logic method, expanding the reinforcement training domain for VLMs into gaming scenarios, and validates that Game-RL can enhance general reasoning capabilities [28][1]

视觉语言模型推理能力强化训练

Artificial Intelligence

视觉语言模型（VLM）

视觉语言模型推理能力强化训练

Artificial Intelligence

视觉语言模型（VLM）

车圈一个月48位高管变动，新一轮的变革要开始了......

自动驾驶之心· 2025-09-25 03:45

Group 1 - The automotive industry is undergoing a new round of transformation, with significant executive changes in various companies, including Li Auto, BYD, and Changan Automobile [1] - The autonomous driving sector is rapidly evolving, with a shift in focus from traditional methods to new algorithms and models, emphasizing the need for continuous learning and adaptation [2][3] - The community is actively engaging in discussions about the future of autonomous driving, exploring new article styles and hosting online events with industry leaders [3][6] Group 2 - The community has developed platforms for autonomous driving, embodied intelligence, and large models, aiming to explore new opportunities amidst constant change [3][4] - A comprehensive resource has been created within the community, offering over 40 technical routes and addressing practical questions related to autonomous driving [5][8] - The community is focused on providing a collaborative environment for both beginners and advanced practitioners, facilitating knowledge sharing and networking [10][14] Group 3 - The community offers a variety of learning resources, including video tutorials and structured learning paths for newcomers to the field of autonomous driving [11][13] - Regular discussions and Q&A sessions are held to address industry-related queries, such as entry points into end-to-end autonomous driving and the applicability of multi-sensor fusion [17][19] - The community aims to grow its membership significantly over the next two years, enhancing its role as a hub for technical exchange and career opportunities in the autonomous driving sector [3][19]

Autonomous Driving

自动驾驶系统

Autonomous Driving

自动驾驶系统

PhysicalAgent：迈向通用认知机器人的基础世界模型框架

具身智能之心· 2025-09-22 00:03

Core Viewpoint - The article discusses the development of PhysicalAgent, a robotic control framework designed to overcome key limitations in the current robot manipulation field, specifically addressing the robustness and generalizability of visual-language-action (VLM) models and world model-based methods [2][3]. Group 1: Key Bottlenecks and Solutions - Current VLM models require task-specific fine-tuning, leading to a significant drop in robustness when switching robots or environments [2]. - World model-based methods depend on specially trained predictive models, limiting their generalizability due to the need for carefully curated training data [2]. - PhysicalAgent aims to integrate iterative reasoning, diffusion video generation, and closed-loop execution to achieve cross-modal and cross-task general manipulation capabilities [2]. Group 2: Framework Design Principles - The framework's design allows perception and reasoning modules to remain independent of specific robot forms, requiring only lightweight skeletal detection models for different robots [3]. - Video generation models have inherent advantages due to pre-training on vast multimodal datasets, enabling quick integration without local training [5]. - The framework aligns with human-like reasoning, generating visual representations of actions based solely on textual instructions [5]. - The architecture demonstrates cross-modal adaptability by generating different manipulation tasks for various robot forms without retraining [5]. Group 3: VLM as the Cognitive Core - VLM serves as the cognitive core of the framework, facilitating a multi-step process of instruction, environment interaction, and execution [6]. - The innovative approach redefines action generation as conditional video synthesis rather than direct control strategy learning [6]. - The robot adaptation layer is the only part requiring specific robot tuning, converting generated action videos into motor commands [6]. Group 4: Experimental Validation - Two types of experiments were conducted to validate the framework's cross-modal generalization and iterative execution robustness [8]. - The first experiment focused on verifying the framework's performance against task-specific baselines and its ability to generalize across different robot forms [9]. - The second experiment assessed the iterative execution capabilities of physical robots, demonstrating the effectiveness of the "Perceive→Plan→Reason→Act" pipeline [12]. Group 5: Key Results - The framework achieved an 80% final success rate across various tasks for both the bimanual UR3 and humanoid G1 robots [13][16]. - The first-attempt success rates were 30% for UR3 and 20% for G1, with average iterations required for success being 2.25 and 2.75, respectively [16]. - The iterative correction process significantly improved task completion rates, with a sharp decline in the proportion of unfinished tasks after the first few iterations [16].

通用认知机器人

基础世界模型框架

视觉语言模型（VLM）

通用认知机器人

基础世界模型框架

视觉语言模型（VLM）

想跳槽去具身，还在犹豫...

自动驾驶之心· 2025-09-12 16:03

Core Viewpoint - The article discusses the ongoing developments and challenges in the autonomous driving industry, emphasizing the importance of community engagement and knowledge sharing among professionals and enthusiasts in the field [1][5]. Group 1: Community Engagement - The "Autonomous Driving Heart Knowledge Planet" serves as a comprehensive community for sharing knowledge, resources, and job opportunities related to autonomous driving, aiming to grow its membership to nearly 10,000 in the next two years [5][15]. - The community has over 4,000 members and offers various resources, including video content, learning routes, and Q&A sessions to assist both beginners and advanced practitioners [5][11]. Group 2: Technical Discussions - Key topics discussed include the transition from rule-based systems to end-to-end learning in autonomous driving, the potential of embodied intelligence versus intelligent driving, and the current state of companies excelling in smart driving technologies [2][3][19]. - The community has compiled over 40 technical routes covering various aspects of autonomous driving, including perception, simulation, and planning control [15][27]. Group 3: Industry Trends - The article highlights the ongoing shifts in the industry, such as the exploration of end-to-end algorithms and the importance of data loops in enhancing autonomous driving capabilities [2][19]. - There is a focus on the employment landscape, with discussions on the stability of hardware-related positions compared to rapidly evolving software roles in the autonomous driving sector [2][19]. Group 4: Learning Resources - The community provides structured learning paths for newcomers, including comprehensive guides on various technical stacks and practical applications in autonomous driving [11][15]. - Members can access a wealth of resources, including datasets, open-source projects, and insights from industry leaders, to facilitate their learning and career development [27][28].

端到端自动驾驶

自动驾驶多模态大模型

视觉语言模型（VLM）

端到端自动驾驶

自动驾驶多模态大模型

视觉语言模型（VLM）

李飞飞的答案：大模型之后，Agent 向何处去？

3 6 Ke· 2025-09-04 08:28

Core Insights - The latest paper by Fei-Fei Li delineates the boundaries and establishes paradigms for the currently trending field of Agents, with major players like Google, OpenAI, and Microsoft aligning their strategies with the proposed capability stack [1][4] - The paper introduces a comprehensive cognitive loop architecture that encompasses perception, cognition, action, learning, and memory, forming a dynamic iterative system for intelligent agents, which is not only a technological integration but also a systematic vision for the future of AGI [1][5] - Large models are identified as the core engine driving Agents, while environmental interaction is crucial for addressing issues of hallucination and bias, emphasizing the need for real or simulated feedback to calibrate reality and incorporate ethical and safety mechanisms [1][3][11] Summary by Sections 1. Agent AI's Core: A New Cognitive Architecture - The paper presents a novel Agent AI paradigm that is a forward-thinking consideration of the development path for AGI, rather than a mere assembly of existing technologies [5] - It defines five core modules: Environment and Perception, Cognition, Action, Learning, and Memory, which together create a complete and interactive cognitive loop for intelligent agents [5][10] 2. How Large Models Drive Agent AI - The framework of Agent AI is made possible by the maturity of large foundational models, particularly LLMs and VLMs, which serve as the basis for the cognitive capabilities of Agents [11][12] - LLMs and VLMs have internalized vast amounts of common and specialized knowledge, enabling Agents to perform zero-shot planning effectively [12] - The paper highlights the challenge of "hallucination," where models may generate inaccurate content, and proposes environmental interaction as a key anchor to mitigate this issue [13] 3. Application Potential of Agent AI - The paper explores the significant application potential of Agent AI in three cutting-edge fields: gaming, robotics, and healthcare [14][19] - In gaming, Agent AI can transform NPC behavior, allowing for meaningful interactions and dynamic adjustments based on player actions, enhancing immersion [15] - In robotics, Agent AI enables users to issue commands in natural language, allowing robots to autonomously plan and execute complex tasks [17] - In healthcare, Agent AI can serve as a medical chatbot for preliminary consultations and provide diagnostic suggestions, particularly in resource-limited settings [19][21] 4. Conclusion - The paper acknowledges that Agent AI is still in its early stages and faces challenges in achieving deep integration across modalities and domains [22] - It emphasizes the need for standardized evaluation metrics to guide development and measure technological progress in the field [22]

Artificial Intelligence

大语言模型（LLM）

视觉语言模型（VLM）

Artificial Intelligence

大语言模型（LLM）

视觉语言模型（VLM）

4000人的自动驾驶社区，开学季招生了！！！

自动驾驶之心· 2025-09-02 03:14

Core Viewpoint - The article emphasizes the establishment of a comprehensive community focused on autonomous driving technology, aiming to provide valuable resources and networking opportunities for both beginners and advanced learners in the field [1][3][12]. Group 1: Community Structure and Offerings - The community has been focusing on nearly 40 cutting-edge technology directions in autonomous driving, including multimodal large models, VLM, VLA, closed-loop simulation, world models, and sensor fusion [1][3]. - The community consists of members from leading autonomous driving companies, top academic laboratories, and traditional robotics firms, creating a complementary dynamic between industry and academia [1][12]. - The community has over 4,000 members and aims to grow to nearly 10,000 within two years, serving as a hub for technical sharing and communication [3][12]. Group 2: Learning and Development Resources - The community provides a variety of resources, including video content, articles, learning paths, and Q&A sessions, to assist members in their learning journey [3][12]. - It has organized nearly 40 technical routes for members, covering various aspects of autonomous driving, from entry-level to advanced topics [3][12]. - Members can access practical solutions to common questions, such as how to start with end-to-end autonomous driving and the learning paths for multimodal large models [3][12]. Group 3: Networking and Career Opportunities - The community facilitates job referrals and connections with various autonomous driving companies, enhancing members' employment opportunities [8][12]. - Regular discussions with industry leaders and experts are held to explore trends, technological directions, and challenges in mass production [4][12]. - Members are encouraged to engage with each other to discuss academic and engineering-related questions, fostering a collaborative environment [12][54]. Group 4: Technical Focus Areas - The community has compiled extensive resources on various technical areas, including 3DGS, NeRF, world models, and VLA, providing insights into the latest research and applications [12][27][31]. - Specific learning paths are available for different aspects of autonomous driving, such as perception, simulation, and planning control [12][13]. - The community also offers a detailed overview of open-source projects and datasets relevant to autonomous driving, aiding members in practical applications [24][25].

多模态大模型

端到端自动驾驶

自动驾驶多模态大模型

视觉语言模型（VLM）

多模态大模型

端到端自动驾驶

自动驾驶多模态大模型

视觉语言模型（VLM）

NIPS 2025 MARS 多智能体具身智能挑战赛正式启动！

具身智能之心· 2025-08-18 00:07

Core Insights - The article discusses the challenges and advancements in multi-agent embodied intelligence, emphasizing the need for efficient collaboration among robotic systems to tackle complex tasks in real-world environments [3][4]. Group 1: Challenges in Embodied Intelligence - Single intelligent agents are insufficient for complex and dynamic task scenarios, necessitating high-level collaboration among multiple embodied agents [3]. - The MARS Challenge aims to address these challenges by encouraging global researchers to explore high-level planning and low-level control capabilities of multi-agent systems [4]. Group 2: MARS Challenge Overview - The MARS Challenge features two complementary tracks focusing on planning and control, aiming to evaluate the capabilities of intelligent agents in complex tasks [4][12]. - The challenge will culminate in results and awards announced at the NeurIPS 2025 SpaVLE Workshop [4]. Group 3: Track 1 - Multi-Agent Embodied Planning - Track 1 focuses on high-level task planning and role assignment for heterogeneous robots, utilizing the ManiSkill platform and RoboCasa dataset [5][6]. - Participants will use visual language models to select appropriate robot combinations and create high-level action sequences based on natural language instructions [5][8]. Group 4: Track 2 - Multi-Agent Control Strategy Execution - Track 2 emphasizes the collaborative capabilities of multi-agent systems in executing complex tasks, requiring real-time interaction with dynamic environments [12]. - The RoboFactory simulation environment will be used to develop and evaluate cooperative strategies, with participants designing deployable control models [12][13]. Group 5: Timeline and Participation - The challenge timeline includes a warm-up round starting on August 18, 2025, and the official competition beginning on September 1, 2025, concluding on October 31, 2025 [25]. - Participants from various fields such as robotics, computer vision, and natural language processing are encouraged to join and showcase their creativity and technology [26].

多智能体具身智能

视觉语言模型（VLM）

Diffusion Policy

ManiSkill平台

RoboCasa数据集

多智能体具身智能

视觉语言模型（VLM）

Diffusion Policy

ManiSkill平台

RoboCasa数据集

自动驾驶前沿方案：从端到端到VLA工作一览

自动驾驶之心· 2025-08-10 03:31

Core Viewpoint - The article discusses the advancements in end-to-end (E2E) and VLA (Vision-Language Architecture) algorithms in the autonomous driving industry, highlighting their potential to enhance driving capabilities through unified perception and control modeling, despite their higher technical complexity [1][5]. Summary by Sections End-to-End Algorithms - End-to-end approaches are categorized into single-stage and two-stage methods, with the latter focusing more on joint prediction, where perception serves as input for trajectory planning and prediction [3]. - Single-stage end-to-end models include various methods such as UniAD, DiffusionDrive, and Drive-OccWorld, each emphasizing different aspects and likely to be optimized by combining their strengths in production [3][37]. VLA Algorithms - VLA extends the capabilities of large models to enhance scene understanding in production models, with internal discussions on language models as interpreters and various algorithm summaries for modular and unified end-to-end VLA [5][45]. - The community has compiled over 40 technical routes, facilitating quick access to industry applications, benchmarks, and learning pathways [7]. Community and Resources - The community provides a platform for knowledge exchange among members from renowned universities and leading companies in the autonomous driving sector, offering resources such as open-source projects, datasets, and learning routes [19][35]. - A comprehensive technical stack and roadmap for beginners and advanced researchers are available, covering various aspects of autonomous driving technology [12][15]. Job Opportunities and Networking - The community has established job referral mechanisms with multiple autonomous driving companies, encouraging members to connect and share job opportunities [10][17]. - Regular discussions on industry trends, research directions, and practical applications are held, fostering a collaborative environment for learning and professional growth [20][83].

自动驾驶世界模型

端到端自动驾驶方案

视觉语言模型（VLM）

自动驾驶VLA

自动驾驶世界模型

端到端自动驾驶方案

视觉语言模型（VLM）

自动驾驶VLA

DriveBench：VLM在自动驾驶中真的可靠吗？（ICCV'25）

自动驾驶之心· 2025-08-07 23:32

Core Insights - The article discusses the advancements in Visual Language Models (VLMs) and their potential application in autonomous driving, particularly focusing on the reliability and interpretability of driving decisions generated by VLMs [3][5]. Group 1: DriveBench Overview - DriveBench is introduced as a benchmark dataset designed to evaluate the reliability of VLMs in 17 different settings, comprising 19,200 frames and 20,498 question-answer pairs [3]. - The framework covers four core tasks in autonomous driving: perception, prediction, planning, and behavior, and incorporates 15 types of Out-of-Distribution (OoD) scenarios to systematically test VLMs in complex driving environments [7][9]. Group 2: Presentation Details - The article highlights a live presentation by Shaoyuan Xie, a PhD student at the University of California, Irvine, who will discuss the empirical study on VLMs and their readiness for autonomous driving [9]. - The presentation will cover an overview of VLMs in autonomous driving, the reliability assessment of DriveBench, and future prospects for VLM applications in the industry [9].

视觉语言模型（VLM）可靠性评估

视觉语言模型（VLM）

视觉语言模型（VLM）可靠性评估

视觉语言模型（VLM）

4000人了，我们搭建了一个非常全栈的自动驾驶社区！

自动驾驶之心· 2025-08-03 00:33

Core Viewpoint - The article discusses the current state and future prospects of the autonomous driving industry, highlighting the shift towards embodied intelligence and large models, while questioning whether traditional autonomous driving technologies are becoming obsolete [2][3]. Group 1: Industry Perspectives - Some professionals have transitioned away from autonomous driving, believing that the technology stack has become homogenized, with only end-to-end and large models remaining viable [2]. - Those still observing the field are reluctant to leave their current high-paying jobs and lack reliable resources in the embodied intelligence sector [3]. - Many individuals remain committed to the autonomous driving field, viewing it as the most promising path towards achieving general embodied intelligence [3]. Group 2: Industry Challenges - The current state of mass production in autonomous driving is perceived as somewhat chaotic, with existing solutions not yet fully refined before new ones are rushed to market [3]. - The article suggests that the past hype around autonomous driving may have been beneficial, allowing for a more focused approach to solidifying mass production capabilities [3]. Group 3: Future Directions - The future of mass production in autonomous driving is expected to be unified, multi-modal, and end-to-end, requiring full-stack talent who are knowledgeable in perception, planning, prediction, and large models [3]. - The community aims to bridge the gap between academia and industry, facilitating communication and collaboration to advance the field [3][6]. Group 4: Community Initiatives - The "Autonomous Driving Heart" knowledge platform has created a comprehensive ecosystem for sharing academic and industrial insights, including job opportunities and technical resources [5][12][14]. - The platform has organized various resources, including over 40 technical routes and numerous open-source projects, to assist both newcomers and experienced professionals in the field [5][15][16]. Group 5: Educational Resources - The community provides a well-structured entry-level technical stack and roadmap for beginners, as well as valuable industry frameworks and project plans for those already engaged in research [10][12]. - Continuous job postings and sharing of opportunities are part of the community's offerings, aimed at building a complete ecosystem for autonomous driving [14].

自动驾驶技术

端到端自动驾驶

视觉语言模型（VLM）

自动驾驶技术

端到端自动驾驶

视觉语言模型（VLM）