视觉语言模型(VLM)
Search documents
免训练!使用贝叶斯去微调VLM,机器人操作任务取得SOTA!
具身智能之心· 2025-12-03 03:47
Core Insights - The article discusses the advancements in Visual Language Models (VLM) and introduces T²-VLM, a novel framework that generates temporally consistent rewards for robotic tasks without requiring training [2][5]. Group 1: VLM and T²-VLM Overview - VLMs have significantly improved performance in embodied tasks such as goal decomposition and visual understanding, but providing precise rewards for robotic operations remains challenging due to the lack of domain-specific knowledge in pre-training datasets and high computational costs [2]. - T²-VLM is designed to track the state changes of sub-goals derived from VLMs to generate accurate rewards, enhancing long-term decision-making capabilities and improving fault recovery performance through reinforcement learning [2]. Group 2: Methodology and Results - The T²-VLM method queries the VLM before each interaction to establish spatially aware sub-goals and initial completion estimates, utilizing a Bayesian tracking algorithm to dynamically update the target completion state [2]. - Extensive experiments demonstrate that T²-VLM achieves state-of-the-art performance in two robotic operation benchmarks while reducing computational costs and exhibiting superior reward accuracy [2]. Group 3: Live Session Details - A live session is scheduled for December 3rd, from 19:30 to 20:30, focusing on the background of real-machine reinforcement learning, the current state of reward generation research based on VLMs, and reflections on the T²-VLM method [5][6].
VLM也能「自我进化」!RL自我进化框架VisPlay突破视觉推理难题
具身智能之心· 2025-12-02 09:30
Core Insights - The article discusses the introduction of VisPlay, a self-evolving reinforcement learning framework for Vision-Language Models (VLM), which allows for self-improvement using vast amounts of unlabeled image data [2][3][18] Group 1: Challenges in VLM - VLMs have made significant progress in perception tasks but struggle with complex visual reasoning due to reliance on high-quality labeled data [5] - Traditional methods like supervised fine-tuning and reinforcement learning face bottlenecks as the cost and speed of manual labeling cannot keep up with the evolving model demands [5][4] Group 2: VisPlay Framework - VisPlay is designed to address the challenges of VLMs by implementing a self-evolution mechanism that allows models to learn autonomously from unlabeled images [7][8] - The framework divides the VLM into two roles: the "Questioner," which generates challenging visual questions, and the "Reasoner," which provides answers based on the images and questions [10][12] Group 3: Reward Mechanism - VisPlay employs a sophisticated reward mechanism that includes Difficulty Reward and Diversity Reward to enhance the quality of generated questions and answers [10][11] - This approach effectively mitigates common issues in self-evolving models, such as low answer quality and high question redundancy, leading to significant improvements in capability [11] Group 4: Experimental Results - VisPlay has been tested on mainstream VLM models like Qwen2.5-VL and MiMo-VL across eight benchmark datasets, showing consistent and significant accuracy gains [15][17] - The framework demonstrates strong generalization capabilities, particularly in unseen complex reasoning combinations, and effectively reduces the occurrence of "hallucinations" in VLMs [17][18]
图解Qwen3-VL多模态模型
自动驾驶之心· 2025-11-29 02:06
Core Insights - The article discusses the Qwen3-VL model, a visual language model (VLM) that processes both text and images as input, emphasizing its architecture and implementation details [3][4]. Group 1: Model Overview - Qwen3-VL is an autoregressive AI model designed to handle multimodal inputs, specifically text and images [3]. - The model's architecture includes various components such as configuration files, modeling files, and processing files for images and videos [5][6]. Group 2: Source Code Analysis - The source code of Qwen3-VL is structured into several classes, including Qwen3VLVisionMLP, Qwen3VLVisionPatchEmbed, and Qwen3VLForConditionalGeneration, each serving specific functions within the model [6][12]. - The Qwen3VLProcessor class converts input images into pixel values, utilizing the Qwen2-VL image processor for this task [7][10]. Group 3: Image Processing - The image processing function involves resizing, normalizing, and preparing images for input into the model, ultimately returning pixel values that serve as input [8][9]. - The model processes images in batches, grouping them by size for efficient resizing and normalization [9]. Group 4: Model Execution Flow - The Qwen3-VLForConditionalGeneration class serves as the entry point for the model, where input pixel values and text input IDs are processed to generate outputs [15][16]. - The model's forward method outlines the steps taken to integrate image and text features, including embedding images into the input sequence [21][22]. Group 5: Vision Encoder - The vision encoder of Qwen3-VL is custom-built, differing from existing models like CLIP, and utilizes a 3D convolution to convert images into hidden states [35][37]. - The encoder incorporates attention mechanisms and position encoding to enhance the model's ability to process visual data [40][41]. Group 6: Final Outputs - The final output of the model combines the processed image and text features, which are then forwarded to the language model for further processing [33][34]. - The architecture allows for the integration of visual and textual data, enabling the model to generate coherent outputs based on multimodal inputs [44].
性能超越GPT和Google,北京人形机器人创新中心开源全球最强具身VLM
具身智能之心· 2025-11-17 00:47
Core Viewpoint - The article highlights the launch of Pelican-VL 1.0, a cutting-edge embodied visual language model (VLM) that claims to surpass the performance of GPT-5 and Google Gemini series, showcasing China's technological strength in the field of embodied intelligence [1][3]. Group 1: Overview of Pelican-VL - Pelican-VL is described as the "visual language brain" and its open-source nature significantly advances the progress of embodied intelligence technology [3]. - The core team behind Pelican-VL is composed entirely of women, emphasizing the contribution of female talent in China's technological research and development [7]. Group 2: Innovation Center and Team - The Beijing Humanoid Robot Innovation Center, established in November 2023, is the first provincial humanoid robot innovation center in China, formed by companies like Xiaomi Robotics and UBTECH [5]. - The center has achieved notable results, including the "Tian Gong" series, which is the world's first full-size electric humanoid robot capable of running at a speed of 12 km/h and adapting to various complex terrains [5]. Group 3: Core Technology - DPPO - Pelican-VL's performance breakthrough is attributed to its globally pioneering DPPO (Deliberate Practice Policy Optimization) training paradigm, which allows the model to achieve better performance with significantly less data [8][9]. - Traditional models require 1 million to 5 million data points for training, while Pelican-VL only needs 200,000 data points, achieving a data efficiency of 1/10 to 1/50 compared to similar models [8][9]. Group 4: Training Methodology - DPPO mimics human learning processes, involving a closed loop of observation, practice, error correction, and improvement [9]. - The training process consists of two key phases: reinforcement learning exploration and targeted supervised fine-tuning, focusing on identified weaknesses [12]. Group 5: Performance Comparison - Pelican-VL's training utilized a dedicated computing cluster of over 1,000 A800 GPUs, requiring more than 50,000 A800 GPU-hours for a complete model checkpoint training [15]. - The model offers two versions: a lightweight 7B parameter model for local deployment and a 72B parameter model for cloud-based complex task processing, providing flexibility and performance maximization [23]. Group 6: Data Quality and Performance Metrics - The training data for Pelican-VL was meticulously curated from 12 domains, resulting in a high-quality dataset that includes millions of tokens and numerous "failure cases" for effective learning [24]. - Performance tests show that Pelican-VL outperforms GPT-5 by 15.79% and Google Gemini by 19.25% across various dimensions, including visual understanding and action planning [25]. Group 7: VLA System Integration - Pelican-VL serves as the "brain" of the Vision-Language-Action (VLA) system, integrating visual, language, and action modules to execute complex tasks [29][30]. - This integration allows Pelican-VL to understand and execute highly abstract composite instructions, enhancing its operational capabilities in real-world scenarios [30]. Group 8: Open Source Impact - The open-source release of Pelican-VL is expected to lower the barriers for adopting embodied intelligence technology, enabling small and medium enterprises to develop intelligent robots without significant upfront investment [34]. - The open-source model encourages a complete industrial chain development, fostering a rich application ecosystem around Pelican-VL and expanding the boundaries of embodied intelligence applications [34].
宾夕法尼亚大学!MAESTRO:基于VLM的零样本通用机器人框架
具身智能之心· 2025-11-05 00:02
Core Insights - MAESTRO is a modular robotic framework centered around Vision Language Models (VLM), achieving zero-shot operational performance without extensive training data, while offering scalability and debuggability [2][5][22] Group 1: Innovation and Design - Current mainstream robotics development relies on large-scale "observation-action" datasets, which are costly and limited, hindering progress [4] - MAESTRO adopts a differentiated approach, utilizing VLM to avoid dependency on robot-specific data and integrating mature specialized tools for enhanced low-level operations [6][5] - The framework employs a closed-loop interaction mechanism, continuously monitoring environmental feedback to adjust actions in real-time, forming an adaptive cycle of perception, action, and learning [5][6] Group 2: Core Module Toolset - The modular design adheres to six principles, addressing diverse robotic operational needs, including perception, control, and geometry [8] - Key modules include: - Perception: Enhances visual information accuracy through a hierarchical approach [10] - Control: Integrates Cartesian control and collision-free motion planning for safety [10] - Geometry & Linear Algebra: Provides tools for spatial reasoning [10] - Image Editing: Improves visual grounding capabilities [10] - Mobile Operation Extensions: Adapts to mobile robot scenarios with navigation and active perception tools [10] Group 3: Evolution Mechanism - MAESTRO records past task execution codes and outcomes to provide contextual examples for VLM, optimizing code generation and enhancing performance after minimal real-world trials [12] Group 4: Experimental Results and Performance Analysis - MAESTRO demonstrated superior performance in desktop operations, significantly outperforming existing VLA models in six out of seven tasks, particularly in semantic reasoning and long-term memory tasks [17] - In mobile operations, MAESTRO achieved high completion rates, with specific tasks scoring 96.0±8.9 and 93.3±14.9 [17] - The evolution capability was highlighted by improving task completion from 35% to 85.0±7.4 after three iterations in a door-opening task [17] Group 5: Key Module Ablation Analysis - Removing advanced perception modules drastically reduced task completion rates, indicating the importance of precise perception for complex operations [20] - The absence of geometry modules also negatively impacted performance, underscoring the necessity of spatial reasoning tools [20] Group 6: Future Directions - MAESTRO's framework is positioned as an effective alternative to large-scale robotic training paths, with future enhancements aimed at optimizing VLM inference speed, improving low-level control capabilities, and increasing reasoning stability in complex scenarios [22]
跨行转入自动驾驶大厂的经验分享
自动驾驶之心· 2025-11-04 00:03
Core Insights - The article emphasizes the importance of seizing opportunities and continuous learning in the rapidly evolving field of autonomous driving [1][4] - It highlights the creation of a comprehensive community platform, "Autonomous Driving Heart Knowledge Planet," aimed at facilitating knowledge sharing and career development in the autonomous driving sector [4][16] Group 1: Career Development - Transitioning to the autonomous driving industry can be successful through dedication and preparation, as illustrated by the experience of a professional who switched careers and excelled in various roles [1] - Continuous learning and adapting to industry trends are crucial for career advancement, as demonstrated by the professional's progression from algorithm evaluation to advanced safety algorithms [1] Group 2: Community and Resources - The "Autonomous Driving Heart Knowledge Planet" has over 4,000 members and aims to grow to nearly 10,000 in two years, providing a platform for discussion, technical sharing, and job opportunities [4][16] - The community offers a variety of resources, including video content, learning pathways, and Q&A sessions, to support both beginners and advanced learners in the autonomous driving field [7][10] Group 3: Technical Learning and Networking - The community organizes discussions with industry experts on various topics, including entry points for end-to-end autonomous driving and the integration of multi-sensor fusion [8][20] - Members have access to a wealth of technical routes and resources, including over 40 technical pathways and numerous datasets relevant to autonomous driving [10][36] Group 4: Job Opportunities - The community facilitates job referrals and connections with leading companies in the autonomous driving sector, enhancing members' chances of securing positions in the industry [11][12] - Regular updates on job openings and industry trends are provided, helping members stay informed about potential career advancements [21][93]
世界模型==VQA?机器人不用想象画面,预测语义就够了
机器之心· 2025-10-28 00:41
Core Insights - The article discusses the necessity of precise future predictions in world models for AI, questioning whether detailed visual representations are essential for decision-making [1][6] - It introduces the concept of the Semantic World Model (SWM), which focuses on predicting semantic information about future outcomes rather than generating visual frames [9][18] Summary by Sections World Models and Their Limitations - World models enable AI to learn the dynamics of the world and predict future events based on current states [6] - Traditional models often generate realistic images but may miss critical semantic details necessary for decision-making [7][8] Semantic World Model (SWM) - SWM redefines world modeling as a visual question-answering (VQA) problem, focusing on task-relevant interactions rather than raw visual data [8][9] - SWM utilizes a visual language model (VLM) to answer questions about future actions and their semantic effects [9][11] Training and Data Generation - SWM can be trained on low-quality sequence data, including both expert and non-expert data, making it versatile [15] - A dataset called SAQA (State-Action-Question-Answer) is generated to train the model effectively [22] Experimental Results - SWM demonstrated high accuracy in answering future outcome questions and showed generalization capabilities in new scenarios [17] - In multi-task simulations, SWM significantly improved performance compared to baseline models, achieving success rates of 81.6% in LangTable and 76% in OGBench [30][34] Generalization and Robustness - SWM retains the generalization capabilities of the underlying VLM, showing improvements in performance even with new object combinations and background changes [39][41] - The model's attention mechanisms focus on task-relevant information, indicating its ability to generalize across different scenarios [41]
做了几期线上交流,我发现大家还是太迷茫
自动驾驶之心· 2025-10-24 00:04
Core Viewpoint - The article emphasizes the establishment of a comprehensive community called "Autonomous Driving Heart Knowledge Planet," aimed at providing a platform for knowledge sharing and networking in the autonomous driving industry, addressing the challenges faced by newcomers in the field [1][3][14]. Group 1: Community Development - The community has grown to over 4,000 members and aims to reach nearly 10,000 within two years, providing a space for technical sharing and communication among beginners and advanced learners [3][14]. - The community integrates various resources including videos, articles, learning paths, Q&A, and job exchange, making it a comprehensive hub for autonomous driving enthusiasts [3][5]. Group 2: Learning Resources - The community has organized over 40 technical learning paths, covering topics such as end-to-end autonomous driving, multi-modal large models, and data annotation practices, significantly reducing the time needed for research [5][14]. - Members can access a variety of video tutorials and courses tailored for beginners, covering essential topics in autonomous driving technology [9][15]. Group 3: Industry Insights - The community regularly invites industry experts to discuss trends, technological advancements, and production challenges in autonomous driving, fostering a serious content-driven environment [6][14]. - Members are encouraged to engage with industry leaders for insights on job opportunities and career development within the autonomous driving sector [10][18]. Group 4: Networking Opportunities - The community facilitates connections between members and various autonomous driving companies, offering resume forwarding services to help members secure job placements [10][12]. - Members can freely ask questions regarding career choices and research directions, receiving guidance from experienced professionals in the field [87][89].
执行力是当下自动驾驶的第一生命力
自动驾驶之心· 2025-10-17 16:04
Core Viewpoint - The article discusses the evolving landscape of the autonomous driving industry in China, highlighting the shift in competitive dynamics and the increasing investment in autonomous driving technologies as a core focus of AI development [1][2]. Industry Trends - The autonomous driving sector has undergone significant changes over the past two years, with new players entering the market and existing companies focusing on improving execution capabilities [1]. - The industry experienced a flourishing period before 2022, where companies with standout technologies could thrive, but has since transitioned into a more competitive environment that emphasizes addressing weaknesses [1]. - Companies that remain active in the market are progressively enhancing their hardware, software, AI capabilities, and engineering implementation to survive and excel [1]. Future Outlook - By 2025, the industry is expected to enter a "calm period," where unresolved technical challenges in areas like L3, L4, and Robotaxi will continue to present opportunities for professionals in the field [2]. - The article emphasizes the importance of comprehensive skill sets for individuals in the autonomous driving sector, suggesting that those with a short-term profit mindset may not endure in the long run [2]. Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has been established to provide a comprehensive platform for learning and sharing knowledge in the autonomous driving field, featuring over 4,000 members and aiming for a growth to nearly 10,000 in the next two years [4][17]. - The community offers a variety of resources, including video content, learning pathways, Q&A sessions, and job exchange opportunities, catering to both beginners and advanced learners [4][6][18]. - Members can access detailed technical routes and practical solutions for various autonomous driving challenges, significantly reducing the time needed for research and learning [6][18]. Technical Focus Areas - The community has compiled over 40 technical routes related to autonomous driving, covering areas such as end-to-end learning, multi-modal models, and various simulation platforms [18][39]. - There is a strong emphasis on practical applications, with resources available for data processing, 4D labeling, and engineering practices in autonomous driving [12][18]. Job Opportunities - The community facilitates job opportunities by connecting members with openings in leading autonomous driving companies, providing a platform for resume submissions and internal referrals [13][22].
突然发现,新势力在集中IPO......
自动驾驶之心· 2025-10-06 04:05
Group 1 - The article highlights a surge in IPO activities within the autonomous driving sector, indicating a significant shift in the industry landscape with new players entering the market [1][2] - Key events include the acquisition of Shenzhen Zhuoyu Technology by China First Automobile Works, Wayve's partnership with NVIDIA for a $500 million investment, and multiple companies filing for IPOs or completing strategic investments [1] - The article discusses the intense competition in the autonomous driving field, suggesting that many companies are pivoting towards embodied AI as a response to market saturation [1][2] Group 2 - The article emphasizes the importance of comprehensive skill sets for professionals remaining in the autonomous driving industry, as the market is expected to undergo significant restructuring [2] - It mentions the creation of a community platform, "Autonomous Driving Heart Knowledge Planet," aimed at providing resources and networking opportunities for individuals interested in the field [3][19] - The community offers a variety of learning resources, including video tutorials, technical discussions, and job placement assistance, catering to both beginners and experienced professionals [4][11][22] Group 3 - The community has gathered over 4,000 members and aims to expand to nearly 10,000 within two years, focusing on knowledge sharing and technical collaboration [3][19] - It provides structured learning paths and resources for various topics in autonomous driving, including end-to-end learning, multi-sensor fusion, and real-time applications [19][39] - The platform also facilitates discussions on industry trends, job opportunities, and technical challenges, fostering a collaborative environment for knowledge exchange [20][91]