Workflow
视觉 - 语言模型(VLM)
icon
Search documents
简历直推!小马智行多模态大模型实习生招聘
自动驾驶之心· 2025-11-30 02:02
Core Viewpoint - The article focuses on the recruitment of interns by PonyAi, emphasizing the need for skills in perception algorithm development and optimization in the autonomous driving industry [2][6]. Group 1: Responsibilities - The role involves developing perception capabilities driven by scene descriptions and natural language instructions based on Visual-Language Models (VLM), aiming to implement models in real-world scenarios [2]. - Responsibilities also include developing and optimizing perception algorithms based on Camera, LiDAR, and multi-modal fusion, covering areas such as object detection, semantic/instance segmentation, object tracking, and 3D reconstruction [6]. Group 2: Qualifications - Candidates with internship experience in the autonomous driving industry or those able to commit to a 6-month internship will receive preference [3]. - A bachelor's degree or higher in computer science or a related field is required, along with proficiency in deep learning and computer vision algorithms [6]. - Familiarity with CNN-based image detection, tracking, and recognition processes, as well as strong programming skills in C/C++ or Python, is essential [6]. - Candidates with strong research capabilities, such as having published first-author papers in CCF A-class conferences or journals, will be prioritized [6]. - Knowledge of deep learning frameworks like Pytorch and experience in parallel computing or CUDA programming are advantageous [6].
让VLM学会「心中有世界」:VAGEN用多轮RL把视觉智能变成「世界模型」推理机器
机器之心· 2025-10-25 03:20
Core Insights - The article discusses the limitations of Visual-Language Models (VLMs) in complex visual tasks, highlighting their tendency to act impulsively rather than thoughtfully due to their perception of the world being limited and noisy [2][6]. - The VAGEN framework aims to enhance VLMs by teaching them to construct an internal world model before taking actions, thereby promoting a more structured thinking process [3][12]. Group 1: VAGEN Framework - VAGEN enforces a structured "thinking template" for VLMs, which includes two core steps: State Estimation (observing the current state) and Transition Modeling (predicting future outcomes) [7][11]. - The framework utilizes reinforcement learning (RL) to reward this structured thinking process, demonstrating that the "World Modeling" strategy significantly outperforms both "No Think" and "Free Think" approaches [12][32]. Group 2: Internal Monologue and Reward Mechanism - The research explores the best format for the internal monologue of the agent, finding that the optimal representation depends on the nature of the task [13][14]. - VAGEN introduces two key components in its reward mechanism: World Modeling Reward, which provides immediate feedback after each thought process, and Bi-Level GAE for efficient reward distribution [18][20]. Group 3: Performance Results - The VAGEN-Full model, based on a 3B VLM, achieved an impressive overall score of 0.82 across five diverse tasks, outperforming various other models including GPT-5 [27][30]. - The results indicate that VAGEN-Full not only surpasses untrained models but also exceeds the performance of several proprietary models, showcasing its effectiveness in enhancing VLM capabilities [30][32].