Workflow
视觉语言模型(VLM)
icon
Search documents
免训练!使用贝叶斯去微调VLM,机器人操作任务取得SOTA!
具身智能之心· 2025-12-03 03:47
点击下方 卡片 ,关注" 具身智能 之心 "公众号 >>直播和内容获取转到 → 具身智能之心知识星球 点击按钮预约直播 视觉语言模型(VLM)的最新进展显著提升了在具身任务(如目标分解与视觉理解)中的性能。然而,在不对VLM进行微调的情况下,为机器人操作任 务提供精确的奖励仍颇具挑战。这主要是因为预训练数据集中缺乏领域特定的机器人知识,且高昂的计算成本阻碍了其实时应用。为此,研究人员提出 T²-VLM ——一种新颖的、无需训练且具有时序一致性的框架,通过跟踪VLM推导出的子目标状态变化来生成精确的奖励。 本工作首先在每轮交互前查询VLM,以建立空间感知的子目标及初始完成度估计。随后,采用贝叶斯跟踪算法,利用子目标隐藏状态动态更新目标完成 状态,从而为强化学习(RL)智能体生成结构化的奖励。该方法增强了长程决策能力,并借助RL提升了故障恢复性能。大量实验表明, T²-VLM 在两个 机器人操作基准测试中取得了最先进的性能,在降低计算消耗的同时展现了优异的奖励准确性。 我们相信,该方法不仅推动了奖励生成技术的发展,也 为具身人工智能的更广泛领域做出了贡献。 直播时间: 12.3 / 19:30-20:30 直播简 ...
VLM也能「自我进化」!RL自我进化框架VisPlay突破视觉推理难题
具身智能之心· 2025-12-02 09:30
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 最新研究 VisPlay 首次提出了一个自进化强化学习框架,使 VLM 能够仅通过海量的未标注图像数据进行自我演化和能力提升。 VisPlay 将基础 VLM 分解为「提问者」和「推理者」两大角色,通过迭代的自我进化机制协同进化,并结合 GRPO 算法和创新的多样性/难度奖励,平衡 了问题的复杂度和答案的质量。 Title:VisPlay: Self-Evolving Vision-Language Models from Images 实验证明,VisPlay 在 Qwen2.5-VL 和 MiMo-VL 等主流模型上实现了持续的性能提升,尤其在视觉推理、组合泛化和幻觉减少方面效果显著,展示了一 条可扩展、低成本的多模态智能进化新路径。 引言: >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 在 Vision-Language Model 领域,提升其复杂推理能力通常依赖于耗费巨大的人工标注数据或启发式奖励。这不仅成本高昂,且难以规模化。 ...
图解Qwen3-VL多模态模型
自动驾驶之心· 2025-11-29 02:06
阿杰 | 十年技术老兵:曾深耕大数据建模、后端架构设计与算法优化,经手过千万级用户系统。这里分享技术实战干货、踩坑复盘与行业趋势解读,陪开发 者一起成长。 作者 | 阿杰不敲代码时 来源 | 阿杰不敲代码时 原文链接: 图解Qwen3-VL多模态模型 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 以下文章来源于阿杰不敲代码时 ,作者阿杰不敲代码时 阿杰不敲代码时 . 本文只做学术分享,如有侵权,联系删文 前面不久 ,写了一篇关于VLM的文章,不知道是不是内容不好还是标题的原因,导致大家好像不是很感兴趣,但是如果要知道Qwen3-VL的内部细节。如果基础不怎 么牢固或者没有基础,那一篇还是需要看看的,当然我也是认为大家看了那篇,才来看这边哈,这里也就不在重复一些知识了。不排除有些大佬可能有基础,跳过第 一篇来看这个,也是可以。如果写的有不对的地方,也欢迎大家指正与批评。 视觉语言模型 (VLM) 是自回归 AI 模型,可将文本和图像处理为输入。在这一篇文章中我们也会详细的从源码来看Qwen3-VL模型怎么 ...
性能超越GPT和Google,北京人形机器人创新中心开源全球最强具身VLM
具身智能之心· 2025-11-17 00:47
作者丨 咖啡不加糖 编辑丨 焉知机器人 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 2025 年 11 月 14 日,北京具身智能机器人创新中心正式发布 Pelican-VL 1.0 具身视觉语言模型( VLM ),不仅宣称性能超越 GPT-5 同类模型 和 Google Gemini 系列,更以 " 全球最大规模开源具身多模态大模型 " 的身份,展示了中国在具身智能领域的技术硬实力。 具身智能,简单来说就是让机器人像人类一样感知世界、做出决策并执行动作的技术,而视觉语言模型( VLM )相当于机器人的 " 眼睛 " 和 " 大脑中 枢 " ,负责把看到的图像信息转化为可理解的语言指令,再规划出具体的行动步骤。 图 Pelican-VL 1.0 (中文是塘鹅或者鹈鹕的意思)在抱脸虫和魔搭都可下载 Pelican-VL 1.0 称为 " 视觉语言大脑 " ,它 的开源有力推动了 具身 智能技术的进步 。 一、北京人形机器人创新中心和 Pelican-VL ...
宾夕法尼亚大学!MAESTRO:基于VLM的零样本通用机器人框架
具身智能之心· 2025-11-05 00:02
Core Insights - MAESTRO is a modular robotic framework centered around Vision Language Models (VLM), achieving zero-shot operational performance without extensive training data, while offering scalability and debuggability [2][5][22] Group 1: Innovation and Design - Current mainstream robotics development relies on large-scale "observation-action" datasets, which are costly and limited, hindering progress [4] - MAESTRO adopts a differentiated approach, utilizing VLM to avoid dependency on robot-specific data and integrating mature specialized tools for enhanced low-level operations [6][5] - The framework employs a closed-loop interaction mechanism, continuously monitoring environmental feedback to adjust actions in real-time, forming an adaptive cycle of perception, action, and learning [5][6] Group 2: Core Module Toolset - The modular design adheres to six principles, addressing diverse robotic operational needs, including perception, control, and geometry [8] - Key modules include: - Perception: Enhances visual information accuracy through a hierarchical approach [10] - Control: Integrates Cartesian control and collision-free motion planning for safety [10] - Geometry & Linear Algebra: Provides tools for spatial reasoning [10] - Image Editing: Improves visual grounding capabilities [10] - Mobile Operation Extensions: Adapts to mobile robot scenarios with navigation and active perception tools [10] Group 3: Evolution Mechanism - MAESTRO records past task execution codes and outcomes to provide contextual examples for VLM, optimizing code generation and enhancing performance after minimal real-world trials [12] Group 4: Experimental Results and Performance Analysis - MAESTRO demonstrated superior performance in desktop operations, significantly outperforming existing VLA models in six out of seven tasks, particularly in semantic reasoning and long-term memory tasks [17] - In mobile operations, MAESTRO achieved high completion rates, with specific tasks scoring 96.0±8.9 and 93.3±14.9 [17] - The evolution capability was highlighted by improving task completion from 35% to 85.0±7.4 after three iterations in a door-opening task [17] Group 5: Key Module Ablation Analysis - Removing advanced perception modules drastically reduced task completion rates, indicating the importance of precise perception for complex operations [20] - The absence of geometry modules also negatively impacted performance, underscoring the necessity of spatial reasoning tools [20] Group 6: Future Directions - MAESTRO's framework is positioned as an effective alternative to large-scale robotic training paths, with future enhancements aimed at optimizing VLM inference speed, improving low-level control capabilities, and increasing reasoning stability in complex scenarios [22]
跨行转入自动驾驶大厂的经验分享
自动驾驶之心· 2025-11-04 00:03
Core Insights - The article emphasizes the importance of seizing opportunities and continuous learning in the rapidly evolving field of autonomous driving [1][4] - It highlights the creation of a comprehensive community platform, "Autonomous Driving Heart Knowledge Planet," aimed at facilitating knowledge sharing and career development in the autonomous driving sector [4][16] Group 1: Career Development - Transitioning to the autonomous driving industry can be successful through dedication and preparation, as illustrated by the experience of a professional who switched careers and excelled in various roles [1] - Continuous learning and adapting to industry trends are crucial for career advancement, as demonstrated by the professional's progression from algorithm evaluation to advanced safety algorithms [1] Group 2: Community and Resources - The "Autonomous Driving Heart Knowledge Planet" has over 4,000 members and aims to grow to nearly 10,000 in two years, providing a platform for discussion, technical sharing, and job opportunities [4][16] - The community offers a variety of resources, including video content, learning pathways, and Q&A sessions, to support both beginners and advanced learners in the autonomous driving field [7][10] Group 3: Technical Learning and Networking - The community organizes discussions with industry experts on various topics, including entry points for end-to-end autonomous driving and the integration of multi-sensor fusion [8][20] - Members have access to a wealth of technical routes and resources, including over 40 technical pathways and numerous datasets relevant to autonomous driving [10][36] Group 4: Job Opportunities - The community facilitates job referrals and connections with leading companies in the autonomous driving sector, enhancing members' chances of securing positions in the industry [11][12] - Regular updates on job openings and industry trends are provided, helping members stay informed about potential career advancements [21][93]
世界模型==VQA?机器人不用想象画面,预测语义就够了
机器之心· 2025-10-28 00:41
Core Insights - The article discusses the necessity of precise future predictions in world models for AI, questioning whether detailed visual representations are essential for decision-making [1][6] - It introduces the concept of the Semantic World Model (SWM), which focuses on predicting semantic information about future outcomes rather than generating visual frames [9][18] Summary by Sections World Models and Their Limitations - World models enable AI to learn the dynamics of the world and predict future events based on current states [6] - Traditional models often generate realistic images but may miss critical semantic details necessary for decision-making [7][8] Semantic World Model (SWM) - SWM redefines world modeling as a visual question-answering (VQA) problem, focusing on task-relevant interactions rather than raw visual data [8][9] - SWM utilizes a visual language model (VLM) to answer questions about future actions and their semantic effects [9][11] Training and Data Generation - SWM can be trained on low-quality sequence data, including both expert and non-expert data, making it versatile [15] - A dataset called SAQA (State-Action-Question-Answer) is generated to train the model effectively [22] Experimental Results - SWM demonstrated high accuracy in answering future outcome questions and showed generalization capabilities in new scenarios [17] - In multi-task simulations, SWM significantly improved performance compared to baseline models, achieving success rates of 81.6% in LangTable and 76% in OGBench [30][34] Generalization and Robustness - SWM retains the generalization capabilities of the underlying VLM, showing improvements in performance even with new object combinations and background changes [39][41] - The model's attention mechanisms focus on task-relevant information, indicating its ability to generalize across different scenarios [41]
做了几期线上交流,我发现大家还是太迷茫
自动驾驶之心· 2025-10-24 00:04
Core Viewpoint - The article emphasizes the establishment of a comprehensive community called "Autonomous Driving Heart Knowledge Planet," aimed at providing a platform for knowledge sharing and networking in the autonomous driving industry, addressing the challenges faced by newcomers in the field [1][3][14]. Group 1: Community Development - The community has grown to over 4,000 members and aims to reach nearly 10,000 within two years, providing a space for technical sharing and communication among beginners and advanced learners [3][14]. - The community integrates various resources including videos, articles, learning paths, Q&A, and job exchange, making it a comprehensive hub for autonomous driving enthusiasts [3][5]. Group 2: Learning Resources - The community has organized over 40 technical learning paths, covering topics such as end-to-end autonomous driving, multi-modal large models, and data annotation practices, significantly reducing the time needed for research [5][14]. - Members can access a variety of video tutorials and courses tailored for beginners, covering essential topics in autonomous driving technology [9][15]. Group 3: Industry Insights - The community regularly invites industry experts to discuss trends, technological advancements, and production challenges in autonomous driving, fostering a serious content-driven environment [6][14]. - Members are encouraged to engage with industry leaders for insights on job opportunities and career development within the autonomous driving sector [10][18]. Group 4: Networking Opportunities - The community facilitates connections between members and various autonomous driving companies, offering resume forwarding services to help members secure job placements [10][12]. - Members can freely ask questions regarding career choices and research directions, receiving guidance from experienced professionals in the field [87][89].
执行力是当下自动驾驶的第一生命力
自动驾驶之心· 2025-10-17 16:04
Core Viewpoint - The article discusses the evolving landscape of the autonomous driving industry in China, highlighting the shift in competitive dynamics and the increasing investment in autonomous driving technologies as a core focus of AI development [1][2]. Industry Trends - The autonomous driving sector has undergone significant changes over the past two years, with new players entering the market and existing companies focusing on improving execution capabilities [1]. - The industry experienced a flourishing period before 2022, where companies with standout technologies could thrive, but has since transitioned into a more competitive environment that emphasizes addressing weaknesses [1]. - Companies that remain active in the market are progressively enhancing their hardware, software, AI capabilities, and engineering implementation to survive and excel [1]. Future Outlook - By 2025, the industry is expected to enter a "calm period," where unresolved technical challenges in areas like L3, L4, and Robotaxi will continue to present opportunities for professionals in the field [2]. - The article emphasizes the importance of comprehensive skill sets for individuals in the autonomous driving sector, suggesting that those with a short-term profit mindset may not endure in the long run [2]. Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has been established to provide a comprehensive platform for learning and sharing knowledge in the autonomous driving field, featuring over 4,000 members and aiming for a growth to nearly 10,000 in the next two years [4][17]. - The community offers a variety of resources, including video content, learning pathways, Q&A sessions, and job exchange opportunities, catering to both beginners and advanced learners [4][6][18]. - Members can access detailed technical routes and practical solutions for various autonomous driving challenges, significantly reducing the time needed for research and learning [6][18]. Technical Focus Areas - The community has compiled over 40 technical routes related to autonomous driving, covering areas such as end-to-end learning, multi-modal models, and various simulation platforms [18][39]. - There is a strong emphasis on practical applications, with resources available for data processing, 4D labeling, and engineering practices in autonomous driving [12][18]. Job Opportunities - The community facilitates job opportunities by connecting members with openings in leading autonomous driving companies, providing a platform for resume submissions and internal referrals [13][22].
突然发现,新势力在集中IPO......
自动驾驶之心· 2025-10-06 04:05
Group 1 - The article highlights a surge in IPO activities within the autonomous driving sector, indicating a significant shift in the industry landscape with new players entering the market [1][2] - Key events include the acquisition of Shenzhen Zhuoyu Technology by China First Automobile Works, Wayve's partnership with NVIDIA for a $500 million investment, and multiple companies filing for IPOs or completing strategic investments [1] - The article discusses the intense competition in the autonomous driving field, suggesting that many companies are pivoting towards embodied AI as a response to market saturation [1][2] Group 2 - The article emphasizes the importance of comprehensive skill sets for professionals remaining in the autonomous driving industry, as the market is expected to undergo significant restructuring [2] - It mentions the creation of a community platform, "Autonomous Driving Heart Knowledge Planet," aimed at providing resources and networking opportunities for individuals interested in the field [3][19] - The community offers a variety of learning resources, including video tutorials, technical discussions, and job placement assistance, catering to both beginners and experienced professionals [4][11][22] Group 3 - The community has gathered over 4,000 members and aims to expand to nearly 10,000 within two years, focusing on knowledge sharing and technical collaboration [3][19] - It provides structured learning paths and resources for various topics in autonomous driving, including end-to-end learning, multi-sensor fusion, and real-time applications [19][39] - The platform also facilitates discussions on industry trends, job opportunities, and technical challenges, fostering a collaborative environment for knowledge exchange [20][91]