视觉语言模型

Search documents
机器人「看片」自学新技能:NovaFlow从生成视频中提取动作流,实现零样本操控
机器之心· 2025-10-09 02:24
本文共同第一作者为李鸿宇(布朗大学博士生)和孙凌峰(Robotics and AI Institute 研究员,博士毕业于加州大学伯克利分校)。通讯作者付佳慧在 Robotics and AI Institute 任研究员,博士毕业于麻省理工学院。George Konidaris 为布朗大学副教授。 构建能够在新环境中、无需任何针对性训练就能执行多样化任务的通用机器人,是机器人学领域一个长期追逐的圣杯。近年来,随着大型语言模型(LLMs)和视 觉语言模型(VLMs)的飞速发展,许多研究者将希望寄托于视觉 - 语言 - 动作(VLA)模型,期望它们能复刻 LLM 和 VLM 在泛化性上取得的辉煌。然而,理想 很丰满,现实却很骨感。VLA 模型的端到端训练范式,要求海量与特定机器人相关的 "视觉 - 语言 - 动作" 数据。与 LLM 和 VLM 可以轻易获取的网络规模数据不 同,机器人数据的采集成本极高、难度极大,这形成了一个巨大的 "数据瓶颈"。有没有可能绕过这个瓶颈,让机器人不依赖于昂贵的 "亲身经历" 数据,也能学会 新技能呢? 最近,来自布朗大学(Brown University)和机器人与人工智能研究 ...
RoboDexVLM:基于VLM分层架构的通用灵巧机器人操作
具身智能之心· 2025-09-26 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 >>直播和内容获取转到 → 具身智能之心知识星球 点击按钮预约直播 RoboDexVLM是 一个面向配备灵巧手的协作机械臂的创新型机器人任务规划与抓取检测框架。 现有方法通常聚焦于简化且受限的操作任务,往往忽视了以长时序方式抓取多样化物体所伴随的复杂性。相比之下, RoboDexVLM 框架利用灵巧手 能够 抓取不同形状和尺寸物体 的能力,同时基于自然语言指令执行任务。 该方法的核心组成部分如下: 首先,设计了一个具备任务级恢复机制的鲁棒任务规划器 ,它利用视觉语言模型使系统能够解析并执行开放词汇指令以完 成长序列任务。 其次,提出了一种基于机器人运动学和形式化方法的语言引导灵巧抓取感知算法 ,专为面向多样化物体和指令的零样本灵巧操作而设 计。全面的实验结果验证了 RoboDexVLM 在处理长时序场景和执行灵巧抓取方面的有效性、适应性和鲁棒性。这些结果突显了该框架在复杂环境中运行 的能力,展示了其在开放词汇灵巧操作方面的潜力。 论文标题 : RoboDexVLM: Visual Language Model-Enabled Task Planning an ...
全新开源模型复现o3视觉推理,无需大量训练即可实现深度思考
量子位· 2025-09-15 03:59
Core Viewpoint - The article discusses the development of Mini-o3, an advanced visual language model (VLM) that enables multi-round visual reasoning, significantly improving upon previous models by allowing for deep reasoning across dozens of steps [1][2][15]. Group 1: Model Development - Mini-o3 is developed by a collaboration between ByteDance and the University of Hong Kong, designed to perform long-cycle visual search without extensive training resources [13]. - The model can extend its reasoning capabilities from a training limit of 6 rounds to dozens during testing, showcasing its advanced multi-modal reasoning abilities [2][15]. Group 2: Key Design Features - Mini-o3 incorporates three critical design elements: the VisualProbe dataset for exploratory reasoning, an iterative data collection process for diverse reasoning strategies, and a super-round masking strategy to balance training efficiency with testing scalability [17][19][34]. - The VisualProbe dataset consists of thousands of visual search challenges specifically designed for deep reasoning tasks, enhancing the model's training [17][38]. Group 3: Training Phases - The training of Mini-o3 occurs in two phases: a cold-start supervised fine-tuning (SFT) phase to activate multi-round tool usage, and a reinforcement learning (RL) phase to optimize interaction rounds [19][25]. - The cold-start SFT phase utilizes a small number of manually constructed samples to generate diverse reasoning trajectories, resulting in approximately 6000 cold-start reasoning paths [24][46]. Group 4: Performance Evaluation - Mini-o3 outperforms existing models in visual search tasks, achieving the best performance across various benchmarks, including VisualProbe, V*Bench, and HR-Bench [43][44]. - The model's performance is attributed to its ability to maintain complex and deep reasoning trajectories, with significant improvements noted in challenging tasks [44][48]. Group 5: Experimental Insights - Experiments indicate that removing RL data leads to a performance drop of about 8.6 points on VisualProbe-Hard, highlighting the importance of challenging RL samples for encouraging complex reasoning [45]. - The super-round masking technique effectively enhances RL performance, particularly in multi-round interaction scenarios, by stabilizing the training process and enabling extended reasoning during testing [48]. Group 6: Conclusion and Future Directions - The technical framework of Mini-o3 provides practical guidance for the development of multi-round interactive multi-modal models and their applications in reinforcement learning [52]. - The research team has made all related code open-source, promoting further exploration and development in this field [53].
自动驾驶超视距VLA如何实现?小鹏NavigScene另辟蹊径!
自动驾驶之心· 2025-09-04 23:33
Core Viewpoint - The article discusses the limitations of current autonomous driving systems in bridging the gap between local perception and global navigation, highlighting the introduction of NavigScene as a solution to enhance navigation capabilities in autonomous vehicles [3][4]. Group 1: Research and Development - Autonomous driving systems have made significant progress in local visual information processing, but they struggle to integrate broader navigation context used by human drivers [4][9]. - NavigScene is introduced as a navigation-guided natural language dataset that simulates a human-like driving environment within autonomous systems [5][9]. - The development of three complementary paradigms utilizing NavigScene aims to improve reasoning, preference optimization, and the integration of visual-language-action models [5][9]. Group 2: Methodologies - Navigation-guided reasoning enhances visual language models by incorporating navigation context into prompting methods [5]. - Navigation-guided preference optimization is a reinforcement learning approach that improves visual language model responses by establishing preference relationships based on navigation-related information [5]. - The navigation-guided vision-language-action model integrates navigation guidance and visual language models with traditional end-to-end driving models through feature fusion [5]. Group 3: Event and Engagement - A live session is scheduled to discuss the advancements and methodologies related to NavigScene, emphasizing its role in overcoming the limitations of current autonomous driving systems [4][9].
百度视觉技术部多模态感知与理解招聘(社招/校招/实习)
自动驾驶之心· 2025-09-03 23:33
Core Viewpoint - The article focuses on recruitment opportunities in the field of video understanding and artificial intelligence, highlighting the responsibilities and requirements for various positions within the company [2][4][5]. Recruitment Responsibilities - The company is looking for candidates to engage in cutting-edge algorithm research and development for video understanding, specifically targeting tasks such as video question answering, video summarization, temporal action localization, and event detection [2]. - Responsibilities also include building large-scale, high-quality multimodal datasets, distributed training of large models, and collaborating with business teams for practical application and innovation [2]. Job Requirements - Candidates should possess a master's or doctoral degree in computer science, artificial intelligence, electronic information, automation, or related fields [4]. - Experience in top AI conferences or journals is preferred, particularly in areas like computer vision and multimodal learning [5]. Advantages of Joining - The company offers a supportive environment with ample hiring capacity for new graduates, interns, and experienced hires, along with competitive salaries and benefits such as mentorship and participation in significant projects [6]. Community and Resources - The article mentions a community platform for job seekers in autonomous driving and robotics, providing resources like interview questions, industry reports, and salary negotiation tips [7][19].
苹果FastVLM视觉语言模型开放试用:视频字幕生成速度可提升85倍
Huan Qiu Wang Zi Xun· 2025-09-02 04:07
Core Insights - Apple has released a visual language model called FastVLM, which is now available on the Hugging Face platform [1][2] Group 1: Model Features - FastVLM offers near-instant high-resolution image processing and can increase video subtitle generation speed by 85 times [2] - The model is over three times smaller than similar models, enhancing its usability [2] Group 2: User Experience - Users can load the lightweight FastVLM-0.5B version directly in their browser, with a loading time of a few minutes on a 16GB M2 Pro MacBook Pro [2] - Once loaded, the model accurately describes the user's appearance, the room behind them, and surrounding objects [2] Group 3: Application Potential - FastVLM runs locally in the browser, ensuring that data never leaves the device and can even operate offline [2] - This feature presents significant potential in wearable devices and assistive technology, where lightweight and low-latency performance is crucial [2]
告别高耗时!上交Prune2Drive:自动驾驶VLM裁剪利器,加速6倍性能保持
自动驾驶之心· 2025-08-28 23:32
Core Viewpoint - The article discusses the Prune2Drive framework developed by Shanghai Jiao Tong University and Shanghai AI Lab, which achieves a 6.4x acceleration in visual token processing while only reducing performance by 3% through a pruning method that eliminates 90% of visual tokens [2][3][25]. Group 1: Research Background and Challenges - Visual Language Models (VLMs) provide a unified framework for perception, reasoning, and decision-making in autonomous driving, enhancing scene understanding and reducing error propagation [2]. - The deployment of VLMs in real driving scenarios faces significant computational challenges due to the high-resolution images from multiple cameras, leading to increased inference latency and memory consumption [3]. - Existing token pruning methods are limited in adapting to multi-view scenarios, often neglecting spatial semantic diversity and the varying contributions of different camera views [4]. Group 2: Prune2Drive Framework - Prune2Drive introduces the Token-wise Farthest Point Sampling (T-FPS) mechanism, which maximizes the semantic and spatial coverage of multi-view tokens rather than relying solely on individual token significance [6]. - The T-FPS method uses cosine distance to measure semantic similarity between tokens, ensuring that selected tokens are non-redundant and semantically rich [10][11]. - A view-adaptive pruning controller is designed to optimize the pruning ratio for different views, allowing for efficient resource allocation based on the contribution of each view to driving decisions [11][12]. Group 3: Experimental Design and Results - Experiments were conducted on two multi-view VLM benchmark datasets (DriveLM, DriveLMM-o1) to validate the performance retention and efficiency improvement of Prune2Drive compared to baseline methods [16]. - The framework demonstrated that even with a 90% token reduction, it maintained a risk assessment accuracy of 68.34, outperforming several baseline models [22]. - The efficiency of Prune2Drive was highlighted by a significant speedup in processing, achieving a 6.4x acceleration in the DriveMM model and a 2.64x acceleration in the DriveLMM-o1 model [25]. Group 4: Key Findings and Advantages - Prune2Drive effectively captures critical information in driving scenarios, outperforming other methods by accurately identifying key objects in various views [26]. - The framework is plug-and-play, requiring no retraining of VLMs and compatible with efficient implementations like Flash Attention [31]. - It balances performance and efficiency, achieving substantial reductions in computational load while preserving essential semantic information [31].
真实场景也能批量造「险」!VLM+扩散模型打造极限测试
具身智能之心· 2025-08-26 00:03
更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 近期,懂车帝的《懂车智炼场》栏目对量产自动驾驶系统的NOA辅助驾驶功能进行了安全关键场景测试。 编辑丨新智元 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 结果显示,在黑夜施工工地、高速公路前方车辆发生事故以及障碍物后突然驶出车辆等高风险场景中,目前尚无任何系统能够在测试中做到完全避免事 故。 这类安全关键场 景在真实道路上虽不常见,但一旦发生,可能导致人员伤亡或严重交通事故。 为了提升自动驾驶系统在此类情境下的可靠性,必须在多样化且高风险的安全关键场景中进行广泛测试。 然而,这类极端场景在现实中采集难度极高——发生频率低、风险大、难以批量获取。 在仿真环境中,类似的场景虽然可以批量制造,但现有模拟器在画面真实度上与现实仍有差距,难以直接用于真实域下端到端系统的极限测试。 为此,来自 浙江大学与与哈工大(深圳) 的研究团队提出了 SafeMVDrive ——首个面向真实域的多视角安全关键驾驶视频生成框架。 它将 VLM关键车辆选择器 与两阶段轨迹生成 ...
均普智能发展逐步多元化 具身智能机器人业务实现突破式进展
Zheng Quan Ri Bao Wang· 2025-08-23 04:13
Core Insights - Junpu Intelligent achieved a revenue of 1.032 billion yuan in the first half of 2025, with a backlog of orders amounting to 3.464 billion yuan, indicating stable business development [1] - The company secured new orders worth 1.112 billion yuan, representing a year-on-year growth of 20.22%, with non-automotive orders in the medical and high-end consumer goods sectors reaching 445 million yuan, accounting for approximately 40% of total new orders [1] Group 1: Medical Sector Developments - In the medical health sector, Junpu Intelligent successfully won a project for the production line of continuous glucose monitoring (CGM) sensors for an internationally leading diagnostic equipment manufacturer, with an annual design capacity of 15 million units [1] - The company established a strategic partnership with a leading domestic medical enterprise to jointly develop key platform cam technology for insulin injection pens [1] - The acquisition of the first fully automated production line project for insulin injection pens and automatic injectors signifies the market recognition of Junpu Intelligent's technological strength in high-value medical consumables intelligent manufacturing [1] Group 2: High-End Consumer Goods Innovations - In the high-end consumer goods sector, Junpu Intelligent's innovative achievements include the successful application of its self-developed "multi-blade intelligent assembly process" for an international brand's razor blade assembly order [1] - The company received an order for a flexible assembly line for high-end electric toothbrush drive units, which received high praise from the client [1] Group 3: Robotics Advancements - Junpu Intelligent's humanoid robot "Jarvis 2.0" successfully completed a multimodal upgrade, integrating various AI models such as large language models (LLM) and visual language models (VLM), enabling multilingual dialogue, voice command control, and visual guidance for object handling [2] - The "Jarvis Lightweight 1.0" version has been officially delivered to Tsinghua University and other institutions for research and teaching purposes [2] - The joint venture between Junpu Intelligent's Ningbo Junpu Artificial Intelligence and Humanoid Robot Research Institute and Zhiyuan Robotics has officially commenced operations, with the first mass production pilot line achieving production [2] - By the end of June, the joint venture received over 28 million yuan in orders for humanoid robot production and sales, with three models of embodied intelligent robots currently in production [2]
又帮到了一位同学拿到了VLA算法岗......
具身智能之心· 2025-08-22 16:03
昨天下午有个小朋友,底子还不错,C9即将研三。正在秋招,来找峰哥诉苦,同门找到了VLA算法岗位 (一个特别有钱的具身公司),我想转来不及了......刚开始都是一起做的传统机器人,SLAM相关。后面不 知道他做了什么项目,进度这么快,面试几家都过了。 这两天同门才刚给我推荐你们社区,体系很完整, 就怕有点晚了。 8月份,陆续有同学找到峰哥,不是拿到口头offer,就是想转具身担心来不及。虽然秋招将近, 但还是那 句话,"什么时候都不算太晚。" 尽快把完整的具身路线补齐才是重中之重,特别是数采和算法、仿真等。 如果你没有较强独立学习和搜索问题的能力,可以来我们的具身社区,也是目前国内最大最全的具身学习 平台【具身智能之心】知识星球。 "具身智能之心知识星球"目前集视频 + 图文 + 学习路线 + 问答 + 求职交流为一体,是一个综合类的具身社 区,近2000人了。我们期望未来2年内做到近万人的规模。给大家打造一个交流+技术分享的聚集地,是许 多初学者和进阶的同学经常逛的地方。 社区内部还经常为大家解答各类实用问题:如何使用设备?如何有效采集数据?如何部署VA、VLA模型 等。是采集背景太复杂还是数据比较dirt ...