Workflow
视觉语言模型
icon
Search documents
以玩促学?游戏代码驱动数据合成,提升多模态大模型通用推理
机器之心· 2025-07-04 08:59
Core Insights - The article presents a novel approach called Code2Logic, which utilizes game code to synthesize multimodal reasoning data, enhancing the reasoning capabilities of visual language models (VLMs) [47][48]. - The research indicates that training AI using game scenarios can significantly improve its performance in geometric and graphical reasoning tasks [1][24]. Data and Model - The scarcity of high-quality multimodal reasoning data limits the advancement of VLMs' complex reasoning abilities, prompting the need for a cost-effective method to generate such data [4]. - The research team from Fudan University and ByteDance proposes leveraging game code to automatically synthesize visual reasoning data, capitalizing on the structured nature of games [12][13]. Methodology - The Code2Logic method involves three core steps: generating game code using large language models (LLMs), designing question-answer templates from the game code, and constructing an automated data engine to generate Q&A instances [13][14][15]. - The GameQA dataset created through this method encompasses 30 games, 158 reasoning tasks, and 140,000 Q&A pairs, showcasing its scalability and diversity [18]. Training and Performance - Training on GameQA data leads to significant performance improvements in both in-domain and out-of-domain tasks, demonstrating the generalization capabilities of models trained with this dataset [24][25]. - The study reveals that models trained with GameQA outperform those trained on traditional geometric reasoning datasets, indicating the cognitive diversity and reasoning complexity inherent in game data [28][29]. Scaling Effects - The research identifies two scaling effects: increased game variety enhances out-of-domain generalization, and sample diversity correlates positively with generalization performance [37][38]. - These findings suggest that the diversity and scalability of GameQA contribute to stronger generalization in reasoning tasks [39]. Limitations and Challenges - The analysis highlights key limitations in VLMs' reasoning capabilities, particularly in 3D spatial perception, pattern recognition, and strategic planning [42][45]. - The study emphasizes the need for further improvements in models' abilities to handle complex reasoning tasks effectively [46].
今年大火的目标导航到底是什么?从目标搜索到触达有哪些路线?
具身智能之心· 2025-06-26 14:19
Core Viewpoint - Goal-Oriented Navigation empowers robots to autonomously complete navigation tasks based on goal descriptions, marking a significant shift from traditional visual language navigation systems [2][3]. Group 1: Technology Overview - Embodied navigation is a core area of embodied intelligence, relying on three technical pillars: language understanding, environmental perception, and path planning [2]. - Goal-Oriented Navigation requires robots to explore and plan paths in unfamiliar 3D environments using only goal descriptions such as coordinates, images, or natural language [2]. - The technology has been industrialized in various verticals, including delivery, healthcare, and hospitality, enhancing service efficiency [3]. Group 2: Technological Evolution - The evolution of Goal-Oriented Navigation can be categorized into three generations: - First Generation: End-to-end methods focusing on reinforcement learning and imitation learning, achieving breakthroughs in Point Navigation and closed-set image navigation tasks [5]. - Second Generation: Modular methods that explicitly construct semantic maps, breaking tasks into exploration and goal localization [5]. - Third Generation: Integration of large language models (LLMs) and visual language models (VLMs) to enhance knowledge reasoning and open vocabulary target matching [7]. Group 3: Challenges and Learning Path - The complexity of embodied navigation, particularly Goal-Oriented Navigation, necessitates knowledge from multiple fields, making it challenging for newcomers to enter the domain [9]. - A new course has been developed to address these challenges, focusing on quick entry, building a research framework, and combining theory with practice [10][11][12]. Group 4: Course Structure - The course will cover the theoretical foundations and technical lineage of Goal-Oriented Navigation, including task definitions and evaluation benchmarks [15]. - It will also delve into the Habitat simulation ecosystem, end-to-end navigation methodologies, modular navigation architectures, and LLM/VLM-driven navigation systems [16][18][20][22]. - A significant project will focus on the reproduction of VLFM algorithms and their deployment in real-world scenarios [24].
上海交大最新!DyNaVLM:零样本、端到端导航框架
具身智能之心· 2025-06-22 10:56
Core Viewpoint - The article discusses the development of DyNaVLM, a zero-shot, end-to-end navigation framework that integrates vision-language models (VLM) to enhance navigation capabilities in dynamic environments, overcoming limitations of traditional methods [4][5]. Group 1: Introduction and Optimization Goals - Navigation is a fundamental capability in autonomous agents, requiring spatial reasoning, real-time decision-making, and adaptability to dynamic environments. Traditional methods face challenges in generalization and scalability due to their modular design [4]. - The advancement of VLMs offers new possibilities for navigation by integrating perception and reasoning within a single framework, although their application in embodied navigation is limited by spatial granularity and contextual reasoning capabilities [4]. Group 2: Core Innovations of DyNaVLM - **Dynamic Action Space Construction**: DyNaVLM introduces a dynamic action space that allows robots to determine navigation goals based on visual information and language instructions, enhancing movement flexibility in complex environments [6]. - **Collaborative Graph Memory Mechanism**: Inspired by retrieval-augmented generation (RAG), this mechanism enhances memory management for better navigation performance [8]. - **No-Training Deployment Mode**: DyNaVLM can be deployed without task-specific fine-tuning, reducing deployment costs and improving generalization across different environments and tasks [8]. Group 3: System Architecture and Methodology - **Problem Formalization**: The system takes inputs such as target descriptions and RGB-D observations to determine appropriate actions, maintaining a memory function to extract spatial features [11]. - **Memory Manager**: This component connects VLM and graph-structured memory, capturing spatial relationships and semantic object information [12]. - **Action Proposer and Selector**: The action proposer simplifies continuous search space into discrete candidates, while the selector generates final navigation actions based on geometric candidates and contextual memory [14][15]. Group 4: Experimental Evaluation - **Simulation Environment Evaluation**: DyNaVLM achieved a success rate (SR) of 45.0% and a path length weighted success rate (SPL) of 0.232 in ObjectNav benchmarks, outperforming previous VLM frameworks [19][22]. - **Real-World Evaluation**: DyNaVLM demonstrated superior performance in real-world settings, particularly in tasks requiring the identification of multiple targets, showcasing its robustness and efficiency in dynamic environments [27].
万马科技20250612
2025-06-12 15:07
摘要 万马科技通过收购有方科技切入车联网领域,车联网业务收入从 2021 年的 5,000 万元增长到 2024 年的 2.6 亿元,利润也显著提升,并已建 立完整的数据闭环工具链和智驾算力中心。 国内车联网行业渗透率约为 80%,海外市场渗透率不足 30%,随着智 能驾驶对数据需求的增加,国内外市场均有较大的发展空间,尤其 Robotaxi 对实时数据监控和技术要求更高,单车价值提升显著。 优卡科技提供蓝海全球车联和云自动驾驶数据闭环两大解决方案,支持 1,400 万辆车辆,客户包括吉利、上汽、东风和理想等,并在全球范围 内支持 Robotaxi 企业的业务布局。 Robotaxi 被视为车联网行业发展的"皇冠上的明珠",高盛预测中国 Robotaxi 市场年化增长率将达到 96%。目前已在北京、武汉、广州以 及香港、迪拜等地进行常态化运营,特斯拉也即将推出相关业务。 Robotaxi 运营对网络质量有极高要求,包括运行安全、用户交互、合 规性、自动驾驶数据采集和运维等方面,需要高清地图、车路协同、远 程脱困以及海量数据支持。 万马科技 20250612 据监控需求高,对技术和数据量要求也更高,从单车价值上 ...
中金《秒懂研报》 | 智能驾驶:引领出行变革的新时代
中金点睛· 2025-05-24 08:32
Group 1: Core Viewpoints - The article discusses the rapid development and potential of intelligent driving technology, highlighting its transformative impact on urban mobility and the automotive industry [1][2][3]. Group 2: Technology Engine Behind Intelligent Driving - The end-to-end architecture is a significant innovation in intelligent driving, reducing data annotation difficulty and optimizing data processing through unique algorithms, which enhances vehicle responsiveness to road conditions [2][3]. - The introduction of visual language models and cloud models improves the system's ability to handle complex scenarios, akin to equipping vehicles with sharper "eyes" [3]. Group 3: Current Development of Intelligent Driving - The high-speed Navigation on Autopilot (NOA) feature is expected to be scaled up in 2024, becoming a standard for intelligent driving vehicles priced above 200,000 yuan [5]. - The penetration rate of urban NOA is projected to reach 6.5% in 2024, driven by increased consumer acceptance and reduced costs, expanding its availability to more consumers [7]. Group 4: Business Model of Intelligent Driving - The L2++ intelligent driving software faces challenges in charging fees due to low consumer willingness to pay, leading most automakers to standardize systems to accumulate users and data [11]. - Some leading automakers are exploring buyout or subscription payment models, with promotional activities to attract customers [11][12]. Group 5: Benefits of Urban NOA - Urban NOA is expected to drive sales of high-configured, high-margin models, as consumers are likely to prefer higher-end vehicles once the technology gains market acceptance [13][14]. - The overlap in technology requirements between Robotaxi and urban NOA is anticipated to enhance intelligent driving system capabilities, potentially leading to a shift towards mobility services by 2025 [15]. Group 6: Globalization of Intelligent Driving Industry - China's late start in intelligent driving is countered by rapid development, with domestic companies gaining advantages in technology and production experience, positioning them favorably in the global market [16]. - Collaborations between joint venture automakers and domestic intelligent driving companies are expected to facilitate access to international projects and opportunities for global expansion [16][17].
智能辅助驾驶竞速与暗战:自研派VS合作派,功能水平分化加剧
Bei Ke Cai Jing· 2025-05-22 10:37
Core Insights - The article discusses the advancements and competitive landscape of the assisted driving industry, highlighting various companies' self-developed systems and strategies [1][4]. Group 1: Company Developments - Li Auto has launched its new generation dual-system intelligent driving solution, focusing on upgrading driving capabilities and synchronizing updates for smart electric vehicles [3]. - NIO's intelligent assisted driving system has reportedly avoided over 3.5 million collision risks, accumulating a total driving mileage of approximately 4.94 billion kilometers as of May 15, 2025 [3]. - Chery's Hawk 500 has achieved widespread adoption of assisted driving features, with the Hawk 700 targeting mid-to-high-end models and the Hawk 900 positioned as a flagship [3]. - GAC Group's GSD intelligent driving assistance system has accumulated 5 million user driving scenarios and over 40 million kilometers of high-level autonomous driving data [3]. Group 2: Industry Trends - BYD and XPeng are recognized as leaders in self-developed intelligent driving systems, with BYD's high-end system named "Tianshen Eye" [4]. - Bosch's China president has expressed skepticism about the self-development model, suggesting that mid-level intelligent driving should become standard and that costs could be better managed through supply chain partnerships [4]. - Huawei is positioned as a top player in the intelligent driving system market, with plans for 10 brands from 7 automakers to adopt its solutions, potentially exceeding 500,000 vehicles [4][5]. - Huawei's collaboration models include component supply, Huawei Inside (HI) partnerships, and deep cooperation with automakers, with the latter being the most integrated approach [5]. Group 3: Strategic Partnerships - SAIC Group has publicly stated its intention to maintain control over core technologies while also choosing to collaborate with Huawei [6]. - The partnerships with Huawei have led to increased sales for collaborating automakers, but questions remain about their ability to independently develop high-quality vehicles [6].
85倍速度碾压:苹果开源FastVLM,能在iphone直接运行的视觉语言模型
机器之心· 2025-05-16 16:31
Core Viewpoint - Apple has open-sourced FastVLM, an efficient vision-language model that can run directly on iPhones, significantly enhancing visual understanding capabilities [2][6]. Group 1: Model Features and Performance - FastVLM addresses size and speed issues, achieving an 85-fold increase in the speed of the first token output compared to traditional models [6]. - The model uses a new hybrid visual encoder, FastViTHD, which combines convolutional layers and transformer modules, reducing the number of visual tokens needed for image processing by 16 times compared to traditional ViT and 4 times compared to FastViT [6][16]. - FastVLM is available in three parameter sizes: 0.5B, 1.5B, and 7B, each with stage 2 and stage 3 fine-tuning weights [7]. Group 2: Technical Innovations - The research emphasizes the importance of image resolution in VLM performance, particularly for text and data-dense tasks, while also addressing the challenges of high-resolution image processing [12][13]. - FastViTHD is specifically designed to enhance VLM efficiency when processing high-resolution images, achieving significant improvements in accuracy and latency compared to existing methods [16][33]. - The model architecture includes five stages, with a total parameter count of 125.1M, which is smaller than most mainstream ViT architectures while maintaining competitive performance [36][37]. Group 3: Efficiency and Optimization - FastVLM demonstrates superior performance in accuracy-latency trade-offs, outperforming previous models like ViT and FastViT under various conditions [46][47]. - The model's design allows for dynamic input resolution adjustments, optimizing performance based on the specific task and hardware capabilities [48][49]. - FastVLM's performance surpasses traditional token pruning methods, achieving lower visual token counts while maintaining higher accuracy [50][51].
百模竞发的 365 天:Hugging Face 年度回顾揭示 VLM 能力曲线与拐点 | Jinqiu Select
锦秋集· 2025-05-16 15:42
Core Insights - The article discusses the rapid evolution of visual language models (VLMs) and highlights the emergence of smaller yet powerful multimodal architectures, showcasing advancements in capabilities such as multimodal reasoning and long video understanding [1][3]. Group 1: New Model Trends - The article introduces the concept of "Any-to-any" models, which can input and output various modalities (images, text, audio) by aligning different modalities [5][6]. - New models like Qwen 2.5 Omni and DeepSeek Janus-Pro-7B exemplify the latest advancements in multimodal capabilities, enabling seamless input and output across different modalities [6][10]. - The trend of smaller, high-performance models (Smol Yet Capable) is gaining traction, promoting local deployment and lightweight applications [7][15]. Group 2: Reasoning Models - Reasoning models are emerging in the VLM space, capable of solving complex problems, with notable examples including Qwen's QVQ-72B-preview and Moonshot AI's Kimi-VL-A3B-Thinking [11][12]. - These models are designed to handle long videos and various document types, showcasing their advanced reasoning capabilities [14]. Group 3: Multimodal Safety Models - The need for multimodal safety models is emphasized, which filter inputs and outputs to prevent harmful content, with Google launching ShieldGemma 2 as a notable example [31][32]. - Meta's Llama Guard 4 is highlighted as a dense multimodal safety model that can filter outputs from visual language models [34]. Group 4: Multimodal Retrieval-Augmented Generation (RAG) - The development of multimodal RAG is discussed, which enhances the retrieval process for complex documents, allowing for better integration of visual and textual data [35][38]. - Two main architectures for multimodal retrieval are introduced: DSE models and ColBERT-like models, each with distinct approaches to processing and returning relevant information [42][44]. Group 5: Multimodal Intelligent Agents - The article highlights the emergence of visual language action models (VLA) that can interact with physical environments, with examples like π0 and GR00T N1 showcasing their capabilities [21][22]. - Recent advancements in intelligent agents, such as ByteDance's UI-TARS-1.5, demonstrate the ability to navigate user interfaces and perform tasks in real-time [47][54]. Group 6: Video Language Models - The challenges of video understanding are addressed, with models like Meta's LongVU and Qwen2.5VL demonstrating advanced capabilities in processing video frames and understanding temporal relationships [55][57]. Group 7: New Benchmark Testing - The article discusses the emergence of new benchmarks like MMT-Bench and MMMU-Pro, aimed at evaluating VLMs across a variety of multimodal tasks [66][67][68].
苹果发布FastVLM模型,可在iPhone上运行的极速视觉语言模型;昆仑万维宣布开源Matrix-Game大模型丨AIGC日报
创业邦· 2025-05-13 23:52
1.【昆仑万维宣布正式开源Matrix-Game大模型】5月13日,据昆仑万维消息,昆仑万维正式开源 (17B+)Matrix-Game大模型,即Matrix-Zero世界模型中的可交互视频生成大模型。Matrix-Game是Matrix系 列在交互式世界生成方向的正式落地,也是工业界首个开源的10B+空间智能大模型,它是一个面向游戏 世界建模的交互式世界基础模型,专为开放式环境中的高质量生成与精确控制而设计。(第一财经) 2.【百型智能推出国内首个外贸行业垂类Agent】百型智能推出国内首个外贸行业垂类Agent——AI外贸员 Zoe。据了解,Zoe可以根据企业目标拆解任务,独立完成从市场分析、寻找客户、精准筛选,到开发触 达、转化跟进的外贸开发拓客全链路,转化率高出传统人工方式10倍以上。(财联社) 3.【火山引擎发布豆包视频生成模型Seedance 1.0 lite】火山引擎发布豆包·视频生成模型Seedance 1.0 lite、 豆包1.5·视觉深度思考模型,并升级豆包·音乐模型,以更全面的模型矩阵、更丰富的智能体工具,帮助企 业打通从业务到智能体的应用链路。官方表示,此次全新发布的豆包视频生成模型 ...
ICML 2025 | 长视频理解新SOTA!蚂蚁&人大开源ViLAMP-7B,单卡可处理3小时视频
机器之心· 2025-05-12 09:06
该工作第一作者为中国人民大学高瓴人工智能学院硕士生程传奇, 目前于蚂蚁技术研究院实习,其主要研究领域为多模态大模型,蚂蚁技术研究院副研究员关健 为共同第一作者。 在视觉语言模型(Vision-Language Models,VLMs)取得突破性进展的当下,长视频理解的挑战显得愈发重要。以标准 24 帧率的标清视频为例,仅需数分钟即可 产生逾百万的视觉 token,这已远超主流大语言模型 4K-128K 的上下文处理极限。当面对影视级的长视频内容时,传统解决方案的不足愈加凸显:粗放式的帧采样 策略往往造成关键帧信息遗漏,而特征融合方法虽能降低数据维度,却不可避免地导致语义完整性受损。 近日, 蚂蚁和人大 的研究团队带来了一个创新性的解决方案。他们提出视觉语言大模型 ViLAMP (Video-Language Model with Mixed Precision),实现了对超长 视频的高效处理。这个方法的核心在于其独特的 " 混合精度 " 策略:对视频中的关键内容保持高精度分析,而对次要内容进行强力压缩,就像人类在观看视频时会 重点关注关键场景,而对过渡时空信息只做快速扫描一样。 论文标题:Scaling Vi ...