视觉语言模型（VLMs） - filings, earnings calls, financial reports, news - Reportify

视觉语言模型（VLMs）

Search documents

速递｜成立两年估值6亿美元：AI文档Reducto完成7500万美元B轮融资，月收入七倍增长

Z Potentials· 2025-10-16 03:03

Core Insights - Reducto combines traditional Optical Character Recognition (OCR) technology with emerging AI techniques to enhance document understanding, attracting significant investment interest from firms like Andreessen Horowitz [2][3]. Funding and Valuation - Reducto recently raised $75 million in Series B funding, tripling its valuation to $600 million, with total funding reaching $108 million and a cash reserve exceeding $100 million [3][4]. Technology and Application - The company integrates traditional OCR with Visual Language Models (VLMs) to better interpret complex documents in sectors like finance, healthcare, law, and insurance [4][5]. - Reducto's technology addresses the limitations of traditional OCR by effectively handling intricate document formats that often confuse standard software [5][6]. Performance and Growth - The company claims its software is more accurate than traditional OCR solutions, securing clients such as legal AI startup Harvey and data annotation company Scale AI [9][10]. - Reducto's monthly revenue has increased sevenfold compared to the previous year, indicating strong growth [11].

光学字符识别技术（OCR）

视觉语言模型（VLMs）

Artificial Intelligence

光学字符识别技术（OCR）

视觉语言模型（VLMs）

Artificial Intelligence

普林斯顿大学最新！VLM2VLA：将 VLM 微调为 VLA，并避免灾难性遗忘

具身智能之心· 2025-10-07 10:00

Core Insights - The article discusses the catastrophic forgetting problem in the context of fine-tuning Visual Language Models (VLMs) into Visual Language Action Models (VLAs) for robotic control, highlighting the mismatch between pre-training and fine-tuning data distributions [2][4]. Group 1: Catastrophic Forgetting - Catastrophic forgetting occurs when the model loses its original reasoning and multimodal understanding capabilities during the action generation training process [2]. - The root cause of this issue is the distribution mismatch between the internet-scale pre-training data (primarily image-text pairs) and the low-dimensional action vector data used for robotic fine-tuning [2]. Group 2: VLM2VLA Approach - VLM2VLA aims to address the distribution mismatch by converting low-dimensional actions into natural language descriptions, aligning the fine-tuning data with the pre-training data [3][4]. - The method employs low-rank adaptation (LoRA) for fine-tuning, minimizing modifications to the VLM backbone and avoiding catastrophic forgetting [4]. Group 3: Hierarchical Action Representation - The VLM2VLA framework decomposes action prediction into a three-level reasoning process, utilizing natural language descriptions at all levels [6]. - High-level subtask prediction generates intermediate tasks based on initial observations and overall task instructions [6]. - Mid-level motion planning produces spatially oriented movement descriptions, while low-level action generation creates executable action sequences with language annotations [6]. Group 4: Data Reconstruction Pipeline - VLM2VLA utilizes Gemini 2.5 to automatically reconstruct raw robotic trajectory datasets into language-annotated datasets, ensuring compatibility with VLM pre-training formats [9]. - The reconstruction process involves providing context, decomposing trajectories into subtasks, and standardizing the format to align with VLM data [9]. Group 5: Efficient Fine-Tuning Strategy - The fine-tuning of the Gemma-3-12B-IT model is conducted using LoRA on linear layers without altering the VLM architecture or requiring joint training with internet-scale data [12][13]. - Key training parameters include a LoRA rank of 16, learning rate of 5e-5, and an effective batch size of 8 [12][13]. Group 6: Experimental Validation - Experiments focus on three core questions comparing VLM2VLA with baseline models, assessing the retention of multimodal understanding, competitive robotic manipulation performance, and the ability to generalize knowledge to new scenarios [14][15]. - VLM2VLA demonstrates competitive performance in both in-distribution and out-of-distribution tasks, showcasing its hierarchical reasoning capabilities [17][19]. Group 7: Limitations and Future Directions - The model currently faces challenges such as reasoning delays and the need for larger-scale robotic language-annotated datasets to enhance generalization capabilities [19]. - Future improvements may include optimizing decoding strategies, expanding language annotation for dexterous actions, and integrating validation capabilities within the VLM itself [19][22].

灾难性遗忘

视觉语言模型（VLMs）

视觉语言动作模型（VLAs）

Gemma-3-12B-IT模型

灾难性遗忘

视觉语言模型（VLMs）

视觉语言动作模型（VLAs）

Gemma-3-12B-IT模型

纯血VLA综述来啦！从VLM到扩散，再到强化学习方案

具身智能之心· 2025-09-30 04:00

Core Insights - The article discusses the evolution and potential of Vision Language Action (VLA) models in robotics, emphasizing their integration of perception, language understanding, and action generation to enhance robotic capabilities [11][17]. Group 1: Introduction and Background - Robotics has traditionally relied on pre-programmed instructions and control strategies, limiting their adaptability in dynamic environments [2][11]. - The emergence of VLA models marks a significant advancement in embodied intelligence, combining visual perception, language understanding, and executable actions into a unified framework [11][12]. Group 2: VLA Methodologies - VLA methods are categorized into four paradigms: autoregressive, diffusion, reinforcement learning, and hybrid/specialized methods, each with unique strategies and mechanisms [8][10]. - The article highlights the importance of high-quality datasets and realistic simulation platforms for the development and evaluation of VLA models [16][18]. Group 3: Challenges and Future Directions - Key challenges identified include data limitations, reasoning speed, and safety concerns, which need to be addressed to advance VLA models and general robotics [10][17]. - Future research directions focus on enhancing the robustness and generalization of VLA models in real-world applications, emphasizing the need for efficient training paradigms and safety assessments [44][47].

视觉-语言-动作（VLA）模型

大语言模型（LLMs）

视觉语言模型（VLMs）

自回归范式

视觉-语言-动作（VLA）模型

大语言模型（LLMs）

视觉语言模型（VLMs）

自回归范式

港科&理想最新！OmniReason: 时序引导的VLA决策新框架

自动驾驶之心· 2025-09-10 23:33

Core Insights - The article discusses the development of the OmniReason framework, a novel Vision-Language-Action (VLA) model designed to enhance spatiotemporal reasoning in autonomous driving by integrating dynamic 3D environment modeling and decision-making processes [2][6][8]. Data and Framework - OmniReason-Data consists of two large-scale VLA datasets: OmniReason-nuScenes and OmniReason-Bench2Drive, which provide dense spatiotemporal annotations and natural language explanations, ensuring physical realism and temporal coherence [2][6][8]. - The OmniReason-Agent architecture incorporates a sparse temporal memory module for persistent scene context modeling and an explanation generator for human-interpretable decision-making, effectively capturing spatiotemporal causal reasoning patterns [2][7][8]. Performance and Evaluation - Extensive experiments on open-loop planning tasks and visual question answering (VQA) benchmarks demonstrate that the proposed methods achieve state-of-the-art performance, establishing new capabilities for interpretable and time-aware autonomous vehicles operating in complex dynamic environments [3][8][25][26]. - The OmniReason-Agent shows competitive results in open-loop planning with an average L2 error of 0.34 meters, matching the top method ORION, while achieving a new record for violation rate at 3.18% [25][26]. Contributions - The introduction of comprehensive VLA datasets emphasizes causal reasoning based on spatial and temporal contexts, setting a new benchmark for interpretability and authenticity in autonomous driving research [8]. - The design of a template-based annotation framework ensures high-quality, interpretable language-action pairs suitable for diverse driving scenarios, reducing hallucination phenomena and providing rich multimodal reasoning information [8][14][15]. Related Work - The article reviews the evolution of datasets for autonomous driving, highlighting the shift from single-task annotations to comprehensive scene understanding, and discusses the limitations of existing visual language models (VLMs) in dynamic environments [10][11].

视觉 - 语言 - 动作（VLA）框架

端到端学习

大型语言模型（LLMs）

视觉语言模型（VLMs）

OmniReason框架

视觉 - 语言 - 动作（VLA）框架

端到端学习

大型语言模型（LLMs）

视觉语言模型（VLMs）

OmniReason框架

最新综述！多模态融合与VLM在具身机器人领域中的方法盘点

具身智能之心· 2025-08-31 02:33

Core Viewpoint - The article discusses the advancements in multimodal fusion and vision-language models (VLMs) in robot vision, emphasizing their role in enhancing robots' perception and understanding capabilities in complex environments [4][5][56]. Multimodal Fusion in Robot Vision Tasks - Semantic scene understanding is a critical task in visual systems, where multimodal fusion significantly improves accuracy and robustness by integrating additional information such as depth and language [9][11]. - Current mainstream fusion strategies include early fusion, mid-level fusion, and late fusion, evolving from simple concatenation to more sophisticated interactions within a unified architecture [10][12][16]. Applications of Multimodal Fusion - In autonomous driving, 3D object detection is crucial for accurately identifying and locating pedestrians, vehicles, and obstacles, with multimodal fusion enhancing environmental understanding [15][18]. - The design of multimodal fusion involves addressing when to fuse, what to fuse, and how to fuse, with various strategies impacting performance and computational efficiency [16][17]. Embodied Navigation - Embodied navigation allows robots to explore and act in real environments, focusing on autonomous decision-making and dynamic adaptation [23][25][26]. - Three representative methods include goal-directed navigation, instruction-following navigation, and dialogue-based navigation, showcasing the evolution from perception-driven to interactive understanding [25][26][27]. Visual Localization and SLAM - Visual localization determines a robot's position, which is challenging in dynamic environments; recent methods leverage multimodal fusion to improve performance [28][30]. - SLAM (Simultaneous Localization and Mapping) has evolved from geometric-driven to semantic-driven approaches, integrating various sensor data for enhanced adaptability [30][34]. Vision-Language Models (VLMs) - VLMs have progressed significantly, focusing on semantic understanding, 3D object detection, embodied navigation, and robot operation, with various fusion methods being explored [56][57]. - Key innovations in VLMs include large-scale pre-training, instruction fine-tuning, and structural optimization, enhancing their capabilities in cross-modal reasoning and task execution [52][53][54]. Future Directions - Future research should focus on structured spatial modeling, improving system interpretability and ethical adaptability, and developing cognitive VLM architectures for long-term learning [57][58].

多模态融合

视觉语言模型（VLMs）

跨模态对齐

多模态融合

视觉语言模型（VLMs）

跨模态对齐

Kitchen-R ：高层任务规划与低层控制联合评估的移动操作机器人基准

具身智能之心· 2025-08-25 00:04

Core Viewpoint - The article introduces the Kitchen-R benchmark, a unified evaluation framework for task planning and low-level control in embodied AI, addressing the existing fragmentation in current benchmarks [4][6][8]. Group 1: Importance of Benchmarks - Benchmarks are crucial in various fields such as natural language processing and computer vision for assessing model progress [7]. - In robotics, simulator-based benchmarks like Behavior-1K are common, providing model evaluation and training capabilities [7]. Group 2: Issues with Existing Benchmarks - Current benchmarks for high-level language instruction and low-level robot control are fragmented, leading to incomplete assessments of integrated systems [8][9]. - High-level benchmarks often assume perfect execution of atomic tasks, while low-level benchmarks rely on simple single-step instructions [9]. Group 3: Kitchen-R Benchmark Features - Kitchen-R fills a critical gap in embodied AI research by providing a comprehensive testing platform that closely simulates real-world scenarios [6][8]. - It includes a digital twin kitchen environment and over 500 language instructions, supporting mobile ALOHA robots [9][10]. - The benchmark supports three evaluation modes: independent evaluation of planning modules, independent evaluation of control strategies, and critical full system integration evaluation [9][10]. Group 4: Evaluation Metrics - Kitchen-R is designed with offline independent evaluation and online joint evaluation metrics to ensure comprehensive system performance measurement [16][20]. - Key metrics include Exact Match (EM) for task planning accuracy and Mean Squared Error (MSE) for trajectory prediction accuracy [20][21]. Group 5: Baseline Methods - Kitchen-R provides two baseline methods: a VLM-driven task planning baseline and a Diffusion Policy low-level control baseline [43][49]. - The VLM planning baseline enhances planning accuracy through contextual examples and constrained generation [47][48]. - The Diffusion Policy baseline integrates visual features and robot states to predict future actions [49][52]. Group 6: Future Directions - Kitchen-R can expand to include more complex scenarios, such as multi-robot collaboration and dynamic environments, promoting the application of language-guided mobile manipulation robots in real-world settings [54].

大语言模型（LLMs）

视觉语言模型（VLMs）

机器人任务规划

机器人低层控制

Kitchen-R基准

大语言模型（LLMs）

视觉语言模型（VLMs）

机器人任务规划

机器人低层控制

Kitchen-R基准

中科院自动化所机器人视觉中的多模态融合与视觉语言模型综述

具身智能之心· 2025-08-04 01:59

Core Insights - The article discusses the advancements in multimodal fusion and vision-language models (VLMs) as essential tools for enhancing robot vision technology, emphasizing their potential in complex reasoning and long-term task decision-making [4][10]. Multimodal Fusion and Robot Vision - Multimodal fusion enhances semantic scene understanding by integrating various data sources, such as visual, linguistic, depth, and lidar information, addressing limitations faced by traditional unimodal methods [8][9]. - The rise of VLMs has propelled the development of multimodal fusion paradigms, showcasing capabilities in zero-shot understanding and instruction following [9][10]. Key Applications and Challenges - The article identifies key applications of multimodal fusion in tasks like simultaneous localization and mapping (SLAM), 3D object detection, navigation, and robot manipulation [10][19]. - Challenges in multimodal fusion include cross-modal alignment, efficient training strategies, and real-time performance optimization [10][19]. Data Sets and Benchmarking - A comprehensive analysis of mainstream multimodal datasets used for robot tasks is provided, detailing their modality combinations, task coverage, and limitations [10][43]. - The importance of high-quality multimodal datasets is highlighted, as they are crucial for model training and performance evaluation [62]. Future Directions - The article suggests future research directions to address challenges in multimodal fusion, such as improving cross-modal alignment techniques and enhancing real-time performance [10][63]. - Emphasis is placed on the need for standardized datasets and benchmarks to facilitate comparisons across different research efforts [66].

多模态融合

视觉语言模型（VLMs）

机器人视觉

Artificial Intelligence

RoboMamba模型

多模态融合

视觉语言模型（VLMs）

机器人视觉

Artificial Intelligence

RoboMamba模型

让 VLMs 更适配机器人：小型VLMs也能展现出强大的视觉规划能力

具身智能之心· 2025-07-15 13:49

Core Insights - The article discusses the potential of large language models (LLMs) in robotic program planning, highlighting their ability to generate coherent action sequences but also noting their limitations in providing the necessary sensory details for physical execution [3][4] - It introduces a new framework called SelfReVision, which enhances the performance of small visual language models (VLMs) through self-distillation without external supervision, aiming to improve their planning capabilities in real-world scenarios [4][9] Research Background - LLMs show promise in generating action sequences but often lack the precision required for robotic tasks due to their reliance on human-centric training data [3] - Visual language models (VLMs) can potentially address these limitations, but existing methods either require specialized simulation environments or are costly to train and deploy [3] Methodology - SelfReVision is proposed as a self-improvement framework that allows small VLMs to enhance their performance through iterative self-critique and revision [4][6] - The framework operates in three stages: critique, revise, and verify, enabling models to generate and refine plans based on self-assessment [4][10] Experimental Setup - Two types of experiments were conducted to evaluate the planning capabilities of SelfReVision: image-based program planning and entity-agent tasks [11] - Evaluation metrics included coverage, ordering, completeness, overall quality, and a new metric called image groundedness [12] Key Results - SelfReVision significantly outperformed baseline models across various metrics, achieving an average win rate of 68% on the PLACES dataset and 72% on the SIMULATION dataset [13] - Larger models benefited more from SelfReVision, with an average gain of 74% for models with 12 billion parameters or more [13] Comparison with Other Methods - SelfReVision demonstrated clear advantages over other methods like Best-of-N and PaliGemma, with improvements of 60% in most settings compared to modest gains from Best-of-N [17] - When compared to GPT-4o, SelfReVision's plans had at least a 25% higher win rate for models with 12 billion parameters or more, indicating its effectiveness in enhancing smaller models [17] Ablation Studies - The complete Criticize-Revise-Verify (CRV) process showed the strongest performance, with average win rates of 68.3% on the PLACES dataset and 71.9% on the SIMULATION dataset [18] - Variants of the process showed significant performance drops, emphasizing the importance of the verification step in filtering out suboptimal revisions [18] Application in Entity-Agent Tasks - SelfReVision was tested in challenging scenarios, showing a 26% improvement for the Gemma 12B model and a 17% improvement for the Gemma 27B model in block manipulation tasks [21] - In hierarchical tasks, SelfReVision plans led to a 70% success rate in generating trajectories, surpassing the 61% success rate of baseline models [21]

大语言模型（LLMs）

视觉语言模型（VLMs）

SelfReVision框架

大语言模型（LLMs）

视觉语言模型（VLMs）

SelfReVision框架

AI Lab最新InternSpatia：VLM空间推理数据集，显著提升模型能力

具身智能之心· 2025-06-24 14:09

Core Insights - The article discusses the limitations of current Vision-Language Models (VLMs) in spatial reasoning tasks, highlighting the need for improved datasets and methodologies to enhance performance in various scenarios [3][12]. Dataset Limitations - The existing InternSpatial dataset has three main limitations: 1. Limited scene diversity, focusing primarily on indoor and outdoor environments, lacking diverse contexts like driving and embodied navigation [3]. 2. Restricted instruction formats, only supporting natural language or region masks, which do not encompass the variety of queries found in real-world applications [3]. 3. Lack of multi-view supervision, with over 90% of data focusing on single-image reasoning, failing to model spatiotemporal relationships across views [3]. Evaluation Benchmark - The InternSpatial-Bench evaluation benchmark includes 6,008 QA pairs across five tasks, assessing position comparison, size comparison, rotation estimation, object counting, and existence estimation [7]. - The benchmark also introduces 1,000 additional QA pairs for multi-view rotation angle prediction [7]. Data Engine Design - The data engine employs a three-stage automated pipeline: 1. Annotation generation using existing annotations or SAM2 for mask generation [9]. 2. View alignment to construct a standard 3D coordinate system [9]. 3. Template-based QA generation with predefined task templates [9]. Experimental Results - Spatial reasoning performance has improved, with InternVL-Spatial-8B showing a 1.8% increase in position comparison accuracy and a 17% increase in object counting accuracy compared to its predecessor [10]. - The model's performance across various tasks demonstrates significant enhancements, particularly in multi-view tasks [10]. Instruction Format Robustness - Current models exhibit a 23% accuracy drop when using the <box> format, while training with InternSpatial reduces the gap between different formats to within 5% [12]. - However, the automated QA generation struggles to replicate the complexity of natural language, indicating a need for further refinement [12].

视觉语言模型（VLMs）

InternSpatia数据集

InternSpatial-Bench评估基准

视觉语言模型（VLMs）

InternSpatia数据集

InternSpatial-Bench评估基准

FindingDory：具身智能体记忆评估的基准测试

具身智能之心· 2025-06-22 10:56

Group 1 - The core issue in embodied intelligence is the lack of long-term memory, which limits the ability to process multimodal observational data across time and space [3] - Current visual language models (VLMs) excel in planning and control tasks but struggle with integrating historical experiences in embodied environments [3][5] - Existing video QA benchmarks fail to adequately assess tasks requiring fine-grained reasoning, such as object manipulation and navigation [5] Group 2 - The proposed benchmark includes a task architecture that allows for dynamic environment interaction and memory reasoning validation [4][6] - A total of 60 task categories are designed to cover spatiotemporal semantic memory challenges, including spatial relations, temporal reasoning, attribute memory, and multi-target recall [7] - Key technical innovations include a programmatic expansion of task complexity through increased interaction counts and a strict separation of experience collection from interaction phases [9][6] Group 3 - Experimental results reveal three major bottlenecks in VLM memory capabilities across 60 tasks, including failures in long-sequence reasoning, weak spatial representation, and collapse in multi-target processing [13][14][16] - The performance of native VLMs declines as the number of frames increases, indicating ineffective utilization of long contexts [20] - Supervised fine-tuning models show improved performance by leveraging longer historical data, suggesting a direction for VLM refinement [25] Group 4 - The benchmark represents the first photorealistic embodied memory evaluation framework, covering complex household environments and allowing for scalable assessment [26] - Future directions include memory compression techniques, end-to-end joint training to address the split between high-level reasoning and low-level execution, and the development of long-term video understanding [26]

视觉语言模型（VLMs）

长时序推理

多目标处理

帧采样悖论

视觉语言模型（VLMs）

长时序推理

多目标处理

帧采样悖论