视觉-语言模型（VLM） - filings, earnings calls, financial reports, news

视觉-语言模型（VLM）

Search documents

Less is More！Max-V1：面向自动驾驶精巧而强大的视觉-语言模型（复旦&中科院）

自动驾驶之心· 2025-10-08 09:04

点击下方卡片，关注" 自动驾驶之心 "公众号戳我-> 领取自动驾驶近30个方向学习路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球论文作者 | Sheng Yang等编辑 | 自动驾驶之心最近大模型领域开始重新思考对scaling laws的传统认知，前有上交团队针对Agent任务提出的『LIMI: Less is More for Agency』。即数据越多，AI能力未必越强越强。如今这一思考延伸到自动驾驶领域。自动驾驶VLA/VLM真的需要海量数据吗？或者说应该刨去冗余，提炼真正关键的信息。自动驾驶之心今天要分享的工作是复旦和中科院的团队提出的 Max-V1 —— 全新的一阶段端到端自动驾驶框架。Max-V1将自动驾驶重新概念化为一种广义的语言任务，并将轨迹规划问题形式化为"下一个waypoint预测"（next waypoint prediction）。背景回顾与主要贡献人类驾驶本质上是一个序列化决策过程，其中每一个动作都依赖于对周围场景的实时理解。这种感知与动作之间的动态交互，与自然语言生成具有高度相似性——后者同样涉及生成高度相关的输出序列。从这一 ...

券商中国· 2025-09-29 11:16

Core Viewpoint - DeepSeek has launched its updated model DeepSeek-V3.2-Exp, which significantly reduces API costs for developers by over 50% due to lower service costs associated with the new model [1][9]. Model Release and Features - The DeepSeek-V3.2-Exp model was officially released on September 29 and is available on the Hugging Face platform, marking an important step towards the next generation architecture [3]. - This version introduces the DeepSeek Sparse Attention (DSA) mechanism, which optimizes training and inference efficiency for long texts while maintaining model output quality [5][8]. - The model supports a maximum context length of 160K, enhancing its capability for handling extensive data [4]. Cost Structure and API Pricing - The new pricing structure for the DeepSeek API includes a cost of 0.2 yuan per million tokens for cache hits and 2 yuan for cache misses, with output priced at 3 yuan per million tokens, reflecting a significant reduction in costs for developers [9]. Open Source and Community Engagement - DeepSeek has made the DeepSeek-V3.2-Exp model fully open source on platforms like Hugging Face and ModelScope, along with related research papers [11]. - The company has retained API access for the previous version, V3.1-Terminus, to allow developers to compare performance, with the same pricing structure maintained until October 15, 2025 [11]. Upcoming Developments - There are indications that the new model GLM-4.6 from Z.ai will be released soon, which is expected to offer greater context capabilities [15][16].

视觉-语言模型（VLM）

通用人工智能（AGI）

Artificial Intelligence

Artificial Intelligence

DeepSeek-V3.2-Exp

GLM-4.6

GLM-4.5

具身智能之心· 2025-08-15 00:05

Core Insights - The article introduces GeoVLA, a novel framework that integrates 3D information into Vision-Language-Action (VLA) models, enhancing robots' spatial perception and adaptability [3][9][10]. Group 1: Background and Motivation - The advancement of robotic operations requires intelligent interaction and precise physical control in real-world environments. Recent VLA models have gained attention for their ability to follow instructions and execute actions [7]. - Current VLA models primarily rely on 2D visual inputs, neglecting the rich geometric information inherent in the 3D physical world, which limits their spatial perception capabilities [8]. Group 2: GeoVLA Framework - GeoVLA employs a visual-language model (VLM) to process images and language instructions, extracting fused visual-language embeddings. It converts depth maps into point clouds and uses a custom point embedding network to generate 3D geometric embeddings [3][10][12]. - The framework consists of three key components: VLM for general understanding, a point embedding network (PEN) for extracting fine-grained 3D features, and a 3D enhanced action expert (3DAE) for generating action sequences [12][13]. Group 3: Performance Evaluation - GeoVLA was evaluated on the LIBERO and ManiSkill2 benchmarks, achieving state-of-the-art results. It demonstrated significant robustness in real-world tasks requiring high adaptability and spatial awareness [15][27]. - In LIBERO, GeoVLA achieved an average success rate of 97.7%, outperforming other models like CogACT (93.2%) and OpenVLA-OFT (95.3%) [27]. - In the ManiSkill2 benchmark, GeoVLA achieved a success rate of 77%, surpassing CogACT (69%) and Dita (66%) [27]. Group 4: Ablation Studies - Ablation studies indicated that the PEN encoder outperformed traditional encoders, achieving a success rate of 97.7% compared to 95.8% for MLP and 95.2% for PointNet [30]. - The use of static routing in the MoE architecture improved performance, demonstrating the effectiveness of the design in leveraging multimodal information [30][20]. Group 5: Real-World Experiments - Real-world experiments showcased GeoVLA's robustness and generalization capabilities across various 3D manipulation tasks, maintaining high performance despite changes in camera perspective, height, and object size [36][34]. - GeoVLA achieved an average success rate of 86.3% across basic and 3D perception tasks, outperforming other models by significant margins [36].