视觉-语言模型(VLM)
Search documents
Less is More!Max-V1:面向自动驾驶精巧而强大的视觉-语言模型(复旦&中科院)
自动驾驶之心· 2025-10-08 09:04
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Sheng Yang等 编辑 | 自动驾驶之心 最近大模型领域开始重新思考对scaling laws的传统认知,前有上交团队针对Agent任务提出的『LIMI: Less is More for Agency』。即数据越多,AI能力未必 越强越强。如今这一思考延伸到自动驾驶领域。自动驾驶VLA/VLM真的需要海量数据吗?或者说应该刨去冗余,提炼真正关键的信息。 自动驾驶之心今天要分享的工作是 复旦和中科院的团队 提出的 Max-V1 —— 全新的一阶段端到端自动驾驶框架。Max-V1将自动驾驶 重新概念化为一种广 义的语言任务 ,并将轨迹规划问题形式化为"下一个waypoint预测"(next waypoint prediction)。 背景回顾与主要贡献 人类驾驶本质上是一个 序列化决策过程 ,其中每一个动作都依赖于对周围场景的实时理解。这种感知与动作之间的动态交互,与自然语言生成具有高度相 似性——后者同样涉及生成高度相关的输出序列。从这一 ...
DeepSeek,重大突发!
券商中国· 2025-09-29 11:16
Core Viewpoint - DeepSeek has launched its updated model DeepSeek-V3.2-Exp, which significantly reduces API costs for developers by over 50% due to lower service costs associated with the new model [1][9]. Model Release and Features - The DeepSeek-V3.2-Exp model was officially released on September 29 and is available on the Hugging Face platform, marking an important step towards the next generation architecture [3]. - This version introduces the DeepSeek Sparse Attention (DSA) mechanism, which optimizes training and inference efficiency for long texts while maintaining model output quality [5][8]. - The model supports a maximum context length of 160K, enhancing its capability for handling extensive data [4]. Cost Structure and API Pricing - The new pricing structure for the DeepSeek API includes a cost of 0.2 yuan per million tokens for cache hits and 2 yuan for cache misses, with output priced at 3 yuan per million tokens, reflecting a significant reduction in costs for developers [9]. Open Source and Community Engagement - DeepSeek has made the DeepSeek-V3.2-Exp model fully open source on platforms like Hugging Face and ModelScope, along with related research papers [11]. - The company has retained API access for the previous version, V3.1-Terminus, to allow developers to compare performance, with the same pricing structure maintained until October 15, 2025 [11]. Upcoming Developments - There are indications that the new model GLM-4.6 from Z.ai will be released soon, which is expected to offer greater context capabilities [15][16].
天大&清华最新!GeoVLA:增强VLA模型的3D特征提取能力,鲁棒提升明显(SOTA)
具身智能之心· 2025-08-15 00:05
Core Insights - The article introduces GeoVLA, a novel framework that integrates 3D information into Vision-Language-Action (VLA) models, enhancing robots' spatial perception and adaptability [3][9][10]. Group 1: Background and Motivation - The advancement of robotic operations requires intelligent interaction and precise physical control in real-world environments. Recent VLA models have gained attention for their ability to follow instructions and execute actions [7]. - Current VLA models primarily rely on 2D visual inputs, neglecting the rich geometric information inherent in the 3D physical world, which limits their spatial perception capabilities [8]. Group 2: GeoVLA Framework - GeoVLA employs a visual-language model (VLM) to process images and language instructions, extracting fused visual-language embeddings. It converts depth maps into point clouds and uses a custom point embedding network to generate 3D geometric embeddings [3][10][12]. - The framework consists of three key components: VLM for general understanding, a point embedding network (PEN) for extracting fine-grained 3D features, and a 3D enhanced action expert (3DAE) for generating action sequences [12][13]. Group 3: Performance Evaluation - GeoVLA was evaluated on the LIBERO and ManiSkill2 benchmarks, achieving state-of-the-art results. It demonstrated significant robustness in real-world tasks requiring high adaptability and spatial awareness [15][27]. - In LIBERO, GeoVLA achieved an average success rate of 97.7%, outperforming other models like CogACT (93.2%) and OpenVLA-OFT (95.3%) [27]. - In the ManiSkill2 benchmark, GeoVLA achieved a success rate of 77%, surpassing CogACT (69%) and Dita (66%) [27]. Group 4: Ablation Studies - Ablation studies indicated that the PEN encoder outperformed traditional encoders, achieving a success rate of 97.7% compared to 95.8% for MLP and 95.2% for PointNet [30]. - The use of static routing in the MoE architecture improved performance, demonstrating the effectiveness of the design in leveraging multimodal information [30][20]. Group 5: Real-World Experiments - Real-world experiments showcased GeoVLA's robustness and generalization capabilities across various 3D manipulation tasks, maintaining high performance despite changes in camera perspective, height, and object size [36][34]. - GeoVLA achieved an average success rate of 86.3% across basic and 3D perception tasks, outperforming other models by significant margins [36].