视觉-语言模型(VLM)
Search documents
AI看图一本正经胡说八道?「一拉一推」让模型看得全又准|微软x清华
量子位· 2026-02-08 04:46
BiPS团队 投稿 量子位 | 公众号 QbitAI 随着视觉-语言模型 (VLM) 推理能力不断增强,一个隐蔽的问题逐渐浮现: 很多错误不是推理没做好,而是"看错了"。 在复杂视觉任务中,模型往往能正确识别对象、理解问题,甚至给出完整的推理链,却因捕捉了错误的视觉证据,得出自信却错误的答案。 现有方法通常在推理阶段"指路"——例如生成视觉提示或调用外部工具,以临时对齐证据。这类策略虽有效,却面临明显局限:视觉线索形式 受限、高度依赖具体任务,且推理开销大。更重要的是,它引出一个根本性问题: 如果模型始终需要外部提醒才知道"看哪儿",它是否真的理解了视觉世界? 为此,微软亚洲研究院与清华大学提出 BiPS (Bi-directional Perceptual Shaping) ,从源头重塑模型的"看图方式"。 BiPS不在推理时临时提示关注区域,而是在训练阶段就教会模型: 面对特定问题,哪些视觉细节必须关注,哪些可以忽略 。通过系统性地对 齐问题与视觉证据,BiPS促使模型内化一种核心能力—— 带着问题去看图 。因此,在推理时无需任何额外提示,模型也能自动聚焦于真正决 定答案的关键区域与细节。 实验表明,这种 ...
Less is More!Max-V1:面向自动驾驶精巧而强大的视觉-语言模型(复旦&中科院)
自动驾驶之心· 2025-10-08 09:04
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Sheng Yang等 编辑 | 自动驾驶之心 最近大模型领域开始重新思考对scaling laws的传统认知,前有上交团队针对Agent任务提出的『LIMI: Less is More for Agency』。即数据越多,AI能力未必 越强越强。如今这一思考延伸到自动驾驶领域。自动驾驶VLA/VLM真的需要海量数据吗?或者说应该刨去冗余,提炼真正关键的信息。 自动驾驶之心今天要分享的工作是 复旦和中科院的团队 提出的 Max-V1 —— 全新的一阶段端到端自动驾驶框架。Max-V1将自动驾驶 重新概念化为一种广 义的语言任务 ,并将轨迹规划问题形式化为"下一个waypoint预测"(next waypoint prediction)。 背景回顾与主要贡献 人类驾驶本质上是一个 序列化决策过程 ,其中每一个动作都依赖于对周围场景的实时理解。这种感知与动作之间的动态交互,与自然语言生成具有高度相 似性——后者同样涉及生成高度相关的输出序列。从这一 ...
DeepSeek,重大突发!
券商中国· 2025-09-29 11:16
Core Viewpoint - DeepSeek has launched its updated model DeepSeek-V3.2-Exp, which significantly reduces API costs for developers by over 50% due to lower service costs associated with the new model [1][9]. Model Release and Features - The DeepSeek-V3.2-Exp model was officially released on September 29 and is available on the Hugging Face platform, marking an important step towards the next generation architecture [3]. - This version introduces the DeepSeek Sparse Attention (DSA) mechanism, which optimizes training and inference efficiency for long texts while maintaining model output quality [5][8]. - The model supports a maximum context length of 160K, enhancing its capability for handling extensive data [4]. Cost Structure and API Pricing - The new pricing structure for the DeepSeek API includes a cost of 0.2 yuan per million tokens for cache hits and 2 yuan for cache misses, with output priced at 3 yuan per million tokens, reflecting a significant reduction in costs for developers [9]. Open Source and Community Engagement - DeepSeek has made the DeepSeek-V3.2-Exp model fully open source on platforms like Hugging Face and ModelScope, along with related research papers [11]. - The company has retained API access for the previous version, V3.1-Terminus, to allow developers to compare performance, with the same pricing structure maintained until October 15, 2025 [11]. Upcoming Developments - There are indications that the new model GLM-4.6 from Z.ai will be released soon, which is expected to offer greater context capabilities [15][16].
天大&清华最新!GeoVLA:增强VLA模型的3D特征提取能力,鲁棒提升明显(SOTA)
具身智能之心· 2025-08-15 00:05
Core Insights - The article introduces GeoVLA, a novel framework that integrates 3D information into Vision-Language-Action (VLA) models, enhancing robots' spatial perception and adaptability [3][9][10]. Group 1: Background and Motivation - The advancement of robotic operations requires intelligent interaction and precise physical control in real-world environments. Recent VLA models have gained attention for their ability to follow instructions and execute actions [7]. - Current VLA models primarily rely on 2D visual inputs, neglecting the rich geometric information inherent in the 3D physical world, which limits their spatial perception capabilities [8]. Group 2: GeoVLA Framework - GeoVLA employs a visual-language model (VLM) to process images and language instructions, extracting fused visual-language embeddings. It converts depth maps into point clouds and uses a custom point embedding network to generate 3D geometric embeddings [3][10][12]. - The framework consists of three key components: VLM for general understanding, a point embedding network (PEN) for extracting fine-grained 3D features, and a 3D enhanced action expert (3DAE) for generating action sequences [12][13]. Group 3: Performance Evaluation - GeoVLA was evaluated on the LIBERO and ManiSkill2 benchmarks, achieving state-of-the-art results. It demonstrated significant robustness in real-world tasks requiring high adaptability and spatial awareness [15][27]. - In LIBERO, GeoVLA achieved an average success rate of 97.7%, outperforming other models like CogACT (93.2%) and OpenVLA-OFT (95.3%) [27]. - In the ManiSkill2 benchmark, GeoVLA achieved a success rate of 77%, surpassing CogACT (69%) and Dita (66%) [27]. Group 4: Ablation Studies - Ablation studies indicated that the PEN encoder outperformed traditional encoders, achieving a success rate of 97.7% compared to 95.8% for MLP and 95.2% for PointNet [30]. - The use of static routing in the MoE architecture improved performance, demonstrating the effectiveness of the design in leveraging multimodal information [30][20]. Group 5: Real-World Experiments - Real-world experiments showcased GeoVLA's robustness and generalization capabilities across various 3D manipulation tasks, maintaining high performance despite changes in camera perspective, height, and object size [36][34]. - GeoVLA achieved an average success rate of 86.3% across basic and 3D perception tasks, outperforming other models by significant margins [36].