视觉语言模型
Search documents
连发Nature、Cancer Cell:上海交大团队利用AI增强罕见病及癌症诊断
生物世界· 2026-03-02 08:00
编辑丨王多鱼 排版丨水成文 2026 年 2 月 18 日, 上海交通大学人工智能学院/上海人工智能实验室 谢伟迪 , 上海交通大学医学院附 属新华医院 孙锟 、 余永国 , 上海交通大学人工智能学院/上海人工智能实验室 张娅 作为共同通讯作者 , 在 Nature 上发表研究论文 【1】 , 开发了 全球首个 AI 智能体罕见病循证推理诊断系统 —— DeepRare ,首次在罕见病诊断的准确性上,超越了拥有十年以上经验的临床专家。这项研究不仅推进了 罕见病诊断的发展,为全球 三亿罕见病患者带来了实实在在的希望,更是 AI 在医疗领域的一个里程碑, 展示了大语言模型驱动的 AI 智能体系统如何重塑当前的临床工作流程。 病理学诊断 在临床癌症诊断应用中仍是金标准。过去十年间,计算机视觉领域深度学习技术的进步极大地 推动了计算病理学的发展,催生了基于全监督或弱监督的专门模型。尽管这些方法前景可观,但它们通常 受限于标注成本高昂和标注数据稀疏,以及在不同数据集上的泛化能力有限。为解决这些局限性, 自监督 学习 (SSL) 策略应运而生,成为一种有前景的替代方案,它能让模型在大量未标注的病理图像上进行预 训练,作为 ...
路标文字可“劫持”自动驾驶汽车与无人机 具身智能面临“视觉攻击”风险
Ke Ji Ri Bao· 2026-01-28 01:56
Core Insights - The research highlights a new threat to autonomous systems, specifically "visual attacks" that can hijack decision-making processes of self-driving cars and drones through malicious text embedded in the environment [1][2][3] - The study emphasizes the urgent need for the industry to establish new safety standards and protective mechanisms to counteract these vulnerabilities [1][2] Group 1: Research Findings - Researchers from the University of California, Santa Cruz, have demonstrated that attackers can manipulate autonomous systems by embedding specific text in physical objects like road signs and posters, leading to dangerous behaviors [2][3] - A framework named "CHAI" was developed to test this concept, which optimizes attack text using generative AI and adjusts visual attributes such as color, size, and position to enhance the effectiveness of the attack [2] - In tests, the CHAI attack successfully interfered with navigation judgments of self-driving vehicles, achieving a manipulation success rate of up to 95.5% in simulated drone scenarios [2] Group 2: Implications for the Industry - The findings indicate that such attacks are feasible in the physical world, posing a real threat to the safety of intelligent systems as AI becomes more integrated into physical environments [2] - The research serves as a warning for the industry to consider the broader implications of AI safety and to conduct more proactive studies to strengthen the security foundation of these technologies [3]
具身智能面临“视觉攻击”风险
Ke Ji Ri Bao· 2026-01-28 01:19
Core Insights - The research highlights a new threat to autonomous systems, specifically "visual attacks" that can hijack decision-making processes of self-driving cars and drones through malicious text embedded in the environment [1][2] Group 1: Research Findings - Scientists from the University of California, Santa Cruz, have revealed that attackers can manipulate autonomous systems by embedding specific text information in physical environments, leading to dangerous behaviors [1] - The study introduces the concept of "environmental indirect prompts," where malicious text can be placed on road signs or posters to mislead AI systems that rely on visual language models [2] - A framework named "CHAI" was designed to demonstrate command hijacking of embodied AI, optimizing attack text using generative AI and adjusting visual attributes to enhance attack effectiveness [2] Group 2: Experimental Results - The CHAI attack framework was tested in three scenarios: autonomous driving, emergency landings of drones, and target searches, showing a success rate of up to 95.5% in manipulating autonomous systems [2] - In real-world tests, misleading images successfully interfered with the navigation judgments of test vehicles, confirming the feasibility of such attacks in physical environments [2] Group 3: Industry Implications - The findings serve as a warning for the industry, emphasizing the need for new safety standards and protective mechanisms as AI becomes increasingly integrated into physical systems [1][2][3] - The research calls for more comprehensive considerations and proactive studies to ensure the safety of AI technologies as they are deployed in real-world applications [3]
DeepSeek-OCR 2发布:让AI像人一样“读懂”复杂文档
Feng Huang Wang· 2026-01-27 11:58
Core Insights - DeepSeek team released the paper "DeepSeek-OCR 2: Visual Causal Flow" and open-sourced the DeepSeek-OCR 2 model, which features an innovative DeepEncoder V2 structure that dynamically adjusts the processing order of visual information based on image semantics [1][2] - The new model aims to align machine processing more closely with human visual reading logic, addressing limitations in traditional visual language models that process images in a fixed grid order [1] Model Performance - DeepSeek-OCR 2 achieved an overall score of 91.09% on the OmniDocBench v1.5 benchmark, representing a 3.73% improvement over its predecessor [2] - The model demonstrated enhanced accuracy in reading order, with the edit distance decreasing from 0.085 to 0.057, indicating a better understanding of document content structure [2]
阶跃星辰10B视觉语言SOTA模型开源
Di Yi Cai Jing· 2026-01-20 10:59
Core Insights - The company announced the open-source release of Step3-VL-10B, which utilizes only 10 billion parameters [1] - Step3-VL-10B achieves state-of-the-art (SOTA) performance across various benchmark tests, including visual perception, logical reasoning, mathematical competitions, and general dialogue [1] Summary by Categories Product Development - Step3-VL-10B is developed with a parameter count of 10 billion, showcasing advancements in AI model efficiency [1] Performance Metrics - The model reaches SOTA levels in multiple areas, indicating its competitive edge in the AI landscape [1]
π0-FAST正式集成到LeRobot中!pytorch版本来了
具身智能之心· 2026-01-15 00:32
Core Viewpoint - The article discusses the introduction of π0-FAST, a new model by the pi team that integrates visual language model capabilities with FAST (Frequency Domain Action Sequence Tokenization) action encoding technology, significantly improving training speed and precision for complex robotic tasks [1][4]. Group 1 - π0-FAST enhances the training of high-precision operational tasks, achieving a training speed increase of up to 5 times compared to traditional diffusion model methods [1]. - The model addresses the limitations of traditional action encoding methods, which struggle with complex dexterous skill tasks requiring precise control and high-frequency response [3]. - The implementation of π0-FAST has been integrated into the LeRobot framework, which now supports multiple models including π0, π0.5, and π0-FAST, as well as the domestic model WALL-OSS [2][7]. Group 2 - The original π0-FAST implementation was based on the JAX framework, but it has been restructured using PyTorch, incorporating cross-entropy loss objectives, FAST tokenization schemes, and inference optimization techniques such as KV caching [6]. - π0-FAST generates dense action token sequences that can be predicted in a self-regressive manner, aligning its prediction method with that of language tokens, thus solving the challenges faced by traditional methods [4].
π0-FAST正式集成到LeRobot中!pytorch版本来了
具身智能之心· 2026-01-14 09:00
Core Viewpoint - The article discusses the introduction of π0-FAST, a new model by the pi team that integrates visual language model capabilities with FAST (Frequency Domain Action Sequence Tokenization) action encoding technology, significantly improving training speed and precision for complex robotic tasks [1][4]. Group 1 - π0-FAST enhances the training of high-precision operational tasks, achieving a training speed increase of up to 5 times compared to traditional diffusion model methods [1]. - The model addresses the limitations of traditional action encoding methods, which struggle with complex dexterous skill tasks requiring precise control and high-frequency responses [3]. - The integration of π0-FAST into the LeRobot framework allows for improved action sequence compression and self-regressive prediction of dense action tokens, aligning its prediction method with that of language tokens [4]. Group 2 - The original π0-FAST implementation was based on the JAX framework, but it has been restructured using PyTorch, incorporating cross-entropy loss objectives, FAST tokenization schemes, and inference optimization techniques [6]. - The LeRobot framework now supports multiple models, including π0, π0.5, and π0-FAST, as well as the domestic model WALL-OSS [7].
智源&港科大等出品!RoboMirror:让机器人先 “读懂” 视频,再精准复刻每一个动作
具身智能之心· 2026-01-09 00:55
Core Insights - The article introduces RoboMirror, a new paradigm in embodied intelligence that allows robots to understand and imitate human actions from video input without relying on traditional motion capture or pose estimation methods [3][5][6]. Industry Pain Points - Traditional robotic imitation has been limited to mechanical replication, facing challenges such as high latency, significant errors, and failure in first-person perspective scenarios [3][5]. - The lack of understanding in robots prevents them from interpreting the intent behind actions, leading to inefficiencies in learning and execution [5][6]. RoboMirror Framework - RoboMirror operates on a two-stage framework that transforms video input into robotic motion, emphasizing understanding before imitation [6][12]. - The first stage involves using a visual language model (VLM) to extract motion intent from videos, while the second stage employs a teacher-student policy architecture for precise action execution [6][10]. Performance Metrics - RoboMirror achieved a task success rate of 0.99 on the Nymeria dataset, significantly higher than the baseline of 0.92 [17]. - The joint position error (MPJPE) was reduced by nearly 50% compared to baseline methods, indicating improved accuracy in generated actions [17]. - The end-to-end processing time from video input to action execution was reduced from 9.22 seconds to 1.84 seconds, marking an approximately 80% improvement in efficiency [17]. Real-World Application - The article highlights successful demonstrations of RoboMirror's capabilities in real-world scenarios, showcasing its ability to accurately understand and replicate actions from video input [25][27].
博世最新一篇长达41页的自动驾驶轨迹规划综述
自动驾驶之心· 2025-12-05 00:03
Core Insights - The article discusses the advancements and applications of foundation models (FMs) in trajectory planning for autonomous driving, highlighting their potential to enhance understanding and decision-making in complex driving scenarios [4][5][11]. Background Overview - Foundation models are large-scale models that learn representations from vast amounts of data, applicable to various downstream tasks, including language and vision [4]. - The study emphasizes the importance of FMs in the autonomous driving sector, particularly in trajectory planning, which is deemed the core task of driving [8][11]. Research Contributions - A classification system for methods utilizing FMs in autonomous driving trajectory planning is proposed, analyzing 37 existing methods to provide a structured understanding of the field [11][12]. - The research evaluates the performance of these methods in terms of code and data openness, offering practical references for reproducibility and reusability [12]. Methodological Insights - The article categorizes methods into two main types: FMs customized for trajectory planning and FMs that guide trajectory planning [16][19]. - Customized FMs leverage pre-trained models, adapting them for specific driving tasks, while guiding FMs enhance existing trajectory planning models through knowledge transfer [19][20]. Application of Foundation Models - FMs can enhance trajectory planning capabilities through various approaches, including fine-tuning existing models, utilizing chain-of-thought reasoning, and enabling language and action interactions [9][19]. - The study identifies 22 methods focused on customizing FMs for trajectory planning, detailing their functionalities and the importance of prompt design in model performance [20][32]. Challenges and Future Directions - The article outlines key challenges in deploying FMs in autonomous driving, such as reasoning costs, model size, and the need for suitable datasets for fine-tuning [5][12]. - Future research directions include addressing the efficiency, robustness, and transferability of models from simulation to real-world applications [12][14]. Comparative Analysis - The study contrasts its findings with existing literature, noting that while previous reviews cover various aspects of autonomous driving, this research specifically focuses on the application of FMs in trajectory planning [13][14]. Data and Model Design - The article discusses the importance of data curation for training FMs, emphasizing the need for structured datasets that include sensor data and trajectory pairs [24][28]. - It also highlights different model design strategies, including the use of existing visual language models and the combination of visual encoders with large language models [27][29]. Language and Action Interaction - The research explores models that incorporate language interaction capabilities, detailing how these models utilize visual question-answering datasets to enhance driving performance [38][39]. - It emphasizes the significance of training datasets and evaluation metrics in assessing the effectiveness of language interaction in trajectory planning [39][41].
“中文AI三大顶会”已有两家报导了理想近期AI进展
理想TOP2· 2025-11-09 14:59
Core Insights - The article discusses the rising prominence of Li Auto in the autonomous driving sector, particularly its recent advancements presented at the ICCV 2025 conference, where it introduced a new paradigm for autonomous driving that integrates world models with reinforcement learning [1][2][4]. Group 1: Company Developments - Li Auto's research and development in autonomous driving began in 2021, evolving from initial BEV solutions to more advanced systems [5]. - The company has significantly invested in AI, with nearly half of its R&D budget allocated to this area, indicating a strong commitment to integrating AI into its vehicle technology [2]. - Li Auto's recent presentation at ICCV 2025 highlighted its innovative approach, which combines synthetic data to address rare scenarios, leading to a notable improvement in human takeover mileage (MPI) [2][4]. Group 2: Industry Reception - The reception of Li Auto's advancements has been overwhelmingly positive, with many industry observers praising its research and development efforts, positioning it as a model for Chinese automotive companies [2][4]. - Articles from major Chinese AI platforms like Quantum Bit and Machine Heart have garnered significant attention, with one article achieving over 39,000 reads, reflecting the growing interest in Li Auto's developments [1][2]. Group 3: Competitive Landscape - Li Auto is recognized as a leading player in the Chinese autonomous driving space, with a notable presence in discussions surrounding AI and autonomous vehicle technology [22]. - The company aims to differentiate itself not just as an automotive manufacturer but as a competitive AI entity, aligning its goals with broader AI advancements and the five stages of AI development as defined by OpenAI [18][19].