多模态大模型
Search documents
优刻得完成DeepSeek-OCR-2接入
Xin Lang Cai Jing· 2026-01-28 06:20
1月28日,优刻得完成DeepSeek-OCR-2接入。 据悉,DeepSeek最新开源的DeepSeek-OCR-2,通过架构 适配DeepEncoder V2,摒弃了经典的CLIP视觉分支,采用LLM作为视觉编码器,并提出视觉因果流 (Visual Causal Flow)范式,以解决多模态大模型在面对复杂表格或非线性文本时,往往会出现的语义 与序列错配问题。 具体来看,传统的视觉语言模型(VLM)存在固有的归纳偏置:光栅扫描,并施加 固定的绝对位置编码(从左到右,从上到下)。这与人类"基于语义逻辑跳跃扫描"的视觉认知机制背道 而驰——人类在阅读文档时,目光是随着逻辑流动,遇到表格会按列或按行扫视,遇到分栏会自动跳 跃。 ...
多模态大模型中Attention机制暗藏「骗局」,需用一个公式修正
3 6 Ke· 2026-01-27 08:15
Attention真的可靠吗? 近年来,Vision-Language Models(VLMs)在多模态理解任务中取得了显著进展,尤其是在视觉问答、图像理解和视频理 解等场景中,模型通常通过language-to-vision attention来衡量视觉token与文本之间的相关性,并据此进行visual token pruning,以降低推理成本、提升运行效率。 然而,一个长期被忽视的问题是:attention本身是否真的能够作为"语义重要性"的可靠指标? 在最新研究中,上海大学曾丹团队系统分析了主流VLM中attention的行为模式,发现一个关键却容易被忽略的现象—— attention并非只由语义决定,而是受到显著的结构性偏置影响。如果直接使用这些带偏置的attention进行visual token pruning,往往会在无意中保留不重要的视觉区域,同时丢失真正有助于任务理解的关键信息。 除了位置偏置之外,该团队还观察到另一类更隐蔽的问题:padding区域的attention异常偏高。在许多VLM中,由于输入图 像尺寸不一致,padding是不可避免的操作,但这些区域在语义上并不包含任何有用信 ...
多模态大模型中Attention机制暗藏「骗局」,需用一个公式修正丨上大×南开
量子位· 2026-01-27 02:33
Core Insights - The article discusses the reliability of attention mechanisms in Vision-Language Models (VLMs), highlighting that attention may not be a trustworthy indicator of semantic importance due to structural biases [2][12] Group 1: Attention Mechanism Issues - Attention is influenced by structural biases, such as position bias, which favors later tokens in a sequence, leading to potential misinterpretation during visual token pruning [3][5] - The phenomenon of "padding attention sink" is identified, where padding areas receive disproportionately high attention, misleading pruning strategies [5][6] Group 2: Proposed Solutions - The research team from Shanghai University suggests a debiasing approach to correct attention biases without introducing new pruning methods or additional training processes [6][12] - By modeling the overall trends of attention biases, the team effectively reduces irrelevant positional factors, enhancing the semantic relevance of attention [6][12] Group 3: Experimental Results - The debiasing strategy was integrated as a plug-and-play module into various mainstream attention-based visual token pruning methods, showing consistent performance improvements across multiple tasks [7][10] - Experimental results indicate that the pruning models with the debiasing correction achieved stable performance enhancements, particularly under aggressive token compression conditions [10][12] Group 4: Conclusion - The findings emphasize that attention is not inherently equivalent to semantic importance, and ignoring inherent structural biases can mislead pruning strategies, affecting overall model performance [12]
11.77亿资本押注卡车新势力「一哥」,L2升维路线率先在商用车跑通!
量子位· 2026-01-27 02:33
贾浩楠 发自 凹非寺 量子位 | 公众号 QbitAI 被公认技术门槛高、商业化挑战大的硬核赛道,也总有玩家能够逆周期成长—— 本轮融资的投资方,包含了普华资本、ABC Impact(淡马锡旗下投资公司)、欣旺达、前海淏天、瀚棠置业、临沂国科、长兴创强基金、山 东国控资本、联想创投、大湾区基金、光跃投资、红山基金。国资、外资、产业资本齐上阵,抓紧窗口赶上了DeepWay深向上市的末班车。 融资朋友圈不断"扩容",在DeepWay深向是常态,比如过去5年中,DeepWay深向曾在A轮融了5轮,B轮3轮……目前公开可查的金额为19.8 亿元,加上这次的11.77亿,已经超30亿。 这一方面证明了这家自动驾驶卡车公司的"抢手"程度。 5年间,DeepWay深向靠卖新能源重卡实现年营收数十亿元;加速度更加不容忽视,刚刚过去的2025年单季度交付量就超过2024全年—— 但投资人看好的,不仅是卖车的业绩,毕竟这一逻辑还无法支撑DeepWay深向 公开道路场景下的"自动驾驶卡车第一股" 的估值和潜力。 正向定义打造新能源重卡, 当年百度商用车领域唯一获授权使用百度Apollo技术 、新能源重卡三电全栈自研…… 穿透Dee ...
研判趋势!2026年中国智能设计行业概述、产业链及市场现状分析:政策、技术双轮驱动智能设计革命,智能设计迈向实时迭代新纪元[图]
Chan Ye Xin Xi Wang· 2026-01-27 01:22
Core Insights - The article emphasizes the strategic importance of artificial intelligence (AI) in China's manufacturing sector, highlighting the transformative impact of generative AI and multimodal large models on intelligent design processes [1][7]. Industry Overview - Intelligent design utilizes modern information technology to simulate human cognitive activities, enabling systems to undertake complex tasks and assist designers in decision-making [6]. - The design process consists of three main levels: conventional design, associative design, and evolutionary design [6]. Market Size - The market size of China's intelligent design industry is projected to reach approximately 6.724 billion yuan in 2024, reflecting a year-on-year growth of 20.70% [1][7]. Industry Chain - The upstream of the intelligent design industry chain includes AI chips, servers, specialized design databases, data annotation, algorithmic large models, and development platforms [6]. - The midstream focuses on system integration and service, while the downstream applications span manufacturing, construction, healthcare, consumer electronics, autonomous driving, digital twins, and space design [6]. Key Companies - Alibaba Group leads the industry with its LuBan AI design platform, achieving automated material generation and promoting multimodal design innovation [8]. - Zhongwang Software offers an All-in-One CAx solution for integrated design, simulation, and manufacturing, enhancing autonomous capabilities [8]. - Tianfu Software specializes in industrial simulation, developing software that utilizes AI acceleration algorithms to overcome simulation time bottlenecks [8]. Industry Development Trends 1. The role of designers is shifting from creating static interfaces to defining dynamic "generative rules" and architecting AI agents [10]. 2. Intelligent design will expand from visual and interactive aspects to encompass full sensory experiences and accurate physical world modeling [11]. 3. The industry process is being restructured from linear stages to an end-to-end ecosystem that connects consumer creativity directly to manufacturing [12].
云知声山海·知音2.0重磅发布 重塑人机交互新范式
Zhi Tong Cai Jing· 2026-01-26 01:22
Core Insights - The company is accelerating its "One Foundation, Two Wings" technology strategy amid the rise of intelligent agents, recently launching the "ShanHai.ZhiYin" model 2.0 after upgrading the "ShanHai.ZhiYi" 5.0 medical model [1] Group 1: Model Capabilities - The "ShanHai.ZhiYin" model 2.0 focuses on three major capability evolutions: understanding professional and local dialects, expressing warmth and emotional connection, and achieving extreme responsiveness [1] - In terms of "understanding," the model's ASR (Automatic Speech Recognition) capabilities have demonstrated leading performance in both public test sets and proprietary full-scenario test sets, surpassing mainstream domestic open-source and closed-source speech models, reaching the highest industry standards [1] - For the "expression" aspect, the ShanHai.ZhiYin-TTS (Text-to-Speech) features a "highly human-like and creatively diverse" core, currently supporting 12 dialects (including Cantonese, Sichuanese, and Shanghainese) and 10 foreign languages, with the ability to switch between 12 styles of Mandarin [1] - The model also overcomes challenges in smooth full-duplex interaction, enabling real-time interruptions, immediate responses, and coherent follow-up questions, making human-machine dialogue as fluid as conversations between close friends [1] Group 2: Technological Foundation - The capabilities of the ShanHai.ZhiYin 2.0 model are underpinned by the company's proprietary "ShanHai.Atlas" intelligent computing foundation, which deeply integrates the general multimodal model base with the Atlas architecture, serving as the foundation for professional intelligent agents and the core of perceptual AI [1]
鸣鸣很忙今起招股,发售价不高于236.6港元;奈飞提出以全现金方式收购华纳兄弟
Sou Hu Cai Jing· 2026-01-21 02:06
Group 1 - Hunan Mingming Hen Mang Commercial Chain Co., Ltd. has officially launched its global offering, planning to list on the Hong Kong Stock Exchange on January 28, with a share price not exceeding HKD 236.6 [2] - The company plans to issue 14.1011 million shares, with approximately 12.6909 million shares for international offering and about 1.4102 million shares for public offering in Hong Kong, estimating a net amount of approximately HKD 3.124 billion from the offering at a median price of HKD 233.10 per share [2] - Netflix has adjusted its acquisition proposal for Warner Bros. to an all-cash offer of USD 82.7 billion, with a cash price of USD 27.75 per share, receiving unanimous support from Warner Bros. Discovery's board [2] Group 2 - Yupan Intelligent has completed a Pre-IPO+ round of financing amounting to approximately RMB 513 million, with investments from various entities including Wenzhou Cangnan Shanhai Industrial Group and Crewstone International [2] - Nature Select has recently completed a new financing round exceeding USD 30 million, with investments from Alibaba, Ant Group, and several venture capital firms [3] - The company "Today Yixiu" has announced the completion of a seed round financing of several tens of millions, with investors including Hillhouse Ventures and Yunjiu Capital, planning to launch a series of hardware and software products later this year [4] Group 3 - Potensic has launched the AI-powered ATOM series of drones, featuring smart functions and compliance with global regulations, aimed at enhancing user experience in flight and control [4] - Tesla's second-generation humanoid robot figurine will be available for sale on January 21, priced at RMB 199, consisting of over 40 independent parts and designed to closely resemble the second-generation humanoid robot [5]
击败GPT、Gemini,复旦×创智孵化创业团队「模思智能」,语音模型上新了
机器之心· 2026-01-20 10:19
Core Viewpoint - The article highlights the breakthrough capabilities of the MOSS-Transcribe-Diarize model developed by MOSI AI, which excels in multi-speaker automatic speech recognition (ASR) and outperforms existing models like GPT-4o and Gemini in complex audio environments [1][2][9]. Group 1: Model Capabilities - MOSS-Transcribe-Diarize can handle overlapping speech and chaotic dialogue scenarios effectively, demonstrating a significant improvement in transcription accuracy [1][5]. - The model supports a long context window of 128K, allowing it to process audio inputs of up to 90 minutes, showcasing its robustness in complex environments [1][9]. - It achieves state-of-the-art (SOTA) performance across various benchmarks, including AISHELL-4, Podcast, and Movies datasets, particularly excelling in challenging audio conditions [2][16][19]. Group 2: Technical Innovations - The model employs a unified end-to-end multimodal architecture that integrates speech recognition, speaker attribution, and timestamp prediction, addressing the classic SATS (Speaker Attribution and Timestamped Speech) challenge [8][12]. - MOSS-Transcribe-Diarize utilizes a combination of real-world dialogue audio and synthetic data for training, enhancing its robustness against overlapping speech and acoustic variations [13][14]. - The architecture allows for direct output of text with speaker labels and precise timestamps, improving accuracy through semantic information utilization [12][14]. Group 3: Competitive Advantage - In benchmark tests, MOSS-Transcribe-Diarize significantly outperformed competitors like GPT-4o and Gemini 3 Pro in metrics such as Character Error Rate (CER) and optimal permutation Character Error Rate (cpCER), particularly in long audio inputs [16][19]. - The model maintains speaker consistency in long dialogues, reducing performance degradation caused by speaker attribution errors [16]. - It demonstrates superior performance in various scenarios, including real-world meetings, podcasts, and complex film dialogues, proving its versatility and effectiveness [19][21]. Group 4: Future Directions - MOSI AI aims to continue advancing multimodal intelligence, focusing on enabling AI to understand complex real-world contexts and achieve natural, coherent, and reliable interactions [24]. - The company has a strategic vision to develop technologies that enhance real-time dialogue interaction and robust speech understanding, positioning itself as a leader in the AI field [24].
【全球招募】用AI唤醒千年文明!探元计划NextGen数智活化赛道:五大文化场景等您“揭榜挂帅”
腾讯研究院· 2026-01-20 09:53
Core Viewpoint - The article emphasizes the integration of advanced technologies like AI to revitalize cultural heritage and enhance public engagement with historical narratives and experiences [2][56]. Group 1: Cultural Revitalization through Technology - The initiative aims to create immersive experiences that allow users to interact with cultural heritage, such as AI-generated historical narratives and personalized experiences [2][5]. - The "NextGen" plan by Tencent focuses on leveraging cutting-edge technologies to address the challenges of cultural heritage revitalization, aiming to create new forms of expression and engagement [5][56]. Group 2: Specific Topics and Challenges - The program identifies three main topics for innovation: 1. Development of multi-modal intelligent agents for cultural content generation [5]. 2. Creation of immersive interactive experiences that combine sensory data and emotional computing [6]. 3. Human-machine collaboration for the transmission and development of traditional crafts through digital means [7]. Group 3: Specific Cultural Scenarios - Five specific cultural scenarios have been outlined for technological application: 1. "Cloud Residence Intelligent Companion" for enhancing public understanding of historical texts [8][9]. 2. "Hangzhou West Lake Experience" focusing on personalized immersive tourism experiences [15][16]. 3. "Dawenkou Culture Interactive Experience" to facilitate understanding of ancient pottery techniques [19]. 4. "Bridge Wisdom Transmission" for teaching traditional wooden bridge construction techniques [29]. 5. "Cantonese Lion Dance Digital Activation" to enhance interaction and experience in traditional performances [36]. Group 4: Collaboration and Support - The initiative invites global technology teams to collaborate with cultural institutions to propose innovative solutions, with funding and resources available for selected projects [43][52]. - The project will undergo a structured process from proposal submission to implementation, ensuring thorough evaluation and support [48][50].
2026 年AI 应用的胜负手:多模态,从AI视频到机器人
2026-01-20 03:54
Summary of Conference Call on AI Applications and Multimodal Models Company and Industry Overview - The conference call focused on the AI applications and advancements in multimodal models, particularly in the context of the computer and technology industry, with specific emphasis on companies like Google, OpenAI, and domestic players like ByteDance and Minimax [1][21][30]. Core Points and Arguments 1. **Transition to AI 2.0**: The industry is entering a 2.0 phase characterized by a focus on scalable AI application scenarios, particularly in multimodal models [1][3]. 2. **Key Growth Areas**: Two primary areas identified for growth are AI in finance and taxation, and AI video applications, with a notable emphasis on the latter due to its larger global market potential [2][3]. 3. **Rapid Growth in AI Video**: There has been explosive growth in AI-generated short dramas and videos, with expectations of significant increases in production and quality over the next year [3][21]. 4. **Technological Advancements**: The evolution of large models is shifting from text-based to multimodal capabilities, with significant developments in dynamic understanding and generation [5][20]. 5. **Emergence of World Models**: The concept of world models is gaining traction, which could enhance applications in robotics and autonomous driving, although it is still in the experimental phase [18][28]. Additional Important Insights 1. **Cost Reduction in AI Video Production**: The cost of producing high-quality AI videos has significantly decreased, with estimates suggesting costs for 1080P quality videos are now in the range of 1,000 to 3,000 yuan [23][30]. 2. **Domestic Model Development**: Domestic models are expected to catch up with international counterparts by mid-2024, with companies like ByteDance and Minimax leading the charge [22][27]. 3. **Investment Opportunities**: Key investment opportunities identified include companies involved in AI video production, such as Zhaochi and Kunlun Wanwei, as well as those developing AI tools and platforms [25][30]. 4. **Market Growth Projections**: The AI video market is projected to experience exponential growth, with estimates suggesting it could exceed 1 trillion yuan, driven by both supply and demand factors [24][30]. 5. **Focus on Multimodal Applications**: The emphasis on multimodal applications is expected to drive significant advancements in AI technologies, particularly in video generation and understanding [29][30]. This summary encapsulates the key discussions and insights from the conference call, highlighting the transformative potential of AI applications and the strategic focus on multimodal models within the industry.