多模态大模型

Search documents
海天瑞声20250610
2025-06-10 15:26
海天瑞声 20250610 摘要 Meta 投资 Scale AI 旨在获取高质量数据及拓展国防等市场,以支持其 AI 商业化落地,并看重其客户资源及政商军事领域布局。 Scale AI 营收高速增长,预计 2025 年达 20 亿美元,估值翻倍至 276 亿美元,主要受益于美国军方和政府订单。 海天瑞声认为 AI 应用普及和多模态大模型发展抬升市场空间,视觉数据 需求激增,2025 年 Q1 视觉收入占比达 49%。 海天瑞声 2025 年发力数据积累业务,并拓展海外市场,菲律宾数据交 付基地提供低成本产能,内容审核业务贡献现金流。 海天瑞声通过研发创新、AI 辅助标注和合成数据等方式提升竞争力,并 关注新型数据需求。 国内大模型发展推动海天瑞声与中国移动等央企合作,受益于沿投联动 机制,订单显著增长。 海天瑞声通过"3+1"模式参与地方政府数据产业化项目,提供数据治 理和标注等服务,并采取本地化部署策略确保合规。 Q&A Meta 对 Scale AI 的投资背后的逻辑是什么? Meta 对 Scale AI 的投资主要有两个方面的考虑。首先,数据处理在 AI 训练中 仍然至关重要。Scale AI 拥有 ...
苹果AI放鸽子,AI录音机、AI玩具等“新国货”先火了
Nan Fang Du Shi Bao· 2025-06-10 08:41
Group 1: Industry Trends - The "2025 High-Quality Consumption Brand TOP100" initiative focuses on nine key sectors including beauty economy, sports and outdoor, food and health, smart consumer electronics, pet economy, experience economy, interest consumption, cross-border expansion, and consumption technology [2] - AI and hardware integration is emerging as a significant trend across various sectors, with companies launching AI-enabled products that are breaking traditional market boundaries [2][3] - The global AI hardware market is witnessing rapid growth, with notable products like AI recorders and AI glasses gaining traction [3][5] Group 2: AI Hardware Developments - The AI recorder Plaud Note has achieved significant success, with nearly 700,000 units shipped globally and an annual revenue of $100 million, reflecting a tenfold growth over two years [5][11] - AI glasses are becoming increasingly popular, with companies like Thunderbird and Rokid announcing new products that leverage AI for enhanced user experiences [7][8] - AI technology is enhancing the functionality of household appliances, with smart kitchen devices seeing over 30% sales growth in 2024 [20][21] Group 3: Consumer Insights - A survey indicated that over 30% of consumers are motivated to purchase products that incorporate AI technology, with more than half feeling a sense of upgrade when encountering AI-enabled Chinese brands [22] - The integration of AI in household appliances is shifting the industry from passive response to proactive service, creating interconnected smart home ecosystems [22][23]
AI自发形成人类级认知!我国科技学家揭示多模态大模型涌现类人物体概念表征
Huan Qiu Wang· 2025-06-10 02:09
研究人员从海量大模型行为数据中提取出66个"心智维度",并为这些维度赋予了语义标签。研究发现, 这些维度是高度可解释的,且与大脑类别选择区域(如处理面孔的FFA、处理场景的PPA、处理躯体的 EBA)的神经活动模式显著相关。 研究还对比了多个模型在行为选择模式上与人类的一致性(Human consistency)。结果显示,多模态大 模型(如 Gemini_Pro_Vision、Qwen2_VL)在一致性方面表现更优。此外,研究还揭示了人类在做决策 时更倾向于结合视觉特征和语义信息进行判断,而大模型则倾向于依赖语义标签和抽象概念。本研究表 明大语言模型并非"随机鹦鹉",其内部存在着类似人类对现实世界概念的理解。 相关研究成果以Human-like object concept representations emerge naturally in multimodal large language models为题,发表于《自然·机器智能》(Nature Machine Intelligence)。(青山) 那么,大语言模型(LLMs)是否能从语言和多模态数据中发展出类似人类的物体概念表征? 近日,中国科学院 ...
生数科技CEO骆怡航:从模型到生产,多模态AI如何推动视频创作更高效
硬AI· 2025-06-09 14:07
北京生数科技有限公司首席执行官骆怡航发表了主题演讲——"多模态生成:从模型走向生产",主要围绕 多模态大模型,特别是视频生成在产业落地中的机遇、挑战,并分享了生数科技(Vidu)的解决方案和成 果。 以下是演讲亮点: 多模态大模型迎来规模化生产落地的拐点:第一,我们看到技术迭代非常迅速,音视频的生成模型无论在 效果、速度、成本上都快速提升。其次,行业需求特别旺盛。第三,很多行业视频内容相关的各种产业落 地节奏加快。 今年包括再往后要同时具备四个条件:内容的创意,内容质量、生成的效率和生产的成本。 如果具备了内容的质量好于传统方式,同时生产效率和生产成本,在我看来效率必须比传统的方式要至少 百倍的提升。 对于生数科技来讲,我们聚焦在多模态生成,现在主要以视频生成为主,包括音视频的部分,未来我们会 延展到3D叙事空间等等。目前我们聚焦在专业的用户和企业用户,致力于把模型推动到8大行业、30大场景 里面。 Vidu 2.0把速度极大做了提升,可以达到 5 秒技术生成。同时Vidu Q1 进一步提升,包括高清的版本,还有 首尾帧,还有动漫等方向。同时对于音效还有音频我们做了深化。 从Vidu上线以来,专业创作的占比增 ...
我国科学家研究揭示多模态大模型概念表征机制
Xin Hua She· 2025-06-09 09:32
传统人工智能研究聚焦于物体识别准确率,却鲜少探讨模型是否真正"理解"物体含义。何晖光说:"当 前人工智能可以区分猫狗图片,但这种'识别'与人类'理解'猫狗的本质区别仍有待揭示。" 研究团队从认知神经科学经典理论出发,设计了一套融合计算建模、行为实验与脑科学的创新范式,并 构建了人工智能大模型的"概念地图"。 何晖光介绍,研究团队从海量大模型行为数据中提取出66个"心智维度",并为这些维度赋予了语义标 签。通过研究发现这些维度是高度可解释的,且与大脑类别选择区域的神经活动模式显著相关。研究还 对比了多个模型在行为选择模式上与人类的一致性,结果显示多模态大模型在一致性方面表现更优。 此外,研究还揭示了人类在做决策时更倾向于结合视觉特征和语义信息进行判断,而大模型则倾向于依 赖语义标签和抽象概念。本研究表明大语言模型内部存在着类似人类对现实世界概念的理解。(记者宋 晨) 记者6月9日从中国科学院自动化研究所获悉,该所与中国科学院脑科学与智能技术卓越创新中心的联合 团队在《自然·机器智能》发表相关研究,首次证实多模态大语言模型能够自发形成与人类高度相似的 物体概念表征系统,为人工智能认知科学提供了新路径,也为构建类人 ...
为什么浮亏似海深,浮盈一口闷?
Ge Long Hui· 2025-06-09 01:34
上周道指见证历史,首次突破4万点,周涨1.24%。纳斯达克指数周涨2.1%再创新高,标普500周涨1.5%再创新高。科技股多数走高,微软涨1.5%,苹果涨 3.7%,英伟达涨2.9%均连涨4周。大摩看好AI服务器成吸金利器,戴尔周涨12.6%,超微电脑周涨11.2%。你不觉得奇怪么?作为美股的对冲盘,港股居然也 是不跌反升。说明这一轮推动港股上涨的避险资金更多是来自其他新兴市场。恒生科技周涨3.79%,比美股好,连恒生指数都涨了3.11%,也比美股好。说 明什么?学霸从95分提高到98分要付出不懈努力,学渣从25分提高到30分没那么难,多蒙对两道选择题就有了。 从桥水和高瓴的持仓变动看,未见美资大幅加仓中国资产的动作。高瓴旗下HHLR一季度第一重仓股仍为拼多多,建仓AMD,同时减仓百度、阿里巴巴, 贝壳和京东。桥水一季度加仓谷歌、英伟达、苹果、Meta、亚马逊,减仓拼多多。 就算缺乏美资的祝福,港股还是涨起来了,而且恒生指数的涨幅还不小,这应该和预期取消红利税有关。对于一些长期资金,他们的耐心能跨越股价的周期 波动,股息率才是关键参数。同样一只股票,来港股买就能有折让,获得更多的股息率,不香吗?上周,香港交易 ...
聚焦多模态:ChatGPT时刻未到,2025大模型“变慢”了吗
Bei Jing Shang Bao· 2025-06-08 13:27
Core Insights - The emergence of multi-modal models, such as Emu3, signifies a shift in content generation, with the potential to understand and generate text, images, and videos through a single model [1][3] - The rapid development of AI has led to a competitive landscape where new and existing products coexist, but the core capabilities of video generation are still lagging behind expectations [1][5] - The commercial application of large models faces challenges, particularly in integrating visual generation with existing models, which limits scalability and effectiveness [7][8] Multi-Modal Model Development - Emu3, released by Zhiyuan Research Institute, is a native multi-modal model that incorporates various data types from the beginning of its training process, unlike traditional models that focus on language first [3][4] - The current learning path for multi-modal models often leads to a decline in performance as they transition from strong language capabilities to integrating other modalities [3][4] - The development of multi-modal models is still in its early stages, with significant technical challenges remaining, particularly in filtering effective information from diverse data types [3][4] Video Generation Challenges - Video generation technology is currently at a transitional phase, comparable to the evolution from GPT-2 to GPT-3, indicating that there is substantial room for improvement [5][6] - Key issues in video generation include narrative coherence, stability, and controllability, which are essential for producing high-quality content [6] - The industry is awaiting a breakthrough moment akin to the "ChatGPT moment" to enhance video generation capabilities [6] Commercialization and Market Growth - The multi-modal AI market is projected to reach $2.4 billion in 2024, with a compound annual growth rate (CAGR) exceeding 28%, and is expected to grow to $128 billion by 2025, reflecting a CAGR of 62.3% from 2023 to 2025 [8] - The integration of traditional computer vision models with large models is seen as a potential pathway for commercial applications, contingent on achieving a favorable cost-benefit ratio [7][8] - Companies are evolving their service models from providing platforms (PaaS) to offering tools (SaaS) and ultimately delivering direct results to users by 2025 [8]
多模态模型挑战北京杭州地铁图!o3成绩显著,但跟人类有差距
量子位· 2025-06-07 05:02
ReasonMap团队 投稿 量子位 | 公众号 QbitAI 近年来,大语言模型(LLMs)以及多模态大模型(MLLMs)在多种场景理解和复杂推理任务中取得突破性进展。 然而,一个关键问题仍然值得追问: 多模态大模型(MLLMs),真的能"看懂图"了吗? 特别是在面对结构复杂、细节密集的图像时,它们是否具备细粒度视觉理解与空间推理能力,比如挑战一下高清 地铁图 这种。 为此,来自西湖大学、新加坡国立大学、浙江大学、华中科技大学的团队提出了一个全新的评测基准 ReasonMap 。 看得出来北京、杭州的地铁图难倒了一大片模型。 这是首个聚焦于 高分辨率交通图(主要为地铁图)的多模态推理评测基准,专为评估大模型在理解图像中细粒度的结构化空间信息 方面的 能力而设计。 结果发现,当前主流开源的多模态模型在ReasonMap上面临明显性能瓶颈,尤其在 跨线路路径规划 上常出现视觉混淆或站点遗漏。 而经强化学习后训练的闭源推理模型(如 GPT-o3)在多个维度上 显著优于 现有开源模型,但与人类水平相比仍存在明显差距。 在面对不同国家地区的地铁图中,四个代表性 MLLM(Qwen2.5-VL-72B-I(蓝色)、 I ...
预见 2025:《2025 年中国多模态大模型行业全景图谱》(附市场现状、竞争格局和发展趋势等)
Sou Hu Cai Jing· 2025-06-06 14:09
行业主要上市公司:阿里巴巴 ( 09988.HK,BABA.US ) ; 百度 ( 09888.HK,BIDU.US ) ; 腾讯 ( 00700.HK, TCEHY ) ;科大讯飞 ( 002230.SZ ) ;万兴科技 ( 300624.SZ ) ;三六零 ( 601360.SH ) ;昆仑万维 ( 300418.SZ ) ; 云 从科技 ( 688327.SH ) ;拓尔思 ( 300229.SZ ) 等 本文核心数据:备案数量 ; 收费模式 ; 市场规模 ; 区域占比等 产业概况 1、定义及特征 多模态 ( Multimodality ) 是指集成和处理两种或两种以上不同类型的信息或数据的方法和技术。在机器学 习和人工智能领域,多模态涉及的数据类型通常包括但不限于文本、图像、视频、音频和传感器数据。多 模态系统的目的是利用来自多种模态的信息来提高任务的性能,提供更丰富的用户体验,或者获得更全面 的数据分析结果。多模态大型语言模型 ( Multimodal Large Language Models,简称 MLLMs ) 是一类结合了 大型语言模型 ( Large Language Models,简称 ...
单卡搞定万帧视频理解!智源研究院开源轻量级超长视频理解模型Video-XL-2
量子位· 2025-06-04 05:21
Core Viewpoint - The article discusses the release of Video-XL-2, a new generation of long video understanding model developed by Zhiyuan Research Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of open-source models in processing and understanding long video content [1][3]. Technical Overview - Video-XL-2 is designed with three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [4][6]. - The model utilizes SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [6][11]. - The training strategy involves a four-stage progressive training design to build robust long video understanding capabilities [8][10]. Performance Improvements - Video-XL-2 shows superior performance in long video understanding tasks, achieving leading levels on benchmarks such as MLVU, Video-MME, and LVBench compared to existing open-source models [9][15]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle [19][23]. - It can encode 2048 frames of video in just 12 seconds, demonstrating remarkable speed and efficiency [24][28]. Application Potential - Video-XL-2 has high application potential in various real-world scenarios, including film content analysis, plot understanding, and anomaly detection in surveillance videos [28][30]. - Specific examples of its application include answering questions about movie scenes and detecting unexpected events in surveillance footage [30][32].