多模态大模型 - filings, earnings calls, financial reports, news - Reportify

多模态大模型

Search documents

海天瑞声20250610

2025-06-10 15:26

海天瑞声 20250610 摘要 Meta 投资 Scale AI 旨在获取高质量数据及拓展国防等市场，以支持其 AI 商业化落地，并看重其客户资源及政商军事领域布局。 Scale AI 营收高速增长，预计 2025 年达 20 亿美元，估值翻倍至 276 亿美元，主要受益于美国军方和政府订单。海天瑞声认为 AI 应用普及和多模态大模型发展抬升市场空间，视觉数据需求激增，2025 年 Q1 视觉收入占比达 49%。海天瑞声 2025 年发力数据积累业务，并拓展海外市场，菲律宾数据交付基地提供低成本产能，内容审核业务贡献现金流。海天瑞声通过研发创新、AI 辅助标注和合成数据等方式提升竞争力，并关注新型数据需求。国内大模型发展推动海天瑞声与中国移动等央企合作，受益于沿投联动机制，订单显著增长。海天瑞声通过"3+1"模式参与地方政府数据产业化项目，提供数据治理和标注等服务，并采取本地化部署策略确保合规。 Q&A Meta 对 Scale AI 的投资背后的逻辑是什么？ Meta 对 Scale AI 的投资主要有两个方面的考虑。首先，数据处理在 AI 训练中仍然至关重要。Scale AI 拥有 ...

Speechocean(SH:688787)

Artificial Intelligence

多模态大模型

Data Processing

Artificial Intelligence

多模态大模型

Data Processing

苹果AI放鸽子，AI录音机、AI玩具等“新国货”先火了

Nan Fang Du Shi Bao· 2025-06-10 08:41

Group 1: Industry Trends - The "2025 High-Quality Consumption Brand TOP100" initiative focuses on nine key sectors including beauty economy, sports and outdoor, food and health, smart consumer electronics, pet economy, experience economy, interest consumption, cross-border expansion, and consumption technology [2] - AI and hardware integration is emerging as a significant trend across various sectors, with companies launching AI-enabled products that are breaking traditional market boundaries [2][3] - The global AI hardware market is witnessing rapid growth, with notable products like AI recorders and AI glasses gaining traction [3][5] Group 2: AI Hardware Developments - The AI recorder Plaud Note has achieved significant success, with nearly 700,000 units shipped globally and an annual revenue of $100 million, reflecting a tenfold growth over two years [5][11] - AI glasses are becoming increasingly popular, with companies like Thunderbird and Rokid announcing new products that leverage AI for enhanced user experiences [7][8] - AI technology is enhancing the functionality of household appliances, with smart kitchen devices seeing over 30% sales growth in 2024 [20][21] Group 3: Consumer Insights - A survey indicated that over 30% of consumers are motivated to purchase products that incorporate AI technology, with more than half feeling a sense of upgrade when encountering AI-enabled Chinese brands [22] - The integration of AI in household appliances is shifting the industry from passive response to proactive service, creating interconnected smart home ecosystems [22][23]

多模态大模型

雷鸟X3 ProAR眼镜

多模态大模型

雷鸟X3 ProAR眼镜

AI自发形成人类级认知！我国科技学家揭示多模态大模型涌现类人物体概念表征

Huan Qiu Wang· 2025-06-10 02:09

研究人员从海量大模型行为数据中提取出66个"心智维度"，并为这些维度赋予了语义标签。研究发现，这些维度是高度可解释的，且与大脑类别选择区域（如处理面孔的FFA、处理场景的PPA、处理躯体的 EBA）的神经活动模式显著相关。研究还对比了多个模型在行为选择模式上与人类的一致性（Human consistency）。结果显示，多模态大模型（如 Gemini_Pro_Vision、Qwen2_VL）在一致性方面表现更优。此外，研究还揭示了人类在做决策时更倾向于结合视觉特征和语义信息进行判断，而大模型则倾向于依赖语义标签和抽象概念。本研究表明大语言模型并非"随机鹦鹉"，其内部存在着类似人类对现实世界概念的理解。相关研究成果以Human-like object concept representations emerge naturally in multimodal large language models为题，发表于《自然·机器智能》（Nature Machine Intelligence）。（青山）那么，大语言模型（LLMs）是否能从语言和多模态数据中发展出类似人类的物体概念表征？近日，中国科学院 ...

多模态大模型

物体概念表征

Gemini_Pro_Vision

多模态大模型

物体概念表征

Gemini_Pro_Vision

生数科技CEO骆怡航：从模型到生产，多模态AI如何推动视频创作更高效

硬AI· 2025-06-09 14:07

Core Insights - The article emphasizes that multimodal large models are at a critical turning point for large-scale production, driven by rapid technological iterations and strong industry demand [1][3][5]. Group 1: Industry Trends - The rapid iteration of audio and video generation models has significantly improved effectiveness, speed, and cost [6][9]. - There is a strong demand in the industry for video content production, addressing traditional pain points such as long cycles, high costs, and the need for specialized skills [7][9]. - The pace of industry adoption for video content-related applications is accelerating, with many sectors actively exploring and implementing solutions [7][9]. Group 2: Company Focus and Solutions - The company focuses on four key conditions for scaling video generation: content creativity, quality, efficiency, and cost reduction, aiming for at least a hundredfold improvement in efficiency compared to traditional methods [3][9]. - The company’s product, Vidu, has seen a 300% increase in professional creation since its launch, with significant growth in generation volume, payment, and usage duration [4][22]. - Vidu 2.0 can generate content in just 5 seconds, with enhancements in quality and features such as high-definition versions and audio effects [10][11]. Group 3: Market Applications - The company targets eight major industries and thirty application scenarios, focusing on professional and enterprise users [9][10]. - Applications of Vidu span various sectors, including internet advertising, animation, e-commerce, and education, with 80% of professional users in demanding scenarios [4][22]. - The company has successfully collaborated with notable partners, including Sony Pictures, to produce high-quality promotional content efficiently [20][21]. Group 4: User Engagement and Community - The Vidu platform has over 30 million users across more than 200 countries, with a strong community for sharing creative ideas and inspirations [11][12]. - Daily, users engage in millions of creative expressions, contributing to a vibrant ecosystem of content creation [12][22]. - The company aims to empower users by enhancing their creative capabilities while maintaining efficiency and cost-effectiveness in content production [23].

多模态大模型

多模态大模型

我国科学家研究揭示多模态大模型概念表征机制

Xin Hua She· 2025-06-09 09:32

传统人工智能研究聚焦于物体识别准确率，却鲜少探讨模型是否真正"理解"物体含义。何晖光说："当前人工智能可以区分猫狗图片，但这种'识别'与人类'理解'猫狗的本质区别仍有待揭示。" 研究团队从认知神经科学经典理论出发，设计了一套融合计算建模、行为实验与脑科学的创新范式，并构建了人工智能大模型的"概念地图"。何晖光介绍，研究团队从海量大模型行为数据中提取出66个"心智维度"，并为这些维度赋予了语义标签。通过研究发现这些维度是高度可解释的，且与大脑类别选择区域的神经活动模式显著相关。研究还对比了多个模型在行为选择模式上与人类的一致性，结果显示多模态大模型在一致性方面表现更优。此外，研究还揭示了人类在做决策时更倾向于结合视觉特征和语义信息进行判断，而大模型则倾向于依赖语义标签和抽象概念。本研究表明大语言模型内部存在着类似人类对现实世界概念的理解。（记者宋晨）记者6月9日从中国科学院自动化研究所获悉，该所与中国科学院脑科学与智能技术卓越创新中心的联合团队在《自然·机器智能》发表相关研究，首次证实多模态大语言模型能够自发形成与人类高度相似的物体概念表征系统，为人工智能认知科学提供了新路径，也为构建类人 ...

多模态大模型

概念表征机制

多模态大模型

概念表征机制

为什么浮亏似海深，浮盈一口闷？

Ge Long Hui· 2025-06-09 01:34

Market Performance - The Dow Jones Industrial Average has historically surpassed 40,000 points for the first time, with a weekly increase of 1.24% [1] - The Nasdaq index rose by 2.1% to reach a new high, while the S&P 500 increased by 1.5%, also achieving a new high [1] - Technology stocks performed well, with Microsoft up 1.5%, Apple up 3.7%, and Nvidia up 2.9%, marking four consecutive weeks of gains [1] Hong Kong Market Dynamics - Despite the lack of significant foreign investment, the Hong Kong stock market has seen gains, with the Hang Seng Index rising by 3.11% [3] - The Hang Seng Tech Index increased by 3.79%, outperforming U.S. markets [3] - The rise in Hong Kong stocks is attributed to expectations of the cancellation of the dividend tax, making stocks more attractive for long-term investors [3] Corporate Earnings - Major internet companies such as Tencent, Baidu, and JD.com reported better-than-expected earnings, leading to stock price increases [3] - Alibaba's earnings fell short of expectations, resulting in a decline in its stock price [3] U.S. Tariff Implications - New U.S. tariffs will affect $18 billion worth of goods from China, including steel, aluminum, semiconductors, and batteries [6] - The tariff on electric vehicles will increase from 25% to 100%, reflecting the rapid growth of China's automotive export sector [6] Export Data - Belgium leads in vehicle exports from China with 175,437 units, showing an 11.3% increase year-on-year [7] - The UK and the Philippines follow, with exports of 125,314 and 115,423 vehicles, respectively, reflecting significant growth rates [7] Investment Strategy - The semiconductor and chip sectors are identified as cyclical industries with potential for significant returns, especially during bear markets [8] - The importance of maintaining a long-term perspective on investments is emphasized, suggesting that investors should focus on price trends rather than entry costs [8]

多模态大模型

半导体与芯片

软件与服务

多模态大模型

半导体与芯片

软件与服务

聚焦多模态：ChatGPT时刻未到，2025大模型“变慢”了吗

Bei Jing Shang Bao· 2025-06-08 13:27

Core Insights - The emergence of multi-modal models, such as Emu3, signifies a shift in content generation, with the potential to understand and generate text, images, and videos through a single model [1][3] - The rapid development of AI has led to a competitive landscape where new and existing products coexist, but the core capabilities of video generation are still lagging behind expectations [1][5] - The commercial application of large models faces challenges, particularly in integrating visual generation with existing models, which limits scalability and effectiveness [7][8] Multi-Modal Model Development - Emu3, released by Zhiyuan Research Institute, is a native multi-modal model that incorporates various data types from the beginning of its training process, unlike traditional models that focus on language first [3][4] - The current learning path for multi-modal models often leads to a decline in performance as they transition from strong language capabilities to integrating other modalities [3][4] - The development of multi-modal models is still in its early stages, with significant technical challenges remaining, particularly in filtering effective information from diverse data types [3][4] Video Generation Challenges - Video generation technology is currently at a transitional phase, comparable to the evolution from GPT-2 to GPT-3, indicating that there is substantial room for improvement [5][6] - Key issues in video generation include narrative coherence, stability, and controllability, which are essential for producing high-quality content [6] - The industry is awaiting a breakthrough moment akin to the "ChatGPT moment" to enhance video generation capabilities [6] Commercialization and Market Growth - The multi-modal AI market is projected to reach $2.4 billion in 2024, with a compound annual growth rate (CAGR) exceeding 28%, and is expected to grow to $128 billion by 2025, reflecting a CAGR of 62.3% from 2023 to 2025 [8] - The integration of traditional computer vision models with large models is seen as a potential pathway for commercial applications, contingent on achieving a favorable cost-benefit ratio [7][8] - Companies are evolving their service models from providing platforms (PaaS) to offering tools (SaaS) and ultimately delivering direct results to users by 2025 [8]

多模态大模型

Artificial Intelligence

多模态大模型

Artificial Intelligence

多模态模型挑战北京杭州地铁图！o3成绩显著，但跟人类有差距

量子位· 2025-06-07 05:02

ReasonMap团队投稿量子位 | 公众号 QbitAI 近年来，大语言模型（LLMs）以及多模态大模型（MLLMs）在多种场景理解和复杂推理任务中取得突破性进展。然而，一个关键问题仍然值得追问：多模态大模型（MLLMs），真的能"看懂图"了吗？特别是在面对结构复杂、细节密集的图像时，它们是否具备细粒度视觉理解与空间推理能力，比如挑战一下高清地铁图这种。为此，来自西湖大学、新加坡国立大学、浙江大学、华中科技大学的团队提出了一个全新的评测基准 ReasonMap 。看得出来北京、杭州的地铁图难倒了一大片模型。这是首个聚焦于高分辨率交通图（主要为地铁图）的多模态推理评测基准，专为评估大模型在理解图像中细粒度的结构化空间信息方面的能力而设计。结果发现，当前主流开源的多模态模型在ReasonMap上面临明显性能瓶颈，尤其在跨线路路径规划上常出现视觉混淆或站点遗漏。而经强化学习后训练的闭源推理模型（如 GPT-o3）在多个维度上显著优于现有开源模型，但与人类水平相比仍存在明显差距。在面对不同国家地区的地铁图中，四个代表性 MLLM（Qwen2.5-VL-72B-I（蓝色）、 I ...

多模态大模型

细粒度视觉推理

Qwen2.5-VL-72B-I

多模态大模型

细粒度视觉推理

Qwen2.5-VL-72B-I

预见 2025：《2025 年中国多模态大模型行业全景图谱》（附市场现状、竞争格局和发展趋势等）

Sou Hu Cai Jing· 2025-06-06 14:09

行业主要上市公司：阿里巴巴 ( 09988.HK，BABA.US ) ; 百度 ( 09888.HK，BIDU.US ) ; 腾讯 ( 00700.HK， TCEHY ) ;科大讯飞 ( 002230.SZ ) ;万兴科技 ( 300624.SZ ) ;三六零 ( 601360.SH ) ;昆仑万维 ( 300418.SZ ) ; 云从科技 ( 688327.SH ) ;拓尔思 ( 300229.SZ ) 等本文核心数据：备案数量 ; 收费模式 ; 市场规模 ; 区域占比等产业概况 1、定义及特征多模态 ( Multimodality ) 是指集成和处理两种或两种以上不同类型的信息或数据的方法和技术。在机器学习和人工智能领域，多模态涉及的数据类型通常包括但不限于文本、图像、视频、音频和传感器数据。多模态系统的目的是利用来自多种模态的信息来提高任务的性能，提供更丰富的用户体验，或者获得更全面的数据分析结果。多模态大型语言模型 ( Multimodal Large Language Models，简称 MLLMs ) 是一类结合了大型语言模型 ( Large Language Models，简称 ...

多模态大模型

Artificial Intelligence

多模态大型语言模型

多模态大模型

Artificial Intelligence

多模态大型语言模型

单卡搞定万帧视频理解！智源研究院开源轻量级超长视频理解模型Video-XL-2

量子位· 2025-06-04 05:21

Core Viewpoint - The article discusses the release of Video-XL-2, a new generation of long video understanding model developed by Zhiyuan Research Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of open-source models in processing and understanding long video content [1][3]. Technical Overview - Video-XL-2 is designed with three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [4][6]. - The model utilizes SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [6][11]. - The training strategy involves a four-stage progressive training design to build robust long video understanding capabilities [8][10]. Performance Improvements - Video-XL-2 shows superior performance in long video understanding tasks, achieving leading levels on benchmarks such as MLVU, Video-MME, and LVBench compared to existing open-source models [9][15]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle [19][23]. - It can encode 2048 frames of video in just 12 seconds, demonstrating remarkable speed and efficiency [24][28]. Application Potential - Video-XL-2 has high application potential in various real-world scenarios, including film content analysis, plot understanding, and anomaly detection in surveillance videos [28][30]. - Specific examples of its application include answering questions about movie scenes and detecting unexpected events in surveillance footage [30][32].

长视频理解

多模态大模型

长视频理解

多模态大模型