多模态大模型
Search documents
文生图进入R1时代:港中文MMLab发布T2I-R1,让AI绘画“先推理再下笔”
量子位· 2025-05-13 04:45
Core Viewpoint - The article discusses the introduction of T2I-R1, the first reinforcement learning-based reasoning-enhanced text-to-image model developed by the MMLab team at the Chinese University of Hong Kong, which significantly improves image generation through a dual-level Chain of Thought (CoT) reasoning framework [2][27]. Group 1: Model Development - The T2I-R1 model builds on previous work in image generation with CoT, focusing on integrating semantic understanding and image generation [6][8]. - T2I-R1 introduces a dual-level CoT reasoning framework, consisting of Semantic-level CoT and Token-level CoT, to enhance the quality of generated images [12][16]. - The model utilizes BiCoT-GRPO, a reinforcement learning method that optimally coordinates the two levels of CoT, allowing for efficient training and improved image generation [21][23]. Group 2: Performance and Evaluation - T2I-R1 demonstrates improved performance, achieving a 13% and 19% increase in benchmarks T2I-CompBench and WISE, respectively, compared to baseline models [33]. - The model effectively generates images that align with human expectations by reasoning through the underlying intent of image prompts, showcasing enhanced robustness in unusual scenarios [29][30]. - The evaluation method incorporates multiple visual expert models to provide a comprehensive quality assessment of generated images, ensuring reliable results [32]. Group 3: Future Implications - The framework of T2I-R1 is expected to extend to more complex tasks such as video generation and 3D content synthesis, contributing to the evolution of generative AI towards more intelligent and creative systems [36].
ICML 2025 | 长视频理解新SOTA!蚂蚁&人大开源ViLAMP-7B,单卡可处理3小时视频
机器之心· 2025-05-12 09:06
该工作第一作者为中国人民大学高瓴人工智能学院硕士生程传奇, 目前于蚂蚁技术研究院实习,其主要研究领域为多模态大模型,蚂蚁技术研究院副研究员关健 为共同第一作者。 在视觉语言模型(Vision-Language Models,VLMs)取得突破性进展的当下,长视频理解的挑战显得愈发重要。以标准 24 帧率的标清视频为例,仅需数分钟即可 产生逾百万的视觉 token,这已远超主流大语言模型 4K-128K 的上下文处理极限。当面对影视级的长视频内容时,传统解决方案的不足愈加凸显:粗放式的帧采样 策略往往造成关键帧信息遗漏,而特征融合方法虽能降低数据维度,却不可避免地导致语义完整性受损。 近日, 蚂蚁和人大 的研究团队带来了一个创新性的解决方案。他们提出视觉语言大模型 ViLAMP (Video-Language Model with Mixed Precision),实现了对超长 视频的高效处理。这个方法的核心在于其独特的 " 混合精度 " 策略:对视频中的关键内容保持高精度分析,而对次要内容进行强力压缩,就像人类在观看视频时会 重点关注关键场景,而对过渡时空信息只做快速扫描一样。 论文标题:Scaling Vi ...
2025年中国多模态大模型行业生产生活应用现状 多模态大模型助力生产生活走向高品质【组图】
Qian Zhan Wang· 2025-05-12 08:11
转自:前瞻产业研究院 智能营销、教学辅助、3D建模以及智能驾驶等应用场景是生产生活中的重要领域,也是目前多模态大 模型可以切入并且精准赋能的领域。根据赛迪四川研究数据显示,2024年智能营销占中国人工智能多模 态大模型20强企业模型场景的9.5%,教学辅助、3D建模和智能驾驶均占4.8%左右。 行业主要上市公司:阿里巴巴(09988.HK,BABA.US);百度(09888.HK,BIDU.US);腾讯(00700.HK, TCEHY);科大讯飞(002230.SZ);万兴科技(300624.SZ);三六零(601360.SH);昆仑万维(300418.SZ);云从科技 (688327.SH);拓尔思(300229.SZ)等 本文核心数据:应用场景比重; 多模态大模型生成生活相关场景 多模态大模型助力智能营销优化策略 智能营销行业利用人工智能、大数据、机器学习和多模态技术,通过自动化、个性化的方式优化广告投 放、客户关系管理和内容营销。智能营销不仅帮助品牌实现更高效的客户触达,还能够动态调整营销策 略,提升用户体验,推动品牌增长。 智能营销是应用人工智能技术,对数字营销的全链路进行智能化升级的新型营销方式。智 ...
云从科技“从容多模态大模型”全球领先,与华为昇腾合作推动解决方案落地
news flash· 2025-05-12 05:48
Core Insights - CloudWalk Technology's self-developed "Congrong Multimodal Large Model" scored 65.5 in the Open Compass evaluation, ranking among the top three globally, surpassing models like Google's Gemini 1.5 Pro [1] - The model has set world records in specific areas such as cross-modal tracking and 3D facial recognition, achieving this feat ten times [1] - Leveraging this technological advantage, the company has partnered with Huawei Ascend to launch an integrated intelligent machine solution, which has been implemented in several benchmark projects, including smart logistics scheduling at Tianjin Port and energy management for State Grid Shandong [1] - These implementations have resulted in over a 20% improvement in operational efficiency for enterprises [1]
冯诺依曼研究院成立深港科技合作再添AI范式
2 1 Shi Ji Jing Ji Bao Dao· 2025-05-09 09:45
Core Insights - Hong Kong has established the Von Neumann Institute to integrate embodied intelligence, generative AI, and advanced supercomputing technologies, aiming to promote interdisciplinary collaboration and commercialize research outcomes [1][2] - The institute focuses on five key AI areas: multimodal AI systems, enhancing AI reasoning capabilities, robotics intelligence, AI-driven 3D understanding, and healthcare service reform through large models [2][3] - The institute aims to build a talent pipeline in AI through educational initiatives, targeting over 100 PhD students and engaging with local schools to foster innovation [3] Group 1 - The Von Neumann Institute is the first "full-chain practical" AI research institute in the Greater Bay Area, bridging basic research and industrial application [2] - The institute's approach includes establishing specialized laboratories and joint industry-university collaborations to accelerate the transition from theoretical research to product development [2][3] - The leadership of Jia Jiaya, who has experience as a scientist, entrepreneur, and educator, positions the institute to become a model of "top-tier research and grounded industry" [3] Group 2 - Jia Jiaya emphasizes Hong Kong's role as an "innovation brain" due to its international capital flow, top-tier research resources, and global talent hub, while Shenzhen acts as the "industrial driver" [4] - The integration of large models and visual sensor hardware by Simo Technology showcases a demand-driven approach to innovation, with applications in major factories like Tesla and BYD [4][5] - The collaboration between Hong Kong and Shenzhen has created a rapid coordination mechanism, allowing for quick transitions from research to production [5]
(经济观察)业界人士热议:文旅行业将率先拥抱人工智能
Zhong Guo Xin Wen Wang· 2025-05-08 15:09
中新社上海5月8日电 (记者郑莹莹)上海徐汇区"AI+文旅生态成长计划"8日在模速空间内启动。参与活动 的业界人士认为,对于拥抱人工智能技术,文旅行业更具包容度。 "工业等领域的应用场景需要非常高的准确率,但文旅场景对于这类新科技的包容度是比较高的。比 如,机器人表演有时还会摔跤,对此大家其实是能包容的。"上海魂伴科技有限责任公司(简称:魂伴科 技)合伙人金成思说。他认为,文旅场景有望率先实现人工智能应用落地。 魂伴科技在2025年4月举办的2025上海龙华庙会上展示人形机器人应用。 中新社记者郑莹莹摄 魂伴科技在今年4月举办的2025上海龙华庙会上"秀"了一把机器人,吸引了沪上众多市民游客围观。这 对金成思触动很大:"当时机器人的表演其实并没有往日视频里酷炫,但市民游客仍觉得比在网络视频 里看到的更好、更真实,现场有些老年市民看到现实版人形机器人后,还期待它未来能帮忙养老。" 这让他思考,也许更重要的是让更多市民有机会了解、接触机器人。"我们要让机器人产品从实验室里 走到广场上,了解市民的需求,哪怕让机器人出洋相。如此,我们才能知道我们差的是什么。" 上海稀宇科技有限公司的公共事务副总裁严奕骏也看好文旅领域 ...
国泰海通|电子:从“能动”到“灵动”,机器人智能化步入新篇章
国泰海通证券研究· 2025-05-08 13:18
投资建议。 人形机器人高速发展,具身智能是驱动商业化落地的核心因素。机器人智能水平以及实时控制 性能提升将驱动感知性能、算力、通信效率等需求增长,端侧传感、驱控及通信芯片将充分受益。具身智 能落地打开人形机器人成长空间,未来应用前景广阔,带动整机厂商业绩上行。 报告导读: 具身智能是人形机器人商业化落地核心,多模态、强化学习加速智能进化,感 知传感迭代革新, EtherCAT 赋能高速通信,端侧算力持续升级。 本文摘自:2025年5月8日发布的 从"能动"到"灵动",机器人智能化步入新篇章 舒 迪 ,资格证书编号: S0880521070002 更多国泰海通研究和服务 亦可联系对口销售获取 重要提醒 本订阅号所载内容仅面向国泰海通证券研究服务签约客户。因本资料暂时无法设置访问限制,根据《证 券期货投资者适当性管理办法》的要求,若您并非国泰海通证券研究服务签约客户,为保证服务质量、 控制投资风险,还请取消关注,请勿订阅、接收或使用本订阅号中的任何信息。我们对由此给您造成的 不便表示诚挚歉意,非常感谢您的理解与配合!如有任何疑问,敬请按照文末联系方式与我们联系。 法律声明 市 场空间超万亿,实现具身智能是商业化落 ...
国泰海通:具身智能驱动人形机器人商业化落地 算法突破等成行业上涨催化剂
智通财经网· 2025-05-08 07:56
Group 1 - The core viewpoint is that embodied intelligence is the key to the commercialization of humanoid robots, with a market space exceeding one trillion yuan, and the intelligent level of humanoid robots in China is expected to evolve significantly by 2045 [1] - Humanoid robots possess human-like perception, body structure, and movement, making them highly adaptable to human society, with potential applications in manufacturing, social services, and hazardous operations [1] - The market scale for humanoid robots is currently below ten billion yuan, but as intelligent levels progress towards embodied intelligence, the market is expected to expand significantly [1] Group 2 - Multi-modal large models and reinforcement learning are enhancing operational control performance, with significant advancements in communication and computing power to support real-time control [2] - Major companies like NVIDIA and Tesla are integrating multi-modal perception to improve robot interaction and decision-making accuracy, while the development of embodied reasoning models is expected to enhance performance in complex environments [2] - The adoption of pure visual solutions and advanced sensors is anticipated to lower hardware costs and improve perception sensitivity, with EtherCAT emerging as a mainstream communication protocol due to its high real-time performance [2]
【行业前瞻】2025-2030年全球及中国多模态大模型行业发展分析
Sou Hu Cai Jing· 2025-05-07 03:45
Core Insights - The multi-modal large model industry focuses on deep learning models capable of processing, understanding, and generating various types of data, including text, images, audio, and video, enabling complex and intelligent tasks [1] - The industry has a wide application potential across various sectors such as natural language processing, image recognition, speech recognition, intelligent driving, and medical imaging diagnosis [1] Industry Overview - The multi-modal large model industry chain is complex, encompassing hardware facilities, software development, and various model types, including CLIP, BLIP, and LLaMA, among others [1] - The industry is divided into three layers: the foundational layer (hardware and basic software), the model layer (various types of multi-modal large models), and the application layer (industry-specific applications) [1] Cost Structure - The training costs for mainstream domestic large models range from tens of millions to hundreds of millions of dollars, with major companies like Baidu, Alibaba, and Tencent investing over $200 million [3][5] - Startups like Kimi and DeepSeek have managed to reduce training costs to between $30 million and $60 million through technological optimizations [3] - Cloud hosting costs are significantly influenced by model scale, with major companies leveraging their own cloud platforms to reduce costs [3] Development History - The global large model industry has evolved through several phases: early exploration (1956-2005), rapid growth (2006-2019), the rise of large models (2020-2022), and the current phase of widespread application starting in 2023 [6] Computational Demand - The demand for computational power in AI is increasing, with larger models requiring exponentially more computational resources; for instance, the GPT-3 model requires 3640 PF-days of computation and at least 10,000 GPUs [9] - As model parameters increase, the computational investment needed grows significantly, influenced by model architecture, optimization efficiency, and hardware capabilities [9]
【投资视角】启示2025:中国多模态大模型行业投融资及产业基金分析(附投融资事件、投资类型和兼并重组等)
Qian Zhan Wang· 2025-05-06 08:08
转自:前瞻产业研究院 行业主要公司:阿里巴巴(09988.HK,BABA.US);百度(09888.HK,BIDU.US);腾讯(00700.HK, TCEHY);科大讯飞(002230.SZ);三六零(601360.SH);云从科技(688327.SH)等 本文核心数据:多模态大模型代表企业融资规模;多模态大模型代表企业投资规模 2025年开始投融资呈爆发式增长 截至2025年4月,多模态大模型投融事件数量接近50件,其中国2021年投融资金额出现了高峰,达19.1 亿元,尽管当年投资事件数量为5件。2024年开始新一轮的投资周期,共有11件投资事件,金额达5.16 亿元。2025年前4个月,共有17件投资事件,金额为16亿元,后续多模态大模型题材的投资将呈现爆发 式增长。 企业能获得多轮投资 根据IT桔子显示,多模态大模型行业2025年开始投融资恢复热度。主要的融资事件如下: | 时间 | 254 | 地区 | 在不同分 | 金额 | 融资金额 | 投资方 | | --- | --- | --- | --- | --- | --- | --- | | 2025/4/9 | 爱芯元智 | 宁波市 | 人工智 ...