多模态大模型

Search documents
国泰海通:具身智能落地打开人形机器人成长空间
智通财经网· 2025-05-14 06:43
Core Insights - The rapid development of humanoid robots is driven by embodied intelligence, which is crucial for commercial viability [1] - The market for humanoid robots is projected to exceed one trillion yuan by 2045, with current market size under ten billion yuan [1] Group 1: Market Potential - Humanoid robots possess human-like perception, body structure, and movement, making them highly adaptable to various applications in manufacturing, social services, and hazardous operations [1] - According to the "Humanoid Robot Industry Development Research Report (2024)", the overall intelligence level of humanoid robots in China will remain at Level 1 from 2024 to 2028, with only a few products exploring Level 2 [1] - The evolution towards embodied intelligence is expected to break the limitations of specific scenarios and tasks, leading to comprehensive coverage across industries [1] Group 2: Technological Advancements - Multi-modal large models are key to enhancing human-robot interaction efficiency and situational understanding, with companies like NVIDIA and Tesla actively integrating multi-modal perception [2] - Reinforcement learning is anticipated to become a primary paradigm for motion algorithms, enabling efficient learning of gaits and running through reward functions [2] - The integration of pure visual solutions, six-dimensional force sensors, and electronic skin is expected to set a standard for sensory solutions, significantly improving perception sensitivity [2] Group 3: Communication and Computing - Real-time control requires efficient communication protocols and robust hardware computing power, with EtherCAT expected to become the mainstream communication protocol due to its high real-time performance and low latency [2] - As robot intelligence evolves towards embodied intelligence, the demand for edge computing power is projected to continue growing, driving performance upgrades in edge-side chips [2]
字节最强多模态模型登陆火山引擎!Seed1.5-VL靠20B激活参数狂揽38项SOTA
机器之心· 2025-05-14 04:36
Core Insights - ByteDance has launched an advanced visual-language multimodal model, Seed 1.5-VL, showcasing significant improvements in multimodal understanding and reasoning capabilities [1][2][3]. Group 1: Model Features - Seed 1.5-VL demonstrates enhanced visual localization and reasoning, with the ability to quickly and accurately identify various elements in images and videos [3][4]. - The model can process a single image and a prompt to identify and classify multiple objects, providing precise coordinates [4]. - It can analyze video footage to answer specific questions, showcasing its advanced video understanding capabilities [5]. Group 2: Performance Metrics - Despite having only 20 billion activation parameters, Seed 1.5-VL performs comparably to Gemini 2.5 Pro, achieving state-of-the-art results in 38 out of 60 public evaluation benchmarks [6]. - The inference cost is competitive, with input priced at 0.003 yuan per 1,000 tokens and output at 0.009 yuan per 1,000 tokens [7]. Group 3: Practical Applications - Developers can access Seed 1.5-VL through an API, enabling the creation of AI visual assistants, inspection systems, and interactive agents [7]. - The model's capabilities extend to complex tasks such as identifying emotions in images and solving visual puzzles, demonstrating its versatility [17][20]. Group 4: Technical Architecture - Seed 1.5-VL consists of three core components: a visual encoding module (SeedViT), a multi-layer perceptron (MLP) adapter, and a large language model (Seed1.5-LLM) [27]. - The model has undergone a unique training process, including multi-modal pre-training and reinforcement learning strategies, enhancing its performance while reducing inference costs [29][30]. Group 5: Industry Impact - The advancements presented at the Shanghai event indicate that ByteDance is building a comprehensive AI ecosystem, integrating various technologies from video generation to deep visual understanding [32]. - The emergence of Seed 1.5-VL signifies a step towards a true multimodal intelligent era, reshaping interactions with visual data [32][33].
文生图进入R1时代:港中文MMLab发布T2I-R1,让AI绘画“先推理再下笔”
量子位· 2025-05-13 04:45
Core Viewpoint - The article discusses the introduction of T2I-R1, the first reinforcement learning-based reasoning-enhanced text-to-image model developed by the MMLab team at the Chinese University of Hong Kong, which significantly improves image generation through a dual-level Chain of Thought (CoT) reasoning framework [2][27]. Group 1: Model Development - The T2I-R1 model builds on previous work in image generation with CoT, focusing on integrating semantic understanding and image generation [6][8]. - T2I-R1 introduces a dual-level CoT reasoning framework, consisting of Semantic-level CoT and Token-level CoT, to enhance the quality of generated images [12][16]. - The model utilizes BiCoT-GRPO, a reinforcement learning method that optimally coordinates the two levels of CoT, allowing for efficient training and improved image generation [21][23]. Group 2: Performance and Evaluation - T2I-R1 demonstrates improved performance, achieving a 13% and 19% increase in benchmarks T2I-CompBench and WISE, respectively, compared to baseline models [33]. - The model effectively generates images that align with human expectations by reasoning through the underlying intent of image prompts, showcasing enhanced robustness in unusual scenarios [29][30]. - The evaluation method incorporates multiple visual expert models to provide a comprehensive quality assessment of generated images, ensuring reliable results [32]. Group 3: Future Implications - The framework of T2I-R1 is expected to extend to more complex tasks such as video generation and 3D content synthesis, contributing to the evolution of generative AI towards more intelligent and creative systems [36].
ICML 2025 | 长视频理解新SOTA!蚂蚁&人大开源ViLAMP-7B,单卡可处理3小时视频
机器之心· 2025-05-12 09:06
该工作第一作者为中国人民大学高瓴人工智能学院硕士生程传奇, 目前于蚂蚁技术研究院实习,其主要研究领域为多模态大模型,蚂蚁技术研究院副研究员关健 为共同第一作者。 在视觉语言模型(Vision-Language Models,VLMs)取得突破性进展的当下,长视频理解的挑战显得愈发重要。以标准 24 帧率的标清视频为例,仅需数分钟即可 产生逾百万的视觉 token,这已远超主流大语言模型 4K-128K 的上下文处理极限。当面对影视级的长视频内容时,传统解决方案的不足愈加凸显:粗放式的帧采样 策略往往造成关键帧信息遗漏,而特征融合方法虽能降低数据维度,却不可避免地导致语义完整性受损。 近日, 蚂蚁和人大 的研究团队带来了一个创新性的解决方案。他们提出视觉语言大模型 ViLAMP (Video-Language Model with Mixed Precision),实现了对超长 视频的高效处理。这个方法的核心在于其独特的 " 混合精度 " 策略:对视频中的关键内容保持高精度分析,而对次要内容进行强力压缩,就像人类在观看视频时会 重点关注关键场景,而对过渡时空信息只做快速扫描一样。 论文标题:Scaling Vi ...
2025年中国多模态大模型行业生产生活应用现状 多模态大模型助力生产生活走向高品质【组图】
Qian Zhan Wang· 2025-05-12 08:11
转自:前瞻产业研究院 智能营销、教学辅助、3D建模以及智能驾驶等应用场景是生产生活中的重要领域,也是目前多模态大 模型可以切入并且精准赋能的领域。根据赛迪四川研究数据显示,2024年智能营销占中国人工智能多模 态大模型20强企业模型场景的9.5%,教学辅助、3D建模和智能驾驶均占4.8%左右。 行业主要上市公司:阿里巴巴(09988.HK,BABA.US);百度(09888.HK,BIDU.US);腾讯(00700.HK, TCEHY);科大讯飞(002230.SZ);万兴科技(300624.SZ);三六零(601360.SH);昆仑万维(300418.SZ);云从科技 (688327.SH);拓尔思(300229.SZ)等 本文核心数据:应用场景比重; 多模态大模型生成生活相关场景 多模态大模型助力智能营销优化策略 智能营销行业利用人工智能、大数据、机器学习和多模态技术,通过自动化、个性化的方式优化广告投 放、客户关系管理和内容营销。智能营销不仅帮助品牌实现更高效的客户触达,还能够动态调整营销策 略,提升用户体验,推动品牌增长。 智能营销是应用人工智能技术,对数字营销的全链路进行智能化升级的新型营销方式。智 ...
云从科技“从容多模态大模型”全球领先,与华为昇腾合作推动解决方案落地
news flash· 2025-05-12 05:48
云从科技自主研发的"从容多模态大模型"在Open Compass评测中以65.5分位列全球前三,超越谷歌 Gemini1.5Pro等模型,并在跨模态跟踪、3D人脸识别等细分领域10次刷新世界纪录。基于这一技术优 势,公司与华为昇腾联合推出的智用一体机解决方案,已在天津港(600717)智慧物流调度、国网山东 能源管理等多个标杆项目中落地,助力企业运营效率提升超20%。(36氪) ...
冯诺依曼研究院成立深港科技合作再添AI范式
2 1 Shi Ji Jing Ji Bao Dao· 2025-05-09 09:45
Core Insights - Hong Kong has established the Von Neumann Institute to integrate embodied intelligence, generative AI, and advanced supercomputing technologies, aiming to promote interdisciplinary collaboration and commercialize research outcomes [1][2] - The institute focuses on five key AI areas: multimodal AI systems, enhancing AI reasoning capabilities, robotics intelligence, AI-driven 3D understanding, and healthcare service reform through large models [2][3] - The institute aims to build a talent pipeline in AI through educational initiatives, targeting over 100 PhD students and engaging with local schools to foster innovation [3] Group 1 - The Von Neumann Institute is the first "full-chain practical" AI research institute in the Greater Bay Area, bridging basic research and industrial application [2] - The institute's approach includes establishing specialized laboratories and joint industry-university collaborations to accelerate the transition from theoretical research to product development [2][3] - The leadership of Jia Jiaya, who has experience as a scientist, entrepreneur, and educator, positions the institute to become a model of "top-tier research and grounded industry" [3] Group 2 - Jia Jiaya emphasizes Hong Kong's role as an "innovation brain" due to its international capital flow, top-tier research resources, and global talent hub, while Shenzhen acts as the "industrial driver" [4] - The integration of large models and visual sensor hardware by Simo Technology showcases a demand-driven approach to innovation, with applications in major factories like Tesla and BYD [4][5] - The collaboration between Hong Kong and Shenzhen has created a rapid coordination mechanism, allowing for quick transitions from research to production [5]
(经济观察)业界人士热议:文旅行业将率先拥抱人工智能
Zhong Guo Xin Wen Wang· 2025-05-08 15:09
中新社上海5月8日电 (记者郑莹莹)上海徐汇区"AI+文旅生态成长计划"8日在模速空间内启动。参与活动 的业界人士认为,对于拥抱人工智能技术,文旅行业更具包容度。 "工业等领域的应用场景需要非常高的准确率,但文旅场景对于这类新科技的包容度是比较高的。比 如,机器人表演有时还会摔跤,对此大家其实是能包容的。"上海魂伴科技有限责任公司(简称:魂伴科 技)合伙人金成思说。他认为,文旅场景有望率先实现人工智能应用落地。 魂伴科技在2025年4月举办的2025上海龙华庙会上展示人形机器人应用。 中新社记者郑莹莹摄 魂伴科技在今年4月举办的2025上海龙华庙会上"秀"了一把机器人,吸引了沪上众多市民游客围观。这 对金成思触动很大:"当时机器人的表演其实并没有往日视频里酷炫,但市民游客仍觉得比在网络视频 里看到的更好、更真实,现场有些老年市民看到现实版人形机器人后,还期待它未来能帮忙养老。" 这让他思考,也许更重要的是让更多市民有机会了解、接触机器人。"我们要让机器人产品从实验室里 走到广场上,了解市民的需求,哪怕让机器人出洋相。如此,我们才能知道我们差的是什么。" 上海稀宇科技有限公司的公共事务副总裁严奕骏也看好文旅领域 ...
国泰海通|电子:从“能动”到“灵动”,机器人智能化步入新篇章
国泰海通证券研究· 2025-05-08 13:18
投资建议。 人形机器人高速发展,具身智能是驱动商业化落地的核心因素。机器人智能水平以及实时控制 性能提升将驱动感知性能、算力、通信效率等需求增长,端侧传感、驱控及通信芯片将充分受益。具身智 能落地打开人形机器人成长空间,未来应用前景广阔,带动整机厂商业绩上行。 报告导读: 具身智能是人形机器人商业化落地核心,多模态、强化学习加速智能进化,感 知传感迭代革新, EtherCAT 赋能高速通信,端侧算力持续升级。 本文摘自:2025年5月8日发布的 从"能动"到"灵动",机器人智能化步入新篇章 舒 迪 ,资格证书编号: S0880521070002 更多国泰海通研究和服务 亦可联系对口销售获取 重要提醒 本订阅号所载内容仅面向国泰海通证券研究服务签约客户。因本资料暂时无法设置访问限制,根据《证 券期货投资者适当性管理办法》的要求,若您并非国泰海通证券研究服务签约客户,为保证服务质量、 控制投资风险,还请取消关注,请勿订阅、接收或使用本订阅号中的任何信息。我们对由此给您造成的 不便表示诚挚歉意,非常感谢您的理解与配合!如有任何疑问,敬请按照文末联系方式与我们联系。 法律声明 市 场空间超万亿,实现具身智能是商业化落 ...
国泰海通:具身智能驱动人形机器人商业化落地 算法突破等成行业上涨催化剂
智通财经网· 2025-05-08 07:56
Group 1 - The core viewpoint is that embodied intelligence is the key to the commercialization of humanoid robots, with a market space exceeding one trillion yuan, and the intelligent level of humanoid robots in China is expected to evolve significantly by 2045 [1] - Humanoid robots possess human-like perception, body structure, and movement, making them highly adaptable to human society, with potential applications in manufacturing, social services, and hazardous operations [1] - The market scale for humanoid robots is currently below ten billion yuan, but as intelligent levels progress towards embodied intelligence, the market is expected to expand significantly [1] Group 2 - Multi-modal large models and reinforcement learning are enhancing operational control performance, with significant advancements in communication and computing power to support real-time control [2] - Major companies like NVIDIA and Tesla are integrating multi-modal perception to improve robot interaction and decision-making accuracy, while the development of embodied reasoning models is expected to enhance performance in complex environments [2] - The adoption of pure visual solutions and advanced sensors is anticipated to lower hardware costs and improve perception sensitivity, with EtherCAT emerging as a mainstream communication protocol due to its high real-time performance [2]