多模态大模型
Search documents
国泰海通:具身智能落地打开人形机器人成长空间
智通财经网· 2025-05-14 06:43
Core Insights - The rapid development of humanoid robots is driven by embodied intelligence, which is crucial for commercial viability [1] - The market for humanoid robots is projected to exceed one trillion yuan by 2045, with current market size under ten billion yuan [1] Group 1: Market Potential - Humanoid robots possess human-like perception, body structure, and movement, making them highly adaptable to various applications in manufacturing, social services, and hazardous operations [1] - According to the "Humanoid Robot Industry Development Research Report (2024)", the overall intelligence level of humanoid robots in China will remain at Level 1 from 2024 to 2028, with only a few products exploring Level 2 [1] - The evolution towards embodied intelligence is expected to break the limitations of specific scenarios and tasks, leading to comprehensive coverage across industries [1] Group 2: Technological Advancements - Multi-modal large models are key to enhancing human-robot interaction efficiency and situational understanding, with companies like NVIDIA and Tesla actively integrating multi-modal perception [2] - Reinforcement learning is anticipated to become a primary paradigm for motion algorithms, enabling efficient learning of gaits and running through reward functions [2] - The integration of pure visual solutions, six-dimensional force sensors, and electronic skin is expected to set a standard for sensory solutions, significantly improving perception sensitivity [2] Group 3: Communication and Computing - Real-time control requires efficient communication protocols and robust hardware computing power, with EtherCAT expected to become the mainstream communication protocol due to its high real-time performance and low latency [2] - As robot intelligence evolves towards embodied intelligence, the demand for edge computing power is projected to continue growing, driving performance upgrades in edge-side chips [2]
字节最强多模态模型登陆火山引擎!Seed1.5-VL靠20B激活参数狂揽38项SOTA
机器之心· 2025-05-14 04:36
Core Insights - ByteDance has launched an advanced visual-language multimodal model, Seed 1.5-VL, showcasing significant improvements in multimodal understanding and reasoning capabilities [1][2][3]. Group 1: Model Features - Seed 1.5-VL demonstrates enhanced visual localization and reasoning, with the ability to quickly and accurately identify various elements in images and videos [3][4]. - The model can process a single image and a prompt to identify and classify multiple objects, providing precise coordinates [4]. - It can analyze video footage to answer specific questions, showcasing its advanced video understanding capabilities [5]. Group 2: Performance Metrics - Despite having only 20 billion activation parameters, Seed 1.5-VL performs comparably to Gemini 2.5 Pro, achieving state-of-the-art results in 38 out of 60 public evaluation benchmarks [6]. - The inference cost is competitive, with input priced at 0.003 yuan per 1,000 tokens and output at 0.009 yuan per 1,000 tokens [7]. Group 3: Practical Applications - Developers can access Seed 1.5-VL through an API, enabling the creation of AI visual assistants, inspection systems, and interactive agents [7]. - The model's capabilities extend to complex tasks such as identifying emotions in images and solving visual puzzles, demonstrating its versatility [17][20]. Group 4: Technical Architecture - Seed 1.5-VL consists of three core components: a visual encoding module (SeedViT), a multi-layer perceptron (MLP) adapter, and a large language model (Seed1.5-LLM) [27]. - The model has undergone a unique training process, including multi-modal pre-training and reinforcement learning strategies, enhancing its performance while reducing inference costs [29][30]. Group 5: Industry Impact - The advancements presented at the Shanghai event indicate that ByteDance is building a comprehensive AI ecosystem, integrating various technologies from video generation to deep visual understanding [32]. - The emergence of Seed 1.5-VL signifies a step towards a true multimodal intelligent era, reshaping interactions with visual data [32][33].
文生图进入R1时代:港中文MMLab发布T2I-R1,让AI绘画“先推理再下笔”
量子位· 2025-05-13 04:45
Core Viewpoint - The article discusses the introduction of T2I-R1, the first reinforcement learning-based reasoning-enhanced text-to-image model developed by the MMLab team at the Chinese University of Hong Kong, which significantly improves image generation through a dual-level Chain of Thought (CoT) reasoning framework [2][27]. Group 1: Model Development - The T2I-R1 model builds on previous work in image generation with CoT, focusing on integrating semantic understanding and image generation [6][8]. - T2I-R1 introduces a dual-level CoT reasoning framework, consisting of Semantic-level CoT and Token-level CoT, to enhance the quality of generated images [12][16]. - The model utilizes BiCoT-GRPO, a reinforcement learning method that optimally coordinates the two levels of CoT, allowing for efficient training and improved image generation [21][23]. Group 2: Performance and Evaluation - T2I-R1 demonstrates improved performance, achieving a 13% and 19% increase in benchmarks T2I-CompBench and WISE, respectively, compared to baseline models [33]. - The model effectively generates images that align with human expectations by reasoning through the underlying intent of image prompts, showcasing enhanced robustness in unusual scenarios [29][30]. - The evaluation method incorporates multiple visual expert models to provide a comprehensive quality assessment of generated images, ensuring reliable results [32]. Group 3: Future Implications - The framework of T2I-R1 is expected to extend to more complex tasks such as video generation and 3D content synthesis, contributing to the evolution of generative AI towards more intelligent and creative systems [36].
ICML 2025 | 长视频理解新SOTA!蚂蚁&人大开源ViLAMP-7B,单卡可处理3小时视频
机器之心· 2025-05-12 09:06
该工作第一作者为中国人民大学高瓴人工智能学院硕士生程传奇, 目前于蚂蚁技术研究院实习,其主要研究领域为多模态大模型,蚂蚁技术研究院副研究员关健 为共同第一作者。 在视觉语言模型(Vision-Language Models,VLMs)取得突破性进展的当下,长视频理解的挑战显得愈发重要。以标准 24 帧率的标清视频为例,仅需数分钟即可 产生逾百万的视觉 token,这已远超主流大语言模型 4K-128K 的上下文处理极限。当面对影视级的长视频内容时,传统解决方案的不足愈加凸显:粗放式的帧采样 策略往往造成关键帧信息遗漏,而特征融合方法虽能降低数据维度,却不可避免地导致语义完整性受损。 近日, 蚂蚁和人大 的研究团队带来了一个创新性的解决方案。他们提出视觉语言大模型 ViLAMP (Video-Language Model with Mixed Precision),实现了对超长 视频的高效处理。这个方法的核心在于其独特的 " 混合精度 " 策略:对视频中的关键内容保持高精度分析,而对次要内容进行强力压缩,就像人类在观看视频时会 重点关注关键场景,而对过渡时空信息只做快速扫描一样。 论文标题:Scaling Vi ...
2025年中国多模态大模型行业生产生活应用现状 多模态大模型助力生产生活走向高品质【组图】
Qian Zhan Wang· 2025-05-12 08:11
转自:前瞻产业研究院 智能营销、教学辅助、3D建模以及智能驾驶等应用场景是生产生活中的重要领域,也是目前多模态大 模型可以切入并且精准赋能的领域。根据赛迪四川研究数据显示,2024年智能营销占中国人工智能多模 态大模型20强企业模型场景的9.5%,教学辅助、3D建模和智能驾驶均占4.8%左右。 行业主要上市公司:阿里巴巴(09988.HK,BABA.US);百度(09888.HK,BIDU.US);腾讯(00700.HK, TCEHY);科大讯飞(002230.SZ);万兴科技(300624.SZ);三六零(601360.SH);昆仑万维(300418.SZ);云从科技 (688327.SH);拓尔思(300229.SZ)等 本文核心数据:应用场景比重; 多模态大模型生成生活相关场景 多模态大模型助力智能营销优化策略 智能营销行业利用人工智能、大数据、机器学习和多模态技术,通过自动化、个性化的方式优化广告投 放、客户关系管理和内容营销。智能营销不仅帮助品牌实现更高效的客户触达,还能够动态调整营销策 略,提升用户体验,推动品牌增长。 智能营销是应用人工智能技术,对数字营销的全链路进行智能化升级的新型营销方式。智 ...
云从科技“从容多模态大模型”全球领先,与华为昇腾合作推动解决方案落地
news flash· 2025-05-12 05:48
Core Insights - CloudWalk Technology's self-developed "Congrong Multimodal Large Model" scored 65.5 in the Open Compass evaluation, ranking among the top three globally, surpassing models like Google's Gemini 1.5 Pro [1] - The model has set world records in specific areas such as cross-modal tracking and 3D facial recognition, achieving this feat ten times [1] - Leveraging this technological advantage, the company has partnered with Huawei Ascend to launch an integrated intelligent machine solution, which has been implemented in several benchmark projects, including smart logistics scheduling at Tianjin Port and energy management for State Grid Shandong [1] - These implementations have resulted in over a 20% improvement in operational efficiency for enterprises [1]
冯诺依曼研究院成立深港科技合作再添AI范式
2 1 Shi Ji Jing Ji Bao Dao· 2025-05-09 09:45
Core Insights - Hong Kong has established the Von Neumann Institute to integrate embodied intelligence, generative AI, and advanced supercomputing technologies, aiming to promote interdisciplinary collaboration and commercialize research outcomes [1][2] - The institute focuses on five key AI areas: multimodal AI systems, enhancing AI reasoning capabilities, robotics intelligence, AI-driven 3D understanding, and healthcare service reform through large models [2][3] - The institute aims to build a talent pipeline in AI through educational initiatives, targeting over 100 PhD students and engaging with local schools to foster innovation [3] Group 1 - The Von Neumann Institute is the first "full-chain practical" AI research institute in the Greater Bay Area, bridging basic research and industrial application [2] - The institute's approach includes establishing specialized laboratories and joint industry-university collaborations to accelerate the transition from theoretical research to product development [2][3] - The leadership of Jia Jiaya, who has experience as a scientist, entrepreneur, and educator, positions the institute to become a model of "top-tier research and grounded industry" [3] Group 2 - Jia Jiaya emphasizes Hong Kong's role as an "innovation brain" due to its international capital flow, top-tier research resources, and global talent hub, while Shenzhen acts as the "industrial driver" [4] - The integration of large models and visual sensor hardware by Simo Technology showcases a demand-driven approach to innovation, with applications in major factories like Tesla and BYD [4][5] - The collaboration between Hong Kong and Shenzhen has created a rapid coordination mechanism, allowing for quick transitions from research to production [5]
(经济观察)业界人士热议:文旅行业将率先拥抱人工智能
Zhong Guo Xin Wen Wang· 2025-05-08 15:09
中新社上海5月8日电 (记者郑莹莹)上海徐汇区"AI+文旅生态成长计划"8日在模速空间内启动。参与活动 的业界人士认为,对于拥抱人工智能技术,文旅行业更具包容度。 "工业等领域的应用场景需要非常高的准确率,但文旅场景对于这类新科技的包容度是比较高的。比 如,机器人表演有时还会摔跤,对此大家其实是能包容的。"上海魂伴科技有限责任公司(简称:魂伴科 技)合伙人金成思说。他认为,文旅场景有望率先实现人工智能应用落地。 魂伴科技在2025年4月举办的2025上海龙华庙会上展示人形机器人应用。 中新社记者郑莹莹摄 魂伴科技在今年4月举办的2025上海龙华庙会上"秀"了一把机器人,吸引了沪上众多市民游客围观。这 对金成思触动很大:"当时机器人的表演其实并没有往日视频里酷炫,但市民游客仍觉得比在网络视频 里看到的更好、更真实,现场有些老年市民看到现实版人形机器人后,还期待它未来能帮忙养老。" 这让他思考,也许更重要的是让更多市民有机会了解、接触机器人。"我们要让机器人产品从实验室里 走到广场上,了解市民的需求,哪怕让机器人出洋相。如此,我们才能知道我们差的是什么。" 上海稀宇科技有限公司的公共事务副总裁严奕骏也看好文旅领域 ...
国泰海通|电子:从“能动”到“灵动”,机器人智能化步入新篇章
国泰海通证券研究· 2025-05-08 13:18
投资建议。 人形机器人高速发展,具身智能是驱动商业化落地的核心因素。机器人智能水平以及实时控制 性能提升将驱动感知性能、算力、通信效率等需求增长,端侧传感、驱控及通信芯片将充分受益。具身智 能落地打开人形机器人成长空间,未来应用前景广阔,带动整机厂商业绩上行。 报告导读: 具身智能是人形机器人商业化落地核心,多模态、强化学习加速智能进化,感 知传感迭代革新, EtherCAT 赋能高速通信,端侧算力持续升级。 本文摘自:2025年5月8日发布的 从"能动"到"灵动",机器人智能化步入新篇章 舒 迪 ,资格证书编号: S0880521070002 更多国泰海通研究和服务 亦可联系对口销售获取 重要提醒 本订阅号所载内容仅面向国泰海通证券研究服务签约客户。因本资料暂时无法设置访问限制,根据《证 券期货投资者适当性管理办法》的要求,若您并非国泰海通证券研究服务签约客户,为保证服务质量、 控制投资风险,还请取消关注,请勿订阅、接收或使用本订阅号中的任何信息。我们对由此给您造成的 不便表示诚挚歉意,非常感谢您的理解与配合!如有任何疑问,敬请按照文末联系方式与我们联系。 法律声明 市 场空间超万亿,实现具身智能是商业化落 ...
国泰海通:具身智能驱动人形机器人商业化落地 算法突破等成行业上涨催化剂
智通财经网· 2025-05-08 07:56
Group 1 - The core viewpoint is that embodied intelligence is the key to the commercialization of humanoid robots, with a market space exceeding one trillion yuan, and the intelligent level of humanoid robots in China is expected to evolve significantly by 2045 [1] - Humanoid robots possess human-like perception, body structure, and movement, making them highly adaptable to human society, with potential applications in manufacturing, social services, and hazardous operations [1] - The market scale for humanoid robots is currently below ten billion yuan, but as intelligent levels progress towards embodied intelligence, the market is expected to expand significantly [1] Group 2 - Multi-modal large models and reinforcement learning are enhancing operational control performance, with significant advancements in communication and computing power to support real-time control [2] - Major companies like NVIDIA and Tesla are integrating multi-modal perception to improve robot interaction and decision-making accuracy, while the development of embodied reasoning models is expected to enhance performance in complex environments [2] - The adoption of pure visual solutions and advanced sensors is anticipated to lower hardware costs and improve perception sensitivity, with EtherCAT emerging as a mainstream communication protocol due to its high real-time performance [2]