Workflow
多模态融合
icon
Search documents
AAAI 2026 Oral | 悉尼科技大学联合港理工打破「一刀切」,联邦推荐如何实现「千人千面」的图文融合?
机器之心· 2025-11-25 04:09
在推荐系统迈向多模态的今天,如何兼顾数据隐私与个性化图文理解?悉尼科技大学龙国栋教授团队联合香港理工大学杨强教授、张成奇教 授团队,提出全新框架 FedVLR。该工作解决了联邦环境下多模态融合的异质性难题,已被人工智能顶级会议 AAAI 2026 接收为 Oral Presentation。 在当今的推荐系统中,利用图像和文本等多模态信息来辅助决策已是标配。然而,当这一需求遭遇 联邦学习 —— 这一要求「数据不出本地」的隐私保护计算范式 时,情况变得极其复杂。 现有的联邦推荐往往面临两难:要么为了保护隐私而放弃繁重的多模态处理,仅使用 ID 特征;要么采用「一刀切」(One-size-fits-all)的粗暴融合策略,假设所 有用户对图文的偏好一致。 但现实是残酷的: 用户的「融合偏好」天生具有极大的异质性。 购买服装时,用户可能更依赖视觉冲击;而挑选数码产品时,详尽的参数文本可能才是关键。这 种偏好的差异,在数据不可见的联邦环境下,极难被捕捉。 为了打破这一瓶颈, 悉尼科技大学龙国栋教授团队,联合香港理工大学人工智能高等研究院杨强院长、香港理工大学深圳研究院张成奇院长推出了 FedVLR 框 架。 其核心洞 ...
谷歌“香蕉”手写满分卷,Karpathy玩上瘾,ChatGPT跪验沉默
3 6 Ke· 2025-11-24 06:56
谷歌Nano Banana Pro出世,又成为一个现象级爆款。这届网友彻底玩疯:手写试卷全对、秒出神级信息图、电影级分镜、跨世纪变装..... 上周,谷歌用两场发布,强势宣告王者归来! Gemini 3 Pro+Nano Banana Pro双核弹连发,巨大的余波至今让AI圈没有缓过来。 谷歌此举,完成了一个精准又漂亮的战略绝杀。 PyTorch之父Soumith Chintala高度评价道,「Gemini 3似乎比任何时刻,更接近GPT-4」。 就连Scalesforce CEO Marc Benioff,直接从ChatGPT转战Gemini 3了。 不仅如此,Nano Banana Pro的超强生图实力,更是让业界大佬连连惊掉下巴。 硅谷八巨头同框,超逼真人物生成真假难辨;一个坐标出图,推理超精准;一键生成电影花絮.... r D ******* . 1999 calities el.cine & @EHuanglu · 2h @NanoBanana is crazy Generate an image at the coordinates 200140.712742° N, 74.013382° W a ...
深度解读|从赛场到市场:中关村具身智能机器人应用大赛解码产业变革新路径
机器人大讲堂· 2025-11-23 00:00
2025年11月,第二届中关村具身智能机器人应用大赛的落幕,不仅是一场汇聚全球157支顶尖战队的技术盛 宴,更是中国具身智能产业从"实验室样机"迈向"产业级应用"的里程碑事件。这场以"具身引智、应用未来"为 主题,紧扣"劳动最光荣"核心导向的赛事,通过家庭服务、工业制造、安全处置等多元场景的实战竞技,完成 了对行业发展的一次全面检阅。 在 "具身智能"首次写入政府工作报告、"人工智能+"行动将其列为新质生产力核心引擎的政策东风下,大赛所 勾勒的技术路线、场景导向与生态模式,正成为解码中国具身智能产业变革的关键密码。 ▍赛事迭代:从技术炫技到实用落地的产业镜像 如果说 2024年首届中关村仿生机器人大赛是对具身智能技术的"集中亮相",那么2025年第二届大赛的全面升 级,则 清晰展现了行业从 "技术展示"到"实用落地"的深刻转型 。赛事的赛道设计、竞技内容与评审机制, 无不成为产业发展的生动镜像。 本届大赛一改首届以仿生技术展示为主的模式,转向 "真实场景劳动技能比拼"的核心导向 ,设置 了 三大核 心赛道,全面覆盖产业需求与学术前沿 。 具身智能模型能力挑战赛 聚焦算法内核,通过 "大脑"与"小脑"双方向竞技 ...
美团 “全能突破”:RoboTron-Mani +RoboData实现通用机器人操作
具身智能之心· 2025-11-11 03:48
Core Insights - The article discusses the development of RoboTron-Mani, a universal robotic operation strategy that overcomes the limitations of existing models by integrating 3D perception and multi-modal fusion, enabling cross-platform and cross-scenario operations [1][3][21]. Group 1: Challenges in Robotic Operations - Current robotic operation solutions face a "dual bottleneck": either lacking 3D perception capabilities or suffering from data set issues that hinder cross-platform training [2][3]. - Traditional multi-modal models focus on 2D image understanding, which limits their ability to interact accurately with the physical world [2][3]. - Single data set training leads to weak generalization, requiring retraining for different robots or scenarios, which increases data collection costs [2][3]. Group 2: RoboTron-Mani and RoboData - RoboTron-Mani is designed to address the challenges of 3D perception and data modality issues, achieving full-link optimization from data to model [3][21]. - The architecture of RoboTron-Mani includes a visual encoder, 3D perception adapter, feature fusion decoder, and multi-modal decoder, allowing it to process various input types and produce multi-modal outputs [5][7][9][10]. - RoboData integrates nine mainstream public datasets, containing 70,000 task sequences and 7 million samples, addressing key pain points of traditional datasets by completing missing modalities and aligning spatial and action representations [11][12][15][16]. Group 3: Experimental Results and Performance - RoboTron-Mani has demonstrated superior performance across multiple datasets, achieving a success rate of 91.7% on the LIBERO dataset, surpassing the best expert model [18][21]. - The model shows an average improvement of 14.8%-19.6% in success rates compared to the general model RoboFlamingo across four simulated datasets [18][21]. - Ablation studies confirm the necessity of key components, with the absence of the 3D perception adapter significantly reducing success rates [19][22]. Group 4: Future Directions - Future enhancements may include the integration of additional modalities such as touch and force feedback to improve adaptability in complex scenarios [23]. - There is potential for optimizing model efficiency, as the current 4 billion parameter model requires 50 hours of training [23]. - Expanding real-world data integration will help reduce the domain transfer gap from simulation to real-world applications [23].
西安交大丁宁:大模型是“智能基建”,资本与技术融合重塑AI版图
Core Insights - The rapid development of large models is driven by capital investment and industry collaboration, where capital acts as a magnifier for technology and technology serves as a multiplier for capital [1][4] Group 1: Industry Trends - The current phase of AI is characterized by a shift towards "multimodal fusion," where models are evolving from single-modal (text only) to integrating images, speech, and code [2][3] - The emergence of ChatGPT at the end of 2022 marked a turning point in AI development, initiating competition in the large model industry [2] - The mainstream large models are primarily based on the Transformer architecture, with a transition in training methods from "pre-training + supervised fine-tuning" to continuous learning and parameter-efficient fine-tuning [3] Group 2: Capital and Technology Dynamics - The high initial costs of training large models include computing power, data, algorithms, and talent, making capital investment essential for developing high-quality foundational models [4] - Without technological insights and research accumulation, capital alone cannot effectively drive industrial upgrades [4] - As of 2023, China leads globally in the number of AI-related patents, accounting for 69% of the total, while the country also produces 41% of the world's AI research papers [4] Group 3: Future Outlook - Future trends in AI development include multimodal integration, parallel advancements in large-scale and lightweight models, embodied intelligence, and exploration of artificial general intelligence (AGI) [5] - The concept of superintelligence, which refers to systems surpassing the smartest humans, remains a theoretical discussion and a potential future direction for AI development [5]
研判2025!中国文本转语音技术行业发展历程、产业链、发展现状、竞争格局及趋势分析:作为人机交互的重要组成部分,行业应用需求不断扩大[图]
Chan Ye Xin Xi Wang· 2025-11-10 00:59
Core Insights - The text-to-speech (TTS) technology is becoming a crucial part of social development, enhancing information accessibility and providing equal opportunities for special groups [1][10] - The market size of China's TTS technology industry is projected to reach 18.76 billion yuan in 2024, reflecting a year-on-year increase of 22.77% [1][11] - The industry is experiencing a shift from early mechanical simulations to advanced AI-driven systems capable of generating human-like speech [1][11] Industry Overview - TTS technology converts text into speech, allowing users to hear content without reading, thus breaking the limitations of information transmission [4][10] - The technology's core value lies in enabling human-machine interaction through natural speech [4][10] Technical Mechanism - The TTS process involves three main components: text preprocessing, speech synthesis, and speech output [5][6] - Text preprocessing includes tasks like word segmentation and semantic understanding, while speech synthesis uses complex algorithms to generate speech signals [5][6] Industry Chain - The TTS industry chain consists of upstream (hardware and algorithm support), midstream (core technology), and downstream (application fields like education, finance, and media) [8][10] - In education, TTS technology is used for personalized learning experiences, aiding students with reading disabilities [8][10] Market Dynamics - The network audio-visual industry, a key segment of new media, is increasingly utilizing TTS technology for content creation, with the user base expected to reach 1.091 billion by 2024 [9][10] Competitive Landscape - The TTS industry is characterized by international technology leadership and domestic market focus, with major players like Google and Microsoft in high-end markets, while domestic companies excel in Chinese language applications [11][12] - Key domestic companies include iFlytek, Baidu, and Yunzhisheng, with competition expected to intensify around edge computing and ethical technology [11][12] Future Trends - The industry is moving towards human-like expression and long-scene adaptability, with emotional expression becoming a core breakthrough point [14][15] - Multi-modal integration is anticipated to enhance TTS capabilities, allowing for collaborative content production across various media [15][16] - As the industry grows, regulatory frameworks will strengthen, focusing on data privacy and voice copyright protection [16]
乌镇峰会风向标:AI应用竞逐“空间智能”新赛道
Core Insights - The 2025 World Internet Conference in Wuzhen focuses on building an open, cooperative, and secure digital future, emphasizing the construction of a community in cyberspace [2][3] - The conference highlights the rapid development and application of large models in various industries, showcasing advancements in artificial intelligence and its integration into daily life and industrial processes [3][4] Group 1: AI and Industry Applications - The theme of this year's "Internet Light" Expo is "AI Coexistence, Intelligent Future," featuring over 1,000 cutting-edge AI technology products from more than 600 global companies [4] - Large models have evolved from single-modal capabilities to multi-modal integration, enabling applications that can understand and create across various formats such as text, voice, and visuals [5] - The integration of AI in healthcare is exemplified by AI-driven health consultations that provide immediate professional advice based on user data [4][5] Group 2: Open Source and Collaboration - The trend towards open-source development is gaining traction, allowing for collaborative innovation and lower barriers for small enterprises and research institutions to participate in the AI ecosystem [6][7] - The "Direct to Wuzhen" global internet competition introduced an open-source project track, attracting over 600 developers to participate in various challenges [6] Group 3: Digital Twin Technology - The application of digital twin technology in industrial settings is advancing, with companies like Qunhe Technology showcasing platforms that replicate real-world industrial environments in a digital space [10][11] - The digital twin platform enhances human-robot collaboration and allows for real-time monitoring and predictive analytics, significantly reducing trial-and-error costs in production [11][12] - The emphasis on open ecosystems and continuous innovation is seen as crucial for embedding AI capabilities across various sectors, moving beyond mere technological barriers to fostering collaborative industrial environments [12]
丁宁:大模型是“智能基建” 资本与技术融合重塑AI版图
"我们现在处于第四次工业的革命——一场以人工智能大数据为代表的智能化革命。"西安交通大学人工 智能学院教授丁宁在会上表示,"借鉴前三次工业革命,相关技术都成为了人们工作和生活的必需品。 可以预见,第四次工业革命后,人工智能也极有可能成为未来世界不可或缺的核心技术。" 丁宁本科与硕士毕业于西安交大,博士毕业于日本庆应大学,曾在阿里巴巴工作数年,于2023年回到高 校从事大模型、人机交互、自然语言处理、语音处理等方向的研究。 AI正在进入"多模态融合"阶段 10月28日下午,由陕西科控投资基金、西安交通大学国家技术转移中心、21世纪经济报道联合主办,招 商银行西安分行支持,21世纪创投研究院担任智库支持的"科学家遇见投资人"闭门研讨会西安交通大学 专场活动在西安交通大学创新港校区举办。 多模态能力意味着AI不再只是理解文字,而是能感知和生成来自不同世界的信息。丁宁教授认为,基 于高质量的预训练模型和参数高效微调,形成的微调大模型可以广泛嵌入科研、制造、教育、医疗、金 融等领域。 目前主流大模型仍以Transformer架构为基础,但在训练方式上,正在从"预训练+监督微调"向持续学习 和参数高效微调演化——即用更少 ...
丁宁:大模型是“智能基建”,资本与技术融合重塑AI版图
21世纪经济报道 赵娜 西安报道 10月28日下午,由陕西科控投资基金、西安交通大学国家技术转移中心、21世纪经济报道联合主办,招 商银行西安分行支持,21世纪创投研究院担任智库支持的"科学家遇见投资人"闭门研讨会西安交通大学 专场活动在西安交通大学创新港校区举办。 "我们现在处于第四次工业的革命——一场以人工智能大数据为代表的智能化革命。"西安交通大学人工 智能学院教授丁宁在会上表示,"借鉴前三次工业革命,相关技术都成为了人们工作和生活的必需品。 可以预见,第四次工业革命后,人工智能也极有可能成为未来世界不可或缺的核心技术。" 具体来说,大模型训练的前期成本极高,包括算力、数据、算法和人才。没有资本介入,很难形成高质 量基础模型;但如果没有技术洞见和研发积累,资本也难以真正驱动产业升级。 再从国际对比看,美国在头部企业、算力中心和生态层面仍领先,中国则在论文和专利授权方面跃居全 球前列。丁宁教授在现场披露的数据显示:到2023年,人工智能领域的论文数量占全球的41%;我国在 人工智能领域的专利飞速增长,截至2023年在全球的专利数占比已达到69%。 另一方面,算力依然是制约我国AI发展的关键瓶颈,模型"幻觉 ...
大模型专题:2025年中国大模型行业发展研究报告
Sou Hu Cai Jing· 2025-11-03 16:20
Core Insights - The report highlights the rapid growth and strategic importance of the large model industry in China, projecting a market size of approximately 294.16 billion yuan in 2024, with expectations to exceed 700 billion yuan by 2026 [1][25][28] - The CBDG four-dimensional model (Consumer, Business, Device, Government) is identified as a new paradigm for understanding the ecosystem and competitive dynamics of the large model industry in China [5][40] - Key players such as iFlytek, ByteDance, and Alibaba are leveraging their unique strengths to build competitive advantages in the large model space, focusing on different market segments and user engagement strategies [7][10][30] Industry Overview - The large model industry is positioned as a strategic core of AI development, driving innovation and transformation across various sectors [14][21] - The industry is characterized by a shift from single-point algorithm innovation to a comprehensive intelligent ecosystem, with a focus on multi-modal capabilities and intelligent agents [16][25] - The competitive landscape is evolving from technology and product-centric competition to a more holistic, ecosystem-based competition, emphasizing capabilities in ecological construction, technological research, industry empowerment, commercial monetization, and innovation expansion [22][40] Market Dynamics - The multi-modal large model market in China is projected to reach 156.3 billion yuan in 2024, with significant applications in digital humans, gaming, and advertising [26][30] - The report indicates a growing trend towards the integration of multi-modal capabilities, moving from traditional text processing to interactions involving images, voice, and video [25][30] - The commercialization of large models is entering a systematic phase, with companies exploring diverse monetization strategies such as API calls, model licensing, and industry-specific solutions [28][30] Competitive Landscape - iFlytek is focusing on deepening its engagement in the government and business sectors, establishing a leading market share in large model solutions for state-owned enterprises [7][10] - ByteDance is leveraging its consumer traffic and data to create a closed-loop ecosystem, enhancing user engagement and retention [7][10] - Alibaba is transforming its Quark platform into an AI toolset to improve user stickiness and differentiate itself in the market [7][10] Future Trends - The future of large models is expected to drive AI from multi-modal cognition towards embodied intelligence, becoming a key link between the virtual and physical worlds [17][25] - The industry is anticipated to witness a shift towards ecological collaboration, with value increasingly concentrated in application service layers [22][25] - Governance will focus on safety, trustworthiness, and a uniquely Chinese path to international competition and cooperation [22][25]