Workflow
多模态融合
icon
Search documents
不管是中国还是美国最终走向都是人工智能时代是这样吗?
Sou Hu Cai Jing· 2025-10-08 20:55
中美两国的发展轨迹已明确指向人工智能时代,这是技术迭代和产业升级的必然趋势,但发展路径和侧 重点存在显著差异:技术发展格局、基础创新方面:美国在基础算法、大模型架构(如BERT原始框 架)及核心专利领域保持优势,其科研生态更注重底层突破。应用落地方面:中国依托庞大的用户基 数、移动互联网积淀(如移动支付/电商)及产业链协同,在场景化应用(如智能体Agent、多模态交 互)推进速度更快,部分领域体验已超越美国。例如:微信AI助手"元宝"实现社交生态无缝集成,腾讯 豆包模型推理能力跻身全球第一梯队;智能体技术正突破场景边界,加速行业自动化进程。 产业生态与政策驱动;美国战略:强化技术霸权主导地位,通过出口管制、标准制定及盟友合作遏制竞 争者。2025年新政策主张放松监管、推动开源,旨在巩固"黄金时代"领导力,但政治化倾向明显(如褒 扬特朗普政策、贬抑拜登监管)。中国路径:发挥制造业根基与数据规模优势,聚焦"AI+实体产业"融 合。张亚勤指出,中国将在5年内成为全球最大AI应用国,核心推力来自成熟的移动生态延续性及产业 链协同效应 核心差异与未来竞争焦点 维度美国中国 创新重心基础理论、通用大模型 场景应用、工程化 ...
非植入式脑机接口+苹果Vision Pro
思宇MedTech· 2025-10-04 14:33
2025年10月1日,总部位于加州圣巴巴拉的 Cognixion 宣布,正式启动一项临床研究,探索其 基于脑电图(EEG)的非植入式脑机接口(BCI) 与 Apple Vision Pro 的结合应用。这一研究旨在评估如何利用 Cognixion 的 Nucleus 生物感知中枢 和先进 EEG 电极系统,结合 Vision Pro 的眼动追踪和凝视控制等辅助功能,为患者提 供全新的自然交互方式——无需手术,即可通过脑电信号、眼动与头部姿态来实现交流与操作。 产品与技术特点 Cognixion 拥有自主研发的 BCI 平台 Axon-R ,是一款可穿戴、非侵入式神经接口设备,具备先进的 EEG 测量与反馈能力,能够通过视觉刺激、神经反馈实现脑活 动的精确捕捉与解码。此次研究中,Cognixion 将把该平台与 Apple Vision Pro 的空间计算和辅助功能相结合: 与目前市场上多为植入式的脑机接口(如 Synchron、Neuralink)不同,Cognixion 的重点是 "无手术、可穿戴、可日常化" ,更容易在临床和家庭场景推广。 临床研究设计 该项临床研究已启动招募,将持续至2026年4月,主要 ...
AI云计算行业发展现状
2025-09-26 02:29
Summary of Key Points from Conference Call Records Industry Overview - The AI cloud computing industry is currently dominated by Alibaba Cloud in China, which holds a market share of approximately 33-35%, making it the leading player domestically and the fourth largest globally [2][3] - The competition landscape includes other major players such as Huawei Cloud (13% market share), Volcano Engine (close to 14%), Tencent, and Baidu [2] Core Insights and Arguments - **Technological Advancements**: Alibaba Cloud has developed a MAAS 2.0 service matrix that includes data annotation, model retraining, and hosting services, which sets it apart from competitors [1][3] - **Token Demand Growth**: The demand for tokens is expected to surge from 30% to 90% penetration over the next few years, driven by major internet companies restructuring their products using AI [1][4] - **Pricing Trends**: In Q3 2023, the price of mainstream model tokens decreased by 30%-50% compared to Q1, with Alibaba's new model 23MAX commanding a higher price point, indicating its pricing power [1][6] - **User Engagement**: The average session duration for AI Chatbot Doubao increased from 13 minutes to 30 minutes, reflecting enhanced user engagement [1][6] Future Investments and Strategies - **CAPEX Plans**: Alibaba plans to invest 380 billion in CAPEX over the next three years, focusing on global data center construction, AI server procurement, and network equipment upgrades, particularly in Asia and Europe [1][10] - **Infrastructure Development**: The company aims to build data centers in regions like Thailand, Mexico, Brazil, and France, targeting areas with a high concentration of Chinese enterprises [10] Emerging Technologies and Products - **New Model Launches**: Alibaba Cloud introduced seven large models, including the flagship 23MAX, which features over a trillion parameters and is designed to compete with GPT-5 [1][7] - **Multi-modal Capabilities**: The model "Qianwen 3 Only" is the first fully multi-modal model in China, capable of handling text, audio, and visual tasks [7] Market Dynamics - **Shift in Revenue Structure**: The revenue structure of cloud vendors is expected to shift from traditional IaaS services to PaaS, SaaS, and AI-driven products, enhancing profit margins [3] - **Token Consumption**: Daily token consumption in China is approximately 90 trillion, with Alibaba accounting for nearly 18 trillion, indicating a significant market presence [20] Competitive Landscape - **Comparison with Competitors**: Alibaba's architecture is similar to Google's, with a focus on self-developed chips and intelligent applications, while competitors like Volcano Engine and Baidu lag in technological capabilities [2][3] - **Collaboration with NVIDIA**: Alibaba's partnership with NVIDIA focuses on "Physical AI," enhancing its cloud offerings with advanced simulation and machine learning capabilities [13][14] Additional Insights - **Vertical AI Applications**: Vertical AI applications are rapidly emerging across various industries, with significant growth in AI programming and data analysis services [8] - **Consumer Market Applications**: AI technologies are being applied in consumer markets through AI search, virtual social interactions, and digital content generation [9] Conclusion - The AI cloud computing industry is poised for rapid growth, driven by technological advancements, increased token demand, and strategic investments by leading players like Alibaba Cloud. The competitive landscape is evolving, with a clear shift towards multi-modal AI applications and enhanced user engagement metrics.
如何向一段式端到端注入类人思考的能力?港科OmniScene提出了一种新的范式...
自动驾驶之心· 2025-09-25 23:33
如何向一段式端到端注入人类思考的能力? 人类视觉能够将2D观察结果转化为以自身为中心的3D场景理解,这一能力为理解复杂场景和展现自适应行为提供了基础。然而当前自动驾驶系统仍缺乏 这种能力—主流方法在很大程度上依赖于基于深度的三维重建,而非真正的场景理解。 为解决这一局限,港科、理想和清华的团队提出一种全新的类人框架OmniScene。 首先本文引入OmniScene视觉-语言模型(OmniVLM),这是一种结合 环视感知与时序融合能力的VLM框架,可实现全面的4D场景理解。其次通过师生结构的OmniVLM架构与知识蒸馏,将文本表征嵌入3D实例特征中以实 现语义监督,既丰富了特征学习过程,又明确捕捉了类人的注意力语义信息。这些特征表征进一步与人类驾驶行为对齐,形成更贴近人类认知的"感知-理 解-行动"架构。 此外本文提出分层融合策略(HFS),以解决多模态融合过程中模态贡献不平衡的问题。该方法能在多个抽象层级上自适应校准几何特征与语义特征的相 对重要性,实现视觉模态与文本模态互补信息的协同利用。这种可学习的动态融合机制,使得异质信息能够被更细致、更有效地挖掘。 本文在nuScenes数据集上对OmniScene ...
基于313篇VLA论文的综述与1661字压缩版
理想TOP2· 2025-09-25 13:33
以下文章来源于自动驾驶之心 ,作者Dapeng Zhang等 自动驾驶开发者社区,关注自动驾驶、计算机视觉、感知融合、BEV、部署落地、定位规控、领域方案 等,坚持为领域输出最前沿的技术方向! 压缩版: VLA (Vision Language Action) 模型的出现标志着机器人技术从传统的基于策略的控制向通用机 器人技术的范式转变 。它将视觉语言模型 (VLM) 从被动的序列生成器重构为能够在复杂动态环 境中进行主动操作和决策的智能体 。 该综述对VLA方法进行清晰的分类和系统性的回顾。 VLA方法主要可分为四类:基于自回归、基于扩散、基于强化学习以及混合与专用方法 。 基于自回归 (Autoregression-based) 的模型 自动驾驶之心 . 核心思想: 将动作序列视为时间依赖过程,逐步生成动作 。 创新与发展: 通用智能体: 通过统一的多模态Transformer(如Gato, RT-1/RT-2, PaLM-E)实现跨任务的泛化 。 推理与规划: 结合大语言模型 (LLM) 进行链式思考 (Chain-of-Thought) 和分层规划,处理长时程 和复杂任务 。 轨迹生成: 直接将语言指 ...
2 亿美元 ARR,AI 语音赛道最会赚钱的公司,ElevenLabs 如何做到快速增长?
Founder Park· 2025-09-16 13:22
估值 66 亿美元,首个 1 亿美元 ARR 耗时 20 个月,而第二个 1 亿美元 ARR 仅用 10 个月。 AI 音频独角兽 ElevenLabs 可以说是欧洲发展速度最快的 AI 创企。 随着语音模态正在成为人与技术交互的重要接口,AI 语音赛道的竞争也尤为激烈,Murf.ai、Play.ht、 WellSaid Labs......尤其是在 OpenAI、Google、微软这些科技巨头的围攻下,ElevenLabs 能够「跑」出来 十分艰难。在初期融资阶段,ElevenLabs 几乎被所有接触的投资人拒绝;在验证市场需求时,挨个给 YouTuber 发了几千封邀请邮件,得到的肯定回复寥寥无几。 ElevenLabs 是如何从一家「小公司」快速成长为 AI 语音领域独角兽的?ElevenLabs 的 CEO Mati Staniszewski 在一场播客对谈中,回顾了其创业历程以及心得经验: 超 13000 人的「AI 产品市集」社群!不错过每一款有价值的 AI 应用。 邀请从业者、开发人员和创业者,飞书扫码加群: 当技术研发到一定阶段,最终都会走向商品化,仅靠研发优势是不够的,必须要靠产品力。11 ...
王兴兴最新发声
财联社· 2025-09-11 08:54
科创板日报 . 科创圈都在关注的新型主流媒体,上海报业集团主管主办,界面财联社出品。 以下文章来源于科创板日报 ,作者黄心怡 王兴兴还提到如今多模态的融合不太理想,尽管单纯的语言或多模态模型表现优异, 但 在 机器人领域,用语言或图像、视频生成内容来控 制机器人仍存在重大挑战。 "机器人的运动怎样与视频、语言模型进行对齐,都是非常具有难点。我的观点是, 眼下硬件是足够用的,最大的问题还是AI模型本身的能 力不太够。没办法让硬件真正用起来,比如AI难以很好地控制灵巧手。 " 不过,王兴兴对于未来仍表示了看好。他提到2016年刚创业的时候,并未曾想科技会发展到这个地步。"我们对于AI模型的认知可以再激进 点,将其作为全能型的工具。把新的东西重新学,把过去的事情能忘了全忘了,太依赖过去的经验,会影响未来的决策。" 王兴兴也提到了人才和管理对科技企业发展的挑战。"目前非常缺少顶尖人才,另一大难点是管理,人多了反而效率更低。" "最近几年AI领域的发展非常快,很幸运给了我再一次机会来把握这个AI时代,推动AI真正落地能干活。" 下载财联社APP获取更多资讯 对于此前提到"人形机器人更大挑战在于不缺乏数据,而是机器人大模 ...
最新综述!多模态融合与VLM在具身机器人领域中的方法盘点
具身智能之心· 2025-09-01 04:02
Core Insights - The article discusses the transformative impact of Multimodal Fusion and Vision-Language Models (VLMs) on robot vision, enabling robots to evolve from simple mechanical executors to intelligent partners capable of understanding and interacting with complex environments [3][4][5]. Multimodal Fusion in Robot Vision - Multimodal fusion integrates various data types such as RGB images, depth information, LiDAR point clouds, language, and tactile data, significantly enhancing robots' perception and understanding of their surroundings [3][4][9]. - The main fusion strategies have evolved from early explicit concatenation to implicit collaboration within unified architectures, improving feature extraction and task prediction [10][11]. Applications of Multimodal Fusion - Semantic scene understanding is crucial for robots to recognize objects and their relationships, where multimodal fusion greatly improves accuracy and robustness in complex environments [9][10]. - 3D object detection is vital for autonomous systems, combining data from cameras, LiDAR, and radar to enhance environmental understanding [16][19]. - Embodied navigation allows robots to explore and act in real environments, focusing on goal-oriented, instruction-following, and dialogue-based navigation methods [24][26][27][28]. Vision-Language Models (VLMs) - VLMs have advanced significantly, enabling robots to understand spatial layouts, object properties, and semantic information while executing tasks [46][47]. - The evolution of VLMs has shifted from basic models to more sophisticated systems capable of multimodal understanding and interaction, enhancing their applicability in various tasks [53][54]. Future Directions - The article identifies key challenges in deploying VLMs on robotic platforms, including sensor heterogeneity, semantic discrepancies, and the need for real-time performance optimization [58]. - Future research may focus on structured spatial modeling, improving system interpretability, and developing cognitive VLM architectures for long-term learning capabilities [58][59].
最新综述!多模态融合与VLM在具身机器人领域中的方法盘点
具身智能之心· 2025-08-31 02:33
Core Viewpoint - The article discusses the advancements in multimodal fusion and vision-language models (VLMs) in robot vision, emphasizing their role in enhancing robots' perception and understanding capabilities in complex environments [4][5][56]. Multimodal Fusion in Robot Vision Tasks - Semantic scene understanding is a critical task in visual systems, where multimodal fusion significantly improves accuracy and robustness by integrating additional information such as depth and language [9][11]. - Current mainstream fusion strategies include early fusion, mid-level fusion, and late fusion, evolving from simple concatenation to more sophisticated interactions within a unified architecture [10][12][16]. Applications of Multimodal Fusion - In autonomous driving, 3D object detection is crucial for accurately identifying and locating pedestrians, vehicles, and obstacles, with multimodal fusion enhancing environmental understanding [15][18]. - The design of multimodal fusion involves addressing when to fuse, what to fuse, and how to fuse, with various strategies impacting performance and computational efficiency [16][17]. Embodied Navigation - Embodied navigation allows robots to explore and act in real environments, focusing on autonomous decision-making and dynamic adaptation [23][25][26]. - Three representative methods include goal-directed navigation, instruction-following navigation, and dialogue-based navigation, showcasing the evolution from perception-driven to interactive understanding [25][26][27]. Visual Localization and SLAM - Visual localization determines a robot's position, which is challenging in dynamic environments; recent methods leverage multimodal fusion to improve performance [28][30]. - SLAM (Simultaneous Localization and Mapping) has evolved from geometric-driven to semantic-driven approaches, integrating various sensor data for enhanced adaptability [30][34]. Vision-Language Models (VLMs) - VLMs have progressed significantly, focusing on semantic understanding, 3D object detection, embodied navigation, and robot operation, with various fusion methods being explored [56][57]. - Key innovations in VLMs include large-scale pre-training, instruction fine-tuning, and structural optimization, enhancing their capabilities in cross-modal reasoning and task execution [52][53][54]. Future Directions - Future research should focus on structured spatial modeling, improving system interpretability and ethical adaptability, and developing cognitive VLM architectures for long-term learning [57][58].
MiniMax上市计划启动,为何“米哈游们”频频布局AI?
3 6 Ke· 2025-08-27 13:09
Core Insights - MiniMax, an AI unicorn, has submitted a prospectus for an IPO in Hong Kong, with an expected valuation exceeding $4 billion, attracting attention from both the gaming and AI industries [1] - Major gaming companies like miHoYo are increasingly investing in AI, driven by the need to enhance game development efficiency and reduce costs [1][3] Financing History - MiniMax has raised significant funding across multiple rounds, including nearly $300 million in Series C in July 2025, $600 million in Series B in May 2024, and over $250 million in Series A in June 2023 [1] AI Integration in Gaming - The integration of AI in gaming has evolved from basic NPCs to sophisticated systems capable of autonomous decision-making, enhancing user experience and operational efficiency [2][3] - miHoYo's recent game, "Honkai: Star Rail," incorporates AI in various aspects, including character behavior control and dialogue generation, showcasing the potential of AI in game design [3] Industry Trends - The gaming industry is witnessing a surge in AI applications, with 52% of developers using AI tools for game development, according to the GDC 2025 report [13] - Major players like Tencent and NetEase are adopting a dual strategy of self-development and investment in AI technologies to enhance their gaming offerings [5][6] MiniMax's Position - MiniMax is recognized as one of the leading companies in the domestic AI model sector, focusing on multi-modal AI model development since its establishment in 2021 [6][8] - The company has developed various products that leverage multi-modal capabilities, achieving significant market recognition [8][10] Cost Reduction and Efficiency - The gaming industry faces rising development costs, with top-tier games costing between $90 million and $200 million to develop, necessitating the adoption of AI to lower costs and improve efficiency [15] - AI is transforming NPCs from mere functional tools to interactive characters, enhancing player engagement and experience [17][19] Legal and Regulatory Challenges - MiniMax is currently facing a copyright infringement lawsuit from iQIYI, which could have significant implications for the AI industry regarding the use of copyrighted materials for model training [22][23] - The rise of AI-driven products has raised concerns about content regulation, particularly for applications targeting minors, highlighting the need for robust content moderation systems [24]