Workflow
机器之心
icon
Search documents
谁还敢说谷歌掉队?2025年,它打了一场漂亮的翻身仗
机器之心· 2025-12-24 09:30
Core Insights - Google has made significant advancements in artificial intelligence in 2025, transforming AI from a mere tool into a collaborative partner capable of complex tasks such as coding and scientific research [1][18]. Group 1: AI Model Developments - The release of Gemini 3 in November is considered Google's peak achievement, showcasing substantial improvements in model reasoning, multimodal understanding, and operational efficiency [10]. - Gemini 3 Pro achieved a 37.5% score in the Humanity's Last Exam, outperforming competitors like Claude Sonnet 4.5 and GPT-5.1 [12]. - The Gemini 3 Flash model was introduced for developers and enterprises, offering enhanced performance at a lower cost compared to its predecessor, Gemini 3 Pro [13]. Group 2: Hardware Innovations - The seventh-generation TPU, Ironwood, was launched, boasting a memory bandwidth of 7.2 TB/s and a single-chip memory capacity of 192GB, providing 42.5 exaflops of AI computing power [34]. - Google's Quantum Echoes algorithm achieved verifiable quantum supremacy, solving specific problems 13,000 times faster than the fastest supercomputers [31][33]. Group 3: Applications and Collaborations - Google has integrated AI capabilities across its product matrix, enhancing user experiences in search, mobile devices, and productivity tools [20]. - The introduction of Antigravity marks a shift in software development, allowing for deeper collaboration between AI and developers [18]. Group 4: Breakthroughs in Science and Research - AlphaFold celebrated its fifth anniversary, predicting the structures of over 200 million proteins and contributing to the Nobel Prize in Chemistry [26]. - The launch of AlphaGenome enables high-resolution DNA sequence analysis, aiding in the understanding of genetic diseases [28]. Group 5: Creative Media and Content Generation - The Veo 3 model revolutionized video generation by synchronizing audio with visual elements, effectively ending the "silent film era" of AI-generated content [22]. - Nano Banana Pro introduced advanced capabilities in image generation, including high-fidelity text rendering and complex prompt construction [23]. Group 6: Educational Innovations - Google has developed AI models like LearnLM, which incorporate educational principles to enhance user learning experiences [52].
实测MiniMax M2.1之后,我们终于看懂了其招股书里的技术底气
机器之心· 2025-12-24 07:40
编辑|Panda 这两天,中国 AI 行业关注的核心无疑是 MiniMax。 12 月 21 日, MiniMax(稀宇科技)正式向港交所递交招股书 ,披露的一连串数字瞬间引爆了舆论场:账上坐拥超 10 亿 美 元 的现金储备,2025 年前九个月营收 同比激增 174.7% ,而在保持高强度研发的同时,经调整净亏损控制在 1.86 亿美元 。 资本市场的喧嚣还没结束,23 日,MiniMax 又反手甩出了一张技术牌:正式上线 MiniMax M2.1 模型。 | MiniMax (official) 10 " | | | --- | --- | | @MiniMax AI | | | MiniMax-M2.1 is now live | MiniMax-M2.1 现已上线 | | - Multi-language Coding (beyond Python) | - 多语言编程(Python 以外的语言) | | SOTA across Rust, Java, Go, C++, Kotlin, Obj-C, TS & JS, scoring 72.5% | 在 Rust、Java、Go、C++、Kotlin ...
直面VLA的「阿喀琉斯之踵」:TeleAI用「反探索」提升具身推理稳定性
机器之心· 2025-12-24 07:40
Core Insights - The article discusses the rapid development of Vision-Language-Action (VLA) models in embodied intelligence, highlighting their unprecedented generalization capabilities but also addressing the critical issue of instability during the reasoning phase [2][3][4]. - A novel framework named TACO (Test-time Anti-exploration via pseudo-Counts) is introduced to tackle the reasoning instability in VLA models, providing a solid theoretical foundation and practical solutions [2][8]. Group 1: VLA Model Challenges - VLA models, despite their impressive average performance, exhibit extreme sensitivity to initial noise during inference, leading to success rates that can fluctuate between 0% and 80% for the same model [4][6]. - The instability is attributed to two main factors: the retention of redundant action patterns from diverse pre-training data and the multimodal nature of fine-tuning datasets, which may include suboptimal strategies [7][6]. Group 2: TACO Framework - TACO draws inspiration from the "anti-exploration" principle in offline reinforcement learning, aiming to constrain generated actions to successful patterns within the fine-tuning dataset [9][11]. - The framework includes three key components: a Coupled Pseudo-Count Estimator that utilizes the VLA model's internal representation, ensuring efficient validation without additional training [11][12]. Group 3: Implementation and Results - TACO employs a two-stage reasoning process: generating diverse action candidates and validating them through pseudo-counts, which are calculated using a trained CFN [17][18]. - The implementation of a Shared Observation Key-Value Cache significantly reduces computational costs, allowing for efficient real-time operation with minimal latency [20][21]. Group 4: Experimental Validation - Comprehensive evaluations across multiple simulated benchmarks and a dual-arm robot platform demonstrate TACO's effectiveness, with average success rates improving by 16% in real-world tasks [22][32]. - Specific tasks, such as "organizing paper and pens," showed a remarkable 25% increase in success rates, highlighting TACO's ability to filter out suboptimal behaviors [32][33]. Group 5: Future Directions - TACO not only addresses practical challenges but also opens new perspectives for VLA research, suggesting potential expansions into more complex multi-task scenarios and integration with world models for enhanced long-term planning capabilities [35].
拒绝「盲修」:JarvisEvo 如何让 Agent 像人类一样拥有「视觉反思」能力?
机器之心· 2025-12-24 03:41
此外,在传统强化学习中经常依赖于静态的奖励模型。随着模型的不断训练,它很容易学会如何「讨好」这个固定的打分器,导致 Reward Hacking —— 即分数很高,但审美并没有真正提升。 为了打破这一僵局, JarvisEvo 应运而生。它不仅仅是一个连接 Adobe Lightroom 的自动化工具使用者,更是一次大胆的探索:探索 Agent 如何通过 「内省」,真正实现自我进化。 在迈向通用人工智能的道路上,我们一直在思考一个问题: 现有的 Image Editing Agent,真的「懂」修图吗? 大多数基于 LLM/VLM 的智能体,本质上更像是一个「盲目的指挥官」。它们能流利地写出修图代码或调用 API,但在按下回车键之前,它们看不见画布 上的变化,也无法像人类设计师那样,盯着屏幕皱眉说:「这张对比度拉太高了,得往回收到一点。」这种感知与决策的割裂,直接导致了「指令幻觉」, 或者说模型在进行盲目的「脑补」。由于缺乏视觉反馈,模型往往凭空想象下一步操作,导致结果与用户的初衷南辕北辙。 核心范式转移: 论文标题: JarvisEvo: Towards a Self-Evolving Photo Edit ...
字节做了个 AI 手机,钉钉做了台 AI 主机
机器之心· 2025-12-24 03:41
Core Viewpoint - The article discusses the recent developments in AI hardware, particularly focusing on the launch of the DingTalk Real AI host, which aims to redefine the role of devices in enterprise settings by allowing AI to operate independently while maintaining control and security [1][2][4][5]. Group 1: AI Hardware Developments - The launch of the Doubao AI phone generated significant attention, showcasing the potential for AI to automate various tasks, although it faced challenges with app compatibility due to security measures [1]. - DingTalk introduced the DingTalk Real AI host, which serves as a physical execution platform for AI agents, surprising many in the industry [2][10]. - Both Doubao and DingTalk aim to redefine device roles from user-operated apps to AI-operated agents, shifting the focus to user demands rather than manual operations [4]. Group 2: DingTalk Real's Functionality - DingTalk Real operates with an intelligent agent operating system (Agent OS), allowing AI agents to work continuously and access internal data securely [11][12]. - The design of DingTalk Real enables agents to operate in a controlled environment, mitigating security risks associated with granting high permissions to AI [13][16]. - The hardware is designed to be online 24/7, allowing AI to perform tasks without interrupting user devices, while still keeping critical decision-making in human hands [17]. Group 3: AgentOS and Collaboration - AgentOS serves as a unified task scheduling and collaboration hub for AI agents, ensuring they are effectively integrated into daily workflows [21][23]. - The architecture of AgentOS includes a core layer for task management and governance, along with a user-friendly interface for seamless interaction with AI agents [24][27]. - The system allows for the deployment of various AI agents, including a general-purpose agent named "Wukong," which can autonomously plan and execute tasks [28][30]. Group 4: Market Position and Future Outlook - DingTalk Real is positioned as a practical solution for enterprises seeking to integrate AI into their processes while addressing security and compliance concerns [38]. - The article highlights the potential for DingTalk to create a collaborative ecosystem of AI agents, enhancing operational efficiency across various business functions [39]. - As DingTalk develops its AI capabilities, it aims to establish a significant presence in the enterprise-level AI market, with plans for further integration and collaboration with developers [40].
从「会表演」到「更会演」:KlingAvatar2.0让数字人拥有生动灵魂
机器之心· 2025-12-24 03:41
Core Insights - The article discusses the significant advancements in the KlingAvatar2.0 technology, which enhances digital avatars' ability to express emotions and interact more naturally, moving beyond basic performance to a more lifelike representation [1][17]. Group 1: Technological Innovations - KlingAvatar2.0 introduces a spatiotemporal cascading framework that allows for coherent long video generation, addressing issues of quality degradation in traditional AI tools [4][5]. - The system generates a low-resolution "blueprint video" to capture global semantics and actions, which is then refined into high-resolution, temporally coherent segments [5][7]. - A collaborative reasoning director system, comprising three AI experts, transforms vague instructions into detailed storylines, effectively managing multimodal conflicts [8]. Group 2: Character Control and Performance - The technology employs identity-specific multi-role control, ensuring that each digital character is accurately represented with its own voice and actions, avoiding confusion in multi-character scenarios [9][11]. - The performance metrics show a significant improvement in expressiveness, with KlingAvatar2.0 achieving a 43.2% overall enhancement compared to competitors like HeyGen and OmniHuman-1.5 [14][16]. - The emotional expression capabilities have been refined, allowing for natural facial expressions that convey complex emotions, and the overall motion quality has been enhanced to synchronize perfectly with audio [15][16]. Group 3: Industry Implications - The continuous evolution of digital human technology is lowering creative barriers and raising production standards across various sectors, including e-commerce, entertainment, online education, and corporate services [18]. - The advancements in KlingAvatar2.0 signify a leap in AI's understanding of human expressive arts, transforming technology from a mere tool into a medium for creative expression and emotional communication [18].
广电绝地反击!揭秘多彩新媒「不烧钱」的AI生存法则
机器之心· 2025-12-24 03:41
Core Insights - The traditional broadcasting industry is facing a profound survival crisis due to increasing external competition, with smart voice device penetration exceeding 68% and short video platforms occupying an average of 2.8 hours of user time daily, leading to a structural shift in user attention [1] - The average ARPU for provincial IPTV users is projected to be below 15 yuan in 2024, a 22% decline compared to three years ago, indicating a significant revenue drop [1] - The essence of user loss and revenue decline stems from the traditional broadcasting service model's inability to meet new user demands for television [1] - National policies are increasingly supportive of the broadcasting industry's intelligent transformation, emphasizing media integration and digital cultural strategies [1] Strategic Choices - The company has chosen to abandon a "big and comprehensive" approach, focusing instead on three core areas to address current challenges: maintaining "entry rights," enhancing "service value," and seeking "new growth" through AIGC tools [4][5] - This pragmatic strategy reflects a typical mindset of budget-constrained operators, prioritizing core business stability while exploring key areas with controlled investments [5] Technical Path - The technical implementation emphasizes a layered architecture and "lightweight" practices, integrating mature external capabilities rather than building a complete AI system from scratch [7] - The architecture focuses on core needs of IPTV, including safety, service, and efficiency, with plans to utilize lightweight models to control costs and develop industry-specific AI models [7] Scene Implementation - The AI practices of the company will be evaluated based on specific scene implementations, focusing on internal efficiency, user service upgrades, and ecosystem expansion [9] - Internally, the company plans to enhance content production efficiency through tools like intelligent posters and coding, aiming for full-process automation [10] - User experience will shift from passive viewing to active usage, with features like natural language commands for content search and tailored services for specific demographics [12][15] Ecosystem Collaboration - The company emphasizes an "internal and external linkage" strategy for ecosystem building, leveraging external resources to quickly establish service capabilities [20] - The approach includes collaborating with major content providers and integrating local services to enhance user engagement and service diversity [20] Value Outlook - The AI transformation is expected to yield not only direct commercial returns but also serve as a practical reference for other industry players, with plans to accumulate valuable data assets for future growth [22] - The initiative aims to establish new standards for intelligent broadcasting, facilitating a shift from content distribution to service operation across local broadcasting entities [22] Conclusion - The intelligent transformation of the broadcasting industry is a systemic change focused on redefining survival space and value, with the company's approach providing a pragmatic path for regional operators seeking to adapt [23][25]
当世界模型不止「视频」该如何评估?WorldLens提出实用化评估新框架
机器之心· 2025-12-23 09:36
但问题也随之变得尖锐: 当一个模型被称为「世界模型」时,我们究竟在期待它具备什么能力? 仅用 LPIPS、FVD 这类视频指标,或「清晰 / 流畅 / 像真视频」的主观印象,很容易把讨论停留在「像不像视频」。而真正决定它是否能服务 仿真、规划、数据合 成和闭环决策 的,往往是那些视频指标难以触及的属性:几何是否自洽、多视角是否一致、时序是否稳定、行为是否可执行、下游是否可用、人类是否认可其物 理与安全合理性。 近期, WorldBench 团队构建了全新、体系化的世界模型评测框架 WorldLens。 据悉,这是领域内首个 从生成 (Generation)、重建 (Reconstruction)、指令跟随 (Action-Following)、下游任务 (Downstream)和人类偏好 (Human Preference) 等五 个维度同时出发,评测现有开源世界模型的框架。评测 EvalKit 现已公开。 论文链接:https://arxiv.org/abs/2512.10958 生成式世界模型在 机器人、自动驾驶、AIGC等领域 的进展肉眼可见:从单视角、行车记录仪式的视频合成,到 可控、多视角、长时序 ...
都是TOP人才!跑遍全球,和机器之心共聚AI学术顶会
机器之心· 2025-12-23 09:36
2025 年,AI 依然在加速奔跑。从多模态大模型到智能体系统的演进,从基础理论的突破到产业应用的深化,技术的每一次跃迁,都在重塑未来的轮廓。在海量 学术成果爆发的背景下,单纯的阅读已难以追赶技术的迭代速度,我们笃信——再强大的算法,也需要人与人的连接;再前沿的突破,也需要面对面的对话。 今年,带着这份相信,我们出发了。从北京的四季轮转到江南的桂香满庭,从新加坡的星洲夜语到维也纳的夏风微拂,从温哥华的学术静谧到圣地亚哥的海边星 光……我们围 绕 ICLR、CVPR、ACL、ICML、IROS、EMNLP、NeurIPS 等 AI 学术会议,跨越 8 座城市,落地 11 场活动。 在时差交替的版图上,我们找到了共同的频率,写下了这些属于 2025 的记忆与数字: 2025,精彩回顾 从论文的深度解读,到人才晚宴上的热烈交谈,"论文分享会"与"人才 Meetup"两大系列活动,贯穿全年,覆盖海内外,旨在打造一个 有温度、有深度、也有价值 的 AI 交流生态圈: 2026,继续出发 旧章已谱,新篇待书。2025 年的圆满收官,是 2026 年更精彩旅程的起点。我们已经初步规划了覆盖 ICLR、CVPR、ACL、IC ...
告别高昂重制成本!港科大广州、快手可灵发布立体视频转换单步推理新方案
机器之心· 2025-12-23 07:06
Core Viewpoint - The article discusses the increasing demand for 3D video content driven by advancements in VR headsets, smart glasses, and 3D cinemas, highlighting the challenges in producing 3D content due to high costs and complex processes [2] Group 1: Challenges in 3D Content Production - Traditional 3D content production is hindered by high costs, as exemplified by the $18 million investment and 300 engineers required for the 3D re-release of "Titanic" [2] - Existing automated methods for converting 2D to 3D, such as "Monocular-to-Stereo," often yield unsatisfactory results, with conversion times ranging from 15 to 70 minutes for just 5 seconds of video [2] - The "Depth-Warp-Inpaint" (DWI) method, commonly used in 2D to 3D conversion, suffers from three major flaws: error propagation, depth ambiguity, and format inconsistency [8][9][15] Group 2: Introduction of StereoPilot - Kuaishou's Keling team and Hong Kong University of Science and Technology have developed StereoPilot, a new model that converts 5 seconds of 2D video into high-quality 3D video in just 11 seconds, outperforming existing state-of-the-art methods [3][23] - StereoPilot addresses the limitations of DWI by effectively handling complex reflective scenes, which traditional methods struggle with [13][33] Group 3: Data and Model Structure - The team created the UniStereo dataset, the first large-scale dataset containing both Parallel and Converged formats, which includes 58,000 5-second videos from real-world sources and 48,000 from high-quality 3D films [24][28] - The model structure of StereoPilot includes a Domain Switcher for format flexibility and a Cycle Consistency Loss to ensure geometric alignment between generated views [30][34] Group 4: Performance Comparison - In quantitative comparisons, StereoPilot significantly outperforms other methods like StereoDiffusion and Mono2Stereo across all key metrics, achieving a PSNR of 27.735 and a processing time of just 11 seconds [31] - Visual comparisons show that StereoPilot produces more accurate disparity and higher visual quality, particularly in complex scenes [33] Group 5: Conclusion - StereoPilot represents a breakthrough in rapid, high-quality 2D to 3D video conversion, offering new possibilities for VR/AR content creation and film restoration while clarifying the standards for training and evaluation in the field [43]