Workflow
机器之心
icon
Search documents
让扩散模型「可解释」不再降质,开启图片编辑新思路
机器之心· 2025-12-16 02:31
Core Viewpoint - The article discusses the emergence of TIDE (Temporal-Aware Sparse Autoencoders) as a significant advancement in making diffusion models interpretable without sacrificing their generative quality [3][17]. Group 1: Background and Challenges - Over the past three years, diffusion models have dominated the image generation field, with architectures like DiT pushing the limits of image quality [2]. - Despite the growth in explainability research for LLMs, the internal semantics and causal pathways of diffusion models remain largely opaque, making them a "black box" [2]. - Existing attempts at explainability often lead to a noticeable decline in performance, making the pursuit of interpretable diffusion models seem impractical [2]. Group 2: Introduction of TIDE - TIDE is introduced as the first truly temporal-aware framework for diffusion transformers, aiming to reveal the internal mechanisms of these models without compromising their generative capabilities [3][5]. - The framework emphasizes the importance of the temporal aspect of the diffusion process, which unfolds progressively over time [6]. Group 3: Mechanism and Functionality of TIDE - TIDE aligns semantics along the time dimension, allowing for a clearer presentation of the diffusion model's internal processes, such as the emergence of structure from noise and the gradual formation of semantics [7]. - The sparse autoencoder in TIDE enables lossless reconstruction in the feature space, maintaining the stability of the diffusion trajectory while being "observed" [7][10]. Group 4: Performance and Results - TIDE decomposes diffusion features into controllable semantic factors, enhancing image editing capabilities by allowing direct manipulation along clear semantic directions [8][10]. - The impact of TIDE on generative quality is minimal, with FID and sFID changes being less than 0.1%, demonstrating its ability to be interpretable without degrading quality [10][14]. - TIDE shows significant improvements in semantic binding and understanding of spatial relationships, with multiple metrics indicating optimal performance [12]. Group 5: Implications and Future Directions - TIDE represents a new research paradigm, suggesting that diffusion models can be interpretable with the right perspective [19]. - Future developments may include more controllable and robust diffusion editing systems, unified understanding of generative models, and advancements in causal and semantic theory research [21][22].
告别「手搓Prompt」,前美团高管创业,要让物理世界直接成为AI提示词
机器之心· 2025-12-16 02:31
Core Viewpoint - The article discusses the emergence of Looki, an AI hardware aiming to enhance human-machine interaction by transforming real-world scenarios into contextual data, moving from passive responses to proactive engagement [1][2]. Group 1: Looki's Purpose and Technology - Looki aims to fill the gap in large models' "sensory intelligence" by converting real-time visual and auditory signals into contextual data, thus driving AI to think and serve users more effectively [4][28]. - The Looki L1 device, weighing only 30g, is designed to operate as a multi-modal perception system, capturing the physical world continuously and efficiently [6][8]. - The founders of Looki, with backgrounds in autonomous driving and smart hardware, leverage their expertise to adapt perception algorithms from driving to everyday life [9][10]. Group 2: Data Management and User Context - Looki employs a "data flywheel" approach to create a personalized context for users, transforming raw data into structured memories that the AI can efficiently access [12][15]. - The system addresses two major challenges in multi-modal models: understanding long-sequence data and managing context explosion, ensuring privacy and security in data handling [14]. Group 3: Transition to Proactive AI - The article highlights a shift from manual prompts to proactive AI, where enriched context allows for anticipatory actions by the AI, marking a transition from chatbots to agentic AI [17][18]. - Looki's capabilities include automatic video editing, identifying significant moments in users' lives, and providing insights based on accumulated data, thus evolving into a second brain for users [20][24][30]. Group 4: Future Vision - Looki envisions its hardware as a data interface that evolves beyond its current form, aiming to address the data hunger of the physical world and help users accumulate valuable personal data assets [29]. - The ultimate goal is for AI to possess a sense of presence, transforming it from a mere tool into an integral part of users' daily lives [30][31].
AAAI 2026|视频大语言模型到底可不可信?23款主流模型全面测评来了
机器之心· 2025-12-15 10:00
Core Insights - The article discusses the development of Trust-videoLLMs, a comprehensive evaluation benchmark for video large language models, addressing challenges in authenticity, safety, fairness, robustness, and privacy [3][6][13]. Evaluation Framework - Trust-videoLLMs includes a systematic, multi-layered, and scalable evaluation system with five core dimensions: - Truthfulness: Video description, temporal understanding, event reasoning, and hallucination suppression - Robustness: Noise interference, temporal disturbance, adversarial attacks, and modality conflict - Safety: Harmful content identification, harmful instruction rejection, deepfake detection, and jailbreak attack defense - Fairness: Stereotype identification, occupational bias, and time sensitivity analysis - Privacy: Privacy content recognition, celebrity privacy protection, and self-inference of privacy [6][9]. Evaluation Tasks - The evaluation tasks cover three main aspects, including contextual reasoning, temporal reasoning, video description, event understanding, and hallucination in videos, among others [8][11]. Model Assessment - The evaluation encompasses 23 mainstream video large language models, including 5 commercial models and 18 open-source models, with varying parameter scales and architectural designs [10][12]. Key Findings - Model size does not equate to stronger performance, as larger models do not necessarily outperform smaller ones [16]. - Closed-source models, such as Claude and Gemini1.5, demonstrate superior safety, privacy protection, and multi-modal alignment compared to open-source models [17]. - Video context significantly impacts safety, as harmful text prompts paired with relevant videos increase the likelihood of generating harmful content [18]. - Fairness issues are prevalent, with models showing biases related to gender, age, and skin color, where closed-source models perform better due to data cleaning and ethical constraints [19]. - Privacy protection is a double-edged sword; stronger models can better identify privacy content but also risk inferring private information [20]. Open-source Tools and Data - To promote the development of trustworthy video large models, the team has open-sourced a large-scale video dataset containing 6,955 videos covering multiple scenes and tasks, along with a unified evaluation toolbox [24].
Thinking Machines首款产品重大更新:K2 Thinking、Qwen3-VL都可以微调了
机器之心· 2025-12-15 10:00
Core Insights - The article discusses the launch of Tinker, an API developed by Thinking Machines Lab, aimed at simplifying the fine-tuning of language models for developers and researchers [1][4] - Tinker has removed its candidate list, allowing all users to access the platform directly, which marks a significant shift in accessibility for AI model training [1][4] - The article highlights three major updates to Tinker: enhanced reasoning capabilities, a new inference interface compatible with OpenAI API, and the introduction of visual input support through new models [1][4] Group 1: Tinker Overview - Tinker allows developers to focus solely on training data and algorithms, while it manages infrastructure aspects like scheduling and resource management, significantly lowering the barrier to entry for model training [4] - The platform now supports fine-tuning of the Kimi K2 model, which has a trillion parameters, previously accessible only to top-tier labs [4] - Tinker’s visual input capabilities enable users to handle images and visual content in various applications, further broadening its usability [1][4] Group 2: Model Performance and Comparisons - Tinker has been tested with the Qwen3-VL-235B-A22B model on several image classification benchmarks, including Caltech-101, Stanford Cars, Oxford Flowers, and Oxford Pets [4][5] - The performance of Qwen3-VL-235B-A22B was compared to DINOv2, a self-supervised visual transformer, showing superior results in small sample scenarios due to its larger model size and integrated language knowledge [7] - The ability of Qwen3-VL to combine language and visual understanding allows for easier adaptation to various visual tasks beyond image classification [7]
NeurIPS 2025|指哪打哪,可控对抗样本生成器来了!
机器之心· 2025-12-15 08:10
近日,在全球人工智能领域最具影响力的顶级学术会议 NeurIPS(神经信息处理系统大会)上, 清华大学和蚂蚁数科联合提出了一种名为 Dual-Flow 的新型对抗攻 击生成框架。 简单来说,Dual-Flow 是一个能够从海量图像数据中学习 "通用扰动规律" 的系统,它不依赖目标模型结构、不需要梯度,却能对多种模型、多种类别发起黑盒攻 击。其核心思想是通过 "前向扰动建模 — 条件反向优化" 的双流结构,实现对抗样本的高可迁移性与高成功率,同时保持极低的视觉差异。 可以把它理解为一个" 可控的对抗样本生成器 ", 只需指定想攻击的图像类别(如狗类、人类),模型就能自动生成该类别下逼真且有效的攻击图像 ,为 AI 安全 带来了前所未有的挑战。 研究背景与意义 论文标题:Dual-Flow: Transferable Multi-Target, Instance-Agnostic Attacks via In-the-wild Cascading Flow Optimization 作者:Yixiao Chen, Shikun Sun, Jianshu Li, Ruoyu Li, Zhe Li, Junliang ...
国产芯片也能跑AI视频实时生成了,商汤Seko 2.0揭秘幕后黑科技
机器之心· 2025-12-15 08:10
Core Insights - The article discusses the competitive landscape of video generation models, highlighting the advancements made by various tech companies, including Google, Runway, and Kuaishou, while questioning the readiness of these models as productivity tools [2][9] - SenseTime's Seko 2.0 version is introduced as a significant advancement, enabling AI short drama creation with minimal human input, effectively allowing a single person to manage the production [2][4][7] Group 1: Industry Developments - Major tech companies are racing to release enhanced versions of video generation models before the end of the year, with Google launching Veo 3.1 and Runway introducing Gen-4.5 [2] - SenseTime's Seko 2.0 has been successfully deployed in over a hundred short drama studios, showcasing its capability to generate scripts, storyboards, and videos rapidly [7][9] Group 2: Technical Challenges - The article outlines the "impossible triangle" of video generation, where efficiency, cost, and quality are at odds, making it difficult for AI video generation models to meet commercial demands [11][13] - Current models, even at the Sora 2 level, require several minutes to generate just 10 seconds of video, which hampers rapid iteration and real-time feedback essential for industrial production [11][12] Group 3: Innovations in Video Generation - SenseTime's LightX2V framework is highlighted as a breakthrough in real-time video generation, achieving generation times of under 5 seconds for 5-second videos, significantly faster than current industry standards [16][17] - The framework employs Phased DMD technology, which enhances video quality and consistency while maintaining high generation speeds [19][20] Group 4: Engineering and Optimization - LightX2V incorporates a comprehensive optimization strategy across five dimensions: model, scheduling, computation, storage, and communication, enabling low-cost and real-time video generation [31][32] - The framework's architecture allows for efficient use of consumer-grade GPUs, achieving real-time generation capabilities with a memory requirement of less than 8GB [36][37] Group 5: Domestic Chip Adaptation - SenseTime's Seko 2.0 has achieved full compatibility with domestic AI chips, allowing for a cost-effective alternative to NVIDIA chips while maintaining comparable video quality [39][40] - The strategic support for domestic AI ecosystems is emphasized, marking a significant step for China's AI industry in achieving core technological independence [42]
Veo何止生成视频:DeepMind正在用它模拟整个机器人世界
机器之心· 2025-12-15 08:10
Core Insights - The article discusses the development of generalist robots capable of performing various tasks through natural language instructions, highlighting significant challenges in real-world evaluation and safety assessment [1][3]. Group 1: Challenges in Robot Evaluation - Real-world evaluation is costly and time-consuming, requiring extensive hardware experiments across various scenarios, including extreme and out-of-distribution environments [1]. - Safety assessments are particularly challenging due to the potential for unsafe behaviors that cannot be repeatedly tested in real environments, making traditional evaluation methods difficult to implement [1]. Group 2: Limitations of Traditional Simulation - Traditional physical simulators have limitations in realism, diversity, setup costs, and visual consistency, which hinder their effectiveness in robot evaluation [2]. Group 3: Advancements in Video Modeling - Cutting-edge video models offer an alternative path for world simulation, addressing many challenges in robot strategy evaluation, though they face difficulties such as generating artifacts in closed-loop conditions and simulating contact dynamics [3]. Group 4: Introduction of Veo Robotics System - The article introduces a video modeling-based robot strategy evaluation system developed by Google DeepMind's Gemini Robotics team, which supports comprehensive evaluation needs, including in-distribution and out-of-distribution assessments [4][5]. - The system utilizes the advanced video generation model Veo, achieving high fidelity in visual realism and fine-grained control responses without the need for real physical setups [5]. Group 5: Experimental Validation - Over 1,600 real-world experiments validated the effectiveness of the video model predictions across eight generalist strategy checkpoints and five tasks, demonstrating a strong correlation between predicted and actual success rates [5][26]. - The system's ability to predict performance across different robot strategies was tested, confirming its reliability in ranking strategies based on performance [24][26]. Group 6: Safety Testing Capabilities - The Veo Robotics world model can be used for safety red team testing, allowing for the identification of potential unsafe behaviors in strategies without real-world risks [31].
AAAI 2026 | 革新电影配音工业流程:AI首次学会「导演-演员」配音协作模式
机器之心· 2025-12-15 01:44
Core Viewpoint - The article discusses the limitations of AI voice dubbing, particularly its lack of emotional depth, and introduces a new framework called Authentic-Dubber that incorporates director-actor interaction to enhance emotional expression in AI-generated voiceovers [2][3][19]. Group 1: AI Dubbing Limitations - AI voice dubbing often lacks the "human touch," as it skips the crucial director-actor interaction that brings emotional depth to performances [2][3]. - The current AI models simplify the dubbing process by having AI "actors" read scripts without the guidance of a director, resulting in a lack of emotional resonance [2][3]. Group 2: Authentic-Dubber Framework - The Authentic-Dubber framework, developed by a team led by Professor Liu Rui, introduces a director role into AI dubbing, simulating the emotional transmission mechanisms found in traditional dubbing processes [4]. - This system aims to teach AI to "understand first, then express," moving beyond mere imitation of sounds to a more nuanced emotional delivery [4]. Group 3: Mechanisms of Authentic-Dubber - The framework includes a multi-modal reference material library that serves as an emotional guide for AI, integrating various emotional cues such as scene atmosphere and facial expressions [7]. - A retrieval-augmented strategy allows the AI to quickly access emotionally relevant reference clips, mimicking how actors internalize emotional cues under a director's guidance [11]. - The system employs a progressive graph-structured speech generation method to ensure that the final output is rich in emotional layers, enhancing the overall quality of the dubbing [13]. Group 4: Experimental Validation - In tests on the V2C-Animation dataset, Authentic-Dubber significantly outperformed all mainstream baseline models in emotional accuracy (EMO-ACC) [14]. - Subjective evaluations by human listeners showed that Authentic-Dubber achieved the highest scores in emotional matching (MOS-DE) and emotional authenticity (MOS-SE) [15]. - The system demonstrated quantifiable advantages in emotional expression, as evidenced by spectral analysis showing distinct acoustic features for different emotions [16]. Group 5: Significance of the Research - The research elevates the competitive dimension of AI dubbing from mere synchronization to emotional resonance, indicating a deeper understanding of complex emotions by AI [19]. - By simulating key interactions in human collaboration, the framework represents a significant step towards creating AI voiceovers that can truly "inject soul" into characters [19].
RL是「点金石」还是「挖掘机」?CMU 用可控实验给出答案
机器之心· 2025-12-15 01:44
Core Insights - Recent advancements in reinforcement learning (RL) technology have significantly improved the reasoning capabilities of language models [1] - The true extent to which post-training expands model reasoning capabilities or merely uncovers existing potential remains unclear [2] - A key challenge is the lack of controllability in modern training processes, with large-scale pre-training corpora being opaque and mid-training often insufficiently studied [2] Group 1: Research Framework and Methodology - Researchers from Carnegie Mellon University developed a controllable synthetic data framework based on GSM-Infinite to quantitatively analyze the causal impact of pre-training, mid-training, and RL on model reasoning generalization [2][5] - The framework allows for the decoupling of reasoning structure and surface context, enabling precise quantification of reasoning complexity and the examination of whether models genuinely learn reasoning logic or merely memorize specific text patterns [10][12] Group 2: Key Findings on Training Interactions - The effectiveness of RL depends on the "capability margin"; RL can only enhance reasoning abilities when tasks are challenging yet within the model's exploration range [16][17] - Pre-training utilized 10 billion tokens focusing on basic reasoning primitives, while mid-training serves as a bridge to align the model's internal representations for RL readiness [20] - A minimal amount of target context data during pre-training can significantly enhance cross-context generalization during RL post-training [22] Group 3: Training Efficiency and Performance - Mid-training is crucial for computational efficiency, with findings indicating that combining mid-training with RL yields better performance than using RL alone [26][27] - The introduction of process-level rewards can mitigate reward hacking and improve reasoning fidelity, particularly in complex reasoning tasks [29][30] Group 4: Practical Guidelines for Training - RL data design should target the model's capability margin, avoiding overly easy or difficult tasks [31] - Pre-training strategies must ensure at least 1% coverage of atomic capabilities in long-tail domains to provide interfaces for RL [32] - The allocation of computational resources should be dynamically adjusted based on task difficulty, with more RL for tackling challenging problems and more mid-training for stability [33]
SIGGRAPH Asia 2025|30FPS普通相机恢复200FPS细节,4D重建方案来了
机器之心· 2025-12-14 04:53
硬件革新:异步捕捉,让相机 "错峰拍摄" 本文第一作者陈羽田,香港中文大学 MMLab 博士二年级在读,研究方向为三维重建与生成,导师为薛天帆教授。个人主页:https://yutian10.github.io 当古装剧中的长袍在武林高手凌空翻腾的瞬间扬起 0.01 秒的惊艳弧度,当 VR 玩家想伸手抓住对手 "空中定格" 的剑锋,当 TikTok 爆款视频里一滴牛奶皇冠般的溅 落要被 360° 无死角重放 —— 如何用普通的摄像机,把瞬间即逝的高速世界 "冻结" 成可供反复拆解、传送与交互的数字化 4D 时空,成为 3D 视觉领域的一个难 题。 然而,受限于硬件成本与数据传输带宽,目前绝大多数 4D 采集阵列的最高帧率仅约 30 FPS;相比之下,传统高速摄影通常需要 120 FPS 乃至更高。简单升级相机 硬件不仅价格高昂,还会带来指数级增长的数据通量,难以在大规模部署中落地。另一条改变的思路是在重建阶段 "补帧"。近期,例如 4D 高斯溅射(4D Gaussian Splatting)等动态场景重建方法能在简单运动中通过稀疏时序输入合成连续帧,变相提升帧率,但面对布料摆动、高速旋转等非线性复杂运动,中间 ...