Workflow
机器之心
icon
Search documents
SIGGRAPH Asia 2025|电影级运镜一键克隆!港中文&快手可灵团队发布CamCloneMaster
机器之心· 2025-10-22 06:32
本文第一作者罗亚文,香港中文大学 MMLab 博士一年级在读,研究方向为视频生成,导师为薛天帆教授。个人主页: https://luo0207.github.io/yawenluo/ 作为视频创作者,你是否曾梦想复刻《盗梦空间》里颠覆物理的旋转镜头,或是重现《泰坦尼克号》船头经典的追踪运镜? 在 AI 视频生成中,这些依赖精确相机运动的创意,实现起来却往往异常困难。 一个直接的想法是先用相机位姿估计模型从参考视频中提取相机参数,然后使用相机参数作为控制条件引导视频生成过程。 然而,这条看似容易的路径,实则充满了陷阱:现实场景中的动态物体和复杂遮挡关系,常常导致模型估算出的相机参数出现偏差或错误,让生成的运镜效果与 预期大相径庭。 为了解决这一痛点, 香港中文大学与快手可灵团队联合提出了一种全新的运镜可控的视频生成框架 CamCloneMaster 。它引入了一种「参考即用」的新范式,用 户只需提供一段参考视频,模型就能直接「克隆」其相机运动并应用于新内容,从根本上告别了对相机参数的依赖。 该工作被计算机图形学顶级会议 SIGGRAPH Asia 2025 接收,其训练、测试代码和高质量渲染数据集 CamClo ...
CVPR 2026新规:强制披露算力成本,高效率、高透明度论文可获三项认可奖
机器之心· 2025-10-22 03:30
Core Viewpoint - The CVPR has introduced a pilot program called the "Compute Resource Report Form (CRF)" to enhance transparency and fairness in AI research by requiring authors to report the computational resources used in their studies starting from CVPR2026 [2][4]. Group 1: CRF Implementation - All paper submitters must provide a detailed report of the computational resources used, including GPU and CPU usage, training time, and model efficiency [7]. - Submission of the CRF is mandatory but will not affect the paper acceptance decision, as the data will be reviewed by an independent committee [6][16]. - The CRF requires basic information about hardware, computation time, and performance results compared to the strongest benchmarks [7][31]. Group 2: Recognition Awards - Papers demonstrating outstanding computational efficiency and/or transparency may qualify for recognition awards, including the Efficient CVPR Badge, CVPR Compute Star Award, and CVPR Compute Transparency Champion Award [9][10]. - Awards will be evaluated based on objective metrics and will ensure fair assessment across similar task categories [10][27]. Group 3: Submission Process - Authors are encouraged to review a pre-filled example form to understand how to complete each section of the CRF [11][19]. - The completion of the mandatory sections typically takes 10-15 minutes, while optional sections may require additional time [28]. - Authors should save the original PDF of the filled form without flattening it to retain the necessary data fields for processing [12][20]. Group 4: FAQs and Clarifications - The CRF aims to provide insights into the actual computational costs of different methods and is a transparency experiment rather than a judgment mechanism [15][18]. - High computational resource usage will not incur penalties, as significant advancements often require substantial resources [17]. - The submission of anonymized Weights & Biases logs is optional but can enhance the chances of receiving recognition awards [26].
刚刚,ICCV最佳论文出炉,朱俊彦团队用砖块积木摘得桂冠
机器之心· 2025-10-22 03:30
Core Insights - The ICCV (International Conference on Computer Vision) awarded the best paper and best student paper on October 22, 2023, highlighting significant advancements in computer vision research [1][2][4]. Group 1: Best Paper - The best paper award was given to a research team from Carnegie Mellon University (CMU) for their paper titled "Generating Physically Stable and Buildable Brick Structures from Text" led by notable AI scholar Junyan Zhu [6][9]. - The paper introduces BrickGPT, a novel method that generates physically stable and interconnected brick assembly models based on text prompts, marking a significant advancement in the field [9][11]. - The research team created a large-scale dataset of stable brick structures, comprising over 47,000 models and 28,000 unique 3D objects with detailed text descriptions, to train their model [11][10]. Group 2: Methodology and Results - The methodology involves discretizing a brick structure into a sequence of text tokens and training a large language model to predict the next brick to add, ensuring physical stability through validity checks and a rollback mechanism [10][17]. - Experimental results indicate that BrickGPT achieved a 100% validity rate and 98.8% stability rate, outperforming various baseline models in both effectiveness and stability [20][18]. - The paper's approach allows for the generation of diverse and aesthetically pleasing brick structures that align closely with the input text prompts, demonstrating high fidelity in design [11][20]. Group 3: Best Student Paper - The best student paper award went to a research from the Technion titled "FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models," which innovatively bypasses traditional image editing paths to enhance image fidelity [25][28]. - FlowEdit establishes a direct mapping path between source and target image distributions, resulting in lower transfer costs and better preservation of original image structure during editing [31][27]. - The method was validated on advanced T2I flow models, achieving state-of-the-art results across various complex editing tasks, showcasing its efficiency and superiority [31][31]. Group 4: Other Awards and Recognitions - The Helmholtz Prize was awarded for contributions to computer vision benchmarks, recognizing two significant papers, including "Fast R-CNN" by Ross Girshick, which improved detection speed and accuracy [36][38]. - The Everingham Prize recognized teams for their contributions to 3D modeling and multimodal AI, including the development of the SMPL model and the VQA dataset [41][43]. - Significant Researcher Awards were given to David Forsyth and Michal Irani for their impactful contributions to the field of computer vision [50][52].
智源开源EditScore:为图像编辑解锁在线强化学习的无限可能
机器之心· 2025-10-22 03:30
随着多模态大模型的不断演进,指令引导的图像编辑(Instruction-guided Image Editing)技术取得了显著进展。然而,现有模型在遵循复杂、精细的文本指令方面 仍面临巨大挑战,往往需要用户进行多次尝试和手动筛选,难以实现稳定、高质量的「一步到位」式编辑。 强化学习(RL)为模型实现自我演进、提升指令遵循能力提供了一条极具潜力的路径。但其在图像编辑领域的应用,长期以来受限于一个核心瓶颈: 缺乏一个能 够精确评估编辑质量并提供高保真度反馈的奖励模型(Reward Model)。 没有可靠的「奖励信号」,模型便无法有效判断自身生成结果的优劣,从而难以实现高 效的自我优化。 为攻克这一难题, 北京智源人工智能研究院 VectorSpace Lab 团队 近日发布了全新的高保真奖励模型系列—— EditScore 。该工作直面上述挑战,旨在 为指令引 导的图像编辑任务提供精确、可靠的奖励信号,从而为强化学习在 AIGC 领域的深入应用铺平道路,真正解锁其强大潜力。 EditScore 是智源在成功推出统一图像生成模型 OmniGen 系列之后,对更通用、更可控的生成式 AI 的又一重要探索。为了促进 ...
一张照片,一个3D「你」:计算所等提出HumanLift,实现高保真数字人重建
机器之心· 2025-10-21 23:20
Core Insights - The article discusses the development of a new technology called HumanLift, which enables the reconstruction of high-quality, realistic 3D digital humans from a single reference image, addressing challenges in 3D consistency and detail accuracy [2][4][25]. Part 1: Background - Traditional methods for single-image digital human reconstruction are categorized into explicit and implicit approaches, each with limitations in handling complex clothing and achieving realistic textures [8]. - Recent advancements in generative models and neural implicit rendering have improved the connection between 2D images and 3D space, yet challenges remain in high-fidelity 3D human modeling due to data scarcity and complexity in human poses and clothing [8][9]. Part 2: Algorithm Principles - HumanLift aims to create a 3D digital representation that captures realistic appearance and fine details from a single image, utilizing a two-stage process [11]. - The first stage generates realistic multi-view images from a single photo using a 3D-aware multi-view human generation method, incorporating a backbone network based on a video generation model [13][14]. - The second stage reconstructs the 3D representation using the generated multi-view images, optimizing parameters based on Gaussian mesh representation [15][17]. Part 3: Effectiveness Demonstration - HumanLift demonstrates its capability by generating multi-view RGB and normal images from real-world photographs, achieving photo-realistic results and maintaining spatial consistency [20]. - Ablation studies confirm the importance of facial enhancement and SMPL-X pose optimization in improving detail quality and rendering accuracy [21][22][23]. Part 4: Conclusion - The development of HumanLift represents a significant advancement in single-image full-body digital human reconstruction, overcoming traditional limitations and providing a user-friendly solution for high-quality 3D modeling [25].
刚刚,OpenAI发布AI浏览器ChatGPT Atlas,基于Chromium
机器之心· 2025-10-21 23:20
机器之心报道 编辑:Panda 刚刚,OpenAI 发布 AI 浏览器 ChatGPT Atlas 。 当然,这是一个意料之中的消息,毕竟山姆・奥特曼已经多次表达过 OpenAI 对浏览器的兴趣,他甚至在谷歌可能被迫出售 Chrome 浏览器时表达过潜在的收购意 愿。他曾直言:「如果 Chrome 真要出售的话,我们应该研究一下。」或许正因为重视,山姆・奥特曼也亲自参与了 Atlas 的发布。 现在,谷歌已没有被迫出售 Chrome 的风险,OpenAI 推出自家的浏览器也完全合乎情理,此举也无疑将加剧其与谷歌(Chrome)和微软(Edge)的竞争关系。 ChatGPT Atlas ChatGPT Atlas 目前仅发布了 macOS 版,对 Free、Plus、Pro 和 Go 用户免费开放,感兴趣的用户已经可以在此处下载使用: https://chatgpt.com/atlas OpenAI 也提到将会发布 Windows、iOS 和 Android 版本。 究其核心,ChatGPT Atlas 的核心能力是将 ChatGPT 接入到用户浏览器,让「ChatGPT 可以看到您所在的页面,并通过 Ask ...
具身智能学界业界思想「惊人的统一」?美团在IROS开了个学术年会
机器之心· 2025-10-21 09:32
Group 1 - The core theme of the IROS conference this year is "Robotics for Better Life," focusing on the integration of embodied intelligence and retail services, which aligns with Meituan's vision and business philosophy [5][3] - Meituan's drone delivery service has achieved initial scale and has begun global operations, being the only company approved by the Civil Aviation Administration of China to operate drones in all cities, including night flights [5][3] - The conference featured significant academic contributions from Meituan's robotics research institute, with six papers presented, highlighting the practical value of research in areas like drone flight and visual semantic recognition [7][10] Group 2 - Professor Xi Ning emphasized the need to move away from traditional robotic planning paradigms to more effectively utilize artificial intelligence methods [13][11] - The GAT (Generative Adversarial Tri-model) proposed by Professor Xi aims to combine physical models with data-driven models to enhance task planning and control in robotics [18][21] - The concept of Non-Vector Space Models was introduced to improve perception and measurement processes in robotics, moving beyond traditional vector data approaches [28][24] Group 3 - CEO Wang Qian of Zivariable Robotics discussed the necessity of an end-to-end model for embodied intelligence, highlighting the complexity of physical interactions [43][40] - The importance of data quality and diversity over sheer quantity was stressed, with a focus on experiential learning as a means to overcome data limitations in embodied intelligence [52][53] - The roundtable discussion at the conference explored the foundational principles of embodied intelligence, emphasizing the integration of physical laws with data-driven learning [57][60] Group 4 - There is a consensus among both academia and industry regarding the complexity of the physical world and the need to combine physical and data-driven approaches in building embodied intelligence models [63][62] - The outdated notion of "large models + automation = embodied intelligence" is being challenged, with a call for the development of genuine embodied intelligence models [64][61] - The closing remarks from the roundtable participants highlighted the importance of dreams, curiosity, and the pursuit of knowledge in advancing the field of embodied intelligence [65][66][67]
豆包是如何炼成的?字节放出自研万卡训练系统ByteRobust论文
机器之心· 2025-10-21 09:32
Core Insights - The article discusses the challenges and advancements in training large language models (LLMs), particularly focusing on ByteDance's robust training infrastructure, ByteRobust, which aims to minimize training interruptions and enhance fault diagnosis and recovery efficiency [3][7][25]. Group 1: Training Infrastructure and Challenges - The core infrastructure for LLM training is GPUs, with training scales reaching tens of thousands of GPUs, leading to increased training times and frequent hardware failures [1][2]. - ByteDance's training of a 175B parameter model utilized 12,288 GPUs, while a 405B parameter model, LLaMA 3, required 16,384 NVIDIA H100 GPUs and took 54 days to pre-train [1]. - Faults such as CUDA errors and task hangs occur frequently, with Meta reporting hardware failures approximately every 2.78 hours during training on 16,000 GPUs [1][2]. Group 2: ByteRobust Overview - ByteRobust is designed to achieve high effective training time ratio (ETTR) by efficiently diagnosing and handling events during LLM training [7][25]. - The infrastructure consists of two main components: a control plane for event management and a data plane for monitoring and diagnostics [8][10]. Group 3: Control Plane and Data Plane Functions - The control plane coordinates robust event handling strategies, including anomaly detection and fault localization, while the data plane integrates monitoring, diagnostics, and checkpoint management [10][11]. - The Robust Controller in the control plane manages an automated fault mitigation framework, utilizing real-time monitoring for most events [10][12]. Group 4: Fault Tolerance Mechanisms - ByteRobust emphasizes rapid fault isolation over precise fault localization to minimize GPU idling during large-scale training [13][14]. - The automated fault tolerance framework includes real-time checks, in-depth diagnostics, and mechanisms for quick recovery from transient faults [19][20]. Group 5: Performance Metrics and Results - ByteRobust has been deployed for over a year, effectively reducing event detection time and resolving incidents through its automated framework [25]. - In a three-month period, ByteRobust identified 38,236 explicit faults and 5,948 implicit faults across 778,135 LLM training tasks [26]. - The system achieved a maximum ETTR of 97% during intensive model training using 9,600 GPUs, demonstrating significant improvements in recovery speed with warm standby and hot update mechanisms [28][35]. Group 6: Model Training Insights - ByteDance's experiments showed that the warm standby and hot update mechanisms improved recovery speeds by up to 10.87 times and 11.04 times, respectively [28]. - The effective checkpoint mechanism implemented in ByteRobust incurs less than 0.9% overhead, facilitating faster fault switching [31]. - The training of dense models and MoE models revealed that while dense models had higher performance optimizations, MoE training introduced additional complexities that could lead to increased manual restarts [38].
清华、快手提出AttnRL:让大模型用「注意力」探索
机器之心· 2025-10-21 09:32
Core Insights - The article discusses the advancements in reinforcement learning (RL), particularly focusing on Process-Supervised RL (PSRL) and the introduction of a new framework called AttnRL, which enhances exploration efficiency and performance in reasoning models [3][4][9]. Group 1: Challenges in Traditional Methods - Traditional PSRL methods assign equal reward signals to all tokens, neglecting the fine-grained quality during the reasoning process [7]. - Existing PSRL approaches face significant bottlenecks in exploration efficiency and training costs, leading to high computational expenses [4][10]. Group 2: Introduction of AttnRL - AttnRL introduces an innovative exploration method by utilizing attention mechanisms to guide the reasoning process, allowing the model to branch from high-attention steps [9][12]. - The framework employs Attention-based Tree Branching (ATB), which analyzes the reasoning sequence and calculates Forward Context Influence (FCI) scores to determine the most impactful steps for branching [13][16]. Group 3: Adaptive Sampling Mechanisms - AttnRL incorporates two adaptive sampling mechanisms: difficulty-aware exploration and dynamic batch adjustment, optimizing the learning process by focusing on challenging problems while reducing computational load on simpler ones [20][22]. - The training process is streamlined to a One-Step Off-Policy approach, significantly reducing sampling costs compared to previous PSRL methods [23]. Group 4: Experimental Results - AttnRL demonstrates superior performance across various mathematical reasoning benchmarks, achieving average accuracy rates of 57.2% for 1.5B models and 68.7% for 7B models, outperforming baseline methods like GRPO and TreeRL [28]. - The framework shows improved efficiency in sampling, with a higher effective ratio and better performance in fewer training steps compared to traditional methods [29][31]. Group 5: Future Outlook - The introduction of attention scores in PSRL exploration decisions opens new avenues for enhancing model interpretability and RL research, suggesting that efficiency and intelligence can coexist through more effective exploration strategies [34].
DeepSeek的新模型很疯狂:整个AI圈都在研究视觉路线,Karpathy不装了
机器之心· 2025-10-21 03:43
Core Insights - The article discusses the groundbreaking release of the DeepSeek-OCR model, which compresses 1000 words into 100 visual tokens while maintaining a high accuracy of 97% [1] - This model addresses the long-context efficiency issue in large language models (LLMs) and suggests a paradigm shift where visual inputs may be more effective than textual inputs [1][5] Group 1: Model Features and Performance - DeepSeek-OCR can process 200,000 pages of data daily using a single NVIDIA A100 GPU [1] - The model's compression efficiency is ten times better than traditional text tokens, allowing for a significant reduction in the number of tokens needed to represent information [9] - The model eliminates the need for tokenizers, which have been criticized for their complexity and inefficiency [6] Group 2: Community Reception and Expert Opinions - The open-source nature of DeepSeek-OCR has led to widespread validation and excitement within the AI community, with over 4000 stars on GitHub shortly after its release [2][1] - Experts like Andrej Karpathy have praised the model, highlighting its potential to redefine how LLMs process inputs [3][5] - The model has sparked discussions about the efficiency of visual tokens compared to text tokens, with some researchers noting that visual representations may offer better performance in certain contexts [9][11] Group 3: Implications for Future Research - The article suggests that the use of visual tokens could significantly expand the effective context length of models, potentially allowing for the integration of extensive internal documents into prompts [12][13] - There are references to previous research that laid the groundwork for similar concepts, indicating that while DeepSeek-OCR is innovative, it is part of a broader trend in the field [18][20] - The potential for combining DeepSeek-OCR with other recent advancements, such as sparse attention mechanisms, is highlighted as a promising avenue for future exploration [11][12]