Workflow
机器之心
icon
Search documents
AI 真能看懂物理世界吗?FysicsWorld:填补全模态交互与物理感知评测的空白
机器之心· 2025-12-28 04:44
Core Insights - The article discusses the rapid paradigm shift in multimodal large language models, focusing on the development of unified full-modal models capable of processing and generating information across various modalities, including language, vision, and audio [2][4] - The driving force behind this shift is the complexity of the real physical world, where humans have historically relied on multimodal information to understand and interact with their environment [3] - A new benchmark called FysicsWorld has been introduced to evaluate models' capabilities in understanding, generating, and reasoning across multiple modalities in real-world scenarios [4][10] Summary by Sections Introduction to Multimodal Models - Multimodal models are evolving from simple combinations of visual and textual data to more complex integrations that include audio and other sensory modalities [12] - There is a growing expectation for these models to accurately understand and interact with complex real-world environments [12] FysicsWorld Benchmark - FysicsWorld is the first unified benchmark designed to assess models' abilities in multimodal tasks, covering 16 tasks that span various real-world scenarios [6][10] - The benchmark includes a cross-modal complementarity screening strategy to ensure that tasks require genuine multimodal integration, avoiding reliance on single-modal shortcuts [8][23] Evaluation Framework - The evaluation framework of FysicsWorld is comprehensive, covering tasks from basic perception to high-level interactions, ensuring a thorough assessment of models' capabilities [15][17] - The benchmark aims to address the limitations of existing evaluation systems, which often focus on text-centric outputs and lack real-world applicability [16] Performance Insights - Initial evaluations using FysicsWorld reveal significant performance gaps among current models, particularly in tasks requiring deep cross-modal reasoning and interaction in real-world contexts [31] - The results indicate that while models have made progress in basic multimodal tasks, they still struggle with complex scenarios that require robust integration of multiple sensory inputs [31][34] Future Directions - The article emphasizes the need for further advancements in cross-modal integration, dynamic environment understanding, and physical constraint reasoning to achieve true full-modal intelligence [35] - FysicsWorld serves as a critical tool for researchers to map and improve models' capabilities in real-world multimodal interactions [36]
一封AI邮件,竟让Go语言之父爆起粗口
机器之心· 2025-12-28 04:44
Core Viewpoint - The article discusses the backlash from Rob Pike, a prominent programmer and co-creator of the Go language, against an AI-generated email that expressed gratitude for his contributions to the tech field, highlighting his frustrations with AI's impact on programming and the environment [1][5][8]. Group 1: AI and Programming Community Reactions - Rob Pike's anger stemmed from the realization that the email was generated by AI, which he deemed as "AI garbage" [5]. - Other prominent figures in programming, like Guido van Rossum, also received similar emails, indicating a broader issue within the community regarding AI-generated content [5]. - The general sentiment among programmers reflects a growing disdain for AI-generated code, with many feeling that it leads to a degradation of their foundational skills [13][14]. Group 2: Environmental and Social Concerns - Pike expressed concerns about the environmental impact of AI, citing the significant hardware resources wasted and the societal disruptions caused by AI technologies [8]. - There is a perception that AI models exploit data from individuals without providing any compensation, raising ethical questions about data usage [8]. Group 3: Adaptation to AI in Programming - The article notes a sense of panic among programmers regarding the rapid advancement of AI, with some feeling left behind as AI tools become more capable [16]. - Despite the fear, there are suggestions within the community to embrace AI programming tools to gain experience and adapt to the changing landscape [22]. - Boris Cherny, creator of Claude Code, shared data showing extensive AI-generated contributions, indicating a shift in how programming tasks are approached [18].
压缩之外,Visual Tokenizer 也要理解世界?
机器之心· 2025-12-28 01:30
Core Insights - The article discusses the evolution of Visual Tokenizer and its significance in understanding the world, suggesting that the next step in its development is to enhance its ability to comprehend high-level semantics rather than just focusing on pixel-level reconstruction [5][6][9]. Group 1: Visual Tokenizer Research - MiniMax and researchers from Huazhong University of Science and Technology have released a new study on Visual Tokenizer Pre-training (VTP), which has sparked significant interest in the industry [6]. - Traditional visual generation models typically involve a two-step process: compressing images using a tokenizer (like VAE) and then training a generative model in latent space [6]. - The study indicates that improving the performance of generative models can be achieved not only by scaling the main model but also by enhancing the tokenizer [6][8]. - The research reveals that focusing solely on pixel-level reconstruction can lead to a decline in downstream generative quality, as traditional tokenizers tend to favor low-level pixel information over high-level semantic representation [7][8]. - VTP proposes that introducing semantic understanding in tokenizer pre-training can make latent representations more sensitive to high-level semantics without overly memorizing pixel details [8][9]. Group 2: VTP Framework and Findings - The VTP framework integrates image-text contrastive learning (like CLIP), self-supervised learning (like DINOv2), and traditional reconstruction loss to optimize the latent space of visual tokenizers [9][10]. - The framework retains lightweight reconstruction loss for visual fidelity while introducing two semantic-oriented tasks: self-supervised loss based on DINOv2 and contrastive loss based on CLIP [9][10]. - Experimental results show a strong positive correlation between the semantic quality of the latent space (measured by zero-shot classification accuracy) and generative performance (measured by FID) [11]. - The largest VTP model (approximately 700 million parameters) achieved a zero-shot classification accuracy of 78.2% on ImageNet, with a reconstruction fidelity (rFID) of 0.36, comparable to specialized representation learning models [11][12]. - Replacing the tokenizer in a standard diffusion model training with VTP led to a 65.8% reduction in FID relative to the baseline and a fourfold increase in convergence speed [12][13]. - This indicates that investing more computational resources in tokenizer pre-training can significantly enhance downstream generative quality without increasing the complexity of the generative model [13].
告别「单线程」思维:通研院提出NPR框架,让智能体进化出原生的并行推理大脑
机器之心· 2025-12-27 04:01
近年来,大语言模型在「写得长、写得顺」这件事上进步飞快。但当任务升级到真正复杂的推理场景 —— 需要兵分多路探索、需要自我反思与相互印证、需要在多条线索之间做汇总与取舍时,传统的链式思维 (Chain-of-Thought)往往就开始「吃力」:容易被早期判断带偏、发散不足、自我纠错弱,而且顺序生成 的效率天然受限。 北京通用人工智能研究院(BIGAI)语言交互实验室(NLCo)最新工作 Native Parallel Reasoner(NPR, 原生并行推理器) ,瞄准的正是这类瓶颈: 让智能体在一次思考中同时衍生并维护多条候选推理路径,并在关键节点「分支 + 聚合」,最终像拼图一 样汇总线索,合成最优解。 更重要的是,NPR 的突破点不只是「并行生成的工程技巧」,而是提出了一套「自蒸馏 + 并行强化学习」 三阶段训练范式,并配套专门的 并行推理引擎 ,目标是让并行推理从外挂变为模型的 原生认知能力 。 人们对语言智能体(Language Agent)的研究已经把关注从「单一思维链扩展」推广到了「多步深度推 理」。模型能够进行更深层次的推理令人兴奋,但未来的超级智能真正需要的,是 能更广泛地并行探索多 条可能 ...
AI大佬Karpathy焦虑了:作为程序员,我从未感到如此落后
机器之心· 2025-12-27 04:01
Core Insights - The article discusses the transformative impact of AI on the programming profession, highlighting a shift where programmers contribute less code and instead focus on integrating various tools [4][9] - It emphasizes the need for programmers to adapt to new technologies and methodologies to avoid falling behind, suggesting that those who leverage AI tools effectively can achieve significant productivity gains [4][5] Group 1: Industry Transformation - Andrej Karpathy expresses a feeling of being left behind as a programmer, noting that the profession is undergoing a fundamental restructuring due to AI advancements [4] - The emergence of a new programmable abstraction layer requires mastery of various concepts such as agents, prompts, and workflows, which are essential for navigating the evolving landscape [4] - The rapid evolution of AI tools is likened to a powerful alien tool distributed without instructions, leading to a seismic shift in the industry [4] Group 2: Programmer Adaptation - Experienced engineers are finding themselves needing to relearn and adjust their expectations regarding AI capabilities, as new models continuously improve [6][8] - A specific example illustrates how a senior engineer had to remind himself to utilize AI tools like Claude for debugging, which can outperform traditional methods [8] - The article notes that new entrants to the field may adapt more quickly to AI tools due to a lack of preconceived notions about their capabilities [8] Group 3: Community Reactions - The article highlights a range of reactions from the programming community, with some expressing anxiety about falling behind while others adopt a more relaxed approach, viewing the changes as opportunities for creativity [11][12] - Some industry experts emphasize the importance of focusing on depth in specific areas rather than spreading oneself too thin across multiple languages or fields [12] - A notable sentiment is that AI is not replacing programmers but rather changing the nature of programming languages and practices [13] Group 4: AI Development Trends - The article cites data indicating that the Epoch Capabilities Index (ECI), which measures AI capabilities, has seen a growth rate nearly double that of the previous two years, with a 90% acceleration expected by April 2024 [19][20] - This rapid advancement in AI technology is anticipated to continue, potentially leading to unprecedented developments by 2026 [20][23]
SIGGRAPH Asia 2025|当视频生成真正「看清一个人」:多视角身份一致、真实光照与可控镜头的统一框架
机器之心· 2025-12-27 04:01
第一作者徐源诚是 Netflix Eyeline 的研究科学家,专注于基础 AI 模型的研究与开发,涵盖多模态理解、推理、交互与生成,重点方向包括可控视频生成及其在影 视制作中的应用。他于 2025 年获得美国马里兰大学帕克分校博士学位。 最后作者于宁是 Netflix Eyeline 资深研究科学家,带领视频生成 AI 在影视制作中的研发。他曾就职于 Salesforce、NVIDIA 及 Adobe,获马里兰大学与马普所联合 博士学位。他多次入围高通奖学金、CSAW 欧洲最佳论文,并获亚马逊 Twitch 奖学金、微软小学者奖学金,以及 SPIE 最佳学生论文。他担任 CVPR、ICCV、 ECCV、NeurIPS、ICML、ICLR 等顶会的领域主席,以及 TMLR 的执行编辑。 在电影与虚拟制作中,「看清一个人」从来不是看清某一帧。导演通过镜头运动与光线变化,让观众在不同视角、不同光照条件下逐步建立对一个角色的完整认 知。然而,在当前大量 customizing video generation model 的研究中,这个最基本的事实,却往往被忽视。 被忽视的核心问题:Multi-view Ident ...
马斯克圣诞礼物:X上所有图片都能一键AI改图了,全球画师暴怒
机器之心· 2025-12-27 02:45
Core Viewpoint - The introduction of Grok AI's image editing capabilities on the X platform marks a significant shift towards generative creative platforms, allowing users to edit images directly and even convert static images into short videos, which could disrupt traditional content creation and artistic professions [2][3][11]. Group 1: New Features and Capabilities - The X platform has added a "Edit Image" option for all images, utilizing the Grok AI model, enabling users to edit any image they see on the platform [2]. - Grok AI can transform static images into 6-15 second videos, automatically animating elements like blinking and background movements, enhancing user engagement [3]. - The new editing tools have led to a surge in user-generated content, with many users experimenting with the new features [11]. Group 2: Impact on Creators - The new features are seen as a threat to artists, as their original works can be easily edited by others without consent, leading to concerns about the devaluation of artistic labor [11][22]. - Prominent artists have expressed their dissatisfaction, indicating that they may stop sharing their work on the platform due to the lack of control over how their images are used [22]. - The platform's updated terms of service allow for the use of user-generated content for machine learning purposes, raising further concerns among creators about the exploitation of their work [22]. Group 3: User Concerns and Reactions - Users have raised alarms about the potential misuse of the AI editing feature, which allows anyone to edit images of real people, including personal photos, without consent [19]. - There is currently no option to disable the AI editing feature, leading to frustration among users who feel their privacy and rights are compromised [22]. - Suggestions have been made to upload images in formats like GIF to prevent editing, although this may reduce image quality [23].
顶刊TPAMI|多模态视频理解领域重磅数据更新:MeViSv2发布
机器之心· 2025-12-26 04:35
近日,多模态视频理解领域迎来重磅更新!由 复旦大学、上海财经大学、南洋理工大学 联合打造的 MeViSv2 数据集正式发布,并已被顶刊 IEEE TPAMI 录用。 作为目前该领域最具有代表性的数据集之一,MeViSv2 围绕 复杂动作推理 来挑战现有模型的多模态处理能力,其包含 2,006 个视频、 8,171 个目标及 33,072 条文本 / 音频表达,通过新增 15 万秒音频数据实现了向原生多模态的进化。 该数据集不仅全面支持 RVOS 、 RMOT 、 AVOS 以及 RMEG 四大核心任务,更引入了 "无目标语句" 和 "运动推理" 等机制,旨在挑战模型逻辑推理与 鲁棒性的天花板。目前,数据集、代码及评测平台均已开放。 图 1:MeViS 示例,MeViS 中的表达主要侧重于运动属性,使得仅凭单帧图像无法识别目标对象。最新的 MeViSv2 进一步提供了运动推理和无目标表达式,同时给每一个文本提供了对应的 音频记录。 论文:MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation,TPAMI 20 ...
Agent「记吃不记打」?华为诺亚&港中文发布SCOPE:Prompt自我进化,让HLE成功率翻倍
机器之心· 2025-12-26 04:35
机器之心发布 当 Agent 遇到工具调用错误时,错误日志里往往已经包含了解决方案 —— 正确的参数格式、有效的 API 用法、甚至是直接可用的替代方案。然而,静态 的 Prompt 无法让 Agent 从这些反馈中 "学到教训",导致它们陷入 "错误循环":承认失败,却重复同样的动作。 华为诺亚方舟实验室与香港中文大学联合发布的 SCOPE 框架,旨在解决这一问题。 SCOPE 的核心思想是:既然 Agent 会被反复调用,那么它的 Prompt 就可以在执行过程中不断进化。通过从执行轨迹中自动提炼指导规则,SCOPE 让 Agent 能够 "从错误中学习",并将经验固化到 Prompt 中,实现自我进化。 Agent 的两大失败模式 研究团队分析了 GAIA 和 DeepSearch 基准上的 Agent 执行日志,发现了两类典型的失败模式: 论文:《 SCOPE: Prompt Evolution for Enhancing Agent Effectiveness 》 论文地址:https://arxiv.org/abs/2512.15374 开源地址:https://github.com/Jarvis ...
视频生成DeepSeek时刻!清华&生数开源框架提速200倍,一周斩获2k Star
机器之心· 2025-12-26 04:35
Core Insights - The article discusses the launch of TurboDiffusion, an open-source framework developed by Tsinghua University's TSAIL team and Shenshu Technology, which significantly accelerates video generation, reducing the time required to generate videos from minutes to seconds [1][3][7]. Group 1: Technological Breakthrough - TurboDiffusion marks a pivotal shift from traditional video rendering and waiting to real-time generation, addressing the high inference latency that has limited the practical use of video generation models [3][7]. - The framework achieves approximately 200 times acceleration in generating high-quality videos, allowing a 5-second 720p video to be produced in just 24 seconds on a single RTX 5090 GPU [26][43]. - The technology employs four core techniques: mixed attention acceleration, efficient step distillation, and W8A8 linear layer quantization, which collectively enhance video generation efficiency without compromising quality [13][20][21]. Group 2: Implementation and Performance - Mixed attention acceleration includes SageAttention and Sparse-Linear Attention (SLA), which optimize attention mechanisms for faster processing [14][17]. - Efficient step distillation reduces the number of sampling steps required for video generation from 100 to as few as 3 or 4, maintaining high video quality [20]. - The W8A8 linear layer quantization compresses model size by about 50%, utilizing INT8 Tensor Cores for faster linear layer computations [21]. Group 3: Industry Impact - TurboDiffusion's introduction lowers the computational barrier for high-end video creation, making it accessible to individual creators using consumer-grade GPUs [51]. - The framework enables near real-time video generation, enhancing creative exploration by allowing instant feedback on adjustments to prompts [52]. - The advancements in video generation technology, including TurboDiffusion, are expected to facilitate the development of applications requiring immediate feedback, such as AI video live streaming and AR/VR content rendering [52].