机器之心
Search documents
ACL Fellows 2025名单公布:西湖大学张岳与UIUC季姮入选
机器之心· 2025-12-13 08:31
Core Viewpoint - The ACL has announced the list of 2025 ACL Fellows, recognizing significant contributions in the field of Natural Language Processing (NLP) [1]. Group 1: Overview of ACL Fellows - A total of 11 scholars have been selected as ACL Fellows in 2025, with notable inclusions of two Chinese scholars: Heng Ji from the University of Illinois Urbana-Champaign and Yue Zhang from Westlake University [1]. Group 2: Heng Ji's Contributions - Heng Ji is recognized for her important contributions in information extraction, multimodal and multilingual knowledge extraction, and "AI for Science" [6]. - She holds multiple positions at the University of Illinois, including Professor of Computer Science and Director of the Amazon-Illinois Interactive Dialogue Experience AI Center [7]. - Her research interests focus on NLP, particularly multimedia multilingual information extraction and knowledge-enhanced large language models [8]. Group 3: Yue Zhang's Contributions - Yue Zhang is acknowledged for his contributions to structured prediction and generalization in NLP, as well as his service to the NLP community and education [12]. - He has held various academic positions, including a tenure as an Associate Professor at Singapore University of Technology and Design [11]. - His research interests include NLP and underlying machine learning algorithms, with a focus on the differences between neural language models and human cognition [13]. Group 4: Other Notable Fellows - Rada Mihalcea is recognized for her contributions in NLP, multimodal processing, and computational social science, including the development of the TextRank algorithm [16]. - Mohit Bansal is acknowledged for his work in question-answering systems, scientific applications, and multimodal AI [20]. - Saif Mohammad is recognized for his pioneering contributions in knowledge-based NLP and commonsense reasoning [31]. - Lori Levin is acknowledged for her work in computational emotion science and responsible NLP [36]. - Alexander Koller is recognized for foundational contributions in computational semantics and neural-symbolic architectures [43].
NeurIPS 2025 | 告别全量扫描!浙大提出COIDO:破解多模态数据选择「高耗」难题
机器之心· 2025-12-13 08:31
Core Insights - The article introduces COIDO (Coupled Importance-Diversity Optimization), a framework designed to optimize data selection for visual instruction tuning in multi-modal large language models (MLLMs) [4][9][23] - COIDO aims to reduce the computational costs associated with data selection while ensuring high-quality data is retained, addressing the challenges of existing methods that often require full data traversal [12][23] Group 1: Motivation and Background - The rapid growth of datasets, such as LLaVA-665K, has led to significant computational overhead and redundancy when fine-tuning MLLMs on full datasets [8] - Existing data selection methods face two main issues: high selection costs and the decoupling of importance and diversity in data selection [12][9] Group 2: Methodology - COIDO introduces a lightweight scoring mechanism that allows for training on a small sample (e.g., 20%) of the full dataset, enabling generalization without the need for full data traversal [14] - The core innovation of COIDO is the coupled optimization of importance and diversity within a unified training framework, rather than treating them as separate phases [14] - The importance loss is based on a reweighted cross-entropy loss, while the diversity loss utilizes spectral clustering to minimize variance among clusters, ensuring a diverse data selection [14][15] Group 3: Experimental Results - COIDO achieves state-of-the-art performance using only 20% of the data, reaching 98.2% of the performance of full data fine-tuning across various benchmarks [20][21] - The framework demonstrates strong generalization and transferability, outperforming models trained from scratch on new datasets [21] Group 4: Conclusion - COIDO presents a novel paradigm for multi-modal data selection, challenging the notion that data selection must be costly and providing a pathway for efficient fine-tuning of MLLMs [23][24] - The framework's low computational cost and high-quality data selection make it a valuable tool for researchers with limited resources [23]
谢赛宁REPA得到大幅改进,只需不到4行代码
机器之心· 2025-12-13 04:59
Core Insights - The article discusses the importance of spatial structure over global semantic information in representation alignment for generative models, specifically in the context of diffusion models [1][3][42]. Group 1: Research Findings - A joint team from Adobe Research, Australian National University, and New York University conducted empirical analysis on 27 different visual encoders and model sizes [2]. - The unexpected result revealed that spatial structure, rather than global performance, drives the generative performance of target representations [3][8]. - The study introduced the concept of Spatial Self-Similarity to quantify spatial structure, which measures the clarity of "texture" and "relationships" in feature maps [15][17]. Group 2: iREPA Methodology - The team developed a simple method called iREPA, which can enhance the convergence speed of various visual encoders and training variants [5][20]. - iREPA's core modifications include replacing the MLP projection layer with a convolutional layer to better preserve local spatial relationships and introducing a spatial normalization layer to enhance spatial contrast [20][21][22]. Group 3: Performance Improvements - iREPA demonstrated significant improvements in convergence speed across various diffusion transformers and visual encoders, proving its robustness and general applicability [26][27]. - The method showed that as the model size increases, the performance gains from iREPA also increase, aligning with the "Scaling Law" trend [34]. - Visual quality improvements were evident, with iREPA-generated images exhibiting better object outlines, texture details, and overall structural coherence compared to standard REPA [36]. Group 4: Conclusion - The research emphasizes that understanding spatial relationships between pixels is more crucial for generative models than merely focusing on a single metric like ImageNet accuracy [42].
AAAI 2026 Oral | 拒绝「一刀切」!AdaMCoT:让大模型学会「看题下菜碟」,动态选择最佳思考语言
机器之心· 2025-12-13 04:59
多语言大模型(MLLM)在面对多语言任务时,往往面临一个选择难题:是用原来的语言直接回答,还是翻译成高资源语言去推理? 实际上, 不同 的语言在模型内部承载着不同的「特长」 。比如英语可能逻辑性强,适合科学推理;而中文或印尼语在处理特定文化背景或押韵任务时,可能比英 语更具优势。 如何让模型在面对不同任务时,自动选择一条「最顺手」的推理路径?来自新加坡科技研究局(A*STAR)Nancy F. Chen 和 Ai Ti Aw 带领的研究团队,携手新加 坡科技设计大学(SUTD)Roy Ka-Wei Lee 教授团队共同推出了 AdaMCoT(Adaptive Multilingual Chain-of-Thought)框架 。AdaMCoT 的核心在于 把 「用哪种 语言思考」本身当成一个可优化的决策变量 :通过自适应地在多种语言间路由并组合链式思考,再将推理结果映射回目标语言,从而显著提升跨语言的事实推理 准确性与一致性。 该工作已被 AAAI 2026 主轨道接收为 Oral 论文 。 研究背景与痛点 现有的跨语言推理方法通常存在「路径依赖」:要么不做处理直接推理,容易导致低资源语言的幻觉;要么强制全部转 ...
GPT-5.2已上线24小时:差评如潮!
机器之心· 2025-12-13 04:59
机器之心报道 编辑:杨文 网友吐槽GPT-5.2「不通人性」。 X 上充斥着对 GPT-5.2 的恶评。 昨天,OpenAI 十周年之际,拿出了 最新的顶级模型 GPT-5.2 系列 ,官方号称是「迄今为止在专业知识工作 上最强大的模型系列」,在众多基准测试中,GPT-5.2 也都刷新了最新的 SOTA 水平。 | | GPT-5.2 Thinking | GPT-5.1 Thinking | | --- | --- | --- | | GDPval (wins or ties) | 70.9% | 38.8% (GPT-5) | | Knowledge work tasks | | | | SWE-Bench Pro (public) | 55.6% | 50.8% | | Software engineering | | | | SWE-bench Verified | 80.0% | 76.3% | | Software engineering | | | | GPQA Diamond (no tools) | 92.4% | 88.1% | | Science questions | | | | Ch ...
2026 将近,世界模型到底更「世界」了吗?
机器之心· 2025-12-13 02:30
Core Viewpoint - The recent launch of GWM Worlds and GWM Robotics by Runway pushes video generation towards an interactive "world simulation" paradigm, reigniting discussions on the definition and scope of "world models" as interfaces for creation and interaction, simulators for training and evaluation, or cognitive frameworks for reasoning and decision-making [1]. Group 1: Evolution of World Models - Over the past two years, world models have evolved to be considered on par with LLMs in the AGI landscape, transitioning from a narrow definition focused on reinforcement learning to a broader understanding that includes generative modeling [4]. - Initially, world models were seen as internal environment models for agents, predicting future states based on current conditions and actions, allowing for internal simulation and decision-making [5]. - The engineering perspective defined world models as a combination of three capabilities: compressing high-dimensional perception into usable representations, predicting future states over time, and utilizing predictions for planning and decision-making [6]. - By 2024, the understanding of world models expanded to encompass general world evolution modeling, with a trend from language generation to image generation, and ultimately to 3D and world generation [6]. - The boundaries of the world model concept have become more ambiguous, with ongoing debates about the nature of representations, the incorporation of physical laws, and the organization of input relationships [6]. Group 2: Industry Layout and Trends - Major companies are investing in world models, questioning whether they are enhancing their "data engines" or building new frameworks for "spatiotemporal cognition" [3]. - In February 2024, OpenAI referred to the video generation model Sora as "world simulators," emphasizing their ability to learn the three-dimensional structure and physical laws of the real world [6]. - Concurrently, LeCun introduced V-JEPA, which focuses on predicting masked video segments in abstract representation space, allowing for higher training efficiency by discarding unpredictable information [6]. - The current discourse has shifted from whether to develop world models to how to model them, with debates on whether to abstract from pixel levels or to directly operate in abstract spaces [7]. - There is a recognition that existing approaches may only capture partial physical laws, indicating a need for representations of isolated objects and a priori laws of change across space and time to achieve a coherent world model [7]. Group 3: Definition and Ambiguity of World Models - By 2025, world models are positioned alongside LLMs, with companies like Google DeepMind, Meta, and Nvidia shifting focus from pure LLMs to world models, aiming for "Physical AI + superintelligence" due to stagnation in LLM advancements [8]. - The distinction between world models and existing generative AI lies in the former's goal to construct internal representations of environments that include physical, temporal, and spatial dimensions for planning and decision-making [9]. - The term "world model" has become ambiguous, referring to latent states within systems, game-like simulators for training agents, or any content pipeline capable of generating navigable 3D scenes [9]. - An analysis from Entropy Town in November 2025 categorized world models into three technical routes: interface, simulator, and cognitive framework, highlighting the ongoing ambiguity in the field [9].
告别「盲目自信」,CCD:扩散语言模型推理新SOTA
机器之心· 2025-12-13 01:13
对此, 华为小艺香港团队、香港城市大学及香港大学 的研究人员们共同提出了一种全新的 上下文一致性解码算法(Coherent Contextual Decoding, CCD) ,充分 利用扩散过程中的上下文增广,从理论上纠正了传统 DLM 推理策略的 "短视性",并进一步采用自适应解码方案在多种开源 DLMs 上同时实现了 3.48 倍的加速和 3.9% 的性能提升。该方案不仅适配 Any-oder 生成,且在半自回归 Block-wise 解码设定下也获得了提升,扩散语言模型的高效推理时代,或许已经到来。 研究背景 今年以来,以 Dream 和 LLaDA 为主的开源扩散语言模型大放异彩,实现了和同尺寸自回归 LLM 相当的通用能力,且展现出了 DLMs 在全局规划和双向上下文理 解任务上的优势 。 扩散语言模型(Diffusion Language Models)以其独特的 "全局规划" 与并行解码能力广为人知,成为 LLM 领域的全新范式之一。然而在 Any-order 解码模式下,其 通常面临推理速度较慢且生成逻辑不连贯等问题。 论文标题: Beyond Confidence: Adaptive an ...
苹果光速撤回RLAX论文:用了谷歌TPU和阿里Qwen,作者中还有庞若鸣
机器之心· 2025-12-13 01:13
Core Viewpoint - The article discusses Apple's recently withdrawn paper on a scalable reinforcement learning framework called RLAX, which utilizes Google's TPU and other cloud services, highlighting the company's engineering capabilities in AI infrastructure despite recent personnel changes [1][35]. Group 1: Paper Overview - The paper titled "RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs" was submitted on December 6 and quickly withdrawn after being made public [1][7]. - RLAX is designed for efficient execution of advanced reinforcement learning algorithms on large-scale distributed TPU clusters [12]. Group 2: Technical Contributions - RLAX employs a parameter-server architecture, allowing for logical separation of training, inference, and validation components, which enhances resource allocation flexibility [14]. - The framework supports preemptive scheduling, enabling immediate resource recovery for higher-priority tasks without crashing the training process [15]. - RLAX addresses key challenges in post-training reinforcement learning, offering programmable configuration options for managing on-policy and off-policy RL [16]. Group 3: Experimental Results - During experiments, RLAX improved the pass@8 accuracy of the QwQ-32B model by 12.8% in just 12 hours and 48 minutes using 1024 TPU v5p [24]. - The framework's development involved using Google's TPU, Amazon's AWS Lambda for testing, and a Chinese open-source model, showcasing a collaborative approach across different technologies [26]. Group 4: Author Background - The paper lists several authors, including Kelvin Zou, who has transitioned to Meta, and Cheng Leong, a long-time Apple employee, indicating a shift in talent within the AI sector [8][9].
港大开源ViMax火了,实现AI自编自导自演
机器之心· 2025-12-12 10:06
Group 1 - The core idea of the article is the introduction of ViMax, an AI framework that automates the entire video production process, allowing anyone to create videos without needing extensive skills or equipment [2][3] - ViMax represents a significant shift in AI video production from "fragment generation" to "systematic creation," indicating a fundamental change in creative processes [3] Group 2 - The framework utilizes a multi-agent collaboration model, where different AI agents handle specific tasks such as screenwriting, shot planning, visual asset creation, quality assessment, and overall coordination [9][10][11][12][13] - ViMax employs a recursive narrative decomposition strategy to manage the complexity of long video storytelling, breaking down scripts into manageable units while maintaining logical coherence and emotional continuity [15][16] Group 3 - To address visual consistency across shots, ViMax implements a graph-based tracking mechanism that identifies and maintains dependencies among visual elements, ensuring coherent character and scene representation [19][20] - The system also introduces a transition video generation technique to maintain spatial geometric consistency when capturing multiple angles of the same scene [21] Group 4 - ViMax's quality control mechanism involves generating multiple versions of content and using a visual language model for evaluation, ensuring high-quality outputs through iterative refinement [24][25] - The framework is designed to be adaptable, with future enhancements expected in computational efficiency, interactive editing capabilities, cultural diversity support, and audio production integration [29]
提示词一响,烂片登场,OpenAI谈下200+迪士尼顶级IP出场费
机器之心· 2025-12-12 10:06
机器之心报道 机器之心编辑部 AI 版权战不再是想着怎么把 IP 彻底锁起来不让 AI 碰,而是要谈一个合适的出场费。你猜,朱迪、尼克还要多久就会飙脏话? 消息一出,网友直接炸锅。 未来三年,只要你是 Sora 用户,迪士尼这些顶级 IP 角色,都能随手捏。 迪士尼不但不找麻烦,反而选择亲自开闸放水。 事情是这样的。 迪士尼刚官宣,将向 OpenAI 投资 10 亿美元,并签下一份为期三年的合作协议,授权 Sora 使用旗下 IP,用于生成短视频内容。 一夜之间,OpenAI 直接拿到了 200 多个国际公认顶级IP的合法使用权。 包括迪士尼经典,米奇、米妮、灰姑娘、小美人鱼等。 很多人喜欢的皮克斯IP,比如《玩具总动员》、《头脑特工队》、《超能陆战队》。 当然,这次授权仅限动画或插画版本,不涉及任何真人演员的肖像与声音(毕竟,太难、太棘手)。 OpenAI 拿到了授权,还顺手收下 10 亿美元新投资。那么,迪士尼图啥呢? 首先,10 亿美元对于年营收 900 多亿美元的迪斯尼来说,不算啥。 其次,有了 OpenAI 的股份,迪士尼就能将这些角色带到 Z 世代和 Alpha 世代聚集的平台。 还有生产力工具。 ...