机器之心

Search documents
ICML 2025|多模态理解与生成最新进展:港科联合SnapResearch发布ThinkDiff,为扩散模型装上大脑
机器之心· 2025-07-16 04:21
Core Viewpoint - The article discusses the introduction of ThinkDiff, a new method for multimodal understanding and generation that enables diffusion models to perform reasoning and creative tasks with minimal training data and computational resources [3][36]. Group 1: Introduction to ThinkDiff - ThinkDiff is a collaborative effort between Hong Kong University of Science and Technology and Snap Research, aimed at enhancing diffusion models' reasoning capabilities with limited data [3]. - The method allows diffusion models to understand the logical relationships between images and text prompts, leading to high-quality image generation [7]. Group 2: Algorithm Design - ThinkDiff transfers the reasoning capabilities of large visual language models (VLM) to diffusion models, combining the strengths of both for improved multimodal understanding [7]. - The architecture involves aligning VLM-generated tokens with the diffusion model's decoder, enabling the diffusion model to inherit the VLM's reasoning abilities [15]. Group 3: Training Process - The training process includes a vision-language pretraining task that aligns VLM with the LLM decoder, facilitating the transfer of multimodal reasoning capabilities [11][12]. - A masking strategy is employed during training to ensure the alignment network learns to recover semantics from incomplete multimodal information [15]. Group 4: Variants of ThinkDiff - ThinkDiff has two variants: ThinkDiff-LVLM, which aligns large-scale VLMs with diffusion models, and ThinkDiff-CLIP, which aligns CLIP with diffusion models for enhanced text-image combination capabilities [16]. Group 5: Experimental Results - ThinkDiff-LVLM significantly outperforms existing methods on the CoBSAT benchmark, demonstrating high accuracy and quality in multimodal understanding and generation [18]. - The training efficiency of ThinkDiff-LVLM is notable, achieving optimal results with only 5 hours of training on 4 A100 GPUs, compared to other methods that require significantly more resources [20][21]. Group 6: Comparison with Other Models - ThinkDiff-LVLM exhibits capabilities comparable to commercial models like Gemini in everyday image reasoning and generation tasks [25]. - The method also shows potential in multimodal video generation by adapting the diffusion decoder to generate high-quality videos based on input images and text [34]. Group 7: Conclusion - ThinkDiff represents a significant advancement in multimodal understanding and generation, providing a unified model that excels in both quantitative and qualitative assessments, contributing to the fields of research and industrial applications [36].
打造全球首个强化学习云平台,九章云极是如何做到的?
机器之心· 2025-07-16 04:21
Core Viewpoint - The article discusses the paradigm shift in AI from passive language models to autonomous decision-making agents, highlighting the importance of reinforcement learning (RL) as a key technology driving this transition towards general artificial intelligence (AGI) [1][2]. Summary by Sections Reinforcement Learning and Its Challenges - Reinforcement learning is becoming central to achieving a closed-loop system of perception, decision-making, and action in AI [2]. - Current RL methods face challenges such as the need for high-frequency data interaction and large-scale computing resources, which traditional cloud platforms struggle to accommodate [2][8]. AgentiCTRL Platform Launch - In June 2025, the company launched AgentiCTRL, the first industrial-grade RL cloud platform capable of supporting heterogeneous computing resource scheduling at scale [3]. - AgentiCTRL enhances model inference capabilities and improves end-to-end training efficiency by 500%, while reducing overall costs by 60% compared to traditional RL solutions [4][22]. Systematic Reconstruction for RL - The company has restructured the RL training process from the ground up, moving beyond simple GPU scaling to a more complex system design that includes resource scheduling and fault tolerance [9][8]. - AgentiCTRL simplifies the RL training process, allowing users to initiate training with minimal code, significantly improving development efficiency [11][12]. Serverless Architecture and Resource Management - AgentiCTRL integrates a serverless architecture that allows for elastic resource allocation, maximizing resource utilization and reducing training costs [15][16]. - The platform is the first to support "ten-thousand card" level RL training, addressing communication bottlenecks and synchronization challenges in distributed systems [17]. Performance Validation and Cost Efficiency - The platform has demonstrated significant performance improvements, such as a 37% reduction in training time and a 25% increase in GPU utilization, with a 90% decrease in manual intervention [19]. - Overall costs can decrease by up to 60%, making RL more accessible and cost-effective [22][39]. Strategic Vision and Ecosystem Development - The company aims to build a comprehensive native cloud infrastructure for intelligent agents, positioning RL as a core capability rather than a mere cloud service module [27][28]. - The strategic direction includes the establishment of the "AI-STAR Enterprise Ecosystem Alliance" to foster collaboration and investment in RL applications across various industries [33]. Future Implications - The successful implementation of AgentiCTRL signifies a shift in the AI infrastructure landscape, where RL becomes a standard component of AI systems rather than a specialized tool [41]. - The company is poised to lead in the next generation of AI ecosystems by mastering the training-feedback-deployment loop for intelligent agents [33][41].
突发|思维链开山作者Jason Wei被曝加入Meta,机器之心独家证实:Slack没了
机器之心· 2025-07-16 02:22
Core Viewpoint - Meta continues to recruit top talent from OpenAI, with notable researchers Jason Wei and Hyung Won Chung reportedly leaving OpenAI to join Meta [1][2][4]. Group 1: Talent Acquisition - Jason Wei and Hyung Won Chung, both prominent researchers at OpenAI, are confirmed to be leaving for Meta, with their Slack accounts already deactivated [2][4]. - Jason Wei is recognized as a key author of the Chain of Thought (CoT) concept, which has significantly influenced the AI large model field [4][6]. - Hyung Won Chung has been a core contributor to OpenAI's projects, including the o1 model, and has a strong background in large language models [4][29]. Group 2: Contributions and Impact - Jason Wei's work includes leading early efforts in instruction tuning and contributing to research on the emergent capabilities of large models, with over 77,000 citations on Google Scholar [21][16]. - Hyung Won Chung has played a critical role in the development of major projects like PaLM and BLOOM during his time at Google, and later at OpenAI, where he contributed to the o1 series models [26][40]. - Both researchers have been influential in advancing the capabilities of AI systems, particularly in reasoning and information retrieval [38][40]. Group 3: Community Reaction - Following the news of their potential move to Meta, the online community has expressed excitement and congratulations towards Jason Wei, indicating a strong interest in their career transition [10][9].
MIRIX重塑AI多模态长期记忆:超Gemini 410%,节省99.9%内存,APP同步上线
机器之心· 2025-07-15 08:29
MIRIX,一个由 UCSD 和 NYU 团队主导的新系统,正在重新定义 AI 的记忆格局。 在过去的十年里,我们见证了大型语言模型席卷全球,从写作助手到代码生成器,无所不能。然而,即使最强大的模型依然有一个根本性的弱点: 它们不记得你 针对这一现状,加利福尼亚大学圣迭戈分校(UCSD) 博士生 Yu Wang 和纽约大学教授 陈溪 ( Xi Chen )联合推出并开源了 M IRI X —— 全球首个真正意义上的 多模态、多智能体 AI 记忆系统。 MIRIX 的表现非常亮眼!在 ScreenshotVQA 这一需要深度多模态理解的挑战性基准上,MIRIX 的准确率比传统 RAG 方法高出 35% ,存储开销降低 99.9% ,与长 文本方法相比超出 410% ,开销降低 93.3% 。在 LOCOMO 长对话任务中,MIRIX 以 85.4% 的成绩显著超越所有现有基线,树立了新的性能标杆。 。 论文标题:MIRIX: Multi-Agent Memory System for LLM-Based Agents 论文链接:https://arxiv.org/abs/2507.07957 官方网站:http ...
马斯克Grok这个二次元「小姐姐」,攻陷了整个互联网
机器之心· 2025-07-15 08:29
Core Viewpoint - The article discusses the launch of a new feature called "Smart Companion" in the Grok app, which allows users to interact with AI avatars in a more personalized manner, aiming to attract users interested in anime, virtual companions, and advanced AI interactions [2][11]. Group 1: New Feature Launch - The "Smart Companion" feature is based on the recently released Grok 4 model and has generated significant user interest [2][11]. - Paid users of SuperGrok can access the new AI chatbot avatars, enhancing the app's interactive capabilities [3][12]. - Users can enable the companion feature through the app settings, although the process is currently seen as somewhat complex [5][7]. Group 2: User Experience and Feedback - Initial user feedback indicates that the feature, while subtly introduced, is a smart move for enhancing personalized interactions within the app [12]. - Some users expressed dissatisfaction with the design of the avatars, particularly criticizing the character Ani for being overly provocative and not suitable for a premium AI service [15][12]. - The testing of the avatar Ani revealed a mix of engaging and awkward interactions, with a focus on creating a flirtatious atmosphere [18][20]. Group 3: Market Context and Comparisons - The article highlights the growing trend of AI companions, referencing previous successful models like Character.AI, which allowed for emotional companionship and role-playing [30][45]. - The emergence of platforms like SillyTavern demonstrates a shift towards more customizable and unrestricted AI interactions, appealing to users seeking deeper engagement [34][40]. - The demand for emotional companionship through AI reflects broader societal needs for connection and the potential for AI to address feelings of loneliness [45][47]. Group 4: Gaming Integration - Grok is also venturing into the gaming sector, showcasing impressive capabilities in generating playable games through AI [49][50]. - The development process for games has been streamlined, allowing developers to create complete games using simple prompts, marking a potential shift in game development practices [52][55].
ICML 2025杰出论文出炉:8篇获奖,南大研究者榜上有名
机器之心· 2025-07-15 05:37
Core Insights - The article discusses the announcement of the best paper awards at ICML 2025, highlighting the significance of the conference in the AI research community [3][4]. - A total of 8 papers were awarded, including 6 outstanding papers and 2 outstanding position papers, with notable contributions from researchers at Nanjing University [4]. Submission Statistics - This year, ICML received 12,107 valid paper submissions, with 3,260 accepted, resulting in an acceptance rate of 26.9% [5]. - The number of submissions increased significantly from 9,653 in 2024, indicating a growing interest in the AI field [5]. Outstanding Papers - **Paper 1**: "Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions" explores the performance of masked diffusion models (MDMs) compared to autoregressive models (ARMs), demonstrating that adaptive token decoding can significantly enhance MDMs' performance [10][12][13]. - **Paper 2**: Investigates the impact of predictive technologies on welfare distribution in the context of fairness, providing a framework for policymakers to make principled decisions [17][19]. - **Paper 3**: Introduces CollabLLM, a training framework that enhances collaboration between humans and large language models, achieving an 18.5% improvement in task performance [22][26][27]. - **Paper 4**: Proposes a minimal algorithm task to quantify the creative limits of current language models, arguing for the superiority of multi-token methods over next-token prediction [28][32][34]. - **Paper 5**: Discusses conformal prediction from a Bayesian perspective, offering a practical alternative for uncertainty quantification in high-risk scenarios [35][39]. - **Paper 6**: Addresses score matching with missing data, providing methods to handle incomplete datasets effectively [40][44]. Outstanding Position Papers - **Position Paper 1**: Advocates for a dual feedback mechanism in peer review processes to enhance accountability and quality in AI conference submissions [49][51][53]. - **Position Paper 2**: Emphasizes the need to prioritize the impact of AI on the future of work, suggesting comprehensive support for transitions in labor markets affected by AI [54][56][58].
央企牵头!这个AI开源社区要让大模型跑遍「中国芯」
机器之心· 2025-07-15 05:37
Core Viewpoint - The article discusses the challenges and solutions related to the adaptation of large models to domestic chips in China, emphasizing the need for a collaborative platform to bridge the gap between model development and chip compatibility [2][3][35]. Group 1: Model Adaptation Challenges - The successful deployment of large models requires overcoming three main hurdles: adapting the inference engine, adapting the computing platform, and adapting the upper scheduling for business system integration [9][10]. - Current tools for supporting large model inference and adaptation are diverse, but the challenge lies in effectively connecting and coordinating these fragmented tools and experiences [11]. Group 2: Collaboration Initiatives - The "Model Inference Adaptation Collaboration Plan" was launched by the Modelers community to gather developers, algorithm teams, chip manufacturers, and inference tool partners to build an open-source collaborative ecosystem [5][30]. - The community upgraded its "Mirror Center" to a "Tool Center," elevating the importance of the toolchain to be on par with model libraries and datasets [13][14]. Group 3: Community Engagement and Development - The community introduced a "Collaboration Space" where all users can submit pull requests (PRs) to contribute to documentation, adaptation code development, and optimization of inference configurations [20][29]. - The collaboration mechanism aims to aggregate dispersed adaptation efforts into a unified platform, allowing for easy downloading and secondary development [29]. Group 4: Industry Partnerships - The community collaborates with various domestic computing power manufacturers to provide developers with hardware, tools, and technical support [31]. - The initiative also integrates a diverse ecosystem of adaptation and inference software, helping developers quickly master the adaptation toolchain [32]. Group 5: Future Prospects - The "Adaptation Plan" will continue to be open for more chip manufacturers, model developers, and developers to join, with a focus on standardizing adaptation technology [34]. - If successful, this collaborative mechanism could address the critical "coordination shortfall" in the domestic chip ecosystem, facilitating the systematic implementation of models on chips [35].
什么都不做就能得分?智能体基准测试出现大问题
机器之心· 2025-07-15 05:37
Core Viewpoint - The existing benchmarks for evaluating AI agents are fundamentally flawed, leading to significant misjudgments of their capabilities, necessitating the development of more rigorous testing standards [5][7][23]. Group 1: Importance of Benchmark Testing - Benchmark testing plays a foundational role in assessing the strengths and limitations of AI systems, guiding both research and industry development [2]. - As AI agents transition from research prototypes to real-world applications, the need for effective evaluation benchmarks becomes critical [3]. Group 2: Current Issues with AI Benchmarks - Current AI agent benchmarks have not reached a reliable state, with many tests allowing for misleadingly high scores without actual capability [5][6]. - A study involving researchers from several prestigious universities identified common failure modes in existing benchmarks, highlighting the need for a checklist to minimize the potential for "gaming" the tests [7][23]. Group 3: Challenges in Benchmark Design - AI agent tasks often require real-world scenarios and lack standard answers, making the design and evaluation of benchmarks more complex than traditional AI tests [4][11]. - Two key validity criteria for AI benchmarks are proposed: task validity (whether the task can only be solved with specific capabilities) and result validity (whether the evaluation accurately reflects task completion) [12][15]. Group 4: Findings from the ABC Checklist - The ABC checklist, derived from 17 widely used AI benchmarks, contains 43 items focusing on outcome validity and task validity [17][18]. - Application of the ABC checklist revealed that 7 out of 10 benchmarks contained tasks that could be exploited by AI agents, and 7 out of 10 did not meet outcome validity standards [23]. Group 5: Specific Benchmark Failures - Examples of benchmark failures include SWE-bench, which failed to detect errors in AI-generated code due to insufficient unit test coverage [24][27]. - KernelBench's reliance on random tensor values may overlook critical errors in generated code, while τ-bench allowed a "no-operation" agent to achieve a 38% success rate [28][31]. - OSWorld's outdated evaluation methods led to a 28% underestimation of agent performance due to reliance on obsolete website elements [32][33]. Group 6: Future Directions - The ABC aims to provide a practical evaluation framework to help benchmark developers identify potential issues and enhance the rigor of their assessments [36].
南大等8家单位,38页、400+参考文献,物理模拟器与世界模型驱动的机器人具身智能综述
机器之心· 2025-07-15 05:37
Core Insights - The article emphasizes the significance of "Embodied Intelligence" in the pursuit of Artificial General Intelligence (AGI), highlighting the need for intelligent agents to perceive, reason, and act in the physical world [5] - The integration of physical simulators and world models is identified as a promising pathway to enhance the capabilities of robots, enabling them to transition from mere action execution to cognitive processes [5] Summary by Sections 1. Introduction to Embodied Intelligence - Embodied Intelligence focuses on intelligent agents that can autonomously perceive, predict, and execute actions in complex environments, moving towards AGI [5] - The combination of physical simulators and world models is crucial for developing robust embodied intelligence [5] 2. Key Contributions - The paper systematically reviews the advancements in learning embodied intelligence through the integration of physical simulators and world models, analyzing their complementary roles in enhancing autonomy, adaptability, and generalization of intelligent agents [5] 3. Robot Capability Classification - A five-level capability classification system (IR-L0 to IR-L4) is proposed, covering autonomy, task handling, environmental adaptability, and social cognition [9][10] - IR-L0: Basic execution with no environmental perception - IR-L1: Rule-based response in closed environments - IR-L2: Perceptual adaptation with basic path planning - IR-L3: Human-like collaboration with emotional recognition - IR-L4: Full autonomy with self-generated goals and ethical decision-making [15] 4. Review of Core Robot Technologies - The article reviews the latest technological advancements in legged locomotion, manipulation control, and human-robot interaction [11][16] 5. Comparative Analysis of Physical Simulators - A comprehensive comparison of mainstream simulators (Webots, Gazebo, MuJoCo, Isaac Gym/Sim) is provided, focusing on their physical simulation capabilities, rendering quality, and sensor support [12][18][19] 6. Advances in World Models - The paper discusses representative architectures of world models and their applications, such as trajectory prediction in autonomous driving and simulation-reality calibration for articulated robots [13][20]
上海交大/上海AI Lab翟广涛:当评测不再重要,AGI就实现了
机器之心· 2025-07-15 03:20
Core Viewpoint - The article discusses the challenges and limitations of current AI evaluation systems, emphasizing that a perfect evaluation system would equate to achieving Artificial General Intelligence (AGI) [3][20]. Evaluation System Challenges - The primary issue with evaluation systems is "data contamination," where publicly available benchmark tests are often included in the training data of subsequent models, undermining the diagnostic value of evaluations [5][6]. - The "atomization of capabilities" in evaluations leads to a fragmented understanding of intelligence, as complex skills are broken down into isolated tasks, which may not reflect a model's true capabilities in real-world applications [7][8]. - There is a significant disconnect in embodied intelligence, where models perform well in simulated environments but poorly in real-world scenarios, highlighting the need for more realistic evaluation frameworks [9]. Evaluation Framework and Methodology - The article proposes a "Human-Centered Evaluation" approach, focusing on how models enhance human task efficiency and experience rather than merely comparing model performance against benchmarks [12][13]. - The "EDGE" framework is introduced, which stands for Evolving, Dynamic, Granular, and Ecosystem, aiming to create a responsive evaluation system that adapts to AI advancements [13]. - The team is developing a high-quality internal question bank to mitigate data contamination, planning to gradually open-source questions to ensure reproducibility [15]. Future Directions and Goals - The concept of "training-evaluation integration" is emphasized, where evaluations inform training processes, creating a feedback loop that aligns model development with human preferences [16][17]. - The ultimate goal is to establish a comprehensive evaluation framework that encompasses various aspects of AI, guiding the industry towards a more value-driven and human-centered development path [22][23]. - The article concludes that the success of AI evaluation systems lies in their eventual obsolescence, as achieving AGI would mean that self-evaluation capabilities become inherent to the AI itself [24].