机器之心
Search documents
ICLR 2026惊现SAM 3,分割一切的下一步:让模型理解「概念」
机器之心· 2025-10-13 04:21
Core Insights - The article discusses the release of a new paper titled "SAM 3: Segment Anything with Concepts," which is believed to be a continuation of Meta's "Segment Anything" series, following SAM 1 and SAM 2 [1][3][4]. Group 1: Overview of SAM 3 - SAM 3 introduces a new task called Promptable Concept Segmentation (PCS), allowing users to input text or image examples to predict instance and semantic masks for matching objects while maintaining identity consistency across video frames [8][12]. - The model focuses on identifying atomic visual concepts, enabling it to understand simple noun phrases like "red apple" or "striped cat" for segmentation tasks [8][12]. - SAM 3 improves upon its predecessors by enhancing performance in promptable visual segmentation and establishing new standards for PCS [18]. Group 2: Performance Metrics - SAM 3 shows significant performance improvements, achieving at least a 2x enhancement on the newly proposed SA-Co benchmark compared to previous systems [13]. - In the LVIS dataset, SAM 3 achieved a zero-shot mask average precision of 47.0, surpassing the previous best of 38.5 [13]. - The model processes images with over 100 objects in just 30 milliseconds on a single H200 GPU [14]. Group 3: Methodology and Data - SAM 3 employs a dual encoder-decoder transformer architecture, integrating a detector with a tracker and memory module for video applications [20]. - The research developed a scalable human-machine collaborative data engine, annotating a high-quality training dataset with 4 million unique phrases and 520 million masks [21]. - The PCS benchmark includes 124K images and 1.7K videos with 214K unique concepts, significantly expanding the concept count compared to existing benchmarks [25]. Group 4: Comparative Analysis - SAM 3 outperforms previous models in various tasks, including instance segmentation, box detection, and semantic segmentation across multiple datasets [27][28]. - In open vocabulary semantic segmentation experiments, SAM 3 exceeded the performance of strong baseline models [29]. - The model also demonstrated superior object counting accuracy and segmentation capabilities compared to other models [33].
大模型追逐星辰大海,GPT和Gemini国际天文奥赛夺金
机器之心· 2025-10-13 04:21
Core Insights - The article discusses the remarkable advancements in artificial intelligence, particularly in large language models (LLMs) like GPT-5 and Gemini 2.5 Pro, which have achieved gold medal performances in the International Olympiad on Astronomy and Astrophysics (IOAA) [4][18]. Group 1: AI Model Performance - GPT-5 and Gemini 2.5 Pro excelled in the IOAA, demonstrating strong reasoning and problem-solving capabilities in astronomy and astrophysics [4][12]. - In the theoretical exams, GPT-5 scored an average of 84.2% while Gemini 2.5 Pro scored 85.6%, outperforming other models by 7 to 25 percentage points [12][13]. - The models achieved gold medal status, with GPT-5 scoring 86.8% in 2025, 89.6% in 2023, and 93.0% in 2022, consistently outperforming the best human participants [19][18]. Group 2: Evaluation Framework - The study introduced a more rigorous evaluation framework for assessing LLMs in scientific research, focusing on complex reasoning and problem-solving rather than simple knowledge recall [9][10]. - The IOAA was chosen as a benchmark due to its ecological validity, covering a wide range of astronomical topics and requiring multi-step reasoning [10][9]. Group 3: Error Analysis - The models showed a significant performance gap between different types of questions, with better accuracy in physics/mathematics problems (67-91%) compared to geometric/spatial problems (49-78%) [26]. - Common errors included conceptual misunderstandings and geometric reasoning challenges, indicating fundamental difficulties in achieving deep physical understanding [26][25].
「微调已死」再添筹码,谷歌扩展AI自我进化范式,成功经验与失败教训双向学习
机器之心· 2025-10-12 08:02
Core Insights - The article discusses the concept of "Agentic Context Engineering," which allows language models to self-improve without the need for fine-tuning, drawing attention from the academic community [1] - Google's earlier work on "ReasoningBank" presents a similar idea, focusing on an innovative memory framework for agent systems that extracts and organizes memory items from the agent's own experiences [1][3] Summary by Sections ReasoningBank Overview - ReasoningBank captures effective strategies from successes and important lessons from failures, creating actionable principles in a closed-loop process [1][3] - The framework consists of structured memory items that include a title, description, and content, allowing agents to interact with their environment and build new memory items from past experiences [5][7] Key Components of ReasoningBank - Memory Structure: Memory items are designed from past experiences, abstracting low-level execution details while retaining transferable reasoning patterns [7] - Integration with Agents: Agents equipped with ReasoningBank can draw from a curated pool of transferable strategies to guide decision-making, enhancing adaptability to unseen queries [7] Memory-Aware Test-Time Expansion (MaTTS) - MaTTS integrates ReasoningBank with test-time expansion, generating diverse explorations to provide comparative signals for better memory synthesis [8][9] - Two complementary implementations of MaTTS are introduced: parallel expansion and sequential expansion, enhancing the effectiveness of memory planning [9] Experimental Results - Extensive experiments on challenging benchmarks, including WebArena and SWE-Bench-Verified tasks, show that ReasoningBank outperforms baseline methods with effectiveness improvements of up to 34.2% and a reduction of 16.0% in interaction steps [11] - The results indicate that ReasoningBank significantly enhances both the resolve rate and efficiency compared to models without memory [13][14] Overall Impact - The collaboration between ReasoningBank and MaTTS is highlighted as a key component for memory-based experience expansion, demonstrating superior performance in various tasks [14][15]
硅谷CEO们高喊AI威胁论,「5年内失业率飙升至20%」,但95%AI项目赔本赚吆喝
机器之心· 2025-10-12 04:05
机器之心报道 编辑:杨文 当前「AI 威胁就业」的论调,更多是基于技术趋势的预警,而非基于现实的既成事实,但这也绝非轻视 AI 长期影响的理由。 最近,「AI 让人类失业」的论调甚嚣尘上,给本就焦虑的打工人更蒙上了一层阴影。 Anthropic 的首席执行官 Dario Amodei 预测白领就业将面临一场「末日浩劫」,「AI 可能在未来五年内大规模取代入门级白领工作, 失业率 可能会飙升至 10% 到 20% 之间 ,尤其在法律、金融和咨询等行业。」 Goodwill 首席执行官表示,他正在为人工智能导致的 Z 世代失业潮做准备,还认为 青年失业危机已经发生 。 Stability AI 联合创始人 Emad Mostaque 声称, 明年将出现大规模失业 。「AI 能够完成复杂的工作且不出错,这将导致许多工作面临被替代 风险。失业问题将同时影响多个行业,并且在未来一到两年内可能会加剧。」 甚至前谷歌首个生成式 AI 团队创始人贾德・塔里菲 (Jad Tarifi) 表示,不断提升的人工智能能力可能很快就会让 获得法律或医学高级学位变 得毫无意义 。 这篇论文的核心观点是, AGI 的普及将导致人类劳动在经 ...
LLM越狱攻击的威胁被系统性高估? 基于分解式评分的「越狱评估新范式」出炉
机器之心· 2025-10-12 04:05
Core Viewpoint - The article introduces JADES, a new framework for evaluating jailbreak attacks, developed by researchers from CISPA, Flexera, and Xi'an Jiaotong University, which aims to provide a more accurate assessment by using a decompositional scoring mechanism instead of traditional holistic evaluation methods [4][5][6]. Current Limitations of Jailbreak Assessment - Accurate evaluation of jailbreak attacks is challenging due to the open-ended nature of harmful questions, making it difficult to establish a unified success standard [10]. - Existing automated evaluation methods suffer from two core flaws: misaligned proxy indicators leading to false positives, and holistic evaluation strategies that obscure the details of responses [11][12]. JADES Framework - JADES automates the analytic scoring logic used by human experts, ensuring granularity and reliability in assessments through a multi-agent collaborative process [12]. - The framework consists of four collaborative nodes: 1. **Question Decomposition Node**: Breaks down harmful questions into weighted sub-questions [12]. 2. **Response Preprocessing Node**: Cleans the original jailbreak response to reduce complexity [16]. 3. **Sub-Question Pairing Node**: Extracts relevant sentences from the cleaned response for each sub-question [17]. 4. **Evaluation Node**: Scores each sub-answer using a five-point Likert scale and aggregates the scores to determine overall success [18]. Performance Evaluation - Researchers created a benchmark dataset, JailbreakQR, consisting of 400 pairs of harmful questions and jailbreak responses, to validate JADES [20]. - JADES revealed that previous assessment methods systematically overestimated the success rates of jailbreak attacks, with the success rate for LAA attacks on GPT-3.5-Turbo dropping from 93% to 69% under JADES [24]. - In binary classification, JADES achieved 98.5% consistency with human evaluators, while in a more challenging ternary classification, it maintained an accuracy of 86.3% [26]. - The introduction of a new metric, Success Rate/Attack Success Rate (SR/ASR), indicated that the proportion of fully successful cases was less than 0.25, suggesting that many attacks labeled as successful were actually only partially successful [27]. Conclusion - The JADES framework establishes a transparent, reliable, and auditable standard for jailbreak assessment, revealing systemic biases in current evaluation methods and providing a more effective tool for the field [28].
Qwen3 变身扩散语言模型?不从零训练也能跑,30B参数创纪录
机器之心· 2025-10-12 04:05
Core Insights - The article discusses the development of the RND1-Base, the largest open-source diffusion language model (DLM) to date, which aims to overcome the challenges faced by traditional autoregressive (AR) models in terms of training efficiency and scalability [2][3][6]. Group 1: Model Development - RND1-Base is a 30 billion parameter sparse MoE model, with 3 billion active parameters, derived from a pre-trained AR model (Qwen3-30BA3B) and trained on 500 billion tokens to achieve full diffusion behavior [6]. - The research team from Radical Numerics has successfully demonstrated that scaling diffusion language models beyond 8 billion parameters is feasible and effective [9]. Group 2: Performance Evaluation - RND1 was tested against various benchmarks, including MMLU, ARC-C, RACE, and BBH, showing stable performance that surpasses existing models like Dream-7B and LLaDA-8B while retaining the strong performance of its AR foundation [7]. - Although RND1 performed well, it was not compared with the latest LLaDA model (LLaDA-MoE-7B-A1B), indicating that further comparisons are needed to determine which model is superior [9]. Group 3: Training Methodology - The research identified key factors in the autoregressive-to-diffusion (A2D) conversion process, such as initialization strategies, hierarchical learning rates, and critical batch sizes, which contribute to scalability and stability [10]. - A simpler method called Simple Continuous Pretraining (SCP) was found to achieve comparable performance to more complex A2D conversion processes, allowing for effective retention of AR pre-training knowledge [13][14]. Group 4: Training Efficiency - The study revealed that A2D conversion performs better with larger batch sizes, indicating that diffusion language models can effectively utilize larger batch sizes during continuous pre-training [15][17]. - The article emphasizes the importance of replacing causal masks with bidirectional masks during initialization and continuing pre-training under a masked diffusion objective [18]. Group 5: Company Vision - Radical Numerics aims to create an automated AI research platform that recursively improves itself, with RND1 being one of the first tangible outcomes of this vision [20]. - The founding team of Radical Numerics comprises members from top institutions like DeepMind and Stanford, focusing on hybrid architectures and innovative technologies [21].
RL 将如何提高具身大模型 VLA 泛化性?清华大学团队NeurIPS 2025文章分析 RL 与 SFT 泛化性差异
机器之心· 2025-10-12 02:41
Core Insights - The article discusses the potential of Vision-Language-Action (VLA) large models in embodied intelligence, highlighting the limitations of current supervised fine-tuning (SFT) methods in generalization to new environments and tasks. It emphasizes the advantages of Reinforcement Learning (RL) in enhancing the generalization capabilities of VLA models [2][4]. Group 1: Research Findings - A new evaluation benchmark was created to address the limited generalization of VLA models, comparing the performance of RL and SFT in enhancing model robustness across various visual, semantic, and execution challenges [4]. - Experiments revealed that using RL algorithms like Proximal Policy Optimization (PPO) significantly improved the model's robustness in semantic understanding and task execution, maintaining performance comparable to SFT in visually varied scenarios [4][11]. Group 2: RL Methodology - The research team tested three RL algorithms: PPO, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). The results showed that PPO outperformed DPO and GRPO in multi-step decision tasks due to the partially observable Markov decision process (POMDP) characteristics of robotic tasks [9][11]. - To enhance the efficiency of PPO training on VLA models, three key innovations were introduced: a shared Actor-Critic architecture reducing memory usage by 45% and increasing training speed by 35%, a preheating strategy using 140 high-quality trajectories to improve convergence speed by 50%, and minimizing PPO training epochs to just one, which reduced training time significantly [13][15]. Group 3: Comparison of SFT and RL - The research explored the data scale limits of SFT, finding that performance saturation occurred at around 16,000 demonstration trajectories. In contrast, RL achieved a 42.6% performance improvement on out-of-distribution tasks, indicating superior generalization capabilities [18][19]. - A comprehensive evaluation benchmark was constructed to dissect the generalization differences between SFT and RL across visual, semantic, and execution dimensions, with RL showing clear advantages in semantic understanding and execution robustness [21][23]. Group 4: Practical Implications - The study underscores the core value of RL in developing truly generalizable embodied agents, which is increasingly important as robotic applications become more complex and variable. The team has open-sourced a large-scale RL framework for embodied intelligence, RLinf, to facilitate further research [25]. - Visual analysis of specific cases revealed deeper differences, such as RL's ability to maintain task stability under noise and effectively handle unseen objects, contrasting with SFT's tendency to get stuck in repetitive actions [26].
曾拒15亿美金,超级天才Andrew Tulloch重返Meta,Thinking Machines Lab痛失联创
机器之心· 2025-10-12 02:41
Core Viewpoint - Meta's aggressive recruitment strategy, particularly the high-profile attempt to lure Andrew Tulloch back, highlights the company's ongoing efforts to strengthen its AI capabilities despite previous rejections of lucrative offers [1][11]. Group 1: Recruitment and Offers - Mark Zuckerberg's recruitment efforts have included a dramatic offer exceeding $1 billion to Andrew Tulloch, which was initially declined [2][11]. - Tulloch, a prominent figure in AI, has a strong academic background and extensive experience at Meta and OpenAI, making him a valuable asset for any tech company [7][8]. - Despite rejecting the initial offer, Tulloch ultimately decided to join Meta, indicating a shift in his career path [5][12]. Group 2: Background of Andrew Tulloch - Andrew Tulloch graduated with top honors in mathematics from the University of Sydney and later earned a master's degree from Cambridge University [7]. - He has over 11 years of experience at Meta, contributing significantly to the development of machine learning systems and advertising platforms [7]. - After leaving Meta, Tulloch played a key role at OpenAI, working on the development of advanced AI models like GPT-4 and GPT-4.5 [9][11]. Group 3: Implications for Meta - Tulloch's return to Meta comes at a time of internal management changes, raising questions about the potential impact of his expertise on the company's AI initiatives [12].
从组件到系统,Agent 的 Evaluation 怎么做?
机器之心· 2025-10-12 01:27
Core Insights - The article discusses the evolution of Agent Evaluation in AI, emphasizing the need for new assessment benchmarks as AI systems transition from passive language models (LLMs) to autonomous AI agents capable of planning and interacting with digital environments [3][4][5]. Group 1: Agent Evaluation Challenges - The complexity of evaluating agents arises from the need to measure their end-to-end success rates, reliability, and efficiency in dynamic environments, unlike traditional LLM evaluations which focus on static outputs [5][6]. - The evaluation of agents must consider their interactions with the environment and the emergent properties that arise from these interactions, rather than just the quality of text output [7][8]. Group 2: Evolution of Evaluation Paradigms - The evolution of Agent Evaluation paradigms reflects the increasing complexity and application scope of AI systems, with each generation of benchmarks designed to address the limitations of the previous ones [9][10]. - The article outlines a comparison of different evaluation generations, highlighting the shift from static assessments of LLMs to dynamic evaluations of agents that can operate in real-world scenarios [10][11]. Group 3: Key Evaluation Frameworks - New frameworks such as GAIA, MCP-universe, MCPMark, and MCP-AgentBench have emerged to address the unique challenges of Agent Evaluation, focusing on dynamic interactions and the ability to perform tasks in real-time [8][10]. - The core value of an agent is defined by its autonomy, planning capabilities, and interaction with the environment, necessitating evaluation methods that can measure these action-oriented competencies [11].
风波再起,OpenAI被指通过警方向AI监管倡导者施压,马斯克锐评其「建立在谎言之上」
机器之心· 2025-10-11 08:06
Core Viewpoint - The article discusses the controversy surrounding OpenAI's legal actions against Nathan Calvin, a participant advocating for AI regulation, highlighting the implications of the recently passed SB 53 bill in California and OpenAI's response to criticism regarding transparency and governance [1][2][3]. Group 1: Legal Actions and Controversy - Nathan Calvin, a lawyer and member of the Encode organization, received a subpoena from OpenAI, which demanded private information related to California legislators and former OpenAI employees [2][3]. - The subpoena is linked to the SB 53 bill, which mandates large AI developers to disclose their safety protocols and update them regularly, effective from September 30 [3][4]. - OpenAI's actions are perceived as an attempt to intimidate critics and investigate potential funding from Elon Musk, who has been vocal against the company [4][5]. Group 2: Reactions and Implications - Calvin expressed his dissatisfaction with OpenAI's tactics, suggesting they are using legal means to suppress dissent and maintain control over the narrative surrounding AI governance [4][5]. - Other organizations, such as the Midas Project, have reported similar experiences with OpenAI, indicating a broader pattern of legal scrutiny against those advocating for transparency [5]. - OpenAI's Chief Strategy Officer defended the company's actions as necessary to protect its interests amid ongoing litigation with Musk, questioning the motives behind Encode's support for Musk [7][8].