机器之心

Search documents
大模型为何难成为「数学家」?斯坦福等揭示严谨证明中的结构性弱点
机器之心· 2025-06-22 04:26
Core Insights - The article discusses the challenges and innovations in formalizing mathematical proofs, particularly focusing on inequality problems and the limitations of current large language models (LLMs) in providing rigorous reasoning [1][27][38]. Group 1: Inequality Proofs and Formalization - Inequality problems serve as ideal subjects for testing the rigor of mathematical reasoning due to their clear structure and logical simplicity [1]. - Current formal systems like Lean and Coq require high precision in expression, making them difficult to apply at scale, especially for middle and high school level problems [1][5]. - A new approach proposed by research teams from Stanford, UC Berkeley, and MIT involves breaking down inequality proof tasks into two non-formal but verifiable sub-tasks: Bound Estimation and Relation Prediction [2][7]. Group 2: IneqMath Dataset - The IneqMath dataset is the first benchmark for Olympic-level inequality proofs, consisting of 1,252 training problems, 200 test problems, and 100 validation problems [12]. - The training set includes 83 theorem types and 29 theorem categories, allowing for model fine-tuning [12][13]. - Each problem in the dataset has a unique correct answer, facilitating the verification of results [10]. Group 3: Evaluation Framework - The research team developed a framework called LLM-as-Judge, which includes five automated reviewers to assess the logical rigor of the reasoning process in LLMs [20][23]. - The framework evaluates whether models merely guessed the correct answer or followed a logical reasoning chain at each step [23][24]. - The evaluation system has shown high alignment with human annotations, achieving an F1 score of 0.93, indicating its reliability and scalability [24]. Group 4: Findings on LLM Performance - The study found that while LLMs like GPT-4 and others can guess answers accurately, they often fail to maintain logical rigor in their reasoning processes [27][30]. - The accuracy of final answers can be high, but the overall reasoning correctness remains low, with some models dropping from 71.5% to 6% when evaluated for logical rigor [29]. - Increasing model size or reasoning time does not significantly improve the quality of reasoning, suggesting that simply scaling models is insufficient for enhancing logical closure [30][32]. Group 5: Improvement Strategies - The research identified effective strategies for improving LLM performance, such as self-improvement via critic and theorem augmentation, which have shown to enhance accuracy by approximately 5% and 10% respectively [42]. - The IneqMath leaderboard encourages community participation, allowing researchers to submit their models for evaluation based on both final answer accuracy and reasoning rigor [36][37].
Sam Altman提醒创业者:ChatGPT将来要做的,大家就绕开吧
机器之心· 2025-06-22 04:26
机器之心报道 编辑:+0 Y Combinator 最近在旧金山举办的 AI Startup School 活动,邀请了大量 AI 领域最具影响力的创始人和专家进行现场对谈和演讲,之前 Andrej Karpathy 在活动上的 演讲视频 爆火,现在 OpenAI CEO Sam Altman 的最新采访也已上线。 视频地址:https://www.youtube.com/watch?v=V979Wd1gmTU 在本次采访中,Altman 深入复盘了从早期创业艰辛到缔造 OpenAI 的完整历程。他不仅分享了对雄心、责任及全球瞩目下如何前行的思考,还就早期关键决策、 未来技术机遇、产品形态及个人领导哲学等话题,给出了深刻洞见。 这次对话为我们理解 AI 的当下与未来,以及其背后核心驱动者的思考,提供了一个直接且全面的视角。 我们将访谈内容总结为以下这些关键问题,在不改变原意的情况下使读者以更清晰的结构了解访谈内容。 行业未来会怎样 AI 的演进从未停止,交互的形态也必将迭代。Sam Altman 在此描绘了一幅激动人心的技术路线图,预言了 AI 从问答工具到 全天候智能体 的进化。 他不仅展望了 GPT-5 ...
开源版MetaQuery来了!OpenUni用1.1B参数媲美BLIP3-o-8B,数据代码完全开源
机器之心· 2025-06-22 04:26
Core Viewpoint - OpenUni, developed by Nanyang Technological University S-Lab and SenseTime, is an open-source version of MetaQuery that achieves the performance of an 8B model with only 1.1B parameters, providing all code, weights, and data as open-source resources [1][18]. Architecture and Design - The architecture of OpenUni is simplified, featuring only 6 layers of connectors compared to 24 layers in MetaQuery, significantly reducing complexity [5]. - OpenUni utilizes 256 learnable queries to extract condition information from user instructions, a frozen InternVL for maintaining understanding capabilities, 6 transformer connectors based on ViT architecture, and a SANA diffusion model for efficient image generation [5][6]. Performance Metrics - OpenUni-B achieves a GenEval score of 0.84, comparable to the BLIP3-o-8B model, while OpenUni-L reaches a score of 0.86, marking it as the best-performing open-source unified model [15][18]. - In DPG-Bench, OpenUni-L-1024 scores 83.08, surpassing all MetaQuery and BLIP3-o variants [15]. Training Strategy - The training process consists of two phases: pre-training with 23 million image-text pairs and fine-tuning with 60,000 image-text pairs [7][9]. - During pre-training, the diffusion model is frozen, while in the fine-tuning phase, it becomes trainable to enhance generation quality [8][9]. Open Source Contribution - OpenUni provides a complete open-source resource, including model weights, training code, and a dataset of 23 million entries, facilitating community research and innovation [19][20]. - The project aims to offer a clear, reproducible, and extensible baseline implementation for the research community [18].
从RLHF、PPO到GRPO再训练推理模型,这是你需要的强化学习入门指南
机器之心· 2025-06-22 04:26
Core Insights - Reinforcement Learning (RL) has become an essential technology in the AI field, particularly in large language models (LLMs) [1] - The Unsloth team has released a comprehensive reinforcement learning tutorial that covers various concepts from RLHF to GRPO, making it accessible for beginners and advanced users alike [2][3] Group 1: Understanding Reinforcement Learning - The goal of reinforcement learning is to increase the likelihood of achieving "good" outcomes while reducing the chances of "bad" outcomes [8][10] - Key components of RL include the environment, agent, actions, and reward functions, which collectively define the learning process [9][14] - RLHF (Reinforcement Learning from Human Feedback) has gained popularity, particularly through OpenAI's implementation, which trains agents to generate outputs deemed useful by humans [16][19] Group 2: GRPO and Its Advantages - GRPO (Group Relative Policy Optimization) is a method developed to train reasoning models, differing from PPO (Proximal Policy Optimization) by removing the value model and utilizing custom reward functions [22][24] - GRPO estimates average rewards through sampling multiple outputs for a given question, which helps in optimizing the model's performance [27][28] - The approach allows for significant memory savings and can enhance various tasks beyond coding and mathematics, such as email automation and legal applications [30] Group 3: Training with Unsloth - Unsloth provides a detailed guide for training reasoning models using GRPO, requiring a minimum of 5GB VRAM for local training of models up to 1.5 billion parameters [44] - The training process involves generating multiple answer variants for each question, evaluating them with a reward function, and updating model weights accordingly [45][57] - Effective training requires a well-designed reward function and a sufficient amount of data, with recommendations for at least 500 lines for optimal results [49][50] Group 4: Reward Functions and Validators - Reward functions and validators play crucial roles in evaluating model outputs, with the former assigning scores based on correctness and quality, while the latter verifies the accuracy of the outputs [46][56] - Examples of reward functions include those that reward correct answers and penalize incorrect or overly verbose responses [61] - The design of reward functions is critical, as poorly constructed ones can inadvertently degrade model performance [57]
概率统计机制下,LLM 推理真的「理解世界了」吗?
机器之心· 2025-06-21 06:32
Group 1 - The article discusses whether LLMs (Large Language Models) truly "understand the world" or if their reasoning is merely a form of pattern matching, highlighting the debate within the industry regarding the nature of LLM reasoning capabilities [1][3][4] - It references a paper from Apple suggesting that current reasoning models do not genuinely think, but rather engage in pattern matching, which has sparked renewed discussion in the AI community [3][4] - The article mentions that true reasoning involves understanding causal relationships, as emphasized by various researchers, indicating that LLMs lack a causal framework necessary for deep and flexible reasoning [5][6][7] Group 2 - The article explores the motivations behind enterprises increasing their spending on generative AI, noting a shift from building in-house solutions to purchasing third-party AI applications [1][2] - It outlines the evaluation framework for selecting AI models, which includes key factors that influence procurement decisions in the context of traditional software purchasing characteristics [1][2]
月之暗面「调教」出最强Agent,在「人类最后一场考试」拿下最新 SOTA
机器之心· 2025-06-21 05:06
Core Viewpoint - Kimi-Researcher is an advanced autonomous agent developed using end-to-end reinforcement learning, showcasing significant improvements in multi-step reasoning and search capabilities, achieving state-of-the-art performance in various benchmarks [2][4][3]. Group 1: Performance Metrics - Kimi-Researcher achieved a Pass@1 score of 26.9% and a Pass@4 accuracy of 40.17% in the "Humanity's Last Exam," marking a substantial improvement from an initial score of 8.6% [3][4]. - In the xbench-DeepSearch subtask, Kimi-Researcher reached an average Pass@1 score of 69%, outperforming other models equipped with search tools [4]. Group 2: Training Methodology - The agent is trained using end-to-end reinforcement learning, which allows it to learn from a single model that integrates planning, perception, and tool usage without the need for manual rule creation [14][24]. - The training process incorporates a reward mechanism based on final outcomes, ensuring consistent preference direction in dynamic environments [24]. Group 3: Context Management and Efficiency - Kimi-Researcher employs a context management mechanism that enables it to retain key information while discarding irrelevant documents, allowing for over 50 iterations in a single trajectory [27][30]. - The model's training efficiency is enhanced through the introduction of a gamma decay factor, encouraging the discovery of shorter and more efficient exploration paths [25]. Group 4: Tool Utilization and Task Design - The training tasks are designed to necessitate the use of specific tools, promoting the agent's learning on when and how to effectively utilize multiple tools in complex environments [21]. - Kimi-Researcher is capable of conducting academic research, legal and policy analysis, clinical evidence review, and corporate financial analysis, showcasing its versatility [11][8]. Group 5: Infrastructure and Scalability - A scalable, asynchronous rollout system has been developed to enhance the efficiency of agent interactions and reward calculations, significantly improving operational performance [34][32]. - The infrastructure supports dynamic resource allocation and fault tolerance, ensuring high availability in production environments [34].
三个大模型合作,1000次迭代,竟能像人类科学家一样发现方程
机器之心· 2025-06-21 05:06
Core Viewpoint - The article discusses the innovative framework DrSR (Dual Reasoning Symbolic Regression) developed by researchers at the Institute of Automation, Chinese Academy of Sciences, which enables large models to analyze data, reflect on failures, and optimize models like scientists do [2][14][56]. Group 1: Framework and Mechanism - DrSR employs a dual-path reasoning mechanism that integrates "data insights" and "experience summaries" to guide large models in scientific equation discovery [16][28]. - The framework consists of three virtual scientists: a data scientist, a theoretical scientist, and an experimental scientist, each contributing to a collaborative mechanism for efficient scientific equation discovery [3][7]. Group 2: Performance and Results - In various interdisciplinary modeling tasks, DrSR has demonstrated superior generalization capabilities, outperforming existing methods in accuracy and efficiency [4][30]. - Experimental results show that DrSR achieved an accuracy of 99.94% in nonlinear damping oscillation system modeling, significantly surpassing all baseline methods [31]. Group 3: Learning and Adaptation - DrSR's process is a closed loop: data analysis → prompt guidance → equation generation → evaluation and scoring → experience summarization, allowing the model to accumulate knowledge and refine its approach [28]. - The framework's experience-driven strategy helps avoid common failure structures, resulting in a higher proportion of valid equations generated compared to other methods [37]. Group 4: Robustness and Generalization - DrSR exhibits strong robustness against noise and out-of-distribution (OOD) data, maintaining low normalized mean square error (NMSE) across various tasks [40][41]. - The model's performance remains stable under different Gaussian noise levels, showcasing its generalization advantages [41]. Group 5: Future Directions - DrSR is integrated into the ScienceOne platform, providing efficient and interpretable scientific modeling services, with plans to enhance its reasoning capabilities and cross-task generalization [57]. - Future improvements will focus on expanding DrSR's capabilities to multi-modal scientific modeling scenarios and incorporating continuous learning mechanisms [61].
ICML 2025 Oral | NAS老树开新花,NUS提出智能体超网,成本狂降55%
机器之心· 2025-06-21 04:36
Core Insights - The article discusses the introduction of the "Agentic Supernet" concept, which dynamically customizes agent teams based on task difficulty, outperforming existing methods by up to 11.82% while reducing inference costs to 45% of those methods [4][38]. Group 1: Challenges in Multi-Agent Systems - Current multi-agent systems often rely on cumbersome manual configurations and prompt engineering, leading to inefficiencies [6]. - Automated optimization methods tend to create overly complex systems that waste resources on simple tasks [7]. - There is a lack of a universal solution that excels across all tasks, resulting in task conflicts [7]. Group 2: Paradigm Shift - The paper proposes a paradigm shift from seeking a single optimal agent architecture to optimizing a probabilistic distribution of potential agent architectures [10]. - The "Agentic Supernet" serves as a vast repository of various foundational capabilities, allowing for dynamic selection and combination based on task requirements [12][20]. Group 3: MaAS Framework - The MaAS framework consists of three main strategies: defining a universal blueprint, intelligent scheduling, and self-evolution [15]. - The first step involves creating a comprehensive "Agentic Supernet" that includes all possible agent capabilities [16]. - The intelligent scheduler, or controller, dynamically selects the most suitable skill modules for each task, ensuring efficient resource allocation [21][25]. Group 4: Performance and Cost Efficiency - MaAS demonstrates superior performance across multiple benchmarks, achieving an average score of 83.59% and outperforming 14 baseline methods [32]. - The inference cost of MaAS is significantly lower, averaging only 45% of that of existing systems, with training costs also being substantially reduced [33][34]. - The framework exhibits strong generalization capabilities, effectively adapting to various tasks and domains [38].
世界模型版《模拟人生》:AI虚拟小人街头演讲拉票,GPT-4o选举获胜
机器之心· 2025-06-21 04:36
机器之心报道 编辑:泽南、杨文 一个真实世界模拟器。 当世界模型高度进化后,里面的「人」都在做些什么? 有人会进行街头演说,吸引到了不少听众,小孩会和机器狗玩: 有人会当街作案,警察前去抓捕,又有人会在大庭广众之下求婚: 本周五,来自马萨诸塞大学阿默斯特分校(UMass Amherst)、约翰霍普金斯大学、卡耐基梅隆大学的研究者们提出了一个神奇的研究:虚拟社区 (Virtual Community)。 虚拟社区将真实世界的地理空间数据与生成模型相结合,为多种不同类型的智能体创建了一个具有社会根基的交互式、可扩展开放世界场景。 该工作昨晚提交,立即吸引了一些 AI 圈大佬的关注,纽约大学助理教授谢赛宁表示,这对于智能体研究来说意义重大。 虚拟社区提供了一个统一的框架,用于模拟社区中人类和机器人丰富的社交和物理互动。它建立在通用物理引擎之上,并以现实世界的 3D 场景作为基础。 作者为人类智能体实现了一个虚拟角色模拟框架,而其中的机器人模拟则主要继承自 Genesis。 虚拟社区通过在环境中填充配置机器人、人类角色配置文件和社会关系网络的智能体(由 LLM 提供支持)来支持基于 3D 场景的智能体社区生成。 这一 ...
外媒:苹果内部讨论买Perplexity,140亿美元史上最大收购?
机器之心· 2025-06-21 04:36
Core Viewpoint - Apple is considering acquiring AI startup Perplexity to enhance its search capabilities and reduce reliance on Google [1][2][6]. Group 1: Acquisition Discussions - Apple executives have held internal talks regarding a potential acquisition of Perplexity, although discussions are in early stages and no formal offer has been made [2][9]. - Perplexity's team, with backgrounds from top AI labs like OpenAI and Google, presents a significant attraction for Apple due to their expertise in AI technology [2][3]. - Previous acquisition talks between Meta and Perplexity ended without agreement, as Perplexity chose to withdraw from negotiations [3][4]. Group 2: Strategic Implications - The integration of AI-driven search features from Perplexity into Apple's Safari browser could help the company move away from its long-standing partnership with Google, which is valued at approximately $20 billion annually [5][6]. - The rise of AI search options like Perplexity and ChatGPT is leading to a decline in traditional search engine usage, particularly among younger users [6][7]. - Perplexity's recent valuation reached $14 billion, making it a potential record acquisition for Apple if a deal were to occur [8]. Group 3: Company Statements - Perplexity has publicly stated that it is not aware of any ongoing merger discussions with Apple [9]. - The Chief Business Officer of Perplexity expressed skepticism about the likelihood of an acquisition or a partnership similar to Meta's with Scale AI [10].