Workflow
机器之心
icon
Search documents
链式思维是幻象吗?从数据分布视角重新审视大模型推理,马斯克回复,Grok破防
机器之心· 2025-08-14 09:11
Core Viewpoint - The research suggests that Chain-of-Thought (CoT) reasoning in large language models (LLMs) may not represent true reasoning but rather a replication of patterns learned from training data, leading to fragility when faced with out-of-distribution tasks [2][10][37]. Data Distribution Perspective on CoT - The effectiveness of CoT is attributed to the "structured inductive bias" learned within the training distribution, indicating that the reasoning chains are merely reproductions of common patterns rather than genuine logical deductions [13][37]. - A theoretical framework is introduced to quantify the relationship between training and testing distributions, highlighting how distribution shifts can impact reasoning performance [15]. Experimental Findings on Generalization - In "task generalization," the model shows nearly 100% accuracy within the training distribution, but accuracy drops to 0.01% with slight distribution shifts, indicating a lack of true generalization [23]. - Supervised fine-tuning on a small amount of new data can restore performance, but this only expands the existing distribution boundaries without enhancing abstract generalization capabilities [24]. - In "length generalization," even minor changes in input sequence length significantly affect model performance, demonstrating a tendency to generate reasoning chains consistent with training lengths [26]. - The model is highly sensitive to format changes, with even minor alterations in input prompts leading to complete reasoning failures [28]. Universal Sensitivity to Distribution Shifts - The study finds that the sensitivity to distribution shifts is a common phenomenon across different sampling temperatures and model sizes, indicating that this issue is not isolated to specific models [31]. Practical Implications - In high-risk fields such as healthcare and finance, reliance on CoT for robust reasoning is cautioned against, as misleading reasoning chains can be more dangerous than outright incorrect answers [34]. - Current evaluation methods that depend on validation sets closely aligned with training distributions may overestimate model robustness, necessitating stricter out-of-distribution testing [35]. - While supervised fine-tuning can quickly enhance performance on specific tasks, it does not equip models with true abstract reasoning capabilities [36].
LeetCode刷够100小时,学会找人内推,OpenAI员工下场教你拿Offer
机器之心· 2025-08-14 09:11
机器之心编译 机器之心编辑部 OpenAI 在 AI 领域引领了一波又一波浪潮,想必很多人好奇,这些创新背后的研究人员是如何通过面试的? 尤其是现在,OpenAI 已经成为全球最受瞩目的 AI 公司之一,吸引了无数顶尖人才投递简历。想要加入这个团队,着实不容易。 近日,一位入职 OpenAI 不到两个月的新研究员 Bas van Opheusden 分享了他的求职经历,面试指南长达 8 页。 根据领英数据显示, Bas van Opheusden 今年 7 月加入 OpenAI ,现在是一名研究员,拥有纽约大学博士学位。 在这份指南里,他谈到了心态调适、准备策略、编程技巧等多个方面,并将自己的经验教训、建议分享给大家。 OpenAI 新员工分享面试技巧 以下是原文内容: 原文地址:https://docs.google.com/document/d/1ZV73D2vgaj2yu_tjN3TVOP6QVLWVPXJB2rrqSZQxYtI/edit?tab=t.0 Opheusden 强调首先是保护好自己的身心健康。面试过程充满压力,短短 30 分钟的谈话,就可能让你的人生发生巨大变化,不论是好的还是坏的,过程 ...
冗长响应缩减80%,DeepSeek GRPO获得颠覆性改进,微软GFPO问世
机器之心· 2025-08-14 04:57
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm called Group Filtered Policy Optimization (GFPO), which aims to enhance the efficiency of reasoning models by significantly reducing unnecessary token lengths during inference while maintaining accuracy [2][3][9]. Summary by Sections Introduction to GFPO - GFPO is a revolutionary algorithm that balances computational costs during training and testing phases, achieving up to an 80% reduction in token length during inference [3][5]. Background on GRPO - The article explains the Group Relative Policy Optimization (GRPO) as a simplified version of the Proximal Policy Optimization (PPO) algorithm, which does not require a value model for baseline advantage estimation [7][8]. - GRPO has limitations due to its reliance on a single scalar reward signal, making it challenging to optimize multiple response attributes simultaneously, leading to increased response lengths [8][9]. Mechanism of GFPO - GFPO allows targeted strategy optimization for desired response attributes by sampling a larger candidate response group and filtering based on specific characteristics [11]. - The algorithm normalizes the advantages of selected responses using their average and standard deviation, ensuring that only the most relevant responses are considered for policy updates [13][14]. Adaptive Difficulty in GFPO - An adaptive variant of GFPO is introduced, which allocates more training signals to harder problems, dynamically adjusting the number of retained responses based on problem difficulty [21][22]. Experimental Findings - The article presents various experimental findings, including: - The importance of sampling more responses to reduce response lengths effectively [28]. - Token efficiency optimization leads to significant length reductions while maintaining accuracy, with reductions of 70.9% to 84.6% across different benchmarks [31]. - GFPO effectively mitigates out-of-distribution length inflation while slightly improving accuracy [32]. - The adaptive difficulty variant outperforms the Shortest-k algorithm in length reduction across multiple benchmarks [31][40]. Conclusion - GFPO demonstrates a substantial reduction in unnecessary response lengths during reasoning and validation phases, achieving a 94.4% reduction in excess length for answers and a 66.7% reduction for validation steps in specific benchmarks [44].
千支队伍争锋!首届「启智杯」算法大赛圆满落幕,助推AI应用落地
机器之心· 2025-08-14 04:57
Core Viewpoint - Artificial intelligence is transitioning from theoretical exploration to large-scale application, becoming a new engine for high-quality economic and social development in China [1] Group 1: Event Overview - The "Qizhi Cup" algorithm innovation application challenge was officially launched on May 20, 2025, by Qiyuan Laboratory, aiming to promote the practical application of intelligent algorithms [1] - The competition attracted 1,022 teams from universities, research institutions, and technology companies, with three teams winning in different tracks [2][20] Group 2: Competition Tracks - The competition featured three main tracks: "Robust Instance Segmentation of Satellite Remote Sensing Images," "Drone Ground Target Detection for Embedded Platforms," and "Adversarial Challenges for Multimodal Large Models" [4][20] - Each track focused on core capabilities such as robust perception, lightweight deployment, and adversarial defense [4] Group 3: Track Summaries Robust Instance Segmentation of Satellite Remote Sensing Images - This track aimed at precise segmentation of complex targets in high-resolution remote sensing images, addressing challenges like occlusion and domain differences [6] - The champion team from South China University of Technology utilized an optimized Co-DETR model, enhancing feature learning through multi-task training [8][9] Drone Ground Target Detection for Embedded Platforms - This track required algorithms to achieve high recognition accuracy while operating efficiently on resource-constrained platforms [9][21] - The winning team, "Duan Yan Wu Ping," achieved high precision under hardware limitations by transitioning from YOLOv11 to a Transformer-based Co-DETR model [10][12] Adversarial Challenges for Multimodal Large Models - This track evaluated models on accuracy, robustness, and resistance to attacks in visible light remote sensing scenarios [14] - The winning team from Sun Yat-sen University developed a robust and reliable model using a systematic optimization approach [16][18] Group 4: Industry Implications - The "Qizhi Cup" serves as a platform for integrating cutting-edge algorithms with practical applications, emphasizing the adaptability and engineering feasibility of models in dynamic environments [20][21] - The competition fosters AI talent development, enhancing participants' understanding of business and data while bridging the gap between theory and engineering [23]
ICCV 2025 | HERMES:首个统一3D场景理解与生成的世界模型
机器之心· 2025-08-14 04:57
Core Viewpoint - The article discusses the advancements in autonomous driving technology, emphasizing the need for a unified model that integrates both understanding current environments and predicting future scenarios effectively [7][10][30]. Research Background and Motivation - Recent progress in autonomous driving necessitates vehicles to possess deep understanding of current environments and accurate predictions of future scenarios to ensure safe and efficient navigation [7]. - The separation of "understanding" and "generation" in mainstream solutions is highlighted as a limitation in achieving effective decision-making in real-world driving scenarios [8][10]. Method: HERMES Unified Framework - HERMES proposes a unified framework that utilizes a shared large language model (LLM) to drive both understanding and generation tasks simultaneously [13][30]. - The framework addresses challenges such as efficiently inputting high-resolution images and integrating world knowledge with predictive capabilities [11][12]. HERMES Core Design - HERMES employs Bird's-Eye View (BEV) as a unified scene representation, allowing for efficient encoding of multiple images while preserving spatial relationships and semantic details [18]. - The introduction of World Queries facilitates the connection between understanding and future predictions, enhancing the model's ability to generate accurate future scenarios [19][20]. Joint Training and Optimization - HERMES utilizes a joint training process with two optimization objectives: language modeling loss for understanding tasks and point cloud generation loss for accuracy in future predictions [21][22][23]. Experimental Results and Visualization - HERMES demonstrates superior performance in scene understanding and future generation tasks on datasets like nuScenes and OmniDrive-nuScenes [26]. - The model excels in generating coherent future point clouds and accurately describing driving scenes, showcasing its comprehensive capabilities [27]. Summary and Future Outlook - HERMES presents a new paradigm for autonomous driving world models, effectively bridging the gap between 3D scene understanding and future generation [30]. - The model shows significant improvements in prediction accuracy and understanding tasks compared to traditional models, validating the effectiveness of unified modeling [31].
刚刚,全网最懂图文调研的智能体模型震撼上线,看完我直接卸了浏览器
机器之心· 2025-08-14 04:57
Core Viewpoint - The article emphasizes the rapid development and open-sourcing of domestic AI models in China, particularly highlighting the advancements made by Kunlun Wanwei in the field of multi-modal AI and intelligent agents [1][47]. Group 1: Open-source Models and Developments - In July, the Chinese AI community saw an impressive total of 33 open-source models released, with major players like Kunlun Wanwei, Alibaba, and Tencent participating [1]. - In August, Kunlun Wanwei continued to release significant models, including the second-generation reward model Skywork-Reward-V2 and the multi-modal understanding model Skywork-R1V3 [1]. - Kunlun Wanwei launched a week-long technology release event, showcasing various models across multi-modal AI applications [1]. Group 2: Skywork Deep Research Agent - On August 14, Kunlun Wanwei released the upgraded version of its Skywork Deep Research Agent, enhancing its capabilities in multi-modal information retrieval and generation [3]. - The Skywork Deep Research Agent achieved a remarkable accuracy of 27.8% in conventional reasoning mode and 38.7% in its proprietary "parallel thinking" mode, setting a new industry SOTA record [4]. - The agent also excelled in the GAIA benchmark test, surpassing all competitors in complex task performance [6]. Group 3: Multi-modal Capabilities - Kunlun Wanwei's agent integrates multi-modal retrieval and understanding, allowing it to process images and charts, thus enhancing the completeness and accuracy of research reports [12]. - The agent can generate detailed reports with rich visual content, including graphs and charts, while ensuring that all data sources are cited [21][22]. - The system employs advanced technologies such as MM-Crawler for efficient data collection and multi-agent architecture for task execution [29][30]. Group 4: Technological Innovations - The Skywork Deep Research Agent V2 incorporates several key enhancements, including high-quality data synthesis, end-to-end reinforcement learning, and efficient parallel reasoning [40]. - The agent's architecture allows for dynamic task management and collaboration among multiple agents, improving adaptability and efficiency [44]. - Innovations in data quality standards and complex problem-solving strategies have been implemented to enhance the agent's learning and reasoning capabilities [41][42]. Group 5: Industry Trends and Future Outlook - The article notes a shift in the AI industry focus from developing singular powerful models to open-source collaboration and practical application deployment [47]. - Companies that can effectively build comprehensive toolchains and application ecosystems on top of open-source models are likely to gain a competitive edge in the AI landscape [49]. - Kunlun Wanwei's recent developments signal its commitment to advancing multi-modal AI and establishing a strong position in the global AI competition [50].
港大联手月之暗面等开源OpenCUA:人人可造专属电脑智能体
机器之心· 2025-08-14 01:26
Core Viewpoint - The article discusses the launch of an open-source framework called OpenCUA for developing computer-use agents (CUA), which includes a flagship model OpenCUA-32B that achieved a 34.8% success rate on the OSWorld-Verified benchmark, surpassing GPT-4o [1][37]. Group 1: OpenCUA Framework - OpenCUA framework consists of tools for capturing human-computer interactions, a large-scale dataset called AgentNet, and a workflow for converting demonstrations into "state-action" pairs with reasoning [6][9]. - The framework aims to expand data collection across different computer environments and user scenarios, minimizing restrictions on user interactions to enhance scalability [11][12]. Group 2: AgentNet Tool and Dataset - AgentNet Tool is a cross-platform application that records user interactions on Windows, macOS, and Ubuntu, capturing screen videos and metadata for real-world computer usage demonstrations [13][15]. - The AgentNet dataset includes 22,625 manually annotated computer usage tasks from over 140 applications and 190 websites, with an average of 18.6 steps per task, reflecting task complexity [23][20]. Group 3: OpenCUA Model - The OpenCUA model integrates reflective long-chain reasoning and cross-domain data, enabling it to perform computer operation tasks in real desktop environments [29][30]. - The model variants, including OpenCUA-7B and OpenCUA-32B, were evaluated against multiple benchmarks, demonstrating superior performance compared to existing models [35][37]. Group 4: Experimental Results - OpenCUA-32B achieved the highest performance among open-source models with a 34.8% average success rate on the OSWorld-Verified benchmark, significantly closing the gap with proprietary agents [37][38]. - The model's performance improved with the scale of training data, indicating strong potential for further enhancement during testing [45][49]. Group 5: Conclusion - OpenCUA fills a critical gap in the development of computer-use agents by providing a comprehensive open-source framework, including annotation infrastructure, data processing pipelines, diverse datasets, efficient training strategies, and system evaluation benchmarks [50].
破解「长程智能体」RL训练难题,腾讯提出RLVMR框架,让7B模型「思考」比肩GPT-4o
机器之心· 2025-08-14 01:26
Core Viewpoint - The article discusses the development of the RLVMR framework by Tencent's Hunyuan AI Digital Human team, which aims to enhance the reasoning capabilities of AI agents by rewarding the quality of their thought processes rather than just the outcomes, addressing inefficiencies in long-horizon tasks and improving generalization abilities [4][26]. Group 1: Challenges in Current AI Agents - Many AI agents succeed in tasks but rely on luck and inefficient trial-and-error methods, leading to a lack of effective reasoning capabilities [2]. - The low-efficiency exploration problem arises as agents often engage in meaningless actions, resulting in high training costs and low reasoning efficiency [2]. - The generalization fragility issue occurs because strategies learned through guessing lack a logical foundation, making them vulnerable in new tasks [3]. Group 2: RLVMR Framework Introduction - RLVMR introduces a meta-reasoning approach that rewards good thinking processes, enabling end-to-end reinforcement learning for reasoning in long-horizon tasks [4][6]. - The framework allows agents to label their cognitive states, enhancing self-awareness and tracking their thought processes [7]. - A lightweight verification rule evaluates the quality of the agent's thinking in real-time, providing immediate rewards for good reasoning and penalizing ineffective habits [8]. Group 3: Experimental Results - The RLVMR-trained 7B model achieved a success rate of 83.6% on the most challenging L2 generalization tasks in ALFWorld and ScienceWorld, outperforming all previous state-of-the-art models [11]. - The number of actions required to solve tasks in complex environments decreased by up to 28.1%, indicating more efficient problem-solving paths [13]. - The training process showed faster convergence and more stable strategies, significantly alleviating the issue of ineffective exploration [13]. Group 4: Insights from RLVMR - The introduction of a reflection mechanism allows agents to identify problems and adjust strategies rather than blindly retrying, leading to a significant reduction in repeated actions and an increase in task success rates [19]. - Rewarding good reasoning habits establishes a flexible problem-solving framework that enhances generalization capabilities in unseen tasks [20][21]. - The two-phase training process of cold-start SFT followed by reinforcement learning aligns with cognitive principles, suggesting that teaching agents how to think before allowing them to learn from mistakes is more efficient [22][24]. Group 5: Conclusion and Future Outlook - RLVMR represents a paradigm shift from outcome-oriented to process-oriented training, effectively addressing the challenges of low-efficiency exploration and generalization fragility in long-horizon tasks [26]. - The ultimate goal is to develop AI agents capable of independent thinking and rational decision-making, moving beyond mere shortcut-seeking behaviors [26][27].
美国计算机就业炸了:名校毕业投5000家无人问,不如生物、艺术史,麦当劳打工也不要
机器之心· 2025-08-13 09:29
Core Viewpoint - The article highlights the paradox of high unemployment rates among computer science graduates despite the booming AI industry, suggesting that AI may be displacing entry-level jobs in technology [1][2][3]. Employment Situation - Recent data from the New York Federal Reserve indicates that unemployment rates for computer science and computer engineering graduates are at 6.1% and 7.5%, respectively, significantly higher than the 3% unemployment rate for biology and art history graduates [2][3]. - This trend challenges the long-held belief that STEM fields, particularly computer science, guarantee better job prospects [3]. Job Market Dynamics - The article discusses how AI tools are reshaping the job market, leading to reduced demand for entry-level software engineers as companies increasingly adopt AI programming assistants [18]. - Many graduates are facing unprecedented pressure in their job search, with reports of applicants submitting thousands of resumes without securing interviews [14][18]. Graduate Experiences - Personal accounts from graduates illustrate the harsh realities of the job market, with one individual applying for over 5,700 tech jobs and receiving only 13 interview opportunities [15][18]. - The article notes that many graduates are now considering alternative career paths, including blue-collar jobs, as the tech industry becomes more competitive and automated [12][18]. Educational Trends - The number of computer science graduates has surged, with over 170,000 graduates reported last year, more than double the figures from 2014 [20]. - Despite the influx of graduates, the job market has not kept pace, leading to a stark contrast between the promises of high salaries and the current employment landscape [20][21]. Industry Outlook - The article suggests that the once-promising field of computer science is now perceived as a "golden ticket" that has lost its luster, leaving many graduates feeling deceived by the industry's previous assurances [21][22].
告别Transformer,重塑机器学习范式:上海交大首个「类人脑」大模型诞生
机器之心· 2025-08-13 09:29
Core Viewpoint - The article discusses the introduction of BriLLM, a new language model inspired by human brain mechanisms, which aims to overcome the limitations of traditional Transformer-based models, such as high computational demands, lack of interpretability, and context size restrictions [3][8]. Group 1: Limitations of Current Models - Current Transformer-based models face three main issues: high computational requirements, black-box interpretability, and context size limitations [6][8]. - The self-attention mechanism in Transformers has a time and space complexity of O(n²), leading to increased computational costs as input length grows [7]. - The internal logic of Transformers lacks transparency, making it difficult to understand the decision-making process within the model [7][8]. Group 2: Innovations of BriLLM - BriLLM introduces a new learning mechanism called SiFu (Signal Fully-connected Flowing), which replaces traditional prediction operations with signal transmission, mimicking the way neural signals operate in the brain [9][13]. - The model architecture is based on a directed graph, allowing all nodes to be interpretable, unlike traditional models that only provide limited interpretability at the input and output layers [9][19]. - BriLLM supports unlimited context processing without increasing model parameters, allowing for efficient handling of long sequences [15][16]. Group 3: Model Specifications - BriLLM has two versions: BriLLM-Chinese and BriLLM-English, with non-sparse model sizes of 16.90 billion parameters for both languages [21]. - The sparse version of the Chinese model has 2.19 billion parameters, while the English version has 0.96 billion parameters, achieving a parameter reduction of approximately 90% [21]. - The model's design allows for the integration of multiple modalities, enabling it to process not just language but also visual and auditory inputs [25][26]. Group 4: Future Prospects - The team aims to develop a multi-modal brain-inspired AGI framework, which will integrate perception and motion [27]. - BriLLM has been selected for funding under Shanghai Jiao Tong University's "SJTU 2030" plan, which supports groundbreaking research projects [27].