机器之心
Search documents
英伟达的AI已经开始接管整个项目了?SATLUTION自主进化代码库登顶SAT竞赛
机器之心· 2025-09-11 03:36
Core Viewpoint - The article discusses the emergence of AI frameworks capable of developing complex software, specifically highlighting NVIDIA Research's SATLUTION, which has demonstrated superior performance in solving SAT problems compared to human-designed solvers [1][3][5]. Group 1: SATLUTION Framework - SATLUTION is the first framework that extends the code evolution capabilities of LLMs from "algorithm kernels" to "complete codebases," handling complex projects with thousands of lines of C/C++ code [3][4]. - The framework coordinates LLM agents under strict correctness verification and distributed runtime feedback to iteratively optimize the SAT solver's codebase [4][9]. Group 2: Performance and Results - In the 2025 SAT competition, the SATLUTION-evolved solver outperformed human-designed champions, achieving lower PAR-2 scores, indicating better performance [5][7]. - The SATLUTION framework demonstrated a clear and robust performance improvement trajectory over 70 evolution cycles, surpassing the performance of human-designed solvers by the 50th iteration [19][21]. Group 3: Evolution Process - The system operates with a dual-agent architecture: a planning agent that strategizes modifications and a coding agent that implements these changes [10]. - A dynamic rules system guides the evolution process, ensuring efficiency and stability by encoding domain knowledge and constraints [11][12]. Group 4: Validation and Evaluation - Each new solver version undergoes a rigorous two-stage validation process, including compilation tests and correctness verification against known benchmarks [14][15]. - The validated solvers are evaluated in a distributed manner across 800 CPU nodes, providing near real-time performance feedback [15]. Group 5: Cost Efficiency - The total cost of the SATLUTION self-evolution experiment was under $20,000, significantly lower than the months or years typically required for human experts to develop competitive SAT solvers [21].
大模型智能体不止能写代码,还能被训练成白帽黑客
机器之心· 2025-09-11 03:36
Core Insights - The article discusses the emerging role of AI in cybersecurity, particularly through the development of large models that can simulate attacks and identify vulnerabilities without causing harm [2][3] - Amazon AWS AI's Q Developer team has introduced two innovative methods for training cybersecurity models: Cyber-Zero and CTF-Dojo, marking a shift from general tasks to the frontline of cybersecurity [3][9] Summary by Sections Cyber-Zero - Cyber-Zero focuses on generating high-quality training data without relying on real runtime environments, utilizing existing knowledge and language modeling to create behavior trajectories [10][11] - The method extracts key steps from public CTF competition writeups, allowing the model to simulate interactions between attackers and defenders in a safe, text-based environment [11][13] - Experimental results indicate that Cyber-Zero can produce diverse and effective training data, achieving comparable or superior performance in vulnerability detection and attack path reasoning compared to real environment-generated data [13][15] CTF-Dojo - CTF-Dojo provides a real operational environment for AI models to execute commands, interact with systems, and discover vulnerabilities, complementing the virtual training offered by Cyber-Zero [16][19] - The team developed CTF-Forge, a tool that automates the setup of CTF challenges, significantly reducing the time and labor required to create a stable operating environment for large language models [16][19] - The dataset for CTF-Dojo includes 658 independent task instances from top-tier CTF competitions, covering various categories such as web vulnerabilities and binary exploitation [19][21] Performance Evaluation - Models trained using CTF-Dojo demonstrated significant improvements in benchmark tasks, with the best-performing model achieving an absolute increase of 11.6% on the En IGM A+ benchmark [22][24] - The results highlight the scalability and effectiveness of using real execution data to enhance the performance of cybersecurity models, suggesting a pathway for AI to approach human-level capabilities in ethical hacking [24][26] Future Implications - The integration of Cyber-Zero and CTF-Dojo creates a comprehensive training and operational framework for AI in cybersecurity, addressing both data generation and practical application challenges [26][27] - The potential applications of AI white-hat hackers include automated code scanning, vulnerability discovery, and personalized training in educational settings, indicating a broad future impact [27][28] - However, the dual-use nature of this technology raises concerns about its potential misuse for offensive purposes, necessitating discussions on balancing accessibility with security [28][29]
刚刚,Thinking Machines Lab首次发长文,揭开LLM推理不确定性真相
机器之心· 2025-09-11 03:36
Core Viewpoint - The article discusses the challenges of achieving reproducibility in large language models (LLMs) due to the lack of batch invariance, which leads to nondeterministic outputs even under controlled conditions [10][41][46]. Group 1: Introduction to the Issue - Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, published its first article addressing nondeterminism in LLM inference [1][3]. - The blog aims to cover a wide range of topics related to their research, including numerical computation and prompt engineering [3]. Group 2: Understanding Nondeterminism - Reproducibility is a cornerstone of scientific progress, yet obtaining consistent results from LLMs is challenging [10]. - Even with the temperature parameter set to 0, LLM APIs can still produce nondeterministic outputs [11]. - The nondeterminism is attributed to floating-point non-associativity and concurrency, which affects the order of operations in GPU computations [13][30]. Group 3: The Root Cause of Nondeterminism - The article argues that the common assumption linking concurrency and floating-point operations to nondeterminism does not fully explain the issue [14][30]. - Floating-point non-associativity leads to different results based on the order of operations, especially in parallel computations [19][26]. - The actual implementation of kernel functions in LLMs contributes to the nondeterministic behavior observed [27][30]. Group 4: Batch Invariance - The lack of batch invariance is identified as a key factor causing nondeterminism in LLM outputs [41][46]. - Batch size changes can lead to different results for the same input, which is counterintuitive for mathematical functions [43]. - The article emphasizes that ensuring kernel functions are batch invariant is crucial for achieving consistent outputs in LLM inference [46]. Group 5: Solutions for Achieving Determinism - The article outlines strategies to implement batch invariance in key operations such as RMSNorm, matrix multiplication, and attention mechanisms [49][60][71]. - By ensuring that the operations do not depend on batch size, the LLM inference can produce consistent results [46][81]. - The authors provide a demonstration of deterministic inference using their batch-invariant kernel function library [82]. Group 6: Performance Considerations - Initial performance tests indicate that while the batch-invariant kernel functions may not be fully optimized, they do not lead to catastrophic performance declines [89]. - The article highlights the importance of maintaining performance while achieving deterministic outputs in LLMs [88]. Group 7: Implications for Reinforcement Learning - The article discusses how achieving deterministic inference can facilitate true on-policy reinforcement learning by ensuring consistent outputs between training and inference [90]. - This consistency is essential for effective training and sampling processes in reinforcement learning environments [90]. Group 8: Conclusion - The article advocates for a proactive approach to understanding and addressing the sources of nondeterminism in LLMs, encouraging the community to strive for reproducibility in AI systems [93].
00后挑大梁!近20国选手激战外滩大会,AI科创赛三赛道冠军诞生
机器之心· 2025-09-10 11:30
Core Insights - The 2025 Inclusion Bund Conference AI Innovation Competition successfully concluded in Shanghai, showcasing a total of 80 awarded projects that demonstrate both technological foresight and market potential [1][4] - The competition attracted over 8,000 teams and nearly 20,000 participants from around 20 countries, with a significant representation of Gen Z participants [1][4] Group 1: Competition Overview - This year's AI Innovation Competition doubled in scale compared to the previous year, with Gen Z participants making up more than half of the competitors, highlighting the event's youthful and international character [4] - The competition featured three main events: the AI Hardware Innovation Competition, the AFAC Financial Intelligence Innovation Competition, and the 2025 Global AI Defense Challenge, emphasizing the integration of AI across various sectors [1][4][10] Group 2: AI Hardware Innovation - The AI Hardware Innovation Competition was introduced this year, focusing on marketable AI hardware innovations, with standout projects including AI digital healthcare and AI elderly assistance devices [5][7] - Investors showed strong interest in the AI hardware projects, recognizing their potential to transition from concept to practical applications in consumer electronics and healthcare [7] Group 3: Financial Intelligence and AI Security - The AFAC Financial Intelligence Innovation Competition aimed to address real industry challenges, producing solutions for issues like fraud prevention and credit assessment through advanced technologies [10] - The 2025 Global AI Defense Challenge focused on AI security, introducing a competitive format that leverages AI to counteract AI threats, and launched the first multimodal AI security benchmark dataset [13][10] Group 4: Future Directions - The Shanghai Municipal Science and Technology Commission emphasized the importance of fostering innovation and practical capabilities among participants to enhance Shanghai's status as a global innovation hub [15]
CoRL 2025 | 港大InfoBodied AI团队首发具身表征新范式,构建任务自适应的感知框架
机器之心· 2025-09-10 11:30
Core Viewpoint - The article introduces HyperTASR, a novel framework for task-aware scene representation in embodied intelligence, enabling robots to dynamically adjust their perception based on task relevance, akin to human cognitive processes [5][12]. Group 1: Research Background and Challenges - In embodied intelligence, strategy learning relies heavily on scene representation, but existing methods often use task-agnostic feature extraction, leading to inefficiencies [4][18]. - Traditional approaches do not adapt to different tasks, resulting in irrelevant information being included in strategy learning, which hampers efficiency and generalization [18][20]. Group 2: Innovations and Contributions - HyperTASR framework allows for task-aware scene representation, enabling robots to focus on relevant environmental features during task execution [8][12]. - Introduces a hypernetwork-driven representation transformation mechanism that dynamically generates adaptive parameters based on task specifications and progress [9][20]. - Compatible with various strategy learning architectures, allowing integration without significant modifications, enhancing performance [10][26]. Group 3: Experimental Validation - Significant improvements were observed in both simulation (RLBench) and real-world environments, establishing new state-of-the-art (SOTA) benchmarks for single-view manipulation tasks [11][29]. - In simulation, integrating HyperTASR with GNFactor and 3D Diffuser Actor led to success rates exceeding baseline methods by 27% and achieving over 80% success in single-view operations, respectively [29][31]. - Real-world experiments demonstrated strong generalization capabilities, achieving a success rate of 51.1% with only 15 demonstration samples per task [32][33].
谷歌AI新里程碑:一个能「做研究」的系统诞生了,用LLM+树搜索编写专家级软件
机器之心· 2025-09-10 08:14
Core Viewpoint - The article discusses a groundbreaking AI system developed by Google that assists researchers in writing expert-level empirical software, integrating large language models with traditional tree search methods to enhance efficiency in scientific research [2][4][36]. Group 1: AI System Overview - The AI system can automatically write and optimize software programs needed for scientific tasks, surpassing human capabilities in various fields such as genomics and public health [2][4]. - It transforms from a one-time code generation tool to an iterative, search-driven software evolution guided by quantifiable goals [4][36]. Group 2: Methodology and Components - The system focuses on "scorable scientific tasks," which can be quantified through accuracy, error rates, or benchmark rankings, covering a wide range of scientific applications [14]. - Three core components work in synergy: 1. LLM-based code mutation, which continuously rewrites and optimizes existing candidate codes [15]. 2. Tree search navigation, systematically exploring the software solution space using a variant of the PUCT algorithm inspired by AlphaZero [16]. 3. Integration of research ideas from various sources, including expert knowledge and academic papers [17]. Group 3: Achievements Across Scientific Fields - In genomics, the system identified 40 new methods for single-cell RNA sequencing, outperforming all published methods on the OpenProblems leaderboard, with the best method improving performance by 14% over the existing best [19]. - For geospatial analysis, the system's top solutions for satellite image segmentation significantly exceeded recent academic results, achieving an average intersection-over-union score greater than 0.80 [22]. - In neuroscience, the system generated a model predicting neural activity in zebrafish brains that outperformed all baseline models and trained significantly faster [26]. - The system also excelled in time series prediction across 28 datasets, creating a unified prediction library adaptable to various datasets [27]. Group 4: Technical Innovations - A key innovation is the systematic integration and intelligent reorganization of research ideas, allowing the system to analyze core principles of different methods and synthesize instructions for creating hybrid methods [31]. Group 5: Conclusion and Implications - The research indicates that AI can not only automate but also systematically exceed human performance in developing scientific software across multiple fields [36]. - This system has the potential to fundamentally change the way scientific software is developed, making advanced analytical tools more accessible to researchers and expanding the boundaries of scientific exploration [37].
英伟达下一代GPU登场,Rubin CPX一次推理数百万Token,网友:这是头野兽
机器之心· 2025-09-10 08:14
机器之心报道 机器之心编辑部 在周二的 AI 基础设施峰会上,英伟达宣布推出一款名为 Rubin CPX(Rubin Context GPUs) 的新 GPU,专为超过 100 万 token 的长上下文推理而设计。 对用户而言,这意味着他们在软件开发、视频生成等长上下文任务中能够获得更好的性能。 例如,在软件开发中,AI 系统必须能够对整个代码库进行推理、理解仓库级代码结构,才能更好的帮助开发者。同样地,长视频和研究类应用也要求在数百万 token 范围内保持持续的连贯性和记忆。 现在,随着 Rubin CPX 发布,这些问题都能迎刃而解。 这款新型 GPU(Rubin CPX) 将与 NVIDIA Vera CPU 和 Rubin GPU 搭配使用,共同组成全新的 NVIDIA Vera Rubin NVL144 CPX 平台。这一集成式 NVIDIA MGX 系统在单机架内可提供 8 exaflops AI 算力,其 AI 性能是 NVIDIA GB300 NVL72 系统的 7.5 倍,并配备 100TB 高速内存和 1.7 PB/s(petabytes)内存带宽。 同时,NVIDIA 还将为已有 V ...
人人都能炼专属Agent,上海交大开源端侧Agent全栈工具链,真实场景性能超GPT-5!
机器之心· 2025-09-10 07:31
打开手机,让 AI Agent 自动帮你完成订外卖、订酒店、网上购物的琐碎任务,这正成为智能手机交互的新范式。 就在刚刚,这一局面迎来了新的破局者。 来自 上海交通大学 IPADS 实验室 的团队 ,正式开源了一套名为 MobiAgent 的移动端智能体 "全家桶"。 APP:https://github.com/IPADS-SAI/MobiAgent/releases/download/v1.0/Mobiagent.apk 一个能自主处理大部分日常任务的个人专属智能体,正在从科幻走进现实。 然而,通往 "解放双手" 的最后一公里却并不好走。如何高效地训练和在手机端部署 Agent 模型,长期以来似乎都是少数大厂的 "自留地"。从高质量操作数据的获 Agent 养成全攻略:三步走 要让 AI 学会玩手机,首先得让它看懂人是怎么操作的。MobiAgent 的第一大核心,就是贡献了一套 AI 辅助的 敏捷数据收集 "流水 线 " 。 过去,给 AI 准备 "教材"(标注数据)又贵又慢。现在,MobiAgent 用一个轻量级小工具,就能记录下人类在手机上的所有点击、滑动、输入等操作轨迹。对于一 些简单的任务,这一录 ...
AI胡说八道这事,终于有人管了?
机器之心· 2025-09-10 04:05
编辑:+0、张倩 想象一下,如果 ChatGPT 等 AI 大模型在生成的时候,能把自己不确定的地方都标记出来,你会不会对它们生成的答案放心很多? 机器之心报道 上周末, OpenAI 发的一篇论文引爆了社区 。这篇论文系统性地揭示了幻觉的根源,指出问题出在奖励上 —— 标准的训练和评估程序更倾向于对猜测进行奖励, 而不是在模型勇于承认不确定时给予奖励。可能就是因为意识到了这个问题,并找出了针对性的解法,GPT-5 的幻觉率大幅降低。 随着 AI 大模型在医疗咨询、法律建议等高风险领域的应用不断深入,幻觉问题会变得越来越棘手,因此不少研究者都在往这一方向发力。除了像 OpenAI 那样寻 找幻觉原因,还有不少人在研究幻觉检测技术。然而,现有的幻觉检测技术在实际应用中面临瓶颈,通常仅适用于简短的事实性查询,或需要借助昂贵的外部资 源进行验证。 针对这一挑战,来自苏黎世联邦理工学院(ETH)和 MATS 的一项新研究提出了一种低成本、可扩展的检测方法,能够 实时识别长篇内容中的「幻觉 token」 , 并成功应用于高达 700 亿(70B)参数的大型模型。 论文标题:Real-Time Detection of ...
AI应用元年,这场标杆赛事见证了中国创新速度与野心
机器之心· 2025-09-10 04:05
机器之心原创 编辑:吴昕 一场关于未来金融智能的集体预演,见证了创业者们的冲刺,也折射出一个行业的进化。 2025 年的 AI ,正在上演「双线长跑」。 一端是大模型底层的持续进化,远未触顶;另一端是场景应用集中爆发。 来自 a16z 最新发布的全球百强 GenAI 应用榜单 ,释放出一个清晰信号,在「 AI 如何改造行业」应 用上,中国玩家已展现出全球领先优势。 与此同时,国务院印发的「人工智能 + 」行动计划又添了一把柴。 AI 的赋能范围,正从新质生产力的 试点,扩展到全社会,被视作未来现代化的核心引擎。 这股脉动,在 AFAC2025 金融智能创新大赛 上体现得淋漓尽致。作为连续举办三年的金融智能标杆 赛事,它已成为海内外 AI 创业团队的聚合地。在为期三个月的赛程中, 11 支队伍从初创组脱颖而出 —— 获奖方案直击真实金融痛点,覆盖底层技术突破与复杂系统工程,落地性极强,跨界创新尤为显著。 11 支获奖团队的项目方向、技术亮点和应用场景,大都直击真实金融痛点,落地性极强,「跨界」创新明显。 中国的应用落地速度是全球领先的,另一位评委、 xcube.co 首席幕僚长兼董事、新加坡金融科技节和 GFT ...