机器之心 - filings, earnings calls, financial reports, news

机器之心

Search documents

英伟达的AI已经开始接管整个项目了？SATLUTION自主进化代码库登顶SAT竞赛

机器之心· 2025-09-11 03:36

Core Viewpoint - The article discusses the emergence of AI frameworks capable of developing complex software, specifically highlighting NVIDIA Research's SATLUTION, which has demonstrated superior performance in solving SAT problems compared to human-designed solvers [1][3][5]. Group 1: SATLUTION Framework - SATLUTION is the first framework that extends the code evolution capabilities of LLMs from "algorithm kernels" to "complete codebases," handling complex projects with thousands of lines of C/C++ code [3][4]. - The framework coordinates LLM agents under strict correctness verification and distributed runtime feedback to iteratively optimize the SAT solver's codebase [4][9]. Group 2: Performance and Results - In the 2025 SAT competition, the SATLUTION-evolved solver outperformed human-designed champions, achieving lower PAR-2 scores, indicating better performance [5][7]. - The SATLUTION framework demonstrated a clear and robust performance improvement trajectory over 70 evolution cycles, surpassing the performance of human-designed solvers by the 50th iteration [19][21]. Group 3: Evolution Process - The system operates with a dual-agent architecture: a planning agent that strategizes modifications and a coding agent that implements these changes [10]. - A dynamic rules system guides the evolution process, ensuring efficiency and stability by encoding domain knowledge and constraints [11][12]. Group 4: Validation and Evaluation - Each new solver version undergoes a rigorous two-stage validation process, including compilation tests and correctness verification against known benchmarks [14][15]. - The validated solvers are evaluated in a distributed manner across 800 CPU nodes, providing near real-time performance feedback [15]. Group 5: Cost Efficiency - The total cost of the SATLUTION self-evolution experiment was under $20,000, significantly lower than the months or years typically required for human experts to develop competitive SAT solvers [21].

大模型智能体不止能写代码，还能被训练成白帽黑客

机器之心· 2025-09-11 03:36

Core Insights - The article discusses the emerging role of AI in cybersecurity, particularly through the development of large models that can simulate attacks and identify vulnerabilities without causing harm [2][3] - Amazon AWS AI's Q Developer team has introduced two innovative methods for training cybersecurity models: Cyber-Zero and CTF-Dojo, marking a shift from general tasks to the frontline of cybersecurity [3][9] Summary by Sections Cyber-Zero - Cyber-Zero focuses on generating high-quality training data without relying on real runtime environments, utilizing existing knowledge and language modeling to create behavior trajectories [10][11] - The method extracts key steps from public CTF competition writeups, allowing the model to simulate interactions between attackers and defenders in a safe, text-based environment [11][13] - Experimental results indicate that Cyber-Zero can produce diverse and effective training data, achieving comparable or superior performance in vulnerability detection and attack path reasoning compared to real environment-generated data [13][15] CTF-Dojo - CTF-Dojo provides a real operational environment for AI models to execute commands, interact with systems, and discover vulnerabilities, complementing the virtual training offered by Cyber-Zero [16][19] - The team developed CTF-Forge, a tool that automates the setup of CTF challenges, significantly reducing the time and labor required to create a stable operating environment for large language models [16][19] - The dataset for CTF-Dojo includes 658 independent task instances from top-tier CTF competitions, covering various categories such as web vulnerabilities and binary exploitation [19][21] Performance Evaluation - Models trained using CTF-Dojo demonstrated significant improvements in benchmark tasks, with the best-performing model achieving an absolute increase of 11.6% on the En IGM A+ benchmark [22][24] - The results highlight the scalability and effectiveness of using real execution data to enhance the performance of cybersecurity models, suggesting a pathway for AI to approach human-level capabilities in ethical hacking [24][26] Future Implications - The integration of Cyber-Zero and CTF-Dojo creates a comprehensive training and operational framework for AI in cybersecurity, addressing both data generation and practical application challenges [26][27] - The potential applications of AI white-hat hackers include automated code scanning, vulnerability discovery, and personalized training in educational settings, indicating a broad future impact [27][28] - However, the dual-use nature of this technology raises concerns about its potential misuse for offensive purposes, necessitating discussions on balancing accessibility with security [28][29]

刚刚，Thinking Machines Lab首次发长文，揭开LLM推理不确定性真相

机器之心· 2025-09-11 03:36

Core Viewpoint - The article discusses the challenges of achieving reproducibility in large language models (LLMs) due to the lack of batch invariance, which leads to nondeterministic outputs even under controlled conditions [10][41][46]. Group 1: Introduction to the Issue - Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, published its first article addressing nondeterminism in LLM inference [1][3]. - The blog aims to cover a wide range of topics related to their research, including numerical computation and prompt engineering [3]. Group 2: Understanding Nondeterminism - Reproducibility is a cornerstone of scientific progress, yet obtaining consistent results from LLMs is challenging [10]. - Even with the temperature parameter set to 0, LLM APIs can still produce nondeterministic outputs [11]. - The nondeterminism is attributed to floating-point non-associativity and concurrency, which affects the order of operations in GPU computations [13][30]. Group 3: The Root Cause of Nondeterminism - The article argues that the common assumption linking concurrency and floating-point operations to nondeterminism does not fully explain the issue [14][30]. - Floating-point non-associativity leads to different results based on the order of operations, especially in parallel computations [19][26]. - The actual implementation of kernel functions in LLMs contributes to the nondeterministic behavior observed [27][30]. Group 4: Batch Invariance - The lack of batch invariance is identified as a key factor causing nondeterminism in LLM outputs [41][46]. - Batch size changes can lead to different results for the same input, which is counterintuitive for mathematical functions [43]. - The article emphasizes that ensuring kernel functions are batch invariant is crucial for achieving consistent outputs in LLM inference [46]. Group 5: Solutions for Achieving Determinism - The article outlines strategies to implement batch invariance in key operations such as RMSNorm, matrix multiplication, and attention mechanisms [49][60][71]. - By ensuring that the operations do not depend on batch size, the LLM inference can produce consistent results [46][81]. - The authors provide a demonstration of deterministic inference using their batch-invariant kernel function library [82]. Group 6: Performance Considerations - Initial performance tests indicate that while the batch-invariant kernel functions may not be fully optimized, they do not lead to catastrophic performance declines [89]. - The article highlights the importance of maintaining performance while achieving deterministic outputs in LLMs [88]. Group 7: Implications for Reinforcement Learning - The article discusses how achieving deterministic inference can facilitate true on-policy reinforcement learning by ensuring consistent outputs between training and inference [90]. - This consistency is essential for effective training and sampling processes in reinforcement learning environments [90]. Group 8: Conclusion - The article advocates for a proactive approach to understanding and addressing the sources of nondeterminism in LLMs, encouraging the community to strive for reproducibility in AI systems [93].

大语言模型推理不确定性

批次不变性

在策略强化学习

Artificial Intelligence

Artificial Intelligence

ChatGPT

vLLM

00后挑大梁！近20国选手激战外滩大会，AI科创赛三赛道冠军诞生

机器之心· 2025-09-10 11:30

Core Insights - The 2025 Inclusion Bund Conference AI Innovation Competition successfully concluded in Shanghai, showcasing a total of 80 awarded projects that demonstrate both technological foresight and market potential [1][4] - The competition attracted over 8,000 teams and nearly 20,000 participants from around 20 countries, with a significant representation of Gen Z participants [1][4] Group 1: Competition Overview - This year's AI Innovation Competition doubled in scale compared to the previous year, with Gen Z participants making up more than half of the competitors, highlighting the event's youthful and international character [4] - The competition featured three main events: the AI Hardware Innovation Competition, the AFAC Financial Intelligence Innovation Competition, and the 2025 Global AI Defense Challenge, emphasizing the integration of AI across various sectors [1][4][10] Group 2: AI Hardware Innovation - The AI Hardware Innovation Competition was introduced this year, focusing on marketable AI hardware innovations, with standout projects including AI digital healthcare and AI elderly assistance devices [5][7] - Investors showed strong interest in the AI hardware projects, recognizing their potential to transition from concept to practical applications in consumer electronics and healthcare [7] Group 3: Financial Intelligence and AI Security - The AFAC Financial Intelligence Innovation Competition aimed to address real industry challenges, producing solutions for issues like fraud prevention and credit assessment through advanced technologies [10] - The 2025 Global AI Defense Challenge focused on AI security, introducing a competitive format that leverages AI to counteract AI threats, and launched the first multimodal AI security benchmark dataset [13][10] Group 4: Future Directions - The Shanghai Municipal Science and Technology Commission emphasized the importance of fostering innovation and practical capabilities among participants to enhance Shanghai's status as a global innovation hub [15]

Artificial Intelligence

具身智能

Artificial Intelligence

AI 陪伴玩具

AI 数字医疗

AI 老年人生活助理

Artificial Intelligence

具身智能

Artificial Intelligence

AI 陪伴玩具

AI 数字医疗

AI 老年人生活助理

CoRL 2025 | 港大InfoBodied AI团队首发具身表征新范式，构建任务自适应的感知框架

机器之心· 2025-09-10 11:30

Core Viewpoint - The article introduces HyperTASR, a novel framework for task-aware scene representation in embodied intelligence, enabling robots to dynamically adjust their perception based on task relevance, akin to human cognitive processes [5][12]. Group 1: Research Background and Challenges - In embodied intelligence, strategy learning relies heavily on scene representation, but existing methods often use task-agnostic feature extraction, leading to inefficiencies [4][18]. - Traditional approaches do not adapt to different tasks, resulting in irrelevant information being included in strategy learning, which hampers efficiency and generalization [18][20]. Group 2: Innovations and Contributions - HyperTASR framework allows for task-aware scene representation, enabling robots to focus on relevant environmental features during task execution [8][12]. - Introduces a hypernetwork-driven representation transformation mechanism that dynamically generates adaptive parameters based on task specifications and progress [9][20]. - Compatible with various strategy learning architectures, allowing integration without significant modifications, enhancing performance [10][26]. Group 3: Experimental Validation - Significant improvements were observed in both simulation (RLBench) and real-world environments, establishing new state-of-the-art (SOTA) benchmarks for single-view manipulation tasks [11][29]. - In simulation, integrating HyperTASR with GNFactor and 3D Diffuser Actor led to success rates exceeding baseline methods by 27% and achieving over 80% success in single-view operations, respectively [29][31]. - Real-world experiments demonstrated strong generalization capabilities, achieving a success rate of 51.1% with only 15 demonstration samples per task [32][33].

谷歌AI新里程碑：一个能「做研究」的系统诞生了，用LLM+树搜索编写专家级软件

机器之心· 2025-09-10 08:14

Core Viewpoint - The article discusses a groundbreaking AI system developed by Google that assists researchers in writing expert-level empirical software, integrating large language models with traditional tree search methods to enhance efficiency in scientific research [2][4][36]. Group 1: AI System Overview - The AI system can automatically write and optimize software programs needed for scientific tasks, surpassing human capabilities in various fields such as genomics and public health [2][4]. - It transforms from a one-time code generation tool to an iterative, search-driven software evolution guided by quantifiable goals [4][36]. Group 2: Methodology and Components - The system focuses on "scorable scientific tasks," which can be quantified through accuracy, error rates, or benchmark rankings, covering a wide range of scientific applications [14]. - Three core components work in synergy: 1. LLM-based code mutation, which continuously rewrites and optimizes existing candidate codes [15]. 2. Tree search navigation, systematically exploring the software solution space using a variant of the PUCT algorithm inspired by AlphaZero [16]. 3. Integration of research ideas from various sources, including expert knowledge and academic papers [17]. Group 3: Achievements Across Scientific Fields - In genomics, the system identified 40 new methods for single-cell RNA sequencing, outperforming all published methods on the OpenProblems leaderboard, with the best method improving performance by 14% over the existing best [19]. - For geospatial analysis, the system's top solutions for satellite image segmentation significantly exceeded recent academic results, achieving an average intersection-over-union score greater than 0.80 [22]. - In neuroscience, the system generated a model predicting neural activity in zebrafish brains that outperformed all baseline models and trained significantly faster [26]. - The system also excelled in time series prediction across 28 datasets, creating a unified prediction library adaptable to various datasets [27]. Group 4: Technical Innovations - A key innovation is the systematic integration and intelligent reorganization of research ideas, allowing the system to analyze core principles of different methods and synthesize instructions for creating hybrid methods [31]. Group 5: Conclusion and Implications - The research indicates that AI can not only automate but also systematically exceed human performance in developing scientific software across multiple fields [36]. - This system has the potential to fundamentally change the way scientific software is developed, making advanced analytical tools more accessible to researchers and expanding the boundaries of scientific exploration [37].

英伟达下一代GPU登场，Rubin CPX一次推理数百万Token，网友：这是头野兽

机器之心· 2025-09-10 08:14

机器之心报道机器之心编辑部在周二的 AI 基础设施峰会上，英伟达宣布推出一款名为 Rubin CPX（Rubin Context GPUs）的新 GPU，专为超过 100 万 token 的长上下文推理而设计。对用户而言，这意味着他们在软件开发、视频生成等长上下文任务中能够获得更好的性能。例如，在软件开发中，AI 系统必须能够对整个代码库进行推理、理解仓库级代码结构，才能更好的帮助开发者。同样地，长视频和研究类应用也要求在数百万 token 范围内保持持续的连贯性和记忆。现在，随着 Rubin CPX 发布，这些问题都能迎刃而解。这款新型 GPU（Rubin CPX）将与 NVIDIA Vera CPU 和 Rubin GPU 搭配使用，共同组成全新的 NVIDIA Vera Rubin NVL144 CPX 平台。这一集成式 NVIDIA MGX 系统在单机架内可提供 8 exaflops AI 算力，其 AI 性能是 NVIDIA GB300 NVL72 系统的 7.5 倍，并配备 100TB 高速内存和 1.7 PB/s（petabytes）内存带宽。同时，NVIDIA 还将为已有 V ...

NVIDIA Vera Rubin NVL144 CPX平台

NVIDIA Vera Rubin NVL144 CPX平台

人人都能炼专属Agent，上海交大开源端侧Agent全栈工具链，真实场景性能超GPT-5！

机器之心· 2025-09-10 07:31

打开手机，让 AI Agent 自动帮你完成订外卖、订酒店、网上购物的琐碎任务，这正成为智能手机交互的新范式。就在刚刚，这一局面迎来了新的破局者。来自上海交通大学 IPADS 实验室的团队，正式开源了一套名为 MobiAgent 的移动端智能体 "全家桶"。 APP：https://github.com/IPADS-SAI/MobiAgent/releases/download/v1.0/Mobiagent.apk 一个能自主处理大部分日常任务的个人专属智能体，正在从科幻走进现实。然而，通往 "解放双手" 的最后一公里却并不好走。如何高效地训练和在手机端部署 Agent 模型，长期以来似乎都是少数大厂的 "自留地"。从高质量操作数据的获 Agent 养成全攻略：三步走要让 AI 学会玩手机，首先得让它看懂人是怎么操作的。MobiAgent 的第一大核心，就是贡献了一套 AI 辅助的敏捷数据收集 "流水线 " 。过去，给 AI 准备 "教材"（标注数据）又贵又慢。现在，MobiAgent 用一个轻量级小工具，就能记录下人类在手机上的所有点击、滑动、输入等操作轨迹。对于一些简单的任务，这一录 ...

机器之心· 2025-09-10 04:05

编辑：+0、张倩想象一下，如果 ChatGPT 等 AI 大模型在生成的时候，能把自己不确定的地方都标记出来，你会不会对它们生成的答案放心很多？机器之心报道上周末， OpenAI 发的一篇论文引爆了社区。这篇论文系统性地揭示了幻觉的根源，指出问题出在奖励上 —— 标准的训练和评估程序更倾向于对猜测进行奖励，而不是在模型勇于承认不确定时给予奖励。可能就是因为意识到了这个问题，并找出了针对性的解法，GPT-5 的幻觉率大幅降低。随着 AI 大模型在医疗咨询、法律建议等高风险领域的应用不断深入，幻觉问题会变得越来越棘手，因此不少研究者都在往这一方向发力。除了像 OpenAI 那样寻找幻觉原因，还有不少人在研究幻觉检测技术。然而，现有的幻觉检测技术在实际应用中面临瓶颈，通常仅适用于简短的事实性查询，或需要借助昂贵的外部资源进行验证。针对这一挑战，来自苏黎世联邦理工学院（ETH）和 MATS 的一项新研究提出了一种低成本、可扩展的检测方法，能够实时识别长篇内容中的「幻觉 token」，并成功应用于高达 700 亿（70B）参数的大型模型。论文标题：Real-Time Detection of ...

AI应用元年，这场标杆赛事见证了中国创新速度与野心

机器之心· 2025-09-10 04:05

机器之心原创编辑：吴昕一场关于未来金融智能的集体预演，见证了创业者们的冲刺，也折射出一个行业的进化。 2025 年的 AI ，正在上演「双线长跑」。一端是大模型底层的持续进化，远未触顶；另一端是场景应用集中爆发。来自 a16z 最新发布的全球百强 GenAI 应用榜单，释放出一个清晰信号，在「 AI 如何改造行业」应用上，中国玩家已展现出全球领先优势。与此同时，国务院印发的「人工智能 + 」行动计划又添了一把柴。 AI 的赋能范围，正从新质生产力的试点，扩展到全社会，被视作未来现代化的核心引擎。这股脉动，在 AFAC2025 金融智能创新大赛上体现得淋漓尽致。作为连续举办三年的金融智能标杆赛事，它已成为海内外 AI 创业团队的聚合地。在为期三个月的赛程中， 11 支队伍从初创组脱颖而出 —— 获奖方案直击真实金融痛点，覆盖底层技术突破与复杂系统工程，落地性极强，跨界创新尤为显著。 11 支获奖团队的项目方向、技术亮点和应用场景，大都直击真实金融痛点，落地性极强，「跨界」创新明显。中国的应用落地速度是全球领先的，另一位评委、 xcube.co 首席幕僚长兼董事、新加坡金融科技节和 GFT ...

Previous Next