Workflow
机器之心
icon
Search documents
推理速度10倍提升,蚂蚁集团开源业内首个高性能扩散语言模型推理框架dInfer
机器之心· 2025-10-13 09:24
Core Insights - Ant Group has launched dInfer, the industry's first high-performance inference framework for diffusion large language models (dLLM), achieving over 10 times the inference speed compared to Fast-dLLM [2][29] - dInfer has set a new milestone in performance, reaching a throughput of 1011 tokens per second in single-batch inference scenarios, surpassing highly optimized autoregressive (AR) models [29] Group 1: dInfer Framework - dInfer is designed to support various dLLM architectures, including LLaDA, LLaDA-MoE, and LLaDA-MoE-TD, emphasizing modularity and scalability [9][20] - The framework integrates four core modules: Model, KV Cache Manager, Iteration Manager, and Decoder, allowing developers to customize and optimize strategies [11][13] - dInfer addresses three core challenges in dLLM inference: high computational costs, KV cache invalidation, and the complexities of parallel decoding [12][19] Group 2: Performance Enhancements - dInfer employs a "Vicinity KV-Cache Refresh" strategy to reduce computational costs while maintaining generation quality by selectively recalculating KV caches [15][17] - The framework optimizes the forward computation speed of dLLM to match that of AR models through various system enhancements [18] - It introduces hierarchical and credit decoding algorithms to maximize the number of tokens decoded in parallel without additional training [19][20] Group 3: Performance Metrics - In tests with 8 NVIDIA H800 GPUs, dInfer achieved an average inference speed of 681 tokens per second, which is 10.7 times faster than Fast-dLLM [29] - When combined with trajectory distillation technology, dInfer's average inference speed soared to 847 tokens per second, exceeding the performance of AR models by over 3 times [24][29] - dInfer's performance in code generation tasks has set a record, demonstrating significant speed advantages in latency-sensitive scenarios [29] Group 4: Open Source and Community Engagement - The release of dInfer marks a significant step in the practical efficiency of diffusion language models, inviting global developers and researchers to collaborate in building a more efficient and open AI ecosystem [28][25] - The complete code, technical reports, and experimental configurations for dInfer v0.1 have been made open source [27][28]
改变强化学习范式,Meta新作呼应Sutton「经验时代」预言
机器之心· 2025-10-13 06:37
Core Insights - The article discusses the transition from the data era to the experience era in AI, emphasizing the need for AI agents to learn from interactions with their environment rather than solely relying on data [1][2] - Meta's research introduces a new paradigm called "early experience," which allows AI agents to learn from their own actions and the resulting states, providing a way to generate supervisory signals without external rewards [2][3] Group 1: Early Experience Paradigm - The "early experience" paradigm combines imitation learning and reinforcement learning, enabling agents to learn from both curated data and their own experiences in the environment [2][3] - Meta's implementation of this paradigm improved task completion success rates by 9.6% and out-of-distribution generalization by 9.4%, indicating a significant advancement in AI training methodologies [3][25] Group 2: Methodologies - Two strategies were explored within the early experience framework: implicit world modeling and self-reflection [3][18] - Implicit world modeling uses collected states to predict future states, allowing agents to internalize environmental dynamics without separate modules [10][12] - Self-reflection enables agents to compare expert actions with their own generated actions, producing explanations that enhance decision-making and learning [13][14] Group 3: Experimental Results - Benchmark tests showed that the early experience methods outperformed traditional imitation learning across various scenarios, with implicit world modeling and self-reflection yielding notable improvements [21][22] - In out-of-distribution evaluations, early experience methods significantly reduced performance gaps, demonstrating their effectiveness in adapting to unseen environments [23] Group 4: Conclusion - The findings suggest that starting training with early experience leads to higher performance ceilings in subsequent reinforcement learning phases, acting as a bridge between the data and experience eras [25][26]
LLaVA-OneVision-1.5全流程开源,8B模型预训练只需4天、1.6万美元
机器之心· 2025-10-13 06:37
LLaVA 用低成本对齐打通「 视觉编码器 + 大语言模型」起步,LLaVA‑1.5 以更大更干净的数据与高分辨率输入强化理解,LLaVA‑NeXT 拓展 OCR / 数理与多场景 任务;随后分支为 LLaVA‑NeXT‑Video 处理时序视频、多帧推理,及 LLaVA-NeXT-Interleave 支持交替多图文与跨图联推;最终在 LLaVA‑OneVision 汇聚为统一接 口,覆盖图像 / 文档 / 图表 / 多图 / 视频,兼顾效果与效率。 LLaVA 于 2023 年提出,通过低成本对齐高效连接开源视觉编码器与大语言模型,使「 看图 — 理解 — 对话 」的多模态能力在开放生态中得以普及,明显缩小了 与顶级闭源模型的差距,标志着开源多模态范式的重要里程碑。 尽管多模态对齐的接口与架构趋于收敛,真正「 可复现 」的开源路径仍与「 仅开放权重 」存在间距。Qwen2.5‑VL、InternVL3.5 在 OCR、文档理解、数理与跨图 推理上树立高基线,但完整的数据清单、清洗与混合比例,以及对齐 / 采样与训练日程多为部分披露,难以端到端重现。Molmo 以更干净的数据流水线与精细化 设计,在多项评测 ...
NeurIPS 2025 Spotlight | GeoSVR:稀疏体素的新潜力——超越3DGS系列的高精度三维表面重建
机器之心· 2025-10-13 04:21
Core Viewpoint - The article discusses the introduction of GeoSVR (Geometric Sparse Voxel Reconstruction), a new explicit geometric optimization framework that surpasses existing methods in geometric accuracy, detail capture, and completeness in surface reconstruction from multi-view images [2][32]. Methodology - The core of GeoSVR involves two main designs for harnessing sparse voxels: 1. Voxel-Uncertainty Depth Constraint, which models uncertainty and weights depth constraints to improve geometric accuracy [8][10]. 2. Sparse Voxel Surface Regularization, which employs various regularization strategies to maintain global consistency and prevent overfitting [14][22]. Experimental Results - GeoSVR significantly outperforms existing methods across multiple datasets, achieving a Chamfer distance that is notably better than state-of-the-art methods, with a training time of only 0.8 hours compared to over 12 hours for previous methods [24][30]. - In the DTU dataset, GeoSVR achieved a mean Chamfer distance of 0.32, demonstrating superior geometric precision and reconstruction quality [23][30]. - On the Mip-NeRF 360 dataset, GeoSVR achieved an F1-score of 0.56, marking it as the highest precision method currently available [27]. Significance and Future Outlook - GeoSVR showcases the potential of sparse voxels for high-quality surface reconstruction, providing a foundation for applications in robotics perception, autonomous driving, digital twins, and virtual reality [32][33]. - Future research will focus on scaling scene reconstruction and supporting complex light path conditions [33].
为MoE解绑:全新「专家即服务」推理架构发布,超细粒度扩展锐减37.5%成本
机器之心· 2025-10-13 04:21
Core Viewpoint - The article discusses the challenges and innovations in the inference of large language models, particularly focusing on the Mixture-of-Experts (MoE) architecture and the introduction of the Expert-as-a-Service (EaaS) model to enhance efficiency, scalability, and robustness in model inference [2][4][25]. Group 1: Challenges in MoE Inference - The inference cost of large language models has increased exponentially, prompting the need for cost reduction strategies [2]. - Existing MoE frameworks face scalability issues due to the requirement for large-scale synchronous communication, leading to resource wastage [2]. - MoE systems exhibit low fault tolerance, where a single node failure can cause the entire service cluster to restart, resulting in service interruptions [3]. - Load imbalance occurs as the activation of experts is dynamically sparse, leading to some GPU nodes being overloaded while others remain idle [4]. Group 2: Introduction of EaaS - EaaS transforms the MoE inference architecture into a microservices-like model, allowing for flexible scheduling and independent scaling of expert services [7]. - The architecture decouples the expert layer from the Attention layer, enabling asynchronous processing and improving pipeline utilization [10]. - EaaS employs a dynamic batching mechanism and a custom communication library based on InfiniBand GPUDirect Async (IBGDA) to minimize communication latency and kernel launch overhead [14]. Group 3: Performance and Scalability - EaaS demonstrates superior scalability and fault tolerance compared to traditional MoE inference systems, with the ability to maintain throughput even during GPU node failures [15][20]. - The system allows for fine-grained resource allocation, enabling cloud service providers to adjust computational resources dynamically based on real-time load [18]. - EaaS can achieve up to 37.5% GPU resource savings while maintaining performance levels comparable to static architectures [18]. Group 4: Future Potential - EaaS shows significant potential in cloud-based large model inference and model-as-a-service (MaaS) scenarios, aligning with the needs of multi-tenant environments and continuous delivery [25]. - The modular design of EaaS facilitates independent upgrades and maintenance, allowing the system to evolve with changing model scales and application demands [25].
ICLR 2026惊现SAM 3,分割一切的下一步:让模型理解「概念」
机器之心· 2025-10-13 04:21
Core Insights - The article discusses the release of a new paper titled "SAM 3: Segment Anything with Concepts," which is believed to be a continuation of Meta's "Segment Anything" series, following SAM 1 and SAM 2 [1][3][4]. Group 1: Overview of SAM 3 - SAM 3 introduces a new task called Promptable Concept Segmentation (PCS), allowing users to input text or image examples to predict instance and semantic masks for matching objects while maintaining identity consistency across video frames [8][12]. - The model focuses on identifying atomic visual concepts, enabling it to understand simple noun phrases like "red apple" or "striped cat" for segmentation tasks [8][12]. - SAM 3 improves upon its predecessors by enhancing performance in promptable visual segmentation and establishing new standards for PCS [18]. Group 2: Performance Metrics - SAM 3 shows significant performance improvements, achieving at least a 2x enhancement on the newly proposed SA-Co benchmark compared to previous systems [13]. - In the LVIS dataset, SAM 3 achieved a zero-shot mask average precision of 47.0, surpassing the previous best of 38.5 [13]. - The model processes images with over 100 objects in just 30 milliseconds on a single H200 GPU [14]. Group 3: Methodology and Data - SAM 3 employs a dual encoder-decoder transformer architecture, integrating a detector with a tracker and memory module for video applications [20]. - The research developed a scalable human-machine collaborative data engine, annotating a high-quality training dataset with 4 million unique phrases and 520 million masks [21]. - The PCS benchmark includes 124K images and 1.7K videos with 214K unique concepts, significantly expanding the concept count compared to existing benchmarks [25]. Group 4: Comparative Analysis - SAM 3 outperforms previous models in various tasks, including instance segmentation, box detection, and semantic segmentation across multiple datasets [27][28]. - In open vocabulary semantic segmentation experiments, SAM 3 exceeded the performance of strong baseline models [29]. - The model also demonstrated superior object counting accuracy and segmentation capabilities compared to other models [33].
大模型追逐星辰大海,GPT和Gemini国际天文奥赛夺金
机器之心· 2025-10-13 04:21
Core Insights - The article discusses the remarkable advancements in artificial intelligence, particularly in large language models (LLMs) like GPT-5 and Gemini 2.5 Pro, which have achieved gold medal performances in the International Olympiad on Astronomy and Astrophysics (IOAA) [4][18]. Group 1: AI Model Performance - GPT-5 and Gemini 2.5 Pro excelled in the IOAA, demonstrating strong reasoning and problem-solving capabilities in astronomy and astrophysics [4][12]. - In the theoretical exams, GPT-5 scored an average of 84.2% while Gemini 2.5 Pro scored 85.6%, outperforming other models by 7 to 25 percentage points [12][13]. - The models achieved gold medal status, with GPT-5 scoring 86.8% in 2025, 89.6% in 2023, and 93.0% in 2022, consistently outperforming the best human participants [19][18]. Group 2: Evaluation Framework - The study introduced a more rigorous evaluation framework for assessing LLMs in scientific research, focusing on complex reasoning and problem-solving rather than simple knowledge recall [9][10]. - The IOAA was chosen as a benchmark due to its ecological validity, covering a wide range of astronomical topics and requiring multi-step reasoning [10][9]. Group 3: Error Analysis - The models showed a significant performance gap between different types of questions, with better accuracy in physics/mathematics problems (67-91%) compared to geometric/spatial problems (49-78%) [26]. - Common errors included conceptual misunderstandings and geometric reasoning challenges, indicating fundamental difficulties in achieving deep physical understanding [26][25].
「微调已死」再添筹码,谷歌扩展AI自我进化范式,成功经验与失败教训双向学习
机器之心· 2025-10-12 08:02
Core Insights - The article discusses the concept of "Agentic Context Engineering," which allows language models to self-improve without the need for fine-tuning, drawing attention from the academic community [1] - Google's earlier work on "ReasoningBank" presents a similar idea, focusing on an innovative memory framework for agent systems that extracts and organizes memory items from the agent's own experiences [1][3] Summary by Sections ReasoningBank Overview - ReasoningBank captures effective strategies from successes and important lessons from failures, creating actionable principles in a closed-loop process [1][3] - The framework consists of structured memory items that include a title, description, and content, allowing agents to interact with their environment and build new memory items from past experiences [5][7] Key Components of ReasoningBank - Memory Structure: Memory items are designed from past experiences, abstracting low-level execution details while retaining transferable reasoning patterns [7] - Integration with Agents: Agents equipped with ReasoningBank can draw from a curated pool of transferable strategies to guide decision-making, enhancing adaptability to unseen queries [7] Memory-Aware Test-Time Expansion (MaTTS) - MaTTS integrates ReasoningBank with test-time expansion, generating diverse explorations to provide comparative signals for better memory synthesis [8][9] - Two complementary implementations of MaTTS are introduced: parallel expansion and sequential expansion, enhancing the effectiveness of memory planning [9] Experimental Results - Extensive experiments on challenging benchmarks, including WebArena and SWE-Bench-Verified tasks, show that ReasoningBank outperforms baseline methods with effectiveness improvements of up to 34.2% and a reduction of 16.0% in interaction steps [11] - The results indicate that ReasoningBank significantly enhances both the resolve rate and efficiency compared to models without memory [13][14] Overall Impact - The collaboration between ReasoningBank and MaTTS is highlighted as a key component for memory-based experience expansion, demonstrating superior performance in various tasks [14][15]
硅谷CEO们高喊AI威胁论,「5年内失业率飙升至20%」,但95%AI项目赔本赚吆喝
机器之心· 2025-10-12 04:05
机器之心报道 编辑:杨文 当前「AI 威胁就业」的论调,更多是基于技术趋势的预警,而非基于现实的既成事实,但这也绝非轻视 AI 长期影响的理由。 最近,「AI 让人类失业」的论调甚嚣尘上,给本就焦虑的打工人更蒙上了一层阴影。 Anthropic 的首席执行官 Dario Amodei 预测白领就业将面临一场「末日浩劫」,「AI 可能在未来五年内大规模取代入门级白领工作, 失业率 可能会飙升至 10% 到 20% 之间 ,尤其在法律、金融和咨询等行业。」 Goodwill 首席执行官表示,他正在为人工智能导致的 Z 世代失业潮做准备,还认为 青年失业危机已经发生 。 Stability AI 联合创始人 Emad Mostaque 声称, 明年将出现大规模失业 。「AI 能够完成复杂的工作且不出错,这将导致许多工作面临被替代 风险。失业问题将同时影响多个行业,并且在未来一到两年内可能会加剧。」 甚至前谷歌首个生成式 AI 团队创始人贾德・塔里菲 (Jad Tarifi) 表示,不断提升的人工智能能力可能很快就会让 获得法律或医学高级学位变 得毫无意义 。 这篇论文的核心观点是, AGI 的普及将导致人类劳动在经 ...
LLM越狱攻击的威胁被系统性高估? 基于分解式评分的「越狱评估新范式」出炉
机器之心· 2025-10-12 04:05
Core Viewpoint - The article introduces JADES, a new framework for evaluating jailbreak attacks, developed by researchers from CISPA, Flexera, and Xi'an Jiaotong University, which aims to provide a more accurate assessment by using a decompositional scoring mechanism instead of traditional holistic evaluation methods [4][5][6]. Current Limitations of Jailbreak Assessment - Accurate evaluation of jailbreak attacks is challenging due to the open-ended nature of harmful questions, making it difficult to establish a unified success standard [10]. - Existing automated evaluation methods suffer from two core flaws: misaligned proxy indicators leading to false positives, and holistic evaluation strategies that obscure the details of responses [11][12]. JADES Framework - JADES automates the analytic scoring logic used by human experts, ensuring granularity and reliability in assessments through a multi-agent collaborative process [12]. - The framework consists of four collaborative nodes: 1. **Question Decomposition Node**: Breaks down harmful questions into weighted sub-questions [12]. 2. **Response Preprocessing Node**: Cleans the original jailbreak response to reduce complexity [16]. 3. **Sub-Question Pairing Node**: Extracts relevant sentences from the cleaned response for each sub-question [17]. 4. **Evaluation Node**: Scores each sub-answer using a five-point Likert scale and aggregates the scores to determine overall success [18]. Performance Evaluation - Researchers created a benchmark dataset, JailbreakQR, consisting of 400 pairs of harmful questions and jailbreak responses, to validate JADES [20]. - JADES revealed that previous assessment methods systematically overestimated the success rates of jailbreak attacks, with the success rate for LAA attacks on GPT-3.5-Turbo dropping from 93% to 69% under JADES [24]. - In binary classification, JADES achieved 98.5% consistency with human evaluators, while in a more challenging ternary classification, it maintained an accuracy of 86.3% [26]. - The introduction of a new metric, Success Rate/Attack Success Rate (SR/ASR), indicated that the proportion of fully successful cases was less than 0.25, suggesting that many attacks labeled as successful were actually only partially successful [27]. Conclusion - The JADES framework establishes a transparent, reliable, and auditable standard for jailbreak assessment, revealing systemic biases in current evaluation methods and providing a more effective tool for the field [28].