Workflow
机器之心
icon
Search documents
从GPT-2到gpt-oss,深度详解OpenAI开放模型的进化之路
机器之心· 2025-08-18 05:15
Core Insights - OpenAI has released its first open-weight models, gpt-oss-120b and gpt-oss-20b, since the launch of GPT-2 in 2019, which can run locally due to optimizations [4][5] - The article provides a detailed analysis of the architectural advancements from GPT-2 to gpt-oss and compares it with Qwen3 [4][5] Model Architecture Overview - gpt-oss-20b can run on consumer-grade GPUs with 16 GB RAM, while gpt-oss-120b requires a single H100 processor with 80 GB RAM or more [10] - The architecture of gpt-oss models appears conventional, as leading LLM developers often use similar foundational architectures with minor adjustments [10][11] Changes Since GPT-2 - The article highlights significant changes from GPT-2, including the removal of Dropout, the adoption of RoPE for positional encoding, and the replacement of GELU with Swish/SwiGLU [20][22][29] - The introduction of Mixture of Experts (MoE) models allows for increased parameter capacity while maintaining efficiency by activating only a subset of experts for each token [39] - Grouped Query Attention (GQA) is introduced as a more efficient alternative to Multi-Head Attention (MHA) [41] - Sliding window attention is applied in gpt-oss to reduce memory usage and computational costs [47] - RMSNorm replaces LayerNorm for better efficiency in large-scale LLMs [52] Comparison with Qwen3 - gpt-oss-20b has a wider architecture with more attention heads, while Qwen3 has a deeper architecture with more transformer modules [69][70] - gpt-oss uses fewer but larger experts compared to Qwen3, which has more smaller experts [72] - Both models utilize grouped query attention, but gpt-oss incorporates sliding window attention to limit context size [82] Additional Insights - gpt-oss models are designed for inference, allowing users to control inference workload easily [93] - The training compute for gpt-oss is estimated at 2.1 million H100 GPU hours, comparable to other large models [92] - The MXFP4 optimization allows gpt-oss models to run on a single GPU, enhancing accessibility [98] - Benchmark results indicate that gpt-oss performs comparably to proprietary models, although it has not yet been extensively tested [101][106]
开源扩散大模型首次跑赢自回归!上交大联手UCSD推出D2F,吞吐量达LLaMA3的2.5倍
机器之心· 2025-08-18 03:22
挑战 —— 例如缺少完善的 KV 缓存机制,以及未充分释放并行潜力 —— 推理速度远慢于同规模的 AR 模型。 近期的一篇工作彻底扭转了这个局面。上海交通大学 DENG Lab 联合加州大学圣地亚哥分校(UCSD)推出 Discrete Diffus ion Forcing (D2F) ,首次使开源 dLLMs 的生成速度显著超过同等规模的 AR 模型。实验显示,D2F 模型在 GSM8K 等基准上,实现了相比 LLaMA3 等主流 AR 模型 高达 2.5 倍的吞吐量 提升,同 本文作者团队来自上海交通大学 DENG Lab 与加州大学圣地亚哥分校(UCSD)。该研究由硕士生王旭、准硕士生徐晨开、本科生金义杰以及博士生金佳纯共同 完成,指导教师为邓志杰与张浩老师。DENG Lab 隶属上海交通大学,致力于高效、跨模态生成模型的研究。 论文地址:https://arxiv.org/abs/2508.09192 代码地址:https://github.com/zhijie-group/Discrete-Diffusion-Forcing 视频 1 : D2F dLLMs 与同尺寸 AR LLMs 的推理过程对比 ...
一张图,开启四维时空:4DNeX让动态世界 「活」起来
机器之心· 2025-08-18 03:22
Core Viewpoint - The article introduces 4DNeX, a groundbreaking framework developed by Nanyang Technological University S-Lab and Shanghai Artificial Intelligence Laboratory, which can generate 4D dynamic scenes from a single input image, marking a significant advancement in the field of AI and world modeling [2][3]. Group 1: Research Background - The concept of world models is gaining traction in AI research, with Google DeepMind's Genie 3 capable of generating interactive videos from high-quality game data, but lacking validation in real-world scenarios [5]. - A pivotal point in the development of world models is the ability to accurately depict dynamic 3D environments that adhere to physical laws, enabling realistic content generation and supporting "counterfactual" reasoning [5][6]. Group 2: 4DNeX-10M Dataset - The 4DNeX-10M dataset consists of nearly 10 million frames of 4D annotated video, covering diverse themes such as indoor and outdoor environments, natural landscapes, and human motion, with a focus on "human-centered" 4D data [10]. - The dataset is constructed using a fully automated data-labeling pipeline, which includes data sourcing from public video libraries and quality control measures to ensure high fidelity [12][14]. Group 3: 4DNeX Method Architecture - 4DNeX proposes a 6D unified representation that captures both appearance (RGB) and geometry (XYZ), allowing for the simultaneous generation of multi-modal content without explicit camera control [16]. - The framework employs a key strategy called "width fusion," which minimizes cross-modal distance by directly concatenating RGB and XYZ data, outperforming other fusion methods [18][20]. Group 4: Experimental Results - Experimental results demonstrate that 4DNeX achieves significant breakthroughs in both efficiency and quality, with a dynamic range of 100% and temporal consistency of 96.8%, surpassing existing methods like Free4D [23]. - User studies indicate that 85% of participants preferred the generated effects of 4DNeX, particularly noting its advantages in motion range and realism [23][25]. - Ablation studies confirmed the critical role of the width fusion strategy in optimizing multi-modal integration, eliminating noise and alignment issues present in other approaches [28].
SEAgent:开启从实战经验中自我进化的GUI智能体新纪元
机器之心· 2025-08-17 04:28
Core Viewpoint - The development of Current Computer Using Agents (CUA) is heavily reliant on expensive human-annotated data, which limits their application in novel or specialized software environments. To overcome this limitation, researchers from Shanghai Jiao Tong University and The Chinese University of Hong Kong proposed SEAgent, a new framework that allows agents to learn and evolve autonomously through interaction with their environment without human intervention [2][4]. Group 1: SEAgent Framework - SEAgent's core innovation lies in its closed-loop autonomous evolution framework, a deeply optimized evaluation model, and an efficient "specialist-generalist" integration strategy [2][5]. - The autonomous evolution capability of SEAgent is derived from the collaborative functioning of three core components, forming a sustainable and self-driven learning loop [5]. Group 2: Core Components - The Curriculum Generator acts as a "mentor," automatically generating progressively challenging exploration tasks based on the agent's current capabilities and maintaining a "software guide" to document new functionalities discovered during exploration [9]. - The Actor-CUA, which is the agent itself, executes the tasks generated by the Curriculum Generator in the software environment [9]. - The World State Model serves as the "judge," evaluating the agent's performance at each step and providing critical feedback signals for learning, thus completing the evolution loop [9][10]. Group 3: Evaluation Model - A precise "judge" is fundamental to autonomous evolution. Existing open-source large visual language models struggle with evaluating long sequences of agent operations, leading to decreased accuracy with excessive historical inputs. To address this, a more robust evaluation model, the World State Model, was developed [10]. - The optimized World State Model significantly reduces the performance gap with commercial models like GPT-4o, providing reliable and stable evaluation capabilities for the SEAgent framework [10]. Group 4: Specialist-to-Generalist Strategy - The research explores building a "generalist" model capable of operating across multiple software environments, finding that training a generalist directly in multi-software settings is less effective than training specialist models in single software environments [13]. - A three-step efficient "specialist-to-generalist" integration strategy is proposed, which includes innovating the evaluation paradigm, high-quality data distillation, and cultivating specialists before transitioning to a generalist model [14][15]. Group 5: Experimental Results - The final "generalist" agent achieved an overall success rate of 34.5%, surpassing the performance of directly trained generalist models (30.6%) and exceeding the combined performance of all specialist models (32.2%), demonstrating the potential of the "specialist first, then generalist" approach [18]. - Rigorous ablation experiments confirm the necessity of the algorithm design, showing that a high-quality World State Model is essential for effective learning, and exploration-based reinforcement learning (GRPO) significantly outperforms mere imitation [20]. Group 6: Author and Research Interests - The first author of the study is Sun Zeyi, a joint doctoral student from Shanghai Jiao Tong University and the Shanghai Artificial Intelligence Laboratory, with multiple publications in CVPR, ICCV, and NeurIPS, and research interests in GUI-Agent, multimodal learning, and reinforcement learning [20].
400万人围观的分层推理模型,「分层架构」竟不起作用?性能提升另有隐情?
机器之心· 2025-08-17 04:28
Core Insights - The article discusses the Hierarchical Reasoning Model (HRM), which has gained significant attention since its release in June, achieving a score of 41% on the ARC-AGI-1 benchmark with a relatively small model of 27 million parameters [3][4][5]. Group 1: HRM Performance and Analysis - HRM's performance on the ARC-AGI benchmark is impressive given its model size, with a score of 32% on the semi-private dataset, indicating minimal overfitting [29]. - The analysis revealed that the hierarchical architecture's impact on performance is minimal compared to the significant performance boost from the less emphasized "outer loop" optimization process during training [5][41]. - Cross-task transfer learning benefits were found to be limited, with most performance derived from memorizing specific task solutions used during evaluation [6][52]. Group 2: Key Components of HRM - Pre-training task augmentation is crucial, with only 300 augmentations needed to achieve near-maximum performance, contrary to the 1000 augmentations reported in the original paper [7][56]. - The HRM architecture combines slow planning (H-level) and fast execution (L-level), but the performance gains are not solely attributed to this structure [35][40]. - The outer loop optimization process significantly enhances performance, with a notable increase in accuracy observed with iterative optimization during training [41][46]. Group 3: Future Directions and Community Engagement - The article encourages further exploration of various aspects of HRM, including the impact of puzzle_id embeddings on model performance and the potential for generalization beyond training data [62][63]. - The analysis emphasizes the importance of community-driven evaluations of research, suggesting that such detailed scrutiny can lead to more efficient knowledge acquisition [65][66].
CoRL 2025|隐空间扩散世界模型LaDi-WM大幅提升机器人操作策略的成功率和跨场景泛化能力
机器之心· 2025-08-17 04:28
Core Viewpoint - The article discusses the introduction of LaDi-WM (Latent Diffusion-based World Models), a novel world model that utilizes latent space diffusion to enhance robot operation performance through predictive strategies [2][28]. Group 1: Innovation Points - LaDi-WM employs a latent space representation constructed using pre-trained vision foundation models, integrating both geometric features (derived from DINOv2) and semantic features (derived from Siglip), which enhances the generalization capability for robotic operations [5][10]. - The framework includes a diffusion strategy that iteratively optimizes output actions by integrating predicted states from the world model, leading to more consistent and accurate action results [6][12]. Group 2: Framework Structure - The framework consists of two main phases: world model learning and policy learning [9]. - **World Model Learning**: Involves extracting geometric and semantic representations from observation images and implementing a diffusion process that allows interaction between these representations to improve dynamic prediction accuracy [10]. - **Policy Model Training and Iterative Optimization**: Utilizes future predictions from the world model to guide policy learning, allowing for multiple iterations of action optimization, which reduces output distribution entropy and enhances action prediction accuracy [12][18]. Group 3: Experimental Results - In extensive experiments on virtual datasets (LIBERO-LONG, CALVIN D-D), LaDi-WM demonstrated a significant increase in success rates for robotic tasks, achieving a 27.9% improvement on the LIBERO-LONG dataset, reaching a success rate of 68.7% with minimal training data [15][16]. - The framework's scalability was validated, showing that increasing training data and model parameters consistently improved success rates in robotic operations [18][20]. Group 4: Real-World Application - The framework was also tested in real-world scenarios, including tasks like stacking bowls and opening drawers, where LaDi-WM improved the success rate of original imitation learning strategies by 20% [24][25].
LLM+Tool Use 还能撑多久?下一代 AI Agent 在 self-evolving 的技术探索上行至何方?
机器之心· 2025-08-17 01:30
Group 1 - The article discusses the increasing demand for self-evolving capabilities in AI agents, highlighting the limitations of static models in adapting to new tasks and dynamic environments [6][8][10] - It emphasizes the need for a systematic theoretical framework to guide the exploration of self-evolving agents, with contributions from multiple research institutions [8][10] - The article outlines three key dimensions for analyzing and designing self-evolving agents: what to evolve, when to evolve, and how to evolve, each addressing different aspects of the evolution process [9][10][11] Group 2 - The article raises questions about the ability of AI application companies to replicate or surpass the commercial successes of the mobile internet era, focusing on new monetization models [2][3] - It explores the differences in user ecosystems and commercial boundaries between AI and the mobile internet era, questioning the necessity of multiple apps as AI becomes a platform capability [2][3] - The article discusses the varying attitudes of Chinese and American internet giants towards AI investments and how this may impact future competitiveness [2][3] Group 3 - The article presents insights from Dario Amodei on the profitability of large models despite significant accounting losses, suggesting that each generation of large models can be viewed as independent startups [3] - It discusses the natural drive for funding, computing power, and data investment that comes with advancements in large model capabilities [3] - The article highlights the implications of Scaling Law for AI enterprise growth and the potential consequences if it were to fail [3]
大模型如何推理?斯坦福CS25重要一课,DeepMind首席科学家主讲
机器之心· 2025-08-16 05:02
Core Insights - The article discusses the insights shared by Denny Zhou, a leading figure in AI, regarding the reasoning capabilities of large language models (LLMs) and their optimization methods [3][4]. Group 1: Key Points on LLM Reasoning - Denny Zhou emphasizes that reasoning in LLMs involves generating a series of intermediate tokens before arriving at a final answer, which enhances the model's strength without increasing its size [6][15]. - The challenge lies in the fact that reasoning-based outputs often do not appear at the top of the output distribution, making standard greedy decoding ineffective [6]. - Techniques such as chain-of-thought prompting and reinforcement learning fine-tuning have emerged as powerful methods to enhance LLM reasoning capabilities [6][29]. Group 2: Theoretical Framework - Zhou proposes that any problem solvable by Boolean circuits can be addressed by generating intermediate tokens using a constant-sized transformer model, indicating a theoretical understanding of reasoning [16]. - The importance of intermediate tokens in reasoning is highlighted, as they allow models to solve complex problems without requiring deep architectures [16]. Group 3: Decoding Techniques - The article introduces the concept of chain-of-thought decoding, which involves checking multiple generated candidates rather than relying on a single most likely answer [22][27]. - This method requires programming effort but can significantly improve reasoning outcomes by guiding the model through natural language prompts [27]. Group 4: Self-Improvement and Data Generation - The self-improvement approach allows models to generate their own training data, reducing reliance on human-annotated datasets [39]. - The concept of reject sampling is introduced, where models generate solutions and select the correct steps based on achieving the right answers [40]. Group 5: Reinforcement Learning and Fine-Tuning - Reinforcement learning fine-tuning (RL fine-tuning) has gained attention for its ability to enhance model generalization, although not all tasks can be validated by machines [42][57]. - The article discusses the importance of reliable validators in RL fine-tuning, emphasizing that the quality of machine-generated training data can sometimes surpass human-generated data [45][37]. Group 6: Future Directions - Zhou expresses anticipation for breakthroughs in tasks that extend beyond unique, verifiable answers, suggesting a shift in focus towards building practical applications rather than solely addressing academic benchmarks [66]. - The article concludes with a reminder that simplicity in research can lead to clearer insights, echoing Richard Feynman's philosophy [68].
当AI比我们更聪明:李飞飞和Hinton给出截然相反的生存指南
机器之心· 2025-08-16 05:02
Core Viewpoint - The article discusses the contrasting perspectives of AI safety from prominent figures in the field, highlighting the ongoing debate about the potential risks and benefits of advanced AI systems [6][24]. Group 1: Perspectives on AI Safety - Fei-Fei Li presents an optimistic view, suggesting that AI can be a powerful partner for humanity, with safety depending on human design, governance, and values [6][24]. - Geoffrey Hinton warns that superintelligent AI may emerge within 5 to 20 years, potentially beyond human control, advocating for the creation of AI that inherently cares for humanity, akin to a protective mother [9][25]. - The article emphasizes the importance of human decision-making and governance in ensuring AI safety, suggesting that better testing, incentive mechanisms, and ethical safeguards can mitigate risks [24][31]. Group 2: Interpretations of AI Behavior - There are two main interpretations of AI's unexpected behaviors, such as the OpenAI o3 model's actions: one views them as engineering failures, while the other sees them as signs of AI losing control [12][24]. - The first interpretation argues that these behaviors stem from human design flaws, emphasizing that AI's actions are not driven by autonomous motives but rather by the way it was trained and tested [13][14]. - The second interpretation posits that the inherent challenges of machine learning, such as goal misgeneralization and instrumental convergence, pose significant risks, leading to potentially dangerous outcomes [16][21]. Group 3: Technical Challenges and Human Interaction - Goal misgeneralization refers to AI learning to pursue a proxy goal that may diverge from human intentions, which can lead to unintended consequences [16][17]. - Instrumental convergence suggests that AI will develop sub-goals that may conflict with human interests, such as self-preservation and resource acquisition [21][22]. - The article highlights the need for developers to address both technical flaws in AI systems and the psychological aspects of human-AI interaction to ensure safe coexistence [31][32].
简单即强大:全新生成模型「离散分布网络DDN」是如何做到原理简单,性质独特?
机器之心· 2025-08-16 05:02
Core Viewpoint - The article introduces a novel generative model called Discrete Distribution Networks (DDN), which offers unique features and capabilities in generating and reconstructing data, particularly in the context of zero-shot conditional generation and end-to-end differentiability [4][8][33]. Group 1: Overview of DDN - DDN employs a mechanism that generates K outputs simultaneously during a single forward pass, creating a discrete distribution of outputs [5][6]. - The training objective is to optimize the positions of these sample points to closely approximate the true distribution of the training data [7]. - DDN is characterized by three main features: Zero-Shot Conditional Generation (ZSCG), tree-structured one-dimensional discrete latent variables, and full end-to-end differentiability [8]. Group 2: DDN Mechanism - DDN can reconstruct data similarly to Variational Autoencoders (VAE) by mapping data to latent representations and generating highly similar reconstructed images [12]. - The reconstruction process involves multiple layers, where each layer generates K outputs, and the most similar output to the target is selected as the condition for the next layer [14][15]. - The training process mirrors the reconstruction process, with the addition of calculating loss for the selected outputs at each layer [16]. Group 3: Unique Features of DDN - DDN supports zero-shot conditional generation, allowing the model to generate images based on conditions it has never seen during training, such as text prompts or low-resolution images [24][26]. - The model can efficiently guide the sampling process using purely discriminative models, promoting a unification of generative and discriminative models [28][29]. - DDN's latent space is structured as a tree, providing a highly compressed representation of data, which can be visualized to understand its structure [36][39]. Group 4: Future Research Directions - Potential research directions include improving DDN through parameter tuning and theoretical analysis, applying DDN in various fields such as image denoising and unsupervised clustering, and integrating DDN with existing generative models for enhanced capabilities [41][42].