Workflow
机器之心
icon
Search documents
KDD 2025 | UoMo来了,首个无线网络流量预测模型,一个框架搞定三类任务
机器之心· 2025-08-18 05:15
你有没有想过,未来的移动网络能像 "预知未来" 一样提前感知用户需求?在今年的 ACM KDD 2025 大会上,清华大学电子系团队联合中国移动发布了 UoMo,全 球首个面向移动网络的通用流量预测模型。UoMo 能同时胜任短期预测、长期预测,甚至在没有历史数据的情况下生成全新区域的流量分布。它结合了前沿的扩散 模型与 Transformer 结构,还能理解城市中的地理信息与人流变化,把网络规划和优化做得更聪明、更精准。 标题:UoMo: A Universal Model of Mobile Traffic Forecasting for Wireless Network Optimization 为什么要做 UoMo 作者:Haoye Chai(柴浩野), Shiyuan Zhang(张诗源),Xiaoqian Qi(齐效乾),Baohua Qiu(邱宝华),Yong Li(李勇) 机构:清华大学,中国移动 论文链接:https://dl.acm.org/doi/10.1145/3711896.3737272 数据及代码链接:https://github.com/tsinghua-fib-lab/UoMo ...
NextStep-1:一次在图像生成上自回归范式的探索
机器之心· 2025-08-18 05:15
Core Insights - The article discusses the development of NextStep-1, a new autoregressive model for image generation that operates directly in continuous visual space, avoiding the information loss associated with discretization [2][3][4] - The model utilizes a lightweight Flow Matching Head, which simplifies the architecture and allows for end-to-end training without reliance on external diffusion models [4][5] - The exploration aims to provide a new perspective in the multimodal generation field, emphasizing the potential for creating efficient and high-fidelity generative models [26][33] Technical Framework - NextStep-1 is built on a powerful Transformer backbone network with 14 billion parameters, complemented by a Flow Matching Head with 157 million parameters for generating continuous image patches [7][8] - The model generates images autoregressively by producing patches sequentially, which helps bypass the bottleneck of discretization [8] - The architecture is designed to be simple and pure, demonstrating that a streamlined autoregressive model can be constructed without sacrificing continuity [4][26] Key Discoveries - The team identified that the Transformer acts as the main creator, while the Flow Matching Head serves as an efficient sampler, with minimal impact on image quality from the size of the Flow Matching Head [12] - Two critical techniques were discovered for stability and quality: channel-wise normalization to stabilize token statistics and the counterintuitive finding that adding more noise during training can enhance image quality [14][16] Performance Evaluation - NextStep-1 has been rigorously evaluated against industry benchmarks, achieving competitive results with state-of-the-art diffusion models [21][22] - The model's performance metrics include GenEval scores of 0.63/0.737 and DPG-Bench scores of 85.28, indicating its strong capabilities in image generation [21][22] Limitations and Future Directions - The model faces challenges related to stability during generation, particularly when expanding the latent space dimensions, which can lead to occasional failures [27][29] - The autoregressive nature of the model introduces latency issues, particularly in sequential decoding, which affects overall performance [28] - Future work will focus on optimizing the Flow Matching Head, accelerating the autoregressive backbone, and improving convergence efficiency, especially in high-resolution image generation [34][35]
机器人也会「摸鱼」了?宇树G1赛后葛优瘫刷美女视频,网友:比人还懂享受生活
机器之心· 2025-08-18 05:15
Core Viewpoint - The article highlights the impressive performance of the Yushun G1 robot at the 2025 World Humanoid Robot Games, showcasing its capabilities in both competition and leisure activities, while also addressing some operational challenges faced during the events [3][5][50]. Group 1: Competition Performance - The Yushun G1 robot achieved remarkable results, including a standout performance in the 100-meter sprint where it demonstrated exceptional speed [5][10]. - In the 1500-meter final, the H1 robot caused a stir by colliding with an opponent's operator but continued to win the race, leading to humorous commentary online about its "strategy" [18][21][25]. - The G1 robot excelled in obstacle courses, showcasing its balance and coordination skills, successfully navigating stairs and complex environments [35][47]. Group 2: Operational Challenges - The collision incident during the 1500-meter race was attributed to a lack of control during operator handover, highlighting the need for improved operational protocols [25][28]. - The H1 robot's performance was conducted under manual control, which raised concerns about operator fatigue, prompting plans for fully autonomous participation in future events [29]. - The article notes that while the robots displayed impressive capabilities, there is still room for improvement in their algorithms and operational efficiency [31][50].
从GPT-2到gpt-oss,深度详解OpenAI开放模型的进化之路
机器之心· 2025-08-18 05:15
Core Insights - OpenAI has released its first open-weight models, gpt-oss-120b and gpt-oss-20b, since the launch of GPT-2 in 2019, which can run locally due to optimizations [4][5] - The article provides a detailed analysis of the architectural advancements from GPT-2 to gpt-oss and compares it with Qwen3 [4][5] Model Architecture Overview - gpt-oss-20b can run on consumer-grade GPUs with 16 GB RAM, while gpt-oss-120b requires a single H100 processor with 80 GB RAM or more [10] - The architecture of gpt-oss models appears conventional, as leading LLM developers often use similar foundational architectures with minor adjustments [10][11] Changes Since GPT-2 - The article highlights significant changes from GPT-2, including the removal of Dropout, the adoption of RoPE for positional encoding, and the replacement of GELU with Swish/SwiGLU [20][22][29] - The introduction of Mixture of Experts (MoE) models allows for increased parameter capacity while maintaining efficiency by activating only a subset of experts for each token [39] - Grouped Query Attention (GQA) is introduced as a more efficient alternative to Multi-Head Attention (MHA) [41] - Sliding window attention is applied in gpt-oss to reduce memory usage and computational costs [47] - RMSNorm replaces LayerNorm for better efficiency in large-scale LLMs [52] Comparison with Qwen3 - gpt-oss-20b has a wider architecture with more attention heads, while Qwen3 has a deeper architecture with more transformer modules [69][70] - gpt-oss uses fewer but larger experts compared to Qwen3, which has more smaller experts [72] - Both models utilize grouped query attention, but gpt-oss incorporates sliding window attention to limit context size [82] Additional Insights - gpt-oss models are designed for inference, allowing users to control inference workload easily [93] - The training compute for gpt-oss is estimated at 2.1 million H100 GPU hours, comparable to other large models [92] - The MXFP4 optimization allows gpt-oss models to run on a single GPU, enhancing accessibility [98] - Benchmark results indicate that gpt-oss performs comparably to proprietary models, although it has not yet been extensively tested [101][106]
开源扩散大模型首次跑赢自回归!上交大联手UCSD推出D2F,吞吐量达LLaMA3的2.5倍
机器之心· 2025-08-18 03:22
Core Insights - The article discusses the introduction of Discrete Diffusion Forcing (D2F), a new model that significantly enhances the inference speed of open-source diffusion large language models (dLLMs) compared to autoregressive (AR) models, achieving up to 2.5 times higher throughput on benchmarks like GSM8K [2][6][22]. Group 1: Challenges and Solutions - Existing dLLMs face challenges such as the lack of a complete KV cache mechanism and insufficient parallel potential, resulting in slower inference speeds compared to AR models [2][8]. - D2F addresses these challenges by integrating a mixed paradigm of autoregressive and diffusion approaches, optimizing model architecture, training methods, and inference strategies [11][12]. Group 2: D2F Design Features - D2F incorporates block-level causal attention to ensure compatibility with KV caching, allowing for the reuse of KV states and reducing computational redundancy [12][15]. - The model employs asymmetric distillation and structured noise scheduling to efficiently transfer knowledge from a pre-trained teacher model to the D2F student model, enhancing its parallel capabilities [18]. Group 3: Inference Mechanism - D2F introduces a pipelined parallel decoding algorithm that maintains a dynamic decoding window, allowing for semi-activated and fully-activated states to optimize throughput and quality [20][21]. - The model achieves a maximum speedup of up to 50 times compared to original dLLMs while maintaining average performance levels [22]. Group 4: Performance Metrics - D2F demonstrates superior performance-efficiency trade-offs, with the ability to adapt to various scenarios by adjusting decoding parameters, achieving over four times the throughput of AR models in specific tasks [25]. - Comparative tests show D2F-LLaDA achieving a throughput of 52.5 tokens per second, representing a 7.3 times increase over baseline methods [23]. Group 5: Future Directions - The success of D2F indicates a promising path for further research in parallel decoding technologies, with potential future developments including real-time serving capabilities and hybrid parallel processing [28].
一张图,开启四维时空:4DNeX让动态世界 「活」起来
机器之心· 2025-08-18 03:22
Core Viewpoint - The article introduces 4DNeX, a groundbreaking framework developed by Nanyang Technological University S-Lab and Shanghai Artificial Intelligence Laboratory, which can generate 4D dynamic scenes from a single input image, marking a significant advancement in the field of AI and world modeling [2][3]. Group 1: Research Background - The concept of world models is gaining traction in AI research, with Google DeepMind's Genie 3 capable of generating interactive videos from high-quality game data, but lacking validation in real-world scenarios [5]. - A pivotal point in the development of world models is the ability to accurately depict dynamic 3D environments that adhere to physical laws, enabling realistic content generation and supporting "counterfactual" reasoning [5][6]. Group 2: 4DNeX-10M Dataset - The 4DNeX-10M dataset consists of nearly 10 million frames of 4D annotated video, covering diverse themes such as indoor and outdoor environments, natural landscapes, and human motion, with a focus on "human-centered" 4D data [10]. - The dataset is constructed using a fully automated data-labeling pipeline, which includes data sourcing from public video libraries and quality control measures to ensure high fidelity [12][14]. Group 3: 4DNeX Method Architecture - 4DNeX proposes a 6D unified representation that captures both appearance (RGB) and geometry (XYZ), allowing for the simultaneous generation of multi-modal content without explicit camera control [16]. - The framework employs a key strategy called "width fusion," which minimizes cross-modal distance by directly concatenating RGB and XYZ data, outperforming other fusion methods [18][20]. Group 4: Experimental Results - Experimental results demonstrate that 4DNeX achieves significant breakthroughs in both efficiency and quality, with a dynamic range of 100% and temporal consistency of 96.8%, surpassing existing methods like Free4D [23]. - User studies indicate that 85% of participants preferred the generated effects of 4DNeX, particularly noting its advantages in motion range and realism [23][25]. - Ablation studies confirmed the critical role of the width fusion strategy in optimizing multi-modal integration, eliminating noise and alignment issues present in other approaches [28].
SEAgent:开启从实战经验中自我进化的GUI智能体新纪元
机器之心· 2025-08-17 04:28
Core Viewpoint - The development of Current Computer Using Agents (CUA) is heavily reliant on expensive human-annotated data, which limits their application in novel or specialized software environments. To overcome this limitation, researchers from Shanghai Jiao Tong University and The Chinese University of Hong Kong proposed SEAgent, a new framework that allows agents to learn and evolve autonomously through interaction with their environment without human intervention [2][4]. Group 1: SEAgent Framework - SEAgent's core innovation lies in its closed-loop autonomous evolution framework, a deeply optimized evaluation model, and an efficient "specialist-generalist" integration strategy [2][5]. - The autonomous evolution capability of SEAgent is derived from the collaborative functioning of three core components, forming a sustainable and self-driven learning loop [5]. Group 2: Core Components - The Curriculum Generator acts as a "mentor," automatically generating progressively challenging exploration tasks based on the agent's current capabilities and maintaining a "software guide" to document new functionalities discovered during exploration [9]. - The Actor-CUA, which is the agent itself, executes the tasks generated by the Curriculum Generator in the software environment [9]. - The World State Model serves as the "judge," evaluating the agent's performance at each step and providing critical feedback signals for learning, thus completing the evolution loop [9][10]. Group 3: Evaluation Model - A precise "judge" is fundamental to autonomous evolution. Existing open-source large visual language models struggle with evaluating long sequences of agent operations, leading to decreased accuracy with excessive historical inputs. To address this, a more robust evaluation model, the World State Model, was developed [10]. - The optimized World State Model significantly reduces the performance gap with commercial models like GPT-4o, providing reliable and stable evaluation capabilities for the SEAgent framework [10]. Group 4: Specialist-to-Generalist Strategy - The research explores building a "generalist" model capable of operating across multiple software environments, finding that training a generalist directly in multi-software settings is less effective than training specialist models in single software environments [13]. - A three-step efficient "specialist-to-generalist" integration strategy is proposed, which includes innovating the evaluation paradigm, high-quality data distillation, and cultivating specialists before transitioning to a generalist model [14][15]. Group 5: Experimental Results - The final "generalist" agent achieved an overall success rate of 34.5%, surpassing the performance of directly trained generalist models (30.6%) and exceeding the combined performance of all specialist models (32.2%), demonstrating the potential of the "specialist first, then generalist" approach [18]. - Rigorous ablation experiments confirm the necessity of the algorithm design, showing that a high-quality World State Model is essential for effective learning, and exploration-based reinforcement learning (GRPO) significantly outperforms mere imitation [20]. Group 6: Author and Research Interests - The first author of the study is Sun Zeyi, a joint doctoral student from Shanghai Jiao Tong University and the Shanghai Artificial Intelligence Laboratory, with multiple publications in CVPR, ICCV, and NeurIPS, and research interests in GUI-Agent, multimodal learning, and reinforcement learning [20].
400万人围观的分层推理模型,「分层架构」竟不起作用?性能提升另有隐情?
机器之心· 2025-08-17 04:28
Core Insights - The article discusses the Hierarchical Reasoning Model (HRM), which has gained significant attention since its release in June, achieving a score of 41% on the ARC-AGI-1 benchmark with a relatively small model of 27 million parameters [3][4][5]. Group 1: HRM Performance and Analysis - HRM's performance on the ARC-AGI benchmark is impressive given its model size, with a score of 32% on the semi-private dataset, indicating minimal overfitting [29]. - The analysis revealed that the hierarchical architecture's impact on performance is minimal compared to the significant performance boost from the less emphasized "outer loop" optimization process during training [5][41]. - Cross-task transfer learning benefits were found to be limited, with most performance derived from memorizing specific task solutions used during evaluation [6][52]. Group 2: Key Components of HRM - Pre-training task augmentation is crucial, with only 300 augmentations needed to achieve near-maximum performance, contrary to the 1000 augmentations reported in the original paper [7][56]. - The HRM architecture combines slow planning (H-level) and fast execution (L-level), but the performance gains are not solely attributed to this structure [35][40]. - The outer loop optimization process significantly enhances performance, with a notable increase in accuracy observed with iterative optimization during training [41][46]. Group 3: Future Directions and Community Engagement - The article encourages further exploration of various aspects of HRM, including the impact of puzzle_id embeddings on model performance and the potential for generalization beyond training data [62][63]. - The analysis emphasizes the importance of community-driven evaluations of research, suggesting that such detailed scrutiny can lead to more efficient knowledge acquisition [65][66].
CoRL 2025|隐空间扩散世界模型LaDi-WM大幅提升机器人操作策略的成功率和跨场景泛化能力
机器之心· 2025-08-17 04:28
Core Viewpoint - The article discusses the introduction of LaDi-WM (Latent Diffusion-based World Models), a novel world model that utilizes latent space diffusion to enhance robot operation performance through predictive strategies [2][28]. Group 1: Innovation Points - LaDi-WM employs a latent space representation constructed using pre-trained vision foundation models, integrating both geometric features (derived from DINOv2) and semantic features (derived from Siglip), which enhances the generalization capability for robotic operations [5][10]. - The framework includes a diffusion strategy that iteratively optimizes output actions by integrating predicted states from the world model, leading to more consistent and accurate action results [6][12]. Group 2: Framework Structure - The framework consists of two main phases: world model learning and policy learning [9]. - **World Model Learning**: Involves extracting geometric and semantic representations from observation images and implementing a diffusion process that allows interaction between these representations to improve dynamic prediction accuracy [10]. - **Policy Model Training and Iterative Optimization**: Utilizes future predictions from the world model to guide policy learning, allowing for multiple iterations of action optimization, which reduces output distribution entropy and enhances action prediction accuracy [12][18]. Group 3: Experimental Results - In extensive experiments on virtual datasets (LIBERO-LONG, CALVIN D-D), LaDi-WM demonstrated a significant increase in success rates for robotic tasks, achieving a 27.9% improvement on the LIBERO-LONG dataset, reaching a success rate of 68.7% with minimal training data [15][16]. - The framework's scalability was validated, showing that increasing training data and model parameters consistently improved success rates in robotic operations [18][20]. Group 4: Real-World Application - The framework was also tested in real-world scenarios, including tasks like stacking bowls and opening drawers, where LaDi-WM improved the success rate of original imitation learning strategies by 20% [24][25].
LLM+Tool Use 还能撑多久?下一代 AI Agent 在 self-evolving 的技术探索上行至何方?
机器之心· 2025-08-17 01:30
Group 1 - The article discusses the increasing demand for self-evolving capabilities in AI agents, highlighting the limitations of static models in adapting to new tasks and dynamic environments [6][8][10] - It emphasizes the need for a systematic theoretical framework to guide the exploration of self-evolving agents, with contributions from multiple research institutions [8][10] - The article outlines three key dimensions for analyzing and designing self-evolving agents: what to evolve, when to evolve, and how to evolve, each addressing different aspects of the evolution process [9][10][11] Group 2 - The article raises questions about the ability of AI application companies to replicate or surpass the commercial successes of the mobile internet era, focusing on new monetization models [2][3] - It explores the differences in user ecosystems and commercial boundaries between AI and the mobile internet era, questioning the necessity of multiple apps as AI becomes a platform capability [2][3] - The article discusses the varying attitudes of Chinese and American internet giants towards AI investments and how this may impact future competitiveness [2][3] Group 3 - The article presents insights from Dario Amodei on the profitability of large models despite significant accounting losses, suggesting that each generation of large models can be viewed as independent startups [3] - It discusses the natural drive for funding, computing power, and data investment that comes with advancements in large model capabilities [3] - The article highlights the implications of Scaling Law for AI enterprise growth and the potential consequences if it were to fail [3]