机器之心
Search documents
华为推出软工代码智能体SWE-Lego,解锁SFT训练极致性能
机器之心· 2026-01-13 04:08
"软工任务要改多文件、多轮工具调用,模型怎么学透?高质量训练数据稀缺,又怕轨迹含噪声作弊?复杂 RL 训练成本高,中小团队望而却步?" 华为研究团队推出 SWE-Lego , 仅基于监督微调(SFT)的软件工程代码智能体,无需复杂 RL 流程,在 SWE-bench Verified 基准中斩获同等规模开源模型 SOTA,甚至超越部分更大规模闭源模型!项目已开源,代码、模型和 全部数据一键获取 ! SWE-Lego 具有 三大创新,包括数据、训练和测试时扩展。 1. 混合数据集构建: 3. 测试时扩展策略(TTS): 引言 在软件工程领域,Code Agent 需要处理复杂的任务:修复 bug、重构代码、理解大型代码库。这些任务要求 Code Agent 具备 长序列推理、多文件操作和工具使用 等能力。现有的训练方法通常需要复杂的训练范式,比如强化学习(RL)或者 RL 和 SFT 的迭代组合。 这些方法虽然有效,但计算成本高,训练过程复杂。能否用更简单的方法达到同样的效果? 华为的研究团队提出了 SWE-Lego,一个仅基于监督微调(SFT)的软工代码模型的解决方案 。在 SWE-bench Verifie ...
OpenAI的首款硬件:是AI耳机,今年销量要冲5000万
机器之心· 2026-01-13 04:08
Core Viewpoint - OpenAI is venturing into hardware development with a new audio product named "Sweetpea," aimed to compete directly with Apple's AirPods, indicating a significant shift in its business strategy [1][3]. Group 1: Product Details - The "Sweetpea" is designed to replace AirPods and is expected to have an initial production target of 40-50 million units in its first year, compared to Apple's AirPods annual sales of approximately 60-70 million units [3]. - The product features a unique industrial design resembling an eggstone, made from metal, and includes two capsule-like units for wearing behind the ear [3]. - It will utilize a smartphone-grade processor with a 2nm process, likely Samsung's Exynos, enabling local AI processing, and will include a custom chip for Siri control [3][4]. Group 2: Development and Strategy - OpenAI's hardware initiative is a bold attempt to enter the wearable AI market, following its acquisition of the hardware startup io, founded by former Apple chief designer Jony Ive, for $6.5 billion [7]. - The integration of io's team is expected to be completed by July 2025, with a focus on creating a new type of computing device that enhances AI interaction beyond traditional smartphone screens [7]. Group 3: Market Positioning and Expectations - Concerns have been raised regarding the high BOM (Bill of Materials) costs due to the premium materials and specifications, but the device is anticipated to offer superior functionality compared to existing products [4]. - The product's development has garnered excitement among consumers, with many expressing high expectations for its potential to revolutionize the audio device market [6].
大模型中标TOP10里的黑马:中关村科金的应用攻坚之道
机器之心· 2026-01-13 02:33
Core Insights - The article highlights a significant shift in the Chinese large model industry, with application projects accounting for nearly 60% of the market, indicating a transition from technical competition to value validation in commercial scenarios [1][3][25] - In 2025, the number of large model-related bidding projects reached 7,539, with a disclosed amount of 29.52 billion yuan, marking a dramatic increase of 396% and 356% compared to 2024 [1][3] - The report emphasizes the importance of industry-specific knowledge and high-quality private data as key competitive advantages in the evolving market landscape [19][20] Market Trends - Application projects dominated the bidding landscape, comprising 58% of the total projects, with a peak of 63% in November 2025 [1][5] - The trend shows a quarterly increase in application project share from 44% in Q1 to 61% in Q3, stabilizing at 60.5% in Q4 [5] - The highest monetary share came from computing projects at 52.9%, but their quantity share was only 27%, indicating a preference for direct procurement of computing power and existing models for application development [5] Industry Distribution - The top five industries by project quantity were education, government, telecommunications, energy, and finance, with the government sector leading in monetary share at approximately 40% [5] - The financial sector showed a notable shift from computing investment to application deployment in the latter half of 2025 [5] Vendor Landscape - Major players in the bidding market included general large model vendors like iFlytek, Baidu, Volcano Engine, and Alibaba Cloud, alongside specialized vendors like Zhongguancun KJ, which focused on niche markets [6][11] - Zhongguancun KJ ranked fourth among financial industry large model vendors, showcasing its deep industry expertise and successful project implementations [13] Case Studies - Zhongguancun KJ's collaboration with China Shipbuilding Group led to the development of a large model for the shipbuilding industry, integrating a vast knowledge base and enhancing operational efficiency [11][12] - In the finance sector, Zhongguancun KJ has served over 500 leading financial institutions, creating a comprehensive financial intelligent agent matrix that integrates AI capabilities into core business processes [13][14] Future Outlook - The market is expected to enter a "deep water zone" in 2026, where return on investment (ROI) will become a critical metric for evaluating AI projects [18] - The relationship between specialized vendors and general platforms is anticipated to evolve from competition to collaboration, fostering a symbiotic ecosystem [22][23]
刚刚,梁文锋署名开源「记忆」模块,DeepSeek V4更细节了
机器之心· 2026-01-13 00:12
Core Insights - DeepSeek has introduced a new research paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, focusing on enhancing large language models (LLMs) through a novel approach to memory and computation [1][2]. Group 1: Research Background and Problem Statement - Current large language models primarily utilize Mixture of Experts (MoE) for sparsity, known as "conditional computation," but lack an inherent knowledge retrieval mechanism, leading to inefficient simulation of retrieval behavior [2][8]. - DeepSeek proposes "conditional memory" as a complementary approach to MoE, introducing a new module called Engram to address this limitation [3][8]. Group 2: Engram Module and Its Implementation - The Engram module has been made available on GitHub, allowing for community engagement and further development [4]. - Engram modernizes classic n-gram embeddings to achieve knowledge retrieval in O(1) time complexity, enhancing the efficiency of memory access [8][10]. - The module separates static knowledge storage from dynamic computation processes, enhancing the overall architecture of the Transformer network [12][14]. Group 3: Performance and Efficiency - DeepSeek has expanded Engram to a scale of 27 billion parameters, demonstrating significant performance improvements over pure MoE baseline models under equivalent parameter and FLOPs conditions [10][37]. - Engram has shown notable gains in knowledge retrieval tasks, with improvements such as +3.4 in MMLU and +4.0 in CMMLU, as well as enhanced general reasoning capabilities [10][37]. - The architecture allows for efficient memory access without additional performance overhead, supporting prefetching from host memory during runtime [11][18]. Group 4: Sparsity Distribution and Optimal Allocation - DeepSeek formalized a U-shaped expansion rule to characterize the optimal trade-off between neural computation (MoE) and static memory (Engram) [9][22]. - The research indicates that a balanced allocation of approximately 20%-25% of sparse parameter budget to Engram yields optimal performance, confirming the structural complementarity between the two modules [27][29]. Group 5: Experimental Results - Four models were trained: Dense-4B, MoE-27B, Engram-27B, and Engram-40B, all under identical training conditions [34][35]. - Sparse architectures consistently outperformed the dense model across various benchmarks, with Engram-27B achieving significant improvements over MoE-27B in multiple tasks [37]. - Engram-40B further reduced pre-training loss and improved performance on most benchmarks, indicating that memory capacity has not yet reached saturation [38]. Group 6: Long Context Training - Engram's architecture has been validated for its structural advantages in long-context tasks, demonstrating significant performance gains in global context retention [40][41]. - Controlled experiments revealed that Engram outperforms MoE in complex retrieval tasks, showcasing its inherent architectural superiority [45].
一个模型统一4D世界生成与重建,港科大One4D框架来了
机器之心· 2026-01-13 00:12
Group 1 - The core idea of the article is the introduction of One4D, a unified framework for 4D generation and reconstruction that addresses the limitations of existing video diffusion models by enabling simultaneous output of RGB videos and geometric pointmaps [4][32]. - One4D aims to enhance the capabilities of video generation models by integrating both appearance (RGB) and geometry (Pointmap/Depth/Camera Trajectory) within a single framework, thus facilitating the transition towards a 4D world model [32][33]. - The framework employs two key innovations: Decoupled LoRA Control (DLC) for reducing cross-modal interference and Unified Masked Conditioning (UMC) for handling various input types seamlessly [10][17]. Group 2 - One4D supports three types of input: single image to 4D generation, sparse video frames to 4D generation and reconstruction, and complete video to 4D reconstruction [9]. - The training of One4D utilizes a large-scale dataset combining synthetic and real data to ensure both geometric accuracy and visual diversity, achieving effective results with 34,000 videos trained on 8 NVIDIA H800 GPUs over 5,500 steps [20]. - User studies indicate that One4D outperforms existing methods in consistency, dynamic quality, aesthetics, depth quality, and overall 4D coherence, with significant improvements in various metrics [21][22]. Group 3 - In the context of sparse video frames, One4D demonstrates the ability to generate missing RGB frames and complete the geometric sequence even under extreme sparsity conditions, showcasing its capability for dynamic 4D scene generation [30][31]. - One4D also excels in full video 4D reconstruction, outperforming dedicated reconstruction methods on benchmark datasets such as Sintel and Bonn, indicating its robust performance across different tasks [25][26]. - The framework's camera trajectory estimation capabilities are validated through evaluations on datasets like Sintel and TUM, further proving its effectiveness in unified generation and reconstruction tasks [28][29].
引入几何约束后,VLM跨越了「空间推理」的认知鸿沟
机器之心· 2026-01-12 06:35
Core Insights - The article discusses the "Semantic-to-Geometric Gap" in existing Visual Language Models (VLMs), which struggle with precise spatial reasoning tasks, leading to incorrect answers in spatial queries [2][6]. Group 1: Problem Identification - The "Semantic-to-Geometric Gap" arises because VLMs compress rich pixel information into abstract semantic features, losing high-fidelity geometric details necessary for accurate spatial reasoning [7]. - VLMs lack the ability to form precise geometric imaginations, which hampers their performance in complex spatial reasoning scenarios [7]. Group 2: Proposed Solution - A research team from Beihang University and Shanghai AI Lab introduced the Geometrically-Constrained Agent (GCA), which employs a new paradigm of "formalizing constraints before deterministic computation" to enhance spatial reasoning capabilities [4]. - GCA does not rely on massive data fine-tuning but instead uses formal task constraints to shift VLMs from "fuzzy intuition" to "precise solving," creating a verifiable geometric bridge for spatial reasoning [4]. Group 3: Performance Improvement - GCA significantly improved model performance by nearly 50% in the challenging MMSI-Bench test, establishing a new state-of-the-art (SOTA) in the field of spatial reasoning [4][14]. - The average accuracy achieved by GCA is 65.1%, surpassing existing training-based and tool-integrated methods, particularly in complex spatial reasoning tasks [15]. Group 4: Generalizability and Versatility - GCA is a training-free universal reasoning paradigm that can empower various foundational models, achieving an average relative performance improvement of about 37% on the MMSI-Bench [16]. - The GCA framework demonstrated exceptional performance, with the Gemini-2.5-Pro model's accuracy rising from 36.9% to 55.0% after integration [16]. Group 5: Methodology - GCA's approach involves two stages: formalizing tasks from "fuzzy instructions" to "precise rules" and then performing deterministic geometric calculations within established constraints [9][12]. - The framework includes intelligent tool scheduling and binding, ensuring seamless integration of perception and computation tools to achieve reliable spatial reasoning [20]. Group 6: Conclusion and Implications - GCA represents a new paradigm of "language-defined constraints and geometric execution," effectively transforming vague spatial queries into constrained mathematical problems, thus enhancing reasoning accuracy and moving machines closer to possessing "geometric intuition" [24].
真香!刚骂完AI,Linux之父的首个Vibe Coding项目上线
机器之心· 2026-01-12 06:35
Core Viewpoint - Linus Torvalds has embraced "Vibe Coding" with the launch of his new project "AudioNoise," which utilizes AI technology for audio processing, marking a significant shift in his programming approach [3][10][30]. Group 1: Project Overview - The "AudioNoise" project, released on GitHub, has gained 1.4k stars within five days, showcasing its popularity [10][11]. - This project is related to guitar effects and aims to simulate audio effects using AI, specifically through a Python visualization tool [6][12]. - Torvalds' previous project, "GuitarPedal," served as a foundation for "AudioNoise," focusing on learning about analog circuits and audio processing [12][14]. Group 2: Programming Approach - Torvalds initially used traditional programming methods but later adopted a more streamlined approach by utilizing Google Antigravity for coding, which he refers to as "Vibe Coding" [8][17]. - He expressed satisfaction with the results of using AI tools, noting that the outcomes were better than his manual coding efforts [18][20]. - Despite his positive experience with AI in this project, Torvalds maintains a cautious stance regarding AI in production environments, emphasizing the importance of understanding code logic [30][31]. Group 3: Industry Reactions - The programming community has reacted with a mix of enthusiasm and skepticism regarding Torvalds' use of AI, highlighting a shift in attitudes towards AI-generated code [22][30]. - Notable figures in the tech industry, including the creator of Antigravity, have expressed admiration for Torvalds' decision to incorporate AI into his work [23][24]. - Torvalds' previous criticisms of AI-generated code have sparked discussions about the implications of AI in software development, particularly in relation to quality and accountability [28][30].
2026年,大模型训练的下半场属于「强化学习云」
机器之心· 2026-01-12 05:01
Core Insights - The article discusses the transition in AI model development from scaling laws based on increasing parameters and training data to a focus on reinforcement learning (RL) and post-training scaling, indicating a paradigm shift in AI capabilities [1][4][10]. Group 1: Scaling Law and Model Development - By the end of 2024, discussions in Silicon Valley and Beijing highlighted concerns that scaling laws were hitting a wall, as newer flagship models like Orion did not show expected marginal benefits from increased parameters and data [1]. - Ilya Sutskever's remark suggested a shift from an era of scaling to one of miracles and discoveries, indicating skepticism about the sustainability of the pre-training approach [3]. - By early 2025, OpenAI's o1 model introduced reinforcement reasoning, demonstrating that test-time scaling could lead to higher intelligence, while DeepSeek R1 successfully replicated this technology in an open-source manner [4][6]. Group 2: Reinforcement Learning and Infrastructure - The focus of computational power is shifting from pre-training scaling to post-training and test-time scaling, emphasizing the importance of deep reasoning capabilities over mere parameter size [8]. - The emergence of DeepSeek R1 revealed that deep reasoning, driven by reinforcement learning, is more critical for model evolution than simply increasing parameters [4][6]. - The industry is calling for a new computational infrastructure to support this shift towards dynamic exploration and reasoning, as existing cloud architectures struggle to meet these demands [11][12]. Group 3: Agentic RL and Its Implications - Nine Chapters Cloud has positioned itself as a leader in defining "reinforcement learning cloud" infrastructure, which is essential for the evolving AI landscape [12][14]. - The Agentic RL platform, launched in mid-2025, is the first industrial-grade reinforcement learning cloud platform, significantly enhancing training efficiency and reducing costs [15][19]. - Agentic RL aims to evolve general models into expert models capable of complex decision-making and control, addressing real-world challenges in various industries [20][22]. Group 4: Real-World Applications and Economic Impact - The successful implementation of a large-scale AI center in Huangshan within 48 days exemplifies Nine Chapters Cloud's engineering capabilities and operational efficiency [41][43]. - The Huangshan model is projected to generate significant economic benefits, with an estimated increase of at least 200 million yuan in annual service industry value [48]. - The integration of AI capabilities into urban management and tourism demonstrates the potential for AI infrastructure to drive economic growth and enhance operational efficiency [50][51]. Group 5: Future Vision and Market Position - Nine Chapters Cloud aims to establish itself as a key player in the independent AI cloud sector, advocating for an open ecosystem that does not compete with clients [54][60]. - The company emphasizes the importance of defining standards for next-generation infrastructure, moving beyond traditional cloud services to focus on enabling rapid evolution of intelligent agents [63][66]. - The future of cloud computing is envisioned as an "evolution era," where the focus will be on enhancing the capabilities of intelligent agents rather than merely providing computational resources [68][69].
AAAI 2026 Oral|快手提出全新「检索数据引擎」CroPS,打破搜索信息茧房
机器之心· 2026-01-12 05:01
Core Insights - The article discusses the introduction of a new retrieval data engine called CroPS (Cross-Perspective Positive Samples) by Kuaishou's search team, aimed at improving short video search capabilities by addressing the limitations of traditional self-reinforcing training paradigms that rely heavily on historical click data [2][10]. Group 1: Problem Identification - Current vector retrieval models in the industry often depend on historical user interaction data, leading to a self-reinforcing cycle that narrows the search results and limits exposure to diverse content [6]. - This mechanism results in a significant sample bias, where high-quality long-tail content is systematically excluded from positive samples, causing the model's retrieval scope to become conservative and repetitive [6][7]. - Users experience a lack of novelty in search results, making it difficult to satisfy exploratory needs [7]. Group 2: CroPS Framework - CroPS introduces a multi-dimensional positive sample enhancement engine that utilizes user query behavior, recommendation system feedback, and knowledge from large language models (LLMs) to enrich the semantic space [11]. - The framework captures user intent continuity by analyzing query rewrites, allowing the system to correct semantic biases by incorporating successful interactions from related queries [12]. - It breaks down barriers between search and recommendation systems, enabling the retrieval model to leverage diverse content that users may not have actively searched for [15]. - CroPS employs LLMs to generate high-quality synthetic samples when existing content does not cover certain queries, effectively expanding the model's knowledge base [16][17]. Group 3: Hierarchical Labeling and Loss Function - The Hierarchical Label Assignment (HLA) strategy addresses the reliability differences among positive samples from various sources, allowing the model to prioritize more relevant samples during training [19]. - H-InfoNCE loss function enhances the model's ability to distinguish between high-priority and low-priority samples, aligning learning objectives with the hierarchical logic of HLA [23][28]. Group 4: Experimental Results - Offline experiments showed that CroPS improved recall rates by 9.5% on user click datasets and 7.1% on user query change datasets compared to the strongest baseline [30]. - In large-scale A/B testing, CroPS led to significant business growth, with a 40.9% increase in ratio rank and a 44.3% increase in ratio show for dense models [31]. - The click-through rate (CTR) increased by 0.869%, and the long playback rate (LPR) rose by 0.483%, indicating improved content relevance and quality [36]. Group 5: Future Directions - The Kuaishou search team plans to explore the integration of CroPS with generative retrieval methods to further leverage the potential of large-scale language models in the search process [34].
顶尖AI竟输给三岁宝宝,BabyVision测试暴露多模态模型硬伤
机器之心· 2026-01-12 05:01
Core Viewpoint - The article discusses the limitations of current large models in visual understanding, emphasizing that while they excel in language and text reasoning, their visual capabilities remain underdeveloped, akin to that of a three-year-old child [3][4][49]. Group 1: BabyVision Overview - UniPat AI, in collaboration with Sequoia China and various research teams, has launched a new multimodal understanding evaluation set called BabyVision to assess visual capabilities of AI models [3][4]. - BabyVision aims to create a new paradigm for AI training, evaluation, and application in real-world scenarios, focusing on generating measurable and iterative visual capabilities [4][49]. Group 2: Evaluation Methodology - BabyVision includes a direct comparison experiment with 20 vision-centric tasks given to children of different ages (3, 6, 10, 12 years) and top multimodal models [7]. - The evaluation strictly controls language dependency, requiring answers to be derived solely from visual information [8]. Group 3: Results and Findings - The results reveal that most models score significantly below the average performance of three-year-old children, with the best model, Gemini3-Pro-Preview, only achieving 49.7%, which is still 20 percentage points below the performance of six-year-olds [15][21]. - Human participants scored an impressive 94.1% accuracy on the BabyVision-Full test, highlighting the substantial gap between human and model performance [20][21]. Group 4: Challenges Identified - The study identifies four core challenges in visual reasoning for AI models: observing non-verbal details, maintaining visual tracking, lacking spatial imagination, and difficulty in visual pattern induction [27][30][36][39]. - These challenges indicate a systemic lack of foundational visual capabilities in current models, rather than isolated deficiencies [23]. Group 5: Future Directions - The article suggests that transitioning visual reasoning tasks to visual operations, as demonstrated in BabyVision-Gen, may help bridge the gap in visual understanding [42]. - The ongoing development of BabyVision aims to guide the evolution of multimodal large models by breaking down visual understanding into 22 measurable atomic capabilities [49].