Workflow
机器之心
icon
Search documents
AAAI 2026|AP2O-Coder 让大模型拥有「错题本」,像人类一样按题型高效刷题
机器之心· 2026-01-14 05:37
Core Insights - The article discusses the development of the Adaptive Progressive Preference Optimization (AP2O) method and its framework, AP2O-Coder, aimed at improving code generation and error correction in large language models (LLMs) [3][5][6]. Group 1: Existing Challenges and AP2O-Coder Design - Current offline preference optimization methods face three main challenges: lack of error type awareness, insufficient training focus, and weak dynamic adaptation capabilities [5][12]. - AP2O-Coder is designed to address these challenges by utilizing a systematic learning process similar to human error correction strategies, which includes error analysis and targeted optimization [6][8]. Group 2: AP2O-Coder Framework and Mechanism - The AP2O-Coder framework consists of four key steps: code generation evaluation, error diagnosis analysis, progressive preference optimization, and adaptive error replay [10][11][14]. - The code generation evaluation step establishes an initial training dataset by generating candidate answers for programming tasks and labeling them as pass or fail [10]. - The error diagnosis analysis step uses programming language-specific tools to identify and categorize errors, creating a structured "error book" for targeted optimization [11]. - The progressive preference optimization step focuses on correcting errors in a structured manner, prioritizing error types based on model size [13]. - The adaptive error replay step regularly evaluates model performance and adjusts training data distribution to focus on current weaknesses [14]. Group 3: Experimental Validation and Results - The research team conducted systematic validation on six mainstream LLMs, achieving performance improvements of 2.8% to 3.4% on the EvalPlus benchmark, even for large models [16][18]. - AP2O-Coder demonstrated a significant reduction in error occurrence rates and improved generalization capabilities across various models [22][29]. - The method also showed enhanced sample efficiency, requiring only 4% to 60% of the preference data compared to traditional methods to achieve optimal performance [25]. Group 4: Adaptability of General LLMs - AP2O-Coder is effective not only for code-specific LLMs but also for adapting general LLMs to coding tasks, as evidenced by significant performance improvements in models like Qwen3 and Llama3 [28].
500万次围观,1X把「世界模型」真正用在了机器人NEO身上
机器之心· 2026-01-14 01:39
Core Viewpoint - The article discusses the advancements in the home humanoid robot NEO, particularly the introduction of its new brain, the 1X World Model, which enables NEO to learn and perform tasks more autonomously by understanding the physical world through video training [3][4][11]. Group 1: Technological Advancements - NEO has evolved from merely executing pre-programmed actions to being able to "imagine" tasks by generating a video of successful task completion in its mind before executing it [4][6]. - The 1X World Model (1XWM) integrates video pre-training to allow NEO to generalize across new objects, movements, and tasks without extensive prior training data [11][21]. - The model is built on a 14 billion parameter generative video model, which has undergone a multi-stage training process to adapt to NEO's physical characteristics [16][18]. Group 2: Training and Evaluation - The training process includes using 900 hours of first-person human video data to align the model with human-like operational behaviors, followed by fine-tuning with 70 hours of robot data [18][19]. - The evaluation of 1XWM's capabilities shows that it can perform tasks it has never encountered before, with generated videos closely matching real-world execution [24][30]. - The importance of high-quality subtitles and first-person data in improving video generation quality and task success rates is emphasized, indicating that detailed descriptions enhance the model's performance [39][40]. Group 3: Practical Applications - NEO has been tested on various tasks, including those requiring complex interactions and coordination, demonstrating its ability to adapt and learn from video pre-training [28][30]. - The model's performance in both in-distribution and out-of-distribution tasks shows a stable success rate, although some fine manipulation tasks remain challenging [30][32]. - The article suggests that the quality of generated videos can be linked to task success rates, allowing for potential improvements in video generation through iterative testing and selection processes [32][39].
跳出「黑盒」,人大刘勇团队最新大语言模型理论与机理综述
机器之心· 2026-01-14 01:39
Core Insights - The article discusses the rapid growth of Large Language Models (LLMs) and the paradigm shift in artificial intelligence, highlighting the paradox of their practical success versus theoretical understanding [2][5][6] - A unified lifecycle-based classification method is proposed to integrate LLM theoretical research into six stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation [2][7][10] Group 1: Lifecycle Stages - **Data Preparation Stage**: Focuses on optimizing data utilization, quantifying data features' impact on model capabilities, and analyzing data mixing strategies, deduplication, and the relationship between memorization and model performance [11][18] - **Model Preparation Stage**: Evaluates architectural capabilities theoretically, understanding the limits of Transformer structures, and designing new architectures from an optimization perspective [11][21] - **Training Stage**: Investigates how simple learning objectives can lead to complex emergent capabilities, analyzing the essence of Scaling Laws and the benefits of pre-training [11][24] Group 2: Advanced Theoretical Insights - **Alignment Stage**: Explores the mathematical feasibility of robust alignment, analyzing the dynamics of Reinforcement Learning from Human Feedback (RLHF) and the challenges of achieving "Superalignment" [11][27] - **Inference Stage**: Decodes how frozen-weight models simulate learning during testing, analyzing prompt engineering and context learning mechanisms [11][30] - **Evaluation Stage**: Theoretically defines and measures complex human values, discussing the effectiveness of benchmark tests and the reliability of LLM-as-a-Judge [11][33] Group 3: Challenges and Future Directions - The article identifies frontier challenges such as the mathematical boundaries of safety guarantees, the implications of synthetic data, and the risks associated with data pollution [11][18][24] - It emphasizes the need for a structured roadmap to transition LLM research from engineering heuristics to rigorous scientific discipline, addressing the theoretical gaps that remain [2][35]
视觉模型既懂语义,又能还原细节,南洋理工&商汤提出棱镜假说
机器之心· 2026-01-13 10:04
Core Insights - The article introduces the Prism Hypothesis and Unified Autoencoding (UAE), aiming to harmonize semantic and pixel representations by addressing the conflict between semantic understanding and detail reconstruction [2][5][10]. Background - The challenge of achieving both semantic understanding and detail restoration in visual foundational models is highlighted, as many systems are forced to combine two separate representations, leading to decreased training efficiency and interference [3][4]. Key Concepts - The Prism Hypothesis posits that the representation of information in the world must allow for both shared semantics and the retention of fine-grained details [4][5]. - Semantic encoders (e.g., DINOv2, CLIP) excel in abstract information, while pixel encoders (e.g., SD series VAE) are better at reconstructing details like textures and edges [5][10]. Methodology - Unified Autoencoding (UAE) aims to synthesize both representations by structuring the learning of multi-frequency latent variables, separating the roles of semantics and details [11][13]. - The method involves: 1. **Unified Encoder**: Initialized from a semantic model to transition into a unified latent space [14]. 2. **Residual Split Flow**: Employs FFT for frequency band projection and iterative residual splitting to decompose latent variables into multiple frequency bands [15]. 3. **Frequency Band Modulator**: Perturbs only the high-frequency details and integrates them for the decoder [16]. 4. **Semantic-wise Loss**: Applies semantic constraints only to the lowest frequency bands, allowing for detail learning in higher frequencies [17]. Experimental Results - UAE demonstrates superior reconstruction quality on ImageNet and MS-COCO datasets, achieving PSNR=33.08, SSIM=0.94, and rFID=0.16 on ImageNet, and PSNR=32.84, SSIM=0.94, and rFID=0.17 on MS-COCO [19][20]. - Compared to the RAE baseline, UAE shows higher PSNR/SSIM and a reduction in rFID by over 90% [20]. - In conditional generation tasks on ImageNet, UAE achieves gFID=1.68 and IS=301.6 [25]. - For semantic understanding, UAE reaches a Top-1 accuracy of 83.0% on ImageNet-1K, matching RAE's performance [26][27].
相约AAAI 2026 | 上海AI实验室北极星 X 星启交流会(报名开启)
机器之心· 2026-01-13 10:04
2026年1月20日至27日,第40届AAAI人工智能会议( AAAI 2026 ) 将在新加 坡召开。期间,上海人工智能实验室(上海AI实验室)将举办" 北极星X星启交流 会暨云帆AI Talent Meetup "。届时,实验室相关领域专家将亲临现场,与全球 同行展开深度交流与研讨。 诚邀AAAI论文作者,人工智能、自然科学、工程科学等多学科交叉领域的教授、 博士后,以及产学研各界的创新实践者参加,共探探前沿技术。截至目前,"北 极星"交流会已在中国、美国、新加坡、加拿大等地成功举办多场,为数千名AI英 才连接全球机遇。 1 报名信息 交流会为邀约制 ,请扫描下方二维码或点击文末阅读原文, 提交报名信息。审 核通过后,将收到邀请函邮件。席位有限,请即报名。 截止时间: 1月19日12:00p.m. 咨询邮箱 : luochen @pjlab.org.cn 2 活动信息 活动亮点: 活动时间 : 1月22日17:30-20:30(新加坡时间) 活动地点 : 新加坡中心城区 议程速览 顶尖学术分享 : 上海AI实验科学家 创新成果分享、前沿技术主题演讲 ,激发科研 灵感。 实验室直通车: 与上海AI实验室团队 ...
不上云、不租卡,如何优雅地在本地微调Qwen-VL-30B?
机器之心· 2026-01-13 04:08
Core Viewpoint - The article discusses the challenges and solutions for deploying a 30B parameter multimodal AI model, emphasizing the need for a powerful yet compact computing solution that balances memory and processing capabilities [1][12][51]. Model Selection - A 30B parameter model is identified as the optimal choice for understanding complex data, outperforming smaller models while being more manageable than larger ones [2][3]. - The article highlights the deceptive nature of the "30B parameter" label, noting that high-resolution image processing significantly increases memory requirements [4][6]. Hardware Requirements - The need for substantial memory is emphasized, with 24GB of VRAM being insufficient for fine-tuning a 30B model, leading to potential performance sacrifices [10][12]. - The Lenovo ThinkStation PGX is introduced as a compact solution with 128GB of unified memory, allowing for efficient processing without the constraints of traditional setups [19][21]. Performance and Efficiency - The ThinkStation PGX's architecture allows for shared memory between CPU and GPU, enabling developers to run large models without running out of memory [25][26]. - The article details the successful fine-tuning of a model, achieving a significant reduction in loss from 4.03 to 1.06, demonstrating the system's effectiveness [34]. Advantages of Lenovo ThinkStation PGX - The PGX is positioned as the only desktop solution capable of comfortably running 30B multimodal models, providing a unique advantage in the market [38]. - The system's design incorporates advanced cooling solutions to manage high power consumption effectively, ensuring stable performance during extended tasks [41]. Market Position and Pricing - Lenovo's ThinkStation PGX is priced at 31,999 yuan for the 1TB version and 36,999 yuan for the 4TB version, offering a cost-effective alternative to high-end GPUs or cloud instances [51][52]. - The article suggests that for developers facing memory constraints, the PGX represents a valuable investment, providing a seamless experience without the typical configuration headaches [52][53].
华为推出软工代码智能体SWE-Lego,解锁SFT训练极致性能
机器之心· 2026-01-13 04:08
"软工任务要改多文件、多轮工具调用,模型怎么学透?高质量训练数据稀缺,又怕轨迹含噪声作弊?复杂 RL 训练成本高,中小团队望而却步?" 华为研究团队推出 SWE-Lego , 仅基于监督微调(SFT)的软件工程代码智能体,无需复杂 RL 流程,在 SWE-bench Verified 基准中斩获同等规模开源模型 SOTA,甚至超越部分更大规模闭源模型!项目已开源,代码、模型和 全部数据一键获取 ! SWE-Lego 具有 三大创新,包括数据、训练和测试时扩展。 1. 混合数据集构建: 3. 测试时扩展策略(TTS): 引言 在软件工程领域,Code Agent 需要处理复杂的任务:修复 bug、重构代码、理解大型代码库。这些任务要求 Code Agent 具备 长序列推理、多文件操作和工具使用 等能力。现有的训练方法通常需要复杂的训练范式,比如强化学习(RL)或者 RL 和 SFT 的迭代组合。 这些方法虽然有效,但计算成本高,训练过程复杂。能否用更简单的方法达到同样的效果? 华为的研究团队提出了 SWE-Lego,一个仅基于监督微调(SFT)的软工代码模型的解决方案 。在 SWE-bench Verifie ...
OpenAI的首款硬件:是AI耳机,今年销量要冲5000万
机器之心· 2026-01-13 04:08
Core Viewpoint - OpenAI is venturing into hardware development with a new audio product named "Sweetpea," aimed to compete directly with Apple's AirPods, indicating a significant shift in its business strategy [1][3]. Group 1: Product Details - The "Sweetpea" is designed to replace AirPods and is expected to have an initial production target of 40-50 million units in its first year, compared to Apple's AirPods annual sales of approximately 60-70 million units [3]. - The product features a unique industrial design resembling an eggstone, made from metal, and includes two capsule-like units for wearing behind the ear [3]. - It will utilize a smartphone-grade processor with a 2nm process, likely Samsung's Exynos, enabling local AI processing, and will include a custom chip for Siri control [3][4]. Group 2: Development and Strategy - OpenAI's hardware initiative is a bold attempt to enter the wearable AI market, following its acquisition of the hardware startup io, founded by former Apple chief designer Jony Ive, for $6.5 billion [7]. - The integration of io's team is expected to be completed by July 2025, with a focus on creating a new type of computing device that enhances AI interaction beyond traditional smartphone screens [7]. Group 3: Market Positioning and Expectations - Concerns have been raised regarding the high BOM (Bill of Materials) costs due to the premium materials and specifications, but the device is anticipated to offer superior functionality compared to existing products [4]. - The product's development has garnered excitement among consumers, with many expressing high expectations for its potential to revolutionize the audio device market [6].
大模型中标TOP10里的黑马:中关村科金的应用攻坚之道
机器之心· 2026-01-13 02:33
Core Insights - The article highlights a significant shift in the Chinese large model industry, with application projects accounting for nearly 60% of the market, indicating a transition from technical competition to value validation in commercial scenarios [1][3][25] - In 2025, the number of large model-related bidding projects reached 7,539, with a disclosed amount of 29.52 billion yuan, marking a dramatic increase of 396% and 356% compared to 2024 [1][3] - The report emphasizes the importance of industry-specific knowledge and high-quality private data as key competitive advantages in the evolving market landscape [19][20] Market Trends - Application projects dominated the bidding landscape, comprising 58% of the total projects, with a peak of 63% in November 2025 [1][5] - The trend shows a quarterly increase in application project share from 44% in Q1 to 61% in Q3, stabilizing at 60.5% in Q4 [5] - The highest monetary share came from computing projects at 52.9%, but their quantity share was only 27%, indicating a preference for direct procurement of computing power and existing models for application development [5] Industry Distribution - The top five industries by project quantity were education, government, telecommunications, energy, and finance, with the government sector leading in monetary share at approximately 40% [5] - The financial sector showed a notable shift from computing investment to application deployment in the latter half of 2025 [5] Vendor Landscape - Major players in the bidding market included general large model vendors like iFlytek, Baidu, Volcano Engine, and Alibaba Cloud, alongside specialized vendors like Zhongguancun KJ, which focused on niche markets [6][11] - Zhongguancun KJ ranked fourth among financial industry large model vendors, showcasing its deep industry expertise and successful project implementations [13] Case Studies - Zhongguancun KJ's collaboration with China Shipbuilding Group led to the development of a large model for the shipbuilding industry, integrating a vast knowledge base and enhancing operational efficiency [11][12] - In the finance sector, Zhongguancun KJ has served over 500 leading financial institutions, creating a comprehensive financial intelligent agent matrix that integrates AI capabilities into core business processes [13][14] Future Outlook - The market is expected to enter a "deep water zone" in 2026, where return on investment (ROI) will become a critical metric for evaluating AI projects [18] - The relationship between specialized vendors and general platforms is anticipated to evolve from competition to collaboration, fostering a symbiotic ecosystem [22][23]
刚刚,梁文锋署名开源「记忆」模块,DeepSeek V4更细节了
机器之心· 2026-01-13 00:12
Core Insights - DeepSeek has introduced a new research paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, focusing on enhancing large language models (LLMs) through a novel approach to memory and computation [1][2]. Group 1: Research Background and Problem Statement - Current large language models primarily utilize Mixture of Experts (MoE) for sparsity, known as "conditional computation," but lack an inherent knowledge retrieval mechanism, leading to inefficient simulation of retrieval behavior [2][8]. - DeepSeek proposes "conditional memory" as a complementary approach to MoE, introducing a new module called Engram to address this limitation [3][8]. Group 2: Engram Module and Its Implementation - The Engram module has been made available on GitHub, allowing for community engagement and further development [4]. - Engram modernizes classic n-gram embeddings to achieve knowledge retrieval in O(1) time complexity, enhancing the efficiency of memory access [8][10]. - The module separates static knowledge storage from dynamic computation processes, enhancing the overall architecture of the Transformer network [12][14]. Group 3: Performance and Efficiency - DeepSeek has expanded Engram to a scale of 27 billion parameters, demonstrating significant performance improvements over pure MoE baseline models under equivalent parameter and FLOPs conditions [10][37]. - Engram has shown notable gains in knowledge retrieval tasks, with improvements such as +3.4 in MMLU and +4.0 in CMMLU, as well as enhanced general reasoning capabilities [10][37]. - The architecture allows for efficient memory access without additional performance overhead, supporting prefetching from host memory during runtime [11][18]. Group 4: Sparsity Distribution and Optimal Allocation - DeepSeek formalized a U-shaped expansion rule to characterize the optimal trade-off between neural computation (MoE) and static memory (Engram) [9][22]. - The research indicates that a balanced allocation of approximately 20%-25% of sparse parameter budget to Engram yields optimal performance, confirming the structural complementarity between the two modules [27][29]. Group 5: Experimental Results - Four models were trained: Dense-4B, MoE-27B, Engram-27B, and Engram-40B, all under identical training conditions [34][35]. - Sparse architectures consistently outperformed the dense model across various benchmarks, with Engram-27B achieving significant improvements over MoE-27B in multiple tasks [37]. - Engram-40B further reduced pre-training loss and improved performance on most benchmarks, indicating that memory capacity has not yet reached saturation [38]. Group 6: Long Context Training - Engram's architecture has been validated for its structural advantages in long-context tasks, demonstrating significant performance gains in global context retention [40][41]. - Controlled experiments revealed that Engram outperforms MoE in complex retrieval tasks, showcasing its inherent architectural superiority [45].