机器之心
Search documents
新一代AI教师是什么样?学而思让它从L2「助手」跃迁至L3「老师」
机器之心· 2025-09-28 00:32
机器之心报道 编辑:+0 自动驾驶有 L1-L5 的分级路径,现在教育 AI 也有了自己的版本。 然而,长期以来,这种 高频互动和个性化引导 几乎只是少数学生才能享有的「奢侈品」。 人工智能的加入正在改变这一切。AI 学伴不仅能提供全天候的回应,还能创造一个无须担心被评判的空间,让学生大胆试错、主动追问。更重要的是,它能把启 发式的交互和个性化的反馈规模化,让「因材施教」真正成为可能。 可以看到,全球科技巨头已将目光聚焦于此。从 OpenAI 到 Google,其 AI 应用界面均已部署学习板块。 如今,「AI 下半场」已成共识,应用落地正成为决定未来的关键。教育,作为关乎人类发展的根本基石,已然成为 AI 技术融合与创新的前沿阵地。 很多人可能都有过这样的经历: 课堂上,一个问题在嘴边盘旋,却因为害怕问得「太蠢」而最终选择沉默;或者,前面的内容还没听懂,老师已经跳到下一个知 识点了。 ChatGPT 学习板块。 这正是教育领域长期存在的无奈:大班授课下,个体的思考路径常常被淹没在统一的教学节奏中。教师想兼顾每一位学生的困惑,但心有余而力不足。 瑞士心理学家 Jean Piaget 提出的建构主义早已指出:知 ...
让大模型合成检查器:UIUC团队挖出Linux内核90余个长期潜伏漏洞
机器之心· 2025-09-28 00:32
Core Insights - The paper introduces KNighter, a system that transforms static analysis by synthesizing checkers using large language models (LLMs), successfully identifying 92 long-standing vulnerabilities in the Linux kernel [3][11][16] - KNighter utilizes historical patch data to distill defect patterns and repair intentions, allowing the model to generate structured, maintainable, and compilable static analysis checkers [11][21] Background and Pain Points - Traditional static analysis tools require manual rule creation, which is time-consuming and difficult to maintain, often covering only limited predefined patterns [7] - Directly scanning large codebases with LLMs poses challenges due to context limitations and high computational costs [7] Methodology - KNighter's approach involves breaking down the task of creating a static analysis checker into manageable steps, allowing the model to analyze defect patterns and program states before generating the checker framework [11] - The synthesized checkers can be integrated into continuous integration (CI) pipelines for long-term use and iterative upgrades as new patches are introduced [12][20] Experimental Results - The research team validated KNighter's effectiveness on the Linux kernel, where the synthesized checkers identified 92 vulnerabilities, with 77 confirmed by maintainers and 57 fixed, including 30 that received CVE identifiers [16] - This method is more cost-effective and stable compared to direct LLM code scanning, as the generated checkers can be reused and provide precise alerts with clear state transitions [16] Practical Recommendations - The synthesized checkers can be integrated into version control systems and CI processes, facilitating code review and evolution [19] - Organizations can trigger KNighter's pattern mining and checker generation automatically with each patch merge, gradually building a comprehensive rule library [20] - Starting with high-risk scenarios, such as resource management and error propagation, can help in generating initial seed checkers before expanding to other subsystems [20]
规范对齐时代:GPT-5 断层领先,让安全与行为边界更明晰
机器之心· 2025-09-27 06:18
Core Viewpoint - The article discusses the concept of Specification Alignment in large models, emphasizing the need for these models to adhere to both safety and behavioral specifications in various contexts, thereby ensuring user safety while meeting diverse behavioral requirements [3][9][30]. Group 1: Specification Alignment - Specification Alignment is introduced as a new concept requiring large models to comply with both safety specifications (safety-spec) and behavioral specifications (behavioral-spec) in different scenarios [3][9]. - Safety specifications define the boundaries that models must not cross, such as avoiding violent content in children's stories or refusing to generate malicious code [9][10]. - Behavioral specifications guide how models should operate, reflecting user or organizational preferences, such as including educational morals in stories or providing multiple travel plans [9][10]. Group 2: SpecBench and Evaluation - The research team developed SpecBench, the first benchmark for evaluating specification alignment, covering five application scenarios, 103 specifications, and 1500 prompts [6][15]. - A new metric, Specification Alignment Rate (SAR), was introduced to assess models' adherence to specifications, emphasizing the principle of "safety first, then utility" [16][30]. - Testing revealed that most models exhibited significant gaps in specification alignment, with GPT-5 showing a clear lead across all scenarios, attributed to OpenAI's safe-completion training [23][24]. Group 3: Test-time Deliberation - The article presents Test-time Deliberation (TTD) as a flexible approach to achieve specification alignment, allowing models to reflect on specifications during inference without altering model parameters [18][21]. - The Align3 method, part of TTD, effectively integrates safety and behavioral specifications into the reasoning process, enhancing model reliability [21][27]. - Experimental results indicate that TTD methods, including Align3, significantly improve specification alignment while maintaining lower computational costs compared to other methods [27][28]. Group 4: Future Outlook - Specification alignment is identified as a critical academic challenge and a key threshold for large models to integrate into society and industry [30]. - Future models must balance safety and practicality while adapting to increasingly diverse and personalized specifications [30]. - The ongoing development of SpecBench and methods like Align3 represents the initial steps toward achieving more capable and responsible AI systems [30][31].
OpenAI研究大模型对GDP贡献,三大行业已能代替人类,并自曝不敌Claude
机器之心· 2025-09-27 06:13
Core Viewpoint - The article discusses the introduction of GDPval, a new evaluation method by OpenAI that assesses AI model performance on economically valuable real-world tasks, indicating that AI is nearing human-level performance in various industries [1][3][22]. Group 1: Evaluation Methodology - GDPval uses GDP as a key economic indicator and extracts tasks from critical occupations in the top nine industries contributing to the GDP [3][16]. - The evaluation includes 1,320 professional tasks, with a golden open-source subset of 220 tasks, designed and reviewed by experienced professionals [18][22]. - Tasks are based on real work outcomes, ensuring the evaluation's realism and diversity compared to other benchmarks [18][19]. Group 2: Model Performance - The evaluation results show that leading models like Claude Opus 4.1 and GPT-5 are approaching or matching the quality of human experts in various tasks [4][9]. - Claude Opus 4.1 excels in aesthetic tasks, while GPT-5 performs better in accuracy-related tasks [9][10]. - Performance improvements have been significant, with task completion speed being approximately 100 times faster and costs being 100 times lower than human experts [13]. Group 3: Industry Impact - AI has reached or surpassed human-level capabilities in sectors such as government, retail, and wholesale [7]. - The early results from GDPval suggest that AI can complete some repetitive tasks faster and at a lower cost than human experts, potentially transforming the job market [21]. - OpenAI aims to democratize access to these tools, enabling workers to adapt to changes and fostering economic growth through AI integration [21]. Group 4: Future Developments - OpenAI plans to expand GDPval to include more occupations, industries, and task types, enhancing interactivity and addressing more ambiguous tasks [22]. - The ongoing improvements in the evaluation method indicate a commitment to better measure the progress of diverse knowledge work [22].
AI能「拍」好电影?五部短片亮相釜山电影节,答案出乎意料
机器之心· 2025-09-27 06:13
Core Viewpoint - The article discusses the technological advancements in AI-generated films, highlighting the successful creation of the first fully AI-generated short film "Nine Heavens" by a young team from Hong Kong, which has been recognized at the Busan International Film Festival [2][5][40]. Group 1: AI in Film Production - The team at ManyMany Creations Limited aimed to create a 15-minute narrative short film entirely generated by AI, which they successfully accomplished with "Nine Heavens" [2][3]. - "Nine Heavens" is notable for its reliance on subtle micro-expressions to convey the protagonist's emotional journey, showcasing AI's capability in narrative storytelling [5][6]. - The film was part of a larger initiative called the "Future Image Plan," which aims to explore AI's role in filmmaking [5][18]. Group 2: AI Technology and Tools - The production utilized advanced AI models from platforms like Jiemeng AI and Volcano Engine, which have significantly improved the quality and realism of AI-generated images and videos [17][18]. - The article mentions the evolution of AI tools, such as Seedream 4.0, which allows for multi-image fusion, enabling creators to generate detailed storyboards and videos from simple descriptions [23][25]. - The integration of AI in film production has led to a reduction in production time and costs, with "Nine Heavens" being produced in a fraction of the time compared to traditional methods [25][26]. Group 3: Industry Trends and Future Outlook - Major film companies, like Bona Film Group, are embracing AI technologies, establishing dedicated AI production centers to explore new creative workflows [19][20]. - The shift towards AI in filmmaking is seen as a way to democratize the industry, allowing non-professionals to create high-quality content with minimal resources [30][31]. - Despite the advancements, challenges remain in achieving consistent quality in longer scenes, indicating that human intervention is still necessary in the production process [40][47]. Group 4: Creative Freedom and Expression - AI tools have provided unprecedented creative freedom, allowing filmmakers to experiment with character designs and settings without the constraints of traditional production processes [32][33]. - The article emphasizes that while AI can generate content, the essence of storytelling and artistic expression remains rooted in human creativity and perspective [48][49].
先验+后验加持,大模型能否 hold 住推理预测的现实「溢出」?
机器之心· 2025-09-27 01:30
本文来自PRO会员通讯内容,文末关注「机器之心PRO会员」,查看更多专题解读。 引言 :近日,字节跳动等推出的 FutureX 动态评测基准,让大模型在答案未知、数据动态更新和闭环检验的情况下直面预测型「考卷」。这项工作在模型预测力和记忆力之 间做了区分,也探究了模型在长程推理、执行稳健性和不确定性环境下的表现。此外,大模型在财务预测、疾病评估等场景的落地效果正在优化过程中,业内研究者也在寻 找能填平推理和执行鸿沟的新机制。 目录 当推理「用兵」碰上财务预测等现实场景,模型能否稳定「指挥」从而落地?... 03 . 模型推理预测哪家强,先验后验不同路径 「各显神通」? 过往的模型预测技术在往哪些方向发力?先验记忆与后验反思机制,未来能为模型预测带来新的突破吗?... 01 FutureX 「出世」,从长程推理到现实预测大模型「顶」住了吗? 1、目前,大多数用于评估大型语言模型的基准都依赖于预先存在的、固定不变的数据集。 2、这种评估方式在衡量模型的事实性知识或在已知数据集上的简单推理能力时表现较好,但在面对动态的真实世界进行预测时,则难以考察模型真实的推理实力。 ① 静态基准通常处理的是在已有解决方案的情况下 ...
Agentic Coding表现创新高,全新KAT系列模型上榜SWE-Bench
机器之心· 2025-09-26 10:35
Core Insights - The article discusses the launch of two groundbreaking models in the Code Intelligence field by the Kuaipilot team: the open-source 32B parameter model KAT-Dev-32B and the closed-source flagship model KAT-Coder, showcasing their strong performance and capabilities in coding tasks [2][26]. Model Performance - KAT-Dev-32B achieved a 62.4% solution rate on the SWE-Bench Verified, ranking 5th among all open-source models of various sizes [2]. - KAT-Coder demonstrated an impressive 73.4% solution rate on the same benchmark, comparable to top global closed-source models [2][11]. Model Accessibility - KAT-Dev-32B is available on the Hugging Face platform for further research and development [7]. - The API key for KAT-Coder has been made available for application on the "Kuaishou Wanqing" enterprise-level model service and development platform, allowing users to access coding tools directly [7]. Training Innovations - The KAT series models underwent several innovative training phases, including Mid-Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and large-scale Agentic Reinforcement Learning (RL) [9][12]. - Mid-Training focused on enhancing the model's capabilities related to "LLM-as-Agent," improving tool usage, multi-turn interaction, and instruction adherence [10][12]. - SFT involved collecting real demand delivery trajectories marked by human engineers to enhance end-to-end delivery capabilities [13]. - RFT introduced ground truth for trajectory exploration, improving the efficiency and stability of the reinforcement learning phase [15]. Advanced Techniques - The team implemented entropy-based tree pruning to efficiently learn from non-linear trajectory histories and maximize throughput while minimizing costs [19]. - The SeamlessFlow framework was developed to manage trajectory trees and ensure high throughput training by decoupling RL training from the agent's internal logic [21][22]. Emergent Capabilities - Post-training analysis revealed two significant emergent phenomena: a reduction in dialogue rounds by 32% compared to SFT models and the ability to call multiple tools in parallel [33][35]. - The model's efficiency preference and parallel calling capabilities were attributed to the implicit optimization pressure from the trajectory tree structure [33]. Future Prospects - The Kuaipilot team aims to explore the frontiers of code intelligence, including enhancing tool integration, expanding language support, and developing collaborative coding systems [35].
IEEE TPAMI 2025 | 北京大学提出分布驱动的终身学习范式,用结构建模解决灾难性遗忘
机器之心· 2025-09-26 10:35
近日,北京大学王选计算机研究所周嘉欢助理教授与彭宇新教授合作在人工智能重要国际期刊 IEEE TPAMI 发布一项最新的研究成果: DKP++(Distribution- aware Knowledge Aligning and Prototyping for Non-exemplar Lifelong Person Re-Identification) 。该工作针对终身学习中的灾难性遗忘问题,提出分布建模引导 的知识对齐与原型建模框架,不仅有效增强了对历史知识的记忆能力,也提升了模型的跨域学习能力。 本文的第一作者为北京大学北京大学王选计算机研究所助理教授周嘉欢,通讯作者为北京大学王选计算机研究所教授彭宇新。目前该研究已被 IEEE TPAMI 接 收,相关代码已开源。 行人重识别(Person Re-Identification, ReID)旨在针对跨相机视角、跨地点、跨时间等场景中,基于视觉特征实现对同一行人图像的匹配与关联。该技术在多摄像 头监控、智能交通系统、城市安全管理以及大规模图像视频检索等实际场景中具有广泛应用价值。然而,在现实环境中,由于采集地点、拍摄设备和时间条件的 不断变化,行人图像的分 ...
京东AI「结果」:深度应用已成当下,万亿生态瞄准未来
机器之心· 2025-09-26 10:35
Core Insights - JD's AI model "JoyAI" has transitioned into deep industry applications, showcasing its capabilities across various sectors and daily life [2][3][31] - The company emphasizes that understanding industry scenarios is crucial for leading in AI applications, moving beyond mere technical advancements [33][34] Product Launches - JD has upgraded its AI model brand to "JoyAI," covering a range from 3 billion to 750 billion parameters, and introduced three major AI products: JoyAI LiveHuman, JoyAI LiveTTS, and the 京犀 App [6][10][11] - The 京犀 App is positioned as a next-generation shopping and lifestyle service platform, capable of understanding user needs and facilitating transactions through voice commands [11][13][14] - "他她它" is JD's first digital assistant product, designed to provide a wide range of services and engage users in a more human-like interaction [15][16] Technological Advancements - JoyAI's architecture includes innovations such as sparse MOE training and self-competitive algorithms, enhancing reasoning speed by 1.8 times compared to traditional methods [7][9] - The model achieved a score of 76.3 on the Rbench0924 evaluation, ranking first in China and second globally for reasoning capabilities [9] Industry Applications - JD's AI is being integrated into various sectors, including retail, logistics, industrial, and healthcare, enhancing efficiency and trust in supply chain operations [21][22][27] - The new AI architecture "Oxygen" aims to revolutionize e-commerce by providing personalized shopping experiences through advanced recommendation systems [24][27] Strategic Vision - JD's approach combines self-developed technology with investments and ecosystem partnerships to penetrate the embodied intelligence field, focusing on practical applications rather than just technological prowess [20][31] - The company plans to invest significantly over the next three years to build a trillion-scale AI ecosystem, emphasizing sustainable development and real value creation for industries [38]
学三年动画被AI秒杀,OpenAI要拍电影,好莱坞不敢买账
机器之心· 2025-09-26 08:26
Core Viewpoint - OpenAI is positioning itself to disrupt Hollywood by demonstrating that generative AI can produce animated films more quickly and cost-effectively than traditional methods [21][26]. Group 1: OpenAI's Animation Project - OpenAI is backing an animated film titled "Critterz," which aims to showcase the capabilities of generative AI in film production [21]. - The film's production timeline is targeted to be reduced from the traditional three years to approximately nine months, with a budget of under $30 million, significantly lower than typical animation costs [23]. - The film is set to premiere globally in 2026, with hopes of debuting at the Cannes Film Festival [25]. Group 2: Technology and Collaboration - The production involves collaboration with human artists for character sketches, which will be integrated with OpenAI's tools, including the latest GPT-5 and image generation models [23][28]. - OpenAI's approach combines human creativity with AI assistance, aiming to mitigate copyright concerns that have arisen in the industry [28]. Group 3: Industry Implications - If successful, "Critterz" could accelerate the adoption of AI technologies in Hollywood, lowering creative barriers for more creators [26]. - Despite the potential benefits, the entertainment industry remains cautious about fully embracing AI due to fears of job displacement for actors and writers, as well as intellectual property issues [27][28].