Workflow
机器之心
icon
Search documents
北大团队发布首篇大语言模型心理测量学系统综述:评估、验证、增强
机器之心· 2025-05-27 04:11
背景 随着大语言模型(LLM)能力的快速迭代,传统评估方法已难以满足需求。如何科学评估 LLM 的「心智」特征,例如价值观、性格和社交智能?如何建立 更全面、更可靠的 AI 评估体系?北京大学宋国杰教授团队最新综述论文(共 63 页,包含 500 篇引文),首次尝试系统性梳理答案。 这些挑战与心理测量学长期关注的核心问题高度契合:如何科学量化和理解复杂、抽象的心理特质(如知识、技能、性格、价值观等)。心理测量学通过将 这些特质转化为可量化的数据,为教育、医疗、商业和治理等领域的决策提供支持。 将心理测量学的理论、工具和原则引入大语言模型的评估,为系统理解和提升 AI「心智」能力提供了新的方法路径,并推动了「LLM 心理测量学(LLM Psychometrics)」这一交叉领域的发展。这一方向有助于更全面、科学地认识和界定人工智能的能力边界。 主要内容 这篇综述论文首次系统梳理了 LLM 心理测量学的研究进展,结构如下图所示。 心理测量和 LLM 基准的差异与评估原则的革新 论文标题:Large Language Model Psychometrics: A Systematic Review of Evalu ...
传统云还在「卖铁」,下一代云已在「炼钢」:火山引擎xLLM如何一张卡榨出两张的性能!
机器之心· 2025-05-27 04:11
机器之心报道 编辑:Panda 大模型越来越聪明,企业却似乎越来越焦虑了。 模型性能突飞猛进,从写文案到搭智能体(Agent),AI 掌握的技能也越来越多。但一到真正上线部署,问题就来了:为什么推理成本越来越 高?算力投入越来越多?效果却不成正比? 现如今,推理大模型已经具备服务复杂业务场景的实力。但是,要想让它们在工作时有足够快的速度,企业往往不得不大力堆卡(GPU),从 而满足 T PO T (平均输出一个 Token 的时间)和 TPS (每秒 Token 数)等指标。也就是说,在迈过了模型性能的门槛之后,企业却发现大模 型落地还有另一个高耸的门槛: 推理效率 。 为了响应这一需求,云厂商不约而同地把目光投向了「卖铁」,也就是上更多、更新但也更贵的卡。但它们的客户面临的问题真的是「卡不够 多不够强」吗? 火山引擎给出的答案是:不是卡不够多,也不是卡不够强,而是没「炼」好。 这家已经高举「 AI 云原生 」旗帜的云服务平台已经在「炼钢」这个方向上走出了自己的道路,其推出的 xLLM 大语言模型推理框架具有堪称 极致的性能,能低时延、高吞吐地支持大规模部署: 用同样的 GPU 卡,计算成本仅为开源框架的二分 ...
两岁的Llama,最初的14位作者已跑了11个!Mistral成最大赢家
机器之心· 2025-05-27 03:23
机器之心报道 机器之心编辑部 他们都是 Meta 的顶尖人才。 Meta 开源 Llama 模型家族帮助该公司制定了 AI 战略,某种程度上也改变了全世界的大模型格局。然而,令人意想不到的是,短短几年,Llama 的初创者们大多已 经转投他处。 在 2023 年发表的那篇具有里程碑意义的论文《 LLaMA: Open and Efficient Foundation Language Models 》中,Llama 被推向世界,当时论文共有 14 位作者。 论文地址:https://arxiv.org/pdf/2302.13971 本周有媒体盘点发现, 仅过去两年的时间,已经有 11 位作者离开 ,剩下的 3 位分别是 :研究科学家 Hugo Touvron、研究工程师 Xavier Martinet 和技术项目负责 人 Faisal Azhar。 Meta 的人才流失让 Mistral 受益最多。Mistral 是一家总部位于法国巴黎的 AI 初创公司,由前 Meta 研究员 Guillaume Lample 和 Timothée Lacroix(Llama 的两位核 心架构师, 也是 LLaMA 的创始成 ...
舍弃自回归!国内团队打造纯扩散多模态大模型LLaDA-V,理解任务新SOTA
机器之心· 2025-05-27 03:23
Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].
9位顶级研究员连讲3晚,华为盘古大模型底层研究大揭秘
机器之心· 2025-05-26 10:59
Core Viewpoint - The rapid development of large language models (LLMs) has become a cornerstone of general artificial intelligence systems, but the increase in model capabilities has led to significant growth in computational and storage demands, presenting a challenge for achieving high performance and efficiency in AI [1][2]. Group 1: Technological Advancements - Huawei's Noah's Ark Lab has developed the Pangu Ultra, a general language model with over 100 billion parameters, surpassing previous models like Llama 405B and Mistral Large 2 in various evaluations [2]. - The lab also introduced the sparse language model Pangu Ultra MoE, achieving long-term stable training on over 6000 Ascend NPUs [2]. Group 2: Key Research Presentations - A series of sharing sessions from May 28 to May 30 will cover breakthroughs in quantization, pruning, MoE architecture optimization, and KV optimization, aimed at developers and researchers interested in large models [3][4]. Group 3: Specific Research Contributions - **CBQ**: A post-training quantization framework that addresses the high computational and storage costs of LLMs, achieving significant performance improvements in ultra-low bit quantization [6]. - **SlimLLM**: A structured pruning method that effectively reduces the computational load of LLMs while maintaining accuracy, demonstrating advanced performance in LLaMA benchmark tests [8]. - **KnowTrace**: An iterative retrieval-augmented generation framework that enhances multi-step reasoning by tracking knowledge triplets, outperforming existing methods in multi-hop question answering [10]. Group 4: Further Innovations - **Pangu Embedded**: A flexible language model that alternates between fast and deep thinking, designed to optimize inference efficiency while maintaining high accuracy [14]. - **Pangu-Light**: A pruning framework that stabilizes and optimizes performance after aggressive structural pruning, achieving significant model compression and inference acceleration [16]. - **ESA**: An efficient selective attention method that reduces computational overhead during inference by leveraging the sparsity of attention matrices [18]. Group 5: MoE Model Developments - **Pangu Pro MoE**: A native MoE model with 72 billion parameters, designed to balance load across devices and enhance inference efficiency through various optimization techniques [21]. - **PreMoe**: An expert routing optimization for MoE models that allows dynamic loading of experts based on task-specific requirements, improving inference efficiency by over 10% while maintaining model capability [24]. Group 6: KV Optimization Techniques - **KVTuner**: A hardware-friendly algorithm for KV memory compression that achieves near-lossless quantization without requiring retraining, significantly enhancing inference speed [26]. - **TrimR**: An efficient reflection compression algorithm that identifies redundant reflections in LLMs, leading to a 70% improvement in inference efficiency across various models [26].
实测惊艳全球的Veo3!音画同步无敌,贵是有原因的
机器之心· 2025-05-26 09:40
Core Viewpoint - The article discusses the impressive capabilities of Google's new AI model, Veo3, which can generate synchronized video and audio content, raising questions about the future of content creation and the potential impact on traditional media industries like Hollywood [4][5][50]. Group 1: Veo3 Capabilities - Veo3 can generate videos with synchronized audio, including environmental sounds, background music, and dialogue, achieving a high level of realism [5][6]. - Users have shared various videos generated by Veo3 on social media, showcasing its ability to create lifelike performances that challenge traditional actors [7][12]. - The model has been tested with different prompts, producing impressive results in various scenarios, including ASMR and game streaming videos [13][26]. Group 2: User Experience and Access - Google has provided access to Veo3 through its Gemini platform, with different user tiers offering varying levels of functionality [19][15]. - Users have reported that the model performs better with English prompts compared to Chinese ones, indicating a potential area for improvement [49]. Group 3: Limitations and Challenges - Despite its strengths, Veo3 struggles with complex scenarios, such as gymnastics videos, where it fails to accurately depict intricate movements [31][33]. - The model has shown some limitations in generating realistic interactions and transitions between scenes, particularly in more dynamic settings [50]. Group 4: Industry Implications - The advancements in AI-generated content, like those seen with Veo3, pose significant questions for the entertainment industry, particularly regarding the future of acting and content creation [51]. - The article emphasizes the need for the industry to adapt to these technological advancements rather than simply dismissing them as threats [51].
与Gemini Diffusion共振!首个扩散式「发散思维链」来了
机器之心· 2025-05-26 09:40
Core Viewpoint - The article introduces a novel reasoning paradigm called "Diffusion Chain of Lateral Thought," which enhances the reasoning capabilities of large models by treating intermediate results in the reverse diffusion process as steps in the reasoning process, optimizing the final output's correctness through reinforcement learning [1][34]. Group 1: Introduction of the Concept - The "Diffusion Chain of Lateral Thought" is a new reasoning paradigm proposed by Professor Qi Guojun's team at West Lake University MAPLE Lab, emphasizing the importance of divergent thinking in large model training and inference [1][6]. - This method allows for non-linear generation of responses, contrasting with traditional linear reasoning chains, thereby encouraging more creative and exploratory reasoning paths [1][7]. Group 2: Application and Results - The method has been successfully applied to two representative diffusion language models, showing significant improvements in mathematical reasoning and code generation tasks, surpassing existing models [2][30]. - The team trained the "Ordered Mask Generation Diffusion Language Model" (LLaDOU) based on the LLaDA model, achieving superior performance in complex reasoning tasks compared to other diffusion language models [2][31]. Group 3: Experimental Validation - Experiments demonstrated that the DCoLT approach outperformed traditional methods like Chain of Thought (CoT) and DoT in tasks such as Sudoku solving and mathematical reasoning, achieving a 57.0% accuracy on the GSM8K-Aug dataset [30]. - The LLaDOU model achieved an accuracy of 88.1% in mathematical reasoning tasks, significantly higher than other models, indicating the effectiveness of the proposed reasoning paradigm [32]. Group 4: Theoretical Implications - The research highlights that traditional autoregressive models are not the only choice for generating answers, suggesting that optimizing the order of token generation can lead to more effective reasoning processes [2][34]. - The findings provide important insights into the training and inference of foundational large models, advocating for a shift from linear to non-linear reasoning paradigms in AI [2][6].
ACL 2025 高分接收 | 高感情语音技术:逻辑智能小语种TTS破局之道
机器之心· 2025-05-26 01:28
该工作由北京深度逻辑智能科技有限公司×宁波东方理工EIT-NLP实验室联合完成。 语音合成(TTS)技术近十年来突飞猛进,从早期的拼接式合成和统计参数模型,发展到如今的深度神经网络与扩散、GAN 等先进架构,实现了接近真人 的自然度与情感表达,广泛赋能智能助手、无障碍阅读、沉浸式娱乐等场景。 然而,这一繁荣几乎局限于英语、普通话等资源充沛的大语种;全球一千多种小语种由于语料稀缺、文字无空格或多音调等复杂语言学特性,在数据收集、 文本前端处理和声学建模上都面临巨大挑战,导致高质量 TTS 迟迟无法落地。破解「小语种困境」既是学术前沿课题,也是实现数字包容与多语文化传播 的关键。 面对这一挑战,逻辑智能团队提出了一种针对低资源语言 TTS 的解决方案并应用于泰语 TTS 合成,该工作已经被 ACL 2025 Industry track 正式接收! 这项工作提出了一种数据优化驱动的声学建模框架的创新方案,通过从语音、文本、音素、语法等多个维度构建系统化的泰语数据集,并结合先进的声学建 模技术,成功实现了在有限资源下的高质量 TTS 合成效果。 此外,该框架还具备 zero-shot 声音克隆的能力,展示了优异的跨 ...
惊了,我的电脑在自动打工!花不到1块钱雇个「AI超人」,Office三件套被卷死
机器之心· 2025-05-26 01:28
Core Viewpoint - The article highlights the emergence of the "Skywork Super Agents" by Kunlun Wanwei as a groundbreaking product in the AI agent space, showcasing its advanced capabilities and potential to revolutionize content creation and productivity tools in the workplace [5][6][64]. Group 1: Product Features - Skywork integrates five expert-level AI agents, enabling users to generate professional documents, spreadsheets, presentations, podcasts, and web pages seamlessly [6]. - It offers a universal AI agent capable of producing multimodal content, including music, MV, promotional videos, picture books, and audiobooks [7]. - Skywork excels in benchmark tests, outperforming competitors like Manus and OpenAI in various assessments, including GAIA and SimpleQA [9][11]. - The product is the first globally to provide an open-source deep research agent framework, allowing developers to participate in defining AI agents [14]. - It features three major MCP interfaces for document generation, data analysis, and presentation creation, establishing itself as a core "AI operating system" for developers [15]. Group 2: User Experience and Functionality - Skywork's user interface allows for easy interaction, enabling users to generate scripts, analyze data, create presentations, and develop web pages with simple prompts [19][26][31][33]. - The platform supports visual data analysis, generating structured and informative sheets with visual representations like pie charts and bar graphs [30]. - It provides a robust PPT generation feature, producing visually appealing and informative presentations based on user prompts [32]. - Skywork can create playable web games and podcasts, demonstrating its versatility in content generation [35][37]. Group 3: Competitive Advantage - Skywork distinguishes itself through its task collaboration, multimodal generation, and high credibility of results, addressing pain points faced by competitors [44]. - The integration of document, spreadsheet, and presentation tools enhances productivity for users, allowing for detailed and organized content generation [45]. - It offers flexible export formats, including PPTX, PDF, HTML, and Google Slides, catering to various user needs [50]. Group 4: Technological Innovations - Skywork employs self-developed technologies, including a deep research model and an agent workflow framework, to enhance its performance and capabilities [61][63]. - The platform's ability to break down complex tasks into manageable components allows for efficient processing and execution [62]. - It incorporates a personal knowledge base feature, enabling users to upload various document formats and create a sustainable content cycle [58]. Group 5: Market Implications - The launch of Skywork signifies a strategic breakthrough for Kunlun Wanwei, positioning it competitively against international players in the AI agent market [66]. - The article suggests that the rise of AI agents like Skywork may lead to a significant transformation in workplace productivity, potentially automating many tasks currently performed by humans [67].
微软副总裁X上「开课」,连更关于RL的一切,LLM从业者必读
机器之心· 2025-05-26 01:28
Core Viewpoint - The article discusses the educational series on artificial intelligence initiated by Nando de Freitas, focusing on reinforcement learning (RL) and its applications in large language models (LLMs) [1][2]. Summary by Sections Introduction to AI Education - Nando de Freitas aims to educate readers on AI through a series of posts on X, starting with reinforcement learning and gradually covering diffusion and flow matching technologies [1][2]. Learning Types - The article highlights that there is no ultimate conclusion on unsupervised learning, supervised learning, and reinforcement learning [8][19]. - Supervised learning is described as basic imitation, requiring high-quality expert data for effective learning [9]. - Reinforcement learning focuses on selective imitation, allowing agents to learn from suboptimal experiences and improve their performance [10][11]. Distributed Reinforcement Learning Systems - Modern distributed RL systems consist of two main components: Actors and Learners, where Actors interact with the environment and collect data, while Learners update the policy network based on this data [23][24]. - The importance of measuring operational durations and communication bandwidth in such systems is emphasized [24][27]. Offline Reinforcement Learning - Offline RL has unique value in scenarios like post-training LLMs, where it can leverage historical data for learning [28][29]. Single-step and Multi-step RL - The article differentiates between single-step and multi-step RL problems, with single-step focusing on immediate actions and multi-step involving planning over a series of interactions [35][39]. - The complexity of multi-step RL is noted, particularly in credit assignment issues where multiple decisions affect outcomes [40][41]. Policy Gradient and Techniques - Policy gradient methods are discussed, including the use of baseline subtraction to reduce variance in reward signals [49][56]. - The article also covers the significance of KL divergence in maintaining proximity to supervised fine-tuning strategies during post-training [69]. Importance Sampling and PPO - Importance sampling is introduced as a method to correct off-policy sample bias, with Proximal Policy Optimization (PPO) being a key technique to manage policy updates [73][78]. - The integration of various techniques in training models like DeepSeek-R1 is highlighted, showcasing the complexity of modern RL systems [81]. Future Directions - Freitas plans to expand the discussion from single-step to multi-step RL, indicating ongoing developments in the field [82].