Workflow
机器之心
icon
Search documents
所有大模型,都学物理学:北大物理系一篇研究,震撼了AI圈
机器之心· 2025-12-16 08:55
编辑|+0、泽南、Panda LLM 智能体很赞,正在成为一种解决复杂难题的强大范式。 论文标题:Detailed balance in large language model-driven agents 论文地址:https://arxiv.org/pdf/2512.10047 简单来说,他们通过实验测量了 LLM 生成状态之间的转移概率。基于此,他们在统计上发现了 LLM 生成转移中的细致平衡 (detailed balance) 现象。 这表明: LLM 的生成可能不是通过一般性地学习规则集和策略来实现的,而是通过隐式地学习一类潜在的势函数 (potential functions),这些势函数可能超越了不 同的 LLM 架构和提示词模板。 不过,这种成功目前更多还停留在「经验主义」的工程实践层面 —— 我们知道它好用,但往往不知道它在宏观上为何如此运作。那么,我们是否能找到一个理论 框架,像物理学描述自然界那样,去理解和统一智能体的宏观动力学(macroscopic dynamics)? 为了解开这个黑盒,近日,北京大学物理学院、高能物理研究中心以及北京计算科学研究中心联合发力,跨界借用了物理学中经 ...
英伟达成开源新王?Nemotron 3全新混合专家架构,推理效率升4倍
机器之心· 2025-12-16 08:55
机器之心编辑部 英伟达的自研大模型,刚刚有了大版本的更新。 北京时间今天凌晨,英伟达发布了 Nemotron 3 系列开放模型,共三种规模,分别为 Nano、Super 和 Ultra : 英伟达认为,随着企业从单一模型聊天机器人转向协同工作的多智能体 AI 系统,开发者正面临通信开销高、上下文漂移以及推理成本居高不下等挑战。同时,能 够支撑复杂工作流自动化的模型,必须具备足够的透明性与可解释性,才能赢得开发者与企业的信任。 其中 Nemotron 3 Nano 已在 Hugging Face 上线,是目前计算成本效率最高的模型,针对软件调试、内容摘要、AI 助手工作流和信息检索等任务进行了优化,可显 著降低推理成本。该模型采用独特的混合 MoE 架构,在效率与可扩展性方面实现了显著提升。 Nemotron 3 Nano 的总参数规模为 316 亿,激活参数规模为 32 亿(包含嵌入层为 36 亿)。在每次前向推理过程中,其激活的参数数量不到上代 Nemotron 2 Nano 的一半,却实现了更高的准确率。 与 Nemotron 2 Nano 相比,Nemotron 3 Nano 实现了最高 4 倍的 To ...
临床PK完胜ChatGPT-5!国内团队造出首个OCT影像AI系统
机器之心· 2025-12-16 04:11
Core Viewpoint - The CA-GPT system, a specialized AI for percutaneous coronary intervention (PCI), significantly outperforms the general model ChatGPT-5 in decision-making for cardiac surgeries, indicating a breakthrough in the application of AI in the medical field [1][3]. Group 1: Performance Comparison - In a clinical study involving 96 patients and 160 lesions, the CA-GPT system achieved a median decision score of 5.0, compared to ChatGPT-5's score of 3.0, demonstrating a statistically significant advantage (P<0.001) [11]. - The accuracy of stent diameter selection by CA-GPT reached 90.3%, while ChatGPT-5 only achieved 63.9%, which is lower than the 72.2% accuracy of junior physicians [11]. - For stent length selection, CA-GPT's accuracy was 80.6%, compared to ChatGPT-5's 54.2% [11]. Group 2: Clinical Impact - The CA-GPT system can analyze OCT images and generate structured reports in under 20 seconds, reducing the interpretation time by over 95% compared to traditional methods [10]. - The system's high stability and accuracy stem from its architecture, which combines small models, large data, and a large model, allowing for precise analysis and logical reasoning [19][21]. - The CA-GPT system aims to democratize medical expertise, providing junior doctors in remote areas with access to top-tier decision-making support, effectively bridging the gap in medical resource distribution [25][26]. Group 3: Technological Framework - The CA-GPT system integrates 13 core functions for structured analysis of OCT images, enabling rapid and accurate decision-making [21]. - It utilizes the DeepSeek framework for logical reasoning based on precise quantitative data provided by smaller models, enhancing the reliability of its recommendations [21]. - The system is linked to a knowledge base containing over 1 million cardiovascular literature and guidelines, ensuring that AI decisions are grounded in expert consensus [21]. Group 4: Future Implications - The introduction of the CA-GPT system marks a significant milestone for Chinese medical technology, showcasing the ability to define standards in high-end intravascular imaging rather than merely following Western advancements [30]. - This development represents a pivotal moment for AI in healthcare, emphasizing the importance of combining deep learning precision with reasoning capabilities to address real clinical challenges [30][31].
56倍加速生成式策略:西交大提出EfficientFlow,迈向高效具身智能
机器之心· 2025-12-16 04:11
Core Insights - The article discusses the development of a new generative policy learning method called EfficientFlow, which addresses two major challenges in embodied AI: reliance on large-scale demonstration data and slow inference times [2][3]. Group 1: Technical Highlights - EfficientFlow integrates equivariant modeling with efficient flow matching, significantly improving data efficiency and reducing the number of iterations required for inference, achieving state-of-the-art (SOTA) performance across multiple robotic operation benchmarks [2][19]. - The method introduces an acceleration regularization term in the loss function to encourage smoother and faster trajectory generation, inspired by physical intuition that smooth movements typically have low acceleration [6][19]. - The model employs equivariant networks that allow it to generalize learned actions across different orientations, effectively multiplying the data utility by enabling the model to learn from a single perspective and apply it to various rotations [11][19]. Group 2: Inference Efficiency - EfficientFlow demonstrates remarkable inference efficiency, achieving near-equivalent performance to existing SOTA methods with significantly fewer data and iterations. For instance, it reaches close to the performance of EquiDiff with 100 iterations in just 1 step, resulting in a 56-fold increase in single-step inference speed and nearly 20 times faster for 5-step inference [19]. - The model incorporates a time consistency strategy to ensure coherent action sequences during execution, utilizing overlapping predictions to maintain continuity in behavior [15][19]. - Periodic resets are implemented to enhance the model's ability to explore diverse behaviors while maintaining time consistency, ensuring minimal additional overhead during inference [17][19].
阿里妈妈发布MUSE:用多模态搞定十万级超长行为序列,并开源Taobao-MM数据集
机器之心· 2025-12-16 04:11
Core Insights - The article discusses the limitations of current recommendation systems, which often suffer from "short-term amnesia" due to computational and storage constraints, leading to the neglect of valuable long-tail data [1][3] - MUSE (Multimodal Search-based framework) is introduced as a solution to enhance user interest modeling by leveraging multimodal information, effectively acting as a "digital hippocampus" for recommendation systems [1][4] - The framework has been successfully implemented in Alibaba's advertising system, demonstrating a significant CTR increase of 12.6% [6][36] Summary by Sections Background and Evolution - The evolution of CTR modeling has transitioned from short-term behavior analysis to long-term behavior modeling, but improvements have plateaued as historical behavior length increases [2][3] - Users accumulate extensive behavior sequences, often exceeding one million actions, but current models typically utilize only a few thousand recent actions due to limitations in processing and storage [3][4] MUSE Framework - MUSE focuses on reorganizing user behavior data through multimodal information to improve the quality and usability of lifelong interest modeling [6][20] - The framework consists of two main components: GSU (General Search Unit) for initial retrieval and ESU (Exact Search Unit) for detailed modeling, both enhanced by multimodal embeddings [20][24] Implementation and Results - MUSE has been fully deployed in Alibaba's advertising system, capable of modeling user behavior sequences of up to 100,000 actions, with ongoing improvements to extend this to millions [6][36] - The implementation has shown that using high-quality multimodal embeddings significantly enhances retrieval and modeling accuracy, leading to improved business outcomes [6][36] Engineering Considerations - The design of MUSE allows for controlled latency despite the complexity of handling long sequences and multimodal data, primarily by decoupling the GSU from the main processing path [31][36] - The system's architecture emphasizes efficient data retrieval and processing, minimizing the impact of network and storage delays on overall performance [36][39] Industry Implications - MUSE offers valuable insights for industries involved in advertising, content recommendation, and e-commerce, suggesting a shift towards integrating multimodal embeddings and enhancing user interest modeling [37][39] - The framework encourages a reevaluation of existing systems, advocating for a focus on quality embeddings and efficient data handling to unlock new performance improvements [45][47]
让扩散模型「可解释」不再降质,开启图片编辑新思路
机器之心· 2025-12-16 02:31
Core Viewpoint - The article discusses the emergence of TIDE (Temporal-Aware Sparse Autoencoders) as a significant advancement in making diffusion models interpretable without sacrificing their generative quality [3][17]. Group 1: Background and Challenges - Over the past three years, diffusion models have dominated the image generation field, with architectures like DiT pushing the limits of image quality [2]. - Despite the growth in explainability research for LLMs, the internal semantics and causal pathways of diffusion models remain largely opaque, making them a "black box" [2]. - Existing attempts at explainability often lead to a noticeable decline in performance, making the pursuit of interpretable diffusion models seem impractical [2]. Group 2: Introduction of TIDE - TIDE is introduced as the first truly temporal-aware framework for diffusion transformers, aiming to reveal the internal mechanisms of these models without compromising their generative capabilities [3][5]. - The framework emphasizes the importance of the temporal aspect of the diffusion process, which unfolds progressively over time [6]. Group 3: Mechanism and Functionality of TIDE - TIDE aligns semantics along the time dimension, allowing for a clearer presentation of the diffusion model's internal processes, such as the emergence of structure from noise and the gradual formation of semantics [7]. - The sparse autoencoder in TIDE enables lossless reconstruction in the feature space, maintaining the stability of the diffusion trajectory while being "observed" [7][10]. Group 4: Performance and Results - TIDE decomposes diffusion features into controllable semantic factors, enhancing image editing capabilities by allowing direct manipulation along clear semantic directions [8][10]. - The impact of TIDE on generative quality is minimal, with FID and sFID changes being less than 0.1%, demonstrating its ability to be interpretable without degrading quality [10][14]. - TIDE shows significant improvements in semantic binding and understanding of spatial relationships, with multiple metrics indicating optimal performance [12]. Group 5: Implications and Future Directions - TIDE represents a new research paradigm, suggesting that diffusion models can be interpretable with the right perspective [19]. - Future developments may include more controllable and robust diffusion editing systems, unified understanding of generative models, and advancements in causal and semantic theory research [21][22].
告别「手搓Prompt」,前美团高管创业,要让物理世界直接成为AI提示词
机器之心· 2025-12-16 02:31
Core Viewpoint - The article discusses the emergence of Looki, an AI hardware aiming to enhance human-machine interaction by transforming real-world scenarios into contextual data, moving from passive responses to proactive engagement [1][2]. Group 1: Looki's Purpose and Technology - Looki aims to fill the gap in large models' "sensory intelligence" by converting real-time visual and auditory signals into contextual data, thus driving AI to think and serve users more effectively [4][28]. - The Looki L1 device, weighing only 30g, is designed to operate as a multi-modal perception system, capturing the physical world continuously and efficiently [6][8]. - The founders of Looki, with backgrounds in autonomous driving and smart hardware, leverage their expertise to adapt perception algorithms from driving to everyday life [9][10]. Group 2: Data Management and User Context - Looki employs a "data flywheel" approach to create a personalized context for users, transforming raw data into structured memories that the AI can efficiently access [12][15]. - The system addresses two major challenges in multi-modal models: understanding long-sequence data and managing context explosion, ensuring privacy and security in data handling [14]. Group 3: Transition to Proactive AI - The article highlights a shift from manual prompts to proactive AI, where enriched context allows for anticipatory actions by the AI, marking a transition from chatbots to agentic AI [17][18]. - Looki's capabilities include automatic video editing, identifying significant moments in users' lives, and providing insights based on accumulated data, thus evolving into a second brain for users [20][24][30]. Group 4: Future Vision - Looki envisions its hardware as a data interface that evolves beyond its current form, aiming to address the data hunger of the physical world and help users accumulate valuable personal data assets [29]. - The ultimate goal is for AI to possess a sense of presence, transforming it from a mere tool into an integral part of users' daily lives [30][31].
AAAI 2026|视频大语言模型到底可不可信?23款主流模型全面测评来了
机器之心· 2025-12-15 10:00
Core Insights - The article discusses the development of Trust-videoLLMs, a comprehensive evaluation benchmark for video large language models, addressing challenges in authenticity, safety, fairness, robustness, and privacy [3][6][13]. Evaluation Framework - Trust-videoLLMs includes a systematic, multi-layered, and scalable evaluation system with five core dimensions: - Truthfulness: Video description, temporal understanding, event reasoning, and hallucination suppression - Robustness: Noise interference, temporal disturbance, adversarial attacks, and modality conflict - Safety: Harmful content identification, harmful instruction rejection, deepfake detection, and jailbreak attack defense - Fairness: Stereotype identification, occupational bias, and time sensitivity analysis - Privacy: Privacy content recognition, celebrity privacy protection, and self-inference of privacy [6][9]. Evaluation Tasks - The evaluation tasks cover three main aspects, including contextual reasoning, temporal reasoning, video description, event understanding, and hallucination in videos, among others [8][11]. Model Assessment - The evaluation encompasses 23 mainstream video large language models, including 5 commercial models and 18 open-source models, with varying parameter scales and architectural designs [10][12]. Key Findings - Model size does not equate to stronger performance, as larger models do not necessarily outperform smaller ones [16]. - Closed-source models, such as Claude and Gemini1.5, demonstrate superior safety, privacy protection, and multi-modal alignment compared to open-source models [17]. - Video context significantly impacts safety, as harmful text prompts paired with relevant videos increase the likelihood of generating harmful content [18]. - Fairness issues are prevalent, with models showing biases related to gender, age, and skin color, where closed-source models perform better due to data cleaning and ethical constraints [19]. - Privacy protection is a double-edged sword; stronger models can better identify privacy content but also risk inferring private information [20]. Open-source Tools and Data - To promote the development of trustworthy video large models, the team has open-sourced a large-scale video dataset containing 6,955 videos covering multiple scenes and tasks, along with a unified evaluation toolbox [24].
Thinking Machines首款产品重大更新:K2 Thinking、Qwen3-VL都可以微调了
机器之心· 2025-12-15 10:00
Core Insights - The article discusses the launch of Tinker, an API developed by Thinking Machines Lab, aimed at simplifying the fine-tuning of language models for developers and researchers [1][4] - Tinker has removed its candidate list, allowing all users to access the platform directly, which marks a significant shift in accessibility for AI model training [1][4] - The article highlights three major updates to Tinker: enhanced reasoning capabilities, a new inference interface compatible with OpenAI API, and the introduction of visual input support through new models [1][4] Group 1: Tinker Overview - Tinker allows developers to focus solely on training data and algorithms, while it manages infrastructure aspects like scheduling and resource management, significantly lowering the barrier to entry for model training [4] - The platform now supports fine-tuning of the Kimi K2 model, which has a trillion parameters, previously accessible only to top-tier labs [4] - Tinker’s visual input capabilities enable users to handle images and visual content in various applications, further broadening its usability [1][4] Group 2: Model Performance and Comparisons - Tinker has been tested with the Qwen3-VL-235B-A22B model on several image classification benchmarks, including Caltech-101, Stanford Cars, Oxford Flowers, and Oxford Pets [4][5] - The performance of Qwen3-VL-235B-A22B was compared to DINOv2, a self-supervised visual transformer, showing superior results in small sample scenarios due to its larger model size and integrated language knowledge [7] - The ability of Qwen3-VL to combine language and visual understanding allows for easier adaptation to various visual tasks beyond image classification [7]
NeurIPS 2025|指哪打哪,可控对抗样本生成器来了!
机器之心· 2025-12-15 08:10
Core Viewpoint - The article discusses the introduction of a novel adversarial attack generation framework called Dual-Flow, developed by Tsinghua University and Ant Group, which can generate effective adversarial samples without relying on the target model structure or gradients, posing significant challenges to AI security [2][5]. Group 1: Dual-Flow Framework - Dual-Flow learns "universal perturbation patterns" from vast image datasets, enabling it to launch black-box attacks across various models and categories [2][5]. - The framework employs a "forward perturbation modeling - conditional backward optimization" dual-flow structure, achieving high transferability and success rates for adversarial samples while maintaining low visual differences [2][5][8]. - It acts as a "controllable adversarial sample generator," allowing users to specify target image categories for automatic generation of realistic and effective attack images [2][5]. Group 2: Limitations of Traditional Methods - Traditional methods face two major limitations: instance-specific attacks, which have high success rates but are limited to single images and lack transferability [6], and instance-agnostic attacks, which have limited transferability and lower success rates when targeting multiple models or categories [7][8]. Group 3: Innovations of Dual-Flow - The core innovation of Dual-Flow lies in its forward and backward flow structure, which generates more natural, concealed, and structured perturbations compared to traditional pixel-level noise methods, while maintaining high transferability [9][22]. - Dual-Flow's unified framework supports multi-target and instance-agnostic attack capabilities, allowing a single generator to cover multiple categories and models, significantly reducing costs and enhancing practicality [10][22]. Group 4: Experimental Results - Experimental results on the ImageNet NeurIPS validation set indicate that Dual-Flow demonstrates strong transferability in both single-target and multi-target attacks, with average success rates significantly higher than traditional methods in black-box environments [17][18]. - Even against adversarially trained models, Dual-Flow maintains high success rates, showcasing its generality and powerful attack capabilities in real-world scenarios [19][22]. - The technology has been integrated into Ant Group's identity security products, optimizing capabilities for adversarial sample generation and detection, thereby enhancing the robustness of defense systems against adversarial samples [24].