机器之心
Search documents
五年,终于等来Transformers v5
机器之心· 2025-12-02 06:47
Core Insights - The article discusses the release of the first release candidate version v5.0.0rc0 of the Transformers library, marking a significant transition from version 4 to version 5 after a five-year technical cycle [2] - The library has seen a dramatic increase in usage, with daily downloads rising from 20,000 at the time of v4's release to over 3 million today, and total installations surpassing 1.2 billion [2] - The core focus of the v5 update is on simplicity, pre-training, interoperability with high-performance inference engines, and making quantization a core feature [2][3] Evolution and Features - The v5 version establishes PyTorch as the sole core backend and emphasizes four key dimensions of evolution: extreme simplicity, transition from fine-tuning to pre-training, interoperability with high-performance inference engines, and enhanced quantization capabilities [2] - The team aims for a clean and clear model integration approach, promoting broader standardization and stronger generality [4] - Over the past five years, an average of 1-3 new models has been added weekly, with the goal of becoming the only trusted source for model definitions [4] Modular Design and Tools - Hugging Face has advanced a modular design approach, simplifying maintenance and speeding up integration while fostering community collaboration [6] - The introduction of the AttentionInterface provides a centralized abstraction layer for attention mechanisms, streamlining the management of common auxiliary functions [8] - Tools are being developed to identify similarities between new models and existing architectures, aiming to automate the model conversion process into the Transformers format [9][10] Training Enhancements - The v5 version increases support for pre-training, with redesigned model initialization and support for forward and backward propagation optimization operators [15][16] - Hugging Face continues to collaborate closely with fine-tuning tools in the Python ecosystem and ensures compatibility with tools in the JAX ecosystem [17] Inference Improvements - Inference is a key focus of the v5 update, introducing dedicated kernels, cleaner default settings, new APIs, and optimized support for inference engines [18][19] - The v5 version aims to complement specialized inference engines rather than replace them, ensuring compatibility with engines like vLLM, SGLang, and TensorRT-LLM [21] Local Deployment and Quantization - The team collaborates with popular inference engines to allow Transformers to be used as a backend, enhancing the value of models added to Transformers [23] - Quantization is positioned as a core capability of Transformers, ensuring compatibility with major functionalities and providing a reliable framework for training and inference [27]
华为新开源!扩散语言模型突破32K上下文,还解锁了「慢思考」
机器之心· 2025-12-02 06:47
Core Insights - The article discusses the significant paradigm shift in text generation from Auto-Regressive models to Diffusion Language Models, highlighting the limitations of long sequence training and the recent advancements made by Huawei with the openPangu-R-7B-Diffusion model [1][14]. Model Performance - The openPangu-R-7B-Diffusion model achieved new state-of-the-art (SOTA) records in various benchmarks, demonstrating superior performance in general capabilities, mathematical reasoning, and code generation compared to other models [2][3]. - In the MMLU benchmark, openPangu-R-7B-Diffusion scored 81.66, surpassing LLaDA 2.0-mini-preview by 9.17 points [2]. - The model's performance in mathematical reasoning (MATH) reached 84.26, significantly leading over similar models [3]. Architectural Innovations - The model incorporates an innovative causal attention mask architecture, which allows for seamless migration from Auto-Regressive to BlockDiffusion models, addressing the architectural adaptation challenges [5][7]. - By retaining the causal attention characteristics, the model reduces adaptation costs and maximizes compatibility with pre-trained knowledge from Auto-Regressive models [8][10]. Training and Inference Efficiency - The training strategy of openPangu-R-7B-Diffusion optimizes the BlockDiffusion approach, enhancing the efficiency of the model [10]. - The model employs a dual-mode decoding capability, allowing users to balance generation quality and speed through different sampling settings [15]. Conclusion - The release of openPangu-R-7B-Diffusion marks a significant advancement in the ability of diffusion models to handle complex long texts, proving that they can achieve both speed and depth in processing [14].
NeurIPS 2025|CAKE:大模型驱动的贝叶斯优化新配方,让黑箱优化更智能、更高效
机器之心· 2025-12-02 06:47
Core Insights - The article discusses a new method called Context-Aware Kernel Evolution (CAKE) for Bayesian Optimization, which utilizes large language models (LLMs) to dynamically design optimal Gaussian Process (GP) kernel functions during the optimization process [5][6][14]. Group 1: Methodology - CAKE reimagines the kernel design problem as an "evolutionary process," using LLMs to generate new kernel functions based on existing observational data [17]. - The system maintains a "population" of kernel functions and employs genetic operations such as crossover and mutation to evolve these kernels [19]. - BIC-Acquisition Kernel Ranking (BAKER) is introduced to rank kernel functions based on their model fit and sampling potential, balancing optimization and exploration [21][22]. Group 2: Experimental Results - CAKE was tested against three baseline methods: Fixed (using a single SE or M5 kernel), Adaptive (random selection or BIC selection), and Compositional methods [25]. - In hyperparameter optimization tasks, CAKE achieved the highest final accuracy across all tested machine learning models, demonstrating high sample efficiency, especially in the early stages of optimization [27]. - In dynamic simulation tasks, CAKE outperformed all baseline methods, showing robustness to environmental changes and successfully achieving high scores in challenging tasks [28]. Group 3: Advantages and Future Directions - CAKE offers significant interpretability, allowing for human-readable explanations of kernel structures generated during optimization [34][37]. - The framework is expected to evolve further by incorporating more general kernel function syntax and extending its core ideas to other machine learning tasks, such as SVM and kernel PCA [42].
架构解耦是统一多模态模型所必须的吗?全新AIA损失:No
机器之心· 2025-12-02 05:07
Core Insights - The rapid development of unified understanding and generation models has faced challenges due to conflicts between visual understanding and generation tasks [2] - Researchers from CUHK MMLab and Meituan believe that the performance of unified models will eventually reach that of single-task models, but they question whether the current approach of decoupling architectures is truly beneficial [2][3] Unified Model Intent - The original intent of unified models is to enhance single-task performance through a transparent and rational process of interleaved text and image reasoning [3] - Examples include generating corresponding images while navigating mazes or drawing auxiliary lines during mathematical problem-solving [3] Architecture Decoupling Issues - Models like BAGEL require complex processes to achieve interleaved reasoning, leading to significant computational overhead and potential information loss [3] - Despite current performance gains, researchers warn that these issues may become more pronounced as research progresses [3] AIA Introduction - To explore the reasons behind performance improvements from architecture decoupling and to find ways to enhance model performance without it, CUHK MMLab and Meituan introduced AIA [5] Research Findings - Researchers found that regardless of how models are decoupled, understanding and generation tasks exhibit a negative correlation at the same network layer [8] - This indicates that decoupling does not fundamentally resolve the conflicts between tasks [8] AIA Loss Design - AIA loss was designed to explicitly constrain the interaction patterns of unified models during training, using the cross-modal interaction patterns of single-task models as a learning target [10] AIA Effectiveness - Experiments on Emu3 and Janus-Pro showed that AIA can enhance model performance without additional tricks, reducing the performance gap with more decoupled models [12] AIA Training Sensitivity - AIA loss demonstrated stable convergence across a wide range of weight adjustments during training, particularly for Emu3, which had weaker pre-training knowledge [17] - In contrast, Janus-Pro's strong pre-training knowledge made it more sensitive to AIA loss adjustments [17] AIA Advantages - The introduction of AIA loss can mitigate common data ratio issues, achieving better results with a 1:1 data ratio for generation and understanding tasks, indicating a collaborative optimization effect [19] Unified Model Training Path - The dynamic allocation of task weights during unified training may represent the correct behavior of unified models, suggesting that task conflicts could be a natural characteristic rather than a problem to avoid [21] - Another approach involves removing task differentiation cues to force the model to learn a truly unified space, though this increases training difficulty [22] Future Outlook - AIA represents an initial step in analyzing the principles of unified model training, with a call for more researchers to explore this field [24] - The theoretical and architectural aspects of unified models are still immature, necessitating collaborative exploration [24]
刚刚,千问App把谷歌和OpenAI的「付费绝活」塞进了手机,还免费?
机器之心· 2025-12-02 05:07
Core Insights - The article discusses the significant updates to the Qianwen App, which integrates two advanced visual models, Qwen-Image and Wan 2.5, making them accessible to ordinary users without technical expertise [1][4][36] Group 1: Qwen-Image Model - Qwen-Image is recognized for its strong visual logic understanding, allowing it to accurately interpret complex spatial relationships and geometric structures, outperforming many existing models [8][9][65] - The model excels in maintaining identity consistency during image editing, which is crucial for users seeking reliable results in complex scenarios [18][32] - Qwen-Image has shown impressive performance in multi-image fusion tasks, allowing for seamless integration of different visual elements while preserving their unique characteristics [29][32] Group 2: Wan 2.5 Model - Wan 2.5 represents a breakthrough in AI video generation, enabling native audio-visual synchronization, which enhances the user experience by eliminating the need for separate audio processing [34][68] - The model can generate videos that include original music and dialogue, showcasing its ability to understand and integrate multiple modalities [43][70] - Wan 2.5's architecture allows it to process text, images, video, and audio signals simultaneously, facilitating complex creative tasks that were previously challenging [68][70] Group 3: User Accessibility and Integration - The integration of these models into the Qianwen App eliminates barriers for users, allowing them to create high-quality visual and audio content without needing coding skills or expensive hardware [4][75] - The app serves as a comprehensive platform for multi-modal generation, enabling users to transition smoothly from image creation to video production within a single interface [45][47] - This development reflects Alibaba's long-term investment in building a robust ecosystem of multi-modal generative models, positioning it as a leader in the AI creative tools market [72][74]
刚刚,霸榜神秘视频模型身份揭晓,原来它就是「David」
机器之心· 2025-12-02 00:17
Core Insights - Runway's Gen-4.5 has emerged as the leading state-of-the-art (SOTA) video generation model, setting new industry standards in motion quality, prompt adherence, and visual realism [1][3][8] Model Performance - Gen-4.5 has achieved an ELO Score of 1247, surpassing competitors like Veo 3/3.1, Kling 2.5, and Sora 2 Pro, showcasing unprecedented visual realism and creative control capabilities [3][6][8] - The model maintains speed and efficiency while delivering significant quality improvements, making advanced video generation accessible to creators of various scales [8][20] Key Features - **Precise Prompt Adherence**: Gen-4.5 demonstrates exceptional physical accuracy and visual detail, accurately portraying object motion, fluid dynamics, and intricate surface details [11][12] - **Expressive Characters**: The model can depict nuanced emotions and lifelike facial details, enhancing character representation [14] - **Stylized Control and Visual Consistency**: It supports a wide range of aesthetic styles, from photorealism to stylized animation, while maintaining a coherent visual language [16][18] Deployment and Limitations - Gen-4.5 is built on NVIDIA architecture, optimizing training efficiency and inference speed through collaboration with NVIDIA [20] - Despite its advancements, Gen-4.5 exhibits common limitations found in video generation models, such as causal reasoning issues and object permanence challenges [21][22]
英伟达拿出推理版VLA:Alpamayo-R1让自动驾驶AI更会动脑子
机器之心· 2025-12-02 00:17
Group 1 - The core challenge in autonomous driving is not just perception but understanding the reasoning behind actions taken by the model [1] - Traditional end-to-end systems struggle with rare but critical scenarios, leading to potential accidents [1][2] - NVIDIA's Alpamayo-R1 introduces a reasoning capability that allows vehicles to infer causal relationships before making decisions [1][6] Group 2 - Alpamayo-R1 features a new dataset called Chain of Causation (CoC), which includes not only actions taken but also the reasons for those actions [2][3] - The model employs a diffusion-based trajectory decoder to generate feasible driving trajectories under real-time constraints [5] - A multi-stage training strategy is utilized, starting with basic mapping from vision to action, followed by supervised fine-tuning on CoC data, and concluding with reinforcement learning for optimization [6][15] Group 3 - The performance of Alpamayo-R1 shows significant improvements, particularly in long-tail scenarios where traditional models often fail [6][20] - The model's input consists of multi-camera and temporal observations, allowing for integrated multi-modal semantic understanding [8] - The CoC dataset employs a human-machine collaborative annotation mechanism, resulting in improved planning accuracy and reduced error rates [10][11] Group 4 - The training process of Alpamayo-R1 is divided into three phases: supervised fine-tuning, CoC supervision, and reinforcement learning-based post-training optimization [15][17] - The model incorporates a multi-dimensional reward mechanism to enhance reasoning accuracy and action consistency [17] - The design of AR1 represents a shift from "black box" to "white box" in autonomous driving, enabling the model to explain its decisions [19][20] Group 5 - The significance of Alpamayo-R1 lies not only in performance enhancement but also in establishing a closed loop between AI reasoning and physical actions [20][21] - The model aims to ensure safety and build trust in autonomous driving by providing explanations for its decisions [21]
AAAI 2026 | 首个抗端到端攻击的大模型加密指纹 / 水印方案
机器之心· 2025-12-01 09:30
Core Insights - The article discusses the development of iSeal, an encrypted fingerprinting solution designed to protect the intellectual property of large language models (LLMs) against advanced attacks [2][3][5]. Research Background - The training of large language models often incurs costs in the millions of dollars, making the model weights valuable intellectual property. Researchers typically use model fingerprinting techniques to assert ownership by embedding triggers that produce characteristic responses [6][7]. - Existing fingerprinting methods assume that the verifier faces a black-box API, which is unrealistic as advanced attackers can directly steal model weights and deploy them locally, gaining end-to-end control [7][10]. iSeal Overview - iSeal is the first encrypted fingerprinting scheme designed for end-to-end model theft scenarios. It introduces encryption mechanisms to resist collusion-based unlearning and response manipulation attacks, achieving a 100% verification success rate across 12 mainstream LLMs [3][12]. Methodology and Innovations - iSeal's framework transforms the fingerprint verification process into a secure encrypted interaction protocol, focusing on three main aspects: - **Encrypted Fingerprinting and External Encoder**: iSeal employs an encrypted fingerprint embedding mechanism and an external encoder to decouple fingerprints from model weights, preventing attackers from reverse-engineering the fingerprints [15]. - **Confusion & Diffusion Mechanism**: This mechanism binds fingerprint features to the model's core reasoning capabilities, making them inseparable and resilient against attempts to erase specific fingerprints [15]. - **Similarity-based Dynamic Verification**: iSeal uses a similarity-based verification strategy and error correction mechanisms to identify fingerprint signals even when attackers manipulate outputs through paraphrasing or synonym replacement [15][18]. Experimental Results - In experiments involving models like LLaMA and OPT, iSeal maintained a 100% verification success rate even under advanced attacks, while traditional fingerprinting methods failed after minor fine-tuning [17][18]. - The results demonstrated that iSeal's design effectively prevents attackers from compromising the entire verification structure by attempting to erase parts of the fingerprint [17][21]. Ablation Studies - Ablation studies confirmed the necessity of iSeal's key components, showing that without freezing the encoder or using a learned encoder, the verification success rate dropped to near zero [20][21].
13岁靠「氛围编程」创业,见奥特曼、拜访a16z,他的暑假把成年人卷哭
机器之心· 2025-12-01 09:30
Core Viewpoint - The article highlights the emergence of young entrepreneurs in the tech industry, exemplified by 13-year-old Michael Goldstein, who is leveraging AI tools like ChatGPT to innovate and create startups, reflecting a shift in how the younger generation engages with technology and entrepreneurship [2][6][34]. Group 1: Young Entrepreneurs and AI - Michael Goldstein represents a new wave of tech-savvy youth who are actively participating in the startup ecosystem, showcasing a blend of creativity and technical skills [4][6]. - The influence of AI, particularly tools like ChatGPT, is noted as a significant factor in empowering today's youth to pursue entrepreneurial ventures, contrasting with previous trends dominated by social media platforms [6][34]. - Goldstein's approach to coding, termed "vibe coding," emphasizes a conceptual understanding over traditional coding skills, allowing him to create an AI startup despite limited coding experience [8][10]. Group 2: Startup Journey and Challenges - Goldstein's entrepreneurial journey includes seeking advice from industry leaders like Sam Altman, indicating the importance of mentorship and networking in the startup landscape [10][32]. - After facing challenges with his initial project, Goldstein pivoted to a new venture focused on AI design, demonstrating adaptability in the face of entrepreneurial hurdles [10][11]. - The article discusses the development of Goldstein's AI design tool, Kodo, which aims to assist users in creating visual content, although it currently faces limitations in execution and understanding user prompts [16][18][24]. Group 3: Societal Perspectives and Concerns - The rise of young entrepreneurs has sparked debate about the implications of children engaging in tech entrepreneurship, with some expressing concern over the potential loss of childhood experiences [32]. - There is a growing concern regarding the risks associated with children using AI technologies, highlighted by legislative discussions around restricting AI access for minors [32][34]. - The article notes that while Silicon Valley is enthusiastic about young innovators, there is a counter-narrative emphasizing the need for a balanced approach to youth engagement with technology [32][34].
NeurIPS 2025 | DePass:通过单次前向传播分解实现统一的特征归因
机器之心· 2025-12-01 04:08
Core Viewpoint - The article discusses the introduction of a new unified feature attribution framework called DePass, which aims to enhance the interpretability of large language models (LLMs) by providing precise attribution of model outputs to internal computations [3][11]. Group 1: Introduction of DePass - DePass is a novel framework developed by a research team from Tsinghua University and Shanghai AI Lab, designed to address the challenges of existing attribution methods that are often computationally expensive and lack a unified analysis framework [3][6]. - The framework allows for the decomposition of hidden states in the forward pass into additive components, enabling precise attribution of model behavior without modifying the model structure [7][11]. Group 2: Implementation Details - In the Attention module, DePass freezes attention scores and applies linear transformations to the hidden states, allowing for accurate distribution of information flow [8]. - For the MLP module, it treats the neurons as a key-value store, effectively partitioning the contributions of different components to the same token [9]. Group 3: Experimental Validation - DePass has been validated through various experiments, demonstrating its effectiveness in token-level, model-component-level, and subspace-level attribution tasks [11][13]. - In token-level experiments, removing the most critical tokens identified by DePass significantly decreased model output probabilities, indicating its ability to capture essential evidence driving predictions [11][14]. Group 4: Comparison with Existing Methods - Existing attribution methods, such as noise ablation and gradient-based methods, face challenges in providing fine-grained explanations and often incur high computational costs [12]. - DePass outperforms traditional importance metrics in identifying significant components, showing higher sensitivity and completeness in its attribution results [15]. Group 5: Applications and Future Potential - DePass can track the contributions of specific input tokens to particular semantic subspaces, enhancing the model's controllability and interpretability [13][19]. - The framework is expected to serve as a universal tool in mechanism interpretability research, facilitating exploration across various tasks and models [23].