机器之心
Search documents
OpenAI新论文拆解语言模型内部机制:用「稀疏电路」解释模型行为
机器之心· 2025-11-14 09:30
Core Insights - The article discusses OpenAI's new research on training smaller, sparser models that are easier to interpret, addressing the "black box" nature of large language models [1][12][26] Group 1: Model Transparency and Interpretability - Most large language models operate as "black boxes," making it difficult for even experts to understand their internal processes [1] - Enhancing model transparency can help analyze and explain issues like hallucinations and unstable behavior in language models [1][12] - OpenAI's research aims to isolate small circuits within sparse models that are responsible for specific tasks, providing unprecedented insights into language model operations [7][12] Group 2: Sparse Model Training Methodology - OpenAI's approach involves training models with sparse weights, limiting connections between neurons to simplify the model's structure [14][26] - The research shows that training larger and sparser models can lead to simpler and more interpretable circuits, which can effectively perform specific tasks [17][19] - The study highlights a specific task where the model must choose the correct type of quotation marks in Python code, demonstrating the model's ability to isolate and execute simple behaviors [19][22] Group 3: Future Directions and Challenges - OpenAI acknowledges that while this research is a step towards understanding model computations, there is still a long way to go [26] - Future efforts will focus on scaling these techniques to larger models and explaining more complex behaviors [26] - OpenAI is exploring two pathways to improve the efficiency of training sparse models: extracting sparse circuits from existing dense models and developing more efficient interpretability-guided training techniques [26]
百度亮出秘密武器:一个自我演化的AI,给出了人类做不到的最优解
机器之心· 2025-11-14 09:30
Core Insights - The article discusses the rapid evolution of AI from being mere executors to becoming inventors, highlighting the introduction of Baidu's FM Agent, a self-evolving intelligent agent capable of solving complex problems autonomously [1][6][30] Group 1: AI Capabilities and Innovations - FM Agent can autonomously generate and optimize algorithms, significantly reducing the time required for tasks that would take human experts days or even weeks [4][8] - The system combines large language models with evolutionary search algorithms to tackle real-world problems, demonstrating a leap from executing commands to discovering solutions independently [6][8] - The agent's performance has been validated in various benchmarks, achieving a medal rate of 43.56% on MLE-Bench, outperforming the human median by 51.56% [13] Group 2: Technical Features - FM Agent employs four core technologies: automated machine learning processes, combination optimization, GPU kernel generation, and mathematical problem-solving capabilities [13][14] - The system operates through a workflow that includes cold start initialization, adaptive diversity sampling, and a distributed asynchronous infrastructure based on the Ray framework [12][14] Group 3: Industry Applications - FM Agent has shown effectiveness in multiple sectors, including finance, urban traffic optimization, and large-scale engineering projects, providing solutions that are faster and more efficient than traditional methods [25][18] - The agent can abstract real-world problems into mathematical algorithms, continuously iterating and optimizing solutions based on clear evaluation metrics [18][20] Group 4: Future Implications - The emergence of FM Agent signifies a shift towards a new paradigm where humans define problems and AI executes solutions, potentially transforming productivity across various industries [22][30] - Baidu's FM Agent has already attracted over 1,000 enterprises for testing, indicating strong interest and potential for widespread application in sectors like transportation, energy, and finance [33][32]
2025宝山·智能机器人产业大会暨嘉年华,邀您共赴科技盛宴
机器之心· 2025-11-14 09:30
Core Insights - The "2025 Baoshan Intelligent Robot Industry Conference and Carnival" will take place on November 21-22, 2025, featuring industry forums, startup roadshows, exhibitions, and interactive experiences [2][9] - The event will unveil Baoshan District's three-year action plan for robotics and a series of industrial service platforms [2] - Four major forums will cover the entire industry chain, focusing on macro trends, technological breakthroughs, core components, and innovation ecosystems [2] Industry Forums - The main forum will gather top experts to discuss the macro blueprint and trends in the industry [2] - The embodied intelligence and robotics forum will focus on humanoid robots, service robots, and the practical applications of embodied data systems [2] - The core technology and components forum will address breakthroughs in domestic core components such as joint modules, sensors, and servo drives [2] - The innovation and talent forum will facilitate cross-industry dialogues to activate talent and innovation synergies [2] Startup Roadshow - The afternoon of the second day will feature a roadshow showcasing cutting-edge robotics startups, including projects in core areas like body, simulation technology, and machine vision [4] - Leading investment institutions will be present to support quality startup projects, fostering partnerships for industrial implementation [4] Exhibition Highlights - The exhibition will showcase nearly 30 leading companies in the robotics and components sector, featuring humanoid robots, industrial robots, sensors, and servo drives [6] - Attendees will experience the latest achievements and strengths of China's intelligent robotics technology [6] Interactive Experiences - The event will offer various interactive experiences, including performances by robots, martial arts competitions, and AI interactions [8] - Attendees can enjoy unique experiences such as AI calligraphy and personalized coffee brewing, highlighting the potential of technology in daily life [8]
NeurIPS Spotlight|GHAP:把3DGS“剪枝”变成“重建更小的高斯世界”
机器之心· 2025-11-14 09:30
Core Viewpoint - The article presents a novel approach to 3D Gaussian Splatting (3DGS) compression by framing it as a Gaussian mixture model simplification, which effectively reduces redundancy while preserving geometric details [4][28]. Summary by Sections Introduction - 3DGS is a popular method for 3D scene modeling that uses numerous Gaussian spheres to create high-quality 3D representations. However, the redundancy in Gaussian spheres limits storage and rendering speed [4]. Methodology - The proposed Gaussian-Herding-across-Pens (GHAP) method treats the entire 3DGS as a Gaussian mixture model, aiming to reconstruct a smaller mixture model globally. This approach maintains the geometric structure while reducing the number of Gaussian spheres [8][9]. - GHAP employs a two-stage process: first simplifying geometric information (position/covariance), followed by refining appearance features (opacity/color). This decoupling enhances stability [9][19]. Experimental Results - The GHAP method was compared with various pruning-based and end-to-end compression methods. Results indicate that GHAP consistently outperforms other baseline methods while being close to the performance of full-sample end-to-end methods [20][24]. - At a 10% retention rate, GHAP maintains high visual fidelity across different models and scenes, demonstrating its effectiveness in preserving the original geometric structure [23][24]. Conclusion - The GHAP method offers a new perspective on 3DGS compression, focusing on Gaussian mixture model simplification to retain geometric detail. It is designed to be scalable for large 3DGS scenes and is compatible with most existing 3DGS frameworks [28].
FDA对偶锚点:模型知识迁移的新视角——从参数空间到输入空间
机器之心· 2025-11-14 01:33
Core Insights - The article discusses the introduction of FDA (Model Merging with Functional Dual Anchors), a novel framework for merging expert models derived from task-specific fine-tuning of general foundational models, aiming to integrate their capabilities into a single model without accessing original training data [2][4]. Group 1: FDA Framework Overview - FDA represents task knowledge embedded in model parameters using a set of dual synthetic input points, allowing for efficient knowledge integration through induced gradients in the input space [4][10]. - Unlike traditional methods that rely on arithmetic operations in parameter space, FDA shifts the knowledge integration process to the input space, providing a new perspective on model merging [4][9]. - The framework is designed to be scalable to large neural network models, demonstrating superior performance and scalability in visual and natural language models [4][12]. Group 2: Performance and Robustness - Experimental results indicate that FDA significantly outperforms traditional task vector methods, achieving an average performance of 87.26 in multi-task scenarios compared to 73.94 for task vectors, marking an improvement of nearly 18% [14]. - FDA exhibits flexible knowledge modeling capabilities, achieving an average performance increase of approximately 5.10% on ViT-B/16 and about 13% on RoBERTa-Large, showcasing its adaptability across different architectures [15]. Group 3: Algorithm Implementation - The FDA algorithm consists of two main phases: construction of FDA samples for each downstream task and parameter updates based on the constructed FDA [17][19]. - Two practical initialization strategies for FDA construction are proposed: linear weight sampling and scaled Gaussian sampling, which help in optimizing the initial points effectively [18]. Group 4: Knowledge Encoding and Mechanisms - FDA captures task-related dominant representation directions while suppressing redundant or noisy components, aligning with the low-rank structure typically observed in task-specific knowledge in parameter space [22]. - The optimization process of FDA aligns its high-energy subspace with the high-energy subspace of real data, indicating a potential connection between the knowledge encoded in FDA and the actual task data [23]. - The parameter updates induced by FDA gradually align with those induced by real data, demonstrating its robustness and effectiveness in capturing task-related knowledge [24].
LeCun在Meta的最后论文?还是共同一作,LeJEPA:JEPAs理论拼图补完
机器之心· 2025-11-14 01:33
Core Viewpoint - The article discusses the development of LeJEPA, a new self-supervised learning framework that addresses the limitations of existing Joint Embedding Predictive Architectures (JEPAs) by providing a solid theoretical foundation and eliminating reliance on heuristic methods [4][5][8]. Group 1: Theoretical Foundation - The research team established that the optimal embedding distribution for JEPAs is an isotropic Gaussian distribution, which minimizes downstream prediction risk across various tasks [5]. - A novel distribution matching objective called Stochastic Isotropic Gaussian Regularization (SIGReg) was introduced to efficiently enforce the embedding to conform to the ideal isotropic Gaussian distribution [6][8]. - LeJEPA combines the predictive objectives of JEPA with SIGReg, resulting in a statistically optimal solution that mitigates representation collapse [8][9]. Group 2: Practical Implementation - LeJEPA demonstrates simplicity, robustness, and high performance due to its principled theoretical design, which eliminates the need for complex heuristics like gradient stopping and teacher-student networks [9][11]. - The implementation of LeJEPA requires only about 50 lines of code in PyTorch, making it user-friendly and easy to deploy [11][19]. Group 3: Experimental Validation - LeJEPA was tested across over 10 datasets and 60 architectures, achieving or surpassing state-of-the-art results, such as a 79% accuracy on ImageNet-1K with ViT-H/14 [10]. - The framework showed superior performance in domain-specific datasets, outperforming DINOv2-based transfer learning, indicating its capability for in-domain pre-training [10][33]. Group 4: Stability and Scalability - LeJEPA maintains stability across different hyperparameters and architectures, with recommended settings yielding competitive performance even with small batch sizes [24][26]. - The framework's design is architecture-agnostic, allowing it to learn high-quality representations across various model types [26][27]. Group 5: Semantic Structure Emergence - LeJEPA's self-supervised learning successfully emerged semantic structures without explicit supervision, as evidenced by attention patterns that correspond to object boundaries and salient regions [41][43]. - The attention maps demonstrated temporal consistency, enabling unsupervised video segmentation, indicating that the learned features capture both spatial semantics and temporal structure [43].
RAE+VAE? 预训练表征助力扩散模型Tokenizer,加速像素压缩到语义提取
机器之心· 2025-11-13 10:03
Core Insights - The article discusses the introduction of RAE (Diffusion Transformers with Representation Autoencoders) and VFM-VAE by Xi'an Jiaotong University and Microsoft Research Asia, which utilize "frozen pre-trained visual representations" to enhance the performance of diffusion models in generating images [2][6][28]. Group 1: VFM-VAE Overview - VFM-VAE combines the probabilistic modeling mechanism of VAE with RAE, systematically studying the impact of compressed pre-trained visual representations on the structure and performance of LDM systems [2][6]. - The integration of frozen foundational visual models as Tokenizers in VFM-VAE significantly accelerates model convergence and improves generation quality, marking an evolution from pixel compression to semantic representation [2][6]. Group 2: Performance Analysis - Experimental results indicate that the distillation-based Tokenizers struggle with semantic alignment under perturbations, while maintaining high consistency between latent space and foundational visual model features is crucial for robustness and convergence efficiency [8][19]. - VFM-VAE demonstrates superior performance and training efficiency, achieving a gFID of 3.80 on ImageNet 256×256, outperforming the distillation route's 5.14, and reaching a gFID of 2.22 with explicit alignment in just 80 epochs, improving training efficiency by approximately 10 times [23][24]. Group 3: Semantic Representation and Alignment - The research team introduced the SE-CKNNA metric to quantify the consistency between latent space and foundational visual model features, which is essential for evaluating the impact on subsequent generation performance [7][19]. - VFM-VAE maintains a higher average and peak CKNNA score compared to distillation-based Tokenizers, indicating a more stable alignment of latent space with foundational visual model features [19][21]. Group 4: Future Directions - The article concludes with the potential for further exploration of latent space in multimodal generation and complex visual understanding, aiming to transition from pixel compression to semantic representation [29].
太卷了!专属Coding的新一代Arena榜单来了,有国产模型登上榜首
机器之心· 2025-11-13 10:03
Core Insights - The article highlights the rapid advancements in large model programming, emphasizing the competitive landscape among model vendors as they enhance coding capabilities and develop new tools [2][3] - The introduction of the Code Arena by LMArena marks a significant evolution in the evaluation of coding capabilities of large models, focusing on real-world application development rather than just code generation [4][6] Model Performance - The new Code Arena ranks the domestic model GLM-4.6 at the top, alongside Claude and GPT-5, showcasing its superior coding abilities [6][10] - GLM-4.6 has demonstrated a success rate of 94.9% in code modification tasks, closely trailing behind Anthropic's Claude Sonnet 4.5, which has a success rate of 96.2% [11] - The performance gap between open-source models and top proprietary models has significantly narrowed from 5-10 percentage points to mere basis points, indicating a rapid convergence in capabilities [14] Industry Trends - There is a noticeable shift among users towards utilizing GLM-4.6 for daily tasks, reflecting its growing acceptance and recognition in the AI programming community [15] - Cerebras has decided to adopt GLM-4.6 as its default recommended model, phasing out the previous model, which underscores the model's rising prominence in the industry [16] - The article emphasizes the remarkable acceleration of domestic models, transitioning from a phase of catching up to one of leading the market, particularly in the open-source ecosystem [17][18]
下一代目标检测模型:3B参数MLLM Rex-Omni首度超越Grounding DINO,统一10+视觉任务
机器之心· 2025-11-13 08:26
Core Insights - The article discusses the breakthrough of the Rex-Omni model, which surpasses traditional coordinate regression detectors in target localization accuracy, addressing long-standing criticisms of multimodal large language models (MLLM) [2][4]. Group 1: Model Design and Innovations - Rex-Omni integrates all visual perception tasks into a unified "next point prediction" framework, utilizing efficient 4-Token coordinate encoding and a two-stage GRPO reinforcement learning training process [4][11]. - The model's design includes a unique output format with quantized coordinates and special tokens, allowing for efficient representation of various geometric outputs [13][14]. - Rex-Omni employs multiple data engines (Grounding, Referring, Pointing, and OCR) to generate high-quality training signals, enhancing its semantic understanding and spatial reasoning capabilities [16][17]. Group 2: Training Methodology - The two-stage training approach of SFT (Supervised Fine-Tuning) and GRPO (Geometric Reward-based Policy Optimization) is crucial for achieving high localization accuracy and correcting behavioral deficiencies [19][21]. - GRPO introduces geometric reward functions, enabling the model to learn from its generated sequences and significantly improving performance with minimal additional training steps [19][21]. Group 3: Performance Evaluation - In zero-shot evaluations on core detection benchmarks like COCO and LVIS, Rex-Omni demonstrates superior performance, achieving an F1-score that surpasses traditional models like Grounding DINO [20][22]. - The model excels in dense and small object detection tasks, achieving the highest F1@mIoU performance among MLLMs, showcasing its refined spatial localization capabilities [27][28]. - Rex-Omni's unified framework allows it to effectively handle various visual perception tasks, outperforming traditional open-set detectors in referring object detection [31][34]. Group 4: Conclusion and Future Implications - Rex-Omni represents a significant advancement for MLLMs in visual perception, proving that they can overcome geometric and behavioral limitations to achieve precise geometric perception and robust language understanding [45]. - The model sets a new performance benchmark in the MLLM field and indicates a promising direction for the development of next-generation target detection models [45].
周末来造梦!李飞飞世界模型正式开放,能力升级,有免费版
机器之心· 2025-11-13 08:26
Core Insights - The article discusses the launch of Marble, a multi-modal generative world model developed by Fei-Fei Li's "Spatial Intelligence" team, which is now available for public use, allowing users to create 3D worlds easily [3][4]. Features and Capabilities - Marble has undergone significant upgrades since its preview release, now supporting more generation methods, deeper editing capabilities, and a wider range of output formats, making it suitable for various professional applications such as game development, film effects, architectural design, and robotic simulation [4]. - The platform offers both a free version and a membership version, with differences in the number of worlds that can be generated, the range and depth of editing features, and commercial licensing [6]. Multi-Modal Input - Marble's core upgrade is its heavy multi-modal capability, allowing users to input various types of information, including multiple images, to create more refined 3D worlds [7][12]. - Users can provide different reference images for various areas of the world, enabling a more cohesive 3D space [7]. Editing and Iteration - Marble allows for iterative creation, where users can modify the generated world post-creation, including object removal, local adjustments, and structural reconfigurations [12][20]. - The platform supports input from multiple real-world photos or short video clips to inspire virtual world creation, with seamless transitions between views [14]. Expansion and Detail Enhancement - Users can expand specific areas of the generated world to fill in missing details and enhance clarity, particularly in areas that may have been less defined during initial generation [23][24]. - The platform also allows for the combination of multiple worlds based on user-defined relationships, facilitating the construction of larger spaces [25]. Output and Rendering - Marble enables users to export created worlds in various formats, including high-fidelity Gaussian Splat representations and triangle meshes, ensuring compatibility with industry-standard tools [27][28]. - Users can render worlds as videos with pixel-level control over camera movements and pacing, enhancing the creative process [31]. Collaborative Exploration - The company has launched Marble Labs to collaborate with artists, designers, and engineers to explore new creative paradigms and best practices [36]. - Marble is positioned as a step towards "spatial intelligence," with future plans to enhance interactivity and expand applications in simulation and robotics [37].