机器之心
Search documents
当推荐系统真正「懂你」:快手团队在NeurIPS 2025提出新成果TagCF
机器之心· 2025-11-27 04:09
Core Insights - The article discusses the development of a new recommendation system framework called TagCF, which aims to enhance user understanding in addition to content understanding, moving from "knowing what" to "understanding why" [2][43]. Group 1: Research Background and Motivation - The research highlights a gap in traditional recommendation systems, which often focus solely on content without understanding user identities and roles [2][5]. - The TagCF framework was developed in collaboration with Kuaishou's algorithm team, the foundational model and application department, and Wuhan University [2][3]. Group 2: Methodology and Framework - TagCF introduces two new tasks: User Role Identification, which models user characteristics and social roles, and Behavioral Logic Modeling, which explores the logical relationships between user roles and item topics [9][10]. - The framework consists of three main modules: a video content understanding platform based on MLLM, a behavioral logic graph exploration platform, and a downstream recommendation system enhancement [16][18][22]. Group 3: Experimental Results - Experiments showed that user role-based modeling statistically outperformed traditional topic modeling, leading to more stable and effective recommendations [7][40]. - The TagCF framework demonstrated significant improvements in recommendation accuracy and diversity, with TagCF-it and TagCF-ut models achieving notable performance metrics [34][36]. Group 4: Challenges and Solutions - The implementation faced challenges such as uncontrolled tag expansion and the need for precise scoring mechanisms [23][24]. - Solutions included constructing a cover set of high-frequency tags to ensure stability and generalizability in industrial applications [25][41]. Group 5: Conclusion and Future Directions - The article concludes that the TagCF framework represents a significant advancement in recommendation systems by integrating user understanding with content understanding, thus bridging the gap between statistical and symbolic modeling [43][45]. - Future work will focus on refining the tag-logic system and exploring its applications across various business scenarios, including e-commerce and search [44][45].
首个3D生成解构模型PartCrafter问世,GitHub狂揽2k星标
机器之心· 2025-11-27 04:09
Core Insights - The article discusses the introduction of PartCrafter, a structured 3D generation model that allows for the creation of editable 3D models from a single 2D image, enhancing control and interpretability in 3D content creation [2][9][32] Group 1: Research Background and Motivation - Traditional 3D generation models operate as "black boxes," generating objects as indivisible wholes, which limits the ability to edit individual components [8] - Existing methods rely on a two-stage "segmentation-reconstruction" pipeline, which is time-consuming and prone to errors, taking over 20 minutes for processing [8][9] - PartCrafter aims to create an end-to-end structured 3D generation system that directly generates complex 3D mesh models from a single 2D image, addressing the editing challenges of current methods [9] Group 2: Methodology - PartCrafter employs a compositional latent space, assigning independent latent variables to different components of a 3D object, allowing for a modular representation [15] - The model incorporates a local-global denoising transformer architecture to ensure both component independence and overall structural consistency during generation [16][17] Group 3: Data Set Construction - The research team constructed a large-scale dataset specifically for part-level generation tasks, containing approximately 130,000 3D objects, with around 100,000 having precise multi-part annotations [19] - The dataset was curated with strict quality standards, ensuring high-quality material textures and reasonable part counts [19] Group 4: Experimental Results - In quantitative results, PartCrafter outperformed the HoloPart model, generating high-fidelity, part-separable 3D meshes in about 34 seconds, compared to HoloPart's longer processing time and lower accuracy [23][24] - In qualitative assessments, PartCrafter demonstrated the ability to generate clear geometric structures and rich details, allowing users to control the granularity of part segmentation [27][30] Group 5: Conclusion and Future Outlook - The introduction of PartCrafter signifies a pivotal shift in 3D generation technology from a holistic approach to a structured one, enhancing interpretability and controllability [32] - This capability to directly generate editable components broadens the application scope of 3D AIGC technology in fields such as gaming, virtual reality, and industrial design [32]
Adam的稳+Muon的快?华为诺亚开源ROOT破解大模型训练「既要又要」的两难困境
机器之心· 2025-11-27 04:09
Core Viewpoint - The article discusses the evolution of optimizers in large language model (LLM) training, highlighting the introduction of ROOT (Robust Orthogonalized Optimizer) by Huawei Noah's Ark Lab as a solution that combines the speed of Muon and the stability of Adam, addressing the limitations of existing optimizers in handling large-scale training and noise robustness [2][50]. Group 1: Optimizer Evolution - The early optimizer, SGD (Stochastic Gradient Descent), established the basic paradigm for neural network training but struggled with convergence speed and stability in high-dimensional loss landscapes [6][7]. - Adam and AdamW emerged as the de facto standards for training deep learning models, significantly improving convergence efficiency but revealing numerical instability issues when model parameters exceed one billion [7][8]. - Muon, a matrix-aware optimizer, attempted to address these issues by treating weight matrices as a whole, yet it faced challenges related to robustness and sensitivity to noise [11][13]. Group 2: ROOT Optimizer Features - ROOT enhances the robustness of orthogonalized optimizers by introducing adaptive coefficients for the Newton-Schulz iteration, tailored to specific matrix dimensions, thus overcoming the dimensional fragility seen in fixed-coefficient methods [26][29]. - The optimizer employs a soft-thresholding mechanism to filter out gradient noise, effectively separating normal and abnormal gradient components, which improves stability during training [30][33]. - ROOT's design aims to provide a balance between speed and stability, making it suitable for large-scale, non-convex real model training scenarios [20][21]. Group 3: Performance Validation - In extensive experiments, ROOT demonstrated superior convergence capabilities, achieving a training loss of 2.5407 in a 10B token pre-training experiment, outperforming Muon [41][42]. - ROOT achieved an average score of 60.12 across multiple downstream tasks, surpassing both AdamW (59.05) and Muon (59.59), indicating its competitive edge [43]. - The optimizer also showed strong cross-modal generalization capabilities, achieving a Top-1 accuracy of 88.44% on the CIFAR-10 dataset, significantly higher than Muon's 84.67% [46][47]. Group 4: Future Implications - ROOT is positioned to potentially usher in a new era of optimizers, addressing the increasing complexity and scale of future language models, thereby enhancing the reliability and efficiency of AI system training [49][51]. - The open-source release of ROOT's code is expected to encourage further research and application in training trillion-parameter models, reinforcing Huawei's commitment to innovation in AI [52].
NeurIPS 2025奖项出炉,Qwen获最佳论文,Faster R-CNN获时间检验奖
机器之心· 2025-11-27 03:00
Core Insights - The NeurIPS 2025 conference awarded four Best Paper awards and three Best Paper Runner-up awards, highlighting significant advancements in various AI research areas [1][4]. Group 1: Best Papers - Paper 1: "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" discusses the limitations of large language models in generating diverse content and introduces Infinity-Chat, a dataset with 26,000 diverse user queries for studying model diversity [5][6][9]. - Paper 2: "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" reveals the impact of gated attention mechanisms on model performance and stability, demonstrating significant improvements in the Qwen3-Next model [11][16]. - Paper 3: "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities" shows that increasing network depth to 1024 layers can enhance performance in self-supervised reinforcement learning tasks, achieving performance improvements of 2x to 50x [17][18]. - Paper 4: "Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training" identifies mechanisms that prevent diffusion models from memorizing training data, establishing a link between training dynamics and generalization capabilities [19][21][22]. Group 2: Best Paper Runner-Up - Paper 1: "Optimal Mistake Bounds for Transductive Online Learning" solves a 30-year-old problem in learning theory, establishing optimal mistake bounds for transductive online learning [28][30][31]. - Paper 2: "Superposition Yields Robust Neural Scaling" argues that representation superposition is the primary mechanism governing neural scaling laws, supported by multiple experiments [32][34]. Group 3: Special Awards - The Time-Tested Award was given to the paper "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," recognized for its foundational impact on modern object detection frameworks since its publication in 2015 [36][40]. - The Sejnowski-Hinton Prize was awarded for the paper "Random synaptic feedback weights support error backpropagation for deep learning," which contributed significantly to understanding biologically plausible learning rules in neural networks [43][46][50].
通用脑机接口时代要来了?跨尺度脑基础模型CSBrain真正读懂脑信号
机器之心· 2025-11-27 03:00
Core Insights - Brain-Computer Interface (BCI) is seen as the ultimate interface connecting human intelligence with artificial intelligence, with a focus on high-precision brain signal decoding to enable general AI models to understand complex brain activities [2] - Current BCI systems are limited to task-specific deep learning models, which lack generalizability and cross-task transfer capabilities, resulting in isolated "specialist" applications [2][3] - The introduction of a brain foundation model, CSBrain, aims to address these challenges by integrating cross-scale structural perception into the model design [5][6] Group 1: Challenges in Brain-Computer Interfaces - The BCI field has primarily relied on task-specific deep learning models, which perform well on specific datasets but struggle with adaptability to diverse brain signals [2] - The unique cross-scale spatiotemporal structure of brain signals presents challenges for traditional modeling paradigms, which fail to capture the inherent neural structure [3][5] Group 2: CSBrain Model Innovations - CSBrain introduces two core innovative modules: Cross-scale Spatiotemporal Tokenization (CST) and Structured Sparse Attention (SSA) [6][7] - CST extracts multi-scale temporal and spatial features from EEG signals, balancing neural representation capability and computational efficiency through a dimension allocation strategy [6] - SSA captures long-range temporal dependencies and models inter-region interactions while reducing computational complexity from O(N²) to O(N・k) [7] Group 3: Experimental Results and Performance - CSBrain was validated across 11 representative brain decoding tasks and 16 public datasets, achieving state-of-the-art performance with an average improvement of 3.35% over current models [12] - In high-challenge tasks, CSBrain showed a 5.2% accuracy improvement in motor imagery tasks and a 7.6% enhancement in epilepsy detection metrics [12] - The experimental results confirm the effectiveness of CSBrain's cross-scale modeling paradigm and pre-trained brain foundation model, supporting various BCI applications [12][14] Group 4: Future Prospects - As data scale and computational power increase, brain foundation models are expected to play a larger role in broader brain-AI integration scenarios, accelerating the application of next-generation brain-computer interfaces [14]
小米开源首个跨域具身基座模型MiMo-Embodied,29个榜单SOTA
机器之心· 2025-11-26 09:19
Core Insights - The article discusses the development of MiMo-Embodied, a foundational model that integrates autonomous driving and embodied intelligence, marking a significant advancement in AI research [5][46]. - The model addresses the fragmentation between autonomous driving and embodied AI, which have traditionally been treated as separate domains, leading to a lack of a unified cognitive framework [4][9]. Group 1: Model Development and Architecture - MiMo-Embodied is the first open-source model successfully merging autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) results across 17 benchmarks in embodied intelligence and 12 in autonomous driving [5][19]. - The model is built on Xiaomi's self-developed MiMo-VL architecture, which decomposes physical interactions into six core dimensions, enhancing both environmental perception and decision-making capabilities [11][12]. Group 2: Training Strategy - A four-stage progressive training strategy was designed to effectively integrate diverse cross-domain data while avoiding catastrophic forgetting, which is crucial for the model's performance [13][14]. - The training phases include: 1. Establishing foundational knowledge with general and embodied data [14]. 2. Injecting autonomous driving knowledge through mixed supervision while retaining embodied data [14][15]. 3. Enhancing logical reasoning capabilities using Chain-of-Thought (CoT) techniques [15]. 4. Refining the model through reinforcement learning (RL) to improve output precision [16]. Group 3: Performance Metrics - MiMo-Embodied achieved record-breaking performance in key areas such as affordance prediction, task planning, and spatial understanding, demonstrating its robust capabilities in embodied intelligence [19][22][25]. - In autonomous driving benchmarks, the model excelled in environmental perception, state prediction, and driving planning, showcasing its ability to generate coherent and contextually appropriate driving decisions [27][28][30]. Group 4: Real-World Applications - The model's practical utility was validated in embodied navigation and operation tasks, where it performed exceptionally well in identifying and locating objects in various household scenarios [33][34]. - In autonomous driving trajectory planning, MiMo-Embodied significantly outperformed competing models in both imitation learning and reinforcement learning phases, indicating its effectiveness in complex driving situations [38][39]. Group 5: Conclusion and Future Implications - The introduction of MiMo-Embodied signifies a new phase in embodied intelligence research, proving that cognitive logic in the physical world is unified across different applications [46]. - This work lays the groundwork for developing general Vision-Language-Action (VLA) models, moving towards the vision of a single brain applicable to various embodied forms [46].
谢赛宁与Jaakkola团队重磅研究:无数据Flow Map蒸馏
机器之心· 2025-11-26 09:19
Core Insights - The article discusses recent advancements in AI communication methods, particularly focusing on a new paradigm called "Cache-to-Cache" communication, which allows machines to exchange information without verbal language, enhancing efficiency in AI interactions [1] - Another significant research highlights the concept of "Thought Communication," enabling agents to share latent thoughts internally, resembling telepathic collaboration [3] - A joint study from MIT and NYU proposes a method that eliminates the need for data by sampling from prior distributions, achieving impressive performance in flow map distillation [4][5] Group 1: AI Communication Innovations - The "AI Transmission" research showcases a model where machines communicate through caches, bypassing traditional language, which has garnered significant attention in the tech community [1] - The "Thought Communication" concept introduced in a NeurIPS 2025 Spotlight paper emphasizes internal thought sharing among agents, pushing the boundaries of AI collaboration [3] Group 2: Data-Free AI Models - The joint research from MIT and NYU introduces a method that allows flow map distillation without relying on external data, achieving remarkable results [4][5] - The study demonstrates that the new approach can refresh the generation quality record on ImageNet, indicating a shift towards utilizing internal representations rather than explicit data [5] - The proposed "FreeFlow" framework emphasizes a paradigm shift towards data-free methodologies, ensuring alignment with prior distributions to avoid risks associated with teacher-data mismatches [21][30]
预测下一个像素还需要几年?谷歌:五年够了
机器之心· 2025-11-26 07:07
Core Insights - The article discusses the potential of next-pixel prediction in image recognition and generation, highlighting its scalability challenges compared to natural language processing tasks [6][21]. - It emphasizes that while next-pixel prediction is a promising approach, it requires significantly more computational resources than language models, with a token-per-parameter ratio that is 10-20 times higher [6][15][26]. Group 1: Next-Pixel Prediction - Next-pixel prediction can be learned in an end-to-end manner without the need for labeled data, making it a form of unsupervised learning [3][4]. - The study indicates that achieving optimal performance in next-pixel prediction requires a higher token-parameter ratio compared to text token learning, with a minimum of 400 for pixel models versus 20 for language models [6][15]. - The research identifies three core questions regarding the evaluation of model performance, the consistency of scaling laws with downstream tasks, and the variation of scaling trends across different image resolutions [7][8]. Group 2: Experimental Findings - Experiments conducted at a fixed resolution of 32×32 pixels reveal that the optimal scaling strategy is highly dependent on the target task, with image generation requiring a larger token-parameter ratio than classification tasks [18][22]. - As image resolution increases, the model size must grow faster than the data size to maintain optimal scaling, indicating that computational capacity is the primary bottleneck rather than data availability [18][26]. - The study shows that while the scaling trends for next-pixel prediction can be predicted using established frameworks from language models, the optimal scaling strategies differ significantly between tasks [21][22]. Group 3: Future Outlook - The article predicts that next-pixel modeling will become feasible within the next five years due to the rapid growth of training computational power, which is expected to increase by four to five times annually [8][26]. - It concludes that despite the current challenges, the path towards pixel-level modeling remains viable and could achieve competitive performance in the future [26].
云帆Meetup报名|i人也可,在NeurIPS圣地亚哥遇见同频智友
机器之心· 2025-11-26 07:07
Group 1 - The year 2025 is expected to be a significant year for the rapid development of AI, transforming human-computer interaction and becoming a new partner for exploring the world [2] - The pace of technological iteration in AI is accelerating, with new research being proposed, tested, and iterated rapidly, leading to a non-linear accumulation of knowledge [2] - NeurIPS, one of the most influential academic conferences in the AI field, received 21,575 valid submissions this year, with an acceptance rate of 24.52%, and will be held in San Diego from December 2 to 7, 2025 [2] Group 2 - The "Yunfan・NeurIPS 2025 AI Talent Meetup" is an informal event aimed at facilitating face-to-face exchanges among researchers, engineers, and entrepreneurs in a relaxed atmosphere [3] - The meetup is organized by Machine Heart and Shanghai Artificial Intelligence Laboratory, inviting participants to share ideas and discuss recent hot topics and research directions [3] Group 3 - The meetup will take place on December 3, 2025, from 17:30 to 20:30 local time in San Diego, with a capacity of 80 participants [4] - The event is designed for various attendees, including researchers interested in AI trends, job seekers, and innovators looking for opportunities and partners [4] Group 4 - The agenda for the meetup includes registration, an introduction to the Shanghai AI Laboratory, presentations on the Polaris & Xingqi program, an AI Talent Show, and a dinner with interactive sessions [6]
突破视觉-语言-动作模型的瓶颈:QDepth-VLA让机器人拥有更精准的3D空间感知
机器之心· 2025-11-26 07:07
Core Insights - The article discusses the significant potential of Vision-Language-Action (VLA) models in robotic manipulation, highlighting the introduction of QDepth-VLA, which enhances 3D spatial perception and reasoning capabilities through Quantized Depth Prediction [2][4][34]. Group 1: Model Limitations and Challenges - Despite advancements in semantic understanding and instruction following, VLA models struggle with spatial perception, particularly in fine-grained or long-duration multi-step tasks, leading to positioning errors and operational failures [5][6]. - The gap between 2D visual semantic understanding and 3D spatial perception has prompted researchers to explore various methods to integrate 3D information into VLA models, categorized into three main approaches: direct injection of 3D features, 3D feature projection, and auxiliary 3D visual prediction tasks [5][6]. Group 2: QDepth-VLA Methodology - QDepth-VLA introduces a mechanism that combines Quantized Depth Prediction with a hybrid attention structure, allowing the model to maintain semantic consistency while enhancing 3D spatial perception and action decision-making [8][34]. - The method consists of three main components: high-precision depth annotation using Video-Depth-Anything, a Depth Expert module for structured depth token prediction, and a hybrid attention mechanism to manage information flow across modalities [11][13][14]. Group 3: Experimental Validation - Comprehensive evaluations of QDepth-VLA were conducted in both simulated environments (Simpler and LIBERO) and real-world settings, demonstrating significant performance improvements in various object manipulation and multi-step tasks [18][19]. - In the Simpler simulation, QDepth-VLA achieved an average success rate increase of 8.5% and 3.7% compared to the baseline model Open π0 [20]. - In the LIBERO simulation, QDepth-VLA outperformed the 3D-CAVLA model by approximately 2.8% [26]. - Real-world experiments showed QDepth-VLA's superior performance in pick-and-place tasks, with a 20% improvement in basic tasks and a 10% enhancement in more challenging scenarios [30]. Group 4: Ablation Studies - Ablation studies indicated that the depth supervision and hybrid attention mechanisms are crucial for QDepth-VLA's high performance, with significant drops in success rates when these components were removed [31][32]. Group 5: Future Directions - Future research will focus on enhancing the model's spatial understanding capabilities, with potential developments in future spatial structure prediction and more efficient depth representation learning [35][36]. - The integration of enhanced 3D geometric perception and action consistency into CASBOT's product line is anticipated, supporting various applications in both domestic and industrial settings [35][36].