机器之心
Search documents
一个模型统一4D世界生成与重建,港科大One4D框架来了
机器之心· 2026-01-13 00:12
Group 1 - The core idea of the article is the introduction of One4D, a unified framework for 4D generation and reconstruction that addresses the limitations of existing video diffusion models by enabling simultaneous output of RGB videos and geometric pointmaps [4][32]. - One4D aims to enhance the capabilities of video generation models by integrating both appearance (RGB) and geometry (Pointmap/Depth/Camera Trajectory) within a single framework, thus facilitating the transition towards a 4D world model [32][33]. - The framework employs two key innovations: Decoupled LoRA Control (DLC) for reducing cross-modal interference and Unified Masked Conditioning (UMC) for handling various input types seamlessly [10][17]. Group 2 - One4D supports three types of input: single image to 4D generation, sparse video frames to 4D generation and reconstruction, and complete video to 4D reconstruction [9]. - The training of One4D utilizes a large-scale dataset combining synthetic and real data to ensure both geometric accuracy and visual diversity, achieving effective results with 34,000 videos trained on 8 NVIDIA H800 GPUs over 5,500 steps [20]. - User studies indicate that One4D outperforms existing methods in consistency, dynamic quality, aesthetics, depth quality, and overall 4D coherence, with significant improvements in various metrics [21][22]. Group 3 - In the context of sparse video frames, One4D demonstrates the ability to generate missing RGB frames and complete the geometric sequence even under extreme sparsity conditions, showcasing its capability for dynamic 4D scene generation [30][31]. - One4D also excels in full video 4D reconstruction, outperforming dedicated reconstruction methods on benchmark datasets such as Sintel and Bonn, indicating its robust performance across different tasks [25][26]. - The framework's camera trajectory estimation capabilities are validated through evaluations on datasets like Sintel and TUM, further proving its effectiveness in unified generation and reconstruction tasks [28][29].
引入几何约束后,VLM跨越了「空间推理」的认知鸿沟
机器之心· 2026-01-12 06:35
Core Insights - The article discusses the "Semantic-to-Geometric Gap" in existing Visual Language Models (VLMs), which struggle with precise spatial reasoning tasks, leading to incorrect answers in spatial queries [2][6]. Group 1: Problem Identification - The "Semantic-to-Geometric Gap" arises because VLMs compress rich pixel information into abstract semantic features, losing high-fidelity geometric details necessary for accurate spatial reasoning [7]. - VLMs lack the ability to form precise geometric imaginations, which hampers their performance in complex spatial reasoning scenarios [7]. Group 2: Proposed Solution - A research team from Beihang University and Shanghai AI Lab introduced the Geometrically-Constrained Agent (GCA), which employs a new paradigm of "formalizing constraints before deterministic computation" to enhance spatial reasoning capabilities [4]. - GCA does not rely on massive data fine-tuning but instead uses formal task constraints to shift VLMs from "fuzzy intuition" to "precise solving," creating a verifiable geometric bridge for spatial reasoning [4]. Group 3: Performance Improvement - GCA significantly improved model performance by nearly 50% in the challenging MMSI-Bench test, establishing a new state-of-the-art (SOTA) in the field of spatial reasoning [4][14]. - The average accuracy achieved by GCA is 65.1%, surpassing existing training-based and tool-integrated methods, particularly in complex spatial reasoning tasks [15]. Group 4: Generalizability and Versatility - GCA is a training-free universal reasoning paradigm that can empower various foundational models, achieving an average relative performance improvement of about 37% on the MMSI-Bench [16]. - The GCA framework demonstrated exceptional performance, with the Gemini-2.5-Pro model's accuracy rising from 36.9% to 55.0% after integration [16]. Group 5: Methodology - GCA's approach involves two stages: formalizing tasks from "fuzzy instructions" to "precise rules" and then performing deterministic geometric calculations within established constraints [9][12]. - The framework includes intelligent tool scheduling and binding, ensuring seamless integration of perception and computation tools to achieve reliable spatial reasoning [20]. Group 6: Conclusion and Implications - GCA represents a new paradigm of "language-defined constraints and geometric execution," effectively transforming vague spatial queries into constrained mathematical problems, thus enhancing reasoning accuracy and moving machines closer to possessing "geometric intuition" [24].
真香!刚骂完AI,Linux之父的首个Vibe Coding项目上线
机器之心· 2026-01-12 06:35
Core Viewpoint - Linus Torvalds has embraced "Vibe Coding" with the launch of his new project "AudioNoise," which utilizes AI technology for audio processing, marking a significant shift in his programming approach [3][10][30]. Group 1: Project Overview - The "AudioNoise" project, released on GitHub, has gained 1.4k stars within five days, showcasing its popularity [10][11]. - This project is related to guitar effects and aims to simulate audio effects using AI, specifically through a Python visualization tool [6][12]. - Torvalds' previous project, "GuitarPedal," served as a foundation for "AudioNoise," focusing on learning about analog circuits and audio processing [12][14]. Group 2: Programming Approach - Torvalds initially used traditional programming methods but later adopted a more streamlined approach by utilizing Google Antigravity for coding, which he refers to as "Vibe Coding" [8][17]. - He expressed satisfaction with the results of using AI tools, noting that the outcomes were better than his manual coding efforts [18][20]. - Despite his positive experience with AI in this project, Torvalds maintains a cautious stance regarding AI in production environments, emphasizing the importance of understanding code logic [30][31]. Group 3: Industry Reactions - The programming community has reacted with a mix of enthusiasm and skepticism regarding Torvalds' use of AI, highlighting a shift in attitudes towards AI-generated code [22][30]. - Notable figures in the tech industry, including the creator of Antigravity, have expressed admiration for Torvalds' decision to incorporate AI into his work [23][24]. - Torvalds' previous criticisms of AI-generated code have sparked discussions about the implications of AI in software development, particularly in relation to quality and accountability [28][30].
2026年,大模型训练的下半场属于「强化学习云」
机器之心· 2026-01-12 05:01
Core Insights - The article discusses the transition in AI model development from scaling laws based on increasing parameters and training data to a focus on reinforcement learning (RL) and post-training scaling, indicating a paradigm shift in AI capabilities [1][4][10]. Group 1: Scaling Law and Model Development - By the end of 2024, discussions in Silicon Valley and Beijing highlighted concerns that scaling laws were hitting a wall, as newer flagship models like Orion did not show expected marginal benefits from increased parameters and data [1]. - Ilya Sutskever's remark suggested a shift from an era of scaling to one of miracles and discoveries, indicating skepticism about the sustainability of the pre-training approach [3]. - By early 2025, OpenAI's o1 model introduced reinforcement reasoning, demonstrating that test-time scaling could lead to higher intelligence, while DeepSeek R1 successfully replicated this technology in an open-source manner [4][6]. Group 2: Reinforcement Learning and Infrastructure - The focus of computational power is shifting from pre-training scaling to post-training and test-time scaling, emphasizing the importance of deep reasoning capabilities over mere parameter size [8]. - The emergence of DeepSeek R1 revealed that deep reasoning, driven by reinforcement learning, is more critical for model evolution than simply increasing parameters [4][6]. - The industry is calling for a new computational infrastructure to support this shift towards dynamic exploration and reasoning, as existing cloud architectures struggle to meet these demands [11][12]. Group 3: Agentic RL and Its Implications - Nine Chapters Cloud has positioned itself as a leader in defining "reinforcement learning cloud" infrastructure, which is essential for the evolving AI landscape [12][14]. - The Agentic RL platform, launched in mid-2025, is the first industrial-grade reinforcement learning cloud platform, significantly enhancing training efficiency and reducing costs [15][19]. - Agentic RL aims to evolve general models into expert models capable of complex decision-making and control, addressing real-world challenges in various industries [20][22]. Group 4: Real-World Applications and Economic Impact - The successful implementation of a large-scale AI center in Huangshan within 48 days exemplifies Nine Chapters Cloud's engineering capabilities and operational efficiency [41][43]. - The Huangshan model is projected to generate significant economic benefits, with an estimated increase of at least 200 million yuan in annual service industry value [48]. - The integration of AI capabilities into urban management and tourism demonstrates the potential for AI infrastructure to drive economic growth and enhance operational efficiency [50][51]. Group 5: Future Vision and Market Position - Nine Chapters Cloud aims to establish itself as a key player in the independent AI cloud sector, advocating for an open ecosystem that does not compete with clients [54][60]. - The company emphasizes the importance of defining standards for next-generation infrastructure, moving beyond traditional cloud services to focus on enabling rapid evolution of intelligent agents [63][66]. - The future of cloud computing is envisioned as an "evolution era," where the focus will be on enhancing the capabilities of intelligent agents rather than merely providing computational resources [68][69].
AAAI 2026 Oral|快手提出全新「检索数据引擎」CroPS,打破搜索信息茧房
机器之心· 2026-01-12 05:01
Core Insights - The article discusses the introduction of a new retrieval data engine called CroPS (Cross-Perspective Positive Samples) by Kuaishou's search team, aimed at improving short video search capabilities by addressing the limitations of traditional self-reinforcing training paradigms that rely heavily on historical click data [2][10]. Group 1: Problem Identification - Current vector retrieval models in the industry often depend on historical user interaction data, leading to a self-reinforcing cycle that narrows the search results and limits exposure to diverse content [6]. - This mechanism results in a significant sample bias, where high-quality long-tail content is systematically excluded from positive samples, causing the model's retrieval scope to become conservative and repetitive [6][7]. - Users experience a lack of novelty in search results, making it difficult to satisfy exploratory needs [7]. Group 2: CroPS Framework - CroPS introduces a multi-dimensional positive sample enhancement engine that utilizes user query behavior, recommendation system feedback, and knowledge from large language models (LLMs) to enrich the semantic space [11]. - The framework captures user intent continuity by analyzing query rewrites, allowing the system to correct semantic biases by incorporating successful interactions from related queries [12]. - It breaks down barriers between search and recommendation systems, enabling the retrieval model to leverage diverse content that users may not have actively searched for [15]. - CroPS employs LLMs to generate high-quality synthetic samples when existing content does not cover certain queries, effectively expanding the model's knowledge base [16][17]. Group 3: Hierarchical Labeling and Loss Function - The Hierarchical Label Assignment (HLA) strategy addresses the reliability differences among positive samples from various sources, allowing the model to prioritize more relevant samples during training [19]. - H-InfoNCE loss function enhances the model's ability to distinguish between high-priority and low-priority samples, aligning learning objectives with the hierarchical logic of HLA [23][28]. Group 4: Experimental Results - Offline experiments showed that CroPS improved recall rates by 9.5% on user click datasets and 7.1% on user query change datasets compared to the strongest baseline [30]. - In large-scale A/B testing, CroPS led to significant business growth, with a 40.9% increase in ratio rank and a 44.3% increase in ratio show for dense models [31]. - The click-through rate (CTR) increased by 0.869%, and the long playback rate (LPR) rose by 0.483%, indicating improved content relevance and quality [36]. Group 5: Future Directions - The Kuaishou search team plans to explore the integration of CroPS with generative retrieval methods to further leverage the potential of large-scale language models in the search process [34].
顶尖AI竟输给三岁宝宝,BabyVision测试暴露多模态模型硬伤
机器之心· 2026-01-12 05:01
Core Viewpoint - The article discusses the limitations of current large models in visual understanding, emphasizing that while they excel in language and text reasoning, their visual capabilities remain underdeveloped, akin to that of a three-year-old child [3][4][49]. Group 1: BabyVision Overview - UniPat AI, in collaboration with Sequoia China and various research teams, has launched a new multimodal understanding evaluation set called BabyVision to assess visual capabilities of AI models [3][4]. - BabyVision aims to create a new paradigm for AI training, evaluation, and application in real-world scenarios, focusing on generating measurable and iterative visual capabilities [4][49]. Group 2: Evaluation Methodology - BabyVision includes a direct comparison experiment with 20 vision-centric tasks given to children of different ages (3, 6, 10, 12 years) and top multimodal models [7]. - The evaluation strictly controls language dependency, requiring answers to be derived solely from visual information [8]. Group 3: Results and Findings - The results reveal that most models score significantly below the average performance of three-year-old children, with the best model, Gemini3-Pro-Preview, only achieving 49.7%, which is still 20 percentage points below the performance of six-year-olds [15][21]. - Human participants scored an impressive 94.1% accuracy on the BabyVision-Full test, highlighting the substantial gap between human and model performance [20][21]. Group 4: Challenges Identified - The study identifies four core challenges in visual reasoning for AI models: observing non-verbal details, maintaining visual tracking, lacking spatial imagination, and difficulty in visual pattern induction [27][30][36][39]. - These challenges indicate a systemic lack of foundational visual capabilities in current models, rather than isolated deficiencies [23]. Group 5: Future Directions - The article suggests that transitioning visual reasoning tasks to visual operations, as demonstrated in BabyVision-Gen, may help bridge the gap in visual understanding [42]. - The ongoing development of BabyVision aims to guide the evolution of multimodal large models by breaking down visual understanding into 22 measurable atomic capabilities [49].
被Jim Fan点赞!全球第一的千寻智能Spirit v1.5正式开源!
机器之心· 2026-01-12 01:20
Core Insights - The article discusses the significant advancements in embodied intelligence, particularly highlighting the emergence of the Spirit v1.5 model from Qianxun Intelligent, which has surpassed previous models like Pi0.5 in performance [3][4][15]. Group 1: Key Developments in Embodied Intelligence - 2025 marked a breakthrough year for embodied intelligence, with hardware advancements and the development of foundational models defining the intelligence ceiling of this technology [3]. - The Spirit v1.5 model was open-sourced on January 12, 2026, and achieved the top position in the RoboChallenge's Table30 ranking, outperforming Pi0.5 [4][8][15]. - The open-sourcing of Spirit v1.5 includes model weights, inference code, and usage examples, allowing for public verification and community innovation [6][33]. Group 2: Performance Metrics and Evaluation - Spirit v1.5 scored 66.09 on the RoboChallenge leaderboard, while Pi0.5 scored 61.84, indicating a significant performance improvement [11][14]. - The RoboChallenge platform focuses on real-world physical testing, with a task set designed to challenge various dimensions of model capabilities, including precise 3D positioning and multi-stage tasks [15]. Group 3: Data Utilization and Training Paradigms - Spirit v1.5's success is attributed to a fundamental restructuring of the robot pre-training data paradigm, moving away from "clean data" to a more diverse and open-ended data collection strategy [18][20]. - The model's training involved continuous skill flows and internalized error correction capabilities, allowing it to adapt dynamically to unexpected challenges [21][22]. Group 4: Implications for the Industry - The open-source nature of Spirit v1.5 represents a significant shift in the industry, providing a competitive alternative to closed-source models and enabling broader access to high-performance robotics technology [35][39]. - The model's development and open-source release are seen as a pivotal moment for Chinese teams in the global AI landscape, transitioning from followers to leaders in defining core technological paths [41][42].
Sakana让AI互相「猎杀」,而它们开始了趋同进化
机器之心· 2026-01-11 10:03
Core Insights - The article discusses the collaboration between Sakana AI and MIT on a new research project called Digital Red Queen (DRQ), which explores self-evolving assembly code through a competitive programming environment [2][3]. - The research utilizes the classic programming game "Core War" to create a dynamic adversarial environment where AI programs, referred to as "warriors," evolve by competing against each other [3][7]. Group 1: Research Methodology - The DRQ method allows AI programs to evolve by continuously adapting to changing opponents rather than static benchmarks, leading to the generation of robust and versatile "warriors" [3][8]. - The study positions "Core War" as a sandbox for examining the dynamics of artificial systems in competitive environments, such as cybersecurity [7][8]. Group 2: Evolutionary Dynamics - The research reveals that the dynamic adversarial process encourages models to develop increasingly general strategies, demonstrating a phenomenon known as convergent evolution, where different programs exhibit similar high-performance behaviors [8][26]. - As the DRQ runs increase, the warriors become more robust and generalizable, indicating a trend towards phenotypic convergence, where behaviors become similar despite differing underlying implementations [29][30]. Group 3: Implications and Future Work - The findings suggest that the DRQ algorithm, combined with the "Core War" environment, could provide valuable insights into the nature of adversarial competition and the evolution of AI systems in real-world scenarios [34]. - Future research may explore richer settings that allow multiple agents to co-evolve simultaneously, better simulating real-world dynamics where large populations adapt in parallel [35].
挑战GRPO,英伟达提出GDPO,专攻多奖励优化
机器之心· 2026-01-11 04:00
但随着语言模型能力的不断提升,用户对它们的期待也在发生变化:不仅要回答正确,还要在各种不同场景下表现出符合多样化人类偏好的行为。为此, 强化学 习训练流程开始引入多种奖励信号 ,每一种奖励对应一种不同的偏好,用来共同引导模型走向理想的行为模式。 但英伟达的一篇新论文却指出,在进行多奖励优化时,GRPO 可能不是最佳选择。 具体来说,在多奖励优化场景中,GRPO 会将不同的奖励组合归一化为相同的优势值。这会削弱训练信号,降低奖励水平。 为了解决这一问题,他们提出了一种新的策略优化方法 —— 组奖励解耦归一化策略优化( GDPO )。该方法通过对各个奖励信号分别进行归一化,避免了不同奖 励之间被混合「抹平」,从而更真实地保留它们的相对差异,使多奖励优化更加准确,同时显著提升了训练过程的稳定性。 机器之心编辑部 GRPO 是促使 DeepSeek-R1 成功的基础技术之一。最近一两年,GRPO 及其变体因其高效性和简洁性,已成为业内广泛采用的强化学习算法。 论文标题:GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-re ...
联邦学习不再安全?港大TPAMI新作:深挖梯度反转攻击的内幕
机器之心· 2026-01-11 04:00
Core Viewpoint - Federated Learning (FL) is not as secure as previously thought, as Gradient Inversion Attacks (GIA) can potentially compromise privacy by reconstructing private training data from shared gradient information [3][5]. Group 1: Background and Importance of the Study - Federated Learning allows clients to collaboratively train models without sharing raw data, but recent studies indicate that "not sharing data" does not equate to "absolute security" [5]. - Attackers can utilize GIA to reconstruct private data such as facial images and medical records, highlighting the need for a systematic classification and analysis of these attacks [5][6]. Group 2: Classification of GIA Methods - The research categorizes existing GIA methods into three main types: 1. Optimization-based attacks (OP-GIA) 2. Generation-based attacks (GEN-GIA) 3. Analysis-based attacks (ANA-GIA) [9]. Group 3: Theoretical Contributions - The study presents significant theoretical advancements, including: - Theorem 1: Establishes a linear relationship between the reconstruction error of OP-GIA and the square root of Batch Size and image resolution, indicating that larger batch sizes and higher resolutions make attacks more difficult [11]. - Proposition 1: Reveals that the similarity of gradients during model training affects the difficulty of data recovery, with more similar gradients making recovery harder [13]. Group 4: Experimental Findings - Extensive experiments were conducted on datasets like CIFAR-10/100, ImageNet, and CelebA, covering various attack types and model architectures [15]. - Key findings indicate that: - OP-GIA is practical but limited by batch size and resolution, with its threat significantly reduced in Practical FedAvg scenarios. - GEN-GIA can generate high-quality images but relies heavily on pre-trained generators and specific activation functions, making it less effective if those conditions are not met. - ANA-GIA can achieve precise data recovery but is easily detectable by clients, limiting its practical application [25]. Group 5: Defense Guidelines - The authors propose a three-phase defense pipeline to enhance security without complex encryption: 1. Network design phase 2. Training protocol phase 3. Client verification phase, where clients should validate model architecture and parameters to prevent malicious modifications [22]. Group 6: Summary and Practical Implications - This research serves as a comprehensive examination of existing GIA methods and provides practical guidelines for enhancing the security of federated learning systems, emphasizing that while privacy risks are real, they can be effectively managed through thoughtful design and protocols [24].