Workflow
机器之心
icon
Search documents
机械手真正「活」了,银河通用&清华推出DexNDM,用神经动力学重塑灵巧操作
机器之心· 2025-11-06 03:28
Core Insights - The article discusses the development of DexNDM, a method aimed at solving the sim-to-real challenge in dexterous robotic manipulation, particularly in achieving stable in-hand rotation of various objects [2][5][24]. Group 1: Background and Challenges - High dexterity in remote operation of complex tools, such as using a screwdriver or hammer, has been a long-standing challenge in robotics [4]. - Traditional direct mapping remote operation methods are limited to simple tasks and cannot handle complex manipulations requiring fine motor skills [4]. - A semi-autonomous remote operation paradigm is proposed, which breaks down complex tasks into stable atomic skills that robots can execute autonomously [4]. Group 2: DexNDM Methodology - DexNDM is designed to learn general and stable atomic skills for in-hand rotation, covering a wide range of scenarios including challenging elongated and small objects [5][19]. - The method utilizes a joint-wise neural dynamics model to bridge the gap between simulation and real-world dynamics, enhancing data efficiency and generalization across different hand-object interactions [19][20]. Group 3: Achievements and Capabilities - DexNDM achieves unprecedented capabilities in continuous rotation of objects under challenging wrist postures, demonstrating superior performance compared to previous methods [9][13]. - The system allows operators to guide dexterous hands in complex tasks such as tightening screws and assembling furniture, showcasing its robustness and adaptability [7][15]. - The method's flexibility enables stable execution of tasks regardless of the wrist orientation or rotation axis required [14][15]. Group 4: Data Collection and Training - An automated data collection system, termed "Chaos Box," is developed to gather diverse real-world interaction data with minimal human intervention [21]. - A residual policy network is trained to compensate for the dynamics gap between simulation and reality, enhancing the system's performance in real-world applications [23]. Group 5: Conclusion and Future Outlook - DexNDM represents a significant advancement in addressing the sim-to-real challenge in robotics, achieving dexterous manipulation skills previously deemed impossible [24]. - The authors believe this is just the beginning, with the potential for dexterous hands to play a crucial role in the future of humanoid robotics [25].
NeurIPS 2025 Spotlight | 你刷到的视频是真的么?用物理规律拆穿Sora谎言
机器之心· 2025-11-05 06:30
Core Viewpoint - The article discusses the development of a physics-driven spatiotemporal modeling framework for detecting AI-generated videos, emphasizing the need for a robust detection method that leverages physical consistency rather than superficial features [6][47]. Group 1: Research Background - The rise of generative AI technologies has led to significant advancements in video synthesis, but the detection of such videos faces new challenges due to the complex spatial and temporal dependencies inherent in video data [7]. - Existing detection methods often focus on superficial inconsistencies, which are less effective against high-quality generated videos that obscure these features [7][8]. - The core dilemma in AI video detection is how to construct a detection framework that is robust to unknown generative models by understanding the physical evolution laws of natural videos [8]. Group 2: Proposed Methodology - The article introduces the concept of Normalized Spatiotemporal Gradient (NSG) statistics, which quantifies the physical inconsistencies in generated videos by analyzing the differences in NSG distributions between real and generated videos [3][18]. - The NSG-VD method is proposed as a universal video detection approach that models the distribution of natural videos without relying on specific generative models, demonstrating strong detection performance across various scenarios [3][28]. Group 3: Experimental Validation - The NSG-VD framework was evaluated on the GenVideo benchmark, which includes 10 different generative models, showing superior performance compared to existing baseline methods [40]. - In mixed data training on Kinetics-400 (real videos) and Pika (generated videos), NSG-VD achieved an average recall of 88.02% and an F1 score of 90.87%, significantly outperforming the previous best method, DeMamba [40]. - Even with a limited training dataset of only 1,000 generated videos, NSG-VD maintained robust performance, achieving a recall of 82.14% on the Sora model, indicating high data efficiency [41]. Group 4: Theoretical Foundations - The theoretical framework of NSG-VD is grounded in the principles of probability flow conservation and continuity equations, which describe the transport of conserved quantities in physical systems [13][14]. - The NSG statistic captures the relationship between spatial probability gradients and temporal density changes, providing a unified measure of consistency across different video scenarios [20][28]. Group 5: Future Directions - The article suggests that future work will focus on refining the physical models used in NSG-VD, optimizing computational efficiency, and exploring the feasibility of real-time detection applications [48].
具身智能一步踏入Scaling Law!10B+基础模型,27万小时真实数据
机器之心· 2025-11-05 06:30
Core Viewpoint - The article discusses the breakthrough achieved by the AI robotics startup Generalist with the introduction of a new embodied foundational model, GEN-0, which is designed for multimodal training on high-fidelity physical interaction data, aiming to enhance robotic intelligence through scalable data and computational power [2][5]. Group 1: GEN-0 Model Features - GEN-0 is built to capture human-level reflexes and physical common sense, with a parameter count exceeding 10 billion [3][4]. - A core feature of GEN-0 is "Harmonic Reasoning," allowing the model to seamlessly think and act simultaneously, which is crucial for real-world physical systems [5]. - The model has demonstrated strong scaling laws, indicating that increased pre-training data and computational power can predictably enhance performance across various tasks [6][10]. Group 2: Data and Training Insights - Generalist has pre-trained GEN-0 on over 270,000 hours of diverse real-world operational data, with the dataset growing at a rate of 10,000 hours per week [23][24]. - The company emphasizes that the quality and diversity of data are more critical than sheer quantity, leading to models with different characteristics based on the data mix used [33]. - The scaling experiments revealed that smaller models exhibit "ossification," while larger models continue to improve, highlighting the importance of model size in absorbing complex sensory-motor data [10][11]. Group 3: Applications and Future Directions - GEN-0 has been successfully tested on various robotic platforms, including humanoid robots with different degrees of freedom [6]. - The company is building the largest and most diverse real-world operational dataset to expand GEN-0's capabilities, covering a wide range of tasks across different environments [28]. - Generalist aims to create a robust infrastructure to support the extensive data collection and processing required for training large-scale robotic models [31].
数字生命「培养皿」里,AI竟然学会了打架、结盟、抢地盘
机器之心· 2025-11-05 04:15
Core Concept - The article discusses the development of a new artificial life simulation system called PD-NCA (Petri Dish Neural Cellular Automata), which allows multiple NCA agents to compete and evolve in a shared environment, focusing on self-replication as their primary goal [2][5]. Group 1: PD-NCA Overview - PD-NCA differs significantly from traditional NCA frameworks by allowing each NCA to have independent neural network parameters that are continuously optimized during the simulation [3]. - The agents in PD-NCA interact through differentiable attack and defense channels, showcasing a dynamic relationship of both competition and cooperation [5][6]. - The system enables emergent behaviors such as cyclic dynamics, territorial defense, and spontaneous cooperation among the agents [7]. Group 2: Simulation Mechanics - The simulation operates on a discrete grid, where each cell contains information about attack channels, defense channels, and hidden states [12]. - The simulation progresses through four stages: Processing, Competition, Normalization, and State Update [13]. - A static background environment is introduced to maintain a competitive atmosphere, ensuring that agents must constantly adapt to survive [16][17]. Group 3: Learning and Optimization - Each agent's optimization goal is to maximize its territory by maximizing its overall survival rate across the grid [29]. - The learning mechanism allows agents to balance between offensive expansion and defensive territory optimization, leading to complex emergent behaviors [30][31]. - The introduction of learning significantly enhances the richness and sustainability of emergent behaviors compared to a non-learning scenario [37][38]. Group 4: Experimental Findings - Experiments indicate that the number of NCA agents, grid size, and learning processes are critical factors in generating complex dynamics and diverse behaviors within PD-NCA [38]. - The study explores the impact of grid size on NCA behavior, showing variations as the grid expands from 16x16 to 196x196 [39]. - Attempts to encourage the formation of longer hypercycle structures reveal that while modifications to the loss function were made, stable long-length hypercycles were rarely observed [43].
用更一致的轨迹、更少的解码步数「驯服」掩码扩散语言模型,扩散语言模型的推理性能和效率大幅提升
机器之心· 2025-11-05 04:15
Core Insights - The article discusses the rapid advancements in diffusion large language models (LLMs), highlighting their potential as strong competitors to traditional LLMs [2][7] - A recent paper from a collaborative research team proposes an efficient decoding strategy combined with reinforcement learning for masked diffusion large language models (MDLM), significantly improving their reasoning performance and efficiency [2][21] Group 1: Problem Identification - Masked diffusion large language models like LLaDA exhibit capabilities comparable to autoregressive models but face challenges with full diffusion-style decoding, which is less effective than block-wise decoding [7][9] - The decoding process of MDLMs often encounters an issue where early generation of <EOS> tokens leads to performance degradation, creating a decoding trap [14][15] Group 2: Proposed Solutions - The research team introduces an early rejection mechanism for <EOS> tokens to suppress their confidence during early decoding steps, thus preventing premature termination of generation [15] - A power-increasing decoding step scheduler is designed to optimize the decoding process, reducing the inference steps from O(L) to O(logL), thereby accelerating reasoning [15][16] Group 3: Consistency Trajectory Optimization - The team proposes a consistency trajectory grouping strategy (CJ-GRPO) to address inconsistencies between rollout and optimization trajectories, enhancing training stability and effectiveness [16] - By combining the early rejection mechanism, increasing step scheduler, and CJ-GRPO, the model can maintain performance comparable to baseline methods while significantly reducing decoding steps [16][24] Group 4: Experimental Results - Extensive experiments demonstrate that the proposed methods outperform baseline models in mathematical reasoning and planning tasks, with performance improvements of up to 2-4 times in certain benchmarks [23][24] - The results indicate that the combination of CJ-GRPO with EOSER and ASS maintains competitive performance in low-step inference scenarios, achieving a balance of speed and quality [24] Group 5: Future Directions - The article suggests exploring hybrid reasoning modes that combine the strengths of diffusion and autoregressive models to meet diverse task requirements [26]
中英双语、29项第一、像素级理解:360 FG-CLIP2登顶全球最强图文跨模态模型
机器之心· 2025-11-05 04:15
Core Viewpoint - The article discusses the advancements in AI visual understanding, particularly focusing on the new model FG-CLIP 2 developed by 360, which significantly improves detail recognition and spatial understanding compared to previous models [10][11][21]. Group 1: Model Performance - FG-CLIP 2 has achieved superior performance in eight categories and 29 tests, surpassing Google and Meta, making it the strongest visual-language model currently available [11][26]. - In English tasks, FG-CLIP 2 scored an average of 81.10, significantly higher than Meta CLIP 2's 72.71, Google SigLIP 2's 71.87, and OpenAI CLIP's 64.10 [30][34]. - The model demonstrates a remarkable ability to understand spatial relationships and fine details, such as distinguishing between different cat breeds based on fur texture and position [18][19]. Group 2: Data Quality and Training - The core of FG-CLIP 2's capabilities lies in its high-quality dataset, FineHARD, which includes 500 million pairs of images and texts, specifically designed to enhance semantic understanding [36][37]. - The training process involves a two-stage strategy that first establishes a global understanding before focusing on fine details, allowing the model to evolve from general recognition to pixel-level understanding [42][49]. - FG-CLIP 2 incorporates a unique data adaptive resolution strategy, optimizing image processing efficiency and accuracy [54][55]. Group 3: Applications and Impact - FG-CLIP 2 has been integrated into various business applications, including advertising image matching, IoT camera intelligent retrieval, and content moderation, serving as a foundational technology for these services [57]. - The model's ability to perform detailed image searches and content generation supervision enhances its utility in e-commerce, security, and media management [58]. - 360 aims to leverage FG-CLIP 2 as a core capability for AI development across multiple industries, positioning itself as a leader in the AI landscape [60][61].
清北联合推出Motion Transfer,比肩Gemini Robotics,让机器人直接从人类数据中端到端学习技能
机器之心· 2025-11-05 04:15
Core Insights - The article discusses the release of Gemini Robotics 1.5 by Google DeepMind, highlighting its Motion Transfer Mechanism (MT) which allows skill transfer between different robot forms without retraining [2] - A collaborative team from Tsinghua University, Peking University, Wuhan University, and Shanghai Jiao Tong University has developed a new paradigm for zero-shot action transfer from humans to robots, releasing a comprehensive technical report and open-source code [3] MotionTrans Framework - MotionTrans is an end-to-end, zero-shot RGB-to-Action skill transfer framework that enables robots to learn human skills without prior demonstrations [8] - The framework includes a self-developed human data collection system using VR devices, capturing first-person videos, head movements, wrist poses, and hand actions [9] Implementation of MotionTrans - The framework allows for zero-shot transfer, enabling robots to learn tasks like pouring water and unplugging devices using only human VR data, achieving a 20% average success rate across 13 tasks [12][17] - Fine-tuning with a small number of robot data (5-20 samples) can increase the success rate to approximately 50% and 80%, respectively [20] Data and Training Techniques - The team utilized a large-scale human-robot dataset with over 3200 trajectories and 15 tasks, demonstrating the framework's ability to learn from human data alone [14][16] - The approach includes techniques like hand redirection and unified action normalization to bridge the gap between human and robot actions [10][13] Results and Contributions - MotionTrans has proven that even advanced end-to-end models can unlock new skills under zero-robot demonstration conditions, changing perceptions of human data from a supplementary role to a primary one [25] - The team has open-sourced all data, code, and models to support future research in this area [26]
AI太空竞赛?英伟达H100刚上天,谷歌Project Suncatcher也要将TPU送上天
机器之心· 2025-11-05 00:18
Core Insights - Google has launched Project Suncatcher, a space-based scalable AI infrastructure system designed to utilize solar energy for AI applications, with the potential to harness energy that exceeds human electricity production by 100 trillion times [8][11][29] - The project aims to deploy a constellation of satellites equipped with Tensor Processing Units (TPUs) and free-space optical communication links to enhance machine learning capabilities in space [7][9][10] Project Overview - Project Suncatcher is a significant exploration initiative that envisions a satellite constellation powered by solar energy, aimed at expanding the computational scale of machine learning in space [7][8] - The first satellite launch is scheduled for early 2027, in collaboration with Planet, to test the feasibility of the proposed system [3][29] Technical Challenges - The project faces several engineering challenges, including thermal management, high-bandwidth inter-satellite communication, and system reliability in orbit [28][29] - Achieving data center-scale inter-satellite links is crucial, requiring connections that support tens of terabits per second [13][14] - The satellites will operate in a dawn-dusk sun-synchronous low Earth orbit to maximize solar energy collection [13][21] TPU Radiation Tolerance - Google's Trillium TPU has undergone radiation testing, demonstrating resilience to total ionizing dose (TID) and single-event effects (SEEs), making it suitable for space applications [21][22] Economic Viability - Historical data suggests that launch costs for satellite systems may decrease to below $200 per kilogram by the mid-2030s, making space-based data centers economically feasible [23][24] - The operational costs of space-based data centers could become comparable to terrestrial counterparts in terms of energy costs [24] Future Directions - The initial analysis indicates that the core concept of space-based machine learning computing is not hindered by fundamental physics or insurmountable economic barriers [28] - The next milestone involves launching two prototype satellites to validate Google's models and TPU hardware in space [29][30]
让AI生成视频「又长又快」:Rolling Forcing实现分钟级实时生成
机器之心· 2025-11-05 00:18
Core Insights - The article discusses a breakthrough in real-time long video generation through a new method called Rolling Forcing, developed by researchers from Nanyang Technological University and Tencent ARC Lab [2][4][12]. Group 1: Challenges in Real-Time Video Generation - Real-time long video generation faces a "impossible triangle" dilemma, where high quality, consistency, and real-time performance are difficult to achieve simultaneously [8]. - The core challenges include the need for sequential frame generation with low latency, the difficulty in eliminating error accumulation while maintaining consistency, and the limitations of self-regressive frame generation methods [10][11]. Group 2: Rolling Forcing Methodology - Rolling Forcing introduces a "sliding window" approach that allows for parallel processing of frames within a window, enabling real-time generation while correcting errors as they occur [12][14]. - The method incorporates three key innovations: 1. A sliding window for joint denoising, optimizing multiple frames simultaneously [14]. 2. An Attention Sink mechanism to ensure long-term consistency by caching initial frames as global anchors [14]. 3. An efficient training algorithm that uses self-generated historical frames to simulate real inference scenarios [14]. Group 3: Experimental Results - Rolling Forcing demonstrates significant improvements over existing methods, achieving a generation speed of 16 frames per second (fps) while maintaining low error accumulation [17][20]. - In qualitative comparisons, Rolling Forcing maintains high fidelity in long video generation, avoiding issues like color drift and detail degradation that affect other models [20][21]. Group 4: Future Directions - Future research may focus on optimizing memory mechanisms for better retention of key information, improving training efficiency to reduce computational costs, and minimizing interaction delays for applications requiring ultra-low latency [25].
多模态大模型理解物理工具吗?PhysToolBench提出了衡量多模态大模型对物理工具理解的基准
机器之心· 2025-11-04 08:52
Core Insights - The article discusses the development of PhysToolBench, a benchmark designed to evaluate the understanding of physical tools by multimodal large models, highlighting the need for these models to improve their capabilities in recognizing, understanding, and creating tools [2][22]. Summary by Sections PhysToolBench Introduction - PhysToolBench categorizes the understanding of physical tools into three levels: recognizing tools, understanding tools, and creating tools [2][5]. - The benchmark includes over 1000 image-text pairs where models must identify the appropriate tool for a given task based on visual input [5]. Evaluation Criteria - The evaluation covers 32 of the latest multimodal large models, including proprietary, open-source, and embodied intelligence-specific models [7]. - The assessment is structured into three difficulty levels: Easy (Tool Recognition), Medium (Tool Understanding), and Hard (Tool Creation) [8][6]. Model Performance - The top-performing model, GPT-5, scored 62.15% overall, but many models scored below 50% in higher difficulty levels, indicating a significant gap compared to human performance [13]. - Proprietary models generally outperformed open-source models, with larger models showing better capabilities [13]. Specific Findings - Models struggled with recognizing and understanding tools, particularly in identifying whether tools were usable, leading to potential safety risks [18]. - The research indicates that reasoning capabilities, especially visual-centric reasoning, are crucial for effectively using physical tools [19][22]. Future Directions - The findings suggest that improving the understanding, application, and creation of complex physical tools is essential for advancing towards general intelligence in AI [22]. - The article encourages further exploration and development in this area, providing links to relevant papers, code, and datasets for interested parties [23].