机器之心
Search documents
AAAI 2026|教会视频扩散模型「理解科学现象」:从初始帧生成整个物理演化
机器之心· 2025-11-15 01:37
Core Insights - The article discusses the limitations of existing video generation models like Stable Diffusion and CogVideoX in accurately simulating scientific phenomena, highlighting their tendency to produce physically implausible results [2][3] - A new framework proposed by a research team from Dongfang University and Shanghai Jiao Tong University aims to enable video diffusion models to learn "latent scientific knowledge," allowing them to generate scientifically accurate video sequences from a single initial frame [3][4] Methodology - The proposed method consists of three main steps: latent knowledge extraction, pseudo-language prompt generation, and knowledge-guided video generation [8] - The first step involves extracting "latent scientific knowledge" from a single initial image, which is crucial for inferring subsequent dynamic evolution [9] - The second step generates pseudo-language prompts by leveraging the CLIP model's cross-modal alignment capabilities, allowing the model to "understand" the underlying structural rules in the initial image [10] - The third step integrates these pseudo-language prompts into existing video diffusion models, enabling them to simulate scientific phenomena while adhering to physical laws [11] Experimental Results - The research team conducted extensive experiments using fluid dynamics simulation data and real typhoon observation data, demonstrating that the new model generates videos that are not only visually superior but also more scientifically accurate [13][18] - The model was tested on various fluid simulation scenarios, including Rayleigh-Bénard Convection, Cylinder Flow, DamBreak, and DepthCharge, as well as real satellite data from four typhoon events [13][18] - Quantitative evaluations showed significant improvements in physical consistency metrics, with the new model outperforming mainstream methods in all tested scenarios [18] Future Implications - This research represents a meaningful exploration of generative AI in scientific modeling, suggesting that AI can evolve from merely visual generation to understanding and simulating physical processes [19][20] - The potential applications of this technology could extend to meteorological forecasting, fluid simulation, and Earth system modeling, positioning AI as a valuable tool for scientists [20]
他「二本」出身,数学很差:最终成了PyTorch之父、Meta副总裁
机器之心· 2025-11-15 01:37
Core Insights - The article highlights the inspiring journey of Soumith Chintala, the creator of PyTorch, emphasizing his resilience and determination despite numerous setbacks in his academic and professional life [2][22]. Group 1: Early Life and Education - Soumith Chintala struggled with mathematics, which hindered his admission to top universities in India, leading him to attend a second-tier institution, VIT [5]. - He improved his academic performance and achieved a GRE score of 1420, which was impressive at the time [6]. - Despite his efforts, he faced rejection from 12 U.S. universities when applying for a master's program [7]. Group 2: Career Challenges - After being rejected by 15 universities, he eventually received an offer from the University of Southern California, and later a waitlist offer from New York University [8][9]. - Soumith faced multiple rejections from companies, including DeepMind, and initially worked as a test engineer at Amazon [10][12]. - He encountered significant visa challenges that added to his professional struggles, but he eventually secured a waiver to stay in the U.S. [13]. Group 3: Breakthrough and PyTorch Development - During his time at Facebook AI Research (FAIR), he solved a critical issue that senior engineers could not, marking a significant turning point in his career [18]. - He played a key role in the development of PyTorch, which became a widely used deep learning framework [20]. - Despite initial success, he faced internal challenges at Meta regarding the future of PyTorch, but the project was ultimately preserved [20]. Group 4: Reflection and Future - Soumith's story illustrates the importance of resilience and the belief that perseverance can lead to success, regardless of current circumstances [22][32]. - He expressed gratitude towards mentors and family who supported him throughout his journey [25][28]. - As he prepares to leave Meta for new ventures, there is anticipation regarding his future contributions to the field [34].
⽆需任何监督信号!自博弈机制让深度搜索Agent实现自我进化
机器之心· 2025-11-15 01:37
Core Insights - The article discusses the rising interest in search-based agents and the challenges in enhancing their capabilities to approach human-level performance [2] - A new method called Search Self-Play (SSP) is proposed, allowing agents to evolve through self-play without the need for human annotation [5][21] - The SSP method has shown significant improvements in various open-domain question-answering benchmarks, demonstrating its effectiveness in enhancing agent capabilities [17] Method Overview - The SSP framework involves a single large language model acting as both the "Proposer" and "Solver," engaging in adversarial training to dynamically increase task difficulty as the model's capabilities improve [7][10] - The training process consists of three stages: problem generation, collaborative verification, and adversarial solving, ensuring that generated questions are solvable and unique [9][10] Experimental Results - The SSP method was evaluated across seven open-domain question-answering benchmarks, consistently outperforming baseline methods [16][17] - Notably, the Qwen2.5-7B-Base model achieved an average score increase of 26.4 points, with a remarkable 40.4-point improvement on TriviaQA [17] - The SSP approach also proved effective for instruction-tuned models, enhancing their performance by an average of 8.0 points [17] Implications and Future Directions - The SSP paradigm represents a shift towards self-competition among models, potentially leading to superhuman performance without human supervision [21][22] - The article suggests that this self-play training method could become a standard in future large model training, as it allows for rapid capability enhancement beyond the limitations of human annotation [21]
SIGGRAPH Asia 2025 | 让3D场景生成像「写代码」一样灵活可控
机器之心· 2025-11-14 10:32
Core Viewpoint - The rapid development of generative AI is pushing the boundaries of its creative capabilities, particularly in 3D scene generation, but existing methods face significant limitations in logical consistency and spatial relationships [2][3]. Group 1: Procedural Scene Programs (PSP) Framework - The PSP framework, developed by research teams from Brown University and UC San Diego, allows AI to generate executable scripts for 3D scene construction rather than directly outputting geometric parameters [3][8]. - This new paradigm enables AI to "write" the logic of scene generation, resulting in a highly editable, reusable, and structurally controllable output [3][9]. Group 2: Components of PSP - The system consists of two key components: 1. Procedural Scene Description Language (PSDL) for defining the rules of the generated world [9][10]. 2. Program Search module for automatic detection and correction of geometric errors post-execution [9][13]. - PSDL allows for the expression of spatial relationships through programming logic, enhancing the model's ability to define scene structure and layout [10][11]. Group 3: Error Correction Mechanism - The Program Search module identifies inconsistencies in structure and geometry, employing a symbolic correction mechanism that requires fewer iterations to fix errors compared to traditional methods [13][14]. - The system can correct most errors with an average of about 7 program modifications, significantly improving the physical consistency of generated scenes [14]. Group 4: Performance Comparison - In a comparison of 70 open-world scene prompts, PSP outperformed traditional methods, achieving a human preference rate of 82.9% against DeclBase and 94.3% against Holodeck, while also generating scenes faster, averaging about 38 seconds [16][17]. - An automated evaluation method corroborated these findings, showing a preference rate of 77.1% against DeclBase and 90.0% against Holodeck, aligning with human assessments [18]. Group 5: Significance and Future Outlook - The integration of programming logic with AI's imaginative capabilities through PSP enhances the controllability and interpretability of 3D content generation [20][21]. - This framework provides a new foundation for constructing virtual environments, game levels, and intelligent visual settings, marking a significant advancement in the field of AI-generated content [21].
超大参数量具身VLM开源:首创DPPO训练范式,模型性价比天花板,来自北京人形
机器之心· 2025-11-14 10:32
Core Insights - The article highlights the launch of Pelican-VL 1.0, an open-source embodied intelligence VLM model, which is considered the largest scale of its kind in the industry, covering 7B and 72B parameter scales [1][4]. Group 1: Model Performance and Training - Pelican-VL has achieved a performance improvement of 20.3% over baseline models, surpassing similar open-source models by 10.6% [4][11]. - The model was trained on a cluster of over 1000 A800 GPUs, with a single checkpoint training consuming more than 50,000 A800 GPU-hours [4]. - The training process utilizes a novel "Deliberate Practice Policy Optimization" (DPPO) paradigm, mimicking human metacognitive learning to enhance model capabilities [8][10]. Group 2: Capabilities and Applications - Pelican-VL demonstrates significant advancements in multimodal understanding and reasoning, effectively processing visual and textual inputs to perform complex tasks [12][13]. - The model excels in spatial-temporal reasoning, allowing it to understand sequences of actions and make predictions based on dynamic scenarios [13]. - It showcases strong embodied interaction capabilities, enabling it to generate detailed action plans for robotic tasks such as object manipulation and navigation [13]. Group 3: Industry Implications - The open-source nature of Pelican-VL allows other labs and companies to customize training, thereby accelerating the practical application of VLM in robotics [23]. - The model's development addresses challenges in high-quality embodied data scarcity and evaluation benchmarks, paving the way for future advancements in the field [23]. - Pelican-VL represents a significant step towards creating robots that can not only recognize objects but also make informed decisions about their interactions with the environment [23][28].
OpenAI新论文拆解语言模型内部机制:用「稀疏电路」解释模型行为
机器之心· 2025-11-14 09:30
Core Insights - The article discusses OpenAI's new research on training smaller, sparser models that are easier to interpret, addressing the "black box" nature of large language models [1][12][26] Group 1: Model Transparency and Interpretability - Most large language models operate as "black boxes," making it difficult for even experts to understand their internal processes [1] - Enhancing model transparency can help analyze and explain issues like hallucinations and unstable behavior in language models [1][12] - OpenAI's research aims to isolate small circuits within sparse models that are responsible for specific tasks, providing unprecedented insights into language model operations [7][12] Group 2: Sparse Model Training Methodology - OpenAI's approach involves training models with sparse weights, limiting connections between neurons to simplify the model's structure [14][26] - The research shows that training larger and sparser models can lead to simpler and more interpretable circuits, which can effectively perform specific tasks [17][19] - The study highlights a specific task where the model must choose the correct type of quotation marks in Python code, demonstrating the model's ability to isolate and execute simple behaviors [19][22] Group 3: Future Directions and Challenges - OpenAI acknowledges that while this research is a step towards understanding model computations, there is still a long way to go [26] - Future efforts will focus on scaling these techniques to larger models and explaining more complex behaviors [26] - OpenAI is exploring two pathways to improve the efficiency of training sparse models: extracting sparse circuits from existing dense models and developing more efficient interpretability-guided training techniques [26]
百度亮出秘密武器:一个自我演化的AI,给出了人类做不到的最优解
机器之心· 2025-11-14 09:30
Core Insights - The article discusses the rapid evolution of AI from being mere executors to becoming inventors, highlighting the introduction of Baidu's FM Agent, a self-evolving intelligent agent capable of solving complex problems autonomously [1][6][30] Group 1: AI Capabilities and Innovations - FM Agent can autonomously generate and optimize algorithms, significantly reducing the time required for tasks that would take human experts days or even weeks [4][8] - The system combines large language models with evolutionary search algorithms to tackle real-world problems, demonstrating a leap from executing commands to discovering solutions independently [6][8] - The agent's performance has been validated in various benchmarks, achieving a medal rate of 43.56% on MLE-Bench, outperforming the human median by 51.56% [13] Group 2: Technical Features - FM Agent employs four core technologies: automated machine learning processes, combination optimization, GPU kernel generation, and mathematical problem-solving capabilities [13][14] - The system operates through a workflow that includes cold start initialization, adaptive diversity sampling, and a distributed asynchronous infrastructure based on the Ray framework [12][14] Group 3: Industry Applications - FM Agent has shown effectiveness in multiple sectors, including finance, urban traffic optimization, and large-scale engineering projects, providing solutions that are faster and more efficient than traditional methods [25][18] - The agent can abstract real-world problems into mathematical algorithms, continuously iterating and optimizing solutions based on clear evaluation metrics [18][20] Group 4: Future Implications - The emergence of FM Agent signifies a shift towards a new paradigm where humans define problems and AI executes solutions, potentially transforming productivity across various industries [22][30] - Baidu's FM Agent has already attracted over 1,000 enterprises for testing, indicating strong interest and potential for widespread application in sectors like transportation, energy, and finance [33][32]
2025宝山·智能机器人产业大会暨嘉年华,邀您共赴科技盛宴
机器之心· 2025-11-14 09:30
Core Insights - The "2025 Baoshan Intelligent Robot Industry Conference and Carnival" will take place on November 21-22, 2025, featuring industry forums, startup roadshows, exhibitions, and interactive experiences [2][9] - The event will unveil Baoshan District's three-year action plan for robotics and a series of industrial service platforms [2] - Four major forums will cover the entire industry chain, focusing on macro trends, technological breakthroughs, core components, and innovation ecosystems [2] Industry Forums - The main forum will gather top experts to discuss the macro blueprint and trends in the industry [2] - The embodied intelligence and robotics forum will focus on humanoid robots, service robots, and the practical applications of embodied data systems [2] - The core technology and components forum will address breakthroughs in domestic core components such as joint modules, sensors, and servo drives [2] - The innovation and talent forum will facilitate cross-industry dialogues to activate talent and innovation synergies [2] Startup Roadshow - The afternoon of the second day will feature a roadshow showcasing cutting-edge robotics startups, including projects in core areas like body, simulation technology, and machine vision [4] - Leading investment institutions will be present to support quality startup projects, fostering partnerships for industrial implementation [4] Exhibition Highlights - The exhibition will showcase nearly 30 leading companies in the robotics and components sector, featuring humanoid robots, industrial robots, sensors, and servo drives [6] - Attendees will experience the latest achievements and strengths of China's intelligent robotics technology [6] Interactive Experiences - The event will offer various interactive experiences, including performances by robots, martial arts competitions, and AI interactions [8] - Attendees can enjoy unique experiences such as AI calligraphy and personalized coffee brewing, highlighting the potential of technology in daily life [8]
NeurIPS Spotlight|GHAP:把3DGS“剪枝”变成“重建更小的高斯世界”
机器之心· 2025-11-14 09:30
Core Viewpoint - The article presents a novel approach to 3D Gaussian Splatting (3DGS) compression by framing it as a Gaussian mixture model simplification, which effectively reduces redundancy while preserving geometric details [4][28]. Summary by Sections Introduction - 3DGS is a popular method for 3D scene modeling that uses numerous Gaussian spheres to create high-quality 3D representations. However, the redundancy in Gaussian spheres limits storage and rendering speed [4]. Methodology - The proposed Gaussian-Herding-across-Pens (GHAP) method treats the entire 3DGS as a Gaussian mixture model, aiming to reconstruct a smaller mixture model globally. This approach maintains the geometric structure while reducing the number of Gaussian spheres [8][9]. - GHAP employs a two-stage process: first simplifying geometric information (position/covariance), followed by refining appearance features (opacity/color). This decoupling enhances stability [9][19]. Experimental Results - The GHAP method was compared with various pruning-based and end-to-end compression methods. Results indicate that GHAP consistently outperforms other baseline methods while being close to the performance of full-sample end-to-end methods [20][24]. - At a 10% retention rate, GHAP maintains high visual fidelity across different models and scenes, demonstrating its effectiveness in preserving the original geometric structure [23][24]. Conclusion - The GHAP method offers a new perspective on 3DGS compression, focusing on Gaussian mixture model simplification to retain geometric detail. It is designed to be scalable for large 3DGS scenes and is compatible with most existing 3DGS frameworks [28].
FDA对偶锚点:模型知识迁移的新视角——从参数空间到输入空间
机器之心· 2025-11-14 01:33
Core Insights - The article discusses the introduction of FDA (Model Merging with Functional Dual Anchors), a novel framework for merging expert models derived from task-specific fine-tuning of general foundational models, aiming to integrate their capabilities into a single model without accessing original training data [2][4]. Group 1: FDA Framework Overview - FDA represents task knowledge embedded in model parameters using a set of dual synthetic input points, allowing for efficient knowledge integration through induced gradients in the input space [4][10]. - Unlike traditional methods that rely on arithmetic operations in parameter space, FDA shifts the knowledge integration process to the input space, providing a new perspective on model merging [4][9]. - The framework is designed to be scalable to large neural network models, demonstrating superior performance and scalability in visual and natural language models [4][12]. Group 2: Performance and Robustness - Experimental results indicate that FDA significantly outperforms traditional task vector methods, achieving an average performance of 87.26 in multi-task scenarios compared to 73.94 for task vectors, marking an improvement of nearly 18% [14]. - FDA exhibits flexible knowledge modeling capabilities, achieving an average performance increase of approximately 5.10% on ViT-B/16 and about 13% on RoBERTa-Large, showcasing its adaptability across different architectures [15]. Group 3: Algorithm Implementation - The FDA algorithm consists of two main phases: construction of FDA samples for each downstream task and parameter updates based on the constructed FDA [17][19]. - Two practical initialization strategies for FDA construction are proposed: linear weight sampling and scaled Gaussian sampling, which help in optimizing the initial points effectively [18]. Group 4: Knowledge Encoding and Mechanisms - FDA captures task-related dominant representation directions while suppressing redundant or noisy components, aligning with the low-rank structure typically observed in task-specific knowledge in parameter space [22]. - The optimization process of FDA aligns its high-energy subspace with the high-energy subspace of real data, indicating a potential connection between the knowledge encoded in FDA and the actual task data [23]. - The parameter updates induced by FDA gradually align with those induced by real data, demonstrating its robustness and effectiveness in capturing task-related knowledge [24].