机器之心
Search documents
3D视觉被过度设计?字节Depth Anything 3来了,谢赛宁点赞
机器之心· 2025-11-15 09:23
机器之心报道 编辑:泽南、杨文 现在,只需要一个简单的、用深度光线表示训练的 Transformer 就行了。 这项研究证明了,如今大多数 3D 视觉研究都存在过度设计的问题。 本周五,AI 社区最热门的话题是一篇新论文,有关 3D 建模的。 经过一年多的探索,来自字节跳动的团队推出了 Depth Anything 3(DA3),将单目深度估计扩展到了任何 视角场景,让计算机实现了媲美人类的空间感知。 为了追求最小建模,DA3 的工作获得了 两个关键见解 : 就是这样的方法, 在姿态估计方面比当前业界最先进的方法 (SOTA) 提升了 44%,在几何估计方面提升了 25%。 原来 3D 视觉竟然这么简单? 纽约大学计算机科学助理教授、知名 AI 学者谢赛宁表示,论文有点像电影:第一部通常是最好的,续集往 往更复杂却并不更精彩。但这完全不适用于 DepthAnything 系列。 Bingyikang 的团队每次都能让事情变得更 简单、更易于扩展。 论文:https://arxiv.org/abs/2511.10647 项目页面:https://depth-anything-3.github.io 代码:htt ...
NeurIPS 2025|当AI学会"炒股":用千个虚拟投资者重现金融市场涌现现象
机器之心· 2025-11-15 09:23
Core Insights - The article discusses TwinMarket, a scalable behavioral and social simulation platform for financial markets driven by large language models (LLMs), aiming to replicate human-like decision-making and social interactions in trading environments [2][4]. Group 1: Traditional Market Simulation Limitations - Traditional market simulation methods rely on preset rules, leading to three fundamental limitations: behavior homogeneity, lack of social interaction, and black-box cognitive processes [5][6]. - These models often assume a "standard investor," failing to capture the heterogeneity of real market participants [6]. - Social media influences and the complexity of information dissemination are inadequately modeled in traditional frameworks [6]. Group 2: TwinMarket's Innovations - TwinMarket introduces the Belief-Desire-Intention (BDI) cognitive framework, marking a paradigm shift from rule-based to cognitive reasoning models [7][10]. - The BDI framework allows AI agents to reflect on their decisions, enhancing their learning capabilities through cognitive updates rather than gradient descent [12]. Group 3: Data-Driven Simulation Environment - TwinMarket is grounded in real data, utilizing trading records from 639 investors and 11,965 transactions to initialize user profiles [15][19]. - The platform incorporates various data sources, including stock recommendations and news articles, to simulate a realistic trading environment [20]. Group 4: Micro and Macro Behavioral Insights - The simulation reveals that wealth inequality naturally emerges and expands within a fair virtual market, with the Gini coefficient increasing over time [25][26]. - Frequent trading correlates with poorer returns, reflecting human behavioral biases such as overconfidence and emotional decision-making [27]. Group 5: Stylized Facts Validation - TwinMarket successfully replicates four stylized facts of real markets: fat-tailed distribution, leverage effect, volume-price relationship, and volatility clustering [31][32][33][34]. - The simulation captures the phenomenon of collective behavior leading to market volatility, demonstrating how individual biases can amplify into macroeconomic crises [36]. Group 6: Scalability and Practical Applications - TwinMarket exhibits strong scalability, maintaining high correlation with real market price movements even in large-scale experiments with 1,000 agents [44][46]. - The platform serves as a valuable tool for understanding complex socio-economic systems, allowing researchers to test theories and evaluate regulatory impacts in a controlled environment [52][56]. Group 7: Future Directions - Future developments aim to enhance market mechanisms and introduce macroeconomic interactions, expanding the simulation's applicability to various financial ecosystems [64][65]. - The potential for cross-disciplinary applications, including political and public health simulations, is also recognized [66].
EMNLP2025 | 通研院揭秘MoE可解释性,提升Context忠实性!
机器之心· 2025-11-15 06:23
Core Insights - The article discusses the integration of Mechanistic Interpretability with Mixture-of-Experts (MoE) models, highlighting the importance of understanding the underlying mechanisms to enhance model performance and explainability [4][5][6]. Group 1: Mechanistic Interpretability and MoE - There are many teams working on MoE models, but few focus on Mechanistic Interpretability, making this a rare and valuable area of research [4]. - The article proposes a method called "Router Lens & CEFT" aimed at improving context faithfulness in language models, which has been accepted for EMNLP 2025 [7][9]. - The research identifies experts within MoE models that are particularly adept at utilizing contextual information, termed "Context-Faithful Experts" [14][18]. Group 2: Context Faithfulness and Expert Specialization - Context faithfulness refers to the model's ability to generate responses based strictly on the provided context, avoiding irrelevant information [10]. - The study confirms the existence of context-faithful experts within MoE models, demonstrating that adjusting expert activation can significantly enhance context utilization [18][20]. - The Router Lens method is used to identify these experts by calibrating routing behavior to reflect their true capabilities [16]. Group 3: Performance Improvements and Efficiency - The CEFT method, which fine-tunes only the identified context-faithful experts, shows that it can achieve or exceed the performance of full parameter fine-tuning while significantly reducing the number of trainable parameters [41][44]. - The results indicate that CEFT requires training only 500 million parameters compared to 6.9 billion for full fine-tuning, achieving a 13.8 times reduction in parameter count [44]. - CEFT demonstrates superior resistance to catastrophic forgetting compared to full fine-tuning, as evidenced by performance metrics across various benchmarks [46]. Group 4: Future Applications and Research Directions - The Router Lens method can be applied to identify and analyze other types of experts, such as those specialized in reasoning or programming [50]. - It can also help in debugging MoE models by locating poorly performing or misleading experts [51]. - Combining Router Lens with other interpretability techniques could further enhance understanding of expert behavior and knowledge distribution within models [51].
当AI重新定义「科研影响力」:一场关于CSRankings的反思与重塑
机器之心· 2025-11-15 06:23
Core Viewpoint - The article discusses the evolution of academic ranking systems, emphasizing the shift from quantity-based metrics, such as the number of published papers, to quality-based assessments that reflect true academic impact and influence [2][12]. Group 1: Issues with Current Ranking Systems - Traditional ranking systems like USNews rely on subjective surveys, while CSRankings uses objective metrics like publication counts, leading to a competition focused on quantity rather than quality [2][3]. - The reliance on citation counts to measure academic influence has its drawbacks, as not all citations indicate significant contributions to the field [3][4]. Group 2: New Approaches to Measuring Impact - A new academic ranking system has been developed by researchers from Oregon State University and the University of California Santa Cruz, utilizing large language models (LLMs) to assess the impact of academic papers [5][7]. - The LLM analyzes top AI conference papers from 2020-2025 to identify the five most important references cited by each paper, aiming to uncover the foundational works that drive innovation in the field [7][8]. Group 3: Implementation of the New Ranking System - The new system maps the identified key references back to their authors and institutions, assigning academic influence points based on how often a paper is cited as a key reference by new research [10][12]. - This approach rewards institutions that contribute to groundbreaking discoveries and foundational research, shifting the focus from mere publication counts to genuine academic influence [12][13]. Group 4: Results and Rankings - The resulting rankings highlight institutions that have significantly impacted their fields, showcasing a more nuanced understanding of academic contributions [12][14]. - The article provides specific rankings of institutions based on their impact scores, illustrating the effectiveness of this new methodology in recognizing true academic excellence [16][21].
从「行为数据」到「AI 记忆」,哪条路线更可能成就 AI 对用户的「终身记忆」?
机器之心· 2025-11-15 02:30
Core Viewpoint - The article discusses the ongoing competition in the AI industry regarding the development of long-term memory systems, highlighting different approaches taken by companies to enhance user experience and product differentiation in the AI landscape [1]. Group 1: From "Behavior Data" to "AI Memory" - Current AI products, such as assistants and virtual companions, primarily operate on a one-time interaction basis, which diminishes user trust and engagement [4]. - Long-term memory should be a core design element from the outset, rather than an afterthought, as emphasized by Artem Rodichev from Ex-human [4]. - Effective memory systems must balance the retention of significant events, updates based on user interactions, and user control over memory management [4]. - The true challenge in product differentiation lies not in replicating features but in how products learn and adapt through memory [4]. - Mainstream personal assistant systems categorize memory into short-term, mid-term, and long-term layers, enhancing understanding of user behavior over time [4]. - The interconnectedness of these memory layers creates a "behavioral compounding" effect, making it difficult for competitors to replicate this contextual depth [4]. - Companies are making strategic choices regarding what to remember, for whom, and for how long, aiming to establish a competitive edge through unique memory systems [4]. Group 2: Routes to Achieve AI's "Lifetime Memory" - Various product routes have emerged around AI long-term memory, each emphasizing different strategic narratives such as privacy, cost efficiency, speed, and integration [5].
AAAI 2026|教会视频扩散模型「理解科学现象」:从初始帧生成整个物理演化
机器之心· 2025-11-15 01:37
Core Insights - The article discusses the limitations of existing video generation models like Stable Diffusion and CogVideoX in accurately simulating scientific phenomena, highlighting their tendency to produce physically implausible results [2][3] - A new framework proposed by a research team from Dongfang University and Shanghai Jiao Tong University aims to enable video diffusion models to learn "latent scientific knowledge," allowing them to generate scientifically accurate video sequences from a single initial frame [3][4] Methodology - The proposed method consists of three main steps: latent knowledge extraction, pseudo-language prompt generation, and knowledge-guided video generation [8] - The first step involves extracting "latent scientific knowledge" from a single initial image, which is crucial for inferring subsequent dynamic evolution [9] - The second step generates pseudo-language prompts by leveraging the CLIP model's cross-modal alignment capabilities, allowing the model to "understand" the underlying structural rules in the initial image [10] - The third step integrates these pseudo-language prompts into existing video diffusion models, enabling them to simulate scientific phenomena while adhering to physical laws [11] Experimental Results - The research team conducted extensive experiments using fluid dynamics simulation data and real typhoon observation data, demonstrating that the new model generates videos that are not only visually superior but also more scientifically accurate [13][18] - The model was tested on various fluid simulation scenarios, including Rayleigh-Bénard Convection, Cylinder Flow, DamBreak, and DepthCharge, as well as real satellite data from four typhoon events [13][18] - Quantitative evaluations showed significant improvements in physical consistency metrics, with the new model outperforming mainstream methods in all tested scenarios [18] Future Implications - This research represents a meaningful exploration of generative AI in scientific modeling, suggesting that AI can evolve from merely visual generation to understanding and simulating physical processes [19][20] - The potential applications of this technology could extend to meteorological forecasting, fluid simulation, and Earth system modeling, positioning AI as a valuable tool for scientists [20]
他「二本」出身,数学很差:最终成了PyTorch之父、Meta副总裁
机器之心· 2025-11-15 01:37
Core Insights - The article highlights the inspiring journey of Soumith Chintala, the creator of PyTorch, emphasizing his resilience and determination despite numerous setbacks in his academic and professional life [2][22]. Group 1: Early Life and Education - Soumith Chintala struggled with mathematics, which hindered his admission to top universities in India, leading him to attend a second-tier institution, VIT [5]. - He improved his academic performance and achieved a GRE score of 1420, which was impressive at the time [6]. - Despite his efforts, he faced rejection from 12 U.S. universities when applying for a master's program [7]. Group 2: Career Challenges - After being rejected by 15 universities, he eventually received an offer from the University of Southern California, and later a waitlist offer from New York University [8][9]. - Soumith faced multiple rejections from companies, including DeepMind, and initially worked as a test engineer at Amazon [10][12]. - He encountered significant visa challenges that added to his professional struggles, but he eventually secured a waiver to stay in the U.S. [13]. Group 3: Breakthrough and PyTorch Development - During his time at Facebook AI Research (FAIR), he solved a critical issue that senior engineers could not, marking a significant turning point in his career [18]. - He played a key role in the development of PyTorch, which became a widely used deep learning framework [20]. - Despite initial success, he faced internal challenges at Meta regarding the future of PyTorch, but the project was ultimately preserved [20]. Group 4: Reflection and Future - Soumith's story illustrates the importance of resilience and the belief that perseverance can lead to success, regardless of current circumstances [22][32]. - He expressed gratitude towards mentors and family who supported him throughout his journey [25][28]. - As he prepares to leave Meta for new ventures, there is anticipation regarding his future contributions to the field [34].
⽆需任何监督信号!自博弈机制让深度搜索Agent实现自我进化
机器之心· 2025-11-15 01:37
Core Insights - The article discusses the rising interest in search-based agents and the challenges in enhancing their capabilities to approach human-level performance [2] - A new method called Search Self-Play (SSP) is proposed, allowing agents to evolve through self-play without the need for human annotation [5][21] - The SSP method has shown significant improvements in various open-domain question-answering benchmarks, demonstrating its effectiveness in enhancing agent capabilities [17] Method Overview - The SSP framework involves a single large language model acting as both the "Proposer" and "Solver," engaging in adversarial training to dynamically increase task difficulty as the model's capabilities improve [7][10] - The training process consists of three stages: problem generation, collaborative verification, and adversarial solving, ensuring that generated questions are solvable and unique [9][10] Experimental Results - The SSP method was evaluated across seven open-domain question-answering benchmarks, consistently outperforming baseline methods [16][17] - Notably, the Qwen2.5-7B-Base model achieved an average score increase of 26.4 points, with a remarkable 40.4-point improvement on TriviaQA [17] - The SSP approach also proved effective for instruction-tuned models, enhancing their performance by an average of 8.0 points [17] Implications and Future Directions - The SSP paradigm represents a shift towards self-competition among models, potentially leading to superhuman performance without human supervision [21][22] - The article suggests that this self-play training method could become a standard in future large model training, as it allows for rapid capability enhancement beyond the limitations of human annotation [21]
SIGGRAPH Asia 2025 | 让3D场景生成像「写代码」一样灵活可控
机器之心· 2025-11-14 10:32
Core Viewpoint - The rapid development of generative AI is pushing the boundaries of its creative capabilities, particularly in 3D scene generation, but existing methods face significant limitations in logical consistency and spatial relationships [2][3]. Group 1: Procedural Scene Programs (PSP) Framework - The PSP framework, developed by research teams from Brown University and UC San Diego, allows AI to generate executable scripts for 3D scene construction rather than directly outputting geometric parameters [3][8]. - This new paradigm enables AI to "write" the logic of scene generation, resulting in a highly editable, reusable, and structurally controllable output [3][9]. Group 2: Components of PSP - The system consists of two key components: 1. Procedural Scene Description Language (PSDL) for defining the rules of the generated world [9][10]. 2. Program Search module for automatic detection and correction of geometric errors post-execution [9][13]. - PSDL allows for the expression of spatial relationships through programming logic, enhancing the model's ability to define scene structure and layout [10][11]. Group 3: Error Correction Mechanism - The Program Search module identifies inconsistencies in structure and geometry, employing a symbolic correction mechanism that requires fewer iterations to fix errors compared to traditional methods [13][14]. - The system can correct most errors with an average of about 7 program modifications, significantly improving the physical consistency of generated scenes [14]. Group 4: Performance Comparison - In a comparison of 70 open-world scene prompts, PSP outperformed traditional methods, achieving a human preference rate of 82.9% against DeclBase and 94.3% against Holodeck, while also generating scenes faster, averaging about 38 seconds [16][17]. - An automated evaluation method corroborated these findings, showing a preference rate of 77.1% against DeclBase and 90.0% against Holodeck, aligning with human assessments [18]. Group 5: Significance and Future Outlook - The integration of programming logic with AI's imaginative capabilities through PSP enhances the controllability and interpretability of 3D content generation [20][21]. - This framework provides a new foundation for constructing virtual environments, game levels, and intelligent visual settings, marking a significant advancement in the field of AI-generated content [21].
超大参数量具身VLM开源:首创DPPO训练范式,模型性价比天花板,来自北京人形
机器之心· 2025-11-14 10:32
Core Insights - The article highlights the launch of Pelican-VL 1.0, an open-source embodied intelligence VLM model, which is considered the largest scale of its kind in the industry, covering 7B and 72B parameter scales [1][4]. Group 1: Model Performance and Training - Pelican-VL has achieved a performance improvement of 20.3% over baseline models, surpassing similar open-source models by 10.6% [4][11]. - The model was trained on a cluster of over 1000 A800 GPUs, with a single checkpoint training consuming more than 50,000 A800 GPU-hours [4]. - The training process utilizes a novel "Deliberate Practice Policy Optimization" (DPPO) paradigm, mimicking human metacognitive learning to enhance model capabilities [8][10]. Group 2: Capabilities and Applications - Pelican-VL demonstrates significant advancements in multimodal understanding and reasoning, effectively processing visual and textual inputs to perform complex tasks [12][13]. - The model excels in spatial-temporal reasoning, allowing it to understand sequences of actions and make predictions based on dynamic scenarios [13]. - It showcases strong embodied interaction capabilities, enabling it to generate detailed action plans for robotic tasks such as object manipulation and navigation [13]. Group 3: Industry Implications - The open-source nature of Pelican-VL allows other labs and companies to customize training, thereby accelerating the practical application of VLM in robotics [23]. - The model's development addresses challenges in high-quality embodied data scarcity and evaluation benchmarks, paving the way for future advancements in the field [23]. - Pelican-VL represents a significant step towards creating robots that can not only recognize objects but also make informed decisions about their interactions with the environment [23][28].