机器之心
Search documents
DeepSeek开源新基础模型,但不是V4,而是V3.1-Base
机器之心· 2025-08-20 00:15
Core Viewpoint - DeepSeek has upgraded its online model to version V3.1, extending the context length to 128k and updating the user interface, which has generated significant attention in the AI community [1][3]. Summary by Sections Model Release - DeepSeek has released a new model, DeepSeek-V3.1-Base, on Hugging Face, which quickly rose to the 4th position on the platform's popular model list within hours of its launch [1]. - The model maintains similar parameters to its predecessor, DeepSeek-V3, and utilizes a mixture of experts (MoE) architecture [1]. Community Reaction - Reactions on social media regarding the update are mixed; some users express excitement, viewing it as a precursor to future models DeepSeek-V4 and DeepSeek-R2, while others feel the update does not meet the company's previous standards [3].
ICCV 2025 | RobustSplat: 解耦致密化与动态的抗瞬态3DGS三维重建
机器之心· 2025-08-19 09:45
Core Viewpoint - The article discusses the RobustSplat method, which addresses the challenges of 3D Gaussian Splatting (3DGS) in rendering dynamic objects by introducing a delayed Gaussian growth strategy and a scale-cascade mask guidance method to reduce rendering artifacts caused by transient objects [2][21]. Research Motivation - The motivation stems from understanding the dual role of Gaussian densification in 3DGS, which enhances scene detail but also risks overfitting dynamic areas, leading to artifacts and scene distortion. The goal is to balance static structure representation and dynamic interference suppression [6][8]. Methodology - **Transient Mask Estimation**: Utilizes a Mask MLP with two linear layers to output pixel-wise transient masks, distinguishing between transient and static regions [9]. - **Feature Selection**: DINOv2 features are chosen for their balance of semantic consistency, noise resistance, and computational efficiency, outperforming other feature sets like Stable Diffusion and SAM [10]. - **Supervision Design**: Combines image residual loss and feature cosine similarity loss for mask MLP optimization, enhancing dynamic area recognition [12]. - **Delayed Gaussian Growth Strategy**: This core strategy postpones the densification process to prioritize static scene structure optimization, reducing the risk of misclassifying static areas as transient [13]. - **Scale-Cascade Mask Guidance**: Initially estimates transient masks using low-resolution features, then transitions to high-resolution supervision for more accurate mask predictions [14]. Experimental Results - Experiments on NeRF On-the-go and RobustNeRF datasets show that RobustSplat outperforms baseline methods like 3DGS, SpotLessSplats, and WildGaussians across various metrics, including PSNR, SSIM, and LPIPS [16][21]. Summary - RobustSplat effectively reduces rendering artifacts caused by transient objects through its innovative strategies, demonstrating superior performance in complex scenes with dynamic elements while preserving detail [19][21].
强化学习之父Richard Sutton最新演讲揭示OaK架构:通向超级智能的八步愿景
机器之心· 2025-08-19 09:45
Core Viewpoint - Richard Sutton, the father of reinforcement learning and 2024 ACM Turing Award winner, presented a vision for achieving general artificial intelligence (AGI) and superintelligence through the OaK architecture, which is based on experiential learning and outlines a clear roadmap for AI development [2][4]. Group 1: OaK Architecture Overview - The OaK architecture is not a complete algorithm but a vision that breaks down the goals for AI development into eight necessary steps, highlighting the current gaps and potential development paths [2][6]. - Sutton emphasizes the importance of a simple and general AI agent architecture that learns from experience rather than relying on pre-defined domain knowledge [10][13]. Group 2: Key Concepts in OaK Architecture - The architecture focuses on "open-ended abstraction," allowing the agent to continuously develop its conceptual framework and understanding of the world without being limited by predefined knowledge [13][28]. - Sutton introduces two critical concepts: design time (before deployment) and runtime (during operation), advocating for learning based on experience during runtime to adapt to the complexities of the world [18][20]. Group 3: Learning and Decision-Making - The architecture proposes that agents should learn solely from runtime experiences, as the complexity of the world cannot be fully anticipated or pre-defined [30][31]. - Sutton argues that the agent's knowledge is inherently approximate due to the vast complexity of the world, necessitating a focus on learning and planning during runtime [37][38]. Group 4: Reinforcement Learning and Reward Hypothesis - The reinforcement learning framework is defined by the goal of maximizing a scalar reward signal, which is central to the agent's learning process [42][47]. - Sutton posits that even a simple reward signal can lead to the emergence of intelligent behavior in a sufficiently complex environment [51]. Group 5: Common Agent Model - The common model of intelligent agents includes components such as perception, value function, reactive policy, and transition model, which are interconnected to facilitate learning and planning [58][61]. - This model serves as a foundation for the OaK architecture, which seeks to enhance it by introducing higher-level abstractions and multiple value functions for different subproblems [67][72]. Group 6: Implementation Steps of OaK Architecture - The implementation of the OaK architecture involves eight parallel steps, including learning strategies for maximizing rewards, generating new state features, and constructing corresponding subproblems [82][85]. - Each step is contingent on the successful realization of continuous deep learning and the ability to generate and evaluate new features [86][90]. Group 7: Future Directions and Challenges - Sutton acknowledges that while some steps in the OaK architecture are feasible, significant challenges remain, particularly in achieving reliable continuous learning in nonlinear deep learning networks [89][96]. - The architecture aims to create a system that evolves through an open-ended cycle of exploration and learning, with the ultimate goal of enhancing the agent's ability to abstract and generalize from experiences [160].
X-SAM:从「分割一切」到「任意分割」:统一图像分割多模态大模型,在20+个图像分割数据集上均达SoTA
机器之心· 2025-08-19 06:33
Core Viewpoint - The article discusses the development of X-SAM, a unified multimodal large language model for image segmentation, which enhances the capabilities of existing models by allowing for pixel-level understanding and interaction through visual prompts [4][26]. Background and Motivation - Segment Anything Model (SAM) excels in dense segmentation mask generation but is limited by its reliance on single input modes, hindering its applicability across various segmentation tasks [4]. - Multimodal large language models (MLLMs) have shown promise in tasks like image description and visual question answering but are fundamentally restricted in handling pixel-level visual tasks, which limits the development of generalized models [4]. Method Design - X-SAM introduces a unified framework that extends the segmentation paradigm from "segment anything" to "any segmentation" by incorporating visual grounded segmentation (VGS) tasks [4]. - The model employs a dual projectors architecture to enhance image understanding and a segmentation connector to provide rich multi-scale information for segmentation tasks [11][12]. - X-SAM utilizes a three-stage progressive training strategy to optimize performance across diverse image segmentation tasks, including segmentor fine-tuning, alignment pre-training, and mixed fine-tuning [16][22]. Experimental Results - X-SAM has been evaluated on over 20 segmentation datasets, achieving state-of-the-art performance across seven different image segmentation tasks [19]. - The model's performance metrics indicate significant improvements in various segmentation tasks compared to existing models, showcasing its versatility and effectiveness [20][21]. Summary and Outlook - X-SAM represents a significant advancement in the field of image segmentation, establishing a foundation for future research in video segmentation and the integration of temporal information [26]. - Future directions include expanding the model's capabilities to video segmentation tasks, potentially enhancing video understanding technologies [26].
7年了,OpenAI官方给出五代GPT对比,网友却怀念起「狂野」初代
机器之心· 2025-08-19 06:33
Core Viewpoint - The article discusses the evolution of the GPT series models from GPT-1 to GPT-5, highlighting significant improvements in their capabilities, particularly in understanding and generating coherent responses to complex queries [2][5][49]. Group 1: Evolution of GPT Models - GPT-1 was characterized by awkward and nonsensical responses, demonstrating limited understanding of complex questions [2][11][12]. - GPT-5, in contrast, provides well-structured, coherent, and contextually relevant answers, showcasing a significant leap in performance over its predecessors [5][20][49]. - The internal experiences of OpenAI personnel reflect the profound changes in the model's capabilities over seven years of development [6][49]. Group 2: Specific Comparisons - When asked about the feasibility of annual full-body MRI scans for cancer detection, GPT-1's response was illogical and confusing, while GPT-2 offered a slightly better but still unhelpful answer [11][12]. - GPT-4 provided a more reliable response but lacked a personal touch, whereas GPT-5 not only addressed the question effectively but also engaged with emotional value, resembling a conversation with a knowledgeable doctor [20][21][49]. - The article emphasizes that GPT-5's responses are not only accurate but also considerate, indicating a shift towards more human-like interaction [20][21][49]. Group 3: User Reactions - The article notes varied user opinions on the different GPT models, with some expressing nostalgia for the wild and unpredictable nature of GPT-1, suggesting it had a unique appeal [50][51]. - There are comments suggesting that GPT-1 felt more like a "true AGI" compared to its successors, indicating a divergence in user preferences [53][54].
妙笔生维:线稿驱动的三维场景视频自由编辑
机器之心· 2025-08-19 02:43
Core Viewpoint - The article discusses the development of Sketch3DVE, a novel method for 3D scene video editing that allows users to manipulate videos using simple sketches, enhancing creativity and personalization in video content creation [3][22]. Part 1: Background - Recent advancements in video generation models have significantly improved text-to-video and image-to-video generation, with a focus on precise control over camera trajectories due to its important application prospects [6]. - Existing methods for video editing are categorized into two types: one directly uses camera parameters as model inputs, while the other constructs explicit 3D representations from single images to render new perspective images [8][9]. - Despite these advancements, editing real videos with significant camera motion remains a challenge, as video editing requires maintaining original motion patterns and local features while synthesizing new content [8][9]. Part 2: Algorithm Principles - Users begin by selecting the first frame of a 3D scene video, marking the editing area with a mask and drawing a sketch to specify the geometry of new objects [12]. - The system employs the MagicQuill image editing algorithm to process the first frame, generating the edited result, and utilizes the DUSt3R algorithm for 3D reconstruction to analyze the entire input video [13]. - A 3D mask propagation algorithm is designed to accurately transfer the mask from the first frame to subsequent frames, ensuring consistency across different perspectives [14]. - The final video generation model integrates edited images, multi-view videos, and original input videos to produce a scene-edited video with precise 3D consistency [14]. Part 3: Effect Demonstration - The method allows users to create high-quality 3D scene video edits, enabling operations such as adding, removing, and replacing objects while maintaining good 3D consistency [16]. - The approach can handle complex scenarios involving shadows and reflections, producing reasonable editing results due to training on real video datasets [17]. - Users can also edit the first frame using image completion methods, demonstrating the versatility of the system in generating realistic 3D scene video edits [19]. - Sketch3DVE offers an effective solution to traditional model insertion challenges, allowing for personalized 3D object generation and high-fidelity scene video editing without requiring extensive expertise [22].
清华叉院教授手把手教你用强化学习训练智能体
机器之心· 2025-08-19 02:43
Core Viewpoint - The article discusses the significance of Agentic Reinforcement Learning (Agentic RL) in training general intelligent agents, highlighting the ASearcher project as a key initiative by the AReaL team to develop an end-to-end search agent using this technology [1][2]. Summary by Sections Agentic RL Challenges - The main difficulty in Agentic RL is the long-horizon tool usage, which requires complex interactions in various environments [11]. ASearcher Project - ASearcher leverages fully asynchronous RL to unlock long-horizon tool usage for agents, allowing up to 128 complex environment interactions [2][11]. AReaL-Lite - AReaL-Lite is introduced as a lightweight development framework that enables rapid training of Agentic RL, simplifying the coding process [11]. Hands-on Training - The article mentions a hands-on session where participants will learn to implement multi-turn search agent training in Jupyter Notebook, emphasizing the need for a GPU server with at least 4 cards [11]. Guest Speakers - The session features notable speakers including Professor Wu Yi from Tsinghua University and key members from the AReaL and ASearcher projects, highlighting their expertise in the field [11].
开源版Genie 3世界模型来了:实时+长时间交互,单卡可跑,国内公司出品
机器之心· 2025-08-19 02:43
Core Viewpoint - The article discusses the launch of the open-source interactive world model "Matrix-Game 2.0" by Kunlun Wanwei, which demonstrates significant advancements in real-time interactive generation and simulation of complex environments, rivaling the capabilities of proprietary models like Google DeepMind's Genie 3 [1][3][11]. Group 1: Product Overview - Matrix-Game 2.0 is an open-source model with 1.8 billion parameters, capable of running on a single GPU and achieving a frame rate of 25 FPS for virtual environment generation [12][36]. - The model allows users to upload images and interact with the generated virtual world using keyboard controls, enabling real-time movement and perspective changes [19][40]. - It has been noted for its ability to simulate realistic environments, including complex terrains and dynamic elements, enhancing user immersion [8][21]. Group 2: Technical Innovations - The model employs a novel visual-driven interactive world modeling approach, moving away from traditional language-based prompts to focus on visual understanding and physical law learning [35][40]. - Matrix-Game 2.0 integrates a self-regressive diffusion generation mechanism, which helps in producing longer videos while minimizing content deviation and error accumulation [42][45]. - The data production pipeline utilized for training includes over 1.2 million video clips, achieving an accuracy rate exceeding 99% [37][38]. Group 3: Market Impact and Future Prospects - The emergence of Matrix-Game 2.0 signifies a shift in the world model landscape, indicating that such technologies are moving towards practical applications in various fields, including gaming and robotics [57][59]. - The article highlights the potential of world models to serve as training environments for AI, addressing challenges like data scarcity and generalization in embodied intelligence [57][58]. - Kunlun Wanwei's continuous efforts in open-source projects are expected to accelerate the practical implementation of world models, enhancing their utility across different sectors [54][59].
图生视频新玩法刷爆外网:图上画两笔就能动起来,终于告别文本提示
机器之心· 2025-08-19 02:43
机器之心报道 编辑:杜伟、杨文 现在,AI看你画的就能懂。 Higgsfield AI 这家公司,有点意思。 不仅三天两头上线新功能,在 X 上疯狂刷存在感,还一度被传出和 Meta 洽谈收购事宜,虽然最后不了了 之。 该公司专注于 AI 视频生成,最擅长电影级镜头控制技术,三个月前曾凭借 AI 运镜视频生成火出圈,我们 还专门报道过: 一张照片实现超 70 种百万级运镜!这款 AI 神器给了摄影师一记「铁拳」 前几天,它又先后发布了 Draw-to-Video 和 Product-to-Video 功能。 前者只需上传一张静态图像,在上面绘制图形、文字或箭头等元素,即可生成具有电影质感的视频画面。 该功能一经发布就在外网爆了,短短 4 天时间 X 上的浏览量就超 530 万。 后者则可以通过简单的拖拽操作,免费生成精美的、电影级的广告视频。截至目前也已在 X 上收获 160 万 次浏览量。 据 The Information 报道, Meta Platforms 正在寻求与开发人工智能视频生成与编辑模型的初创公司建立合作关系,曾与视频生成初创公 司 Higgsfield 探讨过潜在的收购事宜,但这些谈判目前 ...
Dario Amodei:账面亏损?大模型照样生钱!
机器之心· 2025-08-18 09:22
Group 1 - The core argument presented by Dario Amodei is that accounting losses do not equate to business failure, and each generation of AI models should be viewed as an independent profit unit to understand the true health of the business [1][5][8] - Amodei suggests that the future AI market will likely consist of three to six major players with cutting-edge technology and substantial capital, emphasizing that both technology and capital are essential [5][6] - The traditional view of increasing R&D expenses leading to worsening business conditions is challenged; instead, Amodei argues that each model can be seen as a startup with significant upfront investment but profitability over its lifecycle [8][9][10] Group 2 - Amodei illustrates a financial model where a company spends $100 million to train a model in 2023, generates $200 million in revenue in 2024, and then invests $1 billion in the next generation model, which brings in $20 billion in 2025 [6][7] - He emphasizes that the key to determining when to train a model is not based on a calendar but rather on the specific data from the previous model, highlighting the importance of data-driven decision-making [10][11] - The concept of "capitalistic impulse" is introduced, where the leap in model capabilities naturally drives investments in capital, computing power, and data, thus amplifying economic value [13] Group 3 - Amodei asserts that as long as Scaling Law remains effective, the embedded venture capital cycle will continue to drive growth and profitability, positioning the company among the top players in the market [12][11] - The discussion also touches on the challenges of existing AI interfaces, which have yet to fully unlock the potential of models, indicating a gap in interface design that needs to be addressed [4]