Workflow
机器之心
icon
Search documents
刚刚,Yann LeCun官宣离职创业,瞄准高级机器智能AMI
机器之心· 2025-11-20 02:07
Core Viewpoint - Yann LeCun, a Turing Award winner, has announced his departure from Meta to start a new company focused on Advanced Machine Intelligence (AMI), aiming to revolutionize AI by enabling systems to understand the physical world, possess long-term memory, reason, and plan complex actions [1][8][14]. Group 1: Company Transition - LeCun's new venture will continue his research on "world models," which he believes are essential for AI to truly understand the physical world [8][27]. - Meta will act as a partner to LeCun's new company, supporting the AMI initiative, which has overlapping interests with Meta's business but also extends into other areas [8][28]. - The departure marks a significant shift in the AI landscape, as LeCun leaves a position he helped establish at Meta's FAIR (Facebook AI Research) amid internal cultural conflicts and strategic misalignments [17][27]. Group 2: Research Focus - The goal of the new company is to drive a major revolution in AI, focusing on systems that can understand the physical world and plan actions without extensive trial and error [8][24]. - LeCun has been a critic of large language models (LLMs), arguing that they lack true understanding of the physical world, and he aims to develop AI that can reason and plan using world models [19][27]. - Recent research contributions include the JEPA theory, which aims to create organized and actionable high-dimensional embedding spaces, seen as a potential pathway to achieving world models [25][27]. Group 3: Industry Impact - LeCun's transition to entrepreneurship at the age of 65 signifies a new exploration phase in AI, moving away from the constraints of corporate environments to pursue foundational scientific challenges [14][27]. - The departure of LeCun, alongside other key figures like Soumith Chintala, indicates the end of an era for Meta AI, highlighting the ongoing evolution within the AI research community [28].
通往通用人工智能的关键一步?DeepMind放大招,3D世界最强AI智能体SIMA 2
机器之心· 2025-11-20 02:07
Core Viewpoint - Google DeepMind has launched SIMA 2, a general AI agent capable of autonomous gaming, reasoning, and continuous learning in virtual 3D environments, marking a significant step towards general artificial intelligence [2][3][6]. Group 1: SIMA 2 Overview - SIMA 2 represents a major leap from its predecessor, SIMA, evolving from a passive instruction follower to an interactive gaming companion that can autonomously plan and reason in complex environments [6][10]. - The integration of the Gemini model enhances SIMA 2's capabilities, allowing it to understand user intentions, formulate plans, and execute actions through a multi-step cognitive chain [15][20]. Group 2: Performance and Capabilities - SIMA 2 can understand and execute complex instructions with higher success rates, even in unfamiliar scenarios, showcasing its ability to generalize across different tasks and environments [24][30]. - The agent demonstrates self-improvement capabilities, learning through trial and error and utilizing feedback from the Gemini model to enhance its skills without additional human-generated data [35][39]. Group 3: Future Implications - SIMA 2's ability to operate across various gaming environments serves as a critical testing ground for general intelligence, enabling the agent to master skills and engage in complex reasoning [41][43]. - The research highlights the potential for SIMA 2 to contribute to robotics and physical AI applications, as it learns essential skills for future AI assistants in the physical world [43].
NeurIPS 2025 | 上下文元学习实现不微调跨被试脑活动预测
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the development of BraInCoRL, a novel brain encoding model that utilizes meta-learning and context learning to predict brain responses from visual stimuli with minimal data requirements [3][32]. - This model addresses the limitations of traditional visual encoding models, which require extensive data collection for each individual, making them costly and difficult to implement in clinical settings [6][32]. Background and Innovation - The research highlights significant functional differences in the human higher visual cortex among individuals, necessitating the creation of brain encoding models that can effectively represent these differences [2][6]. - BraInCoRL allows for the prediction of brain responses using only a small number of example images and their corresponding brain activity data, eliminating the need for model fine-tuning [3][32]. Methodology - The BraInCoRL framework treats each voxel as an independent function mapping visual stimuli to neural responses, leveraging meta-learning and context learning to enhance data efficiency and generalization [7][10]. - During training, the model learns shared structures of visual cortex responses from multiple subjects, and during testing, it can generate a subject-specific voxel encoder using just a few image-brain response pairs [11][20]. Experimental Results - BraInCoRL demonstrates high data efficiency, achieving comparable variance explanation to models trained on thousands of images while only using 100 context images [20][22]. - The model shows robust performance across different datasets and scanning protocols, confirming its cross-device and cross-protocol generalization capabilities [22][23]. - Semantic clustering visualizations reveal clear functional organization within the visual cortex, with distinct areas for faces, scenes, and other categories [26][27]. Conclusion - BraInCoRL introduces in-context learning to computational neuroscience, creating a data-efficient, interpretable, and language-interactive framework for visual cortex encoding [32]. - This innovation significantly lowers the barriers for constructing individualized brain encoding models, paving the way for applications in clinical neuroscience and other data-limited scenarios [32].
NeurIPS 2025 Spotlight | 香港大学提出无需数据标记的ViT密集表征增强方法
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the introduction of PH-Reg, a novel method for enhancing Vision Transformers (ViTs) by removing artifacts from dense features without requiring data labeling, thus improving model performance in fine-grained tasks [2][6][19]. Group 1: Methodology - PH-Reg employs a test-time augmentation denoising strategy to eliminate artifacts from the dense features of teacher models, resulting in a student model that outputs artifact-free dense features [2][11]. - The self-distillation framework of PH-Reg allows for enhancement of the student model architecture with minimal intrusion, focusing updates on specific components while preserving the core information of the pre-trained ViT model [11][20]. - The method is designed to be plug-and-play, requiring no retraining and enabling efficient artifact removal from existing pre-trained models like CLIP and DINOv2 [19][22]. Group 2: Experimental Results - In semantic segmentation tasks across eight benchmark datasets, PH-Reg outperformed mainstream methods such as MaskCLIP and SCLIP in seven datasets, demonstrating its robustness and effectiveness [13][21]. - Specifically, the method achieved a significant improvement of 5.04% in mean Intersection over Union (mIoU) on the VOC21 dataset and 3.64% on the ADE20K dataset for the CLIP model [21]. - The training time for PH-Reg is reduced by over 58.9% compared to traditional methods, with a total training time of 9000 minutes, significantly less than the 21908 minutes required for DVT [17][22]. Group 3: Advantages - PH-Reg's core advantage lies in its independence from gradient-based neural field learning, allowing for a single-stage distillation process that minimizes storage requirements and computational resources [22]. - The method can compute all distillation targets in real-time without the need for additional storage space, contrasting with DVT's requirement of 1.4 TB for neural field feature data [22].
如视发布空间大模型Argus1.0,支持全景图等多元输入,行业首创!
机器之心· 2025-11-19 04:07
Core Viewpoint - The article discusses the emergence of Argus 1.0, a groundbreaking spatial model by Realsee, which aims to recreate the real world in a 3D interactive format, contrasting with AI-generated virtual worlds [2][4]. Group 1: Introduction of Argus 1.0 - Argus 1.0 is the world's first spatial model that supports panoramic image input and infers spatial depth, representing a significant shift from virtual generation to real-world replication [2][6]. - The model processes single or multiple panoramic images to derive camera poses, depth maps, and point clouds with millisecond-level speed [2][6]. Group 2: Foundation of Argus 1.0 - The development of Argus 1.0 is rooted in Realsee's extensive experience in spatial digitization since its establishment in 2017, driven by a "digital space-algorithm-industry application" flywheel [6][14]. - Realsee has accumulated over 53 million sets of digital space data, covering more than 4.4 billion square meters globally, forming the largest real space database [7][8]. Group 3: Technical Innovations - Argus 1.0 represents a transition from single-view depth estimation to multi-view consistency, utilizing a Transformer architecture trained on nearly one million sets of real high-definition spatial data [16][24]. - The model is the first in the industry to support panoramic images as input, significantly enhancing the efficiency of VR content production [17][21]. Group 4: Quality and Performance - Argus 1.0 achieves high-quality output due to its unique high-precision, scale-aware, pixel-aligned real database, allowing it to handle challenging scenarios like glass and mirrors effectively [24][29]. - The model's inference efficiency reaches millisecond-level, making it the first real-time panoramic global reconstruction system [22][23]. Group 5: Future Directions and Industry Impact - Argus 1.0 is a key component in Realsee's "spatial intelligence" framework, which outlines a four-layer theory from digitization to intelligence [30][34]. - The company plans to release Argus 2.0 and subsequent versions to further enhance real-time rendering capabilities and support advanced applications in various industries [36][38]. - Realsee aims to open a dataset of 10,000 indoor housing data sets to foster innovation in the spatial intelligence sector, addressing the significant gap in high-quality spatial data [39][40].
登顶开源SOTA!上交大&小红书LoopTool实现工具调用任务的「数据进化」
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the evolution of large language models (LLMs) from merely "speaking" to "doing" through the integration of external tools, emphasizing the need for high-quality, diverse training data to enhance model performance in various tasks [1][5][35] Group 1: LoopTool Framework - Shanghai Jiao Tong University and Xiaohongshu team developed LoopTool, an autonomous, model-aware, iterative data evolution framework that achieves data-model closed-loop optimization for tool-calling tasks [2][35] - LoopTool utilizes the open-source model Qwen3-32B as both data generator and discriminator, outperforming its larger counterpart (32B) with a smaller model (8B) in tool-calling performance [2][35] - The framework has demonstrated its effectiveness by achieving state-of-the-art (SOTA) results on public benchmarks BFCL-v3 and ACEBench, validating the generalizability and effectiveness of closed-loop iterative optimization across different model sizes [2][35] Group 2: Methodology - LoopTool's core concept is to create an automated closed loop of data generation, label correction, and model training, driven by model performance feedback [7][35] - The process begins with seed data construction, where high-quality, diverse seed datasets are generated using semantic and constraint trees to ensure consistency and semantic integrity [9][10] - The iterative optimization phase includes several modules: GRPO training for tool calling, greedy capability probing to identify valuable samples, judgment-guided label verification for correcting mismatched labels, and error-driven data expansion to create new challenging samples [11][12][13][15][17] Group 3: Experimental Results - LoopTool-8B achieved an overall accuracy of 74.93% on BFCL-v3, ranking first among all 8B models, with a notable improvement of +8.59 percentage points over the original Qwen3-8B [20][23] - LoopTool-32B reached an overall accuracy of 79.32%, also ranking first, demonstrating superior performance in both single-turn and multi-turn scenarios [20][21] - The iterative training process showed continuous performance improvement, contrasting with static training methods that plateaued or declined due to mismatched data distribution and model capabilities [23] Group 4: Generalization and Downstream Tasks - LoopTool not only enhances tool-calling capabilities but also improves general reasoning and complex task handling, as evidenced by its performance across various general tasks [30][31] - The model demonstrated significant improvements in instruction following and code generation tasks, indicating that closed-loop data evolution positively impacts broader model capabilities [30][31] - In practical applications, LoopTool's enhanced tool usage ability effectively addresses real-world problems, showcasing its utility in diverse scenarios such as API management and complex task execution [32][33]
ConsistEdit来了:无需训练,实现高精度、高一致性的视觉编辑新范式
机器之心· 2025-11-19 02:09
Core Insights - The article discusses the advancements in training-free visual editing methods, particularly focusing on the ConsistEdit approach designed for Multi-Modal Diffusion Transformer (MM-DiT) architecture, addressing key challenges in visual generation [5][7][34]. Research Background - The article identifies two main pain points in current visual editing methods: the difficulty in balancing editing intensity with source image consistency and the lack of fine-grained control over editing strength [5]. Key Findings - Three critical discoveries regarding MM-DiT architecture are highlighted: 1. Editing only "visual tokens" ensures stable editing results, while modifying "text tokens" can lead to distortions [9]. 2. All layers of MM-DiT retain structural information, allowing edits to affect all attention layers rather than just the last few [11]. 3. Controlling Q/K tokens can precisely maintain structural consistency, while V tokens primarily influence content texture, enabling a decoupled control of structure and texture [15]. Method Design - ConsistEdit introduces three core operations: 1. Visual-only attention control to maintain strong consistency while adhering to text instructions [19]. 2. Mask-guided attention fusion to accurately separate editing and non-editing areas [20]. 3. Differentiated control of Q/K/V tokens to achieve smooth transitions from complete structure preservation to free structure modification [21]. Experimental Validation - The performance of ConsistEdit is validated against five mainstream methods on the PIE-Bench dataset, demonstrating its advantages in both image and video editing tasks [22]. Generalization - ConsistEdit is adaptable to various MM-DiT variants, including Stable Diffusion 3 and others, showcasing its versatility across different models [31]. Application Prospects - The high consistency and fine-grained control of ConsistEdit make it suitable for a wide range of visual creation scenarios, from static images to dynamic videos, enhancing interactive creative possibilities [34].
刚刚,PyTorch之父光速入职TML!离职Meta刚过一天,投身500亿估值独角兽
机器之心· 2025-11-19 02:09
Core Viewpoint - Soumith Chintala, the creator of PyTorch, has joined Thinking Machines Lab (TML), a startup valued at $50 billion, indicating a shift towards new ventures and innovation in AI [2][4]. Group 1: Chintala's Transition - Chintala officially left Meta on November 17 and joined TML shortly after, reflecting his desire to move beyond PyTorch and explore new opportunities [4]. - His LinkedIn profile currently lists him as a "Technician," leaving the specifics of his new projects at TML unclear [3]. - Chintala expressed a strong wish to avoid being tied to a single project for decades, as he mentioned in his farewell letter [10]. Group 2: Career Background - Chintala's career has been marked by significant challenges, including being rejected by 12 universities and facing multiple job rejections, including three from DeepMind [12]. - He eventually joined FAIR, where he led the development of PyTorch, which now holds over 90% usage in the AI field and supports training at an unprecedented scale [12][14]. - His departure from Meta to TML signifies a bold career move, showcasing his evolution from a struggling engineer to a leading figure in AI [14]. Group 3: Future of PyTorch - Concerns about PyTorch's future have arisen following Chintala's departure; however, he has ensured that the team is resilient and capable of decision-making without his direct involvement [16]. - Chintala stated that the project no longer relies on him, emphasizing its strength and the foundational role it plays in redefining intelligence [16][17]. - He believes that AI is most effective when it is accessible and open-source, hinting at his future aspirations at TML [17].
何恺明重磅新作:Just image Transformers让去噪模型回归基本功
机器之心· 2025-11-19 02:09
Core Insights - The article discusses the relationship between image generation and denoising diffusion models, emphasizing that high-quality image generation relies on diffusion models [1] - It questions whether denoising diffusion models truly achieve "denoising," highlighting a shift in focus from predicting clean images to predicting noise itself [2][5] - The research proposes a return to directly predicting clean data, which allows networks with seemingly insufficient capacity to operate effectively in high-dimensional spaces [7][8] Group 1: Denoising Diffusion Models - Denoising diffusion models do not function in the classical sense of "denoising," as they predict noise or noisy quantities instead of clean images [5][6] - The manifold assumption suggests that natural images exist on a low-dimensional manifold, while noise is off-manifold, indicating a fundamental difference in predicting clean data versus noisy data [4][6] - The study introduces a model that directly predicts clean data, which could enhance the performance of diffusion models [7] Group 2: Just Image Transformers (JiT) - The paper presents the "Just image Transformers (JiT)" architecture, which utilizes simple large patch pixel-level transformers to create powerful generative models without the need for tokenizers or pre-training [11] - JiT achieves competitive pixel-space image generation on ImageNet, with FID scores of 1.82 at 256x256 resolution and 1.78 at 512x512 resolution [12] - The architecture is designed to be self-consistent and applicable across various fields involving natural data, such as protein and molecular data [12] Group 3: Model Performance and Design - The JiT architecture operates by dividing images into non-overlapping patches, allowing for effective processing of high-dimensional data [14] - The study finds that the performance of the model is significantly influenced by the prediction method used, with -prediction yielding the best results across various loss functions [21][23] - Increasing the number of hidden units is not necessary for model performance, as demonstrated by JiT's effective operation at higher resolutions without additional modifications [28][31] Group 4: Scalability and Generalization - The research emphasizes the scalability of the JiT model, showing that it maintains similar computational costs across different resolutions while achieving strong performance [42][44] - The findings suggest that the design of the network can be decoupled from the observed dimensions, allowing for flexibility in model architecture [31] - The introduction of bottleneck structures in the network design can enhance performance, encouraging the learning of intrinsic low-dimensional representations [33] Group 5: Conclusion and Future Implications - The study concludes that the findings regarding -prediction are a natural outcome of the limitations of neural networks in modeling noise rather than data [51] - The proposed "Diffusion + Transformer" paradigm has the potential to serve as a foundational method in various fields, particularly where obtaining tokenizers is challenging [52]
Gemini 3深夜来袭:力压GPT 5.1,大模型谷歌时代来了
机器之心· 2025-11-18 18:19
Core Insights - Gemini 3 has been highly anticipated in the AI community, with significant excitement leading up to its release [2][3] - Google defines Gemini 3 as a crucial step towards AGI, boasting the strongest multimodal understanding and interaction capabilities in the world [11][12] - The model has set new SOTA standards in reasoning and multimodal capabilities, outperforming competitors like Claude Sonnet 4.5 and GPT-5.1 [13][14] Performance Metrics - Gemini 3 Pro achieved a groundbreaking Elo score of 1501 on the LMArena Leaderboard, surpassing previous models and competitors in various benchmarks [13] - In the Humanity's Last Exam, it scored 37.5% without tools and 45.8% with search and code execution, showcasing its academic reasoning capabilities [15] - The model also excelled in visual reasoning puzzles, achieving 31.1% in ARC-AGI-2 and 91.9% in GPQA Diamond [15] Interaction and Usability - Gemini 3 Pro has improved interaction quality, providing concise and direct responses rather than excessive flattery [16] - It serves as a true thinking partner, offering new ways to understand information and express ideas [17][18] - The Deep Think mode enhances reasoning and multimodal understanding, achieving impressive scores in challenging AI benchmarks [19][21] Learning and Development Capabilities - Gemini 3 integrates various modalities, including text, images, and videos, to facilitate seamless learning experiences [23] - It can generate interactive study materials and analyze performance in activities like sports [25] - The model excels in zero-shot generation, significantly improving developer efficiency and enabling the creation of rich, interactive web interfaces [28][29] Planning and Long-term Management - Gemini 3 has demonstrated superior long-term planning capabilities, as evidenced by its performance in the Vending-Bench 2 test [32][36] - It maintains consistent decision-making and tool usage throughout extended tasks, achieving higher returns on investment [33] Market Position and Future Outlook - Gemini 3 is now fully accessible to users and developers through various platforms, with a tiered pricing model based on context length [38][40] - The introduction of Google Antigravity as a new development platform enhances the collaborative capabilities of AI in software development [43] - Market confidence in Gemini is reflected in user engagement metrics, with 2 billion monthly active users and 650 million monthly app users reported [52]