Workflow
机器之心
icon
Search documents
并行扩散架构突破极限,实现5分钟AI视频生成,「叫板」OpenAI与谷歌?
机器之心· 2025-11-20 09:35
Core Insights - CraftStory has launched the Model 2.0 video generation system, capable of producing expressive, human-centered videos up to five minutes long, addressing the long-standing "video duration" challenge in the AI video generation industry [1][3][5] Company Overview - CraftStory was founded by Victor Erukhimov, a key contributor to the widely used computer vision library OpenCV, and previously co-founded Itseez, which was acquired by Intel in 2016 [3][9] - The company aims to provide significant commercial value to businesses struggling to scale video production for training, marketing, and customer education [3][5] Technology and Innovation - The breakthrough in video duration is attributed to CraftStory's parallel diffusion architecture, which fundamentally differs from traditional models that require larger networks and more resources for longer videos [5][6] - CraftStory's system processes all segments of a five-minute video simultaneously, avoiding the accumulation of flaws that can occur when segments are generated sequentially [6][7] - The training data includes high-quality footage captured by professional studios, ensuring clarity even in fast-moving scenes, which contrasts with the motion blur often found in standard videos [6][7] Product Features - Model 2.0 is a "video-to-video" conversion model that allows users to upload their videos or use preset ones, maintaining character identity and emotional nuances over longer sequences [7][8] - The system can generate a 30-second low-resolution video in approximately 15 minutes, featuring advanced lip-syncing and gesture alignment algorithms [7][8] Market Position and Future Directions - CraftStory recently completed a $2 million funding round, which, while modest compared to larger competitors, reflects the company's belief that success does not solely depend on massive funding [9] - The company targets the B2B market, focusing on how software companies can create effective training and product videos, rather than consumer creative tools [9] - Future developments include a "text-to-video" model that will enable users to generate long-form content directly from scripts, as well as support for mobile camera scenes [9]
本周六,围观学习NeurIPS 2025论文分享会,最后报名了
机器之心· 2025-11-20 06:35
Core Insights - The evolution of AI is transitioning from "capability breakthroughs" to "system construction" by 2025, focusing on reliability, interpretability, and sustainability [2] - NeurIPS, a leading academic conference in AI and machine learning, received 21,575 submissions this year, with an acceptance rate of 24.52%, indicating a growing interest in AI research [2] - The conference will take place from December 2 to 7, 2025, in San Diego, USA, with a new official venue in Mexico City, reflecting the diversification of the global AI academic ecosystem [2] Event Overview - The "NeurIPS 2025 Paper Sharing Conference" is designed for domestic AI talent, featuring keynote speeches, paper presentations, roundtable discussions, poster exchanges, and corporate interactions [3] - The event is scheduled for November 22, 2025, from 09:00 to 17:30 at the Crowne Plaza Hotel in Zhongguancun, Beijing [5][6] Keynote Speakers and Topics - Morning keynote by Qiu Xipeng from Fudan University on "Contextual Intelligence: Completing the Key Puzzle of AGI" [8][14] - Afternoon keynote by Fan Qi from Nanjing University on "From Frames to Worlds: Long Video Generation for World Models" [10][17] Paper Presentations - Various presentations will cover topics such as data mixing in knowledge acquisition, multimodal adaptation for large language models, and scalable data generation frameworks [9][30] - Notable presenters include doctoral students from Tsinghua University and Renmin University, showcasing cutting-edge research in AI [9][30] Roundtable Discussion - A roundtable discussion will explore whether world models will become the next frontier in AI, featuring industry experts and academics [10][20]
AI终于学会「读懂人心」,带飞DeepSeek R1,OpenAI o3等模型
机器之心· 2025-11-20 06:35
Core Insights - The article discusses the development of MetaMind, a framework designed to enhance AI's social reasoning capabilities by integrating metacognitive principles from psychology, allowing AI to better understand human intentions and emotions [7][24][47]. Group 1: Introduction and Background - Human communication often involves meanings that go beyond the literal words spoken, requiring an understanding of implied intentions and emotional states [5]. - The ability to infer others' mental states, known as Theory of Mind (ToM), is a fundamental aspect of social intelligence that develops in children around the age of four [5][6]. Group 2: Challenges in AI Social Intelligence - Traditional large language models (LLMs) struggle with the ambiguity and indirectness of human communication, often resulting in mechanical responses [6]. - Previous attempts to enhance AI's social behavior have not successfully imparted the layered psychological reasoning capabilities that humans possess [6][26]. Group 3: MetaMind Framework - MetaMind employs a three-stage metacognitive multi-agent system to simulate human social reasoning, inspired by the concept of metacognition [10][17]. - The first stage involves a Theory of Mind agent that generates hypotheses about the user's mental state based on their statements [12]. - The second stage features a Moral Agent that applies social norms to filter the hypotheses generated in the first stage, ensuring contextually appropriate interpretations [14][15]. - The third stage includes a Response Agent that generates and validates the final response, ensuring it aligns with the inferred user intentions and emotional context [16][17]. Group 4: Social Memory Mechanism - The framework incorporates a dynamic social memory that records long-term user preferences and emotional patterns, allowing for personalized interactions [19][20]. - This social memory enhances the AI's ability to maintain consistency in emotional tone and content across multiple interactions, addressing common issues of disjointed responses in traditional models [20][23]. Group 5: Performance and Benchmarking - MetaMind has demonstrated significant performance improvements across various benchmarks, including ToMBench and social cognitive tasks, achieving human-level performance in some areas [27][28]. - For instance, the average psychological reasoning accuracy of GPT-4 improved from approximately 74.8% to 81.0% with the integration of MetaMind [28][31]. Group 6: Practical Applications - The advancements in AI social intelligence through MetaMind have implications for various applications, including customer service, virtual assistants, and educational tools, enabling more empathetic and context-aware interactions [47][48]. - The framework's ability to adapt to cultural norms and individual user preferences positions it as a valuable tool for enhancing human-AI interactions in diverse settings [47][48]. Group 7: Conclusion and Future Directions - MetaMind represents a shift in AI design philosophy, focusing on aligning AI reasoning processes with human cognitive patterns rather than merely increasing model size [49]. - The potential for AI to understand not just spoken words but also unspoken emotions and intentions marks a significant step toward achieving general artificial intelligence [49].
分割一切并不够,还要3D重建一切,SAM 3D来了
机器之心· 2025-11-20 02:07
Core Insights - Meta has launched significant updates with the introduction of SAM 3D and SAM 3, enhancing the understanding of images in 3D [1][2] Group 1: SAM 3D Overview - SAM 3D is the latest addition to the SAM series, featuring two models that convert static 2D images into detailed 3D reconstructions [2][5] - SAM 3D Objects focuses on object and scene reconstruction, while SAM 3D Body specializes in human shape and pose estimation [5][28] - Meta has made the model weights and inference code for SAM 3D and SAM 3 publicly available [7] Group 2: SAM 3D Objects - SAM 3D Objects introduces a novel technical approach for robust and realistic 3D reconstruction and object pose estimation from a single natural image [11] - The model can generate detailed 3D shapes, textures, and scene layouts from everyday photos, overcoming challenges like small objects and occlusions [12][13] - Meta has annotated nearly 1 million images, generating approximately 3.14 million 3D meshes, leveraging a scalable data engine for efficient data collection [17][22] Group 3: SAM 3D Body - SAM 3D Body addresses the challenge of accurate human 3D pose and shape reconstruction from a single image, even in complex scenarios [28] - The model supports interactive input, allowing users to guide and control predictions for improved accuracy [29] - A high-quality training dataset of around 8 million images was created to enhance the model's performance across various 3D benchmarks [31] Group 4: SAM 3 Capabilities - SAM 3 introduces promptable concept segmentation, enabling the model to identify and segment instances of specific concepts based on text or example images [35] - The architecture of SAM 3 builds on previous AI advancements, utilizing Meta Perception Encoder for enhanced image recognition and object detection [37] - SAM 3 has achieved a twofold improvement in concept segmentation performance compared to existing models, with rapid inference times even for images with numerous detection targets [39]
黄仁勋GTC开场:「AI-XR Scientist」来了!
机器之心· 2025-11-20 02:07
Core Insights - The article discusses the introduction of LabOS, a groundbreaking AI platform that integrates AI with XR (Extended Reality) to enhance scientific research and experimentation, marking a new era of human-machine collaboration in science [2][4]. Group 1: LabOS Overview - LabOS is the world's first Co-Scientist platform that combines multi-modal perception, self-evolving agents, and XR technology, creating a seamless connection between AI computational reasoning and real-time human-robot collaboration in experiments [4]. - The platform aims to significantly reduce the time and cost of scientific research, with claims that tasks that previously took years can now be completed in weeks, and costs reduced from millions to thousands of dollars [4]. Group 2: AI's Evolution in Laboratories - Traditional scientific AI systems, like AlphaFold, operate in a purely digital realm and are unable to perform physical experiments, which limits research efficiency and reproducibility [6]. - LabOS represents a significant advancement by creating a system that integrates abstract intelligence with physical operations, enabling AI to function as a collaborative scientist in real laboratory settings [6][8]. Group 3: World Model and Visual Understanding - The complexity of laboratory environments necessitates high demands on AI's visual understanding, leading to the development of the LabSuperVision (LSV) benchmark, which assesses AI models' capabilities in laboratory perception and reasoning [13]. - LabOS has been trained to decode visual inputs from XR glasses, achieving a significant improvement in visual reasoning capabilities, with a 235 billion parameter version achieving over 90% accuracy in error detection [13][16]. Group 4: Applications in Biomedical Research - LabOS has demonstrated its capabilities through three advanced biomedical research cases, including the autonomous discovery of new cancer immunotherapy targets, showcasing its end-to-end research capabilities from hypothesis generation to experimental validation [22][25]. - The platform also aids in the mechanism research of biological processes, such as identifying core regulatory genes, and enhances reproducibility in complex wet experiments by capturing expert operations and forming standardized digital processes [29][32]. Group 5: Future Implications - The emergence of LabOS signifies a fundamental shift in the paradigm of scientific discovery, aiming to expand the boundaries of science through collaboration with AI [32]. - The integration of AI and robotics into laboratories is expected to accelerate the pace of scientific discoveries, transforming the traditional reliance on individual expertise into a more scalable and reproducible scientific process [32].
刚刚,Yann LeCun官宣离职创业,瞄准高级机器智能AMI
机器之心· 2025-11-20 02:07
Core Viewpoint - Yann LeCun, a Turing Award winner, has announced his departure from Meta to start a new company focused on Advanced Machine Intelligence (AMI), aiming to revolutionize AI by enabling systems to understand the physical world, possess long-term memory, reason, and plan complex actions [1][8][14]. Group 1: Company Transition - LeCun's new venture will continue his research on "world models," which he believes are essential for AI to truly understand the physical world [8][27]. - Meta will act as a partner to LeCun's new company, supporting the AMI initiative, which has overlapping interests with Meta's business but also extends into other areas [8][28]. - The departure marks a significant shift in the AI landscape, as LeCun leaves a position he helped establish at Meta's FAIR (Facebook AI Research) amid internal cultural conflicts and strategic misalignments [17][27]. Group 2: Research Focus - The goal of the new company is to drive a major revolution in AI, focusing on systems that can understand the physical world and plan actions without extensive trial and error [8][24]. - LeCun has been a critic of large language models (LLMs), arguing that they lack true understanding of the physical world, and he aims to develop AI that can reason and plan using world models [19][27]. - Recent research contributions include the JEPA theory, which aims to create organized and actionable high-dimensional embedding spaces, seen as a potential pathway to achieving world models [25][27]. Group 3: Industry Impact - LeCun's transition to entrepreneurship at the age of 65 signifies a new exploration phase in AI, moving away from the constraints of corporate environments to pursue foundational scientific challenges [14][27]. - The departure of LeCun, alongside other key figures like Soumith Chintala, indicates the end of an era for Meta AI, highlighting the ongoing evolution within the AI research community [28].
通往通用人工智能的关键一步?DeepMind放大招,3D世界最强AI智能体SIMA 2
机器之心· 2025-11-20 02:07
Core Viewpoint - Google DeepMind has launched SIMA 2, a general AI agent capable of autonomous gaming, reasoning, and continuous learning in virtual 3D environments, marking a significant step towards general artificial intelligence [2][3][6]. Group 1: SIMA 2 Overview - SIMA 2 represents a major leap from its predecessor, SIMA, evolving from a passive instruction follower to an interactive gaming companion that can autonomously plan and reason in complex environments [6][10]. - The integration of the Gemini model enhances SIMA 2's capabilities, allowing it to understand user intentions, formulate plans, and execute actions through a multi-step cognitive chain [15][20]. Group 2: Performance and Capabilities - SIMA 2 can understand and execute complex instructions with higher success rates, even in unfamiliar scenarios, showcasing its ability to generalize across different tasks and environments [24][30]. - The agent demonstrates self-improvement capabilities, learning through trial and error and utilizing feedback from the Gemini model to enhance its skills without additional human-generated data [35][39]. Group 3: Future Implications - SIMA 2's ability to operate across various gaming environments serves as a critical testing ground for general intelligence, enabling the agent to master skills and engage in complex reasoning [41][43]. - The research highlights the potential for SIMA 2 to contribute to robotics and physical AI applications, as it learns essential skills for future AI assistants in the physical world [43].
NeurIPS 2025 | 上下文元学习实现不微调跨被试脑活动预测
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the development of BraInCoRL, a novel brain encoding model that utilizes meta-learning and context learning to predict brain responses from visual stimuli with minimal data requirements [3][32]. - This model addresses the limitations of traditional visual encoding models, which require extensive data collection for each individual, making them costly and difficult to implement in clinical settings [6][32]. Background and Innovation - The research highlights significant functional differences in the human higher visual cortex among individuals, necessitating the creation of brain encoding models that can effectively represent these differences [2][6]. - BraInCoRL allows for the prediction of brain responses using only a small number of example images and their corresponding brain activity data, eliminating the need for model fine-tuning [3][32]. Methodology - The BraInCoRL framework treats each voxel as an independent function mapping visual stimuli to neural responses, leveraging meta-learning and context learning to enhance data efficiency and generalization [7][10]. - During training, the model learns shared structures of visual cortex responses from multiple subjects, and during testing, it can generate a subject-specific voxel encoder using just a few image-brain response pairs [11][20]. Experimental Results - BraInCoRL demonstrates high data efficiency, achieving comparable variance explanation to models trained on thousands of images while only using 100 context images [20][22]. - The model shows robust performance across different datasets and scanning protocols, confirming its cross-device and cross-protocol generalization capabilities [22][23]. - Semantic clustering visualizations reveal clear functional organization within the visual cortex, with distinct areas for faces, scenes, and other categories [26][27]. Conclusion - BraInCoRL introduces in-context learning to computational neuroscience, creating a data-efficient, interpretable, and language-interactive framework for visual cortex encoding [32]. - This innovation significantly lowers the barriers for constructing individualized brain encoding models, paving the way for applications in clinical neuroscience and other data-limited scenarios [32].
NeurIPS 2025 Spotlight | 香港大学提出无需数据标记的ViT密集表征增强方法
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the introduction of PH-Reg, a novel method for enhancing Vision Transformers (ViTs) by removing artifacts from dense features without requiring data labeling, thus improving model performance in fine-grained tasks [2][6][19]. Group 1: Methodology - PH-Reg employs a test-time augmentation denoising strategy to eliminate artifacts from the dense features of teacher models, resulting in a student model that outputs artifact-free dense features [2][11]. - The self-distillation framework of PH-Reg allows for enhancement of the student model architecture with minimal intrusion, focusing updates on specific components while preserving the core information of the pre-trained ViT model [11][20]. - The method is designed to be plug-and-play, requiring no retraining and enabling efficient artifact removal from existing pre-trained models like CLIP and DINOv2 [19][22]. Group 2: Experimental Results - In semantic segmentation tasks across eight benchmark datasets, PH-Reg outperformed mainstream methods such as MaskCLIP and SCLIP in seven datasets, demonstrating its robustness and effectiveness [13][21]. - Specifically, the method achieved a significant improvement of 5.04% in mean Intersection over Union (mIoU) on the VOC21 dataset and 3.64% on the ADE20K dataset for the CLIP model [21]. - The training time for PH-Reg is reduced by over 58.9% compared to traditional methods, with a total training time of 9000 minutes, significantly less than the 21908 minutes required for DVT [17][22]. Group 3: Advantages - PH-Reg's core advantage lies in its independence from gradient-based neural field learning, allowing for a single-stage distillation process that minimizes storage requirements and computational resources [22]. - The method can compute all distillation targets in real-time without the need for additional storage space, contrasting with DVT's requirement of 1.4 TB for neural field feature data [22].
如视发布空间大模型Argus1.0,支持全景图等多元输入,行业首创!
机器之心· 2025-11-19 04:07
Core Viewpoint - The article discusses the emergence of Argus 1.0, a groundbreaking spatial model by Realsee, which aims to recreate the real world in a 3D interactive format, contrasting with AI-generated virtual worlds [2][4]. Group 1: Introduction of Argus 1.0 - Argus 1.0 is the world's first spatial model that supports panoramic image input and infers spatial depth, representing a significant shift from virtual generation to real-world replication [2][6]. - The model processes single or multiple panoramic images to derive camera poses, depth maps, and point clouds with millisecond-level speed [2][6]. Group 2: Foundation of Argus 1.0 - The development of Argus 1.0 is rooted in Realsee's extensive experience in spatial digitization since its establishment in 2017, driven by a "digital space-algorithm-industry application" flywheel [6][14]. - Realsee has accumulated over 53 million sets of digital space data, covering more than 4.4 billion square meters globally, forming the largest real space database [7][8]. Group 3: Technical Innovations - Argus 1.0 represents a transition from single-view depth estimation to multi-view consistency, utilizing a Transformer architecture trained on nearly one million sets of real high-definition spatial data [16][24]. - The model is the first in the industry to support panoramic images as input, significantly enhancing the efficiency of VR content production [17][21]. Group 4: Quality and Performance - Argus 1.0 achieves high-quality output due to its unique high-precision, scale-aware, pixel-aligned real database, allowing it to handle challenging scenarios like glass and mirrors effectively [24][29]. - The model's inference efficiency reaches millisecond-level, making it the first real-time panoramic global reconstruction system [22][23]. Group 5: Future Directions and Industry Impact - Argus 1.0 is a key component in Realsee's "spatial intelligence" framework, which outlines a four-layer theory from digitization to intelligence [30][34]. - The company plans to release Argus 2.0 and subsequent versions to further enhance real-time rendering capabilities and support advanced applications in various industries [36][38]. - Realsee aims to open a dataset of 10,000 indoor housing data sets to foster innovation in the spatial intelligence sector, addressing the significant gap in high-quality spatial data [39][40].