机器之心

Search documents
2D图像作中介,零训练实现3D场景生成SOTA:英伟达&康奈尔提出文本驱动新流程
机器之心· 2025-06-12 03:23
本文第一作者顾泽琪是康奈尔大学计算机科学四年级博士生,导师为 Abe Davis 教授和 Noah Snavely 教授,研究方向专注于生成式 AI 与多模态大模型。本项目为 作者在英伟达实习期间完成的工作。 想象一下,你是一位游戏设计师,正在为一个奇幻 RPG 游戏搭建场景。你需要创建一个 "精灵族树屋村落"—— 参天古木和树屋、发光的蘑菇路灯、半透 明的纱幔帐篷... 传统工作流程中,这可能需要数周时间:先手工建模每个 3D 资产,再逐个调整位置和材质,最后反复测试光照效果…… 总之就是一个 字,难。 核心贡献:无需训练的智能 3D 场景工厂 ArtiScene 的核心创新在于构建了一个完全 无需额外训练 的自动化流水线,将文本生成图像的前沿能力与 3D 重建技术巧妙结合。它一共包含五步: 1. 2D 图像作为 "设计蓝图" 系统首先用扩散模型生成等轴测视角的场景图。这种视角常用于建筑设计示意图,因为它能同时呈现物体的长、宽、高信息,且不受场景位置影响。相比直 接生成 3D,这种方法能利用更成熟的 2D 生成技术确保布局合理性和视觉美感。 这种困境正是当前 3D 内容创作领域的缩影。传统 3D 设计软件如 ...
SIGGRAPH 2025奖项出炉:上科大、厦大入选最佳论文
机器之心· 2025-06-12 03:23
Core Points - The SIGGRAPH conference, organized by ACM SIGGRAPH since 1974, is a leading event in the field of graphics and imaging technology, covering various areas such as animation, simulation, rendering, and machine learning [2][3]. Group 1: Best Paper Awards - This year, five best papers were awarded, with significant contributions from domestic institutions including Shanghai University of Science and Technology, Huazhong University of Science and Technology, Xiamen University, and Tsinghua University [5]. - Paper 1: "Shape Space Spectra" focuses on the feature analysis of differential operators and introduces a shape-space feature analysis method applicable in various fields such as sound synthesis and elastic dynamics simulation [6][8]. - Paper 2: "CAST: Component-Aligned 3D Scene Reconstruction From an RGB Image" presents a novel method for 3D scene reconstruction from a single RGB image, addressing challenges in quality and domain limitations [9][13]. - Paper 3: "TokenVerse: Versatile Multi-Concept Personalization in Token Modulation Space" introduces a method for multi-concept personalization using pre-trained text-to-image diffusion models, allowing for seamless integration of complex visual elements [18][21]. - Paper 4 discusses variance reduction techniques for Monte Carlo integration, introducing a ratio control variable to improve estimation accuracy [25]. - Paper 5: "Transformer IMU Calibrator" presents a dynamic calibration method for inertial motion capture systems, breaking the static assumption in IMU calibration and expanding application scenarios [26]. Group 2: Honorable Mentions - Several papers received honorable mentions, including works from institutions like the University of California, San Diego, and Google, focusing on various advancements in graphics and imaging technology [27][28]. - Notable mentions include "Lifting the Winding Number" and "A Monte Carlo Rendering Framework for Simulating Optical Heterodyne Detection," showcasing innovative approaches in their respective fields [30]. Group 3: Test of Time Award - The Test of Time Award was established to recognize impactful research from 2013-2015, with four papers selected for their significant contributions to the industry [32]. - Awarded papers include "Unified Particle Physics for Real-Time Applications," which introduced a unified dynamics framework for real-time visual effects, and "Learning Visual Similarity for Product Design With Convolutional Neural Networks," which helped shape future research directions in computer graphics [33][34].
CVPR 2025 | 多模态统一学习新范式来了,数据、模型、代码全部开源
机器之心· 2025-06-12 00:53
本文第一作者杜恒辉为中国人民大学二年级硕士生,主要研究方向为多模态大模型视听场景理解与推理,长视频理解等,师从胡迪副教授。作者来自于中国人民 大学,清华大学和北京腾讯 PCG AI 技术中心。 我们人类生活在一个充满视觉和音频信息的世界中,近年来已经有很多工作利用这两个模态的信息来增强模型对视听场景的理解能力,衍生出了多种不同类型的 任务,它们分别要求模型具备不同层面的能力。 过去大量的工作主要聚焦于完成单一任务,相比之下,我们人类对周围复杂的的世界具有一个通用的感知理解能力。因此,如何设计一个像人类一样对视听场景 具有通用理解能力的模型是未来通往 AGI 道路上一个极其重要的问题。 当前主流的学习范式是通过构建大规模的多任务指令微调数据集并在此基础上直接做指令 微调 。然而,这种学习范式对于多任务学习而言是最优的吗? 最近中国人民大学高瓴人工智能学院 GeWu-Lab 实验室,清华大学和北京腾讯 PCG AI 技术中心合作发表的 CVPR 2025 论文指出, 当前这种主流的学习范式忽视 了多模态数据的异质性和任务间的复杂关系,简单地将所有任务联合训练可能会造成任务间的相互干扰。 为了有效实现任务间的显示互 ...
刚刚,LeCun亲自出镜,Meta推出新世界模型!
机器之心· 2025-06-12 00:53
机器之心报道 机器之心编辑部 最近,Meta 大动作不断。 前些天有外媒曝出马克・扎克伯格正在组建一个名为「超级智能团队」的专家团队,以实现通用人工智能。随后开出 9 位数的薪酬为该团队吸纳人才。 就在刚刚,Meta 又有新的动作,推出 基于视频训练的世界模型 V-JEPA 2(全称 Video Joint Embedding Predictive Architecture 2) 。其能够实现最先进的环境理 解与预测能力,并在新环境中完成零样本规划与机器人控制。 Meta 表示,他们在追求高级机器智能(AMI)的目标过程中,关键在于开发出能像人类一样认知世界、规划陌生任务执行方案,并高效适应不断变化环境的 AI 系 统。 这次,Meta 首席 AI 科学家 Yann LeCun 亲自出镜,介绍世界模型与其他 AI 模型的不同。 他说,世界模型是一种现实的抽象数字孪生,AI 可以参考它来理解世界并预测其行为的后果。与理解语言不同,世界模型使机器能够理解物理世界,并能够规划 行动路线以完成任务,而无需进行数百万次的试验,因为世界模型提供了对世界运行方式的基本理解。能够使用世界模型进行推理和规划的 AI 将产生广泛 ...
10%训练数据超越100%表现,机器人学习领域迎来重要突破
机器之心· 2025-06-11 03:54
Core Viewpoint - The ViSA-Flow framework represents a revolutionary approach to robot skill learning, significantly enhancing learning efficiency in data-scarce situations by extracting semantic action flows from large-scale human videos [4][36]. Group 1: Research Background and Challenges - Traditional robot imitation learning methods require extensive, meticulously curated datasets, which are costly to collect, creating a bottleneck for developing robots capable of diverse real-world tasks [7]. - Humans exhibit remarkable abilities to learn new skills through observation, focusing on semantically relevant components while filtering out irrelevant background information [8]. Group 2: Key Innovations - The core innovation of the ViSA-Flow framework is the introduction of Semantic Action Flow as an intermediate representation, capturing the essential spatiotemporal features of operator-object interactions, unaffected by surface visual differences [11]. - Key components of the framework include: 1. Semantic entity localization using pre-trained visual language models to describe and locate operators and task-related objects [11]. 2. Hand-object interaction tracking to maintain stable segmentation across frames [12]. 3. Flow-conditioned feature encoding to generate rich feature vectors while preserving visual context [13]. Group 3: Experimental Evaluation - In the CALVIN benchmark tests, ViSA-Flow outperformed all baseline methods using only 10% of annotated robot trajectories (1,768), achieving a success rate of 31.4% in completing five consecutive tasks, nearly double that of the next best method [19]. - The average sequence length of 2.96 further demonstrates ViSA-Flow's effectiveness in handling long-duration operational tasks [20]. Group 4: Ablation Studies - Ablation studies indicate that removing semantic entity localization significantly reduces performance, while omitting the time tracking phase decreases the average success length [26]. - The full ViSA-Flow model achieved a success rate of 89.0% in task completion, showcasing its robustness [21]. Group 5: Real-World Experiments - Real-world evaluations of ViSA-Flow included single-stage and long-duration operational tasks, demonstrating its ability to maintain performance across varying task complexities [23][30]. - The model's focus on operator and task-related objects allows for smooth transitions in spatial support as scenes change [31]. Group 6: Technical Advantages and Limitations - Advantages include data efficiency, cross-domain generalization, long-duration stability, and semantic consistency in task execution [40]. - Limitations involve the absence of explicit 3D geometric modeling, reliance on pre-trained components, and potential challenges in tasks requiring precise physical interactions [40]. Group 7: Future Directions - Future developments may include integrating physical modeling, reducing reliance on pre-trained components, combining with reinforcement learning algorithms, and expanding pre-training datasets [40]. Group 8: Significance and Outlook - ViSA-Flow represents a significant breakthrough in robot learning, proving the feasibility of extracting semantic representations from large-scale human videos for skill acquisition [36]. - The framework bridges the gap between human demonstration observation and robot execution, paving the way for more intelligent and efficient robotic learning systems [37].
「Next-Token」范式改变!刚刚,强化学习预训练来了
机器之心· 2025-06-11 03:54
Core Viewpoint - The article discusses the emerging importance of Reinforcement Learning (RL) in enhancing AI model capabilities, particularly through a new paradigm called Reinforcement Pre-Training (RPT) which redefines next-token prediction as a reasoning task [3][10][24]. Summary by Sections Introduction - Yann LeCun previously viewed reinforcement learning as a minor component in AI, but its significance is growing in model enhancement [3]. RPT Overview - RPT transforms the next-token prediction task into a reasoning process, allowing models to receive verifiable rewards for correct predictions [6][25]. - This method leverages vast amounts of unannotated text data for general reinforcement learning without requiring domain-specific labeled answers [9][26]. Advantages of RPT - RPT offers inherent scalability and generality by utilizing large unannotated datasets for training [28]. - It minimizes the risk of reward hacking by using direct, rule-based reward signals [29]. - The internal reasoning process during pre-training allows for deeper understanding and generalization beyond mere token memorization [30]. - RPT enhances prediction accuracy by allocating more computational resources to each prediction step [31]. Experimental Results - RPT outperforms baseline methods in next-token prediction accuracy across various difficulty levels [40][41]. - The performance of RPT-14B is comparable to that of larger models, indicating its effectiveness in capturing complex reasoning signals [43]. - RPT's accuracy improves reliably with increased training computation, demonstrating its scaling characteristics [45]. - Models pre-trained with RPT achieve higher performance ceilings when further trained with RLVR, showcasing its ability to transfer learned reasoning patterns to downstream tasks [47]. Zero-Shot Performance - RPT-14B consistently surpasses R1-Distill-Qwen-14B across all benchmark tests, even outperforming larger models in next-token prediction [49]. Reasoning Mode Analysis - The reasoning process of RPT-14B differs qualitatively from that of R1-Distill-Qwen-14B, indicating a more thoughtful approach rather than simple pattern matching [51].
Mistral的首个强推理模型:拥抱开源,推理速度快10倍
机器之心· 2025-06-11 03:54
Core Viewpoint - Mistral AI has launched a new series of large language models (LLMs) named Magistral, showcasing strong reasoning capabilities and the ability to tackle complex tasks [4]. Group 1: Model Overview - The launch includes two versions: a proprietary model for enterprise clients called Magistral Medium and an open-source version with 24 billion parameters named Magistral Small [5]. - The open-source version is available under the Apache 2.0 license, allowing for free use and commercialization [5]. Group 2: Performance Metrics - In benchmark tests, Magistral Medium scored 73.6% on AIME2024, with a majority vote score of 64% and a score of 90% [6]. - Magistral Small achieved scores of 70.7% and 83.3% in the same tests [6]. - The model also excelled in high-demand tests such as GPQA Diamond and LiveCodeBench [7]. Group 3: Technical Features - Magistral Medium demonstrates programming capabilities, generating code to simulate gravity and friction [10]. - The model maintains high-fidelity reasoning across multiple languages, including English, French, Spanish, German, Italian, Arabic, Russian, and Chinese [11]. - With Flash Answers in Le Chat, Magistral Medium can achieve up to 10 times the token throughput compared to most competitors, enabling large-scale real-time reasoning and user feedback [14]. Group 4: Learning Methodology - Mistral employs a proprietary scalable reinforcement learning pipeline, relying on its own models and infrastructure rather than existing implementations [15]. - The model's design principle focuses on reasoning in the same language as the user, minimizing code-switching and enhancing performance in reasoning tasks [16][17]. Group 5: Market Positioning - Magistral Medium is being integrated into major cloud platforms, including Amazon SageMaker, with plans for Azure AI, IBM WatsonX, and Google Cloud Marketplace [20]. - The pricing for input tokens is set at $2 per million and $5 per million for output tokens, significantly higher than the previous Mistral Medium 3 model, which was $0.4 and $2 respectively [21]. - Despite the price increase, Magistral Medium's pricing strategy remains competitive compared to external competitors, being cheaper than OpenAI's latest models and on par with Gemini 2.5 Pro [22].
刚刚,OpenAI正式发布o3-pro!奥特曼激动更新博客:温和的奇点
机器之心· 2025-06-11 00:24
Core Insights - OpenAI has launched o3-pro, a new model that reportedly shows significant improvements over its predecessor, o3, particularly in areas such as science, education, programming, data analysis, and writing [5][9][22]. Performance Evaluation - The benchmark results indicate that o3-pro has a clear advantage over o3, with higher ratings in clarity, comprehensiveness, instruction adherence, and accuracy [9][11]. - The model has been evaluated using a strict "4/4 reliability" assessment, demonstrating outstanding performance [11][13]. - In the ARC-AGI semi-private evaluation dataset, o3-pro's performance was similar to o3, but at a higher cost [14]. Features and Capabilities - o3-pro supports both text and image input modalities, with a context window size of 200k and a maximum output token count of 100k [18]. - The model's knowledge cutoff is set for June 1, 2024, meaning it lacks information from the past year but can utilize tools for additional context [18]. - API pricing for o3-pro is set at $20 per million input tokens and $80 per million output tokens, which is 87% cheaper than o1-pro but still considered expensive [22]. User Feedback - Early user tests have shown that o3-pro is faster and more accurate than previous models, with notable improvements in programming tasks [29][34]. - Some users expressed disappointment, indicating that not all expectations were met [37]. Future Outlook - Sam Altman's blog post discusses the potential of AI to significantly enhance productivity and scientific progress, suggesting that the future may hold unprecedented advancements [40][44]. - The blog emphasizes the importance of making superintelligence widely accessible and affordable, while also addressing the need for societal discussions on the implications of such technology [59][60].
时空压缩!剑桥大学提出注意力机制MTLA:推理加速5倍,显存减至1/8
机器之心· 2025-06-11 00:24
Core Insights - The article discusses the significance of the Transformer architecture in the context of large language models, emphasizing its irreplaceable role despite challenges related to computational complexity and efficiency [1][2][5]. Group 1: Transformer Architecture and Challenges - The self-attention mechanism of the Transformer, while powerful in modeling long-range dependencies, faces challenges due to its quadratic computational complexity, which has led to research on alternatives [1]. - The KV cache size grows linearly with the sequence length during inference, becoming a critical bottleneck for efficiency as model parameters increase [1][2]. Group 2: Innovations in KV Cache Management - The MLA mechanism proposed by the DeepSeek team compresses the KV cache in the latent space, significantly improving inference efficiency, especially in low-resource scenarios [2][7]. - The introduction of Multi-head Temporal Latent Attention (MTLA) combines temporal and latent space compression, addressing the redundancy in the KV cache as sequence lengths increase [2][9]. Group 3: Comparison of Attention Mechanisms - Current models often use Grouped-Query Attention (GQA) to reduce KV cache size by grouping query heads, achieving a balance between efficiency and performance [5]. - MTLA outperforms existing methods like GQA and MQA by maintaining model performance while compressing both spatial and temporal dimensions of the KV cache [9][20]. Group 4: Performance and Future Potential - MTLA demonstrates superior performance across various tasks, achieving over 5 times faster inference speed and reducing GPU memory usage by more than 8 times compared to standard MHA [20]. - The potential for MTLA in large-scale deployments is significant, especially as the demand for efficient KV cache management grows with increasing model sizes and sequence lengths [23][24].
高考数学全卷重赛!一道题难倒所有大模型,新选手Gemini夺冠,豆包DeepSeek并列第二
机器之心· 2025-06-10 17:56
机器之心报道 编辑:杨文、+0 AI挑战全套高考数学题来了! 话接上回。 高考数学一结束,我们连夜使用六款大模型产品,按照一般用户截图提问的方式,挑战了 14 道最新高考客观题,不过有网友质疑测评过程不够严 谨,所以这次我们加上解答题,重新测一遍。 本次参加挑战的选手分别是:Doubao-1.5-thinking-vision-pro、DeepSeek R1、Qwen3-235b、hunyuan-t1-latest、文心 X1 Turbo、o3,并且新增网友们非常期待的 Gemini 2.5 pro。上一次我们使用网页端测试,这次除 o3 外,其他模型全部调用 API。 在考题选择上,我们仍然采用 2025 年数学新课标 Ⅰ 卷,包含 14 道客观题,总计 73 分;5 道解答题,总计 77 分。其中第 6 题由于涉及到图片,我们就单独摘出 来,后面通过上传题目截图的形式针对多模态大模型进行评测。其他文本题目全部转成 latex 格式,分别投喂给大模型,还是老规矩,不做 System Prompt 引导, 不开启联网搜索,直接输出结果。 (注:第 17 题虽然也涉及到图片,但文字表述足够清晰,不影响答题,因此 ...