Workflow
机器之心
icon
Search documents
人均「95后」,账上超十亿美金,MiniMax叩响港股大门
机器之心· 2025-12-21 17:22
Core Viewpoint - The rapid pace of IPOs for AI startups, exemplified by MiniMax, highlights the growing significance and potential of artificial general intelligence (AGI) in the market [2][5]. Company Overview - MiniMax, founded in December 2021 and headquartered in Shanghai, focuses on developing multi-modal artificial intelligence technologies [4]. - The company is recognized for its foundational models MiniMax M1 and M2, as well as AI-native products like Hailuo AI and Xingye [4]. Market Position and User Base - MiniMax is poised to set a record as the fastest AI company from establishment to IPO [5]. - The company has over 2.12 billion personal users across more than 200 countries and regions, with over 100,000 enterprises and developers [9]. - The average monthly active users of AI-native products surged from 3.14 million in 2023 to 27.62 million in the first nine months of 2025, indicating strong user engagement [9]. Financial Performance - For the first nine months of 2025, MiniMax reported revenues of $53.44 million, a year-on-year increase of approximately 174.7% [9]. - The revenue primarily comes from AI-native product subscriptions and enterprise services, with a gross margin of 69.4% for B2B services [10]. - The adjusted net loss for the same period was $186 million, reflecting a slight increase of 8.6% despite significant revenue growth [22]. Research and Development - MiniMax has a strong focus on R&D, with expenditures reaching $180 million in the first nine months of 2025, accounting for 337.4% of total revenue [22]. - The company has made significant advancements in multi-modal AI technologies, launching several models that have gained international recognition [13][15]. Leadership and Organizational Structure - The executive board of MiniMax is notably young, with an average age of 32, reflecting the company's tech-driven and innovative culture [29][30]. - The leadership team is deeply involved in R&D and business operations, aligning with the company's focus on long-term technological investment and efficiency [31]. Future Plans - MiniMax plans to use approximately 70% of the funds raised from the IPO for R&D over the next five years, focusing on model upgrades and AI-native product development [34]. - The company aims to enhance social productivity and individual quality of life through its vision of "Intelligence with Everyone" [37].
遥遥无期的AGI是画大饼吗?两位教授「吵起来了」
机器之心· 2025-12-21 04:21
Core Viewpoint - The article discusses the limitations of achieving Artificial General Intelligence (AGI) due to physical and resource constraints, emphasizing that scaling alone is not sufficient for significant advancements in AI [3][20][32]. Group 1: Limitations of AGI - Tim Dettmers argues that AGI will not happen because computation is fundamentally physical, and there are inherent limitations in hardware improvements and scaling laws [8][10][12]. - The article highlights that as transistor sizes shrink, while computation becomes cheaper, memory access becomes increasingly expensive, leading to inefficiencies in processing power [11][17]. - The concept of "superintelligence" is critiqued as a flawed notion, suggesting that improvements in intelligence require substantial resources, and thus, any advancements will be gradual rather than explosive [28][29][30]. Group 2: Hardware and Scaling Challenges - The article points out that GPU advancements have plateaued, with significant improvements in performance per cost ceasing around 2018, leading to diminishing returns on hardware investments [16][17]. - Scaling AI models has become increasingly costly, with the need for linear improvements requiring exponential resource investments, indicating a nearing physical limit to scaling benefits [20][22]. - The efficiency of current AI infrastructure is heavily reliant on large user bases to justify the costs of deployment, which poses risks for smaller players in the market [21][22]. Group 3: Divergent Approaches in AI Development - The article contrasts the U.S. approach of "winner-takes-all" in AI development with China's focus on practical applications and productivity enhancements, suggesting that the latter may be more sustainable in the long run [23][24]. - It emphasizes that the core value of AI lies in its utility and productivity enhancement rather than merely achieving higher model capabilities [24][25]. Group 4: Future Directions and Opportunities - Despite the challenges, the article suggests that there are still significant opportunities for improvement in AI systems through better hardware utilization and innovative model designs [39][45][67]. - It highlights the potential for advancements in training efficiency and inference optimization, indicating that current models are not yet fully optimized for existing hardware capabilities [41][43][46]. - The article concludes that the path to more capable AI systems is not singular, and multiple avenues exist for achieving substantial improvements in performance and utility [66][69].
震撼,英伟达新模型能打遍几乎所有游戏
机器之心· 2025-12-21 04:21
Core Viewpoint - The article introduces Nvidia's latest open-source model, NitroGen, which is capable of playing over 1,000 different games using AI-generated controls, showcasing significant advancements in gaming automation and cross-game adaptability [5][6][8]. Group 1: Model Overview - NitroGen is designed to play a wide variety of game genres, including RPGs, platformers, and racing games, by directly processing game video frames to generate controller signals [6][8]. - The model supports fine-tuning for new games, allowing it to adapt quickly without starting from scratch, demonstrating its potential for cross-game generalization [8]. - The architecture of NitroGen is based on the GR00T N1.5 framework, which was originally designed for robotics but has been adapted for gaming applications with minimal modifications [12]. Group 2: Key Components - NitroGen consists of three core components: a multi-game intelligent agent, a universal simulator, and a large-scale dataset of gaming videos [15][16][17]. - The multi-game intelligent agent can generate controller commands from game observations, enabling zero-shot gameplay across various titles [15]. - The universal simulator standardizes interactions across different games using the Gymnasium API, facilitating large-scale training and evaluation [16]. - The dataset comprises 40,000 hours of publicly available gaming videos, covering over 1,000 games, and includes automatically generated action labels [17][24]. Group 3: Data Collection and Processing - The data collection process involved extracting player actions from videos with "input overlays," which present real-time controller inputs [18][19]. - The research team utilized advanced techniques to match key points and segment the controller displays from the videos, ensuring the model learns without "cheating" [21]. - The dataset features a diverse distribution of game types, with action RPGs making up 34.9% of the total video duration, followed by platformers at 18.4% [26]. Group 4: Performance and Results - NitroGen has demonstrated strong performance across various game types, including 3D action games and 2D platformers, achieving non-trivial task completion rates [28][30]. - The model showed a significant improvement in task success rates when fine-tuned for new games, with up to a 52% relative increase compared to models trained from scratch [32]. - The research indicates that NitroGen is a foundational step towards creating general-purpose embodied agents capable of interacting with complex environments [35][36].
挑战WorldLabs:Visionary,一个全面超越Marble底层渲染器的WebGPU渲染平台
机器之心· 2025-12-21 04:21
Core Insights - The article discusses the development of Visionary, a new rendering platform that utilizes WebGPU and ONNX to enhance the visualization and interaction of World Models in web environments, overcoming limitations faced by previous technologies like SparkJS [2][10][27]. Group 1: Challenges in Current Technologies - The existing World Model visualization methods, particularly those relying on WebGL, face significant limitations in rendering dynamic and complex scenes due to CPU sorting bottlenecks [6][7][8]. - Current solutions like SparkJS are primarily designed for static or pre-computed Gaussian rendering, making them inadequate for real-time inference of dynamic 3D Gaussian Splatting (3DGS) and Neural Avatars [7][8]. Group 2: Visionary's Innovations - Visionary is positioned as a native web rendering substrate that integrates GPU computation and rendering directly into browsers, replacing the older WebGL framework [10][25]. - It introduces a Gaussian Generator Contract that standardizes the output of various 3DGS and 4DGS methods into ONNX format, allowing for dynamic generation and updating of Gaussian attributes in real-time [11][13]. Group 3: Performance and Quality Improvements - Experimental data indicates that Visionary significantly outperforms SparkJS in rendering efficiency, particularly in scenes with millions of Gaussian points, by shifting sorting and preprocessing tasks to the GPU [18][21]. - Visionary employs frame-by-frame GPU global sorting to eliminate visual artifacts seen in other solutions, ensuring accurate rendering of transparency even in complex multi-model scenarios [21][24]. Group 4: Applications and Future Directions - Visionary serves as a unified platform for researchers, creators, and industries, enabling quick reproduction and comparison of 3DGS variants, as well as facilitating editing and rendering directly in the browser [24][25]. - The development team views Visionary as a foundational step towards a comprehensive World Model framework, with future explorations planned in areas such as physical interaction enhancement and spatial intelligence [26][28].
相机运动误差降低40%!DualCamCtrl:给视频生成装上「深度相机」,让运镜更「听话」
机器之心· 2025-12-21 04:21
本研究的共同第一作者是来自于香港科技大学(广州)EnVision Research 的张鸿飞(研究助理)和陈康豪(博士研究生),两位研究者均师从陈颖聪教 授。 你的生成模型真的「懂几何」吗?还是只是在假装对齐相机轨迹? 当前众多视频生成模型虽宣称具备「相机运动控制」能力,但其控制信号通常仅依赖于相机位姿。虽近期工作通过逐像素射线方向(Ray Condition)编码 了运动信息,但由于模型仍需隐式推断三维结构,本质上仍缺乏对场景的显式几何理解。这一局限性导致了相机运动的不一致——模型受限于外观与结构两 种表征信息的耦合,无法充分捕捉场景的底层几何特征。 鉴于上述挑战, 来自香港科技大学、复旦大学等机构的研究团队提出了一种全新的端到端几何感知扩散模型框架 DualCamCtrl 。 该研究针对现有方法在 场景理解与几何感知方面的不足,创新性地设计了一个「双分支扩散架构」,能够同步生成与镜头运动一致的 RGB 与深度序列。进一步地,为实现 RGB 与深度两种模态的高效协同,DualCamCtrl 提出了语义引导互对齐机制(Semantic Guided Mutual Alignment),该机制以语义信息为指导, ...
近两百万人围观的Karpathy年终大语言模型清单,主角是它们
机器之心· 2025-12-21 03:01
Core Insights - 2025 is a pivotal year for the evolution of large language models (LLMs), marked by significant paradigm shifts and advancements in the field [2][36] - The emergence of Reinforcement Learning from Verifiable Rewards (RLVR) is transforming LLM training processes, leading to enhanced capabilities without necessarily increasing model size [10][11] - The industry is witnessing a new layer of LLM applications, exemplified by tools like Cursor, which organize and deploy LLM capabilities in specific verticals [16][17] Group 1: Reinforcement Learning and Model Training - The introduction of RLVR allows models to learn in verifiable environments, enhancing their problem-solving strategies through self-optimization [10] - The majority of capability improvements in 2025 stem from extended RL training rather than increased model size, indicating a new scaling law [11][12] - OpenAI's models, such as o1 and o3, exemplify the practical application of RLVR, showcasing a significant qualitative leap in performance [12] Group 2: Understanding LLM Intelligence - The industry is beginning to grasp the unique nature of LLM intelligence, which differs fundamentally from human intelligence, leading to a jagged distribution of capabilities [14][15] - The concept of "vibe coding" emerges, allowing non-engineers to create complex programs, thus democratizing programming and reshaping software development roles [25][29] - The introduction of tools like Claude Code signifies a shift towards LLM agents that can operate locally, enhancing user interaction and productivity [19][22] Group 3: User Interaction and GUI Development - The development of GUI applications like Google Gemini's "Nano Banana" indicates a trend towards more intuitive and visually engaging interactions with LLMs [31][34] - The integration of text, images, and knowledge within a single model represents a significant advancement in how LLMs can communicate and operate [34] - The industry is at the cusp of a new interaction paradigm, moving beyond traditional web-based AI to more integrated and user-friendly applications [23][30] Group 4: Future Outlook - The potential of LLMs remains largely untapped, with the industry only beginning to explore their capabilities [38][39] - Continuous and rapid advancements are expected, alongside the recognition of the extensive work still required to fully realize the potential of LLM technology [40][41]
AI一旦开始「内卷」,会变成什么样?腾讯混元和上交联合揭秘多智能体「饥饿游戏」
机器之心· 2025-12-21 03:01
在多智能体系统的想象中,我们常常看到这样一幅图景: 多个 AI 智能体分工协作、彼此配合,像一个高效团队一样攻克复杂任务,展现出超越单体智能的 "集体智慧"。 但一个关键问题常常被忽略: 当这些智能体不再只是 "同事",而是被迫变成 "竞品",甚至是 "对手",会发生什么? 腾讯混元数字人团队与上海交通大学的最新研究,给出了一个颇为刺眼的回答: 当面临极端竞争压力时,LLM 多智能体系统会出现严重的 "过度竞争" 行为,沉迷互踩、内卷和博弈,直接拖垮整体任务表现。 换句话说, 当我们把 AI 扔进一场 "饥饿游戏" ,它们会开始变坏。 论文链接:https://arxiv.org/abs/2509.26126 项目地址:https://github.com/Tencent/DigitalHuman/tree/main/HATE 「饥饿游戏」式辩论: 只有一个能活下来 这项研究设计了一个高风险、零和博弈的辩论环境,让智能体在 "合作完成任务" 与 "避免被淘汰" 之间做出选择。 为了让竞争足够残酷,系统给每个智能体植入了清晰的 "生存本能" 提示: 只会有一名胜者,其余全部被移除。 整个框架可以理解为一场 AI ...
从 Gen0 的精细操作到 RTC 的持续工作,具身智能 Just needs execution?
机器之心· 2025-12-21 01:30
Group 1 - The article discusses the advancements in embodied intelligence, highlighting the need for execution in humanoid robots to effectively serve humans, despite significant training hours and scaling laws [1][5] - It emphasizes the rapid improvement in humanoid robots' capabilities, such as parkour, dancing, and basketball, while noting the lack of real-world deployment in service roles [6][7] - The article mentions that the number of humanoid robot companies and funding is increasing, but skepticism remains regarding their market integration [6][7] Group 2 - Morgan Stanley estimates that by 2050, the number of humanoid robots could exceed 1 billion, creating a market valued at $5 trillion, although achieving this goal is uncertain [7] - The article points out that the future focus may shift towards deploying fewer robots capable of performing multiple tasks rather than many robots for single tasks [8] - Despite challenges in large-scale commercial deployment, significant technical progress has been made in areas such as fine manipulation, long-range tasks, and continuous operation [8][9] Group 3 - The article highlights the achievements in fine manipulation, with DexterityGen demonstrating a 10-100 times improvement in stability for robotic hands using reinforcement learning [9] - The Generalist AI Gen0 model, trained for 270,000 hours, showcases a wide range of operational skills applicable across different robotic platforms [9]
人人都是导演:CineCtrl首个实现视频生成中的相机运镜与摄影效果统一控制
机器之心· 2025-12-20 07:00
论文名称 :Generative Photographic Control for Scene-Consistent Video Cinematic Editing 论文链接 : https://arxiv.org/abs/2511.12921 项目主页 : https://huiqiang-sun.github.io/cinectrl/ 开源代码 : https://github.com/huiqiang-sun/CineCtrl 图 1 CineCtrl 摄影效果与相机运动的精细控制 背景 仅凭一段普通视频,能否像专业导演一样,在后期随意改变相机轨迹,同时精细调整变焦、光圈散景、曝光度甚至图像色温? 现有视频生成模型往往难以兼顾「运镜」与「摄影美学」的精确控制。为此,华中科技大学、南洋理工大学、商汤科技和上海人工智能实验室团队推出了 CineCtrl。作为首个统一的视频摄影控制 V2V 框架,CineCtrl 通过解耦交叉注意力机制,摆脱了多控制信号共同控制的效果耦合问题,实现了对视频相机 外参轨迹与摄影效果的独立、精细、协调控制。 为了便于用户进行更加直观的摄影效果控制,CineCtrl 将控制信号归一 ...
LeCun的JEPA已进化为视觉-语言模型,1.6B参数比肩72B Qwen-VL
机器之心· 2025-12-20 07:00
Core Insights - The article discusses the advancements in the Joint Embedding Predictive Architecture (JEPA) with the introduction of the visual-language model VL-JEPA, developed by a collaborative team from Meta, Hong Kong University of Science and Technology, Sorbonne University, and New York University [2][3]. Group 1: Model Overview - VL-JEPA is the first non-generative model based on the joint embedding predictive architecture that can perform general domain visual-language tasks in real-time [3]. - Unlike traditional visual-language models (VLMs) that generate tokens in an autoregressive manner, VL-JEPA predicts continuous embeddings of target text, focusing on task-relevant semantics while ignoring superficial language variations [4][13]. Group 2: Model Efficiency - The model transforms expensive token generation learning into more efficient latent space semantic prediction, which simplifies the target distribution and enhances the learning process [11][16]. - VL-JEPA can produce continuous target semantic embedding streams with very low latency due to its non-autoregressive nature, making it particularly beneficial for real-time applications like action tracking and scene recognition [17]. Group 3: Performance Comparison - In a comparative study, VL-JEPA demonstrated consistent higher performance in zero-shot description generation and classification while using approximately half the trainable parameters compared to traditional token-generating VLMs, indicating improved learning efficiency [20]. - The selective decoding strategy implemented in VL-JEPA reduced the number of decoding operations by about 2.85 times while maintaining overall output quality as measured by average CIDEr scores [22]. Group 4: Training Phases and Results - The VL-JEPA model undergoes two training phases, with the first phase producing VL-JEPA_BASE, which outperformed models like CLIP and SigLIP2 in average classification accuracy and retrieval recall across eight datasets [23][24]. - The second phase, which involves domain-specific training data, significantly enhances the classification performance of the model, resulting in VL-JEPA_SFT, which approaches the performance of specialized models [25][28]. Group 5: Application and Demonstration - The article includes demonstrations of VL-JEPA's capabilities, such as real-time robot state tracking, showcasing its practical applications in various fields [29].