机器之心
Search documents
从 Gen0 的精细操作到 RTC 的持续工作,具身智能 Just needs execution?
机器之心· 2025-12-21 01:30
Group 1 - The article discusses the advancements in embodied intelligence, highlighting the need for execution in humanoid robots to effectively serve humans, despite significant training hours and scaling laws [1][5] - It emphasizes the rapid improvement in humanoid robots' capabilities, such as parkour, dancing, and basketball, while noting the lack of real-world deployment in service roles [6][7] - The article mentions that the number of humanoid robot companies and funding is increasing, but skepticism remains regarding their market integration [6][7] Group 2 - Morgan Stanley estimates that by 2050, the number of humanoid robots could exceed 1 billion, creating a market valued at $5 trillion, although achieving this goal is uncertain [7] - The article points out that the future focus may shift towards deploying fewer robots capable of performing multiple tasks rather than many robots for single tasks [8] - Despite challenges in large-scale commercial deployment, significant technical progress has been made in areas such as fine manipulation, long-range tasks, and continuous operation [8][9] Group 3 - The article highlights the achievements in fine manipulation, with DexterityGen demonstrating a 10-100 times improvement in stability for robotic hands using reinforcement learning [9] - The Generalist AI Gen0 model, trained for 270,000 hours, showcases a wide range of operational skills applicable across different robotic platforms [9]
人人都是导演:CineCtrl首个实现视频生成中的相机运镜与摄影效果统一控制
机器之心· 2025-12-20 07:00
论文名称 :Generative Photographic Control for Scene-Consistent Video Cinematic Editing 论文链接 : https://arxiv.org/abs/2511.12921 项目主页 : https://huiqiang-sun.github.io/cinectrl/ 开源代码 : https://github.com/huiqiang-sun/CineCtrl 图 1 CineCtrl 摄影效果与相机运动的精细控制 背景 仅凭一段普通视频,能否像专业导演一样,在后期随意改变相机轨迹,同时精细调整变焦、光圈散景、曝光度甚至图像色温? 现有视频生成模型往往难以兼顾「运镜」与「摄影美学」的精确控制。为此,华中科技大学、南洋理工大学、商汤科技和上海人工智能实验室团队推出了 CineCtrl。作为首个统一的视频摄影控制 V2V 框架,CineCtrl 通过解耦交叉注意力机制,摆脱了多控制信号共同控制的效果耦合问题,实现了对视频相机 外参轨迹与摄影效果的独立、精细、协调控制。 为了便于用户进行更加直观的摄影效果控制,CineCtrl 将控制信号归一 ...
LeCun的JEPA已进化为视觉-语言模型,1.6B参数比肩72B Qwen-VL
机器之心· 2025-12-20 07:00
编辑|Panda LeCun 的联合嵌入预测架构(JEPA)迎来了新进展。 近日,来自 Meta、香港科技大学、索邦大学、纽约大学的一个联合团队基于 JEPA 打造了一个视觉-语言模型: VL-JEPA 。 据作者 Pascale Fung 介绍, VL-JEPA 是第一个基于联合嵌入预测架构,能够实时执行通用领域视觉-语言任务的非生成模型。 不同于传统的视觉-语言模型(VLM)通过自回归方式生成 token,VL-JEPA 预测的是目标文本的连续嵌入(embedding)。通过在抽象的表征空间中学习,该模型 能够专注于与任务相关的语义,同时忽略表层语言形式的多变性 。 论文标题:VL-JEPA: Joint Embedding Predictive Architecture for Vision-language 论文地址:https://arxiv.org/abs/2512.10942 该论文共有四位共一作者:Delong Chen(陈德龙)、Mustafa Shukor、Théo Moutakanni、Willy Chung。JEPA 提出者、图灵奖得主 Yann LeCun 也在作者名单中。 当前 V ...
Anthropic公布新技术:不靠删数据,参数隔离移除AI危险
机器之心· 2025-12-20 04:45
为此,研究者通常会在后训练加入拒答机制等安全措施,希望阻断这些能力的滥用。然而事实证明:面对刻意规避的攻击者,这些防线并不牢固。模型的强大让 它在被保护与被绕过之间处于微妙而脆弱的平衡。 机器之心编辑部 近年来,大语言模型的能力突飞猛进,但随之而来的却是愈发棘手的双重用途风险(dual-use risks)。当模型在海量公开互联网数据中学习时,它不仅掌握语言与 推理能力,也不可避免地接触到 CBRN(化学、生物、放射、核)危险制造、软件漏洞利用等高敏感度、潜在危险的知识领域。 这促使研究者开始探索在预训练阶段进行干预,从根源上防止模型获得危险能力。 目前的标准做法是数据过滤:在训练前识别并移除有害内容。然而,这一方法存在多项挑战: 这些挑战导致一个不可避免的取舍:要么接受危险内容,要么因为过度清洗而损失大量有价值的通用知识。 为此,Anthropic 提出了 SGTM(Selective Gradient Masking),用一种全然不同的范式来应对这些挑战:它不再试图在训练前完美分类并剔除危险数据,而是在训 练过程中将危险知识定位进模型中专门的参数区域。 方法介绍 SGTM 基于 Gradient Rout ...
布局控制+身份一致:浙大提出ContextGen,实现布局锚定多实例生成新SOTA
机器之心· 2025-12-20 04:45
随着扩散模型(Diffusion Models)的迭代演进,图像生成已经日臻成熟。然而,在 多实例图像生成(Multi-Instance Image Generation, MIG) 这一有着大量用户 场景的关键领域,现有的方法仍面临核心瓶颈:如何同时实现对多个对象的 空间布局控 制(Layo ut Control)以及身份特征的良好保持(Identity Preservation) 主流方法往往无法做到两全其美:依赖文本和布局引导(Layout-to-Image)的模型往往难以实现高度的实例定制化,且实例遗漏、属性泄露的问题时有发生;而主 流的主体驱动(Subject-driven)方法在主体数量增加时,面临着严重的身份混淆和细节丢失的问题。 。 ContextGen 与主流 SOTA 的对比示例,以及 ContextGen 的使用例 为解决这一制约高度定制化图像生成的难题, 浙江大学 ReLER 团队发布 ContextGen ,一个新型的基于 Diffusion Transformer (DiT) 的框架,旨在通过上下文学 习,可靠地完成图像引导的多实例生成任务! ContextGen 提出了全新的上下 ...
玩到崩溃,《青椒模拟器》游戏爆火,我在AI世界一路升级做院士
机器之心· 2025-12-20 04:45
Core Viewpoint - The article discusses the sudden popularity of a game called "Green Pepper Simulator," which simulates the academic career path of a university lecturer, reflecting both the challenges and absurdities of academic life [2][18]. Group 1: Game Overview - "Green Pepper Simulator" progresses through academic years, where players start with limited resources and must make decisions that affect their career ratings [2]. - Players can experience various outcomes, from failing to pass evaluations to achieving prestigious titles like professor or even Nobel Prize winner [3][13]. Group 2: Player Experiences - Some players have shared their experiences, highlighting the game's realistic portrayal of academic pressures, such as managing student projects and publishing papers [10][11]. - The game has sparked discussions among players, with some providing tips for success, such as focusing on student recruitment and strategic project applications [18][19]. Group 3: Development and Features - The game was developed as a side project by independent developers, utilizing advanced models for enhanced gameplay [7][8]. - Players are randomly assigned identities and must navigate a simulated academic environment, including applying for projects and managing student interactions [22][29].
从「金砖理论」到「The Messy Inbox」,a16z 合伙人如何看待 AI 时代的护城河?
机器之心· 2025-12-20 02:30
Group 1 - The core argument of the article is that software is transitioning from being an "auxiliary tool" to an "executive entity," marking a paradigm shift in its commercial attributes [4][7][12] - In the past, software was strictly defined as a tool dependent on human operation, with its value released only through human input [4][5] - The emergence of AI has transformed software into a digital workforce capable of independent task execution, thus changing how businesses evaluate software value [7][8][11] Group 2 - The traditional pricing model based on per-user subscriptions is becoming obsolete, necessitating a fundamental adjustment in monetization strategies for entrepreneurs [12][13] - The proposed "Goldilocks Zone" pricing strategy aims to find an optimal arbitrage space between software costs and human labor costs, ensuring pricing is significantly lower than hiring real employees while still being higher than traditional software subscription fees [15][16][17] - Entrepreneurs are advised to leverage the "Gold Brick Theory" to identify structural gaps that giants strategically overlook, shifting the focus from homogeneous model capabilities to deep understanding of specific industry contexts [18]
大模型「越想越错」?人大&腾讯团队用信息论揭示:什么时候该想、什么时候别想
机器之心· 2025-12-19 06:38
Core Insights - The article discusses the inefficiencies in the reasoning capabilities of large models, highlighting the need for a more effective approach to reasoning in AI systems [4][10][46] - The proposed solution, Adaptive Think, allows models to automatically stop reasoning when they reach a sufficient level of confidence, thus improving efficiency and accuracy [7][28][45] Group 1: Inefficiencies in Current Models - Current large models exhibit a tendency to overthink, leading to longer reasoning chains that often result in noise and decreased accuracy [3][19] - Research indicates that longer reasoning chains do not necessarily yield better results, as they can lead to diminishing returns and increased computational costs [19][20][36] - The study employs information theory metrics such as entropy and mutual information to evaluate the reasoning efficiency of models [6][12] Group 2: Adaptive Think Mechanism - The Adaptive Think strategy enables models to self-monitor their reasoning process, terminating when confidence is sufficiently high [28][29] - Experimental results show that Adaptive Think significantly reduces token consumption while maintaining or improving accuracy across various tasks [33][36] - The mechanism allows for dynamic adjustment of reasoning depth based on task difficulty, enhancing both speed and reliability [31][45] Group 3: Experimental Findings - In tests on the GSM8K dataset, Adaptive Think reduced average token usage by over 40% while improving accuracy by 0.93% compared to traditional methods [33] - The approach demonstrated effectiveness across multiple reasoning tasks, with notable improvements in efficiency for common-sense reasoning tasks [36][37] - The findings suggest that many models can achieve correct answers with fewer reasoning steps, challenging the notion that longer reasoning is inherently better [38][46]
Mamba作者团队提出SonicMoE:一个Token舍入,让MoE训练速度提升近2倍
机器之心· 2025-12-19 06:38
Core Insights - The MoE (Mixture of Experts) model has become the standard architecture for scaling language models without significantly increasing computational costs, showing trends of higher expert granularity and sparsity, which enhance model quality per unit FLOPs [1][2] MoE Model Trends - Recent open-source models like DeepSeek V3, Kimi K2, and Qwen3 MoE exhibit finer-grained expert designs and higher sparsity, significantly increasing total parameter count while maintaining the number of active parameters [1][2] - The table of recent models indicates varying parameters, expert activation ratios, and expert granularities, with models like Mixtral 8x22B having 131 billion parameters and a 25% expert activation ratio [2] Hardware Efficiency Challenges - The pursuit of extreme granularity and sparsity in MoE designs has led to significant hardware efficiency issues, prompting the development of SonicMoE, a solution tailored for NVIDIA Hopper and Blackwell architecture GPUs [3] - SonicMoE demonstrates performance advantages, achieving a 43% speed increase in forward propagation and up to 115% in backward propagation compared to existing baselines [3] Memory and IO Bottlenecks - Fine-grained MoE models face linear growth in activation memory usage with the number of active experts, leading to increased memory pressure during forward and backward propagation [4] - The reduced arithmetic intensity in smaller, dispersed experts results in more frequent IO access, pushing model training into a memory-constrained zone [4] Efficient Algorithms - SonicMoE introduces a method to compute routing gradients without caching activation values, reducing backward propagation memory usage by 45% for fine-grained models [4] - The design allows for overlapping computation and IO operations, effectively masking high IO latency associated with fine-grained MoE [4] Token Rounding Strategy - The token rounding method optimizes the distribution of tokens to experts, minimizing computational waste due to tile quantization effects, thus enhancing training efficiency without compromising model quality [4][20][26] Performance Metrics - SonicMoE achieves a training throughput of 213 billion tokens per day using 64 H100 GPUs, comparable to the efficiency of 96 H100 GPUs running ScatterMoE [6] - The memory usage for activation remains constant even as expert granularity increases, with efficiency improvements ranging from 0.20 to 1.59 times over existing baselines [9][15] Open Source Contribution - The team has open-sourced the relevant kernel code, providing a robust tool for the large model community to accelerate high-performance MoE training [7]
拆解CANN:当华为决定打开算力的「黑盒」
机器之心· 2025-12-19 06:38
Core Viewpoint - The article discusses Huawei's recent announcement regarding the open-source of its Ascend CANN software, which aims to lower the barriers for AI tool development and foster a new AI computing ecosystem [2][30]. Group 1: CANN Open Source and Developer Empowerment - CANN, which stands for Compute Architecture for Neural Networks, serves as a bridge between AI training frameworks and underlying AI chips, allowing developers to utilize computing power without needing to understand chip details [2][5]. - The open-source nature of CANN has garnered significant attention in the industry, as it empowers developers to define computing capabilities and customize their AI models [2][6]. - CANN supports seamless integration with major AI frameworks such as PyTorch, TensorFlow, MindSpore, and PaddlePaddle, enhancing developer flexibility [5][6]. Group 2: Development Paths Offered by CANN - CANN provides three development paths for different types of developers: 1. For those familiar with Python, CANN integrates with the Triton ecosystem, allowing easy migration of existing code [9]. 2. For system-level programmers seeking high performance, Ascend C offers low-level resource management capabilities [10]. 3. For developers looking for ease of use, the CATLASS operator template library simplifies the creation of matrix multiplication operators [11][13]. - The MLAPO fusion operator, part of the CATLASS library, significantly reduces computation time and enhances performance in large models [15]. Group 3: Architectural Innovations - CANN's architecture features a layered decoupling approach, allowing independent evolution of components, which reduces integration complexity for developers [21][22]. - The decoupling enables developers to selectively upgrade specific components based on their needs, facilitating easier customization and integration [23][29]. - CANN has transitioned from a monolithic software structure to a modular one, with independent components for various functionalities, enhancing flexibility and performance [24][26]. Group 4: Open Source Community and Growth - The open-source initiative of CANN is actively progressing, with over 27 sub-projects and a total of more than 3,700 stars on its repositories [35]. - The community-driven approach invites developers to contribute, thereby expanding the ecosystem and enhancing the technology's value through collaborative efforts [31][32]. - CANN's repositories include a variety of core libraries and tools, providing developers with ready-to-use resources for AI application development [16][36].