机器之心

Search documents
具身智能的终极命题:是造「人」还是造「生产力」?
机器之心· 2025-06-25 04:06
机器之心报道 编辑:吴昕 华为开发者大会 2025(HDC 2025)上发布了 CloudRobo 具身智能平台。该平台可视为具身智能的「技术底 座」,通过云端的「强智能」赋能机器本体,规避了本体侧智能进程慢,且部署成本高的痛点,摸索出一 条涉猎范围最广、实现速度最快的具身智能落地路径。 「华为云的目标是让一切联网的本体都成为具身智能机器人。」华为云计算 CEO 张平安说道。 不做「本体」转而去做云端的技术赋能,华为云的布局思路虽是更符合自身需求的战略方向,但也为具身 智能带来了发展新视角。 具身智能追求的并不是本体「构型」,或是本体的智能程度,而是站在「更好用」的终局视角,从人形到 移动机器人再到卡车,让一切机器「 具身智能化 」,加速其在物理世界真正用起来的脚步。 这种终局思维,极大拓宽了具身智能产业化的想象空间,并为商业落地指明了潜在的效率最优路径。 工业领域的实践印证了这条路径的可行性:在工业喷涂领域,CloudRobo 助力埃夫特机械臂快速适应新喷涂 任务;在半导体制造领域,CloudRobo 赋能优艾智合物流机器人,实时同步生产系统,更新任务规划,完成 物料搬运及运输。 其合作方优艾智合、埃夫特等 ...
ICML 2025 Oral | 从「浅对齐」到「深思熟虑」,清华牵头搭起大模型安全的下一级阶梯
机器之心· 2025-06-25 04:06
本工作共同第一作者 包括: 张亦弛 , 清华大学计算机系三年级博士生,师从朱军教授,研究方向是多模态大模型和大模型安全, 在CVPR、NeurIPS、ICML等顶会发表多篇论文,曾主导开发了首个多模态大模型可信度全面评测基准MultiTrust ; 张思源 , 清 华大学计算机系一年级硕士生,导师是苏航副研究员,研究方向是大模型安全与对齐算法。本文通讯作者是清华大学人工智能学院 董胤蓬助理教授和计算机系朱军教授。 其他合作者来自北航、瑞莱智慧、阿里安全、百川智能等单位。 在大语言模型(LLM)加速进入法律、医疗、金融等高风险应用场景的当下, " 安全对齐 " 不再只是一个选项,而是每一位模型开 发者与AI落地者都必须正面应对的挑战。然而,如今广泛采用的对齐方式,往往 只是 让 模型在检测到 风险 提示时机械地回复一 句"很抱歉,我无法满足你的请求" ——这种表面看似"安全"的机制,实则脆弱不堪。ICLR 2025 杰出论文首次将这类方法命名为 "浅对齐(Shallow Alignment)" [1] :模型的预测分布仅在 回复 开头做出了 有效 偏移,却从未真正理解潜在的风险语义 。一旦 越狱提示换个包装,模 ...
刚刚,首个能在机器人上本地运行的具身Gemini来了
机器之心· 2025-06-25 00:46
Core Viewpoint - The article discusses the launch of Gemini Robotics On-Device, a new visual-language-action (VLA) model by Google DeepMind, designed for robots to operate efficiently without continuous internet connectivity [1][2]. Group 1: Product Overview - Gemini Robotics On-Device is the first VLA model that can be directly deployed on robots, enhancing their ability to adapt to new tasks and environments [2][4]. - The model is optimized for efficient operation on robotic hardware, showcasing strong general flexibility and task generalization capabilities [4][12]. - It can operate in environments with no data network, making it suitable for latency-sensitive applications [5]. Group 2: Developer Tools - Google will release the Gemini Robotics SDK, allowing developers to evaluate the model's performance in their specific tasks and environments [7]. - Developers can test the model in DeepMind's MuJoCo physics simulator, requiring only 50 to 100 demonstrations to adapt to new tasks [7][21]. Group 3: Performance and Adaptability - Gemini Robotics On-Device has demonstrated strong performance in various dexterous tasks, such as unzipping bags and folding clothes, all executed directly on the robot [12][16]. - The model shows significant advantages over previous local robot models, especially in challenging out-of-distribution tasks and complex multi-step instructions [15][16]. - It can be fine-tuned for improved performance and can adapt to different robotic platforms, including the Franka FR3 and Apollo humanoid robots [25][26]. Group 4: Updates and Changes - Alongside the new model, Google DeepMind has reduced the free usage limits for its Gemini 2.5 Flash and Gemini 2.0 Flash models, which may not be well-received by free users [30][32]. - The company has also announced the launch of new image generation models, Imagen 4 and Imagen 4 Ultra, in its AI Studio and Gemini API [33].
如何做到在手机上实时跑3D真人数字人?MNN-TaoAvatar开源了!
机器之心· 2025-06-25 00:46
Core Viewpoint - TaoAvatar is a breakthrough 3D digital human technology developed by Alibaba's Taobao Meta Technology team, enabling real-time rendering and AI dialogue on mobile and XR devices, providing users with a realistic virtual interaction experience [1][8]. Group 1: Technology Overview - TaoAvatar utilizes advanced 3D Gaussian splatting technology to create lifelike full-body avatars that capture intricate facial expressions and gestures, as well as details like clothing folds and hair movement [8]. - The technology significantly reduces the cost and increases the efficiency of digital human modeling, facilitating large-scale applications [9]. - MNN-TaoAvatar is an open-source 3D digital human application that integrates multiple leading AI technologies, allowing natural voice interaction with digital humans on mobile devices [10]. Group 2: Performance Metrics - The application runs efficiently on mobile devices, with key performance metrics for various models as follows: - ASR (Automatic Speech Recognition): Model size 281.65M, RTF: 0.18 - LLM (Large Language Model): Model size 838.74M, pre-fill speed: 165 tokens/s, decode speed: 41.16 tokens/s - TTS (Text-to-Speech): Model size 1.34GB, RTF: 0.58 - A2BS (Audio-to-BlendShape): Model size 368.71MB, RTF: 0.34 - NINIR (Rendering Output): Model size 138.40MB, rendering frame rate: 60 FPS [16][17][18]. Group 3: Development and Optimization - MNN-TaoAvatar is built on the MNN engine, which supports various algorithm modules, enhancing the performance of AI applications in real-time scenarios [23][30]. - The MNN-LLM module demonstrates superior CPU performance, with pre-fill speed improved by 8.6 times compared to llama.cpp and decoding speed improved by 2.3 times [34]. - The MNN-NNR rendering engine employs optimizations such as data synchronization and scheduling to ensure efficient rendering, achieving smooth output at 60 FPS even with lower frequency updates [40][45]. Group 4: Hardware Requirements - Recommended hardware for MNN-TaoAvatar includes devices with Qualcomm Snapdragon 8 Gen 3 or equivalent CPU, at least 8GB of RAM, and 5GB of storage for model files [51].
讲得了课、押得中题、学习规划还能量身定制,真卷到点子上的只有它
机器之心· 2025-06-24 14:07
Core Viewpoint - The article emphasizes that the true capability of AI learning machines in personalized education is not just the model itself, but a comprehensive system built over 20 years by iFLYTEK, which includes various innovative features and functionalities [2][5][66]. Group 1: AI Learning Machine Performance - iFLYTEK's X1 model has gained attention after performing well in various media-organized tests, ranking first among domestic AI in high school Chinese and English composition tests [7][8]. - The X1 model is noted for its deep reasoning capabilities and is the only fully domestically trained deep reasoning model in the industry, with a lightweight design of only 70 billion parameters [9][10]. Group 2: AI Personalized Learning Features - The "AI Precision Learning" feature has been upgraded to provide personalized learning paths based on quick assessments of students' weak points, recommending targeted resources and exercises [16][18]. - A new "AI 1-on-1 Interactive Consultation Planning" function allows the system to assess a child's knowledge mastery through dialogue, generating personalized learning paths [20][21]. Group 3: Enhanced AI Tutoring Capabilities - The AI tutoring system now employs a Socratic method, guiding students through questions rather than providing direct answers, which encourages critical thinking [24][41]. - The upgraded AI tutoring supports more subjects and grades, including elementary mathematics and middle school language arts [27][28]. Group 4: Interactive Learning Experience - iFLYTEK's AI interactive courses are designed to engage students actively, with new features like AI storybook reading sessions for younger children [32][33]. - The AI learning machine will soon introduce additional features, including AI textbooks and mental health support tools [34]. Group 5: Foundation of AI Capabilities - The effectiveness of iFLYTEK's AI learning machine is attributed to its foundational AI capabilities, particularly the SocraticLM model, which enhances the teaching process through structured problem-solving [37][41]. - iFLYTEK has invested over 20 years in educational technology, collaborating with various educational authorities to build a comprehensive understanding of educational standards and data [62][63]. Group 6: Future Developments - iFLYTEK is focused on developing domestically controlled AI models to avoid reliance on external technologies, aligning with national strategies for AI education [67][68]. - The upcoming upgrades for the X1 model are anticipated to further enhance its capabilities in personalized education, indicating a promising future for AI in the education sector [69].
ToMAP:赋予大模型「读心术」,打造更聪明的AI说服者
机器之心· 2025-06-24 14:07
Core Viewpoint - The article introduces ToMAP, a new persuasion model that integrates Theory of Mind (ToM) mechanisms to enhance the persuasive capabilities of AI, addressing the limitations of current large language models in understanding opponents' perspectives and adapting strategies accordingly [4][19]. Summary by Sections Introduction to Persuasion - Persuasion is a complex communication process that influences beliefs, attitudes, and behaviors, and serves as a test for advanced large language models [2]. Limitations of Current Models - Top-tier large models can generate coherent persuasive text but lack mental perception, which hinders their ability to effectively persuade [3][4]. ToMAP Model Overview - ToMAP introduces two key mental modules: the Refutation Predictor and the Attitude Predictor, enabling AI to anticipate opposing viewpoints and assess the opponent's attitude dynamically [9][19]. Refutation Predictor - The Refutation Predictor simulates human-like anticipation of counterarguments, allowing the model to address concerns proactively. It can identify common objections, such as "cooking is troublesome" or "the taste is bad" in discussions about vegetarian recipes [9][10]. Attitude Predictor - The Attitude Predictor evaluates the opponent's stance towards counterarguments, determining whether they are firmly against, neutral, or persuaded. This module uses dialogue history and arguments to dynamically assess the opponent's attitude [9][11]. Training Methodology - ToMAP employs reinforcement learning (RL) to train the model through numerous dialogues, rewarding it based on a "persuasiveness score" that measures attitude changes before and after interactions [11][19]. Experimental Results - The model was tested across various datasets, showing that ToMAP significantly outperforms baseline models and even larger models like GPT-4o, demonstrating its effectiveness despite having fewer parameters [14][20]. Performance Insights - ToMAP maintains a low level of repetition while increasing the diversity of outputs, indicating effective use of the mental modules. It also shows a higher depth of thought compared to baseline models, favoring rational strategies over emotional appeals [15][16]. Long-term Persuasiveness - Unlike baseline models that plateau or decline in effectiveness over extended dialogues, ToMAP continues to improve its persuasiveness, showcasing its adaptability and diverse argumentation [17][20]. Conclusion - ToMAP represents a significant advancement in AI persuasion frameworks, integrating social cognition features that allow for a more human-like understanding of opponents' cognitive structures and attitudes [20][21].
Cache Me If You Can:陈丹琦团队如何「抓住」关键缓存,解放LLM内存?
机器之心· 2025-06-24 14:07
Core Viewpoint - The research by Chen Danqi's team at Princeton University introduces a unified metric called "KV Footprint" to measure the efficiency of key-value (KV) cache usage in language models, particularly for long-context tasks, addressing the challenges of memory consumption during the pre-fill and decoding stages [10][12][15]. Group 1 - The emergence of technologies like "long thinking chains" has created new workloads requiring models to generate thousands of tokens [2]. - Most language models are based on the Transformer architecture, which requires storing the attention states of all previous tokens in a KV cache, leading to linear memory growth with input length [3][5]. - The KV cache is crucial for fast inference, but its size can reach up to 42GB when processing long prompts, such as those with 128K tokens [5]. Group 2 - Previous works have proposed methods to evict parts of the KV pairs from memory to achieve "sparse attention," but comparing these methods fairly has been challenging [6][20]. - The research defines "Key KV Footprint" as the minimum KV footprint achievable while maintaining at least 90% performance relative to a full attention mechanism, ensuring that comparisons are meaningful [12][27]. Group 3 - The study reveals that previous KV eviction methods suffer from high peak memory issues, particularly with post-fill eviction methods that are incompatible with pre-fill eviction [13]. - The team developed PruLong, an end-to-end optimization method that learns which attention heads need to retain full KV cache and which do not, achieving a 12% reduction in KV footprint while maintaining performance on challenging recall tasks [15][36]. Group 4 - The research examines various efficient long-context methods and discusses their fit within the KV footprint framework, highlighting trade-offs and different sparsity concepts [28]. - The study categorizes KV entries as active, inactive, or evicted, defining KV occupancy as the number of non-evicted attention entries across all time steps [24][26]. Group 5 - PruLong optimizes the attention heads by minimizing the next token prediction loss, which aligns better with the usage of these models in text generation [37]. - The method utilizes natural long-context data for training, contrasting with previous approaches that relied on synthetic data, thus enhancing its applicability in real-world scenarios [39].
众所周知视频不能P?北大施柏鑫团队、贝式计算CVPR研究:视频里轻松换衣服、加柯基
机器之心· 2025-06-24 09:31
机器之心发布 机器之心编辑部 视频是信息密度最高、情感表达最丰富的媒介之一,高度还原现实的复杂性与细节。正因如此,视频也是编辑难度最高的一类数字内容。在传统的视频编辑流程 中,若要调整或替换主体、场景、色彩或是移除一个物体,往往意味着无数帧的手动标注、遮罩绘制和精细调色。即使是经验丰富的后期团队,也很难在复杂场 景中保持编辑内容的时间一致性。 近年来,生成式 AI 尤其是扩散模型与多模态大模型的快速迭代,为视频编辑带来了全新的解题思路。从早期基于规则的特效工具,到目标识别与自动分割,再到 基于文本指令的视频生成与重绘,尽管 AI 已经为视频编辑带来了效率与可控性的双重提升,但在精度要求较高的场景中仍存在一系列挑战,例如当前很多零样本 方法在处理连续视频帧时容易造成画面闪烁;对于背景复杂或多目标场景,可能会出现错位、模糊或语义偏差。 针对于此,北京大学相机智能实验室(施柏鑫团队)联合 OpenBayes贝式计算,以及北京邮电大学人工智能学院模式识别实验室李思副教授团队,共同提出了一种 结合草图与文本引导的视频实例重绘方法 VIRES,支持对视频主体的重绘、替换、生成与移除等多种编辑操作。该方法利用文本生成视频模 ...
外国小哥徒手改装消费级5090,一举击败巨无霸RTX Pro 6000
机器之心· 2025-06-24 06:46
Core Viewpoint - The modified ASUS ROG Astral LC RTX 5090 surpasses the performance of the $10,000 RTX Pro 6000 after a shunt mod, demonstrating significant potential in high-performance graphics cards [1][21]. Summary by Sections Shunt Mod Overview - Shunt mod is a high-risk hardware modification method used to bypass power and current limits in high-performance graphics cards and motherboards [2]. - The modification allows for increased power limits, potentially enhancing performance but poses risks to the GPU's lifespan [7][8]. Der8auer's Experiment - Der8auer, a renowned hardware modder, aimed to unlock power limits on the RTX 5090 to assess performance improvements [5][6]. - He conducted baseline tests to gather initial data on performance, noise, and temperature before the modification [10]. Performance Metrics - After the shunt mod, the power consumption of the RTX 5090 increased from 660W to 720W, with GPU frequency rising to 2,950MHz [17]. - The FPS improved from 146 to 152, successfully outperforming the RTX Pro 6000 [17]. - The GPU temperature remained around 60°C under load, while memory temperature reached 80°C, indicating effective thermal management [18]. Technical Details - The modification involved altering resistance values near the power connector to trick the control circuit into accepting higher power inputs [12]. - Theoretically, the mod allows the GPU to handle approximately 30% more power without detection [14]. Conclusion - The modified RTX 5090 slightly outperformed the RTX Pro 6000 but exhibited significantly higher power consumption compared to the unmodified version [21].
强化学习新发现:无需数学样本,仅游戏训练AI推理大增
机器之心· 2025-06-24 06:46
Core Viewpoint - The research introduces a groundbreaking method called ViGaL (Visual Game Learning), which enhances multi-modal reasoning capabilities in AI models through game training, without the need for extensive mathematical training samples [5][11][24]. Group 1: Research Findings - The study demonstrates that training AI models on simple games like Snake can significantly improve their performance in mathematical reasoning and multi-disciplinary tasks, achieving an average accuracy increase of 2.9% on mathematical benchmarks and 2.0% on multi-disciplinary reasoning tasks [11][15]. - The research team utilized a 7B parameter model, Qwen2.5-VL, and found that reinforcement learning through game play outperformed traditional methods that relied on mathematical or multi-disciplinary data [11][15]. - The findings suggest that game training can lead to stronger cross-domain generalization, allowing models to transfer skills learned in gaming to complex reasoning tasks in mathematics and other fields [7][11]. Group 2: Game Design and Training Methodology - The research involved two complementary training games: Snake, which focuses on path planning and spatial navigation, and a custom-designed 3D rotation game that enhances spatial geometric understanding [18][19]. - The design philosophy of the games is complementary, with Snake improving 2D coordinate-related mathematical performance and the rotation game targeting angle and length reasoning [20]. - Joint training on both games proved to be more effective than training on either game alone, showcasing the potential for diverse gaming tasks to enhance AI performance [20]. Group 3: Implications and Future Directions - The success of ViGaL indicates a potential new trend in AI training, suggesting that well-designed games could serve as synthetic tasks to develop multi-modal reasoning capabilities when high-quality human data is scarce [22][23]. - This game-based training paradigm offers unique advantages over traditional methods, emphasizing the importance of cultivating underlying general reasoning abilities rather than solely focusing on direct task learning [23]. - The research highlights that allowing AI to "play games" may be more effective than conventional training methods, especially as challenges arise in scaling traditional approaches [24].