Workflow
机器之心
icon
Search documents
谷歌开源全栈平台Coral NPU,能让大模型在手表上全天候运行
机器之心· 2025-10-16 04:51
Core Viewpoint - Google is actively advancing its AI capabilities through collaborations and new product launches, including a potential cancer therapy prediction model, an updated video generation tool, and the introduction of the Coral NPU for low-power AI applications [1][4][35]. Group 1: Coral NPU Overview - The Coral NPU is positioned as a full-stack, open-source platform aimed at addressing performance, fragmentation, and privacy challenges in deploying AI on low-power edge devices [4][6]. - It is designed to enable continuous AI operation on wearable devices, integrating AI directly into users' personal environments [4][30]. - The architecture is based on RISC-V instruction set architecture, optimized for low power consumption while providing significant performance capabilities [9][13]. Group 2: Technical Challenges - The performance gap arises from the need for advanced machine learning models that require more computational resources than what edge devices can provide [6]. - Fragmentation costs hinder the optimization of machine learning models across diverse proprietary processors, complicating consistent performance [6]. - User trust is essential, necessitating a focus on privacy and security for personal data in AI applications [6][32]. Group 3: Technical Details - The Coral NPU architecture includes a scalar core for data flow management, a vector execution unit for large data sets, and a matrix execution unit for neural network operations [22]. - It supports seamless integration with modern compilers and machine learning frameworks like TensorFlow, JAX, and PyTorch [21][25]. - The development tools provided simplify programming for machine learning models, ensuring a consistent experience across various hardware [23]. Group 4: Target Applications - The Coral NPU is designed for ultra-low power, always-on edge AI applications, particularly in environmental sensing systems [30]. - Potential use cases include hardware-enforced privacy, context awareness, audio processing, image processing, and user interaction [31][34]. - Google has partnered with Synaptics to implement the Coral NPU architecture in their new AI-native IoT processors, enhancing edge AI capabilities [35]. Group 5: Future Outlook - Google aims to build a foundational layer for personal AI through the Coral NPU, fostering a vibrant ecosystem for developers [37].
从掩码生成到「再掩码」训练:RemeDi让扩散语言模型学会自我纠正与反思
机器之心· 2025-10-16 02:20
近期,扩散语言模型备受瞩目,提供了一种不同于自回归模型的文本生成解决方案。为使模型能够在生成过程中持续修正与优化中间结果, 西湖大学 MAPLE 实 验室齐国君教授团队成功训练了 具有「再掩码」能力的扩散语言模型( Rem asking- e nabled Di ffusion Language Model, RemeDi 9B)。在扩散去噪的多步过程 中,通过进行再掩码 SFT 和 RL 训练,为每个 token 输出一个去掩码置信度,RemeDi 能够从序列中已经生成的内容中识别无法确定的位置进行 再掩码(remask) ,从而修正错误内容并提升文本质量,在各方面都超越了现有的扩散语言模型。该模型还具有 可变长生成(variable-length generation) 能力,打破了现有中大规 模扩散语言模型仅支持定长生成的限制,提高了模式能力的灵活性。 背景 扩散语言模型已成为自回归语言模型的有力替代方案。这一类方法首先定义了一个将文本逐步破坏为噪声的前向过程,然后让模型学习从噪声中恢复出干净文本 的逆向过程。在这一类方法中,当前最主流的是基于掩码的扩散语言模型。该方案要求模型在训练中学习恢复被掩码的 ...
苹果发完M5芯片,最开心的是M1钉子户
机器之心· 2025-10-16 02:20
| 机器之心报道 | | --- | 编辑:张倩、泽南 M5 上的神经加速器提升了在 LM Studio 等应用中在设备上运行的基于 GPU 的人工智能工作负载的性能。 像 Draw Things 这类应用程序,在新款 14 英寸 MacBook Pro 和 iPad Pro 上,通过使用苹果内置的框架和 API 为其应用构建解决方案,能够获得显著的性能提升。 想买新 Mac 的同学,建议你再等等。 昨晚,苹果发布了新一代自研芯片 M5,在 AI 计算、图形性能与能效表现上全面升级。相较上一代 M4,M5 的峰值 GPU AI 计算性能是前者的 4 倍以上,并首次 在每个 GPU 核心中都集成了 Neural Accelerator(神经加速器)。这款芯片将率先搭载于新款 14 英寸 MacBook Pro、iPad Pro 和 Apple Vision Pro,三款产品已同步 开启预购。 AI 是核心亮点 M5 基于第三代 3 纳米制程(N3P),采用全新的 10 核 GPU 架构(注:尽管 M5 仍然使用台积电的 3nm 技术,但与 M4 使用的 N3E 工艺节点相比,M5 采用的 N3P 工艺节点通过略 ...
年轻人用AI生成流浪汉吓坏父母,吸引810万人围观,这次玩笑开大了
机器之心· 2025-10-16 02:20
Core Viewpoint - The article discusses the trend of using AI-generated images of homeless individuals as pranks, particularly targeting parents, leading to significant anxiety and panic among them [3][18][25]. Group 1: Prank Mechanics - Young people are using AI tools like Google Gemini to create realistic images of homeless people in their homes, which they then send to their parents to elicit reactions [11][12]. - The pranks often involve sending multiple images showing the supposed homeless person engaging in various activities, such as eating or using personal items, which escalates the panic of the parents [4][6][10]. Group 2: Reactions and Consequences - Parents typically react with alarm, often attempting to contact their children or even calling the police out of fear for their safety [4][19][22]. - The phenomenon has gained significant traction on social media, with videos receiving millions of views and likes, indicating a widespread interest in such pranks [10][12]. Group 3: Ethical Considerations - The article raises concerns about the ethical implications of these pranks, highlighting that they can cause real distress and anxiety, particularly for older individuals who may not be familiar with AI technology [18][25]. - There is a warning that prolonged pranking could lead to unnecessary police involvement, wasting resources and potentially causing serious consequences [19][22].
刚刚,谷歌Veo 3.1迎来重大更新,硬刚Sora 2
机器之心· 2025-10-16 00:51
Core Insights - Google has released its latest AI video generation model, Veo 3.1, which enhances audio, narrative control, and visual quality compared to its predecessor, Veo 3 [2][3] - The new model introduces native audio generation capabilities, allowing users to better control the emotional tone and narrative pacing of videos during the creation phase [10] Enhanced Audio and Narrative Control - Veo 3.1 improves support for dialogue, environmental sound effects, and other audio elements, allowing for a more immersive video experience [5] - Core functionalities in Flow, such as "Frames to Video" and "Ingredients to Video," now support native audio generation, enabling users to create longer video clips that can extend beyond the original 8 seconds to 30 seconds or even longer [6][9] Richer Input and Editing Capabilities - The model accepts various input types, including text prompts, images, and video clips, and supports up to three reference images to guide the final output [12] - New features like "Insert" and "Remove" allow for more precise editing, although not all functionalities are immediately available through the Gemini API [13] Multi-Platform Deployment - Veo 3.1 is accessible through several existing Google AI services and is currently in a preview phase, available only in the paid tier of the Gemini API [15][16] - The pricing structure remains consistent with the previous Veo model, charging only after successful video generation, which aids in budget predictability for enterprise teams [16][21] Technical Specifications and Output Control - The model supports video output at 720p or 1080p resolution with a frame rate of 24 frames per second [18] - Users can upload product images to maintain visual consistency throughout the video, simplifying the creative production process for branding and advertising [19] Creative Applications - Google’s Flow platform serves as an AI-assisted movie creation tool, while the Gemini API is aimed at developers looking to integrate video generation features into their applications [20]
不再靠「猜坐标」!颜水成团队等联合发布PaDT多模态大模型:实现真正的多模态表征输出
机器之心· 2025-10-16 00:51
Core Insights - The article discusses the advancements in Multimodal Large Language Models (MLLMs) and introduces a new paradigm called Patch-as-Decodable Token (PaDT) to address the limitations of existing models in tasks requiring fine spatial understanding [2][6]. Group 1: PaDT Overview - PaDT proposes a revolutionary approach by dividing images into multiple visual patches and allowing the model to generate corresponding Visual Reference Tokens (VRTs) directly [3]. - It enables seamless alternation between text tokens and visual tokens at both input and output stages, making the model's description of image content as natural as describing text [4]. - The model can directly indicate image targets in generated sentences rather than guessing coordinates [5]. Group 2: Limitations of Traditional MLLMs - Traditional MLLMs output detection box coordinates in string format, leading to inconsistencies, semantic disconnection, and weak image-text associations [8]. - The output format can vary, making it difficult to parse targets, and numbers can be split into separate tokens, disrupting spatial continuity [8]. - The reliance on coordinate tokens, which lack inherent semantic meaning, results in challenges such as hallucination and repetition in generated outputs [8]. Group 3: PaDT Mechanism - PaDT introduces VRTs derived from the visual patch embeddings of the input image, creating a dynamic embedding table that integrates both text and visual information [11]. - This design avoids the pitfalls of traditional methods that depend on global visual codebooks, which can confuse similar objects and generate non-existent patches [13]. - The lightweight PaDT Decoder, consisting of three bidirectional attention blocks, transforms VRTs into structured visual outputs like bounding boxes and segmentation masks [15]. Group 4: Performance Metrics - PaDT Pro (3B) achieved a remarkable average accuracy of 93.6 in the RefCOCO/+/g referring expression comprehension task, surpassing the 78B InternVL3 model, which scored 91.4 [21][22]. - In the COCO open vocabulary detection task, traditional MLLMs typically have a mean Average Precision (mAP) below 20, while PaDT Pro (3B) raised it to 38.2, nearly doubling the performance [21][24]. - The model also demonstrated strong performance in the Referring Image Captioning (RIC) task, significantly improving the CIDEr-D score from 0.386 to 1.450 [24]. Group 5: Implications and Future Directions - PaDT's success stems from its deep understanding of the visual capability bottlenecks in MLLMs, allowing for native alignment between visual patches and generated tokens [31]. - The dynamic embedding mechanism ensures strong binding of VRTs to the current image, preventing cross-image confusion [31]. - The model exhibits robust multitasking capabilities, outperforming single-task models by seamlessly switching tasks through prompt changes [33]. - The introduction of PaDT marks a significant step towards achieving true multimodal intelligence, allowing for more natural interactions between different modalities [35].
首个多轮LLM Router问世, Router-R1可让大模型学会「思考–路由–聚合」
机器之心· 2025-10-15 10:44
Core Insights - The article discusses the introduction of Router-R1, a novel multi-round LLM Router framework that enables large language models (LLMs) to not only answer questions but also think, schedule, and coordinate with other models to achieve a balance between performance and cost [3][26]. Group 1: Background and Motivation - The rapid growth of LLMs has led to over a hundred different models, each with unique strengths, such as logic reasoning or knowledge retrieval [6]. - Current AI applications primarily rely on single model inference, which can lead to inefficiencies and inaccuracies depending on the complexity of the questions posed [6][8]. Group 2: Router-R1 Framework - Router-R1 innovatively transforms the router into a reasoning-capable policy LLM, allowing it to engage in a "think-select-aggregate" process, thus enabling multi-round routing iterations [8][26]. - The framework utilizes reinforcement learning to optimize the performance-cost trade-off, formalizing the multi-round routing process as a sequential decision-making problem [10][26]. Group 3: Reward Mechanisms - Router-R1 employs three types of reward functions: - Format Reward ensures the output adheres to specific format constraints [10]. - Final Outcome Reward measures the correctness of the generated answer against a standard [11]. - Cost Reward introduces a cost constraint mechanism that considers the model's parameter size and output token count [15][16]. Group 4: Performance Evaluation - The research team evaluated Router-R1 across seven QA benchmarks, demonstrating superior performance in both single-hop and multi-hop reasoning tasks [19]. - Router-R1 outperformed existing models, achieving the highest accuracy across all datasets when performance was prioritized over cost [21]. Group 5: Implications and Future Trends - Router-R1 represents a shift towards a new paradigm of collaborative multi-model systems, allowing for dynamic balancing of performance and cost while maintaining high-quality outputs [26]. - The adoption of LLM Router mechanisms in future models, such as GPT-5, indicates a trend towards multi-model collaboration as a foundational infrastructure in the LLM ecosystem [26].
具身智能迎来ImageNet时刻:RoboChallenge开放首个大规模真机基准测试集
机器之心· 2025-10-15 10:44
官网:https://robochallenge.ai 全球首个大规模多任务的真机基准测试平台 机器人正逐步融入现实世界,但目前仍缺乏统一、开放且可复现的基准测试方法,难以衡量技术进展或公平比较不同方法的优劣。改变这一现状需要构建一个大 规模多任务的具身智能真机测试集,使得研发人员在统一环境中验证对比机器人算法,实现从基础任务到复杂现实应用场景的全面覆盖。 论文:https://robochallenge.ai/robochallenge_techreport.pdf GitHub:https://github.com/RoboChallenge/RoboChallengeInference Hugging Face:https://huggingface.co/RoboChallengeAI 机器之心发布 机器之心编辑部 近日, RoboChallenge 重磅推出!这是 全球首个大规模、多任务的在真实物理环境中由真实机器人执行操作任务的基准测试。 通过科学的评估体系构建一个开放、公正、可复现的「真实考场」,克服真实环境下的性能验证、标准化测试条件、公开可访问测试平台等关键挑战, RoboChallenge ...
ICCV 2025 | FDAM:告别模糊视界,源自电路理论的即插即用方法让视觉Transformer重获高清细节
机器之心· 2025-10-15 07:33
针对视觉 Transformer(ViT)因其固有 "低通滤波" 特性导致深度网络中细节信息丢失的问题,我们提出了一种即插即用、受电路理论启发的 频率动态注意力调制 (FDAM)模块。它通过巧妙地 "反转" 注意力以生成高频补偿,并对特征频谱进行动态缩放,最终在几乎不增加计算成本的情况下,大幅提升了模型在分割、检 测等密集预测任务上的性能,并取得了 SOTA 效果。 该工作来自北京理工大学、RIKEN AIP和东京大学的研究团队。 研究背景:为什么这是一个重要的问题? 论文全文: https://arxiv.org/abs/2507.12006 作者主页: https://linwei-chen.github.io 实验室主页: https://ying-fu.github.io 开源代码: https://github.com/Linwei-Chen/FDAM 正如上图所示,在标准的 ViT 中,高频信息随着层数加深迅速衰减至零。解决这一根本性缺陷,释放 ViT 在高清视觉任务上的全部潜力,是当前领域亟待突破的 关键瓶颈。 现有方法的局限性 此前,一些工作尝试缓解 ViT 的 "过平滑" 问题,例如通过正则 ...
报名|IROS 2025举杯时刻!与你Pick的圈内大神共饮一杯!
机器之心· 2025-10-15 07:33
Core Insights - The article discusses the upcoming IROS 2025 conference in Hangzhou, which is a significant event in the robotics field, bringing together top scholars and covering a wide range of topics from theoretical research to practical applications [2] - A special closed-door event, TalentAI50 Meetup, will be held during the conference, aimed at young talents who are shaping the future of robotics and AI [3][13] Event Details - The TalentAI50 Meetup will feature prominent young scholars from leading universities, including Hong Kong University, Shanghai Jiao Tong University, Tsinghua University, and Zhejiang University, among others [5] - The event is designed to foster informal discussions without traditional presentations, encouraging networking and collaboration among participants [7] - The Meetup is limited to 50 attendees, with registration open to authors of papers presented at IROS 2025 [7][9] Schedule and Logistics - The event is scheduled for October 22, from 18:00 to 21:00, at a venue near the Hangzhou International Expo Center [8] - The agenda includes a check-in period, an opening talk, interactive experiences, and a dinner with free networking opportunities [9]