Workflow
机器之心
icon
Search documents
苹果又失去一位AI高管:清华校友Ke Yang加入Meta
机器之心· 2025-10-16 07:34
机器之心报道 今日,据彭博社消息,苹果公司 AKI(Answers, Knowledge and Information)团队负责人 Ke Yang 现已离职,加入 Meta 超级智能实验室,致力于将 AI 转化为消费 产品的研究。 | Bloomberg | | --- | | · Live TV Markets · Economics Industries Tech Politics Businessweek Opinion More V | | Apple: iPhone 17 Line New Watches AirPods Pro 3 $2,000 iPhones Vision Pro Pause Al Troubles | | Technology Apple's Head of ChatGPT-Like Al Al | 机器之心编辑部 苹果又一位高管离职了。 此次离职的时间点让人颇感意外,数周前 Ke Yang 刚被任命为该团队负责人。 据了解,Ke Yang 领导的 AKI 团队相对较新,是在今年早些时候组建的,该团队主要负责推进苹果内部类似 ChatGPT 的 AI 搜索项目,该项目旨在为 ...
递归语言模型登场!MIT华人新作爆火,扩展模型上下文便宜又简单
机器之心· 2025-10-16 07:34
Core Insights - The article discusses the limitations of current mainstream large language models (LLMs) regarding context length and performance degradation, known as "context rot" [2][26]. - Researchers from MIT propose a new approach called Recursive Language Models (RLMs) to address these issues by breaking down long contexts into manageable parts and processing them recursively [4][6]. Group 1: RLM Concept and Implementation - RLMs treat input context as a variable, allowing the main model to decompose and interact recursively with the context [8][14]. - In practical implementation, RLMs utilize a Python REPL environment to store user prompts in variables and process them iteratively, leading to significant performance improvements [5][17]. - The RLM framework enables the root language model to manage context more flexibly, avoiding the pitfalls of traditional models that read the entire context at once [23][16]. Group 2: Performance Results - In tests on the OOLONG benchmark, RLM using GPT-5-mini achieved over 114% improvement in correct answers compared to GPT-5, with lower average costs per query [28][30]. - RLM demonstrated no performance degradation even when processing contexts exceeding 10 million tokens, outperforming traditional methods like ReAct + retrieval [34][35]. - The RLM framework allows for a more efficient handling of large contexts, maintaining performance without additional fine-tuning or structural changes [35][39]. Group 3: Future Implications - The researchers believe RLMs could become a powerful paradigm for reasoning and context management in LLMs, potentially revolutionizing how models handle extensive data [6][7]. - As LLM capabilities improve, RLMs are expected to scale effectively, potentially managing even larger contexts in the future [37][40]. - The approach emphasizes that language models should autonomously determine how to decompose and process tasks, contrasting with traditional agent-based methods [40][41].
ICCV 2025 | 浙大、港中文等提出EgoAgent:第一人称感知-行动-预测一体化智能体
机器之心· 2025-10-16 04:51
Core Insights - The article discusses the development of EgoAgent, a first-person joint predictive agent model that learns visual representation, human action, and world state prediction simultaneously, inspired by human cognitive learning mechanisms [2][5][21] - EgoAgent breaks the traditional separation of perception, control, and prediction in AI, allowing for a more integrated learning approach [6][21] Group 1: Model Overview - EgoAgent is designed to simulate the continuous interaction between the human brain, body, and environment, enabling AI to learn through experience rather than just observation [5][6] - The model employs a core architecture called JEAP (Joint Embedding-Action-Prediction) that allows for joint learning of the three tasks within a unified Transformer framework [6][8] Group 2: Technical Mechanisms - EgoAgent utilizes an interleaved "state-action" joint prediction approach, encoding first-person video frames and 3D human actions into a unified sequence [8][10] - The model features a collaborative mechanism between a Predictor and an Observer, enhancing its self-supervised learning capabilities over time [8][10] Group 3: Performance and Results - EgoAgent demonstrates superior performance in key tasks, significantly outperforming existing models in first-person world state prediction, 3D human motion prediction, and visual representation [12][13][15] - For instance, EgoAgent with 300 million parameters improved Top-1 accuracy by 12.86% and mAP by 13.05% compared to the latest first-person visual representation model [13] Group 4: Future Applications - The model has broad application prospects, particularly in robotics and AR/VR, enhancing scene perception and interaction capabilities in complex environments [21]
仅用三五条样本击败英伟达,国内首个超少样本具身模型登场,还斩获顶会冠军
机器之心· 2025-10-16 04:51
机器之心发布 机器之心编辑部 国内首个少样本通用具身操作基础模型发布,跨越视觉语言与机器人操作的鸿沟。 具身智能领域终于要突破 "数据桎梏" 了吗? 相较于自然语言、视觉领域,具身智能的数据天然稀缺。真实世界的机器人操作往往涉及复杂的物理交互、实时反馈与环境变化,导致数据采集不仅成本高、效 率低,并且还难以规模化。因此,现实中能达到数十万以及百万物理交互的数据集并不多见。 另外,当前的视觉 - 语言 - 动作(VLA)模型虽然已经具备了强大的语义理解能力,但在实际操作层面仍依赖大规模标注数据来弥补泛化能力的不足。 如何让具身机器人在极少样本下也能快速学习、准确执行、灵活迁移,成为决定它们真正走出实验室、进入工业生产与人机协作场景的关键因素。 主要实验效果: 近日, 国内通用具身智能创企中科第五纪(FiveAges)正式发布新一代具身操作基础模型 FiveAges Manipulator-1(FAM-1) ,其核心架构源于团队入选 NeurIPS 2025 的《BridgeVLA: Bridging the Gap between Large Vision-Language Model and 3D Robot ...
「性价比王者」Claude Haiku 4.5来了,速度更快,成本仅为Sonnet 4的1/3
机器之心· 2025-10-16 04:51
Core Viewpoint - Anthropic has launched a new lightweight model, Claude Haiku 4.5, which emphasizes being "cheaper and faster" while maintaining competitive performance with its predecessor, Claude Sonnet 4 [2][4]. Model Performance and Cost Efficiency - Claude Haiku 4.5 offers coding performance comparable to Claude Sonnet 4 but at a significantly lower cost: $1 per million input tokens and $5 per million output tokens, which is one-third of the cost of Claude Sonnet 4 [2][4]. - The inference speed of Claude Haiku 4.5 has more than doubled compared to Claude Sonnet 4 [2][4]. - In specific benchmarks, Claude Haiku 4.5 outperformed Claude Sonnet 4, achieving 50.7% on OSWorld and 96.3% on AIME 2025, compared to Sonnet 4's 42.2% and 70.5%, respectively [4][6]. User Experience and Feedback - Early users, such as Guy Gur-Ari from Augment Code, reported that Claude Haiku 4.5 achieved 90% of the performance of Sonnet 4.5, showcasing impressive speed and cost-effectiveness [7]. - Jeff Wang, CEO of Windsurf, noted that Haiku 4.5 blurs the traditional trade-off between quality, speed, and cost, representing a new direction for model development [10]. Safety and Consistency - Claude Haiku 4.5 has undergone extensive safety and consistency evaluations, showing a lower incidence of concerning behaviors compared to its predecessor, Claude Haiku 3.5, and improved consistency over Claude Sonnet 4.5 [14][15]. - It is considered Anthropic's "safest model to date" based on these assessments [15]. Market Position and Future Outlook - Anthropic has been active in the market, releasing three major AI models within two months, indicating a competitive strategy [16]. - The company aims for an annual revenue target of $9 billion by the end of the year, with more aggressive goals set for the following year, potentially reaching $20 billion to $26 billion [18].
谷歌开源全栈平台Coral NPU,能让大模型在手表上全天候运行
机器之心· 2025-10-16 04:51
Core Viewpoint - Google is actively advancing its AI capabilities through collaborations and new product launches, including a potential cancer therapy prediction model, an updated video generation tool, and the introduction of the Coral NPU for low-power AI applications [1][4][35]. Group 1: Coral NPU Overview - The Coral NPU is positioned as a full-stack, open-source platform aimed at addressing performance, fragmentation, and privacy challenges in deploying AI on low-power edge devices [4][6]. - It is designed to enable continuous AI operation on wearable devices, integrating AI directly into users' personal environments [4][30]. - The architecture is based on RISC-V instruction set architecture, optimized for low power consumption while providing significant performance capabilities [9][13]. Group 2: Technical Challenges - The performance gap arises from the need for advanced machine learning models that require more computational resources than what edge devices can provide [6]. - Fragmentation costs hinder the optimization of machine learning models across diverse proprietary processors, complicating consistent performance [6]. - User trust is essential, necessitating a focus on privacy and security for personal data in AI applications [6][32]. Group 3: Technical Details - The Coral NPU architecture includes a scalar core for data flow management, a vector execution unit for large data sets, and a matrix execution unit for neural network operations [22]. - It supports seamless integration with modern compilers and machine learning frameworks like TensorFlow, JAX, and PyTorch [21][25]. - The development tools provided simplify programming for machine learning models, ensuring a consistent experience across various hardware [23]. Group 4: Target Applications - The Coral NPU is designed for ultra-low power, always-on edge AI applications, particularly in environmental sensing systems [30]. - Potential use cases include hardware-enforced privacy, context awareness, audio processing, image processing, and user interaction [31][34]. - Google has partnered with Synaptics to implement the Coral NPU architecture in their new AI-native IoT processors, enhancing edge AI capabilities [35]. Group 5: Future Outlook - Google aims to build a foundational layer for personal AI through the Coral NPU, fostering a vibrant ecosystem for developers [37].
从掩码生成到「再掩码」训练:RemeDi让扩散语言模型学会自我纠正与反思
机器之心· 2025-10-16 02:20
Core Insights - The article introduces RemeDi, a diffusion language model developed by the MAPLE lab at Westlake University, which incorporates a "remask" mechanism for self-reflection and optimization during text generation [2][26]. - RemeDi surpasses existing diffusion language models by identifying and correcting errors in generated text through a confidence score prediction system [8][27]. Group 1: Model Features - RemeDi is designed with a "remask" capability that allows it to identify incorrect tokens and correct them by leveraging context from subsequent generation steps [5][25]. - The model supports variable-length generation, breaking the limitation of fixed-length outputs in traditional diffusion models, enhancing flexibility in text generation [9][27]. - RemeDi employs a dual-stream architecture, where the Token Prediction Stream (TPS) predicts token distributions, and the Unmasking Policy Stream (UPS) outputs confidence scores for each token [10][8]. Group 2: Training Methodology - The training process consists of two phases: supervised fine-tuning (Remask SFT) and reinforcement learning (Remask RL) [12][17]. - During Remask SFT, the model learns to recover masked tokens while also identifying incorrect tokens that need to be remasked [13][12]. - The Remask RL phase optimizes the model's generation trajectory based on the results, enhancing the probability of generating correct final answers [17][20]. Group 3: Experimental Results - RemeDi demonstrates significant performance improvements in tasks such as mathematical reasoning, code generation, and general knowledge question answering compared to other diffusion language models [22][27]. - The model's performance is further enhanced when combining Remask SFT with Remask RL, leading to superior results across various benchmarks [22][24].
苹果发完M5芯片,最开心的是M1钉子户
机器之心· 2025-10-16 02:20
Core Insights - Apple has launched its new self-developed chip, M5, which significantly upgrades AI computing, graphics performance, and energy efficiency compared to the previous M4 chip [1][31]. AI as the Core Highlight - The M5 chip features a peak GPU AI computing performance that is over 4 times that of M4 and over 6 times that of M1, with each GPU core now equipped with a Neural Accelerator [5][6]. - Applications like Draw Things and webAI will see faster generation speeds on the M5 architecture [7][10]. Graphics Performance Enhancement - The new GPU architecture provides a maximum graphics performance increase of 30% compared to M4 and 2.5 times that of M1, with a 45% improvement in graphics performance in ray-traced scenarios [13][17]. - The M5 chip also enhances micro-OLED display performance on Apple Vision Pro, with a pixel rendering increase of approximately 10% and a refresh rate of up to 120Hz [15]. CPU and Neural Engine Improvements - The M5 CPU consists of 10 cores (4 performance and 6 efficiency), with a maximum multi-threaded performance increase of 15% over M4 [19]. - The new 16-core Neural Engine accelerates AI inference and works in conjunction with the CPU and GPU's Neural Accelerators [20][23]. Memory Bandwidth and Unified Memory - The unified memory bandwidth of M5 has increased to 153GB/s, approximately 30% higher than M4 and more than double that of M1 [27][29]. - This higher memory bandwidth allows for the local execution of larger AI models and enhances the performance of multi-threaded applications and creative software [29][30]. Considerations for Potential Buyers - Despite the improvements, the M5 chip does not match the performance of M4 Pro or M4 Max, suggesting that consumers with higher performance needs may want to wait for future releases [32][36]. - Upcoming models like M6 and M5 Pro/Max are anticipated to bring significant innovations, including a 2nm process and a touchscreen design, expected in the second half of next year [36][38].
年轻人用AI生成流浪汉吓坏父母,吸引810万人围观,这次玩笑开大了
机器之心· 2025-10-16 02:20
Core Viewpoint - The article discusses the trend of using AI-generated images of homeless individuals as pranks, particularly targeting parents, leading to significant anxiety and panic among them [3][18][25]. Group 1: Prank Mechanics - Young people are using AI tools like Google Gemini to create realistic images of homeless people in their homes, which they then send to their parents to elicit reactions [11][12]. - The pranks often involve sending multiple images showing the supposed homeless person engaging in various activities, such as eating or using personal items, which escalates the panic of the parents [4][6][10]. Group 2: Reactions and Consequences - Parents typically react with alarm, often attempting to contact their children or even calling the police out of fear for their safety [4][19][22]. - The phenomenon has gained significant traction on social media, with videos receiving millions of views and likes, indicating a widespread interest in such pranks [10][12]. Group 3: Ethical Considerations - The article raises concerns about the ethical implications of these pranks, highlighting that they can cause real distress and anxiety, particularly for older individuals who may not be familiar with AI technology [18][25]. - There is a warning that prolonged pranking could lead to unnecessary police involvement, wasting resources and potentially causing serious consequences [19][22].
刚刚,谷歌Veo 3.1迎来重大更新,硬刚Sora 2
机器之心· 2025-10-16 00:51
Core Insights - Google has released its latest AI video generation model, Veo 3.1, which enhances audio, narrative control, and visual quality compared to its predecessor, Veo 3 [2][3] - The new model introduces native audio generation capabilities, allowing users to better control the emotional tone and narrative pacing of videos during the creation phase [10] Enhanced Audio and Narrative Control - Veo 3.1 improves support for dialogue, environmental sound effects, and other audio elements, allowing for a more immersive video experience [5] - Core functionalities in Flow, such as "Frames to Video" and "Ingredients to Video," now support native audio generation, enabling users to create longer video clips that can extend beyond the original 8 seconds to 30 seconds or even longer [6][9] Richer Input and Editing Capabilities - The model accepts various input types, including text prompts, images, and video clips, and supports up to three reference images to guide the final output [12] - New features like "Insert" and "Remove" allow for more precise editing, although not all functionalities are immediately available through the Gemini API [13] Multi-Platform Deployment - Veo 3.1 is accessible through several existing Google AI services and is currently in a preview phase, available only in the paid tier of the Gemini API [15][16] - The pricing structure remains consistent with the previous Veo model, charging only after successful video generation, which aids in budget predictability for enterprise teams [16][21] Technical Specifications and Output Control - The model supports video output at 720p or 1080p resolution with a frame rate of 24 frames per second [18] - Users can upload product images to maintain visual consistency throughout the video, simplifying the creative production process for branding and advertising [19] Creative Applications - Google’s Flow platform serves as an AI-assisted movie creation tool, while the Gemini API is aimed at developers looking to integrate video generation features into their applications [20]