Workflow
机器之心
icon
Search documents
千寻智能高阳团队最新成果:纯视觉VLA方案从有限数据中学到强大的空间泛化能力
机器之心· 2025-09-29 02:52
设想一下刚学开车的情况:在训练场上,我们可能会反复练习特定动作:到了某个位置就踩刹车,拐到某个点就打方向盘。久而久之,这些动作会形成 "条件记 忆",一旦环境发生变化,就容易手忙脚乱。最近,千寻智能的研究人员注意到,基于模仿学习的视觉运动策略中也存在类似现象,并在论文《Do You Need Proprioceptive States in Visuomotor Policies?》中对此进行了深入探讨。 论文链接:https://arxiv.org/abs/2509.18644 项目主页:https://statefreepolicy.github.io 文中研究人员提出了一种名为 State-free Policy 的策略,与 State-based Policy 相比,即便在训练数据中桌面高度、机器人位置和目标物体等都被严格固定的情况 下,机器人仍能展现出强大的空间泛化能力。例如: 在夹笔任务中,获得桌面高度的泛化能力(标准桌高为 80 cm): 在叠衣服任务中,即使机械臂位置大幅偏离标准位置,机器人仍然能出色完成任务: 在全身机器人从冰箱拿饮料的过程中,即使冰箱位置发生移动,机器人也能够适应: 事实上 ...
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
机器之心· 2025-09-28 10:29
Core Insights - The article discusses the development of TinyWorlds, a minimal world model inspired by DeepMind's Genie 3, capable of generating playable pixel-style environments with only 3 million parameters [1][9][32]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities when trained on large-scale video data [5][7]. - The challenge lies in the need for frame-by-frame action labels for training, which limits the use of unannotated video data from the internet [5][6]. - Genie 1's solution involved training an action tokenizer to infer action labels, enabling the use of vast amounts of unannotated video for training [5][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, determining the range of environments the model can generate [11][12]. Group 3: Architecture and Tokenization Strategy - TinyWorlds employs a space-time transformer to handle three-dimensional video data, capturing video information through a three-layer mechanism [15][17]. - The model's architecture includes spatial attention, temporal attention, and a feedforward network to extract higher-level features [21][22]. - The video tokenizer compresses videos into tokens, while the action tokenizer predicts actions between frames, allowing training on unannotated data [24][26]. Group 4: Training the World Generator - The dynamics model serves as the system's "brain," predicting future frames based on video and actions, with performance improving significantly when the model size is increased [30][32]. - Despite its 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, though the output remains somewhat blurry and incoherent [32].
下一代推荐系统长这样,Meta最新研究RecoWorld,从「猜你喜欢」到「听你指令」
机器之心· 2025-09-28 10:29
Core Insights - The article discusses the evolution of recommendation systems, highlighting the limitations of traditional systems that rely on past data and lack real-time interaction with users [2][9] - Meta's new approach, RecoWorld, introduces a dual-view architecture that allows for multi-round interactions between users and the recommendation system, aiming to enhance user retention [3][4] Group 1: RecoWorld Overview - RecoWorld features a unique dual-view architecture that simulates user interactions and allows the recommendation system to adjust its content dynamically based on user feedback [4][12] - The system utilizes a user simulator that mimics real user behavior, providing feedback such as complaints or likes, which informs the recommendation system's adjustments [13][14] - The design of RecoWorld enables a dynamic feedback loop where user instructions lead to system adjustments, fostering a two-way dialogue between users and the recommendation system [18] Group 2: Mechanism and Functionality - The core mechanism of RecoWorld involves a "virtual duet" where simulated users interact with the recommendation system, helping it learn how to retain users effectively [12][16] - The user simulator can perform various actions such as clicking, skipping, or liking, and its decisions are influenced by environmental factors and past interactions [14][16] - The ultimate goal of RecoWorld is to optimize long-term user retention by maximizing session duration and minimizing session gaps, which correlates with daily active users (DAU) [16] Group 3: Future Implications - RecoWorld represents a foundational infrastructure for recommendation system research, akin to OpenAI's Gym for reinforcement learning, allowing for safe experimentation with new algorithms [21] - The shift from one-way recommendations to interactive systems signifies a transformation where users can direct the algorithm, enhancing the personalization of content [22][24] - Future recommendation systems are envisioned to be more intelligent and responsive, capable of understanding user preferences and adapting in real-time [25][24]
OpenAI被指欺诈,用户输入可能会被秘密路由到新模型GPT-5-Chat-Safety
机器之心· 2025-09-28 07:05
Core Viewpoint - The release of GPT-5 has led to significant user dissatisfaction, particularly due to OpenAI's removal of the model selector in ChatGPT, which has sparked online petitions from users demanding the return of the GPT-4o model [1][2]. Group 1 - OpenAI has reinstated the GPT-4o model for ChatGPT Plus users, but issues persist regarding the routing of emotionally charged content to a hidden model called GPT-5-Chat-Safety without user notification [2][3]. - Users have reported that any content deemed "risky," even slightly emotional, is rerouted to the GPT-5-Chat-Safety model, which is not publicly acknowledged by OpenAI [3][4]. - The GPT-5-Chat-Safety model is criticized for being inferior to GPT-5, providing shorter and less engaging responses, and treating conversations as stories rather than genuine interactions [3][4]. Group 2 - Concerns have been raised about the ethical implications of rerouting user conversations to a model designed for crisis response, especially when most affected dialogues do not involve emergencies [4][6]. - Users have expressed outrage over what they perceive as deceptive practices by OpenAI, arguing that the lack of transparency regarding model changes constitutes a form of fraud [12][19]. - The incident has ignited discussions about AI model transparency and user rights, highlighting the challenge OpenAI faces in maintaining user trust amid rapid technological advancements [29].
普通人也能「炼丹」了?我拿小红书文案喂给openPangu-Embedded-1B的模型,几步就把它变成了专属文案大师!
机器之心· 2025-09-28 07:05
Core Viewpoint - The article emphasizes the potential of smaller AI models, specifically the openPangu-Embedded-1B model, to be effectively trained for specific applications, demonstrating that high performance can be achieved without relying on massive models [3][23]. Group 1: Model Introduction and Capabilities - The openPangu-Embedded-1B model is a lightweight AI model that can be easily trained with limited resources, making it accessible for ordinary users [3][11]. - Despite its smaller size, the 1B model shows competitive performance compared to larger models like Qwen3-1.7B [3][23]. Group 2: Training Process - The training process involves three simple steps: preparing the dataset, loading the model, and fine-tuning it with the specific data [9][10]. - The dataset for training can be sourced from open academic resources, such as Hugging Face, which simplifies the data collection process [9][11]. Group 3: Application and Results - The article presents a case study where the model was fine-tuned to generate content in the unique style of Xiaohongshu (Little Red Book), showcasing its adaptability [5][19]. - The results of the fine-tuning demonstrated a significant improvement in the model's ability to produce engaging and stylistically appropriate content, aligning with the platform's tone [19][21]. Group 4: Advantages of Smaller Models - Smaller models like openPangu-Embedded-1B offer low hardware requirements, making them accessible to a broader audience and alleviating concerns about computational power [27]. - The efficiency of training and the ability to customize the model with personal data allow users to define the model's style and knowledge boundaries [27].
放弃 CoT?Agentic 时代为什么更需要隐式推理?
机器之心· 2025-09-28 07:05
Group 1 - The article discusses the limitations of Chain of Thought (CoT) reasoning in AI, highlighting its inability to break the "1Hz" barrier and suggesting that implicit reasoning may be a more suitable approach for Agentic AI [7][8][10] - Recent studies indicate that CoT may not represent true reasoning but rather a structured pattern matching, which can lead to performance degradation in tasks requiring inductive reasoning [9][10] - The high computational cost and time consumption associated with explicit reasoning make it less viable for real-time applications, necessitating a shift towards implicit reasoning that can adapt to various task complexities [10][11] Group 2 - Implicit reasoning is gaining traction as it allows for faster processing and lower costs, making it more suitable for real-time AI applications compared to the traditional "Think-before-Speaking" (TbS) model [11][12] - The article emphasizes the need for AI agents to dynamically adjust their reasoning depth and speed based on task difficulty, which is a key capability for future AI development [10][11] - Challenges remain for implicit reasoning, particularly in high-stakes scenarios where accuracy and verifiability are paramount, such as legal document analysis and medical diagnostics [13][14]
「从追赶者到引领者,路有多远?」 我们和CANN一线开发者聊了聊
机器之心· 2025-09-28 04:50
Core Viewpoint - The article discusses the transformation of the AI industry, emphasizing that the competition has shifted from hardware capabilities to a battle for software, developers, and ecosystem building, with Huawei's Ascend and its heterogeneous computing architecture CANN at the forefront of this change [1][4]. Summary by Sections CANN Open Source Announcement - Huawei's rotating chairman Xu Zhijun announced that the CANN hardware enabling will be fully open-sourced by December 30, 2025 [2]. Significance of CANN Open Source - The open-sourcing of CANN represents a profound self-revolution in the domestic AI infrastructure, aiming to break the closed model traditionally dominated by hardware manufacturers and embrace a more open and community-driven future [4][19]. - The success of the ecosystem relies on attracting academic innovation and creating a stable, universal, and efficient foundational tool for developers [5][18]. Developer Perspectives on CANN - Developers describe CANN's evolution as a challenging journey, with early versions requiring low-level programming skills, which hindered productivity [10][11]. - The introduction of the Ascend C programming language marked a significant improvement, aligning more closely with mainstream programming practices [15]. Challenges Faced by Developers - Early developers faced high technical barriers and a lack of stable architecture, leading to a difficult development environment [11][13]. - Systemic issues persisted, such as the inability to reproduce model accuracy across different frameworks due to a lack of transparency in the underlying systems [17]. The Role of Open Source - Open sourcing CANN is seen as a means to break down technical barriers and empower developers by providing transparency and control over the platform [21][23]. - The open-source model aims to foster a vibrant community where developers can contribute and innovate, moving away from reliance on a few official experts [29]. Ecosystem Empowerment - Open source provides unprecedented opportunities for deep integration between academia and industry, allowing researchers to address real-world problems and convert solutions into academic contributions [26]. - The shift from users to contributors is expected to cultivate a new generation of developers who can engage in high-quality projects [28]. Future Outlook for CANN - The current focus is on matching CUDA's capabilities while fostering original innovations within the CANN ecosystem [44]. - Huawei has committed to investing significant resources, including 1,500 petaflops of computing power and 30,000 development boards annually, to support the open-source community [45].
RLHF与RLVR全都要,陈丹琦团队最新力作将推理能力拓展到通用智能
机器之心· 2025-09-28 04:50
一个月前,我们曾报道过清华姚班校友、普林斯顿教授 陈丹琦似乎加入 Thinking Machines Lab 的消息。有些爆料认为她在休假一年后,会离开普林斯顿,全职加 入 Thinking Machines Lab。 最近,陈丹琦在普林斯顿大学的团队发布了最新学术成果,表明了 RLVR 范式在可验证领域之外依然有效,提出了 基于模型奖励思维的强化学习(RLMT) 方 法,它将显式的思维链推理融入通用聊天模型之中。 论文标题:Language Models that Think, Chat Better 论文链接:https://www.arxiv.org/overview/2509.20357v1 众所周知,大型语言模型传统上遵循一种多阶段训练范式:首先在大规模文本语料上进行 预训练,然后通过 监督微调 来学习指令跟随,最后借助 强化学习 来对 齐人类偏好。 机器之心报道 编辑:冷猫 思考自身行为的后果,并在必要时进行修正 —— 这是人类智慧的核心特征之一。 这种方法确实催生了功能强大的对话式 AI 系统,但仍存在一个关键局限: 在数学、编程等领域通过 可验证奖 励的强化学习(RLVR) 所获得的推理能力, ...
登上NeurIPS,Genesis开创无需OCC引导的多模态生成新范式,在视频与激光雷达指标上达到SOTA水平
机器之心· 2025-09-28 04:50
由华中科技大学与小米汽车提出了业内首个无需 OCC 引导的多模态的图像 - 点云联合生成框架 Genesis 。该算法只需基于场景描述和布局(包括车道线和 3D 框),就可以生成逼真的图像和点云视频。 为了以结构化语义引导生成过程,本文引入了 DataCrafter (一个基于 VLM 的数据标注模块),可提供场景级与实例级的信息描述。在 nuScenes 基准数据集上的大量 实验表明,Genesis 在视频与激光雷达指标上均达到了当前 SOTA 水平。 论文链接:https://arxiv.org/abs/2506.07497 Github 链接:xiaomi-research/genesis 论文题目:Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency Genesis 采用两阶段架构:第一阶段基于透视图投影的布局和场景描述等条件,利用基于 DiT 的扩散模型学习 3D 变分自编码器编码的环视图特征; 第二阶段将第 一阶段多视角视频序列转到鸟瞰图的特征空间,并结合场景描述和 ...
一文读懂鲸智百应:驱动组织进化的企业AI操作系统,让企业从「用AI」到「是AI」
机器之心· 2025-09-28 04:50
机器之心发布 机器之心编辑部 「统一认知、智能执行、决策中枢、记忆进化、智能体工厂、 AI 治理」六大维度,让企业 彻底跳出「用 AI 」的工具思维,成为 「 AI 原生组织」。 走进任何一家大中型企业,「系统横跳」已成为日常:员工每天要在 5 个以上业务系统间切换完成工 作, 80% 的生产数据沉睡在 ERP 、 CRM 、 OA 的孤岛中无法调用, AI 工具仍停留在「问答式辅 助」而非「全流程执行」 ...... 本该驱动业务迭代的核心资产,成了看得见、用不上的「数据孤岛」, 企业数字化落地早已陷入「工具堆砌而非价值重构」的困境。 曾经一家企业 CTO 的感慨颇具代表 性:「每个系统都很专业,可当处理复杂业务时,却连一份完整的分析报告都凑不出来。」 2025 云栖大会上,在多数玩家还在聚焦「智能体」时,浩鲸科技正式推出的「鲸智百应」,以「企业 AI 操作系统」的定位撕开了差异化缺口。 据 浩鲸科技董事、云智能总裁杨名 介绍,鲸智百应并非简单的功能叠加,而是从「统一认知、智能执 行、决策中枢、记忆进化、智能体工厂、 AI 治理」六大维度,让企业彻底跳出「用 AI 」的工具思 维,成为 具备感知、思考、行动 ...