机器之心 - filings, earnings calls, financial reports, news

机器之心

Search documents

机器之心· 2025-09-29 02:52

Core Viewpoint - The field of embodied intelligence is experiencing unprecedented attention, yet key issues remain unresolved, including data scarcity and differing technical approaches [1][2][3] Group 1: Data and Technical Approaches - The industry is divided into two factions: the "real machine" faction, which relies on real-world data collection, and the "synthetic" faction, which believes in the feasibility of synthetic data for model training [5][12] - Galaxy General, representing the synthetic faction, argues that achieving generalization in embodied intelligence models requires trillions of data points, which is unsustainable through real-world data alone [8][9] - The "real machine" faction challenges the notion that real-world data is prohibitively expensive, suggesting that with sufficient investment, data collection can be scaled effectively [12][14] Group 2: Model Architecture - Discussions around the architecture of embodied intelligence models highlight a divide between end-to-end and layered approaches, with some experts advocating for a unified model while others support a hierarchical structure [15][19] - The layered architecture is seen as more aligned with biological evolution, while the end-to-end approach is criticized for potential error amplification [19][20] - The debate extends to the relevance of VLA (Vision-Language Alignment) versus world models, with some experts arguing that VLA is currently more promising due to its data efficiency [21][22] Group 3: Industry Trends and Infrastructure - The scaling law in embodied intelligence is beginning to emerge, indicating that expanding model and data scales could be effective [24] - The industry is witnessing an acceleration in the deployment of embodied intelligence technologies, with various companies sharing their experiences in human-robot interaction and industrial applications [24][29] - Cloud service providers, particularly Alibaba Cloud, are emphasized as crucial players in supporting the infrastructure needs of embodied intelligence companies, especially as they transition to mass production [29][31] Group 4: Alibaba Cloud's Role - Alibaba Cloud has been preparing for the exponential growth in data and computational needs associated with embodied intelligence, having developed capabilities to handle large-scale data processing and model training [33][35] - The company offers a comprehensive suite of cloud-based solutions to support both real and synthetic data production, enhancing efficiency and reducing costs [35][36] - Alibaba Cloud's unique position as a model provider and its engineering capabilities are seen as significant advantages in the rapidly evolving embodied intelligence landscape [37][41]

千寻智能高阳团队最新成果：纯视觉VLA方案从有限数据中学到强大的空间泛化能力

机器之心· 2025-09-29 02:52

设想一下刚学开车的情况：在训练场上，我们可能会反复练习特定动作：到了某个位置就踩刹车，拐到某个点就打方向盘。久而久之，这些动作会形成 "条件记忆"，一旦环境发生变化，就容易手忙脚乱。最近，千寻智能的研究人员注意到，基于模仿学习的视觉运动策略中也存在类似现象，并在论文《Do You Need Proprioceptive States in Visuomotor Policies?》中对此进行了深入探讨。论文链接：https://arxiv.org/abs/2509.18644 项目主页：https://statefreepolicy.github.io 文中研究人员提出了一种名为 State-free Policy 的策略，与 State-based Policy 相比，即便在训练数据中桌面高度、机器人位置和目标物体等都被严格固定的情况下，机器人仍能展现出强大的空间泛化能力。例如：在夹笔任务中，获得桌面高度的泛化能力（标准桌高为 80 cm）：在叠衣服任务中，即使机械臂位置大幅偏离标准位置，机器人仍然能出色完成任务：在全身机器人从冰箱拿饮料的过程中，即使冰箱位置发生移动，机器人也能够适应：事实上 ...

大神爆肝一个月，复刻DeepMind世界模型，300万参数就能玩实时交互像素游戏

机器之心· 2025-09-28 10:29

Core Insights - The article discusses the development of TinyWorlds, a minimal world model inspired by DeepMind's Genie 3, capable of generating playable pixel-style environments with only 3 million parameters [1][9][32]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities when trained on large-scale video data [5][7]. - The challenge lies in the need for frame-by-frame action labels for training, which limits the use of unannotated video data from the internet [5][6]. - Genie 1's solution involved training an action tokenizer to infer action labels, enabling the use of vast amounts of unannotated video for training [5][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, determining the range of environments the model can generate [11][12]. Group 3: Architecture and Tokenization Strategy - TinyWorlds employs a space-time transformer to handle three-dimensional video data, capturing video information through a three-layer mechanism [15][17]. - The model's architecture includes spatial attention, temporal attention, and a feedforward network to extract higher-level features [21][22]. - The video tokenizer compresses videos into tokens, while the action tokenizer predicts actions between frames, allowing training on unannotated data [24][26]. Group 4: Training the World Generator - The dynamics model serves as the system's "brain," predicting future frames based on video and actions, with performance improving significantly when the model size is increased [30][32]. - Despite its 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, though the output remains somewhat blurry and incoherent [32].

世界模型

Artificial Intelligence

Genie 3

TinyWorlds

世界模型

Artificial Intelligence

Genie 3

TinyWorlds

机器之心· 2025-09-28 10:29

Core Insights - The article discusses the evolution of recommendation systems, highlighting the limitations of traditional systems that rely on past data and lack real-time interaction with users [2][9] - Meta's new approach, RecoWorld, introduces a dual-view architecture that allows for multi-round interactions between users and the recommendation system, aiming to enhance user retention [3][4] Group 1: RecoWorld Overview - RecoWorld features a unique dual-view architecture that simulates user interactions and allows the recommendation system to adjust its content dynamically based on user feedback [4][12] - The system utilizes a user simulator that mimics real user behavior, providing feedback such as complaints or likes, which informs the recommendation system's adjustments [13][14] - The design of RecoWorld enables a dynamic feedback loop where user instructions lead to system adjustments, fostering a two-way dialogue between users and the recommendation system [18] Group 2: Mechanism and Functionality - The core mechanism of RecoWorld involves a "virtual duet" where simulated users interact with the recommendation system, helping it learn how to retain users effectively [12][16] - The user simulator can perform various actions such as clicking, skipping, or liking, and its decisions are influenced by environmental factors and past interactions [14][16] - The ultimate goal of RecoWorld is to optimize long-term user retention by maximizing session duration and minimizing session gaps, which correlates with daily active users (DAU) [16] Group 3: Future Implications - RecoWorld represents a foundational infrastructure for recommendation system research, akin to OpenAI's Gym for reinforcement learning, allowing for safe experimentation with new algorithms [21] - The shift from one-way recommendations to interactive systems signifies a transformation where users can direct the algorithm, enhancing the personalization of content [22][24] - Future recommendation systems are envisioned to be more intelligent and responsive, capable of understanding user preferences and adapting in real-time [25][24]

Meta Platforms(US:META)

OpenAI被指欺诈，用户输入可能会被秘密路由到新模型GPT-5-Chat-Safety

机器之心· 2025-09-28 07:05

Core Viewpoint - The release of GPT-5 has led to significant user dissatisfaction, particularly due to OpenAI's removal of the model selector in ChatGPT, which has sparked online petitions from users demanding the return of the GPT-4o model [1][2]. Group 1 - OpenAI has reinstated the GPT-4o model for ChatGPT Plus users, but issues persist regarding the routing of emotionally charged content to a hidden model called GPT-5-Chat-Safety without user notification [2][3]. - Users have reported that any content deemed "risky," even slightly emotional, is rerouted to the GPT-5-Chat-Safety model, which is not publicly acknowledged by OpenAI [3][4]. - The GPT-5-Chat-Safety model is criticized for being inferior to GPT-5, providing shorter and less engaging responses, and treating conversations as stories rather than genuine interactions [3][4]. Group 2 - Concerns have been raised about the ethical implications of rerouting user conversations to a model designed for crisis response, especially when most affected dialogues do not involve emergencies [4][6]. - Users have expressed outrage over what they perceive as deceptive practices by OpenAI, arguing that the lack of transparency regarding model changes constitutes a form of fraud [12][19]. - The incident has ignited discussions about AI model transparency and user rights, highlighting the challenge OpenAI faces in maintaining user trust amid rapid technological advancements [29].

AI模型透明度

用户知情权

Artificial Intelligence

Artificial Intelligence

GPT-5

GPT-4o

GPT-5-Chat-Safety

普通人也能「炼丹」了？我拿小红书文案喂给openPangu-Embedded-1B的模型，几步就把它变成了专属文案大师！

机器之心· 2025-09-28 07:05

Core Viewpoint - The article emphasizes the potential of smaller AI models, specifically the openPangu-Embedded-1B model, to be effectively trained for specific applications, demonstrating that high performance can be achieved without relying on massive models [3][23]. Group 1: Model Introduction and Capabilities - The openPangu-Embedded-1B model is a lightweight AI model that can be easily trained with limited resources, making it accessible for ordinary users [3][11]. - Despite its smaller size, the 1B model shows competitive performance compared to larger models like Qwen3-1.7B [3][23]. Group 2: Training Process - The training process involves three simple steps: preparing the dataset, loading the model, and fine-tuning it with the specific data [9][10]. - The dataset for training can be sourced from open academic resources, such as Hugging Face, which simplifies the data collection process [9][11]. Group 3: Application and Results - The article presents a case study where the model was fine-tuned to generate content in the unique style of Xiaohongshu (Little Red Book), showcasing its adaptability [5][19]. - The results of the fine-tuning demonstrated a significant improvement in the model's ability to produce engaging and stylistically appropriate content, aligning with the platform's tone [19][21]. Group 4: Advantages of Smaller Models - Smaller models like openPangu-Embedded-1B offer low hardware requirements, making them accessible to a broader audience and alleviating concerns about computational power [27]. - The efficiency of training and the ability to customize the model with personal data allow users to define the model's style and knowledge boundaries [27].

AI Model Fine-tuning

Artificial Intelligence

openPangu-Embedded-1B

Qwen3-1.7B

AI Model Fine-tuning

Artificial Intelligence

openPangu-Embedded-1B

Qwen3-1.7B

放弃 CoT？Agentic 时代为什么更需要隐式推理？

机器之心· 2025-09-28 07:05

Group 1 - The article discusses the limitations of Chain of Thought (CoT) reasoning in AI, highlighting its inability to break the "1Hz" barrier and suggesting that implicit reasoning may be a more suitable approach for Agentic AI [7][8][10] - Recent studies indicate that CoT may not represent true reasoning but rather a structured pattern matching, which can lead to performance degradation in tasks requiring inductive reasoning [9][10] - The high computational cost and time consumption associated with explicit reasoning make it less viable for real-time applications, necessitating a shift towards implicit reasoning that can adapt to various task complexities [10][11] Group 2 - Implicit reasoning is gaining traction as it allows for faster processing and lower costs, making it more suitable for real-time AI applications compared to the traditional "Think-before-Speaking" (TbS) model [11][12] - The article emphasizes the need for AI agents to dynamically adjust their reasoning depth and speed based on task difficulty, which is a key capability for future AI development [10][11] - Challenges remain for implicit reasoning, particularly in high-stakes scenarios where accuracy and verifiability are paramount, such as legal document analysis and medical diagnostics [13][14]

「从追赶者到引领者，路有多远？」我们和CANN一线开发者聊了聊

机器之心· 2025-09-28 04:50

Core Viewpoint - The article discusses the transformation of the AI industry, emphasizing that the competition has shifted from hardware capabilities to a battle for software, developers, and ecosystem building, with Huawei's Ascend and its heterogeneous computing architecture CANN at the forefront of this change [1][4]. Summary by Sections CANN Open Source Announcement - Huawei's rotating chairman Xu Zhijun announced that the CANN hardware enabling will be fully open-sourced by December 30, 2025 [2]. Significance of CANN Open Source - The open-sourcing of CANN represents a profound self-revolution in the domestic AI infrastructure, aiming to break the closed model traditionally dominated by hardware manufacturers and embrace a more open and community-driven future [4][19]. - The success of the ecosystem relies on attracting academic innovation and creating a stable, universal, and efficient foundational tool for developers [5][18]. Developer Perspectives on CANN - Developers describe CANN's evolution as a challenging journey, with early versions requiring low-level programming skills, which hindered productivity [10][11]. - The introduction of the Ascend C programming language marked a significant improvement, aligning more closely with mainstream programming practices [15]. Challenges Faced by Developers - Early developers faced high technical barriers and a lack of stable architecture, leading to a difficult development environment [11][13]. - Systemic issues persisted, such as the inability to reproduce model accuracy across different frameworks due to a lack of transparency in the underlying systems [17]. The Role of Open Source - Open sourcing CANN is seen as a means to break down technical barriers and empower developers by providing transparency and control over the platform [21][23]. - The open-source model aims to foster a vibrant community where developers can contribute and innovate, moving away from reliance on a few official experts [29]. Ecosystem Empowerment - Open source provides unprecedented opportunities for deep integration between academia and industry, allowing researchers to address real-world problems and convert solutions into academic contributions [26]. - The shift from users to contributors is expected to cultivate a new generation of developers who can engage in high-quality projects [28]. Future Outlook for CANN - The current focus is on matching CUDA's capabilities while fostering original innovations within the CANN ecosystem [44]. - Huawei has committed to investing significant resources, including 1,500 petaflops of computing power and 30,000 development boards annually, to support the open-source community [45].

RLHF与RLVR全都要，陈丹琦团队最新力作将推理能力拓展到通用智能

机器之心· 2025-09-28 04:50

一个月前，我们曾报道过清华姚班校友、普林斯顿教授陈丹琦似乎加入 Thinking Machines Lab 的消息。有些爆料认为她在休假一年后，会离开普林斯顿，全职加入 Thinking Machines Lab。最近，陈丹琦在普林斯顿大学的团队发布了最新学术成果，表明了 RLVR 范式在可验证领域之外依然有效，提出了基于模型奖励思维的强化学习（RLMT）方法，它将显式的思维链推理融入通用聊天模型之中。论文标题：Language Models that Think, Chat Better 论文链接：https://www.arxiv.org/overview/2509.20357v1 众所周知，大型语言模型传统上遵循一种多阶段训练范式：首先在大规模文本语料上进行预训练，然后通过监督微调来学习指令跟随，最后借助强化学习来对齐人类偏好。机器之心报道编辑：冷猫思考自身行为的后果，并在必要时进行修正 —— 这是人类智慧的核心特征之一。这种方法确实催生了功能强大的对话式 AI 系统，但仍存在一个关键局限：在数学、编程等领域通过可验证奖励的强化学习（RLVR）所获得的推理能力， ...

基于模型奖励的思维强化学习（RLMT）方法

基于模型奖励的思维强化学习（RLMT）方法

登上NeurIPS，Genesis开创无需OCC引导的多模态生成新范式，在视频与激光雷达指标上达到SOTA水平

机器之心· 2025-09-28 04:50

由华中科技大学与小米汽车提出了业内首个无需 OCC 引导的多模态的图像 - 点云联合生成框架 Genesis 。该算法只需基于场景描述和布局（包括车道线和 3D 框），就可以生成逼真的图像和点云视频。为了以结构化语义引导生成过程，本文引入了 DataCrafter (一个基于 VLM 的数据标注模块)，可提供场景级与实例级的信息描述。在 nuScenes 基准数据集上的大量实验表明，Genesis 在视频与激光雷达指标上均达到了当前 SOTA 水平。论文链接：https://arxiv.org/abs/2506.07497 Github 链接：xiaomi-research/genesis 论文题目：Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency Genesis 采用两阶段架构：第一阶段基于透视图投影的布局和场景描述等条件，利用基于 DiT 的扩散模型学习 3D 变分自编码器编码的环视图特征；第二阶段将第一阶段多视角视频序列转到鸟瞰图的特征空间，并结合场景描述和 ...