机器之心
Search documents
在具身智能的岔路口,这场论坛把数据、模型、Infra聊透了
机器之心· 2025-09-29 02:52
Core Viewpoint - The field of embodied intelligence is experiencing unprecedented attention, yet key issues remain unresolved, including data scarcity and differing technical approaches [1][2][3] Group 1: Data and Technical Approaches - The industry is divided into two factions: the "real machine" faction, which relies on real-world data collection, and the "synthetic" faction, which believes in the feasibility of synthetic data for model training [5][12] - Galaxy General, representing the synthetic faction, argues that achieving generalization in embodied intelligence models requires trillions of data points, which is unsustainable through real-world data alone [8][9] - The "real machine" faction challenges the notion that real-world data is prohibitively expensive, suggesting that with sufficient investment, data collection can be scaled effectively [12][14] Group 2: Model Architecture - Discussions around the architecture of embodied intelligence models highlight a divide between end-to-end and layered approaches, with some experts advocating for a unified model while others support a hierarchical structure [15][19] - The layered architecture is seen as more aligned with biological evolution, while the end-to-end approach is criticized for potential error amplification [19][20] - The debate extends to the relevance of VLA (Vision-Language Alignment) versus world models, with some experts arguing that VLA is currently more promising due to its data efficiency [21][22] Group 3: Industry Trends and Infrastructure - The scaling law in embodied intelligence is beginning to emerge, indicating that expanding model and data scales could be effective [24] - The industry is witnessing an acceleration in the deployment of embodied intelligence technologies, with various companies sharing their experiences in human-robot interaction and industrial applications [24][29] - Cloud service providers, particularly Alibaba Cloud, are emphasized as crucial players in supporting the infrastructure needs of embodied intelligence companies, especially as they transition to mass production [29][31] Group 4: Alibaba Cloud's Role - Alibaba Cloud has been preparing for the exponential growth in data and computational needs associated with embodied intelligence, having developed capabilities to handle large-scale data processing and model training [33][35] - The company offers a comprehensive suite of cloud-based solutions to support both real and synthetic data production, enhancing efficiency and reducing costs [35][36] - Alibaba Cloud's unique position as a model provider and its engineering capabilities are seen as significant advantages in the rapidly evolving embodied intelligence landscape [37][41]
千寻智能高阳团队最新成果:纯视觉VLA方案从有限数据中学到强大的空间泛化能力
机器之心· 2025-09-29 02:52
设想一下刚学开车的情况:在训练场上,我们可能会反复练习特定动作:到了某个位置就踩刹车,拐到某个点就打方向盘。久而久之,这些动作会形成 "条件记 忆",一旦环境发生变化,就容易手忙脚乱。最近,千寻智能的研究人员注意到,基于模仿学习的视觉运动策略中也存在类似现象,并在论文《Do You Need Proprioceptive States in Visuomotor Policies?》中对此进行了深入探讨。 论文链接:https://arxiv.org/abs/2509.18644 项目主页:https://statefreepolicy.github.io 文中研究人员提出了一种名为 State-free Policy 的策略,与 State-based Policy 相比,即便在训练数据中桌面高度、机器人位置和目标物体等都被严格固定的情况 下,机器人仍能展现出强大的空间泛化能力。例如: 在夹笔任务中,获得桌面高度的泛化能力(标准桌高为 80 cm): 在叠衣服任务中,即使机械臂位置大幅偏离标准位置,机器人仍然能出色完成任务: 在全身机器人从冰箱拿饮料的过程中,即使冰箱位置发生移动,机器人也能够适应: 事实上 ...
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
机器之心· 2025-09-28 10:29
Core Insights - The article discusses the development of TinyWorlds, a minimal world model inspired by DeepMind's Genie 3, capable of generating playable pixel-style environments with only 3 million parameters [1][9][32]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities when trained on large-scale video data [5][7]. - The challenge lies in the need for frame-by-frame action labels for training, which limits the use of unannotated video data from the internet [5][6]. - Genie 1's solution involved training an action tokenizer to infer action labels, enabling the use of vast amounts of unannotated video for training [5][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, determining the range of environments the model can generate [11][12]. Group 3: Architecture and Tokenization Strategy - TinyWorlds employs a space-time transformer to handle three-dimensional video data, capturing video information through a three-layer mechanism [15][17]. - The model's architecture includes spatial attention, temporal attention, and a feedforward network to extract higher-level features [21][22]. - The video tokenizer compresses videos into tokens, while the action tokenizer predicts actions between frames, allowing training on unannotated data [24][26]. Group 4: Training the World Generator - The dynamics model serves as the system's "brain," predicting future frames based on video and actions, with performance improving significantly when the model size is increased [30][32]. - Despite its 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, though the output remains somewhat blurry and incoherent [32].
下一代推荐系统长这样,Meta最新研究RecoWorld,从「猜你喜欢」到「听你指令」
机器之心· 2025-09-28 10:29
Core Insights - The article discusses the evolution of recommendation systems, highlighting the limitations of traditional systems that rely on past data and lack real-time interaction with users [2][9] - Meta's new approach, RecoWorld, introduces a dual-view architecture that allows for multi-round interactions between users and the recommendation system, aiming to enhance user retention [3][4] Group 1: RecoWorld Overview - RecoWorld features a unique dual-view architecture that simulates user interactions and allows the recommendation system to adjust its content dynamically based on user feedback [4][12] - The system utilizes a user simulator that mimics real user behavior, providing feedback such as complaints or likes, which informs the recommendation system's adjustments [13][14] - The design of RecoWorld enables a dynamic feedback loop where user instructions lead to system adjustments, fostering a two-way dialogue between users and the recommendation system [18] Group 2: Mechanism and Functionality - The core mechanism of RecoWorld involves a "virtual duet" where simulated users interact with the recommendation system, helping it learn how to retain users effectively [12][16] - The user simulator can perform various actions such as clicking, skipping, or liking, and its decisions are influenced by environmental factors and past interactions [14][16] - The ultimate goal of RecoWorld is to optimize long-term user retention by maximizing session duration and minimizing session gaps, which correlates with daily active users (DAU) [16] Group 3: Future Implications - RecoWorld represents a foundational infrastructure for recommendation system research, akin to OpenAI's Gym for reinforcement learning, allowing for safe experimentation with new algorithms [21] - The shift from one-way recommendations to interactive systems signifies a transformation where users can direct the algorithm, enhancing the personalization of content [22][24] - Future recommendation systems are envisioned to be more intelligent and responsive, capable of understanding user preferences and adapting in real-time [25][24]
OpenAI被指欺诈,用户输入可能会被秘密路由到新模型GPT-5-Chat-Safety
机器之心· 2025-09-28 07:05
Core Viewpoint - The release of GPT-5 has led to significant user dissatisfaction, particularly due to OpenAI's removal of the model selector in ChatGPT, which has sparked online petitions from users demanding the return of the GPT-4o model [1][2]. Group 1 - OpenAI has reinstated the GPT-4o model for ChatGPT Plus users, but issues persist regarding the routing of emotionally charged content to a hidden model called GPT-5-Chat-Safety without user notification [2][3]. - Users have reported that any content deemed "risky," even slightly emotional, is rerouted to the GPT-5-Chat-Safety model, which is not publicly acknowledged by OpenAI [3][4]. - The GPT-5-Chat-Safety model is criticized for being inferior to GPT-5, providing shorter and less engaging responses, and treating conversations as stories rather than genuine interactions [3][4]. Group 2 - Concerns have been raised about the ethical implications of rerouting user conversations to a model designed for crisis response, especially when most affected dialogues do not involve emergencies [4][6]. - Users have expressed outrage over what they perceive as deceptive practices by OpenAI, arguing that the lack of transparency regarding model changes constitutes a form of fraud [12][19]. - The incident has ignited discussions about AI model transparency and user rights, highlighting the challenge OpenAI faces in maintaining user trust amid rapid technological advancements [29].
放弃 CoT?Agentic 时代为什么更需要隐式推理?
机器之心· 2025-09-28 07:05
Group 1 - The article discusses the limitations of Chain of Thought (CoT) reasoning in AI, highlighting its inability to break the "1Hz" barrier and suggesting that implicit reasoning may be a more suitable approach for Agentic AI [7][8][10] - Recent studies indicate that CoT may not represent true reasoning but rather a structured pattern matching, which can lead to performance degradation in tasks requiring inductive reasoning [9][10] - The high computational cost and time consumption associated with explicit reasoning make it less viable for real-time applications, necessitating a shift towards implicit reasoning that can adapt to various task complexities [10][11] Group 2 - Implicit reasoning is gaining traction as it allows for faster processing and lower costs, making it more suitable for real-time AI applications compared to the traditional "Think-before-Speaking" (TbS) model [11][12] - The article emphasizes the need for AI agents to dynamically adjust their reasoning depth and speed based on task difficulty, which is a key capability for future AI development [10][11] - Challenges remain for implicit reasoning, particularly in high-stakes scenarios where accuracy and verifiability are paramount, such as legal document analysis and medical diagnostics [13][14]
普通人也能「炼丹」了?我拿小红书文案喂给openPangu-Embedded-1B的模型,几步就把它变成了专属文案大师!
机器之心· 2025-09-28 07:05
Core Viewpoint - The article emphasizes the potential of smaller AI models, specifically the openPangu-Embedded-1B model, to be effectively trained for specific applications, demonstrating that high performance can be achieved without relying on massive models [3][23]. Group 1: Model Introduction and Capabilities - The openPangu-Embedded-1B model is a lightweight AI model that can be easily trained with limited resources, making it accessible for ordinary users [3][11]. - Despite its smaller size, the 1B model shows competitive performance compared to larger models like Qwen3-1.7B [3][23]. Group 2: Training Process - The training process involves three simple steps: preparing the dataset, loading the model, and fine-tuning it with the specific data [9][10]. - The dataset for training can be sourced from open academic resources, such as Hugging Face, which simplifies the data collection process [9][11]. Group 3: Application and Results - The article presents a case study where the model was fine-tuned to generate content in the unique style of Xiaohongshu (Little Red Book), showcasing its adaptability [5][19]. - The results of the fine-tuning demonstrated a significant improvement in the model's ability to produce engaging and stylistically appropriate content, aligning with the platform's tone [19][21]. Group 4: Advantages of Smaller Models - Smaller models like openPangu-Embedded-1B offer low hardware requirements, making them accessible to a broader audience and alleviating concerns about computational power [27]. - The efficiency of training and the ability to customize the model with personal data allow users to define the model's style and knowledge boundaries [27].
「从追赶者到引领者,路有多远?」 我们和CANN一线开发者聊了聊
机器之心· 2025-09-28 04:50
Core Viewpoint - The article discusses the transformation of the AI industry, emphasizing that the competition has shifted from hardware capabilities to a battle for software, developers, and ecosystem building, with Huawei's Ascend and its heterogeneous computing architecture CANN at the forefront of this change [1][4]. Summary by Sections CANN Open Source Announcement - Huawei's rotating chairman Xu Zhijun announced that the CANN hardware enabling will be fully open-sourced by December 30, 2025 [2]. Significance of CANN Open Source - The open-sourcing of CANN represents a profound self-revolution in the domestic AI infrastructure, aiming to break the closed model traditionally dominated by hardware manufacturers and embrace a more open and community-driven future [4][19]. - The success of the ecosystem relies on attracting academic innovation and creating a stable, universal, and efficient foundational tool for developers [5][18]. Developer Perspectives on CANN - Developers describe CANN's evolution as a challenging journey, with early versions requiring low-level programming skills, which hindered productivity [10][11]. - The introduction of the Ascend C programming language marked a significant improvement, aligning more closely with mainstream programming practices [15]. Challenges Faced by Developers - Early developers faced high technical barriers and a lack of stable architecture, leading to a difficult development environment [11][13]. - Systemic issues persisted, such as the inability to reproduce model accuracy across different frameworks due to a lack of transparency in the underlying systems [17]. The Role of Open Source - Open sourcing CANN is seen as a means to break down technical barriers and empower developers by providing transparency and control over the platform [21][23]. - The open-source model aims to foster a vibrant community where developers can contribute and innovate, moving away from reliance on a few official experts [29]. Ecosystem Empowerment - Open source provides unprecedented opportunities for deep integration between academia and industry, allowing researchers to address real-world problems and convert solutions into academic contributions [26]. - The shift from users to contributors is expected to cultivate a new generation of developers who can engage in high-quality projects [28]. Future Outlook for CANN - The current focus is on matching CUDA's capabilities while fostering original innovations within the CANN ecosystem [44]. - Huawei has committed to investing significant resources, including 1,500 petaflops of computing power and 30,000 development boards annually, to support the open-source community [45].
RLHF与RLVR全都要,陈丹琦团队最新力作将推理能力拓展到通用智能
机器之心· 2025-09-28 04:50
Core Insights - The article discusses the introduction of a new method called Reinforcement Learning with Model Thinking (RLMT), which integrates explicit reasoning into general chat models, enhancing their performance in open-ended tasks [5][7][26]. Summary by Sections Introduction - The article highlights the recent academic contributions of Chen Danqi from Princeton University, who has developed the RLMT method, which aims to bridge the gap between specialized reasoning capabilities and general conversational abilities in AI [2][5]. Methodology - RLMT combines aspects of Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) to optimize language models for open-ended tasks [6][11]. - The training process involves two approaches: supervised fine-tuning (SFT) to teach the desired reasoning format and a zero-training method that directly applies RLMT to base models without prior training [12][14]. Results - Models trained with RLMT demonstrated superior performance in open-ended reasoning tasks compared to non-thinking baseline models, particularly in chat and creative writing benchmarks [18][26]. - The article presents comparative performance data showing that RLMT models outperformed other models, including GPT-4o and Claude-3.7-Sonnet, in various chat benchmarks [19][20]. Conclusion - RLMT successfully extends the advantages of explicit reasoning from specialized domains to general conversational AI, indicating its potential to reshape language model training methodologies [26][29].
登上NeurIPS,Genesis开创无需OCC引导的多模态生成新范式,在视频与激光雷达指标上达到SOTA水平
机器之心· 2025-09-28 04:50
Core Insights - The article discusses the Genesis framework, a multimodal image-point cloud generation algorithm developed by Huazhong University of Science and Technology and Xiaomi Auto, which does not require occupancy (OCC) guidance for generating realistic driving scene data [2][4]. Group 1: Genesis Framework Overview - Genesis employs a two-stage architecture: the first stage uses a perspective projection layout and scene descriptions to learn 3D features, while the second stage converts multi-view video sequences into a bird's-eye view feature space [4]. - The framework introduces DataCrafter, a data annotation module based on Visual Language Models (VLM), to provide structured semantic information for guiding the generation process [10][13]. Group 2: Challenges in Current Driving Scene Generation - Existing methods primarily focus on single-modal data generation, either RGB video or LiDAR point clouds, which limits the potential for deep collaboration and consistent expression between visual and geometric modalities [7][8]. - The high cost of obtaining OCC labels in real-world driving scenarios restricts the application of existing multimodal generation models in the industry [8]. Group 3: DataCrafter Module - DataCrafter is designed to filter training data and extract structured semantic information, ensuring high-quality segments are used for training and providing detailed semantic guidance for the generation tasks [13][18]. - The module evaluates video segments based on visual attributes such as clarity, structural coherence, and aesthetic qualities, retaining only those that meet a set threshold [15]. Group 4: Video Generation Model - The video generation model within Genesis integrates scene layout information and language descriptions through attention mechanisms, enhancing the semantic expression of dynamic scenes [19]. - Innovations include using YOLOv8x-Pose for detecting pedestrian poses, which are then projected across various views to improve the generation of realistic driving scenarios [19]. Group 5: Performance Metrics - In experiments on the nuScenes dataset, Genesis achieved a multi-frame FVD of 83.10 and a multi-frame FID of 14.90 without initial frame conditions, outperforming previous methods [26]. - For LiDAR generation, Genesis demonstrated superior performance with a Chamfer distance of 0.611 at 1-second prediction, surpassing the previous best by 21% [27]. Group 6: Downstream Task Evaluation - The generated data from Genesis was evaluated in downstream perception tasks, showing improvements in mean Average Precision (mAP) and NuScenes Detection Score (NDS) across various settings [30]. - The combination of camera and LiDAR modalities in generation tasks yielded the highest gains, demonstrating the complementary advantages of multimodal generation [30].