Workflow
世界模型
icon
Search documents
中国汽车的“爷爷”长啥样?70年变迁,竟然只在一瞬间!
电动车公社· 2025-07-02 15:59
Core Viewpoint - The article emphasizes the evolution of the Chinese automotive industry, highlighting its journey from manual craftsmanship to becoming the world's largest producer and exporter of automobiles, and the current advancements in technology and culture within the sector [1]. Group 1 - The Beijing Automobile Museum serves as a platform to reflect on the history of Chinese automotive development and its cultural roots [1]. - The article mentions the significance of national-level models in understanding the progress of the automotive industry in China [1]. - There is a focus on the future of new energy vehicles and the direction of automotive culture in China [1]. Group 2 - The article introduces recent vehicle launches, specifically mentioning the Xiaopeng G7, indicating ongoing innovation in the market [3]. - It discusses the new national standards for batteries, suggesting regulatory changes that could impact the industry [3]. - The concept of world models and the underlying logic of AI and intelligent driving are explored, indicating a shift towards advanced technology in automotive operations [3].
RoboScape:基于物理信息的具身世界模型,动作可控性提升68.3%
具身智能之心· 2025-07-02 10:18
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yu Shang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 根源在于现有模型过度依赖视觉令牌拟合,缺乏物理知识 awareness。此前整合物理知识的尝试分为三类: 物理先验正则化(局限于人类运动或刚体动力学等窄域)、基于物理模拟器的知识蒸馏(级联 pipeline 计 算复杂)、材料场建模(限于物体级建模,难用于场景级生成)。因此,如何在统一、高效的框架中整合 物理知识,成为亟待解决的核心问题。 核心方法 问题定义 聚焦机器人操作场景,学习具身世界模型 作为动力学函数,基于过去的观测 和机器人动作 预测 下一个视觉观测 ,公式为: 研究背景与核心问题 在具身智能领域,世界模型作为强大的模拟器,能生成逼真的机器人视频并缓解数据稀缺问题,但现有模 型在物理感知上存在显著局限。尤其在涉及接触的机器人场景中,因缺乏对3D几何和运动动力学的建模能 力,生成的视频常出现不真实的物体变形或 ...
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-01 12:07
核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对物理世界的空间推理,生成安全、合 理、可解释的行动序列。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支 ...
“三年实现商业化”,哈啰如何跑通Robotaxi?
Core Insights - The article discusses the competitive landscape of the Robotaxi industry, highlighting the shift from technology development to commercialization and scaling [1] - Ha Luo's entry into the Robotaxi market is supported by its user data and local operational experience, as well as a significant investment partnership with Ant Group and CATL [2][6] - The company aims to achieve commercialization within three years, focusing initially on the domestic market before expanding internationally [9][15] Company Strategy - Ha Luo plans to adopt a differentiated competition strategy by creating a multi-layered, accessible operational platform that integrates various car manufacturers and technology partners [4] - The platform will allow for resource sharing among partners, reducing operational costs and lowering the barriers for cities to implement Robotaxi services [4] - The company emphasizes the importance of data acquisition, particularly focusing on long-tail data to enhance model training for autonomous driving [5] Investment and Partnerships - The joint venture with Ant Group and CATL involves an initial investment of over 3 billion yuan, aimed at advancing L4 autonomous driving technology [2][6] - Ant Group will contribute to AI infrastructure and algorithm research, while CATL will provide battery technology and operational support [7] Technical Development - Ha Luo acknowledges the challenges in developing L4 technology, particularly in acquiring functional cases and long-tail data [9] - The company is exploring a dual approach to technology, combining AI-driven methods with traditional sensor technologies like LiDAR for enhanced reliability [13][14] Market Positioning - The company positions itself as a latecomer with unique advantages, leveraging the maturity of the industry to make targeted investments [3] - Ha Luo aims to create a commercially viable L4 product that is not only technologically sound but also economically feasible for consumers [8][12]
AI下半场,大模型要少说话,多做事
Hu Xiu· 2025-07-01 01:33
Core Insights - The article discusses the rapid advancements in AI models in China, particularly highlighting the performance improvements of DeepSeek and other models over the past year [1][3][5] - The establishment of the "Fangsheng" benchmark testing system aims to standardize AI model evaluations and address issues of cheating in rankings [2][44] - The competitive landscape of AI models is characterized by frequent updates and rapid changes in rankings, with Chinese models increasingly dominating the top positions [4][5][8] Group 1: AI Model Performance - DeepSeek has shown significant performance improvements, moving from a lower ranking in April 2024 to becoming the top model by December 2024 [1] - The current landscape features approximately six Chinese models in the top ten, indicating a strong domestic presence in AI development [3] - The frequency of updates has increased, leading to shorter durations for models to maintain top positions, with rankings changing as often as every few days [5][7] Group 2: Benchmark Testing - The "Fangsheng" benchmark testing system was introduced to provide a standardized method for evaluating AI models, addressing the lack of consistency in existing tests [2][44] - The testing framework includes a diverse set of questions, focusing on real-world applications rather than traditional academic assessments [43][46] - The system aims to enhance the practical capabilities of AI models, ensuring they can effectively contribute to the economy [44][53] Group 3: Future of AI and Agents - The concept of Agents, which operate on top of AI models, is gaining traction, allowing for more autonomous and intelligent functionalities [20][21] - Future developments may lead to the emergence of specialized Agents for various tasks, potentially transforming individual productivity and collaboration with AI [25][26] - The integration of databases and knowledge repositories with AI models is essential for improving accuracy and reducing misinformation [17][19] Group 4: Industry Implications - The advancements in AI models and the establishment of benchmark testing are expected to drive significant changes in various industries, enhancing operational efficiency and innovation [35][52] - Companies are encouraged to focus on the practical applications of AI, moving beyond mere content generation to deeper analytical capabilities [52][53] - The competitive landscape remains fluid, with no single company holding a definitive advantage, as multiple players vie for user engagement and market share [28]
头部Robotaxi专家小范围交流
2025-07-01 00:40
Summary of Key Points from the Conference Call Industry Overview - The conference call primarily discusses the **L4 level autonomous driving** industry, focusing on various companies and their technological approaches, including **Tesla**, **Vivo**, **Baidu**, and **Pony** [1][2][6][7]. Core Insights and Arguments - **Current Autonomous Driving Models**: The mainstream approach for autonomous driving combines local end-to-end two-stage models, utilizing CNN and LLM for perception and prediction, while planning and control rely on rule-based methods to ensure safety [1][2]. - **Tesla's Technology**: Tesla employs a pure end-to-end visual model, which offers fast response times and excels in complex scenarios. However, it faces challenges such as complex training processes and difficulties in data labeling, leading to potential dangerous behaviors in unseen data [3][4]. - **Domestic L4 Systems**: Domestic L4 autonomous driving systems outperform Tesla in driving comfort, safety in complex road conditions, and path planning in sharp turns. Companies like Baidu and Pony enhance perception capabilities through multi-sensor fusion, making them more suitable for complex domestic traffic environments [6][7]. - **Lidar Necessity**: Lidar is deemed essential for L4 autonomous driving, especially in low visibility conditions, as it effectively identifies object shapes, addressing the shortcomings of pure visual systems [9]. - **Cost and Performance of Chips**: The performance and stability of chips are critical for L4 functionality. While domestic chips are improving, they still lag behind Nvidia in peak performance and ecosystem support. However, U.S. sanctions are driving a trend towards domestic alternatives, significantly reducing costs [12][13]. - **Testing and Simulation**: L4 companies utilize extensive testing and simulation technologies to address common issues, moving away from solely relying on real-world testing, which is labor-intensive and limited [14]. Additional Important Points - **Regulatory Environment**: The operation of Robotaxi services requires prior data submission to government authorities for area approval, indicating a structured regulatory framework [17][18]. - **Challenges in Scaling**: The high cost of individual vehicles, regulatory restrictions, and the need for infrastructure development are significant barriers to scaling operations for companies like Pony and WeRide [16]. - **Talent Acquisition**: Companies are focusing on recruiting high-end talent from both domestic and international sources, with a strong emphasis on graduates from top Chinese universities [25][26]. - **Future Technological Iterations**: While no major technological shifts are expected in the short term, the integration of large language models into autonomous driving systems is anticipated to significantly enhance capabilities [28]. This summary encapsulates the key discussions and insights from the conference call, highlighting the current state and future prospects of the L4 autonomous driving industry.
AI专家给奥特曼泼凉水:纯LLM从未真正理解世界,以此构建AGI没希望
3 6 Ke· 2025-06-30 09:29
划重点: 6月29日消息,OpenAI首席执行官山姆・奥特曼(Sam Altman)满怀憧憬,认为通用人工智能的曙光已近在咫尺,其观点如同一剂强心 针,让众多追随者热血沸腾,对未来的智能时代充满无尽遐想。然而,美国认知科学家、人工智能专家加里・马库斯(Gary Marcus)却 如同一盆冷水,无情地泼向这看似热烈的憧憬之中。 马库斯日前发表长文《生成式AI的致命缺陷:缺乏稳健的世界模型》(Generative AI's crippling and widespread failure to induce robust models of the world),在学术与科技界引发强烈共鸣。这篇文章从一个荒诞的AI生成视频切入——视频中,一名国际象棋选手竟将对方 的棋子横向移动数格——引出他对当前生成式人工智能最深层的批判:这些模型虽然能"模仿思考",但从未真正建立起对世界的稳定、 可靠理解。 这并不是第一次有人指出大语言模型在推理方面存在严重缺陷。苹果公司本月发布的研究论文《思维的幻觉》(Illusion of Thinking) 中,就系统记录了大语言模型在逻辑推理和数学计算中频繁出错的实例。然而,正如马库斯 ...
LeCun发布最新世界模型:首次实现16秒连贯场景预测,具身智能掌握第一视角!还打脸用了VAE
量子位· 2025-06-30 06:38
Core Viewpoint - Yann LeCun, a prominent figure in AI and deep learning, is focusing on developing a new model called PEVA, which aims to enhance embodied agents' predictive capabilities, allowing them to anticipate actions similarly to humans [2][10]. Group 1: PEVA Model Development - The PEVA model enables embodied agents to learn predictive abilities, achieving coherent scene predictions for up to 16 seconds [2][6]. - The model integrates structured action representation with 48-dimensional kinematic data of human joints and a conditional diffusion Transformer [3][20]. - PEVA utilizes first-person perspective video and full-body pose trajectories as inputs, moving away from abstract control signals [4][12]. Group 2: Technical Innovations - The model addresses computational efficiency and delay issues in long-sequence action prediction through random time jumps and cross-historical frame attention [5][24]. - PEVA captures both "overall movement" and "fine joint movements" using high-dimensional structured data, which traditional models fail to represent accurately [16][18]. - The architecture employs a hierarchical tree structure for motion encoding, ensuring translation and rotation invariance [25]. Group 3: Performance Metrics - PEVA outperforms baseline models in various tasks, showing lower LPIPS and FID values, indicating higher visual similarity and better generation quality [33][35]. - In single-step predictions, PEVA's LPIPS value is 0.303, and FID is 62.29, demonstrating its effectiveness compared to the CDiT baseline [33][35]. - The model's ability to predict visual changes within 2 seconds and generate coherent videos for up to 16 seconds marks a significant advancement in embodied AI [40]. Group 4: Practical Applications - PEVA can intelligently plan actions by evaluating multiple options and selecting the most appropriate sequence, mimicking human trial-and-error planning [42]. - The model's capabilities could lead to more efficient robotic systems, such as vacuum cleaners that can anticipate obstacles and navigate more effectively [51].
AI 开始「自由玩电脑」了!吉大提出「屏幕探索者」智能体
机器之心· 2025-06-27 04:02
Core Viewpoint - The article discusses the development of a vision-language model (VLM) agent named ScreenExplorer, which is designed to autonomously explore and interact within open graphical user interface (GUI) environments, marking a significant step towards achieving general artificial intelligence (AGI) [2][3][35]. Group 1: Breakthroughs and Innovations - The research introduces three core breakthroughs in the training of VLM agents for GUI exploration [6]. - A real-time interactive online reinforcement learning framework is established, allowing the VLM agent to interact with a live GUI environment [8][11]. - The introduction of a "curiosity mechanism" addresses the sparse feedback issue in open GUI environments, motivating the agent to explore diverse interface states [10][12]. Group 2: Training Methodology - The training involves a heuristic and world model-driven reward system that encourages exploration by providing immediate rewards for diverse actions [12][24]. - The GRPO algorithm is utilized for reinforcement learning training, calculating the advantage of actions based on rewards obtained [14][15]. - The training process allows for multiple parallel environments to synchronize reasoning, execution, and recording, enabling "learning by doing" [15]. Group 3: Experimental Results - Initial experiments show that without training, the Qwen2.5-VL-3B model fails to interact effectively with the GUI [17]. - After training, the model demonstrates improved capabilities, successfully opening applications and navigating deeper into pages [18][20]. - The ScreenExplorer models outperform general models in exploration diversity and interaction effectiveness, indicating a significant advancement in autonomous GUI interaction [22][23]. Group 4: Skill Emergence and Conclusion - The training process leads to the emergence of new skills, such as cross-modal translation and complex reasoning abilities [29][34]. - The research concludes that ScreenExplorer effectively enhances GUI interaction capabilities through a combination of exploration rewards, world models, and GRPO reinforcement learning, paving the way for more autonomous agents and progress towards AGI [35].
具身世界模型新突破,地平线 & 极佳提出几何一致视频世界模型增强机器人策略学习
机器之心· 2025-06-26 04:35
近年来,随着人工智能从感知智能向决策智能演进, 世界模型 (World Models) 逐渐成为机器人领域的重要研究方向。世界模型旨在让智能体对环境进行建模并 预测未来状态,从而实现更高效的规划与决策。 与此同时,具身数据也迎来了爆发式关注。因为目前具身算法高度依赖于大规模的真实机器人演示数据,而这些数据的采集过程往往成本高昂、耗时费力,严重 限制了其可扩展性和泛化能力。尽管仿真平台提供了一种相对低成本的数据生成方式,但由于仿真环境与真实世界之间存在显著的视觉和动力学差异(即 sim-to- real gap),导致在仿真中训练的策略难以直接迁移到真实机器人上,从而限制了其实际应用效果。 因此如何高效获取、生成和利用高质量的具身数据,已成为当 前机器人学习领域的核心挑战之一 。 项目主页: https://horizonrobotics.github.io/robot_lab/robotransfer/ 模仿学习(Imitation Learning)已成为机器人操作领域的重要方法之一。通过让机器人 "模仿" 专家示教的行为,可以在复杂任务中快速构建有效的策略模型。然 而,这类方法通常依赖大量高质量的真实机器 ...