世界模型

Search documents
某智驾公司一言难尽的融资。。。
自动驾驶之心· 2025-07-12 12:00
Core Viewpoint - The article discusses a unique financing strategy employed by an autonomous driving company in collaboration with a leading automotive manufacturer, highlighting the challenges and competitive landscape of the autonomous driving industry. Group 1: Financing Strategy - An autonomous driving company has been struggling to secure funding due to its high valuation compared to its limited production projects, which are close to those of top autonomous driving firms [3][4]. - The company approached a leading automotive manufacturer for investment, which agreed to invest under the condition that the funds would be reinvested into a struggling subsidiary parts company of the manufacturer [4]. - This financing maneuver allows the automotive manufacturer to present the investment as external funding, enhancing its public relations while providing necessary capital to its subsidiary [4]. Group 2: Industry Competition - The autonomous driving market is highly competitive, with companies that excel in algorithms and production capabilities successfully securing projects and funding, while those lacking in these areas struggle to obtain both [5]. - The article emphasizes that for the autonomous driving company, focusing on improving algorithm performance and production delivery is more crucial than engaging in complex investment maneuvers with major clients [5].
字节藏了一手“牌”
虎嗅APP· 2025-07-12 09:27
Core Viewpoint - The article discusses the emerging trend of "emotional large models" in AI, highlighting their potential to enhance user interaction by understanding and responding to human emotions, thus transforming AI from mere tools to emotional companions [3][5][6]. Group 1: Emotional Large Models Overview - "Emotional large models" differ from traditional chatbots by focusing on user emotional experiences, utilizing techniques to analyze tone, pauses, and expressions to generate emotionally appropriate responses [5][6]. - The technology evolution of "emotional large models" is driven by two paths: enhancing multimodal emotional computing capabilities on general models and developing specialized generative models focused on emotional understanding [7][8]. Group 2: Market Potential and Growth - The emotional AI companion market is expected to experience explosive growth, with the number of active users increasing 30 times from 2018 to 2023, and the global market size projected to rise from $30 million in 2023 to $150 billion by 2030, reflecting a compound annual growth rate of 236% [8][9]. - Character.AI has seen significant user engagement, with mobile downloads exceeding 34.32 million and web visits reaching 310 million in a single month, indicating strong market interest [9]. Group 3: Technical Aspects and Implementation - Emotional large models require more NLP experts and a different computational approach compared to traditional models, with a 30%-50% higher computational demand during training to maintain effectiveness [10]. - The development of emotional models in China is approximately one year behind that of international counterparts, with advancements in multimodal learning and mixed expert models [10]. Group 4: Industry Applications and Innovations - Companies are launching various AI companions and toys, such as Miko's AI partner and Curio's AI toys for children, indicating a trend towards integrating emotional AI into consumer products [12]. - ByteDance plans to leverage emotional large models to double the monthly active users of its product "Doubao" by 2025, focusing on entertainment, social interaction, and personalized services [14]. Group 5: Future Directions and Challenges - The emotional large model trend is expected to accelerate the upgrade of consumer robots, with global shipments projected to reach 47 million units in 2024, and a compound growth rate exceeding 20% over the next five years [16]. - Challenges remain, including non-linear growth in computational demands, long-term memory capabilities, and data privacy concerns, which could serve as barriers or protective measures for businesses in the future [16].
字节藏了一手“牌”
Hu Xiu· 2025-07-12 07:27
Core Insights - ByteDance is focusing on "emotional large models" to provide API calls and AI dialogue solutions for enterprises, indicating a strategic shift towards enhancing user emotional experiences in AI interactions [1][2][4] - The development of "emotional large models" is seen as a significant trend in AI, moving from mere tools to emotional companions, which opens new application scenarios [5][7] Group 1: Emotional Large Models Overview - "Emotional large models" differ from traditional chatbots by emphasizing emotional understanding and user experience, utilizing voice tone, pauses, and expressions to generate appropriate responses [3][4] - The technology evolution of "emotional large models" is driven by two paths: enhancing multimodal emotional computing capabilities on general large models and focusing on generative models specifically for emotional applications [5][6] Group 2: Market Trends and Growth Potential - The AI companionship market is expected to see explosive growth, with the number of active users increasing 30 times from 2018 to 2023, and the global market size projected to rise from $30 million to $150 billion between 2023 and 2030, with a CAGR of 236% [7] - Character.AI exemplifies the potential of "emotional large models" by enabling interactive AI character experiences, with significant user engagement reflected in its mobile downloads and web traffic [8][10] Group 3: Technical Aspects and Challenges - "Emotional large models" require more NLP experts and have different parameter and computational needs compared to traditional models, with training requiring 30%-50% more computational power [10][11] - The current gap in development between domestic and international "emotional large models" indicates that domestic advancements are approximately one year behind [11] Group 4: ByteDance's Strategic Positioning - ByteDance plans to leverage various vertical large models to double the monthly active users of its product Doubao by 2025, focusing on entertainment, social, and gaming scenarios [14] - The integration of "emotional large models" with hardware like smart speakers and AI companions is part of ByteDance's strategy to enhance user interaction and experience [14][15]
具身数采方案一览!遥操作和动捕的方式、难点和挑战(2w字干货分享)
自动驾驶之心· 2025-07-10 12:40
Core Viewpoint - The article discusses the significance of remote operation (遥操作) in the context of embodied intelligence, emphasizing its historical roots and contemporary relevance in robotics and data collection [3][15][17]. Group 1: Understanding Remote Operation - Remote operation is not a new concept; it has been around for decades, primarily in military and aerospace applications [8][10]. - Examples of remote operation include surgical robots and remote-controlled excavators, showcasing its practical applications [8][10]. - The ideal remote operation involves spatial separation, allowing operators to control robots from a distance, thus creating value through this separation [10][15]. Group 2: Remote Operation Experience - Various types of remote operation experiences were shared, with a focus on the comfort level of different methods [19][20]. - The most comfortable method identified is pure visual inverse kinematics (IK), which allows for greater freedom of movement compared to rigid control systems [30][28]. Group 3: Future of Remote Operation - The discussion includes visions for future remote operation systems, highlighting the need for a complete control loop involving both human-to-machine and machine-to-human interactions [33][34]. - The potential for pure virtual and pure physical solutions was explored, suggesting that future systems may integrate both approaches for optimal user experience [37][39]. Group 4: Data Collection and Its Importance - Remote operation is crucial for data collection, which is essential for training robots to mimic human actions [55][64]. - The concept of "borrowing to repair the truth" was introduced, indicating that advancements in remote operation are driven by the need for better data collection in robotics [64][65]. Group 5: Implications for Robotics - The emergence of the "robot cockpit" concept indicates a trend towards more intuitive control systems for robots, integrating various functionalities into a cohesive interface [67][70]. - The challenges of controlling multiple joints in robots were discussed, emphasizing the need for innovative hardware and interaction designs to manage complex operations [68][70]. Group 6: Motion Capture and Its Challenges - Motion capture systems are essential for remote operation, but they face challenges such as precision and the need for complex setups [93][95]. - The discussion highlighted the importance of human adaptability in using motion capture systems, suggesting that users can adjust to various input methods effectively [80][81]. Group 7: ALOHA System Innovations - The ALOHA system represents a significant innovation in remote operation, focusing on minimal hardware configurations and end-to-end algorithm frameworks [102][104]. - This system has prompted the industry to rethink robot design and operational paradigms, indicating its potential long-term impact [103][104].
VLA统一架构新突破:自回归世界模型引领具身智能
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new unified Vision-Language-Action (VLA) model architecture called UniVLA, which enhances the integration of visual, language, and action signals for improved decision-making in embodied intelligence tasks [4][5][13]. Group 1: Model Architecture and Mechanism - UniVLA is based on a fully discrete, autoregressive mechanism that models visual, language, and action signals natively, incorporating world model training to learn temporal information and causal logic from large-scale videos [5][9][14]. - The framework transforms visual, language, and action signals into discrete tokens, creating interleaved multimodal temporal sequences for unified modeling [9][10]. Group 2: Performance and Benchmarking - UniVLA has set new state-of-the-art (SOTA) records across major embodied intelligence benchmarks such as CALVIN, LIBERO, and SimplerEnv, demonstrating its strong performance advantages [18][21]. - In the CALVIN benchmark, UniVLA achieved an average score of 95.5%, outperforming previous models significantly [19]. Group 3: Training Efficiency and Generalization - The post-training stage of the world model significantly enhances downstream decision-making performance without relying on extensive action data, utilizing only vast amounts of video data for efficient learning [14][15]. - The model supports unified training for various tasks, including visual understanding, video generation, and action prediction, showcasing its versatility and data scalability [10][24]. Group 4: Future Directions - The article suggests exploring deeper integration of the UniVLA framework with multimodal reinforcement learning to enhance its perception, understanding, and decision-making capabilities in open-world scenarios [24].
筹备了半年!端到端与VLA自动驾驶小班课来啦(一段式/两段式/扩散模型/VLA等)
自动驾驶之心· 2025-07-09 12:02
Core Viewpoint - End-to-End Autonomous Driving is the core algorithm for the next generation of intelligent driving mass production, marking a significant shift in the industry towards more integrated and efficient systems [1][3]. Group 1: End-to-End Autonomous Driving Overview - End-to-End Autonomous Driving can be categorized into single-stage and two-stage approaches, with the former directly modeling vehicle planning and control from sensor data, thus avoiding error accumulation seen in modular methods [1][4]. - The emergence of UniAD has initiated a new wave of competition in the autonomous driving sector, with various algorithms rapidly developing in response to its success [1][3]. Group 2: Challenges in Learning and Development - The rapid advancement in technology has made previous educational resources outdated, creating a need for updated learning paths that encompass multi-modal large models, BEV perception, reinforcement learning, and more [3][5]. - Beginners face significant challenges due to the fragmented nature of knowledge across various fields, making it difficult to extract frameworks and understand development trends [3][6]. Group 3: Course Structure and Content - The course on End-to-End and VLA Autonomous Driving aims to address these challenges by providing a structured learning path that includes practical applications and theoretical foundations [5][7]. - The curriculum covers the history and evolution of End-to-End algorithms, background knowledge necessary for understanding current technologies, and practical applications of various models [8][9]. Group 4: Key Technologies and Innovations - The course highlights significant advancements in two-stage and single-stage End-to-End methods, including notable algorithms like PLUTO and DiffusionDrive, which represent the forefront of research in the field [4][10][12]. - The integration of large language models (VLA) into End-to-End systems is emphasized as a critical area of development, with companies actively exploring new generation mass production solutions [13][14]. Group 5: Expected Outcomes and Skills Development - Upon completion of the course, participants are expected to reach a level equivalent to one year of experience as an End-to-End Autonomous Driving algorithm engineer, mastering various methodologies and key technologies [22][23]. - The course aims to equip participants with the ability to apply learned concepts to real-world projects, enhancing their employability in the autonomous driving sector [22][23].
「世界模型」也被泼冷水了?邢波等人揭开五大「硬伤」,提出新范式
机器之心· 2025-07-09 07:10
机器之心报道 编辑:泽南、+0 现在的世界模型,值得批判。 我们知道,大语言模型(LLM)是通过预测对话的下一个单词的形式产生输出的。由此产生的对话、推理甚至创作能力已经接近人类智力水平。 但目前看起来,ChatGPT 等大模型与真正的 AGI 还有肉眼可见的差距。如果我们能够完美地模拟环境中每一个可能的未来,是否就可以创造出强大的 AI 了?回想 一下人类:与 ChatGPT 不同,人类的能力组成有具体技能、深度复杂能力的区分。 模拟推理的案例:一个人(可能是自私的)通过心理模拟多个可能结果来帮助一个哭泣的人。 人类可以执行广泛的复杂任务,所有这些任务都基于相同的人类大脑认知架构。是否存在一个人工智能系统也能完成所有这些任务呢? 论文:Critiques of World Models 论文链接:https://arxiv.org/abs/2507.05169 研究人员指出了构建、训练世界模型的五个重点方面:1)识别并准备包含目标世界信息的训练数据;2)采用一种通用表征空间来表示潜在世界状态,其含义可 能比直接观察到的数据更为丰富;3)设计能够有效对表征进行推理的架构;4)选择能正确指导模型训练的目标函数; ...
具身智能论文速递 | 强化学习、VLA、VLN、世界模型等~
具身智能之心· 2025-07-08 12:54
Core Insights - The article discusses advancements in Vision-Language-Action (VLA) models through reinforcement learning (RL) techniques, specifically the Proximal Policy Optimization (PPO) algorithm, which significantly enhances the generalization capabilities of these models [2][4]. Group 1: VLA Model Enhancements - The application of PPO has led to a 42.6% increase in task success rates in out-of-distribution (OOD) scenarios [2]. - Semantic understanding success rates improved from 61.5% to 75.0% when encountering unseen objects [2]. - In dynamic interference scenarios, success rates surged from 28.6% to 74.5% [2]. Group 2: Research Contributions - A rigorous benchmark was established to evaluate the impact of VLA fine-tuning methods on generalization across visual, semantic, and execution dimensions [4]. - PPO was identified as superior to other RL algorithms like GRPO and DPO for VLA fine-tuning, with discussions on adapting these algorithms to meet the unique needs of VLA [4]. - An efficient PPO-based fine-tuning scheme was developed, utilizing a shared actor-critic backbone network, VLA model preheating, and minimal PPO training iterations [4]. - The study demonstrated that RL's generalization capabilities in VLA for semantic understanding and entity execution outperformed supervised fine-tuning (SFT), while maintaining comparable visual robustness [4]. Group 3: NavMorph Model - The NavMorph model was introduced as a self-evolving world model for vision-and-language navigation in continuous environments, achieving a success rate of 47.9% in unseen environments [13][15]. - The model incorporates a World-aware Navigator for inferring dynamic representations of the environment and a Foresight Action Planner for optimizing navigation strategies through predictive modeling [15]. - Experiments on mainstream VLN-CE benchmark datasets showed that NavMorph significantly enhanced the performance of leading models, validating its advantages in adaptability and generalization [15].
写了两万字综述 - 视频未来帧合成:从确定性到生成性方法
自动驾驶之心· 2025-07-08 12:45
Core Insights - The article discusses Future Frame Synthesis (FFS), which aims to generate future frames based on existing content, emphasizing the synthesis aspect and expanding the scope of video frame prediction [2][5] - It highlights the transition from deterministic methods to generative approaches in FFS, underscoring the increasing importance of generative models in producing realistic and diverse predictions [5][10] Group 1: Introduction to FFS - FFS aims to generate future frames from a series of historical frames or even a single context frame, with the learning objective seen as a core component of building world models [2][3] - The key challenge in FFS is designing models that efficiently balance complex scene dynamics and temporal coherence while minimizing inference delay and resource consumption [2][3] Group 2: Methodological Approaches - Early FFS methods followed two main design approaches: pixel-based methods that struggle with object appearance and disappearance, and methods that generate future frames from scratch but often lack high-level semantic context [3][4] - The article categorizes FFS methods into deterministic, stochastic, and generative paradigms, each representing different modeling approaches [8][9] Group 3: Challenges in FFS - FFS faces long-term challenges, including the need for algorithms that balance low-level pixel fidelity with high-level scene understanding, and the lack of reliable perception and randomness evaluation metrics [11][12] - The scarcity of high-quality, high-resolution datasets limits the ability of current video synthesis models to handle diverse and unseen scenarios [18][19] Group 4: Data Sets and Their Importance - The development of video synthesis models heavily relies on the diversity, quality, and characteristics of training datasets, with high-dimensional datasets providing greater variability and stronger generalization capabilities [21][22] - The article summarizes widely used datasets in video synthesis, highlighting their scale and available supervision signals [21][24] Group 5: Evaluation Metrics - Traditional low-level metrics like PSNR and SSIM often lead to blurry predictions, prompting researchers to explore alternative evaluation metrics that align better with human perception [12][14] - Recent comprehensive evaluation systems like VBench and FVMD have been proposed to assess video generation models from multiple aspects, including perceptual quality and motion consistency [14][15]
独家对话「杭州六小龙」云深处CEO:人形机器人进家干活还要10年
36氪· 2025-07-08 09:18
以下文章来源于智能涌现 ,作者苏建勋 富充 智能涌现 . 直击AI新时代下涌现的产业革命。36氪旗下账号。 给机器人安上"世界模型", 就不需要那么多数据了。 访谈 | 苏建勋 杨轩 文| 富充 编辑 | 苏建勋 来源| 智能涌现(ID:AIEmergence) 封面来源 | 企业官方 在热闹的具身智能行业中,杭州云深处科技有限公司(以下简称"云深处科技")低调得出奇。 即使年初被媒体列为"杭州六小龙"之一,但创始人朱秋国仍鲜少露面。他个人的视频号里,没有一条和本人相关的内容,点赞量最高的一段视频,是公司 旗下四足机器人产品"绝影"攀爬楼梯、越障奔跑的影像。 "这个牛X了,三只(机器)狗跑出不一样的动作,会自主判断用什么策略。"有人在视频下激动留言,朱秋国没有回复。 正如"云深处"的名字一样,朱秋国认为技术爬坡必须要沉下心来,才能实现"远上寒山石径斜,白云深处有人家"的豁然开朗。他身边的同事告诉我 们:"朱老师喜欢把公司和产品往前放,让自己退到后面。" 可即使有一颗大隐隐于市的心,成立第八年的云深处,终究在今天具身智能的浪潮下,被推到了舞台中央。 智能涌现获悉,云深处科技宣布完成近5亿元人民币新一轮融资。本轮 ...