Workflow
世界模型
icon
Search documents
端到端时代下的自动驾驶感知
自动驾驶之心· 2025-12-05 00:03
Core Insights - The article discusses the resurgence of end-to-end (E2E) perception in the autonomous driving industry, highlighting its impact on the field and the shift from traditional modular approaches to more integrated solutions [4][5][9]. Group 1: End-to-End Revival - End-to-end is not a new technology; it was initially hoped to directly use neural networks to output trajectories from camera images, but stability and safety were issues [9]. - The traditional architecture of localization, perception, planning, and control has been the mainstream approach, but advancements in BEV perception and Transformer architectures have revived end-to-end methods [9]. - Companies are now exploring various one-stage and two-stage solutions, with a focus on neural network-based planning modules [9]. Group 2: Perception Benefits in End-to-End - In traditional frameworks, perception aimed to gather as much accurate scene information as possible for planning, but this modular design limited the ability to meet planning needs [11]. - Current mainstream end-to-end solutions continue to follow this approach, treating various perception tasks as auxiliary losses [13]. - The key advantage of end-to-end is the shift from exhaustive perception to "Planning-Oriented" perception, allowing for a more efficient and demand-driven approach [14][15]. Group 3: Navigation-Guided Perception - The article introduces a Navigation-Guided Perception model, which suggests that perception should be guided by navigation information, similar to how human drivers focus on relevant scene elements based on driving intent [16][18]. - A Scene Token Learner (STL) module is proposed to efficiently extract scene features based on BEV characteristics, integrating navigation information to enhance perception [18][19]. - The SSR framework demonstrates that only 16 self-supervised queries can effectively represent the necessary perception information for planning tasks, significantly reducing the complexity compared to traditional methods [22]. Group 4: World Models and Implicit Supervision - The article discusses the potential of world models to replace traditional perception tasks, providing implicit supervision for scene representation [23][21]. - The SSR framework aims to enhance understanding of scenes through self-supervised learning, predicting future BEV features to improve scene query comprehension [20][21]. - The design allows for efficient trajectory planning while maintaining consistency for model convergence during training [20]. Group 5: Performance Metrics - The SSR framework outperforms various state-of-the-art (SOTA) methods in both efficiency and performance, achieving significant improvements in metrics such as L2 distance and collision rates [24]. - The framework's design allows for a reduction in the number of queries needed for effective scene representation, showcasing its scalability and efficiency [22][24].
字节端侧AI进展交流
2025-12-04 15:36
Summary of ByteDance's AI and Mobile Strategy Conference Call Company Overview - **Company**: ByteDance - **Focus**: AI development, hardware ecosystem expansion, and mobile technology Key Points on AI Strategy - ByteDance is focusing on three main areas in AI: General AI (AGI), embodied intelligence, and world models, managed by four teams: C team, Follow team, Stone team, and Cici team [1][2] - The C and Follow teams are responsible for 80% of product and model development, with over 1,200 and 1,000 personnel respectively [2] - The company aims to generate revenue primarily from B-end business by providing AI solutions, custom development, and private deployment, rather than direct monetization from C-end traffic [1][7] Financial Projections and Capital Expenditure - ByteDance expects capital expenditure to reach CNY 160 billion in 2025, with CNY 90 billion allocated for GPU purchases, primarily from NVIDIA (75%) and domestic suppliers [1][5][6] - The total computing power is equivalent to 1.1 million H100D GPU cards, with a current total computing power of 147.5 billion FLOPS [7] - For 2026, the projected capital expenditure is CNY 220 billion, with 70% for GPU purchases and 30% for building supercomputing centers [5] AI Mobile Phone Development - ByteDance plans to launch an AI phone in collaboration with ZTE and Nubia, targeting a production volume of over 1 million units by Q1 or Q2 of 2026 [1][9][11] - The AI phone aims to penetrate the global AI mobile market, projected to reach 80 million units in 2026, with a target market share of 5% (500,000 units) [3][15] - The phone will utilize a Snapdragon 8 chip with a computing power of 400 TOPS, and the production model is expected to reach 800 TOPS [3][25] Competitive Landscape and Market Positioning - Volcano Engine, ByteDance's cloud service, aims to differentiate itself from Alibaba Cloud by focusing on diverse AI processing solutions and computing services, with expected revenue exceeding CNY 50 billion in 2025 [8] - The AI phone is part of a broader strategy to enhance user experience and integrate AI technology into daily life, aiming to change user habits from touch to voice interaction [9][24] Technical Challenges and User Feedback - ByteDance faces several technical challenges, including low semantic understanding, high latency in edge models, and issues with cross-application operations [16][18] - User feedback highlights concerns over semantic understanding, multi-turn dialogue coherence, and hardware resource consumption [18] - The company is actively addressing over 3,400 bugs and releasing updates every two days [18] Future Outlook - ByteDance's AI assistant aims to reshape the mobile operating system's traffic entry points, potentially disrupting existing platforms by providing services without the need for app installations [27] - The competitive landscape for AI mobile phones remains uncertain, with major players like Alibaba, Xiaomi, Huawei, and Tencent also vying for market share [28] Conclusion - ByteDance is strategically positioning itself in the AI and mobile markets through significant investments, innovative product development, and a focus on B-end revenue generation, while navigating technical challenges and competitive dynamics in the industry.
我们身处波涛汹涌的中心|加入拾象
海外独角兽· 2025-12-04 11:41
About Us 我们是一个对 AI 和 foundation model 痴迷的团队。 2022 年秋,我们在硅谷看到了 AI 的火苗,从此只专注研究 AI。 专注研究和投资 AI,让我们取得了还不错的成绩。我们在管 AUM 超过 15 亿美金,有 5 亿美元在投的长 线基金,一二级联动,有足够的子弹抓住 AI 机会。 我们过去投资并见证了 6 家 portfolio 从数十亿,数 百亿美金,成长为千亿美金公司——这也是拾象的寓意,只研究全球最重要的技术变化,投资有大象级 潜力的公司。 我们有数位千亿美金业务的 CEO / leadership 提供洞察,帮 portfolio 做好 AI 和全球化。 我们通过海外独角兽和 AI 讨论社群,持续讨论重要问题,帮助和影响了中美两地的华人创业者,也在 AI 从业者中获取了一些宝贵的信任。 现在,我们希望邀请你加入,一起做 全球 AI 投资,一起捕捉大机会,成为 AI 领域的最佳投手 。我们是 一个年轻(平均年龄不到 30 岁)、扁平、高人才密度的团队,推崇 high-trust,low-ego,团队内信息极 度透明,讨论氛围热烈。 我们尤其喜欢的特质是:对 AI ...
第八届GAIR全球人工智能与机器人大会,议程正式公布
雷峰网· 2025-12-04 10:04
Core Insights - The article emphasizes the transformative impact of AI technology on education, industry paradigms, and computational frameworks, highlighting the upcoming GAIR 2025 conference as a pivotal event for discussing these changes [2][22]. Event Overview - The GAIR 2025 conference will take place on December 12-13, 2025, at the Sheraton Hotel in Shenzhen, featuring a new agenda and deeper industry discussions [2][22]. - The conference will offer 20 free tickets to loyal readers, available on a first-come, first-served basis [2][22]. Conference Agenda Highlights - The conference will include specialized sessions on topics such as the redefinition of education through AI, paradigm shifts in various fields, and advancements in AI computational power [7][17][25]. - Notable speakers include prominent figures from academia and industry, such as Zhao Wei, Guo Yike, and Kazuhiro Kosuge, who will present on various AI-related topics [27][31][32]. Key Themes - The themes of the conference will focus on AI in education, paradigm reconstruction, world models, and AI chips and computational power [25][22]. - The event aims to gather over 50 academicians, 300 young scholars, and 1000 industry elites to explore the future of AI [25].
世界太小,不够世界模型们用了
3 6 Ke· 2025-12-04 09:29
世界模型,已经像这个世界一样混乱了。 OpenAI指着Sora生成的视频说,这就是"世界模拟器";杨立昆(Yann LeCun)指着Sora,说它是像素幻 觉,真正的世界模型应该是"预测未来的抽象大脑";谷歌DeepMind称,Genie3就是一个"可交互的通用世 界模型";而李飞飞说,"空间智能"才是正解。 现实世界是唯一的、客观的,但AI圈里似乎人人都在制造属于自己的"世界模型"。 尽管定义南辕北辙,但这群吵得不可开交的大佬们,在一个基本判断上达成了共识:大语言模型早晚到 头,世界模型才是通往AGI的必经之路。 大语言模型在GPT-3.5之后经历了参数的膨胀,而世界模型在技术路线收敛之前,就先经历了概念的通货 膨胀。 世界模型是个筐,啥都往里装 "世界模型"的混乱,根源在于它是一种目的,指的是让AI具备理解外部世界规律,预测世界变化的能力, 而非具体的技术路径。 最先混乱的就是概念。 关于世界模型的思想,最早可追溯至1943年认知科学家Kenneth Craik提出的"心智模型(Mental Model)",即大脑通过构建外部世界的微缩模型来进行预测,换句话说,我们脑中有一个心智模型,不仅 能处理当前看到 ...
碾压π0.5,复旦团队首创「世界模型+具身训练+强化学习」闭环框架
机器之心· 2025-12-04 08:18
Core Viewpoint - The Vision–Language–Action (VLA) strategy is becoming a crucial technological pathway for robots to achieve general operational intelligence, enabling simultaneous processing of visual perception, language instructions, and generation of continuous control signals [2]. Group 1: Challenges in Current VLA Approaches - Most current VLA methods rely heavily on imitation learning, which can lead to error accumulation and task failure when there are distribution shifts or changes in task forms [3][11]. - Implementing online reinforcement learning (RL) on real robots is costly and limited by the need for extensive human intervention and monitoring, making large-scale deployment impractical [12]. - Traditional physics engines struggle to balance realism, scene diversity, and engineering usability, complicating the use of RL in simulated environments [13]. Group 2: ProphRL Framework - The research team proposed the ProphRL framework, utilizing a large-scale pre-trained world model called Prophet as a video-level simulator to optimize VLA strategies through online RL algorithms [4]. - This approach allows for significant reductions in real-world interaction costs while maintaining physical credibility, facilitating the practical implementation of large model VLA strategies [4]. Group 3: Experimental Results - ProphRL demonstrated a success rate improvement of 5–17% across various VLA models in public benchmarks, with real robot experiments showing a substantial success rate increase of 24–30% [8]. - The Prophet model achieved leading performance in visual fidelity and action consistency across multiple datasets, showcasing its ability to generalize across new scenes and tasks with minimal fine-tuning [31]. Group 4: Innovations in RL Algorithms - The research introduced FA-GRPO and FlowScale, RL algorithms tailored for flow-based action heads, enhancing training stability and performance by reorganizing gradient signals and balancing contributions from different steps [26][27]. - A video-language reward model was developed to evaluate task success based on the entire trajectory, moving away from manually designed geometric distances [26]. Group 5: Real-World Validation - The ProphRL framework was validated on real robots, achieving significant improvements in task success rates across various complex tasks, indicating the effectiveness of the world model and RL integration in practical applications [38].
从 LLM 到 World Model:为什么我们需要能理解并操作世界的空间智能?
海外独角兽· 2025-12-03 12:05
编译:Haozhen、Gemini 如今 LLM 的语言理解与生成能力已展现出惊人的广泛适用性,但随着 LLM 的发展,一个事实越 发凸显:仅靠语言,仍不足以支撑真正的智能。 从更本质的角度看,人类处理世界的方式从来不只依赖文字,而是通过视觉、空间感知、物理直觉 与行动能力等共同构成完整的认知体系。语言只是对三维世界的"有损压缩":它记录结论,却省略 过程;它表达结构,却隐藏动态。而真正的智能,源于不断与世界互动、不断在空间中推理和行动 的能力。 正因如此,构建能够"理解并操作世界"的空间智能(Spatial Intelligence)与世界模型(World Models)成为继 LLM 之后的关键方向。 2024 年,李飞飞、Justin Johnson 等学者创立了 World Labs,今年 11 月推出了 Marble 这个 3D 世界 生成模型。团队尝试突破模型"只懂文本"的限制,让模型具备在三维环境中定位、推理、模拟、生 成甚至执行任务的能力。这不仅意味着新的技术路线,也意味着新的 AI 价值尺度:从语言走向世 界、从描述走向交互、从静态认知走向动态智能。 本文整理了李飞飞和 Justin Joh ...
赛道分化加剧,2026年人工智能最强风口来袭
3 6 Ke· 2025-12-03 08:57
不再是"AI+"的修修补补,而是AI原生重构系统底层逻辑;不再局限于数字世界的生成与理解,而是物理AI打通虚拟与现实的行动闭环;不再是单一模态 的孤军奋战,而是多模态技术融合万象;更有世界模型让AI从"数据应答"走向"规律预判"。 这场关乎技术架构、应用形态与认知高度的变革已然来临,谁将成为重塑产业、定义未来的最强风口? AI原生引发系统应用底层革命 当算法模型的迭代速度超越行业想象边界,当AI从屏幕后的工具跃变为渗透现实的"参与者",2026年将成为人工智能发展的关键分水岭。 如果说"AI+"是在现有系统上"打补丁"或"外挂"AI功能,那么AI原生则意味着以AI为系统设计的底层逻辑与能力中枢,这套系统为AI而生、因AI而长,驱 动从技术架构、业务流程、组织角色到价值创造方式的全方位重塑。 这种变革并非简单的功能叠加,而是以生成式AI为核心重构开发范式,让智能成为应用的原生属性而非附加能力。从"AI+"走向"AI原生",正成为AI未来 发展的关键方向。 | 维度 | 传统"Al+"架构 | AI原生架构 | | --- | --- | --- | | 设计起点 | 现有业务流程 | Al能力边界 | | 数据 ...
潮声丨人工智能有时比人还“蠢”,AI版图缺的这块拼图是什么
Sou Hu Cai Jing· 2025-12-03 00:35
Core Insights - The current era of artificial intelligence, dominated by large language models and image classifiers, has reached its limits, and AI with spatial intelligence is seen as the next frontier to break through this bottleneck [2][11][24] Group 1: AI Limitations - AI is categorized into two types: speaking intelligence and doing intelligence, with the former being strong in text output but often failing in practical tasks [6][11] - Examples of AI failures include generating unrealistic images and videos, highlighting the lack of common sense and physical understanding in current models [7][10] Group 2: Spatial Intelligence - Spatial intelligence, a concept originating from educational psychology, involves the perception, understanding, and manipulation of spatial information, which is crucial for human development and creativity [12][15] - Current AI systems lack deep, common-sense understanding of the physical world, which directly affects the quality of their outputs [11][17] Group 3: World Models - The concept of world models, inspired by human cognitive abilities, is emerging as a key area of focus for AI development, aiming to enable machines to understand and interact with the physical world [19][23] - Recent advancements in world models include new products and technologies from companies like NVIDIA and Google DeepMind, indicating a growing interest and investment in this area [22][23] Group 4: Future Challenges - Building AI that can operate like humans presents significant challenges, including the complexity and uncertainty of the real world, limitations in existing data, and the inherent constraints of physical laws [23][24]
华为重投,头部具身智能机器人创企发布并开源“最强”具身世界模型!
Robot猎场备忘录· 2025-12-03 00:03
温馨提示 : 点击下方图片,查看运营团队最新(12月)原创报告(共260页) 说明: 欢迎约稿、刊例合作、行业交流 , 行业交流记得先加入 "机器人头条"知识星球 ,后添加( 微信号:lietou100w )微 信; 若有侵权、改稿请联系编辑运营(微信:li_sir_2020); 正文: 华为重投, 国内领先通用具身智能企业[极佳视界]发布并开源行业领先的具身世界模型GigaWorld-0! 2025年12月2日, Physical AI(物理AI)领域头部创企 [极佳视界 GigaAl ]发布并开源行业领先的具身世界模型 GigaWorld-0,全球范围内首次实 现90%世界 模型生成数据占比, 具身VLA大模型性能飙升300% ,并同步开源 了全阶段训练和推理代码! GigaWorld-0是极佳视界专为VLA训练打造的世界模型框架,也 是业内首个采用FP8精度端到端训练的世界模型 ,标志着世界模型训练迈入高能效新阶段。 GigaWorld-0由两大协同组件构成: 互联网数据质量 参差不齐、 仿真数据 场景泛化难, 对于 人形机器人来说,通向 "具身智能"的最大难关,并不是 算法本身,而是如何获得规模化、真 ...