Trajectory - Conditioned Sparse Query
Search documents
理想砍掉BEV与token化直接用OCC稀疏注意力进行4D世界模型预测
理想TOP2· 2025-12-16 12:44
Core Insights - The article discusses the innovative SparseWorld-TC model released by Ideal, emphasizing a shift from traditional structured approaches to a more data-driven methodology that enhances performance in 3D spatial representation and prediction [1]. Group 1: De-quantized Structure - The model transitions from discrete tokens to a Sparse Occupancy Representation, allowing direct operations in continuous 3D coordinate space, which improves inference speed and scene reconstruction fidelity [2]. Group 2: Removal of Spatial Mediators - Ideal's approach eliminates the need for Bird's Eye View (BEV) projections, which impose geometric constraints and bottlenecks in information flow, by using trajectory-conditioned sparse queries that directly extract information from multi-view image features [3]. Group 3: Elimination of Temporal Serial Structures - The model adopts a feed-forward full attention architecture, enabling parallel output of multiple future frames in a single inference pass, significantly enhancing prediction accuracy and speed compared to traditional autoregressive methods [4]. Group 4: Inspiration from GPT - The model draws inspiration from GPT's attention mechanisms, aiming to understand 3D spatial physics without the limitations of discrete tokenization, thus maintaining continuous physical attributes while efficiently participating in attention calculations [5].