Core Viewpoint - The article emphasizes the transition from 2D Vision Transformers (ViT) to 3D ViT, highlighting the advantages of processing continuous video streams for better understanding of the physical world and improved response times in autonomous vehicles [1][2]. Group 1: Transition to 3D ViT - The traditional 2D ViT processes images in a sliced manner, limiting the information captured from each frame, while 3D ViT processes video clips, integrating spatial and temporal data for enhanced feature extraction [1][2]. - The shift to 3D ViT is not merely a conversion from 2D to 3D in terms of perspective but involves a fundamental change in feature extraction dimensions, focusing on height, width, and time [2]. Group 2: Technical Advancements - The new chip architecture, referred to as a data flow architecture, allows for direct connections between layers on the silicon chip, minimizing the need for external memory reads and writes, thus optimizing latency [2]. - The self-developed chip by the company is designed to be data-driven rather than instruction-driven, achieving higher parallelism and integrating hardware and software design from the outset [4]. Group 3: Implications for Autonomous Vehicles - The advancements in chip technology necessitate corresponding improvements in vehicle control systems, leading to the development of a fully controlled chassis for the L9 model to match the enhanced processing capabilities [3].
大雨解读理想L9搞全线控底盘底层逻辑