VLA/世界模型/WA/端到端是宣传分歧, 不是技术路线分歧

Core Viewpoints - Many people are unaware that there is no universally accepted definition of VLA/world model/end-to-end [1] - Leading autonomous driving companies share more commonalities in their exploration of autonomous driving than the differences portrayed online, with the core being promotional divergence rather than technical route divergence [1][2] - Language plays a significant role in autonomous driving, particularly in long reasoning, user interaction value alignment, and understanding the world [1] - Those who believe that predicting the next token is more than just a probability distribution are more likely to accept that language can understand the world [1] Group 1: VLA/World Model/End-to-End - VLA, world model, and end-to-end all require the ability to generate road video data that appears real, focusing on visual information input and ultimately controlling vehicle actions [2] - The distinction lies in the involvement of language, its depth of participation, and the architectural form it takes, with future language-related tokens potentially being LLM's text tokens or photon tokens [2] - The narrative that VLA and world models represent different technical routes is misleading, as both need to generate a world model and understand the physical world [4] Group 2: End-to-End Definitions - The definition of end-to-end is often debated, with some believing it requires a core framework where input and output are clearly defined [5] - Tesla's approach, which involves visual input and outputting trajectory rather than direct control signals, raises questions about the true nature of their end-to-end definition [5][6] - The output of precise trajectories is preferred over direct control signals, suggesting a more effective design approach [6] Group 3: Tesla's Approach and Future Directions - Tesla's historical context and style suggest that their approach to end-to-end definitions may not have a universally accepted exclusivity [7] - Long-term predictions indicate that AI model inputs and outputs may predominantly involve photons, which could significantly reduce computational loads [10] - The ideal VLA model is defined as having visual or multimodal input, language participation, and ultimately directing actions in a broad sense [11] Group 4: Understanding Language and AI Potential - There are fundamental differences in views regarding LLM, particularly concerning the understanding of predicting the next token [12] - Those who see predicting the next token as more than mere statistics are more inclined to recognize the potential of LLM and AI [12][19] - The ability to predict the next token effectively implies an understanding of the underlying reality that generates the token, which is a deeper question than it appears [18]