预测下一个像素还需要几年？谷歌：五年够了

Core Insights - The article discusses the potential of next-pixel prediction in image recognition and generation, highlighting its scalability challenges compared to natural language processing tasks [6][21]. - It emphasizes that while next-pixel prediction is a promising approach, it requires significantly more computational resources than language models, with a token-per-parameter ratio that is 10-20 times higher [6][15][26]. Group 1: Next-Pixel Prediction - Next-pixel prediction can be learned in an end-to-end manner without the need for labeled data, making it a form of unsupervised learning [3][4]. - The study indicates that achieving optimal performance in next-pixel prediction requires a higher token-parameter ratio compared to text token learning, with a minimum of 400 for pixel models versus 20 for language models [6][15]. - The research identifies three core questions regarding the evaluation of model performance, the consistency of scaling laws with downstream tasks, and the variation of scaling trends across different image resolutions [7][8]. Group 2: Experimental Findings - Experiments conducted at a fixed resolution of 32×32 pixels reveal that the optimal scaling strategy is highly dependent on the target task, with image generation requiring a larger token-parameter ratio than classification tasks [18][22]. - As image resolution increases, the model size must grow faster than the data size to maintain optimal scaling, indicating that computational capacity is the primary bottleneck rather than data availability [18][26]. - The study shows that while the scaling trends for next-pixel prediction can be predicted using established frameworks from language models, the optimal scaling strategies differ significantly between tasks [21][22]. Group 3: Future Outlook - The article predicts that next-pixel modeling will become feasible within the next five years due to the rapid growth of training computational power, which is expected to increase by four to five times annually [8][26]. - It concludes that despite the current challenges, the path towards pixel-level modeling remains viable and could achieve competitive performance in the future [26].