测试时训练（TTT） - filings, earnings calls, financial reports, news

测试时训练（TTT）

Search documents

具身智能之心· 2025-12-18 00:07

Group 1 - The core challenge of existing Vision-Language-Action (VLA) models is the limitation of the supervised fine-tuning (SFT) paradigm, which contrasts with human learning that emphasizes practice and feedback [2][3] - The proposed solution involves a test-time training (TTT) framework that allows VLA models to learn continuously through environmental interaction, addressing the lack of Oracle reward signals during deployment [4][6] Group 2 - The innovative aspects of the framework include a test-time autonomous feedback mechanism using a pre-trained progress estimator (VLAC) to provide dense feedback signals, and strategies to tame noise signals inherent in the progress estimator [4][6] - The method framework models robot operation tasks as a Markov Decision Process (MDP) and incorporates mechanisms for cumulative progress estimation and progressive horizon expansion to enhance learning robustness [6][7] Group 3 - Experimental results demonstrate that EVOLVE-VLA achieves an average success rate of 95.8%, a 6.5% improvement over the SFT baseline, with significant enhancements in long-horizon tasks [16][18] - In low-data scenarios, EVOLVE-VLA shows a 17.7% increase in success rate, reaching 61.3% with only one demonstration, highlighting its effectiveness in reducing data collection costs [19][20] Group 4 - The framework exhibits cross-task generalization capabilities, achieving a success rate of 20.8% in zero-shot task transfer after autonomous exploration, marking a significant advancement in task adaptability [22] - The qualitative analysis reveals emergent capabilities not present in the demonstration data, such as error recovery and state adaptation, showcasing the model's flexibility [25][27] Group 5 - The study identifies limitations such as misalignment between progress estimation and environmental success criteria, which can lead to reward hacking or misjudgment [33] - Future directions include optimizing the reward model for better alignment, accelerating real-time deployment, and enhancing zero-shot generalization capabilities [34]