FoundationMotion
Search documents
无需人工标注,轻量级模型运动理解媲美72B模型,英伟达、MIT等联合推出FoundationMotion
机器之心· 2026-01-11 02:17
Core Insights - The rapid development of video models faces challenges in understanding complex physical movements and spatial dynamics, leading to inaccuracies in interpreting object motion [2][6] - A significant issue is the lack of high-quality motion data, as existing datasets are either too small or heavily reliant on expensive manual annotations [3][12] - FoundationMotion, developed by researchers from MIT, NVIDIA, and UC Berkeley, offers an automated data pipeline that does not require manual labeling, significantly improving motion understanding in video models [4][13] Data Generation Process - FoundationMotion operates through a four-step automated data generation process, starting with precise extraction of motion from videos using advanced detection and tracking models [16] - The system then translates these trajectories into a format understandable by language models, enhancing the model's ability to comprehend object movements [17] - Finally, it utilizes GPT-4o-mini to automatically generate high-quality annotations and questions, resulting in a dataset of approximately 500,000 entries for motion understanding [18] Model Performance - The data generated by FoundationMotion was used to fine-tune various open-source video models, including NVILA-Video-15B and Qwen2.5-7B, leading to significant performance improvements [21] - The fine-tuned models surpassed larger models like Gemini-2.5 Flash and Qwen2.5-VL-72B on multiple motion understanding benchmarks, demonstrating the impact of high-quality data [26] Broader Implications - FoundationMotion's contributions extend beyond performance metrics, as understanding object motion is crucial for safety and decision-making in autonomous driving and robotics [24] - The system provides a cost-effective and scalable solution for AI to develop an intuitive understanding of the physical world through extensive video analysis [25] - This advancement is seen as foundational for building true embodied intelligence, enhancing both physical perception and general video understanding capabilities [26][27]