仅用三五条样本击败英伟达，国内首个超少样本具身模型登场

Core Insights - The article discusses the breakthrough in the field of embodied intelligence with the release of the first general-purpose few-shot embodied operation model, FAM-1, by the domestic startup FiveAges, which bridges the gap between visual language models and 3D robotic manipulation [2][5][18]. Data Scarcity and Challenges - Embodied intelligence faces a significant challenge due to the scarcity of data compared to natural language and visual fields, as real-world robotic operations involve complex physical interactions and real-time feedback, making data collection costly and inefficient [3]. - Current visual-language-action (VLA) models rely heavily on large-scale labeled data to compensate for their lack of generalization capabilities in practical applications [4]. FAM-1 Model Overview - FAM-1 utilizes a novel architecture called BridgeVLA, which allows for efficient knowledge transfer and spatial modeling between large visual language models and 3D robotic control [5][7]. - The model achieves significant breakthroughs in few-shot learning, cross-scene adaptation, and complex task understanding, requiring only 3-5 robot data points per task to achieve an impressive success rate of 97%, surpassing state-of-the-art (SOTA) models [5][14]. Technical Innovations - The model consists of two core modules: Knowledge-driven Pretraining (KP) and 3D Few-shot Fine-tuning (FF), which enhance its ability to generalize across different tasks and environments [9][12]. - The KP module builds a knowledge base from vast amounts of image and video data to improve the model's understanding of operational contexts, while the FF module aligns the outputs of VLM and VLA using 3D heatmaps, significantly reducing the dependency on labeled data [9][12]. Experimental Results - FAM-1 outperformed SOTA models in various international benchmarks, achieving an average success rate of 88.2% in tasks such as "Insert Peg" and "Open Drawer," with improvements of over 30% in average success rates compared to competitors [11]. - In real-world deployments, FAM-1 demonstrated a 97% success rate in basic tasks using only 3-5 samples, showcasing its robustness against various environmental challenges [15]. Future Directions - FiveAges aims to enhance the generalization, reliability, and adaptability of its foundational models for operational scenarios, promote their application in industrial settings, and develop general-purpose models for navigation tasks [20]. - The company is also exploring self-supervised learning strategies from unlabeled human operation videos, which could further lower the barriers to application in robotics [19].