轻量级VLA模型Evo-1:仅凭0.77b参数取得SOTA,解决低成本训练与实时部署
具身智能之心·2025-11-12 04:00

Core Insights - The article discusses the Evo-1 model, a lightweight Vision-Language-Action (VLA) model that integrates perception, language, and control capabilities, aiming to reduce computational costs and improve deployment efficiency without relying on large-scale robot data pre-training [3][5][6]. Industry Pain Points - Existing VLA models face several limitations, including high computational costs due to large parameter sizes, which can reach billions, leading to significant GPU memory consumption and low control frequencies [4]. - The reliance on extensive robot datasets for training is both labor-intensive and costly, further complicating the deployment of these models in real-time interactive tasks [4]. Evo-1 Methodology and Performance - Evo-1 employs a unified visual-language backbone and a two-stage training paradigm to enhance multimodal perception and understanding while maintaining a compact model size of only 0.77 billion parameters [5][6]. - The model achieved state-of-the-art results in benchmark tests, surpassing previous models by 12.4% and 6.9% on MetaWorld and RoboTwin datasets, respectively, and achieving a 94.8% success rate on the LIBERO test [6][18]. - In real-world evaluations, Evo-1 demonstrated a success rate of 78%, outperforming other baseline models while maintaining low memory usage of 2.3 GB and a high inference frequency of 16.4 Hz [22][20]. Model Architecture - Evo-1 utilizes the InternVL3-1B model as its backbone, which is pre-trained in a native multimodal paradigm, allowing for efficient feature fusion and cross-modal alignment [10]. - The model incorporates a cross-modulation diffusion transformer to predict continuous control actions from the multimodal embeddings generated by the backbone [11]. - An integrated module aligns the fused visual-language representations with the robot's proprioceptive information, ensuring seamless integration of multimodal features for subsequent control tasks [12]. Training Process - The two-stage training process begins with aligning the action expert while freezing the visual-language backbone, followed by a global fine-tuning phase to optimize the entire architecture [13][14]. - This approach preserves the semantic integrity of the visual-language model while adapting to diverse action generation needs, effectively enhancing the model's generalization capabilities [14]. Ablation Studies - Various integration strategies between the visual-language model and the action expert were evaluated, demonstrating the effectiveness of the proposed design in maintaining performance [24]. - The two-stage training paradigm was compared with a single-stage baseline, showing that the former retains semantic attention patterns better, leading to improved focus on relevant task areas [25].