Workflow
VPP模型
icon
Search documents
重磅!揭秘机器人界「Sora」,完成1000+任务的“硬件密码”
机器人大讲堂· 2025-05-16 09:49
Core Viewpoint - The article discusses the launch of the VPP (Video Prediction Policy) model, an open-source AIGC generative robot model developed by Tsinghua University's ISRLab and Xingdong Jiyuan, which is seen as a significant advancement in robotics, enabling the transition from the digital to the physical world [1][2]. Group 1: VPP Model and Its Capabilities - The VPP model is recognized for its core advantages, including video prediction strategies, high-frequency execution, cross-ontology learning, and reliability, which drive innovation in the robotics field [1]. - The model is designed to perform over 1,000 tasks, showcasing its versatility and potential for real-world applications [2]. Group 2: Xingdong XHAND1's Role - Xingdong XHAND1, a joint all-direct drive humanoid robotic hand platform, serves as the core hardware for testing and implementing the VPP model [2]. - The all-direct drive technology of Xingdong XHAND1 allows for higher flexibility, faster speeds, and greater strength, enabling it to execute commands from the VPP model with precision [4][5]. - The hand can achieve a maximum load of 25 kg and a grip strength of 80N, significantly surpassing similar devices, making it suitable for heavy-duty tasks [5]. Group 3: Key Features of Xingdong XHAND1 - The hand's "back-drivability" enhances the robustness of the VPP model by allowing it to adjust its posture in response to external forces, improving fault tolerance and adaptability in complex environments [7]. - High power density and durability are ensured through the use of advanced motors and integrated joint modules, providing reliable performance during prolonged operations [8]. - The hand supports full-chain remote operation with various devices, allowing for precise capture of human actions, which enriches the training data for the VPP model [11]. Group 4: Performance Validation - Xingdong XHAND1 achieved a 67% success rate in real-world dexterous hand tasks, validating its reliability and effectiveness in practical applications [26]. - Compared to similar dexterous hands, Xingdong XHAND1 stands out due to its direct drive design and robustness, making it a superior choice for training and validating the VPP model [29]. Group 5: Future Prospects - The integration of Xingdong XHAND1 with more robotic models will facilitate hardware-algorithm iterative upgrades, expanding application boundaries [29]. - The company aims to foster an open-source ecosystem to attract developers for building diverse application libraries, promoting rapid advancements in humanoid robotics [29].
AI动态汇总:英伟达Llama-Nemotron模型表现优异,小米Mi-BRAG智能引擎亮相
China Post Securities· 2025-05-14 13:08
Quantitative Models and Construction Methods 1. Model Name: Llama-Nemotron - **Model Construction Idea**: The Llama-Nemotron model aims to enhance inference capabilities while reducing memory usage without sacrificing performance[12][13] - **Model Construction Process**: - **Stage 1: Neural Architecture Search (NAS)**: Optimizes from the Llama 3 model to accelerate inference using block-level local distillation and mixed-integer programming (MIP) solvers to select the most efficient configuration[14] - **Stage 2: Vertical Compression and FFN Fusion**: Introduces FFN fusion technology to reduce sequence depth and improve computational efficiency by identifying and replacing consecutive FFN blocks[14] - **Stage 3: Knowledge Distillation and Continued Pre-training**: Conducts knowledge distillation and continued pre-training to improve model quality and recover any quality loss from block replacement[15] - **Stage 4: Supervised Fine-Tuning (SFT)**: Uses mixed instruction data and reasoning trajectories from strong teacher models for supervised fine-tuning[15] - **Stage 5: Large-Scale Reinforcement Learning**: Trains the model using large-scale reinforcement learning, particularly on complex mathematical and STEM datasets[15] - **Model Evaluation**: The model is designed to enhance inference efficiency and reduce memory usage while maintaining high performance[13][16] Model Backtesting Results - **Llama-Nemotron Model**: - **HumanEval 0-shot**: 92.1%[53] - **LiveCodeBench (v6) 0-shot**: 30.3%[53] - **MultiPL-E average 0-shot**: 81.4%[53] - **ArenaHard 0-shot**: 97.1%[53] - **IfEval 0-shot**: 89.4%[53] - **Math500 Instruct 0-shot**: 91.0%[53] - **GPQA Diamond 5-shot CoT**: 57.1%[53] - **MMLU Pro 5-shot CoT**: 77.2%[53] - **RULER 32K**: 96.0%[53] - **RULER 128K**: 90.2%[53] - **MMMU 0-shot**: 66.1%[53] - **DocVQA 0-shot**: 95.3%[53] - **AI2D 0-shot**: 93.7%[53] - **ChartQA 0-shot**: 82.6%[53] Quantitative Factors and Construction Methods 1. Factor Name: Mi-BRAG - **Factor Construction Idea**: The Mi-BRAG system addresses high knowledge update costs, lack of insight into proprietary knowledge bases, and data leakage risks in traditional large models[25] - **Factor Construction Process**: - **Full-Format Compatibility**: Integrates an intelligent parsing engine to handle various document formats like PDF, Word, and Excel[27] - **Full-Modal Parsing**: Accurately analyzes complex images, tables, and mixed information[27] - **Multilingual Q&A**: Supports document parsing and interactive Q&A in major languages[27] - **Fine-Grained Traceability**: Uses dynamic traceability technology to mark the original document and citation location for each generated result[27] - **Factor Evaluation**: The system enhances the intelligent knowledge center for various application scenarios, improving product intelligence and user experience[28] Factor Backtesting Results - **Mi-BRAG Factor**: - **SuperCLUE-RAG Generation Capability Ranking**: Ranked first in April 2025[31] 2. Factor Name: VPP (Video Prediction Policy) - **Factor Construction Idea**: VPP is designed to generate video actions based on text instructions, leveraging AIGC video diffusion models for predictive visual representation and action learning[36][39] - **Factor Construction Process**: - **Stage 1**: Uses video diffusion models to learn predictive visual representations[36] - **Stage 2**: Employs Video Former and DiT diffusion strategies for action learning[36] - **Factor Evaluation**: VPP significantly enhances the generalization ability of humanoid robots by learning from human actions and reducing dependency on high-quality robot data[36][40] Factor Backtesting Results - **VPP Factor**: - **Calvin ABC-D Task Average Length**: 4.33[42] - **Real-World Dexterous Hand Task Success Rate**: 67%[42]