会跳舞、能演讲！RoboPerform 让人形机器人听懂声音，即兴解锁双重技能

Core Insights - The article discusses the advancements in humanoid robotics, particularly focusing on the RoboPerform framework, which allows robots to perform expressive movements in sync with audio inputs, overcoming previous limitations in audio-motion coupling [3][6][7]. Industry Pain Points - The traditional multi-stage process of generating robot movements from audio leads to information loss, causing robots to lag in responsiveness and expressiveness during tasks like dancing or public speaking [6][7]. - The weak coupling between audio signals and joint movements has resulted in robots that are often out of sync with the audio, leading to awkward and uncoordinated actions [6][7]. Breakthrough Solutions - RoboPerform introduces a unified audio-motion generation framework that eliminates the need for redirection, allowing robots to directly interpret audio and generate appropriate movements [7][8]. - The framework is based on a dual representation of "content" (the core task) and "style" (the rhythm and emotion conveyed by audio), enabling more natural and fluid movements [7][8]. Technical Innovations - The training process involves a three-stage approach: alignment, distillation, and generation, which facilitates direct mapping from audio to robot motion [11][12]. - The use of a mixed expert strategy (∆MoE) allows for diverse and precise movement adaptations across different scenarios, enhancing the robot's ability to perform both dynamic dances and natural gestures [13][14]. - Real-time performance is achieved with a latency of just 5.3 milliseconds for single action inference, significantly improving responsiveness compared to traditional methods [14][22]. Performance Validation - RoboPerform demonstrated a top-1 retrieval accuracy of 66.7% in music-motion tasks and 64.6% in speech-motion tasks, showcasing its ability to accurately align movements with audio cues [17]. - In terms of motion tracking precision, RoboPerform achieved a success rate of up to 99% in both simulation and real-world tests, outperforming traditional methods [18][19]. - The framework's deployment time of approximately 1.2 seconds meets the stringent requirements for real-time control in humanoid robots [14][22]. Practical Applications - The Unitree G1 robot successfully executed fluid dance movements and natural gestures in response to audio inputs, validating the practical utility of the RoboPerform framework [22][24].