Ask-to-Clarify：解决指令的模糊性，端到端为真实具身任务生成动作

Core Insights - The article presents the Ask-to-Clarify framework aimed at enhancing embodied intelligent agents' ability to interact with humans by resolving instruction ambiguity through multi-turn dialogue [2][4][41]. Framework Design - A new collaborative task for embodied agents is introduced, requiring them to ask questions to clarify ambiguous instructions before executing tasks. This involves a combination of a visual-language model (VLM) for questioning and a diffusion model for action generation [6][10]. - The framework consists of two main components: a collaborative module for human interaction and an action module for generating specific actions. A connection module is designed to ensure smooth integration between these components [42][46]. Training Strategy - A two-phase "knowledge isolation" training strategy is proposed. The first phase focuses on training the model to handle ambiguous instructions, while the second phase maintains this capability while enhancing the action generation ability [8][15]. - In the first phase, a dataset of interactive dialogue is constructed to train the collaborative component, allowing it to ask questions when faced with ambiguous instructions [16][17]. - The second phase involves a hierarchical framework for end-to-end action generation, ensuring that the model retains its ability to clarify instructions while learning to generate actions [18][19]. Inference Process - During inference, the framework engages in dialogue with users to clarify instructions and then executes the inferred correct actions. A signal detector routes the process between questioning and executing based on the task state [22][23]. - The model uses specific signal markers to indicate whether an instruction is ambiguous or not, guiding its response accordingly [22][23]. Experimental Validation - The framework was tested in real-world scenarios, demonstrating its ability to clarify ambiguous instructions and reliably generate actions. Various experiments were conducted to assess its performance, including ablation studies on training strategies and the connection module [24][25][41]. - The results showed that the Ask-to-Clarify framework significantly outperformed baseline models in handling ambiguous instructions and executing tasks accurately [29][30][35]. Robustness Testing - The framework's robustness was evaluated under challenging conditions, such as low-light environments and the presence of distractors. It consistently outperformed baseline models in these scenarios, showcasing its practical applicability [37][39][40].