思维链(Chain-of-Thought)

Search documents
张祥雨发现的多模态AI内耗难题,北大找到了解法
3 6 Ke· 2025-09-19 10:52
今年6月,阶跃星辰首席科学家张祥雨在访谈中谈及了他在近两年模型训练中遇到的最大困境——多模态AI的内部,一直有一场"内战"。 具体是,在大一统多模态模型训练中,视觉的"理解"与"生成"能力可以共存,却很少协作,甚至时常内耗 。在联合训练时,一方能力的提升甚至还会导 致另一方性能的下降 。 这和我们的认知完全相反。对于一个人类来讲,他对画面的理解越深入,作画也就可能更精妙。但在多模态模型中,理解和生成,二者之间没有形成有效 的"信息增益"和"相互促进"。 张祥雨对此作出的解释是,图像生成太复杂了,得有极其复杂的空间规划、物理常识和语义推理。而Transformer模型虽然强大,但它在一次前向传播中 能执行的逻辑推理步骤是有限的。你让它根据"画一个宇航员在月球骑方形轮子的自行车"这个指令,一次性生成符合所有物理、几何、语义约束的图像太 难了。 而在训练过程中,因为这种单次推理,导致梯度信号太粗糙,训练出来的理解模型根本没法给生成模型有效指导,而反向亦然,生成模块的失败,也无法 有效地帮助理解模块进步。 因此张祥雨给出的解决方法是多模态模型应该像语言推理一样,引入"思维链"(Chain-of-Thought)。让模 ...
端到端模型!GraphCoT-VLA:面向模糊指令的操作任务的VLA模型
具身智能之心· 2025-08-13 00:04
Core Viewpoint - The article introduces GraphCoT-VLA, an advanced end-to-end model designed to enhance robot operations under ambiguous instructions and in open-world conditions, significantly improving task success rates and response times compared to existing methods [3][15][37]. Group 1: Introduction and Background - The VLA (Vision-Language-Action) model has become a key paradigm in robotic operations, integrating perception, understanding, and action to interpret and execute natural language commands [5]. - Existing VLA models struggle with ambiguous language instructions and unknown environmental states, limiting their effectiveness in real-world applications [3][8]. Group 2: GraphCoT-VLA Model - GraphCoT-VLA addresses the limitations of current VLA models by incorporating a structured Chain-of-Thought (CoT) reasoning module, which enhances understanding of ambiguous instructions and improves task planning [3][15]. - The model features a real-time updatable 3D pose-object graph that captures the spatial configuration of robot joints and the topological relationships of objects in three-dimensional space, allowing for better interaction modeling [3][9]. Group 3: Key Contributions - The introduction of a novel CoT architecture enables dynamic observation analysis, interpretation of ambiguous instructions, generation of failure feedback, and prediction of future object states and robot actions [15][19]. - The model integrates a dropout-based mixed reasoning strategy to balance rapid inference and deep reasoning, ensuring real-time performance [15][27]. Group 4: Experimental Results - Experiments demonstrate that GraphCoT-VLA significantly outperforms existing methods in task success rates and action fluidity, particularly in scenarios with ambiguous instructions [37][40]. - In the "food preparation" task, GraphCoT-VLA improved accuracy by 10% over the best baseline, while in the "outfit selection" task, it outperformed the leading model by 18.33% [37][38]. Group 5: Ablation Studies - The introduction of the pose-object graph improved success rates by up to 18.33%, enhancing the model's accuracy and action generation fluidity [40]. - The CoT module significantly improved the model's ability to interpret and respond to ambiguous instructions, demonstrating enhanced task planning and future action prediction capabilities [41].