Workflow
思维链(Chain of Thought
icon
Search documents
在WAIC耳朵听出茧子的「智能体」,是时候系统学一下了
机器之心· 2025-08-04 07:05
Core Insights - The article emphasizes the shift in perception of AI large models from simple chatbots to intelligent agents capable of proactive thinking, planning, and task execution [1][2]. Group 1: LLM and Its Capabilities - Standard LLMs generate text responses based on given prompts, showcasing their versatility as a significant advantage [5]. - The integration of reasoning and external API interactions into LLMs is crucial for developing advanced AI agents [6]. Group 2: Tool Utilization - The ability to teach LLMs to integrate and use external tools has become a hot topic in AI research, with examples including calculators, calendars, and search engines [7]. - LLMs can act as "commanders" that coordinate various specialized tools to solve problems effectively [8]. Group 3: Reasoning Models - Reasoning capabilities have been a core focus in LLM research, with the ability to break down complex problems into smaller tasks and determine which tools to use being essential [21][23]. - The Chain of Thought (CoT) method enhances LLMs' reasoning by guiding them to generate a reasoning process before arriving at a final output [24][25]. Group 4: ReAct Framework - The ReAct framework allows LLM-driven agents to autonomously decompose and solve complex problems through a sequential process that integrates reasoning and action [41]. - The framework expands the action space to include language as a form of action, enabling agents to "think" in addition to executing actions [46][49]. Group 5: Applications and Performance - ReAct has been applied in knowledge-intensive reasoning tasks and decision-making scenarios, demonstrating its effectiveness in various contexts [63][64]. - Performance comparisons show that ReAct consistently outperforms other models, highlighting the importance of reasoning during action execution [77]. Group 6: Future of AI Agents - The development of reliable AI agent systems is crucial, as current systems may fail if any step in the sequential problem-solving process goes wrong [114]. - Ongoing research aims to enhance the capabilities and reliability of AI agents, indicating significant advancements in the near future [115].
细粒度视觉推理链引入数学领域,准确率暴涨32%,港中文MMLab打破多模态数学推理瓶颈
量子位· 2025-06-16 10:30
Core Viewpoint - The article discusses the introduction of MINT-CoT, a new visual reasoning framework designed to enhance multimodal mathematical reasoning by addressing the limitations of traditional Chain of Thought (CoT) methods in handling visual information [1][3]. Group 1: Challenges in Mathematical Visual Reasoning - Traditional CoT methods struggle with integrating visual information in mathematical contexts due to three main bottlenecks: 1. Coarse-grained image region selection, where most methods rely on bounding boxes that may include irrelevant information [4]. 2. Visual encoders that are not trained to understand mathematical images, leading to poor perception of mathematical content [5]. 3. Over-reliance on external functionalities, which increases training and inference costs and reduces generalizability [6]. Group 2: MINT-CoT Framework - MINT-CoT (Multimodal Interleaved Chain-of-Thought) is introduced as a fine-grained, lightweight visual interleaved CoT reasoning method specifically designed for mathematical reasoning scenarios. It innovatively incorporates an Interleave Token that dynamically selects relevant visual tokens during the reasoning process, allowing for true "text-visual joint reasoning" [9]. - The MINT-CoT dataset consists of 54,000 visual interleaved reasoning samples, providing aligned information between reasoning steps and corresponding image tokens [11]. Group 3: Training Strategy - A three-stage training strategy is implemented to enhance the visual interleaved reasoning capabilities of the MINT-CoT framework: 1. Text CoT fine-tuning to establish a foundation for general reasoning formats. 2. Interleaved modality CoT fine-tuning to teach the model to insert visual content appropriately during reasoning. 3. Interleaved modality CoT reinforcement learning to optimize visual content selection and reasoning strategies [13]. Group 4: Experimental Results - The MINT-CoT-7B model, based on the multimodal large model Qwen-VL-7B, demonstrates superior performance in mathematical visual reasoning tasks, achieving significant improvements over baseline models: +32.59% on MathVista, +26.92% on GeoQA, and +23.2% on MMStar, establishing a new benchmark in the field [16].