Workflow
思维链(Chain of Thought
icon
Search documents
在WAIC耳朵听出茧子的「智能体」,是时候系统学一下了
机器之心· 2025-08-04 07:05
摘自Deep(Learning)Focus 作者: Cameron R. Wolfe 机器之心编译 在今年的世界人工智能大会(WAIC)上,智能体是绝对的主角,从 C 端产品到企业级应用,每家参展的 AI 厂商似乎都要提一下在智能体方向的布局。 这其实揭示了一个重要转变:人们不再把 AI 大模型当成一个单纯的聊天机器人,而是希望它能像人一样主动思考、制定计划、使用各种工具来完成任务,这是接 下来大模型走向应用的重要方向。 看来,对于 AI 从业者来说,是时候系统了解一下「智能体」了。 刚好,我们找到了一篇写得非常全面的博客。博客作者是 Netflix 高级研究科学家、莱斯大学博士 Cameron R. Wolfe 。他从最基础的 LLM 说起,逐步引入工具、 推理、自主规划的能力,深度分析了 AI 智能体的底层逻辑。 博客地址:https://cameronrwolfe.substack.com/p/ai-agents 以下是博客的详细内容。 LLM及其能力 标 准 LLM 的输入输出特征 标准 LLM 的功能如上所示。给定一个文本提示,LLM 生成一个文本响应。从许多方面来看, LLM 的通用性是 其最大的 ...
细粒度视觉推理链引入数学领域,准确率暴涨32%,港中文MMLab打破多模态数学推理瓶颈
量子位· 2025-06-16 10:30
Core Viewpoint - The article discusses the introduction of MINT-CoT, a new visual reasoning framework designed to enhance multimodal mathematical reasoning by addressing the limitations of traditional Chain of Thought (CoT) methods in handling visual information [1][3]. Group 1: Challenges in Mathematical Visual Reasoning - Traditional CoT methods struggle with integrating visual information in mathematical contexts due to three main bottlenecks: 1. Coarse-grained image region selection, where most methods rely on bounding boxes that may include irrelevant information [4]. 2. Visual encoders that are not trained to understand mathematical images, leading to poor perception of mathematical content [5]. 3. Over-reliance on external functionalities, which increases training and inference costs and reduces generalizability [6]. Group 2: MINT-CoT Framework - MINT-CoT (Multimodal Interleaved Chain-of-Thought) is introduced as a fine-grained, lightweight visual interleaved CoT reasoning method specifically designed for mathematical reasoning scenarios. It innovatively incorporates an Interleave Token that dynamically selects relevant visual tokens during the reasoning process, allowing for true "text-visual joint reasoning" [9]. - The MINT-CoT dataset consists of 54,000 visual interleaved reasoning samples, providing aligned information between reasoning steps and corresponding image tokens [11]. Group 3: Training Strategy - A three-stage training strategy is implemented to enhance the visual interleaved reasoning capabilities of the MINT-CoT framework: 1. Text CoT fine-tuning to establish a foundation for general reasoning formats. 2. Interleaved modality CoT fine-tuning to teach the model to insert visual content appropriately during reasoning. 3. Interleaved modality CoT reinforcement learning to optimize visual content selection and reasoning strategies [13]. Group 4: Experimental Results - The MINT-CoT-7B model, based on the multimodal large model Qwen-VL-7B, demonstrates superior performance in mathematical visual reasoning tasks, achieving significant improvements over baseline models: +32.59% on MathVista, +26.92% on GeoQA, and +23.2% on MMStar, establishing a new benchmark in the field [16].