长期记忆推理

Search documents
字节Seed开源长线记忆多模态Agent,像人一样能听会看
量子位· 2025-08-18 06:55
Core Insights - The article discusses the launch of M3-Agent, a new multimodal intelligent agent framework by ByteSeed, which can process real-time visual and auditory inputs, build and update long-term memory, and develop semantic memory over time [2][7]. Group 1: M3-Agent Features - M3-Agent is capable of human-like perception, including hearing and seeing, and is designed to be free and open-source [2]. - It utilizes a new long video question-answering benchmark called M3-Bench, developed collaboratively by ByteSeed, Zhejiang University, and Shanghai Jiao Tong University, to evaluate memory effectiveness and reasoning based on memory [2][22]. Group 2: Performance Metrics - Experimental results show that M3-Agent significantly outperforms baseline models, including commercial models like Gemini-1.5-Pro and GPT-4o, across multiple benchmark tests [3][30]. - In the M3-Bench-robot benchmark, M3-Agent achieved a 6.3% accuracy improvement over the strongest baseline model, MA-LLM, while in M3-Bench-web and VideoMME-long, it surpassed the top baseline model, Gemini-GPT4o-Hybrid, by 7.7% and 5.3% respectively [34][35]. Group 3: Memory and Reasoning Capabilities - M3-Agent operates through two parallel processes: a memory process that continuously perceives real-time multimodal inputs to build and update long-term memory, and a control process that interprets external instructions and reasons based on stored memories to execute tasks [8][9]. - The memory process generates two types of memory: event memory, which records specific events observed in videos, and semantic memory, which derives general knowledge from segments [11][12]. Group 4: Benchmarking and Evaluation - M3-Bench consists of two subsets: M3-Bench-robot, which includes 100 real-world videos recorded from a robot's first-person perspective, and M3-Bench-web, which contains 920 videos from various online sources [26]. - The benchmark evaluates the agent's ability to recall past observations and reason based on memory through various question types, including multi-detail, multi-hop, cross-modal reasoning, and general knowledge extraction [24][27]. Group 5: Conclusion - The results indicate that M3-Agent excels in maintaining character consistency, enhancing human understanding, and effectively integrating multimodal information [36].