长时序推理

Search documents
阿里亲身入局具身智能!Qwen内部组团,通义千问技术负责人带队
量子位· 2025-10-09 07:03
衡宇 发自 凹非寺 量子位 | 公众号 QbitAI 具身智能小分队 隶属于阿里巴巴旗下的Qwen (通义千问) 。 这是负责阿里旗舰基础模型研发的核心部门。一直以来,该团队负责Qwen系列大模型的研发、开源和商业化应用。 林俊旸在公开具身智能小分队的帖子中写道: Qwen团队内部组建了一个全新的具身智能小分队! 这一消息由通义千问技术负责人 林俊旸 (Justin Lin) 在上对外公开。 外媒评价称,这一举动标志着阿里巴巴迄今为止最明确的物理AI系统探索。 而阿里,也成为了继OpenAI、Google等之后,又一家宣布入局具身智能赛道的大模型大厂。 黄仁勋曾则表示,英伟达在AI与机器人领域拥有一个"数万亿美元级"的长期增长机遇。 显然,阿里没有想放过这个长期竞争与机遇。 "走向现实世界",Qwen组建具身智能团队 多模态基础模型现在正在转变为能够利用工具和记忆,并通过强化学习执行长期推理的基础智能体。 它们理应从虚拟世界走向现实世界! 如果说以往的大模型是在"理解"世界,那具身智能的目标,就是让模型能够"参与"世界——从林俊旸的推文里不难看出,Qwen 已经开始着手 把多模态模型推向具身智能的新阶段。 此 ...
FindingDory:具身智能体记忆评估的基准测试
具身智能之心· 2025-06-22 10:56
Group 1 - The core issue in embodied intelligence is the lack of long-term memory, which limits the ability to process multimodal observational data across time and space [3] - Current visual language models (VLMs) excel in planning and control tasks but struggle with integrating historical experiences in embodied environments [3][5] - Existing video QA benchmarks fail to adequately assess tasks requiring fine-grained reasoning, such as object manipulation and navigation [5] Group 2 - The proposed benchmark includes a task architecture that allows for dynamic environment interaction and memory reasoning validation [4][6] - A total of 60 task categories are designed to cover spatiotemporal semantic memory challenges, including spatial relations, temporal reasoning, attribute memory, and multi-target recall [7] - Key technical innovations include a programmatic expansion of task complexity through increased interaction counts and a strict separation of experience collection from interaction phases [9][6] Group 3 - Experimental results reveal three major bottlenecks in VLM memory capabilities across 60 tasks, including failures in long-sequence reasoning, weak spatial representation, and collapse in multi-target processing [13][14][16] - The performance of native VLMs declines as the number of frames increases, indicating ineffective utilization of long contexts [20] - Supervised fine-tuning models show improved performance by leveraging longer historical data, suggesting a direction for VLM refinement [25] Group 4 - The benchmark represents the first photorealistic embodied memory evaluation framework, covering complex household environments and allowing for scalable assessment [26] - Future directions include memory compression techniques, end-to-end joint training to address the split between high-level reasoning and low-level execution, and the development of long-term video understanding [26]