长时序推理 - filings, earnings calls, financial reports, news

长时序推理

Search documents

量子位· 2025-10-09 07:03

衡宇发自凹非寺量子位 | 公众号 QbitAI 具身智能小分队隶属于阿里巴巴旗下的Qwen （通义千问）。这是负责阿里旗舰基础模型研发的核心部门。一直以来，该团队负责Qwen系列大模型的研发、开源和商业化应用。林俊旸在公开具身智能小分队的帖子中写道： Qwen团队内部组建了一个全新的具身智能小分队！这一消息由通义千问技术负责人林俊旸（Justin Lin）在上对外公开。外媒评价称，这一举动标志着阿里巴巴迄今为止最明确的物理AI系统探索。而阿里，也成为了继OpenAI、Google等之后，又一家宣布入局具身智能赛道的大模型大厂。黄仁勋曾则表示，英伟达在AI与机器人领域拥有一个"数万亿美元级"的长期增长机遇。显然，阿里没有想放过这个长期竞争与机遇。 "走向现实世界"，Qwen组建具身智能团队多模态基础模型现在正在转变为能够利用工具和记忆，并通过强化学习执行长期推理的基础智能体。它们理应从虚拟世界走向现实世界！如果说以往的大模型是在"理解"世界，那具身智能的目标，就是让模型能够"参与"世界——从林俊旸的推文里不难看出，Qwen 已经开始着手把多模态模型推向具身智能的新阶段。此 ...

FindingDory：具身智能体记忆评估的基准测试

具身智能之心· 2025-06-22 10:56

Group 1 - The core issue in embodied intelligence is the lack of long-term memory, which limits the ability to process multimodal observational data across time and space [3] - Current visual language models (VLMs) excel in planning and control tasks but struggle with integrating historical experiences in embodied environments [3][5] - Existing video QA benchmarks fail to adequately assess tasks requiring fine-grained reasoning, such as object manipulation and navigation [5] Group 2 - The proposed benchmark includes a task architecture that allows for dynamic environment interaction and memory reasoning validation [4][6] - A total of 60 task categories are designed to cover spatiotemporal semantic memory challenges, including spatial relations, temporal reasoning, attribute memory, and multi-target recall [7] - Key technical innovations include a programmatic expansion of task complexity through increased interaction counts and a strict separation of experience collection from interaction phases [9][6] Group 3 - Experimental results reveal three major bottlenecks in VLM memory capabilities across 60 tasks, including failures in long-sequence reasoning, weak spatial representation, and collapse in multi-target processing [13][14][16] - The performance of native VLMs declines as the number of frames increases, indicating ineffective utilization of long contexts [20] - Supervised fine-tuning models show improved performance by leveraging longer historical data, suggesting a direction for VLM refinement [25] Group 4 - The benchmark represents the first photorealistic embodied memory evaluation framework, covering complex household environments and allowing for scalable assessment [26] - Future directions include memory compression techniques, end-to-end joint training to address the split between high-level reasoning and low-level execution, and the development of long-term video understanding [26]