Workflow
机器之心
icon
Search documents
SIGCOMM 2025|重新定义个性化视频体验,快手与清华联合提出灵犀系统
机器之心· 2025-09-04 04:11
Core Viewpoint - Kuaishou and Tsinghua University's Sun Lifeng team have developed the LingXi system, a groundbreaking personalized optimization system for adaptive video streaming, which has been accepted at the prestigious ACM SIGCOMM 2025 conference [2][4]. Group 1: Background and Motivation - The transition from traditional Quality of Service (QoS) to personalized Quality of Experience (QoE) is highlighted, emphasizing the limitations of existing QoS optimization methods in enhancing user experience [6]. - A large-scale A/B test demonstrated that traditional QoS metrics do not translate into improved user experience, indicating a saturation of optimization paths [7][14]. - The study identifies "buffering" as the primary negative factor affecting user experience, necessitating a focus on this aspect for effective QoE optimization [15][23]. Group 2: System Design and Components - The LingXi system is designed as a dynamic optimization module compatible with existing Adaptive Bitrate (ABR) algorithms, ensuring seamless integration without disrupting user experience [31][34]. - The system comprises three core components: 1. Online Bayesian Optimization (OBO) for dynamic parameter exploration [34]. 2. Monte Carlo Sampling for simulating future decisions based on historical data [35]. 3. Hybrid Exit Rate Predictor for accurately quantifying user experience [36][38]. Group 3: Experimental Results - A 10-day large-scale A/B test on the Kuaishou platform showed significant improvements in both QoE and QoS metrics, validating the effectiveness of the LingXi system [40][46]. - The system particularly benefits low-bandwidth users, reducing buffering time by approximately 15% in scenarios with bandwidth below 2000 kbps [52][58]. - The analysis of user sensitivity to buffering revealed a clear negative correlation between buffering sensitivity and the parameters assigned by the system, demonstrating the system's ability to adapt to individual user needs [56]. Group 4: Conclusion - The successful implementation of the LingXi system marks a significant evolution in adaptive video streaming optimization, shifting from static system-level goals to personalized strategies for diverse user experiences [57][58].
让具身智能体拥有「空间感」!清华、北航联合提出类脑空间认知框架,导航、推理、做早餐样样精通
机器之心· 2025-09-04 03:27
Core Viewpoint - The article discusses the innovative BSC-Nav framework developed by Tsinghua University and Beihang University, which enhances embodied intelligence in robots by integrating a structured spatial memory system inspired by biological cognition, enabling robots to perform complex navigation and interaction tasks autonomously [4][11][42]. Group 1: BSC-Nav Overview - BSC-Nav is the first unified framework inspired by the spatial cognition mechanisms of the biological brain, providing advanced navigation capabilities and enabling higher-level spatial perception and interaction tasks [7][8]. - The framework addresses the limitations of existing AI models in physical environments, particularly their short-term memory and poor generalization in dynamic settings [8][11]. Group 2: Memory Components - BSC-Nav incorporates three key memory components: Landmark Memory Module, Cognitive Map Module, and Working Memory Module, which collectively replicate human spatial cognition [12][17][18]. - The Landmark Memory Module identifies and records significant objects in the environment, while the Cognitive Map Module creates a global cognitive map based on observed features [16][17]. - The Working Memory Module allows the robot to retrieve and reconstruct relevant spatial memories for task execution, enhancing its reasoning and generalization capabilities [18][19]. Group 3: Performance Validation - Extensive experiments in the Habitat simulation environment demonstrated BSC-Nav's superior performance across four major navigation tasks, achieving new state-of-the-art results [20][24]. - In object navigation tasks, BSC-Nav achieved a success rate of 78.5%, surpassing the previous best method by 24% [24]. - The framework also excelled in complex instruction navigation and active embodied question answering, showcasing its ability to understand and execute intricate tasks [25][28][31]. Group 4: Real-World Application - BSC-Nav was tested in a real-world environment, achieving over 80% navigation success rate across various tasks, demonstrating its strong generalization capabilities [35][38]. - The robot successfully performed complex operations, including the multi-step task of preparing breakfast, highlighting its practical applicability [38][43]. Group 5: Future Directions - The research emphasizes that the evolution of embodied intelligence may not solely rely on computational power but can be significantly enhanced through effective memory systems [41][42]. - Future plans include expanding the memory framework to more dynamic environments and complex cognitive tasks, aiming for further advancements in embodied AI [42].
特斯拉下一代金色Optimus原型现身?一双「假手」成为最大槽点
机器之心· 2025-09-04 03:27
Core Viewpoint - Tesla's humanoid robot, Optimus, is being positioned as a revolutionary physical intelligence agent, with a high price range of $200,000 to $500,000, as highlighted by Salesforce CEO Marc Benioff [1][10]. Group 1 - The interaction between Benioff and Optimus showcased the robot's capabilities, although it was noted that Optimus walked somewhat slowly but steadily [8]. - There is a significant public reaction to the high price of Optimus, with expectations that mass production could lower the price to around $20,000 to $30,000 [10]. - Optimus has evolved since its debut in December 2023, demonstrating advanced flexibility, intelligence, and human-robot interaction capabilities, including various actions like dancing and object recognition [15]. Group 2 - The design of Optimus's hands has drawn attention, appearing very human-like but seemingly more ornamental than functional [12]. - There are mixed reviews regarding Optimus's performance, with some users finding it noisy and cumbersome, while others criticized the voice integration as overly artificial and delayed [16][18]. - A comparison was made between Tesla's Optimus and Figure's robot, with Figure showcasing a more polished demonstration, while Optimus's performance seemed less refined and more spontaneous [22][26].
刚刚,谷歌放出Nano Banana六大正宗Prompt玩法,手残党速来
机器之心· 2025-09-03 08:33
Core Viewpoint - Google’s Nano Banana has gained popularity among users for its creative applications in generating images from text prompts, showcasing the model's versatility and potential in various creative fields [2][8]. Group 1: Image Generation Techniques - Users can create photorealistic images by providing detailed prompts that include camera angles, lighting, and environmental descriptions, which guide the model to produce realistic effects [12][13]. - The model allows for text-to-image generation, image editing through text prompts, multi-image composition, iterative optimization, and text rendering for clear visual communication [16]. - Specific templates for different styles, such as stickers, logos, product photography, minimalist designs, and comic panels, are provided to help users effectively utilize the model [18][21][25][30][34]. Group 2: User Experience and Challenges - Despite its capabilities, users have reported challenges with the model, such as returning identical images during editing and inconsistencies compared to other models like Qwen and Kontext Pro [39]. - Users are encouraged to share their unique insights and techniques for using Nano Banana in the comments section, fostering a community of knowledge sharing [40].
ICCV 2025 | 基于时序增强关系敏感知识迁移的弱监督动态场景图生成
机器之心· 2025-09-03 08:33
Core Viewpoint - The article discusses a new method for weakly supervised dynamic scene graph generation, highlighting the limitations of existing object detection quality in dynamic scenes and proposing a temporal-enhanced relation-aware knowledge transferring approach to improve detection performance and scene graph generation quality [2][5][8]. Method Introduction - The proposed method, TRKT, addresses the performance bottleneck in weakly supervised dynamic scene graph generation by enhancing object detection quality through a temporal-aware and relation-sensitive knowledge transfer mechanism [5][10]. - TRKT utilizes attention maps generated from object and relation decoders to optimize external object detectors, thereby improving the quality of generated scene graphs [8][10]. Knowledge Transfer Mechanism - The method consists of two main components: relation-aware knowledge mining and a dual-stream fusion module [10][15]. - Relation-aware knowledge mining generates attention maps that highlight object and interaction areas, while the dual-stream fusion module combines these attention maps with external detection results to refine object localization and confidence scores [10][19]. Experimental Results - The proposed method shows significant improvements in average precision (AP) and average recall (AR) compared to existing methods, with an increase of 13.0% in AP and 1.3% in AR for object detection [25]. - In dynamic scene graph generation tasks, the method outperforms baseline models, achieving performance improvements across all evaluation metrics [25][26]. Ablation Studies - Ablation experiments demonstrate the effectiveness of individual components, with the confidence boosting module (CBM) and localization refinement module (LRM) contributing to average precision improvements of 1.2% and 2.0%, respectively [28]. - The integration of these modules leads to a combined average precision increase of 2.8%, indicating that enhancements in bounding box accuracy and confidence scores complement each other [28]. Visualization Results - Visual comparisons show that the proposed method generates more complete and accurate scene graphs than baseline models, benefiting from the introduced temporal-enhanced relation-aware knowledge and dual-stream fusion module [31].
Anthropic承认模型降智后仍放任其偷懒?Claude Code用户信任崩塌中
机器之心· 2025-09-03 08:33
Core Viewpoint - The article discusses the phenomenon of perceived "dumbing down" of AI models, particularly focusing on OpenAI and Anthropic's models, highlighting user experiences and the acknowledgment of quality degradation by model providers [1][3][6]. Group 1: Perception of AI Model Quality - Users often express concerns about the decline in performance of AI models, leading to the belief that models are being "dumbed down" [1][2]. - Aidan McLaughlin from OpenAI noted that the misconception of models being weakened is more common than expected, suggesting it may be a psychological phenomenon [3]. Group 2: Anthropic's Acknowledgment of Quality Issues - Anthropic publicly admitted to a quality degradation incident with its Claude Opus 4.1 model, which occurred from August 25 to August 28, 2025, affecting user experience [5][6]. - The degradation was attributed to an update in the inference stack, which has since been rolled back, but users continued to report issues even after the rollback [7][8]. Group 3: User Reactions and Comparisons - Users have expressed dissatisfaction with Claude Code, noting a significant decline in its performance compared to previous versions, leading many to switch to GPT-5 [8][12]. - Complaints include the inability of Claude Opus 4.1 to perform tasks that earlier models could handle, with some users labeling it as "useless" [12][13]. - The article highlights a shift in user preference towards GPT-5, with developers finding it more effective for coding tasks [13].
其实,扩散语言模型在最终解码之前很久,就已确定最终答案
机器之心· 2025-09-03 04:33
机器之心报道 编辑:陈萍 随着扩散语言模型(DLM)在各个领域的快速发展,其已成为自回归(AR)模型有力的替代方案。与 AR 模型相比,DLMs 的主要优势包括但不限于:高效的并 行解码和灵活的生成顺序。 尽管 DLMs 具有加速潜力,但在实际应用中,其推理速度仍慢于 AR 模型,原因在于缺乏 KV-cache 机制,以及快速并行解码所带来的显著性能下降。 本文,来自香港理工大学、达特茅斯学院等机构的研究者尝试从一个不同的角度来加速 DLMs 推理,这一思路源于一个长期被忽视却极具潜力的现象: 早期答案 收敛 。 论文标题: Diffusion Language Models Know the Answer Before Decoding 通过深入分析,研究者观察到:无论是半自回归重掩码还是随机重掩码场景下,有极高比例的样本在解码早期阶段即可获得正确解码。这一趋势在随机重掩码中 尤为显著,以 GSMK 和 MMLU 数据集为例,仅需半数优化步骤即可分别实现 97% 和 99% 的样本正确解码。 受此发现启发,该研究提出了 Prophet ,一种无需训练的快速解码策略,该策略专为利用早期答案收敛特性而设计。Pr ...
宇树科技官宣:年内提交IPO,或将冲刺科创板
机器之心· 2025-09-03 04:33
机器之心报道 机器之心编辑部 宇树的上市进程,终于又向前迈进了一步。 9 月 2 日晚间,杭州宇树科技股份有限公司(简称「宇树科技」)发布声明说,预计于今年四季度向证券交易所提交上市申请文件,立即引来了大量关注。 完整公告内容如下: 宇树科技自成立以来一直是一家「民用机器人公司」。目前,公司正在积极推进首次公开募股(IPO)的准备工作。根据 IPO 计划, 公司预计将在 2025 年 10 月 至 12 月期间向证券交易所提交备案文件,届时公司的相关经营数据将会正式披露 。 接下来简要介绍一下公司产品的收入结构。我们以 2024 年为例(具体数据应以后续 IPO 备案文件披露的信息为准): 四足机器人、人形机器人及零部件产品的销售额分别约占 65%、30% 和 5%。 其中, 大约 80% 的四足机器人用于科研、教育和消费领域,其余 20% 用于工业领域,如检测和消防。人形机器人全部应用于科研、教育和消费领域 。 自成立以来,宇树科技一直致力于高性能通用机器人在民用领域不同产业中的应用,并在公司官网、产品手册、合作协议以及各类文件中明确声明和限制相关用 途。 特此提醒各方需谨慎识别,不要将其他公司的机器人产品 ...
从复刻魔术开始,RoboMirage打开了机器人仿真的新世界
机器之心· 2025-09-03 04:33
机器之心发布 「RoboMirage」具有以下核心特性: 1. 全物体类型兼容的可扩展接触建模框架 支持刚体、1D/2D/3D 可形变体、多关节结构及各种机器人末端执行器的多样接触,具备强耦合仿真能力,兼容未来可微仿真与高精度训练需求,且允许用户自定 义扩展功能,为多样化场景提供灵活适配的底层架构。 RoboScience 在具身智能的发展路径中,如何获得海量且高质量的数据是行业绕不开的核心问题。 如果说大语言模型依赖于互联网规模的语料库,那么具身智能的成长同样需要规模化的交互经验。现实中,收集这些数据的代价极高:机械臂等硬件部署成本 高,单台投入就需数万元,且难以规模化;数据采集环节依赖经验丰富的数采员且耗时漫长。而在仿真环境中,智能体则可以以更低成本、更高效率进行无限次 试错,从而快速积累大规模交互经验。 正因如此,过去几年中,仿真器已经成为具身智能发展的重要支撑工具,也催生出一批优秀的开源与商业化平台。它们让机器人学、强化学习和智能体研究得以 快速推进,奠定了行业的基础。 但随着研究不断深入,行业对于数据提出了更高要求: 更高 的 物理精度 ,以保证数据与现实世界的贴合度; 更丰富的交互类型 ,覆盖刚体 ...
语音分离最全综述来了!清华等团队深度分析200+文章,系统解析「鸡尾酒会问题」研究
机器之心· 2025-09-03 04:33
Core Viewpoint - The article discusses the revolutionary advancements in the field of speech separation, particularly addressing the "cocktail party problem" through the development of deep neural networks (DNN) [2]. Group 1: Overview of Speech Separation - Speech separation has become crucial for enhancing speech clarity in complex acoustic environments and serves as a preprocessing method for other speech processing tasks [2]. - Researchers from various institutions conducted a comprehensive survey of over 200 representative papers, analyzing the latest research methods across multiple dimensions including deep learning methods, model architectures, evaluation metrics, datasets, and future challenges [2]. Group 2: Problem Definition - The authors categorize speech separation tasks into known and unknown speaker separation based on whether the number of speakers is fixed or variable, highlighting the challenges associated with each scenario [6]. - The need for dynamic output channel determination and the balance between separation quality and termination timing are emphasized as significant challenges in unknown speaker scenarios [6]. Group 3: Learning Paradigms - The article compares supervised and unsupervised learning methods, detailing the advantages and limitations of each approach in the context of speech separation [10]. - Supervised learning is currently the most mature paradigm, utilizing paired mixed audio and clean source audio for training, while unsupervised methods explore training models directly on unlabelled mixed audio [12]. Group 4: Model Architectures - The core components and evolution of speech separation models are summarized, including encoder, separation network, and decoder [14]. - Various architectures such as RNN-based, CNN-based, and transformer models are discussed, showcasing their strengths in capturing long-term dependencies and local feature extraction [17][18]. Group 5: Evaluation Metrics - A comprehensive evaluation metric system is necessary for assessing model performance, which includes both subjective and objective metrics [19]. - The article compares various metrics, highlighting the trade-offs between subjective evaluations that reflect human experience and objective metrics that are efficient but may focus on different aspects [20]. Group 6: Datasets - The article summarizes publicly available datasets for speech separation research, categorizing them based on single-channel and multi-channel formats [22]. - Understanding the coverage and difficulty of these datasets aids researchers in selecting appropriate datasets for algorithm evaluation and identifying gaps in current research [22]. Group 7: Performance Comparison - The authors present a comparison of different models' performance on standard datasets, illustrating the progress in speech separation technology over recent years [24]. - Notable improvements in performance metrics, such as SDR, are highlighted, with advanced architectures achieving SDR levels around 20 dB [24][25]. Group 8: Tools and Platforms - The article introduces various open-source tools and platforms that facilitate the development and application of speech separation tasks, comparing their functionalities and limitations [28]. - These tools provide convenient interfaces for researchers to replicate results and build prototype systems, accelerating the transition from research to application [28]. Group 9: Challenges and Future Directions - The article discusses current challenges in the field, including long-duration audio processing, mobile and embedded applications, real-time speech separation, and the rise of generative methods [32][33]. - The integration of pre-training techniques and the focus on target speaker extraction are also identified as key areas for future exploration [33].