Workflow
机器之心
icon
Search documents
多模态BUG修复新SOTA:慕尼黑工大GUIRepair登上SWE-bench Multimodal榜单第一
机器之心· 2025-09-16 00:22
自动化修复真实世界的软件缺陷问题是自动化程序修复研究社区的长期目标。然而,如何自动化解决视觉软件缺陷仍然是一个尚未充分探索的领域。最近,随着 SWE-bench 团队发布最新的多模态 Issue 修复基准 SWE-bench Multimodal,多模态问题修复引起了研究人员的广泛关注,如何有效的解决这类多模态问题对现有的 修复系统呈现出关键挑战。 为了解决多模态修复场景,来自慕尼黑工业大学 Softw a re Eng in eering & AI 团队 带来了一项最新研究成果: GUIRepair ——《Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Repair》 。这项工作已经成功登上了 SWE-bench Multi modal 排行榜的第一名 ,为多模态软件自动修 复开辟了一条充满潜力的道路。目前,该论文已被软件工程领域顶级学术会议 ASE 2025 接收。 研究动机:为什么要研究 "视觉软件问题"? 在软件工程领域, 自动程序修复(Automated Program Re ...
从「对口型」到「会表演」,刚进化的可灵AI数字人,技术公开了
机器之心· 2025-09-15 12:19
Core Viewpoint - The article discusses the advancements made by Kuaishou's Keling team in creating a new digital human generation paradigm, specifically through the Kling-Avatar project, which allows for expressive and natural performances in long videos, moving beyond simple lip-syncing to full-body expressions and emotional engagement [2][31]. Group 1: Technology and Framework - The Kling-Avatar utilizes a two-stage generative framework powered by a multimodal large language model, enabling the transformation of audio, visual, and textual inputs into coherent storylines for video generation [6][10]. - A multimodal director module organizes inputs into a structured narrative, extracting voice content and emotional trajectories from audio, identifying human features and scene elements from images, and integrating user text prompts into actions and emotional expressions [8][10]. - The system generates a blueprint video that outlines the overall rhythm, style, and key expression nodes, which is then used to create high-quality sub-segment videos [12][28]. Group 2: Data and Training - The Keling team collected thousands of hours of high-quality video data from various sources, including speeches and dialogues, to train multiple expert models for assessing video quality across several dimensions [14]. - A benchmark consisting of 375 reference image-audio-text prompt pairs was created to evaluate the effectiveness of the digital human video generation methods, providing a challenging testing scenario for multimodal instruction following [14][23]. Group 3: Performance and Results - The Kling-Avatar demonstrated superior performance in a comparative evaluation against advanced products like OmniHuman-1 and HeyGen, achieving higher scores in overall effectiveness, lip sync accuracy, visual quality, control response, and identity consistency [16][24]. - The generated lip movements were highly synchronized with audio, and facial expressions adapted naturally to vocal variations, even during complex phonetic sounds [25][26]. - Kling-Avatar's ability to generate long videos efficiently was highlighted, as it can produce multiple segments in parallel from a single blueprint video, maintaining quality and coherence throughout [28]. Group 4: Future Directions - The Keling team aims to continue exploring advancements in high-resolution video generation, fine-tuned motion control, and complex multi-turn instruction understanding, striving to imbue digital humans with a genuine and captivating presence [31].
数字生活的原生入口:蚂蚁集团发布AI眼镜全新技术框架gPass
机器之心· 2025-09-15 12:19
Core Viewpoint - Ant Group has launched the world's first trusted connection technology framework for smart glasses, named gPass, which aims to create a secure, interactive, and connected ecosystem for AI glasses and intelligent agents [2][10]. Group 1: AI Glasses Value Proposition - AI glasses are positioned as the "AI native entry," fundamentally transforming digital service models across three dimensions: 1. Service forms evolve from "flat perception" to "spatial cognition," integrating digital information with physical environments [4]. 2. Interaction methods shift from "linear commands" to "sensory interaction," enabling more natural and efficient communication between users and devices [4]. 3. Experience modes transition from "vertical scenarios" to "lifestyle services," providing personalized and proactive services based on environmental awareness [4]. Group 2: Industry Challenges - The AI glasses ecosystem currently faces three major challenges: 1. Fragmentation of end-to-end hardware and software capabilities, with weak infrastructure and a lack of unified standards [6]. 2. A severe lack of native applications for AI glasses to meet diverse user needs [7]. 3. Difficulty in upgrading mobile internet services to intelligent services, hindering technology adoption [7]. Group 3: gPass Technology Framework - gPass aims to address these ecosystem pain points by providing secure and trusted service connections for all partners in the AI glasses supply chain, including developers, manufacturers, and service providers [8][10]. - The framework features three core capabilities: 1. **Security**: gPass ensures trusted identity verification through biometric authentication and establishes secure communication channels for data transmission [10]. 2. **Interaction**: It incorporates multimodal understanding technologies, enabling seamless payment and interaction through voice and iris recognition [11]. 3. **Connection**: gPass facilitates multi-device connectivity, allowing AI glasses to interact with smartphones, smart cars, and other devices, ensuring smooth service delivery [12]. Group 4: Application and Future Prospects - gPass has already been implemented in brands like Rokid, Xiaomi, Quark, and Thunderbird, enabling features like "look-to-pay" [15]. - Future applications in healthcare, travel, and other sectors are anticipated, providing users with seamless and privacy-conscious experiences [15]. - Ant Group aims for gPass to act as an "accelerator" for the AI glasses industry, fostering collaboration across the supply chain to deliver mature and user-friendly products to consumers [15].
OpenVision 2:大道至简的生成式预训练视觉编码器
机器之心· 2025-09-15 12:19
Core Insights - The article discusses the development of OpenVision 2, a generative visual pre-training model that simplifies the training process while maintaining optimal performance and significantly improving training efficiency [2][21]. Group 1: OpenVision 2 Overview - OpenVision 2 is a new direction in generative visual pre-training, proposed by researchers from UCSC, Apple, and UCB, which enhances training efficiency while achieving a parameter scale of 1 billion [2][21]. - The model eliminates the complexity of the training pipeline found in its predecessor, OpenVision, by removing the text encoder and contrastive learning, focusing solely on the "image → description" generation target [9][21]. Group 2: Performance and Efficiency - Experimental results show that OpenVision 2 performs comparably to or better than OpenAI's CLIP and Google's SigLIP on various multimodal benchmark tasks, particularly excelling in OCR and text-related tasks [14][21]. - The training time for OpenVision 2 is reduced by 1.5 to 2 times, with memory usage cut by nearly half, allowing for larger batch sizes and more efficient training [14][16]. Group 3: Key Innovations - OpenVision 2 introduces a technique of randomly dropping about 2/3 of the visual tokens during pre-training, which reduces the computational burden on the text decoder and enhances training efficiency [10][22]. - The model relies on high-quality synthetic descriptions as the sole supervision signal, which aligns closely with downstream tasks, reducing the "goal misalignment" between pre-training and application [22][21]. Group 4: Community Impact - The research challenges the long-standing dominance of contrastive learning, demonstrating that powerful visual encoders can be trained through a generative framework, paving the way for future developments in multimodal foundational models [21][22]. - Over 25 different models of varying scales and configurations have been open-sourced, providing a reproducible and scalable resource base for both academia and industry [21].
用光学生成图像,几乎0耗电,浙大校友一作研究登Nature
机器之心· 2025-09-15 04:00
Core Viewpoint - The article discusses the development of an ultra-low power AI image generator based on optical methods, which significantly reduces energy consumption compared to traditional AI models [1][3]. Group 1: Technology Overview - The optical generative model is inspired by diffusion models and operates by generating static noise through a digital encoder, which consumes minimal energy [2][11]. - The system utilizes a spatial light modulator (SLM) to imprint the noise pattern onto a laser beam, which is then decoded into the final image by a second SLM [2][3]. - Unlike traditional AI that relies on millions of computational operations, this optical system performs all core tasks using light, resulting in almost no energy consumption [3][11]. Group 2: Applications and Potential - The technology has broad application prospects, including generating images and videos for VR and AR displays, as well as for wearable devices like smartphones and AI glasses [6][9]. - The optical generative model can produce monochrome or color images based on target data distributions, showcasing its versatility [11][12]. Group 3: Experimental Results - Initial experiments using the MNIST and Fashion-MNIST datasets achieved FID scores of 131.08 and 180.57, respectively, indicating that the generated images align well with the target distributions [22]. - High-resolution experiments for generating Van Gogh-style artworks demonstrated the model's capability to produce both monochrome and color images with excellent quality [24][28].
告别ROS的繁琐, 易用易学的机器人学习系统: 华为诺亚面向机器人学习的开源Python框架
机器之心· 2025-09-15 04:00
图 1: Ark 的整体框架 近年来,机器人技术在硬件领域取得了显著突破 —— 无论是 DARPA 机器人挑战赛,还是首届人形机器人自由搏击表演,都展示了令人瞩目的进展。然而,机器 人的自主能力仍明显落后于机器学习的发展步伐。 造成这一差距的 关键瓶 颈在于软 件层面 :现有的机器人技术栈学习门槛较高,仍大量依赖 C/C++ 进行底层开发,工具链分散且硬件集成复杂。相比之下,推动 现代人工智能发展的生态系统以 Python 为核心,文档完善、易于使用 —— 两者形成了鲜明对比。 为应对这些挑战,来自 华为诺亚方舟实验室,德国达姆施塔特工业大学,英国伦敦大学学院,帝国理工学院和牛津大学的研究者 们联合推出了 Ark —— 一个基 于 Python 的机器人开 发框架,支持快速原型 构建,并可便捷地在仿真和真实机器人系统上部署新算法 。 Ark 与主流机器学习工作流深度兼容,能够从仿真环境或实际机器人中采集和预处理数据,并支持使用如 ACT、Diffusion Policy 等前沿模仿学习方法进行策略训 练。该框架采用类似 OpenAI Gym 风格的主接口设计,极大降低了机器学习研究者的上手门槛,便于集成与实验 ...
将KV Cache预算降至1.5%!他们用进化算法把大模型内存占用砍下来了
机器之心· 2025-09-14 05:16
Core Insights - EvolKV achieves superior performance with only 1.5% of the full KV cache budget, significantly reducing inference costs for large language models [1][11][25] - The traditional KV cache methods face challenges with long input texts, leading to increased storage requirements and slower processing [3][4] KV Cache Optimization - Existing KV cache compression methods primarily rely on heuristic approaches, which may not optimally retain task-relevant information [4][9] - EvolKV introduces an evolutionary framework that adaptively allocates KV cache budgets across transformer layers, optimizing for downstream task performance [6][10] Performance Improvements - In various benchmark tests, EvolKV consistently outperforms baseline methods, achieving up to a 13% improvement in the Needle-in-a-Haystack benchmark and maintaining high accuracy in the GSM8K dataset [11][30][25] - The method demonstrates strong adaptability across diverse tasks, maintaining competitive performance even with reduced cache budgets [25][29] Experimental Results - Comprehensive experiments on Mistral 7B-Instruct and Llama-3-8B-Instruct show that EvolKV outperforms all baseline methods across multiple KV cache budget configurations [22][24] - In the LongBench evaluation, EvolKV consistently achieved the highest average performance, even surpassing the full model in certain configurations [22][25] Evolutionary Algorithm Mechanism - The evolutionary algorithm generates candidate solutions and evaluates their fitness based on downstream task performance, guiding the optimization process [13][14] - The optimization process is structured in groups to enhance efficiency, allowing for a more stable optimization dynamic [16][17] Cache Budget Allocation - EvolKV employs a dynamic, task-driven approach to allocate KV cache budgets, ensuring that the distribution aligns with the functional contributions of different transformer layers [10][19] - The method includes a mechanism for adjusting the total KV cache budget to ensure fairness in evaluation [20]
抢先实测美团首个AI Agent,让我体验一把「懒人点餐」的快乐
机器之心· 2025-09-14 05:16
Core Viewpoint - The article discusses the launch of Meituan's AI Agent product "Xiao Mei," which simplifies the food ordering process through natural language commands, enhancing user experience and efficiency in local services [2][3][27]. Group 1: Product Features - "Xiao Mei" allows users to place orders without navigating complex interfaces, using simple voice commands to find restaurants and order food [5][12]. - The AI Agent can learn user preferences over time, providing personalized recommendations for different demographics, such as the elderly and children [5][27]. - Users can request specific meal types, such as low-carb options, and receive tailored suggestions based on their dietary needs and past orders [14][27]. Group 2: User Experience - The ordering process is streamlined, requiring fewer steps compared to traditional apps, thus saving time and effort for users [8][12]. - "Xiao Mei" can also assist with hotel bookings and travel recommendations, showcasing its versatility beyond food ordering [15][19]. - The AI Agent proactively plans meals for the week, enhancing efficiency for users with regular dietary habits [25][27]. Group 3: Technical Insights - The success of "Xiao Mei" is attributed to Meituan's proprietary Longcat model, which is optimized for local service scenarios and can dynamically activate a significant number of parameters for efficient processing [31][32]. - The model leverages extensive local data to provide accurate and personalized responses, addressing the challenges of fragmented user demands and real-time requirements in the local lifestyle sector [30][32]. Group 4: Market Context - The article highlights the competitive landscape of AI Agents, noting that while international companies focus on productivity, domestic firms like Meituan emphasize consumer and lifestyle applications [39]. - The market for AI Agents is expected to grow significantly, with projections indicating a market size of $13 billion by the end of 2025, nearly doubling from 2024 [38][39].
LLaSO 横空出世:逻辑智能推出全球首个完全开源语音大模型框架,定义 LSLM 研究新基准
机器之心· 2025-09-14 05:16
论文标题:L LaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model 在大型语言模型(LLM)的浪潮下,多模态 AI 取得了飞速发展,尤其是在视觉语言(LVLM)领域,已经形成了成熟的研究范式。然而,与之形成鲜明对比的 是,大型语音语言模型(LSLM)的发展却显得零散且步调缓慢。 该领域长期被碎片化的架构、不透明的训练数据和缺失的评估标准所困扰,导致研究之间难以进行公平比较,严重阻碍了技术的可复现性和社区的系统性进步。 许多研究虽然发布了模型权重,但其赖以成功的关键 —— 训练数据和配置细节 —— 却常常被 "雪藏" 起来。 为了打破这一僵局, 北京深度逻辑智能科技有限公司推出了 LLaSO —— 首个完全开放、端到端的语音语言模型研究框架。 LLaSO 旨在为整个社区提供一个统一、透明且可复现的基础设施,其贡献是 "全家桶" 式的,包含了一整套开源的数据、基准和模型,希望以此加速 LSLM 领域的 社区驱动式创新。 论文地址:https://arxiv.org/abs/2508.1 ...
为这一个Tab键,我愿意单独付费:Cursor用在线强化学习优化代码建议,护城河有了?
机器之心· 2025-09-14 03:07
编辑:+0 Cursor Tab 是 Cursor 的核心功能之一,它通过分析开发者的编码行为,智能预测并推荐后续代码,开发者仅需按下 Tab 键即可采纳。 然而,它也面临着一个 AI 普遍存在的难题:「过度热情」。有时,它提出的建议不仅毫无用处,甚至会打断开发者的思路。 问题的关键,不只是让 AI 写出更优秀的代码,更是要教会它「察言观色」:在最恰当的时机提供帮助,在其他时候则保持安静。 基于此,Cursor 采用在线强化学习技术训练出一个全新的 Tab 模型。 该模型将每一次用户交互(接受/拒绝建议)都视为一个强化信号,直接用于模型的在线优 化。 在每天超过 4 亿次请求的巨大流量驱动下,模型得以进行高频度的、基于真实世界反馈的持续学习。 机器之心报道 Cursor 已将这个新的 Tab 模型设为默认版本。与旧模型相比, 新模型提供的建议数量减少了 21%,但所提供建议的接受率却提升了 28%。 此举旨在提升用户的 编码体验,Cursor 也计划在未来继续深化这些方法的研究。 Cursor 的策略独特且高效:它每天多次向用户部署新模型(每隔 1.5-2 小时),利用实时数据进行快速训练和优化。 这与主流做 ...