Workflow
机器之心
icon
Search documents
代码、多模态检索全面登顶SOTA!智源BGE向量模型三连击,并全面开放
机器之心· 2025-05-20 04:58
机器之心发布 机器之心编辑部 检索增强技术在代码及多模态场景中的发挥着重要作用,而向量模型是检索增强体系中的重要组成部分。针对这一需求,近日,智源研究院联合多所高校研发了 三款向量模型,包括代码向量模型 BGE-Code-v1,多模态向量模型 BGE-VL-v1.5 以及视觉化文档向量模型 BGE-VL-Screenshot。这些模型取得了代码及多模态检索 的最佳效果,并以较大优势登顶 CoIR、Code-RAG、MMEB、MVRB 等领域内主要测试基准。 BGE 自 2023 年 8 月发布以来,已成为中国首个登顶 Hugging Face 榜首的国产 AI 模型以及 2023 年所有发布模型的全球下载量冠军。 目前,BGE-Code-v1、BGE-VL-v1.5、BGE-VL-Screenshot 三款模型已向社区全面开放,为相关技术研究与产业应用提供助力。 BGE-Code-v1: BGE-VL-v1.5: BGE-VL-Screenshot: 由智源研究院主导研发的通用向量模型系列 BGE,旨在为各类数据提供高效一站式向量表征与语义检索方案,已推出覆盖中英文、多语言检索及重排模型等多个 版本,持续刷新 ...
ICML 2025 Spotlight | 多模态大模型暴露短板?EMMA基准深度揭秘多模态推理能力
机器之心· 2025-05-20 04:58
「三个点电荷 + Q、-2Q 和 + 3Q 等距放置,哪个向量最能描述作用在 + Q 电荷上的净电力方向?」 在解这道题时,我们可以通过绘制受力分析草图轻松解决。但即使是先进的多模态大语言模型,如 GPT-4o,也可能在理解「同性相斥」的基本物理原则时,错误 地判断斥力的方向(例如,错误地将 + 3Q 对 + Q 的斥力方向判断为右下方而非正确的左上方)。 这个看似简单的物理问题,却暴露了多模态大模型一个「致命缺陷」: 当前的 MLLMs 仍然无法进行需要深度视觉与文本融合的复杂多模态推理 !一项最新研究 推出的 EMMA 基准测试,如同一面「照妖镜」,揭示了即使是顶尖 MLLMs 也在这关键能力上显著不足。 目前该研究已被 ICML 2025 接收为 spotlight,代码数据已全部开源 ! 目前已有多个模型 / 方法在 EMMA 上验证其多模态推理能力,研究发现: 即使最先进的模型 ——Gemini-2.5-pro-exp-03-25 ,或者是能够进行视觉工具调用的 o3/o4-mini 模型在 EMMA 上的表现仍然落后人类专家超 20% ! 标题: Can MLLMs Reason in Multi ...
ICRA 2025|通用多机器人长时任务规划框架破解任务分配难题,成功率+105%、效率+36%
机器之心· 2025-05-20 04:58
2025 年 5 月,美国加州大学河滨分校 (UC Riverside) 与宾夕法尼亚州立大学 (Penn State University) 联合团队在机器人领域顶级会议 ICRA 2025 上发布最新研究成果 LaMMA-P (Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner)。 技术亮点:语言模型与经典规划算法融合,支撑通用异构多机器人长时协同任务 LaMMA-P 首次将大型语言模型与 PDDL 规划器深度融合,解决了异构多机器人系统中长时任务的自动分解与分配难题,大幅提升多机器人协同规划的智能水平。 该技术在全新基准数据集上经过大量模拟实验验证,相比现有最先进方法 SMART-LLM,任务成功率提高 105% ,执行效率提升 36% ,在复杂长程任务规划上取 得了突破性进展,为异构多机器人协同完成复杂任务提供了全新解决方案。 面对复杂长时任务和异构多机器人系统,LaMMA-P 首创性地将大语言模型的语义理解能力与 PDDL 规划器的严谨性结合,不仅解决了传统方法 ...
AI生成视频总不符合物理规律?匹兹堡大学团队新作PhyT2V:不重训练模型也能让物理真实度狂飙2.3倍!
机器之心· 2025-05-19 04:03
Core Viewpoint - The article discusses the advancement of Text-to-Video (T2V) generation technology, emphasizing the transition from focusing on visual quality to ensuring physical consistency and realism through the introduction of the PhyT2V framework, which enhances existing T2V models without requiring retraining or extensive external data [2][3][26]. Summary by Sections Introduction to PhyT2V - PhyT2V is a framework developed by a research team at the University of Pittsburgh, aimed at improving the physical consistency of T2V generation by integrating large language models (LLMs) for iterative self-refinement [2][3][8]. Current State of T2V Technology - Recent T2V models, such as Sora, Pika, and CogVideoX, have shown significant progress in generating complex and realistic scenes, but they struggle with adhering to real-world physical rules and common sense [5][7]. Limitations of Existing Methods - Current methods for enhancing T2V models often rely on data-driven approaches or fixed physical categories, which limits their generalizability, especially in out-of-distribution scenarios [10][12][18]. PhyT2V Methodology - PhyT2V employs a three-step iterative process involving: 1. Identifying physical rules and main objects from user prompts [12]. 2. Detecting semantic mismatches between generated videos and prompts using video captioning models [13]. 3. Generating corrected prompts based on identified physical rules and mismatches [14] [18]. Advantages of PhyT2V - PhyT2V offers several advantages over existing methods: - It does not require any model structure modifications or additional training data, making it easy to implement [18]. - It provides a feedback loop for prompt correction based on real generated results, enhancing the optimization process [18]. - It demonstrates strong cross-domain applicability, particularly in various physical scenarios [18]. Experimental Results - The framework has been tested on multiple T2V models, showing significant improvements in physical consistency (PC) and semantic adherence (SA) scores, with the CogVideoX-5B model achieving up to 2.2 times improvement in PC and 2.3 times in SA [23][26]. Conclusion - PhyT2V represents a novel, data-independent approach to T2V generation, ensuring that generated videos comply with real-world physical principles without the need for additional model retraining, marking a significant step towards creating more realistic T2V models [26].
AI大厦需要新的地基!
机器之心· 2025-05-19 04:03
Core Viewpoint - The article discusses the critical importance of data in the AI era, emphasizing the transition from traditional data infrastructure to an integrated data foundation that supports both AI and data processing [1][4][6]. Group 1: Importance of Data in AI - High-quality data is becoming increasingly scarce, particularly human-generated data, while new data generated by technologies like generative AI is surging [4]. - IDC predicts that global data generation will reach 393.9 ZB by 2028, growing at an average annual rate of nearly 28% from 147 ZB in 2024 [4][5]. - The challenges posed by data fragmentation, scalability, and real-time analysis capabilities are critical for the success of AI applications [4][6]. Group 2: Evolution of Data Infrastructure - The concept of data infrastructure is evolving from merely supporting AI to becoming an integral part of AI workflows, termed "Data×AI" [6]. - OceanBase aims to transition from a traditional database to an integrated data foundation that can handle mixed workloads and support AI applications [2][9]. Group 3: Challenges in Data Management - Data fragmentation is a significant issue, especially in complex industries like finance and healthcare, where data is dispersed across various systems [7]. - Multi-modal data processing is complicated due to the unique structures and characteristics of different data types, necessitating advanced data alignment and synchronization capabilities [7][8]. - Evaluating data quality is increasingly difficult due to the diversity and dynamism of data sources, requiring a robust and adaptable quality assessment system [8]. Group 4: OceanBase's Strategic Direction - OceanBase has made significant advancements in data processing capabilities, including distributed storage and multi-modal data handling [9][11]. - The company is focusing on four key areas: becoming a knowledge base, breaking down data silos, serving as a reliable AI advisor, and managing traffic fluctuations effectively [14]. - OceanBase has introduced a new RAG service, PowerRAG, which streamlines the process of identifying, segmenting, and embedding documents for AI applications [17][20]. Group 5: Market Position and Future Outlook - OceanBase has established itself as a leading open-source database, with over a million downloads and more than 50,000 deployments [21]. - The company is confident in its "Data×AI" strategy, believing that those who can effectively integrate data and AI will become the foundational data providers in the AI era [24][25]. - The database industry is evolving alongside AI, with OceanBase positioning itself to support the next generation of data infrastructure [26].
Index-AniSora:B站开源动画生成模型,斩获多项SOTA入选IJCAI25
机器之心· 2025-05-19 04:03
B 站开源动画视频生成模型 Index-AniSora,支持番剧、国创、漫改动画、VTuber、动画 PV、鬼畜动画等多种二次元风格视频镜头一键生成! 提示词:画面中的人物向上抬了下手臂,他手臂上的气体在流动 引导帧首帧 整个工作技术原理基于 B 站提出的 AniSora 实现, 该工作已经被 IJCAI25 接收 。我们提出的 AniSora 系统, 是首个专为二次元视频生成打造的技术框 架,全面提升动画内容的生产效率与质量。 喜欢的漫画一键出动画效果,支持多种小众画风,效果更加丰富,从此告别 「PPT 动画 」 提示词:画面中一个人在快速向前奔跑,他奔跑的速度很快使得人物有些模糊 引导帧首帧 论文标题:AniSora: Exploring the Frontiers of Animation Video Generation in the Sora Era 论文地址:https://arxiv.org/abs/2412.10255 项目主页: https://github.com/bilibili/Index-anisora 生成的视频 生成的视频 提示词:左边男人紧紧抿着嘴唇,脸上刻满了愤怒和决心。他的 ...
「AI黑客」来袭,Agentic AI如何成为新守护者?
机器之心· 2025-05-19 02:36
Core Viewpoint - The rapid development of AI technology has led to increasingly complex threats in cybersecurity, giving rise to new forms of attacks such as AI-driven phishing and deepfake scams, necessitating a shift towards AI-based defense mechanisms [2][3][4][24]. Group 1: AI-Driven Cybersecurity Threats - Generative AI is reshaping the precision of online scams, enabling attackers to create personalized phishing emails by training AI models on publicly available social data, significantly increasing the success rate of attacks [4]. - Deepfake technology has advanced to the point where attackers can impersonate individuals in video calls, leading to significant financial losses, as demonstrated by a case where a financial officer was tricked into transferring 3.8 million yuan [4]. - Automated attacks and vulnerability exploitation have become more prevalent, with AI enabling rapid scanning of system vulnerabilities and executing zero-day attacks, as evidenced by a massive DDoS attack that caused millions in losses [5]. Group 2: AI in Cyber Defense - The industry consensus is shifting towards using AI to combat AI-driven threats, marking a transition in security paradigms [7]. - Current defensive strategies can be categorized into three main areas: AI model security enhancement, industry-specific defensive applications, and macro-level government and international collaboration [8]. - AI model security focuses on strengthening the inherent safety of models, with companies like Anthropic developing classifiers to prevent AI from generating harmful content [9]. Group 3: Industry Applications and Innovations - Industry-specific applications are emerging, such as financial institutions utilizing AI risk control models to build anti-fraud barriers and open-source ecosystems employing intelligent vulnerability hunting technologies for rapid threat response [10]. - Companies like Cisco are showcasing solutions that can intercept sensitive data queries in real-time, enhancing compliance and management [10]. - The introduction of AI security assistants, such as Microsoft's Security Copilot, demonstrates the potential for AI to assist security teams in detecting and responding to threats more efficiently [13]. Group 4: Advanced AI Security Solutions - The "Wuxiang" security AI product represents a significant advancement, transitioning from passive response to autonomous decision-making in threat detection and response [15][25]. - This system employs a dual-engine architecture to ensure dynamic correction capabilities during complex tasks, significantly reducing response times from days to minutes [16][22]. - The ability of "Wuxiang" to autonomously analyze alerts and generate comprehensive attack reports showcases its effectiveness in enhancing operational efficiency and accuracy in cybersecurity [17][23]. Group 5: Future of Cybersecurity - The evolution of AI technology presents dual challenges, with attackers leveraging AI for automated and personalized attacks while defenders must innovate to enhance detection and response capabilities [24]. - The emergence of high-level AI security systems is expected to fundamentally reshape the cybersecurity landscape, emphasizing the need for organizations to seize this opportunity for transformation [27].
刚刚!北大校友Lilian Weng最新博客来了:Why We Think
机器之心· 2025-05-18 04:25
Core Insights - The article discusses advancements in utilizing "thinking time" during model inference, aiming to enhance the reasoning capabilities of AI models like GPT, Claude, and Gemini [2][3][16]. Group 1: Thinking Mechanisms - The concept of "thinking time" is analogous to human cognitive processes, where complex problems require reflection and analysis before arriving at a solution [6]. - Daniel Kahneman's dual process theory categorizes human thinking into fast (System 1) and slow (System 2) modes, emphasizing the importance of slower, more deliberate thought for accurate decision-making [12]. Group 2: Computational Resources - In deep learning, neural networks can be characterized by the computational and storage resources they utilize during each forward pass, impacting their performance [8]. - The efficiency of models can be improved by allowing them to perform more computations during inference, particularly through strategies like Chain of Thought (CoT) prompting [8][18]. Group 3: Chain of Thought (CoT) and Learning Strategies - CoT prompting significantly enhances the success rate of solving mathematical problems, with larger models benefiting more from extended "thinking time" [16]. - Early research focused on supervised learning from human-written reasoning paths, evolving into reinforcement learning strategies that improve CoT reasoning capabilities [14][41]. Group 4: Test-Time Computation Strategies - Two main strategies for improving generation quality are parallel sampling and sequential revision, each with distinct advantages and challenges [19][20]. - Parallel sampling is straightforward but relies on the model's ability to generate correct answers in one go, while sequential revision allows for targeted corrections but is slower [20][21]. Group 5: Reinforcement Learning Applications - Recent studies have successfully employed reinforcement learning to enhance reasoning capabilities in language models, particularly in STEM-related tasks [41][46]. - The training process often involves a cold-start phase followed by reasoning-oriented reinforcement learning, optimizing performance through structured feedback [42][43]. Group 6: External Tools and Integration - Utilizing external tools, such as code interpreters or APIs, can enhance the reasoning process by offloading certain computational tasks [52][56]. - The ReAct method combines external operations with reasoning trajectories, allowing models to incorporate external knowledge into their inference paths [56][57]. Group 7: Model Interpretability and Trustworthiness - The article highlights the importance of model interpretability, particularly through CoT, which allows for monitoring and understanding model behavior [59]. - However, there are concerns regarding the fidelity of CoT outputs, as biases and errors can affect the reliability of the reasoning process [62][64]. Group 8: Adaptive Computation and Token Utilization - Adaptive computation time allows models to dynamically adjust the number of computation steps during inference, enhancing their reasoning capabilities [81]. - Introducing special tokens, such as thinking tokens, can provide additional processing time and improve model performance on complex tasks [85][89].
ICML 2025 Spotlight | 用傅里叶分解探讨图像对抗扰动,代码已开源
机器之心· 2025-05-18 04:25
本文作者分别来自中国科学院大学和中国科学院计算技术研究所。第一作者裴高政为中国科学院大学博士二年级学生,本工作共同通讯作者是中国科学院大学马 坷副教授和黄庆明教授。 对抗净化旨在测试阶段将对抗图像还原为其原始的干净图像。现有的基于扩散模型的对抗净化策略试图通过前向过程将对抗扰动淹没在各向同性噪声中,随后通 过逆向过程恢复干净图像。 然而,现有策略在时域(即像素空间)无法对干净像素与对抗扰动进行解耦,导致破坏对抗扰动的同时不可避免地损害原始干净图像 的语义信息。 因此,本文从时域转向频域进行研究。具体来说,本文利用傅里叶分解技术将图像分解为幅度谱和相位谱,探讨了对抗扰动的分布特征:结果表明,对抗扰动更 倾向于破坏高频幅度谱和相位谱。基于这一实验观察,本文提出在扩散模型的逆向过程中注入原始样本的低频信息作为先验,以引导干净样本的生成。这种方法 不仅能够有效去除对抗扰动,同时极大地保留了原始图像的语义内容和结构信息,使得净化后的图像尽可能保持与干净样本的语义相似性。 论文题目:Diffusion-based Adversarial Purification from the Perspective of the F ...
ICML 2025|如何凭「自动补全」实现100K生成3×加速?
机器之心· 2025-05-18 04:25
Core Viewpoint - The article discusses the challenges of generating ultra-long texts in the era of complex large models and introduces TokenSwift, a new inference acceleration framework that significantly improves efficiency while maintaining output quality [1][27][29]. Group 1: Challenges in Long Text Generation - Traditional autoregressive methods generate one token at a time, leading to performance degradation as sequence lengths increase to 100,000 tokens or more [4][5]. - The main bottlenecks include model redundancy, KV cache inflation, and semantic repetition, which hinder the efficiency and diversity of generated outputs [9][19]. Group 2: TokenSwift Framework - TokenSwift proposes a lightweight and efficient framework that restructures traditional autoregressive inference by introducing a mechanism based on multi-token drafting, parallel validation, and dynamic cache updates [7][11]. - The framework allows for the parallel generation of multiple candidate tokens, significantly reducing model reload frequency and I/O time while ensuring semantic relevance [12][17]. Group 3: Key Technical Innovations - The n-gram heuristic completion mechanism utilizes historical fragments to enhance the accuracy of token drafting, ensuring high semantic relevance [14]. - A tree-structured parallel validation module assesses the drafted tokens against standard autoregressive paths, ensuring lossless output quality [15][17]. - Dynamic KV management and repetition penalties are implemented to mitigate cache inflation and enhance output diversity, respectively [19][26]. Group 4: Performance Evaluation - Extensive experiments on various mainstream models demonstrate that TokenSwift achieves acceleration ratios exceeding 3 times while maintaining output quality consistent with original models [21][22]. - The acceleration effect becomes more pronounced with longer sequences, reducing generation time for 100K token tasks from nearly 5 hours to 1.5 hours [22]. Group 5: Conclusion and Future Implications - TokenSwift is not a new model but a universal acceleration strategy that can be integrated into existing models like LLaMA and Qwen, offering strong compatibility and deployment convenience [28]. - The framework's lossless guarantee for inference quality positions it as a robust technical support for future applications in multi-turn reasoning, code generation, and agent planning [29].