Owl
Search documents
AAAI 2026 | 电子科技大学提出OWL,基于双路径注意力干预的多模态大模型物体幻觉缓解
机器之心· 2025-11-28 08:05
Core Insights - The article discusses the increasing attention on mitigating object hallucination in visual language models (LVLMs) and introduces a novel framework called Owl, which employs a causal dual-path attention intervention to address this issue [2][4]. Group 1: Problem Identification - Existing methods primarily focus on either visual or textual attention independently, neglecting the critical imbalance in cross-modal attention interaction [5]. - There is a lack of quantitative measures for cross-modal dependencies during the decoding process, leading to a coarse intervention mechanism without theoretical guidance [5]. Group 2: Proposed Solution - The paper introduces a structural causal model that formalizes the decomposition of visual and textual attention into key mediating variables, highlighting how confounding factors distort attention and lead to hallucinations [4]. - A new metric, VTACR, is proposed to quantify the model's dependency on visual and textual modalities at each decoding layer, providing a measurable signal for fine-grained attention intervention [7]. Group 3: Methodology - The Owl framework employs a dual-path attention intervention method, creating a visual enhancement path and a textual enhancement path, using a contrastive decoding strategy to dynamically correct attention biases [8][10]. - During inference, the framework decomposes the attention weights of the language decoder into visual and textual components, adjusting attention based on the VTACR distribution to enhance the focus on image tokens while moderating the influence of textual history [10]. Group 4: Experimental Results - The Owl method was evaluated on three representative LVLMs: LLaVA-1.5, MiniGPT-4, and Shikra, against various baseline methods to ensure comprehensive assessment [12]. - In the CHAIR benchmark, Owl significantly reduced sentence-level hallucination by 17.6% and instance-level hallucination by 21.4% on LLaVA-1.5, while generating longer texts, indicating that it effectively mitigates hallucinations without sacrificing content richness [13]. - The method demonstrated comparable or improved performance on five visual question answering (VQA) tasks, with a 7.6% enhancement on the VizWiz task, suggesting that it may enhance the model's understanding of complex visual scenes [14]. - Manual evaluations using GPT-4V showed improvements in correctness by 20.1% and detailedness by 11.3% for LLaVA-1.5, indicating that the generated content is not only more faithful to the images but also richer in information [16]. Group 5: Visual Evidence - The paper presents typical hallucination cases where Owl effectively suppresses errors, ensuring generated results align closely with the actual image content [18]. - Visualizations reveal that Owl acts like a precise editor, suppressing "hallucination words" while prioritizing "correct words" during the generation process [18][19].
FutureHouse 联合创始人:AI Scientist 不是“全自动化科研”
海外独角兽· 2025-06-26 12:25
Group 1 - FutureHouse is an AI lab focused on "AI for Science," aiming to create AI systems that can autonomously ask questions, plan experiments, and iterate hypotheses [3][4][5] - The lab has launched four AI research agents: Crow (general intelligence), Falcon (automated literature review), Owl (research agent), and Phoenix (experimental agent), which can access full scientific literature and assess information quality [3][4] - FutureHouse's approach emphasizes scientific automation, transforming laboratories into "black box laboratories" and creating a software pipeline for research [4][5] Group 2 - FutureHouse is building a research API, focusing on automating scientific research through non-traditional mechanisms [19][22] - The founders aim to tackle "moonshot" challenges that require sustained investment and commercial strategies, with a focus on AI-driven scientific automation [22][23] - The ChemCrow project integrates language models and tools to achieve a complete scientific discovery process, demonstrating the value of scientific literature [23][24] Group 3 - The development of FutureHouse's research agents involves a clear distinction between agents and environments, with memory integrated into the agents for better performance [29][30] - The agents are designed to interact with their environments through language, observations, and actions, allowing for flexible combinations of different agents and environments [29][30] - The focus on full-text search and filtering relevant information is crucial for enhancing the performance of the research agents [32][33] Group 4 - FutureHouse believes that AI will not fully replace human involvement in scientific research, emphasizing the need for a semi-autonomous approach [46][47] - The complexity of biological systems requires human oversight, as AI cannot independently conduct experiments without human-defined frameworks [47][48] - The lab is exploring modular approaches to drug discovery and literature research, integrating human resources into the scientific process [51] Group 5 - AI technologies like AlphaFold and ESM-3 are expected to significantly enhance experimental efficiency, potentially increasing hit rates by tenfold or more [53] - The integration of computational predictions with experimental validation is becoming increasingly important in biological research [53][54] - Despite advancements, the complexity of biological systems means that experimental measurements remain the most reliable method for understanding biological mechanisms [55][56]
腾讯研究院AI速递 20250506
腾讯研究院· 2025-05-05 10:05
Group 1: Generative AI Developments - DeepSeek-Prover-V2 launched with 671B and 7B models, enhancing mathematical reasoning through recursion and reinforcement learning, setting multiple new records [1] - Anthropic introduced new integration features for Claude, enabling seamless connections with popular applications like Jira, and enhancing research capabilities [1] - Google’s NotebookLM now supports 50 languages for podcast generation, featuring local accents and a source tracing function for content [2] Group 2: Competitive AI Applications - Meta released a standalone AI application to compete with ChatGPT, utilizing user social data for personalized services and integrating with Meta's social product ecosystem [3] - Apple partnered with Anthropic to develop an "ambient programming" software platform for internal code writing, based on the Claude Sonnet model [4] - Midjourney launched Omni-Reference functionality for high consistency in character and object representation, requiring only one reference image [5] Group 3: Advanced AI Research and Risks - FutureHouse introduced four AI research agents that outperform top models and human PhDs in literature search accuracy, enhancing research efficiency [6] - MIT research indicates a greater than 90% risk of AI losing control, even with ideal supervision mechanisms, highlighting the challenges of managing superintelligent AI [7] - Physical Intelligence emphasizes the importance of diverse robotic data collection for effective real-world operation, suggesting a future of varied robot designs [8]