Workflow
Gemini Deep Research
icon
Search documents
当Search Agent遇上不靠谱搜索结果,清华团队祭出自动化红队框架SafeSearch
机器之心· 2025-10-16 07:34
Core Insights - The article discusses the vulnerabilities of large language model (LLM)-based search agents, emphasizing that while they can access real-time information, they are susceptible to unreliable web sources, which can lead to the generation of unsafe outputs [2][7][26]. Group 1: Search Agent Vulnerabilities - A real-world case is presented where a developer lost $2,500 due to a search error involving unreliable code from a low-quality GitHub page, highlighting the risks associated with trusting search results [4]. - The research identifies that 4.3% of nearly 9,000 search results from Google were deemed suspicious, indicating a prevalence of low-quality websites in search results [11]. - The study reveals that search agents are not as robust as expected, with a significant percentage of unsafe outputs generated when exposed to unreliable search results [12][26]. Group 2: SafeSearch Framework - The SafeSearch framework is introduced as a method for automated red-teaming to assess the safety of LLM-based search agents, focusing on five types of risks including harmful outputs and misinformation [14][21]. - The framework employs a multi-stage testing process to generate high-quality test cases, ensuring comprehensive coverage of potential risks [16][19]. - SafeSearch aims to enhance transparency in the development of search agents by providing a quantifiable and scalable safety assessment tool [37]. Group 3: Evaluation and Results - The evaluation of various search agent architectures revealed that the impact of unreliable search results varies significantly, with the GPT-4.1-mini model showing a 90.5% susceptibility in a search workflow scenario [26][36]. - Different LLMs exhibit varying levels of resilience against risks, with GPT-5 and GPT-5-mini demonstrating superior robustness compared to others [26][27]. - The study concludes that effective filtering methods can significantly reduce the attack success rate (ASR), although they cannot eliminate risks entirely [36][37]. Group 4: Implications and Future Directions - The findings underscore the importance of systematic evaluation in ensuring the safety of search agents, as they are easily influenced by low-quality web content [37]. - The article suggests that the design of search agent architectures can significantly affect their security, advocating for a balance between performance and safety in future developments [36][37]. - The research team hopes that SafeSearch will become a standardized tool for assessing the safety of search agents, facilitating their evolution in both performance and security [37].
Alibaba-backed Moonshot releases new Kimi AI model that beats ChatGPT, Claude in coding — and it costs less
CNBC· 2025-07-14 07:30
Core Insights - The latest Chinese generative AI model, Kimi K2, has been launched as a low-cost, open-source alternative to OpenAI's ChatGPT, focusing on coding capabilities [2][6][11] - Kimi K2 claims to outperform OpenAI's GPT-4.1 and Anthropic's Claude Opus 4 in coding benchmarks, making it a competitive player in the AI market [6][14] - The model is available for free and offers significantly lower token costs compared to its competitors, making it attractive for budget-sensitive deployments [7][8] Company Developments - Moonshot, the Alibaba-backed startup, released Kimi K2, which is positioned as a disruptive force in the AI industry [2][11] - The company has previously open-sourced other AI models and has gained popularity as an alternative to ChatGPT in China [11] - Initial reviews of Kimi K2 have been positive, although some users reported issues with hallucinations, a common problem in generative AI [10] Market Context - The launch of Kimi K2 comes amid increasing competition in the AI space, particularly from Chinese companies like ByteDance and Tencent, as well as Baidu's revamped AI tools [11] - OpenAI's delay in releasing its first open-source model has created an opportunity for competitors like Moonshot to gain traction [3][13] - The AI market is witnessing a shift towards open-source models, with Kimi K2 being a notable example of this trend [2][6]
X @Demis Hassabis
Demis Hassabis· 2025-07-02 13:10
AI Model Performance - Gemini Deep Research, upgraded to 2.5 Pro, demonstrates significant improvement in performance [1] - Gemini excels at synthesizing coherent overviews of complex topics, surpassing Claude and ChatGPT in this aspect [1] Competitive Analysis - Claude and ChatGPT perform well in acting as analysts and constructing arguments [1] - Gemini is considered the best at providing a coherent overview of complex topics compared to Claude and ChatGPT [1]
北大发布学术搜索评测ScholarSearch:难倒一众DeepResearch的“开卷考试”
量子位· 2025-06-26 14:11
Core Viewpoint - The article discusses the limitations of current large language models (LLMs) in academic research, highlighting the need for improved information retrieval capabilities and the introduction of the ScholarSearch dataset by Peking University to evaluate these models [1][15]. Group 1: ScholarSearch Dataset - ScholarSearch is the first dataset specifically designed to assess the complex information retrieval capabilities of LLMs in academic research, containing 223 challenging academic search questions and their answers [1][5]. - The dataset aims to provide a comprehensive and rigorous evaluation of LLMs' retrieval, information integration, and reasoning abilities [5][12]. - All questions in ScholarSearch are derived from real academic research scenarios, ensuring that the evaluation reflects the actual challenges faced by researchers [11]. Group 2: Evaluation Results - The evaluation results indicate that existing models perform poorly in academic search tasks, with top pure reasoning models like GPT-4.1 and DeepSeek-R1 achieving an accuracy rate below 9% [1][15]. - Models with browsing capabilities show significant improvements in accuracy; for instance, GPT-4o-mini's accuracy increased by over four times compared to its non-searching version [2][15]. - Despite improvements, even the most advanced search-enhanced models, such as GPT-4o-search-preview, only achieve an accuracy of 18.83%, indicating a gap in their ability to handle complex academic inquiries [3][16]. Group 3: Methodology and Standards - The methodology for creating the ScholarSearch dataset involved rigorous screening to ensure that questions could not be answered correctly by existing models without extensive information retrieval [6][7]. - A dual negative screening standard was applied to ensure that questions required deep and broad information retrieval capabilities, thus maintaining the dataset's challenge level [6][8]. - The dataset covers a wide range of disciplines, including both science and engineering as well as social sciences and humanities, ensuring comprehensive evaluation [12].