o4 mini

Search documents
大模型碰到真难题了,测了500道,o3 Pro仅通过15%
机器之心· 2025-09-14 03:07
Core Insights - The article discusses the development of a new benchmark called UQ (Unsolved Questions) to evaluate the capabilities of large language models, focusing on unsolved problems that reflect real-world challenges [2][3][5] - UQ consists of 500 challenging questions sourced from the Stack Exchange community, designed to assess reasoning, factual accuracy, and browsing capabilities of models [3][8] - The study highlights the limitations of existing benchmarks, which often prioritize difficulty over real-world applicability, and proposes a continuous evaluation method through community validation [1][5] Group 1 - UQ is a test set of 500 unsolved questions covering various topics, including computer science, mathematics, and history, aimed at evaluating model performance in a realistic context [3][8] - The selection process for UQ involved multiple filtering stages, reducing an initial pool of approximately 3 million questions to 500 through rule-based, model-based, and manual reviews [10][11] - The best-performing model in the UQ validation only succeeded in answering 15% of the questions, indicating the high difficulty level of the benchmark [5][7] Group 2 - The UQ validation process employs a composite verification strategy that leverages the strengths of different models to assess candidate answers without requiring standard answers [14][26] - The study found that using a composite validator significantly reduces self-bias and over-optimism in model evaluations, which is a common issue when models assess their own performance [24][25][26] - Results showed that a stronger answer generation model does not necessarily correlate with better answer validation performance, highlighting the complexity of model capabilities [27][28]
Gilat Becomes First to Market with AI-Powered Network Management System
Globenewswire· 2025-09-11 11:01
Core Insights - Gilat Satellite Networks Ltd. has announced the AI transformation of its Network Management System (NMS) by integrating Model Context Protocol (MCP), with new AI capabilities available immediately [1][2] Group 1: AI Integration and Capabilities - The new NMS-MCP acts as a gateway between the NMS and AI agents, supporting authentication, licensing, and secure communication to ensure compliance and operational integrity [2] - AI models from the GPT Series 4, 5, and 5 mini, as well as o3, o4, o4 mini, and Claude Sonnet 4, are available for interfacing with the Total-NMS [2] - The integration is seen as a critical business multiplier for customers, enabling rapid innovation and simplified network management [2] Group 2: Company Overview - Gilat Satellite Networks is a leading global provider of satellite-based broadband communications with over 35 years of experience [3] - The company develops and delivers technology solutions for satellite, ground, and new space connectivity, focusing on critical connectivity across commercial and defense applications [3] - Gilat's portfolio includes cloud-based platforms, high-performance satellite terminals, and integrated ground systems for various markets [4] Group 3: Product Applications - Gilat's products support multiple applications including government and defense, broadband access, cellular backhaul, and critical infrastructure, meeting stringent service level requirements [5] - The company offers integrated solutions for multi-orbit constellations, Very High Throughput Satellites (VHTS), and Software-Defined Satellites (SDS) [4]
Manus估值36亿了?
投中网· 2025-04-27 06:35
将投中网设为"星标⭐",第一时间收获最新推送 硅谷顶级VC也来投了。 作者丨 刘燕秋 来源丨 投中网 模型推理能力的显著提升,使得 Agent 成为 2025 年最热的 AI 投资方向,在这波热潮中, Manus 成为第一个在国内刷屏的 Agent ,甚至可以说开 启了 Agent 元年。 这家公司最近又有新动向。据外媒援引知情人士消息, Manus AI 背后的公司 " 蝴蝶效应 " 获得了由美国风投 Benchmark 领投的一轮融资,融资金 额达 7500 万美元(约合 5.46 亿人民币)。此前M anus 已从腾讯、真格基金和红杉中国等投资人那里筹集了超过 1000 万美元。这轮融资让 Manus AI 的估值增长了约 5 倍,提升至近 5 亿美元(约合 36.44 亿人民币)。 我拿这条信息跟 Manus 团队求证,截至发稿暂无回应。 今年 3 月, Manus 发布了一款尚在内测中的通用 AI Agent ,能够独立处理简历筛选、行程规划和股票分析等任务,并声称在多项指标上的表现均优 于 OpenAI 近期推出的 Deep Research 。最近它还推出了订阅服务,价格为每月 39 美元,高级 ...