Workflow
大语言模型
icon
Search documents
GPT-5编程测评大反转,表面不及格,实际63.1%的任务没交卷,全算上成绩比Claude高一倍
3 6 Ke· 2025-09-22 11:39
Core Insights - Scale AI's new software engineering benchmark, SWE-BENCH PRO, reveals that leading models like GPT-5, Claude Opus 4.1, and Gemini 2.5 have low resolution rates, with none exceeding 25% [1][11] - The benchmark's difficulty is significantly higher than its predecessor, SWE-Bench-Verified, which had an average accuracy of 70% [4][11] - The new benchmark aims to eliminate data contamination and better reflect real-world software engineering challenges by using previously unseen tasks [4][7] Benchmark Details - SWE-BENCH PRO includes 1865 diverse code libraries categorized into three subsets: public, commercial, and reserved [7] - The public subset consists of 731 problems from 11 public code libraries, while the commercial subset includes problems from 276 startup code libraries [7] - The benchmark excludes trivial edits and focuses on complex tasks requiring multi-file modifications, enhancing the assessment's rigor [7][4] Testing Methodology - The evaluation process incorporates a "human in the loop" approach, enhancing problem statements with additional context and requirements [8][9] - Each task is assessed in a containerized environment, ensuring that models are tested under specific conditions [10] - The testing includes fail2pass and pass2pass tests to verify problem resolution and maintain existing functionality [10] Model Performance - The resolution rates for the top models are as follows: GPT-5 at 23.3%, Claude Opus 4.1 at 22.7%, and Gemini 2.5 at 13.5% [13][14] - Even the best-performing models scored below 20% in the commercial subset, indicating limited capabilities in addressing real-world business problems [13][11] - The analysis highlights that programming language difficulty and code library variations significantly impact model performance [15] Failure Analysis - Common failure modes include semantic understanding issues, syntax errors, and incorrect solutions, with GPT-5 showing a high non-response rate of 63.1% [16][17] - Claude Opus 4.1 struggles with semantic understanding, while Gemini 2.5 exhibits balanced failure rates across multiple dimensions [17][16] - QWEN3 32B, an open-source model, has the highest tool error rate, emphasizing the importance of integrated tool usage for effective performance [17]
苹果传统强项再发力,视觉领域三种模态终于统一
机器之心· 2025-09-22 10:27
Core Insights - The article discusses the recent release of Apple's new products and the ongoing conversation about the hardware advancements of the new phones [1] - It highlights that Apple has not yet introduced any groundbreaking AI applications, with Apple Intelligence still lagging in the domestic market [2] - The article notes a concerning trend of talent loss within Apple's AI and hardware teams, suggesting a less optimistic outlook for the company [3] AI Research and Development - Despite challenges in the large model domain, Apple has a strong background in computer vision research [4] - The article emphasizes a significant pain point in building large models related to vision, as visual modalities (images, videos, and 3D) require separate handling due to their different data dimensions and representation methods [4][5] - Apple’s research team has proposed ATOKEN, a unified tokenizer for vision, which addresses the core limitation of existing models by enabling unified processing across all major visual modalities while maintaining reconstruction quality and semantic understanding [5][6][8] ATOKEN Architecture - ATOKEN represents a significant innovation by introducing a shared sparse 4D latent space that allows for the representation of all visual modalities as feature-coordinate pairs [11] - The architecture utilizes a pure Transformer framework, surpassing traditional convolutional methods, and incorporates a four-stage progressive training curriculum to enhance multimodal learning without degrading single modality performance [15][16][19] - The training phases include image-based pre-training, video dynamic modeling, integration of 3D geometry, and discrete tokenization through finite scalar quantization [19][20] Performance Metrics - ATOKEN demonstrates industry-leading performance across various evaluation metrics, achieving high-quality image reconstruction and semantic understanding [21][23] - In image tokenization, ATOKEN achieved a reconstruction performance of 0.21 rFID at a 16×16 compression on ImageNet, outperforming the UniTok method [23] - For video processing, it achieved 3.01 rFVD and 33.11 PSNR on the DAVIS dataset, indicating competitive performance with specialized video models [24] - In 3D asset handling, ATOKEN achieved 28.28 PSNR on the Toys4k dataset, surpassing dedicated 3D tokenizers [29] Conclusion - The results indicate that the next generation of multimodal AI systems based on unified visual tokenization is becoming a reality, showcasing ATOKEN's capabilities in both generative and understanding tasks [26][27]
氪星晚报|国泰航空恢复西雅图航线每周五对直航往返航班;马斯克称明年SpaceX可能将全球总有效载荷的95%送入轨道
3 6 Ke· 2025-09-22 08:49
Group 1: Airline Industry - Cathay Pacific will resume direct flights to Seattle starting March 30, 2026, marking it as the ninth passenger destination in North America [1] - By summer 2026, Cathay Pacific plans to offer over 110 round-trip flights to North America, including destinations like Boston, Chicago, Dallas, Los Angeles, New York, San Francisco, Seattle, Toronto, and Vancouver [1] Group 2: Space Industry - Elon Musk stated that SpaceX could potentially launch 95% of the world's total payload into orbit next year, with a projected 98% by 2027 [2] - In the second quarter, SpaceX reportedly launched 88.5% of satellites, holding an 86% share of the global payload by weight [2] Group 3: E-commerce - Taobao will launch its Double 11 shopping festival simultaneously in 20 countries, with a marketing budget of 1 billion yuan to help 100,000 merchants double their overseas sales [3] - Douyin e-commerce reported a 49% year-on-year growth in GMV for its shelf platform, with over 511 million new e-commerce authors and 536 million new merchants earning income through the platform [4] Group 4: Telecommunications - VodafoneThree has selected Ericsson and Nokia to sign a network contract worth £2 billion (approximately $26.9 billion) for network technology across the UK [6] Group 5: Financial Services - JD Industrials has received approval from the China Securities Regulatory Commission for its Hong Kong IPO, planning to issue up to 253,309,800 ordinary shares [7] - The Financial Regulatory Bureau reported that it has established a mechanism to support small and micro enterprises, issuing loans totaling 22 trillion yuan, with 9.4 trillion yuan covered under the no-repayment renewal policy [11] Group 6: Investment and Mergers - Pfizer is nearing a deal to acquire weight-loss drug developer Metsera for $7.3 billion, with a cash offer of $47.50 per share and additional performance-based payments [9]
27亿美元天价回归,谷歌最贵“叛徒”、Transformer作者揭秘AGI下一步
3 6 Ke· 2025-09-22 08:48
Core Insights - The main focus of the article is on the hardware requirements for large language models (LLMs) as discussed by Noam Shazeer at the Hot Chips 2025 conference, emphasizing the need for increased computational power, memory capacity, and network bandwidth to enhance AI performance [1][5][9]. Group 1: Hardware Requirements for LLMs - LLMs require more computational power, specifically measured in FLOPS, to improve performance and handle larger models [23]. - Increased memory capacity and bandwidth are crucial, as insufficient bandwidth can limit model flexibility and performance [24][26]. - Network bandwidth is often overlooked but is essential for efficient data transfer between chips during training and inference [27][28]. Group 2: Design Considerations - Low precision computing is beneficial for LLMs, allowing for more FLOPS without significantly impacting model performance [30][32]. - Determinism is vital for reproducibility in machine learning experiments, as inconsistent results can hinder debugging and development [35][39]. - Addressing issues of overflow and precision loss in low precision calculations is necessary to maintain stability in model training [40]. Group 3: Future of AI and Hardware - The evolution of AI will continue to progress even if hardware advancements stall, driven by software innovations [42]. - The potential for achieving Artificial General Intelligence (AGI) remains, contingent on the ability to leverage existing hardware effectively [42][44]. - The article highlights the importance of creating a supportive environment for individuals as AI transforms job landscapes, emphasizing the need for societal adaptation to technological changes [56].
美股异动|百度盘前涨超3% 海通国际上调其估值 予目标价188美元
Ge Long Hui· 2025-09-22 08:40
Group 1 - Baidu's stock in Hong Kong rose over 3%, leading to a pre-market increase of over 3% for its US shares [1] - Haitong International upgraded Baidu's valuation method from Price-to-Earnings (PE) to Sum-of-the-Parts (SoTP) due to the new CFO's strategy to "unlock hidden assets" [1] - Baidu is reshaping its traditional business in the wake of the large language model (LLM) trend and aims to surpass competitors in the cloud market through various measures [1] Group 2 - Specific measures include adjusting traditional search operations, enriching AI SaaS products, providing cost-effective and reliable cloud infrastructure, and building an open foundational model ecosystem [1] - The valuation adjustment considers a 45% discount, resulting in a total market value of $64 billion, or a target price of $188 per ADR, corresponding to a projected 22x PE for fiscal year 2025 [1]
美团发布高效推理模型LongCat
Huan Qiu Wang· 2025-09-22 08:09
Core Insights - LongCat-Flash-Thinking enhances the autonomous tool-calling capabilities of agents and expands formal theorem proving abilities, becoming the first domestic large language model with both "deep thinking + tool calling" and "non-formal + formal" reasoning capabilities [3] - The new model shows significant advantages in handling high-complexity tasks such as mathematics, coding, and agent tasks [3] - LongCat-Flash-Thinking is fully open-sourced on HuggingFace and GitHub, allowing users to experience it on the official website [3]
美团发布高效推理模型LongCat-Flash-Thinking,聚焦高复杂度任务
Huan Qiu Wang· 2025-09-22 08:02
Core Insights - LongCat-Flash-Thinking enhances the autonomous tool-calling capabilities of agents and expands formal theorem proving abilities, becoming the first domestic large language model with both "deep thinking + tool calling" and "non-formal + formal" reasoning capabilities [3] - The new model shows significant advantages in handling high-complexity tasks such as mathematics, coding, and agent tasks [3] - LongCat-Flash-Thinking is fully open-sourced on HuggingFace and GitHub, allowing users to experience it on the official website [3]
AI无处不在的小应用,与行业发展的大困局
Hu Xiu· 2025-09-22 07:07
Group 1 - The article discusses the initial skepticism towards AI advancements due to many major companies' new versions falling short of expectations, leading to concerns about future developments [1] - However, after participating in discussions about AI implementation, there is a renewed optimism as AI is being widely adopted across various industries, subtly transforming the world [2] - There is a distinction made between high-end AI technologies, such as large language models and autonomous driving, and more mature technologies like voice and image recognition, which are not considered groundbreaking [3][4] Group 2 - AI is portrayed as a tool accessible to everyone, capable of improving efficiency and outcomes in specific scenarios, thus demonstrating the value of technological progress [4][5] - Numerous practical applications of AI are highlighted, such as automatic transcription of meetings and structured processing of customer interactions, which significantly enhance digital capabilities [6][7] - Despite some professionals dismissing these applications, they are recognized as valuable and memorable, showcasing AI's ability to meet user needs and gain acceptance [8] Group 3 - The article notes that while AI is a hot topic in the tech industry, many projects are still struggling to achieve profitability, indicating that the AI industry is not yet stable [23][24] - From a supply-side perspective, AI applications often lack economies of scale, as the backend systems require extensive customization, making it difficult to standardize and productize solutions [25] - Users expect AI applications to be cost-effective, and while many AI solutions are currently subsidized, the sustainability of this model is questioned as companies may eventually need to charge for their services [27] Group 4 - There are differing opinions on the future of AI, with some focusing on continuous investment in generative AI, while others seek to commercialize existing technologies and create tangible value [28][31] - The article suggests that without a consensus in the industry, the fragmented approach to investment may lead to suboptimal outcomes for all parties involved [32] - Despite the challenges, the gradual integration of AI is enhancing the overall digital landscape, benefiting both individuals and organizations [33]
美团(03690)发布高效推理模型LongCat-Flash-Thinking
智通财经网· 2025-09-22 06:40
官方介绍,该模型不仅增强了智能体自主调用工具的能力,还扩展了形式化定理证明能力,成为国内首 个同时具备"深度思考+工具调用"与"非形式化+形式化"推理能力相结合的大语言模型。尤其在超高复杂 度的任务(如数学、代码、智能体任务)处理上,LongCat-Flash-Thinking具备更显著的优势。 智通财经APP获悉,9月22日,美团(03690)发布高效推理模型LongCat-Flash-Thinking。美团表示,基于 AIME25实测数据,LongCat-Flash-Thinking在该框架下展现出更高效的智能体工具调用能力,在确保 90%准确率的前提下,相较于不使用工具调用节省了64.5%的Tokens。目前,该模型已在HuggingFace、 Github全面开源。 综合评估显示,LongCat-Flash-Thinking在逻辑、数学、代码、智能体等多个领域的推理任务中,达到了 全球开源模型的最先进水平(SOTA)。 ...
美团发布高效推理模型,部分任务性能接近GPT5
Xin Lang Ke Ji· 2025-09-22 06:10
Core Insights - Meituan has officially released its efficient reasoning model LongCat-Flash-Thinking, which maintains the speed characteristic of its predecessor, the LongCat model, while achieving state-of-the-art (SOTA) performance in reasoning tasks across various domains such as logic, mathematics, code, and intelligent agents, with some tasks nearing the performance of the closed-source model GPT5-Thinking [1] - The LongCat-Flash-Thinking model enhances the autonomous tool-calling capabilities of intelligent agents and expands its formal theorem proving abilities, making it the first domestic large language model to combine "deep thinking + tool calling" with "non-formal + formal" reasoning capabilities [1] - The new model demonstrates significant advantages in handling high-complexity tasks, particularly in mathematics, code, and intelligent agent tasks [1] - LongCat-Flash-Thinking is fully open-sourced on platforms like HuggingFace and GitHub, and it is available for experience on the official website [1]