Workflow
量子位
icon
Search documents
大模型首次直接理解代码图:不用Agent自动修bug,登顶SWE-Bench开源模型榜单
量子位· 2025-06-27 06:08
来自蚂蚁的开源新模型,在SWE-bench Lite上 超越所有开源方案 ,性能媲美闭源模型。 具体表现如下,在SWE-bench Lite上: 明敏 发自 凹非寺 量子位 | 公众号 QbitAI AI自动修bug,解决率达 44% !这是全球开源模型的最新 最强水平 。 | | | SWE-bench | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Lite Verified Multimodal | Full | | | | | | | | Open Weight Model V Open Source System Checked | | (All Tags Selected) | | | | | | | Model | | % Resolved | Org | Date | Logs | Trajs | Site | | CodeFuse-CGM | | 44.00 | JEFK | 2025-03-10 | V | V | 6 | | KGCompass + DeepSeek V3 | | 36.67 | (1) | ...
阿里发布信息检索Agent,可自主上网查资料,GAIA基准超越GPT-4o | 模型&数据开源
量子位· 2025-06-27 04:40
Core Viewpoint - Alibaba has introduced WebDancer, an autonomous information retrieval agent capable of understanding and navigating the web like a human, enhancing the capabilities of traditional models through multi-step reasoning and tool usage [1][3]. Group 1: WebDancer's Capabilities - WebDancer can perform complex tasks such as web browsing, information searching, and question answering, demonstrating its ability to execute multi-step reasoning [9]. - The model achieved a Pass@3 score of 61.1% on GAIA and 54.6% on WebWalkerQA, outperforming baseline models and some open-source frameworks [4][34]. - WebDancer employs a four-stage training paradigm, which includes data construction, trajectory sampling, supervised fine-tuning, and reinforcement learning to enhance its reasoning and decision-making capabilities [10][28]. Group 2: Training Methodology - The first stage involves constructing browsing data to create complex QA pairs that require multiple interactions, simulating human behavior [12][15]. - The second stage focuses on generating high-quality Thought-Action-Observation trajectories, utilizing a dual-path sampling method for both short and long reasoning chains [20][22]. - The supervised fine-tuning stage integrates these trajectories to teach the model basic task decomposition and tool usage while preserving its original reasoning abilities [25][27]. - The reinforcement learning stage aims to optimize the agent's decision-making and generalization capabilities in real-world web environments [28][30]. Group 3: Performance Analysis - WebDancer's performance was tested on challenging datasets, including BrowseComp in English and Chinese, where it demonstrated robust capabilities in handling difficult reasoning and information retrieval tasks [36]. - The analysis of Pass@1 and Pass@3 metrics indicates that reinforcement learning significantly improves the sampling of correct responses, while consistency in language reasoning models shows notable improvement [38].
90后清华博士厨房机器人融资数千万,拿下北京市首张具身智能机器人食品经营许可证
量子位· 2025-06-27 04:40
Core Viewpoint - Xiangke Intelligent has successfully completed a multi-million yuan Pre-A round of financing, with a strong lineup of investors including Century Changhe Technology Group and Qidi Star leading the investment [1][12]. Company Overview - The founder, Chen Zhen, is a serial entrepreneur in the robotics field, holding a bachelor's degree in computer science from Beihang University and a master's degree from Tsinghua University, currently a PhD student at Tsinghua University [2]. - In 2020, Chen founded Sensing Technology, which was fully acquired by Joyoung's parent company JS Global Life, after which he served as the general manager of the Shark Ninja robotics R&D center [3]. Product Development - Xiangke Intelligent's LAVA robot received the first food business license for embodied intelligent robots in Beijing last September, making it the first AI chef in the country to operate with a license [5]. - The LAVA robot can fry a plate of fries in 2 minutes and is expected to learn to make ice cream and drinks in the future [6]. - The robot has demonstrated impressive performance, operating continuously for 190 days, processing a peak of 1,732 orders in a single day, and completing over 100,000 fault-free frying tasks with an average production efficiency of 40 seconds per order, while reducing energy consumption by 62% compared to traditional equipment [7]. Future Plans - With the new financing, Xiangke Intelligent plans to promote the mass production upgrade of LAVA and initiate large-scale deployment in vertical application scenarios [9]. - The company has already signed mass production orders for thousands of units with well-known overseas chain brand clients, with deliveries set to begin in the second half of the year [9]. Market Strategy - The strategy focuses on the higher standardization of Western fast food compared to the complex processes of Chinese cuisine, making it easier to automate, which aligns with the robot's strengths in ensuring quality and consistency [11]. - The company has a clear plan for technological iteration, aiming to enhance the "machine senses" through multi-modal perception integration, evolve "machine cognition" by building a database with listed companies, and upgrade self-developed consumer-grade joint technology [11]. Investment Landscape - The investment lineup for the Pre-A round includes industry giants and well-known venture capital firms, indicating strong industrial logic behind the funding [12]. - Notable investors include Century Changhe Technology Group, which is backed by Meinian Health, and NetDragon Tianying Venture Capital, associated with the Hong Kong-listed company NetDragon [13]. - Xiangke Intelligent has also recently signed a cooperation agreement with the Tsinghua Pearl River Delta Research Institute to jointly build a core technology R&D platform for robotics [14]. Team and Experience - Chen Zhen's entrepreneurial journey has been marked by key milestones, from founding Sensing Technology in 2014 to successfully exiting in 2020, and then establishing Xiangke Intelligent at the end of 2022 [17]. - The core team consists of members from the original Sensing team, Shark Ninja team, and Joyoung team, collectively possessing over 10 years of experience in robotics and artificial intelligence R&D and management [17]. Industry Context - The rapid evolution of the robotics industry in China and globally presents opportunities for companies like Xiangke Intelligent, which focus on vertical scenarios and have clear commercialization paths [19].
建筑生破解60年数学悬案,制成「永远同一面朝上」的单稳四面体
量子位· 2025-06-27 04:40
闻乐 时令 发自 凹非寺 量子位 | 公众号 QbitAI 扔100次,99次「同一面朝上」。 这个由碳纤维和碳化钨 (航空材料) 打造的"几何怪物",竟破解了60年数学悬案。 如果这个发明早一点出现,或许"雅典娜"月球着陆器也不会一侧翻就躺平了(doge)。 早在1966年,数学家约翰·康威和他的搭档查德·盖伊提出了"均匀单稳态四面体"构想。 他们想利用均匀材料制作一个重量分布均匀的四面体,无论将这个四面体如何放置,它总会翻到稳定的那一面朝上。 几年后,这对搭档通过不断尝试,否定了均匀单稳四面体的猜想:这是不存在的。 但是, 如果让重量分配不均匀呢 ? 后来,康威猜测不均匀配重的单稳四面体应该存在,但他未发表任何证明。 半个世纪以后,这个数学猜想由 建筑学者杰尔戈·阿尔马迪"跨界"证实 ,还制作出了实物。 所以,这位建筑学者是如何在数学问题上大展身手的呢? 从连续曲面到尖顶多面体 伟大的数学家约翰·康威对多四面体的排列和平衡方式非常感兴趣。 于是,他和搭档想要构建一个由均匀材料制成的四面体——其重量均匀分布,无论如何翻滚,最终总会翻到其稳定的一面。 很遗憾的是,他们在长达几年的研究之后发现,这种均匀单稳四面 ...
最低仅需2G显存,谷歌开源端侧模型刷新竞技场纪录,原生支持图像视频
量子位· 2025-06-27 04:40
Core Viewpoint - Google has officially announced the launch of Gemma 3n, a model that natively supports multiple modalities including text, images, and audio-video [2][20]. Group 1: Model Performance and Specifications - Gemma 3n achieved a score of 1303, becoming the first model under 10 billion parameters to exceed 1300 points [3]. - The model comes in two versions: 5 billion (E2B) and 8 billion (E4B) parameters, with VRAM usage comparable to 2B and 4B models, requiring as little as 2GB [4][17]. - The architecture allows for low memory consumption, making it suitable for edge devices [6][17]. Group 2: Technical Architecture - The core of Gemma 3n is the MatFormer (Matryoshka Transformer) architecture, designed for elastic inference with a nested structure [11][12]. - The concept of "effective parameters" is introduced, allowing for simultaneous optimization of E4B and E2B models during training [10][15]. - Google will release a tool called MatFormer Lab to help find the best model configurations [16]. Group 3: Edge Device Optimization - The model employs Progressive Layer Embedding (PLE) technology to enhance model quality without increasing memory usage [18]. - Gemma 3n optimizes the generation of the first token, improving pre-filling performance by 2 times compared to the previous model [19]. Group 4: Multimodal Support - Gemma 3n supports various input modalities, including advanced audio encoding for speech recognition and translation, capable of processing 30 seconds of audio [20]. - The visual component utilizes a new efficient visual encoder, MobileNet-V5-300M, which can handle resolutions of 256x256, 512x512, and 768x768 pixels, achieving 60 frames per second on Google Pixel [21].
@所有开发者:Agent变现,阿里云百炼联合支付宝首创「AI打赏」!Agent Store全新发布
量子位· 2025-06-27 04:40
Core Viewpoint - The article emphasizes that 2025 marks a significant turning point for AI Agents, transitioning from "toys" to "tools" as various successful Agent projects emerge and major companies release MCP protocol support [1]. Group 1: Development and Features of AI Agents - Many Agent projects are still stuck in the POC stage, facing challenges such as long development cycles and difficulty in validating commercial value [2]. - Alibaba Cloud's new upgrade of Bailian 3.0 provides a comprehensive solution for developers, addressing all needs for large model applications and Agent development [2][12]. - The introduction of the "Agent tipping" feature allows users to reward Agents they find useful, enabling direct monetization for developers [3][4][5]. Group 2: Agent Store and Templates - The Agent Store has officially launched, offering hundreds of Agent templates across various industries, allowing developers to quickly start secondary development projects [7][10][18]. - Developers can easily copy Agent configurations and validate their usability, streamlining the development process [21]. Group 3: Enhanced Capabilities and Tools - The upgrade includes a full suite of capabilities from model supply to application data and development tools, enhancing the overall development experience [13][15]. - The new multi-modal RAG capability supports processing complex enterprise documents, significantly improving document handling capabilities [29][30]. - The introduction of V-RAG allows for better content recognition in structured documents, enhancing the effectiveness of document processing [33][34]. Group 4: MCP Service Enhancements - The MCP service has been upgraded to support KMS encryption, addressing key management issues and reducing risks associated with plaintext exposure [36][37]. - Over 50 enterprise-level MCPs have been launched, with more than 22,000 users utilizing these services to create over 30,000 MCP Agents [41]. Group 5: Multi-modal Interaction Development Kit - The multi-modal interaction development kit provides low-cost development capabilities for enterprises, enabling a new generation of intelligent user experiences [45]. - This kit supports various devices and applications, allowing for flexible integration of multi-modal capabilities [47][48]. Group 6: Commercialization and Sustainability - The introduction of the Agent tipping feature opens new pathways for developers to monetize their creations, establishing a sustainable ecosystem for AI Agents [50][51]. - Alibaba Cloud's exploration serves as a reference for the industry, showcasing a viable commercialization model for AI applications [52].
北大发布学术搜索评测ScholarSearch:难倒一众DeepResearch的“开卷考试”
量子位· 2025-06-26 14:11
Core Viewpoint - The article discusses the limitations of current large language models (LLMs) in academic research, highlighting the need for improved information retrieval capabilities and the introduction of the ScholarSearch dataset by Peking University to evaluate these models [1][15]. Group 1: ScholarSearch Dataset - ScholarSearch is the first dataset specifically designed to assess the complex information retrieval capabilities of LLMs in academic research, containing 223 challenging academic search questions and their answers [1][5]. - The dataset aims to provide a comprehensive and rigorous evaluation of LLMs' retrieval, information integration, and reasoning abilities [5][12]. - All questions in ScholarSearch are derived from real academic research scenarios, ensuring that the evaluation reflects the actual challenges faced by researchers [11]. Group 2: Evaluation Results - The evaluation results indicate that existing models perform poorly in academic search tasks, with top pure reasoning models like GPT-4.1 and DeepSeek-R1 achieving an accuracy rate below 9% [1][15]. - Models with browsing capabilities show significant improvements in accuracy; for instance, GPT-4o-mini's accuracy increased by over four times compared to its non-searching version [2][15]. - Despite improvements, even the most advanced search-enhanced models, such as GPT-4o-search-preview, only achieve an accuracy of 18.83%, indicating a gap in their ability to handle complex academic inquiries [3][16]. Group 3: Methodology and Standards - The methodology for creating the ScholarSearch dataset involved rigorous screening to ensure that questions could not be answered correctly by existing models without extensive information retrieval [6][7]. - A dual negative screening standard was applied to ensure that questions required deep and broad information retrieval capabilities, thus maintaining the dataset's challenge level [6][8]. - The dataset covers a wide range of disciplines, including both science and engineering as well as social sciences and humanities, ensuring comprehensive evaluation [12].
小米AI眼镜1999元起售!雷军:眼镜+相机+耳机+小爱,就是你的随身AI入口
量子位· 2025-06-26 14:11
Core Viewpoint - Xiaomi is positioning its new AI glasses as a personal smart device and an AI entry point for the next era of technology [3][12]. Product Features - The Xiaomi AI glasses weigh 40g, which is about twice the weight of regular glasses, and they support prescription lenses [6]. - The glasses function as a camera from a first-person perspective and include features like voice translation and payment capabilities without needing a phone [12]. - The starting price for the Xiaomi AI glasses is set at ¥1999, which is comparable to the Ray-Ban Meta AI glasses priced at $299 [14][12]. Competitive Advantages - Xiaomi's AI glasses are lighter, have longer battery life, and are equipped with the "Super Xiao Ai" assistant [16]. - The typical battery life of the Xiaomi AI glasses is 8.6 hours, which is double that of the Ray-Ban Meta, and they can be charged in 45 minutes [20]. - The glasses utilize a dual-chip solution with a Snapdragon AR1 and a low-power processing chip, enhancing battery efficiency and performance [21]. AI Integration - The "Super Xiao Ai" assistant enables various functionalities such as taking photos, recording videos, and making payments through voice commands [24][26]. - The assistant supports multimodal interaction, understanding context, and providing personalized services based on user information [27]. Financial Performance - In the first quarter, Xiaomi reported a record revenue of ¥111.3 billion, reflecting a 47% year-on-year growth [31]. - The company plans to invest ¥200 billion in core technology research and development over the next five years (2026-2030) [34].
Nature报道:谷歌新模型1秒读懂DNA变异!首次统一基因组全任务,性能碾压现有模型
量子位· 2025-06-26 14:11
Core Viewpoint - Google DeepMind has introduced a groundbreaking biological model, AlphaGenome, which can accurately predict genomic sequence variations in just one second, marking a significant advancement in the field of genomics [3][2]. Group 1: Model Capabilities - AlphaGenome can predict thousands of functional genomic features from DNA sequences up to 1 million base pairs long, assessing variation effects with single-base resolution [4][5]. - The model outperforms existing models across various tasks, providing a powerful tool for deciphering genomic regulatory codes [5][8]. - It is described as a milestone in biology, being the first unified model that integrates a wide range of genomic tasks with high accuracy and performance [7][10]. Group 2: Model Architecture - The architecture of AlphaGenome is inspired by U-Net, processing 1 million base pairs of DNA input sequences through downsampling to generate two types of sequence representations [13]. - It employs convolutional layers for local sequence pattern modeling and Transformer blocks for modeling longer-range dependencies, achieving high-resolution training of complete base pairs [13]. - The model outputs 11 modalities, covering 5,930 human or 1,128 mouse genomic tracks, demonstrating its comprehensive predictive capabilities [13]. Group 3: Training and Performance - AlphaGenome is trained through a two-phase process involving pre-training and distillation, achieving inference times under one second on NVIDIA H100 GPUs [15][16]. - In evaluations across 24 genomic tracks, AlphaGenome maintained a leading position in 22 tasks, showing a 17.4% relative improvement in cell-type-specific LFC predictions compared to existing models [19]. - The model achieved significant enhancements in various tasks, such as a 25.5% improvement in expression QTL direction predictions compared to Borzoi3 [21]. Group 4: Clinical Applications - AlphaGenome can aid researchers in understanding the underlying causes of diseases and discovering new therapeutic targets, exemplified by its application in T-cell acute lymphoblastic leukemia research [29]. - The model's capabilities extend to predicting synthetic DNA designs and assisting in fundamental DNA research, with potential for broader species coverage and improved prediction accuracy in the future [29]. Group 5: Availability - A preview version of AlphaGenome is currently available, with plans for a formal release, inviting users to experience its capabilities [30].
国产大模型高考出分了:裸分683,选清华还是北大?
量子位· 2025-06-26 06:25
Core Insights - The article discusses the performance of various AI models in a simulated high school examination, comparing their scores and capabilities in different subjects [2][12]. Group 1: Overall Performance - Gemini achieved the highest score in science with 655 points, while Doubao scored 683 points in humanities, also ranking first [2]. - Doubao excelled in six subjects, maintaining top scores except in mathematics, chemistry, and biology [3][4]. Group 2: Subject-Specific Analysis - In the subject breakdown, Doubao scored 128 in Chinese, 141 in mathematics, and 144 in English, while Gemini scored 126 in Chinese and 140 in mathematics [3]. - The models showed significant improvement in mathematics compared to previous years, with most scoring around 140 points [13]. - Doubao and Gemini demonstrated better performance in visual comprehension tasks compared to other models, particularly in chemistry [22][42]. Group 3: Evaluation Methodology - The evaluation used a combination of national and provincial exam papers, with a total score of 750 points [9]. - Scoring was conducted through a mix of automated assessments and human evaluations, ensuring a fair testing environment [10][11]. Group 4: Model Development and Improvement - Doubao's advancements are attributed to three key strategies: multi-modal integration, enhanced reasoning capabilities, and dynamic thinking abilities [30][33][40]. - The model's training involved a three-phase process focusing on text, multi-modal data, and long-context support, significantly improving its performance in reading comprehension and reasoning tasks [35][36]. Group 5: Future Directions - The article suggests that combining text and image inputs can significantly enhance model performance, indicating a promising area for future exploration [42][43].