量子位

Search documents
100行代码打造迷你编程Agent:能修复65%真实项目bug,适配所有大模型
量子位· 2025-07-27 11:57
Core Viewpoint - The article discusses the launch of mini-SWE-agent, a lightweight programming agent that operates with only 100 lines of code, designed to solve 65% of the problems on the SWE-bench benchmark, while being compatible with various language models and easy to deploy locally [2][3][18]. Group 1: Project Overview - mini-SWE-agent is an open-source project developed by the same team behind SWE-bench and SWE-agent, focusing on simplifying the process of code bug fixing in real GitHub projects [2][7]. - The architecture of mini-SWE-agent is significantly simplified, requiring only about 200 lines of code in total, and eliminates complex dependencies [14][10]. - The agent operates using the operating system's Bash environment, allowing it to execute commands without the need for specialized tool interfaces, thus enhancing compatibility with any language model [14][18]. Group 2: Performance and Features - Despite its lightweight design, mini-SWE-agent maintains a performance level comparable to the original SWE-agent, solving approximately 65% of the problems on the SWE-bench validation set [3][18]. - The agent supports various runtime environments, including Docker and other virtualization platforms, facilitating easy deployment across different systems [16][18]. - It includes tools for batch inference and trajectory browsing, aiding users in large-scale evaluation and decision-making processes [18]. Group 3: User Guidance and Applications - mini-SWE-agent is recommended for users seeking quick local execution, simplified control flow, and stable evaluation environments, making it suitable for fine-tuning or reinforcement learning experiments [20]. - For users requiring a highly configurable toolchain and complex state management, the more feature-rich SWE-agent is suggested [20]. - The design philosophy of mini-SWE-agent emphasizes readability, convenience, and ease of expansion, making it accessible for everyday developers [21][22].
大模型隐私安全和公平性有“跷跷板”效应,最佳平衡法则刚刚找到 | 人大&上海AI Lab
量子位· 2025-07-27 11:57
Core Insights - The research from Renmin University and Shanghai AI Lab reveals that enhancing privacy protection in large language models (LLMs) can lead to a significant drop in fairness, with a decline of up to 45% [1][8] - The study identifies a "seesaw effect" caused by coupled neurons that encode both fairness and privacy, leading to conflicts during model optimization [1][10] Group 1: Ethical Challenges in LLMs - The concept of "Alignment Tax" describes the trade-off where optimizing for alignment-related goals often sacrifices other foundational capabilities like general knowledge and reasoning [3] - As LLMs are increasingly integrated into critical sectors such as healthcare, finance, and education, ensuring models maintain fairness and privacy has become essential [4][5] - Users expect LLMs to protect privacy while also ensuring fairness, but achieving both simultaneously is challenging [7] Group 2: SPIN Methodology - The SPIN method is introduced as a training-free solution that involves precisely suppressing 0.00005% of key neurons to enhance both fairness and privacy [2][12] - The approach involves three steps: identifying critical neurons, locating coupled neurons that impact both fairness and privacy, and implementing suppression to decouple their effects [13][15][16] - SPIN demonstrates significant improvements in fairness and privacy metrics across various models, outperforming traditional fine-tuning methods [17][18][19] Group 3: Performance and Robustness - SPIN allows for zero-cost deployment, requiring only a one-time neuron scan, and operates without additional computational costs during inference [20] - The method shows resilience even when trained on harmful data, maintaining stable improvements in fairness and privacy [26][31] - SPIN's effectiveness is validated through various benchmark tests, indicating that it can enhance model performance without sacrificing intelligence [21][22] Group 4: Broader Implications - The principles behind SPIN can be extended to address other ethical conflicts in AI, such as balancing safety and utility [37] - The research highlights the importance of understanding neuron-level interactions to create more responsible AI systems [12][37]
具身智能迎来实力派!十年多模态打底,世界模型开路,商汤「悟能」来了
量子位· 2025-07-27 11:57
Core Viewpoint - SenseTime officially announced its entry into the field of embodied intelligence with the launch of the "Wuneng" embodied intelligence platform at the WAIC 2025 large model forum [1][2]. Group 1: SenseTime's Technological Advancements - SenseTime introduced the "Riri Xin V6.5" multimodal reasoning model, which features a unique image-text interleaved thinking chain that significantly enhances cross-modal reasoning accuracy [3][4]. - The new model outperforms Gemini 2.5 Pro in multimedia reasoning capabilities across multiple datasets, showcasing its competitive edge [8]. - Compared to its predecessor, Riri Xin 6.0, the V6.5 model has improved performance by 6.99% while reducing reasoning costs to only 30% of the previous version, resulting in a fivefold increase in cost-effectiveness [10]. Group 2: Transition to Embodied Intelligence - SenseTime's shift towards embodied intelligence is a natural progression from its expertise in visual perception and multimodal capabilities to physical world interactions [12][13]. - The company has accumulated over ten years of industry experience, particularly in autonomous driving, which has provided valuable data and world model experience for the development of embodied intelligence [13]. - The "Wuneng" platform integrates the general capabilities of the Riri Xin multimodal model with the experience of building and utilizing world models, aiming to create an ecosystem for embodied intelligence [14]. Group 3: World Model Capabilities - The "KAIWU" world model supports the generation of multi-perspective videos and can maintain temporal consistency for up to 150 seconds, utilizing a database of over 100,000 3D assets [16][18]. - It can understand occlusion and layering spatially, as well as temporal changes and motion patterns, allowing for realistic object representation [17][20]. - The platform can simultaneously process people, objects, and environments, creating a 4D representation of the real world [21]. Group 4: Industry Collaboration and Data Utilization - SenseTime is pursuing a "soft and hard collaboration" strategy, partnering with various humanoid robot and logistics platform manufacturers to pre-install its models, enhancing the multimodal perception and reasoning capabilities of hardware [29]. - The company is addressing the common industry challenge of data scarcity by generating synthetic data in virtual environments and using real-world samples for calibration [32][33]. - The integration of first-person and third-person perspectives in training enhances the model's ability to learn from human demonstrations while executing tasks from its own sensory input [26][35]. Group 5: Future Outlook and Competitive Edge - SenseTime is establishing a self-reinforcing data ecosystem through large-scale simulations, real data feedback from hardware, and the fusion of different perspectives, which is expected to drive continuous model upgrades [39]. - The company is positioned to lead the future of embodied intelligence by leveraging multimodal capabilities and hardware collaboration to build a competitive moat in the industry [40].
百元级硬件流畅运行百亿参数大模型!上交&本智激活开源端侧原生大模型
量子位· 2025-07-27 09:01
Core Viewpoint - The next battleground for AI is shifting from the cloud to mobile devices, emphasizing the need for local computation to ensure user privacy and data security [2][3]. Group 1: Industry Trends - Major smartphone manufacturers like Apple, Huawei, Samsung, Xiaomi, and OPPO are integrating large models into mobile devices, indicating a competitive landscape for edge AI [2]. - The challenges of running AI smoothly on local devices are significant, as evidenced by Apple's delayed launch of its core AI features [2][3]. Group 2: Technological Innovations - A new collaboration between Shanghai Jiao Tong University and the startup Zenergize AI has led to the development of the SmallThinker series, which is designed specifically for edge computing [4]. - The SmallThinker models, including SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, are optimized for local CPU inference without relying on high-end GPUs, achieving impressive performance metrics [5][23]. Group 3: Model Architecture - SmallThinker employs a unique architecture that allows for efficient inference on devices with limited computational resources, avoiding the need for traditional model compression techniques [6][8]. - The model features three core technological characteristics: expert knowledge activation, preemptive expert routing to minimize I/O overhead, and a hybrid sparse attention mechanism that reduces memory usage by 76% [9][12][17]. Group 4: Performance Metrics - In extreme memory-constrained scenarios (1GB RAM), the SmallThinker-4B-A0.6B model achieves a speed of 19.91 tokens/s, significantly outperforming competitors like Qwen3-1.7B [26][27]. - On standard PC configurations (8GB RAM), the SmallThinker-21B-A3B model demonstrates a speed of 20.30 tokens/s, doubling the performance of Qwen3-30B-A3B [29]. Group 5: Future Directions - The development team plans to enhance the model's capabilities by scaling up with more high-quality data and aims to create a personal AI assistant that operates entirely on individual devices [32][33]. - The vision is to integrate AI seamlessly into daily life, providing a secure, private, and intelligent experience for users [34].
AI教父Hinton对话上海AI Lab周伯文:多模态聊天机器人已经具有意识,让AI聪明和让AI善良是两件事
量子位· 2025-07-26 15:56
Core Viewpoint - Geoffrey Hinton, known as the "father of artificial intelligence," visited Shanghai, China, for discussions on AI advancements, emphasizing the intersection of AI and scientific discovery [1][2][3] Group 1: Hinton's Visit and Discussions - Hinton's visit included a public dialogue with Zhou Bowen, director of the Shanghai Artificial Intelligence Laboratory, focusing on cutting-edge AI research [2][3] - The dialogue covered topics such as multimodal large models, subjective experience, and training "kind" superintelligence [3][9] - Hinton's presence was met with enthusiasm, as attendees applauded and recorded the event, highlighting his significance in the AI field [2] Group 2: AI and Scientific Discovery - Zhou Bowen presented the "SAGE" framework, which integrates foundational models, fusion layers, and evaluation layers to elevate AI from a tool to an engine for scientific discovery [3] - Hinton noted that AI has the potential to significantly advance scientific research, citing examples like protein folding and weather prediction, where AI outperforms traditional methods [16][17] Group 3: Perspectives on AI Consciousness - Hinton expressed the view that current multimodal chatbots possess a form of consciousness, challenging conventional beliefs about AI capabilities [9][13] - He discussed the importance of understanding subjective experience in AI, suggesting that many misconceptions exist regarding how these concepts operate [12] Group 4: Training AI for Kindness - Hinton proposed that training AI to be both intelligent and kind involves different methodologies, allowing countries to share techniques for fostering AI kindness without compromising intelligence [14][15] - He emphasized the need for ongoing research to develop universal methods for instilling kindness in AI systems as they become more intelligent [15][16] Group 5: Advice for Young Researchers - Hinton advised young researchers to explore areas where they believe "everyone is wrong," encouraging persistence in their unique approaches until they understand the reasoning behind established methods [18]
刚刚,这家帮某爆款潮玩出海的企业,发布首个全球营销AI Agent
量子位· 2025-07-26 09:01
Core Viewpoint - The article emphasizes the transformation of overseas marketing through the introduction of AI agents, specifically highlighting the launch of Navos by Titanium Technology, which aims to enhance efficiency and reduce costs in marketing processes [1][3][43]. Group 1: Introduction of Navos - Navos is introduced as the first global marketing AI agent product that empowers the entire marketing chain, including creativity, deployment, and data analysis [2][4]. - The launch of Navos marks the beginning of the AI agent era in the overseas marketing industry [3]. Group 2: Functionality and Advantages - Navos operates through intelligent agent collaboration, allowing it to autonomously plan and execute tasks based on user input, thereby improving efficiency and reducing costs [4][7]. - The AI agent is designed to take over repetitive tasks, enabling human resources to focus on high-value decision-making [7][8]. Group 3: Technical Innovations - Navos addresses unique challenges in the marketing domain, such as cross-modal semantic alignment for creative design, utilizing advanced models for multi-modal content generation [9][28]. - The AI agent can analyze data from platforms like TikTok to provide creative suggestions based on market trends and successful content characteristics [11][14]. Group 4: Efficiency Gains - The marketing cycle duration has been significantly reduced from one to three months to potentially just a few hours or days, achieving efficiency improvements of 10 to 50 times [28]. - For mature clients, the return on investment (ROI) can increase by over three times, while for small to medium clients, the ROI can rise by 10 to 50 times [29]. Group 5: Market Potential and Future Outlook - The article highlights the vast potential for growth in the overseas marketing sector, with predictions that the AI marketing market could exceed 3 trillion yuan by 2028 [43]. - Titanium Technology aims to expand Navos's applications across various industries and regions, with a focus on e-commerce, gaming, and short video content [40][42]. Group 6: Industry Context - The marketing industry is inherently AI-friendly due to its reliance on data-driven decision-making and the presence of numerous repetitive tasks that AI can effectively handle [32][33]. - Titanium Technology has established itself as a leading player in the AI-driven overseas marketing space, having served over 80,000 enterprises globally [36].
国产GPU跑满血DeepSeek,已经可以100 tokens/s了!
量子位· 2025-07-26 09:01
Core Viewpoint - The fastest chip for running full-scale DeepSeek is a domestic GPU from Moore Threads, achieving a speed of 100 tokens/s, significantly faster than foreign GPUs at 50 tokens/s and domestic counterparts at 15 tokens/s [1][4]. Group 1: Moore Threads' Achievements - Moore Threads has developed an AI super factory that goes beyond just creating faster chips, focusing on a comprehensive transformation of the entire technology stack [6][10]. - The AI super factory is not a physical chip manufacturing facility but a systemic overhaul that includes innovations in chip architecture, cluster design, and software algorithms [9][10]. Group 2: Key Components of the AI Super Factory - The AI super factory's production efficiency is defined by five core elements: generality of accelerated computing, effective chip performance, node efficiency, cluster efficiency, and cluster stability [13]. - A full-function GPU serves as the foundation of the AI super factory, evolving from basic graphics acceleration to a versatile computing platform capable of handling various AI tasks [14][16]. Group 3: MUSA Architecture - The MUSA architecture acts as the "chief designer" of the super factory, allowing for scalable and configurable chip designs that optimize resource allocation [25][26]. - MUSA's innovative design enables global resource sharing, reducing bottlenecks and improving efficiency during multi-task operations [27][29]. Group 4: Full-Stack Software System - Moore Threads has created a full-stack software system that integrates deeply with the MUSA hardware architecture, enhancing developer experience and operational efficiency [35][36]. - The software stack includes optimized drivers, core operator libraries, and tools for performance analysis, significantly improving task handling and resource utilization [41][42]. Group 5: KUAE Computing Cluster - The KUAE computing cluster is a soft-hard integrated system that extends the performance advantages of individual GPUs to large-scale deployments, enabling efficient training of massive AI models [43][44]. - The cluster supports various parallel training strategies and provides end-to-end training optimization, ensuring high performance and stability [45][46]. Group 6: Zero-Interrupt Fault Tolerance Technology - Moore Threads has developed a unique zero-interrupt fault tolerance technology that allows for continuous operation of the AI super factory, minimizing downtime and recovery costs [47][49]. - This technology enhances the overall stability and reliability of the system, ensuring high effective training time and reducing the impact of potential failures [51][52]. Group 7: Future of AI and Computing Needs - The demand for computing power is expected to grow exponentially, driven by advancements in generative AI and the need for complex task execution [54][56]. - Moore Threads aims to provide a comprehensive solution that addresses the challenges of AI model training, emphasizing the importance of stability, reliability, and efficiency in future computing [58][61].
大模型“天梯赛”来了,让Agent在Kaggle真实任务中进化|佐治亚理工、斯坦福开源
量子位· 2025-07-26 09:01
Core Viewpoint - The article discusses the introduction of MLE-Dojo, an interactive framework designed to train and evaluate large language model (LLM) agents in machine learning engineering tasks, addressing the limitations of existing benchmarks that do not simulate real-world iterative workflows [1][2]. Group 1: Existing Problems and Solutions - Current benchmarks for LLMs are mostly static and fail to capture the dynamic workflows of machine learning engineering, lacking assessments of continuous experimentation and structured feedback [6]. - Many platforms do not support advanced training paradigms like supervised fine-tuning (SFT) or reinforcement learning (RL), limiting the development of more autonomous AI agents [7]. - Existing benchmarks often focus on isolated tasks, missing the complexity and interconnections of end-to-end machine learning processes, which MLE-Dojo aims to address by providing a comprehensive training and evaluation environment [8]. Group 2: MLE-Dojo Features - MLE-Dojo consists of over 200 real Kaggle competitions, covering various domains such as tabular data, computer vision (CV), and natural language processing (NLP), providing unprecedented breadth and depth for evaluating AI agents [12]. - The framework offers a Gym-style interactive environment where agents can perform actions like requesting task information, validating code, and executing code in a secure sandbox [13]. - MLE-Dojo provides advanced features such as detailed error reports and a HumanRank score, which measures the agent's relative position on human leaderboards, offering a standardized performance metric across tasks [14]. Group 3: Evaluation of LLMs - The research team evaluated eight leading LLMs using a multi-dimensional assessment system rather than relying on a single metric [16]. - The HumanRank score reflects the model's performance relative to human competitors, while the Elo rating system provides a dynamic ranking based on head-to-head match results [17][18]. - The AUP (Area Under the Performance Profile) metric assesses the robustness and consistency of models across various tasks, with higher scores indicating better performance stability [18]. Group 4: Performance Analysis - Gemini-2.5-Pro emerged as the top performer in the Elo rating, demonstrating strong competitive capabilities and surpassing 61.95% of human players in the HumanRank score [20]. - Different models exhibited distinct problem-solving strategies, with some being more aggressive in executing code while others were more conservative, impacting their efficiency and overall performance [23]. - The analysis revealed that stronger models tend to generate longer and more complex solutions, indicating deeper reasoning and multi-step problem-solving capabilities [24]. Group 5: Cost-Performance Trade-off - High-performing models often incur significant computational costs, with top reasoning models consuming more tokens and resources [25]. - Some models, like DeepSeek-r1, show potential for competitive performance with higher cost-effectiveness, indicating a direction for future model optimization [25].
超大模型推理加速2.18倍!SGLang联合美团技术团队开源投机采样训练框架
量子位· 2025-07-26 09:01
Core Viewpoint - SpecForge is an open-source training framework designed for speculative sampling, specifically tailored for large models, achieving a 2.18x inference acceleration [1][15]. Group 1: SpecForge Overview - SpecForge is developed by the SGLang team in collaboration with Meituan's search recommendation platform and Cloudsway.AI [1]. - The framework is built to address the challenges posed by the increasing size of models, which often leads to lower inference efficiency [4][6]. - SpecForge integrates deeply with the SGLang inference engine, providing a seamless training and inference process for speculative sampling [5][7]. Group 2: Technical Features - The framework incorporates Eagle3, an advanced speculative sampling method that enhances inference speed by training a lightweight draft model to predict token distributions accurately [7]. - SpecForge supports various mainstream models, including complex MoE layers and Transformer variants, ensuring broad applicability [7]. - It features scalable distributed training through Fully Sharded Data Parallel (FSDP) and Tensor Parallelism (TP), optimizing resource utilization on GPU clusters [7][14]. Group 3: Training Modes and Efficiency - SpecForge offers two training modes: Online and Offline, allowing users to choose based on their specific needs and resource availability [10][17]. - The Training-Time Test (TTT) architecture enhances the robustness of the draft model, encapsulating complex processes to simplify implementation for users [9]. - The framework is designed with a focus on memory-efficient training, significantly reducing memory overhead even for trillion-parameter models [7]. Group 4: Experimental Validation - The effectiveness of SpecForge was validated through experiments on datasets like ShareGPT and UltraChat, demonstrating compatibility with the Eagle3 architecture [15]. - The draft models trained using SpecForge achieved a notable 2.18x inference acceleration on the MT-Bench benchmark [15]. Group 5: Future Developments - SpecForge's roadmap includes plans to support additional model architectures and integrate visual-language models (VLM) into the framework [22]. - The team aims to enhance training efficiency through improved parallel strategies and kernel optimizations [22].
Hinton上海演讲:大模型跟人类智能很像,警惕养虎为患
量子位· 2025-07-26 09:01
Core Viewpoint - Geoffrey Hinton emphasizes the importance of establishing a positive mechanism for AI development to ensure it does not threaten humanity, highlighting the complex relationship between AI and human intelligence [3][42][55]. Group 1: AI Development and Understanding - Hinton discusses the evolution of AI over the past 60 years, identifying two main paradigms: logical reasoning and biological understanding, which have shaped current AI capabilities [8][10]. - He compares human understanding of language to that of large language models, suggesting that both operate on similar principles of feature interaction and semantic understanding [19][27]. - The efficiency of knowledge transfer in AI is significantly higher than in humans, with AI capable of sharing vast amounts of information rapidly across different systems [29][36]. Group 2: AI Safety and Collaboration - Hinton warns that as AI becomes more intelligent, it may seek control and autonomy, necessitating international cooperation to ensure AI remains beneficial to humanity [42][55]. - He likens the current relationship with AI to raising a tiger cub, stressing the need for training AI to prevent it from becoming a threat as it matures [49][51]. - The call for a global AI safety institution is made, aimed at researching and training AI to assist rather than dominate humanity [55][56].