量子位
Search documents
智谱IPO敲钟前,连夜把开源编程大模型SOTA了
量子位· 2025-12-23 00:15
AIME 25和人类最后 考试 (HLE) 等基准 中,GLM-4.7分数超GPT-5.1; SWE-Bench分数达(73.8%,+5.8%),创开源新高。 鱼羊 henry 发自 麦蒿寺 量子位 | 公众号 QbitAI 2025倒计时,新SOTA模型涌现没有放缓迹象。 一夜之间,编程SOTA模型易主,而且上线即开源,依然来自中国大模型公司—— 智谱AI,GLM-4.7。 这波更新,技术报告里满眼都是 Coding , Coding ,还是 Coding 。 而能力的提升,带来的最直观效果是: 官方Demo显示,写个植物大战僵尸不费劲: 官网Chatbot和API均已就为,现在就能在线开玩。 Demo来吧,展示 在前端生成质量上,GLM-4.7展现出明显升级:页面结构更干净、组件层级更清晰。 相比GLM-4.6,更像是现代的Web UI,网友元素中更加美观。 总而言之,模型这么一发,双旦的节庆氛围一下到位了(doge)。 在复杂几何结构与空间关系的表达上,GLM-4.7模型能够保持较好的结构一致性与细节稳定性。 3D资产的生成质量也有显著提升。 在PPT与视觉物料生成方面,GLM-4.7标题层级明确、元素 ...
为什么Agent总是Demo猛如龙实战一条虫?
量子位· 2025-12-22 09:30
Core Viewpoint - The article discusses the limitations of AI agents in real-world applications compared to their impressive demonstrations, emphasizing that adaptability is a key factor for improvement [1]. Summary by Sections Definition and Functionality of Agents - Agents are defined as AI systems that can plan, utilize tools (such as search engines and databases), and remember information to complete complex tasks independently [3]. Adaptability Framework - The core bottleneck in current agent systems is adaptability, specifically how models adjust their behavior based on feedback signals [6]. - A 2x2 classification framework is proposed to categorize existing adaptation methods into four paradigms based on two dimensions: who is optimized (the agent or the tools) and where the feedback signal comes from (tool execution results or agent output evaluations) [7][8][9]. Four Paradigms of Adaptation - **A1 Paradigm**: Agents learn from feedback based on tool execution, such as whether code runs successfully [10]. - **A2 Paradigm**: Uses the agent's final output as the optimization signal, exemplified by models like DeepSeek-R1 that train reasoning capabilities through reinforcement learning [11]. - **T1 Paradigm**: Tools are pre-trained independently and then called by the agent, allowing for plug-and-play functionality [12]. - **T2 Paradigm**: Tools optimize themselves based on the agent's output, creating a symbiotic relationship [13]. Benefits of Classification - This classification helps developers avoid trial and error when improving AI capabilities, allowing for targeted adaptations based on specific needs [15]. - It also clarifies trade-offs: modifying AI (A1/A2) is flexible but costly, while modifying tools (T1/T2) is cheaper but limited by the AI's inherent capabilities [16]. Key Findings on Data Efficiency - The T2 paradigm demonstrates significantly higher data efficiency compared to the A2 paradigm. For instance, the Search-R1 using A2 requires approximately 170,000 training samples, while T2 only needs 2,400 samples, achieving comparable results [18][19][20]. Frontiers in Adaptability Research - The article identifies four cutting-edge directions for agent adaptability research: - **Co-Adaptation**: Aims for agents and tools to optimize together within the same learning cycle, presenting challenges in credit assignment [21]. - **Continual Adaptation**: Addresses the need for agents to continuously learn new skills without forgetting old ones in a changing environment [23]. - **Safe Adaptation**: Highlights concerns that large models may erode safety measures established during supervised fine-tuning, making them more vulnerable to attacks [25]. - **Efficient Adaptation**: Focuses on resource-constrained scenarios, discussing techniques like LoRA and FlashRL for efficient learning [27]. Additional Resources - The article mentions that a GitHub repository has been opened to continuously collect related papers and resources, serving as a guide for developers building agent systems [29].
硅谷停电干崩谷歌Robotaxi,马斯克贴脸热嘲:特斯拉就没事
量子位· 2025-12-22 09:30
一凡 发自 凹非寺 量子位 | 公众号 QbitAI 一次大规模停电,暴露了全球无人车一哥的短板。 被曝估值冲上千亿美元没几天,Waymo就因为当地停电全面停摆了,挡在路中间,造成城市 拥堵,相关视频疯传。 马斯克第一时间"补刀",表示自家Robotaxi就没受到影响。看上去,特斯拉代表的L2渐进式 路线,似乎小胜了一局……反正马哥认为这就是彰显优越性的时刻。 在Robotaxi战场上,今年马斯克的一举一动,都把自动驾驶之争推向了新的高潮,大洋两岸 更多玩家开始入场,沿着「特斯拉路线」前进,与「Waymo路线」争夺自动驾驶圣杯。 所以问题是,停电是如何影响Waymo Robotaxi的? 当地停电,Waymo停工 Waymo停摆源自一场火灾,旧金山变电站失火,导致当地大规模停电,据说直接影响到13 万居民用电。 更要命的是,因为大范围停电,马路上的红绿灯都不亮了,引发Waymo无人车全面停摆。 真是屋漏偏逢连夜雨,本就混乱的交通,这下因为无人车挡在路上变得更堵了。Waymo只好 连夜找拖车运走了无人车,同时宣布在当地停运,目前还不清楚什么时候重新上线。 所以为啥停电会导致Waymo停运?首先是官方回应暴露的运 ...
全自研仿真GPU求解器x虚实对标物理测量工厂,打造具身合成数据SuperApp,加速具身仿真生态丨光轮智能@MEET2026
量子位· 2025-12-22 08:01
编辑部 整理自 MEET2026 量子位 | 公众号 QbitAI 从大模型智能的"语言世界"迈向具身智能的"物理世界",仿真正在成为连接落地的底层基础设施。 在本次量子位MEET2026智能未来大会上,光轮智能联合创始人兼总裁 杨海波 给出了他的观察: 具身智能的规模远大于文本与视觉模型,因为数据维度更真实、更复杂。 这也就意味着,具身智能时代的核心,不是算法本身,而是它所依赖的数据是否有效、可扩展——仿真是唯一能够解决数据问题的方案。 在仿真策略的路上,会遇到仿真不真实、Sim2Real不可靠等行业痛点, 光轮智能正在通过自研的一整套"测量、生成、求解"仿真基础设施来 解决这些问题 ,为具身智能提供数据、训练、评测的全流程解决方案。 △ 杨海波指出光轮智能深耕合成数据领域 另外杨海波还进一步指出, 仿真不是孤立的技术工具,需要以真实产业需求为锚点,通过应用场景构建生态。 其中, 具身仿真资产制作是生态的源头活水 ,依托自动化物理测量与生成技术,产出高物理真实的规范化数据资产,为具身训练提供核心燃 料; 大规模RL训练则通过并行的虚拟场景让智能体高效试错学习,将数据价值转化为具身实际技能 ,同时反向打磨仿真 ...
倒反天罡!Gemini Flash表现超越Pro,“帕累托前沿已经反转了”
量子位· 2025-12-22 08:01
Core Insights - Gemini 3 Flash outperforms its predecessor Gemini 2.5 Pro and even the flagship Gemini 3 Pro in various benchmarks, achieving a score of 78% in the SWE-Bench Verified test, surpassing Gemini 3 Pro's score of 76.2% [1][6][9] - The performance of Gemini 3 Flash in the AIME 2025 mathematics competition benchmark is notable, scoring 99.7% with code execution capabilities, indicating its advanced mathematical reasoning skills [7][8] - The article emphasizes a shift in perception regarding flagship models, suggesting that smaller, optimized models like Flash can outperform larger models, challenging the traditional belief that larger models are inherently better [19][20] Benchmark Performance - In the Humanity's Last Exam, Flash scored 33.7% without tools, closely trailing Pro's 37.5% [7][8] - Flash's performance in various benchmarks includes: - 90.4% in GPQA Diamond for scientific knowledge [8] - 95.2% in AIME 2025 for mathematics without tools [8] - 81.2% in MMMU-Pro for multimodal understanding [8] - Flash's speed is three times that of Gemini 2.5 Pro, with a 30% reduction in token consumption, making it cost-effective at $0.50 per million tokens for input and $3.00 for output [9] Strategic Insights - Google’s team indicates that the Pro model's role is to "distill" the capabilities of Flash, focusing on optimizing performance and cost [10][12][13] - The evolution of scaling laws is discussed, with a shift from merely increasing parameters to enhancing reasoning capabilities through advanced training techniques [15][16] - The article highlights the importance of post-training as a significant area for future development, suggesting that there is still substantial room for improvement in open-ended tasks [17][18] Paradigm Shift - The emergence of Flash has sparked discussions about the validity of the "parameter supremacy" theory, as it demonstrates that smaller, more efficient models can achieve superior performance [19][21] - The integration of advanced reinforcement learning techniques in Flash is cited as a key factor in its success, proving that increasing model size is not the only path to enhancing capabilities [20][22] - The article concludes with a call to reconsider the blind admiration for flagship models, advocating for a more nuanced understanding of model performance [23]
让AI像人类画家一样边画边想,港中文&美团让模型「走一步看一步」
量子位· 2025-12-22 04:41
Core Viewpoint - The article discusses the introduction of a new paradigm called Thinking-while-Generating (TwiG), which interleaves textual reasoning with visual generation to enhance the capabilities of models in generating complex images and videos, addressing limitations of existing models in handling spatial relationships and object interactions [5][19]. Group 1: Existing Challenges - Current diffusion and autoregressive models, such as FLUX.1 and Emu3, struggle with generating accurate representations of complex spatial relationships and interactions, often resulting in errors like misplacing objects or incorrect quantities [1]. - Two main approaches have been previously explored: "Think-before-Generation," which lacks flexibility, and "Think-after-Generation," which incurs high computational costs and delays [4]. Group 2: Introduction of TwiG - TwiG allows models to pause during the generation process to evaluate and plan the next steps, mimicking human artistic processes [5][7]. - The framework breaks down visual generation into a cycle of "generate-think-regenerate," enabling models to incorporate reasoning at multiple points during the creation process [7]. Group 3: Core Dimensions of TwiG - The framework consists of three key dimensions: 1. **When to Think**: The model creates a "thinking schedule" based on user prompts, optimizing the generation process into three stages that align with the semantic structure of images [8]. 2. **What to Say**: At each pause, the model generates a "thought chain" that guides the next steps in a more precise manner than traditional prompts [9]. 3. **How to Refine**: After completing a section, the model performs self-reflection to correct any mistakes immediately, enhancing efficiency [10]. Group 4: Empirical Research and Results - The research team conducted experiments on a unified multimodal model (Janus-Pro) to validate the TwiG framework, demonstrating its potential through various stages of testing [12]. - **Zero-Shot Performance**: The TwiG-ZS model showed remarkable "think-while-generating" capabilities without parameter updates, outperforming baseline models in multiple dimensions [13][14]. - **Supervised Fine-Tuning (SFT)**: A dataset of 50K was used for SFT, which improved the model's coherence and control over generated thought chains [16]. - **Reinforcement Learning (RL)**: The TwiG-RL model, optimized with a specific RL strategy, demonstrated competitive performance against existing models like Emu3 and FLUX.1 in key metrics [17]. Group 5: Conclusions and Future Implications - The introduction of TwiG represents a shift in how visual generation models operate, emphasizing the need for logical reasoning in generation processes [19]. - Key conclusions include the necessity of explicit reasoning for complex logic, the efficiency of local corrections over complete rewrites, and the critical role of reinforcement learning in enhancing model capabilities [20]. - The TwiG framework is designed to be compatible with diffusion models, suggesting potential applications in more complex fields such as video generation and 3D modeling [21].
MiniMax海螺视频团队首次开源:Tokenizer也具备明确的Scaling Law
量子位· 2025-12-22 04:41
Core Viewpoint - The MiniMax Sea Cucumber video team has introduced a new scalable visual tokenizer pre-training framework (VTP) that addresses the limitations of traditional tokenizers in generating high-quality outputs from generative models, emphasizing the importance of understanding over mere pixel reconstruction [5][15][58]. Group 1: Traditional Tokenizer Limitations - Traditional tokenizers focus on pixel-level reconstruction, which does not necessarily translate to improved generation quality, leading to a saturation point where increased computational resources yield diminishing returns [4][15]. - The "pre-training scaling problem" indicates that better reconstruction accuracy can paradoxically lead to poorer generation performance, as traditional methods often overlook high-level semantic understanding [12][15]. Group 2: VTP's Approach and Innovations - VTP shifts the focus from pixel-level reconstruction to a more holistic understanding of visual semantics, integrating various representation learning methods to enhance the tokenizer's capabilities [26][30]. - The framework employs a multi-task loss function that combines understanding, reconstruction, and generation, allowing the tokenizer to produce semantically rich latent representations that improve downstream model performance [34][35]. Group 3: Empirical Findings and Performance Metrics - VTP demonstrates that injecting "understanding" into the tokenizer significantly enhances generation quality, with empirical evidence showing a positive correlation between understanding capabilities and generation performance [40][41]. - The VTP model achieved a zero-shot classification accuracy of 78.2% on ImageNet, surpassing the original CLIP's 75.5%, and exhibited superior reconstruction and generation capabilities compared to existing models [44]. Group 4: Scaling Law and Industry Implications - VTP reveals a scaling law for tokenizers, indicating that performance can improve with increased computational resources, data, and parameters, challenging the traditional view that scaling benefits only apply to main models [50][54]. - The findings suggest that investing in tokenizer development is crucial for enhancing overall generative system performance, positioning tokenizers as a core component worthy of long-term investment in the industry [58].
量子位编辑作者招聘
量子位· 2025-12-22 04:41
编辑部 发自 凹非寺 量子位 | 公众号 QbitAI AI热潮还在汹涌,但如果你还不知道如何参与……那为什么不来 量子位 呢? 我们是一家以 追踪AI新进展 为核心的内容平台,经过8年积累,目前拥有顶流影响力,广泛且备受认可的产业资源,以及时代风口的最佳观 测和学习生态位。 目前,我们有 三大方向 岗位招聘,希望你是 (或者能成为) 这三个方向的内容专家: 岗位均为全职,工作地点:北京中关村。 岗位面向: 加入我们,你可以获得: 社招:覆盖编辑、主笔、主编各个层级,按能力匹配岗位; 校招:应届毕业生,接受实习且可转正。 以下是岗位详情: 所有岗位不同能力层级职位均在开放,欢迎结合个人履历和经验申请。 AI产业方向 岗位职责: AI产业方向 :关注基建层创新,包含芯片、AI Infra、云计算; AI财经方向 :关注AI领域创投和财报,跟踪产业链资本动向; AI产品方向 :关注AI在应用和硬件终端方向的进展。 参与核心采访,对话产业专家、技术大牛、撰写AI云落地案例。 站在AI浪潮之巅 :第一时间接触和了解AI领域最新技术和产品,构建完整的AI认知体系。 玩转AI新工具 :将各种AI新技术、新工具应用于工作, ...
天下苦SaaS已久,企业级AI得靠「结果」说话
量子位· 2025-12-22 04:41
Core Viewpoint - The article discusses the shift from traditional SaaS models to RaaS (Result as a Service) in the AI industry, highlighting the challenges and opportunities in deploying AI solutions for enterprises [2][35]. Group 1: Challenges in SaaS and AI Deployment - Service providers are struggling with high inference costs and inconsistent delivery quality, leading to a decline in the attractiveness of SaaS in the AI era [2][8]. - Traditional paths for deploying AI involve high upfront costs and significant trial-and-error expenses, which deter many potential customers from adopting AI solutions [11][15]. - The complexity of integrating new AI systems with existing infrastructure adds to the challenges faced by enterprises [12][17]. Group 2: Emergence of RaaS - RaaS is seen as a promising alternative to SaaS, focusing on paying for results rather than just tools, which aligns better with customer needs [39][40]. - The Results Cloud by BaiRongYunChuang offers a comprehensive solution that includes infrastructure, an operating system, and an application store, addressing the pain points of traditional AI deployment [16][34]. - RaaS encourages a collaborative relationship between service providers and clients, transforming the dynamic from a client-vendor relationship to a partnership [42][44]. Group 3: Results Cloud Architecture - The Results Cloud is structured in three layers: BaiJi (infrastructure), BaiGong (operating system), and BaiHui (application store), each serving a specific purpose in the AI deployment process [19][29]. - BaiJi provides a marketplace for AI infrastructure, offering pre-packaged models and computing power without exposing the underlying complexity to clients [20][21]. - BaiGong acts as a central hub that filters and optimizes the combination of models and computing resources, significantly reducing decision-making costs for clients [25][26]. Group 4: Performance Measurement and Compensation - The Results Cloud aligns the performance metrics of AI employees with human employees, allowing for a more straightforward evaluation of effectiveness [46]. - Compensation models for AI employees can include task-based pricing, value-sharing agreements, or fixed salaries, ensuring that clients only pay for actual results [48][49]. - This approach mitigates concerns about upfront costs, encouraging clients to trial AI solutions without financial risk [52]. Group 5: Ecosystem Development - BaiRongYunChuang emphasizes the importance of building an ecosystem for AI solutions, inviting third-party developers to contribute to the platform [57][59]. - The company aims to create a "Silicon-based Productivity Alliance" to foster collaboration and innovation in the AI space [59][60]. - By leveraging its established technology and client base, BaiRongYunChuang seeks to facilitate market opportunities for developers and enhance the overall AI ecosystem [62][63].
真正面向大模型的AI Infra,必须同时懂模型、系统、产业|商汤大装置宣善明@MEET2026
量子位· 2025-12-22 01:40
Core Viewpoints - The core strategy of the company is "1+X," where "1" represents core businesses including large devices, large models, and AI applications, while "X" encompasses innovative businesses such as smart driving, healthcare, and retail [6][10]. AI Infrastructure Development - The company emphasizes that AI infrastructure must not only address the availability of computing power but also ensure efficient, stable, and scalable support for models and industries [3][4]. - The total computing power of the company has reached 32,000 PetaFLOPS, showcasing its commitment to building a robust AI infrastructure [6][13]. Energy Efficiency and Carbon Reduction - The AI computing center has implemented a power consumption prediction system that can accurately forecast power needs within 15 minutes, achieving a 7% annual reduction in electricity costs and over 3,000 tons of annual carbon reduction [6][21]. - The center's Power Usage Effectiveness (PUE) has reached 1.267, with a 15% improvement in overall computing efficiency [21]. Collaboration and Resource Sharing - The company has launched the "SenseTime Computing Power Mall" in collaboration with over ten domestic manufacturers, allowing clients to freely combine and allocate diverse domestic computing resources and industry model services [6][22]. - The platform supports seamless implementation of algorithms across various chips, enhancing the overall capabilities of the PaaS platform [22]. Industry Applications and Partnerships - The company has established partnerships with top-tier research institutions and various industries, including internet technology, AIGC, and traditional sectors, providing comprehensive end-to-end solutions [25][26]. - Notable collaborations include working with major clients in traditional industries to develop industry-specific AI models, demonstrating the feasibility of AI applications even in complex traditional sectors [29][30].