量子位
Search documents
ChatGPT架构师,刚发布了最新研究成果
量子位· 2025-09-30 12:22
闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 距第二篇研究仅过去三天,Thingking Machines发布了第三篇研究博客。 核心作者是OpenAI联创之一 John Schulman 。 Thingking Machines创始人、OpenAI前CTO Mira Murati继续转发站台。 第三篇研究是关于 LoRA参数的高效微调方法 ,题目为《LoRA Without Regret》,探究了LoRA匹配全量微调(FullFT)效率的条件,还 给出了大幅降低调参难度的简化方案。 当前主流大模型动辄万亿参数,预训练数据达数十万亿token,但下游任务往往只需要小数据集、聚焦特定领域。 用FullFT更新所有参数,资源浪费严重。 而LoRA作为参数高效微调(PEFT)的核心方法,通过低秩矩阵A和B(总参数远少于原权重)捕捉微调信息,却始终面临一个争议: 它真的 能追上FullFT的性能吗? John Schulman和Thingking Machines团队给出了肯定答案:只要抓准关键细节,LoRA不仅能和FullFT拥有相同的样本效率,还能达到一 样的最终性能。 下面具体来看。 LoRA最优学习率 ...
首次实现第一视角视频与人体动作同步生成!新框架攻克视角-动作对齐两大技术壁垒
量子位· 2025-09-30 12:22
Core Viewpoint - The article discusses the development of EgoTwin, a framework that successfully generates first-person perspective videos and human actions in a synchronized manner, overcoming significant challenges in perspective-action alignment and causal coupling. Group 1: Challenges in First-Person Video Generation - The essence of first-person video generation is the visual record driven by human actions, where head movements determine camera position and orientation, while full-body actions affect body posture and surrounding scene changes [9]. - Two main challenges are identified: 1. Perspective alignment, where the camera trajectory in the generated video must precisely match the head trajectory derived from human actions [10]. 2. Causal interaction, where each visual frame provides spatial context for human actions, and newly generated actions alter subsequent visual frames [10]. Group 2: Innovations of EgoTwin - EgoTwin employs a diffusion Transformer architecture to create a "text-video-action" tri-modal joint generation framework, addressing the aforementioned challenges through three key innovations [12]. - The first innovation is a three-channel architecture that allows the action branch to cover only the lower half of the text and video branches, ensuring effective interaction [13]. - The second innovation involves a head-centered action representation, which directly anchors actions to the head joint, achieving precise alignment with first-person observations [20]. - The third innovation is an asynchronous diffusion training framework that balances efficiency and generation quality by adapting to the different sampling rates of video and action modalities [22]. Group 3: Performance Evaluation - EgoTwin's performance was validated using the Nymeria dataset, which includes 170,000 five-second "text-video-action" triplets captured by first-person Aria glasses [32]. - The evaluation metrics included traditional video and action quality indicators, as well as newly proposed consistency metrics [31]. - Results showed that EgoTwin significantly outperformed baseline models across nine metrics, with notable improvements in perspective alignment error and hand visibility consistency [32][33]. Group 4: Applications and Implications - EgoTwin not only reduces cross-modal errors but also provides a foundational generation platform for applications in wearable interaction, AR content creation, and embodied intelligent agent simulation [34].
打车像点单?实测滴滴AI助手,打车也能“私人订制”了
量子位· 2025-09-30 12:22
Core Viewpoint - The article discusses the transformative impact of AI on the ride-hailing experience through the introduction of "Xiaodi," a new intelligent assistant by Didi, which allows users to actively choose their ride preferences rather than passively waiting for a match [1][49]. Group 1: Xiaodi's Features - Xiaodi changes the traditional ride-hailing logic by enabling users to specify their preferences, such as vehicle type, air quality, and other personalized requirements [1][20]. - Users can interact with Xiaodi through voice or text to communicate multiple needs, enhancing the customization of their ride experience [20][23]. - The interface of Xiaodi resembles a chatbot, providing a more engaging and interactive experience compared to traditional ride-hailing apps [10][12]. Group 2: User Experience - The article highlights a seamless user experience where Xiaodi not only finds suitable vehicles but also provides detailed information about each option, including model, distance, estimated arrival time, and price [16][18]. - Users can track their ride history and expenses easily, making it particularly beneficial for business travelers [31][32]. - Xiaodi can assist in planning cost-effective travel routes even when not using a ride-hailing service, showcasing its versatility [29][31]. Group 3: MCP Service - Didi has launched the MCP service, allowing developers to integrate Xiaodi's capabilities into their applications, thus broadening the potential for personalized ride-hailing experiences [34][48]. - The MCP service offers different versions (Beta, Pro, Pro+) catering to various user needs, from simple experiences to comprehensive enterprise solutions [46][48]. - The rapid iteration and updates of the MCP service indicate a commitment to enhancing the AI-driven ride-hailing ecosystem [48]. Group 4: Industry Implications - The introduction of AI in ride-hailing not only benefits passengers but also enhances the visibility and earnings of drivers who provide better services [50]. - Didi's extensive experience and technological foundation in the ride-hailing sector enable it to implement AI solutions effectively, setting a precedent for future developments in the industry [51][52]. - The article suggests that as data accumulates, the AI models will become more sophisticated, continuously improving user experiences in ride-hailing [52].
真够卷的!DeepSeek更完智谱更:GLM-4.6,代码国内最强
量子位· 2025-09-30 08:26
Core Insights - The article discusses the launch of GLM-4.6 by Zhiyu, which is claimed to have the strongest coding capabilities among domestic models, surpassing Claude Sonnet 4 [2][5]. - GLM-4.6 has shown significant improvements in various benchmarks, aligning closely with Claude Sonnet 4 in most assessments [6]. - The model has reduced average token consumption by over 30% compared to its predecessor, GLM-4.5, making it the most efficient in its category [8]. Performance Testing - Zhiyu conducted tests in real programming scenarios, demonstrating GLM-4.6's ability to generate a shooting game in under a minute [14]. - The model successfully created an interactive animation using p5.js, showcasing its speed and efficiency in coding tasks [18]. - In a classic physics problem, GLM-4.6 accurately simulated a ball bouncing within a rotating hexagon, adhering to physical laws [22]. Mathematical and Reasoning Abilities - GLM-4.6 was tested with an AIME 2025 math problem, where it correctly identified the answer as 70, highlighting its mathematical and multimodal capabilities [25]. - The model's reasoning abilities have been enhanced, allowing it to call tools during inference [28]. Technological Advancements - GLM-4.6 has achieved a significant milestone by implementing FP8+Int4 mixed-precision quantization on domestic chips, marking the first successful integration of this technology [27]. - The context window has been expanded from 128K to 200K, enabling it to handle longer code and intelligent tasks [28]. - The model's deployment on the new generation of GPUs from Moer Thread demonstrates its compatibility and adaptability within the ecosystem [30]. Pricing Strategy - Zhiyu has reduced the pricing for its GLM Coding Plan, offering a subscription at one-seventh the cost of competitors while providing 90% of Claude's intelligence [34].
ChatGPT可以下单买买买了
量子位· 2025-09-30 04:36
一水 发自 凹非寺 量子位 | 公众号 QbitAI 终于,ChatGPT开始在电商赛道闭环了。 同样一个需求:你能帮我为朋友找到一份很棒的乔迁礼物吗?可能是手工制作的陶瓷餐具,白色和棕色的,价格在100美元以下。 以前可能只是给出相关建议或推荐 (可望不可即) ,现在却能一步到位下单支付 (直接在ChatGPT买买买) 。 没错,这就是OpenAI刚刚推出的 购物功能 ,用户可以在使用ChatGPT时直接下单 Etsy 和 Shopify 这两个平台的商品。 OpenAI总裁Greg Brockman在确认这一消息的同时,还透露将会有更多商家参与其中。 网友们则纷纷表示,这或许会颠覆整个电商行业,尤其是谷歌亚马逊或将遭受巨大冲击。 所以,这到底是一个怎样的计划?OpenAI意在何为? ChatGPT终于打通聊天和购物 还是先来看看功能本身有哪些细节值得注意。 首先需要提醒,目前这个功能 仅面向在Etsy下单的美国ChatGPT Pro、Plus和Free登录用户推出 。 打开方式对用户来说其实没有多大变化,只是打通了聊天和支付。 当用户描述完自己想要的东西,ChatGPT会在文字回复之后推荐最相关的产品。在 ...
宇树机器人被曝漏洞,机器人之间可相互感染,官方火速回应
量子位· 2025-09-30 04:36
Core Viewpoint - The article highlights a significant wireless security vulnerability in multiple models of robots from Unitree, which allows attackers to gain root access and potentially create a zombie network of infected robots [1][4]. Vulnerability Details - Various models of Unitree robots have a serious vulnerability in their BLE (Bluetooth Low Energy) Wi-Fi configuration interface, enabling attackers to achieve maximum control [2]. - Attackers can bypass authentication using a hardcoded key in the firmware, allowing them to execute commands with root privileges [10][11]. - The vulnerability is characterized as "wormable," meaning that once one robot is compromised, it can automatically infect other nearby Unitree devices [15][16]. Researcher Communication - The researchers who discovered the vulnerability, Andreas Makris and Kevin Finisterre, communicated with Unitree multiple times since May 2023, but progress on fixing the issue was minimal [20][21]. - The researchers publicly released a toolchain called UniPwn on GitHub, which exploits the vulnerability, revealing that multiple security flaws still exist in Unitree's firmware as of September 20, 2025 [22][23]. Company Response - In response to the growing concerns, Unitree acknowledged the security issues and stated that they have formed a product security team to enhance product safety [6][25]. - The company claimed to have completed most of the necessary fixes and will push updates to users soon [25]. - Unitree expressed gratitude for external oversight and aims to collaborate with others to improve safety in the robotics field [27][31].
Claude Sonnet 4.5被炸出来了,依旧最强编程,连续30小时自主运行写代码
量子位· 2025-09-30 00:57
Core Insights - The article discusses the release of Claude Sonnet 4.5, which has shown significant improvements over its predecessor, Claude Sonnet 4, in various performance metrics [2][8]. Performance Improvements - Claude Sonnet 4.5 achieved a score of 82.0% on the SWE-bench, an increase of 1.8 percentage points from Sonnet 4 [2]. - In the OSWorld test, it scored 60.2, nearly a 50% improvement over Sonnet 4 [7]. - The model can autonomously write code for up to 30 hours, producing over 11,000 lines of code, which is a significant increase from the previous model's 7-hour capability [3][5]. Benchmark Comparisons - Claude Sonnet 4.5 outperformed other models in various benchmarks, including: - Agentic coding: 77.2% [10] - Terminal-Bench: 50.0% [10] - High school math (AIME 2025): 100% accuracy with Python and 87% without tools [9][10]. - In specialized fields like finance, healthcare, and law, it showed over 60% win rates against baseline models [11]. Safety and Alignment - The model has undergone safety training to reduce undesirable behaviors such as flattery and deception, with a significant decrease in false positives from 0.15% to 0.02% [12][13]. - Claude Sonnet 4.5 has made notable advancements in defending against immediate injection attacks [12]. Pricing and Accessibility - The pricing for Claude Sonnet 4.5 remains the same as Sonnet 4, at $3 per million input tokens and $15 per million output tokens [24]. New Features and SDK - The Claude Agent SDK has been upgraded to support the development of general autonomous agents, enhancing its capabilities beyond just coding tasks [27]. - A new feature called "Imagine with Claude" allows users to generate software in real-time based on their requirements, facilitating the creation of functional prototypes without existing templates [32].
DeepSeek突然拥抱国产GPU语言!TileLang对标CUDA替代Triton,华为昇腾Day0官宣支持适配
量子位· 2025-09-30 00:57
Core Viewpoint - The article highlights the significance of TileLang, a domain-specific language for GPU kernel development, which has been adopted by DeepSeek in its v3.2 update, showcasing its performance advantages over traditional methods like Flash Attention 2 [1][6][26]. Group 1: TileLang Overview - TileLang is designed to simplify the development of high-performance GPU/CPU kernels, comparable to NVIDIA's CUDA, and is recommended by DeepSeek for experiments due to its debugging and rapid iteration advantages [6][10]. - The language allows developers to write efficient code with significantly reduced lines, achieving performance parity with existing implementations [5][8]. - TileLang's development is led by a team from Peking University, including key figures such as Wang Lei and Dong Yuqi [15][19]. Group 2: DeepSeek's Adoption of TileLang - DeepSeek's choice to use TileLang was first showcased at the Beijing Zhiyuan Conference in June, where its potential for faster operator implementation was discussed [10][11]. - The integration of TileLang has been recognized by industry leaders, including Huawei, which announced support for the language [7][4]. - DeepSeek's v3.2 release demonstrates that TileLang can effectively be used for model training, validating its capabilities in real-world applications [34][26]. Group 3: Performance and Technical Aspects - TileLang provides three programming interfaces catering to different developer expertise levels, from beginners to performance-focused experts [20][21][23]. - The language's architecture allows for decoupling scheduling space from data flow, enabling more efficient optimization by the compiler [19]. - DeepSeek's implementation of TileLang has resulted in significant performance improvements, with claims of achieving a 30% speed increase over traditional methods [5][27].
DeepSeek新模型上线!引入DSA新稀疏注意力,还又狙了CUDA一枪
量子位· 2025-09-29 10:44
Core Insights - DeepSeek has launched its latest model, DeepSeek-V3.2-Exp, which introduces a new attention mechanism called DeepSeek Sparse Attention (DSA) [1][6] - The model aims to enhance long text processing and inference efficiency without significantly affecting output quality [7] - A significant price reduction for the official API has been announced, starting at 50% off [3][17] Model Updates - DeepSeek-V3.2-Exp is built on the previous version, DeepSeek-V3.1-Terminus, which focused on stability, tool invocation capabilities, language consistency, and error correction [9] - In benchmark comparisons, DeepSeek-V3.2-Exp shows comparable performance to DeepSeek-V3.1-Terminus across various evaluation sets [10] - The model demonstrates improved inference costs when handling 128K long contexts, particularly during the decoding phase [12] Technical Innovations - The introduction of DSA allows for fine-grained attention mechanisms, leading to significant improvements in processing efficiency [6][7] - DeepSeek has open-sourced GPU operators in both TileLang and CUDA versions, facilitating research and development [13][15] - The company recommends using the TileLang version for debugging and rapid iteration during experimental research [16] Community Engagement - The announcement includes a call to action for the community to engage with the new model and take advantage of the promotional pricing [18] - Links to access the model on platforms like HuggingFace and ModelScope have been provided [19]
十亿级参数,千亿级性能,上海AI Lab发布新一代文档解析大模型,复杂场景解析精度媲美人类专家
量子位· 2025-09-29 10:44
MinerU2.5团队 投稿 量子位 | 公众号 QbitAI 大模型越来越大,参数量动辄千亿,但真要在实际场景里做到"高精度+高效率",却并不容易。 | Model Type | Models | Slides | Academic | Book | Textbook | Exam | Magazine | Newspaper | Notes | Financial | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | | | | Papers | | | Papers | | | | Report | | | Marker-1.8.2 [32] | 0.1796 | 0.0412 | 0.1010 | 0.2908 | 0.2958 | 0.1111 | 0.2717 | 0.4656 | 0.0341 | | Pipeline | MinerU2-pipeline [46] | 0.4244 | 0.0230 | 0.2628 | 0.1224 | 0.0822 | 0.395 | 0.0736 | 0.2603 ...