Workflow
量子位
icon
Search documents
首次实现第一视角视频与人体动作同步生成!新框架攻克视角-动作对齐两大技术壁垒
量子位· 2025-09-30 12:22
Core Viewpoint - The article discusses the development of EgoTwin, a framework that successfully generates first-person perspective videos and human actions in a synchronized manner, overcoming significant challenges in perspective-action alignment and causal coupling. Group 1: Challenges in First-Person Video Generation - The essence of first-person video generation is the visual record driven by human actions, where head movements determine camera position and orientation, while full-body actions affect body posture and surrounding scene changes [9]. - Two main challenges are identified: 1. Perspective alignment, where the camera trajectory in the generated video must precisely match the head trajectory derived from human actions [10]. 2. Causal interaction, where each visual frame provides spatial context for human actions, and newly generated actions alter subsequent visual frames [10]. Group 2: Innovations of EgoTwin - EgoTwin employs a diffusion Transformer architecture to create a "text-video-action" tri-modal joint generation framework, addressing the aforementioned challenges through three key innovations [12]. - The first innovation is a three-channel architecture that allows the action branch to cover only the lower half of the text and video branches, ensuring effective interaction [13]. - The second innovation involves a head-centered action representation, which directly anchors actions to the head joint, achieving precise alignment with first-person observations [20]. - The third innovation is an asynchronous diffusion training framework that balances efficiency and generation quality by adapting to the different sampling rates of video and action modalities [22]. Group 3: Performance Evaluation - EgoTwin's performance was validated using the Nymeria dataset, which includes 170,000 five-second "text-video-action" triplets captured by first-person Aria glasses [32]. - The evaluation metrics included traditional video and action quality indicators, as well as newly proposed consistency metrics [31]. - Results showed that EgoTwin significantly outperformed baseline models across nine metrics, with notable improvements in perspective alignment error and hand visibility consistency [32][33]. Group 4: Applications and Implications - EgoTwin not only reduces cross-modal errors but also provides a foundational generation platform for applications in wearable interaction, AR content creation, and embodied intelligent agent simulation [34].
真够卷的!DeepSeek更完智谱更:GLM-4.6,代码国内最强
量子位· 2025-09-30 08:26
Core Insights - The article discusses the launch of GLM-4.6 by Zhiyu, which is claimed to have the strongest coding capabilities among domestic models, surpassing Claude Sonnet 4 [2][5]. - GLM-4.6 has shown significant improvements in various benchmarks, aligning closely with Claude Sonnet 4 in most assessments [6]. - The model has reduced average token consumption by over 30% compared to its predecessor, GLM-4.5, making it the most efficient in its category [8]. Performance Testing - Zhiyu conducted tests in real programming scenarios, demonstrating GLM-4.6's ability to generate a shooting game in under a minute [14]. - The model successfully created an interactive animation using p5.js, showcasing its speed and efficiency in coding tasks [18]. - In a classic physics problem, GLM-4.6 accurately simulated a ball bouncing within a rotating hexagon, adhering to physical laws [22]. Mathematical and Reasoning Abilities - GLM-4.6 was tested with an AIME 2025 math problem, where it correctly identified the answer as 70, highlighting its mathematical and multimodal capabilities [25]. - The model's reasoning abilities have been enhanced, allowing it to call tools during inference [28]. Technological Advancements - GLM-4.6 has achieved a significant milestone by implementing FP8+Int4 mixed-precision quantization on domestic chips, marking the first successful integration of this technology [27]. - The context window has been expanded from 128K to 200K, enabling it to handle longer code and intelligent tasks [28]. - The model's deployment on the new generation of GPUs from Moer Thread demonstrates its compatibility and adaptability within the ecosystem [30]. Pricing Strategy - Zhiyu has reduced the pricing for its GLM Coding Plan, offering a subscription at one-seventh the cost of competitors while providing 90% of Claude's intelligence [34].
ChatGPT可以下单买买买了
量子位· 2025-09-30 04:36
Core Viewpoint - OpenAI has launched a shopping feature in ChatGPT, allowing users to make purchases directly from platforms like Etsy and Shopify, potentially disrupting the e-commerce industry and challenging giants like Google and Amazon [4][7][31]. Group 1: Shopping Feature Details - The new shopping functionality is currently available only to U.S. ChatGPT Pro, Plus, and Free users for orders on Etsy [10]. - Users can describe their desired products, and ChatGPT will recommend relevant items, with all merchants having the opportunity to be featured based on relevance [12]. - The ranking of merchants is determined by factors such as availability, price, quality, and whether they are the main seller of the product [14]. - OpenAI only charges a small fee upon successful transactions, and users can pay using various methods including credit cards and digital wallets [20]. Group 2: Market Impact - The introduction of this feature could significantly threaten Google's advertising revenue model, as OpenAI's business model relies on transaction fees rather than advertising [33]. - Amazon's traditional role as a traffic hub and transaction facilitator may be undermined if users start making purchases directly through ChatGPT instead of visiting Amazon [34]. - The shift in consumer behavior could lead to a decrease in Amazon's market share, as users may prefer the streamlined process offered by ChatGPT [35]. Group 3: Historical Context and Future Implications - Historical examples show that non-traditional competitors can disrupt established industries, as seen with Netflix's rise over Blockbuster [37]. - OpenAI's entry into e-commerce may signal a broader trend where AI technologies reshape traditional search and shopping paradigms [39].
宇树机器人被曝漏洞,机器人之间可相互感染,官方火速回应
量子位· 2025-09-30 04:36
Core Viewpoint - The article highlights a significant wireless security vulnerability in multiple models of robots from Unitree, which allows attackers to gain root access and potentially create a zombie network of infected robots [1][4]. Vulnerability Details - Various models of Unitree robots have a serious vulnerability in their BLE (Bluetooth Low Energy) Wi-Fi configuration interface, enabling attackers to achieve maximum control [2]. - Attackers can bypass authentication using a hardcoded key in the firmware, allowing them to execute commands with root privileges [10][11]. - The vulnerability is characterized as "wormable," meaning that once one robot is compromised, it can automatically infect other nearby Unitree devices [15][16]. Researcher Communication - The researchers who discovered the vulnerability, Andreas Makris and Kevin Finisterre, communicated with Unitree multiple times since May 2023, but progress on fixing the issue was minimal [20][21]. - The researchers publicly released a toolchain called UniPwn on GitHub, which exploits the vulnerability, revealing that multiple security flaws still exist in Unitree's firmware as of September 20, 2025 [22][23]. Company Response - In response to the growing concerns, Unitree acknowledged the security issues and stated that they have formed a product security team to enhance product safety [6][25]. - The company claimed to have completed most of the necessary fixes and will push updates to users soon [25]. - Unitree expressed gratitude for external oversight and aims to collaborate with others to improve safety in the robotics field [27][31].
Claude Sonnet 4.5被炸出来了,依旧最强编程,连续30小时自主运行写代码
量子位· 2025-09-30 00:57
Core Insights - The article discusses the release of Claude Sonnet 4.5, which has shown significant improvements over its predecessor, Claude Sonnet 4, in various performance metrics [2][8]. Performance Improvements - Claude Sonnet 4.5 achieved a score of 82.0% on the SWE-bench, an increase of 1.8 percentage points from Sonnet 4 [2]. - In the OSWorld test, it scored 60.2, nearly a 50% improvement over Sonnet 4 [7]. - The model can autonomously write code for up to 30 hours, producing over 11,000 lines of code, which is a significant increase from the previous model's 7-hour capability [3][5]. Benchmark Comparisons - Claude Sonnet 4.5 outperformed other models in various benchmarks, including: - Agentic coding: 77.2% [10] - Terminal-Bench: 50.0% [10] - High school math (AIME 2025): 100% accuracy with Python and 87% without tools [9][10]. - In specialized fields like finance, healthcare, and law, it showed over 60% win rates against baseline models [11]. Safety and Alignment - The model has undergone safety training to reduce undesirable behaviors such as flattery and deception, with a significant decrease in false positives from 0.15% to 0.02% [12][13]. - Claude Sonnet 4.5 has made notable advancements in defending against immediate injection attacks [12]. Pricing and Accessibility - The pricing for Claude Sonnet 4.5 remains the same as Sonnet 4, at $3 per million input tokens and $15 per million output tokens [24]. New Features and SDK - The Claude Agent SDK has been upgraded to support the development of general autonomous agents, enhancing its capabilities beyond just coding tasks [27]. - A new feature called "Imagine with Claude" allows users to generate software in real-time based on their requirements, facilitating the creation of functional prototypes without existing templates [32].
DeepSeek突然拥抱国产GPU语言!TileLang对标CUDA替代Triton,华为昇腾Day0官宣支持适配
量子位· 2025-09-30 00:57
Core Viewpoint - The article highlights the significance of TileLang, a domain-specific language for GPU kernel development, which has been adopted by DeepSeek in its v3.2 update, showcasing its performance advantages over traditional methods like Flash Attention 2 [1][6][26]. Group 1: TileLang Overview - TileLang is designed to simplify the development of high-performance GPU/CPU kernels, comparable to NVIDIA's CUDA, and is recommended by DeepSeek for experiments due to its debugging and rapid iteration advantages [6][10]. - The language allows developers to write efficient code with significantly reduced lines, achieving performance parity with existing implementations [5][8]. - TileLang's development is led by a team from Peking University, including key figures such as Wang Lei and Dong Yuqi [15][19]. Group 2: DeepSeek's Adoption of TileLang - DeepSeek's choice to use TileLang was first showcased at the Beijing Zhiyuan Conference in June, where its potential for faster operator implementation was discussed [10][11]. - The integration of TileLang has been recognized by industry leaders, including Huawei, which announced support for the language [7][4]. - DeepSeek's v3.2 release demonstrates that TileLang can effectively be used for model training, validating its capabilities in real-world applications [34][26]. Group 3: Performance and Technical Aspects - TileLang provides three programming interfaces catering to different developer expertise levels, from beginners to performance-focused experts [20][21][23]. - The language's architecture allows for decoupling scheduling space from data flow, enabling more efficient optimization by the compiler [19]. - DeepSeek's implementation of TileLang has resulted in significant performance improvements, with claims of achieving a 30% speed increase over traditional methods [5][27].
DeepSeek新模型上线!引入DSA新稀疏注意力,还又狙了CUDA一枪
量子位· 2025-09-29 10:44
Core Insights - DeepSeek has launched its latest model, DeepSeek-V3.2-Exp, which introduces a new attention mechanism called DeepSeek Sparse Attention (DSA) [1][6] - The model aims to enhance long text processing and inference efficiency without significantly affecting output quality [7] - A significant price reduction for the official API has been announced, starting at 50% off [3][17] Model Updates - DeepSeek-V3.2-Exp is built on the previous version, DeepSeek-V3.1-Terminus, which focused on stability, tool invocation capabilities, language consistency, and error correction [9] - In benchmark comparisons, DeepSeek-V3.2-Exp shows comparable performance to DeepSeek-V3.1-Terminus across various evaluation sets [10] - The model demonstrates improved inference costs when handling 128K long contexts, particularly during the decoding phase [12] Technical Innovations - The introduction of DSA allows for fine-grained attention mechanisms, leading to significant improvements in processing efficiency [6][7] - DeepSeek has open-sourced GPU operators in both TileLang and CUDA versions, facilitating research and development [13][15] - The company recommends using the TileLang version for debugging and rapid iteration during experimental research [16] Community Engagement - The announcement includes a call to action for the community to engage with the new model and take advantage of the promotional pricing [18] - Links to access the model on platforms like HuggingFace and ModelScope have been provided [19]
十亿级参数,千亿级性能,上海AI Lab发布新一代文档解析大模型,复杂场景解析精度媲美人类专家
量子位· 2025-09-29 10:44
MinerU2.5团队 投稿 量子位 | 公众号 QbitAI 大模型越来越大,参数量动辄千亿,但真要在实际场景里做到"高精度+高效率",却并不容易。 | Model Type | Models | Slides | Academic | Book | Textbook | Exam | Magazine | Newspaper | Notes | Financial | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | | | | Papers | | | Papers | | | | Report | | | Marker-1.8.2 [32] | 0.1796 | 0.0412 | 0.1010 | 0.2908 | 0.2958 | 0.1111 | 0.2717 | 0.4656 | 0.0341 | | Pipeline | MinerU2-pipeline [46] | 0.4244 | 0.0230 | 0.2628 | 0.1224 | 0.0822 | 0.395 | 0.0736 | 0.2603 ...
前馈3D高斯泼溅新方法,浙大团队提出“体素对齐”,直接在三维空间融合多视角2D信息
量子位· 2025-09-29 04:57
Core Viewpoint - The article discusses the rapid industrialization of Feed-Forward 3D Gaussian Splatting (3DGS) and introduces VolSplat, which abandons the traditional pixel-aligned strategy in favor of a voxel-aligned framework, enhancing robustness, efficiency, and engineering feasibility in multi-view rendering [1][2]. Summary by Sections Introduction to VolSplat - VolSplat addresses the limitations of existing pixel-aligned methods, which struggle with precise alignment of 2D features in 3D space and are constrained by the pixel grid in Gaussian density allocation [2][6]. Performance Comparison - Experimental results on public datasets like RealEstate10K and ScanNet show that VolSplat outperforms various pixel-aligned baselines in visual quality and geometric consistency [4][5]. Core Concepts of VolSplat - The core idea of VolSplat is to shift alignment from 2D to 3D, allowing for better integration of multi-view information and overcoming challenges related to multi-view consistency and Gaussian density allocation [6][9]. Methodology Breakdown - The VolSplat pipeline consists of three clear modules: 1. 2D feature extraction and depth estimation 2. Lifting pixels to voxels and feature aggregation 3. Sparse 3D refinement and Gaussian regression [9][11]. Step-by-Step Process - **Step 1**: 2D features are extracted using a shared encoder, and depth maps are constructed to provide necessary geometric priors for subsequent processing [11]. - **Step 2**: Pixels are projected into 3D space based on predicted depths, creating a point cloud that is voxelized for feature aggregation, enhancing cross-view consistency [12][13]. - **Step 3**: A sparse 3D U-Net refines voxel features, predicting corrections for each voxel and regressing Gaussian parameters for rendering [14]. Experimental Highlights - VolSplat demonstrates superior zero-shot generalization across datasets, maintaining high performance even on unseen data, with a PSNR of 32.65 dB on the ACID dataset [15][17]. Practical Implications - The advancements in VolSplat lead to fewer artifacts and better geometric fidelity, translating to improved user experiences in applications like virtual tours and indoor navigation [17][19]. Future Directions - VolSplat opens new avenues for research in 3D reconstruction, robotics, autonomous driving, and AR/VR, providing a unified framework for integrating multimodal data [19][20].
华为盘古718B模型最新成绩:开源第二
量子位· 2025-09-29 04:57
Core Viewpoint - Huawei has emerged as a strong competitor in the AI model landscape, particularly highlighted by its performance in the latest SuperCLUE benchmark evaluation, showcasing its capabilities in various dimensions of AI model performance [1][2]. Group 1: Model Rankings and Performance - The top three models in the SuperCLUE evaluation based on open-source and domestic criteria are: 1. DeepSeek-V3.1-Terminus-Thinking 2. openPangu-Ultra-MoE-718B 3. Qwen3-235B-A22B-Thinking-2507 [5]. - Huawei's openPangu-Ultra-MoE-718B model, with 718 billion parameters, stands out due to its unique training philosophy that emphasizes quality over sheer data volume [6][35]. Group 2: Data Quality and Training Strategy - The openPangu team adheres to three core principles in post-training data construction: quality first, diversity coverage, and complexity adaptation [10][21]. - A comprehensive framework for data generation, scientific selection, and precise enhancement has been established to ensure high data quality, which is crucial for improving the model's reasoning capabilities in complex scenarios [13][35]. Group 3: Pre-training and Optimization Techniques - The pre-training process for openPangu-718B is divided into three stages: General, Reasoning, and Annealing, each focusing on different aspects of knowledge and reasoning enhancement [15][35]. - The model employs a "Critique Internalization" mechanism to mitigate hallucinations, allowing it to self-evaluate its reasoning process and improve output reliability [19][22]. Group 4: Tool Usage and Agent Capabilities - To enhance the model's ability to use tools, the team developed the ToolACE framework, which generates high-quality, complex multi-tool interaction data for training [23][26]. - The model's training includes a three-step post-training fine-tuning scheme to optimize performance, ensuring a balance between underfitting and overfitting [27][29]. Group 5: Technical Innovations and Industry Implications - The systematic technical innovations across various training stages contribute to the superior performance of openPangu-718B, setting a valuable example for the industry on the importance of meticulous technical refinement and deep insights into core challenges [35].