Workflow
量子位
icon
Search documents
智谱运气是差一点点,视觉Token研究又和DeepSeek撞车了
量子位· 2025-10-22 15:27
Core Viewpoint - The article discusses the competition between Zhipu and DeepSeek in the AI field, particularly focusing on the release of Zhipu's visual token solution, Glyph, which aims to address the challenges of long context in large language models (LLMs) [1][2][6]. Group 1: Context Expansion Challenges - The demand for long context in LLMs is increasing due to various applications such as document analysis and multi-turn dialogues [8]. - Expanding context length significantly increases computational costs; for instance, increasing context from 50K to 100K tokens can quadruple the computational consumption [9][10]. - Merely adding more tokens does not guarantee improved model performance, as excessive input can lead to noise interference and information overload [12][14]. Group 2: Existing Solutions - Three mainstream solutions to the long context problem are identified: 1. **Extended Position Encoding**: This method extends the existing position encoding range to accommodate longer inputs without retraining the model [15][16]. 2. **Attention Mechanism Modification**: Techniques like sparse and linear attention aim to improve token processing efficiency, but do not reduce the total token count [20][21]. 3. **Retrieval-Augmented Generation (RAG)**: This approach uses external retrieval to shorten inputs, but may slow down overall response time [22][23]. Group 3: Glyph Framework - Glyph proposes a new paradigm by converting long texts into images, allowing for higher information density and efficient processing by visual language models (VLMs) [25][26]. - By using visual tokens, Glyph can significantly reduce the number of tokens needed; for example, it can represent the entire text of "Jane Eyre" using only 80K visual tokens compared to 240K text tokens [32][36]. - The training process for Glyph involves three stages: continual pre-training, LLM-driven rendering search, and post-training, which collectively enhance the model's ability to interpret visual information [37][44]. Group 4: Performance and Results - Glyph achieves a token compression rate of 3-4 times while maintaining accuracy comparable to mainstream models [49]. - The implementation of Glyph results in approximately four times faster prefill and decoding speeds, as well as two times faster supervised fine-tuning (SFT) training [51]. - Glyph demonstrates strong performance in multimodal tasks, indicating its robust generalization capabilities [53]. Group 5: Contributors and Future Implications - The primary author of the paper is Jiale Cheng, a PhD student at Tsinghua University, with contributions from Yusen Liu, Xinyu Zhang, and Yulin Fei [57][62]. - The article suggests that visual tokens may redefine the information processing methods of LLMs, potentially leading to pixels replacing text as the fundamental unit of AI input [76][78].
清华联手英伟达打造扩散模型新蒸馏范式!视频生成提速50倍,4步出片不穿模
量子位· 2025-10-22 09:12
Core Insights - The article discusses a new distillation paradigm called rCM that significantly enhances video generation speed by up to 50 times while maintaining high quality and diversity in the generated content [4][20][33] Group 1: Introduction of rCM - rCM is a novel large-scale diffusion model distillation paradigm developed by Tsinghua University and NVIDIA, which successfully extends continuous time consistency distillation to billion-parameter models [5][9] - The method addresses bottlenecks in existing approaches, particularly in real-world applications involving large-scale text-to-image and text-to-video models [3][9] Group 2: Technical Innovations - The rCM framework introduces a forward-reverse divergence joint optimization approach, which enhances inference speed while ensuring high-quality and diverse generation results [4][11] - By utilizing self-developed FlashAttention-2 JVP CUDA operators and compatible distributed training strategies, rCM successfully applies continuous time consistency distillation to leading models like Cosmos and Wan2.1 [13][18] Group 3: Performance Metrics - rCM demonstrates exceptional performance across various large-scale text-to-image and text-to-video tasks, compressing the sampling process from hundreds of steps to an impressive 1-4 steps, achieving a speedup of 15-50 times [20][21] - In evaluations, the rCM model matches or even surpasses the performance of teacher models that require hundreds of sampling steps [21][25] Group 4: Quality and Diversity - The rCM model effectively addresses the quality shortcomings of previous models by incorporating reverse divergence as a regularization term, allowing it to maintain high diversity while improving quality [19][22] - Compared to previous state-of-the-art distillation methods, rCM exhibits significantly higher diversity in generated video content, effectively avoiding "mode collapse" issues [25][31] Group 5: Future Applications - rCM is expected to be widely applied in NVIDIA's Cosmos series of world models, indicating its potential for broader industry adoption [34]
KTransformers入选计算机系统顶会、与主流框架合作,趋境&清华让「异构」成为推理新范式
量子位· 2025-10-22 09:12
Core Insights - KTransformers, an open-source project developed by Turing Technology and Tsinghua University's KVCache.AI team, focuses on system innovation during the inference phase of large models, enabling efficient operation on diverse hardware architectures with lower computational power [2][4]. Group 1: KTransformers Overview - KTransformers is a high-performance heterogeneous inference framework that optimally utilizes various computing resources such as GPUs, CPUs, and memory [2]. - The project paper was recognized at the prestigious SOSP 2025 conference, highlighting its significance in the field of computer systems [2][4]. Group 2: Technical Innovations - The framework introduces an "Expert Deferral" mechanism, allowing for efficient scheduling of experts in Mixture of Experts (MoE) models, which reduces computational load without sacrificing model performance [7][13]. - KTransformers achieves nearly 4x speedup on a single Intel Xeon processor compared to traditional PyTorch implementations, significantly enhancing CPU performance in expert calculations [12]. - The system allows for dynamic overlapping of CPU and GPU loads, resulting in a model throughput increase of approximately 1.45 times, with minimal impact on model accuracy [15][16]. Group 3: Collaboration and Ecosystem - KTransformers has partnered with SGLang, a mainstream inference framework, to integrate full GPU inference with heterogeneous inference, enhancing the overall architecture for large model deployment [5][19]. - This collaboration enables developers to access both full GPU and heterogeneous inference capabilities seamlessly, particularly beneficial in scenarios with limited GPU resources [21]. Group 4: Market Position and Future Directions - KTransformers has gained significant traction in the developer community, with over 15.2K stars on GitHub, indicating its widespread adoption as a foundational framework for large model inference [24]. - The project aims to democratize AI capabilities, making them accessible beyond elite computational paths, and is actively collaborating with various domestic CPU and GPU platforms to promote cost-effective solutions [28][29].
人工智能年度榜单火热报名中!五大奖项,寻找AI+时代的先锋力量
量子位· 2025-10-22 09:12
组委会 发自 凹非寺 量子位|公众号 QbitAI 为了让更多从业者感受智能浪潮的跃迁,也为了给予更多同行同路人掌声与鼓舞,我们将正式启动 「2025人工智能年度榜单」评选报名 。 本次评选将从 企业 、 产品 、 人物 三大维度,设立五类奖项。欢迎企业踊跃报名! 让我们共同见证年度之星,点亮未来的方向。 企业榜 产品榜 人物榜 2025 人工智能年度 焦点人物 详细评选标准及报名方式如下。 聚焦于中国人工智能领域创新创业力量,将评选出最具投资价值和发展潜力的AI创业公司, 参选条件 : 评选标准 : 2025 人工智能年度领航企业 将面向中国人工智能领域,评选出最具综合实力的企业, 参选条件 : 2025 人工智能年度 领航企业 2025 人工智能年度 潜力创业公司 2025 人工智能年度 杰出产品 2025 人工智能年度 杰出解决方案 1、注册地在中国,或主营业务主要面向中国市场; 2、主营业务属于人工智能及相关产业,或已将人工智能广泛应用于主营业务,并在细分领域居于行业领先地位; 评选标准 : 2025 人工智能年度潜力创业公司 3、具备成熟的产品或服务,已获得实际客户应用及市场认可; 4、近一年在技术 ...
腾讯开源混元世界模型1.1,视频秒变3D世界,单卡推理仅需1秒
量子位· 2025-10-22 09:12
Core Viewpoint - Tencent has released and open-sourced the Hunyuan World Model 1.1, a unified end-to-end 3D reconstruction model that supports generating 3D worlds from multiple views or videos with high precision and efficiency [1][3][16]. Group 1: Model Features - Hunyuan World Model 1.1 is the industry's first unified feedforward 3D reconstruction model, capable of handling various input modalities and producing multiple outputs simultaneously, achieving state-of-the-art (SOTA) performance [4][18][21]. - The model supports flexible input handling, allowing the integration of camera poses, intrinsic parameters, and depth maps to enhance reconstruction quality [18][20]. - It features a single-card deployment with one-second inference time, significantly faster than traditional methods that may take minutes or hours [22][24]. Group 2: Performance Comparison - In comparisons with Meta's MapAnything and AnySplat models, Hunyuan World Model 1.1 demonstrated superior surface smoothness and scene regularity in 3D point cloud reconstruction tasks [11][12][14]. - The model excels in both geometric accuracy and detail restoration, providing more stable and realistic scene reconstructions compared to its competitors [14][15]. Group 3: User Accessibility - The model is fully open-sourced, allowing developers to clone it from GitHub and deploy it locally, while ordinary users can access it online to generate 3D scenes from uploaded images or videos [34][37]. - The technology aims to democratize 3D reconstruction, making it accessible for anyone to create professional-level 3D scenes in seconds [37].
全球首款!高性能人形机器人跑跳进入万元机时代
量子位· 2025-10-22 09:12
Core Viewpoint - The article discusses the launch of Bumi, the world's first high-performance humanoid robot priced under 10,000 yuan, aimed at the consumer market, making advanced robotics accessible to households [3][9]. Group 1: Product Features - Bumi is a humanoid robot that can walk, jump, and interact, designed to be a programming teacher and a play companion for children [12][13]. - The robot weighs 12 kg and stands under one meter tall, making it easy to handle [6]. - Bumi's capabilities include stable walking and dancing, showcasing advanced motion control technology [19][21]. Group 2: Educational Value - Bumi allows children to learn programming through a drag-and-drop interface, enabling them to design sequences of actions for the robot [24][26]. - This interactive learning experience is positioned as a more engaging alternative to traditional interest classes for children [27]. Group 3: Company Background - The company behind Bumi, Songyan Power, was founded less than two years ago and has a young team, primarily composed of graduates from Tsinghua University [28][30]. - The founder, Jiang Zheyuan, has a notable academic background, having progressed through Tsinghua University from kindergarten to doctoral studies [30][32]. - Songyan Power has previously developed several advanced robotic products, demonstrating a strong technical foundation [33][34]. Group 4: Market Position and Future Outlook - Bumi represents a significant step in making humanoid robots a part of everyday life, marking a shift in the perception of robotics from futuristic concepts to practical household items [49][50]. - The company has successfully completed multiple rounds of financing, positioning itself among the leading players in the commercialization of robotics [45].
汇报一下ICCV全部奖项,恭喜朱俊彦团队获最佳论文
量子位· 2025-10-22 05:48
Core Points - The ICCV 2025 conference in Hawaii highlighted significant contributions from Chinese researchers, who accounted for 50% of the paper submissions [1] - Various prestigious awards were announced, showcasing advancements in computer vision research [3] Award Highlights - Best Paper Award (Marr Prize): "Generating Physically Stable and Buildable Brick Structures from Text" introduced BRICKGPT, a model that generates stable brick structures based on text prompts, utilizing a dataset of over 47,000 structures [4][24][26] - Best Student Paper Award: "FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models" proposed a method for image editing without inversion, achieving state-of-the-art results [6][39][40] - Best Paper Honorary Mention: "Spatially-Varying Autofocus" developed a technique for dynamic depth adjustment in imaging, enhancing focus clarity across scenes [7][42][44] - Best Student Paper Honorary Mention: "RayZer: A Self-supervised Large View Synthesis Model" demonstrated 3D perception capabilities using uncalibrated images [9][47][49] Special Awards - Helmholtz Prize: Awarded to "Fast R-CNN" for its efficient object detection capabilities, significantly improving training and testing speeds [10][52][54] - Another Helmholtz Prize was given for research on rectified activation functions, achieving performance surpassing human-level accuracy on ImageNet [10][59][60] - Evelyn Erham Award: Recognized teams for their contributions to 3D modeling and visual question answering [12][63][68] - Distinguished Researcher Award: David Forsyth and Michal Irani were honored for their impactful work in computer vision [14][73][76] - Azriel Rosenfeld Lifetime Achievement Award: Rama Chellappa was recognized for his extensive contributions to the field [16][79] Research Contributions - The BRICKGPT model was developed to generate physically stable structures, utilizing a large dataset and innovative mechanisms for stability [24][26] - FlowEdit's approach allows for seamless image editing across different model architectures, enhancing flexibility in applications [39][40] - The spatially-varying autofocus technique improves image clarity by dynamically adjusting focus based on scene depth [42][44] - RayZer's self-supervised learning approach enables 3D scene reconstruction without the need for calibrated camera data [47][49] Conclusion - The ICCV 2025 conference showcased groundbreaking research and innovations in computer vision, with significant contributions from various teams and individuals, particularly highlighting the achievements of Chinese researchers [1][3]
Qwen深度研究一夜升级!可生成网页和音频播客,新模型能认医生手写体
量子位· 2025-10-22 05:48
Core Insights - The article discusses the advancements in the Qwen deep research capabilities, highlighting the addition of auditory and visual outputs, enabling the generation of web pages and audio content [1][2]. Group 1: New Features and Functionalities - The Qwen deep research tool can now convert lengthy text into audio podcasts, facilitating easier consumption of information during fragmented time [3]. - Compared to the previously popular NoteBookLM, the deep research tool eliminates the need for users to provide content to the AI, streamlining the input process [4]. - The latest visual language model, Qwen3 VL, can even recognize difficult handwritten notes, showcasing significant improvements in model capabilities [7]. Group 2: User Interaction and Experience - Upon activating the deep research feature, the system defaults to the most powerful Qwen3-Max model and first confirms the user's specific intent before proceeding [9][10]. - The entire operation takes approximately six minutes, resulting in a traditional AI text response and a downloadable PDF file [12][15]. Group 3: Performance Metrics and Comparisons - The Qwen3-VL series has been updated with a maximum parameter version of 32 billion and a minimum of 2 billion, with the team indicating this is the final update for this series [28][29]. - Performance evaluations show that the 32 billion version surpasses the previous Qwen2.5-VL's 72 billion version and competes favorably against closed-source solutions from OpenAI and Anthropic [30]. Group 4: Deployment and Accessibility - Users can generate a simple and aesthetically pleasing web page with dynamic effects, including a day/night mode, enhancing the visual presentation of AI-generated research results [19][20]. - The deployment feature allows users to publish their web content either publicly or privately, providing flexibility in sharing information [22].
中国数学家再中数学四大刊,兰州大学首篇:突破斯托克斯方程“光滑性”限制
量子位· 2025-10-22 05:48
Core Viewpoint - The article highlights the significant achievement of professors Geng Jun from Lanzhou University and Shen Zhongwei from Westlake University, whose research paper has been accepted by one of the top four mathematics journals, Inventiones Mathematicae, marking a milestone for Lanzhou University in the field of mathematics [2][6]. Group 1: Research Focus - The research centers on the Stokes equation, a fundamental aspect of fluid mechanics, specifically investigating the infinite norm pre-estimation of the Stokes operator in non-smooth regions [3][4]. - The study aims to understand the behavior of fluid motion in irregular boundary spaces, such as natural river channels, rather than smooth pipelines [4][5]. Group 2: Key Breakthroughs - The paper presents two major breakthroughs: 1. It establishes that in three-dimensional and higher spaces with C¹ boundaries, and in two-dimensional spaces with Lipschitz boundaries, the maximum fluid velocity can be estimated based on the maximum external force [11]. 2. It introduces a novel approach using large-scale averaging to address the issue of pressure control, allowing for the estimation of maximum velocity in bounded regions [12]. Group 3: Theoretical and Practical Implications - The research fills a critical gap in the theoretical understanding of the Stokes equation in non-smooth regions, clarifying the applicability of C¹ and Lipschitz boundaries and enhancing the mathematical analysis framework of fluid mechanics [13]. - Practically, the findings provide engineers with more accurate computational tools for real-world fluid scenarios, improving the precision of velocity and pressure estimations in non-smooth boundary conditions [14]. Group 4: Authors' Background - The authors, Geng Jun and Shen Zhongwei, are prominent mathematicians with extensive academic backgrounds, having collaborated on influential papers prior to this achievement [15][20]. - Geng Jun, a professor at Lanzhou University, specializes in harmonic analysis and partial differential equations, while Shen Zhongwei, who recently returned to China, has a distinguished career in mathematics education and research [16][22][23].
阿里云秘密武器亮相顶会:狂砍82%英伟达含量,213块GPU干了1192块的活
量子位· 2025-10-21 23:50
Core Viewpoint - Alibaba Cloud has introduced a new GPU pooling system called Aegaeon, which significantly reduces the demand for NVIDIA GPUs by 82% through innovative resource allocation techniques [1][3]. Group 1: Research Background - The research was conducted in collaboration with Peking University, led by Alibaba Cloud's CTO Zhou Jingren [2]. - The study identified that 17.7% of GPU resources were allocated to underutilized models, which only accounted for 1.35% of total request volume [4]. Group 2: Aegaeon's Innovations - Aegaeon addresses the inefficiencies in GPU resource allocation by implementing token-level automatic scaling technology, allowing for dynamic model switching during token generation rather than waiting for entire requests to complete [10][11]. - The system has achieved a 97% reduction in the overhead associated with automatic scaling through various optimizations, including an 80% reduction in initialization overhead and improved memory management [14][15]. Group 3: Performance Outcomes - Aegaeon has demonstrated performance improvements of up to 9 times, with a minimum of 1.5 times, compared to existing systems like ServerlessLLM and MuxServe [18]. - In practical deployment, Aegaeon has serviced 47 models of varying sizes, increasing GPU utilization from 13.3%-33.9% to 48.1% without any service level objective violations or interruptions [20].