Workflow
量子位
icon
Search documents
小米大模型“杀”进第一梯队:代码能力开源第一,智商情商全在线
量子位· 2025-12-18 00:30
Core Viewpoint - Xiaomi's newly announced open-source model MiMo-V2-Flash has successfully entered the first tier of open-source models, showcasing high efficiency and performance with a parameter scale of only 309 billion [2][4]. Technical Innovations - MiMo-V2-Flash employs a MoE architecture with 256 experts, achieving significant performance with a smaller parameter count compared to larger models [11]. - The model utilizes a dynamic activation mechanism, activating only 8 experts, resulting in a low inference cost of approximately 2.5% of the closed-source competitor Claude 4.5 Sonnet [12]. - Key technologies include a 5:1 mixed attention mechanism, learnable attention aggregation bias, multi-layer token prediction (MTP) for inference acceleration, and multi-teacher online policy distillation (MOPD) for training efficiency [13][23]. Performance Metrics - MiMo-V2-Flash scored 86.2 in the Arena-Hard benchmark and 84.9 in the MMLU-Pro complex reasoning task, demonstrating strong general capabilities [27]. - In coding ability, it achieved a score of 73.4% in SWE-Bench Verified, surpassing competitors like DeepSeek-V3.2 and Kimi-K2 [28]. Real-World Applications - The model has shown exceptional performance in practical scenarios, such as generating complete code for a web-based macOS operating system and implementing complex features like gesture control [30][41]. - Compared to closed-source models, MiMo-V2-Flash produced more functional and interactive web applications [36][40]. Strategic Vision - Xiaomi's development of MiMo-V2-Flash reflects a strategic shift towards becoming a major player in the AI model space, aiming to create a unified "brain" for its hardware ecosystem [62][63]. - The company envisions AI that can seamlessly integrate with physical devices, enhancing control precision and response speed [60][61].
“特斯拉延期交付机器人是卡在灵巧手上,中国灵巧手遥遥领先”| 灵心巧手@MEET2026
量子位· 2025-12-17 10:00
Core Viewpoint - The dexterous hand is a core component of embodied intelligence, capable of independent application in real-world scenarios without relying on humanoid robots, and represents a high-barrier soft-hard integrated platform [7][12][13]. Group 1 - The dexterous hand is not merely an accessory to humanoid robots but serves as the central execution platform for embodied intelligence [3][7]. - A good dexterous hand must possess high degrees of freedom, durability, cost-effectiveness, and multi-modal perception, along with tailored solutions for various scenarios [5][31]. - The global dexterous hand market features three main technical routes: tendon-driven, rigid-link, and direct-drive transmission, with the company having solutions in all three areas [16][32]. Group 2 - The company emphasizes that a truly effective dexterous hand should mimic human hand capabilities, including high freedom of movement and the ability to interact with various tools [18][20]. - The current market for dexterous hands has seen prices drop to below 10,000 yuan, making them competitive with traditional two-finger grippers [23]. - The company is focused on developing both the hardware and the necessary algorithms to ensure the dexterous hand can perform a wide range of tasks in real-world applications [24][55]. Group 3 - The company has developed several models of dexterous hands, including the Linker Hand O6, which is lightweight and capable of significant force, and the Linker Hand L20, known for its speed and efficiency in industrial environments [44][46]. - The Linker Hand L30, based on a tendon-driven structure, is set to commercialize in November 2024, showcasing advanced flexibility and responsiveness [52][53]. - The company is committed to self-research for key components like tactile sensors, motors, and reducers, ensuring high durability and performance [55].
腾讯调整大模型组织架构:姚顺雨加盟,向总裁刘炽平汇报
量子位· 2025-12-17 10:00
Core Viewpoint - Tencent has announced a significant organizational restructuring in its AI division, with the notable addition of Yao Shunyu, a prominent figure in the AI research community, as the Chief AI Scientist [1][4][11]. Group 1: Yao Shunyu's Background and Role - Yao Shunyu, a former OpenAI researcher and a distinguished academic, has joined Tencent as the Chief AI Scientist in the CEO's office, reporting directly to Tencent's president, Liu Chiping [2][4]. - At only 28 years old, Yao has made substantial contributions to the field of AI, particularly in the area of large models and agent-based research, with notable works including Tree of Thoughts and ReAct [3][19]. - His recent departure from OpenAI and subsequent move to Tencent has garnered significant attention, highlighting his status as a leading talent in the AI sector [3][11]. Group 2: Organizational Changes at Tencent - Tencent has restructured its AI organization, establishing new departments such as AI Infra, AI Data, and Data Computing Platform to enhance its large model development capabilities [6][8]. - The AI Infra department, led by Yao, will focus on building the technical capabilities for large model training and inference, aiming to create a competitive edge in AI infrastructure [8][10]. - The restructuring aims to strengthen Tencent's engineering advantages and improve the efficiency of AI large model research, aligning with the company's strategic goals in AI [8][12]. Group 3: Tencent's AI Product Development - Over the past year, Tencent has launched more than 30 new models under its Mix Yuan series, with Mix Yuan 2.0 showing significant improvements in pre-training data and reinforcement learning strategies [9]. - Tencent's AI product, Yuanbao, has rapidly gained user acceptance, becoming one of the top AI applications in China, and is integrated into major platforms like WeChat and QQ [10]. - The company is undergoing a comprehensive AI-driven efficiency transformation, with over 900 applications utilizing its Mix Yuan models across various internal services [10][12]. Group 4: Strategic Importance of AI for Tencent - Tencent's advancements in AI are closely tied to its extensive resources, including rich scenarios, vast data, and a strategic approach, positioning the company favorably in the AI landscape [14][15]. - The recruitment of top talent like Yao Shunyu signifies Tencent's commitment to accelerating its AI initiatives and enhancing its capabilities in the competitive AI market [11][12].
全球功能最全的视频生成模型来了
量子位· 2025-12-17 10:00
Core Viewpoint - Alibaba has launched the new Tongyi Wansxiang 2.6 model, which is the most comprehensive video generation model globally, covering various capabilities such as text-to-video, image generation, and audio-driven video creation [1]. Group 1: Video Generation Capabilities - The Wansxiang 2.6 model introduces multi-audio driven video capabilities, along with features like audio-visual synchronization and multi-shot storytelling, which were not available in Sora 2 [2]. - The model demonstrates significant improvements in artistic style control, realistic portrait generation, and understanding of historical and cultural semantics in image generation [3][8]. - The model's video generation capabilities include video reference generation, maintaining subject consistency, and natural audio-visual synchronization, which enhances the overall user experience [11][12]. Group 2: Performance Testing - Initial tests show that Wansxiang 2.6 performs well in video subject consistency and prompt understanding, achieving a near 1:1 replication of the subject's appearance and matching lip movements accurately [11]. - The model's ability to generate multi-shot narratives is effective, with smooth transitions and coherent storytelling across different shots, although some abstract actions may still pose challenges [17][18]. - The model's aesthetic quality in video generation has improved, showcasing a cinematic feel and strong visual appeal, particularly in complex scenes like cyberpunk cityscapes [14][24]. Group 3: Image Generation Enhancements - Wansxiang 2.6 has made advancements in image generation, particularly in style transfer, portrait generation, and bilingual text handling, demonstrating a better grasp of new aesthetic styles [19][22]. - The model successfully generated a food promotional poster with clear bilingual text and an appealing layout, indicating its reliability in aesthetic judgment [25][27]. - Overall, the model's performance is commendable, with minor flaws in multi-character dialogue and complex action understanding, but it is deemed usable for daily short video creation and secondary creation tasks [28][29].
摩尔线程算法一鸣惊人,图形学顶会夺银!已开源
量子位· 2025-12-17 09:07
Core Viewpoint - Moore Threads won the silver medal at the 3D Gaussian Splatting Reconstruction Challenge (3DGS Challenge) during SIGGRAPH Asia 2025, showcasing its advanced algorithm capabilities and hardware-software optimization in next-generation graphics rendering technology [1][2][13]. Group 1: 3D Gaussian Splatting Technology - 3D Gaussian Splatting (3DGS) is a revolutionary 3D scene representation and rendering technology proposed in 2023, achieving an exceptional balance between image quality, efficiency, and resource usage [4]. - Compared to traditional Neural Radiance Fields (NeRF), 3DGS significantly enhances rendering efficiency by hundreds to thousands of times while maintaining realistic rendering quality, demonstrating strong adaptability in ray tracing, VR/AR real-time rendering, and multi-modal fusion [4][6]. - 3DGS is becoming a key foundational technology in embodied AI training scenarios, providing reliable support for accurate world modeling, enhancing path planning, environmental perception, and complex task execution [7][8]. Group 2: Competition and Performance - The 3DGS Challenge required participants to complete high-quality 3DGS reconstruction within 60 seconds using real terminal video sequences and SLAM point clouds, with PSNR and reconstruction speed as evaluation metrics [9][10]. - Moore Threads achieved an average PSNR of 27.58 and a reconstruction time of 34 seconds, ranking third overall and significantly outperforming many teams [15][16]. Group 3: LiteGS Development - Moore Threads developed the LiteGS foundational library to optimize the training process of 3DGS, achieving a significant reduction in training time and parameter count while maintaining high reconstruction quality [17][20]. - LiteGS can achieve up to 10.8 times training acceleration and reduce parameter count by over 50%, while also exceeding mainstream solutions in PSNR by 0.2–0.4 dB [20][21]. - LiteGS has been fully open-sourced on GitHub to promote collaboration and continuous evolution in 3D reconstruction and rendering technology [23]. Group 4: Strategic Implications - The success at the international graphics competition reflects Moore Threads' ability to grasp global technology trends and lead the future direction of graphics computing technology [23][25]. - The company will host the first MUSA Developer Conference on December 20-21, 2025, to discuss how technologies like 3DGS can shape the future and empower fields such as embodied intelligence [25].
大模型的进化方向:Words to Worlds | 对话商汤林达华
量子位· 2025-12-17 09:07
Core Insights - The article discusses the breakthrough of the SenseNova-SI model, developed by SenseTime, which has surpassed the Cambrian-S model in spatial intelligence capabilities [2][5][50] - It highlights a shift in AI paradigms, moving away from merely scaling models to a focus on foundational research and understanding of multi-modal and spatial intelligence [9][20][22] Model Performance - SenseNova-SI achieved state-of-the-art (SOTA) results across various spatial intelligence benchmarks, outperforming both open-source and proprietary models [4][5] - Specific performance metrics show SenseNova-SI scoring higher than Cambrian-S in key areas such as spatial reasoning and hallucination suppression [50] Paradigm Shift in AI - The article emphasizes that the traditional AI model scaling approach is reaching its limits, necessitating a return to fundamental research [9][15][20] - SenseTime's approach involves a new architecture called NEO, which integrates visual and language processing at the core level, allowing for better understanding of spatial relationships [39][42] Technological Innovations - The NEO architecture allows simultaneous processing of visual and textual tokens, enhancing the model's ability to understand and interact with the physical world [42][46] - SenseNova-SI demonstrates a tenfold increase in data efficiency, requiring only 10% of the training data compared to similar models to achieve SOTA performance [49] Industrial Application - The article discusses the importance of making AI technologies economically viable, emphasizing that high costs and slow processing times are barriers to widespread adoption [55][58] - SenseTime's SekoTalk product exemplifies the successful application of AI in real-time video generation, significantly reducing processing time from hours to real-time [64][66] Future Directions - The article encourages young researchers and entrepreneurs to explore diverse fields beyond large language models, such as embodied intelligence and AI for science [68][70] - It concludes with a vision for China's potential in developing AI that deeply interacts with the physical world, positioning it as a leader in this emerging landscape [72][73]
让大模型“吃一堑长一智”,南理工百度等提出模型记忆新方法
量子位· 2025-12-17 09:07
Core Viewpoint - The article discusses a new method called ViLoMem, developed by Nanjing University of Science and Technology in collaboration with Baidu, which addresses the issue of large models having poor memory retention, enabling them to learn from past mistakes by separating visual and logical errors into distinct memory streams [1][5]. Group 1: ViLoMem Framework - ViLoMem employs a dual-stream semantic memory system that allows models to remember visual and logical errors separately, enhancing their ability to learn from experiences [15][16]. - The framework consists of two main components: memory generation and memory retrieval, which work together to improve the model's performance without altering its parameters [18][5]. Group 2: Memory Generation - When a model fails on a task, ViLoMem activates two branches: a visual analysis module to identify visual errors and a logical analysis module to pinpoint logical mistakes, generating structured guidelines for both types of errors [19][20][21]. - Newly generated memories are matched for similarity with existing memories to either merge them into more abstract rules or create new memory slots, preventing memory overload while allowing for the abstraction of general semantic patterns [22][24]. Group 3: Memory Retrieval - The retrieval strategies for visual and logical memories differ, with visual memory using a two-stage retrieval process that includes image-level similarity search and question semantic filtering [27][28]. - Logical memory retrieval focuses on understanding the problem first before searching for relevant rules, which is more effective than simple keyword matching [29]. Group 4: Performance Improvement - ViLoMem has shown significant performance improvements across six multimodal reasoning benchmarks, with notable gains in mathematical tasks, such as a +6.48 increase for GPT-4.1 on MathVision [2][31]. - Smaller models benefit even more from ViLoMem, with Qwen3-VL-8B achieving a +4.38 increase on MMMU [31]. Group 5: Cross-Model Memory Transfer - An interesting experiment demonstrated that smaller models could achieve better scores by utilizing memories generated by larger models, indicating a form of "free knowledge distillation" [34][36]. - This suggests that experiences from stronger models can directly enhance the performance of weaker models without the need for fine-tuning [36].
挖掘注意力中的运动线索:无需训练,解锁4D场景重建能力
量子位· 2025-12-17 09:07
Core Insights - The article discusses the development of VGGT4D, a framework that enables 3D foundation models to process dynamic 4D scenes without increasing training costs [1][2][30] - VGGT4D leverages motion cues hidden within the attention layers of the Visual Geometry Transformer (VGGT) to enhance performance in tasks such as dynamic object segmentation and camera pose estimation [1][6][30] Group 1: Challenges in Transitioning from 3D to 4D - Existing 3D models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes due to moving objects causing background geometric modeling interference and significant camera pose drift [4] - Current solutions face two main challenges: high computational or training costs and reliance on external priors, which complicate the system [5] Group 2: VGGT4D's Mechanism - VGGT4D aims to extract 4D perception capabilities directly from pre-trained 3D models without additional training [6] - The research team visualized the attention mechanism of VGGT and found that different network layers respond distinctly to dynamic regions, indicating that VGGT implicitly encodes rich dynamic cues despite being trained under static assumptions [7][13] Group 3: Motion Cue Extraction Techniques - VGGT4D introduces a training-free attention feature mining and mask refinement mechanism that utilizes Gram matrices and gradient flow for high-precision dynamic-static separation [14] - The method addresses the limitations of standard attention maps by using self-similarity Gram matrices to focus on motion-induced variance, enhancing the model's ability to detect dynamic features [17] Group 4: Performance Evaluation - VGGT4D significantly outperforms other variants in dynamic object segmentation tasks across multiple datasets, achieving optimal performance on DAVIS-2016 and DAVIS-2017 without any 4D-specific training [21][20] - The qualitative analysis shows that VGGT4D generates more accurate masks with clearer boundaries compared to baseline methods, validating the hypothesis that VGGT's Gram similarity statistics embed extractable motion cues [22] Group 5: Robustness and Long Sequence Performance - VGGT4D demonstrates superior robustness in camera pose estimation, achieving the best results in challenging long-sequence benchmarks while maintaining high efficiency [25] - The method effectively identifies and eliminates residual pose inconsistencies caused by motion, leading to more stable and accurate camera trajectories [25] Group 6: 4D Point Cloud Reconstruction - In evaluations on the DyCheck dataset, VGGT4D achieves the best performance across all reconstruction metrics, significantly improving accuracy and distance metrics compared to the VGGT baseline [28] - The method reduces median accuracy error from 0.009 to 0.004 and average distance from 0.150 to 0.123, demonstrating its capability for precise dynamic-static separation and enhanced geometric reconstruction quality [28] Group 7: Conclusion - VGGT4D presents a novel training-free paradigm that successfully extends the capabilities of 3D foundation models to 4D dynamic scenes, offering a low-cost solution for 4D reconstruction and showcasing the potential of foundational models in zero-shot transfer tasks [30]
量子位编辑作者招聘
量子位· 2025-12-17 09:07
Core Viewpoint - The article emphasizes the ongoing AI boom and invites individuals to join the company "Quantum Bit," which focuses on tracking AI advancements and has established itself as a leading content platform in the industry [1]. Group 1: Job Opportunities - The company is hiring for three main directions: AI Industry, AI Finance, and AI Product, with positions available for both experienced professionals and fresh graduates [2][4]. - Positions are open for various levels, including editors, lead writers, and chief editors, with a focus on matching roles to individual capabilities [6]. Group 2: Job Responsibilities - **AI Industry Direction**: Responsibilities include tracking innovations in infrastructure, such as chips, AI infrastructure, and cloud computing, as well as interpreting technical reports from conferences [6][7]. - **AI Finance Direction**: Focuses on venture capital, financial reports, and capital movements within the AI industry, requiring strong analytical skills and a passion for interviews [11]. - **AI Product Direction**: Involves evaluating AI applications and hardware, tracking new product releases, and engaging with entrepreneurs and product experts in the AI space [11]. Group 3: Benefits and Work Environment - Employees will have the opportunity to engage with cutting-edge AI technologies, enhance their work efficiency through new tools, and build personal influence in the AI field [6]. - The company offers competitive salaries, comprehensive benefits including social insurance, meal allowances, and performance bonuses, along with a dynamic and open team culture [6][12]. Group 4: Company Growth - By 2025, the company aims to have over 2.4 million subscribers on WeChat and more than 7 million users across platforms, with a daily reading volume exceeding 2 million [12].
英伟达护城河又宽了!低调收购开源算力调度王牌工具,全球过半顶级超算在用,Thinking Machines也离不开它
量子位· 2025-12-17 03:38
Core Viewpoint - NVIDIA's acquisition of SchedMD is seen as a strategic move to expand its competitive edge in the HPC and AI sectors by integrating SchedMD's Slurm system into its ecosystem, thereby enhancing its influence beyond hardware to resource scheduling [1][11][15]. Group 1: SchedMD Overview - SchedMD, founded in 2010, specializes in large-scale computing task scheduling technology [5]. - The core asset of SchedMD is the open-source workload management system Slurm, which efficiently allocates computing resources across numerous devices for tasks such as AI model training and scientific research [6][8]. - Slurm is utilized by over half of the TOP500 supercomputers globally, as well as by major tech companies like Meta and various AI startups [3][9][10]. Group 2: Strategic Importance of the Acquisition - The acquisition allows for low integration costs due to a decade-long collaboration between NVIDIA and SchedMD, facilitating a smooth transition of technology and team integration [12][13]. - The strategic value of the acquisition lies in extending NVIDIA's influence from hardware to scheduling, ensuring that even clients using AMD or Intel chips will rely on NVIDIA's ecosystem through Slurm [14][15]. - This move further solidifies NVIDIA's position among key customer groups, including supercomputing centers, cloud providers, and AI enterprises [16]. Group 3: Future Considerations - NVIDIA has committed to maintaining Slurm's open-source and vendor-neutral attributes, ensuring continued access for global users [18]. - However, there are concerns regarding NVIDIA's ongoing investment in critical projects like Slinky, which supports Slurm-on-Kubernetes services, raising questions about the future stability of related businesses [19][21].