Workflow
NVARC
icon
Search documents
腾讯研究院AI速递 20251209
腾讯研究院· 2025-12-08 16:01
Group 1: Microsoft VibeVoice-Realtime-0.5B - Microsoft has open-sourced the lightweight real-time TTS model VibeVoice-Realtime-0.5B, achieving a first package latency of only 300 milliseconds and gaining 12.3K stars within 12 hours of release [1] - The model utilizes an interleaved window architecture for smooth reading of long texts, supporting up to 4 characters in natural dialogue, with emotional recognition and expression capabilities, and a long-term context memory of up to 90 minutes [1] - It supports both Chinese and English speech generation, with a typo rate of approximately 2% on the LibriSpeech and SEED TTS test sets, and speaker similarity reaching above 0.65, making it suitable for AI assistants, meeting notes, and podcast generation [1] Group 2: Zhiyuan GLM-4.6V - Zhiyuan has officially launched and open-sourced the GLM-4.6V series multimodal large models, including the 106B-A12B base version and the 9B lightweight version Flash, with a context window increased to 128k tokens, reducing costs by 50% compared to GLM-4.5V [2] - The model architecture integrates Function Call capabilities natively into the visual model, enabling a seamless link from visual perception to executable actions [2] - The 9B version outperforms Qwen3-VL-8B, while the 106B parameter version competes with Qwen3-VL-235B, which has double the parameters, supporting applications such as mixed text and image layouts, visual shopping, and front-end replication [2] Group 3: Keling O1 Features - Keling O1 has introduced the "Subject Library" feature, allowing users to upload multi-angle reference images to create custom characters, props, and scenes, supporting up to 7 subjects in video O1 and 10 subjects in image O1 [3] - A new AI image completion feature can automatically expand more perspectives and intelligently generate subject descriptions based on a primary reference image, continuously updating with a vast official subject library [3] - The "Comparison Template" feature enables one-click integration of multimodal creation, allowing efficient side-by-side comparison of all inputs and final products, enhancing the potential for viral content [3] Group 4: Meituan LongCat-Image Model - Meituan's LongCat team has released and open-sourced the 6B parameter LongCat-Image model, achieving open-source SOTA levels in image editing benchmark tests such as ImgEdit-Bench (4.50) and GEdit-Bench (7.60/7.64) [4] - The model employs a unified architecture design for text-to-image and image editing, utilizing a progressive learning strategy, and has achieved a score of 90.7 in Chinese text generation, significantly leading in the evaluation of 8105 common Chinese characters [4] - The comprehensive open-source model includes multi-stage text-to-image and image editing capabilities, with strong competitive performance in GenEval (0.87) and DPG-Bench (86.8) [4] Group 5: Tencent HY 2.0 and DeepSeek V3.2 - Tencent has officially launched its self-developed large model HY 2.0, featuring a total parameter count of 406B (with 32B active parameters) and supporting a 256K ultra-long context window, placing it at the forefront of industry capabilities [6] - DeepSeek V3.2 has been integrated into Tencent's ecosystem, focusing on enhancing reasoning performance and long text generation quality, achieving capabilities comparable to GPT-5 in public reasoning evaluations, slightly below Gemini-3 Pro [6] - Both models have been deployed in Tencent's native applications such as Yuanbao and ima, with Tencent Cloud opening API and platform services, and various products like QQ Browser and Sogou Input Method gradually integrating these models [6] Group 6: Alibaba Qwen3-TTS - Alibaba's Tongyi team has released the new generation text-to-speech model Qwen3-TTS, offering 49 high-fidelity character voices, including distinct tones like "Mo Rabbit" (lively and cute) and "Cang Mingzi" (deep and wise) [7] - The model supports 10 languages (including Chinese, English, German, French, Spanish, Italian, Portuguese, Japanese, Korean, and Russian) and 9 Chinese dialects, preserving authentic intonation and regional accents [7] - In the MiniMax TTS multilingual test set, it outperformed competitors like MiniMax, ElevenLabs, and GPT-4o Audio Preview in average WER performance, with significant perceptual improvements in prosody control compared to the previous generation [7] Group 7: NVIDIA NVARC Model - NVIDIA's 4B small model NVARC topped the ARC-AGI 2 test with a score of 27.64%, surpassing GPT-5 Pro's score of 18.3%, with a task cost of only 20 cents, approximately 1/36 of GPT-5 Pro's cost per task [8] - The model employs a zero-pretraining deep learning approach, utilizing a large-scale synthesis of high-quality data (over 3.2 million enhanced samples) and fine-tuning techniques during testing for rapid adaptation to each question [8] - It simplifies puzzle understanding using a dialogue template with the Qwen3-4B small parameter model, leveraging the NeMo RL framework for supervised fine-tuning, moving complex reasoning to an offline synthesized data pipeline [8] Group 8: Pudu Robotics PUDU D5 Series - Pudu Robotics has launched the industry-level autonomous navigation quadruped robot PUDU D5 series, offering both wheeled and point-foot versions, equipped with NVIDIA Orin and RK3588 dual-chip architecture, achieving a total computing power of 275 TOPS [9] - The robot features a four-eye fisheye camera and dual 192-line LiDAR for centimeter-level precise positioning and environmental reconstruction, capable of carrying a load of 30 kilograms with a single charge range of 14 kilometers, and has an IP67 protection rating [9] - Utilizing a bionic wheeled-foot fusion system, it can reach speeds of up to 5 meters per second, with capabilities to climb slopes of 30° and navigate obstacles of 25 centimeters, suitable for various applications such as park inspections, material transportation, and guided distribution [9] Group 9: Karpathy's AI Prompting Strategy - Andrej Karpathy emphasizes that large language models should not be viewed as entities but as simulators, advising against using prompts like "What do you think?" as they imply a non-existent "you" [10] - He suggests more effective questioning strategies, such as "What kind of group of people is suitable for exploring the topic xyz? How would they respond?" to allow LLMs to guide or simulate multiple perspectives rather than being limited to a single AI persona [11] - Karpathy highlights that the "you" in models is deliberately designed and engineered, constructed through SFT and RLHF, and fundamentally remains a token simulation engine rather than an emergent "mind" built over time [11]
英伟达4B小模型击败GPT-5 Pro,成本仅1/36
3 6 Ke· 2025-12-08 07:23
英伟达小模型持续获胜。 ARC-AGI 2最新成绩,4B小模型NVARC以27.64%的公开榜成绩力压GPT-5 Pro 18.3%登顶榜首。 且每任务成本仅20美分,大约是GPT-5 Pro单任务成本(超过7美元)的1/36。 据官方分析,此次NVARC夺冠的亮点在于零预训练深度学习方法,没有依赖大规模通用数据集进行前期预训练,规避了预训练模型的领域偏见、数据依 赖等问题。 成绩出炉后,官方访谈到了NVARC团队的Jean-Francois Puget和Ivan Sorokin,进行技术剖析。 快来看看"性价比之王"是如何"练"成的? 而ARC-AGI 2确实是一个消除了与公共训练数据重叠的更高难度测试,主要是看测试模型能否高效地获取超出其训练数据的新技能。 不靠参数堆料 英伟达的策略是将复杂推理移至离线的合成数据管道,训练能在评估时快速运行的较小模型。 简单来说就是大规模合成高质量数据,然后对现有模型进行优化,并且将昂贵的计算工作转移到离线进行。 由于Kaggle比赛对计算资源限制非常严格,团队意识到,他们不能直接使用那些需要超强算力的大型LMM来进行复杂的、一步一步的推理和代码生成。 因此他们改变了思路 ...
英伟达4B小模型击败GPT-5 Pro!成本仅1/36
量子位· 2025-12-08 06:07
英伟达小模型持续获胜。 ARC-AGI 2最新成绩,4B小模型 NVARC 以 27.64% 的公开榜成绩力 压GPT-5 Pro 18.3%登顶榜首。 且每任务成本仅20美分,大约是GPT-5 Pro单任务成本(超过7美元)的 1/36。 闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 据官方分析,此次NVARC夺冠的亮点在于 零预训练深度学习方法 ,没有依赖大规模通用数据集进行前期预训练, 规避了预训练模型的领域 偏见、数据依赖等问题。 而ARC-AGI 2确实是一个消除了与公共训练数据重叠的更高难度测试, 主要是看测试模型能否高效地获取超出其训练数据的新技能。 快来看看"性价比之王"是如何"练"成的? 不靠参数堆料 英伟达的策略是将复杂推理移至离线的合成数据管道, 训练能在评估时快速运行的较小模型。 简单来说就是 大规模合成高质量数据 ,然后对现有模型进行优化, 并且 将昂贵的计算工作转移到离线进行 。 为了确保数据质量,他们将复杂的推理管线拆分成不同的阶段,每个阶段都可以独立验证。 通过这种方式,他们建立了一个含320万+ 增强样本的合成数据集,其中每个样本最多有7对输入/输出。 | Sourc ...