腾讯研究院AI速递 20251209

Group 1: Microsoft VibeVoice-Realtime-0.5B - Microsoft has open-sourced the lightweight real-time TTS model VibeVoice-Realtime-0.5B, achieving a first package latency of only 300 milliseconds and gaining 12.3K stars within 12 hours of release [1] - The model utilizes an interleaved window architecture for smooth reading of long texts, supporting up to 4 characters in natural dialogue, with emotional recognition and expression capabilities, and a long-term context memory of up to 90 minutes [1] - It supports both Chinese and English speech generation, with a typo rate of approximately 2% on the LibriSpeech and SEED TTS test sets, and speaker similarity reaching above 0.65, making it suitable for AI assistants, meeting notes, and podcast generation [1] Group 2: Zhiyuan GLM-4.6V - Zhiyuan has officially launched and open-sourced the GLM-4.6V series multimodal large models, including the 106B-A12B base version and the 9B lightweight version Flash, with a context window increased to 128k tokens, reducing costs by 50% compared to GLM-4.5V [2] - The model architecture integrates Function Call capabilities natively into the visual model, enabling a seamless link from visual perception to executable actions [2] - The 9B version outperforms Qwen3-VL-8B, while the 106B parameter version competes with Qwen3-VL-235B, which has double the parameters, supporting applications such as mixed text and image layouts, visual shopping, and front-end replication [2] Group 3: Keling O1 Features - Keling O1 has introduced the "Subject Library" feature, allowing users to upload multi-angle reference images to create custom characters, props, and scenes, supporting up to 7 subjects in video O1 and 10 subjects in image O1 [3] - A new AI image completion feature can automatically expand more perspectives and intelligently generate subject descriptions based on a primary reference image, continuously updating with a vast official subject library [3] - The "Comparison Template" feature enables one-click integration of multimodal creation, allowing efficient side-by-side comparison of all inputs and final products, enhancing the potential for viral content [3] Group 4: Meituan LongCat-Image Model - Meituan's LongCat team has released and open-sourced the 6B parameter LongCat-Image model, achieving open-source SOTA levels in image editing benchmark tests such as ImgEdit-Bench (4.50) and GEdit-Bench (7.60/7.64) [4] - The model employs a unified architecture design for text-to-image and image editing, utilizing a progressive learning strategy, and has achieved a score of 90.7 in Chinese text generation, significantly leading in the evaluation of 8105 common Chinese characters [4] - The comprehensive open-source model includes multi-stage text-to-image and image editing capabilities, with strong competitive performance in GenEval (0.87) and DPG-Bench (86.8) [4] Group 5: Tencent HY 2.0 and DeepSeek V3.2 - Tencent has officially launched its self-developed large model HY 2.0, featuring a total parameter count of 406B (with 32B active parameters) and supporting a 256K ultra-long context window, placing it at the forefront of industry capabilities [6] - DeepSeek V3.2 has been integrated into Tencent's ecosystem, focusing on enhancing reasoning performance and long text generation quality, achieving capabilities comparable to GPT-5 in public reasoning evaluations, slightly below Gemini-3 Pro [6] - Both models have been deployed in Tencent's native applications such as Yuanbao and ima, with Tencent Cloud opening API and platform services, and various products like QQ Browser and Sogou Input Method gradually integrating these models [6] Group 6: Alibaba Qwen3-TTS - Alibaba's Tongyi team has released the new generation text-to-speech model Qwen3-TTS, offering 49 high-fidelity character voices, including distinct tones like "Mo Rabbit" (lively and cute) and "Cang Mingzi" (deep and wise) [7] - The model supports 10 languages (including Chinese, English, German, French, Spanish, Italian, Portuguese, Japanese, Korean, and Russian) and 9 Chinese dialects, preserving authentic intonation and regional accents [7] - In the MiniMax TTS multilingual test set, it outperformed competitors like MiniMax, ElevenLabs, and GPT-4o Audio Preview in average WER performance, with significant perceptual improvements in prosody control compared to the previous generation [7] Group 7: NVIDIA NVARC Model - NVIDIA's 4B small model NVARC topped the ARC-AGI 2 test with a score of 27.64%, surpassing GPT-5 Pro's score of 18.3%, with a task cost of only 20 cents, approximately 1/36 of GPT-5 Pro's cost per task [8] - The model employs a zero-pretraining deep learning approach, utilizing a large-scale synthesis of high-quality data (over 3.2 million enhanced samples) and fine-tuning techniques during testing for rapid adaptation to each question [8] - It simplifies puzzle understanding using a dialogue template with the Qwen3-4B small parameter model, leveraging the NeMo RL framework for supervised fine-tuning, moving complex reasoning to an offline synthesized data pipeline [8] Group 8: Pudu Robotics PUDU D5 Series - Pudu Robotics has launched the industry-level autonomous navigation quadruped robot PUDU D5 series, offering both wheeled and point-foot versions, equipped with NVIDIA Orin and RK3588 dual-chip architecture, achieving a total computing power of 275 TOPS [9] - The robot features a four-eye fisheye camera and dual 192-line LiDAR for centimeter-level precise positioning and environmental reconstruction, capable of carrying a load of 30 kilograms with a single charge range of 14 kilometers, and has an IP67 protection rating [9] - Utilizing a bionic wheeled-foot fusion system, it can reach speeds of up to 5 meters per second, with capabilities to climb slopes of 30° and navigate obstacles of 25 centimeters, suitable for various applications such as park inspections, material transportation, and guided distribution [9] Group 9: Karpathy's AI Prompting Strategy - Andrej Karpathy emphasizes that large language models should not be viewed as entities but as simulators, advising against using prompts like "What do you think?" as they imply a non-existent "you" [10] - He suggests more effective questioning strategies, such as "What kind of group of people is suitable for exploring the topic xyz? How would they respond?" to allow LLMs to guide or simulate multiple perspectives rather than being limited to a single AI persona [11] - Karpathy highlights that the "you" in models is deliberately designed and engineered, constructed through SFT and RLHF, and fundamentally remains a token simulation engine rather than an emergent "mind" built over time [11]