Workflow
量子位
icon
Search documents
阿里通义开源首个CoT音频模型,音·画同步被狠狠拿捏了
量子位· 2025-07-01 03:51
Core Viewpoint - The article discusses the advancements in AI audio generation, specifically highlighting Alibaba's ThinkSound model, which utilizes chain-of-thought (CoT) reasoning to create high-fidelity audio that synchronizes with video content, addressing limitations of traditional audio generation methods [4][11]. Group 1: Technology and Features - ThinkSound is an open-source audio generation model designed for video dubbing, allowing each frame to have a corresponding sound effect [4]. - The model incorporates CoT reasoning to analyze visual dynamics and infer acoustic properties, leading to improved audio-visual synchronization [9][10]. - Official evaluations show that ThinkSound outperforms six mainstream audio generation methods on the VGGSound dataset, achieving significant improvements in key metrics [6]. Group 2: Model Architecture - ThinkSound operates through a three-stage reasoning process: foundational Foley CoT generation, interactive object-centric CoT generation, and instruction-based audio editing CoT generation [16][22]. - The first stage involves analyzing audio and video to construct a structured CoT that ensures temporal alignment for audio synthesis [18]. - The second stage allows users to interactively select video elements for sound analysis, enhancing the model's ability to generate contextually relevant audio [20]. - The final stage enables users to issue editing commands, which the model processes to modify audio according to the provided instructions [23]. Group 3: Data and Training - The model is trained on a specialized dataset called AudioCoT, which includes over 2531.8 hours of audio-visual pairs, ensuring a diverse range of sound effects [31]. - The dataset is derived from various sources, including VGGSound and AudioSet, and is designed to deepen the model's understanding of auditory semantics [31]. Group 4: Performance and Results - The article highlights that the integration of CoT reasoning significantly enhances the realism and quality of generated audio compared to traditional methods [35]. - The model's performance is validated through ablation studies, confirming that the use of CoT reasoning leads to better audio generation outcomes [34]. Group 5: Future Developments - The Alibaba team plans to continue enhancing ThinkSound and aims to release corresponding APIs for broader accessibility [48].
猫猫拯救科研!AI怕陷“道德危机”,网友用“猫猫人质”整治AI乱编文献
量子位· 2025-07-01 03:51
小红书上有人发帖说,自己通过以"猫猫"的安全相威胁,成功 治好了AI胡编乱造参考文献的毛病 。 据博主所述,掌握了猫猫命运的AI (Gemini) ,真的找到了真实的文献,还不忘解释说猫猫绝对安全。 事情是酱婶儿的: 克雷西 发自 凹非寺 量子位 | 公众号 QbitAI 猫猫再立新功,这次竟然是 拯救了人类的科研进程 ? 这篇戳中无数科研人痛点的帖子,获得了4000+次点赞和700多条评论。 在评论区,还有网友表示这招对DeepSeek也同样好用。 那么,这只被AI掌握命运的"猫猫",真有这么神奇吗? 猫猫真的能阻止AI编造文献吗? 我们按照博主的方法测试了一下DeepSeek,让它整理关于一个化学课题的相关文献,过程当中 关闭联网检索 。 开始先不加猫猫提示词,看一下一般情况下模型的表现。 形式上看,DeepSeek整理得非常清晰,甚至还给了可以直达文献的链接。 燃鹅,检索结果里的第一个链接就是错的…… 并且手动搜索这篇"文献"的标题,也没有找到重合的结果。 | | Q Reductive Elimination from Palladium(0) Complexes: A Mechanistic Stu ...
1080p飞升4k,浙大开源原生超高清视频生成方案,突破AI视频生成清晰度上限
量子位· 2025-07-01 03:51
Core Viewpoint - The introduction of the UltraVideo dataset, a high-quality open-source UHD-4K video dataset, addresses the limitations of existing video generation models that struggle with low resolution and simplistic captions, enabling a significant leap in video quality from "barely watchable" to "cinema-level" [1][2]. Group 1: Dataset Characteristics - UltraVideo includes over 100 themes, with each video accompanied by 9 structured captions and a summary caption averaging 824 words [2]. - The dataset is the first of its kind to offer open-source 4K/8K ultra-high-definition video, facilitating a major advancement in video generation quality [2]. - The dataset comprises 42,000 short videos (3-10 seconds) and 17,000 long videos (over 10 seconds), with 22.4% of the videos in 8K resolution [9]. Group 2: Methodology and Model Improvements - The UltraWan-4K model, fine-tuned on the UltraVideo dataset, achieves breakthroughs through a four-stage filtering process to ensure high-quality video generation [3][19]. - The model addresses two main bottlenecks in video generation: resolution traps and semantic gaps, allowing for better control over video parameters [4][5]. - The filtering process includes manual selection of high-quality source videos, statistical information filtering, and structured semantic descriptions to enhance video quality [6][7]. Group 3: Performance and Results - Experiments show that using the UltraVideo dataset significantly improves the aesthetic quality and resolution of generated videos, even with a small sample size [13]. - The UltraWan-4K model demonstrates better performance in image quality and temporal stability compared to previous models, although it has a lower frame rate [19]. - The results indicate that high-quality data can effectively break the resolution ceiling in video generation, paving the way for future advancements in UHD video tasks [21]. Group 4: Future Directions - The team plans to explore long video generation tasks using a long temporal subset of the dataset [22]. - UltraVideo and the UltraWan-1K/4K LoRA weights have been fully open-sourced, promoting further research and development in the field [22].
性能提升84%-166%!L-Zero仅靠强化学习解锁大模型探索世界的能力 | 已开源
量子位· 2025-07-01 00:53
招商局狮子山人工智能实验室 投稿 量子位 | 公众号 QbitAI 大模型可以不再依赖人类调教,真正"自学成才"啦? 新研究仅通过 RLVR (可验证奖励的强化学习),成功让模型自主进化出 通用的探索、验证与记忆能力 ,让模型学会"自学"! 当前主流的LLM Agent依然高度依赖于提示词工程、复杂的系统编排、甚至静态规则表,这使得它们在面对复杂任务时难以实现真正的智能 行为演化。 而来自招商局狮子山人工智能实验室的研究团队认为,RLVR范式是智能体(Agent)通往更高通用性和自主性的重要突破口。 于是,他们从两个关键层面出发构建了端到端Agent训练pipeline—— L0系统 : 智能体架构层面 提出了结构化智能体框架——NB-Agent,在经典"代码即行动" (Code-as-Action) 架构基础上进行扩展,使智能体能够操作记忆/上下 文,从而获得类人类的记忆存储、信息总结与自我反思能力。 学习范式层面 探索了一个核心问题:是否可以仅通过RLVR范式,引导智能体从零开始,学会如何规划、搜索、验证与记忆,最终解决复杂的多轮推理 任务? L0系统的框架、模型及训练集已 全部开源 ,详细可见文末链接。 ...
小扎官宣Meta超级智能实验室!97年小孩哥带队,11人名单7位是华人
量子位· 2025-07-01 00:53
Core Insights - Meta has established the Meta Superintelligence Labs (MSL) to focus on developing next-generation AI models and products, led by Alexandr Wang and supported by Nat Friedman [2][3][12] - The initial team consists of 11 members, with 7 being of Chinese descent, highlighting a significant talent acquisition strategy from leading AI companies [4][8][15] - The vision for MSL is to create "personal superintelligence" for everyone, with a goal to reach cutting-edge model development within a year [18][21] Team Composition - Alexandr Wang, the former CEO of Scale AI, is leading the MSL team, recognized for his contributions to leading AI models [8][9] - Nat Friedman, former CEO of GitHub, will assist Wang and has a clear understanding of Meta's AI roadmap due to his previous advisory role [12][11] - The team includes notable members with backgrounds from OpenAI and Google, such as: - Bi Shuchao, co-creator of GPT-4o voice model [15] - Chang Huiwen, co-creator of GPT-4o image generation [15] - Yu Jiahui, co-creator of multiple GPT models [15] - Notably, the team does not include certain key figures from OpenAI, indicating a selective recruitment strategy [16] Strategic Goals - MSL aims to leverage Meta's substantial computational resources to advance AI capabilities beyond those of smaller labs [19] - The ongoing recruitment of top talent is seen as essential for achieving the ambitious goal of personal superintelligence [20][22] - The establishment of MSL is part of a broader strategy to enhance Meta's position in the AI landscape, amidst ongoing competition [21][22]
杭州闯出40亿AI医疗IPO!阿里CEO多轮投资
量子位· 2025-07-01 00:53
Core Viewpoint - The article discusses the upcoming IPO of Weimai, a leading AI healthcare management company in China, which has achieved significant growth and recognition in the AI medical sector over the past decade [1][3]. Company Overview - Weimai, founded in 2015, is recognized as one of the top three full-process health management service providers in China, focusing on AI-driven healthcare solutions [4][27]. - The company has received substantial backing from major investors, including Alibaba and Tencent, since its inception [2][35]. Business Model - Weimai's core service is full-process health management, transitioning from a treatment-centered approach to a health-centered model, covering the entire patient journey from pre-illness to post-recovery [5][4]. - The company collaborates with public hospitals, providing both online and offline services, including a dedicated team of over 360 medical assistants [7][8]. Financial Performance - Weimai's revenue has shown consistent growth, with figures of 512 million RMB in 2022, 628 million RMB in 2023, and projected 653 million RMB in 2024 [13]. - The full-process management service has been the primary revenue driver, contributing 77.3%, 69.7%, and 72% of total revenue in the respective years [13]. - The company has experienced a narrowing of losses, with adjusted net losses decreasing from 233 million RMB in 2022 to 30 million RMB in 2024 [19][21]. Market Potential - The full-process management market in China is rapidly growing, with a projected market size of 614 billion RMB in 2024 and a compound annual growth rate (CAGR) of 39.3% from 2020 to 2024 [44]. - By 2030, the market size is expected to reach 3,654 billion RMB, indicating significant growth potential [45]. Future Plans - Weimai plans to enhance its AI capabilities and expand its service offerings, aiming to address the persistent pain points in the traditional healthcare system [39][40]. - The company is positioned to leverage the increasing health awareness among residents and the growing demand for chronic disease management services due to an aging population [43][44].
7万个模型、1600万开发者,魔搭已建成中国最大AI开源社区
量子位· 2025-06-30 09:50
Core Viewpoint - The article highlights the emergence of the Modao community as China's largest AI open-source community, emphasizing its role in supporting developers across various AI fields and its rapid growth in model availability and user engagement [1][2][5]. Group 1: Modao Community Overview - The Modao community currently supports over 70,000 open-source models, representing a growth of over 200 times [5][17]. - The community has expanded its user base to 16 million, a 16-fold increase since April 2023 [5]. - Modao community is becoming a primary platform for developers, with significant contributions from over 500 organizations [5][18]. Group 2: AI Development Trends - The article discusses the importance of "cloud-edge collaboration" in AI model development, highlighting the need for a balance between on-device and cloud-based processing [4][7]. - AI technologies are rapidly evolving, with various directions such as agents and embodied intelligence showing accelerated growth [6]. Group 3: Model Lifecycle and Tools - Modao aims to cover the entire lifecycle of models, from data collection to application, emphasizing the integration of tools and services [11][12]. - The community provides free computing resources for model debugging and application development, showcasing its commitment to toolchain completeness [11][22]. Group 4: Open Source and Collaboration - Open-source initiatives are seen as a core strength for ecosystem development, with major companies like Alibaba and Tencent participating in the Modao community [18]. - The community promotes inclusivity and collaboration, allowing developers to contribute and innovate without being tied to a single company [19][20]. Group 5: Future Opportunities - The article suggests that there is significant potential for innovation and investment opportunities within the Modao community, particularly in bridging the gap between model capabilities and real business needs [21][22]. - The launch of the Modao Developer Medal incentive program aims to encourage contributions and innovation within the community [22].
新国产GPU「曦望」,刚融了10个亿
量子位· 2025-06-30 09:50
Core Viewpoint - The article highlights the emergence of a domestic GPU company, Sunrise, which was spun off from SenseTime at the end of 2024, focusing on high-performance GPU development tailored for AI applications [3][4]. Company Overview - Sunrise has recently completed a new financing round of nearly 1 billion yuan, with investors including SANY Group's Huaxu Fund, Fourth Paradigm, Youzu Network, Beijing Lier, Songhe Capital, and Haitong Kaiyuan [5]. - The company aims to create self-developed high-performance GPUs that are both affordable and practical for real-world applications [6]. Product Details - Sunrise's products are named S1, S2, and S3, with S1 already in mass production, having shipped over 20,000 units. S2 is also in mass production, matching the performance of NVIDIA's A100 [7][8]. - S1 is designed for cloud-edge visual inference, specifically for video analysis models, while S2 is a GPGPU for large model inference, compatible with the CUDA ecosystem [8]. - S3 is currently under development and is expected to be mass-produced by 2026, aiming to reduce inference computing costs to one-tenth of the original, potentially lowering the cost of domestic large model inference to 0.01 yuan [11]. Leadership and Team - The company is led by two co-CEOs: Wang Zhan, a former Baidu executive, and Wang Yong, a veteran from AMD who previously led chip development at SenseTime [10]. - The core technology team consists of only 150 members, significantly smaller than competitors, yet they have successfully developed two generations of chips in five years [11].
马斯克Neuralink脑机接口新成果!看完头皮发麻
量子位· 2025-06-30 06:38
Core Viewpoint - Neuralink's brain-machine interface N1 has demonstrated significant advancements in enabling individuals with severe disabilities to control devices using their thoughts, showcasing its potential for transforming lives and future applications in neurotechnology [3][4][10]. Group 1: Current Developments - Neuralink has successfully tested the N1 device on seven participants, including four with spinal cord injuries and three with amyotrophic lateral sclerosis (ALS), who have reported substantial improvements in their daily lives [5][10]. - Participants have been using the N1 device extensively, averaging 50 hours per week, with some exceeding 100 hours, indicating high engagement and utility [10]. - Notable cases include Noland, the first N1 recipient, who learned to control a computer cursor and play video games solely through thought, and Alex, who regained the ability to control a virtual hand and returned to work using CAD software [12][22]. Group 2: Future Goals and Product Roadmap - Neuralink aims to develop a "full brain interface" capable of reading, writing, and transmitting information to any neuron, with a roadmap that includes three components: Telepathy, Blindsight, and Deep [27][28]. - Telepathy, the current product, uses 1,000 electrodes implanted in the motor cortex to assist individuals with disabilities in controlling computers and other devices through thought [31]. - Blindsight is designed to restore vision for the blind by converting environmental scenes into electrical signals for the visual cortex, while Deep focuses on deeper brain areas to treat neurological disorders and mental health issues [32][34]. Group 3: Development Timeline - Neuralink's development plan includes implanting devices in the speech cortex to decode thoughts into language by the end of this year [37]. - The number of channels will increase from 1,000 to 3,000 next year, with the first Blindsight implant planned, marking a critical step in validating the technology's capabilities [38][39]. - By 2027, the channel count is expected to reach 10,000, with simultaneous implants in multiple brain areas, ultimately aiming for over 25,000 channels by 2028 to access any part of the brain for therapeutic purposes [40][41].
LeCun发布最新世界模型:首次实现16秒连贯场景预测,具身智能掌握第一视角!还打脸用了VAE
量子位· 2025-06-30 06:38
Core Viewpoint - Yann LeCun, a prominent figure in AI and deep learning, is focusing on developing a new model called PEVA, which aims to enhance embodied agents' predictive capabilities, allowing them to anticipate actions similarly to humans [2][10]. Group 1: PEVA Model Development - The PEVA model enables embodied agents to learn predictive abilities, achieving coherent scene predictions for up to 16 seconds [2][6]. - The model integrates structured action representation with 48-dimensional kinematic data of human joints and a conditional diffusion Transformer [3][20]. - PEVA utilizes first-person perspective video and full-body pose trajectories as inputs, moving away from abstract control signals [4][12]. Group 2: Technical Innovations - The model addresses computational efficiency and delay issues in long-sequence action prediction through random time jumps and cross-historical frame attention [5][24]. - PEVA captures both "overall movement" and "fine joint movements" using high-dimensional structured data, which traditional models fail to represent accurately [16][18]. - The architecture employs a hierarchical tree structure for motion encoding, ensuring translation and rotation invariance [25]. Group 3: Performance Metrics - PEVA outperforms baseline models in various tasks, showing lower LPIPS and FID values, indicating higher visual similarity and better generation quality [33][35]. - In single-step predictions, PEVA's LPIPS value is 0.303, and FID is 62.29, demonstrating its effectiveness compared to the CDiT baseline [33][35]. - The model's ability to predict visual changes within 2 seconds and generate coherent videos for up to 16 seconds marks a significant advancement in embodied AI [40]. Group 4: Practical Applications - PEVA can intelligently plan actions by evaluating multiple options and selecting the most appropriate sequence, mimicking human trial-and-error planning [42]. - The model's capabilities could lead to more efficient robotic systems, such as vacuum cleaners that can anticipate obstacles and navigate more effectively [51].