机器之心
Search documents
B站出海的强有力支柱:最新开源文本转语音模型IndexTTS-2.0标志零样本TTS进入双维度时代
机器之心· 2025-09-18 04:32
最近在 B 站上,你是否也刷到过一些 "魔性" 又神奇的 AI 视频?比如英文版《甄嬛传》、坦克飞天、曹操大战孙悟空…… 这些作品不仅完美复现了原角色的音 色,连情感和韵律都做到了高度还原!更让人惊讶的是,它们居然全都是靠 AI 生成的! 英文版 甄嬛传他来 了 论文标题: IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech 让坦克飞 B 站开源 index-tts-2.0 长视频测试,效果真的强,曹操大战孙悟空 如果让 AI 开中文苹果发布会, indextts2 效果展示 据悉,这些视频都是运用了 哔哩哔哩 Ind ex 团队最 新开源的文本转语音模型 IndexTTS-2.0 , 这一模型从 demo 发布起,就在海内外社区引发了不少的关注。目前 该工作在 Github 已超过 10k stars 。 论文链接:https://arxiv.org/abs/2506.21619 近年来,大规模文本转语音(Text-to-Spe ...
通义DeepResearch震撼发布!性能比肩OpenAI,模型、框架、方案完全开源
机器之心· 2025-09-18 01:01
Core Insights - The article discusses the advancements of Tongyi DeepResearch, highlighting its transition from basic conversational capabilities to sophisticated research functionalities, achieving state-of-the-art (SOTA) results across multiple benchmarks while being fully open-source [1][3]. Data Strategy - The improvement in model capabilities is attributed to a multi-stage data strategy designed to generate high-quality training data without relying on expensive manual annotations [5]. - The team introduced Agentic Continual Pre-training (CPT) to establish a solid foundation for the model, utilizing a systematic and scalable data synthesis approach [6]. - The data generation process involves restructuring and constructing questions based on a wide array of knowledge documents, web crawler data, and knowledge graphs, creating an open-world knowledge memory anchored by entities [6]. Reasoning Modes - Tongyi DeepResearch features both a native ReAct Mode and a Heavy Mode for managing complex multi-step research tasks [11]. - In ReAct Mode, the model excels in a standard thinking-action-observation cycle, supporting extensive interaction rounds with a context length of 128K [12]. - Heavy Mode employs a new IterResearch paradigm to deconstruct tasks into research rounds, allowing the agent to maintain cognitive focus and high-quality reasoning [13][14]. Training Methodology - The training process integrates Agentic CPT, Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL), establishing a new paradigm for agent model training [17][20]. - The team customized RL algorithms based on GRPO, ensuring that learning signals align with the model's current capabilities, and implemented strategies to enhance training stability [21]. - Dynamic indicators during training show significant learning effects, with rewards consistently increasing, indicating effective exploration and adaptation [23]. Application Deployment - Tongyi DeepResearch has empowered various internal applications within Alibaba, including the creation of a simulated training environment to reduce development costs and improve speed [27]. - The team developed a stable and efficient tool sandbox to ensure reliable tool calls during agent training and evaluation [27]. - The collaboration with Gaode App focuses on enhancing complex query experiences in navigation and local services, showcasing the practical application of agent capabilities [28]. Legal Intelligence - Tongyi Falvui serves as a legal intelligence agent, providing professional legal services such as legal Q&A, case law retrieval, and document drafting, leveraging innovative agent architecture [30]. - The performance metrics of Tongyi Falvui indicate superior quality in answer points, case citations, and legal references compared to other models [31]. Research Contributions - The Tongyi DeepResearch team has consistently published technical reports, contributing to the open-source community and advancing the field of deep research agents [33].
让机器人「不只是走路」,Nav-R1引领带推理的导航新时代
机器之心· 2025-09-18 01:01
Core Insights - The article discusses the challenges in enabling robots to understand and execute complex navigation commands in real-world environments, emphasizing the need for improved reasoning, path planning, and action execution capabilities [2][4]. Group 1: Key Innovations - The paper introduces a new foundational model called Nav-R1, which integrates perception, reasoning, and action in 3D environments, enhancing the robot's ability to think clearly before acting [5]. - A large dataset, Nav-CoT-110K, consisting of approximately 110,000 Chain-of-Thought trajectories, is constructed to facilitate cold-start training, allowing the model to learn reasoning and action decision-making before reinforcement learning optimization [8]. - Nav-R1 employs three complementary reward mechanisms during reinforcement learning: Format Reward, Understanding Reward, and Navigation Reward, which collectively enhance the model's logical behavior and alignment with human expectations [9][13]. Group 2: Experimental Results - Nav-R1 demonstrates significant improvements in success rates and path efficiency across various navigation tasks, achieving approximately an 8% increase compared to other advanced methods [14]. - In real-world experiments, Nav-R1 was tested on a mobile robot platform, showing robust performance in navigating complex indoor environments such as meeting rooms and corridors [18][23]. Group 3: Practical Applications - The capabilities of Nav-R1 suggest potential applications in service robots and home assistants, where understanding and navigating cluttered environments is crucial for user experience [29]. - In healthcare settings, Nav-R1 can enhance the navigation of robots in hospitals and nursing homes, ensuring safe and reliable operation in complex environments [30]. - The model's reasoning and control capabilities are also applicable in augmented reality (AR) and virtual reality (VR) scenarios, where virtual agents need to navigate physical spaces [31]. - In industrial and hazardous environments, Nav-R1's robustness and generalization abilities make it suitable for tasks in factories, mines, and disaster sites [32].
刚刚,DeepSeek-R1论文登上Nature封面,通讯作者梁文锋
机器之心· 2025-09-17 17:00
Core Viewpoint - The article highlights the significance of DeepSeek-R1, which is recognized as the first large language model (LLM) to pass peer review in a prestigious academic journal, Nature. This achievement marks a pivotal shift in the AI industry towards more rigorous scientific validation of AI models, moving from mere technical competition to a focus on scientific discipline and public trust [5][11][12]. Summary by Sections DeepSeek-R1 Overview - DeepSeek-R1 is trained using reinforcement learning, where the model receives rewards for correct answers and penalties for incorrect ones, enabling it to develop reasoning capabilities similar to human problem-solving [7][8]. - The model's ability to self-validate and reflect on its performance enhances its effectiveness in programming and advanced scientific inquiries [7]. Peer Review Significance - The peer review process serves as a critical gatekeeper, requiring AI companies to substantiate their claims with solid evidence rather than self-promotion [10]. - The rigorous evaluation of DeepSeek-R1's methodology and limitations by external experts helps to mitigate inflated claims in the AI industry [9][10]. Training Methodology - DeepSeek-R1 employs a novel multi-stage pipeline that enhances reasoning capabilities without relying heavily on supervised data [15]. - The model utilizes Group Relative Policy Optimization (GRPO) to reduce training costs and incorporates a dual reward mechanism based on accuracy and format [16][17]. - A structured training template guides the model to articulate its reasoning process before providing final answers, allowing for clear observation of its learning progress [18]. Performance and Limitations - DeepSeek-R1 demonstrates advanced self-evolution capabilities, developing higher-order reasoning skills autonomously during training [20]. - Despite its advancements, the model still faces challenges such as poor readability and language mixing in its outputs [21][26]. Cold Start and Reinforcement Learning - The development team collected a small amount of long Chain of Thought (CoT) data to stabilize the model during the early stages of reinforcement learning [22]. - The integration of language consistency rewards during training aims to improve the model's readability, although it may slightly affect performance [23]. Distillation and Model Efficiency - The team successfully distilled the reasoning capabilities of DeepSeek-R1 into smaller models, significantly enhancing their performance [29]. - Benchmark tests indicate that DeepSeek-R1 competes effectively with state-of-the-art models in reasoning tasks, showcasing its robust capabilities [30][31].
6.1B打平40B Dense模型,蚂蚁开源最新MoE模型Ling-flash-2.0
机器之心· 2025-09-17 09:37
Core Insights - Ant Group's Ling-flash-2.0 model, a new MoE model, features a total of 100 billion parameters with only 6.1 billion active parameters, achieving performance comparable to or exceeding that of larger models with 40 billion parameters [1][3][4] - The model represents a shift from a "parameter arms race" to an "efficiency-first" approach, emphasizing full-stack optimization across architecture, training, and inference [3][4][10] Group 1: Model Performance and Efficiency - Ling-flash-2.0 achieves approximately 7 times the performance leverage, activating only 6.1 billion parameters while delivering performance equivalent to a 40 billion dense model [4][9] - The model's inference speed is over three times faster than similar performance dense models, capable of generating over 200 tokens per second on the H20 platform [9][10] - The architecture includes a 1/32 activation ratio, expert fine-tuning, and a shared expert mechanism to enhance efficiency and reduce redundant activations [6][10] Group 2: Application and Use Cases - Ling-flash-2.0 demonstrates strong capabilities in various tasks, including high-difficulty mathematical reasoning, code generation, and front-end development [11][14][15] - The model outperforms both similar-sized dense models and larger MoE models in benchmarks across multiple disciplines [11][14] - Specific applications include generating Python programs, creating responsive web designs, and solving complex mathematical problems like Sudoku [17][19][27] Group 3: Training and Data Management - The model's training is supported by a robust AI Data System, processing over 40 trillion tokens of high-quality data, with a focus on 20 trillion tokens for pre-training [31][34] - The pre-training process is divided into three stages, optimizing hyperparameters and employing innovative learning rate scheduling to enhance downstream task performance [32][34] - The vocabulary has been expanded to 156,000 tokens to improve multilingual capabilities, incorporating high-quality data from 30 languages [34] Group 4: Post-Training Innovations - The model employs a four-stage post-training process designed to enhance reasoning and conversational abilities, including decoupled fine-tuning and progressive reinforcement learning [35][38][40] - ApexEval is introduced to evaluate model potential based on knowledge mastery and reasoning depth, ensuring only the most capable models proceed to reinforcement learning [39] - The training system supports high-quality data selection and model iteration through an efficient reward system [41] Conclusion - Ling-flash-2.0 redefines the relationship between efficiency and capability in large models, emphasizing that intelligence is not solely dependent on scale but on the synergy of architecture, data, and training strategies [42][43][46]
没想到,音频大模型开源最彻底的,居然是小红书
机器之心· 2025-09-17 09:37
Core Viewpoint - The article highlights the recent surge in open-source AI models in the audio domain, particularly by domestic companies in China, with a focus on the advancements made by Xiaohongshu in developing high-quality audio models and fostering an open-source community [1][4][22]. Summary by Sections Open Source Trends - In recent months, open-source has become a focal point in the AI community, especially among domestic tech companies, with 33 and 31 models being open-sourced in July and August respectively [1]. - The majority of these open-source efforts are concentrated in text, image, video, reasoning, and world models, while audio generation remains a smaller segment [1][2]. Xiaohongshu's Contributions - Xiaohongshu has maintained a steady rhythm of open-sourcing audio technologies since last year, releasing models like FireRedTTS for text-to-speech (TTS) and FireRedASR for automatic speech recognition (ASR), achieving state-of-the-art (SOTA) results [3][4]. - The open-sourcing of high-quality audio models enhances Xiaohongshu's technical influence and signals a long-term strategic commitment to open-source development [4][22]. Technical Achievements - Xiaohongshu's FireRedTTS model allows for flexible voice synthesis, enabling the imitation of various speaking styles with minimal training [6][9]. - FireRedASR has achieved a character error rate (CER) of 3.05%, outperforming other closed-source models [7][8]. - The new FireRedTTS-2 model addresses existing challenges in voice synthesis, providing superior solutions for long dialogue synthesis and achieving industry-leading performance in audio scene modeling [9][11]. Ecosystem Development - Xiaohongshu aims to build a comprehensive open-source community around audio models, covering TTS, ASR, and voice dialogue systems, thereby lowering industry entry barriers and fostering innovation [22][23]. - The introduction of FireRedChat, a fully open-source duplex voice dialogue system, represents a significant advancement, providing a complete solution for developers to create their own voice assistants [17][22]. Future Plans - Xiaohongshu plans to release additional models, including FireRedMusic and FireRedASR-2, to further enhance its audio technology stack and support a broader range of applications [22][26]. - The company is committed to establishing itself as a leader in the open-source audio domain, with a focus on creating industrial-grade, commercially viable models [23][26]. Industry Impact - The article emphasizes that open-source initiatives are reshaping the AI landscape, making advanced capabilities accessible to a wider audience and fostering a collaborative environment for innovation [25][26].
腾讯AI Lab首创RL框架Parallel-R1,教大模型学会「并行思维」
机器之心· 2025-09-17 09:37
自从 Google Gemini 将数学奥赛的成功部分归功于「并行思维」后,如何让大模型掌握这种并行探索多种推理路径的能力,成为了学界关注的焦点。 然而,现有方法多依赖于监督微调(SFT),模型一来只能模仿预先构造的 parallel thinking 数据,难以泛化到真实的复杂任务中,其次这种方式对数据要求很高, 往往需要复杂的 data pipeline 来构造。 为解决这些难题,来自 腾讯 AI Lab 西雅图、马里兰大学、卡内基梅隆大学、北卡教堂山分校、香港城市大学、圣路易斯华盛顿大学等机构的研究者们( 第一作 者郑童是马里兰大学博士生,本工作于其在腾讯 AI Lab 西雅图实习期间完成) 首创了 Parallel-R1 框架 —— 这是第一个通过强化学习(RL)在通用数学推理任务 上教会大模型进行并行思维的框架 。该框架通过创新的「渐进式课程」与「交替式奖励」设计,成功解决了 RL 训练中的冷启动和奖励设计难题。 实验表明,Parallel-R1 不仅在多个数学基准上带来高达 8.4% 的平均准确率提升,更通过一种 "中程训练脚手架" 的策略,在 AIME25 测试中实现了 42.9% 的性能飞 跃 ...
「AI助手」真来了?谷歌牵头推进Agent支付协议AP2
机器之心· 2025-09-17 09:37
Core Viewpoint - Google has launched the Agent Payments Protocol (AP2), an open shared protocol designed to facilitate secure and compliant transactions between agents and merchants, providing a common language for these interactions [2][10]. Summary by Sections Introduction of AP2 - AP2 serves as an extension of the A2A and MCP protocols, enhancing the capabilities of AI agents in processing payments across platforms [5][7]. - The protocol addresses the need for intelligent interactions among multiple agents, moving beyond manual operations to a more automated and integrated approach [6]. Key Issues Addressed by AP2 - AP2 focuses on three main issues: authorization, authenticity, and accountability in transactions initiated by agents [9]. - It aims to ensure that transactions are secure and that users' intentions are accurately represented, while also establishing clear accountability in case of fraud or errors [8][10]. Operational Mechanism - The protocol utilizes mandates (authorization documents) to build trust, which are tamper-proof, encrypted digital contracts serving as verifiable proof of user instructions [12]. - These mandates create an audit trail from user intent to payment, addressing key concerns of authorization and authenticity [13]. Practical Applications - AP2 enables a new business model in the AI era, allowing agents to interact with various service providers seamlessly. For example, a user can instruct an agent to book travel arrangements within a specified budget, and the agent can execute transactions across multiple platforms [14]. - Google has partnered with over 60 companies, including major players like American Express, Alibaba, and PayPal, to implement this protocol [14]. Technical Implementation - The project is publicly available on GitHub, including technical specifications and reference implementations, facilitating broader adoption and integration [15][24]. - The protocol supports various payment types, ensuring a consistent and secure experience for users and merchants alike [10].
LLM开源2.0大洗牌:60个出局,39个上桌,AI Coding疯魔,TensorFlow已死
机器之心· 2025-09-17 04:00
Core Insights - The article discusses the significant changes in the open-source AI model ecosystem, highlighting a shift towards a more competitive and rapidly evolving landscape, particularly in the AI Agent and Model Serving sectors [4][9][61]. Group 1: Ecosystem Changes - The latest version of the open-source landscape includes 114 projects, a decrease of 21 from the previous version, with 39 new projects and 60 projects that have disappeared, indicating a significant reshuffling in the ecosystem [7][10]. - The average lifespan of projects in the AI model ecosystem is only 30 months, with 62% of projects emerging after the "GPT moment" in October 2022, showcasing a high turnover rate [10][11]. - TensorFlow has been overtaken by PyTorch, which now dominates the landscape, marking a dramatic shift in the competitive dynamics [8]. Group 2: Key Trends - The article identifies three main areas of focus: AI Coding, Model Serving, and LLMOps, which are emerging as the primary tracks in the evolving landscape [29][61]. - AI Coding has transitioned from merely assisting in code writing to becoming a comprehensive lifecycle engine, indicating a significant increase in its capabilities and market potential [43][44]. - The AI Data sector remains relatively stable but is expected to evolve as new challenges arise in the native large model era, suggesting a potential for future growth [82][88]. Group 3: Global Contributions - The United States and China contribute over 55% of the total developer population in the open-source AI space, with the U.S. leading at 37.41% [17][20]. - In specific areas, the U.S. has a dominant position in AI Infrastructure and AI Data, with contributions significantly higher than those from China [19][23]. Group 4: Licensing Trends - There is a noticeable trend towards more restrictive open-source licenses, with many new projects adopting custom agreements that allow for greater control by the license holders [90][92]. - This shift raises questions about the definition of "open source" in the current competitive environment, as some projects that are popular on platforms like GitHub are not fully open-source [94].
突破单链思考上限,清华团队提出原生「并行思考」scale范式
机器之心· 2025-09-17 00:07
Core Insights - The article discusses the advancements in large language models (LLMs) in complex reasoning tasks, emphasizing the limitations of current sequential reasoning strategies and introducing a new paradigm called "Native Parallel Thinking" to overcome these challenges [2][4][6]. Group 1: Bottlenecks and Challenges - The performance improvement of LLMs has stagnated despite increased computational resources, indicating a scaling bottleneck in sequential reasoning [3][10]. - The phenomenon known as "Tunnel Vision" restricts LLMs to suboptimal reasoning paths once they make an initial flawed decision, making it difficult to correct or discover better solutions later [6][12]. Group 2: Native Parallel Thinking - The research proposes a framework called ParaThinker, which enables LLMs to generate and integrate multiple reasoning paths simultaneously, thus avoiding the "Tunnel Vision" issue and unlocking their potential reasoning capabilities [14][29]. - ParaThinker is designed to train LLMs to explore diverse reasoning paths in a single forward propagation process, leading to higher quality final answers [14][29]. Group 3: Innovations in ParaThinker - ParaThinker incorporates three core innovations: 1. Specialized controllable tokens to guide the model in opening independent thinking paths [18]. 2. Unique thought embeddings to maintain clarity in the source of information during the integration phase [18]. 3. Two-stage attention masks to ensure independence during parallel reasoning and allow global attention during the summarization phase [18]. Group 4: Experimental Results - Experiments show that using 8 parallel paths with a 1.5 billion parameter model leads to an average accuracy improvement of 12.3%, while a 7 billion parameter model shows a 7.5% improvement [23]. - The results indicate that the accuracy increases with the breadth of thinking, demonstrating the effectiveness of parallel reasoning [22][23]. Group 5: Comparison with Majority Voting - ParaThinker can be combined with majority voting strategies to achieve even higher accuracy, showcasing its compatibility with existing methods [26][28]. - The integration of ParaThinker with majority voting allows for a more robust approach to reasoning tasks, enhancing overall performance [26][28].