机器之心
Search documents
小红书发布FireRedChat:首个可私有化部署的全双工大模型语音交互系统
机器之心· 2025-10-02 03:12
在线体验:https://fireredteam.github.io/demos/firered_chat 开源代码:https://github.com/FireRedTeam/FireRedChat 小红书智创音频团队推出业内首个支持私有化部署的全双工大模型语音交互系统 FireRedChat,自研流式 pVAD 与 EoT 让语音交互更加自 然,首发级联与半级联两套实现,端到端时延逼近工业级应用。彻底开源、可私域落地,打造真正 "知冷暖、能共情、懂表达" 的语音 AI。 小红书智创音频团队发布 Fi r eR ed Chat —— 业内首个支持私有化部署的全双工大模型语音交互系统,直击延迟高、噪声敏感、可控性差、依赖外部 API 等痛 点。 FireRedChat 基于 "交互控制器+交互模块+对话管理器" 的完整架构,将任意半双工链路一键升级为全双工;集成自研流式个性化打断 pVAD、语义判停 EoT、 FireRedTTS-1s、FireRedASR、FireRedTTS2 等核心模型,提供级联与半级联两种端到端服务部署方案,覆盖从 "稳定易部署" 到 "更有温度" 的不同需求,显著提升 实时性、鲁 ...
梦里啥都有?谷歌新世界模型纯靠「想象」训练,学会了在《我的世界》里挖钻石
机器之心· 2025-10-02 01:30
为了在具身环境中解决复杂任务,智能体需要深入理解世界并选择成功的行动。世界模型通过学习从智能体(如机器人或电子游戏玩家)的视角预测潜在行动的 未来结果,为实现这一目标提供了一种有前景的方法。 通过这种方式,世界模型使智能体能够深入理解世界,并具备通过在想象中进行规划或强化学习来选择行动的能力。此外,原则上世界模型可以从固定数据集中 学习,这使得智能体能够纯粹在想象中进行训练,而无需在线交互。对于许多实际应用而言,离线优化行为很有价值,例如物理世界中的机器人,在这种情况 下,与未充分训练的智能体进行在线交互往往不安全。 世界模型智能体 —— 如 Dreamer 3—— 是迄今为止在游戏和机器人领域表现最佳且最为稳健的强化学习算法之一。虽然这些模型在其特定的狭窄环境中速度快且 准确,但其架构缺乏拟合复杂现实世界分布的能力。可控视频模型,如 Genie 3,已在多样的真实视频和游戏上进行训练,并实现了多样的场景生成和简单交互。 这些模型基于可扩展架构,如 diffusion transformer。然而,它们在学习物体交互和游戏机制的精确物理规律方面仍存在困难,这限制了它们在训练成功智能体方面 的实用性。此外,它们 ...
Sora 2干翻Veo 3?超全对比实测:会中文脱口秀,但体操翻车,附有效邀请码
机器之心· 2025-10-01 07:26
Core Viewpoint - The article discusses the advancements of Sora 2, an AI video and audio generation model, highlighting its superior physical accuracy, realism, and controllability compared to its predecessor and competitors like Google's Veo3 [1][6][7]. Comparison with Veo3 - Sora 2 can generate up to 20 seconds of 1080p video, positioning it as a strong competitor to Veo3 [7]. - The audio generation capabilities of Sora 2 are noted to be superior to those of Veo3 [9]. - Sora 2's video generation avoids issues like object disappearance and distortion, which were present in the previous version [5][9]. - Users can access Sora 2 through a web platform or an iOS app, both requiring an invitation and a US IP address [11][12]. Performance Testing - In various tests, Sora 2 demonstrated impressive capabilities in generating realistic videos, including ASMR and singing performances, with accurate audio-visual synchronization [20][22]. - However, both Sora 2 and Veo3 struggled with generating gymnastics videos, resulting in unrealistic movements [28][33]. - Sora 2 outperformed Veo3 in generating fake news segments, providing a more dynamic presentation [24][25]. User Experience and Accessibility - The Sora iOS app mimics popular social media platforms like TikTok, featuring a recommendation algorithm and options for user interaction [44]. - OpenAI has implemented safety measures, including watermarks and restrictions on deepfakes of public figures, to prevent misuse of the technology [35]. Market Position and Competition - The article suggests that while OpenAI's Sora 2 has established a product barrier, competition remains fierce in the AI video generation space, with other companies like Meta and domestic platforms also advancing their offerings [46][47].
CUDA内核之神、全球最强GPU程序员?OpenAI的这位幕后大神是谁
机器之心· 2025-09-30 23:49
机器之心报道 编辑:+0 在 AI 圈里,聚光灯总是追逐着那些履历光鲜的明星人物。但一个伟大的团队,不仅有台前的明星,更有无数在幕后贡献关键力量的英雄。 之前我们介绍了 OpenAI 的两位波兰工程师 ,最近 OpenAI 又一位身处幕后的工程师成为了焦点。 起因是 X 上的一则热门帖子,其中提到 OpenAI 仅凭一位工程师编写的关键 CUDA Kernel,就支撑起每日数万亿次的庞大计算量。 评论区纷纷猜测,这位大神便是 OpenAI 的资深工程师 Scott Gray。 为什么一个能编写 CUDA Kernel 的工程师会引起如此关注? 因为编写高性能的模型训练 CUDA Kernel 是一项极度专业的技能,它要求开发者必须同时精通三大高深领域:并行计算理论、GPU 硬件架构与深度学习算法。能 将三者融会贯通的顶尖人才凤毛麟角。 大多数开发者停留在应用层,使用现成工具。从事推理优化的人稍多,因为其问题边界更清晰。然而,要深入底层,为复杂的训练过程(尤其是反向传播)从零 手写出超越 cuDNN 等现有库的 CUDA Kernel,则需要对算法、并行计算和硬件有宗师级的理解。 而 Scott Gray 的职 ...
Sora 2深夜来袭,OpenAI直接推出App,视频ChatGPT时刻到了
机器之心· 2025-09-30 23:49
Core Insights - OpenAI has quietly launched Sora2, a new product that directly enters the video generation space, similar to the impact of ChatGPT in the language model domain [1][8][12] - Sora2 is designed to enhance physical accuracy, realism, and controllability in video generation, outperforming previous systems [5][12][14] - The introduction of a new iOS app, Sora, allows users to create and share videos, incorporating a feature called "cameos" for high-fidelity personal representation [19][25] Product Features - Sora2 demonstrates significant advancements in simulating complex physical actions, such as Olympic gymnastics and dynamic buoyancy [12][13] - The model improves upon previous video generation systems by adhering more closely to physical laws, allowing for realistic failure simulations [13][17] - Sora2 supports complex multi-shot instructions and excels in various styles, including realistic, cinematic, and anime [14] User Engagement and Safety - The Sora app includes a recommendation algorithm that prioritizes user control over content consumption, aiming to mitigate issues related to addiction and isolation [21][22] - OpenAI emphasizes the importance of user agency in content creation and consumption, with built-in mechanisms for users to manage their experience [22] - The app is designed to foster creativity rather than consumption, addressing safety concerns related to content generation and usage rights [22][23] Availability and Future Plans - The Sora iOS app is currently available for download in the US and Canada, initially free with relaxed computational limits [25] - OpenAI plans to release the Sora2 Pro model for ChatGPT Pro users and intends to make Sora2 available via API in the future [25]
复旦、同济和港中文等重磅发布:强化学习在大语言模型全周期的全面综述
机器之心· 2025-09-30 23:49
近年来,以强化学习为核心的训练方法显著提升了大语言模型(Large Language Models, LLMs)的推理能力与对齐性能,尤其在理解人类意图、遵循用户指令以及 增强推理能力方面效果突出。尽管现有综述对强化学习增强型 LLMs 进行了概述,但其涵盖范围较为有限,未能全面总结强化学习在 LLMs 全生命周期中的作用机 制。 对此, 来自复旦大学、同济大学、兰卡斯特大学以及香港中文大学 MM Lab 等顶尖科研机构 的研究者们全面总结了大语言模型全生命周期的最新强化学习研究, 完成题为 "Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle" 的长文综述,系统性回顾了领域 最新进展,深入探讨研究挑战并展望未来发展方向。 论文标题: Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Acr ...
Thinking Machines又发高质量博客:力推LoRA,不输全量微调
机器之心· 2025-09-30 10:38
Core Insights - The article emphasizes the advantages of LoRA (Low-Rank Adaptation) over Full Fine-tuning (FullFT) in terms of cost-effectiveness and performance in various training scenarios [2][7][18]. Group 1: Importance of LoRA - LoRA is a popular parameter-efficient fine-tuning method that updates a low-dimensional adapter instead of the entire model weights, leading to lower memory requirements and faster loading [11][13]. - The research indicates that LoRA can achieve performance comparable to FullFT in small to medium-sized datasets, while it may struggle in large datasets due to capacity limitations [14][22]. Group 2: Key Findings - The study found that LoRA's performance is closely tied to the training conditions, including the size of the training dataset and the rank of the LoRA parameters [16][25]. - In reinforcement learning tasks, even with a very low rank (rank=1), LoRA can perform similarly to FullFT, indicating that reinforcement learning has lower capacity demands [29]. Group 3: Experimental Methodology - The research utilized models like LLaMA 3 and Qwen3, adjusting LoRA ranks from 1 to 512 and scanning learning rates to find optimal training conditions [20][21]. - Results showed that high-rank LoRA performed almost identically to FullFT in certain datasets, but performance varied across different tasks due to training dynamics [22][24]. Group 4: Practical Implications - LoRA's optimal learning rate is typically about 10 times that of FullFT, allowing it to accept higher learning rates under the same conditions [35]. - The study suggests that applying LoRA across all layers, especially MLP and MoE layers, is crucial for achieving performance close to FullFT [37].
CAIR开源发布超声基座大模型EchoCare“聆音”,10余项医学任务性能登顶
机器之心· 2025-09-30 08:45
Core Insights - The article discusses the launch of EchoCare's "Lingyin" ultrasound foundation model, which has been trained on over 4.5 million ultrasound images covering more than 50 human organs, achieving superior performance in various ultrasound medical tasks [2][28] - The model addresses significant challenges in ultrasound AI, including reliance on large labeled datasets and the ability to handle diverse clinical scenarios, marking a milestone in the integration of AI and clinical medicine [2][24] Group 1: Model Development and Performance - "Lingyin" has undergone clinical validation with over 3,000 cases across multiple hospitals, showing an average performance improvement of 3% to 5% compared to current state-of-the-art models [2][28] - The model's innovative structured contrast self-supervised learning framework effectively resolves traditional ultrasound AI challenges, such as data dependency and model generalization issues [2][24] - The model's architecture includes a hierarchical dual-branch design that aligns with clinical diagnostic logic, enhancing its ability to interpret ultrasound images and structures [12][13] Group 2: Data and Training Innovations - The EchoAtlas dataset, which is the largest ultrasound image dataset globally, was created by integrating 138 high-quality datasets from various sources, ensuring diversity in demographics and anatomical structures [10][24] - The model employs a two-stage training strategy that maximizes the value of unlabeled data, allowing for efficient adaptation to new tasks with only 40%-60% of the original training data required [14][21] Group 3: Clinical Applications and Advantages - "Lingyin" demonstrates high accuracy in key clinical tasks, such as thyroid nodule segmentation and disease diagnosis, with metrics like AUC reaching 86.48% for thyroid malignancy discrimination [17][18] - The model significantly reduces the time for fetal heart-to-chest ratio measurement from 5 minutes to 2 seconds, enhancing efficiency in congenital heart disease screening [19][21] - It achieves a high level of clinical adaptability, processing single images in under 0.5 seconds and generating visual results that assist physicians in drafting reports [22][28] Group 4: Future Directions and Industry Impact - The article outlines the potential for "Lingyin" to evolve into a comprehensive clinical decision-making partner, moving from image analysis to proactive decision support in healthcare [26][29] - Future improvements are suggested, including the integration of multimodal data and enhanced capabilities for processing dynamic sequences, which could further advance ultrasound AI applications [25][26] - The open access to the EchoAtlas dataset and model code is expected to break down barriers in the ultrasound AI field, encouraging broader participation in innovation and research [29][30]
NeurIPS 2025 Spotlight | FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
机器之心· 2025-09-30 08:45
面向自动驾驶的多模态大模型在 "推理链" 上多以文字或符号为中介,易造成空间 - 时间关系模糊与细粒度信息丢失。FSDrive(FutureSightDrive)提出 "时空视觉 CoT"(Spatio-Temporal Chain-of-Thought),让模型直接 "以图思考",用统一的未来图像帧作为中间推理步骤,联合未来场景与感知结果进行可视化推理。该方法 在不改动原有 MLLM 架构的前提下,通过 "词表扩展 + 自回归视觉生成" 激活图像生成能力,并以 "由易到难" 的渐进式视觉 CoT 注入物理先验。模型既充当 "世 界模型" 预测未来,又作为 "逆动力学模型" 进行轨迹规划。 多模态大语言模型(MLLM)凭借世界知识与可解释推理能力,正加速进入端到端 "视觉 - 语言 - 动作"(VLA)自动驾驶范式。但现有做法多依赖离散文本 CoT (如规则描述、坐标),本质上是对视觉信息的高度符号压缩,存在跨模态语义鸿沟与时空关系表征不足的问题。 项目主页:https://miv-xjtu.github.io/FSDrive.github.io/ 论文链地址:https://arxiv.org/abs/25 ...
节前重磅:开源旗舰模型新SOTA,智谱GLM-4.6问世
机器之心· 2025-09-30 08:45
Core Insights - The article discusses the release of the new flagship model GLM-4.6 by Zhiyuan AI, which showcases significant advancements in performance and capabilities compared to its predecessor and competitors [4][16][59] Model Performance Enhancements - GLM-4.6 has achieved comprehensive improvements in various aspects, including advanced coding capabilities, increased context length from 128K to 200K, and enhanced reasoning abilities [15][16] - The model outperformed Claude Sonnet 4 and other domestic models in 74 real-world programming tasks, demonstrating its superior coding performance [23] - It has shown a reduction of over 30% in average token consumption compared to GLM-4.5, making it the most efficient model in its category [27] Technical Innovations - GLM-4.6 is compatible with domestic AI hardware, having implemented FP8+Int4 mixed-precision quantization on Cambricon chips, which significantly lowers inference costs while maintaining accuracy [30] - The model can also run on the new generation of GPUs from Moore Threads using the vLLM inference framework [31] Practical Applications - The model has been tested in various scenarios, including generating a playable game and creating a dynamic visualization of the solar system, showcasing its coding and analytical capabilities [35][40] - It can be integrated into programming environments like Claude Code, allowing for iterative optimization of code [46] Research and Content Creation - GLM-4.6 has demonstrated its ability to conduct in-depth research, generating comprehensive reports on topics such as former OpenAI employees and their ventures, indicating its potential as a research assistant [50][52] - The model's capabilities extend to full-stack development, where it can plan and execute projects autonomously, reflecting a human-like thought process [54] Overall Assessment - The advancements in GLM-4.6 position it as a leading model in the global open-source AI landscape, setting new benchmarks in technical architecture, performance, and cost-effectiveness [59]