Waver
Search documents
港大联合字节跳动提出JoVA: 一种基于联合自注意力的视频-音频联合生成模型
机器之心· 2025-12-29 23:36
作者介绍:本文第一作者黄小虎同学,目前是香港大学的三年级在读博士生,导师是韩锴教授。黄小虎的研究方向是以视频为中心的领域,包括音视频生成、视 频理解以及视频识别。 视频 - 音频联合生成的研究近期在开源与闭源社区都备受关注,其中,如何生成音视频对齐的内容是研究的重点。 近日,来自香港大学和字节跳动的研究团队提出了一种简单有效的框架 ——JoVA,它支持视频和音频的 Token 在一个 Transformer 的注意力模块中直接进行跨模态 交互。为了解决人物说话时的 "口型 - 语音同步" 问题,JoVA 引入了一个基于面部关键点检测的嘴部区域特定损失 (Mouth-area specific loss)。 实验表明,JoVA 只采用了约 190 万条训练数据,便在口型同步准确率、语音质量和整体生成保真度上,达到了先进水平。 项目主页: https://visual-ai.github.io/jova/ 论文地址:https://arxiv.org/abs/2512.13677 一、研究背景与动机 目前的开源解决方案通常分为两大类别:一类是 "级联式",即先生成视频再配音,或者先生成语音再驱动视频生成,这种方式 ...
晚点独家丨爱诗科技完成 1 亿元 B+ 轮新融资,ARR 突破 4000 万美元
晚点LatePost· 2025-10-17 07:29
Core Insights - The article discusses the competitive landscape of AI video generation, highlighting the rapid growth and potential of companies like Aishi Technology and OpenAI's Sora [5][7][11]. Company Developments - Aishi Technology has completed a B+ round financing of 100 million RMB, bringing its total funding to over 100 million USD since its establishment in April 2023 [5]. - Aishi's products, PixVerse and Pai Wo AI, have over 100 million total users and a monthly active user count exceeding 16 million, with an annual recurring revenue (ARR) of 40 million USD [5]. - OpenAI launched the Sora 2 video generation model and Sora App, which quickly topped the US App Store free chart and surpassed 1 million downloads in less than two weeks [8][13]. Market Dynamics - The video generation app market is vast, with existing tools unable to cover all users, as evidenced by TikTok and Douyin's monthly active users exceeding 2 billion [9]. - Aishi's CEO noted that the emergence of AI is reshaping content consumption, similar to the impact of short videos [8]. - Despite Sora's rapid growth, Aishi's PixVerse has not been negatively impacted, indicating a large market capacity for multiple players [9]. Competitive Landscape - The current leading models in video generation are dominated by Chinese companies, with Kuaishou's Kling, Aishi's PixVerse, and MiniMax ranking in the top three, while Sora ranks 31st [11]. - ByteDance's video generation models, Seedance and Waver, are also strong competitors, with significant daily active user growth targets [12]. - The competition in the multi-modal field is intensifying, driven by the enormous consumer and entertainment potential [13].