Workflow
两阶段训练策略
icon
Search documents
相机运动误差降低40%!DualCamCtrl:给视频生成装上「深度相机」,让运镜更「听话」
机器之心· 2025-12-21 04:21
本研究的共同第一作者是来自于香港科技大学(广州)EnVision Research 的张鸿飞(研究助理)和陈康豪(博士研究生),两位研究者均师从陈颖聪教 授。 你的生成模型真的「懂几何」吗?还是只是在假装对齐相机轨迹? 当前众多视频生成模型虽宣称具备「相机运动控制」能力,但其控制信号通常仅依赖于相机位姿。虽近期工作通过逐像素射线方向(Ray Condition)编码 了运动信息,但由于模型仍需隐式推断三维结构,本质上仍缺乏对场景的显式几何理解。这一局限性导致了相机运动的不一致——模型受限于外观与结构两 种表征信息的耦合,无法充分捕捉场景的底层几何特征。 鉴于上述挑战, 来自香港科技大学、复旦大学等机构的研究团队提出了一种全新的端到端几何感知扩散模型框架 DualCamCtrl 。 该研究针对现有方法在 场景理解与几何感知方面的不足,创新性地设计了一个「双分支扩散架构」,能够同步生成与镜头运动一致的 RGB 与深度序列。进一步地,为实现 RGB 与深度两种模态的高效协同,DualCamCtrl 提出了语义引导互对齐机制(Semantic Guided Mutual Alignment),该机制以语义信息为指导, ...
开源复现o3图像思考!快手让AI不再被动看图,模型自主生成代码调用工具
量子位· 2025-08-21 04:23
Kwai Keye 团队 投稿 量子位 | 公众号 QbitAI 在Openai 发布o3后,think with image功能得到了业界和学术界的广泛关注。 Kwai Keye团队提出 Thyme (Think Beyond Images) 的新范式,并围绕它构建了一整套技术方案。旨在突破现有方法的限制,赋予开源 模型一种更强大、更自主、功能更全面的"超越图像思考"的能力。 其主要贡献可以概括为以下几点: 提出了一个全新的多模态交互范式Thyme: 核心思想: 让多模态大模型不再局限于被动地"看图",而是能够主动地通过生成并执行代码,来调用各种工具完成复杂的图像处理和数学计 算。 功能丰富: 模型可以即时进行裁剪、旋转、缩放、对比度增强等多种图像操作,还能处理复杂的数学问题。 高度自主: 模型能自主判断何时需要使用工具、使用何种工具,并动态生成代码来执行,无需人工为特定任务进行干预。 设计了一套高效的两阶段训练策略 SFT + RL: 监督微调 (SFT) 阶段: 利用精心构建的约 50 万条高质量样本数据集,快速教会模型生成代码来执行各种操作。这个阶段仅需约 200 GPU 小时,性价比极高。 强化学习 ...
思维链监督和强化的图表推理,7B模型媲美闭源大尺寸模型
机器之心· 2025-08-01 04:23
Core Viewpoint - The article discusses the emergence of the Chart-R1 model developed by the DocTron team, which utilizes a chain-of-thought supervision and reinforcement learning approach to enhance chart reasoning capabilities, particularly in complex multi-step numerical reasoning tasks [2][20]. Innovation and Technical Breakthroughs - The Chart-R1 model introduces a novel procedural data synthesis technique that generates high-quality reasoning data, resulting in the creation of the ChartRQA dataset containing 258,000 multi-step reasoning samples, ensuring data diversity and authenticity [7][22]. - The model employs a unique two-stage training strategy that utilizes different datasets for each stage, preventing the degradation of the model's exploratory capabilities during reinforcement learning [10][22]. Experimental Results and Performance - Chart-R1 demonstrates superior performance across various public benchmark tests and the self-constructed ChartRQA dataset, outperforming existing chart domain methods and rivaling large closed-source models like GPT-4o and Claude-3.5 in multiple tasks [16][20]. - In complex chart reasoning tasks, while existing visual language models show significant performance drops, Chart-R1 maintains a consistently high level of performance, highlighting its effectiveness in complex reasoning scenarios [17][20]. Research Significance and Application Prospects - The research not only achieves technical breakthroughs but also opens new avenues for chart understanding and reasoning, with potential applications in business intelligence analysis, scientific research data interpretation, and financial report analysis, significantly enhancing automated analysis efficiency [19][20]. - The success of Chart-R1 indicates that even models with relatively smaller parameter scales can achieve performance comparable to large closed-source models in specific domains, providing valuable insights for building efficient, domain-specific AI models and guiding future multi-modal reasoning research [20][21].