Gemini Diffusion

Search documents
开源扩散大模型首次跑赢自回归!上交大联手UCSD推出D2F,吞吐量达LLaMA3的2.5倍
机器之心· 2025-08-18 03:22
挑战 —— 例如缺少完善的 KV 缓存机制,以及未充分释放并行潜力 —— 推理速度远慢于同规模的 AR 模型。 近期的一篇工作彻底扭转了这个局面。上海交通大学 DENG Lab 联合加州大学圣地亚哥分校(UCSD)推出 Discrete Diffus ion Forcing (D2F) ,首次使开源 dLLMs 的生成速度显著超过同等规模的 AR 模型。实验显示,D2F 模型在 GSM8K 等基准上,实现了相比 LLaMA3 等主流 AR 模型 高达 2.5 倍的吞吐量 提升,同 本文作者团队来自上海交通大学 DENG Lab 与加州大学圣地亚哥分校(UCSD)。该研究由硕士生王旭、准硕士生徐晨开、本科生金义杰以及博士生金佳纯共同 完成,指导教师为邓志杰与张浩老师。DENG Lab 隶属上海交通大学,致力于高效、跨模态生成模型的研究。 论文地址:https://arxiv.org/abs/2508.09192 代码地址:https://github.com/zhijie-group/Discrete-Diffusion-Forcing 视频 1 : D2F dLLMs 与同尺寸 AR LLMs 的推理过程对比 ...
AI展望:NewScaling,NewParadigm,NewTAM
HTSC· 2025-06-10 01:43
证券研究报告 科技 AI 展望:New Scaling,New Paradigm,New TAM 华泰研究 2025 年 6 月 10 日│中国内地 中期策略 全球 AI 展望:New Scaling,New Paradigm,New TAM 展望全球 AI 发展趋势,1)模型端新架构正逐步探索,预训练 Scaling Law 有望呈现新起点;2)算力端训练与推理共同推动算力需求持续上行,有望 开启新 TAM,同时算力硬件设计进入新范式;3)应用端商业模式变革带来 新范式,Agent 在细分领域率先落地带来新 TAM。持续看好 AI 产业投资主 线,看好全球 AI 应用进入业绩收获期。 模型:预训练 Scaling Law 有望开启新起点 回顾近三个季度以来的大模型迭代情况,强化学习(RL)带来的后训练 test-time compute 依然是大模型的主流迭代方向。经典 transformer 架构下 模型参数规模或已达到了瓶颈,人类现有公开数据已接近被使用完。但值得 注意的是科技巨头在预训练阶段仍在继续尝试,以腾讯混元 Turbo S 与 Gemini Diffusion 为代表的大模型开始尝试在架构上进 ...
挑战 next token prediction,Diffusion LLM 够格吗?
机器之心· 2025-06-08 02:11
Group 1 - The article discusses the potential of Diffusion LLMs, particularly Gemini Diffusion, as a significant breakthrough in AI, challenging traditional autoregressive models [3][4][5] - Gemini Diffusion demonstrates high generation efficiency, achieving an average sampling speed of 1479 TPS and up to 2000 TPS in encoding tasks, outperforming Gemini 2.0 Flash-Lite by 4-5 times [4][6] - The parallel generation mechanism of the diffusion architecture allows for efficient processing, which could lead to reduced computational costs compared to autoregressive models [6][7] Group 2 - Mary Meeker emphasizes that the speed of AI development surpasses that of the internet era, highlighting the cost disparity between AI model training and inference [1][2] - The article suggests that the rise of open-source models in China may impact the global supply chain, indicating a shift in competitive dynamics within the industry [1][2] - The balance between computational investment and commercial returns is crucial for enterprises as AI inference costs decline [1][2]
冲击自回归,扩散模型正在改写下一代通用模型范式
机器之心· 2025-06-04 01:59
Core Viewpoint - The article discusses the advancements in diffusion language models (dLLMs), particularly focusing on Google's Gemini Diffusion and its implications for AI development, highlighting the speed and performance improvements over traditional autoregressive models [1][8][35]. Group 1: Gemini Diffusion and Its Features - Gemini Diffusion is noted for its impressive generation speed, being five times faster than previous models, and its ability to handle programming tasks effectively [2][8]. - The underlying mechanism of diffusion models allows for rapid iteration and error correction during the generation process, distinguishing it from autoregressive models [2][3]. - Gemini Diffusion's sampling speed can reach an astonishing 1479 tokens per second, showcasing its potential in various benchmarks [8][9]. Group 2: Development of Diffusion Language Models - Prior to Gemini Diffusion, several research teams explored the feasibility of diffusion-based LLMs, including Stanford's Diffusion-LM and Fudan University's DiffusionBERT [3][4]. - The introduction of LLaDA, the first 8 billion parameter diffusion language model, marked a significant milestone in the field, achieving performance comparable to LLaMA 3 [4][21]. - Following LLaDA, other models like d1 and LaViDa have emerged, further establishing LLaDA as a foundational model in dLLM research [20][21]. Group 3: Multimodal Diffusion Language Models - The emergence of diffusion multimodal language models (dMLLMs) is highlighted, with LLaDA-V and MMaDA being prominent examples that integrate visual and language processing capabilities [10][31]. - LLaDA-V combines visual instruction fine-tuning with the diffusion mechanism, demonstrating strong performance in multimodal understanding tasks [26][27]. - MMaDA showcases innovations in text reasoning and multimodal understanding, solidifying its position as a leading research outcome in the dMLLM space [31][32]. Group 4: Future Directions and Implications - The article emphasizes the shift from autoregressive models to diffusion models as a significant paradigm change in AI, suggesting broader implications for future research and applications [35][36]. - The ongoing evolution of models like LLaDA and Gemini Diffusion indicates a growing ecosystem around dLLMs and dMLLMs, with potential applications extending into quantum computing [35][36].
AGI的不归之途
虎嗅APP· 2025-06-03 13:52
以下文章来源于未尽研究 ,作者未尽研究 未尽研究 . AI,新能源,合成生物,地缘X 本文来自微信公众号: 未尽研究 (ID:Weijin_Research) ,作者:未尽研究,题图来自:AI生成 转眼之间,2025年即将过半。上半年OpenAI o3、Gemini 2.5 pro、Grok 3 mini和Claude 4的推出, 以及智能体MCP、A2A等协议的推出和融合,让前沿大模型、智能体、应用的进展再次提速。 上半年中国确立了在开源领域的优势。通义千问在2024年9月即已经开始超越Llama 3,DeepSeek R1从2025年初即开始赶上o1。Llama 4推出后,并没有改变开始形成的DeepSeek与通义千问之间在 性能上互卷的格局。 互联网女皇米克尔 (Mary Meeker) 发出了第一份AI趋势报告。她从PC、互联网、移动、云计算来 看AI,认为 所有后来的技术,都是之前技术的"复利",AI也不例外 。所以,押注"乐观"往往是最值 得的投资之一。 目前全球仍有26亿人没有接入互联网,米克尔看好更低成本的卫星互联网,加上直接带有AI功能的 网络体验。"想象一下,一个'首次上网体验'不再是输入 ...
三位顶流AI技术人罕见同台,谈了谈AI行业最大的「罗生门」
3 6 Ke· 2025-05-28 11:59
Core Insights - The AI industry is currently experiencing a significant debate over the effectiveness of pre-training models versus first principles, with notable figures like Ilya from OpenAI suggesting that pre-training has reached its limits [1][2] - The shift from a consensus-driven approach to exploring non-consensus methods is evident, as companies and researchers seek innovative solutions in AI [6][7] Group 1: Industry Trends - The AI landscape is witnessing a transition from a focus on pre-training to exploring alternative methodologies, with companies like Sand.AI and NLP LAB leading the charge in applying multi-modal architectures to language and video models [3][4] - The emergence of new models, such as Dream 7B, demonstrates the potential of applying diffusion models to language tasks, outperforming larger models like DeepSeek V3 [3][4] - The consensus around pre-training is being challenged, with some experts arguing that it is not yet over, as there remains untapped data that could enhance model performance [38][39] Group 2: Company Perspectives - Ant Group's Qwen team, led by Lin Junyang, has faced criticism for being conservative, yet they emphasize that their extensive experimentation has led to valuable insights, ultimately reaffirming the effectiveness of the Transformer architecture [5][15] - The exploration of Mixture of Experts (MoE) models is ongoing, with the team recognizing the potential for scalability while also addressing the challenges of training stability [16][20] - The industry is increasingly focused on optimizing model efficiency and effectiveness, with a particular interest in achieving a balance between model size and performance [19][22] Group 3: Technical Innovations - The integration of different model architectures, such as using diffusion models for language generation, reflects a broader trend of innovation in AI [3][4] - The challenges of training models with long sequences and the need for effective optimization strategies are critical areas of focus for researchers [21][22] - The potential for future breakthroughs lies in leveraging increased computational power to revisit previously unviable techniques, suggesting a cycle of innovation driven by advancements in hardware [40][41]
又一巨头推出其最强大模型,赶超OpenAI和谷歌
财富FORTUNE· 2025-05-26 13:06
部分早期测试者已通过实际任务体验新模型。该公司举例称,购物奖励公司乐天株式会社(Rakuten)的人工智能总经理表示,Opus 4在部署到一个复杂 项目后"自主编码近七小时"。 Anthropic发布了最新一代"前沿"或尖端人工智能模型Claude Opus 4和Claude Sonnet 4。图片来源:GETTY IMAGES 上周四,在旧金山举办的首届开发者大会上,人工智能初创公司Anthropic发布了最新一代"前沿"或尖端人工智能模型Claude Opus 4和Claude Sonnet 4。这 家估值超610亿美元的公司在一篇博文中表示,备受期待的新模型Opus是"全球最佳编码模型",能够"在需要持续专注且涉及数千步骤的长期任务中保持稳 定性能"。由新模型驱动的人工智能代理可对数千个数据源展开分析,并执行复杂操作。 此次发布凸显了科技公司在"全球最先进人工智能模型"领域的角逐之激烈——尤其在软件工程等领域——各企业纷纷采用新技术来提升速度与效率,谷歌 上周推出的实验性研究模型Gemini Diffusion便是例证。在一项对比不同大型语言模型软件工程任务表现的基准测试中,Anthropic的两款模型 ...
谷歌 I/O 大会:AI 从技术前沿到商业生态的验证
HTSC· 2025-05-25 13:25
证券研究报告 科技 谷歌 I/O 大会: Al 从技术前沿到商业 生态的验证 华泰研究 2025年5月25日|美国 谷歌凭借多模态和推理能力持续提升基础模型,我们重点关注支持 Veo 3 和 lmagen 4 的 Flow, 或可在内容创作者中初见商业化成效: 1)Gemini 2.5 Pro 现已支持原生音频输出,提升多模态交互效率,已嵌入多个 AI IDE 工具 (如 Cursor 等); 推出增强推理模式 Deep Think,可生成多条推理链并互相交叉 审核:2)内容生成方面,Veo 3 支持原生音频生成功能,在口型同步、现 实物理建模等方向实现突破,Imagen 4 支持 2K 分辨率及复杂材质的高保真 图像生成,二者均可通过 Flow App 使用;Gemini Diffusion 作为新一代扩散 模型,生成速度为 2.5 Flash 的 5 倍,具备并行生成及迭代修正能力;此外 还新推出实验性交互式音乐生成的 Lyria Realtime 模型,并宣布将与三星、 Gentle Monster 和 Warby Parker 合作打造智能眼镜, 也展示了两款 Android XR 的第三方设备,分别 ...
比Gemini Diffusion更全能!首个多模态扩散大语言模型MMaDA发布,同时实现强推理与高可控性
机器之心· 2025-05-22 08:46
Core Insights - The article discusses the advancements in large language models (LLMs) and their application in multimodal tasks, highlighting the challenges in architecture uniformity and post-training methods [1] - DeepMind's Gemini Diffusion has demonstrated the potential of diffusion models in text modeling, leading to the development of MMaDA, which integrates text reasoning, multimodal understanding, and image generation into a unified model [1][4] Group 1: Model Development - MMaDA is the first systematic exploration of a diffusion architecture for multimodal foundational models, achieving breakthroughs through three core technologies [1] - The team has open-sourced the training, inference, and weights for MMaDA-8B-Base, with plans to release additional weights [4] Group 2: Performance Metrics - MMaDA achieved state-of-the-art (SOTA) performance in three major tasks: - Textual reasoning with an MMLU accuracy of 68.4%, surpassing models like LLaMA-3-8B and Qwen2-7B [7] - Multimodal understanding, matching specialized models on benchmarks like POPE and VQAv2 [7] - Image generation with a CLIP Score of 32.46, significantly improving accuracy in cultural knowledge generation tasks [7] Group 3: Cross-Task Synergy - During mixed training phases, improvements in text reasoning and image generation metrics were observed, indicating a strong cross-task synergy [9] - MMaDA supports three types of cross-modal completion tasks, showcasing its flexibility and generalization capabilities in complex generation and reasoning tasks [11][13] Group 4: Key Technical Innovations - MMaDA's architecture unifies the text and image generation processes within a diffusion framework, eliminating the complexity of traditional mixed architectures [15] - The model employs a mixed long-chain thinking fine-tuning strategy to address challenges in complex tasks, enhancing its reasoning capabilities [15][19] - A unified inference format is defined to ensure the model outputs cross-modal reasoning steps before generating answers [18] Group 5: Training Strategies - The model utilizes structured noise strategies and diversified reward modeling to enhance performance across different tasks [19][21] - The UniGRPO algorithm has shown a 40% improvement in convergence speed during training compared to baseline methods [21]
谷歌I/O的AI新叙事:从大模型到一站式服务,AI与XR会师
3 6 Ke· 2025-05-22 00:15
谷歌CEO Sundar Pichai表示,去年同期谷歌AI大模型和API每月处理9.7万亿个Token,现在这一数字增长到了480万亿个,谷歌搜索业务的AI综述功能月活用 户也达到了15亿人。 AI正逐渐融入我们的生活,成为不可或缺的一部分。无论是谷歌推出的全新大模型和AI应用,还是XR平台和手机系统,都无法脱离AI的影响。 5月21日凌晨,科技巨头谷歌召开了I/O 2025开发者大会,除了万众瞩目的AI功能,谷歌还公布了安卓XR平台和安卓16的新规划及部分新特性。 AI:从大模型变成一站式服务平台 作为谷歌I/O大会的绝对主角,AI可谓重头戏,发布的新品也最多。此前已多次曝光的Gemini 2.5系列,于本场大会确认6月上线,其中Gemini 2.5 Pro号称世 界上最智能的AI模型,新版本刷榜LMArena,在ELO基准测试中拿到了1448分。 Gemini 2.5 Pro新增深度思考版本,在USAMO 2025、LiveCodeBench、MMMU等多项测试中,Gemini 2.5 Pro深度思考版本表现均领先Gemini 2.5 Pro。 Gemini 2.5 Flash则属于轻量级模型,相较上一 ...