Gemini Diffusion

Search documents
开源扩散大模型首次跑赢自回归!上交大联手UCSD推出D2F,吞吐量达LLaMA3的2.5倍
机器之心· 2025-08-18 03:22
Core Insights - The article discusses the introduction of Discrete Diffusion Forcing (D2F), a new model that significantly enhances the inference speed of open-source diffusion large language models (dLLMs) compared to autoregressive (AR) models, achieving up to 2.5 times higher throughput on benchmarks like GSM8K [2][6][22]. Group 1: Challenges and Solutions - Existing dLLMs face challenges such as the lack of a complete KV cache mechanism and insufficient parallel potential, resulting in slower inference speeds compared to AR models [2][8]. - D2F addresses these challenges by integrating a mixed paradigm of autoregressive and diffusion approaches, optimizing model architecture, training methods, and inference strategies [11][12]. Group 2: D2F Design Features - D2F incorporates block-level causal attention to ensure compatibility with KV caching, allowing for the reuse of KV states and reducing computational redundancy [12][15]. - The model employs asymmetric distillation and structured noise scheduling to efficiently transfer knowledge from a pre-trained teacher model to the D2F student model, enhancing its parallel capabilities [18]. Group 3: Inference Mechanism - D2F introduces a pipelined parallel decoding algorithm that maintains a dynamic decoding window, allowing for semi-activated and fully-activated states to optimize throughput and quality [20][21]. - The model achieves a maximum speedup of up to 50 times compared to original dLLMs while maintaining average performance levels [22]. Group 4: Performance Metrics - D2F demonstrates superior performance-efficiency trade-offs, with the ability to adapt to various scenarios by adjusting decoding parameters, achieving over four times the throughput of AR models in specific tasks [25]. - Comparative tests show D2F-LLaDA achieving a throughput of 52.5 tokens per second, representing a 7.3 times increase over baseline methods [23]. Group 5: Future Directions - The success of D2F indicates a promising path for further research in parallel decoding technologies, with potential future developments including real-time serving capabilities and hybrid parallel processing [28].
AI展望:NewScaling,NewParadigm,NewTAM
HTSC· 2025-06-10 01:43
Group 1: Global AI Outlook - The report highlights a new paradigm in AI development characterized by new scaling, new architecture, and new total addressable market (TAM) opportunities [1] - The demand for computing power is expected to rise due to advancements in both training and inference processes, potentially unlocking new TAMs [1][3] - The report maintains a positive outlook on AI industry investments, anticipating that global AI applications will enter a performance harvesting phase [1] Group 2: Model Development - The pre-training scaling law is anticipated to open a new starting point for model development, with significant innovations in architecture being explored [2][23] - The report notes that the classic transformer architecture has reached a parameter scale bottleneck, with existing public data nearly exhausted [2][20] - Major tech companies are experimenting with new architectures, such as Tencent's Hunyuan TurboS and Google's Gemini Diffusion, which may accelerate scaling law advancements [23][24] Group 3: Computing Power Demand - The report identifies a clear long-term upward trend in computing power demand, driven by both training and inference needs [3][32] - New scaling paths are emerging in the post-training phase, with ongoing exploration of new architectures that may reignite pre-training demand narratives [3][33] - The deployment of large-scale computing clusters, such as OpenAI's StarGate, is expected to support the exploration of pre-training [38] Group 4: Application Development - The report indicates that the rapid advancement of agent applications is leading to a performance harvesting phase for global AI applications [4][67] - The commercialization of agent products is accelerating, with domestic AI applications quickly iterating and entering the market [4][67] - The report emphasizes that agent applications are evolving from simple tools to complex solutions, with significant growth expected in various sectors [5][68] Group 5: Business Model Transformation - The shift from traditional software delivery to outcome-based delivery is highlighted as a key trend, with quantifiable ROI accelerating the adoption of agent applications [5] - Specific sectors such as consumer-facing scenarios (advertising, e-commerce) and AI in marketing/sales are expected to lead in commercialization due to their inherent advantages [5][67] - The report notes that AI applications in HR are transitioning from efficiency tools to strategic hubs, indicating a broader transformation in business models [5][67]
挑战 next token prediction,Diffusion LLM 够格吗?
机器之心· 2025-06-08 02:11
Group 1 - The article discusses the potential of Diffusion LLMs, particularly Gemini Diffusion, as a significant breakthrough in AI, challenging traditional autoregressive models [3][4][5] - Gemini Diffusion demonstrates high generation efficiency, achieving an average sampling speed of 1479 TPS and up to 2000 TPS in encoding tasks, outperforming Gemini 2.0 Flash-Lite by 4-5 times [4][6] - The parallel generation mechanism of the diffusion architecture allows for efficient processing, which could lead to reduced computational costs compared to autoregressive models [6][7] Group 2 - Mary Meeker emphasizes that the speed of AI development surpasses that of the internet era, highlighting the cost disparity between AI model training and inference [1][2] - The article suggests that the rise of open-source models in China may impact the global supply chain, indicating a shift in competitive dynamics within the industry [1][2] - The balance between computational investment and commercial returns is crucial for enterprises as AI inference costs decline [1][2]
冲击自回归,扩散模型正在改写下一代通用模型范式
机器之心· 2025-06-04 01:59
Core Viewpoint - The article discusses the advancements in diffusion language models (dLLMs), particularly focusing on Google's Gemini Diffusion and its implications for AI development, highlighting the speed and performance improvements over traditional autoregressive models [1][8][35]. Group 1: Gemini Diffusion and Its Features - Gemini Diffusion is noted for its impressive generation speed, being five times faster than previous models, and its ability to handle programming tasks effectively [2][8]. - The underlying mechanism of diffusion models allows for rapid iteration and error correction during the generation process, distinguishing it from autoregressive models [2][3]. - Gemini Diffusion's sampling speed can reach an astonishing 1479 tokens per second, showcasing its potential in various benchmarks [8][9]. Group 2: Development of Diffusion Language Models - Prior to Gemini Diffusion, several research teams explored the feasibility of diffusion-based LLMs, including Stanford's Diffusion-LM and Fudan University's DiffusionBERT [3][4]. - The introduction of LLaDA, the first 8 billion parameter diffusion language model, marked a significant milestone in the field, achieving performance comparable to LLaMA 3 [4][21]. - Following LLaDA, other models like d1 and LaViDa have emerged, further establishing LLaDA as a foundational model in dLLM research [20][21]. Group 3: Multimodal Diffusion Language Models - The emergence of diffusion multimodal language models (dMLLMs) is highlighted, with LLaDA-V and MMaDA being prominent examples that integrate visual and language processing capabilities [10][31]. - LLaDA-V combines visual instruction fine-tuning with the diffusion mechanism, demonstrating strong performance in multimodal understanding tasks [26][27]. - MMaDA showcases innovations in text reasoning and multimodal understanding, solidifying its position as a leading research outcome in the dMLLM space [31][32]. Group 4: Future Directions and Implications - The article emphasizes the shift from autoregressive models to diffusion models as a significant paradigm change in AI, suggesting broader implications for future research and applications [35][36]. - The ongoing evolution of models like LLaDA and Gemini Diffusion indicates a growing ecosystem around dLLMs and dMLLMs, with potential applications extending into quantum computing [35][36].
AGI的不归之途
虎嗅APP· 2025-06-03 13:52
Core Insights - The article discusses the rapid advancements in AI technologies, particularly focusing on the emergence of intelligent agents and their potential to replace a significant portion of entry-level jobs, with predictions that they could take over 50% of such roles by 2026 [3][4][5]. - The competition between the US and China in AI development is intensifying, with Chinese models like DeepSeek showing significant performance improvements and closing the gap with US counterparts [5][6][11]. Group 1: AI Advancements - The introduction of advanced models such as OpenAI's o3 and Gemini 2.5 pro has accelerated the development of intelligent agents, which are now capable of handling increasingly complex tasks [3][4]. - OpenAI's annual revenue has reached $10 billion, while Anthropic's revenue has surged from $1 billion to $3 billion within six months, indicating a strong market demand for AI applications [4]. Group 2: Global AI Competition - China's DeepSeek model has surpassed Gemini 2.5 pro in performance, showcasing the rapid advancements in Chinese AI technology [5][6]. - The gap between Chinese and US AI models has narrowed from two years at the time of ChatGPT's release to less than three months, highlighting China's competitive edge in AI development [11]. Group 3: Geopolitical Implications - AI is viewed as a significant economic lever and a source of geopolitical influence by both the US and China, with both nations investing heavily in AI infrastructure and talent acquisition [36][37]. - The article suggests that the next phase of AI commercialization may not follow a "winner-takes-all" model but rather a fusion and restructuring of platforms and specialized vendors [35].
三位顶流AI技术人罕见同台,谈了谈AI行业最大的「罗生门」
3 6 Ke· 2025-05-28 11:59
Core Insights - The AI industry is currently experiencing a significant debate over the effectiveness of pre-training models versus first principles, with notable figures like Ilya from OpenAI suggesting that pre-training has reached its limits [1][2] - The shift from a consensus-driven approach to exploring non-consensus methods is evident, as companies and researchers seek innovative solutions in AI [6][7] Group 1: Industry Trends - The AI landscape is witnessing a transition from a focus on pre-training to exploring alternative methodologies, with companies like Sand.AI and NLP LAB leading the charge in applying multi-modal architectures to language and video models [3][4] - The emergence of new models, such as Dream 7B, demonstrates the potential of applying diffusion models to language tasks, outperforming larger models like DeepSeek V3 [3][4] - The consensus around pre-training is being challenged, with some experts arguing that it is not yet over, as there remains untapped data that could enhance model performance [38][39] Group 2: Company Perspectives - Ant Group's Qwen team, led by Lin Junyang, has faced criticism for being conservative, yet they emphasize that their extensive experimentation has led to valuable insights, ultimately reaffirming the effectiveness of the Transformer architecture [5][15] - The exploration of Mixture of Experts (MoE) models is ongoing, with the team recognizing the potential for scalability while also addressing the challenges of training stability [16][20] - The industry is increasingly focused on optimizing model efficiency and effectiveness, with a particular interest in achieving a balance between model size and performance [19][22] Group 3: Technical Innovations - The integration of different model architectures, such as using diffusion models for language generation, reflects a broader trend of innovation in AI [3][4] - The challenges of training models with long sequences and the need for effective optimization strategies are critical areas of focus for researchers [21][22] - The potential for future breakthroughs lies in leveraging increased computational power to revisit previously unviable techniques, suggesting a cycle of innovation driven by advancements in hardware [40][41]
又一巨头推出其最强大模型,赶超OpenAI和谷歌
财富FORTUNE· 2025-05-26 13:06
Core Insights - Anthropic has launched its latest AI models, Claude Opus 4 and Claude Sonnet 4, at a developer conference in San Francisco, emphasizing their capabilities in coding and long-term task performance [1][4] - The competition in advanced AI models is intensifying, with companies like Google also releasing new models to enhance efficiency in software engineering [1] - The new models are designed to operate autonomously, showcasing significant advancements in AI capabilities, particularly in memory and task execution [4][5] Model Capabilities - Claude Opus 4 is touted as the "best coding model globally," capable of maintaining stable performance over long tasks involving thousands of steps [1] - Early testers have reported that Opus 4 can autonomously code for nearly seven hours on complex projects [4] - The models can switch between reasoning and tool usage, allowing simultaneous operations like web searches and code testing [5] Safety and Governance - Anthropic has introduced stringent safety protocols with Claude Opus 4, exceeding previous models' standards, as part of its Responsible Scaling Policy (RSP) [5][6] - The company aims to ensure that AI development benefits everyone while maintaining safety and governance, reflecting concerns over rapid advancements in AI technology [5][6] - All of Anthropic's models are classified under AI Safety Level 2 (ASL-2), with plans for Claude Opus 4 to meet the more rigorous ASL-3 standards [5][6] Model Release and Transparency - Anthropic will release model cards alongside Opus 4 and Sonnet 4, providing detailed information on their capabilities and safety assessments [7] - This approach contrasts with competitors like OpenAI and Google, which have faced criticism for delays and inadequate transparency in their model releases [7]
谷歌 I/O 大会:AI 从技术前沿到商业生态的验证
HTSC· 2025-05-25 13:25
Investment Rating - The report maintains an "Overweight" rating for the industry, indicating an expectation for the industry stock index to outperform the benchmark [6]. Core Insights - The Google I/O 2025 conference highlighted the integration of AI into core search products, with a focus on enhancing user experience and regaining market share [2][3]. - The Gemini application has seen significant growth, with monthly active users exceeding 400 million and a 45% increase in usage of Gemini 2.5 Pro [1][14]. - The commercial path for AI products is accelerating, with subscription fees for Gemini AI Pro set at $19.99 and Gemini AI Ultra at $249.99, indicating a potential increase in penetration among content creators [1][20]. Summary by Sections AI Search - AI Mode has been fully launched for U.S. users, providing personalized search results and integrating with Gmail, enhancing the shopping experience with visual search and Gemini Shopping Graph 2.0, which covers over 50 billion products [2][7]. - AI Overviews now cover over 200 countries and support more than 40 languages, including new additions like Arabic and Chinese [7]. - Google Lens usage has surpassed 100 billion times this year, reflecting a 65% year-on-year increase [2][7]. Gemini Ecosystem - The Gemini ecosystem is strengthening its integration capabilities, with the introduction of Gemini Live and Agent Mode, which enhance multi-modal interactions and real-time capabilities [1][3]. - Gemini 2.5 Pro supports native audio output and has been embedded in various AI IDE tools, while the Deep Think mode allows for the generation of multiple reasoning chains [4][7]. Hardware and Future Developments - The report highlights the potential of Android XR hardware, with collaborations for smart glasses and the introduction of third-party devices like Samsung's Project Moohan and Xreal's Project Aura [4][7]. - The Beam project aims to deliver AI-driven 3D video calling capabilities, with core technology expected to be integrated into Google Meet [3][7].
比Gemini Diffusion更全能!首个多模态扩散大语言模型MMaDA发布,同时实现强推理与高可控性
机器之心· 2025-05-22 08:46
Core Insights - The article discusses the advancements in large language models (LLMs) and their application in multimodal tasks, highlighting the challenges in architecture uniformity and post-training methods [1] - DeepMind's Gemini Diffusion has demonstrated the potential of diffusion models in text modeling, leading to the development of MMaDA, which integrates text reasoning, multimodal understanding, and image generation into a unified model [1][4] Group 1: Model Development - MMaDA is the first systematic exploration of a diffusion architecture for multimodal foundational models, achieving breakthroughs through three core technologies [1] - The team has open-sourced the training, inference, and weights for MMaDA-8B-Base, with plans to release additional weights [4] Group 2: Performance Metrics - MMaDA achieved state-of-the-art (SOTA) performance in three major tasks: - Textual reasoning with an MMLU accuracy of 68.4%, surpassing models like LLaMA-3-8B and Qwen2-7B [7] - Multimodal understanding, matching specialized models on benchmarks like POPE and VQAv2 [7] - Image generation with a CLIP Score of 32.46, significantly improving accuracy in cultural knowledge generation tasks [7] Group 3: Cross-Task Synergy - During mixed training phases, improvements in text reasoning and image generation metrics were observed, indicating a strong cross-task synergy [9] - MMaDA supports three types of cross-modal completion tasks, showcasing its flexibility and generalization capabilities in complex generation and reasoning tasks [11][13] Group 4: Key Technical Innovations - MMaDA's architecture unifies the text and image generation processes within a diffusion framework, eliminating the complexity of traditional mixed architectures [15] - The model employs a mixed long-chain thinking fine-tuning strategy to address challenges in complex tasks, enhancing its reasoning capabilities [15][19] - A unified inference format is defined to ensure the model outputs cross-modal reasoning steps before generating answers [18] Group 5: Training Strategies - The model utilizes structured noise strategies and diversified reward modeling to enhance performance across different tasks [19][21] - The UniGRPO algorithm has shown a 40% improvement in convergence speed during training compared to baseline methods [21]
谷歌I/O的AI新叙事:从大模型到一站式服务,AI与XR会师
3 6 Ke· 2025-05-22 00:15
Group 1: AI Developments - Google announced the Gemini 2.5 series, with Gemini 2.5 Pro being touted as the world's most intelligent AI model, achieving a score of 1448 in the ELO benchmark test [2] - The Gemini 2.5 Flash model has improved efficiency by 22% and reduced token usage by 20% to 30% compared to its predecessor [2] - The AI capabilities will enhance Google Search, introducing features like chart generation and ticket searches, making the content more comprehensive than traditional search methods [4][10] Group 2: XR Platform and Devices - Google and Samsung's Android XR platform has gained support from hundreds of software developers, with the first XR device, Samsung's Project Moohan, set to launch later this year [11][20] - The Android XR platform integrates AI for improved user interaction, allowing users to engage with devices through natural language [12] - The XR devices face challenges such as limited application ecosystems and short battery life, but the unified ecosystem may encourage more developers to create applications [20][25] Group 3: Android 16 and Wear OS 6 - Android 16 will feature Live Updates, similar to Apple's Live Activities, displaying real-time information like navigation and delivery status [21][23] - Wear OS 6 introduces a new design language and dynamic color themes, although it remains a closed-source system limiting customization [21] - Project Astra, an AI assistant for Android, aims to provide solutions based on user context, although its full capabilities may not be realized immediately [24] Group 4: Industry Trends and Challenges - The AI and XR industries are transitioning from growth to maturity, focusing on practical applications [25] - Despite advancements, leading companies in AI and XR are unlikely to achieve profitability in the short term due to high investments in data centers and ecosystem development [27] - The XR industry faces ecological challenges, requiring time for software development and improvements in battery and performance technologies [27]