Workflow
Diffusion模型
icon
Search documents
对话张津剑:4年前没人相信AGI,MiniMax如今活出3000亿
投中网· 2026-02-26 01:57
Core Viewpoint - The article discusses the journey of Oasis Capital, particularly focusing on its investment strategy in AI and the challenges faced during the pandemic and economic shifts. It highlights the importance of optimism, belief in innovation, and the role of young entrepreneurs in driving the AI sector forward [3][4][5]. Group 1: Investment Strategy and Market Conditions - Oasis Capital was founded in 2019, and its initial investment strategy was based on worst-case scenarios, which proved beneficial during the pandemic [3]. - In 2022, the venture capital landscape shifted dramatically due to rising inflation and interest rates, leading to a significant increase in "down round" financing from 8% to 20% [3][4]. - The firm decided to focus on AI as a core investment direction in November 2022, predicting the release of new AI models, which positioned them ahead of the curve [5][14]. Group 2: Key Investments and Achievements - In 2023, Oasis Capital completed investments in over 10 AI projects, capitalizing on the "GPT moment" despite skepticism about domestic AI models [7]. - The first IPO in the AI sector for Oasis Capital was MiniMax, which saw a 109% increase on its first trading day and reached a market cap of over 300 billion HKD shortly after [7][8]. - The firm has invested in various AI startups, including 千寻智能, Vast, and 逐际动力, showcasing a commitment to supporting innovative companies in the AI space [8][20]. Group 3: Entrepreneurial Insights and Philosophy - The article emphasizes the importance of optimism and belief in the potential of young entrepreneurs, particularly in the context of AI innovation [5][29]. - The founder of MiniMax, 闫俊杰, is highlighted for his focus and dedication, which resonated with Oasis Capital's investment philosophy of supporting passionate and innovative individuals [28][40]. - The narrative suggests that the essence of successful investment lies in the ability to believe in and support visionary entrepreneurs, even when the broader market is skeptical [25][32].
自回归模型杀回图像生成!实现像素级精准控制,比Diffusion更高效可控
量子位· 2025-07-29 05:05
Core Viewpoint - The article discusses the limitations of Diffusion models in AI image generation, particularly in precise control, and introduces a new framework called MENTOR, which utilizes Autoregressive (AR) models for more efficient and controllable multimodal image generation [1][2][3]. Group 1: Challenges in Current Models - Diffusion models face challenges in precise visual control, balancing multimodal inputs, and high training costs [2][6]. - The inherent randomness of Diffusion models makes it difficult to achieve precise control in high-fidelity tasks like image reconstruction [6]. - Existing methods often exhibit modality imbalance, over-relying on either reference images or text instructions [6]. Group 2: Introduction of MENTOR - MENTOR is a novel AR framework that requires only one-tenth of the training data and suboptimal model components to outperform Diffusion methods like Emu2 and DreamEngine [2][3]. - The framework employs a unique two-stage training method to enable efficient multimodal image generation with pixel-level precision [3][8]. Group 3: MENTOR's Design and Training - MENTOR features a unified AR architecture consisting of a multimodal encoder and an autoregressive generator, allowing for token-level alignment between inputs and outputs [9]. - The two-stage training strategy includes: 1. Multimodal Alignment Pretraining: Focuses on understanding different input types and establishing pixel-level and semantic alignment [10]. 2. Multimodal Instruction Tuning: Enhances the model's ability to follow instructions and reason across modalities [12]. Group 4: Performance and Efficiency - MENTOR achieved competitive performance on DreamBench++, surpassing larger models like Emu2 (37 billion parameters) and DreamEngine (10.5 billion parameters) while maintaining a lower CP/PF ratio, indicating better balance between visual feature preservation and prompt following [15][17]. - The training process for MENTOR utilized approximately 3 million image-text pairs over 1.5 days, demonstrating significant efficiency compared to other baseline methods [18]. Group 5: Applications and Future Potential - MENTOR's framework is highly versatile, capable of handling various complex multimodal generation tasks with minimal adjustments [24]. - The article concludes that MENTOR opens a new path for controllable image generation tasks, showcasing the potential of AR models in visual generation, while acknowledging that there are still areas where it lags behind top-tier Diffusion models [26].
TransDiffuser: 理想VLA diffusion出轨迹的架构
理想TOP2· 2025-05-18 13:08
Core Viewpoint - The article discusses the advancements in the field of autonomous driving, particularly focusing on the Diffusion model and its application in generating driving trajectories, highlighting the differences between VLM and VLA systems [1][4]. Group 1: Diffusion Model Explanation - Diffusion is a generative model that learns data distribution through a process of adding noise (Forward Process) and removing noise (Reverse Process), akin to a reverse puzzle [4]. - The model's denoising process involves training a neural network to predict and remove noise, ultimately generating target data [4]. - Diffusion not only generates the vehicle's trajectory but also predicts the trajectories of other vehicles and pedestrians, enhancing decision-making in complex traffic environments [5]. Group 2: VLM and VLA Systems - VLM consists of two systems: System 1 mimics learning to output trajectories without semantic understanding, while System 2 has semantic understanding but only provides suggestions [2]. - VLA is a single system with both fast and slow thinking capabilities, inherently possessing semantic reasoning [2]. - The output of VLA is action tokens that encode the vehicle's driving behavior and surrounding environment, which are then decoded into driving trajectories using the Diffusion model [4][5]. Group 3: TransDiffuser Architecture - TransDiffuser is an end-to-end trajectory generation model that integrates multi-modal perception information to produce high-quality, diverse trajectories [6][7]. - The architecture includes a Scene Encoder for processing multi-modal data and a Denoising Decoder that utilizes the DDPM framework for trajectory generation [7][9]. - The model employs a multi-head cross-attention mechanism to fuse scene and motion features during the denoising process [9]. Group 4: Performance and Innovations - The model achieves a Predictive Driver Model Score (PDMS) of 94.85, outperforming existing methods [11]. - Key innovations include anchor-free trajectory generation and a multi-modal representation decorrelation optimization mechanism to enhance trajectory diversity and reduce redundancy [11][12]. Group 5: Limitations and Future Directions - The authors note challenges in fine-tuning the model, particularly the perception encoder [13]. - Future directions involve integrating reinforcement learning and referencing models like OpenVLA for further advancements [13].