Workflow
OmniGen
icon
Search documents
拆解理想在世界模型方向的工作
自动驾驶之心· 2026-01-05 09:30
Core Insights - The article discusses the advancements and applications of world models in autonomous driving, particularly focusing on the reconstruction and generation techniques utilized by companies like Li Auto [2][3] - It highlights the importance of understanding world models for newcomers in the field, emphasizing the challenges faced in grasping the concepts and practical applications [4][5] Summary by Sections Section 1: Introduction to World Models - The first chapter provides an overview of world models and their connection to end-to-end autonomous driving, detailing the historical development and current applications [7] - It categorizes different types of world models, including purely simulated models, simulation combined with planning, and those generating sensor inputs and perception results [7] Section 2: Background Knowledge of World Models - The second chapter covers foundational knowledge related to world models, including scene representation, Transformer technology, and BEV perception [8][13] - It emphasizes the significance of these concepts in preparing for advanced discussions on world models [8] Section 3: General World Model Exploration - The third chapter focuses on general world models and recent popular works in autonomous driving, discussing models like Marble, Genie 3, and DriveVLA-W0 [9] Section 4: Video Generation-Based World Models - The fourth chapter delves into video generation algorithms, which are currently the most researched area in both academia and industry, starting with notable works like GAIA-1 & GAIA-2 [10] Section 5: OCC-Based World Models - The fifth chapter centers on OCC generation methods, explaining their potential for extending to vehicle trajectory planning and achieving end-to-end solutions [10] Section 6: World Model Job Topics - The sixth chapter shares practical insights from industry experience, addressing the application of world models, industry pain points, and interview preparation for related positions [11] Course Overview - The course aims to provide a comprehensive understanding of world models, targeting individuals interested in advancing their knowledge and skills in autonomous driving technology [12][15] - It includes a structured schedule with specific topics covered in each chapter, starting from foundational concepts to advanced applications [16][17]
智源开源EditScore:为图像编辑解锁在线强化学习的无限可能
机器之心· 2025-10-22 03:30
Core Insights - The article discusses the significant advancements in instruction-guided image editing technology, particularly through the introduction of the EditScore model series by the VectorSpace Lab team at Beijing Academy of Artificial Intelligence [2][3] - EditScore aims to provide precise and reliable reward signals for instruction-guided image editing tasks, addressing the challenges faced by existing models in following complex text instructions [2][5] Development of EditScore - EditScore is a continuation of the OmniGen series, focusing on creating a more general and controllable generative AI [3] - The EditReward-Bench dataset has been released as the first public benchmark specifically designed for evaluating image editing reward models, covering 13 sub-tasks and 11 state-of-the-art editing models [6] - The EditScore model series includes three sizes: 7B, 32B, and 72B, designed to provide high-fidelity feedback signals for instruction-based image editing tasks [7] Performance Metrics - EditScore has demonstrated superior performance on the EditReward-Bench compared to other models, with the largest model achieving accuracy surpassing that of GPT-5 [9] - The evaluation metrics indicate that EditScore models consistently outperform existing visual language models, showcasing their effectiveness in providing accurate quality scores for image editing results [8][9] Applications of EditScore - EditScore serves as an advanced reranker to enhance the output quality of various mainstream editing models through a "Best-of-N" approach [15] - It also functions as a high-fidelity reward signal for reinforcement learning, enabling stable and efficient RL fine-tuning, with notable performance improvements observed in the OmniGen2 model [15][16] Insights from Research - The research highlights that a high score does not necessarily equate to an effective reinforcement learning coach; the distribution of scores from the reward model is also crucial for training effectiveness [16] - A self-ensemble scaling strategy has been identified as a method to enhance performance, suggesting that a well-designed 7B model can outperform larger models in specific tasks [19] Future Directions - The team plans to continue exploring reward modeling and will release additional reinforcement learning training code and inference scripts for the community [3][22] - The ongoing development of EditScore is expected to enhance the controllability and reliability of AIGC models, opening new possibilities for their application across various fields [22]
国产SOTA新模型精准get“画(3+6)条命的动物” | 开源
量子位· 2025-06-20 03:28
Core Viewpoint - The article discusses the advancements in AI, particularly focusing on the new model MindOmni, which enhances reasoning and generative capabilities in image generation, allowing for more coherent and logical outputs based on complex instructions [7][9][44]. Group 1: MindOmni Model Overview - MindOmni is a collaborative effort by Tsinghua University, Tencent ARC Lab, and other institutions, designed to improve AI's reasoning generation capabilities [7]. - The model integrates visual understanding and generative abilities, utilizing a structure based on Qwen2.5-VL, a visual language model [14][18]. - The core module for image generation is the diffusion decoder, which transforms noise into realistic images through a denoising process [15][16]. Group 2: Training Phases - The training of MindOmni occurs in three phases: basic pre-training, supervised fine-tuning, and reasoning generation strategy optimization (RGPO) [19][32]. - In the pre-training phase, the model learns basic text-to-image generation capabilities using open-source image-text pairs [20]. - The RGPO phase employs reinforcement learning to enhance the model's ability to generate logical reasoning chains [26][29]. Group 3: Performance Metrics - MindOmni has shown superior performance in various multimodal understanding and generation benchmarks, outperforming previous models [36][38]. - In image understanding tasks, MindOmni improved by 10.6% on MMMU and 9.8% on MMBench compared to earlier models [38][39]. - The model achieved an overall score of 83% in the GenEval benchmark, demonstrating its strong generative capabilities [40]. Group 4: Reasoning Generation Capabilities - MindOmni excels in reasoning generation tasks, achieving a score of 0.71 in the WISE benchmark across multiple subcategories [45]. - The model effectively interprets complex prompts, such as generating images based on mathematical expressions, showcasing its reasoning abilities [46][47]. - MindOmni's performance in multimodal input scenarios further highlights its advanced capabilities in understanding and generating relevant outputs [48]. Group 5: Ablation Studies - Extensive ablation studies confirm the significance of each training phase in enhancing the model's performance [50]. - The pre-training phase establishes basic generation abilities, while supervised fine-tuning and RGPO further refine reasoning generation capabilities [50][51].
知识类型视角切入,全面评测图像编辑模型推理能力:所有模型在「程序性推理」方面表现不佳
量子位· 2025-06-13 05:07
Core Viewpoint - The article discusses the development of KRIS-Bench, a benchmark for evaluating the reasoning capabilities of image editing models, focusing on the structured knowledge acquisition process similar to human learning [2][3][16]. Group 1: KRIS-Bench Overview - KRIS-Bench is a collaborative effort involving multiple prestigious institutions aimed at assessing AI's reasoning abilities in image editing [2]. - The benchmark categorizes knowledge into three types: Factual Knowledge, Conceptual Knowledge, and Procedural Knowledge, allowing AI to face progressively complex editing challenges [4][8]. - It features 7 reasoning dimensions and 22 typical editing tasks, ranging from basic to advanced difficulty levels [6]. Group 2: Evaluation Metrics - KRIS-Bench introduces a four-dimensional automated evaluation system to score editing outputs: Visual Consistency, Visual Quality, Instruction Following, and Knowledge Plausibility [10][11][13]. - The evaluation process includes a total of 1,267 image-instruction pairs, meticulously curated by experts to ensure diverse data sources [12]. Group 3: Model Performance Insights - The benchmark tests 10 models, including 3 closed-source and 7 open-source models, revealing performance gaps particularly in procedural reasoning and natural science tasks [14][16]. - Closed-source models like GPT-Image-1 lead in performance, while open-source models like BAGEL-Think show improvements in knowledge plausibility through enhanced reasoning processes [17].
知识类型视角切入,全面评测图像编辑模型推理能力:所有模型在「程序性推理」方面表现不佳
量子位· 2025-06-13 05:07
Core Viewpoint - The article discusses the development of KRIS-Bench, a benchmark for evaluating the reasoning capabilities of image editing models, focusing on the structured knowledge acquisition process similar to human learning [2][3][4]. Group 1: Knowledge Structure - KRIS-Bench is designed to assess AI's knowledge structure through three categories: Factual Knowledge, Conceptual Knowledge, and Procedural Knowledge, allowing for a progressive challenge in image editing tasks [4][8]. - The benchmark includes 7 reasoning dimensions and 22 typical editing tasks, ranging from basic to advanced difficulty levels, covering a wide spectrum of challenges [6]. Group 2: Evaluation Metrics - KRIS-Bench introduces a four-dimensional automated evaluation system to score editing outputs, which includes Visual Consistency, Visual Quality, Instruction Following, and Knowledge Plausibility [11][13]. - The evaluation process involves a total of 1,267 image-instruction pairs, meticulously curated by an expert team to ensure diverse data sources and prevent model exploitation [12]. Group 3: Model Performance - The benchmark evaluates 10 models, including 3 closed-source and 7 open-source models, revealing that closed-source models like GPT-Image-1 outperform open-source counterparts in knowledge plausibility [14][18]. - Despite some models showing improvement in factual knowledge tasks, many still struggle with procedural reasoning and complex scientific tasks, indicating a significant gap in deep reasoning capabilities [18].
智源发布“悟界”系列大模型,含全球首个原生多模态世界模型Emu3
Feng Huang Wang· 2025-06-06 14:32
Core Insights - The Zhiyuan Research Institute launched the "Wujie" series of large models, including Emu3, Brainμ, RoboOS 2.0, RoboBrain 2.0, and OpenComplex2, at the 2025 Beijing Zhiyuan Conference [1] Group 1: Emu3 and Brainμ Models - Emu3 is a native multimodal world model that utilizes a next-token prediction paradigm for unified multimodal learning, allowing for the encoding of images/videos into discrete symbol sequences [2] - Brainμ, built on the Emu3 architecture, integrates brain signals as a new modality, enabling a single model to perform various neuroscience tasks, potentially becoming the "AlphaFold" of brain science [2][3] Group 2: RoboOS 2.0 and RoboBrain 2.0 - RoboOS 2.0 is the world's first open-source framework for embodied intelligence SaaS platforms, significantly reducing development barriers and improving performance by 30% compared to its predecessor [4] - RoboBrain 2.0 enhances multi-agent task planning capabilities, achieving a 74% improvement in task planning accuracy over RoboBrain 1.0 [5] Group 3: OpenComplex2 Model - OpenComplex2 represents a breakthrough in modeling biological molecules, capturing molecular interactions at atomic resolution and providing insights into the relationship between microscopic fluctuations and macroscopic biological functions [6][7] Group 4: Open Source Initiatives - Zhiyuan has open-sourced approximately 200 models and 160 datasets, with the FlagOS software stack upgraded to support various AI hardware and improve performance by up to 23% [8] Group 5: Applications and Collaborations - The Brainμ model has shown potential in consumer-grade brain-computer interface applications, collaborating with leading neuroscience laboratories and companies to expand its industrial applications [3][11] - The development of a digital twin heart and a drug safety evaluation platform demonstrates the application of advanced modeling techniques in pharmacology and personalized medicine [12]