多模态学习
Search documents
ICLR 2026 | 帝国理工大学提出DyMo:让多模态模型学会「选择」,突破模态缺失难题
机器之心· 2026-03-09 02:00
Core Insights - The article discusses the advancements in multimodal learning, particularly in addressing the challenge of missing modalities in various applications such as medical diagnosis and autonomous driving [3][39]. - A new framework called DyMo is introduced, which dynamically selects and integrates reliable recovery modalities during inference, overcoming the traditional dilemma of discarding or imputing missing data [15][39]. Multimodal Learning and Its Challenges - Multimodal learning is driving breakthroughs in fields like medical imaging and human-computer interaction by integrating various data types such as images, text, and tables [2]. - A significant issue in real-world applications is "missing modality," where certain data inputs are incomplete, leading to potential loss of critical information [3][7]. The Discarding-Imputation Dilemma - Existing methods for handling missing modalities fall into two categories: recovery-free methods that ignore missing data and recovery-based methods that attempt to reconstruct it, both of which have inherent drawbacks [11][12]. - The article highlights the "discarding-imputation dilemma," where discarding can lead to loss of important information, while imputation may introduce noise [3][12]. DyMo Framework - DyMo is designed to address the aforementioned dilemma by dynamically identifying and integrating reliable recovery modalities during the inference phase [15][39]. - The framework establishes a connection between information gain and task loss, utilizing a reward function to guide the modality selection process [19][21]. Experimental Results - DyMo has been tested on multiple datasets, including PolyMNIST, MST, and CelebA, demonstrating significant performance improvements in scenarios with missing modalities [4][30]. - For instance, in the PolyMNIST dataset, DyMo achieved a classification accuracy increase of 1.61% under missing modality conditions [12][31]. Conclusion and Future Directions - DyMo offers a novel perspective on multimodal learning, shifting the focus from merely recovering all modalities to determining which recovery modalities are trustworthy [39]. - Future research directions include extending dynamic selection to the training phase and exploring applications beyond classification tasks [41].
登上Nature!智源研究院推出AI全能选手——Emu3,一统多模态学习
生物世界· 2026-01-31 03:05
Core Viewpoint - The article discusses the introduction of Emu3, a multimodal large model developed by Beijing Academy of Artificial Intelligence, which aims to unify the learning of text, images, and videos through next-token prediction, potentially transforming the AI landscape [2][3]. Multimodal Learning - Multimodal learning refers to the ability of AI to process various types of information simultaneously, akin to human sensory perception. Achieving a unified algorithm for learning and generating content from multiple modalities has been a long-standing challenge in the AI field [6]. Emu3's Mechanism - Emu3 employs a simple yet effective approach by converting all modal data into discrete tokens and using a Transformer model to predict the next token, which is a key factor in the success of GPT series language models [6][7]. Training Process - The training of Emu3 consists of three stages: 1. Pre-training with large-scale multimodal data, balancing the loss weights of text and visual tokens to prevent dominance of visual tokens [10]. 2. Post-training for quality fine-tuning on generation tasks, incorporating human preference optimization [10]. 3. Inference supporting classifier-free guidance for low-latency and high-throughput generation [11]. Performance Comparison - Emu3 has demonstrated performance that matches or exceeds specialized models across various tasks: - In image generation, it achieved a human preference score of 70.0, surpassing Stable Diffusion v1.5 (59.3) and SDXL (66.9) [13]. - In video generation, it scored 81.0 in VBench evaluation, comparable to mainstream diffusion models [13]. - In visual language understanding, it averaged 62.1 across 12 benchmark tests, rivaling models like LLaVA-1.6 [13]. - In robotic operations, it achieved a success rate of 87.0% in a simulated environment [13]. Significance of the Research - The significance of Emu3 lies not only in its performance improvements but also in its simplification of paradigms. It demonstrates that next-token prediction can serve as a core paradigm for multimodal models, paving the way for the development of more powerful "world models" that integrate perception, language, and action [15][17]. Future Developments - Following Emu3, the research team has introduced Emu3.5, which enhances the model's capabilities through large-scale long-sequence video training, improving its ability to model physical world dynamics and observing trends in multimodal capabilities as the model and data scale increase [15].
破纪录!华人学者一天发表了22篇Nature论文
生物世界· 2026-01-29 08:00
Core Insights - On January 28, 2026, a total of 43 research papers were published in the prestigious journal Nature, with 22 of them authored by Chinese scholars, highlighting the significant contribution of Chinese researchers to global scientific advancements [3][4][6][8][10][11][12][14][18][19][22]. Group 1: Research Contributions - The paper titled "Constraints on axion dark matter by distributed intercity quantum sensors" was authored by Professor Peng Xinhua and Professor Jiang Min from the University of Science and Technology of China, with postdoctoral researcher Wang Yuanhong as the first author [3]. - A study on "Prethermalization by random multipolar driving on a 78-qubit processor" was published by researchers from the Chinese Academy of Sciences and Peking University, with Liu Zhenghe and Liu Yu as co-first authors [4]. - The research "Multimodal learning with next-token prediction for large multimodal models" was led by Professor Huang Tiejun and Wang Zhongyuan from the Beijing Academy of Artificial Intelligence, with Wang Xinlong as a co-first author [6]. - The paper "Radiation-tolerant atomic-layer-scale RF system for spaceborne communication" was authored by Professor Zhou Peng and Associate Professor Ma Shunli from Fudan University, with postdoctoral researcher Zhu Liyuan as the first author [8]. - The study "Accurate determination of the 3D atomic structure of amorphous materials" was published by Miao Jianwei from UCLA, with Liao Yuxuan as the first author [10]. Group 2: Diverse Research Topics - The research titled "Optical control of integer and fractional Chern insulators" was authored by Xu Xiaodong from the University of Washington, with Li Weijie as a co-first author [11]. - The paper "Bandwidth-tuned Mott transition and superconductivity in moiré WSe2" was co-authored by researchers from Cornell University, with Xia Yiyu and Han Zhongdong as co-first authors [12]. - The study "Frequency reproducibility of solid-state thorium-229 nuclear clocks" was published by Ye Jun from the University of Colorado Boulder [13]. - The research "A Cambrian soft-bodied biota after the first Phanerozoic mass extinction" was authored by Zhu Maoyan and Zhao Fangchen from the Nanjing Institute of Geology and Palaeontology, Chinese Academy of Sciences [14]. - The paper "Advancing regulatory variant effect prediction with AlphaGenome" was co-authored by Cheng Jun from Google DeepMind, with co-first author Zhang Mingchao [16].
打造图像编辑领域的ImageNet?苹果用Nano Banana开源了一个超大数据集
机器之心· 2025-10-26 04:03
Core Insights - Apple has been perceived as lagging in the development and application of large models, particularly in the field of visual generation [1][2] - The company has made significant strides in research, recently introducing the Pico-Banana-400K dataset, which consists of 400,000 images for instruction-based image editing [6][9] Dataset Overview - The Pico-Banana-400K dataset is built using Google's Nano-banana model and aims to provide a comprehensive resource for training and evaluating text-guided image editing models [6][9] - The dataset includes a variety of subsets: - 258,000 single-turn editing examples covering 35 editing categories [12] - 72,000 multi-turn editing examples for studying sequential modifications [13] - 56,000 preference samples for alignment research [14] - Instruction pairing sets for developing instruction rewriting and summarization capabilities [15] Quality Control and Methodology - The dataset emphasizes quality and diversity through a systematic design, ensuring comprehensive coverage of editing types and balancing content consistency with instruction fidelity [9][16] - Apple has implemented a self-editing and evaluation process where the Nano-banana model performs edits, and Gemini 2.5 Pro assesses the results, allowing for automatic retries until successful [17] Editing Types and Success Rates - The dataset categorizes editing instructions into 35 types, covering a wide range of operations from color adjustments to object manipulation [21][22] - Success rates vary by editing type, with global appearance and style edits being the easiest, while precise geometric and text edits are the most challenging [31][32][34] Contributions to the Field - The release of Pico-Banana-400K represents a significant contribution to the field of multimodal learning, providing a large-scale, shareable dataset that supports various training objectives [40][41] - The dataset not only facilitates the training of models but also demonstrates the capability of AI to generate and validate training data autonomously, without human supervision [41][42]
谷歌“香蕉”杀死Photoshop,全球软件业彻底变天了
Tai Mei Ti A P P· 2025-09-16 08:45
Core Insights - The article highlights the revolutionary impact of Google's AI model, Nano Banana, in the field of image generation and editing, showcasing its superior capabilities compared to existing models [1][2][28] User Experience - Nano Banana has been described as "stunning," with its text-to-image generation and image editing capabilities surpassing previous models [3][28] - The model effectively addresses challenges in generating coherent text within images, a task that previous models struggled with [4][28] - Users have reported that Nano Banana can create highly realistic figurine images, making it difficult for outsiders to distinguish them from real objects [6][28] - The model excels in various image editing tasks, such as adding or removing elements, local repainting, style transfer, and maintaining high fidelity in details [9][28] - Nano Banana's ability to perform pixel-level edits allows for precise modifications without disrupting the overall image [10][12] - It can render objects from different angles, demonstrating a deep understanding of three-dimensional space [14][28] - The model employs a "staggered generation" approach, breaking down complex tasks into manageable steps for improved accuracy [15][28] - Users have experienced "smart" outputs that exceed their expectations, indicating the model's ability to interpret and enhance vague instructions [16][28] Commercial Prospects - The commercial viability of Nano Banana is being explored, with considerations around cost-effectiveness and potential revenue models [18][28] - Training the model requires significant resources, and the team is seeking efficient evaluation metrics to reduce costs [19][28] - The API pricing is set at competitive rates, with free usage options available, enhancing its attractiveness in the market [20][28] - Third-party platforms are beginning to offer Nano Banana's API at lower prices, indicating a competitive landscape [21][28] - The model's introduction is part of Google's strategy to maintain its leadership in the AI sector against competitors like OpenAI and Midjourney [21][28] Technical Logic - Nano Banana's capabilities stem from Google's long-term investments in multimodal learning, user feedback mechanisms, and innovative architectural designs [22][28] - The model utilizes a "text rendering metric" to assess improvements without relying on subjective user evaluations [23][28] - It employs a unified multimodal model approach, allowing knowledge transfer across different modalities, enhancing overall performance [24][28] - User feedback is integral to the model's iterative improvement process, with the team learning from past failures to refine its capabilities [26][28] - Collaboration between the Gemini and Imagen teams has been crucial in achieving a balance between intelligent content generation and visual quality [27][28]
OpenVision 2:大道至简的生成式预训练视觉编码器
机器之心· 2025-09-15 12:19
Core Insights - The article discusses the development of OpenVision 2, a generative visual pre-training model that simplifies the training process while maintaining optimal performance and significantly improving training efficiency [2][21]. Group 1: OpenVision 2 Overview - OpenVision 2 is a new direction in generative visual pre-training, proposed by researchers from UCSC, Apple, and UCB, which enhances training efficiency while achieving a parameter scale of 1 billion [2][21]. - The model eliminates the complexity of the training pipeline found in its predecessor, OpenVision, by removing the text encoder and contrastive learning, focusing solely on the "image → description" generation target [9][21]. Group 2: Performance and Efficiency - Experimental results show that OpenVision 2 performs comparably to or better than OpenAI's CLIP and Google's SigLIP on various multimodal benchmark tasks, particularly excelling in OCR and text-related tasks [14][21]. - The training time for OpenVision 2 is reduced by 1.5 to 2 times, with memory usage cut by nearly half, allowing for larger batch sizes and more efficient training [14][16]. Group 3: Key Innovations - OpenVision 2 introduces a technique of randomly dropping about 2/3 of the visual tokens during pre-training, which reduces the computational burden on the text decoder and enhances training efficiency [10][22]. - The model relies on high-quality synthetic descriptions as the sole supervision signal, which aligns closely with downstream tasks, reducing the "goal misalignment" between pre-training and application [22][21]. Group 4: Community Impact - The research challenges the long-standing dominance of contrastive learning, demonstrating that powerful visual encoders can be trained through a generative framework, paving the way for future developments in multimodal foundational models [21][22]. - Over 25 different models of varying scales and configurations have been open-sourced, providing a reproducible and scalable resource base for both academia and industry [21].
百度视觉技术部多模态感知与理解招聘(社招/校招/实习)
自动驾驶之心· 2025-09-03 23:33
Core Viewpoint - The article focuses on recruitment opportunities in the field of video understanding and artificial intelligence, highlighting the responsibilities and requirements for various positions within the company [2][4][5]. Recruitment Responsibilities - The company is looking for candidates to engage in cutting-edge algorithm research and development for video understanding, specifically targeting tasks such as video question answering, video summarization, temporal action localization, and event detection [2]. - Responsibilities also include building large-scale, high-quality multimodal datasets, distributed training of large models, and collaborating with business teams for practical application and innovation [2]. Job Requirements - Candidates should possess a master's or doctoral degree in computer science, artificial intelligence, electronic information, automation, or related fields [4]. - Experience in top AI conferences or journals is preferred, particularly in areas like computer vision and multimodal learning [5]. Advantages of Joining - The company offers a supportive environment with ample hiring capacity for new graduates, interns, and experienced hires, along with competitive salaries and benefits such as mentorship and participation in significant projects [6]. Community and Resources - The article mentions a community platform for job seekers in autonomous driving and robotics, providing resources like interview questions, industry reports, and salary negotiation tips [7][19].
模拟大脑功能分化!北大与港中文发布Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
机器之心· 2025-07-12 02:11
Core Insights - The article discusses the development of a new dual-system visual-language-action model named Fast-in-Slow (FiS-VLA) that integrates high-frequency response and complex reasoning in robotic control [4][29]. Group 1: Research Background and Challenges - The goal of robotic operating systems is to generate precise control signals based on sensor inputs and language instructions in complex environments. However, large-scale visual-language models (VLMs) have limitations due to their large parameters and slow inference speed, which restrict their practical use in high-frequency control tasks [7]. - The research draws inspiration from Kahneman's "dual-system theory," where System 1 represents fast, intuitive decision-making, and System 2 represents slower, deeper reasoning. Previous methods attempted to create a dual-system structure but lacked efficient collaboration between the two systems [8][9]. Group 2: FiS-VLA Architecture and Design - FiS-VLA proposes an innovative structure that directly reconstructs the last few layers of the VLM into a System 1 execution module, embedding it within System 2 to form a unified model for efficient reasoning and control. System 2 processes 2D images and language instructions at a low frequency, while System 1 responds to real-time sensory inputs at a high frequency [11][13]. - The architecture includes a visual encoder, a lightweight 3D tokenizer, a large language model (LLaMA2-7B), and several MLP modules for modality fusion and diffusion modeling. This design allows System 1 to inherit pre-trained knowledge and achieve high-frequency execution [13]. Group 3: Dual-System Collaboration - FiS-VLA consists of a slow System 2 and a fast System 1, where System 2 processes task-related visual observations and language instructions, converting them into high-dimensional features. System 1 focuses on real-time action generation, receiving current sensory inputs and outputting actions while utilizing periodic updates from System 2 [14][15]. - The model employs asynchronous sampling to control the operating frequency of the two systems, ensuring time consistency in action generation [14]. Group 4: Performance Evaluation - In simulation tests, FiS-VLA achieved an average success rate of 69% in RLBench tasks, outperforming other models like CogACT (61%) and π0 (55%). The control frequency reached 21.9Hz, more than double that of CogACT [17]. - In real robot platforms (Agilex and AlphaBot), FiS-VLA demonstrated average success rates of 68% and 74% across eight tasks, significantly surpassing the π0 baseline [19]. - The model exhibited robust performance in generalization tests, showing a smaller accuracy decline compared to π0 when faced with unseen objects, complex backgrounds, and lighting changes [21]. Group 5: Ablation Studies and Future Directions - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two Transformer layers, and the best collaboration frequency ratio between Systems 1 and 2 is 1:4. The theoretical control frequency can reach up to 117.7Hz when predicting eight actions at once [23]. - The article concludes that FiS-VLA innovatively merges reasoning and control within a unified VLM, achieving high-frequency, high-precision, and strong generalization capabilities in robotic manipulation. Future enhancements may include dynamic adjustments to shared structures and collaborative frequency strategies to improve adaptability and robustness in real-world tasks [29].
智源发布“悟界”系列大模型,含全球首个原生多模态世界模型Emu3
Feng Huang Wang· 2025-06-06 14:32
Core Insights - The Zhiyuan Research Institute launched the "Wujie" series of large models, including Emu3, Brainμ, RoboOS 2.0, RoboBrain 2.0, and OpenComplex2, at the 2025 Beijing Zhiyuan Conference [1] Group 1: Emu3 and Brainμ Models - Emu3 is a native multimodal world model that utilizes a next-token prediction paradigm for unified multimodal learning, allowing for the encoding of images/videos into discrete symbol sequences [2] - Brainμ, built on the Emu3 architecture, integrates brain signals as a new modality, enabling a single model to perform various neuroscience tasks, potentially becoming the "AlphaFold" of brain science [2][3] Group 2: RoboOS 2.0 and RoboBrain 2.0 - RoboOS 2.0 is the world's first open-source framework for embodied intelligence SaaS platforms, significantly reducing development barriers and improving performance by 30% compared to its predecessor [4] - RoboBrain 2.0 enhances multi-agent task planning capabilities, achieving a 74% improvement in task planning accuracy over RoboBrain 1.0 [5] Group 3: OpenComplex2 Model - OpenComplex2 represents a breakthrough in modeling biological molecules, capturing molecular interactions at atomic resolution and providing insights into the relationship between microscopic fluctuations and macroscopic biological functions [6][7] Group 4: Open Source Initiatives - Zhiyuan has open-sourced approximately 200 models and 160 datasets, with the FlagOS software stack upgraded to support various AI hardware and improve performance by up to 23% [8] Group 5: Applications and Collaborations - The Brainμ model has shown potential in consumer-grade brain-computer interface applications, collaborating with leading neuroscience laboratories and companies to expand its industrial applications [3][11] - The development of a digital twin heart and a drug safety evaluation platform demonstrates the application of advanced modeling techniques in pharmacology and personalized medicine [12]
4万多名作者挤破头,CVPR 2025官方揭秘三大爆款主题, 你卷对方向了吗?
机器之心· 2025-05-28 03:02
Core Insights - The article discusses the latest trends in the field of computer vision, highlighting three major research directions that are gaining traction as of 2025 [3][4]. Group 1: Major Research Directions - The three prominent areas identified are: 1. Multi-view and sensor 3D technology, which has evolved from 2D rendering to more complex 3D evaluations, significantly influenced by the introduction of NeRF in 2020 [5]. 2. Image and video synthesis, which has become a focal point for presenting environmental information more accurately, reflecting advancements in the ability to analyze and generate multimedia content [6]. 3. Multimodal learning, which integrates visual, linguistic, and reasoning capabilities, indicating a trend towards more interactive and comprehensive AI systems [7][8]. Group 2: Conference Insights - The CVPR 2025 conference has seen a 13% increase in paper submissions, with a total of 13,008 submissions and an acceptance rate of 22.1%, indicating a highly competitive environment [3]. - The conference emphasizes the importance of diverse voices in the research community, ensuring that every paper, regardless of the author's affiliation, is given equal consideration [8].