多模态学习
Search documents
登上Nature!智源研究院推出AI全能选手——Emu3,一统多模态学习
生物世界· 2026-01-31 03:05
Core Viewpoint - The article discusses the introduction of Emu3, a multimodal large model developed by Beijing Academy of Artificial Intelligence, which aims to unify the learning of text, images, and videos through next-token prediction, potentially transforming the AI landscape [2][3]. Multimodal Learning - Multimodal learning refers to the ability of AI to process various types of information simultaneously, akin to human sensory perception. Achieving a unified algorithm for learning and generating content from multiple modalities has been a long-standing challenge in the AI field [6]. Emu3's Mechanism - Emu3 employs a simple yet effective approach by converting all modal data into discrete tokens and using a Transformer model to predict the next token, which is a key factor in the success of GPT series language models [6][7]. Training Process - The training of Emu3 consists of three stages: 1. Pre-training with large-scale multimodal data, balancing the loss weights of text and visual tokens to prevent dominance of visual tokens [10]. 2. Post-training for quality fine-tuning on generation tasks, incorporating human preference optimization [10]. 3. Inference supporting classifier-free guidance for low-latency and high-throughput generation [11]. Performance Comparison - Emu3 has demonstrated performance that matches or exceeds specialized models across various tasks: - In image generation, it achieved a human preference score of 70.0, surpassing Stable Diffusion v1.5 (59.3) and SDXL (66.9) [13]. - In video generation, it scored 81.0 in VBench evaluation, comparable to mainstream diffusion models [13]. - In visual language understanding, it averaged 62.1 across 12 benchmark tests, rivaling models like LLaVA-1.6 [13]. - In robotic operations, it achieved a success rate of 87.0% in a simulated environment [13]. Significance of the Research - The significance of Emu3 lies not only in its performance improvements but also in its simplification of paradigms. It demonstrates that next-token prediction can serve as a core paradigm for multimodal models, paving the way for the development of more powerful "world models" that integrate perception, language, and action [15][17]. Future Developments - Following Emu3, the research team has introduced Emu3.5, which enhances the model's capabilities through large-scale long-sequence video training, improving its ability to model physical world dynamics and observing trends in multimodal capabilities as the model and data scale increase [15].
破纪录!华人学者一天发表了22篇Nature论文
生物世界· 2026-01-29 08:00
Core Insights - On January 28, 2026, a total of 43 research papers were published in the prestigious journal Nature, with 22 of them authored by Chinese scholars, highlighting the significant contribution of Chinese researchers to global scientific advancements [3][4][6][8][10][11][12][14][18][19][22]. Group 1: Research Contributions - The paper titled "Constraints on axion dark matter by distributed intercity quantum sensors" was authored by Professor Peng Xinhua and Professor Jiang Min from the University of Science and Technology of China, with postdoctoral researcher Wang Yuanhong as the first author [3]. - A study on "Prethermalization by random multipolar driving on a 78-qubit processor" was published by researchers from the Chinese Academy of Sciences and Peking University, with Liu Zhenghe and Liu Yu as co-first authors [4]. - The research "Multimodal learning with next-token prediction for large multimodal models" was led by Professor Huang Tiejun and Wang Zhongyuan from the Beijing Academy of Artificial Intelligence, with Wang Xinlong as a co-first author [6]. - The paper "Radiation-tolerant atomic-layer-scale RF system for spaceborne communication" was authored by Professor Zhou Peng and Associate Professor Ma Shunli from Fudan University, with postdoctoral researcher Zhu Liyuan as the first author [8]. - The study "Accurate determination of the 3D atomic structure of amorphous materials" was published by Miao Jianwei from UCLA, with Liao Yuxuan as the first author [10]. Group 2: Diverse Research Topics - The research titled "Optical control of integer and fractional Chern insulators" was authored by Xu Xiaodong from the University of Washington, with Li Weijie as a co-first author [11]. - The paper "Bandwidth-tuned Mott transition and superconductivity in moiré WSe2" was co-authored by researchers from Cornell University, with Xia Yiyu and Han Zhongdong as co-first authors [12]. - The study "Frequency reproducibility of solid-state thorium-229 nuclear clocks" was published by Ye Jun from the University of Colorado Boulder [13]. - The research "A Cambrian soft-bodied biota after the first Phanerozoic mass extinction" was authored by Zhu Maoyan and Zhao Fangchen from the Nanjing Institute of Geology and Palaeontology, Chinese Academy of Sciences [14]. - The paper "Advancing regulatory variant effect prediction with AlphaGenome" was co-authored by Cheng Jun from Google DeepMind, with co-first author Zhang Mingchao [16].
打造图像编辑领域的ImageNet?苹果用Nano Banana开源了一个超大数据集
机器之心· 2025-10-26 04:03
Core Insights - Apple has been perceived as lagging in the development and application of large models, particularly in the field of visual generation [1][2] - The company has made significant strides in research, recently introducing the Pico-Banana-400K dataset, which consists of 400,000 images for instruction-based image editing [6][9] Dataset Overview - The Pico-Banana-400K dataset is built using Google's Nano-banana model and aims to provide a comprehensive resource for training and evaluating text-guided image editing models [6][9] - The dataset includes a variety of subsets: - 258,000 single-turn editing examples covering 35 editing categories [12] - 72,000 multi-turn editing examples for studying sequential modifications [13] - 56,000 preference samples for alignment research [14] - Instruction pairing sets for developing instruction rewriting and summarization capabilities [15] Quality Control and Methodology - The dataset emphasizes quality and diversity through a systematic design, ensuring comprehensive coverage of editing types and balancing content consistency with instruction fidelity [9][16] - Apple has implemented a self-editing and evaluation process where the Nano-banana model performs edits, and Gemini 2.5 Pro assesses the results, allowing for automatic retries until successful [17] Editing Types and Success Rates - The dataset categorizes editing instructions into 35 types, covering a wide range of operations from color adjustments to object manipulation [21][22] - Success rates vary by editing type, with global appearance and style edits being the easiest, while precise geometric and text edits are the most challenging [31][32][34] Contributions to the Field - The release of Pico-Banana-400K represents a significant contribution to the field of multimodal learning, providing a large-scale, shareable dataset that supports various training objectives [40][41] - The dataset not only facilitates the training of models but also demonstrates the capability of AI to generate and validate training data autonomously, without human supervision [41][42]
谷歌“香蕉”杀死Photoshop,全球软件业彻底变天了
Tai Mei Ti A P P· 2025-09-16 08:45
Core Insights - The article highlights the revolutionary impact of Google's AI model, Nano Banana, in the field of image generation and editing, showcasing its superior capabilities compared to existing models [1][2][28] User Experience - Nano Banana has been described as "stunning," with its text-to-image generation and image editing capabilities surpassing previous models [3][28] - The model effectively addresses challenges in generating coherent text within images, a task that previous models struggled with [4][28] - Users have reported that Nano Banana can create highly realistic figurine images, making it difficult for outsiders to distinguish them from real objects [6][28] - The model excels in various image editing tasks, such as adding or removing elements, local repainting, style transfer, and maintaining high fidelity in details [9][28] - Nano Banana's ability to perform pixel-level edits allows for precise modifications without disrupting the overall image [10][12] - It can render objects from different angles, demonstrating a deep understanding of three-dimensional space [14][28] - The model employs a "staggered generation" approach, breaking down complex tasks into manageable steps for improved accuracy [15][28] - Users have experienced "smart" outputs that exceed their expectations, indicating the model's ability to interpret and enhance vague instructions [16][28] Commercial Prospects - The commercial viability of Nano Banana is being explored, with considerations around cost-effectiveness and potential revenue models [18][28] - Training the model requires significant resources, and the team is seeking efficient evaluation metrics to reduce costs [19][28] - The API pricing is set at competitive rates, with free usage options available, enhancing its attractiveness in the market [20][28] - Third-party platforms are beginning to offer Nano Banana's API at lower prices, indicating a competitive landscape [21][28] - The model's introduction is part of Google's strategy to maintain its leadership in the AI sector against competitors like OpenAI and Midjourney [21][28] Technical Logic - Nano Banana's capabilities stem from Google's long-term investments in multimodal learning, user feedback mechanisms, and innovative architectural designs [22][28] - The model utilizes a "text rendering metric" to assess improvements without relying on subjective user evaluations [23][28] - It employs a unified multimodal model approach, allowing knowledge transfer across different modalities, enhancing overall performance [24][28] - User feedback is integral to the model's iterative improvement process, with the team learning from past failures to refine its capabilities [26][28] - Collaboration between the Gemini and Imagen teams has been crucial in achieving a balance between intelligent content generation and visual quality [27][28]
OpenVision 2:大道至简的生成式预训练视觉编码器
机器之心· 2025-09-15 12:19
Core Insights - The article discusses the development of OpenVision 2, a generative visual pre-training model that simplifies the training process while maintaining optimal performance and significantly improving training efficiency [2][21]. Group 1: OpenVision 2 Overview - OpenVision 2 is a new direction in generative visual pre-training, proposed by researchers from UCSC, Apple, and UCB, which enhances training efficiency while achieving a parameter scale of 1 billion [2][21]. - The model eliminates the complexity of the training pipeline found in its predecessor, OpenVision, by removing the text encoder and contrastive learning, focusing solely on the "image → description" generation target [9][21]. Group 2: Performance and Efficiency - Experimental results show that OpenVision 2 performs comparably to or better than OpenAI's CLIP and Google's SigLIP on various multimodal benchmark tasks, particularly excelling in OCR and text-related tasks [14][21]. - The training time for OpenVision 2 is reduced by 1.5 to 2 times, with memory usage cut by nearly half, allowing for larger batch sizes and more efficient training [14][16]. Group 3: Key Innovations - OpenVision 2 introduces a technique of randomly dropping about 2/3 of the visual tokens during pre-training, which reduces the computational burden on the text decoder and enhances training efficiency [10][22]. - The model relies on high-quality synthetic descriptions as the sole supervision signal, which aligns closely with downstream tasks, reducing the "goal misalignment" between pre-training and application [22][21]. Group 4: Community Impact - The research challenges the long-standing dominance of contrastive learning, demonstrating that powerful visual encoders can be trained through a generative framework, paving the way for future developments in multimodal foundational models [21][22]. - Over 25 different models of varying scales and configurations have been open-sourced, providing a reproducible and scalable resource base for both academia and industry [21].
百度视觉技术部多模态感知与理解招聘(社招/校招/实习)
自动驾驶之心· 2025-09-03 23:33
Core Viewpoint - The article focuses on recruitment opportunities in the field of video understanding and artificial intelligence, highlighting the responsibilities and requirements for various positions within the company [2][4][5]. Recruitment Responsibilities - The company is looking for candidates to engage in cutting-edge algorithm research and development for video understanding, specifically targeting tasks such as video question answering, video summarization, temporal action localization, and event detection [2]. - Responsibilities also include building large-scale, high-quality multimodal datasets, distributed training of large models, and collaborating with business teams for practical application and innovation [2]. Job Requirements - Candidates should possess a master's or doctoral degree in computer science, artificial intelligence, electronic information, automation, or related fields [4]. - Experience in top AI conferences or journals is preferred, particularly in areas like computer vision and multimodal learning [5]. Advantages of Joining - The company offers a supportive environment with ample hiring capacity for new graduates, interns, and experienced hires, along with competitive salaries and benefits such as mentorship and participation in significant projects [6]. Community and Resources - The article mentions a community platform for job seekers in autonomous driving and robotics, providing resources like interview questions, industry reports, and salary negotiation tips [7][19].
模拟大脑功能分化!北大与港中文发布Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
机器之心· 2025-07-12 02:11
Core Insights - The article discusses the development of a new dual-system visual-language-action model named Fast-in-Slow (FiS-VLA) that integrates high-frequency response and complex reasoning in robotic control [4][29]. Group 1: Research Background and Challenges - The goal of robotic operating systems is to generate precise control signals based on sensor inputs and language instructions in complex environments. However, large-scale visual-language models (VLMs) have limitations due to their large parameters and slow inference speed, which restrict their practical use in high-frequency control tasks [7]. - The research draws inspiration from Kahneman's "dual-system theory," where System 1 represents fast, intuitive decision-making, and System 2 represents slower, deeper reasoning. Previous methods attempted to create a dual-system structure but lacked efficient collaboration between the two systems [8][9]. Group 2: FiS-VLA Architecture and Design - FiS-VLA proposes an innovative structure that directly reconstructs the last few layers of the VLM into a System 1 execution module, embedding it within System 2 to form a unified model for efficient reasoning and control. System 2 processes 2D images and language instructions at a low frequency, while System 1 responds to real-time sensory inputs at a high frequency [11][13]. - The architecture includes a visual encoder, a lightweight 3D tokenizer, a large language model (LLaMA2-7B), and several MLP modules for modality fusion and diffusion modeling. This design allows System 1 to inherit pre-trained knowledge and achieve high-frequency execution [13]. Group 3: Dual-System Collaboration - FiS-VLA consists of a slow System 2 and a fast System 1, where System 2 processes task-related visual observations and language instructions, converting them into high-dimensional features. System 1 focuses on real-time action generation, receiving current sensory inputs and outputting actions while utilizing periodic updates from System 2 [14][15]. - The model employs asynchronous sampling to control the operating frequency of the two systems, ensuring time consistency in action generation [14]. Group 4: Performance Evaluation - In simulation tests, FiS-VLA achieved an average success rate of 69% in RLBench tasks, outperforming other models like CogACT (61%) and π0 (55%). The control frequency reached 21.9Hz, more than double that of CogACT [17]. - In real robot platforms (Agilex and AlphaBot), FiS-VLA demonstrated average success rates of 68% and 74% across eight tasks, significantly surpassing the π0 baseline [19]. - The model exhibited robust performance in generalization tests, showing a smaller accuracy decline compared to π0 when faced with unseen objects, complex backgrounds, and lighting changes [21]. Group 5: Ablation Studies and Future Directions - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two Transformer layers, and the best collaboration frequency ratio between Systems 1 and 2 is 1:4. The theoretical control frequency can reach up to 117.7Hz when predicting eight actions at once [23]. - The article concludes that FiS-VLA innovatively merges reasoning and control within a unified VLM, achieving high-frequency, high-precision, and strong generalization capabilities in robotic manipulation. Future enhancements may include dynamic adjustments to shared structures and collaborative frequency strategies to improve adaptability and robustness in real-world tasks [29].
智源发布“悟界”系列大模型,含全球首个原生多模态世界模型Emu3
Feng Huang Wang· 2025-06-06 14:32
Core Insights - The Zhiyuan Research Institute launched the "Wujie" series of large models, including Emu3, Brainμ, RoboOS 2.0, RoboBrain 2.0, and OpenComplex2, at the 2025 Beijing Zhiyuan Conference [1] Group 1: Emu3 and Brainμ Models - Emu3 is a native multimodal world model that utilizes a next-token prediction paradigm for unified multimodal learning, allowing for the encoding of images/videos into discrete symbol sequences [2] - Brainμ, built on the Emu3 architecture, integrates brain signals as a new modality, enabling a single model to perform various neuroscience tasks, potentially becoming the "AlphaFold" of brain science [2][3] Group 2: RoboOS 2.0 and RoboBrain 2.0 - RoboOS 2.0 is the world's first open-source framework for embodied intelligence SaaS platforms, significantly reducing development barriers and improving performance by 30% compared to its predecessor [4] - RoboBrain 2.0 enhances multi-agent task planning capabilities, achieving a 74% improvement in task planning accuracy over RoboBrain 1.0 [5] Group 3: OpenComplex2 Model - OpenComplex2 represents a breakthrough in modeling biological molecules, capturing molecular interactions at atomic resolution and providing insights into the relationship between microscopic fluctuations and macroscopic biological functions [6][7] Group 4: Open Source Initiatives - Zhiyuan has open-sourced approximately 200 models and 160 datasets, with the FlagOS software stack upgraded to support various AI hardware and improve performance by up to 23% [8] Group 5: Applications and Collaborations - The Brainμ model has shown potential in consumer-grade brain-computer interface applications, collaborating with leading neuroscience laboratories and companies to expand its industrial applications [3][11] - The development of a digital twin heart and a drug safety evaluation platform demonstrates the application of advanced modeling techniques in pharmacology and personalized medicine [12]
4万多名作者挤破头,CVPR 2025官方揭秘三大爆款主题, 你卷对方向了吗?
机器之心· 2025-05-28 03:02
Core Insights - The article discusses the latest trends in the field of computer vision, highlighting three major research directions that are gaining traction as of 2025 [3][4]. Group 1: Major Research Directions - The three prominent areas identified are: 1. Multi-view and sensor 3D technology, which has evolved from 2D rendering to more complex 3D evaluations, significantly influenced by the introduction of NeRF in 2020 [5]. 2. Image and video synthesis, which has become a focal point for presenting environmental information more accurately, reflecting advancements in the ability to analyze and generate multimedia content [6]. 3. Multimodal learning, which integrates visual, linguistic, and reasoning capabilities, indicating a trend towards more interactive and comprehensive AI systems [7][8]. Group 2: Conference Insights - The CVPR 2025 conference has seen a 13% increase in paper submissions, with a total of 13,008 submissions and an acceptance rate of 22.1%, indicating a highly competitive environment [3]. - The conference emphasizes the importance of diverse voices in the research community, ensuring that every paper, regardless of the author's affiliation, is given equal consideration [8].
ETT:打破原生多模态学习视觉瓶颈,重塑视觉tokenizer优化范式
机器之心· 2025-05-27 06:38
Core Viewpoint - The article introduces ETT (End-to-End Vision Tokenizer Tuning), a novel method that optimizes visual tokenization and downstream tasks jointly, addressing the limitations of traditional visual tokenization methods [2][4]. Group 1: Limitations of Traditional Methods - Traditional visual tokenization methods suffer from a critical flaw where the optimization of visual tokenizers is decoupled from the training of downstream tasks, leading to suboptimal performance in tasks requiring rich semantic representation [1][5]. - Existing multimodal pre-training frameworks, such as Emu3, utilize frozen visual tokenizers, wasting their rich feature representation capabilities and hindering end-to-end training [6][10]. Group 2: ETT Innovations - ETT innovatively combines visual tokenization with target autoregressive tasks for joint optimization, allowing visual tokenizers to adapt based on feedback from downstream tasks [4][10]. - The architecture of ETT is based on an improved IBQ framework, with a codebook size of 131,072 and feature dimensions set to 256, enhancing the efficiency of visual tokenizers [10]. Group 3: Training Strategy - ETT employs a structured training strategy, starting with an alignment learning phase where only the visual projection layer is trained while keeping the large language model and visual tokenizer parameters frozen [11]. - In the semantic learning phase, all model weights are unfrozen for end-to-end training, allowing the visual tokenizer to enhance its perceptual capabilities while maintaining image reconstruction abilities [11]. Group 4: Performance Metrics - ETT demonstrates superior performance in multimodal understanding tasks, achieving competitive results on benchmarks like GQA and MMBench, even with fewer model parameters and training data compared to state-of-the-art visual language models [12][13]. - In multimodal generation tasks, ETT matches the performance of advanced diffusion and autoregressive models while being more efficient in terms of model parameters and training data [14][15]. Group 5: Qualitative Results - ETT generates diverse and detailed visual content that adheres closely to text prompts, showcasing its ability to produce high-quality images across various artistic styles and themes [16]. Group 6: Visual Reconstruction - ETT significantly enhances visual reconstruction tasks, preserving low-level details while improving high-level semantic representation, thus providing better visual representations for multimodal tasks [17]. Group 7: Future Directions - Future research will focus on expanding the data scale and model capacity for ETT, exploring end-to-end training of visual tokenizers from scratch, and extending ETT's methodology to other modalities like video and audio [19]. Group 8: Conclusion - ETT represents a breakthrough in native multimodal learning, offering a simple yet effective approach to optimize visual tokenizers, thereby enhancing the performance of multimodal models and paving the way for broader applications [25].