Workflow
多模态学习
icon
Search documents
谷歌“香蕉”杀死Photoshop,全球软件业彻底变天了
Tai Mei Ti A P P· 2025-09-16 08:45
Core Insights - The article highlights the revolutionary impact of Google's AI model, Nano Banana, in the field of image generation and editing, showcasing its superior capabilities compared to existing models [1][2][28] User Experience - Nano Banana has been described as "stunning," with its text-to-image generation and image editing capabilities surpassing previous models [3][28] - The model effectively addresses challenges in generating coherent text within images, a task that previous models struggled with [4][28] - Users have reported that Nano Banana can create highly realistic figurine images, making it difficult for outsiders to distinguish them from real objects [6][28] - The model excels in various image editing tasks, such as adding or removing elements, local repainting, style transfer, and maintaining high fidelity in details [9][28] - Nano Banana's ability to perform pixel-level edits allows for precise modifications without disrupting the overall image [10][12] - It can render objects from different angles, demonstrating a deep understanding of three-dimensional space [14][28] - The model employs a "staggered generation" approach, breaking down complex tasks into manageable steps for improved accuracy [15][28] - Users have experienced "smart" outputs that exceed their expectations, indicating the model's ability to interpret and enhance vague instructions [16][28] Commercial Prospects - The commercial viability of Nano Banana is being explored, with considerations around cost-effectiveness and potential revenue models [18][28] - Training the model requires significant resources, and the team is seeking efficient evaluation metrics to reduce costs [19][28] - The API pricing is set at competitive rates, with free usage options available, enhancing its attractiveness in the market [20][28] - Third-party platforms are beginning to offer Nano Banana's API at lower prices, indicating a competitive landscape [21][28] - The model's introduction is part of Google's strategy to maintain its leadership in the AI sector against competitors like OpenAI and Midjourney [21][28] Technical Logic - Nano Banana's capabilities stem from Google's long-term investments in multimodal learning, user feedback mechanisms, and innovative architectural designs [22][28] - The model utilizes a "text rendering metric" to assess improvements without relying on subjective user evaluations [23][28] - It employs a unified multimodal model approach, allowing knowledge transfer across different modalities, enhancing overall performance [24][28] - User feedback is integral to the model's iterative improvement process, with the team learning from past failures to refine its capabilities [26][28] - Collaboration between the Gemini and Imagen teams has been crucial in achieving a balance between intelligent content generation and visual quality [27][28]
OpenVision 2:大道至简的生成式预训练视觉编码器
机器之心· 2025-09-15 12:19
本文来自加州大学圣克鲁兹分校(UCSC)、苹果公司(Apple)与加州大学伯克利分校(UCB)的合作研究。第一作者刘彦青,本科毕业于浙江大学,现为 UCSC博士生,研究方向包括多模态理解、视觉-语言预训练与视觉基础模型。其余作者包括李先航(UCSC)、张乐天(USCS)、王子瑞(Apple)、郑泽宇 (UCB)、周郁音(UCSC)。通讯作者为UCSC的谢慈航教授。 论文标题:OpenVision 2: A Family of Generative Pretrained Visual Encod ers for Multimodal Learning 论文地址 :arXiv:2509.01644 项目主页 : https://ucsc-vlaa.github.io/OpenVision2 代码与模型 :GitHub · UCSC-VLAA/OpenVision Hugging Face 模型库 :OpenVision 2 on HuggingFace 在多模态大模型快速演进的浪潮中,视觉模块一直是支撑整个体系的关键基石。长期以来,CLIP 式的图文对比学习几乎成为视觉预训练的默认思路。从 OpenAI 的 CL ...
百度视觉技术部多模态感知与理解招聘(社招/校招/实习)
自动驾驶之心· 2025-09-03 23:33
Core Viewpoint - The article focuses on recruitment opportunities in the field of video understanding and artificial intelligence, highlighting the responsibilities and requirements for various positions within the company [2][4][5]. Recruitment Responsibilities - The company is looking for candidates to engage in cutting-edge algorithm research and development for video understanding, specifically targeting tasks such as video question answering, video summarization, temporal action localization, and event detection [2]. - Responsibilities also include building large-scale, high-quality multimodal datasets, distributed training of large models, and collaborating with business teams for practical application and innovation [2]. Job Requirements - Candidates should possess a master's or doctoral degree in computer science, artificial intelligence, electronic information, automation, or related fields [4]. - Experience in top AI conferences or journals is preferred, particularly in areas like computer vision and multimodal learning [5]. Advantages of Joining - The company offers a supportive environment with ample hiring capacity for new graduates, interns, and experienced hires, along with competitive salaries and benefits such as mentorship and participation in significant projects [6]. Community and Resources - The article mentions a community platform for job seekers in autonomous driving and robotics, providing resources like interview questions, industry reports, and salary negotiation tips [7][19].
模拟大脑功能分化!北大与港中文发布Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
机器之心· 2025-07-12 02:11
Core Insights - The article discusses the development of a new dual-system visual-language-action model named Fast-in-Slow (FiS-VLA) that integrates high-frequency response and complex reasoning in robotic control [4][29]. Group 1: Research Background and Challenges - The goal of robotic operating systems is to generate precise control signals based on sensor inputs and language instructions in complex environments. However, large-scale visual-language models (VLMs) have limitations due to their large parameters and slow inference speed, which restrict their practical use in high-frequency control tasks [7]. - The research draws inspiration from Kahneman's "dual-system theory," where System 1 represents fast, intuitive decision-making, and System 2 represents slower, deeper reasoning. Previous methods attempted to create a dual-system structure but lacked efficient collaboration between the two systems [8][9]. Group 2: FiS-VLA Architecture and Design - FiS-VLA proposes an innovative structure that directly reconstructs the last few layers of the VLM into a System 1 execution module, embedding it within System 2 to form a unified model for efficient reasoning and control. System 2 processes 2D images and language instructions at a low frequency, while System 1 responds to real-time sensory inputs at a high frequency [11][13]. - The architecture includes a visual encoder, a lightweight 3D tokenizer, a large language model (LLaMA2-7B), and several MLP modules for modality fusion and diffusion modeling. This design allows System 1 to inherit pre-trained knowledge and achieve high-frequency execution [13]. Group 3: Dual-System Collaboration - FiS-VLA consists of a slow System 2 and a fast System 1, where System 2 processes task-related visual observations and language instructions, converting them into high-dimensional features. System 1 focuses on real-time action generation, receiving current sensory inputs and outputting actions while utilizing periodic updates from System 2 [14][15]. - The model employs asynchronous sampling to control the operating frequency of the two systems, ensuring time consistency in action generation [14]. Group 4: Performance Evaluation - In simulation tests, FiS-VLA achieved an average success rate of 69% in RLBench tasks, outperforming other models like CogACT (61%) and π0 (55%). The control frequency reached 21.9Hz, more than double that of CogACT [17]. - In real robot platforms (Agilex and AlphaBot), FiS-VLA demonstrated average success rates of 68% and 74% across eight tasks, significantly surpassing the π0 baseline [19]. - The model exhibited robust performance in generalization tests, showing a smaller accuracy decline compared to π0 when faced with unseen objects, complex backgrounds, and lighting changes [21]. Group 5: Ablation Studies and Future Directions - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two Transformer layers, and the best collaboration frequency ratio between Systems 1 and 2 is 1:4. The theoretical control frequency can reach up to 117.7Hz when predicting eight actions at once [23]. - The article concludes that FiS-VLA innovatively merges reasoning and control within a unified VLM, achieving high-frequency, high-precision, and strong generalization capabilities in robotic manipulation. Future enhancements may include dynamic adjustments to shared structures and collaborative frequency strategies to improve adaptability and robustness in real-world tasks [29].
智源发布“悟界”系列大模型,含全球首个原生多模态世界模型Emu3
Feng Huang Wang· 2025-06-06 14:32
Core Insights - The Zhiyuan Research Institute launched the "Wujie" series of large models, including Emu3, Brainμ, RoboOS 2.0, RoboBrain 2.0, and OpenComplex2, at the 2025 Beijing Zhiyuan Conference [1] Group 1: Emu3 and Brainμ Models - Emu3 is a native multimodal world model that utilizes a next-token prediction paradigm for unified multimodal learning, allowing for the encoding of images/videos into discrete symbol sequences [2] - Brainμ, built on the Emu3 architecture, integrates brain signals as a new modality, enabling a single model to perform various neuroscience tasks, potentially becoming the "AlphaFold" of brain science [2][3] Group 2: RoboOS 2.0 and RoboBrain 2.0 - RoboOS 2.0 is the world's first open-source framework for embodied intelligence SaaS platforms, significantly reducing development barriers and improving performance by 30% compared to its predecessor [4] - RoboBrain 2.0 enhances multi-agent task planning capabilities, achieving a 74% improvement in task planning accuracy over RoboBrain 1.0 [5] Group 3: OpenComplex2 Model - OpenComplex2 represents a breakthrough in modeling biological molecules, capturing molecular interactions at atomic resolution and providing insights into the relationship between microscopic fluctuations and macroscopic biological functions [6][7] Group 4: Open Source Initiatives - Zhiyuan has open-sourced approximately 200 models and 160 datasets, with the FlagOS software stack upgraded to support various AI hardware and improve performance by up to 23% [8] Group 5: Applications and Collaborations - The Brainμ model has shown potential in consumer-grade brain-computer interface applications, collaborating with leading neuroscience laboratories and companies to expand its industrial applications [3][11] - The development of a digital twin heart and a drug safety evaluation platform demonstrates the application of advanced modeling techniques in pharmacology and personalized medicine [12]
4万多名作者挤破头,CVPR 2025官方揭秘三大爆款主题, 你卷对方向了吗?
机器之心· 2025-05-28 03:02
Core Insights - The article discusses the latest trends in the field of computer vision, highlighting three major research directions that are gaining traction as of 2025 [3][4]. Group 1: Major Research Directions - The three prominent areas identified are: 1. Multi-view and sensor 3D technology, which has evolved from 2D rendering to more complex 3D evaluations, significantly influenced by the introduction of NeRF in 2020 [5]. 2. Image and video synthesis, which has become a focal point for presenting environmental information more accurately, reflecting advancements in the ability to analyze and generate multimedia content [6]. 3. Multimodal learning, which integrates visual, linguistic, and reasoning capabilities, indicating a trend towards more interactive and comprehensive AI systems [7][8]. Group 2: Conference Insights - The CVPR 2025 conference has seen a 13% increase in paper submissions, with a total of 13,008 submissions and an acceptance rate of 22.1%, indicating a highly competitive environment [3]. - The conference emphasizes the importance of diverse voices in the research community, ensuring that every paper, regardless of the author's affiliation, is given equal consideration [8].
ETT:打破原生多模态学习视觉瓶颈,重塑视觉tokenizer优化范式
机器之心· 2025-05-27 06:38
Core Viewpoint - The article introduces ETT (End-to-End Vision Tokenizer Tuning), a novel method that optimizes visual tokenization and downstream tasks jointly, addressing the limitations of traditional visual tokenization methods [2][4]. Group 1: Limitations of Traditional Methods - Traditional visual tokenization methods suffer from a critical flaw where the optimization of visual tokenizers is decoupled from the training of downstream tasks, leading to suboptimal performance in tasks requiring rich semantic representation [1][5]. - Existing multimodal pre-training frameworks, such as Emu3, utilize frozen visual tokenizers, wasting their rich feature representation capabilities and hindering end-to-end training [6][10]. Group 2: ETT Innovations - ETT innovatively combines visual tokenization with target autoregressive tasks for joint optimization, allowing visual tokenizers to adapt based on feedback from downstream tasks [4][10]. - The architecture of ETT is based on an improved IBQ framework, with a codebook size of 131,072 and feature dimensions set to 256, enhancing the efficiency of visual tokenizers [10]. Group 3: Training Strategy - ETT employs a structured training strategy, starting with an alignment learning phase where only the visual projection layer is trained while keeping the large language model and visual tokenizer parameters frozen [11]. - In the semantic learning phase, all model weights are unfrozen for end-to-end training, allowing the visual tokenizer to enhance its perceptual capabilities while maintaining image reconstruction abilities [11]. Group 4: Performance Metrics - ETT demonstrates superior performance in multimodal understanding tasks, achieving competitive results on benchmarks like GQA and MMBench, even with fewer model parameters and training data compared to state-of-the-art visual language models [12][13]. - In multimodal generation tasks, ETT matches the performance of advanced diffusion and autoregressive models while being more efficient in terms of model parameters and training data [14][15]. Group 5: Qualitative Results - ETT generates diverse and detailed visual content that adheres closely to text prompts, showcasing its ability to produce high-quality images across various artistic styles and themes [16]. Group 6: Visual Reconstruction - ETT significantly enhances visual reconstruction tasks, preserving low-level details while improving high-level semantic representation, thus providing better visual representations for multimodal tasks [17]. Group 7: Future Directions - Future research will focus on expanding the data scale and model capacity for ETT, exploring end-to-end training of visual tokenizers from scratch, and extending ETT's methodology to other modalities like video and audio [19]. Group 8: Conclusion - ETT represents a breakthrough in native multimodal learning, offering a simple yet effective approach to optimize visual tokenizers, thereby enhancing the performance of multimodal models and paving the way for broader applications [25].