Workflow
transformer架构
icon
Search documents
ICCV 2025|训练太复杂?对图片语义、布局要求太高?图像morphing终于一步到位
机器之心· 2025-07-18 00:38
本文第一作者操雨康,南洋理工大学MMLab博士后,研究方向是3D/4D重建与生成,人体动作/视频生成,以及图像生成与编辑。 引言 本文共同第一作者司晨阳,南京大学助理教授,研究方向是图像/视频生成,以及生成模型的优化和加速。 在图像处理领域, 「图像 morphing」 (图像变形)是一项常见又充满创意的任务 —— 它可以让两张风格迥异的图片平滑过渡、自然融合,从而生成令人惊艳的 中间图像。你可能在动画、电影特效或照片编辑中见过它的身影。 过去,这项技术往往依赖于复杂的图像对齐算法和颜色插值规则,难以应对复杂纹理和多样语义的图像变化。近年来,虽然 GAN、VAE 等深度学习方法取得了显 著进步,但它们仍然面临训练成本高、数据依赖强、反演不稳定等问题 —— 尤其在真实世界图像中表现并不稳定。 为了实现高质量的图像 morphing,研究者们先后尝试了从图像 warping 到颜色插值,从 GAN 到 VAE,甚至使用了 Stable Diffusion 和 CLIP 等大模型。然而,即使 在最先进的方案中,训练成本高、适应性差依旧是难以回避的难题。 那么,是否可能完全抛开训练?不再依赖预训练模型或额外标注?只用 ...
AI三问③模型之问 | 直面模型之问,以大爱共塑 AI 未来 ——WAIC 2025 大模型论坛以问题破局引领技术革新
3 6 Ke· 2025-07-17 03:21
WAIC 2025 世界人工智能大会 论坛:2025年7月26日-28日 展览:2025年7月26日-29日 作为"模型之问" 系列活动的重要环节,本次活动以 "破解模型本质问题" 作为核心目标,打造跨国界、跨架构的全球顶尖科研、技术交流平台。来自领先 人工智能企业的技术专家与顶尖高校的学者将齐聚一堂,围绕 "泛化性瓶颈与模型底层范式的内在关联" 这一核心问题展开深度对话——解析模型泛化能 力不足是否源于架构设计与学习范式的固有局限,探索技术突破路径。不同国家、不同技术路线的智慧在此实现碰撞与融合,不仅推动人工智能模型前沿 成果的跨域交流,更通过针对性探讨为解决当前大模型发展中的技术瓶颈提供多元化视角,让 "模型之问" 成为技术突破的逻辑起点。 亮点二:架构革新与产业落地,以"模型之问"驱动范式跃迁 活动将 "模型之问" 作为技术探索的核心指引,深入探索Transformer与非Transformer架构的融合路径,致力于推动大模型技术从单一路径向多元范式演 进。一方面,聚焦 "跨模态智能的语义鸿沟" 问题——解析文本、图像等异构模态信息的语义失配症结,探索多模态融合架构的技术突破方向;另一方 面,直击 "性 ...
基于能量的Transformer横空出世!全面超越主流模型35%
量子位· 2025-07-08 07:30
Core Viewpoint - The article discusses the introduction of the Energy-Based Transformers (EBT) architecture by a team from the University of Virginia, which surpasses the Transformer++ model across multiple dimensions, including data, parameters, computation, and model depth, through a novel energy mechanism [1][3][28]. Summary by Sections EBT Architecture and Performance - EBT achieves approximately 35% improvement over Transformer++ in various dimensions such as data volume, batch size, parameter count, computation, and model depth [3]. - During inference, EBT shows a 29% enhancement in performance compared to Transformer++ [7]. - EBT is designed to simulate human-like thinking by minimizing energy through a gradient descent process, allowing the model to determine the number of "thinking steps" dynamically [13][14]. Energy-Based Models (EBM) - EBT is developed based on the principles of Energy-Based Models (EBM), which assign a scalar value to each input configuration through an energy function [15][16]. - Lower energy indicates higher compatibility or probability among input variables, while higher energy suggests lower compatibility [17][18]. - The challenge of large-scale training in EBM remains unresolved, with two primary training methods identified: contrastive learning and regularization methods [19][20]. Training and Scalability - The research team transformed EBM learning into an optimization problem, effectively avoiding the curse of dimensionality and enabling scalable learning [22]. - EBT includes two variants: bidirectional EBT, which is simpler to implement, and autoregressive EBT, which is more complex due to information leakage issues [26]. Comparative Analysis - EBT consistently outperforms Transformer++ across six different dimensions, becoming the first model to achieve multi-dimensional superiority without changing the tokenizer [27][28]. - As training time increases, EBT's thinking capability improves, with performance gains rising from 4%-8% to 10%-14% [28]. - EBT outperforms diffusion models in image denoising tasks while reducing the required forward computation by 99% [32]. Implications and Future Directions - EBT introduces a new approach to implementing System 2 thinking through an energy-based optimization mechanism, demonstrating strong scalability and generalization capabilities [34].
特斯拉、英伟达机器人背后的“卖水人”
虎嗅APP· 2025-07-06 03:31
Core Viewpoint - The article discusses the rise of embodied intelligence and the critical role of data providers, like CyberOrigin, in the robotics industry, emphasizing that data is the new oil for the development of humanoid robots [3][5][23]. Group 1: Industry Trends - The emergence of embodied AI has led to significant interest from major companies like Tesla and NVIDIA, which are now focusing on humanoid robot development [11][20]. - The Transformer architecture has revolutionized the robotics field by enabling better spatial understanding and generalization capabilities, allowing robots to learn from vast amounts of data [12][13][14]. Group 2: Company Insights - CyberOrigin, founded by Yin Peng, aims to become a leading data supplier for humanoid robots, focusing on real-world interaction data rather than just hardware [5][22]. - The company has established partnerships with major AI firms and is actively collecting millions of hours of real-world data to enhance robot training [25][26][29]. Group 3: Data Importance - Data is essential for the evolution of both the physical robot and its cognitive capabilities, with the analogy that models are engines while data is the fuel [23][24]. - The company prioritizes collecting real-world data over synthetic data, believing that authentic data significantly improves model training outcomes [26][27]. Group 4: Challenges and Opportunities - The robotics industry is currently in a chaotic phase, with many new entrants recognizing the value of data, leading to increased competition [51]. - The company acknowledges the long commercial chain in the robotics sector but believes that data can quickly form a commercial loop, making it a strategic focus [22][23].
华尔街嗅到量子投资机遇 热门“量子计算概念股”Rigetti Computing喜获“增持”
Zhi Tong Cai Jing· 2025-07-02 14:20
Core Insights - Rigetti Computing has gained significant attention in the U.S. stock market due to Cantor Fitzgerald initiating coverage with a "buy" rating and a target price of $15, indicating Wall Street's growing interest in quantum computing as a lucrative investment opportunity [1][2] - The quantum computing sector is still in its infancy but is recognized as a highly sought-after technological milestone with potential for substantial economic impact in the future [1][3] - Major tech companies like NVIDIA, Microsoft, and IBM are heavily investing in quantum computing, signaling a competitive landscape and the potential for significant advancements in commercial applications [1][4][8] Company Developments - Rigetti recently completed a $350 million stock issuance to strengthen its balance sheet [2] - NVIDIA's CEO Huang Renxun highlighted that quantum computing is approaching a critical technological turning point, with the potential to solve significant global issues in the coming years [4][5] - Cisco has announced its entry into the quantum computing field by showcasing a prototype chip for connecting quantum computers, indicating a broadening interest in the sector [6] Industry Trends - The concept of a "Transformer moment" in quantum computing is emerging, which refers to the development of controllable and commercially valuable quantum computing applications [7][8] - Recent advancements in technologies such as ion traps and quantum annealing are paving the way for practical quantum computing applications, moving from theoretical concepts to real-world implementations [7][8] - The involvement of major tech giants and government support is expected to accelerate the commercialization of quantum computing on a global scale [8]
画到哪,动到哪!字节跳动发布视频生成「神笔马良」ATI,已开源!
机器之心· 2025-07-02 10:40
Core Viewpoint - The article discusses the development of ATI, a new controllable video generation framework by ByteDance, which allows users to create dynamic videos by drawing trajectories on static images, transforming user input into explicit control signals for object and camera movements [2][4]. Group 1: Introduction to ATI - Angtian Wang, a researcher at ByteDance, focuses on video generation and 3D vision, highlighting the advancements in video generation tasks due to diffusion models and transformer architectures [1]. - The current mainstream methods face a significant bottleneck in providing effective and intuitive motion control for users, limiting creative expression and practical application [2]. Group 2: Methodology of ATI - ATI accepts two basic inputs: a static image and a set of user-drawn trajectories, which can be any shape, including lines and curves [6]. - The Gaussian Motion Injector encodes these trajectories into motion vectors in latent space, guiding the video generation process frame by frame [6][14]. - The model uses Gaussian weights to ensure that it can "see" the drawn trajectories and understand their relation to the generated video [8][14]. Group 3: Features and Capabilities - Users can draw trajectories for key actions like running or jumping, with ATI accurately sampling and encoding joint movements to generate natural motion sequences [19]. - ATI can handle up to 8 independent trajectories simultaneously, ensuring that object identities remain distinct during complex interactions [21]. - The system allows for synchronized camera movements, enabling users to create dynamic videos with cinematic techniques like panning and tilting [23][25]. Group 4: Performance and Applications - ATI demonstrates strong cross-domain generalization, supporting various artistic styles such as realistic films, cartoons, and watercolor renderings [28]. - Users can create non-realistic motion effects, such as flying or stretching, providing creative possibilities for sci-fi or fantasy scenes [29]. - The high-precision model based on Wan2.1-I2V-14B can generate videos comparable to real footage, while a lightweight version is available for real-time interactions in resource-constrained environments [30]. Group 5: Open Source and Community - The Wan2.1-I2V-14B model version of ATI has been open-sourced on Hugging Face, facilitating high-quality, controllable video generation for researchers and developers [32]. - Community support is growing, with tools like ComfyUI-WanVideoWrapper available to optimize model performance on consumer-grade GPUs [32].
盘一盘,2017年Transformer之后,LLM领域的重要论文
机器之心· 2025-06-29 04:23
Core Insights - The article discusses Andrej Karpathy's concept of "Software 3.0," where natural language becomes the new programming interface, and AI models execute specific tasks [1][2]. - It emphasizes the transformative impact of this shift on developers, users, and software design paradigms, indicating a new computational framework is being constructed [2]. Development of LLMs - The evolution of Large Language Models (LLMs) has accelerated since the introduction of the Transformer architecture in 2017, leading to significant advancements in the GPT series and multimodal capabilities [3][5]. - Key foundational papers that established today's AI capabilities are reviewed, highlighting the transition from traditional programming to natural language interaction [5][6]. Foundational Theories - The paper "Attention Is All You Need" (2017) introduced the Transformer architecture, which relies solely on self-attention mechanisms, revolutionizing natural language processing and computer vision [10][11]. - "Language Models are Few-Shot Learners" (2020) demonstrated the capabilities of GPT-3, establishing the "large model + large data" scaling law as a pathway to more general artificial intelligence [13][18]. - "Deep Reinforcement Learning from Human Preferences" (2017) laid the groundwork for reinforcement learning from human feedback (RLHF), crucial for aligning AI outputs with human values [15][18]. Milestone Breakthroughs - The "GPT-4 Technical Report" (2023) details a large-scale, multimodal language model that exhibits human-level performance across various benchmarks, emphasizing the importance of AI safety and alignment [26][27]. - The release of LLaMA models (2023) demonstrated that smaller models trained on extensive datasets could outperform larger models, promoting a new approach to model efficiency [27][30]. Emerging Techniques - The "Chain-of-Thought Prompting" technique enhances reasoning in LLMs by guiding them to articulate their thought processes before arriving at conclusions [32][33]. - "Direct Preference Optimization" (2023) simplifies the alignment process of language models by directly utilizing human preference data, making it a widely adopted method in the industry [34][35]. Important Optimizations - The "PagedAttention" mechanism improves memory management for LLMs, significantly enhancing throughput and reducing memory usage during inference [51][52]. - The "Mistral 7B" model showcases how smaller models can achieve high performance through innovative architecture, influencing the development of efficient AI applications [55][56].
你的扫描全能王,作价217亿冲刺港股IPO
量子位· 2025-06-27 10:57
Core Viewpoint - The company, Shanghai Hehe Information Technology, is aiming to become the "first stock of intelligent text recognition" in Hong Kong, following its previous listing on the A-share Sci-Tech Innovation Board. The company has shown significant growth in revenue and user engagement, positioning itself as a leader in the AI sector with a focus on text intelligence technology [2][3][4]. Financial Performance - In 2024, the company reported a revenue of 1.438 billion RMB, a net profit of 400 million RMB, and a gross margin of 84.3% [4][25]. - The revenue growth from 2022 to 2024 was approximately 21% CAGR, with revenues of 989 million RMB, 1.187 billion RMB, and 1.438 billion RMB respectively [25]. - The C-end business accounted for a significant portion of total revenue, with contributions of 82.2%, 84.3%, and 83.8% from 2022 to 2024 [27]. User Engagement - The monthly active users (MAU) for C-end products reached 171 million in 2024, with a paid user ratio of 4.3% [21]. - The company ranks first in China and fifth globally among efficiency AI companies with MAU exceeding 100 million [21][22]. Product Portfolio - The company offers a range of products targeting both C-end and B-end markets, including "Scan All-in-One" and "Business Card All-in-One" for C-end, and "TextIn" and "Qixin Huayan" for B-end [8][12]. - The core technology is based on multi-modal text intelligence, which enhances efficiency in various applications [14][15]. Market Position - The company is positioned as a leading AI firm with a focus on text recognition and processing, competing with major players like OpenAI, Google, Adobe, and Microsoft [5][6][21]. - The global AI product market is projected to grow significantly, with estimates of 46.5 billion USD in 2024 and 228 billion USD by 2029, indicating a robust growth trajectory for the industry [66]. Research and Development - The company has been increasing its R&D investment, with expenditures of 280 million RMB, 323 million RMB, and 390 million RMB from 2022 to 2024, representing about 27% of total revenue [33]. - The workforce consists of 1,053 employees, with 60.6% in R&D roles, highlighting the company's commitment to innovation [35]. Future Plans - The funds raised from the Hong Kong listing will primarily be used for R&D, international expansion, and exploring investment and acquisition opportunities [50].
上海AI Lab主任周伯文:关于人工智能前沿的十个问题
机器人圈· 2025-06-26 10:46
详细会议介绍参看往期文章: (点击蓝字跳转) 报名开启|顶流期刊征+2025智能机器人关键技术大会盛会将至! 展览展示|抢位2025智能机器人关键技术大会!高曝光商务合作虚位以待,共赴解锁新机遇 以下文章来源于上海人工智能实验室 ,作者Shanghai AI Lab 上海人工智能实验室 . 上海人工智能实验室是我国人工智能领域新型科研机构,开展战略性、原创性、前瞻性的科学研究与技 术攻关,目标建成国际一流的人工智能实验室,成为享誉全球的人工智能原创理论和技术的策源地。 9大期刊联合征文|投稿2025智能机器人关键技术大会,年底正刊发表! "对发现问题的投入,与解决问题同样重要。" 这是 上海人工智能实验室主任周伯文 在首届明珠湖会议所作开场报告中的核心观点之一。 在报告中,周伯文还提出关于人工智能前沿的十个问题: 1. 总体智能 vs 单位智能:如何平衡智能发展的质量与效率? 2. Deep RL规模化发展的资源悖论:如何平衡"数据合成"和"算法训练"两大任务的算力分配? 3. 软硬协同创新:软件向硬件适配,还是硬件向软件兼容? 4. 算力受限的影响:针对应用、迭代和颠覆性的技术,算力应如何配置? 10. 颠 ...
致敬钱学森,我国学者开发AI虚拟现实运动系统——灵境,解决青少年肥胖难题,揭示VR运动的减肥及促进大脑认知作用机制
生物世界· 2025-06-24 03:56
Core Viewpoint - Adolescent obesity is a global public health crisis with rising prevalence, leading to increased risks of cardiovascular and metabolic diseases, as well as cognitive impairments [2] Group 1: Research and Development - A research team from Shanghai Jiao Tong University and other institutions developed the world's first VR-based exercise intervention system, REVERIE, aimed at overweight adolescents [4][8] - The REVERIE system utilizes deep reinforcement learning and a Transformer-based virtual coach to provide safe, effective, and empathetic exercise guidance [4][8] Group 2: Study Design and Methodology - The study included a randomized controlled trial with 227 overweight adolescents, comparing outcomes between VR exercise, real-world exercise, and a control group [11] - Participants were assigned to different groups, including VR and real-world sports, with all groups receiving uniform dietary management over an eight-week intervention [11] Group 3: Results and Findings - After eight weeks, the VR exercise group lost an average of 4.28 kg of body fat, while the real-world exercise group lost 5.06 kg, showing comparable results [13] - Both VR and real-world exercise groups showed improvements in liver enzyme levels, LDL cholesterol, physical fitness, mental health, and exercise willingness [13] - VR exercise demonstrated superior cognitive function enhancement compared to real-world exercise, supported by fMRI findings indicating increased neural efficiency and plasticity [14] Group 4: Safety and Implications - The injury rate in the VR exercise group was 7.69%, lower than the 13.48% in the real-world exercise group, with no severe adverse events reported [15] - The REVERIE system is positioned as a promising solution for addressing adolescent obesity and promoting overall health improvements beyond weight loss [16][17]