Workflow
扩散模型
icon
Search documents
TransDiffuser: 理想VLA diffusion出轨迹的架构
理想TOP2· 2025-05-18 13:08
Core Viewpoint - The article discusses the advancements in the field of autonomous driving, particularly focusing on the Diffusion model and its application in generating driving trajectories, highlighting the differences between VLM and VLA systems [1][4]. Group 1: Diffusion Model Explanation - Diffusion is a generative model that learns data distribution through a process of adding noise (Forward Process) and removing noise (Reverse Process), akin to a reverse puzzle [4]. - The model's denoising process involves training a neural network to predict and remove noise, ultimately generating target data [4]. - Diffusion not only generates the vehicle's trajectory but also predicts the trajectories of other vehicles and pedestrians, enhancing decision-making in complex traffic environments [5]. Group 2: VLM and VLA Systems - VLM consists of two systems: System 1 mimics learning to output trajectories without semantic understanding, while System 2 has semantic understanding but only provides suggestions [2]. - VLA is a single system with both fast and slow thinking capabilities, inherently possessing semantic reasoning [2]. - The output of VLA is action tokens that encode the vehicle's driving behavior and surrounding environment, which are then decoded into driving trajectories using the Diffusion model [4][5]. Group 3: TransDiffuser Architecture - TransDiffuser is an end-to-end trajectory generation model that integrates multi-modal perception information to produce high-quality, diverse trajectories [6][7]. - The architecture includes a Scene Encoder for processing multi-modal data and a Denoising Decoder that utilizes the DDPM framework for trajectory generation [7][9]. - The model employs a multi-head cross-attention mechanism to fuse scene and motion features during the denoising process [9]. Group 4: Performance and Innovations - The model achieves a Predictive Driver Model Score (PDMS) of 94.85, outperforming existing methods [11]. - Key innovations include anchor-free trajectory generation and a multi-modal representation decorrelation optimization mechanism to enhance trajectory diversity and reduce redundancy [11][12]. Group 5: Limitations and Future Directions - The authors note challenges in fine-tuning the model, particularly the perception encoder [13]. - Future directions involve integrating reinforcement learning and referencing models like OpenVLA for further advancements [13].
一键开关灯!谷歌用扩散模型,将电影级光影控制玩到极致
机器之心· 2025-05-16 04:39
Core Viewpoint - Google has launched LightLab, a project that allows precise control over lighting in images, enabling users to adjust light source intensity, color, and insert virtual light sources into scenes [1][2]. Group 1: Technology and Methodology - LightLab utilizes a fine-tuned diffusion model trained on a specially constructed dataset to achieve precise control over lighting in images [7][11]. - The dataset combines real images with controlled lighting changes and synthetic images generated by a physical renderer, allowing the model to learn complex lighting effects [10]. - The model can simulate indirect lighting, shadows, and reflections, providing a photorealistic prior for lighting control [10][11]. Group 2: Data Collection and Processing - The research team captured 600 pairs of original photos depicting the same scene with a single light source turned on and off, ensuring good exposure through automatic settings [22][23]. - The dataset was expanded to approximately 36,000 images through post-processing to cover a range of intensities and colors [27]. - The team employed a consistent tone mapping strategy and separated target light source changes from ambient light in the images [17][18]. Group 3: Model Training and Evaluation - The model was trained for 45,000 steps at a resolution of 1024 × 1024, using a learning rate of 10−5 and a batch size of 128, taking about 12 hours on 64 v4 TPUs [28]. - Evaluation metrics included Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), with user studies conducted to validate results [29]. - The model demonstrated superior performance compared to previous methods, achieving a PSNR of 23.2 and an SSIM of 0.818 [31][33]. Group 4: Applications and Features - LightLab offers a range of lighting control features, allowing users to adjust light source intensity and color interactively [12][38][41]. - The technology enables the insertion of virtual point light sources into scenes, enhancing creative possibilities [44]. - The separation of target light sources from ambient light allows for control over natural light entering a scene, which is typically challenging to manage [45].
为什么现在做AI的产品经理,都是在亏钱?
3 6 Ke· 2025-05-06 01:50
现在,我认识的AI产品经理几乎都是在做AI产品的功能迭代,很少有从0到1做AI产品的。 前段时间写过一篇文章,分享到AI产品经理做AI的产品无非就2个产品框架,一个是让用户找AI,一个是让AI找用户。 两者最大的区别就是前者是用户注册登录产品后,核心功能并不是AI,而后者是用户注册登录后,就会依托于AI模型能力完成基础操作,用户不需要关 心AI的功能入口在哪里。 有一点是比较遗憾的,现在不管是哪一类产品框架,做AI的产品经理都是在亏钱。 在国内也有其他非transformer的模型,比如我最近看到国内早期做搜索的团队,就本身发现了transformer,但是由于其幻觉与训练成本较高的缺陷,所以 就走向了另外一个模型,他们称之为yan模型,这个模型的架构特点并且还有所需要的资源特别低,适合手机终端等运行。 之所以,提到transformer模型并不是AI的最好架构,是因为在围绕着解决大模型幻觉的问题上,一些常见的通过强化RL学习做反馈但没有达到100%,实 际上这是错误的,就像AI的创始人杨立昆提到的:"这就像给一个破旧的汽车在不断地补漆,我们只是做好了表皮,而不关注汽车的内部,这样是修不好 的。" 基于此,可 ...
CVPR 2025 | 如何稳定且高效地生成个性化的多人图像?ID-Patch带来新解法
机器之心· 2025-05-03 04:18
Core Viewpoint - The article discusses the advancements and challenges in personalized multi-person image generation using diffusion models, highlighting the innovative ID-Patch mechanism that addresses identity leakage and enhances accuracy in positioning and identity representation [1][5][21]. Group 1: Challenges in Multi-Person Image Generation - Personalized single-person image generation has achieved impressive visual effects, but generating images with multiple people introduces complexities [4]. - Identity leakage is a significant challenge, where features of different individuals can blend, making it difficult to distinguish between them [2][4]. - Existing methods like OMG and InstantFamily have attempted to tackle identity confusion but face limitations in efficiency and accuracy, especially as the number of individuals increases [4][14]. Group 2: ID-Patch Mechanism - ID-Patch is a novel solution designed specifically for multi-person image generation, focusing on binding identity and position [6][21]. - The mechanism separates facial information into two key modules, allowing for precise placement of individuals while maintaining their unique identities [9][21]. - ID-Patch integrates various spatial conditions, such as pose and depth maps, enhancing its adaptability to complex scenes [10][21]. Group 3: Performance and Efficiency - ID-Patch demonstrates superior performance in identity resemblance (0.751) and identity-position matching (0.958), showcasing its effectiveness in maintaining facial consistency and accurate placement [12]. - In terms of generation speed, ID-Patch is the fastest among existing methods, generating an 8-person group photo in approximately 10 seconds, compared to nearly 2 minutes for OMG [17][15]. - The performance of ID-Patch remains robust even as the number of faces increases, with only a slight decline in effectiveness [14][21]. Group 4: Future Directions - There is potential for further improvement in facial feature representation by incorporating diverse images of the same identity under varying lighting and expressions [20]. - Future explorations may include enhancing facial fidelity through multi-angle images and achieving dual control over position and expression using patch technology [22].
阶跃星辰开源图像编辑模型Step1X-Edit;阿里巴巴AI旗舰应用夸克发布全新“AI相机”丨AIGC日报
创业邦· 2025-04-27 23:48
扫码订阅 AIGC 产业日报, 3.【Meta Token-Shuffle登场:自回归模型突破瓶颈,可AI生成 2048×2048 分辨率图像】报道称Meta AI创 新推出Token-Shuffle,目标解决自回归(Autoregressive,AR)模型在生成高分辨率图像方面的扩展难 题。在语言生成方面,自回归模型大放异彩,近年来也被广泛探索用于图像合成,然而在面对高分辨率 图像时,AR模型遭遇瓶颈。不同于文本生成仅需少量token,图像合成中高分辨率图片往往需要数千个 token,计算成本随之暴增。这让许多基于 AR 的多模态模型只能处理低中分辨率图像,限制了其在精细 图像生成中的应用。尽管扩散模型(Diffusion Models)在高分辨率上表现强劲,但其复杂的采样过程和 较慢的推理速度也存在局限。(搜狐) 4.【Adobe发布Firefly Image Model 4模型:AI生图再升级】Adobe发布博文,推出Firefly Image Model 4和 Firefly Image Model 4 Ultra两款文本生成图像AI模型,并预告针对Photoshop和Illustrator的Crea ...
ICLR 2025 | 无需训练加速20倍,清华朱军组提出用于图像翻译的扩散桥模型推理算法DBIM
机器之心· 2025-04-27 10:40
论文有两位共同一作。郑凯文为清华大学计算机系三年级博士生,何冠德为德州大学奥斯汀分校(UT Austin)一年级博士生。 扩散模型(Diffusion Models)近年来在生成任务上取得了突破性的进展,不仅在图像生成、视频合成、语音合成等领域都实现了卓越表现,推动了文本到图像、 视频生成的技术革新。然而,标准扩散模型的设计通常只适用于从随机噪声生成数据的任务,对于图像翻译或图像修复这类明确给定输入和输出之间映射关系的 任务并不适合。 为了解决这一问题,一种名为 去噪扩散桥模型 (Denoising Diffusion Bridge Models, DDBMs)的变种应运而生。DDBM 能够建模两个给定分布之间的桥接过程, 从而很好地应用于图像翻译、图像修复等任务。然而,这类模型在数学形式上依赖 复杂的常微分方程 / 随机微分方程 ,在生成高分辨率图像时通常需要 数百步的 迭代 , 计算效率低下 ,严重限制了其在实际中的广泛应用。 相比于标准扩散模型,扩散桥模型的推理过程 额外涉及初始条件相关的线性组合和起始点的奇异性 ,无法直接应用标准扩散模型的推理算法。为此,清华大学朱 军团队提出了一种名为 扩散桥隐式模 ...
“计算机视觉被GPT-4o终结了”(狗头)
量子位· 2025-03-29 07:46
Core Viewpoint - The article discusses the advancements in computer vision (CV) and image generation capabilities brought by the new GPT-4o model, highlighting its potential to disrupt existing tools and methodologies in the field [1][2]. Group 1: Technological Advancements - GPT-4o introduces native multimodal image generation, expanding the functionalities of AI tools beyond traditional applications [2][12]. - The image generation process in GPT-4o is based on a self-regressive model, differing from the diffusion model used in DALL·E, which allows for better adherence to instructions and enhanced image editing capabilities [15][19]. - Observations suggest that the image generation may involve a multi-scale self-regressive combination, where a rough image is generated first, followed by detail filling while the rough shape evolves [17][19]. Group 2: Industry Impact - The advancements in GPT-4o's capabilities have raised concerns among designers and computer vision researchers, indicating a significant shift in the competitive landscape of AI tools [6][10]. - OpenAI's approach of scaling foundational models to achieve these capabilities has surprised many in the industry, suggesting a new trend in AI development [12][19]. - The potential for GPT-4o to enhance applications in autonomous driving has been noted, with implications for future developments in this sector [10]. Group 3: Community Engagement - The article encourages community members to share their experiences and innovative uses of GPT-4o, fostering a collaborative environment for exploring AI applications [26].
活动报名:我们凑齐了 LCM、InstantID 和 AnimateDiff 的作者分享啦
42章经· 2024-05-26 14:35
清华交叉信息研究院硕士,研究方向为多模态生成,扩散模型,一致性模型 代表工作有 LCM, LCM-LoRA, Diff-Foley · 王浩帆 硕士毕业于 CMU,InstantX 团队成员,研究方向为一致性生成 代表工作有 InstantStyle, InstantID 和 Score-CAM · 杨策元 42章经 AI 私董会活动 文生图与文生视频 从研究到应用 分享嘉宾 · 骆思勉 LCM、InstantID 和 AnimateDiff 这三个研究在全球的意义和影响力都非常之大,可以说是过去一整年里给文生图和文生视频相关领域带来极大突破或应用 落地性的工作,相信有非常多的创业者都在实际使用这些作品的结果。 这次,我们首次把这三个工作的作者凑齐,并且还请来了知名的 AI 产品经理 Hidecloud 做 Panel 主持,届时期待和数十位 AI 创业者一起交流下文生图、文生视频 领域最新的研究和落地。 PhD 毕业于香港中文大学,研究方向为视频生成 6/01 | 13:00-14:00 (周六) 北京时间 美西时间 5/31 | 22:00-23:00 (周五) 活动形式 线上(会议链接将一对一发送) ...