Flux - filings, earnings calls, financial reports, news

Flux

Search documents

Z Product | Product Hunt最佳产品（12.29-1.4），6 款华人AI产品上榜！“反拖延自律闹钟”榜首

Z Potentials· 2026-01-08 02:05

12.29-1.4 TOP10 | Best of the week of December 29, 2025 | Daily | Weekly | Monthly | Yearly | Featured | AII | | --- | --- | --- | --- | --- | --- | --- | | 12月 29-1月 4 1月 12-18 1月 5—11 | | | 1月 19-25 | | 1月 26-2月 1 | 1 | | Mom Clock | | | | | D | 0 | | You said you'd do it. So why didn't you? | | | | | 52 | 238 | | S iOS · Health & Fitness · Productivity | | | | | | | | BizCard | | | | | O | 0 | | Kill LinkedIn QR contacts. Make real connections instead. | | | | | 86 | 511 | | Hardware · Artificial Intelli ...

Artificial Intelligence

Artificial Intelligence

让扩散模型「可解释」不再降质，开启图片编辑新思路

机器之心· 2025-12-16 02:31

Core Viewpoint - The article discusses the emergence of TIDE (Temporal-Aware Sparse Autoencoders) as a significant advancement in making diffusion models interpretable without sacrificing their generative quality [3][17]. Group 1: Background and Challenges - Over the past three years, diffusion models have dominated the image generation field, with architectures like DiT pushing the limits of image quality [2]. - Despite the growth in explainability research for LLMs, the internal semantics and causal pathways of diffusion models remain largely opaque, making them a "black box" [2]. - Existing attempts at explainability often lead to a noticeable decline in performance, making the pursuit of interpretable diffusion models seem impractical [2]. Group 2: Introduction of TIDE - TIDE is introduced as the first truly temporal-aware framework for diffusion transformers, aiming to reveal the internal mechanisms of these models without compromising their generative capabilities [3][5]. - The framework emphasizes the importance of the temporal aspect of the diffusion process, which unfolds progressively over time [6]. Group 3: Mechanism and Functionality of TIDE - TIDE aligns semantics along the time dimension, allowing for a clearer presentation of the diffusion model's internal processes, such as the emergence of structure from noise and the gradual formation of semantics [7]. - The sparse autoencoder in TIDE enables lossless reconstruction in the feature space, maintaining the stability of the diffusion trajectory while being "observed" [7][10]. Group 4: Performance and Results - TIDE decomposes diffusion features into controllable semantic factors, enhancing image editing capabilities by allowing direct manipulation along clear semantic directions [8][10]. - The impact of TIDE on generative quality is minimal, with FID and sFID changes being less than 0.1%, demonstrating its ability to be interpretable without degrading quality [10][14]. - TIDE shows significant improvements in semantic binding and understanding of spatial relationships, with multiple metrics indicating optimal performance [12]. Group 5: Implications and Future Directions - TIDE represents a new research paradigm, suggesting that diffusion models can be interpretable with the right perspective [19]. - Future developments may include more controllable and robust diffusion editing systems, unified understanding of generative models, and advancements in causal and semantic theory research [21][22].

Diffusion Model Interpretability

Image Generation

Artificial Intelligence

TIDE (Temporal-Aware Sparse Autoencoders)

DiT (Diffusion Transformer)

Stable Diffusion XL

Diffusion Model Interpretability

Image Generation

Artificial Intelligence

TIDE (Temporal-Aware Sparse Autoencoders)

DiT (Diffusion Transformer)

Stable Diffusion XL

NUS LV Lab新作｜FeRA：基于「频域能量」动态路由，打破扩散模型微调的静态瓶颈

机器之心· 2025-12-12 03:41

然而，现有的微调方法（如 LoRA、AdaLoRA）大多采用「静态」策略：无论模型处于去噪过程的哪个阶段，适配器（Adapter）的参数都是固定不变的。这种「一刀切」的方式忽略了扩散生成过程内在的时序物理规律，导致模型在处理复杂结构与精细纹理时往往顾此失彼。针对上述问题，新加坡国立大学 LV Lab（颜水成团队）联合电子科技大学、浙江大学等机构提出 FeRA (Frequency-Energy Constrained Routing) 框架：首次从频域能量的第一性原理出发，揭示了扩散去噪过程具有显著的「低频到高频」演变规律，并据此设计了动态路由机制。 FeRA 摒弃了传统的静态微调思路，通过实时感知潜空间（Latent Space）的频域能量分布，动态调度不同的专家模块。实验结果显示， FeRA 在 SD 1.5、SDXL、 Flux.1 等多个主流底座上，于风格迁移和主体定制任务中均实现了远超 baseline 的生成质量。尹博：NUS 计算机工程硕士生、LV Lab 实习生，研究方向是生成式 AI，及参数高效率微调（PEFT）。胡晓彬：NUS LV Lab Senior Research ...

机器之心· 2025-07-16 04:21

Core Viewpoint - The article discusses the introduction of ThinkDiff, a new method for multimodal understanding and generation that enables diffusion models to perform reasoning and creative tasks with minimal training data and computational resources [3][36]. Group 1: Introduction to ThinkDiff - ThinkDiff is a collaborative effort between Hong Kong University of Science and Technology and Snap Research, aimed at enhancing diffusion models' reasoning capabilities with limited data [3]. - The method allows diffusion models to understand the logical relationships between images and text prompts, leading to high-quality image generation [7]. Group 2: Algorithm Design - ThinkDiff transfers the reasoning capabilities of large visual language models (VLM) to diffusion models, combining the strengths of both for improved multimodal understanding [7]. - The architecture involves aligning VLM-generated tokens with the diffusion model's decoder, enabling the diffusion model to inherit the VLM's reasoning abilities [15]. Group 3: Training Process - The training process includes a vision-language pretraining task that aligns VLM with the LLM decoder, facilitating the transfer of multimodal reasoning capabilities [11][12]. - A masking strategy is employed during training to ensure the alignment network learns to recover semantics from incomplete multimodal information [15]. Group 4: Variants of ThinkDiff - ThinkDiff has two variants: ThinkDiff-LVLM, which aligns large-scale VLMs with diffusion models, and ThinkDiff-CLIP, which aligns CLIP with diffusion models for enhanced text-image combination capabilities [16]. Group 5: Experimental Results - ThinkDiff-LVLM significantly outperforms existing methods on the CoBSAT benchmark, demonstrating high accuracy and quality in multimodal understanding and generation [18]. - The training efficiency of ThinkDiff-LVLM is notable, achieving optimal results with only 5 hours of training on 4 A100 GPUs, compared to other methods that require significantly more resources [20][21]. Group 6: Comparison with Other Models - ThinkDiff-LVLM exhibits capabilities comparable to commercial models like Gemini in everyday image reasoning and generation tasks [25]. - The method also shows potential in multimodal video generation by adapting the diffusion decoder to generate high-quality videos based on input images and text [34]. Group 7: Conclusion - ThinkDiff represents a significant advancement in multimodal understanding and generation, providing a unified model that excels in both quantitative and qualitative assessments, contributing to the fields of research and industrial applications [36].

多模态理解与生成

扩散模型

Artificial Intelligence

ThinkDiff

GPT - 4o image generation

Gemini Pro

多模态理解与生成

扩散模型

Artificial Intelligence

ThinkDiff

GPT - 4o image generation

Gemini Pro

【七彩虹教育】最好用的AI是什么？语音助手？大语言模型？文生图？

Sou Hu Cai Jing· 2025-07-15 13:37

Group 1 - The recent years have seen a small explosion in artificial intelligence, with various tools for voice recognition, meeting summaries, and interactive text models emerging, as well as image generation technologies like Midjourney and StableDiffusion [1] - There is a growing sentiment that these AI tools may not be as user-friendly as initially thought, which can be analyzed through the basic unit of "information" [3] Group 2 - In terms of voice, humans can understand speech at a rate of approximately 150 to 200 words per minute, equating to about 1600 bits of information per minute [4] - For images, a person can theoretically process about 189 MB of image information per minute, assuming one image of 1024x1024 pixels is understood per second [6] - The average reading speed for text is estimated at 250 to 300 words per minute, resulting in an information flow of about 10,000 bits per minute [8][9] Group 3 - Overall, the information transmission capacity is ranked as follows: voice has the least information content at 1600 bits per minute, text is in the middle at 10,000 bits per minute, and images have the highest capacity at 189 MB per minute [11] - AI applications in voice recognition and generation have reached or exceeded human levels, with tools like CosyVoice and SenseVoice performing well [11] - Text-based AI models, particularly after the advent of ChatGPT, are also approaching human-level performance, with models like QWen2 achieving top-tier status [11] - However, image generation and recognition still lag behind, primarily due to the significantly higher information content in images compared to voice and text [11]

清华SageAttention3，FP4量化5倍加速！且首次支持8比特训练

机器之心· 2025-06-18 09:34

Core Insights - The article discusses the advancements in attention mechanisms for large models, particularly focusing on the introduction of SageAttention3, which offers significant performance improvements over previous versions and competitors [1][2]. Group 1: Introduction and Background - The need for optimizing attention speed has become crucial as the sequence length in large models increases [7]. - Previous versions of SageAttention (V1, V2, V2++) achieved acceleration factors of 2.1, 3, and 3.9 times respectively compared to FlashAttention [2][5]. Group 2: Technical Innovations - SageAttention3 provides a 5x inference acceleration compared to FlashAttention, achieving 1040 TOPS on RTX 5090, outperforming even the more expensive H100 with FlashAttention3 by 1.65 times [2][5]. - The introduction of trainable 8-bit attention (SageBwd) allows for training acceleration while maintaining the same results as full precision attention in various fine-tuning tasks [2][5]. Group 3: Methodology - The research team employed Microscaling FP4 quantization to enhance the precision of FP4 quantization, utilizing NVFP4 format for better accuracy [15][16]. - A two-level quantization approach was proposed to address the narrow range of scaling factors for the P matrix, improving overall precision [15][16]. Group 4: Experimental Results - SageAttention3 demonstrated impressive performance in various models, maintaining end-to-end accuracy in video and image generation tasks [21][22]. - In specific tests, SageAttention3 achieved a 3x acceleration in HunyuanVideo, with significant reductions in processing time across multiple models [33][34].

一手实测深夜发布的世界首个设计Agent - Lovart。

数字生命卡兹克· 2025-05-12 19:08

Core Viewpoint - The article discusses the emergence and potential of Lovart, an AI design agent tool, highlighting its capabilities and the future of design workflows in the industry [1][64]. Group 1: Product Overview - Lovart is an AI design agent tool that gained significant attention, particularly in overseas markets, and operates on an invitation-only basis for its beta testing [2][6]. - The interface of Lovart resembles an AI chat platform, providing a user-friendly experience for design requests [7][8]. - The tool emphasizes the importance of industry-specific knowledge, suggesting that understanding design requirements and context is crucial for effective AI application [8]. Group 2: Functionality and Features - Users can input specific design requests, and Lovart processes these by first matching the required style before executing the task [11][17]. - The tool utilizes a LoRA model for style matching, which is essential for achieving the desired design outcome [17]. - Lovart can break down design tasks into detailed prompts, ensuring clarity and precision in the execution of design requests [19][23]. Group 3: Design Process and Output - The article illustrates a practical example where Lovart generated a series of illustrations based on a detailed prompt, showcasing its efficiency and effectiveness [9][30]. - Lovart supports various design functionalities, including resizing images and separating text from backgrounds for easier editing [52][57]. - The tool can also generate video content based on design prompts, demonstrating its versatility in handling multimedia projects [58][61]. Group 4: Future Implications - The author expresses optimism about the future of design workflows, suggesting that AI agents like Lovart could redefine the role of designers and the nature of design outputs [64]. - The potential for vertical agents in various industries is highlighted, indicating a trend towards specialized AI tools that cater to specific fields [64].