Workflow
生成式模型
icon
Search documents
快手-W(01024):可灵迭代用户有望增长,One 系列模型持续提振主业
品可以分分或叫 2026 年 01 月 05 日 #== W (01024) 可灵迭代用户有望增长, One 系列模型持续提振主业 报告原因:强调原有的投资评级 ( | 市场数据: | 2026 年 01 月 02 日 | | --- | --- | | 收盘价(港币) | 66.25 | | 恒生中国企业指数 | 9168.99 | | 52 周最高/最低(港币) | 92.60/38.15 | | H 股市值(亿港币) | 2.861.66 | | 流通 H 股(百万股) | 4.319.48 | | 汇率(港币/人民币) | 0.9032 | 年内股价与基准指数对比走势: 资料来源:Bloomberg 相关研究 证券分析师 林起贤 A0230519060002 linqx@swsresearch.com 任梦妮 A0230521100005 renmn@swsresearch.com 联系人 任梦妮 A0230521100005 renmn@swsresearch.com 申万宏源研究微信服务号 投资要点: 财务数据及盈利预测 | | 2023A | 2024A | 2025E | 2026E | 20 ...
刚做了一份世界模型的学习路线图,面向初学者......
自动驾驶之心· 2025-12-25 03:24
Core Viewpoint - The article discusses the distinction between world models and end-to-end models in autonomous driving, clarifying that world models are not a specific technology but rather a category of models with certain capabilities. It emphasizes the trend in the industry towards using world models for closed-loop simulation to address the high costs associated with corner cases in autonomous driving [2]. Course Overview - The course on world models in autonomous driving is structured into six chapters, covering the introduction, background knowledge, discussions on general world models, video generation-based models, OCC-based models, and job-related insights in the industry [5][6][7][8][9]. Chapter Summaries - **Chapter 1: Introduction to World Models** This chapter outlines the relationship between world models and end-to-end autonomous driving, discussing the development history and current applications of world models, as well as various streams such as pure simulation, simulation plus planning, and generating sensor inputs [5]. - **Chapter 2: Background Knowledge** This chapter covers foundational knowledge related to world models, including scene representation, Transformer technology, and BEV perception, which are crucial for understanding subsequent chapters [6]. - **Chapter 3: General World Models** Focuses on popular general world models like Marble from Li Fei-Fei's team and Genie 3 from DeepMind, discussing their core technologies and design philosophies [7]. - **Chapter 4: Video Generation-Based World Models** This chapter delves into video generation algorithms, starting with GAIA-1 & GAIA-2 and extending to recent works like UniScene and OpenDWM, highlighting both classic and cutting-edge advancements in this area [8]. - **Chapter 5: OCC-Based World Models** Concentrates on OCC generation algorithms, discussing three major papers and a practical project, emphasizing the potential for these methods to extend into vehicle trajectory planning [9]. - **Chapter 6: World Model Job Topics** This chapter shares practical insights from the instructor's experience, addressing industry applications, pain points, and interview preparation for positions related to world models [9]. Learning Outcomes - The course aims to provide a comprehensive understanding of world models in autonomous driving, equipping participants with the knowledge to achieve a level comparable to one year of experience as a world model algorithm engineer [10].
56倍加速生成式策略:EfficientFlow,迈向高效具身智能
具身智能之心· 2025-12-17 00:05
点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 本文共同第一作者为西安交通大学硕士生常建磊和博士生梅若风。柯炜为西安交通大学副教授。论文通讯作者为西安交通大学教授许翔宇,其研究方向涵盖三维 视觉、生成式 AI 与具身智能(个人主页:https://xuxy09.github.io/)。 生成式模型正在成为机器人和具身智能领域的重要范式,它能够从高维视觉观测中直接生成复杂、灵活的动作策略,在操作、抓取等任务中表现亮眼。但在真实 系统中,这类方法仍面临两大「硬伤」: 一是训练极度依赖大规模演示数据,二是推理阶段需要大量迭代,动作生成太慢,难以实时控制。 针对这一核心瓶颈,西安交通大学研究团队提出了全新的生成式策略学习方法 EfficientFlow 。该方法通过将 等变建模与高效流匹配(Flow Matching)深度融合 , 在显著提升数据效率的同时,大幅压缩推理所需的迭代步数 ,在多个机器人操作基准上实现了 SOTA 的性能,并将推理速度提升一个数量级以上。 ...
直观理解Flow Matching生成式算法
自动驾驶之心· 2025-12-17 00:03
作者 | 张云聪 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/28731517852 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 目前不少讲Flow Matching的文章都上来一大堆概念,一大堆公式,搞得人头皮发麻,但实际上这个算法没 那么复杂,代码也很容易理解。 本文不推导公式、无高深数学概念即可理解flow matching算法,并完成一个简单的代码实战。 算法原理 Related Works Flow Matching是一种 生成式模型 。 最简单的生成式模型,目标就是没输入的情况下,就能生成与给定目标集中的样本相近的样本。 举个例子,可以直接无提示的用diffusion模型来生成图片。 带提示的生成式任务是可以基于无提示的生成式任务简单实现的,这里我们先只考虑无提示的生成式任 务。 由于我们一般学的是一个映射,拿一个空输入映射成不同的样本不太符合映射的定义,因此,我们一般实 际上会生成一堆随机值作为输入, ...
理想郎咸朋长文分享为什么关于VLA与宇树王兴兴观点不一致
理想TOP2· 2025-12-10 06:50
Core Insights - The core viewpoint emphasizes that the key to successful autonomous driving lies in the integration of the VLA model with the entire embodied intelligence system, where data plays a crucial role in determining effectiveness [1][4]. Summary by Sections VLA Model - The VLA is fundamentally a generative model, utilizing a GPT-like approach for autonomous driving, generating trajectories and control signals instead of text. User feedback indicates that VLA exhibits emergent behaviors in certain scenarios, reflecting a growing understanding of the physical world [2]. - The world model is better suited for creating "test environments" rather than acting as "test subjects," due to its high computational demands. Ideal is currently leveraging cloud-based data generation and realistic simulation testing, utilizing several exaFLOPS of computational power for simulation tests, which cannot be matched by even the most powerful vehicle chips [2]. - Discussions about model architecture are less relevant than the actual performance outcomes. In autonomous driving, focusing on vast amounts of real data is essential, and Ideal's commitment to VLA is supported by a data loop created from millions of vehicles, enabling near-human driving levels with current computational resources [2]. Embodied Intelligence - To excel in autonomous driving, it is essential to treat it as a complete embodied intelligence system, where all components must work together during development to maximize value. Human drivers do not require extraordinary abilities; rather, coordination among various parts is crucial [3]. - The embodied intelligence system comprises perception (eyes), models (brain), operating systems (nervous system), chips (heart), and the body (vehicle). Full-stack self-research is necessary, encompassing both software and hardware. Ideal's autonomous driving team collaborates with foundational model, chip, and chassis teams to create a comprehensive autonomous driving system [3]. Data Utilization - The key to effective modeling is its compatibility with the entire embodied intelligence system, with data being the decisive factor. While data acquisition is challenging in robotics, it is not a significant issue for companies in the autonomous driving sector that have established data loops. Ideal can mine and filter from over 1 billion kilometers of accumulated data and continuously gather new data from 1.5 million vehicle owners [4]. - During data filtering, interesting patterns were observed, such as nearly 40% of human driving data showing a tendency to drive on one side and not strictly adhering to speed limits. This behavior aligns with typical human driving patterns, leading to the decision not to eliminate these data samples. The VLA model is expected to serve both current and future automotive forms of embodied robots [4].
另辟蹊径赴欧洲创办新AI公司,杨立昆:硅谷不是AGI的土壤
3 6 Ke· 2025-12-05 00:04
Core Insights - Yann LeCun, the outgoing Chief AI Scientist at Meta, plans to establish a new startup in Europe that will pursue a different AI path compared to the generative models dominated by tech giants like OpenAI and Google [1][2] - The new company, named Advanced Machine Intelligence (AMI), aims to develop systems that understand the physical world rather than just generating text, with a focus on creating a significant revolution in AI capabilities [2][3] Group 1 - Yann LeCun announced his departure from Meta to focus on creating his own company, emphasizing the need for AI development outside of Silicon Valley [1][2] - The startup will be a "global entity" with multiple research bases worldwide, particularly in Europe, to harness local talent [2] - LeCun criticized current text-based language models for lacking essential capabilities that would allow AI to perform tasks comparable to a five-year-old child [2] Group 2 - The goal of AMI is to enable systems to understand the physical world, possess long-term memory, reason, and plan complex actions [2] - The new company will adopt a "non-generative" AI architecture to perceive environments and understand the physical world, opening up new application possibilities [2] - Meta will collaborate with AMI and provide access to its innovative technologies, but will not invest in the startup [2][3]
直观理解Flow Matching生成式算法
自动驾驶之心· 2025-11-28 00:49
Algorithm Overview - Flow Matching is a generative model that aims to generate samples similar to a given target set without any input [3][4] - The model learns a direction of movement from a source point to a target point, effectively generating new samples by iteratively adjusting the position towards the target [14][17] Training and Inference - During training, the model samples points along the line connecting source and target, learning the average slope from multiple connections [16][17] - In inference, the model starts from a noise point and moves towards the target, gradually collapsing to a specific state as it approaches the target [17][18] Code Implementation - The implementation involves generating random inputs, predicting the slope using a neural network, and applying an optimization process to minimize the loss between predicted and target slopes [18][19] - The code includes hyperparameters for dimensions, sample sizes, and training epochs, demonstrating a straightforward approach to implementing the Flow Matching algorithm [19][25] Advanced Applications - The model can be adapted to generate samples based on prompts, allowing for more controlled generation by segmenting the target distribution [24][29] - A more complex example includes generating handwritten digits from the MNIST dataset, showcasing the model's versatility in handling different types of data [30][32] Model Architecture - The architecture includes a UNet backbone for predicting the velocity field, which enhances performance through multi-scale feature fusion [32][34] - The model incorporates conditional inputs to refine the generation process, ensuring that the output aligns with specified conditions [34][35] Training Process - The training loop involves generating dynamic noise, calculating the loss based on the difference between predicted and actual images, and updating the model parameters accordingly [40][41] - The model is designed to visualize generated samples periodically, providing insights into its performance and output quality [40][41]
无需再训练即可增强性能!港大团队提出GPC框架,实现机器人「策略组合」
机器之心· 2025-10-19 09:17
Core Viewpoint - The article introduces the General Policy Composition (GPC) framework, which provides a novel, training-free solution to enhance the performance of robot control strategies by dynamically combining multiple pre-trained models during test time, thus overcoming the limitations of traditional training methods [2][5][7]. Summary by Sections Improving Policy Performance - GPC presents a paradigm shift in enhancing policy performance without relying on additional training, instead utilizing a method of combining existing strategies [6][15]. Innovative Theoretical Foundation - The framework is built on two key theoretical findings: 1. Functional-Level Improvement, which shows that convex combinations of decision scores from multiple pre-trained strategies can yield a more accurate combined score than any single strategy [9]. 2. System-Level Stability, which ensures that improvements in single-step errors propagate throughout the entire trajectory, leading to overall performance enhancement [10]. General "Policy Composer" - GPC's core advantage lies in its plug-and-play nature, allowing for the seamless integration of various robot strategies without the need for retraining [14][15]. Heterogeneous Strategy Flexibility - GPC can flexibly combine strategies across different architectures and modalities, effectively balancing information from various conditions to produce stable and coherent action trajectories [17][19]. Weight Search for Optimal Strategy - The weight search mechanism in GPC customizes optimal weight configurations for different tasks, emphasizing the importance of weight distribution in maximizing the effectiveness of the combined strategy [22][23]. Experimental Validation - GPC has demonstrated superior performance in both simulation and real-world environments, achieving significant success rate improvements over single baseline methods, with up to 7.55% in simulation tasks and 5-10% in real-world tasks [28][30]. Key Findings from Experiments - Three core findings from experiments highlight GPC's versatility: 1. GPC can achieve higher accuracy when combining strategies with moderate accuracy levels [29]. 2. The presence of a weak strategy can hinder overall performance, indicating the need for careful selection of contributing strategies [29]. 3. Performance is maximized when stronger strategies are given greater weight in the combination [29].
Insta360最新全景综述:全景视觉的挑战、方法与未来
机器之心· 2025-10-04 03:38
Core Insights - The article discusses the transition from perspective vision to panoramic vision, highlighting the "perspective-panorama gap" as a central theme for understanding the challenges and opportunities in this field [6][19]. - It emphasizes the need for a systematic upgrade across data, models, and applications to enhance the usability of panoramic vision technologies [16][19]. Research Background and Motivation - The paper titled "One Flight Over the Gap: A Survey from Perspective to Panoramic Vision" aims to systematically analyze the differences between perspective and panoramic vision, covering over 300 papers and 20 representative tasks [4][19]. - The article provides a comprehensive overview of the challenges faced in panoramic vision, which are categorized into three main gaps: geometric distortion, non-uniform sampling, and boundary continuity [6][9]. Strategies Overview - Four main strategies are identified for adapting tasks to panoramic vision: 1. **Geometric Distortion**: Issues arise when spherical images are projected onto a plane, leading to shape distortion [7]. 2. **Non-uniform Sampling**: Pixel density varies significantly across different regions, affecting resolution [7]. 3. **Boundary Continuity**: The separation of boundaries in 2D images can lead to learning continuity issues [7]. - The article outlines a cross-method comparison to clarify the applicability of different strategies to various tasks [9][15]. Task Toolbox - The article lists over 20 tasks categorized into four main areas: enhancement and assessment, understanding, multi-modal, and generation, along with representative methods and key papers for each task [12][15]. - It highlights the rapid emergence of new paradigms such as diffusion and generative models, particularly in text-to-image/video and novel view synthesis [15]. Future Directions - To transition from "usable" to "user-friendly," advancements must be made in three main areas: data, model paradigms, and downstream applications [16][21]. - Key challenges include: 1. **Data Bottlenecks**: Lack of large-scale, diverse, and high-quality 360° datasets limits general training and reproducible evaluation [21]. 2. **Model Paradigms**: The need for robust models that can adapt from perspective to panoramic vision while maintaining performance across various tasks [21]. 3. **Downstream Applications**: Applications in spatial intelligence, XR, 3D reconstruction, and various industry sectors require effective deployment and compliance [21][22].
两张图就能重构3D空间?清华&NTU利用生成模型解锁空间智能新范式
量子位· 2025-07-09 01:18
Core Viewpoint - LangScene-X introduces a generative framework that enables the construction of generalized 3D language-embedded scenes using only sparse views, significantly reducing the number of required input images compared to traditional methods like NeRF, which typically need over 20 views [2][5]. Group 1: Challenges in 3D Language Scene Generation - The current 3D language scene generation faces three core challenges: the contradiction between dense view dependency and sparse input absence, leading to severe 3D structure artifacts and semantic distortion when using only 2-3 images [5]. - There is a disconnection in cross-modal information and a lack of 3D consistency, as existing models process appearance, geometry, and semantics independently, resulting in semantic misalignment [6]. - High-dimensional compression of language features and the bottleneck in generalization capabilities hinder practical applications, with existing methods showing a significant drop in accuracy when switching scenes [7]. Group 2: Solutions Offered by LangScene-X - LangScene-X employs the TriMap video diffusion model, which allows for unified multimodal generation under sparse input conditions, achieving significant improvements in RGB and normal consistency errors and semantic mask boundary accuracy [8]. - The Language Quantization Compressor (LQC) revolutionizes high-dimensional feature compression, mapping high-dimensional CLIP features to 3D discrete indices with minimal reconstruction error, enhancing cross-scene transferability [9][10]. - The model integrates a progressive training strategy that ensures the seamless generation of RGB images, normal maps, and semantic segmentation maps, thus improving the efficiency of 3D reconstruction processes [14]. Group 3: Spatial Intelligence and Performance Metrics - LangScene-X enhances spatial intelligence by accurately aligning text prompts with 3D scene surfaces, allowing for natural language queries to identify objects within 3D environments [15]. - Empirical results demonstrate that LangScene-X achieves an overall mean accuracy (mAcc) of 80.85% and a mean intersection over union (mIoU) of 50.52% on the LERF-OVS dataset, significantly outperforming existing methods [16]. - The model's capabilities position it as a potential core driver for applications in VR scene construction, human-computer interaction, and foundational technologies for autonomous driving and embodied intelligence [18].