扩散模型
Search documents
全网破防,AI“手指难题”翻车逼疯人类,6根手指,暴露Transformer致命缺陷
3 6 Ke· 2025-12-15 12:39
最近,网友们已经被AI「手指难题」逼疯了。给AI一支六指手,它始终无法正确数出到底有几根手指!说吧AI,你是不是在嘲笑人类?其实这背后,暗 藏着Transformer架构的「阿喀琉斯之踵」…… 最近几天,整个互联网陷入阴影—— AI,在用数手指嘲笑人类。 人类给AI的这道题,指令很简单:在图中的每根手指上,依次标出数字。 当然题目中有个小陷阱,就是这只手其实有六个手指。 结果,Nano Banana Pro理直气壮地在这只手上标出1、2、3、4、5,直接略过了其中一只手指。 这荒诞的场面,再一次震惊了网友们。 AI模型是真的这么傻吗? 很多人不这么认为——或许,AI只是在装傻,调戏人类而已。 很有可能,它是在嘲笑这些试图测试自己的劣质人类。 为了通过图灵测试,AI必须让自己变得愚蠢一点,才能看起来像人类。如果太聪明,人类就破防了。 GPT-5.2,同样翻车了 有人也拿这个问题去问GPT-5.2,而且prompt里明明白白写了图里有六根手指。 但GPT-5.2面对「图里有几根手指」的问题,还是斩钉截铁地说:五根! 理由就是:人类有五根手指,所以图里没有五根手指就是错的。 还有人把手指画得奇形怪状,人类都要难倒的 ...
时隔一年DiffusionDrive升级到v2,创下了新纪录!
自动驾驶之心· 2025-12-11 03:35
Core Insights - The article discusses the upgrade of DiffusionDrive to version 2, highlighting its advancements in end-to-end autonomous driving trajectory planning through the integration of reinforcement learning to address the challenges of diversity and sustained high quality in trajectory generation [1][3][10]. Background Review - The shift towards end-to-end autonomous driving (E2E-AD) has emerged as traditional tasks like 3D object detection and motion prediction have matured. Early methods faced limitations in modeling, often generating single trajectories without alternatives in complex driving scenarios [5][10]. - Previous diffusion models applied to trajectory generation struggled with mode collapse, leading to a lack of diversity in generated behaviors. DiffusionDrive introduced a Gaussian Mixture Model (GMM) to define prior distributions for initial noise, promoting diverse behavior generation [5][13]. Methodology - DiffusionDriveV2 introduces a novel framework that utilizes reinforcement learning to overcome the limitations of imitation learning, which previously led to a trade-off between diversity and sustained high quality in trajectory generation [10][12]. - The framework incorporates intra-anchor GRPO and inter-anchor truncated GRPO to manage advantage estimation within specific driving intentions, preventing mode collapse by avoiding inappropriate comparisons between different intentions [9][12][28]. - The method employs scale-adaptive multiplicative noise to enhance exploration while maintaining trajectory smoothness, addressing the inherent scale inconsistency between proximal and distal segments of trajectories [24][39]. Experimental Results - Evaluations on the NAVSIM v1 and NAVSIM v2 datasets demonstrated that DiffusionDriveV2 achieved state-of-the-art performance, with a PDMS score of 91.2 on NAVSIM v1 and 85.5 on NAVSIM v2, significantly outperforming previous models [10][33]. - The results indicate that DiffusionDriveV2 effectively balances trajectory diversity and sustained quality, achieving optimal performance in closed-loop evaluations [38][39]. Conclusion - The article concludes that DiffusionDriveV2 successfully addresses the inherent challenges of imitation learning in trajectory generation, achieving an optimal trade-off between planning quality and diversity through innovative reinforcement learning techniques [47].
随到随学!端到端与VLA自动驾驶小班课正式结课
自动驾驶之心· 2025-12-09 19:00
Core Viewpoint - 2023 marks the year of end-to-end production, with 2024 expected to be a significant year for end-to-end production in the automotive industry, as leading new forces and manufacturers have already achieved end-to-end production [1][3]. Group 1: End-to-End Production Development - The automotive industry has two main paradigms: single-stage and two-stage, with UniAD being a representative of the single-stage approach that directly models vehicle trajectories from sensor inputs [1]. - Since last year, the single-stage end-to-end development has rapidly advanced, leading to various derivatives such as perception-based, world model-based, diffusion model-based, and VLA-based single-stage methods [3][5]. - Major players in the autonomous driving sector, including both solution providers and car manufacturers, are focusing on self-research and production of end-to-end autonomous driving technologies [3]. Group 2: Course Overview - A course titled "End-to-End and VLA Autonomous Driving" has been launched, aimed at teaching cutting-edge algorithms in both single-stage and two-stage end-to-end approaches, with a focus on the latest developments in the industry and academia [5][14]. - The course is structured into several chapters, starting with an introduction to end-to-end algorithms, followed by background knowledge on various technologies such as VLA, diffusion models, and reinforcement learning [8][9]. - The second chapter is highlighted as containing the most frequently asked technical keywords for job interviews in the next two years [9]. Group 3: Technical Focus Areas - The course covers various subfields of single-stage end-to-end methods, including perception-based (UniAD), world model-based, diffusion model-based, and the currently popular VLA-based approaches [10][12]. - The curriculum includes practical assignments, such as RLHF fine-tuning, and aims to provide students with hands-on experience in building and experimenting with pre-trained and reinforcement learning modules [11][12]. - The course emphasizes the importance of understanding BEV perception, multi-modal large models, and the latest advancements in diffusion models, which are crucial for the future of autonomous driving [12][16].
北京大学:AI视频生成技术原理与行业应用 2025
Sou Hu Cai Jing· 2025-12-09 06:48
Group 1: AI Video Technology Overview - AI video technology is a subset of narrow AI focused on generative tasks such as video generation, editing, and understanding, with typical methods including text-to-video and image-to-video [1] - The evolution of technology spans from the exploration of GANs before 2016 to the commercialization of diffusion models from 2020 to 2024, culminating in the release of Sora in 2024, marking the "AI Video Year" [1] Group 2: Main Tools and Platforms - Key platforms include OpenAI Sora, Kuaishou Keling AI, ByteDance Jimeng AI, Runway, and Pika, each offering unique features in terms of duration, quality, and style [2] Group 3: Technical Principles and Architecture - The mainstream paradigm is the diffusion model, which is stable in training and offers strong generation diversity, with architectures categorized into U-Net and DiT [3] - Key components include the self-attention mechanism of Transformers for temporal consistency, VAE for compression, and CLIP for semantic alignment between text and visuals [3] Group 4: Data Value and Training - The scale, quality, and diversity of training data determine the model's upper limits, with prominent datasets including WebVid-10M and UCF-101 [4] Group 5: Technological Advancements and Breakthroughs - Mainstream models can generate videos at 1080p/4K resolution and up to 2 minutes in length, with some models supporting native audio-visual synchronization [5] - Existing challenges include temporal consistency, physical logic, and emotional detail expression, alongside computational cost constraints [5] - Evaluation frameworks like VBench and SuperCLUE have been established, focusing on "intrinsic authenticity" [5] Group 6: Industry Applications and Value - In the film and entertainment sector, AI is involved in the entire production process, leading to cost reductions and efficiency improvements [6] - The short video and marketing sectors utilize AI for rapid content generation, exemplified by Xiaomi's AI glasses advertisement [6] - In the cultural tourism industry, AI is used for city promotional videos and immersive experiences [7] - In education, AI facilitates the bulk generation of micro-course videos and personalized learning content [8] - In news media, AI virtual anchors enable 24-hour reporting, though ethical challenges regarding content authenticity persist [9] Group 7: Tool Selection Recommendations - Recommendations for tool selection include using Runway or Keling AI for professional film, Jimeng AI or Pika for short video operations, and Vidu for traditional Chinese content [10] - Domestic tools like Keling and Jimeng have low barriers to entry, while overseas tools require VPN and foreign currency payments [11] - A multi-tool collaborative workflow is advised, emphasizing a "director's mindset" rather than reliance on a single platform [12] Group 8: Future Outlook - The report concludes that AI video will evolve towards a "human-machine co-creation" model, becoming a foundational infrastructure akin to the internet, with a focus on creativity and judgment [13]
Roblox CEO感叹AI研究进展:曾博览群书的自己都快看不懂了
Sou Hu Cai Jing· 2025-12-08 11:28
巴祖基 2005 年创立 Roblox。创业初期,他几乎读遍了从物理模拟到图形渲染的各类研究,而且都能理解。然而 AI 时代的到来改变了一切。他称如今的研究浪潮"规模巨大、速度惊人",从 Transformer 到扩散模型,再到世界 模型,"内容多到让人难以完全掌握"。 IT之家 12 月 8 日消息,AI 研究更新速度飞快,新论文几乎每天出现,技术概念也越来越复杂,Roblox CEO 大 卫・巴祖基对此深有体会。 据《商业内幕》今日报道,巴祖基透露,自己休假时抽出大量时间系统阅读 AI 研究,却发现过程"发人深省"—— 想真正看懂所有论文"极其困难"。 尽管外界关注焦点集中在算力扩张,OpenAI 联合创始人伊利亚・苏茨克维却认为,真正决定 AI 走向的仍是"研 究本身":"我们重新回到研究时代,只不过现在用的是更大的计算机。" 而对于 Roblox 而言,巴祖基的结论是:AI 在"三维世界"里仍然处于非常初期的阶段。他指出,AI 依赖的是人类 制造出来的文本和图像,"我们在用自己创造的内容训练 AI,而不是用真实世界的三维原始数据"。 随着 AI 从学界扩展到国家战略高度,Meta、微软等公司纷纷建立自 ...
世界模型,是否正在逼近自己的「ChatGPT时刻」?
Xin Lang Cai Jing· 2025-12-02 11:22
这场由黄大年茶思屋总编主持,聚集了中科院自动化所、南京大学、北京通用人工智能研究院、极佳科 技等机构专家的大讨论,直指目前 AI 领域最热门的方向——世界模型。最近一段时间,从谷歌 Genie 3 的发布到李飞飞的长文论述,世界模型、空间智能等概念正成为新的焦点。 机器之心报道 机器之心编辑部 李飞飞等顶尖学者投身的创业方向——世界模型是 AI 的下一站吗? 「AI 是人类自诞生以来,唯一担得起『日新月异』这个词的技术领域,」在机器之心近日举办的 NeurIPS 2025 论文分享会圆桌讨论上,茶思屋科技网站总编张群英的开场感叹引发了在场专家们的共 鸣。 四十多分钟的对话里,专家们围绕世界模型的定义、数据与架构方向、技术路径分歧,以及商业化前景 展开了讨论。在一些议题上,大家的观点一致,不过在很多重要方向上有着明显不同的思考。看得出, 面对这个正在快速发展的新兴领域,不论是技术还是评判标准,我们还有很多需要去探索、验证的。 首先,世界模型究竟是什么? 几位嘉宾从不同角度给出了自己的定义。 极佳科技联合创始人、首席科学家朱政认为,世界模型本质上是预测模型:「给定当前状态及动作序 列,预测下一个状态。」他指出了世 ...
扩散模型走了十年弯路!何恺明重磅新作JiT:回归真正“去噪”本质
自动驾驶之心· 2025-12-01 00:04
Core Viewpoint - The article discusses the limitations of current diffusion models in denoising tasks and introduces a simplified architecture called JiT (Just image Transformers) that focuses on predicting clean images directly rather than noise, leading to improved performance in high-dimensional pixel spaces [10][18][34]. Group 1: Diffusion Models and Noise Prediction - Traditional diffusion models are designed to predict noise or the amount of mixed noise, which is fundamentally different from predicting clean images [6][7]. - The authors argue that the essence of denoising should be to let the network predict clean data instead of noise, simplifying the task and improving model performance [18][19]. Group 2: JiT Architecture - JiT is a minimalist framework that operates directly on pixel patches without relying on latent spaces, tokenizers, or additional loss functions, making it more efficient [10][25][34]. - The architecture demonstrates that even with high-dimensional patches (up to 3072 dimensions), the model can maintain stable training and performance by focusing on predicting clean images [23][30]. Group 3: Experimental Results - In experiments on ImageNet at various resolutions, JiT models achieved impressive FID scores, with JiT-G/16 reaching 1.82, comparable to complex models that utilize latent spaces [30][31]. - The model's performance remained stable even at higher resolutions (1024×1024), showcasing its capability to handle high-dimensional data without increased computational costs [32][34]. Group 4: Implications for Future Research - The JiT framework suggests a potential shift in generative modeling, emphasizing the importance of working directly in pixel space for applications in embodied intelligence and scientific computing [34].
世界模型,是否正在逼近自己的「ChatGPT时刻」?
机器之心· 2025-11-29 01:49
机器之心报道 机器之心编辑部 李飞飞等顶尖学者投身的创业方向——世界模型是 AI 的下一站吗? 「AI 是人类自诞生以来,唯一担得起『日新月异』这个词的技术领域,」在机器之心近日举办的 NeurIPS 2025 论文分享会圆桌讨论上,茶思屋科技网站 总编张群英的开场感叹引发了在场专家们的共鸣。 这场由黄大年茶思屋总编主持 ,聚集了中科院自动化所、南京大学、北京通用人工智能研究院、极佳科技等机构专家的大讨论,直指目前 AI 领域最热门的 方向——世界模型。最近一段时间,从谷歌 Genie 3 的发布到李飞飞的长文论述,世界模型、空间智能等概念正成为新的焦点。 四十多分钟的对话里,专家们围绕世界模型的定义、数据与架构方向、技术路径分歧,以及商业化前景展开了讨论。在一些议题上,大家的观点一致,不过在很 多重要方向上有着明显不同的思考。看得出,面对这个正在快速发展的新兴领域,不论是技术还是评判标准,我们还有很多需要去探索、验证的。 首先,世界模型究竟是什么? 几位嘉宾从不同角度给出了自己的定义。 极佳科技联合创始人、首席科学家朱政 认为,世界模型本质上是预测模型:「给定当前状态及动作序列,预测下一个状态。」他指出了世 ...
NeurIPS 2025奖项出炉,Qwen获最佳论文
具身智能之心· 2025-11-28 00:04
Core Insights - The NeurIPS 2025 conference awarded four Best Paper awards and three Best Paper Runner-up awards, highlighting significant advancements in various AI research areas [1][2][4]. Group 1: Best Papers - Paper 1: "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" introduces Infinity-Chat, a dataset with 26,000 diverse user queries, addressing the issue of homogeneity in language model outputs [6][8][10]. - Paper 2: "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" reveals the impact of gated attention mechanisms on model performance, enhancing training stability and robustness [12][18]. - Paper 3: "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities" demonstrates that increasing network depth to 1024 layers significantly improves performance in self-supervised reinforcement learning tasks [19][20]. - Paper 4: "Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training" explores the dynamics of training diffusion models, identifying mechanisms that prevent memorization and enhance generalization [21][23]. Group 2: Awards and Recognition - The Time-Tested Award was given to the paper "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," recognized for its foundational impact on computer vision since its publication in 2015 [38][42]. - The Sejnowski-Hinton Prize was awarded to researchers for their work on "Random synaptic feedback weights support error backpropagation for deep learning," contributing to the understanding of biologically plausible learning rules [45][49].
NeurIPS 2025最佳论文开奖,何恺明、孙剑等十年经典之作夺奖
3 6 Ke· 2025-11-27 07:27
Core Insights - NeurIPS 2025 announced its best paper awards, with four papers recognized, including a significant contribution from Chinese researchers [1][2] - The "Test of Time Award" was given to Faster R-CNN, highlighting its lasting impact on the field of computer vision [1][50] Best Papers - The first best paper titled "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" was authored by a team from multiple prestigious institutions, including Washington University and Carnegie Mellon University [5][6] - The second best paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free," involved collaboration between researchers from Alibaba, Edinburgh University, Stanford University, MIT, and Tsinghua University [14][15] - The third best paper, "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities," was authored by researchers from Princeton University and Warsaw University of Technology [21][24] - The fourth best paper, "Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training," was a collaborative effort from PSL University and Bocconi University [28][29] Runners Up - Three runner-up papers were also recognized, including "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" from Tsinghua University and Shanghai Jiao Tong University [33][34] - Another runner-up paper titled "Optimal Mistake Bounds for Transductive Online Learning" was authored by researchers from Kent State University, Purdue University, Google Research, and MIT [38][39] - The third runner-up paper, "Superposition Yields Robust Neural Scaling," was from MIT [42][46] Test of Time Award - The "Test of Time Award" was awarded to the paper "Faster R-CNN," which has been cited over 56,700 times and has significantly influenced the computer vision field [50][52] - The paper introduced a fully learnable two-stage process that replaced traditional methods, achieving high detection accuracy and near real-time speeds [50][52]