Workflow
机器之心
icon
Search documents
成本仅0.3美元,耗时26分钟!CudaForge:颠覆性低成本CUDA优化框架
机器之心· 2025-11-17 09:00
本文作者包括明尼苏达大学的张子健(共同第一作者),王嵘(共同第一作者),李世阳,罗越波,洪明毅,丁才文。 CUDA 代码的性能对于当今的模型训练与推理至关重要,然而手动编写优化 CUDA Kernel 需要很高的知识门槛和时间成本。与此同时,近年来 LLM 在 Code 领域 获得了诸多成功。这推动人们去探索如何利用 LLM 来编写优化 CUDA kernel。然而,现有的方法面临诸多问题,例如高昂的训练与推理成本,不良的 kernel 性 能,以及缺乏硬件反馈导致的盲目探索。 那么对于使用 LLM 进行 CUDA 代码生成,我们能不能设计一个简单而有效的方法,使其能够低成本地生成可靠高效的 CUDA kernel? 明尼苏达大学的团队提出了一种新的方法—— CudaForge 。这是一种 简单、高效且低成本 的多智能体 CUDA Kernel 生成与优化工作流。该工作流受人类专家的 实际开发流程启发,包含初始 Kernel 的编写、正确性测试、硬件反馈分析以及迭代改进等关键阶段。 实验结果表明,CudaForge 在 KernelBench Levels 1-3 上取得了 SOTA 的结果,超越了现有的所有 ...
真情实感体验了阿里「千问APP」后,为什么说它是「中国的ChatGPT」?
机器之心· 2025-11-17 04:23
Core Viewpoint - Alibaba has launched a new application called Qianwen APP, which aims to compete in the C-end AI application market, positioning itself as "China's ChatGPT" [3][5][55]. Group 1: Product Positioning and Strategy - Qianwen APP is designed to be a personal AI assistant for users, integrating various daily tasks such as knowledge Q&A, search, content creation, code generation, and shopping into one platform [5][55]. - The app is seen as a significant move following Alibaba's investment of 380 billion yuan in AI infrastructure earlier this year, indicating the company's commitment to AI development [5][55]. - Qianwen APP represents Alibaba's first attempt to directly connect its strongest models to users in a personal assistant format, enhancing user experience and perception of model capabilities [6][13]. Group 2: Model Capabilities - The Qwen model family, particularly Qwen3-Max, boasts over 1 trillion parameters and 36 trillion tokens of pre-training data, achieving breakthroughs in various capabilities such as Chinese and English understanding, complex instruction following, and programming [6][12]. - Qwen3-Max has demonstrated top-tier performance in coding and tool-calling capabilities, ranking highly in global benchmarks [6][12][7]. - The Qwen model family covers a wide range of modalities, including text, vision, speech, video, and code, and is recognized as one of the most popular open-source models globally [8][12]. Group 3: User Experience and Features - The Qianwen APP features a minimalist design, focusing on user-friendliness and ease of use, which is crucial for a general-purpose AI assistant [15][53]. - The app excels in visual recognition capabilities, accurately identifying various objects and providing detailed information, which enhances user interaction [19][25]. - Qianwen APP's professional Q&A capabilities are tailored for high-value fields such as finance, technology, and academia, ensuring depth and accuracy in responses [32][53]. Group 4: Competitive Landscape - The launch of Qianwen APP positions Alibaba in direct competition with global AI applications, particularly OpenAI's ChatGPT, as it aims to establish itself as a leading AI assistant in the market [55][56]. - The app's capabilities and design reflect a strategic shift towards creating a "super AI assistant," aligning with global trends in AI development [55][56].
VinciCoder:多模态统一代码生成框架和视觉反馈强化学习,数据代码模型权重已开源
机器之心· 2025-11-17 04:23
Core Insights - The article discusses the limitations of traditional supervised fine-tuning (SFT) in multimodal code generation and introduces VinciCoder, a unified model that leverages visual reinforcement learning (ViRL) to enhance visual fidelity and code executability [2][6][22] - VinciCoder employs a two-phase strategy combining large-scale SFT with coarse-to-fine ViRL to address the challenges faced by existing models in generating diverse code from various visual inputs [2][7][22] Limitations of Traditional SFT - Traditional SFT suffers from a "visual gap" between training objectives and final tasks, leading to issues such as local optimization that fails to ensure global code executability and a lack of visual feedback during training [6][13] - The absence of visual feedback is critical, as minor code modifications can lead to significant changes in rendered images, highlighting the need for a mechanism that provides global visual feedback [6][7] VinciCoder's Approach - VinciCoder's innovation lies in shifting the reward mechanism from the text domain to the visual domain, utilizing a large-scale SFT to build foundational code capabilities, followed by a ViRL phase to optimize visual fidelity and executability [7][12] - The training framework consists of a "1.6M large-scale SFT phase" and a "42k coarse-to-fine ViRL phase," enabling strong code understanding and high-fidelity visual alignment [7][12] Large-Scale SFT and Code Optimization - The research team created a large-scale SFT corpus containing 1.6 million image-code pairs, which includes a new task of "visual code optimization" where the model corrects defective code to align with target images [10][12] Coarse-to-Fine ViRL Framework - VinciCoder introduces a coarse-to-fine visual reward mechanism that directly derives reward signals from visual outputs, addressing the lack of "visual-code" feedback in traditional SFT [12][14] - The framework evaluates visual similarity at both global (coarse) and local (fine) levels, enhancing the model's ability to generate accurate code [14] Experimental Results - VinciCoder demonstrated superior performance across multiple multimodal code generation benchmarks, outperforming both open-source and closed-source models, establishing new state-of-the-art (SOTA) standards [16][18] - The model's performance in challenging tasks, such as Image-to-SVG and chemical formula generation, rivals that of top closed-source models, showcasing its effectiveness [16][18] Research Significance and Future Applications - The research presents a new paradigm for multimodal code generation, emphasizing the importance of visual feedback in guiding code generation processes [19][20] - VinciCoder's success illustrates the potential of reinforcement learning to bridge the gap between visual and code modalities, paving the way for future developments in generalized multimodal intelligence [20][22]
ChatGPT:再见「破折号」
机器之心· 2025-11-17 04:23
Core Viewpoint - The article discusses the peculiarities of AI-generated text, particularly the frequent use of em dashes, which has become a hallmark of AI writing styles, leading to a perception that texts with excessive em dashes are likely AI-generated [6][7]. Group 1: AI Writing Characteristics - AI models, such as ChatGPT, tend to overuse em dashes in their outputs, which has led to the term "ChatGPT style" being coined to describe this phenomenon [6]. - Users have begun to avoid using em dashes in their own writing to prevent being mistaken for AI-generated content [6][18]. - OpenAI's CEO, Sam Altman, announced that users can now instruct ChatGPT to avoid using em dashes, which he referred to as a "small but happy victory" [7]. Group 2: User Experience and Feedback - Despite the update, users quickly reported that em dashes still appeared in ChatGPT's responses, indicating that the issue may not be fully resolved [9]. - Testing showed that when instructed not to use em dashes, ChatGPT complied and did not include them in its responses [10]. - The article highlights other AI tendencies, such as the frequent inclusion of English terms in parentheses and the use of quotation marks around abstract concepts, which can also detract from the overall readability of the text [15][17].
解决特斯拉「监督稀疏」难题,DriveVLA-W0用世界模型放大自动驾驶Data Scaling Law
机器之心· 2025-11-17 04:23
Core Insights - The article discusses the transition of VLA models in autonomous driving from academic research to practical applications, highlighting the challenge of "supervision deficit" [2][5][8] - A new research paper titled "DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving" proposes a solution to this challenge by introducing world models as a means to provide dense self-supervised signals [6][10][12] Group 1: Supervision Deficit - VLA models face a "supervision deficit" where high-dimensional visual input is paired with low-dimensional sparse supervisory signals, leading to wasted representational capacity [8][9] - The research team found that performance of VLA models saturates quickly with increased data under sparse action supervision, diminishing the effects of Data Scaling Law [9][22] Group 2: World Models as a Solution - The introduction of world models allows the model to predict future images, providing a richer and denser learning signal compared to relying solely on sparse actions [11][15][16] - This approach fundamentally alleviates the supervision deficit issue, enabling better learning of complex dynamics in driving environments [16][18] Group 3: Amplifying Data Scaling Law - The core contribution of the research is the discovery that world models significantly amplify the effects of Data Scaling Law, showing a steeper performance improvement with increased data compared to baseline models [18][21] - In experiments with up to 70 million frames, the world model reduced collision rates by 20.4%, demonstrating a qualitative leap in performance that surpasses merely stacking action data [24] Group 4: Efficiency and Real-World Application - The research also addresses the high latency issue in VLA models by proposing a lightweight MoE "action expert" architecture, which reduces inference latency to 63.1% of the baseline VLA without sacrificing performance [26][27] - This design enhances the feasibility of real-time deployment of VLA models in autonomous driving applications [27][29]
离了大谱,21%的ICLR 2026审稿意见竟是AI生成的?官方回应来了
机器之心· 2025-11-17 03:19
Core Insights - The article discusses the significant presence of AI-generated content in the review process for ICLR 2026, highlighting a trend where a substantial portion of review comments are created by AI [2][11]. Group 1: AI Usage in Paper Reviews - A systematic analysis of 75,800 review comments revealed that 21% were fully generated by AI, while 4% were heavily edited by AI, 9% moderately edited, and 22% lightly edited, with only 43% being fully human-written [2][11]. - AI-generated reviews tend to be longer by 26% and score higher on average, with fully AI-generated reviews averaging a score of 4.43 compared to 4.13 for fully human-written reviews [11]. - The average confidence level for fully AI-generated reviews is slightly higher, indicating a tendency to provide more confident evaluations [12]. Group 2: Implications and Responses - The ICLR 2026 organizing committee acknowledged the issue of low-quality reviews generated by AI and is considering appropriate measures, including marking and reporting such reviews [18]. - Suggestions for handling AI-generated reviews include removing poor evaluations and designating the reviewers as having failed their responsibilities, which could lead to automatic rejection of their submissions [18]. - Pangram Labs' analysis indicates that 39% of submitted papers utilized AI in some capacity, with a correlation between higher AI usage and lower average scores [8].
arXiv开始拒收综述论文了?「论文DDoS」这事,这篇NeurIPS论文早有讨论
机器之心· 2025-11-17 03:19
Core Viewpoint - The article discusses a significant update from arXiv, requiring all review and position papers in the computer science category to undergo peer review before submission, primarily due to the overwhelming influx of AI-generated content [2][8]. Group 1: The Crisis of AI-Generated Papers - The term "Survey Paper DDoS attack" is introduced to describe the overwhelming number of low-quality AI-generated survey papers flooding the academic community [5][20]. - The increase in AI-generated content has led to a situation where valuable insights are obscured, akin to a denial-of-service attack, making it difficult for researchers to access meaningful academic contributions [7][21]. Group 2: Quantitative Evidence of the Surge - A study analyzed 10,063 survey papers from arXiv between 2020 and 2024, revealing a significant spike in submissions post-2022, coinciding with the rise of generative AI tools like ChatGPT [10][12]. - The average AI-generated score has more than doubled, indicating that AI is a primary driver of this growth [13]. - There has been a notable increase in suspicious publishing behavior, with authors publishing multiple papers in a short time frame, suggesting AI-assisted bulk production [14]. Group 3: Detrimental Effects on Academic Integrity - AI-generated reviews are not merely noise; they pose a serious threat to the academic ecosystem by introducing low-quality, redundant content [16][19]. - Traditional expert-written reviews provide critical insights, whereas AI-generated reviews often lack structure, innovative classification, and can contain inaccuracies [17][18]. - The phenomenon of "literature poisoning" occurs when new researchers rely on flawed AI-generated reviews, potentially embedding incorrect academic foundations [19]. Group 4: Proposed Solutions - The article suggests that arXiv's new regulations are a necessary but reactive measure against the crisis [23][25]. - The authors propose a shift towards "Dynamic Live Surveys" (DLS), which would create a community-maintained online knowledge base, allowing for real-time updates and reducing redundancy [24]. - Recommendations include stricter review processes, transparency in AI usage, and incentivizing high-quality reviews to combat the influx of low-quality submissions [26].
这届NeurIPS 2025太有看头了!11月22日北京见
机器之心· 2025-11-16 07:30
Core Insights - The evolution of AI is transitioning from "capability breakthroughs" to "system construction" by 2025, focusing on reliability, interpretability, and sustainability [2] - NeurIPS 2025 will be held from December 2 to 7 in San Diego, USA, with a record of 21,575 submissions and an acceptance rate of 24.52%, indicating a growing global AI academic ecosystem [2] - The event aims to serve the Chinese AI community through various activities, including keynote speeches, paper sharing, roundtable discussions, and poster sessions [3] Event Details - The "NeurIPS 2025 Paper Sharing Conference" will take place on November 22, 2025, from 09:00 to 17:30 at the Crowne Plaza Hotel in Zhongguancun, Beijing [5][6] - The agenda includes keynote speeches, paper presentations, and poster exchanges, providing a platform for academic and industry collaboration [3][10] Keynote Speakers - The morning keynote will be delivered by Professor Qiu Xipeng from Fudan University, focusing on "Contextual Intelligence: Completing the Key Puzzle of AGI" [14][16] - The afternoon keynote speaker is Fan Qi from Nanjing University, with the topic yet to be determined [17] Paper Presentations - A variety of papers will be presented, covering topics such as data mixing, multimodal adaptation, and reinforcement learning in large language models [9][11][23] - Notable presentations include "Data Mixing Can Induce Phase Transitions in Knowledge Acquisition" and "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model" [9][11]
Lumina-DiMOO:多模态扩散语言模型重塑图像生成与理解
机器之心· 2025-11-16 04:01
Core Viewpoint - Lumina-DiMOO is an innovative multimodal generative language model that utilizes discrete diffusion modeling to bridge the gap between various multimodal tasks, enabling a seamless integration of text-to-image, image-to-image, and image-to-text capabilities [2][11]. Group 1: Historical Context - Traditional autoregressive models, such as Chameleon and Janus-Pro, face significant limitations including slow generation speed, constrained quality in high-resolution image generation, and a lack of seamless task integration [7]. Group 2: Current Innovations - Lumina-DiMOO employs a pure discrete diffusion framework, addressing the limitations of previous models by enhancing generation speed and quality through parallelized bidirectional attention mechanisms and flexible sampling strategies [9][11]. Group 3: Key Features - **Discrete Diffusion Architecture**: This architecture allows for efficient operation of image generation and understanding tasks within a single framework, breaking down traditional boundaries between generation and understanding [12]. - **Efficient Generation**: By processing multiple tokens simultaneously, Lumina-DiMOO accelerates inference and improves quality, ensuring effective collaboration between tasks [15]. - **Bidirectional Attention Mechanism**: This feature enhances the model's ability to understand contextual relationships in text and capture structural details in images, ensuring high consistency across multimodal tasks [17]. - **Joint Optimization**: The model utilizes a global optimization strategy during training, enhancing performance across various tasks and ensuring seamless transitions between them [18]. - **Max-Logit Caching Technology**: This innovation significantly boosts generation efficiency by caching stable tokens, reducing unnecessary computations and maintaining high-quality outputs, especially in high-resolution tasks [20]. Group 4: Advanced Learning Framework - **Self-GRPO Framework**: This new self-reinforcement framework integrates image generation and multimodal understanding into a single reinforcement learning trajectory, allowing the model to learn from its outputs and improve iteratively [22][23]. Group 5: Performance and Recognition - Lumina-DiMOO has achieved top rankings in several authoritative evaluations, demonstrating its superiority in semantic consistency, layout understanding, and reasoning capabilities compared to leading models like GPT-4o and Janus-Pro [29].
首发 | 陈天桥盛大团队,推出最强开源记忆系统EverMemOS
机器之心· 2025-11-16 04:01
Core Viewpoint - EverMind has launched EverMemOS, a world-class long-term memory operating system for AI agents, aiming to provide AI with a persistent, coherent, and evolving "soul" [1][4]. Group 1: Memory Capability - The limitation of fixed context windows in LLMs leads to frequent "amnesia" in AI during long-term tasks, resulting in memory breaks and factual inconsistencies, which diminishes the application value of AI [4]. - Long-term memory is becoming a core competitive advantage in AI applications, marking a shift from AI as a "tool" to an "agent" capable of proactive evolution [5]. Group 2: Industry Trends - Major industry players like Claude and ChatGPT have introduced long-term memory as a strategic feature, indicating a clear industry trend towards memory as a critical capability for future AI applications [5]. - Current attempts to address memory issues, such as RAG and emerging memory systems, are often fragmented, highlighting the need for a comprehensive, usable memory system that can support various scenarios [5]. Group 3: Inspiration and Design - EverMind's design of EverMemOS is inspired by human brain memory mechanisms, aiming to allow AI to think, remember, and grow like humans [7][14]. - The system's architecture is based on a four-layer design that parallels key functions of the human brain, enhancing its memory capabilities [19][22]. Group 4: Technical Performance - EverMemOS has achieved significant breakthroughs in both scene coverage and technical performance, being the first memory system to support both one-on-one conversations and complex multi-user collaboration [15]. - The system scored 92.3% and 82% on the LoCoMo and LongMemEval-S benchmarks, respectively, surpassing previous state-of-the-art levels [17]. Group 5: System Features - EverMemOS is not just a memory "database" but a "memory processor," addressing the core pain point of existing methods that only retrieve information without utilizing it effectively [23]. - The system features innovative "layered memory extraction" and dynamic organization, allowing for structured memory that captures implicit context [23][26]. - It introduces the first scalable modular memory framework, adaptable to various memory needs across different scenarios, ensuring optimal memory organization and application strategies [26]. Group 6: Open Source and Future Plans - EverMind has released an open-source version of EverMemOS on GitHub for developers and AI teams to deploy and test [28]. - A cloud service version is expected to be launched later this year, providing enhanced technical support and data persistence for enterprise users [28].