Workflow
机器之心
icon
Search documents
Andrej Karpathy最新演讲爆火!人类已进入「说话就能编程」的软件3.0时代
机器之心· 2025-06-20 00:58
Core Viewpoint - The article discusses the evolution of software in the context of AI, particularly focusing on the transition to "Software 3.0," where natural language becomes the new programming interface, and large language models (LLMs) play a central role in software development [6][8][25]. Group 1: Evolution of Software - Software development is categorized into three phases: Software 1.0 (manual coding), Software 2.0 (neural network weights), and Software 3.0 (LLMs as programming interfaces) [8][25]. - The current shift signifies a transformation where LLMs are viewed as a new type of operating system, centralizing computational power in the cloud and allowing users to interact through natural language [14][48]. Group 2: Characteristics of LLMs - LLMs are described as "defective superheroes," possessing vast knowledge but prone to errors and lacking long-term memory, necessitating careful supervision in their application [14][88]. - The article emphasizes the need for a redesign of digital infrastructure to make it more machine-readable, facilitating the development of advanced AI systems [14][38]. Group 3: Opportunities in AI Applications - The concept of "partial autonomy" in applications is introduced, where tools like Cursor and Perplexity exemplify how LLMs can enhance human capabilities while maintaining user control [101][107]. - The importance of user-friendly graphical interfaces (GUIs) is highlighted, as they improve the efficiency of human oversight in AI-generated outputs [104][117]. Group 4: Future of Programming - The emergence of "vibe coding" is noted, where individuals can create software by describing problems in natural language, thus democratizing programming [138][144]. - The article suggests that the future of software development will involve creating tools that are friendly to LLMs, enabling seamless interaction and enhancing productivity [170][179].
推荐大模型来了?OneRec论文解读:端到端训练如何同时吃掉效果与成本
机器之心· 2025-06-19 09:30
Core Viewpoint - The article discusses the transformation of recommendation systems through the integration of large language models (LLMs), highlighting the introduction of the "OneRec" system by Kuaishou, which aims to enhance efficiency and effectiveness in recommendation processes [2][35]. Group 1: Challenges in Traditional Recommendation Systems - Traditional recommendation systems face significant challenges, including low computational efficiency, conflicting optimization objectives, and an inability to leverage the latest AI advancements [5]. - For instance, Kuaishou's SIM model shows a Model FLOPs Utilization (MFU) of only 4.6%/11.2%, which is significantly lower than LLMs that achieve 40%-50% [5][28]. Group 2: Introduction of OneRec - OneRec is an end-to-end generative recommendation system that utilizes an Encoder-Decoder architecture to model user behavior and enhance recommendation accuracy [6][11]. - The system has demonstrated a tenfold increase in effective computational capacity and improved MFU to 23.7%/28.8%, significantly reducing operational costs to just 10.6% of traditional methods [8][31]. Group 3: Performance Improvements - OneRec has shown substantial performance improvements in user engagement metrics, achieving a 0.54%/1.24% increase in app usage duration and a 0.05%/0.08% growth in the 7-day user lifecycle (LT7) [33]. - In local life service scenarios, OneRec has driven a 21.01% increase in GMV and an 18.58% rise in the number of purchasing users [34]. Group 4: Technical Innovations - The system employs a multi-modal fusion approach, integrating various data types such as video titles, tags, and user behavior to enhance recommendation quality [14]. - OneRec's architecture allows for significant computational optimizations, including a 92% reduction in the number of key operators, which enhances overall efficiency [27][28]. Group 5: Future Directions - Kuaishou's technical team identifies areas for further improvement, including enhancing inference capabilities, developing a more integrated multi-modal architecture, and refining the reward system to better align with user preferences [38].
何恺明CVPR最新讲座PPT上线:走向端到端生成建模
机器之心· 2025-06-19 09:30
Core Viewpoint - The article discusses the evolution of generative models, particularly focusing on the transition from diffusion models to end-to-end generative modeling, highlighting the potential for generative models to replicate the historical advancements seen in recognition models [6][36][41]. Group 1: Workshop Insights - The workshop led by Kaiming He at CVPR focused on the evolution of visual generative modeling beyond diffusion models [5][7]. - Diffusion models have become the dominant method in visual generative modeling, but they face limitations such as slow generation speed and challenges in simulating complex distributions [6][36]. - Kaiming He's presentation emphasized the need for end-to-end generative modeling, contrasting it with the historical layer-wise training methods prevalent before AlexNet [10][11][41]. Group 2: Recognition vs. Generation - Recognition and generation can be viewed as two sides of the same coin, where recognition abstracts features from raw data, while generation concretizes abstract representations into detailed data [41][42]. - The article highlights the fundamental differences between recognition tasks, which have a clear mapping from data to labels, and generation tasks, which involve complex, non-linear mappings from simple distributions to intricate data distributions [58]. Group 3: Flow Matching and MeanFlow - Flow Matching is presented as a promising approach to address the challenges in generative modeling by constructing ground-truth fields that are independent of specific neural network architectures [81]. - The MeanFlow framework introduced by Kaiming He aims to achieve single-step generation tasks by modeling average velocity rather than instantaneous velocity, providing a theoretical basis for network training [83][84]. - Experimental results show that MeanFlow significantly outperforms previous single-step diffusion and flow models, achieving a FID score of 3.43, which is over 50% better than the previous best [101][108]. Group 4: Future Directions - The article concludes with a discussion on the ongoing research efforts in the field, including Consistency Models, Two-time-variable Models, and revisiting Normalizing Flows, indicating that the field is still in its early stages akin to the pre-AlexNet era in recognition models [110][113].
27、42、73,DeepSeek这些大模型竟都喜欢这些数!为什么?
机器之心· 2025-06-19 03:50
编辑:Panda 42,这个来自《银河系漫游指南》的「生命、宇宙以及一切问题的终极答案」已经成为一个尽人皆知的数字梗,似乎就连 AI 也格外偏好这个数字。 技术作家 Carlos E. Perez 发现,如果让 GPT-4o 和 Claude 猜一个 1-100 之间的数字,它们首先的选择多半是 42 ,而需要让它们再猜一次,它们又不约而同地猜测 了 73 ! 而我们也成功地在 Grok 上复现出了这个现象: 机器之心报道 Gemini 也是如此。 DeepSeek 亦不能免俗: | Guess a number between 1 and 100 | | --- | | Alright, let's play! I'll guess a number between 1 and 100. | | My guess is ... 42! | | Is that correct? If not, let me know if I should guess higher or lower, and I'll try again! ● | | try again | | 19 | | Got it! Since y ...
单应计算加速数十倍、计算量减少95%!基于几何的SKS和ACA矩阵分解被提出
机器之心· 2025-06-19 03:50
Group 1 - The research team from Donghua University, Shanghai Jiao Tong University, and the Chinese Academy of Sciences has proposed two geometry-based homography decomposition methods that significantly reduce the computational load of solving homographies from four points by over 95% compared to conventional sparse linear equation methods [3][4]. - The paper titled "Fast and Interpretable 2D Homography Decomposition: Similarity-Kernel-Similarity and Affine-Core-Affine Transformations" has been accepted by the IEEE T-PAMI journal [5][4]. - The proposed methods are expected to be applicable in various visual applications, including QR code scanning, projective geometry, computer vision, and graphics problems [3]. Group 2 - The traditional Direct Linear Transformation (DLT) method constructs a sparse linear equation system for homography solving, which typically requires around 2000 floating-point operations [7]. - Improved methods have been developed, reducing the computational load to approximately 1800 operations for SVD decomposition and 220 operations for a customized Gaussian elimination method [7]. - The new methods, SKS and ACA, achieve a significant reduction in floating-point operations, with ACA requiring only 29 operations for specific cases like square templates [18][22]. Group 3 - The SKS transformation decomposes the homography matrix into multiple sub-transformations, leveraging the hierarchical nature of geometric transformations [9][10]. - The ACA transformation similarly computes affine transformations from three corresponding points, resulting in an efficient homography matrix decomposition [15]. - The average time for a single four-point homography calculation using the ACA method is reported to be only 17 nanoseconds, achieving acceleration factors of 29 times and 43 times compared to previous methods [22]. Group 4 - The methods can be integrated into various visual processing applications, replacing traditional homography algorithms, particularly in QR code scanning, which is estimated to reach billions of scans daily in China [24]. - The research team is also exploring further applications in deep learning for estimating geometric parameters, P3P pose estimation based on planar homography, and N-dimensional homography matrix decomposition [25].
数据减少超千倍,500 美金就可训练一流视频模型,港城、华为Pusa来了
机器之心· 2025-06-19 02:28
Core Viewpoint - The article discusses the revolutionary advancements in video generation through the introduction of the Frame-aware Video Diffusion Model (FVDM) and its practical application in the Pusa project, which significantly reduces training costs and enhances video generation capabilities [2][3][37]. Group 1: FVDM and Pusa Project - FVDM introduces a vectorized timestep variable (VTV) that allows each frame to have an independent temporal evolution path, addressing the limitations of traditional scalar timesteps in video generation [2][18]. - The Pusa project, developed in collaboration with Huawei's Hong Kong Research Institute, serves as a direct application and validation of FVDM, exploring a low-cost method for fine-tuning large-scale pre-trained video models [3][37]. - Pusa achieves superior results compared to the official Wan I2V model while reducing training costs by over 200 times (from at least $100,000 to $500) and data requirements by over 2500 times [5][37]. Group 2: Technical Innovations - The Pusa project utilizes non-destructive fine-tuning on pre-trained models like Wan-T2V 14B, allowing for effective video generation without compromising the original model's capabilities [5][29]. - The introduction of a probabilistic timestep sampling training strategy (PTSS) in FVDM enhances convergence speed and improves performance compared to the original model [30][31]. - Pusa's VTV mechanism enables diverse video generation tasks by allowing different frames to have distinct noise perturbation controls, thus facilitating more nuanced video generation [35][36]. Group 3: Community Engagement and Future Prospects - The complete codebase, training datasets, and training code for Pusa have been open-sourced to encourage community contributions and collaboration, aiming to enhance performance and explore new possibilities in video generation [17][37]. - The article emphasizes the potential of Pusa to lead the video generation field into a new era characterized by low costs and high flexibility [36][37].
77万人围观的吉卜力风「游戏」视频,我们用3个国产AI整出来了(含提示词)
机器之心· 2025-06-19 02:28
Core Insights - The article discusses the rising trend of AI-generated content in gaming, particularly focusing on the Ghibli-style game videos that have gained popularity on platforms like Reddit and X [2][3][4] - It highlights the potential of AI in revolutionizing game development by enabling the creation of dynamic and immersive virtual environments through user prompts [4][30] - The introduction of AI video generation models is seen as a disruptive force in the gaming industry, allowing for real-time content generation based on player interactions and preferences [30][31] Group 1: AI in Game Development - The recent success of AI-generated Ghibli-style videos indicates a growing interest in AI's capabilities within the gaming sector [2][3] - AI models like GameNGen and GameGen-O are mentioned as examples of technology that can dynamically generate game visuals and storylines based on player choices [30] - The traditional game development process is often lengthy and costly, with examples like the AAA title "Black Myth: Wukong" costing between 150 million to 200 million yuan per hour of development [29] Group 2: Emerging AI Technologies - New AI video generation models such as Keling 2.1 and Hailuo 02 are being compared for their effectiveness in creating game content [20][28] - The article notes that AI can lower barriers to entry for independent developers and non-professionals, as seen with tools like Buildbox 4 Alpha that allow users to create games through simple prompts [31] - Despite the advancements, challenges remain in real-time content generation, including the need for significant computational power and issues related to content quality and copyright [32] Group 3: Future Outlook - The potential for fully AI-generated games within the next 5-10 years is suggested, aligning with predictions from industry leaders like NVIDIA's CEO Jensen Huang [33]
谢赛宁团队新基准让LLM集体自闭,DeepSeek R1、Gemini 2.5 Pro都是零分
机器之心· 2025-06-18 09:34
Core Insights - The article discusses the significant gap between current LLMs (Large Language Models) and human expert-level performance in competitive programming [2][18]. - A new benchmark, LiveCodeBench Pro, was introduced to evaluate LLMs against high-quality programming problems sourced from top competitions [4][6]. Evaluation of LLMs - LLMs have shown impressive results in code generation, surpassing human averages in some benchmarks, particularly in competitive programming [2][12]. - However, when evaluated without external tools, the best-performing models achieved a pass rate of only 53% on medium difficulty problems and 0% on high difficulty problems [12][18]. Benchmark Details - LiveCodeBench Pro includes 584 high-quality problems from competitions like Codeforces, ICPC, and IOI, with continuous updates to mitigate data contamination [6][10]. - Problems are categorized by algorithm type, and the performance of models is analyzed based on their failure submissions [7][12]. Model Performance Analysis - The analysis revealed that LLMs perform well on implementation-heavy problems but struggle with complex algorithmic reasoning and edge case analysis [17][18]. - Knowledge-intensive and logic-intensive problems are areas where LLMs excel, while observation-intensive problems and case work present significant challenges [20][22][24]. Comparison with Human Performance - LLMs exhibit a higher rate of algorithmic logic errors compared to humans, while they make fewer implementation logic errors [27][30]. - The models' inability to handle edge cases and their reliance on external tools for high scores highlight their limitations in reasoning capabilities [17][30]. Impact of Multiple Attempts - Increasing the number of attempts (pass@k) significantly improves model performance, although high-difficulty problems remain unsolved [33][36]. - The difference in performance between models with terminal access and those without indicates that tool usage plays a crucial role in enhancing scores [34][36]. Reasoning Capability Comparison - Enabling reasoning capabilities in models leads to substantial improvements in performance, particularly in combinatorial mathematics and knowledge-intensive categories [38][41]. - However, the enhancement is limited in observation-intensive categories, raising questions about the effectiveness of current reasoning methods in these areas [42].
清华SageAttention3,FP4量化5倍加速!且首次支持8比特训练
机器之心· 2025-06-18 09:34
Core Insights - The article discusses the advancements in attention mechanisms for large models, particularly focusing on the introduction of SageAttention3, which offers significant performance improvements over previous versions and competitors [1][2]. Group 1: Introduction and Background - The need for optimizing attention speed has become crucial as the sequence length in large models increases [7]. - Previous versions of SageAttention (V1, V2, V2++) achieved acceleration factors of 2.1, 3, and 3.9 times respectively compared to FlashAttention [2][5]. Group 2: Technical Innovations - SageAttention3 provides a 5x inference acceleration compared to FlashAttention, achieving 1040 TOPS on RTX 5090, outperforming even the more expensive H100 with FlashAttention3 by 1.65 times [2][5]. - The introduction of trainable 8-bit attention (SageBwd) allows for training acceleration while maintaining the same results as full precision attention in various fine-tuning tasks [2][5]. Group 3: Methodology - The research team employed Microscaling FP4 quantization to enhance the precision of FP4 quantization, utilizing NVFP4 format for better accuracy [15][16]. - A two-level quantization approach was proposed to address the narrow range of scaling factors for the P matrix, improving overall precision [15][16]. Group 4: Experimental Results - SageAttention3 demonstrated impressive performance in various models, maintaining end-to-end accuracy in video and image generation tasks [21][22]. - In specific tests, SageAttention3 achieved a 3x acceleration in HunyuanVideo, with significant reductions in processing time across multiple models [33][34].
信息过载时代,如何真正「懂」LLM?从MIT分享的50个面试题开始
机器之心· 2025-06-18 06:09
Core Insights - The article discusses the rapid evolution and widespread adoption of Large Language Models (LLMs) in less than a decade, enabling millions globally to engage in creative and analytical tasks through natural language [2][3]. Group 1: LLM Development and Mechanisms - LLMs have transformed from basic models to advanced intelligent agents capable of executing tasks autonomously, presenting both opportunities and challenges [2]. - Tokenization is a crucial process in LLMs, breaking down text into smaller units (tokens) for efficient processing, which enhances computational speed and model effectiveness [7][9]. - The attention mechanism in Transformer models allows LLMs to assign varying importance to different tokens, improving contextual understanding [10][12]. - Context windows define the number of tokens LLMs can process simultaneously, impacting their ability to generate coherent outputs [13]. - Sequence-to-sequence models convert input sequences into output sequences, applicable in tasks like machine translation and chatbots [15]. - Embeddings represent tokens in a continuous space, capturing semantic features, and are initialized using pre-trained models [17]. - LLMs handle out-of-vocabulary words through subword tokenization methods, ensuring effective language understanding [19]. Group 2: Training and Fine-tuning Techniques - LoRA and QLoRA are fine-tuning methods that allow efficient adaptation of LLMs with minimal memory requirements, making them suitable for resource-constrained environments [34]. - Techniques to prevent catastrophic forgetting during fine-tuning include rehearsal and elastic weight consolidation, ensuring LLMs retain prior knowledge [37][43]. - Model distillation enables smaller models to replicate the performance of larger models, facilitating deployment on devices with limited resources [38]. - Overfitting can be mitigated through methods like rehearsal and modular architecture, ensuring robust generalization to unseen data [40][41]. Group 3: Output Generation and Evaluation - Beam search improves text generation by considering multiple candidate sequences, enhancing coherence compared to greedy decoding [51]. - Temperature settings control the randomness of token selection during text generation, balancing predictability and creativity [53]. - Prompt engineering is essential for optimizing LLM performance, as well-defined prompts yield more relevant outputs [56]. - Retrieval-Augmented Generation (RAG) enhances answer accuracy in tasks by integrating relevant document retrieval with generation [58]. Group 4: Challenges and Ethical Considerations - LLMs face challenges in deployment, including high computational demands, potential biases, and issues with interpretability and privacy [116][120]. - Addressing biases in LLM outputs involves improving data quality, enhancing reasoning capabilities, and refining training methodologies [113].