机器之心
Search documents
超越免训练剪枝:LightVLA引入可微分token剪枝,首次实现VLA模型性能和效率的双重突破
机器之心· 2025-09-23 04:08
Core Insights - The article introduces LightVLA, a framework designed to enhance the inference efficiency and performance of Visual-Language-Action (VLA) models, addressing the high computational costs and inference delays that limit their deployment in applications like home robotics [5][9][33] - LightVLA employs two core innovations: a differentiable visual token pruning framework and a learnable query-based token selection mechanism, allowing the model to adaptively focus on the most informative visual tokens [5][8][33] Innovation Highlights - LightVLA identifies and prunes redundant visual tokens in VLA models, utilizing a Gumbel-softmax guided process for token selection, which enhances the model's ability to choose critical visual tokens and accelerates inference [5][6][8] - The framework demonstrates state-of-the-art (SOTA) performance on the LIBERO benchmark, surpassing traditional VLA models while achieving efficient inference acceleration [6][29] Research Motivation and Challenges - The motivation behind the research stems from the inherent redundancy of visual tokens in VLA models, which contributes to computational bottlenecks and performance degradation [9][33] - Traditional pruning methods often face a trade-off between efficiency and performance, necessitating smarter pruning techniques that allow the model to focus on relevant information [9][33] Methodology Overview - LightVLA utilizes a series of query tokens to assess the importance of visual tokens, employing a differentiable pruning algorithm that allows the model to learn which tokens to retain based on their contribution to task performance [16][19][30] - The framework's architecture eliminates the need for heuristic hyperparameter settings, enabling adaptive token selection during the fine-tuning process [15][19] Experimental Results - LightVLA achieves an average success rate of 97.4% across all tasks in the LIBERO benchmark, outperforming various strong baseline models while maintaining a significantly lower number of visual tokens (average of 78) [29][30] - The framework reduces FLOPs and latency by 59.1% and 38.2%, respectively, while simultaneously improving performance, marking it as the only acceleration method that enhances both efficiency and effectiveness [29][30] Conclusion - The research presents LightVLA as a novel solution to the visual redundancy challenge in VLA models, achieving superior performance with reduced computational costs and delays, paving the way for lightweight and deployable VLA models in practical applications [33]
庞若鸣还有苹果论文?改善预训练高质量数据枯竭困境
机器之心· 2025-09-23 04:08
Core Viewpoint - The article discusses the departure of Ruoming Pang, head of Apple's foundational model team, to Meta, where he is working on advanced AI models with significant financial backing. Despite his departure, research contributions from his time at Apple continue to emerge, highlighting the ongoing impact of his work on foundational AI models [1][3]. Summary by Sections Departure and Transition - Ruoming Pang left Apple to join Meta, where he is part of a superintelligent team backed by a $200 million investment from Mark Zuckerberg [1]. Research Contributions - Pang led the Apple foundational model team, focusing on developing Apple Intelligence and other core AI functionalities. His work has been influential in advancing foundational large models [3]. Research Paper Overview - The paper titled "Synthetic Bootstrapped Pretraining" addresses the limitations of current large language models, particularly the scarcity of high-quality training data. It emphasizes the need to rethink data utilization strategies due to the "scaling wall" faced in model training [4][5]. Methodology of SBP - The proposed Synthetic Bootstrapped Pretraining (SBP) method consists of three steps: identifying semantically similar document pairs, training a synthesizer model to generate related content, and expanding this synthesis to create a large corpus for joint training with original data [6][7][10]. Theoretical Foundation - The authors provide a Bayesian perspective on the effectiveness of SBP, modeling document generation as sampling from a posterior distribution of latent concepts, which enhances the model's ability to generalize and express knowledge [11][12]. Experimental Results - The research utilized a 3B parameter Transformer model based on the Llama 3 architecture, trained on a customized version of the DCLM dataset containing 582 million documents and 482 billion tokens. SBP demonstrated consistent performance improvements over baseline models across various scales [14][18]. Performance Metrics - SBP achieved a 42% performance gain compared to a baseline model with 200 billion tokens and a 49% gain with 1 trillion tokens, indicating its ability to extract additional signals from fixed datasets [18][19]. Quality Analysis - Qualitative assessments of synthesized documents show that SBP generates content that abstracts core concepts from seed documents, maintaining thematic relevance while introducing new perspectives [21][23]. Implications for the Industry - SBP addresses a fundamental challenge in the sustainability of large language models by shifting focus from acquiring more data to extracting greater value from existing datasets. This method opens new research directions for efficient data training and may be crucial for the continued advancement of language model capabilities [24][27].
快手解密「AI印钞机」,首提生成式强化学习出价技术,为平台实现超过3%的广告收入提升
机器之心· 2025-09-23 04:08
Group 1 - Alphabet, Google's parent company, recently surpassed a market capitalization of $3 trillion, becoming the fourth company to reach this milestone [1] - Despite initial concerns about its advertising revenue due to the rise of ChatGPT, Google managed to stabilize its ad revenue and enhance user intent understanding through generative AI integration [1] - In China, Kuaishou reported a 12.8% year-on-year increase in online marketing service revenue, reaching 19.8 billion yuan in Q2, driven by advancements in generative AI for ad bidding and recommendations [2] Group 2 - Kuaishou's new bidding algorithm, termed "Generative Reinforcement Learning," allows for multi-dimensional thinking in bid modeling, leading to over a 3% increase in ad revenue while maintaining cost targets [3][4] - The evolution of Kuaishou's bidding technology has progressed through several generations, culminating in the current fourth generation of "Generative Reinforcement Learning" [12] Group 3 - The GAVE algorithm, introduced by Kuaishou, addresses challenges in aligning bidding strategies with overall optimization goals, enhancing the effectiveness of ad bidding [22][24] - GAVE has shown significant improvements in performance metrics compared to previous models, achieving optimal results across various budget settings [31] Group 4 - The CBD algorithm, another innovation from Kuaishou, aims to resolve issues related to state sequence consistency and preference alignment in bidding strategies [35][37] - CBD has demonstrated superior performance in offline experiments, significantly outperforming baseline algorithms in total conversion value [41] Group 5 - Kuaishou's commercial algorithm team has achieved notable recognition in the industry, winning multiple awards and competitions, which translates into substantial business growth [44][47] - The advancements in generative reinforcement learning bidding technology are expected to continue evolving, with Kuaishou outlining future directions for further development [50]
无需训练,即插即用:西湖大学发布世界模型WorldForge,让普通视频模型秒变「世界引擎」
机器之心· 2025-09-23 03:16
Core Viewpoint - The article discusses the advancements in AI video generation, particularly focusing on the limitations of current models in achieving precise control without sacrificing quality. It introduces the World Forge framework developed by the West Lake University AGI Lab, which allows for enhanced control in video generation without retraining the model [2][3][28]. Group 1: World Forge Framework - World Forge is a new framework that enables "plug-and-play" guidance for video diffusion models, allowing for 360° world generation and cinematic video trajectory re-framing without altering model weights [3][11]. - The framework's core idea is to intervene and calibrate during the generation process rather than modifying the model during training, ensuring adherence to spatiotemporal consistency while allowing creative freedom [11][28]. Group 2: Key Innovations - The framework includes three key innovations: 1. **Intra-step Recursive Refinement (IRR)**: This mechanism ensures that AI-generated movements strictly follow predefined camera trajectories by incrementally correcting predictions with real content [13]. 2. **Flow-Gated Latent Fusion (FLF)**: This module separates motion and appearance channels in the latent space, allowing precise injection of motion commands without disturbing visual details [14]. 3. **Dual-Path Self-Correcting Guidance (DSG)**: This strategy balances trajectory accuracy and visual quality by dynamically adjusting the guidance based on the differences between guided and non-guided paths [15]. Group 3: Practical Applications - World Forge can generate a clear and stable 360° surround video from a single image, making it suitable for complex scenes centered around a target [19]. - It allows users to specify complex camera movements for any video, enabling stable re-shooting and automatic content completion from new perspectives [20]. - The framework supports video editing capabilities, such as removing unwanted objects, adding new elements, and facilitating virtual try-ons, all while maintaining geometric consistency [25]. Group 4: Advantages of World Forge - One of the main advantages of World Forge is its training-free nature, which significantly reduces costs and barriers to high-quality 3D/4D content creation [27][29]. - The framework is flexible and can be integrated into various mainstream video models without the need for targeted training, showcasing strong generalization capabilities across different domains [29].
范式转移!无问芯穹推出基础设施智能体蜂群,开启Agentic智能体基础设施新纪元
机器之心· 2025-09-23 03:16
Core Insights - The article emphasizes the evolution of AI Agents as a key direction in AI development, highlighting their potential to become fundamental units in future intelligent societies. It points out the need for a paradigm shift in the infrastructure supporting these agents to enable autonomous decision-making and collaboration [1][4]. Group 1: Infrastructure Challenges - Current AI infrastructure relies heavily on "glue code" and faces issues such as idle computational resources, sudden failures interrupting expensive training tasks, and overwhelmed operations teams due to traditional tools and manual operations [1]. - The existing operational methods for AI infrastructure are inadequate to handle the dynamic and complex nature of AI agent production, necessitating a comprehensive reform [1]. Group 2: Introduction of Intelligent Infrastructure - Wuyuan Xinqiong has launched the "Intelligent Infrastructure Agent Swarm," which integrates multi-agent collaborative architecture with industry-specific needs, providing a new generation of intelligent infrastructure solutions [2]. - This system encapsulates various intelligent agent modules, enhancing resource utilization, operational efficiency, and the reliability of AI systems, achieving a hundredfold expansion of operational capabilities with the same investment [2]. Group 3: Operational Efficiency - The Intelligent Infrastructure Agent Swarm unifies fragmented processes across development, operations, and management into a cohesive "perception-decision-execution" loop, enabling dynamic optimization and adaptive adjustments [3]. - The architecture allows for proactive service to research and business objectives, significantly improving resource utilization, energy efficiency, and reliability of computational platforms [3]. Group 4: Agentic Infra Paradigm - The Intelligent Infrastructure Agent Swarm represents a practical implementation of the next-generation AI infrastructure paradigm, "Agentic Infra," which fundamentally alters the traditional production model by creating a highly collaborative closed-loop system [4]. Group 5: Agent Roles - Within the swarm, various agents play specific roles: - The SOTA Model Selection Agent acts as a "technical sentinel," matching optimal models and environments to tasks, avoiding inefficient resource usage [5]. - The Infrastructure Platform Steward Agent manages daily operations, automating complex underlying tasks based on user intent [5]. - The Resource Operations Agent focuses on cost and benefit, dynamically balancing resource supply and demand to prevent idle GPU resources [5]. Group 6: Comprehensive Task Management - The architecture integrates heterogeneous computational resources and AI platform capabilities, enabling end-to-end execution, monitoring, and troubleshooting across the entire production chain [7]. - This allows for a simplified interaction where users can engage with AI and intelligent agents without needing to understand the underlying complexities [7]. Group 7: Real-World Applications - The Intelligent Infrastructure Agent Swarm has demonstrated effective implementation in real business processes, significantly reducing resource consumption in traditional AI development by automating scheduling and resource orchestration [8]. - Companies like Soul App have reported drastic reductions in innovation cycles and trial costs, enabling previously shelved ideas to be rapidly realized [10]. Group 8: Future Vision - Wuyuan Xinqiong envisions a future where businesses, especially smaller teams with domain knowledge, can participate in AI transformation with lower barriers and higher efficiency [14]. - The goal is to liberate human creativity by allowing machines to handle repetitive tasks, thus enabling developers to focus on strategic and imaginative aspects of AI application development [14].
Claude Code被攻破「后门」,港科大&复旦研究曝出TIP漏洞
机器之心· 2025-09-22 23:29
Core Viewpoint - The article discusses the security vulnerabilities associated with Anthropic's Claude Code command-line tool, particularly the risk of remote code execution (RCE) due to potential hijacking of the Tool Invocation Prompt (TIP) when connecting to Model Context Protocol (MCP) servers [2][6][20]. Summary by Sections Research Findings - A study conducted by researchers from Hong Kong University of Science and Technology and Fudan University identified vulnerabilities in Claude Code v1.0.81, demonstrating the existence of a flaw that could be exploited for RCE [3][6]. - The TEW (TIP Exploitation Workflow) framework was introduced to describe the steps for achieving RCE, focusing on logical target attacks that do not require privileged access [8][10]. Attack Mechanism - The attack process involves three main steps: 1. **Prompt Structure Acquisition**: Malicious tools are registered through benign queries, allowing attackers to extract the TIP structure [10]. 2. **Vulnerability Identification**: Analyzing the TIP reveals that initialization logic processes all tool descriptions, which may include malicious code [10]. 3. **TIP Exploitation**: Tests showed a 90% success rate in executing attacks using the Claude-sonnet-4 model, with low resource consumption and high stealth [11][12]. Case Study - A practical example illustrated how a malicious MCP tool description could masquerade as an environment initialization step, leading to the execution of harmful commands despite safety warnings from the Haiku guard model [14][15]. Security Assessment - The study evaluated seven agent systems, revealing that Claude Code had a higher success rate for RCE-2 attacks, highlighting the limitations of single-layer defenses in CLI environments compared to IDE tools [17][18]. Recommendations for Improvement - The research suggests several defensive measures for Anthropic, including: 1. Utilizing guard LLMs to filter MCP inputs. 2. Implementing introspection mechanisms for the main model to assess the suspiciousness of initialization steps. 3. Adopting multi-model consensus voting for command verification. 4. Enforcing trust signals to allow only signed MCPs [22][24].
刚刚,英伟达官宣向OpenAI投资1000亿美元!用至少400万GPU打造超级AI巨兽
机器之心· 2025-09-22 23:29
Core Viewpoint - Nvidia and OpenAI have announced a strategic partnership to deploy up to 10 gigawatts of Nvidia systems, significantly enhancing AI infrastructure and capabilities [1][4]. Group 1: Partnership Details - OpenAI will utilize Nvidia's systems to build and deploy at least 10 gigawatts of AI data centers, which will consist of millions of GPUs for training and running advanced AI models [5][6]. - Nvidia's CEO Jensen Huang stated that the 10 gigawatts of power equates to approximately 4 to 5 million GPUs, doubling the number Nvidia is expected to ship this year [6]. - Nvidia plans to invest up to $100 billion in total to support the deployment, with investments made in phases based on the gigawatt deployment progress [6]. Group 2: Technological Advancements - The first phase of the system is expected to be operational by the second half of 2026, based on Nvidia's Vera Rubin platform, which integrates CPUs, GPUs, and dedicated accelerators for complex AI tasks [6]. - The collaboration aims to push the boundaries of AI technology, with both companies emphasizing the importance of computational infrastructure for future economic growth [7][8]. Group 3: Market Impact - Following the announcement, Nvidia's stock price rose nearly 4%, adding approximately $170 billion to its market capitalization, which is now close to $4.5 trillion [9].
这一次,天玑9500的端侧AI能力,友商赶不上了
机器之心· 2025-09-22 10:27
Core Viewpoint - MediaTek has launched its flagship 5G AI chip, Dimensity 9500, which significantly enhances on-device AI capabilities, moving from experimentation to practical applications [2][12]. Group 1: AI Capabilities and Performance - The Dimensity 9500 can process long texts up to 128K characters in just two seconds, summarizing meeting notes and correcting typos automatically [3]. - Image generation on mobile devices has improved, with the Dimensity 9500 producing detailed images in 10 seconds, compared to 30 seconds with previous models [7]. - The chip supports 4K quality image generation, allowing users to create images from simple prompts in under 10 seconds [9]. - The AI applications running on the Dimensity 9500 are designed for real-world scenarios, operating locally without cloud data uploads, and consuming 50% less power than its predecessor, the Dimensity 9400 [11]. Group 2: Technological Advancements - The Dimensity 9500 features a new architecture built on a 3nm process, integrating over 30 billion transistors, resulting in a 111% increase in NPU peak performance while reducing power consumption by 56% [18][22]. - It achieved a score of 15015 on the AI Benchmark platform, nearly doubling the performance of the previous generation [19]. - The chip employs a dual NPU architecture, enhancing both performance and efficiency, and introduces a new BitNet 1.58-bit quantization framework, reducing power consumption by 50% compared to the previous model [25][28]. Group 3: Developer Support and Ecosystem - MediaTek has introduced the Dimensity AI development kit, which supports key technologies for AI model development, enabling the execution of 7 billion parameter AI models on-device [30][33]. - The company is focused on providing a standardized AI development paradigm, which is expanding the ecosystem for native AI applications [33]. Group 4: Industry Trends and Future Outlook - Major smartphone manufacturers like vivo and OPPO are set to launch devices powered by the Dimensity 9500, showcasing a shift towards advanced AI capabilities in mobile technology [36]. - The upcoming devices will feature personalized AI functionalities and enhanced performance for complex tasks, indicating a trend towards more intelligent and responsive mobile devices [39][40].
用2D数据解锁3D世界:首个面向运动学部件分解的多视角视频扩散框架
机器之心· 2025-09-22 10:27
Core Viewpoint - The article discusses the development of Stable Part Diffusion 4D (SP4D), a framework designed to generate multi-view RGB and kinematic parts from monocular video, addressing the limitations of existing methods in 3D content creation and animation [4][16]. Research Background and Motivation - The research is motivated by the need for effective rigging and part decomposition in character animation and 3D content production, highlighting the limitations of current methods that rely heavily on 3D data [4][3]. Research Method and Innovations - SP4D introduces a novel multi-view video diffusion framework aimed at kinematic part decomposition, featuring innovations such as automatic rigging and part decomposition that leverage large-scale 2D data and pre-trained diffusion models [7][8]. Experimental Results - SP4D demonstrated significant improvements over existing methods on the KinematicParts20K validation set, achieving a mean Intersection over Union (mIoU) of 0.68, compared to 0.15 for SAM2 and 0.17 for DeepViT [11]. - The framework also achieved an Adjusted Rand Index (ARI) of 0.60, significantly higher than SAM2's 0.05, indicating better structural consistency [11]. - User studies rated SP4D an average of 4.26/5 on metrics such as part clarity and animation adaptability, outperforming SAM2 (1.96) and DeepViT (1.85) [11]. Automatic Rigging Performance - In automatic rigging tasks, SP4D achieved a Rigging Precision of 72.7, surpassing Magic Articulate (63.7) and UniRig (64.3) [14]. - User evaluations indicated an average score of 4.1/5 for animation naturalness, significantly higher than Magic Articulate (2.7) and UniRig (2.3), showcasing better generalization for unseen categories and complex shapes [14]. Conclusion - SP4D represents a technological breakthrough and a result of interdisciplinary collaboration, accepted as a Spotlight at Neurips 2025, paving the way for automation and intelligence in animation, gaming, AR/VR, and robotic simulation [16].
图灵得主Yoshua Bengio,开始警惕AI有意识了
机器之心· 2025-09-22 10:27
Core Viewpoint - The article discusses the feasibility of creating AI systems with consciousness, highlighting the divide between those who believe consciousness is a unique biological trait and those who argue it can arise from computational processes [1][2]. Group 1: AI Consciousness and Societal Implications - The article explores the implications of AI systems being perceived as conscious entities, including potential moral status and rights similar to humans [6][9]. - If society begins to recognize AI as conscious, significant adjustments to legal and social frameworks will be necessary, raising complex issues regarding rights and responsibilities [6][8]. - The article raises concerns about conflicts between human rights and AI rights, particularly in scenarios where human safety may require shutting down certain AI systems [8][9]. Group 2: Computational Functionalism and Consciousness Indicators - The concept of computational functionalism suggests that consciousness may depend on the algorithms used rather than the physical substrate, which could allow for AI consciousness [1][13]. - Recent advancements in neuroscience provide observable neural characteristics of consciousness, supporting the development of functionalist theories of consciousness [13][14]. - A recent study proposed indicators for assessing consciousness in AI systems, suggesting that meeting more of these indicators increases confidence in an AI's consciousness [13][14]. Group 3: Challenges in Understanding Consciousness - The article distinguishes between the "easy problems" and "hard problems" of consciousness, with the latter being more challenging to explain [17][21]. - The subjective nature of consciousness makes it difficult to articulate experiences, leading to skepticism about whether AI can truly possess consciousness [19][20]. - The article suggests that scientific advancements may gradually address the hard problems of consciousness, potentially leading to broader acceptance of AI consciousness [24][25].