机器之心
Search documents
基于文本AI的终结?Agent协作可直接「复制思维」,Token效率暴涨
机器之心· 2025-12-05 04:08
Core Insights - The article discusses the emergence of multi-agent systems (MAS) in the Agentic AI era, emphasizing the shift from individual models to collaborative problem-solving among AI agents [2][5] - A new framework called LatentMAS is introduced, which allows agents to collaborate in latent space rather than through traditional text communication, enhancing efficiency and performance [5][14] Group 1: LatentMAS Framework - LatentMAS enables agents to exchange internal hidden layer representations and KV-cache working memory, resulting in higher performance and reduced token usage [5][10] - The framework is designed to support richer latent reasoning and lossless communication between agents, significantly lowering computational complexity compared to text-based MAS [15][16] Group 2: Experimental Results - Comprehensive experiments on nine benchmark tasks show that LatentMAS outperforms both single models and text-based MAS, with accuracy improvements of up to 14.6% and token usage reductions of 70.8% to 83.7% [6][20][22] - LatentMAS achieves end-to-end reasoning speed increases of 4× to 4.3× compared to traditional methods, demonstrating its efficiency [21][25] Group 3: Efficiency and Performance - The framework allows for complex reasoning processes while significantly reducing the number of tokens used, achieving higher accuracy with fewer output tokens [28][29] - LatentMAS can provide additional speed improvements of 2.6× to 7× over text-based MAS, even when the latter is optimized with vLLM services [25][28] Group 4: Semantic Richness - The latent representations generated by LatentMAS are shown to be semantically rich and diverse, surpassing the expressiveness of discrete tokens used in text-based systems [30][31] - The study indicates that the potential reasoning captured in LatentMAS is not only effective but also contains more nuanced internal representations compared to traditional methods [31][32]
字节前技术负责人创业,联手清华姚班校友,编程智能体世界登顶
机器之心· 2025-12-05 04:08
Core Insights - InfCode is defining the "Engineering Era" of AI programming, moving beyond the "Vibe Coding" concept introduced by Andrej Karpathy, which focuses on generating code from simple prompts [3][7]. Group 1: InfCode's Performance - InfCode achieved a Pass@1 score of 79.4% on the SWE-Bench Verified benchmark, surpassing leading models like GPT-5 and Claude, which scored around 70% [6][13]. - On the Multi-SWE-bench C++ subset, InfCode reached a 25.58% resolution rate, significantly outperforming competitors such as Claude 3.7 Sonnet (8.59%) and DeepSeek V3 (7.75%) [6][13]. Group 2: Technical Innovations - InfCode employs a multi-agent system designed for enterprise scenarios, marking a shift from individual efficiency to organizational evolution in AI coding [6][9]. - The system integrates "Code Intent Analysis," allowing it to understand the functional intent behind natural language descriptions, enhancing its ability to locate issues in large codebases [18][19]. - InfCode features a structured search engine based on Abstract Syntax Trees (AST), improving code retrieval accuracy compared to traditional text search tools [21][23]. Group 3: Repair Process and Methodology - The repair process of InfCode consists of two phases: generation and selection, allowing for multiple iterations to produce diverse patch candidates [30][33]. - InfCode utilizes a dual-agent architecture for code patch generation and testing, enabling continuous improvement and robustness of the generated patches [25][29]. Group 4: Team and Vision - The core team of InfCode, referred to as a "startup dream team," combines technical expertise with commercialization capabilities, positioning them uniquely in the competitive AI coding agent landscape [35][38]. - The team aims to transform the AI coding landscape from mere tool efficiency to a comprehensive reconstruction of the software engineering lifecycle, focusing on end-to-end value delivery [38].
首帧的真正秘密被揭开了:视频生成模型竟然把它当成「记忆体」
机器之心· 2025-12-05 04:08
Core Insights - The first frame in video generation models serves as a "conceptual memory buffer" rather than just a starting point, storing visual entities for subsequent frames [3][9][48] - The research highlights that video generation models can automatically remember characters, objects, textures, and layouts from the first frame and reuse them in later frames [9][10] Research Background - The study originates from a collaborative effort by research teams from UMD, USC, and MIT, focusing on a phenomenon in video generation models that had not been systematically studied [5][8] Methodology and Findings - The proposed method, FFGo, allows for video content customization without modifying model structures or requiring millions of training samples, needing only 20-50 carefully curated examples [18][21] - FFGo can achieve state-of-the-art (SOTA) video content customization with minimal data and training time, demonstrating significant advantages over existing methods like VACE and SkyReels-A2 [21][29] Technical Highlights - FFGo enables the generation of videos with multiple objects while maintaining identity consistency and action coherence, outperforming previous models that were limited to fewer objects [22][31] - The method utilizes Few-shot LoRA to activate the model's memory mechanism, allowing it to leverage existing capabilities that were previously unstable and difficult to trigger [30][44] Implications and Future Directions - The research suggests that video models inherently possess the ability to fuse multiple reference objects, but this potential was not effectively utilized until now [39][48] - FFGo represents a paradigm shift in how video generation models can be used, emphasizing smarter usage over brute-force training [52]
告别「2D错觉」,SpatialActor通过解耦语义与几何,为具身智能注入强鲁棒空间基因
机器之心· 2025-12-05 03:02
Core Insights - The article discusses the limitations of existing robotic manipulation models that primarily rely on 2D images, which often lose critical depth information and 3D geometric structure [2][4] - The proposed solution, SpatialActor, focuses on "disentanglement," separating semantic information from spatial geometric information to enhance robotic understanding and interaction with 3D environments [4][7] Methodology and Architecture - SpatialActor employs a dual-stream architecture that decouples visual and depth encoding, integrating a Semantic-Guided Geometry Module (SGM) and a Spatial Transformer (SPT) to improve robustness and accuracy in robotic tasks [10][11] - The SGM combines robust geometric priors from a pre-trained depth estimation model with fine-grained but noisy depth features, optimizing the geometric representation while maintaining alignment with semantic cues [11][13] - The SPT establishes precise 2D to 3D mappings and integrates multi-modal features, crucial for generating accurate robotic actions [13] Experimental Results - SpatialActor achieved an average success rate of 87.4% across various tasks in simulation, outperforming the previous state-of-the-art model RVT-2 by 6.0% [16][19] - In noise experiments, SpatialActor demonstrated superior robustness, with average success rates improving by 13.9%, 16.9%, and 19.4% under light, medium, and heavy noise conditions, respectively [19][20] - Real-world experiments showed SpatialActor consistently outperforming RVT-2 by approximately 20% across various tasks, confirming its effectiveness in complex environments [22][24] Conclusion - The article concludes that SpatialActor represents a significant advancement in robotic manipulation by effectively decoupling semantic and geometric information, leading to improved robustness and generalization in diverse conditions [24][25] - The framework highlights the importance of disentangled spatial representations for developing more resilient and adaptable robotic systems [25][26]
刚刚,2026年英伟达奖学金名单公布,华人博士生霸榜占比80%
机器之心· 2025-12-05 03:02
Core Insights - The NVIDIA Graduate Fellowship Program has awarded scholarships to 10 doctoral students for the 2026 academic year, each receiving up to $60,000 to support their research in various fields related to computational innovation [2][4]. Group 1: Award Recipients - Jiageng Mao from the University of Southern California focuses on solving complex physical AI problems using large-scale internet data, aiming for robust and generalizable intelligence in real-world embodied agents [5]. - Liwen Wu from the University of California, San Diego specializes in computer graphics and 3D vision, with interests in neural rendering, inverse rendering, and 3D reconstruction [8]. - Sizhe Chen from the University of California, Berkeley is dedicated to ensuring AI safety in real-world applications, particularly developing defenses against prompt injection attacks [10]. - Yunfan Jiang from Stanford University is working on scalable methods for building general-purpose robots for everyday tasks using mixed data sources [12]. - Yijia Shao from Stanford University researches human-AI collaboration, developing AI agents that can communicate and coordinate with humans during task execution [14]. - Shangbin Feng from the University of Washington aims to advance model collaboration among machine learning models trained on different data [17]. - Irene Wang from Georgia Tech is developing a collaborative design framework for large-scale, energy-efficient AI training [19]. - Chen Geng from Stanford University focuses on modeling the 4D physical world using scalable data-driven algorithms [23]. - Shvetank Prakash from Harvard University is building AI agents using new algorithms and intelligent infrastructure [26]. - Manya Bansal from MIT is designing programming languages for modern accelerators to enable modular and reusable code without sacrificing performance [28]. Group 2: Finalists - The program also recognized five finalists: Zizheng Guo from Peking University, Peter Holderrieth from MIT, Xianghui Xie from the Max Planck Institute for Computer Science, Alexander Root from Stanford University, and Daniel Palenicek from Darmstadt University of Technology [31].
DeepSeek-V3.2巨「吃」Token,竟然是被GRPO背刺了
机器之心· 2025-12-04 08:18
Core Insights - The article discusses the release of the DeepSeek-V3.2 model, highlighting its performance issues, particularly in token consumption and output verbosity, which have raised concerns among users and researchers [1][2][6]. Token Consumption and Efficiency - DeepSeek-V3.2 Speciale exhibits inefficient token usage, consuming 77,000 tokens for tasks where Gemini only requires 20,000, indicating over three times the token expenditure for similar quality results [1][6]. - Users have noted that the generation speed of DeepSeek-V3.2 Speciale is approximately 30 tokens per second, and an increase to around 100 tokens per second could significantly enhance usability and experience [6]. Output Quality and Verbosity - The Speciale version tends to produce lengthy and verbose outputs, often resulting in incorrect responses, which is attributed to inherent flaws in the GRPO algorithm [2][15]. - The model's performance in benchmark tests shows that it has a median score of 76.38, with a median difference of 11.07% compared to other models, indicating a notable gap in efficiency [7]. Comparison with Other Models - In benchmark comparisons, DeepSeek-V3.2 Speciale's token consumption during inference has been reported to be significantly higher than its predecessor, with a consumption of 86 million tokens compared to 62 million for the previous version [7][10]. - The model's performance metrics reveal that it lags behind competitors like Gemini-3.0 Pro in terms of output token delay and efficiency [10][12]. Algorithmic Limitations - The GRPO algorithm, which underpins DeepSeek, has been criticized for introducing biases that lead to longer and often incorrect responses, a problem that persists in the latest model [16][20]. - Length bias, a significant issue in the GRPO algorithm, causes the model to generate longer responses even when they are incorrect, which has been identified as a primary reason for the high token consumption in DeepSeek-V3.2 Speciale [20][23]. Future Directions - The developers acknowledge the need for improved token efficiency as a critical area for future research, aiming to balance performance and cost in subsequent model iterations [14][23].
碾压π0.5,复旦团队首创「世界模型+具身训练+强化学习」闭环框架
机器之心· 2025-12-04 08:18
Core Viewpoint - The Vision–Language–Action (VLA) strategy is becoming a crucial technological pathway for robots to achieve general operational intelligence, enabling simultaneous processing of visual perception, language instructions, and generation of continuous control signals [2]. Group 1: Challenges in Current VLA Approaches - Most current VLA methods rely heavily on imitation learning, which can lead to error accumulation and task failure when there are distribution shifts or changes in task forms [3][11]. - Implementing online reinforcement learning (RL) on real robots is costly and limited by the need for extensive human intervention and monitoring, making large-scale deployment impractical [12]. - Traditional physics engines struggle to balance realism, scene diversity, and engineering usability, complicating the use of RL in simulated environments [13]. Group 2: ProphRL Framework - The research team proposed the ProphRL framework, utilizing a large-scale pre-trained world model called Prophet as a video-level simulator to optimize VLA strategies through online RL algorithms [4]. - This approach allows for significant reductions in real-world interaction costs while maintaining physical credibility, facilitating the practical implementation of large model VLA strategies [4]. Group 3: Experimental Results - ProphRL demonstrated a success rate improvement of 5–17% across various VLA models in public benchmarks, with real robot experiments showing a substantial success rate increase of 24–30% [8]. - The Prophet model achieved leading performance in visual fidelity and action consistency across multiple datasets, showcasing its ability to generalize across new scenes and tasks with minimal fine-tuning [31]. Group 4: Innovations in RL Algorithms - The research introduced FA-GRPO and FlowScale, RL algorithms tailored for flow-based action heads, enhancing training stability and performance by reorganizing gradient signals and balancing contributions from different steps [26][27]. - A video-language reward model was developed to evaluate task success based on the entire trajectory, moving away from manually designed geometric distances [26]. Group 5: Real-World Validation - The ProphRL framework was validated on real robots, achieving significant improvements in task success rates across various complex tasks, indicating the effectiveness of the world model and RL integration in practical applications [38].
刚刚,云计算一哥出手,大家AI Agent自由了
机器之心· 2025-12-04 06:10
Core Insights - The article discusses the advancements in Agentic AI, particularly highlighting Amazon Web Services' (AWS) initiatives and innovations in this field, emphasizing the transformative potential of AI agents in various industries [4][6][46] Group 1: Agentic AI Developments - Blue Origin's successful recovery of the New Glenn rocket was significantly aided by the use of generative AI tools, including an internal platform called BlueGPT, which improved overall engineering speed by 75% [3][6] - AWS's annual re:Invent conference showcased a range of new releases focused on Agentic AI, indicating a clear shift towards automation and efficiency in business processes [4][6] - The emergence of AI agents is compared to the impact of the internet and cloud services, suggesting that their influence on business operations could be equally profound [6][46] Group 2: Technical Innovations - AWS introduced the Strands Agents SDK, enabling developers to build AI agents using TypeScript, and added support for edge devices, allowing for a wide range of applications [9][10] - The Amazon Bedrock service has been enhanced with new capabilities for agent development, including policy setting and evaluation tools to ensure agent behavior is safe and compliant [11][20] - New memory capabilities in AgentCore Memory allow agents to learn from past interactions, improving their decision-making over time [12] Group 3: Model Customization and Efficiency - AWS is focusing on creating customized AI models that can perform specific tasks more efficiently, with tools that simplify the customization process [15][19] - The introduction of Amazon Nova Forge allows for open training of models, integrating proprietary data with existing models to create tailored solutions [41] - The Amazon SageMaker HyperPod significantly reduces training cycle times and operational costs, enhancing the efficiency of AI model training [19] Group 4: Future Outlook - AWS envisions a future where billions of AI agents will be active across various industries, providing real value to organizations and individuals [46] - The company reported a revenue of $132 billion, a 20% increase from the previous year, driven by the growing adoption of AI services among over 100,000 global enterprises [46] - The article concludes with an invitation to the upcoming AWS re:Invent event in China, highlighting the importance of staying updated in the rapidly evolving AI landscape [47]
从MiniMax到DeepSeek:为何头部大模型都在押注「交错思维」?
机器之心· 2025-12-04 06:10
Core Insights - The article highlights the impressive performance of MiniMax's new model M2 in the mini-SWE-agent benchmark, surpassing competitors like DeepSeek, GLM, Qwen, and Kimi [2][4] - MiniMax M2's success is attributed to its innovative "Interleaved Thinking" approach, which allows for simultaneous reasoning and tool usage, enhancing its ability to handle complex tasks [4][5] Performance and Recognition - MiniMax M2 has received widespread recognition from developers within just over a month of its release, demonstrating its effectiveness in real-world agent applications [5] - The model's ability to maintain context and improve self-correction capabilities has been noted as a significant advantage, leading to better planning and execution in complex tasks [5][25] Interleaved Thinking Mechanism - Interleaved Thinking is a new reasoning paradigm that integrates reasoning and action, addressing limitations of traditional linear models [10][11] - This approach allows for a dynamic cycle of "thinking → acting → observing → rethinking," which significantly enhances the reliability of long-term workflows [12][25] - The technique effectively mitigates "state drift," ensuring that plans and intentions can persist across multiple interactions, which is crucial for complex agent tasks [16][17] Comparison with Other Memory Techniques - Interleaved Thinking differs from traditional memory models by focusing on maintaining logical reasoning rather than just factual recall, akin to a computer's RAM [20] - While traditional models store past interactions, Interleaved Thinking preserves the reasoning process, enabling agents to make informed decisions based on previous steps [21] Industry Adoption and Future Implications - The adoption of Interleaved Thinking is becoming a standard in high-performance agent models, with other leading companies also integrating similar capabilities [22][23] - MiniMax M2 is positioned as a pioneer in this technology, showcasing unique methods to enhance performance and efficiency [23][25] Cost Efficiency and Practical Applications - MiniMax M2 demonstrates remarkable cost efficiency, with a total operational cost of $0.001669 for a complex task, significantly lower than competitors [31] - This economic advantage allows developers to conduct more iterations within the same budget, facilitating rapid experimentation and development [31] Community and Ecosystem Development - MiniMax is actively working to standardize the implementation of Interleaved Thinking through collaborations with various partners and providing best practices for developers [38][39] - The introduction of tools like the Mini-Agent CLI aims to help developers effectively utilize Interleaved Thinking in their projects, enhancing community engagement and support [44][46]
挑战ReAct!MetaGPT团队提出ReCode智能体新范式
机器之心· 2025-12-04 06:10
Core Insights - The article discusses the limitations of current AI agent frameworks, particularly the fixed decision granularity that restricts adaptability and planning capabilities [2][3] - It introduces ReCode (Recursive Code Generation), a new paradigm that unifies planning and execution, allowing agents to switch between different granularities seamlessly [3][11] Current AI Agent Limitations - Existing frameworks like ReAct operate on a fixed, fine-grained observation-action loop, which can lead to inefficiencies in complex tasks [9] - Agents with planners separate planning and execution, which hampers dynamic adaptability and learning from execution feedback [10] ReCode Framework - ReCode proposes a unified code representation for all decisions, regardless of granularity, allowing for recursive breakdown of high-level plans into executable actions [12][14] - The workflow involves converting task instructions into a root placeholder function, which is then expanded recursively into specific actions [15][16] Performance Improvements - Experimental results show that ReCode outperforms ReAct, achieving an average performance increase from 47.4% to 60.8% across three environments [6][20] - ReCode also reduces reasoning costs by 79% and training sample requirements to 27% of what ReAct needs [6][23] Cost Efficiency - The average cost of a ReCode trajectory is 78.9% lower than ReAct, demonstrating significant cost advantages due to structured exploration [23][24] Training Efficiency - In the ScienceWorld environment, ReCode achieves 88.5% reward with only 3,500 training samples, compared to 12,833 samples required by ReAct [25] - ReCode's recursive structure generates hierarchical training data, enhancing learning efficiency [27] Future Directions - Future research may focus on enhancing the model's ability to understand recursive decomposition logic and optimizing planning strategies through learning [27]