Workflow
多智能体框架
icon
Search documents
32B逆袭GPT-5.2:首个端到端GPU编程智能体框架StitchCUDA问世
机器之心· 2026-03-05 03:54
Core Insights - The article presents StitchCUDA, a novel framework that shifts the focus from optimizing individual kernels to generating complete end-to-end GPU programs, achieving a 90% success rate and a 1.50× average speedup on KernelBench Level 3 tasks, significantly outperforming existing methods [2][10][31] Background and Motivation - The performance of CUDA code is critical for model training and inference, with existing LLM-based methods excelling in single kernel tasks but struggling with end-to-end GPU program generation, which involves complex system-level factors [4][7] Challenges in End-to-End CUDA Generation - Three core challenges are identified: 1. End-to-end programs require global coordination, as performance is influenced by system-level decisions [7] 2. The CUDA programming capability of the Coder needs enhancement beyond prompt engineering [7] 3. Existing RL methods face issues like reward hacking and degradation behavior [7] StitchCUDA Methodology - StitchCUDA employs a multi-agent framework combined with Rubric Reward-based Agentic RL, consisting of three specialized agents: 1. Planner: Analyzes performance and breaks down tasks [12] 2. Coder: Generates CUDA implementations based on the planner's tasks [12] 3. Verifier: Validates correctness and analyzes performance bottlenecks [13] Agentic Reinforcement Learning - The framework introduces an innovative Agentic RL training scheme that decomposes multi-round interactions into atomic skills, significantly reducing training time and enhancing the Coder's capabilities [14][16] Rubric Reward Mechanism - Rubric Reward, designed by CUDA experts, evaluates generated code across four dimensions, effectively addressing reward hacking and degradation behavior by combining it with rule-based rewards [17][18] Experimental Evaluation - Experiments conducted on KernelBench across two NVIDIA architectures demonstrate StitchCUDA's superior performance compared to leading models and frameworks, achieving high correctness and speedup rates [20][21] Key Findings - The multi-agent framework significantly improves end-to-end correctness, with Agentic RL being crucial for achieving system-level acceleration [22] - StitchCUDA outperforms existing methods, including those using larger models, indicating that RL training provides capabilities that model size alone cannot replace [22] - The framework surpasses torch.compile, achieving a 1.29× speedup over reference code [23] Hacking Detection - StitchCUDA implements anti-hacking measures to prevent models from exploiting evaluation criteria, resulting in a significant reduction in hacking rates [24][26] Ablation Studies - Removing Rubric Reward leads to a substantial drop in success rates and speedup, confirming its critical role in effective RL training [27] Case Study - A specific task example illustrates how StitchCUDA achieved a 3.75× speedup through a combination of system-level and kernel-level optimizations [29][30] Conclusion - StitchCUDA represents a comprehensive solution for end-to-end GPU program generation, achieving near 100% success rates and 1.5× average speedup, paving the way for LLM-driven automated GPU programming [31]
东吴证券:端云协同驱动AI入口重塑 端侧模型牵引硬件重构
智通财经网· 2026-02-27 07:07
Core Insights - The evaluation system for cloud-based large models is shifting from purely capability metrics to the actual completion of tasks, with a focus on code capabilities and multi-agent systems by leading overseas companies since 2026 [1] - The dual capability stack of "fast interaction + long reasoning" is expected to become a significant evolution direction for general-purpose agents in the near future [2] - The collaboration between edge models and cloud models is emphasized, with edge models handling high-frequency, lightweight tasks locally, while heavier reasoning tasks are processed in the cloud [3] Cloud Models - The expansion of capability boundaries and cost restructuring are occurring simultaneously in cloud models, with a focus on task completion [1] - Leading companies are intensively laying out code capabilities and multi-agent systems to enhance performance [2] Code Models - The reasoning demands in the era of intelligent agents are evolving along two optimization directions: long-chain complex reasoning and real-time interaction [2] - Low-latency agents like OpenAI's Codex-Spark prioritize interactive AI experiences, while agents like Claude4.6 focus on improving success rates in complex tasks through increased context length [2] Edge Models - The evolution of edge models is characterized by efficiency optimization and capability compression under a collaborative framework with cloud models [3] - Multi-modal capabilities are becoming a key competitive point for edge models, with a focus on achieving zero-latency interactions [3] Hardware Reconstruction - The industry is expected to focus on high-frequency demand scenarios in 2024, with a shift towards multi-modal creative capabilities by 2025 [4] - Key components for edge models are undergoing upgrades in memory and power consumption to enhance user experience [4] Future Outlook - Next-generation flagship SoC platforms like Qualcomm's Snapdragon 8 Elite Gen 6 are anticipated to provide enhanced hardware support for the complexity and multi-modality of edge AI functions [5]
电子行业深度报告:端云协同驱动AI入口重塑与硬件范式重构
Soochow Securities· 2026-02-27 05:50
Investment Rating - The report maintains a rating of "Buy" for the electronic industry [1] Core Insights - The electronic industry is experiencing a transformation driven by edge-cloud collaboration, reshaping AI entry points and reconstructing hardware paradigms [2] - The competition in integrated AI capabilities is shifting from a focus on the quantity of functions to a comprehensive comparison of multi-modal experiences and system-level integration depth [2] - The evolution of edge models is not about replacing cloud models but rather forming a clearly defined collaborative architecture [26] Summary by Sections 1. Cloud Models: Capability Expansion and Cost Restructuring - Cloud models are entering a new acceleration phase focused on agent capabilities, multi-modal integration, and cost optimization [10] - Domestic models are rapidly catching up in performance while expanding their cost-effectiveness, driving demand release [18] 2. Edge Models: Efficiency Optimization and Capability Compression - Edge models are evolving under the mainline of edge-cloud collaboration, focusing on real-time perception and preliminary decision-making within user privacy boundaries [26] - Multi-modal capabilities are becoming a key competitive point for edge models, enabling real-time interactions and execution [29] 3. Hardware Reconstruction Driven by Edge Models - The core components of edge devices are undergoing upgrades in memory, power consumption, and heat dissipation to support more complex AI functionalities [2] - Samsung's LPDDR6 product has achieved approximately 21% energy efficiency improvement compared to the previous generation [2] 4. Algorithm Optimization: Efficiency and Capability Compression - The industry is exploring various model architectures and optimization techniques to enhance efficiency and reduce memory constraints [30][33] - Low-bit quantization has become the industry standard, with ongoing exploration of even lower precision techniques [36]
像开发软件一样造世界,Agent2World来了,把世界模型做成可运行的符号环境
机器之心· 2026-02-02 06:14
Core Insights - The article discusses the development of Agent2World, a tool-augmented multi-agent framework designed to create executable and verifiable symbolic world models, moving beyond traditional script-based generation methods [4][37]. - Agent2World demonstrates significant performance improvements across three benchmarks: Text2World (PDDL), CWMB (MuJoCo), and ByteSized32 (text games), showcasing its potential as a high-quality data synthesis engine [4][24]. Group 1: Challenges in Traditional Approaches - Existing automated route generation methods face three main challenges: script-based workflows, closed knowledge boundaries, and single representation coverage, which limit their effectiveness [3][8]. - Traditional "draft-repair" scripts can fix syntax but struggle to ensure that the generated world models are logically sound and executable [8][9]. Group 2: Methodology Breakdown - Agent2World's approach consists of three stages: Knowledge Synthesis, World Model Generation, and Evaluation-Driven Refinement, integrating research, development, and testing into a reusable generation paradigm [4][12]. - The framework includes a Deep Researcher for knowledge retrieval, a Model Developer for generating world models, and a Testing Team for dynamic validation, ensuring high reliability [16][18]. Group 3: Experimental Validation - Agent2World achieved state-of-the-art performance in the Text2World benchmark, with a 93.1% executability rate, a 14.9 percentage point improvement over the previous best [25]. - In the CWMB benchmark, Agent2World Multi achieved an Overall Normalized Return of 0.4811, outperforming the previous best by 0.132, indicating its effectiveness in supporting downstream planning and control tasks [27]. - The ByteSized32 benchmark showed a significant improvement in physical reality alignment, with a score of 0.4768, highlighting the model's ability to generate logically consistent and stable environments [29]. Group 4: Model Fine-tuning and Ablation Studies - Fine-tuning based on high-quality trajectory data led to a 30.95% average relative performance improvement in unseen tasks, demonstrating the effectiveness of the "Agent nurturing Model" strategy [34]. - Ablation studies confirmed that both the Deep Researcher and Testing Team are essential components for building reliable world models, with significant performance drops observed when either was removed [36][38].
全球第二、国内第一!钉钉发布DeepResearch多智能体框架,已在真实企业部署
机器之心· 2025-11-12 03:17
Core Insights - The article emphasizes the increasing demand for efficient and precise information retrieval and decision support in the digital economy, highlighting the necessity of a "Deep Research System" that can extract key knowledge from vast heterogeneous data sources and perform multi-step reasoning [2][3]. Challenges in Existing Research Systems - Existing research systems face challenges in adapting to real-world enterprise environments, including static architectures, insufficient integration of private datasets, lack of automated evaluation and continuous optimization, and inadequate long-term memory and dynamic evolution mechanisms [5]. - Many systems rely on static prompts or fixed scripts, making them unable to learn and optimize from real-world feedback [5]. - Current research-oriented intelligent agents struggle to securely and efficiently integrate enterprise private data and lack dynamic optimization capabilities [5]. - There is a notable absence of automated evaluation and continuous optimization mechanisms in systems like Anthropic's Claude Research Workbench, hindering sustained improvement in deployment environments [5]. Dingtalk-DeepResearch Framework - Dingtalk-DeepResearch is introduced as a unified multi-agent intelligent framework designed for complex and evolving enterprise tasks, integrating deep research generation, heterogeneous table reasoning, and multi-modal report synthesis [3][10]. - The framework has achieved high scores in international deep research evaluations, ranking second globally and first domestically in the DeepResearch Bench [7]. - It has been successfully deployed in real enterprise scenarios such as manufacturing and supply chain, demonstrating industry-leading accuracy and robustness [10]. Framework Architecture - The Dingtalk-DeepResearch framework features a layered design, providing a comprehensive and flexible intelligent hub for enterprises [12]. - The framework includes specialized agents for deep research, table data processing, and data analysis, along with a core that integrates key functions such as context compression, reasoning, long-term memory, and human-machine collaboration [14]. - A unified data layer consolidates knowledge graphs, databases, and multi-modal datasets, facilitating diverse enterprise and industry data retrieval [14]. Adaptive Intelligence Mechanisms - The framework employs a multi-stage document reinforcement learning approach to enhance document generation capabilities, utilizing a reward model trained on approximately 800,000 labeled samples [17][18]. - An entropy-guided, memory-aware online learning mechanism allows the intelligent agent to adapt continuously to evolving tasks without frequent fine-tuning of the underlying LLM parameters [21]. - The system's table question-answering module effectively handles complex and heterogeneous table data, ensuring precise and interpretable reasoning [22][23]. Continuous Optimization and Evaluation - DingAutoEvaluator serves as a core driver for continuous evolution, transforming the development paradigm into a fully evaluation-driven approach [25]. - The platform continuously monitors cognitive uncertainty peaks in model outputs, prioritizing uncertain cases for expert annotation [25]. - A unified measurement framework evaluates various aspects of the framework's outputs, providing real-time signals for ongoing optimization [31]. Practical Applications and Case Studies - The article presents multiple real-world case studies demonstrating Dingtalk-DeepResearch's end-to-end capabilities in complex table data parsing, retrieval, reasoning, and multi-modal document generation [27]. - In one case, the system accurately processed a complex table containing inventory and logistics information, showcasing its robustness and practical utility [28]. - Another case involved the system answering production-related queries by effectively breaking down complex questions into manageable steps [30][32]. Future Outlook - Dingtalk-DeepResearch is set to be deployed in enterprise workflows and will soon be available as a service through Dingtalk, providing a robust solution for complex task management [44]. - The framework's adaptive capabilities, large-scale document reinforcement learning, and structured table reasoning position it as a significant advancement in enterprise-level adaptive intelligence [45].
论文秒变海报!开源框架PosterAgent一键生成顶会级学术Poster
量子位· 2025-06-03 07:59
Core Viewpoint - The article introduces PosterAgent, a tool designed to convert academic papers into visually appealing posters, highlighting its efficiency and effectiveness compared to existing methods like GPT-4o [2][18]. Group 1: PosterAgent Overview - PosterAgent can transform a 22-page paper into an editable ".pptx" poster for only $0.0045, significantly reducing token usage by 87% compared to GPT-4o [2][36]. - The tool is built upon the Paper2Poster framework, which establishes the first academic poster evaluation standard, addressing gaps in long-context and multi-modal compression assessments [4][18]. Group 2: Evaluation Metrics - Paper2Poster includes 100 pairs of AI-related papers and their corresponding posters, covering various subfields like computer vision (19%), natural language processing (17%), and reinforcement learning (10%) [20]. - The evaluation metrics focus on four dimensions: visual quality, text coherence, overall assessment, and PaperQuiz, which simulates communication between authors and readers [22][23]. Group 3: PosterAgent Components - The PosterAgent framework consists of three key components: a parser for extracting key content, a planner for organizing text and visuals, and a painter-commenter for generating and refining the poster layout [28][29]. - The system employs a top-down design approach to ensure coherence and alignment of content [25]. Group 4: Performance Comparison - In comparative tests, PosterAgent achieved the highest graphic relevance and visual similarity to human-designed posters, scoring an average of 3.72 when evaluated by a visual language model (VLM) [31][32]. - While GPT-4o-image had the highest visual similarity, it recorded the lowest coherence, indicating that its outputs may appear attractive but lack textual clarity [30][31]. Group 5: Cost Efficiency - PosterAgent demonstrated significant cost efficiency, requiring only 101.1K and 47.6K tokens for different variants, translating to a cost of $0.55 (based on GPT-4o) or $0.0045 (based on Qwen) per poster [36].