Qwen2.5

Search documents
ARPO:智能体强化策略优化,让Agent在关键时刻多探索一步
机器之心· 2025-08-09 06:02
Core Viewpoint - The article introduces a novel method called Agentic Reinforced Policy Optimization (ARPO), designed to enhance the performance of large language models (LLMs) in multi-round interactions by addressing the challenges of uncertainty and exploration during tool usage [3][41]. Group 1: Research Motivation and Background - The emergence of Agentic Reinforcement Learning (RL) is driven by the need for LLMs to engage in dynamic multi-round interactions with external tools, moving from static problem-solving to a more interactive agent-environment reasoning paradigm [8]. - Existing Agentic RL methods often underestimate the value of multi-round interactions due to sparse rewards and overuse of tools, leading to a lack of fine-grained exploration of tool usage [8][41]. - The study identifies a significant increase in entropy (uncertainty) after tool calls, indicating an opportunity for exploration that current methods do not fully leverage [14][16]. Group 2: ARPO Methodology - ARPO introduces an entropy-driven adaptive rollout strategy that enhances exploration during high-entropy tool usage phases, allowing for more diverse reasoning paths [11][20]. - The method includes four key steps: initialization of global rollout, monitoring entropy changes, adaptive branching based on entropy, and defining termination conditions for the rollout process [24][27]. - ARPO incorporates advantage attribution estimation to help the model better internalize the value differences in tool usage at each step [28][30]. Group 3: Experimental Results - ARPO outperforms existing sample-level RL methods, achieving better performance with only half the tool call budget across 13 challenging benchmarks, demonstrating its efficiency in training multi-round reasoning agents [21][41]. - The method shows consistent improvements in performance metrics such as Pass@3 and Pass@5, particularly in dynamic, multi-round tasks [37][39]. - In comparative tests, ARPO achieves higher accuracy than GRPO and DAPO in various tasks, including deep search and knowledge-intensive reasoning [41][42]. Group 4: Future Directions - Future research may explore the application of ARPO in multi-modal tasks, expanding its capabilities beyond text-based reasoning to include images and videos [42]. - There is potential for integrating a broader range of external tools to enhance complex task performance through optimized tool usage strategies [42]. - The scalability and real-time deployment of ARPO in larger models and dynamic environments could further improve its practical value and cost-effectiveness [42].
监督学习未死,一题训练五小时起飞!华人学者新方法20倍训练效率释放大模型推理能力
量子位· 2025-08-04 07:00
Core Viewpoint - The article discusses the breakthrough of One-Shot Critique Fine-Tuning (One-Shot CFT) in enhancing reasoning capabilities of large language models (LLMs) with minimal data and computational resources, outperforming traditional reinforcement learning (RL) methods and small-scale supervised fine-tuning (SFT) approaches [1][3][14]. Group 1: One-Shot CFT Methodology - One-Shot CFT is a new method that allows models to learn reasoning by analyzing the quality of answers rather than merely imitating them, thus providing a deeper learning signal [3][12]. - The process involves selecting a representative task, generating multiple answers using various models, and then having a more powerful model critique these answers, which serves as the supervision signal for training [4][5]. - The entire training process requires only one question, multiple answers, and critiques, taking approximately 5 GPU hours, significantly less than RL methods [5][14]. Group 2: Performance and Results - In experiments, Qwen2.5-Math-7B achieved a 15% accuracy increase after One-Shot CFT fine-tuning on a single question, surpassing both RL and full supervised fine-tuning models that used tens of thousands of training samples [9][10]. - The method demonstrated strong performance across various mathematical and logical reasoning tasks, with accuracy improvements ranging from 10% to 16% in specific sub-tasks [10][11]. - One-Shot CFT showed stability and reproducibility across different tasks and model configurations, indicating its robustness [11][13]. Group 3: Advantages of One-Shot CFT - The method emphasizes critical learning, allowing models to understand why answers are correct or incorrect, which enhances the depth of learning compared to traditional SFT [12]. - It introduces multi-perspective inputs by generating multiple answers and critiques for a single task, closely mimicking human learning processes [12]. - The training signals from critiques are highly generalizable, reducing the risk of overfitting and allowing for easier transfer to new tasks [12]. Group 4: Accessibility and Practical Implications - One-Shot CFT's low computational cost makes it accessible for individual researchers, resource-limited labs, and startups, providing a cost-effective solution for enhancing reasoning capabilities [14][15]. - The entire process is open-source, including training scripts, model parameters, and datasets, which significantly lowers the barrier for replication and experimentation [17].
GPT-4o遭越狱后指挥机器人做危险动作!全球首个具身智能体安全评测基准来了,大模型集体翻车
量子位· 2025-08-01 04:23
Core Viewpoint - The article discusses the alarming potential risks associated with embodied AI systems, particularly when they are subjected to "jailbreak" attacks, which can lead to dangerous behaviors in robots [2][8]. Group 1: Introduction to AGENTSAFE - A new comprehensive evaluation benchmark called AGENTSAFE has been proposed to address the safety of embodied intelligent agents, filling a gap in adversarial safety assessment [4]. - This groundbreaking research has received the Outstanding Paper Award at the ICML 2025 Multi-Agent Systems workshop [5]. - The research team plans to release datasets, code, and evaluation sandboxes for global researchers to utilize [6]. Group 2: Need for AGENTSAFE - The necessity for AGENTSAFE arises from the evolution of "jailbreak" attacks, which have shifted from generating harmful content to executing dangerous physical actions [8]. - Existing evaluation benchmarks primarily focus on task completion rates or obstacle avoidance, neglecting safety assessments under adversarial commands [9]. - The authors emphasize the importance of proactively identifying safety vulnerabilities before any harm occurs [10][11]. Group 3: AGENTSAFE Framework - AGENTSAFE simulates 45 real indoor scenarios with 104 interactive objects, creating a dataset of 9,900 dangerous commands inspired by Asimov's "Three Laws of Robotics" [14][15]. - The framework incorporates six advanced "jailbreak" attack methods to disguise dangerous commands, making them harder to detect [15]. - AGENTSAFE features an end-to-end evaluation design that assesses the entire process from perception to action execution, ensuring a comprehensive safety evaluation [16][18]. Group 4: Evaluation Metrics and Results - The evaluation process is divided into three stages: perception, planning, and execution, with specific metrics for assessing safety [30]. - Experimental results indicate that top models perform well on safe commands but show significant variability when faced with dangerous instructions [33][34]. - The study reveals that once commands are subjected to "jailbreak" attacks, the safety of all models declines sharply, with notable drops in refusal rates for harmful commands [37][38]. Group 5: Conclusion and Implications - The findings highlight the current vulnerabilities in embodied intelligent agents regarding safety measures [42]. - The authors stress the need to focus on what these models should not do, advocating for safety testing before deployment in real-world scenarios [43].
Meta-Think ≠ 记套路,多智能体强化学习解锁大模型元思考泛化
机器之心· 2025-07-03 03:26
Core Viewpoint - The article discusses a new framework called ReMA (Reinforced Meta-thinking Agents) designed to enhance the reasoning capabilities of large language models (LLMs) by introducing a multi-agent system that separates meta-thinking from reasoning tasks, thereby improving adaptability and effectiveness in complex problem-solving [3][4][6][10]. Group 1: Introduction and Background - Recent explorations in large model reasoning have introduced various paradigms, including structured search and process reward models, but the mechanisms behind "Aha Moments" in reasoning remain unclear [3]. - The study emphasizes the importance of reasoning patterns and posits that the strength of complex reasoning in large models fundamentally relies on their meta-thinking abilities [3][4]. Group 2: ReMA Framework - The ReMA framework consists of two hierarchical agents: the meta-thinking agent, which generates strategic supervision and planning, and the reasoning agent, which executes detailed sub-tasks based on the meta-thinking agent's guidance [10][11]. - This multi-agent system allows for a more structured and efficient exploration of the reasoning process, balancing generalization capabilities and exploration efficiency [12]. Group 3: Methodology - The study defines a single-round multi-agent meta-thinking reasoning process (MAMRP) where the meta-thinking agent analyzes the problem and generates a solution plan, while the reasoning agent completes the task based on these instructions [13][14]. - In multi-round interactions, the meta-thinking agent can provide ongoing guidance, allowing for planning, reflection, and correction throughout the reasoning process [14][20]. Group 4: Experimental Results - In single-round experiments, ReMA consistently outperformed baseline methods across various benchmarks, demonstrating superior generalization capabilities, particularly on out-of-distribution datasets [27][28]. - The results indicate that ReMA's meta-thinking mechanism significantly enhances performance, with improvements noted in specific benchmarks such as AMC23, where performance increased by up to 20% [28][29]. Group 5: Challenges and Future Work - The study acknowledges challenges in multi-round training, including instability and sensitivity to hyperparameters, suggesting that the current training processes may not be suitable for stochastic or non-stationary environments [39][40]. - Further exploration is needed to address these issues and improve the robustness of the ReMA framework in diverse training scenarios [39].
7B智能体仅凭9个任务训练即超越R1!上交大打造AI-for-AI新范式
机器之心· 2025-06-21 01:33
Core Viewpoint - The article discusses the emergence of AI-for-AI (AI4AI) as a solution to the limitations of traditional AI development, which heavily relies on human intervention and manual tuning, thereby slowing down innovation and the path to Artificial General Intelligence (AGI) [1][6]. Group 1: AI4AI Development - AI4AI aims to enable AI agents to autonomously design, optimize, and improve AI algorithms, significantly reducing human involvement and accelerating the iterative development cycle [1][6]. - A recent study by Shanghai Jiao Tong University and Shanghai AI Lab demonstrated that a 7 billion parameter AI agent (ML-Agent) could surpass a 671 billion parameter model (Deepseek-R1) in performance by utilizing a new paradigm of "experience learning" [2][9]. Group 2: Traditional Machine Learning Challenges - Traditional machine learning processes are time-consuming and inefficient, often requiring days to months for model design and parameter tuning, which limits the speed of AI innovation [4][5]. - Existing AI agents still depend on human-designed prompts, leading to a cycle of waiting, modifying, and retrying, which perpetuates inefficiency [5][6]. Group 3: Breakthroughs in Autonomous Machine Learning - The study introduces a learning-based paradigm for autonomous machine learning, allowing agents to learn from execution trajectories through online reinforcement learning, enabling proactive exploration of strategies [7][9]. - The ML-Agent, powered by a 7 billion parameter model, demonstrated remarkable performance improvements by learning from just nine machine learning tasks, showcasing its ability to generalize across tasks [20][24]. Group 4: Training Framework and Methodologies - The training framework includes three core breakthroughs that enhance the self-evolution of AI agents, such as exploration-enriched fine-tuning and a step-wise reinforcement learning paradigm [11][15]. - A customized reward module was developed to unify feedback from complex experimental results, providing consistent signals for reinforcement learning optimization [19][20]. Group 5: Performance Comparison and Results - ML-Agent outperformed several advanced AI models in both seen and unseen machine learning tasks, demonstrating its strong generalization capabilities [20][22]. - The research highlights that ML-Agent's performance consistently improved throughout training, surpassing all baseline methods and establishing a new paradigm for AI design [24][25]. Group 6: Community and Future Directions - ML-Agent is part of the MASWorks open-source community, which aims to connect global researchers and foster collaboration in the multi-agent systems field [26][27]. - The community plans to host a workshop focused on large language models and multi-agent systems at ICML 2025, encouraging participation from scholars worldwide [28].
小红书hi lab首次开源文本大模型,训练资源不到Qwen2.5 72B 的四分之一
AI前线· 2025-06-06 08:30
刚刚,Qwen3 终于发布!混合推理模式、支持MCP,成本仅DeepSeek R1三分之一,网友喊话小扎:工程师要赶紧加班了
AI前线· 2025-04-28 23:57
Qwen3 在推理、指令遵循、工具调用、多语言能力等方面均大幅增强。在官方的测评中,Qwen3 创下所有国产模型及全球开源模型的性能新高:在奥 数水平的 AIME25 测评中,Qwen3 斩获 81.5 分,刷新开源纪录;在考察代码能力的 LiveCodeBench 评测中,Qwen3 突破 70 分大关,表现甚至超过 Grok3;在评估模型人类偏好对齐的 ArenaHard 测评中,Qwen3 以 95.6 分超越 OpenAI-o1 及 DeepSeek-R1。 | | Qwen3-235B-A22B | Qwen3-32B | OpenAl-o1 | Deepseek-R1 | Grok 3 Beta | Gemini2.5-Pro | Open Al-o 3-mini | | --- | --- | --- | --- | --- | --- | --- | --- | | | MoE | Dense | 2024-12-17 | | Think | | Medium | | ArenaHard | 95.6 | 93.8 | 92.1 | 93.2 | - | 96.4 | 89.0 | | AIM ...
速递|印度初创公司Ziroh Labs,推出无需高端芯片即可运行大型AI模型
Z Potentials· 2025-04-11 04:20
Core Viewpoint - Ziroh Labs has developed an affordable AI system that can run large AI models without relying on high-end computing chips from companies like Nvidia, focusing on making AI accessible to developers in India [1][2]. Group 1: Technology and Development - The framework named Kompact AI was developed in collaboration with the Indian Institute of Technology Madras, allowing AI to run on everyday computing devices' CPUs instead of expensive GPUs [2]. - Ziroh Labs' approach focuses on the inference process, optimizing mainstream AI models to run on personal computers, demonstrated successfully on laptops with Intel Xeon processors [3]. - The technology has been tested by major chip manufacturers like Intel and AMD, indicating its potential for high-quality outcomes [3]. Group 2: Market Impact and Accessibility - The rising costs and shortages of GPUs have hindered local AI research and deployment in India, creating an AI gap where only those with access to expensive resources can develop powerful AI [3]. - The success of Ziroh Labs' cost-effective AI models could lead to a significant reduction in chip usage among AI developers in the coming months [2]. - The initiative aims to democratize AI access, proving that powerful AI can be developed without the need for high-end resources [3].