强化学习
Search documents
“今年最火的20家机器人,我只投中5家”
投中网· 2025-10-23 06:30
Core Insights - The article discusses the investment strategies and reflections of Wang Sheng, a partner at Inno Angel Fund, particularly in the context of the AI and robotics sectors, highlighting the importance of early investment and the need to adapt to changing market dynamics [2][4][14]. Investment Strategy - Wang Sheng emphasizes the significance of identifying potential winners in the robotics industry, noting that he invested in five out of the top 20 most valuable robotics companies in China, achieving returns ranging from tens to hundreds of times [2][4]. - The investment logic varies among companies, with some being tactical responses to market trends while others are based on strategic foresight regarding the future of intelligent development [2][4]. - There is a reflection on the missed opportunities in investing in multiple promising companies rather than betting on a single champion, suggesting a shift in investment philosophy towards a more inclusive approach [14][16]. Personal Journey - Wang Sheng's background includes a long career in internet and entrepreneurship, transitioning to early-stage investment as he recognized the potential in the mobile internet and entertainment sectors [3][4][5]. - His experiences highlight the importance of personal interest and passion in investment decisions, as he found success in areas he was genuinely interested in, such as entertainment and AI, rather than forcing himself into less appealing sectors [4][62]. Industry Trends - The article notes the rapid evolution of the AI and robotics sectors, with a particular focus on embodied intelligence and the need for a deeper understanding of the technology behind it [24][26]. - Wang Sheng discusses the misconceptions surrounding embodied intelligence, clarifying that it is more about human cognitive processes and learning through interaction rather than merely endowing machines with intelligence [26][27]. - The investment landscape is characterized by a cautious yet proactive approach, with a focus on companies that demonstrate both long-term growth potential and immediate revenue generation [35][36]. Reflections on Investment Philosophy - There is a critical examination of past investment choices, with an acknowledgment that overconfidence in market predictions may have led to missed opportunities [16][19]. - The current investment strategy has shifted to place greater trust in founders and their visions, even when they diverge from the fund's initial assessments [20][21]. - Wang Sheng emphasizes the importance of understanding the people behind the projects, suggesting that strong teams can adapt and thrive even if their initial direction differs from established trends [20][22].
阿里国际Marco获WMT机器翻译大赛六项冠军,英中赛道超GPT-4.1与Gemini 2.5 Pro等巨头
Cai Jing Wang· 2025-10-23 05:56
Core Insights - Alibaba's Marco-MT-Algharb translation model achieved significant success at the 2025 WMT competition, winning 6 championships, 4 second places, and 2 third places, particularly excelling in English-to-Chinese translation, surpassing top closed-source AI systems like Gemini 2.5 Pro and GPT-4.1 [1][2][3] Group 1: Competition Overview - The WMT competition is recognized as the "gold standard" in machine translation, combining automatic metrics like COMET and LLM Judge with extensive human evaluations to determine rankings [3] - Marco-MT participated in the more challenging constrained track of the WMT competition, which requires models to handle diverse content while adhering to strict guidelines of using only open-source data and models with a size limit of 20 billion parameters [2] Group 2: Model Performance and Methodology - Marco-MT's success is attributed to its integration of extensive e-commerce translation experience with an original training method called M2PO (Multi-stage Preference Optimization), which applies reinforcement learning to enhance translation quality [2] - The model's training process involves three steps: broadening knowledge through supervised fine-tuning, employing reinforcement learning to evaluate translation quality, and incorporating word alignment and reordering techniques during decoding to improve accuracy and fidelity [2] Group 3: Market Position and Future Prospects - Marco-MT, initially launched in 2024 for e-commerce translation, has expanded its capabilities to support various translation scenarios, including search, product information, dialogue, and images, establishing a strong foundation for its transition to general translation [3] - The model has already demonstrated its competitive edge in multimodal translation, having won 2 championships and 2 second places at the 2025 IWSLT international competition [3]
大模型推理学习新范式!ExGRPO框架:从盲目刷题到聪明复盘
量子位· 2025-10-23 05:18
ExGRPO团队 投稿 量子位 | 公众号 QbitAI 大模型在强化学习过程中,终于知道什么经验更宝贵了! 来自上海人工智能实验室、澳门大学、南京大学和香港中文大学的研究团队,最近提出了 一套经验管理和学习框架ExGRPO —— 通过科学地识别、存储、筛选和学习有价值的经验,让大模型在优化推理能力的道路上,走得更稳、更快、更远。 实验结果显示,与传统的在线策略RLVR (基于可验证奖励的强化学习) 方法相比,ExGRPO在不同基准上均带来了一定程度的性能提升。 尤其在一些极具挑战性的任务 (如AIME数学竞赛题) 上,提升效果更为明显,证明了ExGRPO在攻克复杂推理难题上的有效性。 而且该研究也揭示了一些有趣的现象,比如滚雪球效应。 不过在展开之前,我们先来回答一个核心问题—— 大模型推理的下一步,为什么我们需要"经验驱动"的训练方法? 2025年初以来,赋能大模型推理能力的技术路线以基于可验证奖励的强化学习 (Reinforcement Learning from Verifiable Rewards) 为 主导。 简单来说,就是让模型像个学生一样,不断地"刷题" (生成推理步骤) ,然后由"判卷老师" ...
让LLM扔块石头,它居然造了个投石机
量子位· 2025-10-22 15:27
Core Insights - The article discusses a new research platform called BesiegeField, developed by researchers from CUHK (Shenzhen), which allows large language models (LLMs) to design and build functional machines from scratch [2][39] - The platform enables LLMs to learn mechanical design through a process of reinforcement learning, where they can evolve their designs based on feedback from physical simulations [10][33] Group 1: Mechanism of Design - The research introduces a method called Compositional Machine Design, which simplifies complex designs into discrete assembly problems using standard parts [4][5] - A structured representation mechanism, similar to XML, is employed to facilitate understanding and modification of designs by the model [6][7] - The platform runs on Linux clusters, allowing hundreds of mechanical experiments simultaneously, providing comprehensive physical feedback such as speed, force, and energy changes [9][10] Group 2: Collaborative AI Workflow - To address the limitations of single models, the research team developed an Agentic Workflow that allows multiple AIs to collaborate on design tasks [23][28] - Different roles are defined within this workflow, including a Meta-Designer, Designer, Inspector, Active Env Querier, and Refiner, which collectively enhance the design process [28][31] - The hierarchical design strategy significantly outperforms single-agent or simple iterative editing approaches in tasks like building a catapult and a car [31] Group 3: Self-Evolution and Learning - The introduction of reinforcement learning (RL) through a strategy called RLVR allows models to self-evolve by using simulation feedback as reward signals [33][34] - The results show that as iterations increase, the models improve their design capabilities, achieving better performance in tasks [35][37] - The combination of cold-start strategies and RL leads to optimal scores in both catapult and car tasks, demonstrating the potential for LLMs to enhance mechanical design skills through feedback [38] Group 4: Future Implications - BesiegeField represents a new paradigm for structural creation, enabling AI to design not just static machines but dynamic structures capable of movement and collaboration [39][40] - The platform transforms complex mechanical design into a structured language generation task, allowing models to understand mechanical principles and structural collaboration [40]
OpenAI要让AI替代“初级投行员工”
Hu Xiu· 2025-10-22 13:24
Core Insights - OpenAI is conducting a unique experiment called "Mercury," hiring over 100 former investment banking employees to train its AI models in financial modeling and other core skills [1][2] - The project aims to teach AI how to perform tasks typically done by junior bankers, raising concerns about the future job security of entry-level positions in the finance industry [1][2] Group 1: Project Details - The "Mercury" project has recruited professionals from top financial institutions, including JPMorgan, Morgan Stanley, and Goldman Sachs, as well as talent from Brookfield Corp., Mubadala Investment Co., Evercore Inc., and KKR & Co. [2] - Participants are paid $150 per hour and are required to submit a financial model each week, using simple language to write prompts and executing them in Microsoft Excel [2] - The application process for participants involves minimal human intervention, including a 20-minute interview with an AI chatbot and tests on financial statement knowledge and modeling skills [3] Group 2: AI Learning Focus - The project emphasizes the importance of attention to detail, as junior analysts often work long hours and handle tedious tasks, such as building complex merger models in Excel [4] - According to Bloomberg columnist Matt Levine, the meticulous nature of investment banking is crucial for AI to learn, as even minor formatting errors can lead to significant trust issues [5] - Levine describes the current generative AI as "smart but careless," suggesting that the project is a form of reinforcement learning to instill the necessary attention to detail in AI [5] Group 3: Implications for the Industry - The direct goal of the "Mercury" project is to enable AI to replace the work of junior employees, raising questions about the future of the traditional apprenticeship model in investment banking [6] - Historically, junior analysts have learned skills through foundational work, but if AI takes over these tasks, it may hinder the development of future leaders in the industry [6] - The high turnover rate in investment banking means that many former analysts may not feel burdened by the prospect of training AI to replace their previous roles [6] Group 4: OpenAI's Strategic Focus - The "Mercury" project reflects OpenAI's broader commercialization strategy, targeting the lucrative financial services sector to demonstrate the value of its technology in complex business environments [7] - Despite its high valuation, OpenAI has yet to achieve profitability, prompting the company to actively explore enterprise markets [7] - The initiative indicates OpenAI's ambition to develop specialized AI tools that can be deeply integrated into corporate workflows, aiming for a significant position in the global business landscape [7]
智源开源EditScore:为图像编辑解锁在线强化学习的无限可能
机器之心· 2025-10-22 03:30
随着多模态大模型的不断演进,指令引导的图像编辑(Instruction-guided Image Editing)技术取得了显著进展。然而,现有模型在遵循复杂、精细的文本指令方面 仍面临巨大挑战,往往需要用户进行多次尝试和手动筛选,难以实现稳定、高质量的「一步到位」式编辑。 强化学习(RL)为模型实现自我演进、提升指令遵循能力提供了一条极具潜力的路径。但其在图像编辑领域的应用,长期以来受限于一个核心瓶颈: 缺乏一个能 够精确评估编辑质量并提供高保真度反馈的奖励模型(Reward Model)。 没有可靠的「奖励信号」,模型便无法有效判断自身生成结果的优劣,从而难以实现高 效的自我优化。 为攻克这一难题, 北京智源人工智能研究院 VectorSpace Lab 团队 近日发布了全新的高保真奖励模型系列—— EditScore 。该工作直面上述挑战,旨在 为指令引 导的图像编辑任务提供精确、可靠的奖励信号,从而为强化学习在 AIGC 领域的深入应用铺平道路,真正解锁其强大潜力。 EditScore 是智源在成功推出统一图像生成模型 OmniGen 系列之后,对更通用、更可控的生成式 AI 的又一重要探索。为了促进 ...
大佬开炮:智能体都在装样子,强化学习很糟糕,AGI 十年也出不来
自动驾驶之心· 2025-10-22 00:03
Core Insights - The article discusses the current state and future of AI, particularly focusing on the limitations of reinforcement learning and the timeline for achieving Artificial General Intelligence (AGI) [5][6][10]. Group 1: AGI and AI Development - AGI is expected to take about ten years to develop, contrary to the belief that this year would be the year of agents [12][13]. - Current AI agents, such as Claude and Codex, are impressive but still lack essential capabilities, including multi-modal abilities and continuous learning [13][14]. - The industry has been overly optimistic about the pace of AI development, leading to inflated expectations [12][15]. Group 2: Limitations of Reinforcement Learning - Reinforcement learning is criticized as being inadequate for replicating human learning processes, as it often relies on trial and error without a deep understanding of the problem [50][51]. - The approach of reinforcement learning can lead to noise in the learning process, as it weights every action based on the final outcome rather than the quality of the steps taken [51][52]. - Human learning involves a more complex reflection on successes and failures, which current AI models do not replicate [52][53]. Group 3: Future of AI and Learning Mechanisms - The future of AI may involve more sophisticated attention mechanisms and learning algorithms that better mimic human cognitive processes [33][32]. - There is a need for AI models to develop mechanisms for long-term memory and knowledge retention, which are currently lacking [31][32]. - The integration of AI into programming and development processes is seen as a continuous evolution rather than a sudden leap to superintelligence [45][47].
o1 核心作者 Jason Wei:理解 2025 年 AI 进展的三种关键思路
Founder Park· 2025-10-21 13:49
Group 1 - The core idea of the article revolves around three critical concepts for understanding and navigating AI development by 2025: the Verifiers Law, the Jagged Edge of Intelligence, and the commoditization of intelligence [3][14]. - The Verifiers Law states that the ease of training AI to complete a specific task is proportional to the verifiability of that task, suggesting that tasks that are both solvable and easily verifiable will eventually be tackled by AI [21][26]. - The concept of intelligent commoditization indicates that knowledge and reasoning will become increasingly accessible and affordable, leading to a significant reduction in the cost of achieving specific intelligence levels over time [9][11]. Group 2 - The article discusses the two phases of AI development: the initial phase where researchers work to unlock new capabilities, and the subsequent phase where these capabilities are commoditized, resulting in decreasing costs for achieving specific performance levels [11][13]. - The trend of commoditization is driven by adaptive computing, which allows for the adjustment of computational resources based on task complexity, thereby reducing costs [13][16]. - The article highlights the evolution of information retrieval across different eras, emphasizing the drastic reduction in time required to access public information as AI technologies advance [16][17]. Group 3 - The Jagged Edge of Intelligence concept illustrates that AI's capabilities and progress will vary significantly across different tasks, leading to an uneven development landscape [37][42]. - The article suggests that tasks that are easy to verify will be the first to be automated, and emphasizes the importance of creating objective and scalable evaluation methods for various fields [38][39]. - The discussion includes the notion that AI's self-improvement capabilities will not lead to a sudden leap in intelligence but rather a gradual enhancement across different tasks, with varying rates of progress [41][45].
OpenAI元老Karpathy 泼了盆冷水:智能体离“能干活”,还差十年
3 6 Ke· 2025-10-21 12:42
Group 1 - Andrej Karpathy emphasizes that the maturity of AI agents will take another ten years, stating that current agents like Claude and Codex are not yet capable of being employed for tasks [2][4][5] - He critiques the current state of AI learning, arguing that reinforcement learning is inadequate and that true learning should resemble human cognitive processes, which involve reflection and growth rather than mere trial and error [11][12][22] - Karpathy suggests that future breakthroughs in AI will require a shift from knowledge accumulation to self-growth capabilities and a reconstruction of cognitive structures [4][5][22] Group 2 - The current limitations of large language models (LLMs) in coding tasks are highlighted, with Karpathy noting that they struggle with structured and nuanced engineering design [6][7][9] - He categorizes human interaction with code into three types, emphasizing that LLMs are not yet capable of functioning as true collaborators in software development [7][9][10] - Karpathy believes that while LLMs can assist in certain coding tasks, they are not yet capable of writing or improving their own code effectively [9][10][11] Group 3 - Karpathy discusses the importance of a reflective mechanism in AI learning, suggesting that models should learn to review and reflect on their processes rather than solely focusing on outcomes [18][19][20] - He introduces the concept of "cognitive core," advocating for models to retain essential thinking and planning abilities while discarding unnecessary knowledge [32][36] - The potential for a smaller, more efficient model with only a billion parameters is proposed, arguing that high-quality data can lead to effective cognitive capabilities without the need for massive models [34][36] Group 4 - Karpathy asserts that AGI (Artificial General Intelligence) will gradually integrate into the economy rather than causing a sudden disruption, focusing on digital knowledge work as its initial application area [38][39][40] - He predicts that the future of work will involve a collaborative structure where agents perform 80% of tasks under human supervision for the remaining 20% [40][41] - The deployment of AGI will be a gradual process, starting with structured tasks like programming and customer service before expanding to more complex roles [48][49][50] Group 5 - The challenges of achieving fully autonomous driving are discussed, with Karpathy stating that it is a high-stakes task that cannot afford errors, unlike other AI applications [59][60] - He emphasizes that the successful implementation of autonomous driving requires not just technological advancements but also a supportive societal framework [61][62] - The transition to widespread autonomous driving will be a slow and incremental process, beginning with specific use cases and gradually expanding [63]
清华、快手提出AttnRL:让大模型用「注意力」探索
机器之心· 2025-10-21 09:32
Core Insights - The article discusses the advancements in reinforcement learning (RL), particularly focusing on Process-Supervised RL (PSRL) and the introduction of a new framework called AttnRL, which enhances exploration efficiency and performance in reasoning models [3][4][9]. Group 1: Challenges in Traditional Methods - Traditional PSRL methods assign equal reward signals to all tokens, neglecting the fine-grained quality during the reasoning process [7]. - Existing PSRL approaches face significant bottlenecks in exploration efficiency and training costs, leading to high computational expenses [4][10]. Group 2: Introduction of AttnRL - AttnRL introduces an innovative exploration method by utilizing attention mechanisms to guide the reasoning process, allowing the model to branch from high-attention steps [9][12]. - The framework employs Attention-based Tree Branching (ATB), which analyzes the reasoning sequence and calculates Forward Context Influence (FCI) scores to determine the most impactful steps for branching [13][16]. Group 3: Adaptive Sampling Mechanisms - AttnRL incorporates two adaptive sampling mechanisms: difficulty-aware exploration and dynamic batch adjustment, optimizing the learning process by focusing on challenging problems while reducing computational load on simpler ones [20][22]. - The training process is streamlined to a One-Step Off-Policy approach, significantly reducing sampling costs compared to previous PSRL methods [23]. Group 4: Experimental Results - AttnRL demonstrates superior performance across various mathematical reasoning benchmarks, achieving average accuracy rates of 57.2% for 1.5B models and 68.7% for 7B models, outperforming baseline methods like GRPO and TreeRL [28]. - The framework shows improved efficiency in sampling, with a higher effective ratio and better performance in fewer training steps compared to traditional methods [29][31]. Group 5: Future Outlook - The introduction of attention scores in PSRL exploration decisions opens new avenues for enhancing model interpretability and RL research, suggesting that efficiency and intelligence can coexist through more effective exploration strategies [34].