强化学习 - filings, earnings calls, financial reports, news - Reportify

强化学习

Search documents

让LLM扔块石头，它居然造了个投石机

量子位· 2025-10-22 15:27

Core Insights - The article discusses a new research platform called BesiegeField, developed by researchers from CUHK (Shenzhen), which allows large language models (LLMs) to design and build functional machines from scratch [2][39] - The platform enables LLMs to learn mechanical design through a process of reinforcement learning, where they can evolve their designs based on feedback from physical simulations [10][33] Group 1: Mechanism of Design - The research introduces a method called Compositional Machine Design, which simplifies complex designs into discrete assembly problems using standard parts [4][5] - A structured representation mechanism, similar to XML, is employed to facilitate understanding and modification of designs by the model [6][7] - The platform runs on Linux clusters, allowing hundreds of mechanical experiments simultaneously, providing comprehensive physical feedback such as speed, force, and energy changes [9][10] Group 2: Collaborative AI Workflow - To address the limitations of single models, the research team developed an Agentic Workflow that allows multiple AIs to collaborate on design tasks [23][28] - Different roles are defined within this workflow, including a Meta-Designer, Designer, Inspector, Active Env Querier, and Refiner, which collectively enhance the design process [28][31] - The hierarchical design strategy significantly outperforms single-agent or simple iterative editing approaches in tasks like building a catapult and a car [31] Group 3: Self-Evolution and Learning - The introduction of reinforcement learning (RL) through a strategy called RLVR allows models to self-evolve by using simulation feedback as reward signals [33][34] - The results show that as iterations increase, the models improve their design capabilities, achieving better performance in tasks [35][37] - The combination of cold-start strategies and RL leads to optimal scores in both catapult and car tasks, demonstrating the potential for LLMs to enhance mechanical design skills through feedback [38] Group 4: Future Implications - BesiegeField represents a new paradigm for structural creation, enabling AI to design not just static machines but dynamic structures capable of movement and collaboration [39][40] - The platform transforms complex mechanical design into a structured language generation task, allowing models to understand mechanical principles and structural collaboration [40]

组合式机械设计

智能体工作流

多角色分层设计

基于可验证反馈的强化学习（RLVR）

组合式机械设计

智能体工作流

多角色分层设计

基于可验证反馈的强化学习（RLVR）

OpenAI要让AI替代“初级投行员工”

Hu Xiu· 2025-10-22 13:24

Core Insights - OpenAI is conducting a unique experiment called "Mercury," hiring over 100 former investment banking employees to train its AI models in financial modeling and other core skills [1][2] - The project aims to teach AI how to perform tasks typically done by junior bankers, raising concerns about the future job security of entry-level positions in the finance industry [1][2] Group 1: Project Details - The "Mercury" project has recruited professionals from top financial institutions, including JPMorgan, Morgan Stanley, and Goldman Sachs, as well as talent from Brookfield Corp., Mubadala Investment Co., Evercore Inc., and KKR & Co. [2] - Participants are paid $150 per hour and are required to submit a financial model each week, using simple language to write prompts and executing them in Microsoft Excel [2] - The application process for participants involves minimal human intervention, including a 20-minute interview with an AI chatbot and tests on financial statement knowledge and modeling skills [3] Group 2: AI Learning Focus - The project emphasizes the importance of attention to detail, as junior analysts often work long hours and handle tedious tasks, such as building complex merger models in Excel [4] - According to Bloomberg columnist Matt Levine, the meticulous nature of investment banking is crucial for AI to learn, as even minor formatting errors can lead to significant trust issues [5] - Levine describes the current generative AI as "smart but careless," suggesting that the project is a form of reinforcement learning to instill the necessary attention to detail in AI [5] Group 3: Implications for the Industry - The direct goal of the "Mercury" project is to enable AI to replace the work of junior employees, raising questions about the future of the traditional apprenticeship model in investment banking [6] - Historically, junior analysts have learned skills through foundational work, but if AI takes over these tasks, it may hinder the development of future leaders in the industry [6] - The high turnover rate in investment banking means that many former analysts may not feel burdened by the prospect of training AI to replace their previous roles [6] Group 4: OpenAI's Strategic Focus - The "Mercury" project reflects OpenAI's broader commercialization strategy, targeting the lucrative financial services sector to demonstrate the value of its technology in complex business environments [7] - Despite its high valuation, OpenAI has yet to achieve profitability, prompting the company to actively explore enterprise markets [7] - The initiative indicates OpenAI's ambition to develop specialized AI tools that can be deeply integrated into corporate workflows, aiming for a significant position in the global business landscape [7]

Investment Banking

Artificial Intelligence

Investment Banking

Artificial Intelligence

智源开源EditScore：为图像编辑解锁在线强化学习的无限可能

机器之心· 2025-10-22 03:30

Core Insights - The article discusses the significant advancements in instruction-guided image editing technology, particularly through the introduction of the EditScore model series by the VectorSpace Lab team at Beijing Academy of Artificial Intelligence [2][3] - EditScore aims to provide precise and reliable reward signals for instruction-guided image editing tasks, addressing the challenges faced by existing models in following complex text instructions [2][5] Development of EditScore - EditScore is a continuation of the OmniGen series, focusing on creating a more general and controllable generative AI [3] - The EditReward-Bench dataset has been released as the first public benchmark specifically designed for evaluating image editing reward models, covering 13 sub-tasks and 11 state-of-the-art editing models [6] - The EditScore model series includes three sizes: 7B, 32B, and 72B, designed to provide high-fidelity feedback signals for instruction-based image editing tasks [7] Performance Metrics - EditScore has demonstrated superior performance on the EditReward-Bench compared to other models, with the largest model achieving accuracy surpassing that of GPT-5 [9] - The evaluation metrics indicate that EditScore models consistently outperform existing visual language models, showcasing their effectiveness in providing accurate quality scores for image editing results [8][9] Applications of EditScore - EditScore serves as an advanced reranker to enhance the output quality of various mainstream editing models through a "Best-of-N" approach [15] - It also functions as a high-fidelity reward signal for reinforcement learning, enabling stable and efficient RL fine-tuning, with notable performance improvements observed in the OmniGen2 model [15][16] Insights from Research - The research highlights that a high score does not necessarily equate to an effective reinforcement learning coach; the distribution of scores from the reward model is also crucial for training effectiveness [16] - A self-ensemble scaling strategy has been identified as a method to enhance performance, suggesting that a well-designed 7B model can outperform larger models in specific tasks [19] Future Directions - The team plans to continue exploring reward modeling and will release additional reinforcement learning training code and inference scripts for the community [3][22] - The ongoing development of EditScore is expected to enhance the controllability and reliability of AIGC models, opening new possibilities for their application across various fields [22]

EditReward-Bench

EditReward-Bench

大佬开炮：智能体都在装样子，强化学习很糟糕，AGI 十年也出不来

自动驾驶之心· 2025-10-22 00:03

Core Insights - The article discusses the current state and future of AI, particularly focusing on the limitations of reinforcement learning and the timeline for achieving Artificial General Intelligence (AGI) [5][6][10]. Group 1: AGI and AI Development - AGI is expected to take about ten years to develop, contrary to the belief that this year would be the year of agents [12][13]. - Current AI agents, such as Claude and Codex, are impressive but still lack essential capabilities, including multi-modal abilities and continuous learning [13][14]. - The industry has been overly optimistic about the pace of AI development, leading to inflated expectations [12][15]. Group 2: Limitations of Reinforcement Learning - Reinforcement learning is criticized as being inadequate for replicating human learning processes, as it often relies on trial and error without a deep understanding of the problem [50][51]. - The approach of reinforcement learning can lead to noise in the learning process, as it weights every action based on the final outcome rather than the quality of the steps taken [51][52]. - Human learning involves a more complex reflection on successes and failures, which current AI models do not replicate [52][53]. Group 3: Future of AI and Learning Mechanisms - The future of AI may involve more sophisticated attention mechanisms and learning algorithms that better mimic human cognitive processes [33][32]. - There is a need for AI models to develop mechanisms for long-term memory and knowledge retention, which are currently lacking [31][32]. - The integration of AI into programming and development processes is seen as a continuous evolution rather than a sudden leap to superintelligence [45][47].

AGI（通用人工智能）

LLM（大语言模型）

上下文学习

AGI（通用人工智能）

LLM（大语言模型）

上下文学习

o1 核心作者 Jason Wei：理解 2025 年 AI 进展的三种关键思路

Founder Park· 2025-10-21 13:49

Group 1 - The core idea of the article revolves around three critical concepts for understanding and navigating AI development by 2025: the Verifiers Law, the Jagged Edge of Intelligence, and the commoditization of intelligence [3][14]. - The Verifiers Law states that the ease of training AI to complete a specific task is proportional to the verifiability of that task, suggesting that tasks that are both solvable and easily verifiable will eventually be tackled by AI [21][26]. - The concept of intelligent commoditization indicates that knowledge and reasoning will become increasingly accessible and affordable, leading to a significant reduction in the cost of achieving specific intelligence levels over time [9][11]. Group 2 - The article discusses the two phases of AI development: the initial phase where researchers work to unlock new capabilities, and the subsequent phase where these capabilities are commoditized, resulting in decreasing costs for achieving specific performance levels [11][13]. - The trend of commoditization is driven by adaptive computing, which allows for the adjustment of computational resources based on task complexity, thereby reducing costs [13][16]. - The article highlights the evolution of information retrieval across different eras, emphasizing the drastic reduction in time required to access public information as AI technologies advance [16][17]. Group 3 - The Jagged Edge of Intelligence concept illustrates that AI's capabilities and progress will vary significantly across different tasks, leading to an uneven development landscape [37][42]. - The article suggests that tasks that are easy to verify will be the first to be automated, and emphasizes the importance of creating objective and scalable evaluation methods for various fields [38][39]. - The discussion includes the notion that AI's self-improvement capabilities will not lead to a sudden leap in intelligence but rather a gradual enhancement across different tasks, with varying rates of progress [41][45].

人工智能（AI）

智能的锯齿状边缘

智能商品化

验证者定律

自适应计算

领域民主化

人工智能（AI）

智能的锯齿状边缘

智能商品化

验证者定律

自适应计算

领域民主化

OpenAI元老Karpathy 泼了盆冷水：智能体离“能干活”，还差十年

3 6 Ke· 2025-10-21 12:42

Group 1 - Andrej Karpathy emphasizes that the maturity of AI agents will take another ten years, stating that current agents like Claude and Codex are not yet capable of being employed for tasks [2][4][5] - He critiques the current state of AI learning, arguing that reinforcement learning is inadequate and that true learning should resemble human cognitive processes, which involve reflection and growth rather than mere trial and error [11][12][22] - Karpathy suggests that future breakthroughs in AI will require a shift from knowledge accumulation to self-growth capabilities and a reconstruction of cognitive structures [4][5][22] Group 2 - The current limitations of large language models (LLMs) in coding tasks are highlighted, with Karpathy noting that they struggle with structured and nuanced engineering design [6][7][9] - He categorizes human interaction with code into three types, emphasizing that LLMs are not yet capable of functioning as true collaborators in software development [7][9][10] - Karpathy believes that while LLMs can assist in certain coding tasks, they are not yet capable of writing or improving their own code effectively [9][10][11] Group 3 - Karpathy discusses the importance of a reflective mechanism in AI learning, suggesting that models should learn to review and reflect on their processes rather than solely focusing on outcomes [18][19][20] - He introduces the concept of "cognitive core," advocating for models to retain essential thinking and planning abilities while discarding unnecessary knowledge [32][36] - The potential for a smaller, more efficient model with only a billion parameters is proposed, arguing that high-quality data can lead to effective cognitive capabilities without the need for massive models [34][36] Group 4 - Karpathy asserts that AGI (Artificial General Intelligence) will gradually integrate into the economy rather than causing a sudden disruption, focusing on digital knowledge work as its initial application area [38][39][40] - He predicts that the future of work will involve a collaborative structure where agents perform 80% of tasks under human supervision for the remaining 20% [40][41] - The deployment of AGI will be a gradual process, starting with structured tasks like programming and customer service before expanding to more complex roles [48][49][50] Group 5 - The challenges of achieving fully autonomous driving are discussed, with Karpathy stating that it is a high-stakes task that cannot afford errors, unlike other AI applications [59][60] - He emphasizes that the successful implementation of autonomous driving requires not just technological advancements but also a supportive societal framework [61][62] - The transition to widespread autonomous driving will be a slow and incremental process, beginning with specific use cases and gradually expanding [63]

Artificial Intelligence

Artificial Intelligence

清华、快手提出AttnRL：让大模型用「注意力」探索

机器之心· 2025-10-21 09:32

Core Insights - The article discusses the advancements in reinforcement learning (RL), particularly focusing on Process-Supervised RL (PSRL) and the introduction of a new framework called AttnRL, which enhances exploration efficiency and performance in reasoning models [3][4][9]. Group 1: Challenges in Traditional Methods - Traditional PSRL methods assign equal reward signals to all tokens, neglecting the fine-grained quality during the reasoning process [7]. - Existing PSRL approaches face significant bottlenecks in exploration efficiency and training costs, leading to high computational expenses [4][10]. Group 2: Introduction of AttnRL - AttnRL introduces an innovative exploration method by utilizing attention mechanisms to guide the reasoning process, allowing the model to branch from high-attention steps [9][12]. - The framework employs Attention-based Tree Branching (ATB), which analyzes the reasoning sequence and calculates Forward Context Influence (FCI) scores to determine the most impactful steps for branching [13][16]. Group 3: Adaptive Sampling Mechanisms - AttnRL incorporates two adaptive sampling mechanisms: difficulty-aware exploration and dynamic batch adjustment, optimizing the learning process by focusing on challenging problems while reducing computational load on simpler ones [20][22]. - The training process is streamlined to a One-Step Off-Policy approach, significantly reducing sampling costs compared to previous PSRL methods [23]. Group 4: Experimental Results - AttnRL demonstrates superior performance across various mathematical reasoning benchmarks, achieving average accuracy rates of 57.2% for 1.5B models and 68.7% for 7B models, outperforming baseline methods like GRPO and TreeRL [28]. - The framework shows improved efficiency in sampling, with a higher effective ratio and better performance in fewer training steps compared to traditional methods [29][31]. Group 5: Future Outlook - The introduction of attention scores in PSRL exploration decisions opens new avenues for enhancing model interpretability and RL research, suggesting that efficiency and intelligence can coexist through more effective exploration strategies [34].

大模型推理

Artificial Intelligence

大模型推理

Artificial Intelligence

Embedding黑箱成为历史！这个新框架让模型“先解释，再学Embedding”

量子位· 2025-10-21 09:05

Core Insights - The article introduces GRACE, a new explainable generative embedding framework developed by researchers from multiple universities, aimed at addressing the limitations of traditional text embedding models [1][6]. Group 1: Background and Limitations - Text embedding models have evolved from BERT to various newer models, mapping text into vector spaces for tasks like semantic retrieval and clustering [3]. - A common flaw in these models is treating large language models as "mute encoders," which output vectors without explaining the similarity between texts [4]. - This black-box representation becomes a bottleneck in tasks requiring high interpretability and robustness, such as question-answer matching and cross-domain retrieval [5]. Group 2: GRACE Framework Overview - GRACE transforms "contrastive learning" into "reinforcement learning," redefining the meaning of contrastive learning signals [6]. - The framework emphasizes generating explanations (rationales) for text before learning embeddings, allowing the model to produce logical and semantically consistent reasoning [7][25]. - GRACE consists of three key modules: 1. Rationale-Generating Policy, which generates explanatory reasoning chains for input texts [8]. 2. Representation Extraction, which combines input and rationale to compute final embeddings [9]. 3. Contrastive Rewards, which redefines contrastive learning objectives as a reward function for reinforcement learning updates [11]. Group 3: Training Process - GRACE can be trained in both supervised and unsupervised manners, utilizing labeled query-document pairs and self-alignment techniques [12][18]. - In the supervised phase, the model learns semantic relationships from a dataset of 1.5 million samples [13]. - The unsupervised phase generates multiple rationales for each text, encouraging consistent representations across different explanations [17]. Group 4: Experimental Results - GRACE was evaluated across 56 datasets in various tasks, showing significant performance improvements over baseline models in retrieval, pair classification, and clustering [19][20]. - The results indicate that GRACE not only enhances embedding capabilities without sacrificing generative abilities but also provides transparent representations that can be understood by users [25][27]. Group 5: Conclusion - Overall, GRACE represents a paradigm shift in embedding models, moving towards a framework that can explain its understanding process, thus enhancing both performance and interpretability [28].

可解释的生成式Embedding

可解释的生成式Embedding

马斯克亲自点名Karpathy迎战Grok 5，别神话LLM，AGI还要等十年

3 6 Ke· 2025-10-21 02:21

Core Insights - The path to Artificial General Intelligence (AGI) is acknowledged to exist but is fraught with challenges, with a timeline of approximately 10 years suggested for its realization [1][3][12]. Group 1: Challenges in Achieving AGI - Karpathy highlights several significant challenges in achieving AGI, including sparse reinforcement learning signals, risks of model collapse, and the need for better environmental and evaluative frameworks [2][3]. - He critiques the current hype surrounding AI, suggesting that the industry has overestimated the intelligence level of existing AI systems [1][3]. Group 2: Perspectives on AGI Timeline - The timeline of 10 years for AGI is considered optimistic compared to the current hype, indicating a more realistic approach to expectations in the field [12][15]. - Karpathy believes that while there has been substantial progress in large language models (LLMs), there remains a considerable amount of work to be done before achieving a fully autonomous AGI capable of outperforming humans in all tasks [17][18]. Group 3: Reinforcement Learning and Learning Paradigms - Karpathy expresses skepticism about the effectiveness of traditional reinforcement learning (RL), suggesting that it may not be the complete solution for developing AGI [21][24]. - He advocates for alternative learning paradigms, such as "agentic interaction," which could provide better opportunities for LLMs to engage with their environments [24][25]. Group 4: Collaboration vs. Competition - In a notable exchange, Elon Musk challenged Karpathy to a programming duel with Grok 5, which Karpathy declined, preferring collaboration over competition [4][5]. - This reflects a broader sentiment in the industry that emphasizes the importance of refining tools and methodologies rather than engaging in competitive showdowns [9][32]. Group 5: Future of AI and Automation - Karpathy discusses the potential for AI to enhance productivity across various sectors, emphasizing that automation will likely complement human roles rather than completely replace them [34]. - He suggests that the future of AI will involve a careful balance of human oversight and AI capabilities, particularly in programming and decision-making processes [32][33].

通用人工智能（AGI）

大语言模型（LLM）

智能体式交互

通用人工智能（AGI）

大语言模型（LLM）

智能体式交互

Karpathy泼冷水：AGI要等10年，根本没有「智能体元年」

3 6 Ke· 2025-10-21 02:15

Core Insights - Andrej Karpathy discusses the future of AGI and AI over the next decade, emphasizing that current "agents" are still in their early stages and require significant development [1][3][4] - He predicts that the core architecture of AI will likely remain similar to Transformer models, albeit with some evolution [8][10] Group 1: Current State of AI - Karpathy expresses skepticism about the notion of an "agent era," suggesting it should be termed "the decade of agents" as they still need about ten years of research to become truly functional [4][5] - He identifies key issues with current agents, including lack of intelligence, weak multimodal capabilities, and inability to operate computers autonomously [4][5] - The cognitive limitations of these agents stem from their inability to learn continuously, which Karpathy believes will take approximately ten years to address [5][6] Group 2: AI Architecture and Learning - Karpathy predicts that the fundamental architecture of AI will still be based on Transformer models in the next decade, although it may evolve [8][10] - He emphasizes the importance of algorithm, data, hardware, and software system advancements, stating that all are equally crucial for progress [12] - The best way to learn about AI, according to Karpathy, is through hands-on experience in building systems rather than theoretical approaches [12] Group 3: Limitations of Current Models - Karpathy critiques current large models for their fundamental cognitive limitations, noting that they often require manual coding rather than relying solely on AI assistance [13][18] - He categorizes coding approaches into three types: fully manual, manual with auto-completion, and fully AI-driven, with the latter being less effective for complex tasks [15][18] - The industry is moving too quickly, sometimes producing subpar results while pretending to achieve significant advancements [19] Group 4: Reinforcement Learning Challenges - Karpathy acknowledges that while reinforcement learning is not perfect, it remains the best solution compared to previous methods [22] - He highlights the challenges of reinforcement learning, including the complexity of problem-solving and the unreliability of evaluation models [23][24] - Future improvements may require higher-level "meta-learning" or synthetic data mechanisms, but no successful large-scale implementations exist yet [26] Group 5: Human vs. Machine Learning - Karpathy contrasts human learning, which involves reflection and integration of knowledge, with the current models that lack such processes [28][30] - He argues that true intelligence lies in understanding and generalization rather than mere memory retention [30] - The future of AI should focus on reducing mechanical memory and enhancing cognitive processes similar to human learning [30] Group 6: AI's Role in Society - Karpathy views AI as an extension of computation and believes that AGI will be capable of performing any economically valuable task [31] - He emphasizes the importance of AI complementing human work rather than replacing it, suggesting a collaborative approach [34][36] - The emergence of superintelligence is seen as a natural extension of societal automation, leading to a world where understanding and control may diminish [37][38]

Artificial Intelligence

Artificial Intelligence