双曲空间 - filings, earnings calls, financial reports, news

双曲空间

Search documents

机器之心· 2026-03-18 03:35

Core Insights - The article discusses the limitations of traditional reinforcement learning (RL) in the context of language models, highlighting the infinite action space and the challenges of semantic equivalence in decision-making [6][8]. - It introduces LaPha (Latent Poincaré Shaping for Agentic Reinforcement Learning), a new approach that maps the agent's behavior tree into the latent space of language models, using geometric distances to define potential functions and construct dense process rewards [8][12]. Group 1: Limitations of Traditional RL - In classic RL problems, the action space is typically discrete and finite, allowing for structured decision-making [2][3]. - Language models, however, face an infinite action space where the same semantic decision can be expressed in numerous ways, leading to a high branching factor in the search tree [7][8]. - The issue is compounded by the sparse reward problem in RL, where only a few paths are validated as correct, making feedback weak and credit assignment challenging [8]. Group 2: LaPha Approach - LaPha aims to address the aforementioned issues by utilizing a geometric approach to define potential functions and rewards within the latent space of language models [8][12]. - The method involves averaging the last hidden layer of the LLM to create a state vector for each search node, centering it around the prompt's hidden vector, and mapping all state vectors into the Poincaré ball [14]. - This approach allows for pruning strategies and the training of a value network with minimal overhead, significantly enhancing test-time scaling [12][20]. Group 3: Performance Metrics - LaPha's effectiveness is demonstrated through benchmark results, showing significant improvements in various competitions and assessments, such as a 66.0% to 88.2% increase in performance on MATH-500 and Gaokao'23 [24]. - The method also shows a notable increase in coverage and exploration efficiency by clustering non-terminal nodes based on hyperbolic distances, reducing semantic redundancy [28]. Group 4: Technical Implementation - The potential function is constructed to provide dense rewards, transforming sparse validation signals into intermediate learning signals [19][23]. - A lightweight value head is integrated into the pooled hidden state to fit the potential function, guiding the Monte Carlo Tree Search (MCTS) during test time [25][26]. - The training process indicates that the value head learns additional information independent of the policy head, enhancing overall performance [27].