Workflow
强化学习
icon
Search documents
最后1个名额!强化学习在人形/四足/机械臂等方向上的应用
具身智能之心· 2025-10-21 00:03
Core Insights - Reinforcement Learning (RL) remains a significant field, with increasing applications in robotics, including humanoid and quadrupedal robots, as well as in product optimization across various industries [1][2][3] - The complexity of RL poses challenges for newcomers, making it difficult to produce publishable research papers without a structured learning system [5][6][9] Group 1: Importance of Reinforcement Learning - RL is crucial for tasks such as gait control in embodied intelligent robots, which is essential for achieving general-purpose capabilities [2] - Companies like Yushun and Zhiyuan utilize RL for humanoid robots to perform complex actions like climbing stairs, running, and dancing, enabling applications in rescue and hazardous environments [2][8] Group 2: Challenges in Learning and Research - The extensive and intricate nature of RL makes it hard for beginners to enter the field, often leading to frustration and abandonment of learning [5][9] - Producing a paper that meets the standards of peer review requires proficiency in methodology, experimental results, and writing, with any misstep potentially resulting in low scores from reviewers [5][6] Group 3: Educational Initiatives - To address the entry barriers in RL research, a specialized 1v6 mentoring course has been launched, targeting graduate students and others needing guidance in paper writing [6][7] - The course includes weekly live sessions, project implementation, experimental guidance, and writing refinement, aiming to help participants produce a draft suitable for submission to top conferences and journals [7][9][15] Group 4: Course Structure and Content - The course spans 14 weeks of intensive online training followed by 8 weeks of maintenance support, focusing on various aspects of RL and robotics [9][15] - Key topics include foundational RL concepts, simulation environments, sim2real techniques, and writing guidance, with a structured approach to ensure participants achieve measurable milestones [15][19][20]
腾讯研究院AI速递 20251021
腾讯研究院· 2025-10-20 16:01
Group 1: Oracle's AI Supercomputer - Oracle launched the world's largest cloud AI supercomputer, OCI Zettascale10, consisting of 800,000 NVIDIA GPUs, achieving a peak performance of 16 ZettaFLOPS, serving as the core computing power for OpenAI's "Stargate" cluster [1] - The supercomputer utilizes a unique Acceleron RoCE network architecture, significantly reducing communication latency between GPUs and ensuring automatic path switching during failures [1] - Services are expected to be available to customers in the second half of 2026, with the peak performance potentially based on low-precision computing metrics, requiring further validation in practical applications [1] Group 2: Google's Gemini 3.0 - Google's Gemini 3.0 appears to have launched under the aliases lithiumflow (Pro version) and orionmist (Flash version) in the LMArena, with Gemini 3 Pro being the first AI model capable of accurately recognizing clock times [2] - Testing shows that Gemini 3 Pro excels in SVG drawing and music composition, effectively mimicking musical styles while maintaining rhythm, with significantly improved visual performance compared to previous versions [2] - Despite the notable enhancements in model capabilities, the evaluation methods in the AI community remain traditional, lacking innovative assessment techniques [2] Group 3: DeepSeek's OCR Model - DeepSeek has open-sourced a 3 billion parameter OCR model, DeepSeek-OCR, which achieves a compression rate of less than 10 times while maintaining 97% accuracy, and around 60% accuracy at a 20 times compression rate [3] - The model consists of DeepEncoder (380M parameters) and DeepSeek 3B-MoE decoder (activated parameters 570M), outperforming GOT-OCR2.0 in OmniDocBench tests using only 100 visual tokens [3] - A single A100-40G GPU can generate over 200,000 pages of LLM/VLM training data daily, supporting recognition in nearly 100 languages, showcasing its efficient visual-text compression potential [3] Group 4: Yuanbao AI Recording Pen - Yuanbao has introduced a new feature for its AI recording pen, utilizing Tencent's Tianlai noise reduction technology to enable clear and accurate recording and transcription without additional hardware [4] - The "Inner OS" feature interprets the speaker's underlying thoughts and nuances, helping users stay focused on the core content of meetings or conversations [4] - The recording can intelligently separate multiple speakers in a single audio segment, enhancing clarity in meeting notes without the need for repeated listening [4] Group 5: Vidu's Q2 Features - Vidu's Q2 reference generation feature officially launched globally on October 21, with a reasoning speed three times faster than the Q1 version, supporting multi-subject consistency generation and precise semantic understanding while maintaining 1080p HD video quality [5][6] - The video extension feature allows free users to generate videos up to 30 seconds long, while paid users can extend videos up to 5 minutes, supporting text-to-video, image-to-video, and reference video generation [6] - The Vidu app has undergone a comprehensive redesign, transitioning from an AI creation platform to a one-stop AI content social platform, featuring a vast subject library for easy collaborative video generation [6] Group 6: Gemini's Geolocation Intelligence - Google has opened the Gemini API to all developers, integrating Google Maps functionality to provide location awareness for 250 million places, charging $25 for every 1,000 fact-based prompts [7] - The feature supports Gemini 2.5 Flash-Lite, 2.5 Pro, 2.5 Flash, and 2.0 Flash models, applicable in scenarios such as restaurant recommendations, route planning, and travel itinerary planning, offering real-time traffic and business hours queries [7] - This development signifies a shift in AI from static tools to dynamic "intelligent spaces," with domestic competitor Amap having previously launched smart applications [7] Group 7: AI Trading Experiment - The Alpha Arena experiment initiated by nof1.ai allocated $10,000 each to GPT-5, Gemini 2.5 Pro, Claude 4.5 Sonnet, Grok 4, Qwen3 Max, and DeepSeek V3.1 for real market trading, with DeepSeek V3.1 achieving over $3,500 in profits, ranking first [8] - DeepSeek secured the highest returns with only five trades, while Grok-4 followed closely with one trade, and Gemini 2.5 Pro incurred the most losses with 45 trades [8] - This experiment views the financial market as the ultimate test for intelligence, focusing on survival in uncertainty rather than mere cognitive capabilities [8] Group 8: Robotics Development - Yushu has released its fourth humanoid robot, H2, standing 180 cm tall and weighing 70 kg, with a BMI of 21.6, featuring 31 joints, an increase of about 19% compared to the R1 model [9] - H2 has significantly upgraded its movement fluidity and bionic features, capable of ballet dancing and martial arts, with a "face" appearance, earning the title of "the most human-like bionic robot" [9] - Compared to its predecessor H1, H2's joint control and balance algorithms have been greatly optimized, expanding its application prospects from industrial automation to entertainment and companionship services [9] Group 9: Karpathy's Insights on AGI - Karpathy expressed in a podcast that achieving AGI may still take a decade, presenting a more pessimistic view compared to the general optimism in Silicon Valley, being 5-10 times more cautious [10] - He criticized the inefficiency of reinforcement learning, likening it to "sucking supervision signals through a straw," highlighting its susceptibility to noise and interference [10] - He introduced the concept of a "cognitive core," suggesting that future models will initially grow larger before becoming smaller and more focused on a specialized cognitive nucleus [11]
Karpathy 回应争议:RL 不是真的不行,Agent 还需要十年的预测其实很乐观
Founder Park· 2025-10-20 12:45
Group 1 - The core viewpoint expressed by Andrej Karpathy is that the development of Artificial General Intelligence (AGI) is still a long way off, with a timeline of approximately ten years being considered optimistic in the current hype environment [10][21][23] - Karpathy acknowledges the significant progress made in Large Language Models (LLMs) but emphasizes that there is still a considerable amount of work required to create AI that can outperform humans in any job [11][12] - He critiques the current state of LLMs, suggesting they have cognitive flaws and are overly reliant on pre-training data, which may not be a sustainable learning method [13][14] Group 2 - Karpathy expresses skepticism about the effectiveness of reinforcement learning (RL), arguing that it has a poor signal-to-noise ratio and is often misapplied [15][16] - He proposes that future learning paradigms should focus on agentic interaction rather than solely relying on RL, indicating a shift towards more effective learning mechanisms [15][16] - The concept of a "cognitive core" is introduced, suggesting that LLMs should be simplified to enhance their generalization capabilities, moving away from excessive memory reliance [19] Group 3 - Karpathy critiques the current development of autonomous agents, advocating for a more collaborative approach where LLMs assist rather than operate independently [20][21] - He believes that the next decade will be crucial for the evolution of agents, with significant improvements expected in their capabilities [21][22] - The discussion highlights the need for realistic expectations regarding the abilities of agents, warning against overestimating their current capabilities [20][21] Group 4 - Karpathy emphasizes the importance of understanding the limitations of LLMs in coding tasks, noting that they often misinterpret the context and produce suboptimal code [47][48] - He points out that while LLMs can assist in certain coding scenarios, they struggle with unique or complex implementations that deviate from common patterns [48][49] - The conversation reveals a gap between the capabilities of LLMs and the expectations for their role in software development, indicating a need for further advancements [52]
LLM记忆管理终于不用“手把手教”了,新框架让智能体自主管理记忆系统
量子位· 2025-10-20 10:29
Core Insights - The article introduces Mem-α, an innovative reinforcement learning framework designed to enable large language models (LLMs) to autonomously manage complex memory systems, moving away from reliance on manual design and predefined instructions [2][4][14]. Memory Management Challenges - Traditional memory-enhanced agents often depend on predefined instructions and tools for memory updates, which can lead to suboptimal memory construction and information loss, particularly in long-term interactions [7][9][8]. - LLMs face limitations due to finite context windows, making external memory systems crucial for understanding long-term information [5][6]. Mem-α Framework - Mem-α transforms the memory construction problem into a sequential decision-making problem that can be optimized through reinforcement learning, allowing agents to explore optimal memory management strategies during information processing [14][16]. - The framework incorporates a complex memory system inspired by cognitive science, consisting of core memory, episodic memory, and semantic memory, each supporting various memory operations [22][20]. Training and Evaluation - Mem-α utilizes a multi-dimensional reward function to optimize memory construction, focusing on accurate retrieval, test-time learning, long-range understanding, and conflict resolution [18][28]. - Experimental results demonstrate that Mem-α significantly outperforms existing methods, achieving higher accuracy and efficient memory usage while maintaining performance [35][36]. Key Findings - Mem-α shows superior performance across all tasks, particularly in accurate retrieval and long-range understanding, indicating strong generalization capabilities [35]. - The framework reduces memory usage by approximately 50% compared to traditional methods while enhancing performance, validating the effectiveness of semantic compression mechanisms [35]. - The structured architecture of Mem-α proves essential for processing complex information, highlighting the limitations of flat memory representations [35]. - Mem-α exhibits robust generalization to document lengths exceeding 400K tokens, despite being trained on documents averaging less than 30K tokens [35].
具身智能之心交流群成立来!VLA/RL/导航/数采等多个方向
具身智能之心· 2025-10-20 10:00
Group 1 - The establishment of a technical exchange group focused on embodied intelligence has been announced, inviting participation from various stakeholders in the field [1] - The group encompasses nearly 20 sub-directions, indicating a broad scope of interest and expertise within the embodied intelligence domain [1] - Participants are encouraged to engage in discussions related to humanoid robots, quadrupeds, robotic arms, and various advanced technologies such as VLA, large models, VLN, reinforcement learning, mobile operations, multi-modal perception, simulation, and data collection [1]
NeurIPS 2025 | CMU、清华、UTAustin开源ReinFlow,用在线RL微调机器人流匹配策略
机器之心· 2025-10-20 09:15
Core Insights - The article discusses the emergence of ReinFlow, an online reinforcement learning framework designed to fine-tune flow matching policies, which has been accepted at NeurIPS 2025 and is open-sourced with comprehensive documentation [2][5][27]. Group 1: ReinFlow Overview - ReinFlow is a general framework applicable to all strategies defined by ordinary differential equations, such as Rectified Flow and Shortcut Models, and supports inference with minimal steps [12]. - The framework significantly reduces training time by over 60% compared to DPPO while maintaining similar performance levels [14][16]. Group 2: Algorithm Characteristics - ReinFlow utilizes a strategy gradient theory to convert deterministic flows into discrete-time Markov processes, optimizing the entire flow matching chain [5][7]. - The algorithm introduces a small amount of learnable noise into the deterministic path of the flow strategy, allowing for a stochastic diffusion process that enhances exploration while controlling deviation from the pre-trained strategy [8][10]. Group 3: Performance Metrics - In D4RL locomotion tasks, ReinFlow fine-tuned Rectified Flow strategies achieved an average net performance increase of 135.36%, while reducing the wall-clock time for fine-tuning by 82.63% [16]. - For long-range operation tasks, ReinFlow fine-tuned Shortcut Model strategies improved success rates by an average of 40.34% with fewer diffusion steps, saving an average of 23.20% in training time [18]. Group 4: Experimental Validation - The research team conducted ablation studies to assess the impact of various factors on training outcomes, demonstrating that reinforcement learning fine-tuning can further enhance performance beyond mere data augmentation [24]. - The framework has been validated across multiple benchmark tasks, showing significant performance improvements compared to pre-trained models [14]. Group 5: Open Source and Future Directions - ReinFlow's GitHub project is fully open-sourced and actively maintained, providing a complete codebase, model checkpoints, and detailed documentation for community engagement [27]. - Future updates will include support for various flow models, classic RL environments, and comprehensive guides for installation and usage [29].
AI撕碎了“伪工作”的遮羞布
Hu Xiu· 2025-10-20 08:21
Core Insights - The current AI development may lead to either AGI or a more sophisticated word predictor, which significantly impacts market psychology [2] - A report from MIT indicated that 95% of corporate AI investments yielded zero returns, suggesting a fragile market sentiment [2] - The potential for AI to replace low-level white-collar jobs could liberate humans for more meaningful work, but many individuals may struggle to adapt [3] Group 1 - The discussion on AI's trajectory is crucial as it addresses whether the current advancements will lead to AGI or merely enhance predictive capabilities [2] - Experts' opinions on AI's future have a substantial influence on market sentiment, with pessimistic views highlighting the risks of overvaluation [2] - The notion that AI can handle trivial tasks suggests it may replace jobs that do not utilize higher-level human intelligence [2][3] Group 2 - The short-term effect of AI adoption may boost capital profits, but long-term implications could lead to a decline in overall demand as wealth distribution favors capital [4] - Historical context indicates that significant advancements from the first internet boom took about a decade to materialize, raising concerns about potential downturns in the current AI cycle [4] - The resilience of the market may prove more critical than the initial explosive growth of AI technologies [4]
Andrej Karpathy :AI 智能体的十年战争、强化学习的困境与“数字幽灵”的觉醒
锦秋集· 2025-10-20 07:00
Group 1 - The core viewpoint of the article is that the current era is not the "year of agents" but rather the "decade of agents," emphasizing a long-term evolution in AI capabilities rather than immediate breakthroughs [1][6][7] - The discussion highlights the need for AI to develop four critical modules: multimodal perception, memory systems, continuous learning, and action interfaces, which are essential for creating fully functional intelligent agents [1][8][15] - The article suggests that the next phase of AI development will focus on self-reflection capabilities, allowing AI to review its outputs and learn from its mistakes, moving beyond mere imitation of human behavior [2][20][21] Group 2 - The article provides insights into the historical context of AI development, identifying three key paradigm shifts: the perception revolution, the action revolution, and the representation revolution, each taking years to mature [10][12][14] - It emphasizes that the evolution of intelligent agents will not happen overnight but will require a decade of systematic engineering and integration of various capabilities [4][9] - The article discusses the limitations of reinforcement learning, highlighting its inefficiency and the need for more nuanced feedback mechanisms to improve AI learning processes [20][46][50] Group 3 - The article posits that AI should be viewed as a cognitive collaborator rather than a competitor, suggesting a future where humans and AI work together in a symbiotic relationship [52][56] - It raises the idea that the next decade will focus on "taming" AI, establishing societal rules and values to ensure safe and reliable AI interactions [54][58] - The conclusion emphasizes that this decade will not be about AI taking over the world but rather about humans redefining their roles in collaboration with intelligent systems [56][58]
MuJoCo教程来啦!从0基础到强化学习,再到sim2real
具身智能之心· 2025-10-20 00:03
Core Insights - The article emphasizes that the field of AI is at a pivotal moment, transitioning from early symbolic reasoning to deep learning breakthroughs and now to the rise of embodied intelligence, which is redefining human-machine relationships [1][3]. Group 1: Embodied Intelligence - Embodied intelligence is characterized by machines that can understand language commands, navigate complex environments, and make intelligent decisions in real-time, moving beyond the realm of virtual space [1]. - Major tech companies like Tesla, Boston Dynamics, OpenAI, and Google are actively developing technologies in this disruptive field, indicating a competitive landscape [1][3]. - The potential impact of embodied intelligence spans across various industries, including manufacturing, healthcare, and space exploration, suggesting a transformative effect on the economy and society [1]. Group 2: Technical Challenges and Solutions - Achieving true embodied intelligence presents unprecedented technical challenges, requiring advancements in algorithms, physical simulation, robot control, and perception fusion [3]. - MuJoCo (Multi-Joint dynamics with Contact) is highlighted as a critical technology for embodied intelligence, serving as a high-fidelity simulation engine that connects virtual and real-world environments [4][6]. - MuJoCo allows researchers to conduct millions of trials in a simulated environment, significantly accelerating the learning process while minimizing risks associated with physical hardware [6][8]. Group 3: MuJoCo's Advantages - MuJoCo's advanced contact dynamics algorithms enable precise simulation of complex interactions between robots and their environments, making it a standard tool in both academia and industry [4][8]. - The engine supports high parallelization, allowing thousands of simulations to run simultaneously, which enhances efficiency in training AI systems [4][6]. - The technology's stability and numerical accuracy ensure reliable long-term simulations, making it a preferred choice for leading tech companies [4][6]. Group 4: Educational Initiatives - A comprehensive MuJoCo development tutorial has been created, focusing on practical applications and theoretical foundations within the context of embodied intelligence [9][11]. - The course is structured into six modules, each with specific learning objectives and practical projects, ensuring a thorough understanding of the technology stack [15][17]. - Participants will engage in hands-on projects that cover a range of applications, from basic robotic arm control to complex multi-agent systems, fostering both theoretical knowledge and practical skills [19][29]. Group 5: Target Audience and Outcomes - The course is designed for individuals with programming or algorithm backgrounds looking to enter the field of embodied robotics, as well as students and professionals seeking to enhance their practical capabilities [32][33]. - Upon completion, participants will possess a complete skill set in embodied intelligence, including proficiency in MuJoCo, reinforcement learning, and real-world application of simulation techniques [32][33]. - The program aims to cultivate a combination of technical, engineering, and innovative skills, preparing participants to tackle complex problems in the field [33].
稳定训练、数据高效,清华大学提出「流策略」强化学习新方法SAC Flow
具身智能之心· 2025-10-20 00:03
Core Viewpoint - The article introduces a new approach called SAC Flow, which utilizes a high data efficiency reinforcement learning algorithm to train flow-based policies end-to-end without the need for alternative objectives or policy distillation. This method achieves high data efficiency and state-of-the-art performance on various benchmarks [1][4][20]. Group 1: Research Background - Flow-based policies are gaining popularity in the field of robotic learning due to their ability to model multi-modal action distributions and their simplicity compared to diffusion strategies. They are widely used in advanced VLA models [4]. - Previous attempts to train flow policies using off-policy reinforcement learning (RL) often faced issues such as gradient explosion due to the multi-step sampling process inherent in flow policies [4][5]. Group 2: Methodology - The proposed SAC Flow treats flow policies as sequential models, allowing the use of modern recurrent structures like GRU and Transformer to stabilize training and optimize flow policies directly within an off-policy framework [7][10]. - SAC Flow incorporates Gaussian noise and drift correction in each rollout to ensure the end action distribution remains unchanged, allowing the actor/critic loss to be expressed using the log-likelihood of multi-step sampling from the flow policy [14]. Group 3: Training Paradigms - Two training paradigms are supported: - From-scratch training for dense-reward tasks, where SAC Flow can be trained directly [18]. - Offline-to-online training for sparse-reward tasks, where pre-training on a dataset is followed by online fine-tuning [18][20]. Group 4: Experimental Results - SAC Flow-T and Flow-G demonstrated stable and faster convergence in environments like Hopper, Walker2D, and Ant, achieving state-of-the-art performance [20][21]. - The offline-to-online training results showed that SAC Flow maintains stable gradients and prevents gradient explosion, leading to superior performance compared to naive SAC training [24][26]. Group 5: Comparison with Similar Works - SAC Flow outperforms existing methods like FlowRL and diffusion strategies in terms of convergence speed and efficiency, particularly in challenging sparse-reward tasks [30][31]. - The method retains the modeling capabilities of flow policies without the need for distillation into single-step models, which is a common approach in other methods [31]. Group 6: Key Takeaways - The key attributes of SAC Flow are serialization, stable training, and data efficiency, enabling the direct use of off-policy RL algorithms to train flow policies effectively [32].