Workflow
强化学习
icon
Search documents
从近30篇具身综述中!看领域发展兴衰(VLA/VLN/强化学习/Diffusion Policy等方向)
具身智能之心· 2025-07-11 00:57
Core Insights - The article provides a comprehensive overview of various surveys and research papers related to embodied intelligence, focusing on areas such as vision-language-action models, reinforcement learning, and robotics applications [1][2][3][4][5][6][8][9] Group 1: Vision-Language-Action Models - A survey on Vision-Language-Action (VLA) models highlights their significance in autonomous driving and human motor learning, discussing progress, challenges, and future trends [2][3][8] - The exploration of VLA models emphasizes their applications in embodied AI, showcasing a variety of datasets and methodologies [5][8][9] Group 2: Robotics and Reinforcement Learning - Research on foundation models in robotics addresses applications, challenges, and future directions, indicating a growing interest in integrating AI with robotic systems [3][4] - Deep reinforcement learning is identified as a key area with real-world successes, suggesting its potential for enhancing robotic capabilities [3][4] Group 3: Multimodal and Generative Approaches - The article discusses multimodal fusion and vision-language models, which are crucial for improving robot vision and interaction with the environment [6][8] - Generative artificial intelligence in robotic manipulation is highlighted as an emerging field, indicating a shift towards more sophisticated AI-driven solutions [6][8] Group 4: Datasets and Community Engagement - The article encourages engagement with a community focused on embodied intelligence, offering access to a wealth of resources, including datasets and collaborative projects [9]
2025上半年,AI Agent领域有什么变化和机会?
Hu Xiu· 2025-07-11 00:11
Core Insights - The rapid development of AI Agents has ignited a trend of "everything can be an Agent," particularly evident in the competitive landscape of model development and application [1][2][10] - Major companies like OpenAI, Google, and Alibaba are heavily investing in the Agent space, with new products emerging that enhance user interaction and decision-making capabilities [2][7][8] - The evolution of AI applications is categorized into three phases: prompt-based interactions, workflow-based systems, and the current phase of AI Agents, which emphasize autonomous decision-making and tool usage [17][19] Group 1: Model Development - The AI sector has entered a "arms race" for model development, with significant advancements marked by the release of models like DeepSeek, o3 Pro, and Gemini 2.5 Pro [5][6][14] - The introduction of DeepSeek has demonstrated that there is no significant gap between domestic and international model technologies, prompting major players to accelerate their model strategies [6][10] - The focus has shifted from "pre-training" to "post-training" methods, utilizing reinforcement learning to enhance model performance even with limited labeled data [11][13] Group 2: Application Development - The launch of OpenAI's Operator and Deep Research has marked 2025 as the "Year of AI Agents," with a surge in applications that leverage these capabilities [7][8] - Companies are exploring various applications of AI Agents, with notable examples including Cursor and Windsurf, which have validated product-market fit in the programming domain [9][21] - The ability of Agents to use tools effectively has been a significant breakthrough, allowing for enhanced information retrieval and interaction with external systems [20][21] Group 3: Challenges and Opportunities - Despite advancements, AI Agents face challenges such as context management, memory mechanisms, and interaction with complex software systems [39][40] - The future of Agent applications may involve evolving business models, potentially shifting from subscription-based to usage-based or outcome-based payment structures [40][41] - The industry is witnessing a competitive landscape where vertical-specific Agents may offer more value due to their specialized knowledge and closer user relationships [42][46]
双非同学竟然是这样发第一篇CVPR的!
具身智能之心· 2025-07-10 13:16
Core Insights - The article highlights the success story of a student who, despite lacking guidance, managed to publish a paper in CVPR25 through proactive efforts and support from a service provider [1] - The emphasis is placed on the importance of taking initiative and being diligent in research endeavors [1] Group 1: Student Success Case - A student with no guidance successfully published a paper in CVPR25 after 10 months of communication, experimentation, and writing [1] - The student's proactive approach and willingness to work hard were crucial to overcoming the lack of mentorship [1] Group 2: Service Offerings - The company offers comprehensive support for research and publication, covering various stages from idea generation to submission [1] - Specific research areas for guidance include large models, visual language navigation, reinforcement learning, and more [1] - The service provides tiered pricing based on the level of the paper, including top conferences and journals, as well as various academic categories [2]
端到端VLA这薪资,让我心动了。。。
自动驾驶之心· 2025-07-10 12:40
Core Viewpoint - End-to-End Autonomous Driving (E2E) is the core algorithm for intelligent driving mass production, marking a new phase in the industry with significant advancements and competition following the recognition of UniAD at CVPR [2] Group 1: E2E Autonomous Driving Overview - E2E can be categorized into single-stage and two-stage approaches, directly modeling from sensor data to vehicle control information, thus avoiding error accumulation seen in modular methods [2] - The emergence of BEV perception has bridged gaps between modular methods, leading to a significant technological leap [2] - The rapid development of E2E has led to a surge in demand for VLM/VLA expertise, with potential salaries reaching millions annually [2] Group 2: Learning Challenges - The fast-paced evolution of E2E technology has made previous learning materials outdated, necessitating a comprehensive understanding of multi-modal large models, BEV perception, reinforcement learning, and more [3] - Beginners face challenges in synthesizing knowledge from numerous fragmented papers and transitioning from theory to practice due to a lack of high-quality documentation [3] Group 3: Course Development - A new course titled "End-to-End and VLA Autonomous Driving" has been developed to address learning challenges, focusing on Just-in-Time Learning to help students quickly grasp core technologies [4] - The course aims to build a framework for research capabilities, enabling students to categorize papers and extract innovative points [5] - Practical applications are integrated into the course to ensure a complete learning loop from theory to practice [6] Group 4: Course Structure - The course consists of multiple chapters covering the history and evolution of E2E algorithms, background knowledge, two-stage and one-stage E2E methods, and the latest advancements in VLA [8][9][10] - Key topics include the introduction of E2E algorithms, background knowledge on VLA, and practical applications of diffusion models and reinforcement learning [11][12] Group 5: Target Audience and Outcomes - The course is designed for individuals with a foundational understanding of autonomous driving and aims to elevate participants to a level comparable to one year of experience as an E2E algorithm engineer [19] - Participants will gain a deep understanding of key technologies such as BEV perception, multi-modal large models, and reinforcement learning, enabling them to apply learned concepts to real-world projects [19]
有几个Top具身公司的大模型、强化学习、VLA和具身导航岗位!
具身智能之心· 2025-07-10 03:36
Core Viewpoint - The article discusses job opportunities in the fields of multimodal large models, reinforcement learning, and navigation, highlighting positions in a unicorn company with ample funding [1]. Group 1: Multimodal Large Models - Job locations are in Beijing and Shenzhen with a salary range of 40k-80k/month [2]. - Responsibilities include developing cutting-edge algorithms for embodied intelligent multimodal large models applicable in various indoor and outdoor scenarios, focusing on framework design, model optimization, and training for navigation and operation tasks [2]. - Candidates should have a master's degree or higher in computer science, artificial intelligence, robotics, or control engineering, along with extensive experience in robot perception, navigation, and AI large models [3]. - Preferred qualifications include experience with algorithms related to multimodal large models in robot navigation and a solid foundation in algorithm development and engineering implementation [3][4]. Group 2: Reinforcement Learning - Job location is in Beijing with a salary range of 40k-80k/month [5]. - Specific job descriptions and requirements are not detailed in the provided text [5]. Group 3: Embodied Navigation Algorithms - Job location is in Shenzhen with a salary range of 30k-60k/month [6]. - The role involves researching and developing algorithms for embodied intelligence, focusing on the integration of multimodal data into planning and achieving end-to-end mapping from data to actions [6]. Group 4: Additional Qualifications - Candidates should have a strong foundation in machine learning, deep learning, and reinforcement learning, with the ability to conduct independent research in embodied intelligence and related fields [7]. - Experience in publishing papers in top conferences and journals is a plus, along with strong coding skills and participation in robotics competitions [7].
晚点独家丨Agent 初创公司 Pokee.ai 种子轮融资 1200 万美元,Point 72 创投,英特尔陈立武等投资
晚点LatePost· 2025-07-09 11:38
Core Viewpoint - Pokee.ai, an AI Agent startup, recently raised approximately $12 million in seed funding to accelerate research and sales efforts, with notable investors including Point72 Ventures and Qualcomm Ventures [5][6]. Group 1: Company Overview - Pokee.ai was founded in October 2022 and currently has only 7 employees. The founder, Zhu Zheqing, previously led the "Applied Reinforcement Learning" department at Meta, where he significantly improved the content recommendation system [7]. - Unlike other startups that use large language models (LLMs) as the "brain" of their agents, Pokee relies on a different reinforcement learning model that does not require extensive context input [7]. Group 2: Technology and Cost Efficiency - The current version of Pokee has been trained on 15,000 tools, allowing it to adapt to new tools without needing additional context [8]. - Using reinforcement learning models is more cost-effective compared to LLMs, which can incur costs of several dollars per task due to high computational demands. Pokee's task completion cost is only about 1/10 of its competitors [8]. Group 3: Market Strategy and Product Development - Pokee aims to optimize its ability to call data interfaces (APIs) across various platforms, targeting large companies and professional consumers to facilitate cross-platform tasks [9]. - The funding will also support the integration of new features, including a memory function to better understand client needs and preferences [9]. Group 4: Seed Funding Trends - The seed funding landscape for AI startups is evolving, with average seed round sizes increasing significantly. In 2020, the median seed round was around $1.7 million, which has risen to approximately $3 million in 2023 [10]. - The high costs associated with AI product development necessitate larger funding rounds to sustain operations, with some companies reportedly burning through $100 million to $150 million annually [13][14]. Group 5: Investment Climate - Investors are becoming more cautious, requiring solid product-market fit (PMF) before committing to funding. The median time between seed and Series A funding has increased to 25 months, the highest in a decade [17][18].
如何教AI学会反思?
Hu Xiu· 2025-07-09 07:57
Core Insights - The article discusses a research paper titled "Reflect, Retry, Reward: Self-Improvement of Large Language Models through Reinforcement Learning," which presents a novel approach for AI to learn from its mistakes [5][6][10]. Group 1: Research Overview - The research team from an AI startup called Writer, consisting of eight authors, published the paper, which ranked third in the June leaderboard of the Hugging Face platform [3][4]. - The paper emphasizes a three-step process for AI to learn from errors: Reflect, Retry, and Reward [5][10]. Group 2: Learning Mechanism - The first step, Reflect, involves the AI generating a self-reflection on its mistakes after failing a task, similar to how students analyze their errors [11]. - The second step, Retry, allows the AI to attempt the same task again, armed with insights from its reflection [12]. - The third step, Reward, applies reinforcement learning to adjust the model's parameters based on the effectiveness of its reflection, rather than just the final answer [13][14]. Group 3: Experimental Validation - The research team conducted two experiments: one on function calling and another on solving mathematical equations, both of which are challenging tasks with clear success criteria [16][18]. - In the function calling task, a model with 1.5 billion parameters improved its first-attempt accuracy from approximately 32.6% to 48.6% after implementing the reflection mechanism, and to 52.9% after a retry [20][21]. - For the mathematical equation solving task, the same model's accuracy increased from 6% to 34.9% on the first attempt, and to 45% after a retry, demonstrating significant improvement [23][24][25]. Group 4: Implications for AI Development - The findings suggest that smaller models can outperform larger models when trained with effective learning strategies, indicating that model size is not the only determinant of performance [26][29]. - The research highlights the potential for optimizing training methods to enhance the capabilities of smaller models, which can lead to cost savings in AI development [29].
DeepSeek-R1超级外挂!“人类最后的考试”首次突破30分,上海交大等开源方案碾压OpenAI、谷歌
量子位· 2025-07-09 04:57
Core Insights - The article highlights a significant achievement by a domestic team from Shanghai Jiao Tong University and DeepMind Technology, which scored 32.1 points on the "Humanity's Last Exam" (HLE), setting a new record in a notoriously difficult AI test [1][2][26]. Group 1: Achievement and Context - The previous highest score on the HLE was 26.9, achieved by Kimi-Research and Gemini Deep Research [2]. - The HLE was launched earlier this year and is known for its extreme difficulty, with no model scoring above 10 points initially [34][39]. - The test includes over 3,000 questions across various disciplines, with a significant focus on mathematics [39]. Group 2: Methodology and Tools - The team developed two key systems: the tool-enhanced reasoning agent X-Master and the multi-agent workflow system X-Master s [3][20]. - X-Master operates by simulating the dynamic problem-solving process of human researchers, allowing for seamless switching between internal reasoning and external tool usage [9][10]. - The core mechanism involves conceptualizing code as an interactive language, enabling the agent to generate and execute code when faced with unsolvable problems [11][14]. Group 3: Performance Metrics - The X-Masters system achieved a record score of 32.1%, surpassing all existing agents and models [26]. - The performance improvement was attributed to various components of the workflow: tool-enhanced reasoning improved baseline accuracy by 3.4%, iterative optimization added 9.5%, and final selection led to the record score [29][30]. - In specific categories, X-Masters outperformed existing systems, achieving 27.6% accuracy in the biology/medicine category, compared to 17.3% for Biomni and 26% for STELLA [31]. Group 4: Future Implications - The introduction of X-Master s aims to enhance the breadth and depth of reasoning through a decentralized-stacked approach, where multiple agents collaborate to generate and refine solutions [20][22]. - This structured exploration and exploitation strategy is likened to concepts in reinforcement learning, indicating a potential for further advancements in AI reasoning capabilities [23].
4B小模型数学推理首超Claude 4,700步RL训练逼近235B性能 | 港大&字节Seed&复旦
量子位· 2025-07-09 01:18
Core Viewpoint - The Polaris model, developed by a collaboration between the University of Hong Kong's NLP team, ByteDance Seed, and Fudan University, demonstrates superior mathematical reasoning capabilities compared to leading commercial models, achieving scores of 79.4 on AIME25 and 81.2 on AIME24 [1][53]. Group 1: Model Performance and Training - Polaris utilizes Scaling Reinforcement Learning (RL) to enhance the mathematical reasoning abilities of the 4B model, surpassing various commercial models such as Seed-1.5-thinking and Claude-4-Opus [1][5]. - The lightweight nature of Polaris-4B allows deployment on consumer-grade graphics cards [2]. - The research team confirmed that Scaling RL can replicate significant performance improvements in cutting-edge open-source models like Qwen3 [5]. Group 2: Training Data and Methodology - The success of Polaris hinges on tailored training data and hyperparameter settings that align with the model being trained [7]. - The team discovered a mirrored difficulty distribution in the training data, indicating that the same dataset presents varying challenges to models of different capabilities [8][10]. - A dynamic updating strategy for training data was implemented, allowing the model to adapt as it improves, ensuring that overly easy samples are removed during training [13]. Group 3: Sampling Diversity and Temperature Control - Diversity in sampling is crucial for enhancing model performance, allowing exploration of broader reasoning paths [14]. - The team identified that common temperature settings (0.6 and 1.0) were too low, limiting the model's exploration capabilities [27]. - A three-zone temperature framework was established: Robust Generation Zone, Controlled Exploration Zone, and Performance Collapse Zone, guiding the selection of optimal sampling temperatures [28]. Group 4: Long Context Training and Performance - The model's pre-training context length was limited to 32K, but during RL training, it was extended to 52K, addressing the challenge of long-context training [37]. - The introduction of length extrapolation techniques improved the accuracy of long text generation from 26% to over 50% [41]. - A multi-stage training approach was adopted, gradually increasing context window lengths to enhance reasoning capabilities [48]. Group 5: Evaluation and Results - Polaris achieved the highest performance in most evaluations, demonstrating its effectiveness in mathematical reasoning tasks [53].
具身智能论文速递 | 强化学习、VLA、VLN、世界模型等~
具身智能之心· 2025-07-08 12:54
Core Insights - The article discusses advancements in Vision-Language-Action (VLA) models through reinforcement learning (RL) techniques, specifically the Proximal Policy Optimization (PPO) algorithm, which significantly enhances the generalization capabilities of these models [2][4]. Group 1: VLA Model Enhancements - The application of PPO has led to a 42.6% increase in task success rates in out-of-distribution (OOD) scenarios [2]. - Semantic understanding success rates improved from 61.5% to 75.0% when encountering unseen objects [2]. - In dynamic interference scenarios, success rates surged from 28.6% to 74.5% [2]. Group 2: Research Contributions - A rigorous benchmark was established to evaluate the impact of VLA fine-tuning methods on generalization across visual, semantic, and execution dimensions [4]. - PPO was identified as superior to other RL algorithms like GRPO and DPO for VLA fine-tuning, with discussions on adapting these algorithms to meet the unique needs of VLA [4]. - An efficient PPO-based fine-tuning scheme was developed, utilizing a shared actor-critic backbone network, VLA model preheating, and minimal PPO training iterations [4]. - The study demonstrated that RL's generalization capabilities in VLA for semantic understanding and entity execution outperformed supervised fine-tuning (SFT), while maintaining comparable visual robustness [4]. Group 3: NavMorph Model - The NavMorph model was introduced as a self-evolving world model for vision-and-language navigation in continuous environments, achieving a success rate of 47.9% in unseen environments [13][15]. - The model incorporates a World-aware Navigator for inferring dynamic representations of the environment and a Foresight Action Planner for optimizing navigation strategies through predictive modeling [15]. - Experiments on mainstream VLN-CE benchmark datasets showed that NavMorph significantly enhanced the performance of leading models, validating its advantages in adaptability and generalization [15].