Workflow
大语言模型(LLMs)
icon
Search documents
Kitchen-R :高层任务规划与低层控制联合评估的移动操作机器人基准
具身智能之心· 2025-08-25 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 写在前面&出发点 1)基准的重要性 基准在自然语言处理(如GLUE)、计算机视觉(如Visual Genome)中广泛用于评估模型进展;在机器人领域,基于模拟器的基准(如Behavior-1K)同样常见,兼 具模型评估与训练功能,且需准确模拟低层动作以支持真实机器人的结果迁移。 2)现有基准的割裂问题 近年来,大语言模型(LLMs)和视觉语言模型(VLMs)被广泛用于机器人任务规划与指令遵循,但现有基准存在明显缺陷: 3)Kitchen-R的核心价值 基准是机器人学和具身AI领域评估进展的核心工具,但当前基准存在显著割裂: 高层语言指令遵循类基准 常假设低层执行完美,而 低层机器人控制类基准 仅依赖 简单单步指令。这种割裂导致无法全面评估"任务规划+物理执行"均关键的集成系统。 为填补该空白,这里提出 Kitchen-R基准 ——一个在仿真厨房环境 ...
速递|种子轮融资500万美元,Paradigm配备超5000个AI智能体表格
Z Potentials· 2025-08-19 15:03
"我个人注意到一个普遍现象——很多人会把非常重要的 CRM 数据存放在电子表格里,仅仅因为这是最灵活的方式," Monaco 告诉 TechCrunch 。"但实际维护起来非常痛苦,需要大量手工操作。于是我就一头扎进了这个兔子洞,想为自己打造一款产品,重新构想当电子表格拥有 LLMs 的 全部能力时会是什么模样。" 图片来源: Paradigm 早在 "AI Agent " 这一术语出现之前,安娜·摩纳哥(Anna Monaco)就已经开始构建 AI Agent 系统。在开发了众多聊天机器人后,她开始寻找更适 合 AI Agent 的其他交互界面,最终将目光锁定在了电子表格上。 Paradigm 现正式向公众发布产品,并宣布完成由 General Catalyst 领投的 500 万美元种子轮融资。公司迄今融资总额达 700 万美元。 最终成果就是 Paradigm ——一个配备了 5000 多个 AI 智能体的智能电子表格。用户可以为不同列和单元格分配专属指令,独立的 AI 智能体会自 动爬取网络来查找并填充所需信息。 Monaco 表示, Paradigm 支持 Anthropic 、 OpenAI 和 ...
开源扩散大模型首次跑赢自回归!上交大联手UCSD推出D2F,吞吐量达LLaMA3的2.5倍
机器之心· 2025-08-18 03:22
挑战 —— 例如缺少完善的 KV 缓存机制,以及未充分释放并行潜力 —— 推理速度远慢于同规模的 AR 模型。 近期的一篇工作彻底扭转了这个局面。上海交通大学 DENG Lab 联合加州大学圣地亚哥分校(UCSD)推出 Discrete Diffus ion Forcing (D2F) ,首次使开源 dLLMs 的生成速度显著超过同等规模的 AR 模型。实验显示,D2F 模型在 GSM8K 等基准上,实现了相比 LLaMA3 等主流 AR 模型 高达 2.5 倍的吞吐量 提升,同 本文作者团队来自上海交通大学 DENG Lab 与加州大学圣地亚哥分校(UCSD)。该研究由硕士生王旭、准硕士生徐晨开、本科生金义杰以及博士生金佳纯共同 完成,指导教师为邓志杰与张浩老师。DENG Lab 隶属上海交通大学,致力于高效、跨模态生成模型的研究。 论文地址:https://arxiv.org/abs/2508.09192 代码地址:https://github.com/zhijie-group/Discrete-Diffusion-Forcing 视频 1 : D2F dLLMs 与同尺寸 AR LLMs 的推理过程对比 ...
万字长文!首篇智能体自进化综述:迈向超级人工智能之路~
自动驾驶之心· 2025-07-31 23:33
Core Insights - The article discusses the transition from static large language models (LLMs) to self-evolving agents that can adapt and learn continuously from interactions with their environment, aiming for artificial superintelligence (ASI) [3][5][52] - It emphasizes three fundamental questions regarding self-evolving agents: what to evolve, when to evolve, and how to evolve, providing a structured framework for understanding and designing these systems [6][52] Group 1: What to Evolve - Self-evolving agents can improve various components such as models, memory, tools, and workflows to enhance performance and adaptability [14][22] - The evolution of agents is categorized into four pillars: cognitive core (model), context (instructions and memory), external capabilities (tool creation), and system architecture [22][24] Group 2: When to Evolve - Self-evolution occurs in two main time modes: intra-test-time self-evolution, which happens during task execution, and inter-test-time self-evolution, which occurs between tasks [26][27] - The article outlines three basic learning paradigms relevant to self-evolution: in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement learning (RL) [27][28] Group 3: How to Evolve - The article discusses various methods for self-evolution, including reward-based evolution, imitation and demonstration learning, and population-based approaches [32][36] - It highlights the importance of continuous learning from real-world interactions, seeking feedback, and adjusting strategies based on dynamic environments [30][32] Group 4: Evaluation of Self-evolving Agents - Evaluating self-evolving agents presents unique challenges, requiring assessments that capture adaptability, knowledge retention, and long-term generalization capabilities [40] - The article calls for dynamic evaluation methods that reflect the ongoing evolution and diverse contributions of agents in multi-agent systems [51][40] Group 5: Future Directions - The deployment of personalized self-evolving agents is identified as a critical goal, focusing on accurately capturing user behavior and preferences over time [43] - Challenges include ensuring that self-evolving agents do not reinforce existing biases and developing adaptive evaluation metrics that reflect their dynamic nature [44][45]
大模型隐私安全和公平性有“跷跷板”效应,最佳平衡法则刚刚找到 | 人大&上海AI Lab
量子位· 2025-07-27 11:57
Core Insights - The research from Renmin University and Shanghai AI Lab reveals that enhancing privacy protection in large language models (LLMs) can lead to a significant drop in fairness, with a decline of up to 45% [1][8] - The study identifies a "seesaw effect" caused by coupled neurons that encode both fairness and privacy, leading to conflicts during model optimization [1][10] Group 1: Ethical Challenges in LLMs - The concept of "Alignment Tax" describes the trade-off where optimizing for alignment-related goals often sacrifices other foundational capabilities like general knowledge and reasoning [3] - As LLMs are increasingly integrated into critical sectors such as healthcare, finance, and education, ensuring models maintain fairness and privacy has become essential [4][5] - Users expect LLMs to protect privacy while also ensuring fairness, but achieving both simultaneously is challenging [7] Group 2: SPIN Methodology - The SPIN method is introduced as a training-free solution that involves precisely suppressing 0.00005% of key neurons to enhance both fairness and privacy [2][12] - The approach involves three steps: identifying critical neurons, locating coupled neurons that impact both fairness and privacy, and implementing suppression to decouple their effects [13][15][16] - SPIN demonstrates significant improvements in fairness and privacy metrics across various models, outperforming traditional fine-tuning methods [17][18][19] Group 3: Performance and Robustness - SPIN allows for zero-cost deployment, requiring only a one-time neuron scan, and operates without additional computational costs during inference [20] - The method shows resilience even when trained on harmful data, maintaining stable improvements in fairness and privacy [26][31] - SPIN's effectiveness is validated through various benchmark tests, indicating that it can enhance model performance without sacrificing intelligence [21][22] Group 4: Broader Implications - The principles behind SPIN can be extended to address other ethical conflicts in AI, such as balancing safety and utility [37] - The research highlights the importance of understanding neuron-level interactions to create more responsible AI systems [12][37]
港科大等提出LOVON:足式机器人开放世界全域目标追踪新范式!
具身智能之心· 2025-07-27 09:37
Core Viewpoint - The article introduces the LOVON framework, which integrates large language models, open vocabulary visual detection, and precise language-motion mapping to enhance the navigation capabilities of legged robots in dynamic and unstructured environments [4][6][23]. Group 1: LOVON Framework Overview - LOVON addresses the challenges of long-range multi-target navigation for legged robots in complex environments, overcoming limitations of traditional methods that struggle with real-time visual disturbances and target loss [3][6]. - The framework combines task planning capabilities of large language models with open vocabulary visual detection, enabling robots to efficiently navigate and track dynamic targets in open-world scenarios [4][6][10]. Group 2: Key Features of LOVON - LOVON consists of three core modules that create a closed loop of language, vision, and motion, enhancing the robot's ability to perform complex tasks [10]. - The framework employs Laplacian variance filtering technology to stabilize visual processing, improving the detection frame rate by 25% during robot movement [12][13]. - An adaptive execution logic allows robots to respond to unexpected situations, such as target loss or external interference, by switching to search mode or seamlessly executing new commands [14][16]. Group 3: Performance Metrics - In simulated environments, LOVON achieved a success rate (SR) of 1.00, significantly outperforming traditional methods like EVT, which had an SR of 0.94 [19]. - The training efficiency of LOVON is remarkable, requiring only 1.5 hours to complete training, compared to 360 hours for the best competing model, TrackVLA, representing a 240-fold improvement [19][20]. Group 4: Practical Applications - LOVON's "plug-and-play" feature allows easy deployment on various mainstream legged robot platforms, supporting applications in home services, industrial inspections, and field research [21][24]. - The framework demonstrates exceptional capabilities in open-world adaptation, multi-target long-range tracking, robustness in dynamic environments, and resistance to interference, making it suitable for diverse real-world scenarios [24].
港科大&北京人形提出LOVON:足式机器人开放世界全域目标追踪新范式!
机器之心· 2025-07-25 04:29
Core Viewpoint - The LOVON framework represents a significant advancement in the field of robotics, enabling legged robots to autonomously navigate complex, dynamic environments by integrating large language models, open vocabulary visual detection, and precise language-motion mapping [2][5][20]. Group 1: Introduction to LOVON - The LOVON framework addresses the challenges of long-range multi-target navigation in open environments, overcoming limitations of traditional methods that struggle with real-time visual disturbances and target loss [1][5]. - It combines task planning capabilities of large language models with open vocabulary visual detection and a language-motion mapping model, allowing for efficient navigation in dynamic, unstructured settings [2][5]. Group 2: Core Modules of LOVON - LOVON integrates three core modules to create a closed loop of language, vision, and motion, enhancing the robot's navigation capabilities [9]. - The framework employs Laplacian variance filtering technology to stabilize visual processing, improving the detection rate of clear frames by 25% during robot movement [11][12]. - An adaptive execution logic allows robots to respond to unexpected situations, such as target loss or external disturbances, by switching to search mode or seamlessly executing new commands [13][15]. Group 3: Performance Metrics - In simulation environments like GymUnreal, LOVON achieved a success rate of 1.00, significantly outperforming traditional methods, which had a success rate of 0.94 [18]. - The training efficiency of LOVON is remarkable, requiring only 1.5 hours compared to 360 hours for the best competing model, indicating a 240-fold improvement [18]. Group 4: Real-World Applications - LOVON has been successfully deployed on various legged robot platforms, including Unitree Go2, B2, and H1-2, showcasing its plug-and-play capability without the need for extensive customization [19]. - The framework is poised to transform applications in smart homes, industrial inspections, and field research, providing robust support for diverse tasks [20][21]. Group 5: Key Features - LOVON demonstrates exceptional open-world adaptability, enabling robots to recognize a wide range of objects in unfamiliar environments [23]. - It excels in multi-target long-range tracking, executing complex tasks smoothly and without interruption [23]. - The framework exhibits strong robustness in dynamic environments, maintaining stable tracking of moving targets across various terrains [23]. - LOVON's anti-interference capabilities allow it to quickly reacquire targets and continue tasks despite disruptions [23].
让 VLMs 更适配机器人:小型VLMs也能展现出强大的视觉规划能力
具身智能之心· 2025-07-15 13:49
Core Insights - The article discusses the potential of large language models (LLMs) in robotic program planning, highlighting their ability to generate coherent action sequences but also noting their limitations in providing the necessary sensory details for physical execution [3][4] - It introduces a new framework called SelfReVision, which enhances the performance of small visual language models (VLMs) through self-distillation without external supervision, aiming to improve their planning capabilities in real-world scenarios [4][9] Research Background - LLMs show promise in generating action sequences but often lack the precision required for robotic tasks due to their reliance on human-centric training data [3] - Visual language models (VLMs) can potentially address these limitations, but existing methods either require specialized simulation environments or are costly to train and deploy [3] Methodology - SelfReVision is proposed as a self-improvement framework that allows small VLMs to enhance their performance through iterative self-critique and revision [4][6] - The framework operates in three stages: critique, revise, and verify, enabling models to generate and refine plans based on self-assessment [4][10] Experimental Setup - Two types of experiments were conducted to evaluate the planning capabilities of SelfReVision: image-based program planning and entity-agent tasks [11] - Evaluation metrics included coverage, ordering, completeness, overall quality, and a new metric called image groundedness [12] Key Results - SelfReVision significantly outperformed baseline models across various metrics, achieving an average win rate of 68% on the PLACES dataset and 72% on the SIMULATION dataset [13] - Larger models benefited more from SelfReVision, with an average gain of 74% for models with 12 billion parameters or more [13] Comparison with Other Methods - SelfReVision demonstrated clear advantages over other methods like Best-of-N and PaliGemma, with improvements of 60% in most settings compared to modest gains from Best-of-N [17] - When compared to GPT-4o, SelfReVision's plans had at least a 25% higher win rate for models with 12 billion parameters or more, indicating its effectiveness in enhancing smaller models [17] Ablation Studies - The complete Criticize-Revise-Verify (CRV) process showed the strongest performance, with average win rates of 68.3% on the PLACES dataset and 71.9% on the SIMULATION dataset [18] - Variants of the process showed significant performance drops, emphasizing the importance of the verification step in filtering out suboptimal revisions [18] Application in Entity-Agent Tasks - SelfReVision was tested in challenging scenarios, showing a 26% improvement for the Gemma 12B model and a 17% improvement for the Gemma 27B model in block manipulation tasks [21] - In hierarchical tasks, SelfReVision plans led to a 70% success rate in generating trajectories, surpassing the 61% success rate of baseline models [21]
中金:如何利用大模型实时预测宏观经济指标?
中金点睛· 2025-07-09 23:59
Core Viewpoint - The article discusses the development of a real-time forecasting framework driven by large language models (LLMs) to predict macroeconomic indicators, addressing the inherent lag in traditional macroeconomic data collection and reporting processes [1][7]. Group 1: Real-time Forecasting Methods - Macroeconomic indicators typically experience delays due to the time-consuming data collection and validation processes, often resulting in the release of data in the following month or quarter [2][7]. - Three common methods for addressing the lag in macroeconomic data are outlined: 1. **Periodic Lagging Method**: Using previously published data, which is reliable but relies on linear extrapolation [8]. 2. **Dynamic Lagging Method**: Adjusting data based on historical release patterns, which also relies on linear extrapolation [8]. 3. **Real-time Forecasting Method**: Building models for real-time state predictions, which may introduce randomness [8]. Group 2: Specific Forecasting Techniques - The article details various forecasting techniques: 1. **High-Frequency Data Splitting**: Involves using dynamic high-frequency macro data to update low-frequency macro data predictions, exemplified by the GDPNow model. This method is interpretable but requires extensive domain knowledge and may lead to overfitting due to noise in high-frequency data [9]. 2. **SARIMAX Model**: A seasonal autoregressive integrated moving average model that incorporates seasonal parameters and exogenous variables to enhance predictive power. It is suitable for stable, high-frequency indicators with limited external shocks [10][14]. 3. **LLMs for Text Interpretation**: Utilizing LLMs to analyze unstructured text data (e.g., macro news, analyst reports) to generate predictive signals based on semantic relationships and logical reasoning. This method captures market reactions to sudden events more quickly than traditional models [3][15]. Group 3: Performance of Forecasting Models - The effectiveness of real-time forecasting methods is evaluated: 1. **Autoregressive Predictions**: Limited improvement in predictive accuracy for indicators with weak correlation to previous values, such as CPI month-on-month and new RMB loans. Strongly correlated indicators (≥0.8) can simply use lagged data without modeling [4][27]. 2. **LLMs Enhancements**: Significant improvements in predictive accuracy for various indicators when using LLMs, with notable increases in correlation for new RMB loans (from -0.1 to 0.9) and export amounts (from 0.37 to 0.72) [5][35]. Group 4: Conclusion and Recommendations - The article concludes with a recommended approach for real-time forecasting of lagging macroeconomic data: 1. For indicators with high correlation to previous values, use lagged data directly. 2. For stable indicators with weak trends, apply the SARIMAX model with seasonal adjustments. 3. Utilize LLMs in conjunction with news or report data for real-time predictions when other methods are unsuitable [45].
告别盲选LLM!ICML 2025新研究解释大模型选择的「玄学」
机器之心· 2025-07-04 08:59
Core Viewpoint - The article introduces the LensLLM framework developed by Virginia Tech, which significantly enhances the efficiency of selecting large language models (LLMs) while reducing computational costs, thus addressing the challenges faced by researchers and developers in model selection [2][3][4]. Group 1: Introduction - The rapid advancement of LLMs has created a challenge in model selection, as traditional methods are resource-intensive and yield limited results [4]. Group 2: Theoretical Breakthrough of LensLLM - LensLLM is based on a novel PAC-Bayesian Generalization Bound, revealing unique dynamics in the relationship between test loss and training data size during LLM fine-tuning [6][10]. - The framework provides a first-principles explanation of the "phase transition" in LLM fine-tuning performance, indicating when data investment leads to significant performance improvements [12][16]. Group 3: LensLLM Framework - LensLLM incorporates Neural Tangent Kernel (NTK) to accurately capture the complex dynamics of transformer architectures during fine-tuning, establishing a precise relationship between model performance and data volume [15][16]. - The framework demonstrates impressive accuracy in curve fitting and test loss prediction across various benchmark datasets, outperforming traditional models [17][18]. Group 4: Performance and Cost Efficiency - LensLLM achieved a Pearson correlation coefficient of 85.8% and a relative accuracy of 91.1% on the Gigaword dataset, indicating its effectiveness in ranking models [21]. - The framework reduces computational costs by up to 88.5% compared to FullTuning, achieving superior performance with significantly lower FLOPs [23][25]. Group 5: Future Prospects - The research opens new avenues for LLM development and application, with potential expansions into multi-task scenarios and emerging model architectures like Mixture of Experts (MoE) [27][30]. - LensLLM is particularly suited for resource-constrained environments, accelerating model testing and deployment cycles while maximizing performance [31].