Workflow
离散扩散模型
icon
Search documents
最火VLA,看这一篇综述就够了
具身智能之心· 2025-11-03 00:03
Core Insights - The article discusses the rapid growth and significance of the Vision-Language-Action (VLA) field, highlighting its potential to enable robots to understand human language, perceive the world, and perform tasks effectively [2][7]. Summary by Sections VLA Overview - VLA models have seen a dramatic increase in submissions, rising from single digits to 164 papers, an 18-fold increase [6]. - A model qualifies as VLA if it uses a pre-trained backbone on large-scale visual-language data, emphasizing its capabilities in language understanding, visual generalization, and task transfer [8][9]. Trends in VLA - **Trend 1: Efficient Architecture** Discrete diffusion models are emerging as a new paradigm, allowing for parallel generation of action sequences, enhancing efficiency [15][17]. - **Trend 2: Embodied Chain-of-Thought (ECoT)** ECoT enables robots to generate intermediate reasoning steps before actions, improving planning and interpretability [18][19]. - **Trend 3: Action Tokenizer** This trend focuses on converting continuous robot actions into discrete tokens that VLMs can understand, enhancing efficiency and integration of reasoning and action [22]. - **Trend 4: Reinforcement Learning (RL)** RL is re-emerging as a crucial tool for fine-tuning VLA strategies, particularly in extreme scenarios [26][27]. - **Trend 5: Efficiency Optimization** Efforts are being made to reduce the cost and complexity of VLA models, making them more accessible to smaller labs [28][29]. - **Trend 6: Video Prediction** Video generation models are being utilized to provide VLA with an understanding of temporal dynamics and physical laws [30]. - **Trend 7: Realistic Evaluation Benchmarks** New evaluation methods are being developed to address the saturation of existing benchmarks, focusing on future frame prediction tasks [37][39]. - **Trend 8: Cross-Body Learning** Innovations in architecture are essential for creating universal robot strategies that can operate across different structures [41][43]. Challenges and Future Directions - The article highlights the "performance ceiling" issue in mainstream simulation evaluations, where high scores do not necessarily translate to real-world capabilities [44]. - Two critical areas needing more attention are data quality and the potential for in-context learning to enhance VLA systems [49][50].
会自检的VLA!ReflectDrive:更安全更高效scaling的端到端框架(理想&清华)
自动驾驶之心· 2025-09-27 23:33
Core Viewpoint - ReflectDrive is a novel learning framework that integrates a reflective mechanism to achieve safe trajectory generation through discrete diffusion, addressing the challenges in end-to-end autonomous driving systems [4][46]. Group 1: Introduction and Background - Autonomous driving is leading the transportation industry towards a safer and more efficient future, with end-to-end (E2E) systems becoming a mainstream alternative to traditional modular designs [4]. - Visual-Language-Action (VLA) models combine pre-trained knowledge from visual-language models (VLM) to enhance adaptability in complex scenarios [4][5]. - Current learning-based methods have not resolved core challenges in imitation learning driving systems, particularly in encoding physical rules like collision avoidance [4][5]. Group 2: ReflectDrive Framework - ReflectDrive proposes a new learning framework that utilizes a discrete diffusion reflective mechanism for safe trajectory generation [3][12]. - The framework begins by discretizing the two-dimensional driving space to construct an action codebook, allowing fine-tuning of pre-trained diffusion language models for planning tasks [3][14]. - The reflective mechanism operates without gradient calculations, enabling iterative self-correction inspired by spatiotemporal joint planning [3][8]. Group 3: Methodology and Mechanism - The reflective inference process consists of two stages: target condition trajectory generation and safety-guided regeneration [20][25]. - The framework integrates safety metrics to evaluate generated multimodal trajectories, identifying unsafe path points through local search methods [8][25]. - The iterative optimization loop continues until the trajectory is deemed safe or computational limits are reached, ensuring high efficiency in real-time performance [31][32]. Group 4: Experimental Results - ReflectDrive was evaluated on the NAVSIM benchmark, demonstrating significant improvements in safety metrics such as collision rates and compliance with drivable areas [32][38]. - The introduction of the safety-guided regeneration mechanism led to substantial enhancements in safety indicators, with notable increases in DAC (3.9%), TTC (1.3%), NC (0.8%), and EP (7.9%) compared to the baseline [37][38]. - When using ground-truth agent information, ReflectDrive's performance approached human driving levels, achieving NC of 99.7% and DAC of 99.5% [38][39]. Group 5: Conclusion - ReflectDrive effectively integrates a reflective mechanism with discrete diffusion for safe trajectory generation, validated by its performance on the NAVSIM benchmark [46].
AI动态汇总:智谱发布GLM-4.5,蚂蚁数科发布金融推理大模型Agentar-Fin-R1
China Post Securities· 2025-08-06 02:33
- The GLM-4.5 model, developed by Zhipu, integrates reasoning, coding, and intelligent agent capabilities into a single architecture. It employs a hybrid expert framework with 355 billion total parameters, activating only 32 billion parameters per inference to enhance computational efficiency. The training process includes three stages: pretraining on 15 trillion general text tokens, fine-tuning on 8 trillion specialized data, and reinforcement learning for multi-task alignment. The model achieves a 37% performance improvement in complex reasoning tasks through innovations like deep-layer prioritization and grouped query attention mechanisms [12][14][15] - GLM-4.5 ranks third globally in AGI core capability evaluations, with a composite score of 63.2. It outperforms competitors in tasks such as web interaction (26.4% accuracy in BrowseComp) and code repair (64.2 in SWE-bench Verified). The model demonstrates an 80.8% win rate against Qwen3-Coder in 52 real-world programming tasks, despite having half the parameters of DeepSeek-R1, showcasing its superior performance-to-parameter ratio [15][16][19] - The Agentar-Fin-R1 model, launched by Ant Financial, is a financial reasoning model based on the Qwen3 architecture. It features a dual-engine design: the Master Builder engine translates business logic into executable code, while the Agent Group engine uses consensus algorithms for multi-agent decision-making. The model is trained on a domain-specific corpus covering six major financial sectors, achieving a financial knowledge accuracy rate of 92.3% through weighted training algorithms [20][21][23] - Agentar-Fin-R1 excels in financial evaluations, scoring 87.70 in FinEval1.0 and 86.79 in FinanceIQ. It leads in tasks like risk pricing and compliance review, with a score of 69.93 in the Finova evaluation, surpassing larger general-purpose models. Its compliance system improves review efficiency by 90%, and its credit approval module reduces loan processing time from 3 days to 15 minutes while lowering bad debt rates by 18% [23][24][25] - The Goedel-Prover-V2 theorem-proving system, developed by Princeton, Tsinghua, and NVIDIA, uses an 8B/32B parameter model to achieve state-of-the-art results. It employs scaffolded data synthesis, validator-guided self-correction, and model averaging to enhance performance. The system achieves 88.1% Pass@32 accuracy on the MiniF2F benchmark, with the 8B model reaching 83.3% of the performance of the 671B DeepSeek-Prover-V2 while using only 1/100th of the parameters [58][60][61] - Goedel-Prover-V2 demonstrates exceptional efficiency, with its 32B model solving 64 problems in the PutnamBench competition at Pass@64, outperforming the 671B DeepSeek-Prover-V2, which required Pass@1024 to solve 47 problems. The system's iterative self-correction mode improves proof quality with minimal token consumption increase, and its training process is highly efficient, requiring only 12 hours per iteration on 4 H100 GPUs [60][61][63]