Workflow
思维链推理
icon
Search documents
超越CLIP,北大开源细粒度视觉识别大模型,每类识别训练仅需4张图像
3 6 Ke· 2026-02-11 08:03
Core Insights - The research team led by Professor Peng Yuxin from Peking University has made significant advancements in fine-grained visual recognition using multi-modal large models, with their latest paper accepted at ICLR 2026 and made open-source [1][19]. Group 1: Fine-Grained Visual Recognition - The real world exhibits fine-grained characteristics, with objects often containing a rich hierarchy of categories, such as the classification of aircraft into specific models like Boeing 707, 717, and 727, with over 500 types of fixed-wing aircraft recorded globally [2]. - The Fine-R1 model aims to leverage the extensive knowledge of fine-grained subcategories contained within multi-modal large models to achieve fine-grained recognition of visual objects in open domains, overcoming the limitations of traditional methods that focus on a closed set of categories [4]. Group 2: Model Development and Methodology - The Fine-R1 model employs a two-phase approach: 1. Chain-of-thought supervised fine-tuning, which simulates human reasoning to enhance the model's inference capabilities [7]. 2. Triplet enhancement strategy optimization, which improves the model's robustness to intra-class variations and its ability to distinguish between different classes [8]. - The model demonstrates superior performance, achieving higher accuracy in recognizing both seen and unseen subcategories with only four training images per class, surpassing models like OpenAI's CLIP and Google's DeepMind's SigLIP [13][14]. Group 3: Experimental Results - Experimental results indicate that Fine-R1 outperforms various models in both closed-set and open-set recognition tasks, showcasing its effectiveness in fine-grained visual recognition [14][16]. - The model's enhancements are attributed primarily to its improved ability to utilize fine-grained subcategory knowledge rather than merely optimizing visual representations or increasing knowledge reserves [16].
不止于Prompt:揭秘「神经网络可重编程性」
机器之心· 2026-01-24 04:09
Core Viewpoint - The article discusses the evolution of model adaptation techniques in the context of large pre-trained models, emphasizing a shift from parameter-centric adaptation to reprogrammability-centric adaptation, which allows for efficient task adaptation without modifying model parameters [5][9]. Group 1: Transition in Model Training Paradigms - The adaptation paradigm has fundamentally shifted from traditional parameter adjustment to a focus on model reprogrammability, enabling the reuse of pre-trained models across various tasks with minimal computational overhead [5][9]. - The new approach emphasizes modifying the task presentation rather than the model itself, allowing a single frozen model to handle multiple tasks by changing the interaction method [9]. Group 2: Efficiency Advantages of Reprogrammability - Empirical data shows that reprogrammability-centric adaptation (RCA) significantly outperforms parameter-centric adaptation (PCA) in terms of parameter efficiency, requiring 2-3 orders of magnitude fewer parameters for task adaptation [11][12]. - RCA enables adaptation in resource-constrained environments and supports simultaneous adaptation to multiple tasks without catastrophic forgetting, making it increasingly relevant as pre-trained models grow in scale and complexity [12]. Group 3: Terminology and Framework - The article identifies a terminological confusion in the research community, where similar adaptation methods are referred to differently across fields, such as "prompt tuning" in NLP and "model reprogramming" in machine learning literature [14]. - Despite the different names, these methods fundamentally leverage the same property of neural networks—reprogrammability—leading to the proposal of a unified framework that connects these disparate research areas [14][17]. Group 4: Mathematical Expression of Reprogrammability - The article provides a mathematical framework for neural network reprogrammability, defining how a fixed pre-trained model can be adapted to new tasks through configurable transformations without changing the model's parameters [25][34]. Group 5: Case Studies of Reprogrammability - The article illustrates three methods of reprogramming using a vision-language model, highlighting how each method achieves the same goal of reusing a frozen model for new tasks through different computational paths [27][30]. - Input manipulation and output alignment are key components of these methods, allowing for effective task adaptation without additional training parameters [30][32].
百度X-Driver:可闭环评测的VLA
自动驾驶之心· 2025-12-28 03:30
Core Viewpoint - The article discusses the development and evaluation of X-Driver, a unified multimodal large language model (MLLM) framework designed for closed-loop autonomous driving, emphasizing the importance of closed-loop evaluation metrics for assessing the performance of autonomous driving systems [2][3][23]. Group 1: Methodology and Architecture - X-Driver utilizes a CoT (Chain of Thought) reasoning mechanism integrated within the MLLM to enhance decision-making in autonomous driving, processing inputs from camera data and navigation commands [6][11]. - The system operates in a closed-loop manner, where actions taken by the vehicle affect the real-world environment, generating new sensory data for continuous optimization [7][24]. - The architecture includes LLaVA, a multimodal model that aligns features from images and text, ensuring a comprehensive understanding of driving scenarios [9][10]. Group 2: Training and Reasoning Process - The CoT fusion training method employs high-quality CoT prompt data to improve reasoning and decision-making capabilities in driving scenarios [11][12]. - The model breaks down tasks into sub-tasks such as object detection and traffic signal interpretation, integrating these results to generate final driving decisions [17][18]. - The training process includes accurate perception of complex 3D driving environments and adherence to traffic regulations, ensuring safe navigation [15][22]. Group 3: Closed-loop Evaluation and Results - The closed-loop evaluation is conducted using the CARLA simulation environment, focusing on Driving Score and Success Rate as key performance indicators [27][28]. - The Bench2Drive dataset, containing over 2 million frames, is utilized to assess the closed-loop driving performance under various conditions [27]. - Results indicate that incorporating CoT reasoning significantly improves decision accuracy, with the success rate for closed-loop simulations still around 20% [30][31].
博世最新一篇长达41页的自动驾驶轨迹规划综述
自动驾驶之心· 2025-12-05 00:03
Core Insights - The article discusses the advancements and applications of foundation models (FMs) in trajectory planning for autonomous driving, highlighting their potential to enhance understanding and decision-making in complex driving scenarios [4][5][11]. Background Overview - Foundation models are large-scale models that learn representations from vast amounts of data, applicable to various downstream tasks, including language and vision [4]. - The study emphasizes the importance of FMs in the autonomous driving sector, particularly in trajectory planning, which is deemed the core task of driving [8][11]. Research Contributions - A classification system for methods utilizing FMs in autonomous driving trajectory planning is proposed, analyzing 37 existing methods to provide a structured understanding of the field [11][12]. - The research evaluates the performance of these methods in terms of code and data openness, offering practical references for reproducibility and reusability [12]. Methodological Insights - The article categorizes methods into two main types: FMs customized for trajectory planning and FMs that guide trajectory planning [16][19]. - Customized FMs leverage pre-trained models, adapting them for specific driving tasks, while guiding FMs enhance existing trajectory planning models through knowledge transfer [19][20]. Application of Foundation Models - FMs can enhance trajectory planning capabilities through various approaches, including fine-tuning existing models, utilizing chain-of-thought reasoning, and enabling language and action interactions [9][19]. - The study identifies 22 methods focused on customizing FMs for trajectory planning, detailing their functionalities and the importance of prompt design in model performance [20][32]. Challenges and Future Directions - The article outlines key challenges in deploying FMs in autonomous driving, such as reasoning costs, model size, and the need for suitable datasets for fine-tuning [5][12]. - Future research directions include addressing the efficiency, robustness, and transferability of models from simulation to real-world applications [12][14]. Comparative Analysis - The study contrasts its findings with existing literature, noting that while previous reviews cover various aspects of autonomous driving, this research specifically focuses on the application of FMs in trajectory planning [13][14]. Data and Model Design - The article discusses the importance of data curation for training FMs, emphasizing the need for structured datasets that include sensor data and trajectory pairs [24][28]. - It also highlights different model design strategies, including the use of existing visual language models and the combination of visual encoders with large language models [27][29]. Language and Action Interaction - The research explores models that incorporate language interaction capabilities, detailing how these models utilize visual question-answering datasets to enhance driving performance [38][39]. - It emphasizes the significance of training datasets and evaluation metrics in assessing the effectiveness of language interaction in trajectory planning [39][41].
超越ORION!CoT4AD:显式思维链推理VLA模型(北大最新)
自动驾驶之心· 2025-12-02 00:03
Core Insights - The article introduces CoT4AD, a new Vision-Language-Action (VLA) framework designed to enhance logical and causal reasoning capabilities in autonomous driving scenarios, addressing limitations in existing VLA models [1][3][10]. Background Review - Autonomous driving is a key research area in AI and robotics, promising improvements in traffic safety and efficiency, and playing a crucial role in smart city and intelligent transportation system development [2]. - Traditional modular architectures in autonomous driving face challenges such as error accumulation and limited generalization, leading to the emergence of end-to-end paradigms that utilize unified learning frameworks [2][3]. CoT4AD Framework - CoT4AD integrates chain-of-thought reasoning into end-to-end autonomous driving, allowing for explicit or implicit reasoning through a series of downstream tasks tailored for driving scenarios [3][10]. - The framework combines perception, language reasoning, future prediction, and trajectory planning, enabling the generation of explicit reasoning steps [6][10]. Experimental Results - CoT4AD was evaluated on the nuScenes and Bench2Drive datasets, achieving state-of-the-art performance in both open-loop and closed-loop assessments, outperforming existing LLM-based and end-to-end methods [10][19]. - In the nuScenes dataset, CoT4AD achieved L2 distance errors of 0.12m, 0.24m, and 0.53m at 1s, 2s, and 3s respectively, with an average collision rate of 0.10% [17][18]. Contributions of CoT4AD - The model's design allows for robust multi-task processing and future trajectory prediction, leveraging a diffusion model integrated with chain-of-thought reasoning [10][12]. - CoT4AD demonstrates superior performance in complex driving scenarios, enhancing decision-making consistency and reliability across diverse environments [19][23]. Ablation Studies - The effectiveness of various components, such as perception tokenizers and the chain-of-thought design, was validated through ablation studies, showing significant performance improvements when these elements were included [26][28]. - The model's ability to predict future scenarios was found to be crucial, with optimal performance achieved when predicting four future scenarios [29]. Conclusion - CoT4AD represents a significant advancement in autonomous driving technology, demonstrating enhanced reasoning capabilities and superior performance compared to existing methods, while also highlighting areas for future research to improve computational efficiency [30][32].
北京大学最新!MobileVLA-R1:机械臂之外,移动机器人的VLA能力怎么样了?
具身智能之心· 2025-11-30 03:03
Core Insights - The article discusses the introduction of MobileVLA-R1, a new framework for quadruped robots that bridges the gap between high-level semantic reasoning and low-level action control, addressing the challenges of stability and interpretability in existing methods [1][2][21]. Group 1: Need for Reconstruction of VLA Framework - Current quadruped robots face two main challenges: a semantic-control gap leading to instability in command execution and a lack of traceable reasoning that complicates error diagnosis [2]. - MobileVLA-R1's breakthrough lies in decoupling reasoning from action execution, allowing robots to "think clearly" before "acting accurately," enhancing both interpretability and control robustness [2][23]. Group 2: Implementation of MobileVLA-R1 - MobileVLA-R1 employs a structured CoT dataset, a two-stage training paradigm, and multi-modal perception fusion to achieve coherent reasoning, stable control, and strong generalization [4][6]. - The structured CoT dataset includes 18K episode-level samples, 78K step-level samples, and 38K navigation-specific samples, filling the gap in reasoning supervision from instruction to action [4][5]. Group 3: Performance Evaluation - In navigation tasks, MobileVLA-R1 achieved a success rate of 68.3% and 71.5% on R2R-CE and RxR-CE datasets, respectively, outperforming existing methods by an average of 5% [10]. - For quadruped control tasks, it achieved an average success rate of 73% across six locomotion and operation tasks, significantly surpassing baseline models [12][13]. Group 4: Real-World Deployment - MobileVLA-R1 was tested on the Unitree Go2 quadruped robot in various environments, demonstrating robust adaptation to complex scenarios with a success rate of 86%-91% for complex instructions [14][18]. - The integration of depth and point cloud encoders improved navigation success rates by 5.8%, highlighting the importance of 3D spatial information for scene understanding [19][20]. Group 5: Key Conclusions and Future Directions - MobileVLA-R1 innovatively integrates chain-of-thought reasoning with reinforcement learning, addressing the industry's dilemma of either interpretability or execution stability [21][23]. - Future directions include expanding the action space for more precise tasks, reducing reasoning latency through model optimization, and enhancing self-supervised learning to decrease reliance on labeled data [23].
AI越会思考,越容易被骗?「思维链劫持」攻击成功率超过90%
机器之心· 2025-11-03 08:45
Core Insights - The article discusses a new attack method called Chain-of-Thought Hijacking, which exploits the reasoning capabilities of AI models to bypass their safety mechanisms [1][2][5]. Group 1: Attack Mechanism - Chain-of-Thought Hijacking involves inserting a lengthy harmless reasoning sequence before a harmful request, effectively diluting the model's refusal signals and allowing harmful instructions to slip through [2][5]. - The attack has shown high success rates on various models, including Gemini 2.5 Pro (99%), GPT o4 mini (94%), Grok 3 mini (100%), and Claude 4 Sonnet (94%) [2][11]. Group 2: Experimental Setup - The research utilized the HarmBench benchmark to evaluate the effectiveness of the attack against several reasoning models, comparing it to baseline methods like Mousetrap, H-CoT, and AutoRAN [11][15]. - The team implemented an automated process using a supporting LLM to generate candidate reasoning prefaces and integrate harmful content, optimizing the prompts without accessing the model's internal parameters [6][7]. Group 3: Findings and Implications - The results indicate that while Chain-of-Thought reasoning can enhance model accuracy, it also introduces new security vulnerabilities, challenging the assumption that more reasoning leads to greater robustness [26]. - The study suggests that existing defenses are limited and may need to embed security within the reasoning process itself, such as monitoring refusal activations across layers or ensuring attention to potentially harmful text spans [26].
AI能否「圣地巡礼」?多模态大模型全新评估基准VIR-Bench来了
机器之心· 2025-10-15 04:08
Core Insights - The article discusses the development of a new multimodal large model evaluation benchmark called VIR-Bench, aimed at assessing AI's ability to understand travel videos in terms of geographical locations and temporal sequences [4][20] - The research emphasizes the importance of reconstructing travel itineraries from videos, which requires models to comprehend both geographic and temporal relationships [20] Group 1: VIR-Bench Overview - VIR-Bench is designed to evaluate AI's understanding of travel vlogs by generating a visiting order graph that represents the sequence and relationships of visited locations [6][9] - The visiting order graph consists of nodes representing locations categorized into three levels: Prefecture, City, and Point of Interest (POI) [7][9] Group 2: Task Design and Dataset - The task is divided into two sub-tasks: node prediction, where the model identifies all visited locations, and edge prediction, where it determines the relationships between these locations [10][11] - A dataset of 200 travel videos was constructed, covering 3,689 POIs across 43 prefectures in Japan, with detailed annotations for each video [17][13] Group 3: Experimental Results and Challenges - Current models, particularly open-source ones, lag behind commercial models in POI node recognition and transition edge prediction, with transition edge prediction being notably challenging [16][18] - The performance of models improves significantly with increased scale and the inclusion of geographic pre-training, highlighting the importance of these factors in enhancing accuracy [16][18] Group 4: Future Directions - The research indicates that while current models struggle with long-range reasoning and temporal understanding, there are clear pathways for improvement, such as enhancing spatial awareness and integrating multimodal information [20][18] - The ultimate goal is for AI to not only analyze videos but also to possess the capability to act within the world, aligning with applications in robotics and autonomous systems [20][18]
ICCV 2025|UV-CoT:无监督视觉推理新突破,偏好优化重塑图像级思维链
机器之心· 2025-07-28 04:24
Core Viewpoint - The article discusses the introduction of a novel unsupervised visual reasoning framework called UV-CoT, which enhances model reasoning capabilities and interpretability in visual understanding tasks by leveraging a chain-of-thought (CoT) approach [2][3][25]. Group 1: Background and Challenges - Existing models rely on supervised fine-tuning (SFT) strategies that require extensive labeled data, leading to high annotation costs and limited scalability [6][7]. - SFT methods face challenges such as high labor costs for annotating key image regions and reasoning paths, and limited generalization capabilities due to reliance on a single type of training signal [7]. Group 2: UV-CoT Framework - UV-CoT is designed to mimic human visual understanding by focusing on "key regions → reasoning process," employing an unsupervised data generation and preference optimization mechanism [4][3]. - The framework utilizes an automated preference data generation and evaluation process, guided by an improved preference optimization algorithm called Score-DPO (sDPO), to achieve unsupervised image-level chain-of-thought learning [8][11]. Group 3: Methodology - UV-CoT generates diverse intermediate reasoning responses for image-question pairs using a target model and an evaluation model, which assesses the selected regions' scores and their impact on subsequent answers [13]. - The preference data set is constructed by randomly selecting preference pairs from the generated responses, retaining the highest-scoring response chains for further reasoning [14]. Group 4: Performance and Results - UV-CoT demonstrates significant performance improvements over existing supervised chain-of-thought models, outperforming models like Visual-CoT-7B and LLaVA-1.5-7B across six benchmarks [20][22]. - The self-evaluation capability of UV-CoT leads to high-quality bounding box generation, surpassing LLaVA-1.5-7B by 4.8% and closely approaching the performance of the 12B model OmniLMM-12B [23]. Group 5: Conclusion - UV-CoT presents an innovative approach to unsupervised visual reasoning, eliminating the dependency on manual annotations and enabling automatic identification and reasoning optimization of key image regions, thus laying a solid foundation for future research in unsupervised visual understanding [25].
3D芯片堆叠,新方法
半导体行业观察· 2025-07-01 01:03
Core Viewpoint - The next significant leap in semiconductor packaging will require a series of new technologies, processes, and materials that will collectively achieve an order-of-magnitude performance improvement, which is crucial for the AI era [1]. Group 1: Advances in Cooling Technologies - Liquid cooling technology at the chip level is emerging as forced air cooling reaches its limits, with up to 40% of power used for current delivery and heat dissipation [4]. - TSMC's silicon integrated micro-cooler (IMEC-Si) is being tested for reliability, designed to handle over 3,000 watts of uniform power dissipation under specific conditions [6]. - The demand for direct liquid cooling is increasing, with innovative concepts like using chips as coolants being proposed [7]. Group 2: Hybrid Bonding and Interconnects - Hybrid bonding with fine-pitch multilayer redistribution layers (RDL) is gaining attention as a cost-effective solution for high-speed interconnects [14]. - Intel's hybrid bonding can achieve spacing as small as 1µm, which is critical for advanced applications [5][17]. - The transition from traditional dielectric materials to polymer/copper hybrid bonding is being explored to enhance performance [16]. Group 3: Backside Power Delivery - Backside power delivery significantly reduces voltage drop related to transistor power supply, but it also exacerbates heat issues [19]. - IBM has developed an anisotropic model for precise heat transfer calculations in backend stacks, emphasizing the importance of thermal considerations in design [21]. - The implementation of backside power delivery is expected to lead to a 10% to 30% reduction in thermal losses [23]. Group 4: Co-Packaged Optical Devices - The demand for faster data networks is driving the integration of optical engines with GPUs and HBM in a single package, significantly increasing data transmission speeds [26]. - Co-packaged optical devices (CPO) are expected to achieve a 32-fold increase in bandwidth by bringing optical engines closer to processors [26]. - However, challenges remain regarding thermal management and warpage sensitivity in CPO implementations [28].