Workflow
YOLO Series(YOLOv12
icon
Search documents
ACM MM'25 | 自驾2D目标检测新SOTA!超越最新YOLO Series~
自动驾驶之心· 2025-08-01 16:03
Core Viewpoint - The article discusses a new detection framework called Butter, designed to improve target detection in autonomous driving scenarios by addressing the challenges of multi-scale semantic information modeling and enhancing detection robustness and deployment efficiency [3][11]. Group 1: Framework Innovations - Butter introduces two core innovations in the Neck layer: the Frequency Consistency Enhancement Module (FAFCE) and the Progressive Hierarchical Feature Fusion Network (PHFFNet) [3][15]. - FAFCE enhances boundary resolution by integrating high-frequency detail enhancement with low-frequency noise suppression, while PHFFNet progressively fuses semantic information to strengthen multi-scale feature representation [3][15]. Group 2: Performance Metrics - Butter outperforms existing state-of-the-art (SOTA) methods in detection accuracy with significantly lower parameter counts, achieving a mean Average Precision (mAP@50) of 94.4% on the KITTI dataset, surpassing the previous best by 1.2 percentage points while using only about one-third of the computational load [32][34]. - On the BDD100K and Cityscapes datasets, Butter achieved mAP@50 scores of 53.7% and 53.2%, respectively, demonstrating superior performance compared to other lightweight models, particularly with a 1.6 percentage point improvement on Cityscapes [32][34]. Group 3: Structural Challenges - Existing Neck structures often face issues such as frequency aliasing and rigid fusion processes, which compromise feature expression and detection accuracy, particularly for small targets in complex environments [9][10]. - Butter's design addresses these structural bottlenecks by decoupling frequency modeling and multi-scale fusion, achieving a balance between accuracy and efficiency [11][12]. Group 4: Methodology Overview - The Butter framework begins with a 640×640 monocular image, extracting initial features through a lightweight Backbone module, followed by refinement through various lightweight blocks before entering the Neck module [16][17]. - The model employs a four-output head in the Head layer to generate final detection results, including class labels, confidence scores, and bounding boxes [16][17]. Group 5: Feature Fusion Techniques - FAFCE enhances feature fusion accuracy and robustness by employing high-frequency amplification and low-frequency damping mechanisms, which improve the consistency and precision of multi-scale feature integration [20][27]. - PHFFNet implements a hierarchical fusion strategy that alleviates semantic discrepancies between non-adjacent layers, significantly enhancing detection accuracy and alignment in scenarios requiring precise boundary detection [29][30].