Abstract

Translate

Accurate vehicle damage detection is essential in intelligent transportation systems, insurance claim assessment, and automotive maintenance. Although conventional detection models demonstrate strong performance, they still struggle to capture fine-grained details and long-range dependencies, which can constrain their effectiveness in real-world applications. To address these limitations, we propose HL-YOLO, an enhanced YOLO11-based architecture that integrates Heterogeneous Convolutions (HetConv) to improve feature extraction diversity and Large-Kernel Attention (LSKA) to strengthen contextual representation. Model evaluation results on a vehicle damage dataset demonstrate that HL-YOLO consistently outperforms the YOLO11 baseline, achieving relative improvements of 2.5% in precision, 5.8% in recall, 3.9% in mAP50, and 3.1% in mAP50–95. These results underscore the model’s robustness in identifying complex damage types, ranging from scratches and dents to accident-induced damage. Although inference latency increased moderately due to the added architectural complexity, the overall accuracy gains confirm the effectiveness of HL-YOLO in scenarios where detection reliability is prioritized over real-time speed. The proposed model shows strong potential for deployment in insurance automation, intelligent traffic monitoring, and vehicle after-service systems, providing a reliable framework for accurate vehicle damage assessment.

Full text

Turn on search term navigation

Translate

1. Introduction

Vehicle damage detection has become an increasingly important task in intelligent transportation systems, insurance claim assessment, and automotive maintenance [1]. With the rapid growth of the automobile industry, the rising number of vehicles inevitably leads to accidents and wear-related damages. Accurate and automated detection of such damages is essential for ensuring road safety, expediting insurance claim procedures, and improving after-sales services [2]. Manual inspection, however, is often inefficient, subjective, and prone to inconsistencies, which motivates the development of computer vision-based approaches capable of providing fast, consistent, and reliable results [3].

Deep learning, particularly convolutional neural networks (CNNs), has demonstrated remarkable success in a wide range of visual recognition tasks, from object detection to semantic segmentation [4]. Among detection frameworks, the You Only Look Once (YOLO) series has gained wide adoption due to its favorable trade-off between speed and accuracy. From YOLOv5 to YOLOv8, and more recently YOLO11, the models have undergone significant architectural refinements, such as improved backbones, better feature aggregation strategies, and advanced training techniques, which have collectively enhanced detection performance [5,6]. Despite these advancements, challenges remain in the specific context of vehicle damage detection. Damage types such as scratches, dents, and accident-related traces are typically subtle, irregular in form, and surrounded by complex visual contexts, which makes them challenging to differentiate from background noise [7]. Conventional convolutional operations may have limitations in capturing fine-grained details, while existing attention mechanisms often provide only limited capability for modeling long-range dependencies [8,9]. In addition, bounding box regression methods based on traditional IoU-derived loss functions may converge relatively slowly and insufficiently optimize localization accuracy, which can in turn affect detection robustness [10].

To address these limitations, we propose HL-YOLO, an enhanced YOLO11-based framework specifically tailored for vehicle damage detection. The HL-YOLO framework leverages the YOLO11 architecture as a strong foundation for high-speed object detection. To overcome the specific challenges of fine-grained vehicle damage recognition, we introduce three core structural and methodological improvements. We integrate Heterogeneous Convolutions into the backbone to enhance feature diversity and apply a Large-Kernel Attention mechanism to strengthen contextual modeling and long-range dependency capture. Additionally, the training methodology employs the SIoU loss function for bounding box regression, which incorporates geometric constraints (angle and distance) to optimize localization accuracy and ensure faster convergence compared to conventional IoU-based methods.

Together, these enhancements enable HL-YOLO to more effectively detect fine-grained and complex vehicle damages, achieving a stronger balance between accuracy and robustness while maintaining practical efficiency for real-world deployment. In this work, HL-YOLO is proposed as an enhanced framework based on the YOLO11 architecture, specifically developed for accurate and efficient vehicle damage recognition. The main contributions are as follows:

Enhanced Feature Extraction with HetConv: Heterogeneous Convolutions are incorporated into the backbone to enrich feature diversity and adaptively capture both local and global structures. This design provides a more comprehensive representation of subtle vehicle damages, such as scratches and small dents, which are often challenging for conventional convolutional kernels to capture.

Contextual Awareness via Large-Kernel Attention: A Large-Kernel Attention mechanism is integrated to expand the receptive field and enhance the modeling of long-range dependencies. This allows the network to better capture complex damage patterns, particularly in cluttered or low-contrast environments.

Optimized Localization with SIoU Loss: For bounding box regression, the SIoU loss function is adopted, integrating angle, distance, and shape constraints. These additional factors contribute to faster convergence and more accurate localization, which are particularly beneficial for detecting irregular and fine-grained damage.

Application-Oriented Design: With these improvements, HL-YOLO surpasses the YOLO11 baseline in precision, recall, and mAP, while maintaining practical inference speed for real-world applications such as transportation monitoring, insurance assessment, and vehicle maintenance.

2. Related Works

The field of vehicle damage detection has attracted increasing research attention in recent years, driven by advances in computer vision and deep learning. A recent systematic review of AI-based approaches emphasizes that, although notable progress has been made, several key challenges remain unresolved, including data scarcity, the subtle nature of certain damages, and the complexity of real-world backgrounds [11].

Deep learning methods, particularly convolutional neural networks, have been extensively applied to vehicle damage localization and classification tasks. For instance, a CNN-based approach was proposed to detect and localize damage across twelve distinct categories, demonstrating the effectiveness of deep feature representations in capturing fine-grained characteristics [12]. Transfer learning techniques have also been adopted to enhance detection robustness. In one study, Mask R-CNN was first employed to segment vehicles from their surroundings, followed by a CNN-based classifier to determine whether damage was present, effectively reducing background interference and improving accuracy [13].

Beyond classification, instance segmentation frameworks such as Mask R-CNN have been adapted to provide quantitative assessments of damage. These methods enable measurement of damaged surface areas, which is particularly useful for applications in insurance claim processing and repair cost estimation [14,15]. Other research efforts have focused on multi-class exterior damage detection, aiming to identify and distinguish between different categories such as scratches, dents, and cracks under varied real-world conditions [16]. In addition, the introduction of the Three-Quarter View Car Damage Dataset has provided a valuable benchmark for vehicle damage classification tasks. This dataset has been evaluated using multiple backbone architectures, including ResNet, DenseNet, and EfficientNet, highlighting the importance of diverse viewpoints in building robust classifiers [17].

In contrast, this study emphasizes an integrated architecture that simultaneously localizes and classifies vehicle damage, providing a more streamlined and effective solution. This work develops HL-YOLO, an enhanced YOLO11-based model that incorporates heterogeneous convolutions, large-kernel attention, and the SIoU loss function. These enhancements collectively improve feature representation, contextual understanding, and localization accuracy, yielding a more robust framework for real-world vehicle damage detection.

3. Materials and Methods

3.1. Dataset Construction

A dedicated dataset was constructed to facilitate the training and evaluation of HL-YOLO, focusing specifically on vehicle damage detection. Images were collected from a variety of publicly available online sources and open repositories, as well as images from our own collection, with the objective of covering a broad spectrum of damage scenarios encountered in real-world automotive environments. The data selection process was supervised by an expert with extensive practical experience in vehicle damage assessment to ensure sufficient diversity in viewing angles, damage severity, and environmental conditions. The dataset comprises a total of 3830 images, distributed across three representative categories of damage: accident-related structural damage, localized dents, and surface scratches. These categories reflect the most common forms of external vehicle damage that are relevant in traffic safety monitoring, insurance assessment, and maintenance services.

All images were annotated using Labelme (version 3.16.7) and then converted into the YOLO format, with bounding boxes delineating the precise regions of visible damage. This annotation scheme provides both categorical and spatial information, enabling the detector to learn not only the type of damage but also its location. The dataset encompasses substantial variation in resolution, image quality, viewpoint, illumination conditions, and background complexity. To ensure consistency across inputs and compatibility with the YOLO architecture, all images were uniformly resized to 640 × 640 pixels before training and testing, thereby supporting robust model generalization.

For model development, the dataset was partitioned into a training set of 3427 images and a validation set of 403 images. This division was carefully designed to maintain class balance across the three categories, facilitating robust parameter optimization and reliable performance evaluation. To enhance model generalization and mitigate overfitting, a three-stage data augmentation strategy was employed, involving random 90-degree rotations to simulate camera orientation variations, Gaussian blurring to emulate motion-induced image degradation, and the injection of stochastic noise to mimic artifacts from low-quality sensors or unstable transmission conditions.

The resulting dataset, carefully curated, annotated with Labelme, standardized to a consistent resolution, and reviewed by an expert, represents a comprehensive and diverse benchmark for vehicle damage detection. This resource provides a strong foundation for evaluating the efficacy of HL-YOLO in addressing fine-grained and complex detection tasks under challenging real-world conditions.

3.2. HL-YOLO

The HL-YOLO framework represents an advancement over the YOLO11 baseline, incorporating a suite of architectural innovations specifically designed to address the challenges of vehicle damage detection. As depicted in Figure 1, the network preserves the established tripartite structure—backbone, neck, and detection head—but introduces heterogeneous convolutions and advanced attention mechanisms to improve its capacity for identifying subtle and irregular damage patterns.

The backbone network comprises multiple C3k2-HetConv modules, implementing heterogeneous convolutions that combine kernels of varying sizes to promote feature diversity and enhance adaptability [18]. This design choice enables the extraction of fine-grained representations of damage characteristics, such as scratches and dents, while maintaining computational efficiency. An SPPF block further aggregates multi-scale features, and the inclusion of the C2PSA-LSKA module introduces large-separable kernel attention, expanding the receptive field and facilitating the modeling of long-range dependencies to improve contextual understanding in visually complex scenes [19].

The neck incorporates a multi-scale feature fusion strategy, employing upsampling and concatenation to effectively integrate low-level spatial details with high-level semantic information. The inclusion of HetConv modules within this stage further enhances the robustness of feature propagation, enabling the capture of both small-scale and large-scale damage cues while preserving spatial resolution.

The detection head generates predictions at three different scales, enabling sensitivity to damages of varying sizes and shapes. By leveraging the enriched multi-scale features extracted from the backbone and neck, this component enhances both localization accuracy and category-level classification performance.

The HL-YOLO framework effectively balances fine-grained local feature extraction with robust global context modeling. The synergistic integration of heterogeneous convolutions, large-kernel attention, and optimized SIoU-based regression results in a model with strong potential for accurately identifying diverse forms of vehicle damage under challenging real-world conditions [20].

3.2.1. Large Separable Kernel Attention

The Large Separable Kernel Attention (LSKA) mechanism is designed to effectively capture long-range dependencies and contextual relationships while maintaining computational efficiency. Conventional convolution operations, particularly those with small kernel sizes, are inherently limited by restricted receptive fields, which hinder the extraction of structural information crucial for fine-grained object recognition. To mitigate this limitation, LSKA decomposes a large 2D kernel into multiple lightweight depthwise convolutions, enabling the simulation of large kernel receptive fields without incurring substantial computational overhead.

As illustrated in Figure 2, LSKA employs a factorization strategy wherein a large square convolutional kernel is decomposed into separable 1D kernels, such as $1 \times (2 d - 1)$ and $(2 d - 1) \times 1$ , followed by depthwise dilated convolutions to further enlarge the receptive field. This decomposition substantially reduces the model’s parameter count while maintaining the capacity to aggregate spatial information across extended regions. A final $1 \times 1$ pointwise convolution integrates the extracted features, and the resulting output is modulated through an attention weighting mechanism, enabling the selective emphasis of salient features and suppression of irrelevant noise. Compared with standard convolution and attention modules, LSKA achieves a better balance between accuracy and efficiency [21,22]. By leveraging separable and dilated convolutions, it not only reduces computational overhead but also ensures that long-range contextual cues are effectively incorporated. This is particularly advantageous in vehicle damage detection, where damage such as scratches or fractures may appear subtle and spatially scattered. Incorporating LSKA enhances HL-YOLO’s capability to detect such fine-grained patterns by expanding its receptive field and improving feature discriminability, thereby supporting more robust detection performance in complex visual environments.

The C2PSA-LSKA module integrates the C2PSA mechanism with LSKA to enhance the model’s ability to capture both fine-grained local details and global contextual dependencies. As illustrated in Figure 3, the input feature map is initially processed by a convolutional layer and subsequently partitioned into multiple channel groups. This partial split strategy facilitates independent emphasis or suppression of disparate feature subspaces, thereby reducing redundancy and maximizing representational efficiency.

The partitioned features are then selectively routed through multiple PSABlock-LSKA branches. Each PSABlock harnesses the benefits of LSKA, employing factorized large-kernel depthwise convolutions to significantly expand the receptive field while preserving computational efficiency. This facilitates the effective aggregation of spatially distant features, a critical capability for vehicle damage detection where subtle imperfections may be spatially dispersed and context-dependent. The attention weighting mechanism within each branch adaptively emphasizes salient channel responses, enabling the model to learn robust discriminative representations under adverse visual conditions.

Following processing through the parallel PSA-LSKA branches, the outputs are concatenated and fused, followed by a convolutional transformation to re-establish channel interactions. This hierarchical design facilitates a synergistic balance between the preservation of fine-grained local detail and the modeling of long-range contextual dependencies, thereby enhancing both feature expressiveness and discriminative power [23]. By incorporating this module into HL-YOLO, the detector demonstrates improved sensitivity to subtle damage cues, increased robustness to background interference, and superior generalization capabilities in real-world scenarios.

3.2.2. C3K2-HetConv

Conventional convolution employs identical K × K kernels across all input channels, as illustrated in Figure 4. Although effective for feature extraction, this homogeneous design can be computationally expensive and introduce redundancy. Heterogeneous Convolution addresses these limitations by mixing large kernels $3 \times 3$ with lightweight kernels ( $1 \times 1$ ) within the same layer. As shown in Figure 4, a subset of input channels is processed with larger kernels to capture fine spatial details, while the remaining channels utilize smaller kernels for dimensionality reduction and cross-channel interaction. The balance between the two is determined by the partition factor P. For instance, when P is set to 2, half of the channels employ large kernels, whereas at P = 4, only one quarter are retained, thereby further reducing computational cost [24].

Formally, the computational cost of standard convolution is:

(1) $F_{S t d} = H \times W \times C_{i n} \times C_{o u t} \times K^{2}$

In contrast, HetConv decomposes this into two parts:

(2) $F_{H e t C o n v} = \frac{H \times W \times C_{i n} \times C_{o u t} \times K^{2}}{P} + H \times W \times C_{o u t} \times (C_{i n} - \frac{C_{i n}}{P})$

where

H \times W

is the feature map size,

C_{i n}

and

C_{o u t}

are the input and output channels, and P is the partition factor. This heterogeneous formulation clearly reduces (FLOPs) compared with the homogeneous case while maintaining strong feature extraction capacity.

By combining local detail extraction from $3 \times 3$ kernels with efficient channel mixing from $1 \times 1$ kernels, HetConv enables multi-scale representation learning. In vehicle damage detection, this is particularly beneficial for identifying small and irregular damages such as scratches, dents, and fractures embedded in complex environments. In this study, we adopt HetConv with $P = 2$ , striking a balance between computational efficiency and detection accuracy, making the model suitable for real-world applications that require both precision and speed.

Capitalizing on the computational advantages and enhanced multi-scale feature representation afforded by HetConv, the C3k2-HetConv module further integrates the Cross Stage Partial (CSP) strategy and variable convolutional designs to enhance both computational efficiency and feature discriminability. As illustrated in Figure 5, the module operates in two distinct configurations, contingent upon the logical state of the parameter c3k.

When c3k is set to False, the module adopts a CSPHet-Bottleneck structure in which the input feature map is divided into two parallel branches. One branch is processed through multiple CSPHet-Bottleneck units that integrate standard residual bottlenecks with heterogeneous convolution kernels, whereas the other branch bypasses these transformations. The outputs from both branches are subsequently concatenated and fused via a convolution operation. This architecture mitigates redundant gradient information while preserving a balance between feature reuse and gradient flow, thereby enhancing training stability and computational efficiency.

When c3k = True, the module replaces the bottleneck with stacked C3k2 blocks, which integrate multiple heterogeneous convolutions in a residual manner. This approach yields an enhanced representational capacity, enabling the concurrent capture of fine-grained spatial dependencies and long-range contextual cues [25]. In contrast to a bottleneck architecture, this configuration prioritizes richer multi-scale feature extraction, a property particularly conducive to tasks requiring precise localization accuracy.

Consequently, the C3k2-HetConv module represents a flexible generalization of HetConv, providing a trade-off between computational complexity and representational capacity. The splitting and fusion strategy effectively mitigates redundancy, while the heterogeneous kernel composition facilitates efficient multi-scale feature extraction. This dual-path design enables the module to adaptively balance computational efficiency with expressive feature learning, leading to improved performance of object detection networks in both resource-constrained and accuracy-critical scenarios.

3.2.3. SIoU-Based Localization Loss

Accurate bounding box regression is essential for enhancing detection performance, particularly in complex scenarios involving small, rotated, or densely distributed targets. Conventional IoU-based loss functions, including GIoU, DIoU, and CIoU, mainly focus on overlap, center distance, and aspect ratio alignment, while providing limited consideration of the angular relationship between predicted and ground-truth boxes. This omission can lead to reduced localization accuracy when bounding boxes share similar sizes and positions but differ in orientation [26]. To overcome this limitation, the Scylla IoU (SIoU) loss introduces an additional angle-aware constraint, decomposing the localization error into four complementary components: overlap loss, distance loss, shape loss, and angle loss [27].

As illustrated in Figure 6, SIoU measures the angle α between the line connecting the centers of the predicted box B and ground-truth box and the principal axis. This angle serves as a measure of orientation deviation, enabling bounding box regression to account not only for positional displacement but also for geometric alignment. The computation of the angle is defined as follows:

(3) $α = a r c s i n (\frac{|y_{2} - y_{1}|}{\sqrt{{(x_{2} - x_{1})}^{2} + {(y_{2} - y_{1})}^{2} + ε}})$

where (

x_{1}, y_{1}

) and (

x_{2}, y_{2}

) represent the centers of the predicted and ground-truth boxes. A smaller α indicates closer alignment with the principal axis, reflecting improved orientation consistency. To complement angle modeling, SIoU further incorporates distance and shape constraints. The distance term adaptively reweights center-point deviations based on angular error, thereby imposing stronger penalties on misaligned predictions. The shape constraint enforces consistency between the width and height of predicted and ground-truth boxes, defined as follows:

(4) $s h a p e_c o s t = (1 - e^{{- ω_{ω}}^{θ}}) + (1 - e^{{- ω_{h}}^{θ}})$

where

ω_{ω}

and

ω_{h}

represent normalized differences in width and height, and θ is a scaling factor. The final SIoU loss extends the conventional IoU penalty by incorporating orientation- and shape-aware terms, formulated as:

(5) $L_{S I o U} = 1 - I o U + 0.5 \cdot (δ + s h a p e_c o s t)$

where

δ

denotes the angle-guided distance cost.

The incorporation of angular deviation into the regression process enables Scylla IoU to achieve more refined geometric modeling, accelerate convergence, and improve detection robustness [28]. Consequently, it is particularly well-suited for challenging detection tasks involving rotated, small-scale, or densely packed objects.

3.2.4. Evaluation Indicators

A comprehensive evaluation of the proposed HL-YOLO model for vehicle damage detection requires not only conventional object detection metrics but also considerations of robustness and real-time performance. Since vehicle damage detection is closely related to safety-critical applications such as insurance assessment, intelligent transportation monitoring, and automotive maintenance, the evaluation framework must capture the model’s ability to achieve high accuracy, minimize false alarms, avoid missed detections, and operate efficiently in real time to ensure practical applicability [29].

Among the fundamental evaluation indicators, precision ( $P$ ) and recall ( $R$ ) play a central role. Precision reflects the proportion of true damage detections among all predicted damages:

(6) $P r e c i s i o n = \frac{T P}{T P + F P}$

where

T P

represents true positives and

F P

false positives, precision serves as a measure of the reliability of positive predictions, quantifying the proportion of true positives among all positively identified damage instances. A high precision score indicates that the model rarely confuses undamaged regions with actual damage, which is essential for preventing unnecessary interventions and ensuring operational credibility. Conversely, recall assesses the model’s ability to capture the full extent of existing damage by measuring the proportion of true cases that are successfully detected.

(7) $R e c a l l = \frac{T P}{T P + F N}$

where

F N

represents false negatives, recall indicates the proportion of true damage instances that the model successfully identifies. In the context of vehicle damage detection, recall is particularly critical, as missed detections—such as overlooking severe dents or structural failures—may lead to incomplete assessments, underestimated repair costs, or even safety risks if the damage remains unaddressed.

To unify these complementary measures, the average precision ( $A P$ ) is defined as the area under the precision–recall curve:

(8) $A P = \int_{0}^{1} P (r) d r$

where

P (r)

denotes precision as a function of recall. When dealing with multiple vehicle damage categories including accident-related structural damage, dents and scratches, the mean average precision (

m A P

) is utilized:

(9) $m A P = \frac{1}{n} \sum_{n = 1}^{n} A P_{n}$

with

n

representing the total number of categories. Collectively, AP and mAP provide a rigorous and standardized evaluation of both detection and classification accuracy, offering a comprehensive perspective on the model’s performance across diverse damage types.

$F_{β}$ -Score is a generalized form of the $F 1$ -Score that provides a weighted harmonic mean of precision and recall. It is formally defined as:

(10) $F_{β} = (1 + β^{2}) \cdot \frac{P \cdot R}{β^{2} \cdot P + R}$

$β$ is a non-negative parameter that controls the trade-off between Precision and Recall. Specifically, when $β = 1$ , the measure reduces to the conventional $F 1$ -Score, which equally weights Precision and Recall. For $β > 1$ , the metric places greater emphasis on Recall, making it suitable for tasks in which false negatives are more costly than false positives, such as medical diagnosis or safety-critical detection. Conversely, $β < 1$ emphasizes Precision, which is preferable in scenarios where false alarms must be minimized.

In this study, the $F 2$ -Score ( $β = 2$ ) is adopted to prioritize recall, thereby penalizing missed detections more heavily than false alarms. This choice reflects the practical requirements of vehicle damage detection, where failing to identify a true damage instance such as a severe dent or structural failure may result in incomplete assessments, underestimated repair costs, or potential safety risks if left unaddressed.

In summary, the evaluation indicators, including precision, recall, AP/mAP and Fβ-scores, provide a multidimensional framework for assessing the accuracy, robustness and applicability of HL-YOLO in real-world vehicle damage detection.

4. Results

4.1. Training and Evaluation Setup

To promote fairness and reproducibility in both training and evaluation, a comprehensive vehicle damage image dataset was constructed and randomly partitioned into training and validation sets with an approximate 8:1 split. The dataset comprised 3427 training images and 403 validation images, representing a diverse spectrum of real-world vehicle damage scenarios—including dents, scratches, and accident-induced deformations—to maximize the model’s generalization performance across varying damage types and severities.

All experiments were carried out in a high-performance cloud computing environment equipped with an NVIDIA A100 GPU with 40 GB of memory. The training and evaluation setup ensured sufficient computational capacity to support large-scale training and efficient inference. A detailed summary of the hardware and software configuration is provided in Table 1.

The hyperparameter settings adopted in this study are summarized in Table 2. These include key factors such as learning rate, batch size, and optimizer configuration, which were carefully selected to ensure stable convergence and balanced performance across accuracy and efficiency. Specifically, the Stochastic Gradient Descent (SGD) optimizer was chosen over adaptive methods as it often demonstrates superior generalization capabilities and yields better final accuracy in large-scale computer vision models like YOLO, despite potentially slower initial convergence. The detailed values are provided to facilitate reproducibility and further comparative studies.

The hyperparameter settings used in this study are summarized in Table 2, including 500 training epochs, a batch size of 64, 16 workers, an initial learning rate of 0.01, the SGD optimizer, and an input image size of 640 × 640 [30]. To further optimize the training process, a cosine annealing learning rate scheduler was employed to gradually decrease the learning rate, preventing oscillation near the minimum and promoting fine-tuning during later epochs. To evaluate detection performance under complex real-world conditions, the mAP at an IoU threshold of 0.5 was adopted as the primary evaluation metric, as it effectively reflects both detection accuracy and robustness across different types of vehicle damage. The selected hyperparameter configuration was designed to balance convergence speed, detection accuracy, and inference efficiency, thereby supporting the model’s practical applicability in intelligent vehicle damage detection and classification tasks.

4.2. Comparison with Mainstream Methods

To further validate the effectiveness of the proposed HL-YOLO, a comparative evaluation was conducted against representative state-of-the-art detection algorithms, including DETR, Faster R-CNN, YOLOv5n, YOLOv8n, and YOLO11n. The results are summarized in Table 3. The transformer-based DETR achieved a mean Average Precision at an Intersection over Union (IoU) threshold of 0.5 of 63.8% and a recall of 60.7%; however, its relatively low precision of 63.0%, high computational cost of 36.8 GFLOPS, and large model size of 36.7 MB limit its suitability for real-time vehicle damage detection, with an inference speed of only 75.8 FPS. Faster R-CNN achieved a higher precision of 68.2% and mAP50 of 65.0%, yet its recall remained moderate at 60.0%, and the computational cost of 37.5 GFLOPS with a model size of 28.2 MB indicates limited efficiency for practical deployment, achieving 84.7 FPS.

YOLOv5n and YOLOv8n achieved notable computational efficiency, requiring only 7.1 GFLOPS and 8.1 GFLOPS, respectively, with compact model sizes of 2.5 MB and 3.0 MB and high inference speeds of 294.1 FPS and 277.8 FPS. However, their relatively low recall values of 53.7% and 54.5% limited their ability to capture diverse damage instances. YOLO11n improved precision to 70.8% while reducing complexity to 6.3 GFLOPS, maintaining a lightweight 2.6 MB size and an inference speed of 227.3 FPS, yet its recall remained at 53.6%.

By contrast, the proposed HL-YOLO achieved superior overall performance, with 72.6% precision, 56.7% recall, 64.3% mAP50, and 33.1% mAP50–95, while maintaining a compact model size of 2.7 MB, a computational cost of 7.4 GFLOPS, and a real-time inference speed of 188.7 FPS. Compared with YOLO11n, HL-YOLO increased precision by 1.8%, recall by 3.1%, and mAP50–95 by 1.0%, with only a marginal rise in complexity and a modest reduction in inference speed. These results demonstrate that HL-YOLO achieves a favorable trade-off between detection accuracy, model compactness, and efficiency, outperforming existing models and offering a reliable solution for real-time vehicle damage detection and classification.

In addition to surpassing mainstream methods in precision, recall, and mean average precision, HL-YOLO also demonstrated stable and consistent convergence during training. As shown in Figure 7, the model maintained smooth optimization dynamics, with both training and validation losses steadily decreasing across 500 epochs. The corresponding precision, recall, and mAP curves exhibit clear upward trends and eventually stabilize, indicating effective feature learning and generalization. These results further validate the robustness of the proposed framework and its suitability for real-world deployment.

The box regression loss, classification loss, and distribution focal loss all demonstrated a steady downward trend, eventually reaching a low and stable level, which indicates effective optimization and reduced overfitting. Similarly, the validation losses showed consistent convergence patterns, further confirming the robustness of the training process. Precision and recall increased progressively with the number of epochs and stabilized after approximately 250 iterations, reflecting the model’s ability to balance false positives and false negatives effectively. Precision and recall increased progressively as training proceeded and reached stable levels after approximately 250 epochs, suggesting that the model effectively balanced false positives and false negatives.

The mAP curves exhibited a similar trajectory, with continuous improvements followed by a plateau, reflecting reliable convergence and strong generalization. Together, these results indicate that the proposed framework is capable of learning discriminative features efficiently while maintaining stability across the training process, thereby reinforcing its suitability for practical vehicle damage detection.

Figure 8 presents qualitative detection results of HL-YOLO on a representative set of vehicle damage scenarios, encompassing accident-induced deformation, localized dents, and surface scratches. The first case depicts a vehicle following a frontal collision, resulting in extensive structural damage to the bumper and fender. Despite the irregular geometry and significant deformation, the model accurately localizes the damaged region, achieving a confidence score of 0.83 and demonstrating its capacity to effectively capture large-scale and complex accident patterns. The second example showcases the detection of a localized dent on the vehicle’s side panel, where challenging factors such as lighting variations, raindrop reflections, and complex metallic surface textures introduce considerable visual noise. Nevertheless, HL-YOLO precisely delineates the dent area with a high confidence score of 0.92, indicating robust performance in identifying subtle damage under adverse environmental conditions. The third image illustrates a long, irregular scratch on a black vehicle body, partially occluded by the dark background and wheel arch. Despite the low contrast and elongated shape—which pose a significant challenge for detection—the model successfully identifies the scratch with a confidence score of 0.93, confirming its capacity for fine-grained damage recognition. Collectively, these qualitative results demonstrate that HL-YOLO is capable of consistently detecting a diverse range of vehicle damage categories across varying severities, shapes, and background complexities. This robust performance reinforces its potential for practical deployment in intelligent transportation systems, automated vehicle inspection, and insurance claim assessment.

4.3. Ablation Experiment

To systematically assess the contribution of each proposed module, an ablation study was conducted using the vehicle damage detection dataset. The performance impacts of the LSKA, the C3K2-HetConv module, and the SIoU loss function were evaluated both independently and in combination, providing insights into their respective roles in the overall detection pipeline. The quantitative results are summarized in Table 4.

The introduction of LSKA as a standalone module resulted in a significant improvement in recall—increasing from 53.6% to 56.2%—indicating its capacity to effectively capture long-range dependencies and enhance object localization. However, a concurrent slight reduction in precision and mAP50 was observed. The C3K2-HetConv module demonstrably improved overall performance, increasing recall to 56.6% and the mAP0.5–0.95 to 32.4%, which highlights its effectiveness in multi-scale feature extraction. The integration of SIoU primarily influenced localization accuracy, resulting in a slight decrease in precision but an improvement in bounding box regression alignment.

Regarding the combined effects of these modules, integrating LSKA and C3K2-HetConv yielded a higher precision of 71.6%, albeit at the expense of recall, which decreased to 52.4%. This suggests that while feature representation was enhanced, sensitivity to small and irregularly shaped damages was partially diminished. Similarly, the combination of LSKA with SIoU demonstrated balanced improvements but did not surpass the performance achieved by C3K2-HetConv alone. Notably, the concurrent integration of C3K2-HetConv and SIoU resulted in improvements in both recall and F2-score, demonstrating their complementary benefits for vehicle damage detection.

Based on the F2-score, which provides a balanced assessment that emphasizes recall and overall detection completeness. Model parameters were optimized by minimizing the standard YOLO composite loss function—which integrates bounding-box regression, confidence, and classification components—using stochastic gradient descent (SGD) to ensure stable convergence and consistent performance. Through this optimization and evaluation process, the optimal overall performance was achieved when all three modules were integrated simultaneously, yielding a precision of 72.6%, a recall of 56.7%, an mAP50 of 64.3%, and an mAP0.5–0.95 of 33.1%. This outcome confirms the synergistic and complementary effects among the proposed modules, enabling the HL-YOLO framework to deliver more accurate and robust detection across diverse damage scenarios.

The extended ablation experiments provide deeper insights into the individual and combined effects of each module on detection performance. The results indicate that introducing LSKA alone enhances recall with a relative gain of 4.9%. However, this improvement is accompanied by a marked reduction in precision (−9.2%) and a slight decline in localization metrics. A similar trend was observed for SIoU, where recall shows a modest increase of 0.9%, yet precision decreases by 8.2% and the comprehensive metric mAP0.5–95 drops by 4.0%. In contrast, the C3K2-HetConv module exhibits a more balanced outcome, yielding improvements in recall (5.6%) and both mAP50 and mAP0.5–95 (0.8% and 0.9%, respectively) with only a negligible reduction in precision. These comparative trends are further illustrated in Figure 9, which provides a clear visualization of the relative performance variations across different module configurations.

The combined use of these modules reveals several performance trade-offs. Integrating LSKA with C3K2-HetConv yields a slight improvement in precision (1.1%) at the expense of recall, which decreases by 2.2%. Combining LSKA with SIoU and C3K2-HetConv with SIoU both provide moderate gains in recall, the latter achieving an increase of 5.0%, although these benefits are offset by reductions in precision and localization accuracy, reflecting the inherent compromise between detection sensitivity and stability.

The most significant finding emerges when all three modules are integrated. The combination of LSKA, C3K2-HetConv, and SIoU results in consistent improvements across all evaluation metrics: precision increases by 2.5%, recall by 5.8%, and mAP50 and mAP0.5–0.95 by 3.9% and 3.1%, respectively. This outcome confirms the complementary nature of the three modules, mitigating the weaknesses observed when each component is applied in isolation and leading to the most robust and stable detection performance.

As illustrated in Figure 10, the Grad-CAM–based attention visualizations reveal clear differences in the feature focus patterns across the baseline model, ablated variants, and the proposed HL-YOLO. The baseline YOLO11 exhibits dispersed and sometimes misplaced activation regions, particularly in areas with subtle deformation or complex illumination. Removing SIoU, HetConv, or LSKA further degrades attention stability—each ablated model tends to produce fragmented or incomplete focus maps, indicating weakened localization capability and reduced sensitivity to fine-grained structural details.

In contrast, HL-YOLO demonstrates more concentrated, coherent, and semantically aligned activation around the damaged regions, regardless of viewpoint, surface reflectance, or damage texture. The heatmaps show that HL-YOLO not only captures the primary impact areas but also attends to secondary deformation cues that are often overlooked by other configurations. This enhanced spatial consistency confirms that the joint integration of LSKA, C3K2-HetConv, and SIoU effectively strengthens contextual understanding and geometric awareness, enabling more reliable and robust detection in real-world damage scenarios.

5. Discussion

The results of the ablation experiments provide deeper insight into the roles and interactions of the proposed modules. When evaluated individually, both LSKA and SIoU contribute to increased recall, indicating enhanced sensitivity to the presence of damaged regions. However, this is accompanied by a reduction in precision and localization accuracy, suggesting that these modules, when applied in isolation, increase the propensity for false positive detections. In contrast, C3K2-HetConv achieves a more stable performance profile, producing consistent improvements in recall and both mAP50 and mAP0.5–95, which underscores its effectiveness in enhancing feature representation and multi-scale information extraction.

The combined results further emphasize the complexity of module interactions. Pairing LSKA with C3K2-HetConv yields a moderate gain in precision, but at the expense of recall, while combinations involving SIoU improve recall, often at the cost of localization consistency. These observations indicate that the relationship among modules is not simply additive, and in certain instances, the limitations of individual components may be exacerbated rather than mitigated. This underscores the necessity of carefully designing integration strategies to avoid performance trade-offs.

The most promising results are obtained when all three modules—LSKA, C3K2-HetConv, and SIoU—are integrated simultaneously. Under this configuration, improvements are observed across precision, recall, mAP50, and mAP0.5–95, demonstrating that the modules are complementary when jointly applied. LSKA enhances global contextual perception, C3K2-HetConv strengthens local feature modeling, and SIoU refines geometric alignment in bounding box regression. The combined use of these elements addresses the shortcomings of each individual component, resulting in a detector that is both accurate and robust under challenging conditions.

In addition to the above analyses, the findings of this study align with and extend those reported in recent literature. Prior studies incorporating LSKA, such as YOLOv8-PD [31], have shown mAP50 improvements of approximately 0.9–2.2%, consistent with our results demonstrating LSKA’s ability to enhance recall through expanded contextual perception. Similarly, SIoU has been shown on the COCO benchmark to improve [email protected]–0.95 by +2.4% and mAP-0.5 by +3.6%, reflecting outcomes comparable to our observed recall gains, though with minor localization instability. Published results using C3K2-HetConv within lightweight YOLO architectures report stable AP improvement, matching our findings that C3K2-HetConv contributes steady gains across recall and mAP metrics. The proposed HL-YOLO achieves simultaneous improvements in precision, recall, and mAP with only a modest computational increase. These comparisons highlight the advantageous accuracy–efficiency balance of HL-YOLO and emphasize its competitiveness among state-of-the-art lightweight detection frameworks.

Overall, the findings demonstrate that the integration of complementary modules, when carefully orchestrated, can significantly improve detection accuracy and robustness. The study highlights the importance of understanding both the individual contributions and the interactive effects of model components, providing valuable insights for the design of advanced architectures in vehicle damage detection.

6. Conclusions

This study proposes an enhanced detection framework for vehicle damage integrating three complementary modules, LSKA, C3K2-HetConv, and SIoU. Through extensive experimentation on a carefully constructed dataset, the results demonstrate that each component contributes uniquely to performance improvement. With the integration of all three modules, the proposed HL-YOLO achieves consistent gains over the YOLO11n baseline, improving precision by 1.8%, recall by 3.1%, mAP50 by 2.4%, and mAP0.5–0.95 by 1.0%. Specifically, LSKA enhances contextual representation, C3K2-HetConv improves multi-scale feature extraction with reduced redundancy, and SIoU refines bounding box regression with improved geometric alignment. Ablation studies reveal that the simultaneous integration of these modules yields the most consistent gains across precision, recall, and mAP metrics, indicating synergistic benefits beyond the sum of individual contributions. Moreover, Grad-CAM visualizations further validate these findings by showing that HL-YOLO produces more concentrated, coherent, and semantically aligned attention around damaged regions than both the baseline model and ablated variants. The enhanced spatial focus illustrates the strengthened contextual reasoning and localization stability achieved through the integrated module design, reinforcing the framework’s robustness in real-world vehicle damage scenarios.

The proposed framework not only enhances detection robustness under complex real-world scenarios but also provides insights into the design of efficient architectures for practical applications. Future research will focus on optimizing computational efficiency and exploring adaptive mechanisms that dynamically balance accuracy and real-time performance. From an application perspective, this method holds substantial potential in domains such as automated insurance claims processing, rapid accident detection for traffic safety management, and vehicle condition monitoring for intelligent transportation and autonomous driving systems. These prospects underscore the practical value and relevance of the proposed approach beyond research settings.

Author Contributions

Conceptualization, W.L.; methodology, W.L.; software, L.H.; validation, W.L. and P.L.; formal analysis, H.X.; investigation, H.X.; resources, W.L.; data curation, W.L.; writing—original draft preparation, W.L.; writing—review and editing, P.L.; visualization, L.H.; supervision, H.X.; project administration, W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the Roboflow platform user “infovision” and the Kaggle platform user “NasimEtemadi” for publicly sharing their datasets, from which certain data relevant to this research were utilized during the construction of our own dataset. Their contributions provided valuable resources that supported the development and validation of this study.

Conflicts of Interest

The authors declare that they have no known conflicts of interest or personal relationships that could have affected the work reported in this paper.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Overall architecture of the proposed HL-YOLO model.

Figure 2 Architecture of the LSKA module.

Figure 3 Architecture of the C2PSA-LSKA module.

Figure 4 Architecture of the HetConv module.

Figure 5 Architecture of the C2K3-HetConv module.

Figure 6 Schematic of SIoU directional modeling.

Figure 7 Training and validation convergence curves of the proposed HL-YOLO model.

Figure 8 Results of the proposed HL-YOLO model under challenging scenarios.

Figure 9 Comparative training curves of YOLO variants across loss, recall, and mAP.

Figure 10 Grad-CAM Visualization for vehicle damage detection across different model configurations.

Table 1

Computational environment configuration.

Environment Configuration	Value
Operating system	Ubuntu 20.04 (Cloud Environment)
CPU	Intel Xeon @ 2.20 GHz
GPU	NVIDIA A100 (NVIDIA A100-SXM4)
RAM	40 GB
Development environment	Cloud-based Jupyter
Programming language	Python 3.12

Table 2

Hyperparameter settings.

Hyperparameter	Value
Epochs	500
Batch size	64
Num workers	16
Initial learning rate	0.01
Optimizer	SGD
Input image size	640 × 640

Table 3

Trade-off between accuracy and efficiency across detection models.

Algorithm	Precision/%	Recall/%	mAP50/%	mAP50–95/%	GFLOPs	Model Size/MB	FPS
DETR	63.0	60.7	63.8	33.0	36.8	36.7	75.8
Faster R-CNN	68.2	60.0	65.0	34.1	37.5	28.2	84.7
YOLOv5n	65.0	53.7	60.7	31.1	7.1	2.5	294.1
YOLOv8n	65.2	54.5	61.9	31.1	8.1	3.0	277.8
YOLO11n	70.8	53.6	61.9	32.1	6.3	2.6	227.3
HL-YOLO	72.6	56.7	64.3	33.1	7.4	2.7	188.7

Table 4

Ablation results of modules in HL-YOLO performance.

LSKA	HetConv	SIoU	Precision	Recall	mAP50	mAP50–95	F2-Score
-	-	-	70.8	53.6	61.9	32.1	56.3
√	-	-	64.3	56.2	61.4	31.9	57.7
-	√	-	70.5	56.6	62.4	32.4	58.9
-	-	√	65.0	54.1	59.8	30.8	56.0
√	√	-	71.6	52.4	61.7	32.0	55.4
√	-	√	68.7	52.5	61.3	32.2	55.1
-	√	√	68.5	56.3	60.9	31.6	58.4
√	√	√	72.6	56.7	64.3	33.1	59.3

References

1. Talaat, F.M.; ZainEldin, H. An improved fire detection approach based on YOLO-v8 for smart cities. Neural Comput. Appl.; 2023; 35, pp. 20939-20954. [DOI: https://dx.doi.org/10.1007/s00521-023-08809-1]

2. Bala, J.A.; Adeshina, S.A.; Aibinu, A.M. Performance Evaluation of You Only Look Once v4 in Road Anomaly Detection and Visual Simultaneous Localisation and Mapping for Autonomous Vehicles. World Electr. Veh. J.; 2023; 14, 265. [DOI: https://dx.doi.org/10.3390/wevj14090265]

3. Drliciak, M.; Cingel, M.; Celko, J.; Panikova, Z. Research on Vehicle Congestion Group Identification for Evaluation of Traffic Flow Parameters. Sustainability; 2024; 16, 1861. [DOI: https://dx.doi.org/10.3390/su16051861]

4. Li, Z. Mamba with split-based pyramidal convolution and Kolmogorov-Arnold network-channel-spatial attention for electroencephalogram classification. Front. Sens.; 2025; 6, pp. 2673-5067. [DOI: https://dx.doi.org/10.3389/fsens.2025.1548729]

5. Qu, S.; Yang, X.; Zhou, H.; Xie, Y. Improved YOLOv5-based for small traffic sign detection under complex weather. Sci. Rep.; 2023; 13, 16219. [DOI: https://dx.doi.org/10.1038/s41598-023-42753-3]

6. Yan, H.; Pan, S.; Zhang, S.; Wu, F.; Hao, M. Sustainable utilization of road assets concerning obscured traffic signs recognition. Proc. Inst. Civ. Eng. Eng. Sustain.; 2024; 178, pp. 124-134. [DOI: https://dx.doi.org/10.1680/jensu.24.00090]

7. Liang, R.; Jiang, M.; Li, S. YOLO-DPDG: A Dual-Pooling Dynamic Grouping Network for Small and Long-Distance Traffic Sign Detection. Appl. Sci.; 2025; 15, 10921. [DOI: https://dx.doi.org/10.3390/app152010921]

8. Laskar, R.H.; Seema, S.; Goel, S.; Bansal, A.; Ahmad, T. Artificial Intelligence-Based Vehicle Damage Detection: A Systematic Review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.; 2023; 15, 70027. [DOI: https://dx.doi.org/10.1002/widm.70027]

9. Ni, Y.; Jin, Q.; Hu, R. A Novel Unsupervised Structural Damage Detection Method Based on TCN-GAT Autoencoder. Sensors; 2025; 25, 6724. [DOI: https://dx.doi.org/10.3390/s25216724] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/41228947]

10. Anand, A.; Kumar, S. A Deep Learning and Transfer Learning Approach for Vehicle Damage Detection. Procedia Comput. Sci.; 2021; 192, pp. 3763-3772. [DOI: https://dx.doi.org/10.32473/flairs.v34i1.128473]

11. Gu, Y.; Chen, L.; Su, T. Research on Small Object Detection in Degraded Visual Scenes: An Improved DRF-YOLO Algorithm Based on YOLOv11. World Electr. Veh. J.; 2025; 16, 591. [DOI: https://dx.doi.org/10.3390/wevj16110591]

12. Gálvez-Gutiérrez, A.I.; Afonso, F.; Martínez-Heredia, J.M. On the Usage of Deep Learning Techniques for Unmanned Aerial Vehicle-Based Citrus Crop Health Assessment. Remote Sens.; 2025; 17, 2253. [DOI: https://dx.doi.org/10.3390/rs17132253]

13. Sudhakar, R.; Joseph, M.; George, A. Automated Detection of Multi-Class Vehicle Exterior Damages Using Deep Learning. Proceedings of the International Conference on Intelligent Computing and Control Systems (ICICCS); Madurai, India, 6–8 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1-6. [DOI: https://dx.doi.org/10.1109/ICECCME52200.2021.9590927]

14. Lee, D.; Lee, J.; Park, E. Automated vehicle damage classification using the three-quarter view car damage dataset and deep learning approaches. Heliyon; 2024; 10, e34016. [DOI: https://dx.doi.org/10.1016/j.heliyon.2024.e34016] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39104489]

15. Li, L.; He, Y.; Wei, Y.; Pu, H.; He, X.; Li, C.; Zhang, W. HSS-YOLO Lightweight Object Detection Model for Intelligent Inspection Robots in Power Distribution Rooms. Algorithms; 2025; 18, 495. [DOI: https://dx.doi.org/10.3390/a18080495]

16. Zhou, X.; Xu, M.; Pan, P. C5LS: An Enhanced YOLOv8-Based Model for Detecting Densely Distributed Small Insulators in Complex Railway Environments. Appl. Sci.; 2025; 15, 10694. [DOI: https://dx.doi.org/10.3390/app151910694]

17. Lv, D.; Meng, J.; Meng, G.; Shen, Y. Railway Fastener Defect Detection Model Based on Dual Attention and MobileNetv3. World Electr. Veh. J.; 2025; 16, 513. [DOI: https://dx.doi.org/10.3390/wevj16090513]

18. Chen, K.; Zhou, X.; Ren, J. DLF-YOLO: A Dynamic Synergy Attention-Guided Lightweight Framework for Few-Shot Clothing Trademark Defect Detection. Electronics; 2025; 14, 2113. [DOI: https://dx.doi.org/10.3390/electronics14112113]

19. Ju, Z.; Shui, J.; Huang, J. GLDS-YOLO: An Improved Lightweight Model for Small Object Detection in UAV Aerial Imagery. Electronics; 2025; 14, 3831. [DOI: https://dx.doi.org/10.3390/electronics14193831]

20. Lin, X.; Liao, D.; Du, Z.; Wen, B.; Wu, Z.; Tu, X. SDA-YOLO: An Object Detection Method for Peach Fruits in Complex Orchard Environments. Sensors; 2025; 25, 4457. [DOI: https://dx.doi.org/10.3390/s25144457]

21. Hussain, F.; Ali, Y.; Li, Y.; Haque, M.M. Revisiting the hybrid approach of anomaly detection and extreme value theory for estimating pedestrian crashes using traffic conflicts obtained from artificial intelligence-based video analytics. Accid. Anal. Prev.; 2024; 199, 107517. [DOI: https://dx.doi.org/10.1016/j.aap.2024.107517]

22. Zhu, T. Lightweight Heterogeneous Convolutional Neural Network for Trash Classification. Proceedings of the 2024 IEEE 3rd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA); Changchun, China, 27–29 February 2024; pp. 224-227. [DOI: https://dx.doi.org/10.1109/EEBDA60612.2024.10485705]

23. Wang, L.; Jiang, F.; Zhu, F.; Ren, L. Enhanced Multi-Target Detection in Complex Traffic Using an Improved YOLOv8 with SE Attention, DCN_C2f, and SIoU. World Electr. Veh. J.; 2024; 15, 586. [DOI: https://dx.doi.org/10.3390/wevj15120586]

24. Wen, L.; Li, S.; Ren, J. Surface Defect Detection for Automated Tape Laying and Winding Based on Improved YOLOv5. Materials; 2023; 16, 5291. [DOI: https://dx.doi.org/10.3390/ma16155291] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37569994]

25. Yue, G.; Liu, Y.; Niu, T.; Liu, L.; An, L.; Wang, Z.; Duan, M. GLU-YOLOv8: An Improved Pest and Disease Target Detection Algorithm Based on YOLOv8. Forests; 2024; 15, 1486. [DOI: https://dx.doi.org/10.3390/f15091486]

26. Zhou, Q.; Zhang, D.; Liu, H.; He, Y. KCS-YOLO: An Improved Algorithm for Traffic Light Detection under Low Visibility Conditions. Machines; 2024; 12, 557. [DOI: https://dx.doi.org/10.3390/machines12080557]

27. Ashraf, I.; Hur, S.; Kim, G.; Park, Y. Analyzing Performance of YOLOx for Detecting Vehicles in Bad Weather Conditions. Sensors; 2024; 24, 522. [DOI: https://dx.doi.org/10.3390/s24020522]

28. Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl.; 2023; 35, pp. 7853-7865. [DOI: https://dx.doi.org/10.1007/s00521-022-08077-5]

29. Zhao, S.; Gong, Z.; Zhao, D. Traffic signs and markings recognition based on lightweight convolutional neural network. Vis. Comput.; 2024; 40, pp. 559-570. [DOI: https://dx.doi.org/10.1007/s00371-023-02801-5]

30. Li, W.; Huang, L.; Lai, X. A Deep Learning Framework for Traffic Accident Detection Based on Improved YOLO11. Vehicles; 2025; 7, 81. [DOI: https://dx.doi.org/10.3390/vehicles7030081]

31. Zhao, J.; Zhang, C.; Cong, S.; Yu, Y.; Yue, X.; Shen, Y.; Hui, Y. Research on Road Defect Detection Based on Deep Learning. Eng. Lett.; 2025; 33, pp. 3311-3317.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

HL-YOLO: Improving Vehicle Damage Detection with Heterogeneous Convolutions and Large-Kernel Attention

Content area

Abstract

Full text

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Dataset Construction

3.2. HL-YOLO

3.2.1. Large Separable Kernel Attention

3.2.2. C3K2-HetConv

3.2.3. SIoU-Based Localization Loss

3.2.4. Evaluation Indicators

4. Results

4.1. Training and Evaluation Setup

4.2. Comparison with Mainstream Methods

4.3. Ablation Experiment

5. Discussion

6. Conclusions