Content area

Abstract

Conventional inspections in car damage assessments depend on visual judgments by human inspectors, which are labor-intensive and prone to fraudulent practices through manipulating damages. Recent advancements in artificial intelligence have given rise to a state-of-the-art object detection algorithm, the You Only Look Once algorithm (YOLO), that sets a new standard in smart and automated damage assessment. This study proposes an enhanced YOLOv9 network tailored to detect six types of car damage. The enhancements include the convolutional block attention module (CBAM), applied to the backbone layer to enhance the model’s ability to focus on key damaged regions, and the SCYLLA-IoU (SIoU) loss function, introduced for bounding box regression. To be able to assess the damage severity comprehensively, we propose a novel formula named damage severity index (DSI) for quantifying damage severity directly from images, integrating multiple factors such as the number of detected damages, the ratio of damage to the image size, object detection confidence, and the type of damage. Experimental results on the CarDD dataset show that the proposed model outperforms state-of-the-art YOLO algorithms by 1.75% and that the proposed DSI demonstrates intuitive assessment of damage severity with numbers, aiding repair decisions.

Full text

Turn on search term navigation

1. Introduction

In automotive insurance, it is paramount to expedite claims processing while ensuring accuracy and fairness. The complexities of assessing car body damage, compounded by the prevalence of false claims, have long been challenges for insurers [1,2]. Manual inspections are not only time-consuming but are also prone to inconsistencies due to varying levels of expertise among assessors. With the continuously increasing number of claims, especially due to the rise in vehicle ownership, there is an even greater need for effective and reliable automated systems for damage assessment.

The Internet of Things (IoT) has enabled significant advancements in automating and streamlining traditionally human-centered processes like damage assessment in vehicle claims [3]. By combining the IoT with artificial intelligence (AI) tools such as data mining, machine learning, and deep learning, the industry can harness real-time data and facilitate automated workflows that reduce the dependence on manual inspections, addressing several outstanding challenges in the automobile insurance industry [4,5]. They open new possibilities to automate and streamline processes that conventionally involve a high degree of human input, such as damage assessment in vehicle claims. The ability to automate these processes is one of the most promising areas for automation using the subfield of AI known as computer vision, which allows machines to interpret and understand visual information. The IoT (Internet of Things) facilitates car damage assessment systems by connecting devices, data storage, and AI processing in a seamless network to enable real-time automated damage analysis [6], as shown in Figure 1.

The invention of Convolutional Neural Networks (CNN), introduced by Yann LeCun back in the late 1980s and popularized in 2012 by Alex Krizhevsky through AlexNet [7], revolutionized the field of computer vision by enabling pattern recognition in images and hence are particularly good at recognizing edges, shapes, and textures in images, thus being excellent at image classification. One of the studies that has been carried out for the construction of a car body damage detection system using CNN architecture is research conducted by Sruthy et al. [8], which compares the performance of several CNN architectures, including InceptionV3, Xception, VGG16, VGG19, ResNet50, and MobileNet, on the classification of types of damage to cars. As a result, the MobileNet architecture provides the best performance, reaching 97% accuracy. However, image classification methods only provide information about the presence of an object in the image without determining the specific location of the object. This will be an obstacle if there is more than one type of damage in one photo, because CNN will only detect the damage that gives the highest probability score.

To overcome this limitation, an object detection method was developed that not only classifies objects but also determines the position of the object in the image using a bounding box. Some significant methods in this evolution are Region-based Convolutional Neural Network (R-CNN) [9], Fast R-CNN [10], Faster R-CNN [11], and Mask R-CNN [12]. This method processes images by generating many proposal regions that may contain objects. Research conducted by Widjojo et al. in [13] integrated object segmentation techniques using the Mask R-CNN, EfficientNet, and MobileNet V2 architecture to detect and classify the type and severity of car damage. The architecture with the best performance is MobileNet V2, which achieves an F1-score of 91.11%. However, the R-CNN series has the main disadvantage that it is very slow and computationally expensive because it requires repeated CNN processing for each proposal region. Therefore, the You Only Look Once (YOLO) algorithm was developed, and was first proposed by Redmon et al. in 2016 [14].

This paper is organized as follows: Section 2 discusses overview of YOLO algorithm, problem identification, Convolutional Block Attention Module, and SCYLLA intersection over union (SIoU) loss for bounding box regression. Section 3 introduces the proposed improvement of YOLOv9 architecture in detail, as well as the dataset used in this study, image augmentation techniques, and the proposed damage severity index (DSI). The use of indexes in deep learning, such as the proposed DSI, is an effective way to aggregate and interpret model outputs. Indexes simplify the inference process by condensing complex data, such as bounding box confidence scores, damage localization, and severity levels, into a single actionable metric. This approach not only enhances the usability of the model but also aligns with industry requirements for efficient and accurate decision-making in car damage assessment. By incorporating the DSI, this study leverages the strengths of indexes to bridge technical outputs with practical applications, ensuring relevance and reliability in automated damage assessment systems. Section 4 presents the experimental results and analysis. Finally, Section 5 concludes the paper with a discussion on implications and future research directions.

2. Related Works

2.1. Overview of YOLO Algorithm

You Only Look Once (YOLO) is an object detection algorithm proposed in 2016 by Redmon, Divvala, and Farhadi [14]. YOLO revolutionized object detection tasks by proposing a single-stage detection approach that is itself capable of predicting object locations and their classification in a single pass. YOLOv3 [15] proposed an improvement by adding residual connections and multi-scale predictions, further enhancing the accuracy and performance of YOLOv2. YOLOv4 [16], developed by Alexey Bochkovskiy, introduced advancements like CSPDarknet53 and PANet for feature aggregation, achieving state-of-the-art results with balanced accuracy and speed. YOLOv5, developed by Ultralytics, was highly optimized for practical deployment, focusing on improving ease of use, training efficiency, and real-world performance. This version utilizes complete intersection over union (CIoU) loss proposed by Zheng et al. [17] for bounding box regression. YOLOv6 [18], proposed by Li et al., introduced reparameterization from RepVGG, aimed at improving efficiency and lightweight detection. YOLOv7 [19] utilizes the Efficient Layer Aggregation Network (ELAN) to enhance the efficiency of feature usage, improving model performance by better feature aggregation across layers. YOLOv9 [20] is the version that utilizes Generalized ELAN (GELAN). The idea of GELAN is to generalize ELAN, which originally only used stacking of convolutional layers, to a new architecture that can use any computational blocks. This approach aims to enhance efficiency, scalability, and adaptability across various tasks. The loss function used in YOLO is composed of two distinct components, namely binary cross entropy (BCE) for classification loss and a combination with Complete Intersection over Union (CIoU) for regression loss.

Another improvement that can be explored is optimizing the loss function. The loss function used in majority of YOLO versions is composed of two distinct components, namely binary cross entropy (BCE) for classification loss and a combination with Complete Intersection over Union (CIoU) for regression loss. The BCE loss can be formulated as follows:

(1)LBCE=1Ni=1Nj=1Cyijlogpij

where N is the total number of bounding boxes, C is the total number of classes, yij is the j-th element of the ground truth one-hot encoded vector yi for the i-th bounding box, and pij is the predicted probability of the j-th class for the i-th bounding box.

Second, given that Bgt is the ground truth bounding box, B is the predicted bounding box, w, h are the width and height for ground truth bounding box, and wgt, hgt are the width and height for predicted bounding box, respectively. In addition, b,bgt are the central point of B and Bgt, respectively, ρ2 is the square of the distance between the center points of the prediction box B and the ground truth of the box Bgt, and c2 is to the squared length of the diagonal of the smallest possible box that can fully contain both the prediction and ground-truth boxes. The CIoU loss can be written mathematically as follows:

(2)LCIoU=1IoU+ρ2b,bgtc2+αv IoU=BBgtBBgt α=v1IoU+v v=4π2arctanwgthgt arctanwh2

However, the CIoU loss function has a major drawback. It does not consider the angle factor between the predicted and ground truth bounding boxes, leading to a slower convergence and poorer accuracy.

2.2. Problem Identification

Despite the strong performance and efficiency of the YOLO algorithm, its implementation for car body damage detection faces challenges, particularly in identifying subtle or obscured damages like scratches and in images captured under poor lighting conditions. There are three main types of object detection errors. The first is false negative, where the algorithm fails to detect damage in the given image despite its presence. The second is false positive, which is further divided into two subtypes: misclassification error and localization error. Misclassification occurs when the algorithm correctly identifies the damage location but mislabels its type. Localization error occurs when the algorithm accurately detects the type of damage but predicts the bounding box inaccurately.

Another problem in car damage assessment is that there is no standardization formula for damage severity assessment. The existing methods for car damage severity estimation utilize deep learning to classify damage severity into several categories. A study by Shirode et al. [21] developed a system that detects the location and severity level divided into three categories, namely minor, moderate, and severe, using the VGG model. In addition, a similar study by Elroy, Sannidhan, and Balasubramani [22] developed a system that classifies dent severity into three classes, namely, minor, major, and severe, using VGG, ResNet, and DenseNet models. While those approaches provide a general indication of damage level, they often lack the precision and detail required for comprehensive assessments.

There are two approaches that have been outlined in optimizing algorithm capabilities in computer vision tasks, especially object detection tasks. The first approach is the integration of the convolutional obstruction module (CBAM), proposed by Woo et al. in 2018 [23]. CBAM is a lightweight attention module and can be integrated directly into convolutional networks to improve feature representation that is more relevant to the detection task. Integration of the CBAM module into the Faster R-CNN algorithm with the ResNet 50 backbone increases mAP from 46.2% to 48.2%. Meanwhile, Faster R-CNN with ResNet-101 backbone increases mAP from 48.4% to 50.5%. However, this approach has never been applied to the YOLO algorithm to detect damage to the car body.

On top of that, the CIoU loss does not account for the angle between the predicted bounding box and the ground truth. This can lead to slower convergence during the training process, especially in cases where angle misalignment plays a significant role in optimizing the bounding box position. Therefore, we utilize a new loss function called SCYLLA Intersection over Union (SIoU), proposed Gevorgyan in 2022 [24]. This function incorporates factors such as angle, shape, distance, and IoU in its calculations. Its implementation improved the performance of the Scylla-Net-S object detection algorithm, increasing the mean Average Precision (mAP) from 66.4% before SIoU implementation to 70% after.

For damage severity assessment, we propose a damage severity index (DSI) to assess the damage severity quantitatively. Our proposed DSI combines several indicators, such as number of damages detected, the ratio of damage to the total body surface, object detection confidence, and the type of damage. This scoring system enables a more efficient and objective damage assessment, reducing the potential for subjective judgment and improving decision-making.

2.3. Convolutional Block Attention Module

The convolutional block attention module (CBAM) is a lightweight yet powerful attention mechanism to enhance the representational power of CNNs. Proposed by Sanghyun in 2018 [23], CBAM works by sequentially applying two types of attention mechanisms, channel attention and spatial attention, as shown in Figure 2. This dual-attention strategy allows CBAM to effectively refine feature maps by focusing on the most informative parts of an input image.

Assume that F is the original feature. First, CBAM applies channel attention MCF to emphasize the most relevant feature channels, capturing relationships among the channels themselves. Then, the module applies average and max pooling layer with subsequent fully linked layers in order to obtain a channel attention map, as illustrated in Figure 3. It multiplies the original feature map with the output in order to highlight the key feature at all places in spatial locations. This module can be written mathemathically as in Equation (3):

(3)McF=σMLPAvgPoolF+MLPMaxPoolFF=Mc(F)F

where is an element-wise multiplication operation, AvgPool is an average pooling layer, MaxPool is a max pooling layer, and σ is the activation function. All of them can then utilized as in Equation (3) to produce the final feature after channel attention, represented by F.

After channel attention, CBAM uses spatial attention MsF to emphasize the most relevant spatial regions of the refined feature map. This attention map is generated by first applying average and max pooling across the channel dimension, and then passing these through a convolutional layer and a sigmoid activation function as illustrated in Figure 4. The result is a spatial attention map, which, when multiplied with the feature map F, highlights specific spatial regions critical to the task. This process is formulated in Equation (4):

(4)MsF=σf7×7AvgPoolF;MaxPoolFF=MsFF

where is an element-wise multiplication operation, AvgPool is an average pooling layer, MaxPool is a max pooling layer, and σ is the activation function. All of them can then utilized as in Equation (4) to produce the final feature after channel attention, represented by F.

2.4. SCYLLA Intersection over Union

The CIoU loss function used in the YOLO algorithm poses several drawbacks related to performance and convergence. Therefore, the loss function utilized in this study is SCYLLA intersection over union (SIoU) loss function, composing four distinctive cost functions such as angle, distance, shape, and IoU cost. The purpose of adding the angle cost component is to reduce complexity in determining the direction of movement in relation to distance. Basically, the model will first direct the prediction to the x or y axis, depending on which is closer, and then continues movement along that axis, as illustrated in Figure 5.

The formula for angle cost is given as follows:

(5)Λ=12×sin2arcsinChσπ4σ=bcxgtbcx2+bcygtbcy2ch=maxbcygt, bcymin(bcygt, bcy)

where bcxgt, bcx, bcygt, bcy represent the coordinate of the ground truth bounding box on the x-axis, the coordinate of the predicted bounding box on the x-axis, the coordinate of the ground truth bounding box on the y-axis, and the coordinate of the predicted bounding box on the y-axis, respectively. Distance cost in the SIoU loss is designed by considering the angle costs. When the angle cost approaches zero, the contribution of distance loss decreases drastically. The motivation for introducing distance cost is to ensure that the bounding box predictions are near the ground truth bounding box. The formula for distance cost is given as follows:

(6)Δ=t=x,y1e22Λρtρx=bcxgtbcxcw2, ρy=bcygtbcych2

where cw, ch represent the distance of ground truth center coordinate in x-axis and y-axis, respectively. Shape cost is a component that handles aspect ratio mismatches between the bounding box for ground truth and the bounding box for prediction. Ratio mismatch aspects can affect object detection performance, especially in the case of objects with non-square shape, as illustrated in Figure 6.

The formula for shape cost is given as follows:

(7)Ω=t=w,h1eωtθωw=wwgtmax(w, wgt), ωh=hhgtmax(h, hgt)

where w, wgt, h, and hgt represent the width of predicted, width of ground truth, height of predicted, and height of ground truth bounding box, respectively. The final component of the SIoU loss function is IoU cost. The value is created as 1 minus the IoU between the predicted bounding box and the ground truth bounding box. Reducing the 1 by the IoU value will emphasize the non-overlapping parts of the overlap of the predicted bounding box. The final SIoU formula is represented as follows:

(8)LIoU=1IoULSIoU=LIoU+Δ+Ω2

3. Material and Methodology

3.1. Proposed Architecture

This research applies enhancements of YOLOv9 tailored to accurately identify and classify car damage. The proposed architecture contains backbone, neck, auxiliary, and head layers. To enhance the performance, we integrate the CBAM module in the neck and auxiliary layers. This integration refines attention before altering feature map size, retaining important features when merging information. The full architecture of the proposed YOLO enhancement can be seen in Figure 7.

3.1.1. Backbone Layer

The backbone architecture of the proposed YOLO contains two main components: Repeated Normalized Cross Stage Partial with Efficient Large Kernel Attention Network (RepNCSPELAN4) and asymmetric downsampling (Adown). The RepNCSPELAN4 is a block designed to effectively capture both local and global features at various scales and resolutions, containing the RepNCSP and Efficient Large Kernel Attention Network (ELAN), as shown in Figure 8.

The RepNCSP is a block containing a series of repeated bottleneck layers, convolutional layers, activation functions, and normalizations combined with the cross stage partial (CSP) that splits the input into two parts. The first part undergoes the RepNBottleneck operations, while the other part is directly concatenated with the output.

The Adown module is a component of YOLOv9 used to reduce the spatial resolution of the features. This block contains average pooling, max pooling, and convolutional layers, aiming to maintain spatial information while reducing the size of feature map, as shown in Figure 9.

3.1.2. Neck and Auxiliary Layers

In the proposed YOLO enhancement, the neck layer aggregates and fuses feature maps extracted by the backbone layer at various scales to enhance multi-scale detection capabilities and handle complex scenes. The neck layer used in this study integrates the PANet module with RepNCSPELAN4 and the Adown module to enhance feature representation.

The purpose of the multi-level auxiliary is to enhance the learning process in a neural network with the addition of an integration network between the feature pyramid layers used for auxiliary supervision and the main branch of the network. This integrates the gradient information that comes from various prediction heads, each representing a different object category or level of detail. This is achieved by summing the gradients from multiple levels and feeding them into the main branch so that this gets updated with information from all target objects, rather than being dominated by specific ones.

3.2. Dataset

The dataset used in this study is CarDD, published by Wang, Li, and Wu in 2022 [25]. The dataset contains 4000 images with 9000 annotations covering 6 types of damage, including dent, scratch, crack, shattered glass, broken lamp, and flat tire. Figure 10 details the characteristics of the dataset.

Figure 10a indicates that a class imbalance occurs in the dataset, with dent and scratch being dominant, indicating the needs of image augmentation to handle biased prediction. Figure 10b,c illustrate that the damages tend to be centrally located indicated by denser concentration of bounding boxes in Figure 10b and darker regions in the heatmap in Figure 10c. Figure 10d shows that the damages vary in size, but most of them are small relative to the image indicated by darker regions in the heatmap. The dataset is then divided into 2816 images for training dataset, 810 images for validation dataset, and 374 images for testing dataset.

For a more in-depth analysis, Figure 11 provides a pair plot of bounding boxes in the dataset. This pair plot visualizes the relationships between the variables x, y, width, and height in the dataset for object detection. The darker regions indicate higher concentration of data points. The variables x and y represent the position of the center of bounding boxes, while width and height represent their dimensions. The diagonal histograms show the distribution of each variable individually; for instance, if the x histogram peaks around the center, it suggests that bounding boxes are often centrally located in the images. The off-diagonal plots reveal pairwise relationships, such as the concentration of bounding boxes in certain areas (x vs. y) or correlations between position and size (x or y with width or height). The width vs. height plot shows typical size ratios, which could be useful for identifying common bounding box shapes. Overall, this visualization helps assess where objects are located and their typical sizes, both of which are important for optimizing an object detection model.

3.3. Color Space Transformation

Color transformation, including changing hue, saturation, and brightness, is a fundamental approach to enhance the generalization capability of deep learning models in object detection [26,27,28]. It provides generalizations for various lightning and environmental conditions. Given that R, G, and B are the values of pixels in a certain coordinate of RGB color spaces, this study utilizes the integer-based conversion algorithm between RGB and HSV proposed by Chernov, Alander, and Bochko [29], as shown in Algorithm 1.

Algorithm 1: Image Conversion from RGB to HSV
Input: R, G, and B valueOutput: H, S, and V value1: Find the maximum (M), minimum (m), and middle (c) of R, G, B2: Assign V=M3: Calculate delta (d) with: d=Mm4: if d=0 then5:   S=06: else if7:   Ti I=0 12345 if M=R and m=B if M=G and m=B if M=G and m=R if M=B and m=R if M=B and m=G if M=R and m=G8:   S=d.2161V9:   F=cm.216d+110:    if I=1 or I=3 or I=5 then11:      F=EF12:    end if13:    H=EI+F14: end if15: return H, S, V

3.4. Image Augmentation

Image augmentation involves four techniques to enhance the training data, namely, hue, saturation, brightness, and mosaic augmentation. Hue augmentation randomly changes the color of the input images by shifting the H value by the following formula:

(9)H=H+ΔH mod 360

H shows the adjusted hue value, H shows the original hue value, and ΔH is a random value; in this study, the ΔH is picked from the interval of 90°, 90°. Although hue augmentation can produce unnatural-looking images, it is effective for training deep learning models [30,31].

Saturation augmentation, like hue augmentation, adjusts the image’s vibrancy by shifting color using the following formula:

(10)S=S×α

where S is the adjusted saturation, S is the original saturation, and α is the random scaling factor, which, in this study, is picked from the interval of 0, 0.5. This method makes the object detection model more robust to different lighting conditions and camera settings. The following formula is used to adjust an RGB image brightness:

(11)I=255,I+ΔB2550,I+ΔB0I+ΔB,elsewhere

I is the adjusted brightness value of pixels for each channel, I is the original brightness value of pixels, and ΔB is the random value; in this study, ΔB is picked from the interval of 20, 20. Studies by [32,33] have shown that brightness augmentation significantly increases model performance. The augmentation techniques used in this study can be seen in Figure 12.

The last technique used in this study is mosaic. The idea behind this method is to combine four random images from the training dataset into a single larger image, retaining the relative scale of the objects in the image, and then combining the four slices into one image, as shown in Figure 13. This can help the model in handling occlusion and translation cases more effectively. Additionally, it allows for combinations of classes that would not normally be seen together.

3.5. Damage Severity Index

In this study, we propose a novel scoring system: the damage severity index (DSI). The purpose of this system is to give a standardized score to the detected damage. Given that there are N damages detected, each damage contains the damage type (i), the weight (wdi) and height (hdi) of the bounding box for the damage, the prediction confidence Ci, and the weight (wci) and height (hci) for the image. The DSI algorithm is presented in Algorithm 2.

Algorithm 2: Damage Severity Index
Input: N,wc, hc,i,wdi, hdi, Ci^i=1NOutput: Severity score1: For i in range 1 to N:2:    Ai=wdi×hdiwc×hc×1003:    Di=100if i=crack 85if i=shattered glass70if i=dent55if i=scratch40if i=broken lamp25if i=flat tire4:    Si=0.4×Ai+0.2×Ci^+0.4×Di5: End for6: DSI=1Ni=1NSi7: return DSI

The intuition behind this scoring is based on the number of damages, the damage type, and the ratio between the area of damaged area and the car area. Greater or larger damage, or those which compromise the structural integrity of the car, such as cracked and shattered glass, contribute to a higher score.

The proposed scoring system offers several advantages. First, it provides a standardized method to evaluate the severity of damages across multiple cases, which is beneficial for insurance claims, repair prioritization, and cost estimation. Second, it allows for scalability, as new types of damage can be easily incorporated into the scoring system with appropriately weighted values.

3.6. Metrics Evaluation

In assessing our car body damage detection system, we placed significant emphasis on crucial performance measures to gauge its effectiveness and accuracy in identifying and categorizing damages. Precision and recall will be used in this study.

(12)Precision=TPTP+FP

(13)Recall=TPTP+FN

However, for object detection tasks, Average precision (AP) and mean average precision (mAP) are the metrics most commonly used. The formula for AP involves summation of the area under the precision-recall curve:

(14)AP=n Rn+1Rn PinterpRn+1

where n is the various threshold levels at which precision and recall are calculated, Rn is the recall value at n-th threshold, and Pn is the precision value at n-th threshold.

Mean average precision (mAP) is a metric used to evaluate the performance of object detection models across multiple classes. It provides a single scalar value that summarizes the model’s ability to accurately detect objects of different classes in an image. The formula for mAP involves calculating the AP for each individual class and then computing the average of these AP scores. The formula for mAP is as follows:

(15)mAP=1Ni=1NAPi

where N is the total number of these AP scores and APi is the AP for class i ranging from 1 to N. [email protected] is a specific instance of the mAP calculation, where the IoU threshold is set to 0.5 while [email protected]:0.95 is calculated at multiple IoU thresholds at intervals of 5% ranging from 50% to 95%. The [email protected]:0.95 is then the average of these AP values.

4. Experimental Results

In this study, we proposed an enhanced YOLOv9 with CBAM integration and SIoU integration that will be called YOLOv9-CS. We explore the integration of advanced hardware technologies to enhance the computational power of our car body damage detection system. Specifically, we leverage the RTX A4000 GPU to significantly boost processing speeds and overall performance. This section details how employing this GPU optimizes the efficiency and accuracy of our damage detection algorithms. We utilize 200 epochs and batch size of 16 to experiment with all models, ensuring consistency.

Our proposed enhancement shows the decreasing trend in box loss and classification loss alongside the increasing trends in precision, recall, and mAP metrics, as shown in Figure 14, indicating that the model is learning effectively. The box loss, measuring errors in bounding box predictions, starts around 4.5 and steadily decreases to approximately 1.2 by the end of training, indicating that the model is effectively learning to localize objects. Similarly, the classification loss drops from 2.8 to 0.8, showing improved accuracy in predicting object classes. Both loss curves are smooth, suggesting stable training with no significant signs of overfitting or instability.

The precision metric begins at 20%, but rapidly improves within the first 25 epochs, stabilizing around 78% after 50 epochs. This highlights the model’s ability to minimize false positives efficiently. The recall, which measures the proportion of correctly detected objects, rises gradually from 20% to 69%, indicating steady progress in detecting true positives. The difference between precision and recall suggests that while the model excels at avoiding false positives, some true positives may still be missed.

The mean Average Precision (mAP) metrics further emphasize the model’s performance. The [email protected], which evaluates detection performance with a loose overlap criterion, rises quickly and plateaus at 73%, demonstrating that the model reliably identifies and localizes objects. The stricter [email protected]:0.95, which averages over tighter IoU thresholds, increases more gradually and stabilizes at 58%, reflecting the model’s capability to make precise detections under challenging conditions. The lower value of this metric compared to [email protected] is typical and highlights the difficulty of achieving consistent high overlap across predictions. Overall, the model shows significant improvement and convergence by the end of the training, with all losses decreasing and metrics stabilizing. The relatively high precision (78%) is promising for applications where minimizing false positives is crucial, such as IoT-based real-time damage assessment systems.

The comparison of YOLOv9-CS with and without data augmentation highlights the impact of augmentation techniques on performance metrics over training epochs. Figure 15 illustrates the progression of precision, recall, [email protected], and [email protected]:0.95 across 200 epochs for both configurations. These metrics are critical for evaluating the effectiveness of object detection models. In terms of precision, YOLOv9-CS trained on augmented data achieves better performance, with 78%, which is 6% higher than when trained on data without augmentation treatment with the precision of 72%. In terms of recall, YOLOv9-CS trained on augmented data achieves better performance, with 66.2%, which is 3.76% higher than when trained on data without augmentation treatment with the recall of 63.8%. In terms of [email protected], YOLOv9-CS trained on augmented data achieves better performance, with 70.7%, which is 6.32% higher than when trained on data without augmentation treatment with the recall of 66.5%. In terms of [email protected]:0.95, YOLOv9-CS trained on augmented data again achieves better performance, with 56.8%, which is 8.19% higher than when trained on data without augmentation treatment with the recall of 52.5%. By comparing the two setups, we can conclude that data augmentation techniques influence the learning process and contribute to achieving higher performance and robustness in detection tasks using the YOLOv9-CS.

Figure 16 provides a brief overview of the proposed YOLO enhancement. The model is highly effective in detecting shattered glass, flat tires, and broken lamp glasses, with accuracy of corrected classification of 98%, 94%, and 83%, respectively, measured in the validation data. However, our YOLOv9-CS still struggles with more ambiguous or subtle classes, such as dent, scratch, and crack; this was indicated by a lower correct prediction rate, as they are often confused with the background. The crack is the class with the lowest accurate prediction rate, with only 44% being correctly classified.

To further evaluate the performance of our YOLOv9-CS model, we compared it against the baseline YOLOv9 model on an additional dataset sourced from Kaggle [34,35] that contains 767 images of damaged cars. This dataset provided diverse and challenging scenarios, ensuring a robust assessment of the models’ capabilities. The comparison focused on key performance metrics such as precision, recall, and [email protected] and [email protected]:0.95 across multiple IoU thresholds. The results demonstrated that YOLOv9-CS consistently outperformed YOLOv9 in all metrics. Specifically, YOLOv9-CS achieved higher precision and recall, with values of 97.2% and 78.8%, respectively, compared to YOLOv9 with precision and recall values of 94.3% and 76.7%, indicating fewer false positives and improved recall, reflecting its superior ability to detect objects accurately. The [email protected] metric for YOLOv9-CS showed a value of 86.2% compared to YOLOv9 with the value of 85.1, while the [email protected]:0.95 revealed a value of 64.2% compared to YOLOv9 with a value of 64.1%, highlighting the effectiveness of our model across varying levels of detection strictness. These findings underscore the robustness and adaptability of YOLOv9-CS, making it a promising choice for real-world object detection tasks.

To provide an in-depth analysis, we conducted a comparative study with several state-of-the-art models, such as YOLOv3, YOLOv4, YOLOv5, YOLOV6, YOLOv7, YOLOv7+CBAM [36], YOLOv8, and YOLOv9, across four key metrics: Precision, Recall, [email protected], and [email protected]:0.95. The results are shown in Figure 17. In terms of precision, the models show consistently high scores, ranging from 75% to 78%, with YOLOv9-CS achieving the highest value of 78%, indicating its ability to minimize false positives more effectively than the baseline models. For recall, YOLOv9-CS and YOLOv7 perform the best, achieving scores of 70%, while YOLOv4 lags with a recall of 66%, suggesting it misses a greater number of true positives compared to the newer models and enhancements.

In the [email protected] metric, which evaluates detection accuracy with a loose overlap threshold, YOLOv9-CS again outperforms other models with a score of 73%, demonstrating the consistent benefit of the proposed enhancements. Similarly, for [email protected]:0.95, which measures detection quality across multiple IoU thresholds and is more challenging, YOLOv9-CS achieves the highest value of 58%, significantly outperforming all YOLO versions compared in this study. The improvements in this stricter metric highlight the robustness of the proposed modifications, particularly the integration of CBAM (Convolutional Block Attention Module) and SIoU (SCYLLA Intersection over Union), in enhancing the model’s detection capabilities under more demanding conditions.

To enable seamless communication between the web application and the detection model, an API (Application Programming Interface) is implemented as the bridge between the frontend and the backend. The API handles image uploads from the user interface, processes the input image using the deployed YOLOv9-CS model, and computes the Damage Severity Index (DSI) score. It ensures efficient data transmission, handles model inference requests, and returns the results to the web application in real time. The API is designed to be lightweight, scalable, and secure, allowing smooth integration of the detection model into the application while maintaining fast response times. This architecture facilitates the deployment process and ensures that the system can handle multiple user requests concurrently, making the car damage assessment system both reliable and user-friendly.

The model was then deployed into a simple web application, as shown in Figure 18. The application will ask for an input, which is an image of damaged car. Then, if we press the “Upload and Process” button, the input image will be sent to the server and the detection model and the DSI scoring algorithm will be performed. Once processed, the server returns an annotated image displaying the damage detected with bounding boxes, confidence scores, and a quantitative DSI score. This application streamlines car damage assessment by providing a visual representation of detected damages and a severity score, offering users a clear and objective understanding of the damage extent.

5. Discussion

In this study, we proposed enhancements of YOLOv9 algorithm with convolution block attention module (CBAM) integration and the use of SCYLLA Intersection over union (SIoU) for box regression. For damage severity assessment, we proposed a numerical damage severity index (DSI) based on the damage area, count of damage, and prediction confidence of the damage. These advancements aim to deliver a robust car damage detection system, enabling accurate and objective assessments. Additionally, by integrating these techniques with IoT frameworks, the system can facilitate real-time, scalable, and automated damage evaluations, making it highly applicable for modern insurance and repair solutions.

Overall, the results demonstrate that the proposed YOLOv9-CS model consistently outperforms its counterparts across all metrics, emphasizing the value of incorporating advanced attention mechanisms and improved box loss calculations. From Figure 14, it can be seen that while the models generally exhibit a trade-off between precision and recall, with higher precision being prioritized, the improvements in mAP metrics, particularly [email protected]:0.95, suggest a notable advancement in achieving reliable object detections, outperforming all YOLO versions. These findings underscore the importance of the proposed enhancements for applications requiring high precision and robust localization, such as automated car damage assessment in IoT-driven systems.

To evaluate the effectiveness of the proposed model, Figure 19 shows a comparison of inference using the proposed model to the basic of YOLOv9 on 3 images in the test set. It can be concluded that our proposed model enhances the basic YOLOv9 in terms of detection accuracy. In the first row, a dent and a crack are detected by our proposed model but are not detected by the YOLOv9. In the second row, both YOLOv9 and our proposed model have false positive predictions, predicting two damages that do not exist in the ground truth. However, our proposed model gave a higher confidence level for damages that exist in the ground truth and gave a lower score for the false positive predictions. In the third row, both models show precise predictions to the ground truth; our model gave a higher confidence score for the damages. This show that our improvement gave a better prediction than the YOLOv9.

Indexes, such as the proposed Damage Severity Index (DSI), play a critical role in simplifying the interpretation of deep learning model outputs for practical applications. By aggregating multiple aspects of car damage—such as severity, type, and location—into a single metric, the DSI bridges the gap between technical outputs and real-world usability. This is particularly important in automotive insurance, where quick and accurate decisions are essential for claims processing.

The use of indexes enhances transparency and standardization in damage assessment, allowing stakeholders such as insurers and repair shops to rely on consistent metrics rather than subjective interpretations of raw model outputs. Furthermore, it facilitates seamless integration into automated workflows, reducing the dependency on manual evaluations. Future research could explore refining these indexes by incorporating additional contextual factors, such as vehicle type or repair costs, to further enhance their applicability and precision.

However, the results of this study also present several limitations. First, there are instances of missed detections in car damages, which could impact the accuracy of the assessment and lead to underreporting of the damage severity. These missed detections, especially for damage types such as dents and scratches, highlight the challenge of achieving perfect object detection in real-world scenarios with varying image qualities, angles, and lighting conditions. To tackle the limitation, future works suggest that more data are needed, specifically for dents and scratches.

Another limitation that occurred in this study is the increasing computation cost. While adding the convolution block attention module to the YOLO architecture increases the detection performance, it is proportional to the rise of computational cost since there are more operations introduced by CBAM. CBAM integrates spatial and channel attention sequentially, which adds multiple convolutional and pooling layers, leading to a higher processing load and potential latency during inference. A suggestion for future work is to reduce the computational cost by replacing certain operations with the cheaper ones to reduce computational burdens.

6. Conclusions

Car damage assessment is a process that can be significantly enhanced with the emergence of AI and IoT devices. In this work, two YOLOv9 improvements are proposed for car damage detection by adding a convolutional block attention module and SCYLLA-Intersection over union for box loss. We introduce damage severity index (DSI), a quantitative damage severity score based on the ratio between area of detected damage to image size, number of damages, and the prediction confidence. Experimental studies and data augmentation were performed using the CarDD dataset. As a result, our proposed method achieved the best performance, with precision, recall, [email protected], and [email protected]:0.95 of 78%, 69%, 73%, and 58%, respectively. Comparative experiments indicate that our proposed method outperforms YOLOv3, YOLOv4, YOLOv5, YOLOv7, YOLOv7 +CBAM, and YOLOv9 by 1.75%, resulting a more robust car damage detection. Moreover, the DSI provides an objective and consistent damage severity assessment, enabling IoT systems to provide automated and real-time repair decision.

Author Contributions

Conceptualization, R.A.B. and M.R.S.R.; methodology, R.A.B. and M.R.S.R.; software, M.R.S.R.; validation, A.B., M.R.S.R. and R.A.B.; formal analysis, M.R.S.R.; investigation, R.A.B.; resources, R.A.B.; data curation, M.R.S.R.; writing—original draft preparation, M.R.S.R.; writing—review and editing, R.A.B. and M.R.S.R.; visualization, M.R.S.R.; supervision, A.B.; project administration, A.B.; funding acquisition, R.A.B. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the CarDD dataset owner which can be found on the following link: https://arxiv.org/abs/2211.00945 (accessed on 19 August 2024).

Acknowledgments

This work is supported by Seleris Asia Pacific Technology Pte. Ltd., Data Science Center Universitas Indonesia (DSC UI), and Laboratory of Bioinformatics and Advanced Computing Department of Mathematics Universitas Indonesia.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures
View Image - Figure 1. Internet of Things model for car damage assessment.

Figure 1. Internet of Things model for car damage assessment.

View Image - Figure 2. Convolutional block attention module mechanism.

Figure 2. Convolutional block attention module mechanism.

View Image - Figure 3. Channel attention mechanism.

Figure 3. Channel attention mechanism.

View Image - Figure 4. Spatial attention mechanism.

Figure 4. Spatial attention mechanism.

View Image - Figure 5. Angle cost component of SIoU loss illustration.

Figure 5. Angle cost component of SIoU loss illustration.

View Image - Figure 6. Shape cost component of SIoU loss illustration.

Figure 6. Shape cost component of SIoU loss illustration.

View Image - Figure 7. Proposed YOLOv9 enhancement for car damage detection.

Figure 7. Proposed YOLOv9 enhancement for car damage detection.

View Image - Figure 8. Main modules in the proposed YOLO. (a) RepNCSPELAN4 module; (b) RepNCSP module.

Figure 8. Main modules in the proposed YOLO. (a) RepNCSPELAN4 module; (b) RepNCSP module.

View Image - Figure 9. Adown module.

Figure 9. Adown module.

View Image - Figure 10. The CarDD dataset. (a) Class distribution; (b) Annotation distribution; (c) Damage location distribution; (d) Image size distribution.

Figure 10. The CarDD dataset. (a) Class distribution; (b) Annotation distribution; (c) Damage location distribution; (d) Image size distribution.

View Image - Figure 11. The CarDD dataset pair plot.

Figure 11. The CarDD dataset pair plot.

View Image - Figure 12. Image color augmentation used in this study.

Figure 12. Image color augmentation used in this study.

View Image - Figure 13. Mosaic augmentation used in this study.

Figure 13. Mosaic augmentation used in this study.

View Image - Figure 14. Training results of the proposed YOLO enhancement across 200 epochs.

Figure 14. Training results of the proposed YOLO enhancement across 200 epochs.

View Image - Figure 15. Performance comparison between YOLOv9-CS with data augmentation to YOLOv9-CS without data augmentation.

Figure 15. Performance comparison between YOLOv9-CS with data augmentation to YOLOv9-CS without data augmentation.

View Image - Figure 16. Confusion matrix of the proposed YOLO enhancement.

Figure 16. Confusion matrix of the proposed YOLO enhancement.

View Image - Figure 17. Performance comparison of the proposed method to several YOLO versions.

Figure 17. Performance comparison of the proposed method to several YOLO versions.

View Image - Figure 18. Simple user interface for car damage assessment.

Figure 18. Simple user interface for car damage assessment.

View Image - Figure 19. Comparison between YOLOv9 and proposed model. Column (a) is the ground truth, column (b) is detection using YOLOv9, and column (c) is detection using the proposed model.

Figure 19. Comparison between YOLOv9 and proposed model. Column (a) is the ground truth, column (b) is detection using YOLOv9, and column (c) is detection using the proposed model.

References

1. Doshi, S.; Gupta, A.; Gupta, J.; Hariya, N.; Pavate, A. Vehicle Damage Analysis Using Computer Vision: Survey. Proceedings of the 2023 4th International Conference on Communication Systems, Computing and IT Applications; Mumbai, India, 31 March–1 April 2023; pp. 132-135.

2. Yan, C.; Li, M.; Liu, W.; Qi, M.; Yan, C.; Li, M.; Liu, W.; Qi, M. Improved Adaptive Genetic Algorithm for the Vehicle Insurance Fraud Identification Model Based on a BP Neural Network. Theor. Comput. Sci.; 2020; 817, pp. 12-23. [DOI: https://dx.doi.org/10.1016/j.tcs.2019.06.025]

3. Aleksandrova, A.; Ninova, V.; Zhelev, Z. A Survey on AI Implementation in Finance, (Cyber) Insurance and Financial Controlling. Risks; 2023; 11, 91. [DOI: https://dx.doi.org/10.3390/risks11050091]

4. Aslam, F.; Hunjra, A.I.; Ftiti, Z.; Louhichi, W.; Shams, T. Insurance Fraud Detection: Evidence from Artificial Intelligence and Machine Learning. Res. Int. Bus. Financ.; 2022; 62, 101744. [DOI: https://dx.doi.org/10.1016/j.ribaf.2022.101744]

5. Hassebo, A.; Tealab, M. Global Models of Smart Cities and Potential IoT Applications: A Review. Internet Things; 2023; 4, pp. 366-411. [DOI: https://dx.doi.org/10.3390/iot4030017]

6. Asif, R.; Hassan, S.R. Exploring the Confluence of IoT and Metaverse: Future Opportunities and Challenges. Internet Things; 2023; 4, pp. 412-429. [DOI: https://dx.doi.org/10.3390/iot4030018]

7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the Advances in Neural Information Processing Systems; Lake Tahoe, NV, USA, 3–6 December 2012.

8. Sruthy, C.M.; Kunjumon, S.; Nandakumar, R. Car Damage Identification and Categorization Using Various Transfer Learning Models. Proceedings of the 5th International Conference on Trends in Electronics and Informatics, ICOEI 2021; Tirunelveli, India, 3–5 June 2021; pp. 1097-1101.

9. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition; Columbus, OH, USA, 23–28 June 2014.

10. Girshick, R. Fast R-CNN. Proceedings of the International Conference on Computer Vision (ICCV); Santiago, Chile, 7–13 December 2015.

11. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv; 2015; arXiv: 1506.01497[DOI: https://dx.doi.org/10.1109/TPAMI.2016.2577031] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27295650]

12. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); Venice, Italy, 22–29 October 2017.

13. Widjojo, D.; Setyati, E.; Kristian, Y. Integrated Deep Learning System for Car Damage Detection and Classification Using Deep Transfer Learning. Proceedings of the IEEE 8th Information Technology International Seminar, ITIS 2022; Surabaya, Indonesia, 19–21 October 2022; pp. 21-26.

14. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016.

15. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv; 2018; arXiv: 1804.02767

16. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv; 2020; arXiv: 2004.10934

17. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern.; 2020; 52, pp. 8574-8586. [DOI: https://dx.doi.org/10.1109/TCYB.2021.3095305] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34437079]

18. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv; 2022; arXiv: 2209.02976

19. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv; 2022; arXiv: 2207.02696

20. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv; 2024; arXiv: 2402.13616

21. Shirode, A.; Rathod, T.; Wanjari, P.; Halbe, A. Car Damage Detection and Assessment Using CNN. Proceedings of the 2022 IEEE Delhi Section Conference, DELCON 2022; New Delhi, India, 11–13 February 2022.

22. Elroy Martis, J.; Sannidhan, M.S.; Aravinda, C.V.; Balasubramani, R. Car Damage Assessment Recommendation System Using Neural Networks. Mater. Today Proc.; 2023; 92, pp. 24-31. [DOI: https://dx.doi.org/10.1016/j.matpr.2023.03.259]

23. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 3-19.

24. Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv; 2022; arXiv: 2205.12740

25. Wang, X.; Li, W.; Wu, Z. CarDD: A New Dataset for Vision-Based Car Damage Detection. IEEE Trans. Intell. Transp. Syst.; 2023; 24, pp. 7202-7214. [DOI: https://dx.doi.org/10.1109/TITS.2023.3258480]

26. Holilah, D.; Bustamam, A.; Sarwinda, D. Detection of Alzheimer’s Disease with Segmentation Approach Using K-Means Clustering and Watershed Method of MRI Image. J. Phys.Conf. Ser.; 2021; 1725, 012009. [DOI: https://dx.doi.org/10.1088/1742-6596/1725/1/012009]

27. Nanni, L.; Paci, M.; Brahnam, S.; Lumini, A. Comparison of Different Image Data Augmentation Approaches. J. Imaging; 2021; 7, 254. [DOI: https://dx.doi.org/10.3390/jimaging7120254] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34940721]

28. Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data; 2019; 6, 60. [DOI: https://dx.doi.org/10.1186/s40537-019-0197-0]

29. Chernov, V.; Alander, J.; Bochko, V. Integer-Based Accurate Conversion between RGB and HSV Color Spaces. Comput. Electr. Eng.; 2015; 46, pp. 328-337. [DOI: https://dx.doi.org/10.1016/j.compeleceng.2015.08.005]

30. Kang, L.-W.; Wang, I.-S.; Chou, K.-L.; Chen, S.-Y.; Chang, C.-Y. Image-Based Real-Time Fire Detection Using Deep Learning with Data Augmentation for Vision-Based Surveillance Applications. Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS); Taipei, Taiwan, 18–21 September 2019; pp. 1-4.

31. Yoshimura, M.; Otsuka, J.; Irie, A.; Ohashi, T. Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 14007-14017.

32. Cirillo, M.D.; Abramian, D.; Eklund, A. What Is The Best Data Augmentation For 3D Brain Tumor Segmentation?. Proceedings of the International Conference on Image Processing, ICIP; Anchorage, AK, USA, 19–22 September 2021; pp. 36-40.

33. Volk, G.; Muller, S.; von Bernuth, A.; Bringmann, O. Towards Robust CNN-Based Object Detection through Augmentation with Synthetic Rain Variations. Proceedings of the ITSS Institute of Electrical and Electronics Engineers; Auckland, New Zealand, 27–30 October 2022; pp. 1-7.

34. Kalia, R. COCO Annotated Dataset Car Damage Detection. Available online: https://www.kaggle.com/datasets/ramsikalia/coco-annotated-dataset-car-damage-detection (accessed on 29 March 2023).

35. Lenka, L.P. Coco Car Damage Detection Dataset. Available online: https://www.kaggle.com/datasets/lplenka/coco-car-damage-detection-dataset (accessed on 19 May 2023).

36. Kang, Z.; Liao, Y.; Du, S.; Li, H.; Li, Z. SE-CBAM-YOLOv7: An Improved Lightweight Attention Mechanism-Based YOLOv7 for Real-Time Detection of Small Aircraft Targets in Microsatellite Remote Sensing Imaging. Aerospace; 2024; 11, 605. [DOI: https://dx.doi.org/10.3390/aerospace11080605]

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.