Abstract

Translate

Shipping containers are vital to the transportation industry due to their cost-effectiveness and compatibility with intermodal systems. With the significant increase in container usage since the mid-20th century, manual tracking at port terminals has become inefficient and prone to errors. Recent advancements in Deep Learning for object detection have introduced Computer Vision as a solution for automating this process. However, challenges such as low-quality images, varying font sizes & illumination, and environmental conditions hinder recognition accuracy. This study explores various architectures and proposes a Container Code Localization Network (CCLN), utilizing ResNet and UNet for code identification, and a Container Code Recognition Network (CCRN), which combines Convolutional Neural Networks with Long Short-Term Memory to convert the image text into a machine-readable format. By enhancing existing shipping container localization and recognition datasets with additional images, our models exhibited improved generalization capabilities on other datasets, such as Syntext, for text recognition. Experimental results demonstrate that our system achieves $97.93 %$ accuracy at $64.11$ frames per second under challenging conditions such as varying font sizes, illumination, tilt, and depth, effectively simulating real port terminal environments. The proposed solution promises to enhance workflow efficiency and productivity in container handling processes, making it highly applicable in modern port operations.

Full text

Turn on search term navigation

Translate

1. Introduction

Shipping containers are a pivotal component in global transportation due to their high transportation capacity, re-usability, and cost-effectiveness [1]. Since their introduction in the mid-20th century, they have revolutionized the logistics industry. According to the World Shipping Council (WSC), over 200 million container trips occur annually, transporting approximately 90% of the world’s non-bulk cargo [2]. The volume of containerized trade has surged, increasing from 102 million twenty-foot equivalent units (TEUs) in 2000 to more than 1 billion TEUs in 2024 [2] as illustrated in Figure 1. This remarkable growth underscores the efficiency and reliability of containerization in facilitating global trade.

A shipping container is a standard transport and storage unit designed to protect goods during transit and streamline handling processes [5]. Each container has a unique identifier called a shipping container code, which allows for tracking the container movement throughout the supply chain, especially during Container Multimodal Transportation (CMT) [6]. Port terminals, especially at gate entrances, play a crucial role in the CMT process, as these are the points where container codes are identified and recorded [6]. The accuracy and speed of container code recognition at these locations directly influence the overall efficiency of CMT operations.

According to the International Standards Organization (ISO), a shipping container code consists of 11 characters divided into four components: (1) the Owner code, (2) the Category identifier, (3) the Serial number, and (4) the Check digit [5]. The check digit is calculated based on the other three components to ensure compliance with the ISO 6346 standard [7]. These container codes can be aligned horizontally or vertically, as illustrated in Figure 2.

Despite the critical importance of container codes, manual recording methods have proven to be highly erroneous, time-consuming, and labor-intensive [8]. This inefficiency highlights the pressing need for automated solutions in the industry. Automatic Code Localization and Recognition (ACCLR) systems have emerged, leveraging technologies such as Wireless Sensor Networks (WSN) [9], Radio Frequency Identification (RFID) [10], and Computer Vision (CV) [7]. While WSN and RFID offer automated recognition, they fall short due to the cost-prohibitive and complex implementation associated with extensive infrastructure requirements [7,8,11]. In contrast, CV-based systems provide more practical, affordable, and easy-to-install options, primarily requiring only a camera and a processor [12].

Nonetheless, these systems face significant challenges, including variations in image text quality, font size, orientation, and environmental conditions, which complicate the recognition process [13,14]. To address these challenges, we propose a high-performance deep learning-based system to detect and recognize shipping container codes under various conditions.

Deep learning techniques such as CNN [15], RNN [16], CRNN [17], RCNN [18], Fast-RCNN [19], Faster-RCNN [20], YOLO [21], SSD [22], and RetinaNet [23] have demonstrated impressive performance in sequence recognition, object detection, and image classification tasks. These advancements provide a solid foundation for further research into the application of deep learning for shipping container code detection and recognition at port terminals.

This research makes the following contributions:

Dataset Extension: We expanded existing datasets [24,25] by merging and augmenting them with new images to enhance diversity and better simulate real port terminal conditions. This approach addresses data scarcity and mitigates overfitting, providing a more robust benchmark for model evaluation.

Advanced Localization Model: We proposed the Container Code Localization Network (CCLN), modifying the model architecture in [7] by removing the last two hidden layers and changing the output from sliding windows to bounding box regression, as in [21], which enables the elimination of post-processing steps such as ASA and AMSR. This modification, combined with the use of a stricter loss function that caters for aspect ratio, results in a simpler and better-performing model.

Advanced Recognition Model: We proposed the Container Code Recognition Network (CCRN), modifying the model in [17] by incorporating LSTM as in [7], eliminating the last convolutional layer, and optimizing the image size, resulting in a model that demonstrates strong generalization capabilities when trained on the SynthText [26] dataset.

The rest of the paper is structured as follows: Section 2 presents related work, offering an overview of existing state-of-the-art ACCLR systems while highlighting their strengths, weaknesses, and potential research gaps. Section 3 outlines the proposed method, detailing the design of the CCLN and CCRN, along with their respective architectures, loss functions, and input-output relationships. Section 4 presents the datasets from [24,25,26], discusses the results of the proposed framework model, measures its performance, and compares it with the state-of-the-art models. We will also address the limitations of the model. Finally, Section 5 concludes the paper by summarizing the work, discussing key findings, addressing limitations, and suggesting directions for future research.

2. Related Works

This section explores various state-of-the-art methods for container code localization and recognition, analyzing their strengths and weaknesses to develop an advanced system that addresses existing limitations.

Mei et al. [27] proposed a code character recognition framework that integrates Convolutional Neural Network (CNN) [15] for container code characters with a Template Matching (TM) [27] algorithm for multi-feature extraction, addressing challenges associated with blurred and damaged container codes. The CNN [15] directly predicts the container codes from the input images, whilst the TM identifies similarities between container code characters and template images. The code character combination strategy synthesizes the container code characters based on the outputs from both the CNN [15] and TM [27]. Although this approach achieved an accuracy of 92.00%, it faced challenges with incorrect template matches, high time complexity, and limitations in recognizing sequences.

In [28], a method that differs slightly different from that of Mei et al. [27], focusing on multiple-view code extraction rather than multi-feature extraction, was proposed. This approach achieved an accuracy of 96% by aligning outputs from various perspectives (i.e., top, front, rear, left, and right). In their proposed model, container code characters are segmented from each image using a sliding window and subsequently recognized with a Support Vector Machine [29]. This strategy aims to address challenges related to the deformation and occlusion of container code characters. However, this method experiences increased time complexity due to the separate recognition steps. Also, the top codes of containers often become unclear because of environmental damage from rain, dust, and snow, making the system susceptible to noise [7].

Further advancements in the field of ACCLR systems have been explored. Zhiming et al. [30] introduced a system that employs Faster-RCNN [31] for detecting and classifying container code characters, subsequently recognized using a binary search tree. This approach attained an accuracy of 97.71% on a test dataset comprising 831 container code images. However, it incurs higher processing costs, rendering it unsuitable for real-time applications.

In [11], the author proposed a method utilizing EAST [32], ResNet [33], and CRNN [17], which attained an accuracy of 93.98%. While this method mitigated the effects of noise, it struggled to generalize across various environmental conditions.

Later, Chen et al. [34] presented a similar approach employing PSENet [35], ResNet [33], and CRNN [17], achieving 95.00% accuracy but trained on a smaller dataset, raising concerns about its generalizability to diverse scenarios.

In [7], the author proposed a comparable system that combines ResNet [33], UNet [36], and CRNN [17], alongside ASA and ASMR algorithms, to reduce noise and enhance accuracy to 93.33%. However, the system assumed incorrect predictions were solely due to incomplete code regions, which is not always the case.

Ran et al. [37], later developed a distinct ACCLR system to recognize vertically aligned shipping container codes using a Feature Pyramid Network (FPN) [38] and a Regional Proposal Network (RPN) [31] for code region detection, utilizing two separate recognition pipelines. This system achieved an accuracy of 94.6% at 3.22 FPS on a dataset containing 1853 images of vertically aligned trailer and shipping container codes, but the use of separate recognition pipelines resulted in increased time complexity.

In [39], the author integrated Yolov4 [21] and Tesseract OCR, achieving 90.7% accuracy. Despite its speed, the system encountered difficulties under poor lighting conditions, leading to decreased performance.

Zhao et al. [40] proposed a Practical Unified Network for arbitrarily oriented codes, achieving 98.57% accuracy. However, the model’s performance suffered significantly with very low-quality images.

In [41], the author proposed a framework that combines ResNet [33], UNet [36], and CRNN [17], achieving an accuracy of 95% at a speed of 4 FPS, though it still encounters challenges under adverse environmental conditions.

Recently, Yu et al. [42] introduced a two-stage ACCLR system utilizing C-YOLOv4 and C-Deeplabv3+, achieving an accuracy of 99.51% and a processing speed of 434.78 FPS. However, this system struggles to recognize characters under complex conditions.

In [43], the author employed a different approach using Differential Binarization and a Semantic Reasoning Network, achieving an accuracy of 89.20%. Nonetheless, this system encountered challenges with vertical text and low-quality images due to poor lighting conditions.

A recent hybrid system for container OCR by Santos et al. [44] integrates YOLOv7 for code detection with the transformer-based TrOCR for recognition, achieving an average accuracy of approximately $97.12 %$ . Although the approach demonstrated high reliability in handling multi-oriented codes and environmental distortions, the use of heavy models resulted in time complexity issues, making the system challenging for real-time applications.

In the pursuit of broader container attribute recognition, Ruoxin et al. [45] developed the ACCLR system to handle arbitrary-oriented codes, types, weights, and hazard signs. The system employs three distinct models for container detection, text detection, and text recognition, achieving $96.29 %$ accuracy at $32.71$ FPS. While validated at the real-world Ningbo Zhoushan Port, the model faced issues with miss-classification of similar letters and false text detections.

Su et al. [46] developed a hybrid CNN–BiLSTM–Attention model to forecast the Shanghai Container Freight Index (SCFI), achieving an $R^{2}$ of $94.8 %$ . While their study targets market prediction rather than container recognition, it demonstrates the broader significance of precise container tracking. Accurate ACCLR outputs contribute to downstream forecasting, risk analysis, and strategic decision-making in global logistics, aligning with the data-driven perspective emphasized in their work.

Notably, Transformer-based systems such as the Spatial Semantic Text Spotter [47] and Text-Aware Text Super-Resolution [48] have shown remarkable success in sequence recognition, particularly for complex scene text tasks. These models leverage the global attention mechanism of Transformers to address limitations found in traditional CNN and RNN-based approaches. Their ability to model spatial and sequence information marks a significant advancement in scene text recognition, paving the way for further research in this area.

Despite significant advancements, many existing systems still struggle with poor image quality, blurred or deformed characters, and limited generalization across various conditions [7,11,28,39,40,41,42,43,45]. The primary research gap lies in the absence of a robust model that generalizes effectively across different environments while maintaining real-time performance. This paper aims to address this gap by proposing a model that overcomes challenges related to adverse environmental conditions and poor generalization, thereby striking a balance between accuracy and speed.

The reviewed methods reveal trends such as the prevalent use of ResNet [33] for its high accuracy in compact models and CRNN [17] for recognizing scene text of varying lengths. Other notable architectures include YOLO [21], renowned for its real-time object detection capabilities, and Transformer models, which excel in sequence recognition tasks due to their self-attention mechanisms. UNet is preferred for image segmentation, especially in biomedical contexts. Each architecture has its strengths and weaknesses, for instance, ResNet is highly effective for image classification due to its deep residual learning framework, yet it can be computationally intensive. UNet [36] excels in tasks requiring precise localization and context capture, although it may struggle with processing speed in large datasets. Transformer models, while powerful, often face challenges with time complexity and resource demands, particularly in real-time applications. Conversely, CRNNs effectively handle variable-length inputs but may require multiple recognition pipelines, complicating deployment.

Despite these challenges, ResNet [33] and UNet [36] were selected for this research due to their proven robustness and adaptability in various scenarios. ResNet’s ability to achieve high accuracy with fewer parameters aligns well with our compact model requirements, while UNet’s architecture is particularly suited for the segmentation tasks involved in automatic container code recognition. Nonetheless, challenges remain in managing time complexity and multiple recognition pipelines [27,28,37]. Efforts to mitigate these issues using post-processing algorithms [7,27,37] and time complexity concerns. This research seeks to develop a robust approach to enhance both accuracy and processing efficiency in automatic container code recognition [7,11,27,28,30,34,37,41,49,50,51].

3. Materials and Methods

In our paper, we introduce a CCLN for container code localization and a CCRN for container code recognition. The CCLN model is based on the architecture proposed in [7], utilizing ResNet [33] and UNet [36] to deliver high detection accuracy while maintaining a compact and efficient design. The CCRN model incorporates CNN [15], LSTM [52], and Connectionist Temporal Classification (CTC) [53], following the architectures outlined in [7,17]. The CCRN, derived from the widely adopted CRNN architecture [7,11,30,34,37], effectively recognizes scene text of varying lengths without requiring character segmentation and has also shown generalization in similar applications, such as music score recognition [7].

The CCLN processes rear-view images of containers to predict the container code region, which is defined by a bounding box that includes (1) the owner code and category identifier, (2) the serial number, and (3) the check digit. The x and y coordinates of the top-left and bottom-right corners of the Region of Interest (RoI) define the bounding box, following the Pascal Visual Object Classes (VOC) [54] standard. This bounding box guides the system in cropping the container code image, which is then passed to the recognition model. The CCRN processes the cropped image and converts the visual container code into a machine-readable format, facilitating accurate and efficient recognition.

3.1. Container Code Localization

The container code localization process consists of several steps: pre-processing, feature extraction, up-scaling, and post-processing. During pre-processing, the Red Green Blue (RGB) rear-view image of a container is resized to $256 \times 256$ pixels to meet the requirements of the localization model. These dimensions were selected after exploring a variety of sizes, including $(96 \times 96$ , $224 \times 224$ , $256 \times 256$ , $320 \times 320$ , $640 \times 640$ , $960 \times 960$ , and $1024 \times 1024)$ , that optimize the model without compromising performance. Feature extraction utilizes ResNet [33] and UNet [36], and the subsequent up-scaling addresses the reduction in feature map size. The architecture of the localization model, illustrated in Figure 3, comprises a series of convolutions followed by adaptive pooling to transform the feature map into a vector of four values. This vector represents the coordinates for the upper left and lower right corners of the bounding box surrounding the container code. In the post-processing stage, the predicted code region guides the image-cropping process, resulting in a smaller image focused on the container code.

The Container Code Localization Network (CCLN) is designed to accurately detect the code region within an input image. As illustrated in Figure 3a, the CCLN takes a $256 \times 256$ pixel RGB image as input and outputs the coordinates for the bounding box, following the Pascal VOC [54] standard annotations. The CCLN architecture, shown in Figure 3b and Table 1, leverages a pre-trained ResNet50 backbone [33] to extract a hierarchy of visual features from the input image. We explored various architectures, including LeNet [55], VGG [56], and Inception [57], and found ResNet50 to be the optimal choice due to its ability to mitigate vanishing gradients and effectively extract features [58]. The CCLN contains approximately 16 million parameters, resulting in a model size of 78 Megabytes (MB), balancing accuracy with computational efficiency.

The CCLN modifies the output of the sliding window approach to bounding box regression, as demonstrated in [21], streamlining the detection pipeline and eliminating post-processing steps such as ASA and AMSR. Inspired by the UNet [36] architecture, the model incorporates skip connections to preserve spatial information and local details necessary for accurate bounding box predictions. By combining these skip connections with convolutional layers and a ResNet50 backbone, the CCLN effectively identifies the code region within the input image, even under challenging conditions. The use of multi-scale feature fusion enables the model to focus on relevant details while maintaining a comprehensive understanding of the scene, resulting in precise bounding box predictions that facilitate the subsequent recognition process, as supported by similar studies [42,59,60].

To further define the CCLN output, recall that in Pascal VOC [54], x starts from left to right and y starts from top to bottom, as illustrated in Figure 3a.

$x_{\min}$ indicates the distance from the left edge of the bounding box to the left side of the image.

$y_{\min}$ indicates the distance from the top edge of the bounding box to the top of the image.

$x_{\max}$ denotes the distance from the left side of the image to the right edge of the bounding box.

$y_{\max}$ signifies the distance from the top of the image to the bottom edge of the bounding box.

The loss function for training the CCLN is based on the model’s outputs and annotations that comply with the Pascal VOC [54] standard. We utilized the Complete Intersection over Union (CIoU) loss [61], an enhancement of the traditional IoU loss. CIoU not only measures the overlap between predicted and ground truth bounding boxes but also accounts for the distance between their centers and their aspect ratios. This additional information allows CIoU to provide more informative gradients, particularly when there is no overlap, thereby accelerating the training process and improving localization accuracy. As a result, CIoU proves to be a more effective choice for our model. The Complete Intersection over Union (CIoU) loss, $L_{CIoU}$ , is defined as follows:

(1) $L_{CIoU (B, B^{*})} = 1 - IoU (B, B^{*}) + \frac{ρ^{2} (B, B^{*})}{c^{2}} + α v (B, B^{*})$

where

IoU (B, B^{*})

is the intersection over union between the predicted bounding box B and the ground truth bounding box

B^{*}

(2) $IoU (B, B^{*}) = \frac{(min (x_{\max}, x_{\max}^{*}) - max (x_{\min}, x_{\min}^{*})) \times (min (y_{\max}, y_{\max}^{*}) - max (y_{\min}, y_{\min}^{*}))}{(max (x_{\max}, x_{\max}^{*}) - min (x_{\min}, x_{\min}^{*})) \times (max (y_{\max}, y_{\max}^{*}) - min (y_{\min}, y_{\min}^{*}))}$

The distance between the centers of the predicted and ground truth bounding boxes is given by:

(3) $ρ^{2} (B, B^{*}) = ({(\frac{x_{\max} + x_{\min}}{2} - \frac{x_{\max}^{*} + x_{\min}^{*}}{2})}^{2} + {(\frac{y_{\max} + y_{\min}}{2} - \frac{y_{\max}^{*} + y_{\min}^{*}}{2})}^{2})$

The aspect ratio difference between the predicted and ground truth bounding boxes is:

(4) $v (B, B^{*}) = \frac{4}{π^{2}} {(arctan \frac{x_{\max}^{*} - x_{\min}^{*}}{y_{\max}^{*} - y_{\min}^{*}} - arctan \frac{x_{\max} - x_{\min}}{y_{\max} - y_{\min}})}^{2}$

Finally, the constants c and $α$ are set to 1 and $0.5$ respectively. These values were chosen after experiments showed they balance the IoU and aspect ratio terms well, improving convergence speed. They also match the recommendations in [61], confirming their effectiveness. The value of c heavily influences the penalty for large bounding boxes, while $α$ controls the trade-off between the IoU term and the aspect ratio term, with higher values giving more weight to the aspect ratio term.

3.2. Container Code Recognition

Container code recognition comprises several processes: pre-processing, feature extraction, feature transformation, sequence recognition, and post-processing. In the pre-processing stage, the cropped image from the localization model is converted into grayscale. The resulting grayscale image is resized to a fixed height of 32 pixels, with white pixel padding added to the right side to achieve a width of 128 pixels, thus conforming to the dimension requirements of the recognition model. As depicted in Figure 4, the input dimensions of the recognition model are $128 \times 32$ pixels. These dimensions were selected after experimenting with various sizes, including $(128 \times 32, 256 \times 32, 512 \times 32, 512 \times 64)$ , and were found to be optimal as they strike a balance between being small and not compromising model performance.

During feature extraction, CNN [15] extracts image features, which are then transformed into a time frame of receptive fields. The receptive field represents an area in the input image that creates a feature whose height matches the height of the image but with a width set to 1 pixel. In sequence recognition, the bidirectional Long Short-Term Memory (BiLSTM) [52] layer outputs the probability distribution of all characters in our character set per time frame. In this research, the character set includes all uppercase and lowercase English letters, numbers 0 to 9, and a hyphen representing a space.

The total number of characters in the character set is computed as follows:

(5) $N_{characters} = n_{lowercase} + n_{uppercase} + n_{numbers} + n_{space}$

(6) $N_{characters} = 26 + 26 + 10 + 1 = 63$

In the first part of the post-processing stage, the most confident character is selected for each time frame, resulting in some duplicated characters. In the final part of the post-processing stage, Connectionist Temporal Classification (CTC) [53] is employed to remove successive duplicated characters and all spaces, as shown in Figure 4.

Figure 4a shows the recognition block diagram, where a grayscale image containing a part of a container code (owner code and category identifier) is entered into the recognition model. The model uses CNN and LSTM to predict the probabilities of characters in the character set. The final layer of the model produces a probability distribution vector for each time frame. Following this, the decoding of the sequence employs a standard CTC Greedy Decoding mechanism. This process selects the character with the maximum probability (highest confidence) for each time frame, indicated by the red box (with the character “-” representing a space). The CTC decoding algorithm [53] then removes successive duplicate characters and all spaces (represented by “-”), resulting in the string format of the actual image text: “BBCU”.

Figure 4b and Table 2 illustrates the architecture of the recognition model, showing that the CCRN accepts an input of $128 \times 32$ pixels and produces a vector representing the probability distribution of all characters for each time frame. Consequently, the model produces 63 probability predictions for each of the 128 time frames. It should be noted that this architecture is similar to the one proposed in [17], with the main difference being the removal of the last convolutional layer before connecting to the first LSTM layer. The CCRN model contains approximately 10 million parameters and has a size of about 45 MB, offering a compact yet effective solution for container code recognition.

The loss function to train the recognition model is expressed as follows:

(7) $L = \{\begin{matrix} - \sum_{y^{*} \in Y^{*}, y \in Y} ln (p (y^{*} | y)), & if y \in Y^{*} \\ 0, & otherwise \end{matrix}$

where Y and

Y^{*}

are the predicted and actual character sequences, respectively, and y and

y^{*}

are characters within these sequences.

4. Experimental Results and Discussion

The dataset utilized in this study comprises rear-view images of shipping containers captured by cameras positioned on the top right side of each container. These images were sourced from various port terminals in Africa, Asia, and America and augmented with publicly available ACCLR and scene text datasets. This approach increased the sample size and enhanced the system’s reliability and generalizability. The dataset includes 3600 images, ensuring a diverse representation of container manufacturers, colors, and code formats.

We employed Kili Technology Software for labeling, ensuring a rigorous validation method to maintain annotation accuracy. Images were labeled with a focus on the nearest rear container code, while views of the rear, top, side, or nearby codes were treated as background. Container codes too far from the camera’s view were classified as background artifacts due to recognition difficulties.

Figure 5 illustrates samples from the dataset, showcasing images captured under various lighting conditions (day and night) and backgrounds, including tilted, low-quality, noisy, and deformed container code images with varying font sizes as containers moved toward or away from the camera.

The dataset was split into training (70%), validation (10%), and test (20%) sets. This split was chosen to provide a larger test set that captures the wide variability in the limited dataset, ensuring robust evaluation of generalization across diverse conditions while retaining enough data for training and validation. By monitoring the training loss against the validation loss, the model’s generalization capability could be assessed early on, allowing for timely adjustments to the architecture or hyperparameters if necessary. The test set was kept separate to provide an unbiased evaluation of the final model’s performance. This approach ensured that the localization and recognition models were trained and evaluated on representative samples, leading to robust and reliable performance in real-world scenarios. In the final experiment, the localization and recognition models were combined to predict container codes and determine their accuracy and processing power.

The experiments were conducted on a Linux Ubuntu $22.04 . 4$ system, featuring a 13th Gen Intel Core i7-13600P processor, 16 GB of RAM, and a 64 bits architecture, with a clock speed of 5 GHz. The models were implemented and trained using Python 3.13.0 with the PyTorch 2.4.0 framework [62], which was chosen for its flexibility, dynamic computation graph, and strong community support. Training used the Adam optimizer with an initial learning rate of $0.001$ and the “Reduce Learning Rate on Plateau” scheduler to improve convergence, with a batch size of 64 over 256 epochs. A learning rate scheduler was applied to improve convergence. During training, the model achieving the lowest combined training and validation loss each epoch was saved. The entire training process took approximately $3.5$ h.

4.1. Container Code Localization

The shipping container code localization dataset used in this study comprised 3600 RGB rear-view images of containers, annotated with Extensible Markup Language (XML) files providing bounding box coordinates ( $x_{\min}$ , $y_{\min}$ , $x_{\max}$ , $y_{\max}$ ). This dataset was created by combining a subset of images from the datasets released by [24,25], which represented about $50 %$ of our experimental dataset. The remaining images were sourced from various online datasets around the world, capturing diverse environmental and lighting conditions, as well as different manufacturers.

Figure 6 shows the localization (a) dataset structure, with subfolders for Joint Photographic Experts Group (JPEG) images and XML annotations, (b) dataset annotations, where image file names match corresponding annotation files containing bounding box coordinates, and (c) an annotated image with a green bounding box around the container code.

The key performance metrics selected for the container code localization task are Accuracy, Precision, Recall, F1-Score, and Average Precision (AP). Accuracy provides an overall assessment of the correctness of the detected container code regions. Precision evaluates the accuracy of the model’s positive predictions, Recall measures the ability to detect all code regions without missing any, and F1-Score captures the trade-off between Precision and Recall. AP further assesses the model’s performance across different IoU thresholds. These metrics were chosen to comprehensively evaluate the localization model’s capabilities from multiple perspectives, ensuring high detection rates, low false positives, and well-aligned bounding boxes. By optimizing these metrics, the localization model provides the recognition stage with an accurate and complete set of detected regions, minimizing feature loss and directly benefiting the overall system performance.

(8) $\begin{matrix} Recall & = \frac{TP}{TP + FN} \end{matrix}$

(9) $\begin{matrix} Precision & = \frac{TP}{TP + FP} \end{matrix}$

(10) $\begin{matrix} F1-Score & = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \end{matrix}$

(11) $\begin{matrix} Accuracy & = \frac{TP + TN}{TP + TN + FP + FN} \end{matrix}$

(12) $\begin{matrix} Average Precision & = \frac{1}{N} \sum_{r = 1}^{N} P (r) \end{matrix}$

where N is set to 11 because there are 11 points on the Precision-Recall (P-R) graphs to estimate the area under the curve, similar to the experimental setup in [7]. The abbreviations for TP, FP, TN, and FN are as follows:

TP—True Positive

FP—False Positive

TN—True Negative

FN—False Negative

AP was computed using 11 points, a standard practice in CV [7], and consistent across all compared methods in this paper.

Algorithm 1 provides a method for computing TP, FP, TN, and FN based on the IoU between the predicted and ground-truth bounding boxes. Initially, all counts are set to zero. For each predicted bounding box, the algorithm calculates the IoU with all actual bounding boxes and identifies the maximum IoU value. If this value exceeds the IoU threshold, the prediction is considered a TP, otherwise, it is an FP. The algorithm then iterates over each actual bounding box to ensure that any actual unrecognized box (IoU below the threshold) is counted as an FN. The TN is computed by subtracting the sum of TP, FP, and FN from the sum of all images. This approach ensures a thorough evaluation of the performance of the localization model.

Algorithm 1 Computation of TP, FP, TN, and FN

Require: predicted_boxes, actual_boxes, IoU_thresholdEnsure: TP, FP, TN, FN Initialize

T P, F P, T N, F N

to 0 for each

pred_box

predicted_boxes

max_IoU \leftarrow 0

for each

actual_box

actual_boxes

I o U \leftarrow

ComputeIoU(

pred_box

actual_box

) if

I o U > max_IoU

then

max_IoU \leftarrow I o U

end if end for if

max_IoU \geq IoU_threshold

then

T P \leftarrow T P + 1

else

F P \leftarrow F P + 1

end if end for for each

actual_box

actual_boxes

max_IoU \leftarrow 0

for each

pred_box

predicted_boxes

I o U \leftarrow

ComputeIoU(

pred_box

actual_box

) if

I o U > max_IoU

then

max_IoU \leftarrow I o U

end if end for if

max_IoU < IoU_threshold

then

F N \leftarrow F N + 1

end if end for

T N \leftarrow total_number_of_images - (T P + F P + F N)

return

T P, F P, T N, F N

Figure 7 presents a comparison of Precision, Recall, Accuracy and F1-Score across different IoU thresholds ( $0.5$ to $0.9$ ) for our proposed CCLN and existing methods. CCLN consistently outperforms other models in Precision, Accuracy and F1-Score, demonstrating strong robustness to stricter IoU requirements. Although it slightly lags behind the method in [42] in Recall, the gap is minimal, and CCLN achieves a higher F1-Score, indicating a better balance between Precision and Recall. Maintaining high performance at higher IoU thresholds is critical, as it reduces the risk of character clipping during cropping in the localization stage, thereby benefiting the downstream recognition task [7].

Figure 8 shows the Precision-Recall curve of our proposed model, outperforming existing models, including those from [42,43], with a higher AP. Our model maintains high precision at stricter IoU thresholds, demonstrating improved robustness and accuracy, and indicating better container code localization performance.

The results presented in Table 3 demonstrate the superior performance of the proposed CCLN model compared to other state-of-the-art localization approaches. The CCLN model achieved the highest accuracy of $98.98 %$ , outperforming all other methods. This indicates that the CCLN model was able to accurately locate the target objects in the test images with a high degree of precision. In terms of precision, the CCLN model achieved $98.75 %$ , the highest among all the methods evaluated. This means that the model was able to correctly identify the true positive detections with a very low rate of false positives. The high precision of the CCLN model suggests that it is capable of making reliable and confident predictions, which is a crucial property for practical applications. While the CCLN model did not achieve the highest recall of $98.02 %$ (as reported for the model in [42]), it still maintained a very competitive recall of $97.79 %$ . The fact that the CCLN model was able to outperform the other methods in terms of the F1-score, which is a balanced metric considering both precision and recall, indicates that it has achieved an optimal trade-off between these two important performance measures. Finally, the model demonstrated the highest average precision of $97.86 %$ , which provides a comprehensive evaluation of its localization capabilities across different IoU thresholds. This metric reflects the model’s ability to maintain high precision while achieving good recall, making it a robust and reliable localization solution.

The superior performance of the CCLN model can be attributed to several key architectural and design choices. The transition from a sliding window approach to a regression-based method, similar to the YOLO model [21], has significantly improved the model’s ability to detect small features in complex images. The deep and robust feature extraction capabilities of ResNet [33] have enabled the CCLN model to effectively handle various lighting conditions and challenging scenarios. Furthermore, the multi-scale feature fusion of the UNet architecture [36] has allowed the CCLN model to address diverse font sizes and noisy backgrounds, as demonstrated in previous studies [60]. The integration of these advanced techniques has been shown to enhance the accuracy and robustness of localization models in similar works [42,59].

The slightly lower recall of the CCLN model compared to the best-performing method [42] can be attributed to factors such as occlusion and character deformation, which may have posed challenges for the model. To address this, future work could focus on incorporating more diverse training data, including deformed and occluded container code images, to improve the model’s generalization under such conditions. Overall, the CCLN model has demonstrated exceptional performance across multiple evaluation metrics, showcasing its effectiveness in accurately localizing target objects in complex images

4.2. Container Code Recognition

The recognition dataset comprises 4000 grayscale images of horizontal container codes and scene text, constructed by merging images from [24,25] with 800 additional SynthText [26] images. The addition of images from different datasets was necessary because there are no standard datasets for shipping container codes. To assess the recognition model’s performance, we employed accuracy as the primary evaluation metric, which measures the model’s ability to correctly identify container codes. This choice is consistent with existing studies [7,11,39], where accuracy has been widely adopted as a key performance indicator for recognition tasks. A prediction was considered accurate only if it exactly matched the original text, ensuring a rigorous evaluation of the model’s recognition capabilities.

Figure 9 shows the recognition (a) dataset structure, with sub-folders for JPEG images, (b) dataset annotations within the file name, and (c) an annotated image with ground truth printed on the bottom right.

The accuracy of the CCRN can be computed as follows:

(13) $\begin{matrix} Accuracy & = \frac{N_{correct}}{N_{correct} + N_{incorrect}} \end{matrix}$

where

N_{correct}

and

N_{incorrect}

are the number of correctly and incorrectly predicted container codes, respectively.

The newly designed recognition model was trained on the same dataset under various conditions, including various font sizes, tilt, shades, slightly deformed or occluded characters, noisy backgrounds, and various lighting and environmental conditions. These factors were consistent across the training of the newly proposed and existing models to ensure a fair comparison.

Table 4 presents the accuracy of the proposed recognition model, which achieved an impressive $98.71 %$ , surpassing the latest models by Yu et al. [42] and Lau et al. [43], as well as models based on LSTM such as Ran et al. [7]. This enhanced performance was achieved by optimizing the model proposed by Ran et al. [7] through the removal of the last convolutional layer (within hidden layers) and adjustments to the input and output sizes, along with the introduction of dropout layers to prevent overfitting. Lastly, we employed various data augmentation techniques, including color adjustments (brightness and contrast), Gaussian blur, slight rotation, and cropping, which collectively improved the model’s generalization capabilities that was proven to help prevent overfitting in [40].

4.3. Localization and Recognition Model Integration

The integrated system combines the CCLN and CCRN to detect and recognize container codes. By first localizing code regions using CCLN, cropping, resizing, and converting to grayscale, and then inputting to CCRN, we obtain the machine-readable format of the image text. We evaluate the integrated system using accuracy and processing speed, as higher accuracy ensures reliability in the presence of optical challenges, while faster processing times are crucial for capturing moving containers without missing codes [7].

The recognition accuracy of the integrated system was evaluated using Equation (13), while the processing power was calculated as follows:

(14) $\begin{matrix} Processing Power = \frac{N_{images}}{\sum_{t = 1}^{N_{images}} (T_{finish} - T_{start})} \end{matrix}$

where

T_{start}

is the time (in seconds) before the container code localization and recognition process,

T_{finish}

is the time (in seconds) after the container code localization and recognition process, and

N_{images}

is the number of test images. This formula calculates the average number of images that can be processed per second by the overall system.

Table 5 shows that our integrated system achieves state-of-the-art performance with an accuracy of $97.93 %$ , a throughput of $64.11$ FPS, and an average inference latency of $7.2$ ms per image, outperforming recent transformer-based models [40,42] and LSTM-based models [7,11]. This improvement is attributed to the efficient design of CCLN, which combines ResNet and UNet architectures with skip connections to preserve spatial details. The adoption of the Complete IoU loss further ensured alignment between predicted and ground-truth bounding boxes, while architectural optimization and layer pruning enhanced inference speed without sacrificing detection accuracy. Lastly the usage of bidirectional LSTM layers to capture sequential dependencies while preventing overfitting.

This research went further to investigate the model’s performance across different optical challenges by splitting the dataset into various classes. Figure 10 illustrates the robustness of the model under varying environmental conditions, maintaining an accuracy greater than $98.53 %$ in most scenarios, except for occlusion and deformation of characters, where a minor drop to $94 %$ was observed. The model’s ability to consistently locate and recognize codes, even in complex environments, demonstrates its capacity to generalize. Although recurrent layers slightly increase inference latency, the achieved $64.11$ FPS enables near real-time processing, suitable for automated gate operations and port monitoring. Future improvements could explore attention-based architectures [42,43] and check-digit verification schemes, as used in [7], to further enhance reliability and adaptability for large-scale, real-time deployments.

5. Conclusions

This paper presented an advanced two-stage framework for automatic container code localization and recognition, integrating the CCLN and CCRN. Each model was trained independently and later combined to form the ACCLR system, evaluated using accuracy and processing power as key performance metrics. The proposed system achieved an accuracy of $97.93 %$ at $64.11$ FPS, outperforming recent state-of-the-art models [40,42]. The adoption of a stricter Complete IoU loss ensured that predicted bounding boxes preserved realistic aspect ratios, minimizing feature loss and improving detection reliability.

Comprehensive experiments demonstrated the system’s ability to generalize beyond container codes, performing competitively on the SynthText [26] dataset for scene text recognition. Furthermore, detailed analysis of environmental challenges revealed strong robustness under various lighting and orientation conditions, though reduced performance was observed in cases involving severely deformed or occluded characters. This highlights the need for further improvements through feature enhancement via attention based mechanisms (as successfully employed by Zhao et al. [40] to selectively focus on character features) and leveraging multi view information fusion to recover occluded characters and enhance overall code validation [28].

The proposed ACCLR models prove highly effective for real-world port automation, offering scalability and adaptability to logistics, supply chain management, and asset tracking applications. Its efficient design and real-time capability make it suitable for deployment in high-traffic environments, and future work will focus on enhancing robustness to degraded text and expanding validation across larger and more diverse datasets.

Author Contributions

Conceptualization was carried out by S.H., R.L.K., and J.-R.T.; methodology was developed by S.H., R.L.K., and J.-R.T.; software was created by S.H., R.L.K., and J.-R.T.; validation was conducted by S.H., R.L.K., and J.-R.T.; formal analysis was performed by S.H., R.L.K., and J.-R.T.; investigation was led by S.H., R.L.K., and J.-R.T.; draft preparation was handled by S.H.; resources were managed by R.L.K. and J.-R.T.; the writing, review, and editing process involved S.H., R.L.K., and J.-R.T.; supervision was provided by R.L.K. and J.-R.T.; project administration was overseen by R.L.K. and J.-R.T. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in and supporting this research are subsets of images from the following publicly available datasets: 1. ACCLR dataset: https://universe.roboflow.com/project-ha37k/new-container-number (accessed on 29 June 2024); 2. ACCLR dataset: https://drive.google.com/drive/folders/13LpHEeFExmDJnw_U9peqLR-8uAAUMEzi(accessed on 21 December 2023); 3. SynthText dataset: https://www.kaggle.com/datasets/wassefy/synthtext(accessed on 29 August 2024); 4. Code implementation: https://github.com/sanelehlabisa/ACCLR. These datasets have been used to compile the data necessary for the experiments and analyses reported in this research.

Acknowledgments

This research was made possible through the invaluable guidance and support of RL Khuboni and JR Tapamo, who provided expert supervision throughout this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The manuscript uses the following abbreviations:

ACCLR	Automatic Container Code Localization and Recognition
CCLN	Container Code Localization Network
CCRN	Container Code Recognition Network
CNN	Convolutional Neural Network
CRNN	Convolutional Recurrent Neural Network
CTC	Connectionist Temporal Classification
CV	Computer Vision
FPS	Frames Per Second
LSTM	Long Short Term Memory
RFI	Radio Frequency Identification
RoI	Region of Interest
RNN	Recurrent Neural Network
WSN	Wireless Sensor Network
YOLO	You Only Look Once

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Illustration of the exponential increase in world container traffic from 2010 to 2024, with a noticeable growth trend over the years [3,4].

Figure 2 Examples of (a) horizontal container codes and (b) vertical container codes. Different colors are used to distinguish the individual components of the ISO 6346 code, namely the owner code, equipment category identifier, serial number, and check digit.

Figure 3 Code localization process (a) Localization Block Diagram, and (b) Localization Model Architecture.

Figure 4 Code recognition process (a) Recognition Block Diagram, and (b) Recognition Model Architecture.

Figure 5 Sample of Dataset Images with (a) Various Lighting Conditions, (b) Various Background Colors, (c) Various Tilt Angle, (d) Low Quality, (e) Noisy Background, and (f) Deformed Characters.

Figure 6 Localization (a) Dataset Structure, (b) Dataset Annotations, and (c) Annotated Image.

Figure 7 Comparison of (a) Precision, (b) Recall, (c) Accuracy and (d) F1-Score of the newly designed localization models and existing models [7,11,39,40,41,42,43] under different IoU thresholds (0.5, 0.6, 0.7, 0.8, 0.9).

Figure 8 Precision–Recall Curve comparing our localization model against existing ones [7,11,39,40,41,42,43].

Figure 9 Recognition (a) Dataset Structure, (b) Dataset Annotations, and (c) Annotated Image.

Figure 10 Model Accuracy Across Different Environmental Challenges.

Table 1

Detailed Architecture of the Container Code Localization Network.

Stage	Type	Reps	Out Ch.	Kernel/Stride	Act.	Fusion Source
Initial Block
0	Conv → BN → MaxPool	1	64	$7 \times 7 / 2$ (Conv)	ReLU	N/A
Encoder (ResNet Bottleneck Blocks)
1	Bottleneck	3	256	$3 \times 3 / 1$	ReLU	N/A
2	Bottleneck	4	512	$3 \times 3 / 2$ (Downsample)	ReLU	N/A
3	Bottleneck	6	1024	$3 \times 3 / 2$ (Downsample)	ReLU	N/A
4	Bottleneck	3	2048	$3 \times 3 / 2$ (Downsample)	ReLU	N/A
Decoder (Upsampling and Feature Fusion)
5	UpSample → Conv	1	512	$3 \times 3$	L.ReLU	Stage 3
6	UpSample → Conv	1	256	$3 \times 3$	L.ReLU	Stage 2
7	UpSample → Conv	1	128	$3 \times 3$	L.ReLU	Stage 1
Output Layer
8	Final Conv	1	4	$1 \times 1$	N/A	BBox

Table 2

Detailed Architecture of the Container Code Recognition Network.

ID	Layer	Out Ch.	K	S	Act.	Notes
CNN Backbone (Feature Extraction)
1	Conv/ReLU	64	$3 \times 3$	$1 \times 1$	ReLU	Input: 1 Ch.
2	MaxPool	64	$2 \times 2$	$2 \times 2$	N/A
3	Conv/ReLU	128	$3 \times 3$	$1 \times 1$	ReLU
4	MaxPool	128	$2 \times 2$	$2 \times 2$	N/A
5	Conv/ReLU	256	$3 \times 3$	$1 \times 1$	ReLU
6	Conv/ReLU	512	$3 \times 3$	$1 \times 1$	ReLU
7	MaxPool	512	$2 \times 1$	$2 \times 1$	N/A	Height Reduction
8	Conv/BN/ReLU	1024	$3 \times 3$	$1 \times 1$	ReLU	With BatchNorm
9	MaxPool	1024	$2 \times 1$	$2 \times 1$	N/A	Final Height Red.
RNN Head (Sequence Recognition)
10	Linear (Map)	64	N/A	N/A	N/A	Map from 2048 feats
11	Bi-LSTM 1	512	N/A	N/A	N/A	$2 \times 256$ units
12	Bi-LSTM 2	512	N/A	N/A	N/A	$2 \times 256$ units
13	Linear (Output)	63	N/A	N/A	N/A	Output Classes

Table 3

Localization Performance.

Method	Accuracy	Precision	Recall	F1 Score	Average Precision
Chenghao et al. [11]	96.14	94.19	93.66	93.24	92.98
Ran et al. [7]	96.58	94.82	94.01	94.50	94.31
Hsu et al. [39] ^†	96.54	95.26	93.72	93.87	94.97
Zhao et al. [40]	97.41	97.29	96.98	97.03	97.11
Hlabisa et al. [41]	95.00	96.33	93.24	95.11	96.14
Yu et al. [42] ^†	98.18	97.74	98.02	97.88	97.56
Lau et al. [43]	96.85	98.29	97.09	96.17	97.75
Ours	98.98	98.75	97.79	98.03	97.86

^† Results referenced from the original publication under their respective settings/datasets. Bold indicates the superior performance.

Table 4

Recognition Model Performance.

Method	Accuracy
Chenghao et al. [11]	93.98
Ran et al. [7]	94.37
Hsu et al. [39] ^†	93.41
Zhao et al. [40]	98.17
Hlabisa et al. [41]	95.41
Yu et al. [42] ^†	98.28
Lau et al. [43]	98.33
Ours	98.71

^† Results referenced from the original publication under their respective settings/datasets. Bold indicates superior performance.

Table 5

Performance of Systems.

System	Accuracy (%)	Processing Power (FPS)
Chenghao et al. [11]	93.98	10.00
Ran et al. [7]	93.33	1.13
Hsu et al. [39] ^†	92.69	30.00
Zhao et al. [40]	97.57	4.86
Hlabisa et al. [41]	95.00	4.00
Yu et al. [42] ^†	97.39	53.21
Lau et al. [43]	89.20	31.72
Ours	97.93	64.11

^† Results referenced from the original publication under their respective settings/datasets. Bold indicates superior performance.

References

1. Hao, C.; Yue, Y. Optimization on Combination of Transport Routes and Modes on Dynamic Programming for a Container Multimodal Transport System. Procedia Eng.; 2016; 137, pp. 382-390. [DOI: https://dx.doi.org/10.1016/j.proeng.2016.01.272]

2. World Shipping Council. Container Shipping: Facts and Information. Available online: https://www.worldshipping.org/ (accessed on 7 June 2024).

3. Matsuda, T.; Hirata, E.; Kawasaki, T. Monopoly in the container shipping market: An econometric approach. Marit. Bus. Rev.; 2022; 7, pp. 318-331. [DOI: https://dx.doi.org/10.1108/MABR-12-2020-0071]

4. Yu, H.; Deng, Y.; Zhang, L.; Xiao, X.; Tan, C. Yard Operations and Management in Automated Container Terminals: A Review. Sustainability; 2022; 14, 3419. [DOI: https://dx.doi.org/10.3390/su14063419]

5.ISO 6346:2022 International Organization for Standardization (ISO). Freight Containers—Coding, Identification and Marking; ISO: Geneva, Switzerland, 2022.

6. Chao, S.L.; Lin, Y.L. Gate automation system evaluation: A case of a container number recognition system in port terminals. Marit. Bus. Rev.; 2017; 2, pp. 21-35. [DOI: https://dx.doi.org/10.1108/MABR-09-2016-0022]

7. Zhang, R.; Bahrami, Z.; Wang, T.; Liu, Z. An Adaptive Deep Learning Framework for Shipping Container Code Localization and Recognition. IEEE Trans. Instrum. Meas.; 2021; 70, 2501013. [DOI: https://dx.doi.org/10.1109/TIM.2020.3016108]

8. Long, S.; He, X.; Yao, C. Scene Text Detection and Recognition: The Deep Learning Era. Int. J. Comput. Vis.; 2021; 129, pp. 161-184. [DOI: https://dx.doi.org/10.1007/s11263-020-01369-0]

9. Abbate, S.; Avvenuti, M.; Corsini, P.; Vecchio, A. Localization of Shipping Containers in Ports and Terminals Using Wireless Sensor Networks. Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks; Washington, DC, USA, 13–16 April 2009; pp. 587-592.

10. Narsoo, J.; Muslun, W.; Sunhaloo, M.S. Radio Frequency Identification (RFID) Container Tracking System for Port Louis Harbor: The Case of Mauritius. Issues Informing Sci. Inf. Technol.; 2009; 6, pp. 127-142. [DOI: https://dx.doi.org/10.28945/1047]

11. Li, C.; Liu, S.; Xia, Q.; Wang, H.; Chen, H. Automatic Container Code Localization and Recognition via an Efficient Code Detector and Sequence Recognition. Proceedings of the 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM); Hong Kong, China, 8–12 July 2019; [DOI: https://dx.doi.org/10.1109/AIM.2019.8868819]

12. Mi, C.; Cao, L.; Zhang, Z.; Feng, Y.; Yao, L.; W, Y. A Port Container Code Recognition Algorithm under Natural Conditions. J. Coast. Res.; 2020; 103, pp. 822-829. [DOI: https://dx.doi.org/10.2112/SI103-170.1]

13. Klöver, S.; Kretschmann, L.; Jahn, C. A first step towards automated image-based container inspections. Data Science and Innovation in Supply Chain Management: How Data Transforms the Value Chain. Proceedings of the Hamburg International Conference of Logistics (HICL); Kersten, W.; Blecker, T.; Ringle, C.M. epubli GmbH: Berlin, Germany, 2020; Volume 29, pp. 427-456. [DOI: https://dx.doi.org/10.15480/882.3122]

14. Bahrami, Z.; Zhang, R.; Rayhana, R.; Liu, Z. Deep Learning-based Framework for Shipping Container Security Seal Detection. Proceedings of the 2021 Joint 10th International Conference on Informatics, Electronics & Vision (ICIEV) and 2021 5th International Conference on Imaging, Vision & Pattern Recognition (icIVPR); Kitakyushu, Japan, 16–20 August 2021; pp. 1-7.

15. O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv; 2015; [DOI: https://dx.doi.org/10.48550/arXiv.1511.08458] arXiv: 1511.08458

16. Du, K.L.; Swamy, M.N.S. Recurrent Neural Networks. Neural Networks and Statistical Learning; Springer: London, UK, 2014; [DOI: https://dx.doi.org/10.1007/978-1-4471-5571-3_11]

17. Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2016; 39, pp. 2298-2304. [DOI: https://dx.doi.org/10.1109/TPAMI.2016.2646371]

18. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition; Columbus, OH, USA, 23–28 June 2014; pp. 580-587.

19. Girshick, R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision; Santiago, Chile, 7–13 December 2015; pp. 1440-1448.

20. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015); Montreal, QC, Canada, 7–12 December 2015.

21. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 779-788.

22. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot multi-box Detector. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference; Amsterdam, The Netherlands, 11–14 October 2016; Volume 14, pp. 21-37.

23. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 2980-2988.

24. Project. Shipping Container Code Dataset. 2019; Available online: https://drive.google.com/drive/folders/13LpHEeFExmDJnw_U9peqLR-8uAAUMEzi (accessed on 1 February 2024).

25. Project. New-Container-Number Dataset. 2023; Available online: https://universe.roboflow.com/project-ha37k/new-container-number (accessed on 4 February 2024).

26. Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic Data for Text Localisation in Natural Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016.

27. Liang, M.; Guo, J.; Liu, Q.; Lu, P. A Novel Framework for Container Code-Character Recognition Based on Deep Learning and Template Matching. Proceedings of the 2016 International Conference on Industrial Informatics—Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII); Wuhan, China, 3–4 December 2016; pp. 78-82.

28. Yoon, Y.; Ban, K.D.; Yoon, H.; Kim, J. Automatic Container Code Recognition from Multiple Views. ETRI J.; 2016; 32, pp. 767-775. [DOI: https://dx.doi.org/10.4218/etrij.16.0014.0069]

29. Zhang, Y. Support Vector Machine Classification Algorithm and Its Application. Proceedings of the Information Computing and Applications; Liu, C.; Wang, L.; Yang, A. Springer: Berlin/Heidelberg, Germany, 2012; pp. 179-186.

30. Wu, Z.; Wang, W.; Xing, Y. Automatic Container Code Recognition via Faster-RCNN. Proceedings of the 2019 5th International Conference on Control, Automation and Robotics (ICCAR); Beijing, China, 19–22 April 2019; pp. 870-874.

31. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2017; 39, pp. 1137-1149. [DOI: https://dx.doi.org/10.1109/TPAMI.2016.2577031] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27295650]

32. Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 5551-5560.

33. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778.

34. Sun, C.; Liu, K.; Chi, H.; Zareapoor, M. A Hybrid Model for Container-code Detection. Proceedings of the 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI); Chengdu, China, 17–19 October 2020; pp. 299-304.

35. Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 9336-9345.

36. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference; Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; Volume 18, pp. 234-241.

37. Zhang, R.; Bahrami, Z.; Liu, Z. A Vertical Text Spotting Model for Trailer and Container Codes. IEEE Trans. Instrum. Meas.; 2021; 70, 5017313. [DOI: https://dx.doi.org/10.1109/TIM.2021.3115211]

38. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; [DOI: https://dx.doi.org/10.1109/CVPR.2017.106]

39. Hsu, C.C.; Yang, Y.Z.; Chang, A.; Salahuddin Morsalin, S.M.; Shen, G.T.; Shiu, L.S. Automatic Recognition of Container Serial Code. Proceedings of the 2023 International Conference on Consumer Electronics—Taiwan (ICCE-Taiwan); PingTung, Taiwan, 17–19 July 2023; pp. 257-258. [DOI: https://dx.doi.org/10.1109/ICCE-Taiwan58799.2023.10226954]

40. Zhao, J.; Jia, N.; Liu, X.; Wang, G.; Zhao, W. A Practical Unified Network for Localization and Recognition of Arbitrary-Oriented Container Code and Type. IEEE Trans. Instrum. Meas.; 2024; 73, 4505110. [DOI: https://dx.doi.org/10.1109/TIM.2024.3370750]

41. Hlabisa, S.; Khuboni, R.L.; Tapamo, J.R. Automated Shipping Container Code Localization and Recognition Using Deep Learning. Proceedings of the 2024 Conference on Information Communications Technology and Society (ICTAS); Durban, South Africa, 7–8 March 2024; pp. 198-203. [DOI: https://dx.doi.org/10.1109/ICTAS59620.2024.10507121]

42. Yu, M.; Zhu, S.; Lu, B.; Chen, Q.; Wang, T. A Two-Stage Automatic Container Code Recognition Method Considering Environmental Interference. Appl. Sci.; 2024; 14, 4779. [DOI: https://dx.doi.org/10.3390/app14114779]

43. Lau, L.J.; Lim, L.T.; Tew, Y. Modelling Studies of Automatic Container Code Recognition System for Real Time Implementation. Proceedings of the 2024 IEEE Symposium on Industrial Electronics & Applications (ISIEA); Kuala Lumpur, Malaysia, 6–7 July 2024; pp. 1-6. [DOI: https://dx.doi.org/10.1109/ISIEA61920.2024.10607358]

44. Santos, J.; Canedo, D.; Neves, A.J.R. Automating Code Recognition for Cargo Containers. Electronics; 2025; 14, 4437. [DOI: https://dx.doi.org/10.3390/electronics14224437]

45. Cheng, R.; You, Z.; Chea, S.; Wu, G.; Xia, K.; Huang, S. Fast and Lightweight Automatic Shipping Container Attributes Spotting. IEEE Trans. Instrum. Meas.; 2025; 74, 2531913. [DOI: https://dx.doi.org/10.1109/TIM.2025.3573765]

46. Su, Z.; Li, J.; Pang, Q.; Su, M. China futures market and world container shipping economy: An exploratory analysis based on deep learning. Res. Int. Bus. Financ.; 2025; 76, 102870. [DOI: https://dx.doi.org/10.1016/j.ribaf.2025.102870]

47. Wang, H.; Zhou, H. Chinese Text Spotter Exploiting Spatial Semantic Information in Scene Text Images. Proceedings of the 2023 5th International Conference on Robotics and Computer Vision (ICRCV); Nanjing, China, 15–17 September 2023; pp. 204-208. [DOI: https://dx.doi.org/10.1109/ICRCV59470.2023.10329042]

48. Qin, R.; Wang, B. Scene Text Image Super-Resolution via Content Perceptual Loss and Criss-Cross Transformer Blocks. Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN); Yokohama, Japan, 30 June– 5 July 2024; pp. 1-10. [DOI: https://dx.doi.org/10.1109/IJCNN60899.2024.10650570]

49. Chen, H.; Wang, Y.; Guo, J.; Tao, D. VanillaNet: The Power of Minimalism in Deep Learning. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2305.12972] arXiv: 2305.12972

50. Liu, K.; Sun, C.; Chi, H. Boundary-based Real-time Text Detection on Container Code. Proceedings of the 2021 International Symposium on Computer Science and Intelligent Controls (ISCSIC); Rome, Italy, 12–14 November 2021; pp. 78-81.

51. Tang, C.-M.; Chen, P. Container Number Recognition Method Based on SSD_MobileNet and SVM. Am. Sci. Res. J. Eng. Technol. Sci. (ASRJETS); 2020; 74, pp. 1-12. Available online: https://asrjetsjournal.org/American_Scientific_Journal/article/view/6482 (accessed on 13 July 2024).

52. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.; 1997; 9, pp. 1735-1780. [DOI: https://dx.doi.org/10.1162/neco.1997.9.8.1735]

53. Graves, A.; Fernández, S.; Gomez, F.J.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning; Pittsburgh, PA, USA, 25–29 June 2006; pp. 369-376.

54. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis.; 2010; 88, pp. 303-338. [DOI: https://dx.doi.org/10.1007/s11263-009-0275-4]

55. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE; 1998; 86, pp. 2278-2324. [DOI: https://dx.doi.org/10.1109/5.726791]

56. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv; 2015; [DOI: https://dx.doi.org/10.48550/arXiv.1409.1556] arXiv: 1409.1556

57. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv; 2014; [DOI: https://dx.doi.org/10.48550/arXiv.1409.4842] arXiv: 1409.4842

58. Agrawal, S.; Rewaskar, V.; Agrawal, R.; Chaudhari, S.; Patil, Y.; Agrawal, N. Advancements in NSFW Content Detection: A Comprehensive Review of ResNet-50 Based Approaches. Int. J. Intell. Syst. Appl. Eng.; 2023; 11, pp. 41-45.

59. Qian, X.; Lin, S.; Cheng, G.; Yao, X.; Ren, H.; Wang, W. Object Detection in Remote Sensing Images Based on Improved Bounding Box Regression and Multi-Level Features Fusion. Remote Sens.; 2020; 12, 143. [DOI: https://dx.doi.org/10.3390/rs12010143]

60. Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting Text in Natural Image with Connectionist Text Proposal Network. Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 56-72.

61. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. arXiv; 2019; arXiv: 1911.08287[DOI: https://dx.doi.org/10.48550/arXiv.1911.08287]

62. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. . PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv; 2019; arXiv: 1912.01703

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Confidence-Guided Code Recognition for Shipping Containers Using Deep Learning

Content area

Abstract

Full text

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Container Code Localization

3.2. Container Code Recognition

4. Experimental Results and Discussion

4.1. Container Code Localization

4.2. Container Code Recognition

4.3. Localization and Recognition Model Integration

5. Conclusions