Content area
Abstract
Clinical diagnosis of cervical lymphadenopathy (CLA) using ultrasound images is a time-consuming and laborious process that heavily relies on expert experience. This study aimed to develop an intelligent computer-aided diagnosis (CAD) system using deep learning models (DLMs) to enhance the efficiency of ultrasound screening and diagnostic accuracy of CLA. We retrospectively collected 4089 ultrasound images of cervical lymph nodes across four categories from two hospitals: normal, benign CLA, primary malignant CLA, and metastatic malignant CLA. We employed transfer learning, data augmentation, and five-fold cross-validation to evaluate the diagnostic performance of DLMs with different architectures. To boost the application potential of DLMs, we investigated the potential impact of various optimizers and machine learning classifiers on their diagnostic performance. Our findings revealed that EfficientNet-B1 with transfer learning and root-mean-square-propagation optimizer achieved state-of-the-art performance, with overall accuracies of 97.0% and 90.8% on the internal and external test sets, respectively. Additionally, human–machine comparison experiments and the implementation of explainable artificial intelligence technology further enhance the reliability and safety of DLMs and help clinicians easily understand the DLM results. Finally, we developed an application that can be implemented in systems running Microsoft Windows. However, additional prospective studies are required to validate the clinical utility of the developed application. All pretrained DLMs, codes, and application are available at
Full text
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Cervical lymph nodes are an important part of the human immune system and are responsible for filtering harmful substances and pathogens from body fluids and producing antibodies [1–3]. Cervical lymphadenopathy (CLA) is a critical basis for the diagnosis of infections, inflammations, and certain cancers, such as lymphoma and head and neck cancers. Therefore, the accurate assessment of CLA is essential for prognosis prediction and aids in selecting appropriate treatment, which has substantial clinical significance [4–6]. Ultrasound analysis has emerged as the predominant method for evaluating cervical lymph nodes because it is nonradiative, cost-effective, and easy to perform and allows doctors to assess blood flow and internal structural changes in real time [7, 8]. However, using ultrasound to screen for CLA is challenging, as it can range from normal to various benign conditions, such as viral, tuberculosis, or autoimmune lymphadenitis, and may extend to primary and metastatic malignancies [9]. More importantly, analyzing ultrasound examination results is time-consuming and labor-intensive, easily affected by noise and artifacts during imaging, and heavily depends on clinicians’ domain expertise and experience. These factors can introduce subjectivity and bias into the diagnostic process, potentially affecting diagnostic accuracy [10–12].
Recent studies have demonstrated that rapidly developing deep learning models (DLMs) offer considerable potential for providing more objective and accurate ultrasound examination results. For example, the Visual Geometry Group Net (VGGNet), a very deep convolutional network, has been successfully employed for myositis [13] and breast cancer classification from ultrasound images [14], whereas ResNet with skip connections, aimed at addressing the vanishing gradient problem, has exhibited good performance for detecting breast tumors [15, 16] and thyroid nodules [17]. Additionally, DenseNet and EfficientNet have been demonstrated to be effective for diagnosing fatty liver disease [18], rotator cuff tears [19], COVID-19 [20], and breast tumors [21]. Moreover, lightweight DLMs, such as ShuffleNet and MobileNet, have made significant progress for kidney disease, fetal cardiac cycle, and early stroke detection [22–24]. However, despite the significant success of DLMs in various ultrasound image analysis tasks, few studies have comprehensively investigated their potential for diagnosing cervical lymph nodes. To the best of our knowledge, only Xia et al. [25] have attempted to use VGGNet, ResNet, and DenseNet to diagnose CLA from ultrasound images. However, their primary goal was to distinguish between benign CLA and malignant CLA, and the datasets they employed lacked sufficient samples and diversity. Subsequently, Liu et al. [26] proposed a depth-wise separable convolutional swine transformer that can automatically classify the cervical lymph node level (Levels 1–6); however, their study did not involve CLA detection.
Based on this knowledge, we believe that further research on the potential of DLMs for diagnosing CLA from ultrasound images is warranted to promote the development and application of computer-aided diagnosis (CAD) systems for enhancing the efficiency and consistency of clinical diagnosis. Specifically, the results of previous studies indicate that DLMs with different architectures offer distinct advantages for diagnosing diseases from ultrasound images of different body parts. Additionally, most studies have significantly improved the diagnostic performance of DLMs by adopting transfer learning, which involves fine-tuning models pretrained on large datasets, such as ImageNet, for specific downstream tasks.
Our study aimed to explore and compare the performances of different DLMs with transfer learning for detecting CLA from ultrasound images and to develop a CAD system based on a state-of-the-art (SOTA) DL model. The contributions of this study can be summarized as follows:
1. We collected 4089 high-quality ultrasound CLA images from two clinical institutions and classified them into four pathological groups: normal, benign CLA, primary malignant CLA, and metastatic malignant CLA.
2. We employed transfer learning and five-fold cross-validation to comprehensively evaluate the effectiveness of different DLMs for CLA detection.
3. To further explore the application potential of the model, we studied the impact of different optimizers and machine learning classifiers on the diagnostic performance of each model. The results indicated that the model performance benefited more from the Adam optimizer and its linear classifier.
4. Extensive experimental results demonstrated that EfficientNet-B1 with transfer learning and a root-mean-square propagation (RMSProp) optimizer achieved SOTA performance.
5. We developed an intelligent and convenient application to assist clinicians in diagnosing CLA from ultrasound images. We have made the developed models and application publicly available to facilitate broader research.
2. Materials and Methods
2.1. Clinical Dataset Construction and Annotation
This study employed two datasets: (1) an internal dataset comprising ultrasound images of 480 patients who visited the Department of Medical Ultrasound of the Second Affiliated Hospital of Guangzhou Medical University (acquisition equipment: SonoScape WIS and Mindray DC-8) and (2) an external dataset containing ultrasound images of 259 patients who visited the Ultrasound Department of the First Affiliated Hospital of Guangzhou Medical University (acquisition equipment: Mindray-Reasonable R9 and Philips Cx50). During ultrasound screening, the ultrasound technician collected multiple images for each patient: (1) ultrasound image of each lymph node and (2) multiple images of specific lymph nodes, specifically along the long and short axes. Additionally, the same imaging settings were employed across all devices: probe frequency of 3–12 MHz, center frequency of 7 MHz, and mechanical index of 1.1. Furthermore, each ultrasound image contained only one lymph node.
After data collection and organization, the ultrasound images were divided into four categories based on the status of cervical lymph nodes: normal, benign CLA (benign), primary malignant CLA (primary), and metastatic malignant CLA (metastatic). During annotation, clinicians conformed by the following standards: (1) For the benign category, primary category, and metastatic category, pathological biopsy results (needle and excision biopsies) were used as the gold standard for each lymph node image. (2) For the normal category, ultrasound images of healthy individuals who underwent physical examination were selected. Specifically, their ultrasound images featured neither any obvious abnormalities during the initial physical examination nor any significant changes after 6 months. The study protocol was approved by the Review Committee of the Second Affiliated Hospital of Guangzhou Medical University (Approval No. 2022-hs-78-02). Additionally, our study adhered to tenets of the Declaration of Helsinki, ensuring that the rights, privacy, and anonymity of the patients were respected. Owing to the use of deidentified data and the retrospective nature of our study, the requirement of informed consent was waived by the Ethics Committee of the Second Affiliated Hospital of Guangzhou Medical University. Figure 1 shows some data samples, and Table 1 presents the detailed data distribution.
[figure(s) omitted; refer to PDF]
Table 1
The number of images of each category in the dataset.
| Dataset | Normal | Benign | Primary | Metastatic | Total |
| Internal dataset | 1217 | 601 | 236 | 1338 | 3392 |
| External test set | 210 | 136 | 77 | 274 | 697 |
| Total | 1427 | 737 | 313 | 1612 | 4089 |
2.2. Pretrained DLMs
Transfer learning involves using knowledge gained from training a model for one task or on a dataset to improve its performance for other similar or related tasks and/or different datasets. It has made significant contributions in the field of medical image analysis as it overcomes the problem of dataset scarcity and effectively reduces the model development time and required hardware resources [27, 28]. Transfer learning from natural large-scale image datasets, particularly ImageNet, using standard large models and the corresponding pretrained weights, has become a de facto method for DL applications in medical imaging [29].
In this study, we comprehensively evaluated the potential of multiple SOTA DLMs pretrained on ImageNet to diagnose CLA from ultrasound images. Based on the rapid advancements in DLMs and the advantages offered by different architectures, we selected the following 10 DLMs: (1) VGGNet, which is a widely used classical deep convolutional neural network renowned for its unified 3 × 3 convolution kernel and deep architecture [30]; (2) ResNet, which uses skip connections to mitigate the vanishing or exploding gradient problem of deep networks to significantly enhance model accuracy [31]; (3) MobileNetV3, which is a lightweight network that integrates the HardSwish activation function and squeeze-and-excitation modules with inverted residual blocks and offers efficient computing on mobile devices [32]; (4) ShuffleNetV2, which effectively balances model complexity and parameter size and offers high accuracy by employing channel shuffle operations and an efficient grouped convolution design [33]; (5) DenseNet, which effectively alleviates the vanishing gradient problem, strengthens feature propagation, encourages feature reuse, and substantially reduces parameter size by connecting each layer to every other layer in a feedforward manner [34]; (6) EfficientNet, which uniformly scales all depth, width, and resolution dimensions by employing a simple yet highly effective compound coefficient, offering a significantly better accuracy and efficiency than conventional CNNs [35]; (7) EfficientNetV2, which employs a combination of training-aware neural architecture search and scaling to accelerate training and enhance parameter efficiency [36]; (8) Vision Transformer (ViT), which is a competitive alternative to CNN. It splits an image into fixed-size patches and employs a transformer-like architecture to process them and has demonstrated excellent classification performance for ImageNet [37]; (9) Swin Transformer, which adopts a hierarchical structure and a shifted-windowing self-attention mechanism to effectively extract local and global features from images, thereby offering better classification performance and efficiency for visual tasks [38]; and (10) MobileViT, which is a lightweight hybrid transformer that combines the strengths of CNNs and ViTs and provides a different perspective on global information processing using transformers [39]. These models with different architectures reflect the significant advancements in DL and provide developers with various options to effectively deploy models in various clinical settings.
2.3. Evaluation Metrics
Accuracy, precision, recall, specificity, and F1-score are typically employed for evaluating the classification performance of DLMs and are defined in equations (1)–(5), respectively. Specifically, accuracy measures the proportion of correct classifications among the entire dataset, precision represents the proportion of true positives (TPs) among all predicted positives, recall indicates the proportion of actual TPs among all TPs predicted, and specificity measures the proportion of actual true negatives (TNs) among all TNs. The F1-score, which measures the overall performance of a model, is the harmonic mean of precision and recall:
Furthermore, a confusion matrix and receiver operating characteristic (ROC) curves were used to evaluate the performance of the DLMs for each category.
2.4. Model Weight Optimizer
In this study, we employed three different optimizers (stochastic gradient descent [SGD], adaptive moment estimation [Adam], and RMSProp) to optimize the model parameters, which not only helped further enhance their performance but also offered practical guidance and suggestions for subsequent researchers and developers. The basic concepts and advantages of each optimizer are summarized below.
2.4.1. SGD
SGD is a variant of the gradient descent algorithm and is the most basic and widely used optimizer for optimizing DLMs. It effectively addresses the computational inefficiency of traditional gradient descent methods by selecting a single random training sample (or a small batch) to compute the gradient and update the model parameters.
2.4.2. RMSProp
RMSProp is an adaptive learning rate optimization algorithm designed to improve the performance and speed of DLM training. It is an extension of the SGD algorithm and the foundation of the Adam algorithm. By adapting the learning rates based on the moving average of squared gradients, RMSProp offers an optimal tradeoff between efficient convergence and stability during training, making it a widely used optimization approach for modern DLMs.
2.4.3. Adam
Adam is a first-order gradient-based optimizer designed for stochastic objective functions. It adaptively estimates lower order moments [40], making it simple to implement, computationally efficient, and memory-efficient for model training. Additionally, it is invariant to the diagonal rescaling of gradients and effectively handles large datasets and/or numerous parameters.
2.5. Implementation Details
2.5.1. Data Preprocessing and Dataset Partition
Prior to training, all ultrasound images were resized to 224 × 224 × 3, normalized, and standardized. We split the internal dataset into internal development and test sets in a 9:1 ratio. Five-fold cross-validation and the internal development set were used to evaluate the model performance and stability, whereas the internal test set was used to preliminarily evaluate their generalizability. The data distributions of these sets are detailed in Supporting Information Tables S1 and S2.
2.5.2. Training and Evaluation
The DLMs were trained and evaluated over three stages to ensure that the proposed diagnosis system offered the best performance. In the first stage, we initialized the 10 models using ImageNet pretrained weights and then employed the Adam optimizer with an initial learning rate of 0.001, β1 of 0.9, β2 of 0.999, and a weight decay of 0.0001 to optimize their parameters. After training, their classification performance was evaluated and the four models with the highest accuracies were selected as candidate models. In the second stage, these candidate models were retrained and reevaluated using RMSProp (learning rate = 0.01, momentum = 0, alpha = 0.99) and SGD (learning rate = 0.001, momentum = 0, dampening = 0, weight decay = 0) with the default parameter settings in PyTorch to explore the impacts of the optimizers on their performance. Similarly, we initialized the models using ImageNet pretrained weights before training. In the third stage, the weights were frozen in the backbones of the four models and machine learning classifiers were used to replace their fully connected layers. Subsequently, they were trained to explore whether these machine learning classifiers would further enhance their classification performance. Note that the epoch and batch sizes were set to 150 and 128, respectively, for all training steps. Furthermore, we employed an early stopping strategy to prevent overfitting. However, no data augmentation was used in these three training stages because we wanted to evaluate the ability of the models to extract lesion features from actual ultrasound images, test their performance on an unbalanced dataset, and validate their robustness and generalizability on a dataset reflecting the distribution of real clinical data. The experiments were conducted in PyTorch on a system comprising an NVIDIA GeForce RTX 4090 GPU and Ubuntu 20.04.
2.5.3. Clinical Testing and Human–Machine Comparisons
We selected the two best performing models from the trained candidate models as candidate diagnostic models and preliminarily evaluated their generalizability and diagnostic performance on the internal test set to reflect that in an actual clinical environment. To further enhance their performance, we employed data augmentation to alleviate the data imbalance issue in the internal development set (the data distribution of this augmented set is presented in Supporting Information Table S3) and retrained and reevaluated them under the same experimental environment via cross-validation. Furthermore, the external test dataset was used to test their performance under external clinical settings. To further validate their reliability, we recruited six clinicians from the Second Affiliated Hospital to conduct human–machine comparison experiments and plotted their diagnostic results as a confusion matrix. Finally, the best model was selected as the clinical diagnostic model by comparing the results of all models.
2.5.4. Model Interpretability and Application System Development
To improve the security and transparency of the proposed diagnostic system, we used an interpretable artificial intelligence (AI) technology based on gradient-weighted class activation mapping (Grad-CAM) to clarify the internal decision-making mechanisms of the model. Finally, we developed an application for Microsoft Windows that leverages our system to assist clinicians in CLA diagnosis.
3. Results and Discussion
3.1. Comprehensive Performance Analysis of the Ten DLMs
The names of the 10 DLMs compared in this study were as follows: DenseNet-169 (D169), EfficientNet-B1 (EffB1), EfficientNetV2-Large (EffV2), MobileVit-Small (MbVit), MobileNetV3-Small (MbV3), ResNet-50 (Res50), ShuffleNetV2-X20 (Sv2 × 2), Swin Transformer-Base (SwinB), VGGNet-19 (VGG19), and Vit-Base (VitB).
Table 2 presents their average accuracies for five-fold cross-validation, wherein SwinB exhibits the best performance with an average accuracy of 0.9440, closely followed by D169, EffB1, and Sv2 × 2 with average accuracies of 0.9326, 0.9319, and 0.9319, respectively. By contrast, VitB and EffV2 exhibit significantly lower accuracies of 0.7951 and 0.8311, respectively. From the perspective of the internal mechanism employed, the Swin Transformer divides the images into small, nonoverlapping patches for hierarchical modeling. This approach allows it to retain the advantages of the ViT for modeling long-range dependencies, while also ensuring that it effectively captures the local feature information of lymph node lesions, similar to traditional CNNs. This combination of global and local feature extraction capabilities significantly enhances its classification performance for ultrasound images.
Table 2
The average accuracy (±standard deviation), parameter size (params), and GFLOPs of various models.
| Model | Average accuracy | Params | GFLOPs |
| D169 | 0.9326 (±0.0074) | 12.49 | 3.38 |
| EffB1 | 0.9319 (±0.0099) | 6.51 | 0.58 |
| EffV2 | 0.8311 (±0.0056) | 117.23 | 12.28 |
| MbVit | 0.9254 (±0.0063) | 4.94 | 1.53 |
| MbV3 | 0.9231 (±0.0066) | 1.52 | 0.06 |
| Res50 | 0.9316 (±0.0043) | 23.52 | 4.10 |
| Sv2 × 2 | 0.9319 (±0.0037) | 5.35 | 0.58 |
| SwinB | 0.9440 (±0.0067) | 86.75 | 15.42 |
| VGG19 | 0.9044 (±0.0078) | 139.58 | 19.62 |
| VitB | 0.7951 (±0.0099) | 85.80 | 17.57 |
Additionally, the parameter size and GFLOPs of all models differed significantly. For example, although SwinB exhibited the best accuracy, its parameter size (86.75 M) and GFLOPs (15.42 M) were relatively high, which may limit its effective deployment on resource-constrained systems. By contrast, although MbV3 exhibited a slightly lower accuracy than the other models, it featured the smallest parameter size (1.52M) and lowest GFLOPs (0.06), making it suitable for application in mobile and embedded devices. Notably, EffB1 and Sv2 × 2 had lower parameter sizes (6.51 and 5.35 M respectively) and GFLOPs (0.58 for both) while maintaining high accuracy (0.9319 for both), demonstrating their suitability for resource-constrained environments. The high efficiencies of these lightweight models enhance their flexibility and scalability for practical applications, particularly in scenarios requiring real-time processing with constrained computing resources.
Table 3 lists the classification performances of various models for each category. For the normal category, MbVit and Res50 achieved the best results. Additionally, both D169 and EffB1 achieved a recall of 1.000. For the benign CLA category, SwinB exhibited the best diagnosis performance, whereas D169 and EffB1 achieved the highest precision and specificity, respectively. Additionally, SwinB was the best performing model for the primary malignant CLA category and the metastatic malignant CLA category. However, EffB1 demonstrated the best recall of 0.964. A comprehensive analysis of the results listed in Table 3 indicated that SwinB performed the best for classifying images across all categories. Furthermore, it is evident that the performance of ViT was unsatisfactory as its results for some metrics did not even exceed 0.5.
Table 3
Models’ performances under 5-fold cross-validation.
| Normal | Benign | |||||||
| Precision | Recall | Specificity | F1-score | Precision | Recall | Specificity | F1-score | |
| D169 | 0.997 ± 0.004 | 1.000 ± 0.000 | 0.998 ± 0.002 | 0.998 ± 0.002 | 0.894 ± 0.044 | 0.795 ± 0.035 | 0.979 ± 0.011 | 0.840 ± 0.009 |
| EffB1 | 0.996 ± 0.005 | 1.000 ± 0.000 | 0.998 ± 0.003 | 0.998 ± 0.003 | 0.892 ± 0.036 | 0.777 ± 0.037 | 0.980 ± 0.007 | 0.830 ± 0.033 |
| EffV2 | 0.986 ± 0.010 | 0.998 ± 0.002 | 0.992 ± 0.006 | 0.992 ± 0.006 | 0.687 ± 0.039 | 0.549 ± 0.064 | 0.945 ± 0.012 | 0.608 ± 0.042 |
| MbVit | 0.998 ± 0.004 | 1.000 ± 0.000 | 0.999 ± 0.002 | 0.999 ± 0.002 | 0.849 ± 0.018 | 0.795 ± 0.056 | 0.969 ± 0.006 | 0.820 ± 0.024 |
| MbV3 | 0.997 ± 0.004 | 0.999 ± 0.002 | 0.998 ± 0.002 | 0.998 ± 0.003 | 0.862 ± 0.020 | 0.795 ± 0.016 | 0.972 ± 0.005 | 0.827 ± 0.014 |
| Res50 | 0.998 ± 0.002 | 1.000 ± 0.000 | 0.999 ± 0.001 | 0.999 ± 0.001 | 0.870 ± 0.031 | 0.810 ± 0.042 | 0.973 ± 0.008 | 0.837 ± 0.010 |
| Sv2 × 2 | 0.997 ± 0.006 | 0.998 ± 0.004 | 0.998 ± 0.003 | 0.998 ± 0.003 | 0.876 ± 0.024 | 0.789 ± 0.028 | 0.976 ± 0.006 | 0.830 ± 0.007 |
| SwinB | 0.998 ± 0.004 | 0.999 ± 0.002 | 0.999 ± 0.002 | 0.998 ± 0.002 | 0.876 ± 0.032 | 0.858 ± 0.017 | 0.973 ± 0.008 | 0.866 ± 0.010 |
| VGG19 | 0.996 ± 0.004 | 0.999 ± 0.002 | 0.998 ± 0.002 | 0.997 ± 0.002 | 0.839 ± 0.052 | 0.756 ± 0.027 | 0.968 ± 0.012 | 0.794 ± 0.023 |
| VitB | 0.972 ± 0.010 | 0.993 ± 0.006 | 0.984 ± 0.005 | 0.982 ± 0.005 | 0.613 ± 0.050 | 0.434 ± 0.061 | 0.939 ± 0.018 | 0.504 ± 0.045 |
| Primary | Metastatic | |||||||
| Precision | Recall | Specificity | F1-score | Precision | Recall | Specificity | F1-score | |
| D169 | 0.892 ± 0.052 | 0.794 ± 0.050 | 0.993 ± 0.004 | 0.840 ± 0.041 | 0.899 ± 0.006 | 0.958 ± 0.026 | 0.930 ± 0.005 | 0.927 ± 0.012 |
| EffB1 | 0.936 ± 0.035 | 0.794 ± 0.025 | 0.996 ± 0.002 | 0.858 ± 0.016 | 0.892 ± 0.013 | 0.964 ± 0.013 | 0.924 ± 0.010 | 0.926 ± 0.011 |
| EffV2 | 0.678 ± 0.199 | 0.270 ± 0.094 | 0.988 ± 0.010 | 0.371 ± 0.101 | 0.746 ± 0.014 | 0.880 ± 0.022 | 0.804 ± 0.016 | 0.807 ± 0.009 |
| MbVit | 0.882 ± 0.076 | 0.808 ± 0.041 | 0.991 ± 0.007 | 0.840 ± 0.025 | 0.902 ± 0.031 | 0.937 ± 0.012 | 0.933 ± 0.024 | 0.919 ± 0.012 |
| MbV3 | 0.867 ± 0.040 | 0.790 ± 0.088 | 0.990 ± 0.005 | 0.822 ± 0.028 | 0.892 ± 0.012 | 0.935 ± 0.013 | 0.927 ± 0.008 | 0.913 ± 0.010 |
| Res50 | 0.892 ± 0.034 | 0.836 ± 0.049 | 0.992 ± 0.003 | 0.861 ± 0.022 | 0.906 ± 0.020 | 0.941 ± 0.015 | 0.936 ± 0.015 | 0.923 ± 0.007 |
| Sv2 × 2 | 0.894 ± 0.028 | 0.860 ± 0.057 | 0.992 ± 0.002 | 0.875 ± 0.029 | 0.904 ± 0.011 | 0.949 ± 0.013 | 0.934 ± 0.009 | 0.925 ± 0.004 |
| SwinB | 0.947 ± 0.016 | 0.837 ± 0.066 | 0.996 ± 0.001 | 0.887 ± 0.036 | 0.926 ± 0.007 | 0.952 ± 0.013 | 0.950 ± 0.005 | 0.939 ± 0.009 |
| VGG19 | 0.861 ± 0.116 | 0.611 ± 0.095 | 0.991 ± 0.008 | 0.702 ± 0.048 | 0.862 ± 0.021 | 0.937 ± 0.003 | 0.902 ± 0.018 | 0.898 ± 0.011 |
| VitB | 0.719 ± 0.066 | 0.278 ± 0.084 | 0.992 ± 0.002 | 0.397 ± 0.092 | 0.715 ± 0.016 | 0.868 ± 0.033 | 0.774 ± 0.023 | 0.784 ± 0.013 |
In clinical settings, model accuracy is directly related to early detection and treatment of cancer. Therefore, we selected D169, EffB1, Sv2 × 2, and SwinB as candidate models owing to their high performances to ensure that the proposed system effectively produced reliable diagnostic results. Subsequently, these models were retrained using the RMSProp and SGD optimizers. Supporting Information Table S4 presents the classification performances of all models for each image category, with the maximum values highlighted in red. For the normal category, among the models retrained with RMSProp, Sv2 × 2_R exhibited SOTA performance, whereas among those retrained with the SGD optimizer, SwinB_S achieved the best performance. For the benign CLA category, among the models retrained with RMSProp, EffB1_R exhibited the best performance, whereas among those retrained with the SGD optimizer, SwinB_S maintained the best performance even though the results were significantly lower. For the primary malignant CLA category, among the models retrained with RMSProp, EffB1_R achieved the SOTA performance, whereas among those retrained with the SGD optimizer, SwinB_S again performed the best. Finally, for the metastasis malignant CLA, among the models retrained with RMSProp, EffB1_R exhibited the best performance. Conversely, among those retrained with the SGD optimizer, SwinB_S again performed the best, achieving precision, recall, specificity, and F1-score of 0.624, 0.966, 0.622, and 0.759, respectively.
A comparative analysis of the metric results presented in Table 3 and Supporting Information Table S4 indicated that the performance of most candidate models improved after retraining using the Adam and RMSProp optimizers. By contrast, those optimized using SGD generally produced unsatisfactory results, particularly for diagnosing primary CLA. For example, the precision, recall, and F1-score of D169_S, EffB1_S, and Sv2 × 2_S for the primary malignant CLA category were 0.000. This can be attributed to improper initial learning rate settings and SGD’s uniform gradient scaling in all directions, which makes it difficult to effectively converge when handling unbalanced datasets, particularly when fewer samples are available for training. By contrast, the Adam and RMSProp optimizers better optimized the model parameters through their adaptive learning rates, thereby significantly improving their overall diagnostic performance.
Table 4 lists the average accuracies of the candidate models trained using the three optimizers. A comprehensive review of the results presented in Tables 3 and 4, and Supporting Information Table S4 indicated that the Adam optimizer effectively enhances the performance of most models. Notably, among the models trained using the RMSprop optimizer, EffB1 demonstrated the best diagnosis performance. Additionally, SwinB with Adam and transfer learning stands out as the SOTA across all candidate models. Based on these findings, we suggest that the Adam optimizer should be used to optimize the weights of DLMs, including CNNs and ViTs, for ultrasound-based CLA classification.
Table 4
The average accuracy of D16, EffB1, Sv2 × 2, and SwinB using different optimizers.
| D169 | EffB1 | Sv2 × 2 | SwinB | |
| Adam | 0.9326 ± 0.0074 | 0.9319 ± 0.0099 | 0.9319 ± 0.0037 | 0.9440 ± 0.0067 |
| SGD | 0.7227 ± 0.0016 | 0.6079 ± 0.0081 | 0.7496 ± 0.0057 | 0.6920 ± 0.0023 |
| RMSProp | 0.9290 ± 0.0053 | 0.9372 ± 0.0055 | 0.9257 ± 0.0011 | 0.9204 ± 0.0049 |
To investigate the classification potential of the candidate DLMs, we removed the linear classifiers from the four Adam-optimized candidate models and used their backbones to extract features from the ultrasound images. Subsequently, these features were classified using four classical machine learning algorithms: support vector machine (SVM), random forest (RF), decision tree (DT), and K-nearest neighbors (KNN). The performances of the candidate models with these classifiers are listed in Table S5.
For the normal category, all models exhibited robust performances, with D169_RF, D169_SVC, SwinB_KNN, SwinB_RF, and SwinB_SVC achieving the best results. For the benign CLA category, SwinB_KNN exhibited the best performance, whereas EffB1_RF achieved the highest recall of 0.922. For the primary malignant CLA category, EffB1_KNN performed the best, whereas EffB1_RF again achieved the highest recall of 0.964 and SwinB_SVC achieved the highest F1-score of 0.895. Additionally, both EffB1_RF and SwinB_KNN performed well for the diagnosis of metastatic lymph nodes. Specifically, EffB1_RF achieved the highest precision (0.978) and specificity (0.985), whereas SwinB_KNN achieved the best recall (0.923) and F1-score (0.937).
Table 5 summarizes the average accuracies of the candidate models with machine learning classifiers. Evidently, the accuracy of SwinB_KNN is consistent with that of the original SwinB model. Notably, EffB1_SVC achieved an average accuracy of 0.9339, surpassing all other candidate models except SwinB. The results indicate that SwinB_KNN outperformed SwinB, particularly in terms of recall and specificity.
Table 5
Average accuracy of models using different machine learning classification algorithms.
| Model | SVC | DT | RF | KNN |
| D169 | 0.9290 ± 0.0068 | 0.8900 ± 0.0158 | 0.9231 ± 0.0062 | 0.9267 ± 0.0078 |
| EffB1 | 0.9339 ± 0.0066 | 0.8913 ± 0.0138 | 0.9260 ± 0.0044 | 0.9280 ± 0.0095 |
| Sv2 × 2 | 0.9296 ± 0.0071 | 0.8495 ± 0.0156 | 0.9195 ± 0.0060 | 0.9224 ± 0.0059 |
| SwinB | 0.9431 ± 0.0099 | 0.9178 ± 0.0046 | 0.9434 ± 0.0105 | 0.9440 ± 0.0104 |
Our findings show that, although SVM was more effective than the others for classifying features extracted by the DLMs, it did not offer a significant enhancement compared with the models’ built-in classifiers. Moreover, the other classifiers generally led to inferior performance compared with the original classifiers of the models. Therefore, most machine learning classification algorithms do not outperform the linear classifiers of the models for ultrasound images of cervical lymph nodes.
3.2. Performance Comparisons of Candidate Diagnostic Models on the Internal Test Set
Based on the results of the five-fold cross-validation of the four candidate models, we selected SwinB_Adam and EffB1_RMSprop as the candidate diagnosis models. We loaded them with the five weights obtained via cross-validation and tested their performance on the internal test set. The results indicated that SwinB_Adam and EffB1_RMSprop achieved average accuracies of 0.9163 and 0.9318, respectively, and highest accuracies of 0.9288 and 0.9466, respectively. Figure 2 shows the confusion matrix of these models and the ROC curves with the highest accuracy. As shown in Figure 2(a), the recall of EffB1_RMSprop for all image categories exceeded 0.8000. Additionally, its area under the ROC curve (AUC) values for all categories exceeded 0.95, validating its excellent classification performance. Although the AUC values of SwinB_Adam (Figure 2(b)) were slightly better than those of EffB1_RMSprop, its confusion matrix results were inferior to those of EffB1_RMSprop.
[figure(s) omitted; refer to PDF]
Table 6 presents the classification performances of the candidate diagnostic models for each category. Evidently, the overall performance of EffB1_RMSprop is significantly superior to that of SwinB_Adam and both models achieved a recall of 1.000 for normal cervical lymph nodes. However, these candidate models were prone to misdiagnosing benign CLA and primary CLA as metastatic CLA. This was attributed to two reasons. First, from the perspective of clinical diagnostic insights, benign CLA, primary CLA, and metastatic CLA exhibit certain morphological similarities. For example, primary CLA and metastatic CLA typically exhibit irregular boundaries, round or oval shapes, inhomogeneous internal echoes, and similar shapes and sizes [9]. In clinical practice, even experienced clinicians may struggle to identify unique features that differentiate these lesions. Additionally, some benign CLA may exhibit similar characteristics after infection or inflammation. This morphological similarity can lead to classification errors of DLMs owing to visual misinterpretation during diagnosis. Second, due to the class imbalance of the dataset, the models may be biased toward classes with a larger number of samples during training, thereby reducing the diagnostic accuracy of classes with a smaller number of samples. Concretely, since considerably fewer images of benign and primary CLA are available compared to those of malignant CLA, the model is relatively sensitive to the features of malignant CLA, tends to overfit, and cannot fully learn the feature representations of benign CLA, thus affecting the accuracy and reliability of the overall diagnosis.
Table 6
Precision, recall, specificity, and F1-score of EffB1 and SwinB.
| Precision (EffB1) | Recall (EffB1) | Specificity (EffB1) | F1-score (EffB1) | Precision (SwinB) | Recall (SwinB) | Specificity (SwinB) | F1-score (SwinB) | |
| Benign | 0.895 | 0.850 | 0.978 | 0.872 | 0.889 | 0.800 | 0.978 | 0.842 |
| Normal | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Primary | 1.000 | 0.826 | 1.000 | 0.905 | 0.720 | 0.783 | 0.978 | 0.750 |
| Metastatic | 0.914 | 0.962 | 0.941 | 0.937 | 0.920 | 0.947 | 0.946 | 0.933 |
Note: EffB1 and SwinB mean EffB1_RMSprop and SwinB_Adam, respectively.
To alleviate these limitations, four data augmentation strategies, typically employed for medical images, were used to address the imbalance issue in the internal dataset: affine, contrast change, Gaussian noise, and horizontal flip. A schematic of the enhanced dataset is shown in Supporting Information Figure S1. SwinB_Adam and EffB1_RMSprop were retrained on this enhanced dataset (division of the augmented dataset is shown in Supporting Information Table S2), and their performances were evaluated via five-fold cross-validation. The results showed that after data augmentation, SwinB_Adam (SwinB_Adam_Aug) and EffB1_RMSprop (EffB1_RMSprop_Aug) achieved average accuracies of 0.9520 and 0.9614, respectively, and highest accuracies of 0.9614 and 0.9703, respectively. Additionally, the results of best SwinB_Adam_ Aug and best EffB1_RMSprop_Aug across all other metrics are listed in Table 7. By averaging these results, we found that EffB1_RMSprop_Aug is better.
Table 7
Precision, recall, specificity, and F1-score of best EffB1 and SwinB.
| Precision (EffB1) | Recall (EffB1) | Specificity (EffB1) | F1-score (EffB1) | Precision (SwinB) | Recall (SwinB) | Specificity (SwinB) | F1-score (SwinB) | |
| Benign | 0.981 | 0.867 | 0.996 | 0.920 | 0.917 | 0.917 | 0.982 | 0.917 |
| Normal | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Primary | 0.880 | 0.957 | 0.990 | 0.917 | 0.917 | 0.957 | 0.994 | 0.936 |
| Metastatic | 0.957 | 0.993 | 0.971 | 0.974 | 0.955 | 0.947 | 0.971 | 0.951 |
| Mean | 0.955 | 0.954 | 0.989 | 0.953 | 0.947 | 0.955 | 0.987 | 0.951 |
Note: EffB1 and SwinB mean EffB1_RMSprop_Aug and SwinB_Adam_Aug, respectively.
Additionally, Figure 3 shows the confusion matrices and ROC curves for EffB1_RMSprop_Aug and SwinB_Adam_Aug, wherein it is evident that data augmentation significantly enhanced their performance. Specifically, the recall values of EffB1_RMSprop_Aug for benign and primary CLA increased from 85.00% and 82.61%, respectively, to 86.67% and 95.65%, respectively, whereas those of SwinB_Adam_Aug increased from 80.00% and 78.26%, respectively, to 91.67% and 95.65%, respectively. Additionally, their AUCs for all image categories exceeded 0.98.
[figure(s) omitted; refer to PDF]
3.3. External Testing and Human–Machine Comparisons
The experimental results show that the overall accuracies of EffB1_RMSprop_Aug and SwinB_Adam_Aug on the external test set reached 90.8% and 88.5%, respectively. Table 8 presents the results of the two candidate models for each category. For the normal category, SwinB slightly outperformed EffB1. For the remaining three categories, EffB1 significantly outperformed SwinB across all metrics.
Table 8
The metric results of candidate diagnostic models on the external test set.
| Normal | Benign | |||||||
| Precision | Recall | Specificity | F1-score | Precision | Recall | Specificity | F1-score | |
| EffB1_RMSprop_Aug | 1.000 | 0.985 | 1.000 | 0.993 | 0.879 | 0.895 | 0.947 | 0.887 |
| SwinB_Adam_Aug | 1.000 | 0.993 | 1.000 | 0.996 | 0.826 | 0.881 | 0.920 | 0.853 |
| Primary | Metastatic | |||||||
| Precision | Recall | Specificity | F1-score | Precision | Recall | Specificity | F1-score | |
| EffB1_RMSprop_Aug | 0.901 | 0.831 | 0.989 | 0.865 | 0.889 | 0.902 | 0.927 | 0.895 |
| SwinB_Adam_Aug | 0.924 | 0.792 | 0.992 | 0.853 | 0.868 | 0.861 | 0.915 | 0.864 |
Subsequently, from the external test set, we selected 70% of samples from each image category to conduct human–machine comparisons.
Figure 4 shows the confusion matrices for EffB1_RMSprop_Aug, SwinB_Adam_Aug, and six clinicians (two residents, two attending physicians, and two chief physicians) for this set. The results indicated that EffB1_RMSprop_Aug and SwinB_Adam_Aug achieved accuracies of 88.3% and 87.1%, respectively. Specifically, for the normal category, both models exhibited a recall of 100%. By contrast, their recalls for the benign, metastatic, and primary malignant categories decreased. Similar to the results for the internal dataset, the diagnostic performances of the candidate models for both benign and primary tumors were inferior to those for normal and metastatic tumors. Despite this, the performances of both candidate models surpassed those of the clinicians and chief physicians. This expert-level performance of DLMs renders them suitable auxiliary tools for clinicians to minimize their workload and the possibilities of misdiagnosis and missed diagnosis. Additionally, they can also provide high-level diagnostic support in resource-limited environments to compensate for the shortage of professionals. Based on the aforementioned results, we selected EffB1_RMSprop_Aug as the diagnostic model owing to its superior performance, fewer parameters (6.51M), and lower GFLOPs (0.58).
[figure(s) omitted; refer to PDF]
3.4. Explainable AI
Explainable AI focuses on making AI models more transparent and elucidates their expected impacts and potential biases. It not only helps characterize model accuracy, fairness, transparency, and outcomes for AI-based decision-making but is also critical for building trust and confidence when implementing AI models [41].
Compared to models developed for classifying natural images, those developed for medical images must be interpretable to allow both doctors and patients to trust their outputs and decisions [42]. Specifically, clear and well-reasoned outputs are essential for building this trust. Additionally, understanding the internal decision-making process of a model through interpretability techniques allows researchers and developers to continue enhancing their performance. Moreover, the World Health Organization encourages the provision of adequate explanations for the outputs of machine learning models before using them in clinical trials.
This study employed the widely used Grad-CAM, an algorithm used for producing visual explanations of DLM outputs, to clarify the decision-making process of the model [43]. Grad-CAM generates a coarse localization map by analyzing the gradients of any target concept flowing into the final convolutional layer. This map highlights the important regions in the image for predicting the output. Through this visualization, researchers can gain a deeper understanding of the decision-making processes of a model by identifying the specific regions that the model focuses on during classification or detection tasks. Figure 5 shows a heat map of the model generated by the Grad-CAM algorithm for the different image categories, indicating that the proposed model focuses on the lesion areas in the ultrasound images, visually matching the insights of human experts.
[figure(s) omitted; refer to PDF]
3.5. Application System Page and Its Function Introduction
We developed a PC version of the proposed application using Java (Figure 6). Its main function involves reading ultrasound imaging videos in real time and analyzing them frame by frame. The original window displays the captured ultrasound video, while the processed window displays the heatmap video processed by the system. This heatmap guides doctors toward the affected area. We also added an image-saving function to help doctors conduct retrospective analyses. When the proposed system determines a lesion in a certain real-time video frame, it saves it and displays it in a window in the lower left corner. We tested the application on a laptop running Windows 11 and an Intel(R)-Core(TM)[email protected] GHz.
[figure(s) omitted; refer to PDF]
This system can be easily integrated into the clinical diagnosis process through a connection with the data interface of an ultrasound device. Clinicians can also use this CAD system on traditional ultrasound examinations, without interrupting the diagnostic process. When an ultrasound device captures the ultrasound data, the system immediately receives and processes them. This integration allows clinicians to automatically analyze and process images in real time during routine ultrasound examinations. Additionally, it can process ultrasound video files from other data sources, making it suitable not only for real-time diagnosis but also for postanalysis of historical data. Clinicians can import historical ultrasound examination videos through the AI system to conduct a deeper analysis or review of a patient’s condition. Moreover, this system is suitable for clinical teaching and training. By repeatedly watching ultrasound videos of specific cases, clinicians and students can compare their evaluations with AI predictions and gain a deeper understanding of the subtle characteristics of different lymph node conditions.
3.6. Comparison With Related Work
We compared our study work with related previous ones and the findings are summarized in Table 9. According to our investigation, most studies have focused on cervical lymph node metastasis for specific cancers, and few have specifically focused on using DLMs to automatically diagnose different types of CLA from cervical ultrasound images. Compared with related work, our dataset is more consistent with the actual clinical setting as it contains 4089 images from two clinical institutions. Additionally, by employing transfer learning, our EfficientNet-B1_RMSProp model achieved overall accuracies of 97.0% and 90.8% for the internal and external test sets, respectively. Thus, we developed a practical application to enhance the clinical diagnosis efficiency of CLA. It is worth noting that we conducted an in-depth analysis of numerous transfer learning models using different optimizers and classifiers, which can aid future research in this field.
Table 9
Limitations of related work and strengths of this work.
| Study | Model name | Model task | Dataset description | Multicenter | Model accuracy (%) | Application system |
| [25] | DenseNet161 | Detection of cervical lymphadenopathy | 420 images (metastatic and benign) | × | 86.5 | × |
| [26] | Improved Swin Transformer | Grading of cervical lymph nodes | 2268 images (cervical lymph nodes from Level 1 to Level 6) | × | 80.65 | × |
| Ours | EfficientNetB1+TL + RMSprop | Detection of cervical lymphadenopathy | 4089 images (normal, benign, primary, and metastatic) | √ | Internal: 97.0 | √ |
Abbreviation: TL, transfer learning.
4. Conclusion
This study developed a DL-based CAD system to enhance the diagnosis accuracy and efficiency of CLA from ultrasound images. To the best of our knowledge, this is the first study to test the performance of various DLMs for diagnosing CLA from ultrasound images using transfer learning. Notably, we collected numerous clinical images and explored a variety of model architectures, including the currently popular ViTs. We also employed multiple optimizers to further explore their potential for enhancing the model performance and tested different machine learning classification algorithms. Through an in-depth analysis of our experimental results, we found that the combination of the Adam optimizer and the DLM’s own classifier usually resulted in the best performance. After a comprehensive evaluation, EfficienNet-B1 with transfer learning, RMSprop optimizer, and data augmentation was determined as the SOTA model. Another highlight of this work is that we conducted external testing of DLMs and implemented human–machine comparison experiments in actual clinical environments. These measures further verified the clinical practicality and safety of DLMs. Finally, an application was developed to assist clinicians in diagnosing and classifying CLA. We believe that this study provides valuable insights for researchers and developers aiming to investigate ultrasound images of cervical lymph nodes.
However, this study had some limitations. First, the employed dataset had a significant class imbalance, which led to poor performance of the model for diagnosing primary CLA. Additionally, the limited size and diversity of the dataset may affect the applicability of the proposed system across different populations and imaging conditions. Although the preliminary results of this study are promising, the clinical applicability of our system has not yet been prospectively validated. Additionally, this study mainly focused on the detection performance of models without thoroughly comparing the complexity and parameter sizes of different models, which may limit their deployment in certain resource-limited hardware platforms.
In future works, we will focus on constructing a larger and more diverse dataset using images from multiple medical institutions and locations, which will allow us to further validate and enhance the robustness and utility of our diagnostic system. Second, we will develop multimodal DLMs to provide clinicians with more comprehensive diagnostic insights. Finally, prospective studies are required to confirm the efficacy and reliability of the proposed AI system across various clinical settings. Furthermore, we also plan to develop lightweight models suitable for various hardware systems to ensure that users in resource-constrained environments can use intelligent auxiliary diagnosis systems, effective data augmentation strategies, and well-designed loss functions to enhance model performance on class-imbalanced datasets.
Author Contributions
Ming Xu: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing – original draft, writing – review and editing, visualization, and project administration. Yubiao Yue: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing – original draft, writing – review and editing, visualization, and project administration. Zhenzhang Li: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing – original draft, writing – review and editing, visualization, and project administration. Yinhong Li: resources, data curation and validation. Guoying Li: resources, data curation, and validation. Haihua Liang: resources. Di Liu: software. Xiaohong Xu: conceptualization, methodology, validation, formal analysis, investigation, resources, data curation, writing – original draft, writing – review and editing, supervision, project administration, and funding acquisition. Ming Xu, Yubiao Yue, and Zhenzhang Li contributed equally and share first authorship.
Funding
This study was supported by grants from the Natural Science Foundation of Guangdong Province (no. 2022A1515012241), the NNSF of China (no. 62276074), the national key research and development plan Sub-topics (No.2023YFB4706800), the Scientific Research Project of Guangzhou Education Bureau (no. 202235325), the NSF of Guangdong Province (no. 2022A1515011044 and no. 2023A1515010885), the Project of Promoting Research Capabilities for Key Constructed Disciplines in Guangdong Province (no. 2021ZDJS028) and Scientific Research Project of Guangdong Provincial Department of Education (No.2023KQNCX060).
[1] S. P. Leong, A. Pissas, M. Scarato, "The Lymphatic System and Sentinel Lymph Nodes: Conduit for Cancer Metastasis," Clinical and Experimental Metastasis, vol. 39, pp. 139-157, DOI: 10.1007/s10585-021-10123-w, 2022.
[2] C. L. Willard-Mack, "Normal Structure, Function, and Histology of Lymph Nodes," Toxicologic Pathology 34, pp. 409-424, DOI: 10.1080/01926230600867727, 2006.
[3] M. Buettner, U. Bode, "Lymph Node Dissection-Understanding the Immunological Function of Lymph Nodes," Clinical and Experimental Immunology, vol. 169, pp. 205-212, DOI: 10.1111/j.1365-2249.2012.04602.x, 2012.
[4] A. T. Ahuja, M. Ying, S. Y. Ho, "Ultrasound of Malignant Cervical Lymph Nodes," Cancer Imaging, vol. 8, pp. 48-56, DOI: 10.1102/1470-7330.2008.0006, 2008.
[5] H. S. Hwang, L. A. Orloff, "Efficacy of Preoperative Neck Ultrasound in the Detection of Cervical Lymph Node Metastasis From Thyroid Cancer," The Laryngoscope, vol. 121, pp. 487-491, DOI: 10.1002/lary.21227, 2011.
[6] F. López, J. P. Rodrigo, C. E. Silver, "Cervical Lymph Node Metastases From Remote Primary Tumor Sites," Head and Neck, vol. 38, pp. E2374-E2385, DOI: 10.1002/hed.24344, 2016.
[7] J. Wang, H. Wei, H. Chen, "Application of Ultrasonography in Neonatal Lung Disease: An Updated Review," Frontiers in Pediatrics 10,DOI: 10.3389/fped.2022.1020437, 2022.
[8] Y. Liu, J. Chen, C. Zhang, "Ultrasound-Based Radiomics Can Classify the Etiology of Cervical Lymphadenopathy: A Multi-Center Retrospective Study," Frontiers in Oncology 12,DOI: 10.3389/fonc.2022.856605, 2022.
[9] Y. Chong, G. Park, H. J. Cha, "A Stepwise Approach to Fine Needle Aspiration Cytology of Lymph Nodes," Journal of Pathology and Translational Medicine, vol. 57, pp. 196-207, DOI: 10.4132/jptm.2023.06.12, 2023.
[10] J. Bojunga, P. Trimboli, "Thyroid Ultrasound and Its Ancillary Techniques," Reviews in Endocrine and Metabolic Disorders, vol. 25, pp. 161-173, DOI: 10.1007/s11154-023-09841-1, 2024.
[11] K. Su, J. Liu, X. Ren, "A Fully Autonomous Robotic Ultrasound System for Thyroid Scanning," Nature Communications 15,DOI: 10.1038/s41467-024-48421-y, 2024.
[12] B. Föllmer, M. C. Williams, D. Dey, "Roadmap on the Use of Artificial Intelligence for Imaging of Vulnerable Atherosclerotic Plaque in Coronary Arteries," Nature Reviews Cardiology 21, pp. 51-64, DOI: 10.1038/s41569-023-00900-3, 2024.
[13] E. Uçar, "Classification of Myositis From Muscle Ultrasound Images Using Deep Learning," Biomedical Signal Processing and Control, vol. 71,DOI: 10.1016/j.bspc.2021.103277, 2022.
[14] S. D. Deb, R. K. Jha, "Breast UltraSound Image Classification Using Fuzzy-Rank-Based Ensemble Network," Biomedical Signal Processing and Control, vol. 85,DOI: 10.1016/j.bspc.2023.104871, 2023.
[15] You-Wei, T.-T. Kuo, Yi-H. Chou, Yu Su, S.-H. Huang, C.-J. Chen, "Breast Tumor Classification Using Short-ResNet With Pixel-Based Tumor Probability Map in Ultrasound Images," Ultrasonic Imaging 45, pp. 74-84, DOI: 10.1177/01617346231162906, 2023.
[16] G. Kılıçarslan, C. Koç, F. Özyurt, Y. Gül, "Breast Lesion Classification Using Features Fusion and Selection of Ensemble ResNet Method," International Journal of Imaging Systems and Technology, vol. 33, pp. 1779-1795, DOI: 10.1002/ima.22894, 2023.
[17] N. Aboudi, H. Khachnaoui, O. Moussa, N. Khlifa, "Bilinear Pooling for Thyroid Nodule Classification in Ultrasound Imaging," Arabian Journal for Science and Engineering, vol. 48, pp. 10563-10573, DOI: 10.1007/s13369-023-07674-3, 2023.
[18] P. Zhang, H. Huang, Q. Xiong, X. He, Y. Liu, "Feature Analysis and Automatic Classification of B-Mode Ultrasound Images of Fatty Liver," Biomedical Signal Processing and Control, vol. 79,DOI: 10.1016/j.bspc.2022.104073, 2023.
[19] T. T. Ho, G.-T. Kim, T. Kim, S. Choi, E.-K. Park, "Classification of Rotator Cuff Tears in Ultrasound Images Using Deep Learning Models," Medical, Biological Engineering and Computing, vol. 60, pp. 1269-1278, DOI: 10.1007/s11517-022-02502-6, 2022.
[20] M. M. Al Rahhal, Y. Bazi, R. M. Jomaa, M. Zuair, F. Melgani, "Contrasting EfficientNet, ViT, and gMLP for COVID-19 Detection in Ultrasound Imagery," Journal of Personalized Medicine 12,DOI: 10.3390/jpm12101707, 2022.
[21] M. F. Dar, A. Ganivada, "EfficientU-Net: A Novel Deep Learning Method for Breast Tumor Segmentation and Classification in Ultrasound Images," Neural Processing Letters,DOI: 10.1007/s11063-023-11333-x, 2023.
[22] S. Sudharson, P. Kokil, "An Ensemble of Deep Neural Networks for Kidney Ultrasound Image Classification," Computer Methods and Programs in Biomedicine, vol. 197,DOI: 10.1016/j.cmpb.2020.105709, 2020.
[23] B. Pu, N. Zhu, K. Li, S. Li, "Fetal Cardiac Cycle Detection in Multi-Resource Echocardiograms Using Hybrid Classification Framework," Future Generation Computer Systems, vol. 115, pp. 825-836, DOI: 10.1016/j.future.2020.09.014, 2021.
[24] S. Latha, P. Muthu, K. Wee Lai, A. Khalil, S. Dhanalakshmi, "Performance Analysis of Machine Learning and Deep Learning Architectures on Early Stroke Detection Using Carotid Artery Ultrasound Images," Frontiers in Aging Neuroscience, vol. 13,DOI: 10.3389/fnagi.2021.828214, 2022.
[25] L. Xia, S. Lei, H. Chen, H. Wang, Ultrasound-Assisted Diagnosis of Benign and Malignant Cervical Lymph Nodes in Patients With Lung Cancer Based on Deep Learning, 2020.
[26] Y. Liu, J. Zhao, Q. Luo, C. Shen, R. Wang, X. Ding, "Automated Classification of Cervical Lymph-Node-Level From Ultrasound Using Depthwise Separable Convolutional Swin Transformer," Computers in Biology and Medicine, vol. 148,DOI: 10.1016/j.compbiomed.2022.105821, 2022.
[27] F. Zhuang, Z. Qi, K. Duan, "A Comprehensive Survey on Transfer Learning," Proceedings of the IEEE, vol. 109, pp. 43-76, DOI: 10.1109/JPROC.2020.3004555, 2021.
[28] H. E. Kim, A. Cosa-Linan, N. Santhanam, M. Jannesari, M. E. Maros, T. Ganslandt, "Transfer Learning for Medical Image Classification: A Literature Review," BMC Medical Imaging, vol. 22,DOI: 10.1186/s12880-022-00793-7, 2022.
[29] M. Raghu, C. Zhang, J. Kleinberg, S. Bengio, "Transfusion: Understanding Transfer Learning for Medical Imaging," Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3347-3357, 2019.
[30] K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," ,DOI: 10.48550/arXiv.1409.1556, 2015.
[31] K. He, X. Zhang, S. Ren, J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, DOI: 10.1109/CVPR.2016.90, 2016.
[32] A. Howard, M. Sandler, Bo Chen, "Searching for MobileNetV3," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1314-1324, DOI: 10.1109/ICCV.2019.00140, 2019.
[33] N. Ma, X. Zhang, H.-T. Zheng, J. Sun, "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design," Computer Vision-ECCV 2018, pp. 122-138, DOI: 10.1007/978-3-030-01264-9_8, 2018.
[34] G. Huang, Z. Liu, L. V. D. Maaten, K. Q. Weinberger, "Densely Connected Convolutional Networks," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261-2269, DOI: 10.1109/CVPR.2017.243, 2017.
[35] M. Tan, V. Le Quoc, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," arXiv,DOI: 10.48550/arXiv.1905.11946, 2020.
[36] M. Tan, V. Le Quoc, "EfficientNetV2: Smaller Models and Faster Training," arXiv,DOI: 10.48550/arXiv.2104.00298, 2021.
[37] A. Dosovitskiy, L. Beyer, A. Kolesnikov, "An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale," ,DOI: 10.48550/arXiv.2010.11929, 2021.
[38] Ze Liu, Y. Lin, Y. Cao, "Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992-10002, DOI: 10.1109/ICCV48922.2021.00986, 2021.
[39] S. Mehta, M. Rastegari, "MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer," ,DOI: 10.48550/arXiv.2110.02178, 2022.
[40] D. P. Kingma, Ba. Jimmy, "Adam: A Method for Stochastic Optimization," ,DOI: 10.48550/arXiv.1412.6980, 2017.
[41] W. Samek, T. Wiegand, K.-R. Müller, "Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models," ,DOI: 10.48550/arXiv.1708.08296, 2017.
[42] M. A. Gulum, C. M. Trombley, M. Kantardzic, "A Review of Explainable Deep Learning Cancer Detection Models in Medical Imaging," Applied Sciences, vol. 11,DOI: 10.3390/app11104573, 2021.
[43] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, P. Devi, D. Batra, "Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization," 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618-626, DOI: 10.1109/ICCV.2017.74, 2017.
Copyright © 2025 Ming Xu et al. International Journal of Intelligent Systems published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution License (the “License”), which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/