Content area
Hypertensive retinopathy (HR) can potentially lead to vision loss if left untreated. Early screening and treatment are critical in reducing the risk of vision loss. The computer-aided diagnostic system presents an opportunity to improve the efficiency and reliability of HR screening and diagnosis, particularly given the shortage of specialized medical professionals and the challenges faced by primary care physicians in making precise diagnoses. A notable barrier to the development of such diagnostic algorithms is the lack of publicly available benchmarks and datasets. To address these issues, we organized a challenge named “HRDC—Hypertensive Retinopathy Diagnosis Challenge” in conjunction with the Computer Graphics International (CGI) 2023 conference. The challenge provided a fundus image dataset for two clinical tasks: hypertension classification and HR classification, with each task containing 1000 images. This paper presents a concise summary and analysis of the submitted methods and results for the two challenge tasks. For hypertension classification, the best performing algorithm achieved a Kappa score of 0.3819, an F1 score of 0.6337, and a specificity of 0.8472. For HR classification, the best performing algorithm achieved a Kappa score of 0.4154, an F1 score of 0.6122, and a specificity of 0.8444. We also explored an ensemble approach to the top-ranking methods, which further improved performance beyond the individual best performing algorithm for each task. The challenge results show that there is room for further optimization of these methods, but the insights and methodologies derived from this challenge provide valuable directions for developing more precise and reliable classification models for hypertension and HR.
Introduction
Hypertension, often referred to as high blood pressure, is a medical condition characterized by excessive force exerted by the blood against the walls of arteries [8, 41]. Globally, approximately 1.13 billion individuals in developing countries are affected by hypertension [42]. This widespread condition presents a significant global health challenge, increasing the risk not only for cardiovascular diseases and strokes, but also posing a substantial threat to retinal health [28, 35]. Hypertensive retinopathy (HR) is a prevalent complication among hypertensive patients, capable of causing visual impairment and, in severe cases, even blindness. The diagnosis of HR involves a visual examination of retinal fundus images to identify retinal lesions, including arteriolar narrowing, hemorrhages, cotton-wool spots, and hard exudates [44].
Early screening and prompt treatment of HR have multiple benefits [4, 27]. Firstly, precise diagnosis is difficult even for primary care physicians. Secondly, they hold the potential to reduce the risk of visual impairment. Thirdly, the current diagnosis of HR is hindered by the labor-intensive manual inspection of individual images, which is time-consuming and heavily reliant on the expertise of ophthalmologists. In addition, it is useful to alleviate the burden on ophthalmologists by reducing the necessity for complex interventions. Therefore, the development of an effective computer-aided system becomes essential in assisting ophthalmologists in the analysis of HR.
In recent years, artificial intelligence (AI) has been widely applied to various computer vision tasks, such as classification [37], segmentation [13, 22] and detection [45]. AI has also facilitated the automated medical image diagnosis [9, 10, 23], with the further potential to contribute to the development of the medical field [15, 38]. The integration of AI into the computer-aided system for the diagnosis of hypertension and HR is significant. AI can provide more accurate and consistent diagnoses, analyze a large volume of retinal images more quickly, and ultimately improve the efficiency of HR diagnosis. Recent efforts have leveraged AI techniques for HR detection [2, 5, 36]. Manually designed features such as arteriolar-to-venular ratio (AVR), optic disk localization, and tortuosity index are used as input to the classifier to identify the HR [3, 49]. The end-to-end deep learning methods have also been explored for HR identification. [1, 2]. Some existing datasets such as DRIVE [40], DiaRetDB [19], and DR-HAGIS [17] could serve as valuable resources for the evaluating HR identification [1].
Although lots of efforts have been made in the field of automatic HR diagnosis, the development of AI algorithms for precise diagnosis is still a challenging task, due to the lack of publicly available datasets and benchmarks for model development and evaluation. Further, AI challenges hold significant importance in medical image analysis, because they typically release publicly available data for model development and validation by participating teams from around the world [29, 32, 32, 33]. There are already some AI challenges dedicated to the diagnosis of various ophthalmic disease, such as IDRiD [32], REFUGE [29] and DeepDRiD [24]. However, there is a lack of challenges focusing on the HR diagnostics. To further promote the application of machine learning and deep learning algorithms in computer-aided clinical HR diagnosis, we organized the hypertensive retinopathy diagnosis challenge (HRDC) at Computer Graphics International (CGI) 2023. This challenge introduces two clinically relevant tasks: hypertension classification and HR classification, with the aim of establishing a benchmark and evaluation framework for the automated HR diagnosis. In this paper, we provide a comprehensive overview of the challenge setup, the dataset, and the competing solutions. We hope that the algorithms outlined in this challenge will prove invaluable to the HR research community. The contributions of this paper are as follows:
We present the challenge setup, evaluation, datasets, competing algorithms and results submitted in the challenge, which provides a reference for the organization of subsequent medical image challenges.
We summarize the top performing algorithms submitted in the challenge, and provide the strategies to improve the performance of the classification model.
We provide a comprehensive analysis of the challenge results, including ranking stability, statistical significance analysis, which provides a benchmark for the performance comparison of the hypertension and HR classification tasks. We also explore and demonstrate the effectiveness of the model ensemble in improving the classification results.
Prior work
Hypertension and hypertensive retinopathy classification
In the literature review, there is a limited amount of current research focusing on the prediction and classification of hypertension from fundus images using deep learning techniques. Among the available studies, Zhang et al. [47] developed neural network models to predict hypertension based on retinal fundus images. This study involved the collection of 1,222 high-quality retinal images and more than 50 measurements of anthropometric and biochemical parameters from 625 subjects. Their models achieved an AUC score of 0.766 in predicting hypertension. Krittanawong et al. [31] employed the deep learning method to predict the systolic blood pressure (SBP). They trained the model on a large dataset comprising 284,335 patients and then validated its performance on two independent datasets consisting of 12,026 and 999 patients. The model achieved an absolute error of 11.23 mmHg in SBP estimation. In addition, the model’s application extended beyond hypertension prediction to the identification of cardiovascular risk factors not traditionally associated with or quantifiable from retinal images.
In contrast, the application of image processing and machine learning techniques to HR detection has received more attention. The first step typically involves the identification of various HR-related features, such as AVR, optic disk localization, and tortuosity index. A machine learning classifier is then used to identify HR within color fundus images. For instance, Zhu et al. [49] introduced a supervised approach utilizing the extreme learning machine (ELM) for retinal vessel segmentation. They extracted a set of 39-dimensional discriminative feature vectors for each pixel within the fundus image. These included local features, morphological attributes, and more. These feature vectors, along with manually assigned labels, were used to construct a matrix that served as input for the ELM classifier. The classifier effectively generated binary segmentation results for the retinal vasculature. Similarly, Akbar et al. [3] presented an automated system for retinal vessel segmentation, which forms the basis for HR calculation and grading, primarily relying on AVR. The performance of the system was evaluated using three datasets: INSPIRE-AVR, VICAVR, and a local AVRDB dataset, with average accuracies of 95.14%, 96.82%, and 98.76%, respectively. Additionally, Cavallari et al. [6] adopted a combination of AVR, tortuosity index, and mean fractal dimension to facilitate HR detection.
Deep learning methods offer the advantage of end-to-end detection, eliminating the need for a separate feature extraction process. Abbas et al. [1] introduced a deep learning system named DenseHyper, designed for the HR classification. DenseHyper integrated a trained feature layer with a dense feature transformation layer, which significantly improved its ability to generalize features derived from HR-related lesions. The system incorporated a perceptually oriented color space in conjunction with deep features, resulting in improved performance in HR classification. A dataset of 1,600 HR and 2,670 non-HR retinal images from four different online platforms and one private source was used to train the system. The system showed the results with a sensitivity of 93%, specificity of 95% and an AUC value of 0.96 in a tenfold cross-validation test. Furthermore, Abbas et al. [2] developed the hyper-retino framework for HR grading based on five grades. This framework included a preprocessing step that involved color space transformation and contrast enhancement, effectively addressing issues related to light illumination and contrast. Subsequently, instance-based segmentation, using a shallow CNN model and a random forest classifier, as well as semantic-based segmentation using a U-Net model, were used to detect HR-related lesions. On average, a tenfold cross-validation test on 1,400 HR images yielded a sensitivity of 90.5%, specificity of 91.5% and an AUC of 0.915. Sajid et al. [36] developed a lightweight HR-related eye disease diagnosis system called Mobile-HR. This system integrates a pretrained model with dense blocks, simplifying the network architecture while concurrently improving accuracy and speed. Experimental results show an accuracy of 99% and an F1 score of 99% on various datasets.
Fig. 1 [Images not available. See PDF.]
Examples of fundus images in two tasks of the challenge
The future of deep learning-based HR detection is poised for evolution, characterized by the emergence of deeper models and the use of larger datasets. The effectiveness of pretrained foundational models is expected to play a pivotal role in improving detection accuracy and efficiency. Additionally, the availability of publicly accessible datasets is of great value, serving as critical benchmarks for researchers to evaluate the effectiveness of their models and catalyze advancements in the field of medical research.
Medical image challenges
In recent years, there has been a significant increase in the number of challenges being organized in the field of medical artificial intelligence. These challenges encourage innovation in the development of AI-driven healthcare solutions and provide a platform for testing and refining cutting-edge technologies. At the events such as International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) and IEEE International Symposium on Biomedical Imaging (ISBI), many challenges have revolved around the domain of fundus images and eye diseases. For example, the DeepDRiD challenge [24], hosted at ISBI 2020, centered on diabetic retinopathy grading and image quality estimation. The challenge provides 2,000 regular fundus images from 500 patients. Each patient has four fundus images, with two records for each eye, focusing on the macula and optic disk. The Automatic Detection challenge on age-related macular degeneration (ADAM) [12], also held at ISBI 2020, encompassed four distinct tasks focused on the detection and characterization of age-related macular degeneration (AMD) using fundus images. These tasks included AMD detection, optic disk detection and segmentation, fovea localization, and lesion detection and segmentation. A dataset consisting of 1,200 fundus images was released, along with corresponding diagnostic labels for AMD, pixel-wise segmentation masks for optic disk and AMD-related lesions (such as drusen, exudates, hemorrhages, and scars), and the coordinates indicating the location of the macular fovea. The retinal fundus glaucoma challenge (REFUGE) [29], hosted at MICCAI 2018, focused on two primary tasks: optic disk/cup segmentation and glaucoma classification.
Challenge description
Challenge setup
Fig. 2 [Images not available. See PDF.]
Quantitative results of the top three algorithms in image quality assessment task
The HRDC challenge was set up with the primary objective of providing a benchmark for the evaluation of algorithms designed for the automatic detection of hypertension and HR using fundus images. This challenge consisted of two tasks. Figure 1 shows the example images for each task.
Task 1: Hypertension classification. Participants were tasked with confirming whether a given fundus image of a patient’s eye indicated the presence of hypertension.
Task 2: Hypertensive retinopathy classification. The aim was to determine whether a patient’s eye showed signs of HR in the fundus images.
To facilitate clarity, we adopt the labels A and B to denote the algorithms corresponding to task 1 (hypertension classification) and task 2 (HR classification), respectively, followed by a number indicating the ranking of the algorithm in this task. For example, A1 and B1 represent the first ranked algorithms in task 1 and task 2, respectively. In the hypertension classification task, a total of 12 teams participated by submitting their results. Three invalid submissions were identified, with one team directly submitting the example provided by the organizer and two submissions yielding a score of 0 for both kappa and specificity. This left nine valid submissions for this task. Among these valid submissions, three teams (A1, A2, A7) submitted method description papers. The ranking scores and rankings for these nine teams are shown in Fig. 2. Similarly, in the HR classification task, a total of seven teams submitted their results. Two submissions were identified as invalid, with one team directly submitting the example provided by the organizer and another submission yielding scores of 0 for both kappa and specificity. As a result, there remained five valid submissions for this task. Among these valid submissions, two teams (B1, B3) submitted method description papers. The ranking scores for these five teams and rankings are presented in Fig. 2.
Dataset
Table 1. Statistical details of the two tasks in HRDC dataset
Metadata | Task 1 | Task 2 | |||
|---|---|---|---|---|---|
Training | Testing | Training | Testing | ||
No. of images | 712 | 288 | 712 | 288 | |
No. of participants | 613 | 257 | 483 | 208 | |
No. of eyes | 647 | 267 | 539 | 232 | |
Male (%) | 67.54 | 70.82 | 80.95 | 81.73 | |
Age (mean±std) | 48.05±10.98 | 48.32±9.88 | 50.94±10.53 | 50.40±10.03 | |
In the challenge dataset, all the images were captured using the TOPCON TRC-NW400 retinal camera, with a field of view of 45 degrees. This study was approved by the Ethics Committee of Shanghai Sixth People’s Hospital and conducted in accordance with the Declaration of Helsinki. For the preprocessing of the images, we first removed the black background and then resized the images to a resolution of 800 800 pixels. These images were saved in the PNG format and released to the participants. Statistical details of the two tasks are presented in Table 1. Each task consists of 1,000 color fundus images, further divided into a training set of 712 fundus images and a test set of 288 fundus images. There were 356 images that were shared between the training sets of both tasks and 144 images that were shared between the test sets of both tasks. For the annotation of the hypertension classification task, hypertension was defined as having systolic blood pressure 140 mm Hg or greater, diastolic blood pressure 90 mm Hg or greater, or taking medication for hypertension [48]. The annotation for the HR classification task referred to the Wong–Mitchell grading systems [7, 44] for HR. All the ground truth annotations for these images are available in a CSV file. Within this CSV, the classification annotation consists of numerical values describing the associated class for each image. In the hypertension classification task, category 0 indicates non-hypertension, while category 1 indicates hypertension. In the HR classification task, category 0 corresponds to non-HR, while category 1 corresponds to HR.
Evaluation and ranking
In both tasks, the evaluation of algorithms involves three metrics: kappa, F1 score, and specificity. The ranking method involves calculating separate metric scores for these three metrics across all test cases, and then averaging the scores for the final ranking. The three metrics are calculated as follows:
1
where , and . is the overall classification accuracy, calculated by summing correctly classified images in each category and dividing this sum by the total number of images. C is the total number of categories, is the number of images correctly classified in each category. is the true number of images in each category, and is the predicted number of images in each category.2
3
where Precision = , and Recall = . TP, TN, FN and FP are true positives, true negatives, false negatives and false positives, respectively.Table 2. Performance comparison of the top three algorithms in hypertension classification. Ensemble represents the ensemble results of the top three algorithms
Method | Kappa | F1 score | Specificity | Average |
|---|---|---|---|---|
A1 | 0.3819 | 0.6337 | 0.8472 | 0.6210 |
A2 | 0.3889 | 0.6741 | 0.7569 | 0.6066 |
A3 | 0.3264 | 0.6367 | 0.7361 | 0.5664 |
Ensemble | 0.3889 | 0.6615 | 0.7917 | 0.6140 |
Algorithms summary
In this section, we provide a summary of the algorithms for which method description papers were submitted in each task of the challenge. Specifically, one team, led by Xinming Shu et al., submitted a paper describing their method for task 1, while two other teams, led by Shichang Liu et al. and Julio Silva-Rodríguez et al., each submitted a paper describing their methods for both task 1 and task 2.
A1 (Xinming Shu et al.)
The team used three different ResNet [16] models for automatic hypertension classification. The first model utilized the ResNet50 network architecture, with an additional fully connected layer appended to the end. The second model employed the ResNet34 network architecture, also with an additional fully connected layer added at the end. The third model followed the ResNet34 network architecture, with two fully connected linear layers added: one with an input of 1,000 and an output of 50, and another with an input of 50 and an output of 2. During the training process, the images were resized to a resolution of 512 512 pixels, and the pixel values were normalized to the range [0,1]. Data augmentation techniques were also applied to improve the robustness and generalization of the model, including brightness enhancement, contrast enhancement, rotation, and flipping. Finally, the second model achieved the highest average score of 0.6210, with a kappa value of 0.3819, an F1 score of 0.6337, and a specificity of 0.8472.
Fig. 3 [Images not available. See PDF.]
Confusion matrix for hypertension classification task (top row) and hypertensive retinopathy classification task (bottom row). From left to right are the first, second and third ranked results, respectively
A2 and B1 (Shichang Liu et al.)
The team chose ConvNeXt [25] as the baseline classification model. The image was resized to a resolution of 512 512 and a threefold cross-validation was adopted. Offline data augmentation was also applied to increase the dataset before training. This involved randomly selecting images from the minority class and applying augmentation techniques such as pepper noise, Gaussian noise, and photometric distortion to generate new images. During online training, data augmentation techniques such as horizontal flip, vertical flip, random affine, and random perspective were used. The model was initialized with official pretrained weights and then fine-tuned using the AdamW optimizer [26] on each fold with a learning rate of 0.00025. For task 1, the model underwent 35 epochs of training with a batch size of 16, while for task 2, it underwent 20 epochs of training with the same batch size. All other training parameters remained consistent between the two tasks. Before submitting the models to the challenge to obtain results on the hidden test set, several experiments were conducted to determine the optimal data augmentation methods and training setup. An initial experiment evaluated the effectiveness of various data augmentation techniques, including the four online data augmentation methods mentioned above, as well as offline augmentation. The experiments showed that all the four online data augmentation techniques and the offline data augmentation improved the classification results. Further experiments were then carried out using these data augmentation methods on different variants of ConvNeXt, including ConvNeXt-S, ConvNeXt-B, ConvNeXt-L, and ConvNeXt-B with offline data augmentation. Finally, they submitted these four models to the challenge and the best results were obtained by the ConvNeXt-B with offline data augmentation. The results showed that despite having more parameters, the larger model (ConvNeXt-L) was more prone to overfitting.
A7 and B3 (Julio Silva-Rodríguez et al.)
This team explored how well the FLAIR foundation model [39] could be applied to the two tasks within the HRDC challenge. The FLAIR model, initially pretrained on a large dataset of fundus images from various sources and tasks, served as the foundation model. They used two transfer learning strategies, linear probing (LP) and fine-tuning (FT), to adapt the FLAIR model for these tasks. LP involves directly transferring features and tuning only the linear classifier, while FT involves retraining the entire FLAIR model using task-specific data. They also explored a combination of LP and FT, where they first trained the classifier with a frozen backbone, and then fine-tuned the whole network for the target task [18, 21]. They compared the performance of the adapted FLAIR model with baseline models initialized with pretraining features from ImageNet [11].
During training, they scaled the image pixel values to fall within the [0,1] range. For LP adaptation, they adopted the same LP approach as in CLIP [34] and introduced class weights to deal with class imbalances. In the case of FT, they initialized a classification head with random weights and retrain the whole network for the target task. They employed the Adam optimizer with an initial learning rate of 0.0001 during training. Additionally, they used a resampling strategy for the minority class to deal with class imbalance. Data augmentation techniques, including random flips, rotations, scaling, and color jitter, are applied during each training iteration. To evaluate the performance, they performed a fivefold cross-validation on the HRDC training dataset. In each fold, 20% of the training images for each class are randomly reserved for evaluation, while 70% are used for training and 10% for internal validation. The results of the internal validation indicate that using the FLAIR features as a starting point results in a performance improvement of approximately 2.5% compared to the ImageNet initialization. Further fine-tuning of the whole network increases this performance gap to around 4%. These results highlight the benefits of leveraging foundation models pretrained in medical domains for both tasks in the HRDC challenge.
Table 3. Performance comparison of the top three algorithms in HR classification. Ensemble represents the ensemble results of the top three algorithms
Method | Kappa | F1 score | Specificity | Average |
|---|---|---|---|---|
B1 | 0.4154 | 0.6122 | 0.8444 | 0.6240 |
B2 | 0.4026 | 0.6146 | 0.8111 | 0.6095 |
B3 | 0.3109 | 0.5619 | 0.7611 | 0.5446 |
Ensemble | 0.4373 | 0.6337 | 0.8333 | 0.6348 |
Fig. 4 [Images not available. See PDF.]
Blob plots to visualize ranking stability based on bootstrapping (1000 bootstrap samples). The size of each circle in the plot corresponds to the relative frequency with which an algorithm achieved a particular rank across these 1000 bootstrap samples. The black cross indicates the median rank for each algorithm. The black lines to represent the 95% intervals calculated across these 1000 bootstrap samples
Finally, they chose to use the LP adaptation as their solution to the HRDC challenge. Although this approach may not be the top performing one in cross-validation, their intention is to assess the direct transferability of the foundation model in a real-world context. Consequently, they train a classifier for each task on top of the frozen vision encoder of FLAIR, using the complete challenge training dataset. In this setup, they achieve ranking scores of 0.5 and 0.544 for Task 1 and Task 2, respectively. In summary, this method convincingly demonstrates that foundation models like FLAIR have the potential to advance deep learning-based fundus image analysis.
Results
In this section, we present the results achieved by the participating teams on the hidden test set and perform an analysis of the stability of their rankings. We also provide an ensemble of the results of the top three algorithms in each task and present a statistical significance analysis of these algorithms.
Task 1: hypertension classification
Table 2 shows the performance comparison of the top three algorithms. The confusion matrices for these top three methods are shown in Fig. 3. It can be observed that the non-hypertension class has a higher accuracy compared to the hypertension class. It is worth noting that the dataset maintains a balanced distribution of non-hypertensive to hypertensive images in both the training and test sets at a 1:1 ratio, thus eliminating data imbalance as a contributing factor. However, a possible explanation for the observed misclassification of many hypertension images as non-hypertensive ones could be attributed to the small fundus changes caused by hypertension, making them difficult for deep learning algorithms to detect effectively.
Task 2: hypertensive retinopathy classification
Table 3 gives an overview of the performance comparison of the top three algorithms. Furthermore, the confusion matrices corresponding to the top three methods are shown in Fig. 3. It is clear from the results that the non-HR class has a higher accuracy compared to the HR class. Although there is a slight data imbalance, with approximately 60% of the dataset being non-HR images in both the training and test sets, we do not believe that this factor is the primary cause of the superior detection accuracy observed in the non-HR class. For instance, method B1 used a resampling technique (offline data augmentation) to maintain a 1:1 ratio of non-HR to HR training images, but the gap in detection accuracy is still present. Therefore, we suggest that the detection of the lesions associated with the identification of HR poses a challenge to the algorithms, resulting in the comparatively lower accuracy in this category.
Ensemble of top three algorithms
The power of ensemble models to improve performance has been well-documented in previous studies [14, 20, 46]. Therefore, it is of interest to explore the potential of ensembling the top three methods in each task. In both classification tasks, we adopted a majority voting strategy using the top three results to generate the ensemble result. The performance of the ensemble result is shown in Table 2. The average score of the ensemble method is 0.7% lower than the best single method, mainly due to a 5.55% decrease in specificity. However, the ensemble results show improvements of 0.7% and 2.78% in kappa and F1 score, respectively, compared to the first ranked method. For the HR classification task, the ensemble performance is shown in Table 3. Here, the ensemble method outperforms the first ranked method by 1.08% in terms of the average score. However, the specificity of the ensemble results is 1.11% lower than that of the first ranked method. However, the ensemble results achieve improvement of 2.19% and 2.15% in kappa and F1 score, respectively. From these ensemble results for both classification tasks, we can see that ensembling typically leads to improvements in kappa and F1 scores, although often accompanied by a decrease in specificity.
Ranking stability
Fig. 5 [Images not available. See PDF.]
Violin plots illustrating Kendall’s to visualize ranking stability via bootstrapping (1000 bootstrap samples). The ranking derived from the whole test data is compared pairwise with the ranking derived from each bootstrap. For each pair of rankings, Kendall’s is calculated, and the results are used to generate a violin plot
Table 4. Kendall’s using different metrics as the ranking criteria
Kappa | F1 score | Specificity | Average | |
|---|---|---|---|---|
Task 1 | 0.7573 | 0.5206 | 0.1420 | 0.7398 |
Task 2 | 0.8119 | 0.7359 | 0.7020 | 0.8255 |
Each column separately presents the value of Kendall’s when using that metric as the ranking criterion
Inspired by the challengeR toolkit [43], we performed bootstrapping with 1000 bootstrap samples to evaluate the ranking stability of the results. The quantitative assessment of ranking stability was achieved by the agreement between the challenge ranking and the ranking of each bootstrap on the test set images. The Kendall’s is used as the rank correlation coefficient, which yields values within the range of (indicating a reverse ranking order) to 1 (indicating an identical ranking order). The Violin plots, visually presented in Fig. 5, show the results of the bootstrap analysis for each task, revealing Kendall’s values of 0.7398 and 0.8255 for the two tasks, respectively. In addition, Fig. 4 provides a blob plot illustrating the bootstrap rankings for each task. In addition to assessing ranking stability using the average score of the three metrics, we also used individual metrics for this purpose, as exemplified by Kendall’s presented in Table 4. It is worth noting that while using average scores for rankings may not always yield the highest ranking stability, such as in the case of task 1 where using the average scores to rank is slightly less stable than using kappa as a ranking, it is consistently the optimal or near optimal choice for ranking stability.
Statistical significance analysis
Fig. 6 [Images not available. See PDF.]
Significance maps depict incidence matrices of pairwise significant test results for the one-sided Wilcoxon signed rank test at a 5% significance level with adjustment for multiple testing according to Holm. Yellow shading indicates that the ranking score from the algorithm on the x-axis is significantly superior to those from the algorithm on the y-axis, and blue color indicates no significant difference
In each challenge task, a statistical comparison of the team scores is conducted using the one-tailed Wilcoxon signed rank test, at a significance level of 5%. The challengeR toolkit [43] is used to perform the significance analysis and generate the significance map. The significance maps in Fig. 6 show the results of the statistical significance assessment for both tasks. More specifically, in the context of hypertension classification, the best performing algorithm, A1, shows significant superiority over the ensemble results, A2 and A3. Meanwhile, in the HR classification task, the ensemble model shows significant superiority when compared to algorithms B1, B2, and B3.
Discussion
Findings for the classification algorithms
In line with the evolution of the literature in the field, we observed that the proposed solutions for hypertension and HR detection were generally based on common used CNNs for image classification, such as ResNet and ConvNeXt. Since training such deep architectures from scratch on a training set with only 712 images might be prone to overfitting, the common solutions is to initialize the CNNs with pretrained weights from ImageNet and use a K-fold cross-validation. The data augmentation techniques are also widely used by these team to reduce overfitting. From the ensemble results, we can see its potential to improve performance. Although model ensemble does not always yield the highest accuracy, when tested on unseen data, ensemble yields outputs that are stable and closer to the best results than a single model.
The more recent popular foundation models also have good prospects for application in medical image analysis. One team used the FLAIR model to fine-tune on the dataset. Although the best results were not obtained, there are many advantages to using the foundation model for fine-tuning. Firstly, foundation models have undergone pretraining on extensive datasets, enabling more precise extraction of lesion features, particularly in medical image analysis tasks with limited data. Additionally, fine-tuning these foundation models reduces algorithm development time, achieving more accurate diagnostic results with a smaller amount of data and shorter training periods. We believe that there will be more and more applications of big models for medical image analysis in the future.
Limitations and future work
One limitation of HRDC challenge is the lack of diverse ethnicities and imaging device in its data set, as the images correspond to a Chinese population, and all the images are captured using the same device. Ethnicities manifest differently in color fundus photographs due to changes in the pigment of the fundus. Therefore, it cannot be ensured that the best performing models on the REFUGE challenge can be applied on a different population and obtain the same outcomes without retraining. Another limitation is that our challenge did not include an image quality assessment task. Image quality assessment is important in clinical diagnosis and practical screening and influences the diagnostic results. However, there are some other AI fundus imaging challenges set up with an image quality assessment task, which can be applied in practical screening for removing images with poor image quality.
In our future efforts for the HRDC challenge, we intend to expand the dataset, providing images from multiple centers for model training and validation. Moreover, we will make the HRDC dataset accessible to the community through the HRDC website, with the hope that it will prove highly valuable for researchers addressing various topics in this field. We will also maintain an open post-challenge registration and submission system beyond the end of this challenge, encouraging the evaluation of innovative solutions and driving progress in the domain of automated HR analysis.
Conclusions
The HRDC challenge provides a benchmark and evaluation framework for automatic hypertension and hypertensive retinopathy classification from fundus images. We thoroughly summarized and discussed the algorithms and results from participating teams in this paper. The challenge illuminate the potential of deep learning models to aid healthcare professionals in accurately diagnosing conditions like hypertensive retinopathy. The availability of labeled datasets innovative models are positive steps toward accelerating the development of computer-assisted diagnostic systems for hypertensive retinopathy. However, it is essential to acknowledge that there is still much work to be done. To make these models clinically applicable and robust, further research and improvements are necessary.
Acknowledgements
This work was supported by National Key R & D Program of China (2022YFC2502802), National Natural Science Foundation of China (62272298 and 8238810007), Beijing Natural Science Foundation (IS23096), the College-level Project Fund of Shanghai Sixth People’s Hospital (ynlc201909), the Interdisciplinary Program of Shanghai Jiao Tong University (YG2022QN089), Clinical Special Program of Shanghai Municipal Health Commission (20224044), Chronic Disease Health Management and Comprehensive Intervention Based on Big Data Application (GWVI-8), and Research on Health Management Strategy and Application of Elderly Population (GWVI-11.1-28).
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Abbas, Q; Ibrahim, ME. Densehyper: an automatic recognition system for detection of hypertensive retinopathy using dense features transform and deep-residual learning. Multimed. Tools Appl.; 2020; 79, pp. 31595-31623. [DOI: https://dx.doi.org/10.1007/s11042-020-09630-x]
2. Abbas, Q; Qureshi, I; Ibrahim, ME. An automatic detection and classification system of five stages for hypertensive retinopathy using semantic and instance segmentation in densenet architecture. Sensors; 2021; 21,
3. Akbar, S., Akram, M.U., Sharif, M., Tariq, A., Khan, S.A.: Decision support system for detection of hypertensive retinopathy using arteriovenous ratio. Artif. Intell. Med. 90, 15–24 (2018)
4. Arsalan, M., Haider, A., Choi, J., Park, K.R.: Diabetic and hypertensive retinopathy screening in fundus images using artificially intelligent shallow architectures. J. Pers. Med. 12(1), 7 (2021)
5. Badawi, S.A., Fraz, M.M., Shehzad, M., Mahmood, I., Javed, S., Mosalam, E., Nileshwar, A.K.: Detection and grading of hypertensive retinopathy using vessels tortuosity and arteriovenous ratio. J. Digit. Imaging. pp. 1–21 (2022)
6. Cavallari, M., Stamile, C., Umeton, R., Calimeri, F., Orzi, F., et al.: Novel method for automated analysis of retinal images: results in subjects with hypertensive retinopathy and CADASIL. BioMed Res. Int. 2015 (2015)
7. Cheung, CY; Biousse, V; Keane, PA; Schiffrin, EL; Wong, TY. Hypertensive eye disease. Nat. Rev. Dis. Primers.; 2022; 8,
8. Chhajer, B. High Blood Pressure; 2014; New Delhi, Diamond Pocket Books Pvt Ltd:
9. Dai, L., Sheng, B., Chen, T., Wu, Q., Liu, R., Cai, C., Wu, L., Yang, D., Hamzah, H., Liu, Y., et al.: A deep learning system for predicting time to progression of diabetic retinopathy. Nat. Med. pp. 1–11 (2024)
10. Dai, L; Wu, L; Li, H; Cai, C; Wu, Q; Kong, H; Liu, R; Wang, X; Hou, X; Liu, Y et al. A deep learning system for detecting diabetic retinopathy across the disease spectrum. Nat. Commun.; 2021; 12,
11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
12. Fang, H; Li, F; Fu, H; Sun, X; Cao, X; Lin, F; Son, J; Kim, S; Quellec, G; Matta, S et al. Adam challenge: detecting age-related macular degeneration from fundus images. IEEE Trans. Med. Imaging; 2022; 41,
13. Fu, Y; Chen, Q; Zhao, H. Cgfnet: cross-guided fusion network for rgb-thermal semantic segmentation. Vis. Comput.; 2022; 38,
14. Ganaie, M.A., Hu, M., et al.: Ensemble deep learning: a review. arXiv preprint arXiv:2104.02395 (2021)
15. Guan, Z., Li, H., Liu, R., Cai, C., Liu, Y., Li, J., Wang, X., Huang, S., Wu, L., Liu, D., et al.: Artificial intelligence in diabetes management: advancements, opportunities, and challenges. Cell Rep. Med. (2023)
16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
17. Holm, S; Russell, G; Nourrit, V; McLoughlin, N. Dr hagis-a fundus image database for the automatic extraction of retinal surface vessels from diabetic patients. J. Med. Imaging; 2017; 4,
18. Kanavati, F., Tsuneki, M.: Partial transfusion: on the expressive influence of trainable batch norm parameters for transfer learning. In: Medical Imaging with Deep Learning, pp. 338–353. PMLR (2021)
19. Kauppi, T; Kalesnykiene, V; Kamarainen, JK; Lensu, L; Sorri, I; Raninen, A; Voutilainen, R; Uusitalo, H; Kälviäinen, H; Pietilä, J. The diaretdb1 diabetic retinopathy database and evaluation protocol. BMVC; 2007; 1, 10.
20. Khened, M; Kollerathu, VA; Krishnamurthi, G. Fully convolutional multi-scale residual densenets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers. Med. Image Anal.; 2019; 51, pp. 21-45. [DOI: https://dx.doi.org/10.1016/j.media.2018.10.004]
21. Kumar, A., Raghunathan, A., Jones, R., Ma, T., Liang, P.: Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054 (2022)
22. Li, X; Huang, H; Zhao, H; Wang, Y; Hu, M. Learning a convolutional neural network for propagation-based stereo image segmentation. Vis. Comput.; 2020; 36, pp. 39-52. [DOI: https://dx.doi.org/10.1007/s00371-018-1582-y]
23. Li, Y., Wang, Z., Yin, L., Zhu, Z., Qi, G., Liu, Y.: X-net: a dual encoding–decoding method in medical image segmentation. Vis. Comput. pp. 1–11 (2023)
24. Liu, R., Wang, X., Wu, Q., Dai, L., Fang, X., Yan, T., Son, J., Tang, S., Li, J., Gao, Z., et al.: Deepdrid: diabetic retinopathy-grading and image quality estimation challenge. Patterns p. 100512 (2022)
25. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11,976–11,986 (2022)
26. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
27. Nagpal, D., Panda, S.N., Malarvel, M.: Hypertensive retinopathy screening through fundus images-a review. In: 2021 6th International Conference on Inventive Computation Technologies (ICICT), pp. 924–929. IEEE (2021)
28. Organization, W.H., et al.: Hypertension control: report of a WHO Expert Committee. World Health Organization (1996)
29. Orlando, JI; Fu, H; Breda, JB; Van Keer, K; Bathula, DR; Diaz-Pinto, A; Fang, R; Heng, PA; Kim, J; Lee, J et al. Refuge challenge: a unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med. Image Anal.; 2020; 59, 101570. [DOI: https://dx.doi.org/10.1016/j.media.2019.101570]
30. Pavao, A., Guyon, I., Letournel, A.C., Baró, X., Escalante, H., Escalera, S., Thomas, T., Xu, Z.: Codalab competitions: an open source platform to organize scientific challenges. Ph.D. thesis, Université Paris-Saclay, FRA (2022)
31. Poplin, R; Varadarajan, AV; Blumer, K; Liu, Y; McConnell, MV; Corrado, GS; Peng, L; Webster, DR. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng.; 2018; 2,
32. Porwal, P; Pachade, S; Kokare, M; Deshmukh, G; Son, J; Bae, W; Liu, L; Wang, J; Liu, X; Gao, L et al. Idrid: diabetic retinopathy-segmentation and grading challenge. Med. Image Anal.; 2020; 59, pp. 101-561. [DOI: https://dx.doi.org/10.1016/j.media.2019.101561]
33. Qian, B., Chen, H., Wang, X., Guan, Z., Li, T., Jin, Y., Wu, Y., Wen, Y., Che, H., Kwon, G., et al.: Drac 2022: a public benchmark for diabetic retinopathy analysis on ultra-wide optical coherence tomography angiography images. Patterns (2024)
34. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
35. Rajagopalan, S; Al-Kindi, SG; Brook, RD. Air pollution and cardiovascular disease: Jacc state-of-the-art review. J. Am. Coll. Cardiol.; 2018; 72,
36. Sajid, MZ; Qureshi, I; Abbas, Q; Albathan, M; Shaheed, K; Youssef, A; Ferdous, S; Hussain, A. Mobile-hr: an ophthalmologic-based classification system for diagnosis of hypertensive retinopathy using optimized mobilenet architecture. Diagnostics; 2023; 13,
37. Shajini, M., Ramanan, A.: A knowledge-sharing semi-supervised approach for fashion clothes classification and attribute prediction. Vis. Comput. 38(11), 3551–3561 (2022)
38. Sheng, B., Guan, Z., Lim, L.L., Jiang, Z., Mathioudakis, N., Li, J., Liu, R., Bao, Y., Bee, Y.M., Wang, Y.X., et al.: Large language models for diabetes care: potentials and prospects. Sci. Bull. pp. S2095–9273 (2024)
39. Silva-Rodriguez, J., Chakor, H., Kobbi, R., Dolz, J., Ayed, I.B.: A foundation language-image model of the retina (flair): encoding expert knowledge in text supervision. arXiv preprint arXiv:2308.07898 (2023)
40. Staal, J; Abràmoff, MD; Niemeijer, M; Viergever, MA; Van Ginneken, B. Ridge-based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging; 2004; 23,
41. Suman, S., Tiwari, A.K., Singh, K.: Computer-aided diagnostic system for hypertensive retinopathy: a review. Comput. Methods Prog. Biomed. p. 107627 (2023)
42. Tsukikawa, M., Stacey, A.W.: A review of hypertensive retinopathy and chorioretinopathy. Clin. Optomet. pp. 67–73 (2020)
43. Wiesenfarth, M; Reinke, A; Landman, BA; Eisenmann, M; Saiz, LA; Cardoso, MJ; Maier-Hein, L; Kopp-Schneider, A. Methods and open-source toolkit for analyzing and visualizing challenge results. Sci. Rep.; 2021; 11,
44. Wong, TY; Mitchell, P. Hypertensive retinopathy. N. Engl. J. Med.; 2004; 351,
45. Wu, X; Sahoo, D; Hoi, SC. Recent advances in deep learning for object detection. Neurocomputing; 2020; 396, pp. 39-64. [DOI: https://dx.doi.org/10.1016/j.neucom.2020.01.085]
46. Xie, F; Fan, H; Li, Y; Jiang, Z; Meng, R; Bovik, A. Melanoma classification on dermoscopy images using a neural network ensemble model. IEEE Trans. Med. Imaging; 2016; 36,
47. Zhang, L; Yuan, M; An, Z; Zhao, X; Wu, H; Li, H; Wang, Y; Sun, B; Li, H; Ding, S et al. Prediction of hypertension, hyperglycemia and dyslipidemia from retinal fundus photographs via deep learning: a cross-sectional study of chronic diseases in central china. PLoS ONE; 2020; 15,
48. Zhou, B; Carrillo-Larco, RM; Danaei, G; Riley, LM; Paciorek, CJ; Stevens, GA; Gregg, EW; Bennett, JE; Solomon, B; Singleton, RK et al. Worldwide trends in hypertension prevalence and progress in treatment and control from 1990 to 2019: a pooled analysis of 1201 population-representative studies with 104 million participants. The Lancet; 2021; 398,
49. Zhu, C; Zou, B; Zhao, R; Cui, J; Duan, X; Chen, Z; Liang, Y. Retinal vessel segmentation in colour fundus images using extreme learning machine. Comput. Med. Imaging Graph.; 2017; 55, pp. 68-77. [DOI: https://dx.doi.org/10.1016/j.compmedimag.2016.05.004]
Copyright Springer Nature B.V. Jan 2025