INTRODUCTION
As one of the most common complications of diabetes, Diabetic retinopathy (DR) is the major cause of irreversible blindness and visual disability in adults aged 20–74 years [1]. Studies have found that among patients with type I diabetes with onset before the age of 30, the 10-year cumulative incidence of DR is 90%. For those who are presumed to have type II diabetes, the 10-year cumulative incidence of DR is estimated to be 66% [2]. The most effective treatment for DR is in the initial stage and can achieve a cure rate of 90% [3]. Therefore, the early detection of DR with the treatment of DR can reduce the risk of eventual blindness. In the inspection stage of the patients, retinal imaging techniques mainly include 2D imaging techniques (such as no red colour fundus photography, optical fundus photograph, fundus stereo photography, scanning laser ophthalmoscope [SLO] technology etc.) and 3D imaging technology, including computed tomography, magnetic resonance imaging (MRI), ultrasound imaging, infrared temperature records, colour doppler imaging and coherent optical tomography (OCT) [4, 5]. This paper is mainly based on 2D colour retinal fundus images as shown in Figure 1a. According to proposed international clinical diabetic retinopathy disease severity scales [6], the severity of DR can be graded into five stages (i.e. normal, mild, moderate, severe, and proliferative DR [PDR]). In the clinical diagnosis of DR, the presence of the lesions, such as microaneurysms (MAs), hemorrhages (HEs), exudates (EXUs), and retinal neovascularisation (RNV), determines the grade of DR in a patient [7]. Specifically, MAs represent the most primitive and perceptible signs of retinal injury because the abnormal permeability of retinal blood vessels leads to the formation of MAs. As the disease progresses, the retinal vein vessels may swell and rupture, making them unable to transport blood, and leading to the appearance of bleeding points and hard exudation. At the stage of PDR, the most obvious feature is the formation of new blood vessels, and lesions in this stage may lead to blindness. Ophthalmologists normally spend a long time examining the patient's fundus images for symptoms, which may lead to the delay of treatment. In order to improve efficiency, a number of automatic DR detection approaches have been proposed recently.
[IMAGE OMITTED. SEE PDF]
Deep learning methods have been widely used in DR detection tasks [8, 9]. Deep convolutional neural networks (DCNNs) [10–12] have been proven to be a powerful tool to process the fundus images in DR grading. The convolutional operation of DCNNs [13–15] is a local inter-correlation operation, which realises feature extraction for each operation block in the image. Here, operation block can be seen as the local region for the convolution operation, and the size of the operation block is determined by the size of the convolution kernel. Convolutional operation extracts features by constantly learning the features in the operation blocks, but it does not take into account the relationships between the different operation blocks. In DR detection tasks, not only the feature of local blocks that can provide important information, but also the global connections between scattered lesion patches (such as microaneurysms, hemorrhages, and exudations) on fundus images are useful. For instance, in the case of hemorrhages in different places of the fundus image, there will be correlations between these pixel patches due to the similarity of lesions. To take advantage of the relationships between these long-range patches, our method refers to the idea of patch-wise non-local means denoising method [16]. We compute the weights between a patch and other patches in the input feature map according to the similarity of them as shown in Figure 1. The weights between a patch and all other patches can be regarded as the correlation between one operation block and other operation blocks. The more similar the compared patches, the higher the weight is. The response of this patch is the weighted sum of all the other patches, which is used to update the pixel value of this feature patch. For the implementation of non-local means algorithm in DCNNs, Wang et al. [17] proposed the non-local neural networks for the application of video classification. However, in the spatial dimension, this work only considers the pixel-wise relationships in the feature map. Since lesions usually take the form of plaques, exploring the relationships of the feature patches is more appropriate for DR detection tasks.
In this paper, we propose a novel DCNNs-based method both using the local and long-range global dependence of features in the images. We also design the residual structure of the long-range block, which allows us to flexibly insert our long-range operation unit into any trained networks without breaking it. According to clinical experiences, the features of lesions are crucial to the diagnosis of DR. Thus, it can be inferred that lesion information also plays an important role in DR detection task. In the absence of pixel-level labelling data, it is difficult to obtain the specific information of the lesions in the fundus images. This work utilises the relationship between pixel patches to enhance the features of long-distance lesions in fundus images, which makes the small and scattered lesion information not easy to be lost in the image processing.
Compared to the previous DCNN-based DR detection methods, this work has the following advantages and contributions:
-
The proposed network retains local-dependence and global-dependence features to enhance representations.
-
The global dependencies of the DR image are captured by mining the correlations between any two patches in the input feature map.
-
In the proposed network, the patch-wise Long-Range unit can be flexibly embedded into the trained networks without breaking it.
-
Experimental results show that our proposed method outperforms other state-of-the-art methods for DR detection, and the computational time is also reduced.
The remaining parts of this paper are organised as follows. The related works of this paper are reviewed in Section 2. Section 3 shows the details of the implementation and architecture of our method. Section 4 describes the experimental results and analysis on the Messidor dataset. Finally, the discussion and conclusion are given in Section 5.
RELATED WORK
Non-local means operation
Non-local means algorithm [16], a classical denoising method, is based on a principle: replacing the pixel one wants to denoise with a weighted average of the other similar pixels. Compared with some local smoothing filters (e.g. median filtering and mean filtering), it allows the long distant pixels to contribute to the given pixel by searching similar pixels in a larger neighbourhood. The algorithm was later implemented in pixel-wise and patch-wise by Buades et al. [18], and it was found that the patch-wise method performed better in the image denoising process. In this paper, the proposed method is mainly referred to the idea of patch-wise implementation, because it is more in line with the needs of practical applications.
We assume that is the output value obtained after denoising, u(D) represents the input image block that needs to be denoised, and w(A, D) represents the weight between the image block D that needs to be denoised and other image block A. The objective image patch is denoted as
The compared patch D is centred at q with the width size (2f + 1) (2f + 1) pixels. The weight w(A, D) depends on the squared Euclidean distance d2:
The Euclidean distance is regarded as the similarity between two patches. The higher the similarity, the greater the weight. Thus, the weights can be expressed as a monotonically decreasing function:
This work refers to the processing method of non-local patches relation in non-local means algorithm and uses the similarity degree between patches to quantify the global relation of them.
CNNs-based methods for diabetic retinopathy detection
Recently, deep learning methods have been widely applied in the area of computer vision. As a prominent branch of deep learning networks, CNNs have shown exceptional performance in image processing tasks. As such, more and more CNNs-based methods were proposed to give a hand to ophthalmologists in DR detection. CNNs-based DR detection methods can be mainly divided into two categories: pixel-level supervised methods and image-level supervised methods.
The pixel-level supervised methods are to use the information of tiny features from lesions (e.g. MAs, HEs, and EXs). For the detection of lesions, Dai et al. [19] combined fundus images and expert reports to develop an intersecting depth mining model to deal with the detection of microaneurysms. Van Grinsven et al. [20] proposed an enhanced ConvNet for the detection of bleeding in fundus images, which can dynamically select misclassified negative samples in the training phase. Ref. [21] uses a convolutional neural network to classify pixels as ‘part of an exudate’ or ‘part of a non-exudate’. To both identify the lesions and give the DR grade of severity, Yang et al. [22] proposed a two-stage deep convolutional neural network. Lin et al. [23] proposed a new framework to diagnose the DR grades by combining the original images and lesions features and designed a lesion detection model to reduce the impact of missing annotation sample frames. To improve the DR detection performance of the model, Zhou et al. [24] extracted the lesion information via an attention model in a semi-supervised way before classifying DR grades. However, the detection of lesions relies on a large amount of pixel-level annotation data. This severely limits the performance of the model because of the high cost of annotating medical images.
The image-level supervised methods grade DR directly by training the classification model with fundus images [10, 25–27]. Gulshan et al. [25] designed a classification model based on the Google Inception v3 network to divide samples into normal and referable DR. Pratt et al. [26] proposed an improved CNN-based model that could measure the severity of DR. Gargeya et al. [27] developed a robust CNNs-based automated DR screening algorithm. Wang et al. [10] predicted the severity of DR combining the high-resolution suspicious patches that were highlighted in the preprocessing procedure of the model.
However, most image-level supervision methods use CNN as a black box for DR detection. In this way, the features of the tiny lesion in the images will be lost in the features extracting phase. Therefore, we propose a novel DCNNs-based model by mining the interrelation between similar lesion patches and strengthening the local lesion features.
Transformer-based methods for diabetic retinopathy detection
Recently, for learning long-distance dependent features, transformer-based image processing method ViT [28] has been introduced into the field of computer vision. Transformer-based methods have been used in a variety of image processing tasks in the field of computer vision, including DR detection. Wu et al. [29] proposed a ViT-based DR classification model, embedding fundus images into the imaging correction patches with location information. The generated embedded sequence is input into the model containing attention mechanism and multi-layer perceptron (MLP) learning for training with the aim of predicting DR levels. Sun et al. [30] proposed a novel lesion-aware transformer (LAT) model including a pixel relation-based encoder and a Lesion filter-based decoder. The encoder is modelled by pixel correlation and generates enhanced feature maps using self-attention mechanisms. The lesion filter-based decoder recognises different lesion regions by learning lesion-aware filters. Kamran et al. [31] created a novel network that can generate normal or diseased images of retinal blood vessels while predicting retinal diseases. Meanwhile, inspired by the ViT algorithm, Yu et al. [32] proposed multi-instance learning (MIL) module that can be easily inserted and instantly connected to ViT. This effectively improves the performance of the model and can be used in the subsequent fundus image classification. Furthermore, Papadopoulos et al. [33] designed a neural network based on multi-instance learning, which extracted image information from embedded patches to predict referable DR in the fundus images and generate focal attention heat maps. While these methods can capture global features, the transformer method is limited by the size of the dataset due to the complexity of the network, which makes it less effective on small datasets. The transformer method is not suitable for small-scale DR datasets due to its large number of parameters. Therefore, to effectively extract long-range features, we construct the Long-Range Unit to extract the feature with long-distance dependence.
METHODS
Network architecture
In this section, we present the details of the proposed method for DR detection. An architectural diagram of the proposed network is shown in Figure 2. The core idea of our design is to use the correlation of long-range patches to enhance its sensitivity to global features in CNN networks that have innate advantages in local feature mining. This enables our network not only to learn the local part of the image well, but also to flexibly acquire the long-distance dependence characteristics. We regard the features obtained by traditional convolution operation as local-dependence features. In the feature map with local dependence, there are strong dependencies between elements in the same operation block. Moreover, long-range-dependent features refer to the features calculated through long-range units. Long-Range units that are decentralised embedded in the network play the important role in mining the relationships between long-range patches. In the feature map with long-range dependence, the feature elements in one operation block are calculated from the elements in other different operation blocks and are related to the features at different locations. Our proposed network inserts Long-Range units into the CNN network. In this way, the network can learn the local features and then obtain the long-distance features through calculation, which can organically combine local and long-range-dependent features.
[IMAGE OMITTED. SEE PDF]
As shown in Figure 2 and Table 1, the proposed network is equipped with Long-Range units, basic standardised convolution layers (3 × 3 kernel), max-pooling layers and inception modules proposed in Ref. [34]. In the course of our experiment, we found that the experimental results are better when the Long-Range units are in the front and middle layers of the network. We distributed three Long-Range units into the middle layers of the network. The colour fundus image, as the input of the network, is resized to the size of 299 × 299 and converted to three colour channels (R, G, B). First, the input passes through the standardised convolution layers to extract the high-level image features and reduce the computational complexity in the following Long-Range unit. Based on the Non-local means algorithm, Long-Range units are generated to compute the similarities of long-range patches and use these similarities to enhance the features. Then, three Long-Range units and three inception blocks alternate to form the next part of the network. Finally, a fully connected layer with softmax function classifies the feature map into different levels of DR. Here, the number of categories is denoted by c_num.
TABLE 1 The architecture of the proposed network
| Type | Kernel size, stride or remarks | Input size |
| Conv | 3 × 3, 2 | 299 × 299 × 3 |
| Conv | 3 × 3, 1 | 149 × 149 × 32 |
| Conv padded | 3 × 3, 1 | 147 × 147 × 32 |
| Max_Pool | 3 × 3, 2 | 147 × 147 × 64 |
| Conv | 3 × 3, 1 | 73 × 73 × 64 |
| Conv | 3 × 3, 2 | 71 × 71 × 80 |
| Conv | 3 × 3, 1 | 35 × 35 × 192 |
| Long-range unit | As in Figure 3 | 35 × 35 × 192 |
| 3× inception modules | Inception block [34] | 35 × 35 × 192 |
| Long-range unit | As in Figure 3 | 35 × 35 × 288 |
| 5× inception modules | Inception block [34] | 35 × 35 × 288 |
| Long-range unit | As in Figure 3 | 17 × 17 × 768 |
| 3× inception modules | Inception block [34] | 17 × 17 × 768 |
| Max_Pool | 8 × 8 | 8 × 8 × 2048 |
| Fully connected layer | Linear | 1 × 1 × 2048 |
| Classifier | Softmax | 1 × 1 × c_num |
The proposed network is trained following the process of Algorithm 1. We assume that the input training set is < O, L >, where is a set of colour fundus images, and is the corresponding label set. In detail, N is the number of the samples in the training set, Co presents the number of colour channels, Ho and Wo denote the height and width of the image, respectively. The true label set is L = [l1, …, lk, …, lN], in which lk is the label of the sample Ok. The output of network, Y = [y1, …, yk, …, yN], is a predicted label set of input images. We train the model until the loss is converged. The loss function can be written as
Algorithm
The training process of the proposed network
In the next part of this section, we introduce the definition of the proposed patch-wise Long-Range unit and the implementation of the proposed networks.
Methodology for long-range unit
Given a feature map , we define a patch as Bi = B (i, r) | i ∈ I, and the compared patch is Qj = Q (j, r) | j ∈ I. The response of the patch Bi is regarded as the weighted sum of all patches in I. The objective function of the Long-Range unit can be written as
The Long-Range unit is different from the non-local mean filter [18] in the way of capturing the compared patches. The Long-Range unit uses all the patches in the input feature to compute the response of a patch, and the non-local mean filter takes only the patches from one regional into account. Moreover, the difference with non-local operation in Ref. [17] is that the proposed Long-Range unit considers the interrelation between feature blocks, instead of only between pixels.
Implementation
Our network takes the Google Inception V3 network [34] as the backbone and embedded the proposed Long-Range unit. The structure of our proposed network is shown in Table 1. Our model is pre-trained on ImageNet [35] for rapid convergence. We implement the function in Equation (1) as a Long-Range unit following the residual connection structure in Ref. [17]. Figure 3 illustrates the detailed structure of the proposed Long-Range unit.
[IMAGE OMITTED. SEE PDF]
The Long-Range unit takes the feature maps as input and adopts the convolutional operation to extract the features of patches (depend on the size of the convolution kernel). Here, we use the convolution with the kernel size 5 × 5. The feature maps and are described in the following:
The softmax operation is the implementation of the normalisation in Equation (7), and ⊗ denotes the matrix multiplication. In the proposed network, the multiplication of B transpose and Q is used to present the similarity of pair-wise patches, which is easy to carry out in the deep learning model. To obtain the respond feature maps of input features, its implementation function is defined as follows:
Here, θ(⋅) is the operation that changes the dimensions from C × HW to C × H × W. And BN(⋅) is a batch normalisation function, which can simplify the computation of the model while retaining the ability to represent the features. Note that because we need to embed the Long-Range unit into the networks flexibly, we should add the residual connection structure in Ref. [17]. The final output of the Long-Range unit is through the computation in Equation (13) and ⊕ denotes the element-wise addition. In this way, the Long-Range unit can be inserted into any trained network without breaking it.
EXPERIMENTS
Dataset
In order to verify the effectiveness of our proposed method, we conduct extensive experiments on challenging the Messidor dataset and the EyePACS dataset. Besides, to meet the requirements of the network, we resize the size of all input images to 299 × 299.
Messidor dataset
We employ the independent dataset Messidor [36] for DR detection. The Messidor dataset has 1200 colour fundus images that are captured by three ophthalmologic departments using the colour video 3CCD camera. The fundus images are acquired at 1440 × 960, 2240 × 1488 or 2304 × 1536 pixels with a field-of-view of 45°. The DR grade of each image in the dataset is provided by ophthalmologists to measure the severity of DR. The severity of DR is graded into four stages from 0 to 3. The grade of the DR is determined by the number of lesions such as the MAs, HEs and RNV. We randomly selected three-quarters of images from the dataset to be the training set and used the remaining ones as the test set. For the reason that Messidor only has 1200 images, which is small to train CNNs, Holly et al. [37] proposed to pre-train the model on the other dataset (e.g. ImageNet) for a high-quality classifier. In this paper, we follow the protocol similar to Ref. [37] and conduct binary classification (referable vs. non-referable, normal vs. abnormal) for DR grading. For referable and non-referable, grades 0 and 1 of the Messidor dataset are considered as non-referable, while grades 2 and 3 are labelled as referable. For normal and abnormal classification, only the images graded into stage 0 can be seen as normal, all the others are assigned as abnormal. The binary classification rules have rationality to a certain extent in the detection on the Messidor dataset, and these binary classification settings are widely used in the recent methods [10, 37, 38].
EyePACS dataset
The EyePACS dataset [39] contains 88,702 colour fundus photographs, including 35,126 training samples and 53,576 test samples. The dataset contains images centred with the optic disc or the macular in the form of a JPEG compression, which was taken in different fundus cameras. Ophthalmologists classify each image at a 0–4 level (i.e. 0: normal, 1: mild, 2: moderate, 3: severe, and 4: PDR), depending on the level of presence of the lesions associated with vascular anomalies caused by DR. This is the largest open database for DR classification, but the data contains a large number of images noise, such as blurred or over displayed images. In addition, the distribution of database samples was extremely unbalanced with the proportion of normal samples exceeding 70%, and the proportion of the most serious level (i.e. level 4) samples was only about 2%.
Evaluation metrics
To measure the network performance, we introduce the evaluation metrics, such as accuracy (Acc.), sensitivity (Sens.), specificity (Spec.), and kappa. Several terms should be introduced before defining these metrics.
Definition
Let PD(k) and OD(k) denote the DR detection results of model prediction and ophthalmologist diagnosing for patient k, respectively. Then, true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are defined to represent the prediction correctness.
In the experiments, we choose accuracy to evaluate the detection model. Accuracy is the proportion of the correct quantity predicted by the model in the total. Furthermore, sensitivity, specificity, and precision are used to judge the diagnostic ability of the models. Sensitivity is the ability of a diagnostic model to correctly identify an actual sick person as a patient, and it is related to the case group. Specificity refers to the ability of a diagnostic model to correctly identify a person who is actually healthy as a non-patient. It is related to the non-case group. The diagnostic model with high sensitivity has a lower rate of missed diagnosis. The misdiagnosis rate of high specificity is low. Precision represents the proportion of examples that are classified as positive examples that are actually positive examples. These evaluation metrics are defined as follows:
In real-world applications, the number of samples in each DR level tends to be uneven, which may lead to a bias in the model. In other words, when the accuracy is high, there may be a situation where the identification accuracy of large-scale categories is high and very small-scale categories are neglected by the identification algorithm. In order to evaluate the detection accuracy of the model and avoid the bias of the model, the kappa evaluation metric is introduced. The kappa coefficient can punish the bias of the model. The higher the bias of the model, the lower the Kappa value. In the issues of binary classification, the kappa value can be calculated as follows:
Experimental results on the Messidor dataset
To evaluate the performance of the proposed method, we adopt the confusion matrix to present the visualisation of algorithm performance as shown in Figure 4. Each column of the confounding matrix represents the prediction category, and the total number of each column represents the number of data predicted for that category. Meanwhile, each row represents the actual category of data belonging, and the total number of data in each row represents the number of data instances in that class.
[IMAGE OMITTED. SEE PDF]
For the sake of analysis, we use the normalised confounding matrix, and we can see that 89.5%–95.7% of each category can be predicted correctly. The highest classification accuracy (95.7%) appears in the detection on the normal images, and the accuracy is reduced to 94.1% when grade 2 is added (i.e. the classification on referable images). Moreover, the accuracy of non-referable category is 92.8%, and the accuracy decreases to 89.5% in the abnormal category, which includes grade 2. The results indicate that our model has a positive effect on the classification of the various categories, but still has some difficulties in detecting grade 2 due to its indistinguishable pathological features in the early stage of the disease.
In the experimental comparison phase, we refer to the experimental methods in Ref. [10] using AUC and accuracy as the evaluation metrics. AUC, the area under the receiver operating curve, is used to verify the performance of the classifier. As shown in Tables 2 and 3, our method is compared with some state-of-the-art methods like Fisher Vector [40], CKML Net [37], DSF-RFcara [41], Comprehensive CAD [42], Zoom-in-Net [10], ANF [23], CANet [38], LAT [30] and achieves the best results. For the comparison in the classification of referable and non-referable, the AUC of our proposed method is 98.1%, which is 1.3%–22.1% higher than the compared methods. For the normal/abnormal, our proposed method achieves an AUC of 96.7% with the advantage of 0.4%–10.5% higher than the other methods.
TABLE 2 AUC of different methods on the Messidor dataset for referable/non-referable
| Method | Source | AUC | Acc. |
| Lesion-based [40] | JBHI 2017 | 0.760 | - |
| Fisher vector [40] | JBHI 2017 | 0.863 | - |
| VNXK/LGI [37] | ISM 2016 | 0.887 | 0.893 |
| CKML Net/LGI [37] | ISM 2016 | 0.891 | 0.897 |
| DSF-RFcara [41] | TMI 2016 | 0.916 | - |
| Comprehensive CAD [42] | - | 0.910 | - |
| Clinical A [42] | - | 0.940 | - |
| Clinical B [42] | - | 0.920 | - |
| Zoom-in-Net [10] | MICCAI 2017 | 0.957 | 0.911 |
| AFN [23] | MICCAI 2018 | 0.968 | - |
| CANet [38] | TMI 2020 | 0.963 | 0.926 |
| Ours | - | 0.981 | 0.935 |
TABLE 3 AUC of different methods on the Messidor dataset for normal/abnormal
| Method | Source | AUC | Acc. |
| Splat feature/KNN [43] | TMI 2013 | 0.870 | - |
| VNXK/LGI [37] | CVPR 2017 | 0.870 | 0.871 |
| CKML Net/LGI [37] | ISM 2016 | 0.862 | 0.858 |
| Comprehensive CAD [42] | - | 0.876 | - |
| Clinical A [42] | - | 0.922 | - |
| Clinical B [42] | - | 0.865 | - |
| Zoom-in-Net [10] | MICCAI 2017 | 0.921 | 0.905 |
| AFN [23] | MICCAI 2018 | 0.935 | - |
| LAT [30] | CVPR 2021 | 0.963 | - |
| Ours | - | 0.967 | 0.921 |
To further quantify the model performance, we also introduce the results of the receiver operating characteristic (ROC) curves, sensitivity (Sens.), specificity (Spec.), kappa and total elapsed time and compare our method with different open-source models. The comparison method can be roughly divided into CNN-based methods and transformer-based methods, including CNN-based methods: VGG [44], ResNet [45], DenseNet [46], Inception Net [34], and transformer-based methods: ViT [47], ViG [48], and the model proposed by Huang et al. [49].
According to Figures 5 and 6, our proposed method achieves the best performance both in the classifications of referable/non-referable and normal/abnormal. Notably, our model is designed based on the same backbone of Google Inception Network and achieves better performance than these baselines, which shows the effectiveness of our proposed method.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Since our method mainly combines local-dependence features and long-range-dependence features, we compare our method with CNN-based methods, which have innate advantages over local features, and transformer-based methods, which are better at mining features from long distances. Tables 4 and 5 report that our method obtains accuracy of 93.5% and 92.1% for referable/non-referable and normal/abnormal, respectively. Compared with the traditional CNNs, the accuracy of data for referable/non-referable is improved by 4.5%–21.6%, and the accuracy of the result of the normal/abnormal is improved by 4.1%–22.9%. This proves that simply learning local features cannot meet the requirements of the extraction of complex lesion features in fundus images, and it is useful to add appropriate Long-Range units to enhance the extraction of scattered micro-features. Moreover, compared with transformer based methods, the accuracy of our method in data for referable/non-referable and data for normal/abnormal is at least 8.3% and 5.9% higher. This is because the number of parameters of the transformer-based model is large, which leads to the high dependence of the transformer-based model on the scale of the dataset. In the case of a small training set, the transformer-based model is easy to overfit, resulting in model performance degradation. The experimental results show that our method is superior to the methods that focus on only local features or global features and combining local and long-range-dependence features is a feasible way.
TABLE 4 Comparison with benchmarking models
| Method | Source | AUC | Acc. | Sens. | Spec. | Kappa | Time |
| DenseNet121 [46] | CVPR 2017 | 0.868 | 0.719 | 0.713 | 0.710 | 0.428 | 13.100 |
| DenseNet161 [46] | CVPR 2017 | 0.890 | 0.789 | 0.783 | 0.788 | 0.572 | 22.583 |
| VGG16 [44] | Arxiv. 2014 | 0.873 | 0.785 | 0.780 | 0.781 | 0.565 | 13.920 |
| VGG19 [44] | Arxiv. 2014 | 0.882 | 0.787 | 0.786 | 0.787 | 0.569 | 16.791 |
| ResNet18 [45] | CVPR 2016 | 0.911 | 0.812 | 0.810 | 0.805 | 0.619 | 11.917 |
| ResNet34 [45] | CVPR 2016 | 0.912 | 0.833 | 0.832 | 0.827 | 0.664 | 13.422 |
| Inception V3 [34] | CVPR 2016 | 0.952 | 0.890 | 0.885 | 0.880 | 0.780 | 10.364 |
| Inception V4 [34] | CVPR 2016 | 0.948 | 0.883 | 0.881 | 0.884 | 0.764 | 21.243 |
| Inception-Resnet-v2 [34] | CVPR 2016 | 0.920 | 0.839 | 0.841 | 0.840 | 0.677 | 20.349 |
| ViT [47] | ICLR 2021 | 0.871 | 0.804 | 0.803 | 0.804 | 0.604 | 5.160 |
| ViG [48] | Arxiv. 2022 | 0.346 | 0.483 | 0.482 | 0.480 | 0.211 | 7.625 |
| Huang et al. [49] | Arxiv. 2022 | 0.941 | 0.852 | 0.851 | 0.851 | 0.703 | 6.703 |
| Ours | - | 0.981 | 0.935 | 0.936 | 0.935 | 0.869 | 10.452 |
TABLE 5 Comparison with benchmarking models
| Method | Source | AUC | Acc. | Sens. | Spec. | Kappa | Time |
| DenseNet121 [46] | CVPR 2017 | 0.811 | 0.716 | 0.705 | 0.709 | 0.414 | 12.248 |
| DenseNet161 [46] | CVPR 2017 | 0.882 | 0.880 | 0.879 | 0.884 | 0.568 | 15.413 |
| VGG16 [44] | Arxiv. 2014 | 0.830 | 0.721 | 0.730 | 0.731 | 0.445 | 16.121 |
| VGG19 [44] | Arxiv. 2014 | 0.813 | 0.692 | 0.810 | 0.811 | 0.409 | 20.100 |
| ResNet18 [45] | CVPR 2016 | 0.892 | 0.804 | 0.795 | 0.797 | 0.596 | 12.820 |
| ResNet34 [45] | CVPR 2016 | 0.902 | 0.812 | 0.811 | 0.810 | 0.617 | 11.179 |
| Inception V3 [34] | CVPR 2016 | 0.852 | 0.765 | 0.788 | 0.785 | 0.538 | 11.394 |
| Inception V4 [34] | CVPR 2016 | 0.927 | 0.820 | 0.831 | 0.835 | 0.642 | 19.923 |
| Inception-Resnet-v2 [34] | CVPR 2016 | 0.915 | 0.834 | 0.835 | 0.840 | 0.661 | 24.138 |
| ViT [47] | ICLR 2021 | 0.894 | 0.820 | 0.806 | 0.806 | 0.622 | 4.7693 |
| ViG [48] | Arxiv. 2022 | 0.534 | 0.589 | 0.503 | 0.501 | 0.231 | 8.867 |
| Huang et al. [49] | Arxiv. 2022 | 0.907 | 0.862 | 0.863 | 0.862 | 0.724 | 6.225 |
| Ours | - | 0.967 | 0.921 | 0.931 | 0.925 | 0.838 | 12.035 |
Further, it is observed that our model has a more reliable diagnostic capability in DR detection as it has the highest sensitivity and specificity in the comparison. It means that our model has less probability of missed diagnosis and misdiagnosis than other models. Then, we compare the kappa score of different models and their total elapsed time in the test phase under the same implementation setup. Our method obtains the best kappa score, which proves that the classification bias of our model is minimal and the experimental results are robust and reliable. Within the acceptable range of computational time, our proposed method requires more computational time than ViT, ViG, and Inception V3 models, but our AUC, accuracy, sensitivity, specificity, and kappa score are far better than theirs. Not only is our method better than other methods in classification ability, but our method has a shorter computational time than a large part of baseline CNN-based methods. Overall, our proposed method has the best comprehensive performance among the compared methods and is reasonable and applicable to the DR detection task.
Experimental results on the EyePACS dataset
In terms of comparing the results of five-category classification methods for DR detection, we used recognised assessment indicators, including accuracy (Acc.), precision (Prec.), sensitivity (Sens.), specificity (Spec.), F1 score [50], and the total elapsed time. Significantly, the F1 evaluation provides a comprehensive assessment of model performance and avoids bias in the evaluation of the model due to the sample imbalance in the EyePACS dataset. In order to verify the effectiveness of our method, we conducted comparative experiments with models DenseNet [46], VGG [44], Inception Net [34], MobileNet V2 [51], ResNext50 32 × 4d [52], ShuffleNet [53], ViT [47], ViG [48], and the model proposed by Huang et al. [49].
First, the performance of our model on the EyePASC dataset needs to be analysed, for which we calculate and show the confusion matrix of the proposed method. As shown in Figure 7, 95.1% of the normal samples can be correctly predicted. Except for the samples of grade 1, most of the samples in other categories can be correctly classified. Because the samples of grade 1 only have very mild lesions, they are not distinguishable from the samples of grade 0, which lead to the classification of most samples as grade 0.
[IMAGE OMITTED. SEE PDF]
To further illustrate the performance of our model, we conducted a comparative experiment, the results of which can be seen in Table 6. The accuracy of our method is 83.6%, which is 2.9%–18.1% higher than the comparison methods. Compared with other methods, the proposed method achieves the best performance with average improvements of 3.9%, 10.6%, and 6.0% in terms of overall sensitivity, specificity, and precision, respectively. Notably, due to the problem of sample imbalance in the EyePACS dataset, we introduce the F1 score to evaluate the methods. As shown in Figure 8, our method achieves the optimal result on the F1 score, which proves the superiority and robustness of our method.
TABLE 6 Comparison with benchmarking models
| Method | Source | Acc. | Sens. | Spec. | Prec. | Time |
| DenseNet121 [46] | CVPR 2017 | 0.773 | 0.826 | 0.513 | 0.738 | 7.594 × 103 |
| DenseNet161 [46] | CVPR 2017 | 0.784 | 0.837 | 0.553 | 0.756 | 4.965 × 103 |
| VGG16 [44] | Arxiv. 2014 | 0.803 | 0.858 | 0.640 | 0.783 | 7.569 × 103 |
| Inception V3 [34] | CVPR 2016 | 0.799 | 0.854 | 0.660 | 0.803 | 4.893 × 103 |
| Inception-Resnet-v2 [34] | CVPR 2016 | 0.788 | 0.841 | 0.616 | 0.785 | 2.453 × 103 |
| MobileNet V2 [51] | CVPR 2018 | 0.795 | 0.849 | 0.632 | 0.773 | 3.614 × 104 |
| ResNext50_32 × 4d [52] | CVPR 2017 | 0.807 | 0.862 | 0.685 | 0.810 | 4.177 × 103 |
| ShuffleNet [53] | CVPR 2018 | 0.775 | 0.828 | 0.543 | 0.741 | 3.569 × 104 |
| ViT [47] | ICLR 2021 | 0.655 | 0.701 | 0.468 | 0.655 | 1.827 × 102 |
| ViG [48] | Arxiv. 2022 | 0.760 | 0.813 | 0.550 | 0.750 | 2.830 × 102 |
| Huang et. al. [49] | Arxiv. 2022 | 0.762 | 0.814 | 0.602 | 0.752 | 6.937 × 103 |
| Ours | - | 0.836 | 0.865 | 0.693 | 0.819 | 4.778 × 103 |
[IMAGE OMITTED. SEE PDF]
CONCLUSION
In this paper, we present a novel DR detection model by establishing long-range connections in spatial dimensions. Our proposed model also learns features with local and long-distance dependence to improve feature representation. To improve the performance of DR detection, we strengthen the tiny lesion features utilising the interrelation between similar lesion patches in the whole feature map. Moreover, the proposed Long-Range unit, whose purpose is to explore the correlations between long-range patches, can be flexibly embedded into other trained networks without destroying the construction. To verify the effectiveness of our method, we compare our method with state-of-the-art methods. In the experiments, our method achieves the best performance, which means that our model is potentially suitable for real-world DR detection. This model can help to outstand the small lesion features in the training only with image-level annotation data. It would be also applied to solve other related problems in the domain of medical image processing, such as mammographic image analysis, brain image analysis etc. The further direction we would like to work on is to extract the lesion features by weak-supervision networks and also explore the relationship between the retinal blood vessels and lesions. Based on the above, we hope to improve the DR detection performance by jointly learning multiple features.
ACKNOWLEDGEMENTS
This work was supported in part by the National Natural Science Foundation of China (grant nos. 62001141 and 62272319), the Science, Technology and Innovation Commission of Shenzhen Municipality (grant nos. JCYJ20210324131800002, RCBS20210609103820029, GJHZ20210705141812038, and JCYJ20210324094413037), and the Stable Support Projects for Shenzhen Higher Education Institutions under grant no. 20220715183602001.
CONFLICT OF INTEREST
The author has declared that they have no Conflict of Interest.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in ADCIS at and .
Ning, C., Wong, T.Y.: Diabetic retinopathy and systemic vascular complications. Prog. Retin. Eye Res. 27(2), 161–176 (2008). [DOI: https://dx.doi.org/10.1016/j.preteyeres.2007.12.001]
Simonyan, K. and Zisserman, A.: Very deep convolutional networks for large‐scale image recognition. arXiv.1409.1556, (2014)
He, K., et al.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Huang, G., et al.: Densely connected convolutional networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv.2010.11929 (2020)
Han, K., et al.: Vision GNN: an image is worth graph of nodes. arXiv.2206.00272 (2022)
Huang, L., et al.: Green hierarchical vision transformer for masked image modeling. arXiv.2205.13515 (2022)
Sasaki, Y.: The truth of the f‐measure. Teach Tutor Mater 1(5), 1–5 (2007)
Sandler, M., et al.: MobileNetV2: inverted residuals and linear bottlenecks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Xie, S., et al.: Aggregated residual transformations for deep neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Ma, N., et al.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: The European Conference on Computer Vision (ECCV) (2018)
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Diabetic retinopathy (DR), the main cause of irreversible blindness, is one of the most common complications of diabetes. At present, deep convolutional neural networks have achieved promising performance in automatic DR detection tasks. The convolution operation of methods is a local cross‐correlation operation, whose receptive field determines the size of the local neighbourhood for processing. However, for retinal fundus photographs, there is not only the local information but also long‐distance dependence between the lesion features (e.g. hemorrhages and exudates) scattered throughout the whole image. The proposed method incorporates correlations between long‐range patches into the deep learning framework to improve DR detection. Patch‐wise relationships are used to enhance the local patch features since lesions of DR usually appear as plaques. The Long‐Range unit in the proposed network with a residual structure can be flexibly embedded into other trained networks. Extensive experimental results demonstrate that the proposed approach can achieve higher accuracy than existing state‐of‐the‐art models on Messidor and EyePACS datasets.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
; Xu, Yong 2 ; Lai, Zhihui 3 ; Jin, Xiaopeng 4 ; Zhang, Bob 5
; Zhang, David 6 1 Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology, Shenzhen, China
2 Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology, Shenzhen, China, Peng Cheng Laboratory, Shenzhen, China
3 Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China
4 College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China
5 The Department of Computer and Information Science, University of Macau, Macao, Macau, China
6 The Chinese University of Hong Kong (Shenzhen), Shenzhen, China




