Content area
Purpose
Glaucoma is a leading cause of irreversible blindness, and accurate cup-to-disc ratio (CDR) measurement is essential for early detection. This study presents an enhanced deep learning–based system for automated CDR estimation and glaucoma screening.
Methods
We propose an end-to-end framework consisting of three modules: (1) optic cup and disc segmentation using an enhanced dual encoder–decoder network (E-DCoAtUNet), (2) a conditional random field (CRF) post-processing module for boundary refinement, and (3) a measurement module for vertical CDR calculation and glaucoma classification. The model was trained and evaluated on the Drishti-GS dataset and validated on the REFUGE dataset to assess generalizability.
Results
The system achieved Dice scores of 97.6% for the optic disc and 90.8% for the optic cup, further improved by CRF refinement. Automated CDR estimation showed strong agreement with expert annotations (Pearson’s r = 0.9190, MAE = 0.0387). For glaucoma screening, the system demonstrated reliable performance across both datasets, highlighting its robustness and clinical applicability.
Conclusion
The proposed E-DCoAtUNet-based system provides a fully automated, interpretable, and precise solution for glaucoma screening. By integrating advanced segmentation, boundary refinement, and accurate measurement, it ensures consistent CDR evaluation even under challenging imaging conditions, and demonstrates strong potential for real-world clinical application.
Introduction
Glaucoma ranks as the world’s second most prevalent blinding eye disease, with the number of affected individuals projected to reach 118 million by 2040 [1 − 3]. As the leading cause of irreversible blindness globally, its hallmark pathological feature involves progressive optic nerve damage, which subsequently induces visual field defects and irreversible vision loss, severely compromising patients’ quality of life [4, 5]. Chronic glaucoma is frequently termed the “silent thief of sight” due to its clinically asymptomatic early phases, a characteristic that results in delayed diagnosis and treatment initiation in over 40% of cases [6]. Studies have demonstrated that early and accurate diagnosis combined with timely intervention can reduce blindness risk by approximately 50% [7]. Consequently, early screening and therapeutic intervention are critical for preventing permanent visual impairment caused by glaucoma.
Elevated intraocular pressure (IOP) is regarded as a key warning factor for glaucoma [8]. When IOP exceeds a certain threshold, pressure is exerted on retinal nerve fibers, leading to optic nerve damage and ultimately resulting in gradual vision loss in patients. Although abnormal IOP is one of the early signals of glaucoma, relying solely on IOP changes cannot fully and accurately predict disease progression, thus requiring comprehensive evaluation combining other detection methods. Currently, clinical diagnosis of glaucoma primarily depends on assessment of the optic nerve head, in which the CDR serves as one of the critical diagnostic indicators, as shown in Fig. 1 [9, 10]. With progressive loss of optic nerve fibers, the relative size of the optic cup continuously increases, causing elevation of CDR values [11, 12] A CDR exceeding 0.6 is generally considered one of the diagnostic criteria for glaucoma. However, traditional CDR-based assessment methods have inherent limitations, particularly during manual evaluation where subjective factors may interfere with the segmentation results of the optic cup and disc [13]. Therefore, improving the segmentation accuracy of the optic cup and disc objectively remains a pressing clinical challenge for accurate glaucoma diagnosis.
[IMAGE OMITTED: SEE PDF]
Current research approaches can be categorized into two primary methodologies: (1) direct classification methods based on fundus images or optical coherence tomography (OCT) scans, and (2) computer-aided diagnostic systems that calculate the cup-to-disc ratio through optic disc (OD) and optic cup (OC) segmentation. Within direct classification paradigms, Juneja et al. developed CoG-Net, a model demonstrating precise localization of optic disc regions to achieve efficient glaucoma classification [14]. Madadi et al. proposed a deep domain adaptation framework incorporating low-rank representation and progressive weighting modules, which significantly enhanced diagnostic accuracy for glaucoma detection [15]. Sharma et al. implemented a hybrid architecture combining deep learning with feature dimensionality reduction, optimized through an Extreme Learning Machine and modified Particle Swarm Optimization algorithm, reporting diagnostic accuracies of 97.80% on the G1020 dataset and 98.46% on the ORIGA dataset [16]. Mandal et al. introduced a weakly supervised time-series learning approach for detecting glaucoma progression, demonstrating superior performance in identifying longitudinal structural changes [17].
In contrast to direct classification approaches, methodologies based on optic cup/disc segmentation with subsequent CDR calculation provide more precise quantitative biomarkers for glaucoma assessment. Prastyo et al. [18] utilized the U-Net architecture for cup/disc segmentation, achieving a Dice coefficient of 98.4% through adaptive learning rate optimization. Tabassum et al. [19] enhanced joint cup/disc segmentation accuracy by integrating attention mechanisms with generative adversarial learning, effectively mitigating non-elliptical fitting errors in fundus images. Hervella et al. [20] developed a multi-task deep learning framework combining pixel-level and image-level annotations, which improved diagnostic performance via shared-parameter networks and multi-adaptive optimization strategies. Hua et al. [21] implemented multi-scale attention mechanisms and region-weighted convolutional modules, significantly boosting model generalizability across domains while preserving anatomical details in segmentation tasks, outperforming existing methods. Nawaz et al. [22] proposed an EfficientDet-D0 framework with bidirectional feature pyramid integration, demonstrating superior localization capability and cross-dataset robustness on ORIGA, HRF, and RIM-ONE DL datasets. He et al. proposed a prior-guided multi-task Transformer framework, JOINEDTrans, for optic disc and cup segmentation [23]. This architecture effectively mitigates structural distortions caused by ocular pathologies and image quality issues through integration of multi-scale spatial features. Another study developed the AI-GS network, which integrates multiple lightweight sub-models to capture early structural abnormalities in fundus images, achieving efficient glaucoma screening with minimal computational burden and maintaining strong sensitivity and specificity in practical applications [24]. Experimental results indicate that ViT-based models exhibit superior diagnostic accuracy and enhanced interpretability, thereby establishing new research avenues for glaucoma diagnosis.
Despite significant advancements in existing methodologies, current research continues to face multiple challenges [25,26,27,28,29,30,31]. Notably, existing deep learning models for glaucoma analysis often underexplore the complex morphological characteristics of the optic cup and disc, which limits their robustness and stability when dealing with fundus images containing ambiguous or irregular boundaries. To overcome these limitations, this study proposes an automated glaucoma diagnostic system centered on optic cup and disc segmentation. The framework integrates three core modules: segmentation, post-processing, and measurement. In particular, we introduce E-DCoAtUNet, a dual encoder–decoder architecture that hybridizes convolutional and Transformer blocks, complemented by a CRF-based post-processing module to enhance boundary clarity and regional consistency. This design effectively addresses the challenges of morphological variability and boundary ambiguity. Comprehensive experiments on the publicly available Drishti-GS and REFUGE datasets demonstrate the system’s strong segmentation accuracy, reliable CDR measurement (r = 0.9190, MAE = 0.0387), and improved generalization capability, highlighting its practical potential for real-world clinical glaucoma screening.
Materials and methods
This study constructed a deep learning-based glaucoma diagnostic system, with its architecture illustrated in Fig. 2. The system consists of three components: an optic cup and disc segmentation module, a post-processing module, and a measurement module, aiming to achieve fully automated analysis from fundus image processing to pathological parameter calculation. First, the optic cup and disc segmentation module is responsible for accurately extracting the optic cup and disc regions from fundus images, generating high-quality segmentation results. Subsequently, the post-processing module optimizes the segmentation results by enhancing the continuity and consistency of segmentation boundaries, thereby eliminating noise and errors that may occur during the segmentation process. Based on the optimized results, the measurement module precisely calculates CDR, providing diagnostic evidence for glaucoma. Through efficient collaboration of the segmentation, post-processing, and measurement modules, the system achieves effective fundus image analysis and pathological parameter measurement, offering an efficient and precise solution for early glaucoma screening and diagnosis.
[IMAGE OMITTED: SEE PDF]
CoAtUNet model structure
The local connectivity and parameter sharing mechanisms of CNNs endow them with exceptional capability in capturing spatial information. However, these inherent architectural characteristics restrict their ability to learn globally dependent features, consequently leading to suboptimal performance in complex target reconstruction tasks. In contrast, Transformers establish long-range dependencies through self-attention mechanisms, enabling effective acquisition of global image features. Nevertheless, Transformers exhibit relative weakness in fine-grained feature extraction, resulting in limitations when processing local structural details. Given these complementary limitations, hybrid architectures combining CNN and Transformer advantages have become a prominent research focus in recent years [32, 33].
CoAtUNet is a hybrid architecture model that combines the advantages of convolutional neural networks and Transformers, aiming to overcome the limitations of single architectures and achieve an optimal balance between model generalization capability and performance [34]. The model adopts a staged design, integrating the characteristics of convolution and self-attention mechanisms to progressively model feature information from local to global, thereby demonstrating excellent adaptability and robustness in complex tasks. Based on this design, we constructed a segmentation model named CoAtUNet, as shown in Fig. 3. CoAtUNet employs a U-shaped network architecture, where the encoder extracts feature from input images through progressive down-sampling, capturing multi-level semantic information. The decoder restores image resolution through stepwise up-sampling and integrates encoder features into the decoding process via skip connections. This U-shaped architecture not only effectively recovers spatial resolution but also preserves detailed boundary information of the optic cup and disc during the segmentation process, thereby improving the model’s accuracy in segmentation tasks.
[IMAGE OMITTED: SEE PDF]
In the design of the convolutional module, this study employs the MBConv structure to achieve efficient feature extraction, with optimized designs implemented for both up-sampling and down-sampling stages. As illustrated in Fig. 4(a), the down-sampling stage first adjusts the channel dimension through a 1 × 1 convolution to match the target feature space requirements. Subsequently, depthwise separable convolution with a stride of 2 is applied to reduce the spatial resolution of input feature maps, thereby accomplishing down-sampling. In the figure, \(\:{C}_{i}\) and \(\:{C}_{o}\) represent the input and output channel numbers, respectively, while H and W denote the height and width of the input feature maps. To ensure effective fusion between features in the residual path and the scaled main branch features, this study introduces max pooling in the residual path to adjust feature map resolution, accompanied by a 1 × 1 convolution to align channel dimensions with the main branch output. For the up-sampling stage, transposed convolution is utilized in both the main and residual paths to achieve spatial resolution doubling, as shown in Fig. 4(b). The transposed convolution not only effectively restores feature map resolution but also enhances feature learning capability through parameter learning during the up-sampling process.
[IMAGE OMITTED: SEE PDF]
In the Transformer module, up-sampling is achieved through a combination of linear expansion, multi-head self-attention, and feed-forward networks to double the feature resolution. First, the input features undergo channel adjustment via a 3 × 3 convolution, followed by flattening along the spatial dimension. A linear layer then expands the flattened features to four times the original resolution, accomplishing feature map up-sampling. The expanded feature maps are subsequently fed into the Transformer module, where relative self-attention (Rel-Attention) captures long-range feature interactions. To further optimize feature representation, the expanded features are fused with attention outputs through residual connections, followed by point-wise enhancement via a feed-forward network. This process ultimately outputs high-resolution feature maps enriched with semantic information.
Cup-disc joint segmentation model
The boundaries of the optic cup and disc are often indistinct, particularly in pathological regions or under low image quality conditions, which increases the difficulty of boundary identification for segmentation models. While CoAtUNet demonstrates superior performance in local feature extraction and global information modeling, partially alleviating this issue, the inherent complexity of cup-disc morphology presents additional challenges. Specifically, the nested structure of the optic cup within the disc, combined with the morphological variability of the cup under pathological changes, significantly complicates the segmentation task. To address these limitations, this study proposes an innovative dual encoder-decoder architecture, E-DCoAtUNet (Enhanced Dual CoAtUNet), designed to enhance segmentation precision and robustness, as illustrated in Fig. 5.
[IMAGE OMITTED: SEE PDF]
The proposed Enhanced DCoAtUNet extends and optimizes the original DCoAtUNet design, retaining its strengths in local feature extraction and global dependency modeling while introducing novel modules tailored for joint optic disc and cup segmentation. The original DCoAtUNet employs a CoAtNet-based encoder and a dual-branch decoder architecture, separately modeling disc and cup features. Through a feature pyramid–style multi-scale fusion mechanism, it achieves fine-grained structural delineation, thereby combining the efficient feature aggregation capability of U-shaped networks with enhanced multi-scale perception and global context modeling.
Building upon this foundation, the E-DCoAtUNet integrates Cross Branch Attention (CBA) modules across multiple decoding layers, explicitly capturing dependencies between the disc and cup to facilitate cross-branch feature interaction and complementarity. To strengthen hierarchical representation, a Multi-Scale Feature Enhancement (MFE) module is embedded in the intermediate layers of the disc branch, combining local and global contextual cues via parallel convolution, dilated convolution, and global pooling. In addition, an Edge Aware Module (EAM) is incorporated into the deeper layers of the disc decoder, where boundary detection and feature modulation improve segmentation accuracy along optic disc and cup boundaries. Finally, the outputs of the two decoder branches are fused through feature concatenation, enabling synergistic optimization that balances disc localization with cup detail refinement.
Overall, the E-DCoAtUNet preserves the original model’s core advantages—CoAtNet’s powerful local and global feature extraction, robust multi-scale feature fusion, and the task-specific dual-branch structure—while achieving further improvements in cross-branch interaction, multi-scale enhancement, and boundary delineation, leading to superior accuracy and robustness in optic disc and cup segmentation tasks.
Post-processing module
This study introduces CRF as a post-processing module in E-DCoAtUNet, thereby further improving segmentation accuracy and stability [35] In optic cup and disc segmentation tasks, the nested structure of the cup within the disc often results in ambiguous boundaries, particularly in pathological regions or under poor image quality conditions. This ambiguity frequently leads to unclear boundaries or regional misclassification in segmentation results.
The core mechanism of CRF addresses these challenges by constructing contextual dependencies between pixels while optimizing the contributions of both data and smoothness terms. Specifically, CRF utilize probability maps generated by the segmentation model as the data term, which characterizes the likelihood of each pixel belonging to either the optic cup or disc. The smoothness term models pixel similarity through Gaussian kernel functions, incorporating multi-dimensional features such as color, texture, and spatial position to eliminate segmentation noise and errors. Particularly in overlapping regions between the cup and disc, the smoothness term optimization mechanism effectively reduces category confusion, providing reliable data support for subsequent pathological parameter measurements.
In this study, CRF is introduced as a post-processing module to refine segmentation boundaries by integrating preliminary probability maps with the original image information. Formally, the CRF energy function is defined as:
$$E(x)\, = \,\sum\limits_i {{\psi _u}} ({x_i})\, + \,\sum\limits_{i < j} {{\psi _p}({x_i},{x_j}} )$$
(2.1)
Where \(\:\text{x}\) denotes the label assignment of all pixels. The unary potential \(\:{\psi\:}_{u}\left({x}_{i}\right)\) is obtained from the network’s predicted probability maps, reflecting the likelihood of pixel iii belonging to the optic cup or disc. The pairwise potential \(\:{\psi\:}_{p}({x}_{i},{x}_{j})\) enforces spatial consistency and is defined as:
$$\:\begin{array}{c}{\psi\:}_{p}\left({x}_{i},{x}_{j}\right)=\mu\:\left({x}_{i},{x}_{j}\right)\cdot\:w\,\text{exp}\left(-\frac{\parallel\:{p}_{i}-{p}_{j}{\parallel\:}^{2}}{2{\sigma\:}_{\mathbf{x}\mathbf{y}}^{2}}\right)\end{array}$$
(2.2)
where \(\:{p}_{i}\) and \(\:{p}_{j}\) denote pixel coordinates, and \(\:{\psi\:}_{p}\left({x}_{i},{x}_{j}\right)=1\:\)if \(\:{x}_{i}\ne\:{x}_{j}\), otherwise 0. This Gaussian kernel penalizes label discontinuities between nearby pixels, thereby reducing boundary noise and misclassification.
For implementation, the spatial Gaussian kernel width was set to \(\:{\sigma\:}_{xy}=13\), and the pairwise compatibility parameter was set to compat = 10–11. These values were empirically chosen with reference to commonly adopted ranges in previous CRF-based segmentation studies, and they were fixed across all experiments. In our preliminary testing, this configuration provided a good balance between local smoothness and boundary preservation, and the performance was not highly sensitive to small parameter variations, ensuring stable and reproducible results.
Measurement module
The measurement module utilizes CRF-optimized segmentation results of the optic cup and disc to calculate the vertical CDR, providing a critical quantitative indicator for glaucoma diagnosis. The calculation formula is shown in Eq. (2.1), where \(\:{V}_{cup}\) and \(\:{V}_{disc}\) represent the vertical diameters of the optic cup and disc, respectively. Specifically, the measurement module first extracts the vertical diameters of the optic cup and disc from the segmentation results, then computes the vertical CDR value through ratio calculation. A CDR value exceeding 0.6 is typically regarded as a potential diagnostic marker for glaucoma. By transforming segmentation results into clinically meaningful CDR metrics, the measurement module provides essential quantitative evidence for clinical decision-making.
$$\:\text{C}\text{D}\text{R}=\frac{{V}_{cup}}{{V}_{disc}}$$
(2.3)
Statistical method
To quantitatively analyze the evaluation results, this study employs the Dice coefficient and Intersection over Union (IoU) as primary evaluation metrics. The Dice coefficient is a widely used similarity measure that effectively assesses the overlap between two samples. Its calculation formula is as follows:
$$\:\text{D}\text{i}\text{c}\text{e}=\frac{2|A\cap\:B|}{\left|A\right|+\left|B\right|}$$
(2.4)
Here, A represents the predicted segmentation region, and B denotes the ground truth annotation region. The Dice coefficient ranges from 0 to 1, with values closer to 1 indicating higher overlap between the prediction and ground truth. IoU is another crucial metric for evaluating segmentation results, calculated as the ratio of the intersection to the union of the predicted and ground truth regions. The formula is as follows:
$$\:\text{I}\text{o}\text{U}=\frac{|A\cap\:B|}{|A\cup\:B|}$$
(2.5)
The IoU value also ranges from 0 to 1, with higher values indicating better segmentation performance of the model.
Results
Accurate segmentation of the optic cup and disc is a critical step in glaucoma diagnosis, as its precision directly affects the calculation accuracy of the CDR. Addressing the complexity of cup and disc segmentation tasks, this study proposes a series of optimized model architectures aimed at enhancing feature extraction capabilities and refining segmentation processing. These improvements are designed to boost the model’s recognition and segmentation accuracy for cup and disc regions, providing robust quantitative indicators for glaucoma diagnosis. The study is conducted using the Drishti-GS dataset for training and validation. Drishti-GS is a retinal image dataset specifically designed for optic nerve head segmentation tasks, primarily used for automated glaucoma assessment research. The dataset consists of 101 images, including 31 images of normal eyes and 70 images from glaucoma patients, divided into a training set (50 images) and a test set (51 images). The computational complexity of the proposed E-DCoAtUNet cascade architecture was evaluated to assess its clinical feasibility. The model contains approximately 349.40 M parameters, and the average inference time on a 384 × 384 fundus image is 77.3 ms using an NVIDIA RTX 4090 GPU. Although the model size is relatively large, the inference speed remains within a clinically acceptable range for real-time glaucoma screening applications. The models were implemented in PyTorch and trained using the Adam optimizer. The batch size was set to 2, and the training was performed for 200 epochs. The initial learning rate was fixed at 1e-5 to ensure stable convergence given the relatively small dataset size, and a cosine annealing scheduler was applied to adaptively decay the learning rate during training.
Ablation experiment
Building on the exceptional feature extraction capabilities of CoAtUNet, this study first constructs CoAtUNet, which accurately captures structural features of the optic disc and cup through multi-scale feature extraction and effective fusion of local and global information. To address the complex overlapping relationship and ambiguous boundary characteristics between the optic cup and disc, a dual encoder-decoder architecture, DCoAtUNet, is further designed, significantly improving the model’s segmentation performance for the cup region. Additionally, to optimize boundary details and regional consistency of segmentation results, the model incorporates CRF as a post-processing module, thereby enhancing segmentation accuracy and robustness. To further improve the generalization ability and alignment of automated CDR estimation with clinical standards, we propose an enhanced variant, E-DCoAtUNet. This network introduces architectural refinements and domain-specific optimization strategies, enabling more precise segmentation in challenging cases and achieving superior consistency between estimated CDR values and expert annotations. Experiments are conducted on both the Drishti-GS and REFUGE datasets, with the model trained on 50 images and validated on 51 images from Drishti-GS, while cross-dataset validation on REFUGE further demonstrates the robustness of E-DCoAtUNet. This study employs Dice coefficients and IoU metrics to analyze segmentation performance, and additionally evaluates CDR estimation accuracy using correlation coefficient r and mean absolute error (MAE), with experimental results summarized in Table 1.
[IMAGE OMITTED: SEE PDF]
To provide a more intuitive comparison of different models’ performance in optic cup and disc segmentation tasks, this study conducts a visual analysis of segmentation results. As shown in Fig. 6, the performance of each model in optic disc segmentation is illustrated. Due to the distinct characteristics of the optic disc in fundus images, most models achieve high segmentation accuracy. However, in certain cases, segmentation results are still affected by interference. For example, in CoAtUNet’s segmentation results, a noticeable shape anomaly appears in the lower right corner of the first fundus image. This may be caused by interference from vascular features, leading to erroneous expansion of the disc region. In contrast, E-DCoAtUNet effectively avoids this issue in the same scenario, producing more regular boundaries, which demonstrates that the decoder design significantly enhances the model’s anti-interference capability in complex backgrounds. Furthermore, with the additional incorporation of CRF, boundary details in the segmentation results are further refined, proving the value of post-processing strategies in improving model segmentation performance.
[IMAGE OMITTED: SEE PDF]
Additionally, in optic cup segmentation, while CoAtUNet captures most regions of the cup, its performance is relatively coarse when handling complex boundaries. In contrast, E-DCoAtUNet demonstrates significant improvements in segmentation results, as shown in Fig. 7.
[IMAGE OMITTED: SEE PDF]
To evaluate the contribution of each component in the proposed E-DCoAtUNet, ablation experiments were performed by selectively removing CBA, MFE, EAM. As shown in Table 2, removing any single module leads to a slight decline in Dice and IoU scores, whereas the complete model achieves the highest performance. These results indicate that all three modules provide complementary benefits, and their integration is essential for maximizing segmentation accuracy and ensuring reliable CDR estimation.
[IMAGE OMITTED: SEE PDF]
Method comparison
To comprehensively compare the performance of different models in optic disc and cup segmentation tasks, this study evaluates multiple models in a local experimental environment in the same dataset, with results presented in Table 3. The experimental results reveal significant performance differences among models in segmentation tasks. The traditional U-Net model performs well in optic disc segmentation but shows relatively weaker performance in cup segmentation, achieving a Dice score of only 88.87%. This indicates that U-Net struggles to fully capture the complex features of the optic cup due to boundary ambiguity and regional confusion. By integrating ResNet34 and ResNet50 as encoders into U-Net, cup segmentation performance improves, with ResNet50 + U-Net achieving a Dice score of 89.38%. Additionally, Precision and Recall metrics show more balanced performance, reflecting the advantages of deeper encoders in feature extraction. However, EfficientNet + U-Net and U-Net + + fail to significantly break through the performance bottleneck in cup segmentation, with noticeable fluctuations in Precision and Recall. Although PSPNet performs well in Recall, its Dice and Precision metrics are relatively low, suggesting potential trade-offs in boundary detail preservation when capturing global features. In contrast, the proposed E-DCoAtUNet demonstrates significant performance improvements in both disc and cup segmentation tasks, particularly in cup segmentation, where it achieves a Dice score of 90.74%. Compared with the model we proposed, SwinUNet (Dice 89.71%) and TransUNet (Dice 89.52%) show competitive performance but still fall short in cup segmentation accuracy, whereas the E-DCoAtUNet further boosts the Dice score to 90.74% and, with CRF post-processing, reaches 90.81%, highlighting its superiority in handling boundary ambiguity and overlapping structures.
[IMAGE OMITTED: SEE PDF]
By subtracting the predicted results from the ground truth labels, segmentation errors of different models can be visually analyzed. As shown in Fig. 8, white regions represent segmentation errors, which are typically concentrated near disc boundaries or complex background areas. After subtracting the predicted results from the original images, all models exhibit noticeable error regions in the lower right corner. However, as illustrated in Fig. 8(h), the proposed joint cup-disc segmentation system demonstrates the smallest error distribution, confirming its superior performance in optic disc segmentation tasks.
[IMAGE OMITTED: SEE PDF]
Additionally, this study visualizes segmentation errors for the optic cup, as shown Fig. 9. Experimental results indicate that E-DCoAtUNet, combined with post-processing, exhibits a discontinuous distribution of cup segmentation errors, suggesting high accuracy in predicting cup boundaries. Furthermore, among multiple models, the proposed model demonstrates the smallest error in the left region, further validating its stability in optic cup segmentation.
[IMAGE OMITTED: SEE PDF]
To further validate the effectiveness of the proposed method, this study compares the model with existing publicly available methods in the same dataset, with experimental results presented in Table 4. The results demonstrate that DCoAtUNet, combined with conditional random fields, exhibits significant advantages in segmentation performance, achieving Dice coefficients of 97.60% for the optic disc and 90.81% for the optic cup, outperforming state-of-the-art methods. This indicates that the architectural design of DCoAtUNet + CRF plays a crucial role in improving segmentation accuracy and stability.
[IMAGE OMITTED: SEE PDF]
Model generalization capability verification
To further evaluate the generalization capability of the proposed model, we conducted additional experiments on the REFUGE (Retinal Fundus Glaucoma Challenge) dataset. The REFUGE dataset was jointly established by the Eye and ENT Hospital of Fudan University and Zhongshan Ophthalmic Center, and was first released in the MICCAI 2018 Glaucoma Detection Challenge. It consists of 400 training images, 400 validation images, and 400 testing images, with detailed annotations for both OD and OC, and is currently the largest existing one [52]. Due to variations in imaging conditions, patient populations, and annotation protocols compared with Drishti-GS, REFUGE provides a suitable benchmark for assessing model robustness under cross-dataset scenarios. Table 5 summarizes the segmentation performance of different models on the REFUGE dataset.
[IMAGE OMITTED: SEE PDF]
The results show that the conventional U-Net achieves stable performance in OD segmentation with a Dice score of 97.78%, but its OC segmentation remains limited at 95.28%. In contrast, Transformer-based approaches such as SwinUNet and TransUNet demonstrate stronger performance in OC segmentation, with Dice scores of 95.71% and 95.52%, respectively. Furthermore, the proposed E-DCoAtUNet consistently achieves superior performance across both OD and OC tasks, with the OC Dice score reaching 95.98%, highlighting its strength in addressing boundary ambiguity and overlapping regions. Notably, when combined with CRF as a post-processing step, the OD segmentation Dice of E-DCoAtUNet increases to 98.52%, further confirming the effectiveness of architectural design and post-processing optimization.
Figure 10 illustrates the segmentation results of different models for the optic disc on the REFUGE dataset. It can be observed that the conventional U-Net delineates the disc region reasonably well in most cases, but its boundary precision is limited. In comparison, SwinUNet and TransUNet demonstrate more stable overall contour preservation, although slight deviations still occur near the edges. The proposed E-DCoAtUNet achieves superior performance in both boundary consistency and detail recovery, yielding contours that align more closely with the ground-truth annotations. With the integration of CRF post-processing, the E-DCoAtUNet results almost completely overlap with the reference labels, further validating its effectiveness in optic disc segmentation.
[IMAGE OMITTED: SEE PDF]
Figure 11 presents the visualization results of different models for optic cup segmentation. Due to the inherent boundary ambiguity and overlap with the disc region, the conventional U-Net performs suboptimally in this task, often leading to over-segmentation or under-segmentation. SwinUNet and TransUNet alleviate some of these issues by capturing structural contours more effectively, yet minor inaccuracies remain in fine structures. In contrast, the proposed E-DCoAtUNet exhibits stronger boundary discrimination, significantly reducing regional confusion. When further combined with CRF post-processing, the segmentation results show high consistency with the ground truth, particularly in handling overlapping and ambiguous regions, highlighting its robustness in optic cup segmentation.
[IMAGE OMITTED: SEE PDF]
These findings indicate that E-DCoAtUNet not only performs well on Drishti-GS but also maintains robust segmentation accuracy on REFUGE, demonstrating its enhanced generalization potential for clinical deployment. Future work will focus on extending validation to additional public and multi-center clinical datasets to comprehensively assess the model’s generalization ability.
Interpretability analysis
To enhance the clinical interpretability of the proposed model, we applied saliency-based visualizations (Grad-CAM) to fundus images. As illustrated in Figs. 12 and 13, the model’s attention is predominantly focused on the optic disc and optic cup regions, which are consistent with the anatomical structures evaluated by ophthalmologists during glaucoma screening. Separate visualizations for disc and cup segmentation confirm that the model effectively distinguishes between the two structures, with high attention intensity concentrated at their respective boundaries. This consistency between model attention and clinical regions of interest further demonstrates the reliability and clinical credibility of the proposed system.
[IMAGE OMITTED: SEE PDF]
[IMAGE OMITTED: SEE PDF]
Glaucoma screening
This study calculates the CDR based on segmentation results of the optic cup and disc to assess the likelihood of glaucoma, with specific results shown in Table 6. Experiments demonstrate that the proposed method achieves an accuracy of 84.09%, a recall of 97.37%, and an F1-Score of 90.24% in glaucoma diagnosis. Compared to another study reporting a diagnostic accuracy of 83.78%, this study improves accuracy to 84.09% through optimized segmentation and calculation strategies [53]. When compared to existing glaucoma diagnostic methods based on cup-disc segmentation, the proposed approach demonstrates significant performance advantages.
[IMAGE OMITTED: SEE PDF]
In addition, statistical analysis was conducted to evaluate the agreement between the predicted CDR values and expert annotations. The results indicate that the proposed system achieved strong consistency with manual references, with a Pearson correlation coefficient of 0.919 and a mean absolute error of 0.0387. Furthermore, Bland–Altman analysis (Fig. 14) confirmed this agreement, showing a mean bias close to zero and 95% limits of agreement within ± 0.10. These findings highlight that the proposed method not only provides reliable glaucoma detection but also effectively reduces the risk of missed diagnoses.
[IMAGE OMITTED: SEE PDF]
Discussion
Deep learning technologies have demonstrated significant potential in the field of intelligent glaucoma diagnosis, providing crucial clinical support. However, this domain still faces numerous technical challenges that require resolution [54]. First, the morphological relationship between the optic cup and disc regions is complex and exhibits high inter-individual variability. Traditional methods and existing deep learning models show clear limitations in capturing these subtle and intricate features. Second, the ambiguity of cup-disc boundaries further increases the difficulty of optic cup segmentation tasks. To address these challenges, we propose a deep learning–based automated system that optimizes feature extraction and segmentation of the optic cup and disc. The system integrates high-precision segmentation, boundary refinement via post-processing, and objective CDR measurement in a fully automated and interpretable pipeline. This design enables consistent and reliable CDR calculation, supporting clinical decision-making and early glaucoma detection, and demonstrates strong practical and clinical potential.
Focusing on glaucoma diagnosis tasks, this study proposes a diagnostic system based on optic cup and disc segmentation. The system consists of three core modules: (1) a cup-disc segmentation module, (2) a conditional random field post-processing module, and (3) a vertical CDR measurement module. First, the segmentation module introduces E-DCoAtUNet, a dual encoder-decoder architecture designed to address challenges such as ambiguous cup-disc boundaries, morphological diversity, and complex morphological features. Compared to traditional segmentation models, E-DCoAtUNet more accurately captures cup and disc features through its dual encoder-decoder design, significantly improving segmentation precision. Subsequently, CRF are incorporated as a post-processing step to further optimize boundary clarity and regional consistency. The measurement module calculates the vertical CDR based on segmentation results and enables automated glaucoma screening through threshold-based classification. High-precision segmentation ensures the accuracy of CDR calculations, thereby enhancing the reliability of the screening system.
The experimental evaluation is conducted on the Drishti-GS dataset to assess system performance. In segmentation tasks, E-DCoAtUNet achieves Dice scores of 97.52% for the optic disc and 90.74% for the optic cup, which further improve to 97.60% and 90.81%, respectively, after incorporating CRF. For glaucoma screening tasks, the system achieves Precision, Recall, and F1-Score values of 84.09%, 97.37%, and 90.24%, demonstrating its practical utility. In addition, to evaluate the generalization capability of the proposed system, further experiments were conducted on the REFUGE dataset. The results demonstrate that E-DCoAtUNet also achieves excellent segmentation performance on this dataset, with Dice scores of 98.03% for the optic disc and 95.98% for the optic cup. With the integration of CRF post-processing, the OD Dice score is further improved to 98.52%. These findings highlight the robustness and adaptability of the proposed method in cross-dataset scenarios, underscoring its potential for broader clinical deployment. Furthermore, statistical validation was performed to assess the agreement between the system’s CDR measurements and expert annotations. The results indicate a strong correlation, with a Pearson correlation coefficient of 0.919 and a mean absolute error of only 0.0387. These outcomes suggest that the proposed system not only generalizes well across datasets but also aligns closely with clinical expert judgments, thereby enhancing its reliability and practical utility in glaucoma screening. The proposed deep learning-based glaucoma diagnostic system, built on segmentation and integrated with a measurement module, provides reliable technical support for early screening and precise diagnosis of glaucoma. This system holds significant clinical value and application potential. The novelty of this study lies in its targeted architectural and algorithmic refinements that enhance the ability to model ambiguous boundaries and morphological variations. These innovations lead to improved CDR measurement accuracy and further strengthen the clinical utility of the system for glaucoma screening.
Although this study has achieved significant breakthroughs in intelligent glaucoma diagnosis, several limitations remain. First, the dual encoder-decoder architecture has high computational complexity and stringent hardware requirements, which may limit its practical application in resource-constrained clinical environments. Second, the parameters of the CRF in the post-processing module require manual tuning, leaving room for improvement in adaptability and automation. Additionally, the measurement module heavily relies on segmentation accuracy, particularly in regions with ambiguous boundaries or pathological changes. Deviations in segmentation results may affect the accuracy of CDR calculations, thereby restricting the system’s applicability to more complex cases. Moreover, this study mainly focuses on CDR as the structural biomarker for glaucoma diagnosis, while in clinical practice, comprehensive assessment typically requires integration of additional structural and functional indicators, such as retinal nerve fiber layer thickness and visual field indices.
Future research will focus on addressing these limitations to further enhance the system’s accuracy and practicality. For CDR calculation, the current method is based on 2D image segmentation and assumes the optic cup as a regular planar structure, without fully considering the complex 3D characteristics of actual fundus images. Future work could explore 3D morphological feature-based calculation methods, such as introducing regional shape fitting models or surface geometric reconstruction techniques, to simulate the true spatial structure of the optic cup and thereby calculate CDR more accurately. Additionally, the segmentation model can be further optimized by designing specialized enhancement modules for regions with ambiguous boundaries, such as boundary attention mechanisms or multi-scale feature fusion strategies, to reduce segmentation errors. Simultaneously, the computational complexity of the dual encoder-decoder architecture can be reduced, and the parameter auto-tuning capability of the post-processing module can be optimized to improve the model’s robustness and adaptability. Furthermore, multimodal integration of CDR with other structural and functional indicators will be explored to achieve a more comprehensive and clinically reliable glaucoma diagnostic system. Through these improvements, the system will provide more efficient and precise technical support for early glaucoma screening and diagnosis, with broader clinical application prospects.
Conclusions
Glaucoma is the second leading cause of blindness worldwide, posing a severe threat to patients’ visual health and quality of life. Early diagnosis is crucial for slowing visual impairment and preventing irreversible blindness. To address this challenge, this study proposes a deep learning-based intelligent glaucoma diagnostic system. The system consists of three modules: optic cup and disc segmentation, post-processing, and measurement. It enables automated processing of fundus images to accurately segment cup and disc regions, optimize boundary continuity of segmentation results, and quantitatively measure glaucoma-related pathological parameters. Experimental validation demonstrates that the system achieves high accuracy and robustness in both cup-disc segmentation and glaucoma screening tasks, outperforming traditional methods in efficiency and reliability. Compared to traditional diagnostic approaches relying on manual judgment, the proposed system significantly reduces the impact of human subjectivity on diagnostic outcomes, providing an efficient, automated, and low-cost solution for early glaucoma screening and diagnosis. This research not only offers critical technical support for clinical glaucoma diagnosis but also demonstrates the vast potential of artificial intelligence in medical image analysis.
Data availability
The datasets used in this study are publicly available. The Drishti-GS dataset can be accessed at https://cvit.iiit.ac.in/projects/mip/drishti-gs/mip-dataset2/Home.php, and the REFUGE dataset can be accessed at https://refuge.grand-challenge.org.
Abbreviations
OD:
Optic Disc
OC:
Optic Cup
CDR:
Cup-to-Disc Ratio
CRF:
Conditional Random Field
MAE:
Mean Absolute Error
IoU:
Intersection over Union
E-DCoAtUNet:
An Enhanced Dual Encoder–Decoder Network
IOP:
Intraocular Pressure
OCT:
Optical Coherence Tomography
CBA:
Cross Branch Attention
MFE:
Multi-Scale Feature Enhancement
EAM:
Edge Aware Module
Yang WH, Xu YW, Sun XH. Guidelines for glaucoma imaging classification, annotation, and quality control for artificial intelligence applications. Int J Ophthalmol. 2025;18(7):1181.
Singh LK, Garg H, Khanna M. An artificial intelligence-based smart system for early glaucoma recognition using OCT images. Int J E-Health Med Commun (IJEHMC). 2021;12(4):32–59.
Singh LK, Garg H. Pooja. Automated glaucoma type identification using machine learning or deep learning techniques[M]//Advancement of machine intelligence in interactive medical image analysis. Singapore: Springer Singapore; 2019. pp. 241–63.
Bourne RR, Stevens GA, White RA, et al. Causes of vision loss worldwide, 1990–2010: a systematic analysis. Lancet Glob Health. 2013;1(6):e339–49. [EB/OL].(2010).
Flaxman SR, Bourne RRA, Resnikoff S, et al. Global causes of blindness and distance vision impairment 1990–2020: a systematic review and meta-analysis. Lancet Global Health. 2017;5(12):e1221–34.
Abdull MM, Chandler C, Gilbert C. Glaucoma,the silent thief of sight: patients’ perspectives and health seeking behaviour in Bauchi, Northern Nigeria. BMC Ophthalmol. 2016;16:1–9.
Michelson G, Hornegger J, Wärntges S, et al. The papilla as screening parameter for early diagnosis of glaucoma. Deutsches Aerzteblatt Int. 2008;105(34–35):583.
Moreno MV, Houriet C, Grounauer PA. Ocular phantom-based feasibility study of an early diagnosis device for glaucoma. Sensors. 2021;21(2):579.
Harizman N, Oliveira C, Chiang A, et al. The ISNT rule and differentiation of normal from glaucomatous eyes. Arch Ophthalmol. 2006;124(11):1579–83.
Guo L, Yang JJ, Peng L, et al. A computer-aided healthcare system for cataract classification and grading based on fundus image analysis. Comput Ind. 2015;69:72–80.
Singh LK, Garg H. Detection of glaucoma in retinal images based on multiobjective approach. Int J Appl Evolutionary Comput (IJAEC). 2020;11(2):15–27.
Medeiros FA, Zangwill LM, Bowd C, et al. Use of progressive glaucomatous optic disk change as the reference standard for evaluation of diagnostic tests in glaucoma. Am J Ophthalmol. 2005;139(6):1010–8.
Ashtari-Majlan M, Dehshibi MM, Masip D. Glaucoma diagnosis in the era of deep learning: a survey. Expert Syst Appl. 2024;256:124888.
Juneja M, Thakur S, Uniyal A, et al. Deep learning-based classification network for glaucoma in retinal images. Comput Electr Eng. 2022;101:108009.
Madadi Y, Abu-Serhan H, Yousefi S. Domain Adaptation-Based deep learning model for forecasting and diagnosis of glaucoma disease. Biomed Signal Process Control. 2024;92:106061.
Sharma SK, Muduli D, Priyadarshini R, et al. An evolutionary supply chain management service model based on deep learning features for automated glaucoma detection using fundus images. Eng Appl Artif Intell. 2024;128:107449.
Mandal S, Jammal AA, Malek D, et al. Progression or aging? A deep learning approach for distinguishing glaucoma progression from age-related changes in OCT scans. Am J Ophthalmol. 2024.
Prastyo PH, Sumi AS, Nuraini A. Optic cup segmentation using U-net architecture on retinal fundus image. J Inform Technol Comput Eng. 2020;4(2):105–9.
Tabassum M, Khan TM, Arsalan M, et al. CDED-Net: joint segmentation of optic disc and optic cup for glaucoma screening. IEEE Access. 2020;8:102733–47.
Hervella ÁS, Rouco J, Novo J, et al. End-to-end multi-task learning for simultaneous optic disc and cup segmentation and glaucoma classification in eye fundus images. Appl Soft Comput. 2022;116:108347.
Hua K, Fang X, Tang Z, et al. DCAM-NET: A novel domain generalization optic cup and optic disc segmentation pipeline with multi-region and multi-scale Convolution attention mechanism. Comput Biol Med. 2023;163:107076.
Nawaz M, Nazir T, Javed A, et al. An efficient deep learning approach to automatic glaucoma detection using optic disc and optic cup localization. Sensors. 2022;22(2):434.
He H, Qiu J, Lin L, et al. JOINEDTrans: prior guided multi-task transformer for joint optic disc/cup segmentation and fovea detection. Comput Biol Med. 2024;177:108613.
Sharma P, Takahashi N, Ninomiya T, et al. A hybrid multi model artificial intelligence approach for glaucoma screening using fundus images. Npj Digit Med. 2025;8(1):130.
Singh LK, Khanna M, Monga H, et al. Nature-inspired algorithms-based optimal features selection strategy for COVID-19 detection using medical images. New Generation Comput. 2024;42(4):761–824.
Singh LK, Khanna M, Garg H. Multimodal biometric based on fusion of ridge features with minutiae features and face features. Int J Inform Syst Model Des (IJISMD). 2020;11(1):37–57.
Singh LK, Khanna M, Singh R. Efficient feature selection for breast cancer classification using soft computing approach: a novel clinical decision support system. Multimedia Tools Appl. 2024;83(14):43223–76.
Li NX, Yang QS, Wang J, et al. Current status and prospect of digital technology in management of patients with mental disorders comorbid with physical diseases. Digit Med Health. 2024;2(5):344–8. https://doi.org/10.3760/cma.j.cn101909-20231211-00078.
Cheng H, Huang ZH, Zhang JT, et al. Integration of human–machine intelligence and rehabilitation to promote the human–machine rehabilitation paradigm. Digit Med Health. 2025;3(4):280–92. https://doi.org/10.3760/cma.j.cn101909-20250514-00083.
Wu J, Fang H, Zhu J, et al. Multi-rater prism: learning self-calibrated medical image segmentation from multiple raters. Sci Bull. 2024;69(18):2906–19.
Gong D, Li WT, Li XM, et al. Development and research status of intelligent ophthalmology in China. Int J Ophthalmol. 2024;17(12):2308.
Wu H, Xiao B, Codella N, et al. CvT: introducing convolutions to vision transformers. International Conference on Computer Vision. 2021:22–31.
Zhang Q, Yang YB, Rest. An efficient transformer for visual recognition. Adv Neural Inf Process Syst. 2021;34:15475–85.
Dai Z, Liu H, Le QV, et al. Coatnet: marrying Convolution and attention for all data sizes. Adv Neural Inf Process Syst. 2021;34:3965–77.
Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Int Conf Mach Learn. 2001;1(2):3.
Imtiaz R, Khan TM, Naqvi SS, et al. Screening of glaucoma disease from retinal vessel images using semantic segmentation. Comput Electr Eng. 2021;91:107036.
Pachade S, Porwal P, Kokare M, et al. NENet: nested EfficientNet and adversarial learning for joint optic disc and cup segmentation. Med Image Anal. 2021;74:102253.
Luo L, Xue D, Pan F, et al. Joint optic disc and optic cup segmentation based on boundary prior and adversarial learning. Int J Comput Assist Radiol Surg. 2021;16(6):905–14.
Zhao X, Wang S, Zhao J, et al. Application of an attention u-net incorporating transfer learning for optic disc and cup segmentation. Signal Image Video Process. 2021;15:913–21.
Sun JD, Yao C, Liu J, et al. GNAS-U 2 net: a new optic cup and optic disc segmentation architecture with genetic neural architecture search. IEEE Signal Process Lett. 2022;29:697–701.
Liu B, Pan D, Shuai Z, et al. ECSD-Net: A joint optic disc and cup segmentation and glaucoma classification network based on unsupervised domain adaptation. Comput Methods Programs Biomed. 2022;213:106530.
Jiang Y, Ma Z, Wu C, et al. RSAP-Net: joint optic disc and cup segmentation with a residual spatial attention path module and MSRCR-PT pre-processing algorithm. BMC Bioinformatics. 2022;23(1):523.
Liu Z, Chen Y, Xiang X, et al. An end-to-end real-time lightweight network for the joint segmentation of optic disc and optic cup on fundus images. Mathematics. 2022;10(22):4288.
Sun G, Zhang Z, Zhang J, et al. Joint optic disc and cup segmentation based on multi-scale feature analysis and attention pyramid architecture for glaucoma screening. Neural Comput Appl. 2023;1–14.
Tadisetty S, Chodavarapu R, Jin R, et al. Identifying the edges of the optic cup and the optic disc in glaucoma patients by segmentation. Sensors. 2023;23(10):4668.
Chen Z, Pan Y, Xia Y. Reconstruction-driven dynamic refinement based unsupervised domain adaptation for joint optic disc and cup segmentation. IEEE J Biomedical Health Inf. 2023;27(7):3537–48.
Jiang L, Tang X, You S, et al. BEAC-Net: Boundary-Enhanced adaptive context network for optic disk and optic cup segmentation. Appl Sci. 2023;13(18):10244.
Chen Y, Liu Z, Meng Y, et al. Lightweight optic disc and optic cup segmentation based on MobileNetv3 convolutional neural network. Biomimetics. 2024;9(10):637.
Yu J, Chen N, Li J, et al. LC-MANet: location-constrained joint optic disc and cup segmentation via multiplex aggregation network. Comput Electr Eng. 2024;118:109423.
Liu Y, Wu J, Zhu Y, et al. Combined optic disc and optic cup segmentation network based on adversarial learning. IEEE Access. 2024:194296–321.
Li L, Zhou Y, Yang G. Robust source-free domain adaptation for fundus image segmentation. Winter Conference on Applications of Computer Vision. 2024: 7840–7849.
Orlando JI, Fu H, Breda JB, et al. Refuge challenge: a unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med Image Anal. 2020;59:101570.
Pawar DJ, Kanse YK, Patil SS. Insights into fundus images to identify glaucoma using convolutional neural network. International Conference on Image Processing and Capsule Networks. 2022:654–663.
Li F, Wang D, Yang Z, et al. The AI revolution in glaucoma: bridging challenges with opportunities. Prog Retin Eye Res. 2024;101291.
© 2025. This work is licensed under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.