1. Introduction
The accurate classification of medical images is crucial for the effective diagnosis and treatment of numerous diseases. Nevertheless, medical image classification systems frequently encounter obstacles, mainly when dealing with insufficient and imbalanced datasets or unlabeled datasets, as annotating retinal images requires specialized expertise and is time-consuming and costly [1,2], which can significantly affect their accuracy and overall performance. Additionally, retinal images are prone to variations in capture conditions, such as differences in lighting, focus, and imaging equipment, which can introduce noise and inconsistencies, leading to poor model generalization and overfitting [3]. Traditional data augmentation techniques, such as geometric transformations (e.g., rotation, translation, zoom), are commonly employed to mitigate these dataset imbalances and are often insufficient for generating the subtle pathological variations needed to improve model performance [4]. Furthermore, the detection of fine-grained pathological features, including microaneurysms, hemorrhages, and exudates, remains challenging due to their small size and variability [5]. These limitations highlight the need for more sophisticated data augmentation methods that can generate realistic and diverse synthetic images while preserving critical diagnostic features. With the advent of artificial intelligence and deep learning, more sophisticated methods have emerged. One such method is Generative Adversarial Networks (GAN) [6], which are capable of generating realistic synthetic images that are virtually indistinguishable from real images. Despite this, traditional GANs often suffer from issues such as mode collapse, which limits the diversity of generated images [7]. This has led to the exploration of more advanced augmentation methods using GANs. For instance, recent studies such as the AC-WGAN-GP framework proposed by Sun et al. [8] have demonstrated the efficacy of GANs in generating high-quality synthetic samples for hyperspectral image classification. This framework highlights the importance of integrating techniques like auxiliary classifiers and gradient penalties to ensure stable training and high-quality sample generation. These principles are equally applicable in medical image analysis, where generating high-fidelity images is crucial for improving classification performance. Our proposed method using Wasserstein GAN with Gradient Penalty (WGAN-GP) aims to overcome these challenges by ensuring stable training and generating medically plausible synthetic images, thereby addressing the data scarcity problem and enhancing the performance of retinal image classification models.
GANs have become increasingly important in the field of medical image processing, where they are widely used to create synthetic datasets that help improve classification performance across various medical tasks. For instance, ref. [9] demonstrated the effectiveness of using GAN-augmented datasets to enhance diagnostic accuracy in the classification of liver lesions. Their application of a Deep Convolutional GAN (DCGAN) [10] notably increased dataset diversity, leading to a marked improvement in model performance.
Recent advancements have continued to push the boundaries of synthetic image generation in medical image analysis. For example, ref. [11] introduced a novel architecture utilizing conditional GANs (cGANs) [12] to generate realistic mammography images containing synthetic lesions. By allowing for controlled variation in lesion types and their specific positions, their approach tackled the challenge of limited data availability in breast cancer screening. This technique led to significant improvements in the performance of classification models trained on their augmented datasets, providing a promising solution for enhancing breast cancer diagnosis.
In a related vein, ref. [13] proposed an explainable GAN framework designed to synthesize retinal images for the detection of diabetic retinopathy. Their work underscored the critical need for interpretability in GAN models, ensuring that medical professionals can understand and trust the synthetic images used in training classification systems. The incorporation of attention mechanisms within their framework further enhanced the generation of lesions, ensuring that the synthesized images were closely aligned with clinical standards and expectations. This focus on explainability adds a layer of transparency, which is essential in medical applications.
Similarly, ref. [14] introduced a multiscale GAN architecture aimed at generating high-resolution medical images, with a particular focus on improving the granularity of synthetic data. This work is especially relevant for generating small-scale lesions, which are critical for early diagnosis of various medical conditions. The multiscale approach effectively addressed the challenge of generating fine details in synthetic images, overcoming a limitation that had hindered previous methods in capturing the necessary resolution for accurate diagnosis.
Building on these advancements, we propose the use of a conditional Generative Adversarial Network (cGAN) for generating synthetic medical images tailored for classification tasks. Specifically, we adopt a Wasserstein GAN with Gradient Penalty (WGAN), ref. [15] framework to ensure the production of high-quality images while mitigating the risk of mode collapse, a frequent issue in GAN-based models. Our approach integrates medical domain knowledge directly into the image generation process, ensuring that the resulting synthetic images are not only visually convincing but also medically plausible, aligning with the diagnostic requirements of healthcare professionals.
The main contributions of this work are as follows:
Application of WGAN-GP for Enhanced Stability: The use of WGAN-GP ensures the generation of high-quality images while avoiding issues such as mode collapse, which is a common problem in traditional GANs. This enhancement contributes to more stable training and results in superior image synthesis, making the model more reliable for medical applications.
Lesion Extraction and Style Transfer: We ensure that the synthetic images retain critical pathological features by incorporating lesion extraction and style transfer techniques into the GAN model. This incorporation makes the generated data particularly valuable for training classification models in medical imaging, as it closely mirrors real-world medical conditions.
Dataset Augmentation for Improved Model Performance: The synthetic images produced by our method are used to augment real datasets, addressing the challenge of data scarcity in medical imaging. This augmentation leads to improved performance in medical image classification models, as the expanded dataset allows the model to learn from a more diverse set of examples.
Building on the need for image augmentation in medical contexts, this work presents a novel approach using WGAN-GP to generate high-quality synthetic retinal images. The subsequent sections provide a detailed breakdown of our methodology and experimental design. In Section 2, we describe the core components of the proposed method, including the application of WGAN-GP, lesion extraction, style transfer, and preprocessing steps tailored for medical image synthesis. Section 3 outlines the datasets used and the specific configurations for each, followed by Section 4, which analyzes our experimental results, comparing the generated images across various metrics and databases. The paper concludes with a discussion of key findings, expert evaluations, and areas for future improvement in synthetic medical image generation.
2. Materials and Metods
Our proposed method leverages WGAN-GP to generate synthetic medical images with the goal of augmenting datasets used for classification tasks. The framework, illustrated in Figure 1, comprises several of several key components, each designed to ensure high-quality image generation and robust model training:
2.1. WGAN-GP
In our proposed method, the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) [15] was specifically chosen to improve the quality and stability of synthetic medical image generation. WGAN-GP addresses issues such as mode collapse, which commonly affects traditional GANs, by employing a Wasserstein loss and incorporating a gradient penalty to enforce the Lipschitz constraint during training. This method ensures smoother gradients and leads to more stable training, which is crucial for generating high-quality medical images.
In recent years, more advanced methods have emerged for image generation. StyleGAN and StyleGAN2 [16,17] introduced a style-based architecture and progressive growing to achieve high-resolution image synthesis with precise control over visual attributes. While StyleGAN excels at generating natural images, it requires large datasets and significant computational resources, which limits its applicability in medical imaging. In contrast, WGAN-GP remains robust with smaller datasets and simpler architectures.
BigGAN [18] enhances image quality by scaling up model size and dataset scope, using class-conditional generation and orthogonal regularization. However, BigGAN’s reliance on extensive computational power and large-scale datasets makes it less practical for medical image synthesis. WGAN-GP offers a more efficient alternative while maintaining high fidelity.
Diffusion models, such as Denoising Diffusion Probabilistic Models (DDPMs) [19,20], have gained attention for their ability to produce high-fidelity images through an iterative denoising process. Despite their impressive results, diffusion models require high computational costs and slow inference times, making them less suitable for rapid image generation tasks. In comparison, WGAN-GP’s adversarial framework offers faster training and inference.
Methods integrating multimodal techniques have also advanced image generation capabilities. MEGAN (Mixture of Experts of Generative Adversarial Networks) [21] leverages a mixture of expert GANs to generate images conditioned on multiple modalities, such as text, images, or categorical data. By combining outputs from specialized GANs, MEGAN achieves higher diversity and quality in multimodal image synthesis. These techniques can enhance medical image generation by incorporating additional modalities, such as clinical reports or annotations, alongside image data. However, these methods often require more complex architectures and training strategies, which may not always be feasible in data-limited medical contexts.
Generative Adversarial MultiTask Learning [22] generates images while simultaneously performing auxiliary tasks (e.g., face recognition and sketch synthesis). This approach improves the quality of the generated images by leveraging multiple related tasks during training. In medical imaging, similar multitask methods can enhance synthesis by incorporating diagnostic tasks, ensuring the generated images are clinically relevant and useful for downstream applications. While multitask GANs offer these advantages, their training can be more challenging due to the need for balanced learning across multiple objectives.
GANs with Transformers (GANformer) [23] combine GANs with transformer-based attention mechanisms to model long-range dependencies in images, improving global coherence. Latent Diffusion Models (LDMs) [24] optimize the diffusion process by operating in a lower-dimensional latent space, reducing computational demands while maintaining high-resolution outputs. Though powerful, these models introduce additional complexity and dependencies that may not always align with medical image generation requirements.
Compared with these newer methods, WGAN-GP remains a compelling choice due to its stability, efficiency, and ability to perform well on smaller datasets. This makes it particularly effective for generating synthetic medical images where preserving pathological details, computational feasibility, and robustness are paramount.
In our implementation, the generator network takes a vessel segmentation image , lesion descriptors , and a noise vector as inputs. Here, y and provide physiological and pathological information, respectively. The noise vector z introduces randomness, allowing the model to produce different outputs each time. The output is a synthesized diabetic retinopathy fundus image .
To ensure the stability of the training process and enhance the quality of the synthesized images, we employ the WGAN-GP. This approach optimizes the generator to produce high-quality images while maintaining smooth gradient behavior, thereby avoiding common issues like mode collapse. The entire synthesis process can be expressed as , capturing the interplay of input features and randomness in the generation of realistic medical images.
We use the WGAN-GP strategy by solving the following optimization problem, which governs the interaction between the generator and the discriminator :
(1)
where represents the Wasserstein loss, with and being the perceptual loss and severity loss, respectively.Wasserstein Loss: Wasserstein loss is computed on real and synthesized images, where is defined as:
(2)
In this case, learning the discriminator parameter involves maximizing the Wasserstein loss , which ensures the discriminator provides meaningful feedback during training. To maintain stable training and avoid gradient explosions or vanishing gradients, a gradient penalty is applied, promoting smooth updates to the model parameters.
For the generator, learning the parameter corresponds to minimizing the following loss function:
(3)
This objective encourages the generator to produce synthetic images that the discriminator finds indistinguishable from real ones, thereby improving the overall quality of the generated data. The interplay between these two loss functions drives the adversarial training process, ultimately leading to the generation of realistic and clinically relevant medical images.
2.2. Lesion Extraction
Inspired by the work of [13], we employed the DR Detector from Kaggle’s Diabetic Retinopathy Detection competition [25], where the objective was to detect Diabetic Retinopathy (DR) and classify it into five severity levels: 0 (No DR), 1, 2, 3, and 4 according to the International Clinical Diabetic Retinopathy Disease Severity Scale [26]. The second-place team (Team ‘o_O’) publicly shared their implementation, which we utilized as the foundation for our lesion extraction process.
Using this implementation, we calculated the backpropagation of the lesion activations generated by the network. To refine the extracted lesions, we applied a Gaussian filter followed by binarization, resulting in two tensors representing lesions of different sizes. This approach, shown in Figure 2, allowed us to isolate and capture the critical pathological features in the retinal images, which are crucial for further analysis.
2.3. Style Transfer
To ensure our generated images closely resemble real images, we adopted a style transfer [27] technique that employs two distinct loss functions as shown in Figure 3. The first, called severity loss, leverages the DR Detector from the previous section (Team ‘o_O’ detector) to evaluate the grading of real and synthetic images. The severity loss is calculated as follows:
(4)
This loss function measures the difference between the diabetic retinopathy grading of the real image x and the synthetic image , ensuring that the generated image preserves the clinical severity level of the disease. The second loss function is the perceptual loss, which evaluates the divergence between real and synthetic images in the feature space of a perceptual network. This helps accurately reconstruct both pathological and physiological details in the synthetic images. Our approach employs a pre-trained VGG-19 model as the perceptual network. For a specific layer and the VGG feature extraction function , we define perceptual loss as
(5)
This loss encourages the generator to create images that are not only visually similar to the real images but also match their structural and textural features at different levels of abstraction.
2.4. Preprocessing
Given that each database contained images of different sizes, we standardized all images to a uniform size of 512 × 512 pixels. This resizing ensured that the images remained manageable for processing while preserving critical information. After scaling the images, we extracted their corresponding masks and the images were segmented using the Spacial Attention U-net (SA-Unet) model, as proposed by [28]. This segmentation step ensures that the critical regions, such as blood vessels, are accurately extracted, allowing for more precise lesion generation and classification during the image synthesis process. The preprocessing step was crucial for maintaining consistency across datasets and ensuring accurate lesion extraction and style transfer in the subsequent stages of our method.
3. Experimental Setting
In this study, we began by collecting and preprocessing several publicly available medical image datasets relevant to diabetic retinopathy and retinal lesions. The datasets described in Table 1 include the Kaggle Diabetic Retinopathy dataset [25], Indian Diabetic Retinopathy Image Dataset (IDRiD) [29], Retinal-Lesions [30], Fine-Grained Annotated Diabetic Retinopathy (FGADR) [31], and Retinal Fundus MultiDisease Image Dataset (RFMiD) [32]. Each dataset provides a rich collection of retinal images that are essential for training and evaluating our synthetic image generation model.
3.1. Kaggle Database
The Kaggle Diabetic Retinopathy Detection dataset [25] was developed for a competition on the Kaggle platform to create models capable of identifying diabetic retinopathy in retinal images. The dataset was provided by the EyePACS platform, and sponsored by California Healthcare. It is one of the most widely used datasets for diabetic retinopathy research, containing a total of 88,702 high-resolution images captured under diverse conditions. The primary purpose of this dataset is to classify diabetic retinopathy severity into five categories: No DR, Mild, Moderate, Severe, and Proliferative DR. For our study, we selected a subset of the dataset using 700 randomly selected images from each class, resulting in a total of 3500 images. From this subset, 1400 images were used for training, and the remaining 2100 images were reserved for testing. We opted to use a smaller subset of the dataset than initially available due to the extended training time required for larger datasets and the inherent class imbalance in the original data, with only 712 images available for the Proliferative DR class.
3.2. IDRiD Database
The Indian Diabetic Retinopathy Image Dataset (IDRiD, [29]) is a publicly available database created by the Indian Institute of Technology, Delhi. This dataset contains high resolution retinal images with pixel-level annotations for various lesions associated with diabetic retinopathy. Its primary purpose is to facilitate research in the segmentation and classification of diabetic retinopathy lesions, making it a valuable resource for advancing automated detection and analysis of retinal diseases. The pixel-level annotations provided in this dataset are particularly useful for developing and evaluating algorithms focused on identifying and delineating specific pathological features, such as microaneurysms, hemorrhages, and exudates, which are crucial for accurate diagnosis.
3.3. Retinal-Lesions Database
The Retinal-Lesions Database [30] was developed to provide detailed annotations of lesions and severity levels in retinal images. It is commonly used for training models that detect and classify various types of lesions associated with diabetic retinopathy. This dataset contains 1842 selected images sourced from the Kaggle Diabetic Retinopathy Detection database. They have been re-labeled by a panel of 45 ophthalmologists. The re-labeling process categorizes the images into five levels of diabetic retinopathy (DR) and eight distinct lesion classes. The comprehensive nature of this dataset with expert annotations makes it a valuable tool for improving the performance of automated systems in accurately detecting and classifying both the severity of DR and the presence of specific lesions.
3.4. FGADR Database
The Fine-Grained Annotated Diabetic Retinopathy (FGADR) database, developed by [31], was created to provide extensive resources for the development and evaluation of machine learning models focused on diabetic retinopathy. This database contains 2842 images, which are divided into two subsets: 1842 images with pixel-level annotations for lesions and 1000 images with lesion grade levels of diabetic retinopathy, assessed by six ophthalmologists. For our work, we utilized the first subset of 1842 images with pixel-level annotations. However, since our focus was on verifying the capacity of lesion transfer in the synthetic images, we opted not to use the annotations in our process. The availability of numerous annotated lesions made this dataset suitable for testing the effectiveness of our lesion transfer techniques in generating medically plausible synthetic images.
3.5. RFMID Database
The Retinal Fundus MultiDisease Image Dataset (RFMID) [32] is a comprehensive dataset containing retinal images captured for the diagnosis of multiple retinal diseases. This dataset was designed to address a variety of visual health issues, including diabetic retinopathy, age-related macular degeneration, glaucoma, and several other retinal pathologies. Each image in this database is labeled with high precision, identifying the specific diseases present. We utilized this dataset to demonstrate the versatility of our method, showing its applicability beyond diabetic retinopathy by testing it on various retinal diseases.
3.6. Image Metrics
To evaluate the quality of the generated synthetic images, we used several established image metrics that provide quantitative insights into fidelity, diversity, and structural consistency. These metrics ensure a comprehensive assessment of the performance of our proposed WGAN-GP method.
-
Fréchet Inception Distance (FID) [33]: The FID metric is widely used to measure the similarity between the distributions of real and generated images. It does so by comparing the mean and covariance of features extracted from an Inception [34] network trained on the ImageNet [35] dataset. Lower FID scores indicate that the generated images are closer in distribution to the real images, reflecting higher image quality and realism. This metric is particularly useful in evaluating GANs because it accounts for both the diversity of the generated images (by measuring how well they capture the distribution of the real data) and their fidelity (by assessing how realistic individual images appear). Unlike simple pixel-wise comparisons, FID considers the underlying feature distribution, making it more robust to minor variations in pixel values. Let be the mean and covariance of the features for the real images and be the corresponding values for the generated images. The FID score is defined as:
(6)
-
Mean Squared Error (MSE): MSE measures the average squared difference between the pixel values of the real and generated images. Lower MSE values correspond to higher image similarity, indicating more accurate reproduction of the original images. Let r and g represent the real and generated images, respectively, with N being the total number of pixels. The MSE formula is defined by
(7)
-
Structural Similarity Index (SSIM) [36]: SSIM assesses the perceived quality of images by comparing luminance, contrast, and structure between two images. An SSIM value closer to 1 indicates higher structural similarity between the real and synthetic images, capturing key visual information beyond simple pixel-wise differences. Let r and g be the real and generated images. The SSIM formula is defined as
(8)
where and are the means, and are the variances, is the covariance, and and are constants for stabilization. SSIM values closer to 1 indicate higher similarity. This metric is particularly relevant for GANs as it assesses perceptual image quality by focusing on structural content rather than individual pixel values.
4. Results
We compare our results with PathoGAN from [13] and a cGAN. For each dataset, we generated five synthetic datasets, and the results are summarized in the following tables.
4.1. Retinal-Lesions Database
Table 2 presents the results of our experiments with the Retinal-Lesions database. The test set consisted of 1256 images, and we generated an equivalent number of synthetic images based on their corresponding segmented images. The first column lists the different methods we evaluated. Among the configurations tested, we found that the WGAN-GP with nearest-neighbor interpolation consistently outperformed the other methods across all metrics. In Figure 4, we illustrate examples where the original image did not contain lesions. The synthetic images generated by WGAN-GP demonstrate successful synthesis of the overall color, retinal blood vessels, and a well-formed optic disc that is clearly visible. In contrast, the images generated by the cGAN exhibit several issues, such as areas with incorrect coloration, staining, and a poorly formed optic disc. The images generated by PathoGAN are slightly better than those from the cGAN, showing a more consistent structure and an improved optic disc, though still not as refined as those from the WGAN-GP.
4.2. FGADR Database
Table 3 presents the results of our evaluations on the FGADR database. This dataset posed significant challenges to the implementation of our method. In this case, the cGAN achieved the best FID score, with a value of 28.16. On the other hand, WGAN-GP with nearest-neighbor interpolation performed best in terms of MSE, achieving a value of 0.544, while WGAN-GP with bilinear interpolation yielded the highest SSIM score, with a value of 0.793.
The difficulty in achieving optimal performance with WGAN-GP on this dataset is likely due to the inherent noise in the images, making it harder for the model to learn effectively. This issue highlights the potential for further improvements when dealing with noisy datasets. Despite the lower FID score, the visual quality of the images generated by WGAN-GP remains comparable to the originals, as shown in Figure 5, where the color and structure of the generated images are similar to the real images.
4.3. IDRID Database
For the IDRiD database, the nearest-neighbor configuration proved to be optimal, achieving an FID score of 71.54. The cGAN configuration achieved the lowest MSE, with a value of 0.00722, while the Mitchell-Bicubic configuration attained the highest SSIM, with a score of 0.6852, as shown in Table 4. Due to the small size of this dataset, some features were not captured in full detail, leading to potential overfitting in the model. This issue is evident in the images shown in Figure 6, where the generated images display good color representation and capture key characteristics of retinal images, such as the veins, optic disc, and macula. However, some finer details may have been lost due to the dataset’s limited size.
4.4. Kaggle Database
In the Kaggle Diabetic Retinopathy dataset, the nearest-neighbor algorithm yielded the best results, achieving an FID score of 15.21. The bilinear antialiasing configuration performed best in terms of MSE and SSIM, with values of 0.002025 and 0.89, as can be seen in Table 5. This dataset presented a significant challenge in terms of processing time due to its large size, with training times ranging from 3 to 9 days, depending on the configuration. Due to its complexity, the bicubic algorithm took the longest, requiring 9 days of training. It is worth noting that, in the literature, there is only one study that uses a cGAN with this dataset [14], but the code was not made publicly available, which prevented direct comparison with our results.
Figure 7 presents sample images generated with each configuration. It is evident that WGAN-GP successfully synthesizes the key characteristics of retinal images, such as color, blood vessels, and optic disc. In contrast, the cGAN struggled to accurately transfer the images’ color and structural details.
4.5. RFMiD
For the RFMiD database, we tested the best overall configuration from previous experiments—namely, the nearest-neighbor method. This dataset was used to assess the capability of the proposed method to generate lesions in medical contexts other than diabetic retinopathy.
In Figure 8, we present a comparison of three real images and three synthetic images generated using the nearest-neighbor configuration. The results, summarized in Table 6, show an FID score of 69.25, comparable to the quality achieved with the IDRiD images. This suggests that the proposed method produces synthetic images of acceptable quality, even when applied to different retinal pathologies. Concerning lesion transfer, Figure 8 shows that, despite the method not being specifically designed for this context, it was able to detect and transfer lesions to the synthetic images successfully.
4.6. Lesions
Images containing lesions were also generated, as depicted in Figure 9, Figure 10, Figure 11 and Figure 12 which display examples from all four databases. In these images, the real counterparts feature visible lesions. The results indicate that the proposed method effectively captures and synthesizes the lesions in a manner faithful to the originals. The generated images maintain the overall structure of the retina, including key characteristics such as color, veins, and the optic disc, without introducing artifacts or distortions. In contrast, the images generated by cGAN and PathoGAN do not successfully transfer the lesions, and the colors diverge noticeably from the original images. Additionally, artifacts and a less defined optic disc are present in the synthetic images, except for the IDRiD database, where the small dataset size leads to overfitting, resulting in closer color tones to the originals. Specifically, in Figure 10, although cGAN achieved the best FID score, it does not transfer the lesions as effectively as our improved WGAN-GP method. One limitation of our method is that the lesions appear slightly blurred compared with the real images, but overall, it still outperforms cGAN in lesion synthesis and image quality.
4.7. Expert Evaluation
To assess how convincingly the generated images could be perceived as real, a survey was conducted using 50 randomly selected real images from the Retinal-Lesions dataset and 50 synthetic images generated by our method. For each image, respondents were asked to determine whether the image was real or synthetic and rate the overall quality of the image.
The survey was completed by three experts in the field of ophthalmology. Each expert evaluated the images independently, ensuring a diversity of perspectives based on their specific fields of expertise.
Table 7 presents the accuracy of each expert in distinguishing between real and synthetic images. A lower accuracy corresponds to a higher quality of the generated images, as it implies that the experts had difficulty distinguishing between real and synthetic images. Additionally, the table includes the fidelity score, which represents the respondents’ average rating (on a scale from 1 to 10) for each synthetic image. The results indicate that the experts correctly identified only 56.66% of the images (a mix of real and synthetic images generated by the WGAN-GP method). Since the theoretical probability of randomly selecting a synthetic image is 50%, this result suggests that our synthetic images could mimic real images convincingly. The results demonstrate the high quality of the generated images regarding color, structure, and texture, as even trained professionals found it challenging to differentiate between real and synthetic images.
5. Discussion
The proposed method based on WGAN-GP with nearest neighbor has demonstrated superior performance across multiple datasets for generating synthetic medical images, particularly for diabetic retinopathy. Compared with existing approaches like cGAN and PathoGAN, WGAN-GP consistently outperformed key metrics such as FID, MSE, and SSIM in the Retinal-Lesions, FGADR, and Kaggle databases. The method’s ability to mitigate mode collapse, a common issue in GANs, significantly contributed to its success in generating diverse and high-quality images.
One of the standout features of this method is its effectiveness in lesion transfer, where the generated images closely replicated the pathological features of the originals, such as color, vessel structure, and the optic disc. Although the method occasionally produced slightly blurred lesions, this issue could be addressed by incorporating detail refinement techniques in future work.
In the expert evaluation, the high average rating for synthetic images and the 56.66% accuracy in distinguishing between real and synthetic images further validated the realism of the generated images. This evaluation suggests that the proposed method can produce clinically relevant synthetic images that are difficult to differentiate from real ones, reinforcing its potential utility in medical image augmentation for training classification models.
The practical applications of this work are significant for addressing the challenges of data scarcity and class imbalance in medical imaging, particularly in fields like ophthalmology. High-quality synthetic retinal images generated by our WGAN-GP method can be used to augment datasets for training deep learning models tasked with detecting conditions such as diabetic retinopathy, hypertension, and age-related macular degeneration. This augmentation can improve the generalization performance of classification models, leading to more reliable diagnostic tools. Furthermore, by integrating multitasking and multimodal learning principles, the synthetic image generation process can be enhanced to support related tasks such as lesion segmentation, disease grading, and patient-specific diagnosis. For example, a system that generates synthetic retinal images conditioned on patient-specific parameters could help in personalized medicine by simulating disease progression or treatment outcomes.
Dataset-specific challenges highlight areas for improvement, such as the FGADR database’s image noise and the IDRiD database’s small size, leading to overfitting. Incorporating more robust preprocessing techniques or exploring advanced data augmentation strategies could improve the model’s generalization in noisy or small datasets.
While the method performed well overall, training time in large datasets like Kaggle remains a concern, with some configurations taking up to nine days. Despite this, the model’s ability to synthesize high-quality images in large datasets suggests that with further optimization, the method could become more time-efficient without sacrificing image quality.
6. Conclusions
In this work, we presented a WGAN-GP-based method with nearest-neighbor interpolation for generating high-quality synthetic medical images, specifically targeting diabetic retinopathy and related retinal diseases. The proposed approach demonstrated superior performance compared with cGAN and PathoGAN, particularly in terms of lesion transfer fidelity and overall image quality, as validated by both quantitative metrics and expert evaluations. Key contributions of this work include the accurate replication of lesions and critical retinal structures and the ability to generate clinically relevant synthetic data, which can enhance medical image classification models. The results highlight the potential of WGAN-GP in addressing data scarcity and improving classification accuracy in medical contexts.
However, challenges such as blurring of lesions, handling noisy data, and managing training time in large datasets remain to be solved. Future work will focus on addressing these limitations by incorporating advanced refinement techniques, improved preprocessing, and possibly exploring semi-supervised or hybrid learning approaches. Additionally, asking experts different questions about the pathological components in the images could provide valuable insights into how clinical professionals interpret our images. This feedback could help us understand potential gaps or strengths in the image analysis process and improve the overall accuracy and relevance of our approach. The findings suggest that the proposed WGAN-GP method is a promising tool for synthetic data generation in medical image analysis, with potential applications in clinical and research environments.
Conceptualization, H.A.-S. and L.A.-R.; Methodology, H.A.-S., L.A.-R., R.D.-H. and S.Z.-M.; Software, H.A.-S.; Validation, S.Z.-M.; Formal analysis, L.A.-R.; Investigation, H.A.-S.; Data curation, H.A.-S.; Writing—original draft, H.A.-S., L.A.-R., R.D.-H. and S.Z.-M.; Writing—review & editing, L.A.-R., R.D.-H. and S.Z.-M. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
The original data presented in the study are openly available in github.com/hector-anaya.
The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject or materials discussed in this manuscript.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 3. Style Transfer diagrams. (a) Diagram illustrating the perceptual loss process, utilizing VGG19 for feature extraction. (b) Diagram depicting the severity loss process, where a pretrained CNN is employed for retinal classification.
Figure 4. Images of each configuration where the real image does not have lesions. The image with PathoGAN label is the implementation of [13]. The others are using WGAN-GP with different resizing algorithms, except cGAN. Underlined is the best FID result.
Figure 5. Comparison of generated images across different configurations, where the real image does not contain lesions. The images generated by WGAN-GP and PathoGAN exhibit smoothing effects, while the cGAN successfully transfers the noise from the original image. Underlined is the best FID result.
Figure 6. Comparison with generated and real image samples. The images generated using the proposed method exhibit colors and textures that are more similar to the real image. In contrast, the images generated by the cGAN and PathoGAN show color variations in areas where the real image does not present them. Underlined is the best FID result.
Figure 7. Comparison with generated and real image samples. The proposed method successfully extracts and preserves the color and texture of the original image, while the cGAN method displays different tones. Underlined is the best FID result.
Figure 8. Comparison with generated and real image samples. The proposed method successfully transfers lesions from the original images.
Figure 9. Sample images with lesions from the Retinal-Lesions database. Underlined is the best FID result.
Figure 10. Sample images with lesions from the FGADR database. Underlined is the best FID result.
Figure 11. Sample images with lesions from the IDRiD database. Underlined is the best FID result.
Figure 12. Sample images with lesions from the Kaggle database. Underlined is the best FID result.
Description of the databases used.
Database | Description | Resolution | Number of Images |
---|---|---|---|
Kaggle | DR images, classified in 5 classes | 1444 × 1444, 2184 × 3456 | 1379 train, 2069 test |
IDRiD | DR images with segmented lesions | 4288 × 2848 | 54 train, 27 test |
Retinal-Lesions | DR images with lesion annotations and severity grading | 896 × 896 | 337 train, 1256 test |
FGADR | DR images with lesion segmentation and severity grading | 1280 × 1280 | 500 train, 1342 test |
RFMiD | Images with different illness present, labeled by illness | 2048 × 1536, 512 × 512 | 348 train, 174 test |
Experimental results on the Retinal-Lesions database. The arrows (↓, ↑) indicate the desired direction for performance improvement: ↓ denotes that lower values are better (MSE and FID), while ↑ indicates that higher values are preferable (SSIM).
Retinal-Lesions | |||
---|---|---|---|
Method | MSE ↓ | SSIM ↑ | FID ↓ |
PathoGAN | | | |
cGAN | | | |
WGAN-GP w/mitchellbicubic | | | |
WGAN-GP w/bicubic | | | |
WGAN-GP w/nearest | | | |
WGAN-GP w/lanczos5 | | | |
WGAN-GP w/bilinear | | | |
WGAN-GP w/bilinear+antialias | | | |
Experimental results on the FGADR database. The arrows (↓, ↑) indicate the desired direction for performance improvement: ↓ denotes that lower values are better (MSE and FID), while ↑ indicates that higher values are preferable (SSIM).
FGADR | |||
---|---|---|---|
Method | MSE ↓ | SSIM ↑ | FID ↓ |
PathoGAN | | | |
cGAN | | | |
WGAN-GP w/mitchell-bicubic | | | |
WGAN-GP w/bicubic | | | |
WGAN-GP w/nearest | | | |
WGAN-GP w/lanczos5 | | | |
WGAN-GP w/bilinear | | | |
WGAN-GP w/bilinear+antialias | | | |
Experimental results on the IDRiD database. The arrows (↓, ↑) indicate the desired direction for performance improvement: ↓ denotes that lower values are better (MSE and FID), while ↑ indicates that higher values are preferable (SSIM).
IDRiD | |||
---|---|---|---|
Method | MSE ↓ | SSIM ↑ | FID ↓ |
PathoGAN | | | |
cGAN | | | |
WGAN-GP w/mitchellbicubic | | | |
WGAN-GP w/bicubic | | | |
WGAN-GP w/nearest | | | |
WGAN-GP w/lanczos5 | | | |
WGAN-GP w/bilinear | | | |
WGAN-GP w/bilinear+antialias | | | |
Experimental results on the Kaggle database. The arrows (↓, ↑) indicate the desired direction for performance improvement: ↓ denotes that lower values are better (MSE and FID), while ↑ indicates that higher values are preferable (SSIM).
Kaggle | |||
---|---|---|---|
Method | MSE ↓ | SSIM ↑ | FID ↓ |
cGAN | | | |
WGAN-GP w/mitchellbicubic | | | |
WGAN-GP w/bicubic | | | |
WGAN-GP w/nearest | | | |
WGAN-GP w/lanczos5 | | | |
WGAN-GP w/bilinear | | | |
WGAN-GP w/bilinear+antialias | | | |
Experimental results on the RFMID dataset. The arrows (↓, ↑) indicate the desired direction for performance improvement: ↓ denotes that lower values are better (MSE and FID), while ↑ indicates that higher values are preferable (SSIM).
RFMiD | |||
---|---|---|---|
Method | MSE ↓ | SSIM ↑ | FID ↓ |
WGAN-GP w/nearest | | | |
Table of the accuracy and ratings given by the surveyed experts.
Respondents | Accuracy | Rating |
---|---|---|
Expert 1 | 52% | 8.84 |
Expert 2 | 58% | 7.06 |
Expert 3 | 60% | 7.84 |
Average | 56.66% | 7.91 |
References
1. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal.; 2017; 42, pp. 60-88. [DOI: https://dx.doi.org/10.1016/j.media.2017.07.005]
2. Cheplygina, V.; de Bruijne, M.; Pluim, J.P. Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Med. Image Anal.; 2019; 54, pp. 280-296. [DOI: https://dx.doi.org/10.1016/j.media.2019.03.009] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30959445]
3. Gargeya, R.; Leng, T. Automated Identification of Diabetic Retinopathy Using Deep Learning. Ophthalmology; 2017; 124, pp. 962-969. [DOI: https://dx.doi.org/10.1016/j.ophtha.2017.02.008] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28359545]
4. Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data; 2019; 6, pp. 1-48. [DOI: https://dx.doi.org/10.1186/s40537-019-0197-0]
5. Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M.C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J. et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA; 2016; 316, pp. 2402-2410. [DOI: https://dx.doi.org/10.1001/jama.2016.17216] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27898976]
6. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. arXiv; 2014; arXiv: 1406.2661
7. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv; 2017; arXiv: 1701.07875
8. Sun, C.; Zhang, X.; Meng, H.; Cao, X.; Zhang, J. AC-WGAN-GP: Generating Labeled Samples for Improving Hyperspectral Image Classification with Small-Samples. Remote Sens.; 2022; 14, 4910. [DOI: https://dx.doi.org/10.3390/rs14194910]
9. Frid-Adar, M.; Klang, E.; Amitai, M.M.; Goldberger, J.; Greenspan, H. Synthetic data augmentation using GAN for improved liver lesion classification. Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018); Washington, DC, USA, 4–7 April 2018; pp. 289-293.
10. Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv; 2016; arXiv: 1511.06434
11. Zhao, H.; Li, H.; Maurer-Stroh, S.; Guo, Y.; Deng, Q.; Cheng, L. Supervised Segmentation of Un-Annotated Retinal Fundus Images by Synthesis. IEEE Trans. Med. Imaging; 2019; 38, pp. 46-56. [DOI: https://dx.doi.org/10.1109/TMI.2018.2854886]
12. Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv; 2014; arXiv: 1411.1784
13. Niu, Y.; Gu, L.; Zhao, Y.; Lu, F. Explainable Diabetic Retinopathy Detection and Retinal Image Generation. IEEE J. Biomed. Health Inform.; 2022; 26, pp. 44-55. [DOI: https://dx.doi.org/10.1109/JBHI.2021.3110593] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34495852]
14. Zhou, Y.; Wang, B.; He, X.; Cui, S.; Shao, L. DR-GAN: Conditional Generative Adversarial Network for Fine-Grained Lesion Synthesis on Diabetic Retinopathy Images. IEEE J. Biomed. Health Inform.; 2022; 26, pp. 56-66. [DOI: https://dx.doi.org/10.1109/JBHI.2020.3045475]
15. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved Training of Wasserstein GANs. arXiv; 2017; arXiv: 1704.00028
16. Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 43, pp. 4217-4228. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.2970919]
17. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA, 13–19 June 2020; pp. 8107-8116.
18. Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv; 2019; arXiv: 1809.11096
19. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv; 2020; arXiv: 2006.11239
20. Nichol, A.Q.; Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. arXiv; 2021; arXiv: 2102.09672
21. Park, D.K.; Yoo, S.; Bahng, H.; Choo, J.; Park, N. MEGAN: Mixture of Experts of Generative Adversarial Networks for Multimodal Image Generation. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18; Stockholm, Sweden, 13–19 July 2018; pp. 878-884.
22. Wan, W.; Lee, H.J. Generative Adversarial Multi-Task Learning for Face Sketch Synthesis and Recognition. Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP); Taipei, Taiwan, 22–25 September 2019; pp. 4065-4069.
23. Hudson, D.A.; Zitnick, L. Generative Adversarial Transformers. arXiv; 2021; arXiv: 2103.01209
24. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. arXiv; 2022; arXiv: 2112.10752
25. Emma Dugas, J.J.W.C. Diabetic Retinopathy Detection. Kaggle. 2015; Available online: https://kaggle.com/competitions/diabetic-retinopathy-detection (accessed on 16 December 2024).
26. Diabetic Retinopathy—Europe. American Academy of Ophthalmology. 2024; Available online: https://www.aao.org/education/topic-detail/diabetic-retinopathy–europe (accessed on 16 December 2024).
27. Gatys, L.; Ecker, A.; Bethge, M. A Neural Algorithm of Artistic Style. J. Vis.; 2016; 16, 326. [DOI: https://dx.doi.org/10.1167/16.12.326]
28. Guo, C.; Szemenyei, M.; Yi, Y.; Wang, W.; Chen, B.; Fan, C. SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation. Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); Milan, Italy, 10–15 January 2021; pp. 1236-1242.
29. Porwal, P.; Pachade, S.; Kamble, R.; Kokare, M.; Deshmukh, G.; Sahasrabuddhe, V.; Meriaudeau, F. Indian Diabetic Retinopathy Image Dataset (IDRiD). 2018; Available online: https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-image-dataset-idrid (accessed on 16 December 2024).
30. Wei, Q.; Li, X.; Yu, W.; Zhang, X.; Zhang, Y.; Hu, B.; Mo, B.; Gong, D.; Chen, N.; Ding, D. et al. Learn to Segment Retinal Lesions and Beyond. Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); Milan, Italy, 10–15 January 2021; pp. 7403-7410.
31. Zhou, Y.; Wang, B.; Huang, L.; Cui, S.; Shao, L. A Benchmark for Studying Diabetic Retinopathy: Segmentation, Grading, and Transferability. IEEE Trans. Med. Imaging; 2021; 40, pp. 818-828. [DOI: https://dx.doi.org/10.1109/TMI.2020.3037771]
32. Pachade, S.; Porwal, P.; Thulkar, D.; Kokare, M.; Deshmukh, G.; Sahasrabuddhe, V.; Giancardo, L.; Quellec, G.; Mériaudeau, F. Retinal Fundus Multi-Disease Image Dataset (RFMiD). 2020; Available online: https://ieee-dataport.org/open-access/retinal-fundus-multi-disease-image-dataset-rfmid (accessed on 16 December 2024).
33. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv; 2017; arXiv: 1706.08500
34. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Boston, MA, USA, 7–12 June 2015; pp. 1-9.
35. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis.; 2015; 115, pp. 211-252. [DOI: https://dx.doi.org/10.1007/s11263-015-0816-y]
36. Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process.; 2004; 13, pp. 600-612. [DOI: https://dx.doi.org/10.1109/TIP.2003.819861]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Accurate synthetic image generation is crucial for addressing data scarcity challenges in medical image classification tasks, particularly in sensor-derived medical imaging. In this work, we propose a novel method using a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) and nearest-neighbor interpolation to generate high-quality synthetic images for diabetic retinopathy classification. Our approach enhances training datasets by generating realistic retinal images that retain critical pathological features. We evaluated the method across multiple retinal image datasets, including Retinal-Lesions, Fine-Grained Annotated Diabetic Retinopathy (FGADR), Indian Diabetic Retinopathy Image Dataset (IDRiD), and the Kaggle Diabetic Retinopathy dataset. The proposed method outperformed traditional generative models, such as conditional GANs and PathoGAN, achieving the best performance on key metrics: a Fréchet Inception Distance (FID) of 15.21, a Mean Squared Error (MSE) of 0.002025, and a Structural Similarity Index (SSIM) of 0.89 in the Kaggle dataset. Additionally, expert evaluations revealed that only 56.66% of synthetic images could be distinguished from real ones, demonstrating the high fidelity and clinical relevance of the generated data. These results highlight the effectiveness of our approach in improving medical image classification by generating realistic and diverse synthetic datasets.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 Computer Science Department, Instituto Nacional de Astrofísica Óptica y Electrónica, Luis Enrrique Erro No. 1, Sta. María Tonantzintla, Puebla 72840, Mexico;
2 Optics Department, Instituto Nacional de Astrofísica Óptica y Electrónica, Luis Enrrique Erro No. 1, Sta. María Tonantzintla, Puebla 72840, Mexico;