Content area
Text-to-image generation is a challenging task. Although diffusion models can generate high-quality images of complex scenes, they sometimes suffer from a lack of realism. Additionally, there is often a large diversity among images generated from different text with the same semantics. Furthermore, the generation of details is sometimes insufficient. Generative adversarial networks can generate realism images. These images are consistent with the text descriptions. And the networks can generate content-consistent images. In this paper, we argue that generating images that are more consistent with the text descriptions is more important than generating higher-quality images. Therefore, this paper proposes the pretrained model-based generative adversarial network (PMGAN). PMGAN utilizes multiple pre-trained models in both generator and discriminator. Specifically, in the generator, the deep attentional multimodal similarity model text encoder extracts word and sentence embeddings from the input text, and the contrastive language-image pre-training (CLIP) text encoder extracts initial image features from the input text. In the discriminator, a pre-trained CLIP image encoder extracts image features from the input image. The CLIP encoder can map text and images into a common semantic space, which is beneficial to generate high-quality images. Experimental results show that compared to the state-of-the-art methods, PMGAN achieves better scores on both inception score and Fréchet inception distance and can produce higher quality images while maintaining greater consistency with text descriptions.
Introduction
In the field of deep learning, multimodal tasks have always been highly regarded. Among them, one of the most popular applications is text-to-image generation. The goal of this task is to generate real, natural, and diverse images that are semantically consistent with the given natural language descriptions. The convenience of natural language as a modality of information has attracted a large number of researchers to delve into text-to-image generation, making it a thriving research field today.
Recently, autoregressive models and diffusion models have been introduced as powerful approaches for generating high-quality images of complex scenes. These models leverage large-scale parameter models, extensive training with massive datasets, multiple pretraining models, and iterative generation processes. Notably, models like DALL-E [15] and LDM [18] have demonstrated the ability to produce impressive results. However, these methods still face certain challenges, such as the limitation in generating detailed object descriptions.
GAN-based models have good results in various fields of image generation, such as image translation [9, 31]. For text-to-image task, unlike autoregressive models and diffusion models, generative adversarial networks (GANs) have not delivered satisfactory results in generating images of complex scenes, they have shown remarkable success in generating detailed descriptions of objects. Current GAN-based methods, such as StackGAN++ [33], have employed multiple sets of generators and discriminators to improve the resolution of generated images. Although this technique enhances image quality, the increased complexity introduced by multiple sets of generators and discriminators exacerbates the instability in GAN training. GANs have the capability to generate realistic images but often lack diversity. When generating images with multiple objects using generative adversarial networks, the objects may not be well formed. The quality of the generated images may be low.
In comparison with autoregressive models [1, 2, 15] and diffusion models [11, 16, 18], GANs offer the advantages of faster image generation, decoupling and separation of distinct features, and a larger range in their latent space. Nevertheless, it is widely acknowledged that the training process of GANs is prone to instability. This shortcoming has also led to lower image quality in current text-to-image generation methods that rely on GANs, particularly in complex scenes with multiple objects.
This paper proposes a novel method for generating images from text based on the principles of generative adversarial networks (GANs). The proposed method, referred to as the pretrained model-based generative adversarial network (PMGAN) model, is designed to capitalize on the power of pre-trained models. The network architecture of the PMGAN model, depicted in Fig. 1, comprises multiple upsampling fusion modules, a DAMSM [29] text encoder, a CLIP [13] text encoder, a CLIP image encoder, a conditional discriminator, and an unconditional discriminator. Notably, pre-trained models are employed in both the generator and discriminator to maximize their potential.
Given that the CLIP text encoder and image encoder can map text features and image features into the same latent space, the PMGAN model leverages the pre-trained CLIP text encoder in the generator to extract initial image features from the text. In the discriminator, the pre-trained CLIP image encoder is used to extract image features from the images, which serve as inputs for subsequent unconditional and conditional discriminators. To obtain coarse-grained and fine-grained features from the text, the PMGAN model employs the pre-trained DAMSM text encoder to extract sentence embeddings and word embeddings, which are used as conditional inputs to constrain and guide the image generation process. In order to maximize the utility of sentence embeddings and word embeddings, each upsampling fusion module contains a deep fusion module for global guidance and an attention fusion module for detail enhancement. These modules are thoughtfully integrated to ensure that the model can take full advantage of the rich semantic information contained in the text input.
Overall, the contributions of this paper can be summarized as follows:
This paper proposes a method based on generative adversarial networks combined with pre-trained models for text-to-image generation tasks. It uses pre-trained encoder models in both the generator and discriminator to extract features from text or images and can generate realistic, diverse, and high-quality images.
This paper proposes a discriminator based on a pre-trained CLIP image encoder. It uses the pre-trained CLIP image encoder to extract features and introduces a penalty gradient loss. This improves the stability of adversarial training.
This paper proposes a generator based on a pre-trained text encoder. It uses a pre-trained CLIP text encoder to extract initial image features from the input text and a pre-trained DAMSM text encoder to obtain sentence and word embeddings from the text. This provides the generator with more image information as well as coarse and fine-grained information from the text.
This paper proposes an upsampling fusion module for feature fusion. Each upsampling fusion module includes a deep fusion module and an attention fusion module. It uses sentence and word embeddings as conditions to constrain and guide the image generation process. This fully utilizes the information from the text and generates more image details.
Related work
GANs-based Text-to-Image Generation Reed et al. [17] introduced GAN-INT-CLS, the first text-to-image synthesis method based on GANs. However, this method could only produce low-resolution images of pixels. To overcome this limitation, several methods such as StackGAN++ [33], AttnGAN [29] and CF-GAN [35] employed multiple generators and discriminators to refine high-resolution images from coarse ones. Although this enhanced the image resolution, it also increased the model complexity and the instability of training generative adversarial networks. Instead of using a stacked structure, Swin-GAN [28] and DF-GAN [23] increases the depth of the network model to enhance the image resolution progressively.
This paper chooses a single pair of generator and discriminator consistent with DF-GAN as the overall network framework. This simplifies the complexity of the network and is conducive to the stability of adversarial training.
Pretrained Models for Text-to-Image Generation AttnGAN [29] trains a pair of text encoder and image encoder on a pre-trained DAMSM to extract text features and compute the similarity between images and texts. XMC-GAN [34] employs BERT pre-trained on a pure text dataset as a text encoder to obtain word embeddings and sentence embeddings from text. LAFITE [36] integrates pre-trained CLIP text encoder and image encoder to enable training without paired image-text data, requiring only real images for training a deep learning model for text-to-image synthesis. The text encoder and the image encoder of CLIP are pre-trained on a dataset containing 400 million pairs of images and texts. GigaGAN [6] extracts text embeddings from the CLIP text encoder and further employs a learnable text encoder to obtain local and global text embeddings. GALIP [24] designs a generator and discriminator based on CLIP, fully utilizing CLIP’s ability to understand complex scenes and its domain generalization ability. GLIDE [11] applies CLIP-guided diffusion for text-conditional image synthesis, while DALL-E2 [16] integrates CLIP representation with a diffusion model for a CLIP decoder. Imagen [20] has explored large-scale Pretrained text encoders, such as CLIP, BERT [7], and T5 [14], proving that T5-XXL can generate higher quality text features.
Pre-trained models facilitate faster convergence of the model. CLIP maps text and images to the same latent space. Therefore, this paper employs multiple pre-trained encoders to extract text or image features, providing more information for the generation process and enhancing the stability of adversarial training. This paper uses the CLIP text encoder as an image feature extractor. It is different from methods such as LAFITE, GigaGAN, and GALIP that use the CLIP pre-trained model. The CLIP text encoder extracts initial image features from the input text. This provides the generator with more image information.
Large Models in Text-to-Image Generation In recent advancements in text-to-image synthesis, large pretrained autoregressive and diffusion models have demonstrated remarkable results. Models like DALL-E [15] and CogView [2] use VQ-VAE [25] or VQGAN [4] to convert images into discrete tokens, which are then combined with word tokens for pre-training large unidirectional transformers, facilitating autoregressive generation. Parti [30] approaches text-to-image synthesis as a translation problem, using a sequence-to-sequence autoregressive model. Cogview2 [3] enhances this process with hierarchical transformers and local parallel autoregressive generation for quicker image creation. To address the slow generation issue of autoregressive models, some researchers have explored diffusion models. VQ-Diffusion [5] merges VQ-VAE [25] with a diffusion model to reduce unidirectional bias and prevent accumulated prediction errors. Latent Diffusion Models [18] work in the latent space, enabling training with limited computational resources without sacrificing image quality. Imagen [20] proposes an Efficient UNet for diffusion models, aiming at photorealistic results.
Compared with the generative adversarial network, the diffusion model is not very effective in generating details. It is difficult to maintain details consistent with the text description, and the model parameters are often larger and more time-consuming. In summary, in order to solve the above problems, this article proposes PMGAN. Combined with the pre-trained model, this method can generate real, diverse, high-quality images, which are more outstanding in details.
Method
This paper proposes a text-to-image generation method based on generative adversarial networks combined with pre-trained models, called the PMGAN model (pretrained model-based generative adversarial network). Figure 1 shows the network structure of PMGAN, which consists of multiple upsampling fusion modules (UFM), a DAMSM [29] text encoder, a CLIP [13] text encoder, a CLIP image encoder, a conditional discriminator, and an unconditional discriminator.
Fig. 1 [Images not available. See PDF.]
The network architecture of the PMGAN model proposed in this paper. It consists of multiple upsampling fusion modules (UFM), a DAMSM text encoder, a CLIP text encoder, a CLIP image encoder, a conditional discriminator, and an unconditional discriminator
The generator consists of several upsampling fusion modules with identical structures that are connected in a cascade. Each module comprises a deep fusion module and an attention fusion module. PMGAN uses the CLIP [13] text encoder as the initial image feature extractor to obtain primary image features from the input natural language text, which serves as the generator’s initial input. The DAMSM pre-trained text encoder extracts sentence embeddings and word embeddings from the input natural language, which serve as the generator’s conditional input. The CLIP image encoder extracts image features from images and feeds them to the unconditional discriminator and the conditional discriminator, which judge the images’ realism and semantic consistency with the input natural language, respectively.
The CLIP text encoder, the CLIP image encoder, and the DAMSM text encoder are pre-trained models. The DAMSM text encoder is pre-trained on the training set, while CLIP is pre-trained on a dataset of 400 million image-text pairs collected from various public sources on the Internet.
The PMGAN model extracts the initial image features from the text using the CLIP text encoder, and concatenates them with Gaussian noise. Then, it transforms the dimension of the concatenated features through a fully connected layer, which serves as the input for the subsequent upsampling fusion modules. In each upsampling fusion module, the image features from the previous level are guided and fused by the sentence embedding and word embedding, resulting in new image features. The final image features output by the last upsampling fusion module are passed through a convolutional layer to generate the final image.
Fig. 2 [Images not available. See PDF.]
Network architecture of upsampling fusion module. It comprises a deep fusion module and an attention fusion module. They are sequentially connected and integrated with the input image features via residual connection
Upsampling fusion module
The upsampling fusion module consists of a deep fusion module for global guidance and an attention fusion module for detail enhancement. The detailed network structure of the upsampling fusion module is shown in Fig. 2. The deep fusion module and the attention fusion module are connected in series and then fused with the input image features through a residual connection, obtaining the enhanced image features.
Fig. 3 [Images not available. See PDF.]
Network structure of deep fusion module and Affine module. It contains two affine modules, two ReLU activation functions and a convolutional layer. Affine module mapping the sentence embedding to and to modulated image features
Deep fusion module
The network structure of the deep fusion module is shown in the left diagram of Fig. 3. The deep fusion module is consistent with the deep fusion module in DF-GAN. The deep fusion module contains two affine modules: two ReLU activation functions and a convolutional layer. The affine module is shown in the right diagram of Fig. 3, which guides the transformation of the image features by mapping the sentence embedding to and . Therefore, the affine module can be expressed by the following formula:
1
where Aff denotes the affine module, MLP denotes the multilayer perceptron, a feedforward artificial neural network composed of nodes in multiple levels and there is no structural difference between and . denotes the i-th image feature, and denotes the sentence embedding.Furthermore, the formula of the deep fusion module can be obtained:
2
where denotes the ReLU activation function, and denotes the convolutional layer. This achieves the guidance of the image generation process by using the global text information (i.e., the sentence embedding).Attention fusion module
We designed the attention fusion module based on the Transformer encoder block and made some improvements. Figure 4 shows the network structure of the enhanced attention fusion module. We used the self-modulation layer normalization method from ViTGAN for normalization, as shown by the following formula:
3
where and represent the mean and variance of the sum input within the layer, and calculate the adaptive normalization parameters controlled by the latent vector derived from , represents element-wise multiplication, represents the condition, is the image hidden feature from the previous layer.Fig. 4 [Images not available. See PDF.]
Network structure of attention fusion module. It is based on the Transformer encoder block. We apply cross-attention between image features and word features in multi-head attention, and use self-modulated normalization for images
In the attention fusion module, only image features are normalized using self-modulation layer normalization, while word embeddings still use the standard layer normalization method. Specifically, let denote the sentence embedding, denote the word embedding, and denote the image features output from the previous layer. Then, in the attention fusion module, the normalization part can be expressed using the following formula:
4
Furthermore, in multi-head attention:5
where represents the image features after self-modulation layer normalization, represents the normalized word embeddings, and represent learnable parameters.To further improve computational efficiency and enhance the diversity of generated images, this paper adopts a dynamic mask in the multi-head attention module. Specifically, during attention calculation, the attention weights of low-attention parts are directly ignored, thereby further reducing the computational load during attention calculation. This paper obtains the mask by setting a threshold, that is, the part of the attention weight that exceeds the threshold is retained, and the part below the threshold is ignored. It can be expressed using the following formula:
6
Therefore, the calculation process of attention becomes7
The other multi-head attentions (MultiHeadAtn), residual connections, and feed-forward networks(FFN) are consistent with Transformer [26] encoder block. Among them, multi-head attention can learn information from different subspaces in parallel to improve the model’s understanding and processing capabilities of different contextual features. Therefore, the attention fusion module can be expressed as:8
As can be seen, residual connections are used multiple times in the upsampling fusion module. To enhance the diversity of the generated images and reduce the computational cost, this paper applies dropout operations in each residual connection.Pre-trained encoder
This paper uses three pre-trained encoders, namely DAMSM text encoder, CLIP text encoder, and CLIP image encoder. The DAMSM text encoder obtains sentence embeddings and word embeddings from the input natural language text. The CLIP text encoder takes the input natural language text as a prompt and obtains initial image features. The CLIP image encoder extracts image features from the image, which are used as the input for the subsequent feature discriminator.
The text encoder of DAMSM is pre-trained on the training dataset without using any additional datasets. The text encoder is a bidirectional long short-term memory (LSTM) [21] that concatenates the two hidden states of each word to represent its semantic embedding. The semantic embeddings of all words form the word embeddings of the entire input text, where D represents the dimension of the word embedding and T represents the number of words. The sentence embedding is represented by concatenating the two hidden states of the last hidden layer in the bidirectional LSTM. The image encoder is an Inception-v3 [22] pre-trained on ImageNet [19], which extracts local features of the image from the layer, where 768 represents the dimension of local features and 289 represents the number of local subregions. Finally, global features of the image are extracted through the last average pooling layer.
The text and image encoders of CLIP are pre-trained on a dataset containing 400 million image-text pairs. CLIP uses contrastive pre-training. N groups of image-text pairs are passed through the text and image encoders to obtain the text semantic features and image semantic features , respectively. The cosine similarity between the two is then calculated crosswise. The goal is to maximize the cosine similarity of matching image-text pairs and minimize the cosine similarity of non-matching image-text pairs. The text encoder of CLIP is a Transformer structure, and the CLIP image encoder used in this paper is ViT-B/32.
Feature discriminator
The feature discriminator is shown in Fig. 5. This paper includes two types of discriminators: conditional and unconditional. The conditional discriminator mainly determines whether the natural language description and image are consistent, while the unconditional discriminator mainly determines whether the given image conforms to the distribution of real samples.
Fig. 5 [Images not available. See PDF.]
Feature discriminator network structure. The left one is conditional discriminator, and the right one is unconditional discriminator. Both contain two linear layers: a LeakyReLU and a Sigmoid
Both discriminators have the same structure, but they differ in the dimensions of the features they process. The image features are extracted by the CLIP image encoder, and the text features are encoded by the DAMSM text encoder. The formulas for the conditional and unconditional probabilities are:
9
where denotes the image features extracted by the CLIP image encoder, denotes the text features encoded by the DAMSM text encoder, and denotes the Sigmoid activation function. Based on the output of the discriminator, we design a loss function to enhance the discriminative power of the feature discriminator and provide feedback to the generator to improve its image generation quality.Loss function
For the feature discriminator, the conditional loss in this paper includes real image-matching text, generated image-matching text, and real image-non-matching text. The unconditional loss determines whether the image is real. The conditional and unconditional losses can be expressed by formulas as follows:
10
where s represents the sentence embedding, represents the conditional discriminator, and represents the unconditional discriminator. Following the One-Way Output in DF-GAN, the conditional and unconditional losses in this paper are trained alternately.During the experiment, it was found that the loss of the feature discriminator fluctuated significantly and often reached a balance with the generator in a locally optimal state during training. This resulted in the generator and feature discriminator being unable to further improve their generation and discrimination capabilities. Therefore, this paper also introduces gradient penalty loss for the feature discriminator. The gradient penalty penalizes the gradient of the feature discriminator to make it more smooth to changes in input data, thereby avoiding excessively sharp or discontinuous data generated by the generator in certain areas and improving the authenticity and diversity of generated data. The gradient penalty is a regularization technique used to penalize large changes in gradients when training neural networks. When the gradient of a neural network changes too much, it can cause overfitting or instability of the model. Therefore, using gradient penalty can enhance the stability of the model, prevent the generator from generating unrealistic data or the discriminator from favoring certain datasets, and promote diversity in the model to generate more diverse images.
Specifically, this paper mainly applies gradient penalty between real images and matching natural language descriptions, which can be expressed by the following formula:
11
In summary, the final loss of the feature discriminator can be obtained as follows:12
where is used to control whether to use conditional or unconditional loss during alternating training.For the generator, the loss function is also divided into two parts: conditional loss and unconditional loss. It can be expressed by the following formula:
13
In addition, to further enhance the semantic consistency between the generated image and the input natural language, this paper also introduces contrastive loss for the generator. Specifically, the CLIP text encoder obtains the initial image feature with the natural language description as a prompt, and the CLIP image encoder obtains the generated image feature with the generated image as input. The cosine similarity between these two features is calculated, resulting in a total of cosine similarities between the initial image feature and the generated image feature, of which N pairs are matching and the remaining pairs are non-matching. The adversarial loss maximizes the cosine similarity of the matching N pairs and minimizes the cosine similarity of the non-matching pairs. It can be expressed by the following formula:14
Therefore, the final loss function of the generator is:15
where is used to control whether to use conditional or unconditional loss during alternating training.Experiment
Dataset This paper evaluates our model on two challenging datasets, CUB [27] and COCO [10], building upon previous text generation methods based on generative adversarial networks. The dataset used in this paper is CUB-200–2011, which contains 8,855 training images and 2,933 test images. Each bird image has 10 corresponding textual descriptions. The COCO version used in this paper was released in 2014 and contains 82,783 training images, 40,504 validation images, and 40,775 test images. Each image has 5 corresponding text descriptions.
Implementation Details In the training process of this paper, both the feature discriminator and the generator used the AdamW optimizer with weight decay. The corresponding parameter was set to 0.001. The learning rate was set to 0.0003 for the feature discriminator and 0.0002 for the generator.
Evaluation Metrics In this paper, we primarily use two evaluation metrics, inception score (IS) and Fréchet inception distance (FID), to measure the quality of the generated images. IS primarily evaluates the generated model based on two aspects: the diversity and realism of the images. A higher IS indicates that the images generated by the generative model are more diverse and of higher quality. The FID measures the distance between the distribution of real images and generated images based on features extracted from a pre-trained network. A FID result indicates that the distribution of generated images is closer to that of real images.
Quantitative analysis
We conducted quantitative evaluations on the CUB and COCO datasets, computed FID and IS scores, and compared them with the current state-of-the-art text-to-image generation methods based on generative adversarial networks, including StackGAN [32], StackGAN++ [33], AttnGAN [29], MirrorGAN [12], DM-GAN [37], XMC-GAN [34], DF-GAN [23], Memory-Driven [8] and LAFITE [36]. Table 1 lists the FID and IS scores of these methods on the CUB and COCO datasets.
As shown in Table 1, the parameter amount of PMGAN is 60 M. Because the attention fusion module adds multi-head attention, the number of parameters has increased, but overall the number of parameters is still at a certain level. And the method proposed in this article achieved the best scores on the CUB data set and COCO data set. On the CUB dataset, it improved the IS score by 0.39 and reduced the FID score by 0.25 compared to the best method LAFITE. On the COCO dataset, it improved the IS score by 2.59 and reduced the FID score by 0.23 compared to the best method LAFITE. This indicates that the PMGAN model proposed in this paper has achieved the best results on the FID and IS evaluation metrics.
Table 1. IS, FID and NoP results on the CUB and COCO datasets
Model | CUB | COCO | |||
|---|---|---|---|---|---|
IS | FID | IS | FID | NoP | |
StackGAN [32] | 3.7 | 51.89 | 8.45 | 74.05 | – |
StackGAN++ [33] | 4.04 | 15.3 | 8.3 | 81.59 | – |
AttnGAN [29] | 4.36 | 23.98 | 25.89 | 35.49 | 230 M |
MirrorGAN [12] | 4.56 | – | 26.47 | – | – |
DM-GAN [37] | 4.75 | 16.09 | 30.49 | 32.64 | 46 M |
XMC-GAN [34] | – | – | 30.45 | 9.33 | 166 M |
DF-GAN [23] | 5.1 | 14.81 | 19.32 | – | 19 M |
Memory-Driven [8] | – | 10.49 | – | 19.47 | – |
LAFITE [36] | 5.97 | 10.48 | 32.34 | 8.12 | 75 M |
PMGAN(ours) | 6.36 | 10.23 | 34.93 | 7.89 | 60 M |
The best results of different compared methods are shown in bold
Qualitative analysis
This paper visualizes the results on the CUB and COCO datasets and compares them with other methods. In this section, two of the most recent methods are selected for comparison, namely LAFITE [36], which has the best quantitative evaluation so far, and DF-GAN [23], which can generate images of the highest quality. Figure 6 shows the images generated by different methods. The left three columns are images generated on the CUB dataset, and the right three columns are images generated on the COCO dataset.
Fig. 6 [Images not available. See PDF.]
Visualization results of LAFITE, DF-GAN and our PMGAN on the CUB [27] and COCO [10] datasets
As shown in Fig. 6, our method can generate images of higher quality than the other methods. For example, in the second column, LAFITE does not capture the detail of the "white belly" of the bird, and DF-GAN also does not show this detail very clearly, while our method clearly demonstrates this detail. For the image in the sixth column, LAFITE only shows a chair in the generated image, DF-GAN only shows a snowy scene with a vague chair, while our method generates an image that clearly shows a chair on the snow.
This paper also compares the generated images with the diffusion model. In this section, we chose the LDM method [18] for visualization comparison.
Fig. 7 [Images not available. See PDF.]
Visualization Comparison between our proposed PMGAN and LDM
As can be seen from the first column of Fig. 7, LDM generated high-quality birds with red heads. However, it generated white chests and abdomens and black wings with a white stripe. In contrast, PMGAN generated red heads, chests, and abdomens with pure black feathers, which is highly consistent with the text description. In the second column, although LDM generated a bird with a long and sharp beak, it had an inharmonious combination of black and yellow feathers. In comparison, PMGAN generated a yellow bird with black wings and a long and sharp beak. Overall, the color distribution of the bird is more natural and realistic. In the third column, LDM generated an orange-yellow bird while PMGAN generated a bird with a clear yellow-brown body and brown crest, with distinct and natural color separation. In the fourth column, LDM generated a bird with obvious yellow stripes on its wings and a distinct black neck, which is inconsistent with the text description. In contrast, PMGAN accurately generated all the details in the text, such as a yellow bird with a black tail, gray wings, and a black beak.
It can be seen that although the diffusion model can understand the objects contained in the text and generate corresponding objects, it is not as good as PMGAN in generating details and may produce inharmonious color distribution.
Ablation study
To verify the effectiveness of the modules proposed in this paper, this section will replace them one by one and show the FID evaluation results to demonstrate the effectiveness of different modules. The modules include the upsampling fusion module, the feature discriminator, and the CLIP text encoder. Multiple upsampling fusion modules are connected in series to form the generator. The feature discriminator is used to judge whether the image is real or generated. The CLIP text encoder is used to extract primary image features from the text. Table 2 shows the experimental results of the ablation study in this section, and all experiments in this section are conducted on the CUB dataset.
In Table 2, the baseline refers to the DF-GAN model. By observing the data in Table 2, it can be found that after using the generator composed of upsampling fusion modules, the image generation effect was greatly improved, where FID decreased by 3.32 and IS increased by 0.72. The feature discriminator and the CLIP text encoder also had different degrees of improvement after joining.
Table 2. Ablation study of the module effectiveness on the CUB dataset
Model | FID | IS |
|---|---|---|
Baseline | 12.10 | 5.10 |
Baseline+Upsample Fusion | 11.49 | 5.92 |
Baseline+Upsample Fusion+Feature Discriminator | 10.92 | 6.13 |
Baseline+Upsample Fusion+Feature Discriminator+CLIP Text Encoder | 10.23 | 6.36 |
Baseline In this section, DF-GAN is used as the Baseline and employs a stacked structure and unidirectional output as the adversarial loss. The FID value of 12.10 is the evaluation result obtained by the latest released model of DF-GAN.
Fig. 8 [Images not available. See PDF.]
Visualization of the diversity of images generated based on the text: “This bird has a yellow throat, belly, abdomen and sides with lots of brown streaks on them”
Fig. 9 [Images not available. See PDF.]
The impact of gradient penalty on adversarial training
Effect of Upsample Fusion Module The proposed upsampling fusion module reduces the FID to 11.49 and increases the IS to 5.92. The results indicate that the upsampling fusion module improves the diversity and quality of the generated images. As can be seen from Fig. 8, the generated birds exhibit diversity in both posture and body shape. Some birds are looking up toward the sky, while others are looking down. Some birds have plump bodies, while others are slender. In addition, the image backgrounds also fully reflect diversity.
Effect of Feature Discriminator The proposed feature discriminator reduces the FID to 10.92 and increases the IS to 6.13. The results indicate that the feature discriminator can more effectively constrain the generator to produce high-quality images and improve the stability of training for generative adversarial networks. Figure 9 shows the changes in the feature discriminator loss. As can be seen from Fig. 9, after adding gradient penalty, the fluctuations in the loss have decreased to some extent and the convergence speed has also become faster.
Effect of CLIP Text Encoder The proposed CLIP text encoder reduces the FID to 10.23 and increases the IS to 6.36. The results indicate that the CLIP text encoder provides more image information to the generator, enabling it to generate more realistic and high-quality images.
Conclusion
This paper proposes a method for text-to-image generation named the PMGAN model, which is based on a generative adversarial network model and incorporates a pre-trained model. We propose an upsampling fusion module that utilizes both word embedding and sentence embedding from the text to constrain and guide the generator to produce more realistic and high-quality images. We also introduce a feature discriminator based on the CLIP image encoder, which enhances the training stability and assists the generator to generate higher-quality and semantically consistent images. Moreover, our model leverages the CLIP text encoder to extract the initial image features, which enables the generator to generate higher-quality images more efficiently. Extensive experiments show that our PMGAN outperforms the current state-of-the-art models on CUB and COCO datasets. Compared to diffusion models such as LDM, PMGAN has better performance in generating details.
Acknowledgements
This work was supported by the National Natural Science Foundation of China [Grant Numbers 61807002].
Funding
This work was supported by the National Natural Science Foundation of China [Grant Numbers 61807002].
Availability of data and materials
The datasets generated during and analyzed during the current study are available in the CUB-200-2011 repository, http://www.vision.caltech.edu/datasets/cub_200_2011/, and the COCO 2014 repository, https://cocodataset.org/#download.
Declarations
Conflict of interest
No potential conflict of interest was reported by the authors.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Chang, H., Zhang, H., Barber, J., et al.: Muse: Text-to-image generation via masked generative transformers (2023). arXiv:2301.00704
2. Ding, M; Yang, Z; Hong, W et al. Cogview: mastering text-to-image generation via transformers. Adv. Neural. Inf. Process. Syst.; 2021; 34, pp. 19822-19835.
3. Ding, M; Zheng, W; Hong, W et al. Cogview2: faster and better text-to-image generation via hierarchical transformers. Adv. Neural. Inf. Process. Syst.; 2022; 35, pp. 16890-16902.
4. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
5. Gu, S., Chen, D., Bao, J., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)
6. Kang, M., Zhu, J.Y,. Zhang, R., et al.: Scaling up gans for text-to-image synthesis. arXiv:2303.05511 (2023)
7. Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
8. Li, B., Torr, P.H., Lukasiewicz, T.: Memory-driven text-to-image generation. arXiv:2208.07022 (2022)
9. Li, X; Du, Z; Huang, Y et al. A deep translation (gan) based change detection network for optical and sar remote sensing images. ISPRS J. Photogramm. Remote. Sens.; 2021; 179, pp. 14-34. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2021.07.007]
10. Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV), Springer, pp. 740–755 (2014)
11. Nichol, A., Dhariwal, P., Ramesh, A., et al.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741 (2021)
12. Qiao, T., Zhang, J., Xu, D., et al.: Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
13. Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp. 8748–8763 (2021)
14. Raffel, C; Shazeer, N; Roberts, A et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.; 2020; 21,
15. Ramesh, A., Pavlov, M., Goh, G., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, PMLR, pp. 8821–8831 (2021)
16. Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 1(2) (2022)
17. Reed, S., Akata, Z., Yan, X., et al.: Generative adversarial text to image synthesis. In: International conference on machine learning, PMLR, pp. 1060–1069 (2016)
18. Rombach, R., Blattmann, A., Lorenz, D., et al.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
19. Russakovsky, O; Deng, J; Su, H et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.; 2015; 115, pp. 211-252.
20. Saharia, C; Chan, W; Saxena, S et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst.; 2022; 35, pp. 36479-36494.
21. Schuster, M; Paliwal, KK. Bidirectional recurrent neural networks. IEEE Trans. Signal Process.; 1997; 45,
22. Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
23. Tao, M., Tang, H., Wu, F., et al.: Df-gan: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)
24. Tao, M., Bao, B.K., Tang, H., et al.: Galip: Generative adversarial clips for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14214–14223 (2023)
25. van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6309–6318 (2017)
26. Vaswani, A; Shazeer, N; Parmar, N et al. Attention is all you need. Adv. Neural. Inf. Process. Syst.; 2017; 30, pp. 5998-6008.
27. Wah, C., Branson, S., Welinder, P., et al.: The caltech-ucsd birds-200-2011 dataset. California Institute of Technology (2011)
28. Wang, S; Gao, Z; Liu, D. Swin-gan: generative adversarial network based on shifted windows transformer architecture for image generation. Vis. Comput.; 2023; 39,
29. Xu, T., Zhang, P., Huang, Q., et al.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Rrecognition, pp. 1316–1324 (2018)
30. Yu, J., Xu, Y., Koh, J.Y., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789 2(3) (2022)
31. Yuan, L; Chen, D; Hu, H. Unsupervised object-level image-to-image translation using positional attention bi-flow generative network. IEEE Access; 2019; 7, pp. 30637-30647. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2903543]
32. Zhang, H., Xu, T., Li, H., et al.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
33. Zhang, H; Xu, T; Li, H et al. Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2018; 41,
34. Zhang, H., Koh, J.Y., Baldridge, J., et al.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 833–842 (2021)
35. Zhang, Y; Han, S; Zhang, Z et al. Cf-gan: cross-domain feature fusion generative adversarial network for text-to-image synthesis. Vis. Comput.; 2023; 39,
36. Zhou, Y., Zhang, R., Chen, C., et al.: Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17907–17917 (2022)
37. Zhu, M., Pan, P., Chen, W., et al.: Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)
Copyright Springer Nature B.V. Jan 2025