Content area
Existing image-to-image (I2I) translation methods achieve state-of-the-art performance by incorporating the patch-wise contrastive learning into generative adversarial networks. However, patch-wise contrastive learning only focuses on the local content similarity but neglects the global structure constraint, which affects the quality of the generated images. In this paper, we propose a new unpaired I2I translation framework based on dual contrastive regularization and spectral normalization, namely SN-DCR. To maintain consistency of the global structure and texture, we design the dual contrastive regularization using different deep feature spaces respectively. In order to improve the global structure information of the generated images, we formulate a semantic contrastive loss to make the global semantic structure of the generated images similar to the real images from the target domain in the semantic feature space. We use gram matrices to extract the style of texture from images. Similarly, we design a style contrastive loss to improve the global texture information of the generated images. Moreover, to enhance the stability of the model, we employ the spectral normalized convolutional network in the design of our generator. We conduct comprehensive experiments to evaluate the effectiveness of SN-DCR, and the results prove that our method achieves SOTA in multiple tasks. The code and pretrained models are available at
Introduction
Image-to-image translation (I2I) tasks aim to map an input image from the source domain into the target domain while retaining its original content and structure. In many I2I tasks, it is impossible to collect paired training data, and therefore adversarial loss [10] suffers from the collapse of the content and structure. In order to alleviate this problem, many typical works [31] use cycle consistency loss [31], which enforces the consistency between the input images and the reconstructed images by an inverse mapping of the generated image. Motivated by vision transformer (ViT), the recent cycle consistency-based work [30] explored the generator with a ViT for unpaired image-to-image translation. However, cycle consistency requires that the mapping between the two domains be a bijection, which is too restrictive [23].
Fig. 1 [Images not available. See PDF.]
Visual comparison results with all baselines on the Van GoghPhoto dataset. It is obvious that the global structure of the images generated by previous methods is corrupted, while our SN-DCR is able to preserve the global structure information and generate the photos with more natural details. Our SN-DCR performs better in terms of global structure and texture. Note that CUT and Cyclegan fail to generate a valid output at the first input
Recently, inspired by the success of contrastive learning, CUT [23] proposed patch-wise contrastive learning to maximize the mutual information between the same location of input and generated images, which introduced the contrastive learning into I2I translation for the first time. DCLGAN [12] proposed a method based on patch-wise contrastive learning and dual learning settings (using two generators and two discriminators). Although DCLGAN improves the quality of the generated images, an additional generator and discriminator need to be trained, which increases the training costs. F-LSeSim [42] utilized patch-wise contrastive learning by computing the learned self-similarity. However, it relies on VGG features to measure similarity, which reduces training efficiency. In previous studies, the query for contrastive learning was selected randomly from the generated images, which is an obvious problem since some locations contain less information from the source domain. Therefore, QS-Attn [16] designed a query-selected attention module by intentionally choosing significant anchors for patch-wise contrastive learning. Recently, a novel I2I method [28] based on style harmonization was proposed, which leverages two distinct styles: class-aware memory style and image-specifc component style. Gou et al. [11] proposed multi-feature contrastive learning (MCL) to construct a patch-wise contrastive loss using the feature information of the discriminator output layer for I2I tasks.
The previous works retain content consistency of the generated images via patch-wise contrastive learning without any regularization. However, only patch-wise contrastive learning is unable to effectively maintain the overall structure and texture of the images, as it only focuses on the local content similarity but neglects the global structure constraint. This issue affects the quality of the generated image.
In this paper, we propose a new I2I translation framework based on dual contrastive regularization and spectral normalization (SN-DCR). To achieve a global constraint of structure and texture in an unpaired manner, we formulate two new global contrastive loss functions to supplement the patch-wise contrastive loss, called dual contrastive regularization (DCR). DCR contains two parts: one is the semantic contrastive loss and the other is the style contrastive loss. Specifically, the semantic contrastive loss is proposed to improve the global structure information of the generated images, which encourages the generated images and the real images of the target domain (positives) to pull together in the semantic feature space while pushing the generated images away from the real images of the source domain (negatives). For example, the semantic structure information of the generated dog should be similar to the real dog in the semantic feature space. The style contrastive loss is developed to achieve the global texture consistency, in which the feature space capturing texture maps [9] is adopted to represent the style of texture.
Furthermore, it is well known that the training of GAN is unstable, and there are some issues such as mode collapse and convergence difficulties. To alleviate these issues, we employ the spectral normalization [21] in the design of our model, which enhances the stability of training. Moreover, to further boost the feature representation ability of our model, Frequency Channel Attention Network (FCANet) [25] is introduced to improve the translation performance. We demonstrate that our designed generator is effective through ablation experiments. In order to perform a better patch-wise contrastive loss, the qs-attn module is introduced in this paper. We conduct the comprehensive experiments to evaluate the effectiveness of SN-DCR, and the results prove that our method achieves SOTA in multiple tasks.
In summary, the main contributions of our method are three folds:
We propose a novel unpaired I2I translation framework via dual contrastive learning and spectral normalization, namely SN-DCR. Our proposed SN-DCR focuses not only on local content similarity but also on the consistency of the global structure and style, enabling the generation of high quality images with more natural structure and texture information.
To improve the global information of the generated images, we design dual contrastive regularization (DCR), which can achieve the consistency of the global structure and texture. Our proposed DCR can be viewed as a universal regularization to enhance quality of the generated images.
Experimental results compared with SOTAs clearly demonstrate that our SN-DCR exhibits superiority over prior unsupervised I2I translation approaches. Furthermore, we conduct comprehensive ablation experiments to scrutinize each of our contributions and prove the effectiveness of each element.
Related work
Image-to-image translation
GANs [10, 19, 20] have obtained great success, especially in I2I translation and the key idea is adversarial loss [10]. I2I translation can be categorized into two groups: a paired setting [18, 22, 32] (supervised) and an unpaired setting (unsupervised)[1, 17, 31, 37]. Paired setting means that each image from the source domain has a corresponding label, which can be regarded as classic GAN. In order to further improve the quality of the generated images, SPADE introduces a the spatially-adaptive normalization layer. However, it is difficult to obtain paired training data, as a result, current methods [31] are usually based on unpaired settings, which are developed based on one assumption: cycle consistency. For example, CycleGAN [31], Dual-GAN [37] and MUNIT [17] train cross-domain GANs with cycle consistency loss. CycleGAN learns two mappings simultaneously via translating an image to the target domain and back preserving the fidelity of the input and the reconstructed image. However, the assumption of cycle consistency that two domains can be mapped in both directions is too strict to obtain sufficient context. To alleviate the issue, many methods have tried to break the cycle consistency. DistanceGAN [1] proposes a distance constraint that allows unsupervised domain mapping to be one-sided. GC-GAN [8] enforces geometry consistency as a constraint for unsupervised domain mapping. CUT [23] introduces patch-wise contrastive learning into I2I translation, which significantly improves the quality of translation. However, only patch-wise contrastive loss is unable to effectively maintain the global structure and texture of the generated images.
Contrastive learning
Recent studies of the self-supervised learning [5, 13, 29, 39] show its strong ability to represent an image without labels, particularly with the help of the contrastive loss [3, 6]. Its idea is to perform the instance-level discrimination and learn the feature embedding, by pulling the features from the same image together and pushing those from different ones away. It has been observed that contrastive loss has been utilized in several low-level visual tasks, resulting in outstanding performance in various applications such as style transfer [40], image generation [24], image smoothing [43], image super-resolution and Deraining [4, 35]. PatchNCE [23] proposes patch-based contrastive learning, which uses a noise-contrastive estimation framework by learning the correspondence between the patches of the input image and the corresponding generated image patches. Excellent results are achieved and the recent methods [12, 16, 33, 41, 42] also obtained better performance by utilizing the idea of patch-wise contrastive learning. DCLGAN [12] proposes a method based on patch-wise contrastive learning and dual learning settings (using two generators and two discriminators). Although DCLGAN improves the quality of the generated images, an additional generator and discriminator need to be trained, which increases the training costs. QS-Attn [16] designed a query-selected attention module by intentionally choosing significant anchors for patch-wise contrastive learning. MCL [11] designed multi-feature contrastive learning (MCL) to construct a patch-wise contrastive loss using the feature information of the discriminator output layer. In parallel to these various designed methods, we mainly explore global regularization by introducing the idea of global contrastive loss. Therefore, we propose a new unpaired I2I translation framework based on dual contrastive regularization and spectral normalization to improve the global information of the generated images, which consistently shows better results.
Fig. 2 [Images not available. See PDF.]
Overall framework of our proposed SN-DCR. A cat (the input image) is translated by the generator G into a dog(the generated image). We introduce a dual contrastive regularization that combines both semantic and style contrastive loss to effectively pull the generated image closer to the real images of the target domain (positives) while pushing it away from the real images of the source domain (negatives)
Fig. 3 [Images not available. See PDF.]
Our proposed spectral normalized generator, denoted as G, incorporates the use of InsNorm (IN) and a Frequency Channel Attention Network (FCA). Additionally, it utilizes our novel spectral normalized residual block (SN ResBlock), with nine such blocks present in the middle of the architecture. In line with the configuration of CUT, we extract features from five different layers to compute the multi-layer patch-wise contrastive loss. These layers include RGB pixels, the initial two downsampling convolutions, and the first and fifth residual blocks
Our method
Overall framework
Given an input image from the source domain X, our goal is to translate it into G(x) in the target domain Y via the adversarial loss, having no apparent difference with the real image from the domain Y. The framework of SN-DCR is shown in Fig. 2, which consists of a generator G and a discriminator D. We divide the generator into two parts, the first part is defined as the encoder E, and the other part is defined as the decoder. SN-DCR includes three loss functions, the adversarial loss, patch-wise contrastive loss and dual contrastive regularization. In order to perform better patch-wise contrastive loss, we introduce the qs-attn module (global) in our proposed framework.
We expect the generated image G(x) to be as similar as possible to the real image from the domain Y, and the generator G can be mapped from the domain X to the domain Y through the adversarial loss. The adversarial loss is as follows:
1
Network architecture
Generator. We employ a UNet-based network with 9 residual blocks as the generator module. As known, the training process of GANs is very unstable, and problems such as mode collapse and convergence difficulties often occur. We employ SNconv in the design of residual blocks, which enhances the stability of training. In addition, to further boost the performance of our proposed SN-DCR, an up-to-date attention mechanism Frequency Channel Attention Network(FCANet) is introduced into our network. As illustrated in Fig. 3, given an input image x, the generator G can map x to G(x) in the target domain. To achieve this goal, G is supposed to preserve both image structures and details when translating. Motivated by previous studies, we exploit an encoder-decoder network with nine residual blocks as the generator. Given a cat, we first employ an initial layer and two down-sampling layers to encode the input image into a low-resolution feature map. Then, nine SN ResBlocks are adopted to extract more complex and deeper features in the low-resolution space. Figure 4 shows the detailed structure of SN ResBlock. After that, we employ the corresponding two up-sampling layers and a 7 * 7 convolutional layer to output a dog. Moreover, as mentioned above, we introduce FCANet in the generator design to further enhance the ability of SN-DCR. FCANet combines the channel attention mechanism with the discrete cosine transform cleverly, and expands on the basis of SENet [15] to obtain a new multi spectral channel attention mechanism. FCANet enables our model to learn the weights from different feature maps adaptively. Moreover, after the introduction of FCANet, the results of ablation study indicate that the performance of the proposed model can be improved significantly.
Discriminator. We use the same PatchGAN discriminator [27] architecture as CycleGAN and Pix2Pix which uses local patches of sizes 70x70 and assigns every patch a result. This is equivalent to manually crop one image into 70x70 overlapping patches, run a regular discriminator over each patch, and average the results. For instance, the discriminator takes an image from either domain X or domain Y, passes it through five downsampling Convolutional Normalization LeakeyReLU layers, and outputs a result matrix of 30x30. Each element corresponds to the classification result of one patch. Following CycleGAN and Pix2Pix, in order to improve the stability of adversarial training, we use a buffer to store 50 previously generated images.
Fig. 4 [Images not available. See PDF.]
SN Residual block. The SN residual block can enhance the stability of training and assist the generator to extract more complex features
Dual contrastive regularization
We adopt the real images of the target domain and source domain as the positives and negatives to improve better quality of the generated images. DCR constrains the generator by two different feature spaces. Note that, to ensure the flexibility of our proposed method, these positives and negatives are randomly chosen. To achieve a constraint of the global structure, we propose a semantic contrastive loss, which aims to encourage the generated image G(x) to be close to the positives P while keeping away from the negatives N. For the feature space, inspired by AECR-Net [36], we employ a pre-trained VGG-16 network to extract the feature maps .
Table 1. Quantitative comparison with all baselines
Method | Cat Dog | Van goghPhoto | Horse Zebra | |||
|---|---|---|---|---|---|---|
FID | SWD | FID | FID | SWD | sec/iter | |
CycleGAN | 80.5 | 19.5 | 103.0 | 72.2 | 39.1 | 0.40 |
CUT | 76.2 | 12.9 | 96.9 | 45.5 | 31.5 | 0.24 |
FastCUT | 94.0 | 17.6 | 105.3 | 73.4 | 38.2 | 0.15 |
FSeSim | 87.8 | 13.8 | 94.3 | 43.4 | 37.2 | 0.11 |
DCLGAN | 68.7 | 12.5 | 93.7 | 43.2 | 31.2 | 0.41 |
QS-Attn | 72.8 | 12.8 | 92.2 | 41.1 | 30.3 | 0.22 |
SN-DCR | 62.7 | 12.1 | 90.3 | 33.6 | 28.4 | 0.23 |
For the metrics, our algorithm outperforms all the baselines obviously, and our SN-DCR achieves state-of-the-art performance on unpaired I2I translation tasks
Table 2. Efficiency comparison with all baselines
Method | Horse2Zebra | |||
|---|---|---|---|---|
Memory | Overall Training time | FLOPs | Parameters | |
CycleGAN | 4.7 G | 46 h | 128.26 G | 28.286 M |
CUT | 3.9 G | 27 h | 64.13 G | 14.703 M |
FastCUT | 3.4 G | 17 h | 64.13 G | 14.703 M |
FSeSim | 3.8 G | 12 h | 64.13 G | 14.143 M |
DCLGAN | 7.8 G | 44 h | 128.26 G | 29.406 M |
QS-Attn | 5.5 G | 24 h | 64.13 G | 14.703 M |
SN-DCR | 5.5 G | 26 h | 42.38 G | 14.679 M |
Our proposed SN-DCR demonstrates competitive performance in the realm of image-to-image translation, notwithstanding its suboptimal outcomes in terms of computational speed, parameter efficiency, and memory consumption. Despite these limitations in specific computational metrics, SN-DCR distinguishes itself by exhibiting notable efficacy in the image translation task. This underscores its merit as a viable alternative that strikes a judicious balance between considerations of computational resources and translation performance, thereby constituting a valuable contribution to the field. Moreover, we conduct a comparative analysis of FLOPS for generator of each methods, and our approach has demonstrated superior performance, which further substantiates the advantage of SN-DCR
The semantic contrastive loss can be expressed as:
2
where refers to extracting the i-th hidden features from the VGG-16 network pre-trained on ImageNet. n refers to n layers that we choose. P refers to the real images of the target domain, and N refers to the real images of the source domain. Here we choose the 1st, 3rd, 5th, 9th and 13th layers. are weight coefficients, and we set = 1/16, , , , .Besides, to achieve a constraint of the global texture, we use a special feature space. We can build this feature space in any layer of the network, which consists of the correlations between the different filter responses. These feature correlations are given by the Gram matrix:
3
where is the inner product between the feature maps i and j in the layer l. k refers to the vector length. We employ a feature extraction network to extract the feature maps [c, h, w]. We can obtain the feature maps i[ c, h*w] and the feature maps j[ c, h*w] via flatten and matrix transpose. Gram matrices are the inner product between the feature maps i and j. We then get a set of Gram matrices from layers {1,2,3...L} in the feature extraction network. The gram matrix M is a quantitative description of latent image features. We expect that the distance d (we use L2 to measure this distance) between the generated image G(x) and negatives N is much greater than the distance between the generated image G(x) and positives P:4
Therefore, our proposed style contrastive loss can be expressed as:5
where are hyperparameters (similar to the triplet loss [26] ), we set it to 0.04 in our experiments. Finally, our dual contrastive Regularization is formulated as:6
where and are weight coefficients, and we set = 1, =0.5.Patch-wise contrastive loss
Following the setup of CUT, we employ a patch-wise contrastive loss to maximize the mutual information between the inputs and outputs. The key idea is to pull the query (patch from the generated image, yellow box in Fig. 2) and positives (corresponding from the input image, blue box) together, and push the query and negatives (non-local patches from the input image, red box) away. We employ E to extract features from the input image and the generated image. Then we introduce QS-Attn module to deliberately choose important anchors for patch-wise contrastive loss. The formula can be expressed as:
7
where q refers to the anchor feature (yellow box in Fig. 2) from G(x), refers to a positive (blue box) and refers to (N-1) negatives (red box). Here indicates a temperature parameter used to measure the distance between query and other samples. The default value is 0.07. Note that the positive is the corresponding of the anchor feature q in the input image x, and negatives are randomly selected in x.We utilize E to extract features from the generated image, select the L layers of interest from E, and sent it to Q (QS-Attn moudle) to get the features we need. The resulted features can be denoted by , where represents l-th layer we choose. We use qs-attn module to select important anchors in each selected layer, ( represent the number of patches selected in each layer). In the same way, the corresponding patches of L layer are obtained from the input image, . We take the corresponding patches obtained from the input image as positives, and the other features as negatives.
The patch-wise contrastive loss can be expressed as:
8
In summary, the overall objective function of SN-DCR can be formulated as:9
where is the patch-wise contrastive loss defined in Eq. (8), and is the identity loss, in which the positive and negatives are extracted from a real image y from the domain Y, and the anchor q is from G(y). We refer to the CUT settings to add the identity loss. The goal of this identity loss is to prevent generator G from making changes on the target domain images.Fig. 5 [Images not available. See PDF.]
Visual results comparison with all baselines on the Horse Zebra and Cat Dog datasets. Compared with all baselines, our SN-DCR not only performs better in terms of global structure and texture, but also generates more real images with more natural detail
Fig. 6 [Images not available. See PDF.]
Visual results comparison with all baselines on the CityScapes dataset. Our proposed SN-DCR show visual satisfactory results. SN-DCR can clearly generate a wide variety of cars with natural details, and performs better in terms of global structure and texture
Experiments
Experimental settings
Datasets. SN-DCR is trained and evaluated on Horse Zebra, CatDog, Van Gogh Photo and CityScapes datasets. HorseZebra is provided in [31], which contains 1,067 and 1,334 training images for horse and zebra, respectively. We use 120 horse images as the test images on Horse Zebra. Cat Dog is from [7], which consists of 5,153 and 4,739 training images for cat and dog, respectively. We used 500 images of cats as test images on Cat Dog. Van Gogh Photo is a dataset of 400 Van Gogh paintings and 6287 photos extracted from [7]. We used 400 Van Gogh images as test images. Cityscapes contains street scenes from German cities, with 2,975 training images and 500 test images.
Training details. The implementation of SN-DCR is mainly based on CUT. We use a ResNet-based generator and a PatchGAN discriminator. Different from CUT, we introduce qs-attn module in patch-wise contrastive loss, and introduce SN and FCA module in design of the generator. Our proposed dual contrastive regularization employs VGG16 to extract features. we use the Adam optimizer, and . The batch size we used is 1, and all training images are loaded into 286 * 286, then cut to 256 * 256 blocks. SN-DCR trains 400 epochs on each dataset, the learning rate is 0.0002, and the learning rate starts to decay linearly to 0 after 200 epochs.
Evaluation metrics. We primarily utilize Frechet Inception Distance (FID) [14] and Sliced Wasserstein Distance (SWD) [2] as key evaluation metrics for assessing the performance of SN-DCR. Both FID and SWD are well-established metrics with a robust correlation to human perception. They measure the distance between two distributions, namely the distributions of real and generated images. A lower FID and SWD signify a closer resemblance between the generated and real images. For Cityscapes dataset, we use the pretrained semantic segmentation network DRN [38], and compute mean average precision (mAP), pixel-wise accuracy (pixAcc), and average class accuracy (classAcc), showing the semantic interpretability of the generated images.
Comparison with other methods
Table 1 shows the quantitative results of our proposed SN-DCR compared with all baselines on three datasets, including QS-Attn [16], DCLgan [12], FseSim [42], CUT [23], FastCUT [23] and Cyclegan [31]. We mainly use FID and SWD scores as our quantitative metrics. For the metrics, our algorithm outperforms all the baselines obviously, and our SN-DCR achieves state-of-the-art performance on unpaired I2I translation tasks. As illustrated in the Table 2, SN-DCR demonstrates superiority over DCLgan and Cyclegan concerning both performance and computational parameters. Despite having computational parameters comparable to QS-Attn and CUT, SN-DCR outperforms them in terms of performance. Although SN-DCR falls slightly short of FastCUT and FSeSim in computational parameters, its performance significantly surpasses that of FastCUT and FSeSim. Despite these limitations in specific computational metrics, SN-DCR distinguishes itself by exhibiting notable efficacy in the image translation task. Figure 1 shows the visual results comparison with all baselines on the Van GoghPhoto dataset. Compared with other methods, our SN-DCR is able to preserve the global structure information and generate the photos with more natural details, proving that our DCR is effective to improve the global information. To better illustrate the superiority of our SN-DCR, we display some visual results in Fig. 5. We randomly pick four samples on the Horse Zebra and Cat Dog datasets. QS-Attn, DCLGAN and FseSim fail to preserve details and textures of the generated image. CUT, FastCUT, Cyclegan generate unsatisfactory content. Compared with all baselines, our SN-DCR not only performs better in terms of global structure and texture, but also generates more real images with more natural details. Fig 6 shows the visual results comparison with all baselines on the CityScapes. Compared with other methods, our SN-DCR is able to clearly generate a wide variety of cars with natural details, and performs better in terms of global structure and texture, proving that our DCR is able to enhance consistency of the global texture. Moreover, our SN-DCR can produce more real images with more color details. Table 2 shows the quantitative results of our proposed SN-DCR compared with all baselines on the CityScapes datasets. Obviously, our algorithm outperforms all the baselines. For structural similarity (SSIM) metrics, our proposed SN-DCR performs better than other methods, proving that our DCR is effective to enhance the information of global structure.
Ablation study
In comparison experiments, SN-DCR shows better performance than all baseline methods. In SN-DCR, we apply the FCA and SN module in design of the generator, QS-Attn module in patch-wise contrastive learning, and dual contrastive regularization.
For effect of the weights( hyperparameter and ) in DCR, ablation experiments are performed on the Horse Zebra dataset to ensure their optimum values, as shown in Table 3. To evaluate the measures separately, we mainly conduct the ablation study on the Horse Zebra dataset. Our baseline is CUT. Visual results and metrics for ablation study are listed in Fig. 7 and Table 4.
The metric of models A and B performs better than CUT, reflecting that our proposed global contrastive loss functions are useful for I2I translation tasks. Model C outperforms A and B, indicating the effectiveness of our dual setting. Model D performs better than C, indicating that the addition of QS-Attn module is beneficial to the model’s performance. Model E without DCR to exhibit the different influence between global constraints and local constraints. Model E outperforms QS-Attn, proving that our designed generator is effective. Model F outperforms D, indicating the effectiveness of the SN ResBlock in our generator. Finally, Model G, which combines all the measures, achieves the best performance on I2I translation tasks, validating the efficacy of our overall approach.
Table 3. Quantitative comparison with all baselines on Cityscapes dataset
Method | CityScapes | ||||
|---|---|---|---|---|---|
FID | mAP | pixAcc | classAcc | SSIM | |
CycleGAN | 76.3 | 20.4 | 55.9 | 25.4 | 0.3601 |
CUT | 56.4 | 24.7 | 68.8 | 30.7 | 0.4474 |
FastCUT | 68.8 | 19.1 | 59.9 | 24.3 | 0.4265 |
FSeSim | 54.3 | 22.1 | 69.4 | 27.8 | 0.4367 |
DCLGAN | 49.4 | 22.9 | 76.9 | 29.6 | 0.4486 |
QS-Attn | 53.5 | 25.5 | 79.9 | 31.2 | 0.4573 |
SN-DCR | 46.6 | 27.9 | 74.3 | 35.4 | 0.4638 |
For Cityscapes dataset, we use the pretrained semantic segmentation network DRN, and compute mean average precision (mAP), pixel-wise accuracy (pixAcc), and average class accuracy (classAcc), showing the semantic interpretability of the generated images. Unlike other datasets, CityScapes does have corresponding labels. So we conduct an evaluation between ground truth and the generated images via structural similarity (SSIM) [34] metrics to evaluate the structure consistency
Fig. 7 [Images not available. See PDF.]
Visual results for ablation study on Horse Zebra dataset. The leftmost column are input images. Model G is our proposed SN-DCR
Table 4. Ablation study on Horse Zebra dataset. (hyperparameter and on DCR)
38.1 | 37.9 | 36.5 | 37.4 | 38.4 | |
38.3 | 34.1 | 33.6 | 36.7 | 37.9 | |
39.2 | 35.5 | 33.9 | 36.2 | 38.9 | |
40.3 | 37.8 | 35.4 | 38.6 | 40.9 | |
41.1 | 38.4 | 36.2 | 39.5 | 40.7 |
We use FID to evaluate ablation study
Table 5. Ablation study on Horse Zebra dataset
Method | Configuration | FID | ||||
|---|---|---|---|---|---|---|
Generator | QS-Attn | DCR | ||||
FCA | SN | semantic | style | |||
CUT | 45.5 | |||||
A | 37.4 | |||||
B | 38.7 | |||||
C | 35.5 | |||||
D | 34.5 | |||||
E | 38.9 | |||||
F | 34.1 | |||||
G | 33.6 | |||||
DCR refers to dual contrastive regularization. semantic means semantically contrastive loss, and style means style contrastive loss. FCA refers to the Frequency Channel Attention Network. SN refers to the spectral normalized convolutional network. QS-Attn refers to QS-Attn moudle
Conclusion
In this paper, we propose a new unpaired I2I translation framework based on dual contrastive regularization and spectral normalization, namely SN-DCR. To achieve a global constraint of structure and texture in an unpaired manner, we formulate two new global contrastive loss functions to supplement the patch-wise contrastive loss, called dual contrastive regularization (DCR). To alleviate mode collapse and convergence difficulties, we employ the spectral normalized convolutional network in the design of our generator. Moreover, to further boost the feature representation ability of our model, Frequency Channel Attention Network is introduced to further boost the feature representation ability of our generator. In ablation study, we have proved the effectiveness of each component of our method. Our design can better deal with various tasks of I2I translation. In comparison experiments, SN-DCR shows better performance than all baseline methods in terms of global structure and texture. In multiple datasets, quantitative and visual results demonstrate that SN-DCR achieves the best results. Furthermore, we highlight that the principal advantage of SN-DCR lies in its remarkable capability to generate highly realistic images, particularly excelling in the generation of textures and natural details. Nevertheless, it is noteworthy that our approach comes with certain constraints, chief among which are less than optimal performance in terms of both training costs and memory consumption. Finally, we believe that our work will initiate further research on unpaired image-to-image translation.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62276138 and 61876087.
Data Availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Benaim, S., Wolf, L.: One-sided unsupervised domain mapping. Neural Inf. Process. Syst. 752–762 (2017)
2. Bruckstein, A.M., ter Haar Romeny, B.M., Bronstein, A.M., et al.: Wasserstein barycenter and its application to texture mixing. In: International Conference on Scale Space and Variational Methods, pp. 435–446 (2011)
3. Caron, M., Misra, I., Mairal, J., et al.: Unsupervised learning of visual features by contrasting cluster assignments. Neural Inf. Process. Syst. (2020)
4. Chang, Y., Guo, Y., Ye, Y., et al.: Unsupervised deraining: Where asymmetric contrastive learning meets self-similarity. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
5. Chen, T., Kornblith, S., Norouzi, M., et al.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020)
6. Chen, X., Fan, H., Girshick, R.B., et al.: Improved baselines with momentum contrastive learning. CoRR (2020)
7. Choi, Y., Uh, Y., Yoo, J., et al.: Stargan v2: Diverse image synthesis for multiple domains. In: Conference on Computer Vision and Pattern Recognition, pp. 8185–8194 (2020)
8. Fu, T., Gong, M., Wang, C., et al.: Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In: Conference on Computer Vision and Pattern Recognition, pp. 2427–2436 (2019)
9. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Conference on Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)
10. Goodfellow, I; Pouget-Abadie, J; Mirza, M et al. Generative adversarial networks. Commun. ACM; 2020; 63,
11. Gou, Y; Li, M; Song, Y et al. Multi-feature contrastive learning for unpaired image-to-image translation. Complex Intell. Syst.; 2023; 9,
12. Han, J., Shoeiby, M., Petersson, L., et al.: Dual contrastive learning for unsupervised image-to-image translation. In: Conference on Computer Vision and Pattern Recognition Workshops, pp. 746–755 (2021)
13. He, K., Fan, H., Wu, Y., et al.: Momentum contrast for unsupervised visual representation learning. In: Conference on Computer Vision and Pattern Recognition, pp. 9726–9735 (2020)
14. Heusel, M., Ramsauer, H., Unterthiner, T., et al.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Inf. Process. Syst. 6626–6637 (2017)
15. Hu, J; Shen, L; Albanie, S et al. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 42,
16. Hu, X., Zhou, X., Huang, Q., et al.: Qs-attn: Query-selected attention for contrastive learning in I2I translation. In: Conference on Computer Vision and Pattern Recognition, pp. 18270–18279 (2022)
17. Huang, X., Liu, M.Y., Belongie, S., et al.: Multimodal unsupervised image-to-image translation. In: European Conference on Computer Vision, pp .172–189 (2018)
18. Isola, P., Zhu, J., Zhou, T., et al.: Image-to-image translation with conditional adversarial networks. In: Conference on Computer Vision and Pattern Recognition, pp. 5967–5976 (2017)
19. Karras, T; Laine, S; Aila, T. A style-based generator architecture for generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 43,
20. Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR (2014)
21. Miyato, T., Kataoka, T., Koyama, M., et al.: Spectral normalization for generative adversarial networks. In: International Conference on Learning Representations (2018)
22. Park, T., Liu, M., Wang, T., et al.: Semantic image synthesis with spatially-adaptive normalization. In: Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
23. Park, T; Efros, AA; Zhang, R et al. Contrastive learning for unpaired image-to-image translation. Euro. Conf. Comput. Vis.; 2020; 12345, pp. 319-345.
24. Phaphuangwittayakul, A; Ying, F; Guo, Y et al. Few-shot image generation based on contrastive meta-learning generative adversarial network. Vis. Comput.; 2023; 39,
25. Qin, Z., Zhang, P., Wu, F., et al.: Fcanet: Frequency channel attention networks. In: International Conference on Computer Vision, pp. 763–772 (2021)
26. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
27. Son, J., Park, S.J., Jung, K.: Retinal vessel segmentation in fundoscopic images with generative adversarial networks. CoRR (2017)
28. Song, S., Lee, S., Seong, H., et al.: Shunit: Style harmonization for unpaired image-to-image translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2292–2302 (2023)
29. Sung, F., Yang, Y., Zhang, L., et al.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018)
30. Torbunov, D., Huang, Y., Yu, H., et al.: Uvcgan: Unet vision transformer cycle-consistent gan for unpaired image-to-image translation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 702–712 (2023)
42. un-Yan, Zhu, Park, T., Isola, P., et al.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: International Conference on Computer Vision, pp. 2242–2251 (2017)
31. Wang, T., Liu, M., Zhu, J., et al.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
32. Wang, W., Zhou, W., Bao, J., et al.: Instance-wise hard negative example generation for contrastive learning in unpaired image-to-image translation. In: International Conference on Computer Vision, pp. 14000–14009 (2021)
33. Wang, Z; Bovik, AC; Sheikh, HR et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.; 2004; 13,
34. Wu, G., Jiang, J., Liu, X.: A practical contrastive learning framework for single-image super-resolution. IEEE Trans. Neural Netw. Learn. Syst. (2023)
35. Wu, H., Qu, Y., Lin, S., et al.: Contrastive learning for compact single image dehazing. In: Conference on Computer Vision and Pattern Recognition, pp. 10551–10560 (2021)
36. Yi, Z., Zhang, H.R., Tan, P., et al.: Dualgan: unsupervised dual learning for image-to-image translation. In: International Conference on Computer Vision, pp. 2868–2876 (2017)
37. Yu, F., Koltun, V., Funkhouser, T.A.: Dilated residual networks. In: Conference on Computer Vision and Pattern Recognition, pp. 636–644 (2017)
38. Zhang, D., Zheng, Z., Li, M., et al.: Reinforced similarity learning: Siamese relation networks for robust object tracking. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 294–303 (2020)
39. Zhang, Y; Tian, Y; Hou, J. Csast: content self-supervised and style contrastive learning for arbitrary style transfer. Neural Netw.; 2023; 164, pp. 146-155. [DOI: https://dx.doi.org/10.1016/j.neunet.2023.04.037]
40. Zhao, C., Cai, W., Yuan, Z., et al.: Multi-crop contrastive learning for unsupervised image-to-image translation. CoRR (2023)
41. Zheng, C., Cham, T., Cai, J.: The spatially-correlative loss for various image translation tasks. In: Conference on Computer Vision and Pattern Recognition, pp. 16407–16417 (2021)
43. Zhu, D., Wang, W., Xue, X., et al.: Structure-preserving image smoothing via contrastive learning. In: The Visual Computer, pp. 1–15 (2023)
Copyright Springer Nature B.V. Jan 2025