Full Text

Turn on search term navigation

1. Introduction

The image style transfer objective refers to transferring an actual image to a different painting style, i.e., transforming it into another stylized image, whilst keeping the detail of the image’s content intact. This was previously accomplished by a technically skilled painter, which took a great deal of time. With the evolution of artificial intelligence, it is now possible to use advanced computing power to complete this task, significantly improving efficiency. Improving the quality of stylization is one of the most challenging tasks. With the emergence of deep learning, Gatys et al. [1] first proposed a means based on the convolutional neural network VGG-19 [2] for image style transfer, using VGG-19 to extract the content features of the image, taking the Gram matrix of the feature map as the style features of the image, and using an iterative means to transfer the style of the image. This achieved great success and also provided a basis for subsequent research work. In order to speed up the inference of models, many feedforward network-based methods have emerged [3,4,5,6], achieving good results. However, these methods are limited by the single-style transfer.

Since then, methods of arbitrary style transfer have emerged, which encode content and style images as feature maps, perform fusion operations on content and style features, and finally output them through encoders. Huang et al. [7] proposed AdaIN (adaptive instance normalization), which aligns the mean and variance of the content image features with the mean and variance of the style image features and can perform multi-style conversion; however, due to the severe loss of decoder information in the construction of the self-encoder, this leads to a “blocky” texture in the migration result of the image. Li et al. [8] proposed WCT (whitening and coloring transforms) to achieve arbitrary style transfer through the transformation of content features and style features. First, the content image uses the white transfer algorithm to erase the style information in the image and retain the original high-level semantic information, and then the style of the style map is relocated to the processed content image through the color transfer algorithm to reconstruct the stylized image. Sheng et al. [9] proposed the Avatar-Net, combining the block matching and adaptive normalization concepts to achieve style transformation. Park et al. [10] proposed the SANet (style-attentional network) to match style features to content features by attention; content information can be easily lost as the encoder uses attention to directly perform global optimization after outputting the image features. Wang et al. [11] proposed AesUST (aesthetic-enhanced universal style transfer), incorporating the aesthetic features into the style migration network and using an aesthetic discriminator to enable the efficient and flexible integration of style patterns based on the global aesthetic channel distribution of the style image and the local semantic space distribution of the content image. However, its generated stylized images are more colorful, resulting in an insufficient stylization effect. Deng et al. [12] and Zhang et al. [13] both used transformer-based approaches [14] for arbitrary image style migration, but the number of transformer-based model parameters was too large, and thus the approaches consumed too much training time. Luo et al. [15] proposed PAMA (progressive attentional manifold alignment), using attentional operations and spatially aware interpolation to perform alignment operations on the content and style space. However, the model directly uses the features output from the encoder to perform manifold alignment operations, thus making the method unable to extract too much local information before performing manifold alignment. However, although these methods have been widely used, they do not maintain global and local styles, and their generated stylized images and visual effects need to be furthered stylized. As shown in Figure 1, the stylized image has a square style, while the three methods in the figure do not replicate a stylized image with a wind block.

Since the arbitrary style transfer model structure is based on the encoder–decoder, after the content and style features are extracted by the encoder, it is critical for the two features to be fused in the intermediate layer. In order to achieve a good fusion effect, the attention mechanism can be used to locally reinforce the fused features. Deeply inspired by the PAMA algorithm, we propose an arbitrary image style transfer algorithm based on halo attention dynamic convolution manifold alignment. This system introduces the halo attention module and dynamic convolutional with AMA (attentional manifold alignment) into HDAMA (halo-attention dynamic-convolutional attentional manifold alignment). Content and style features are extracted using halo attention and dynamic convolution to provide better conditions for subsequent manifold alignment operations. After completing manifold alignment, features are extracted again, and then the decoder is used to fuse the feature maps to generate stylized images. The content of our research is as follows.

First, the halo attention module and dynamic convolution are introduced for content and style feature extraction, and the more critical details of the image are screened out so that the content manifold can better match the corresponding style manifold. Subsequently, the dynamic convolution and halo attention modules are used for the output.
A multi-stage loss function is used in the manifold alignment stage, providing better conditions for manifold alignment. The total variance loss is introduced to smooth the image and eliminate noise, combined with the relaxed earth mover distance (REMD) loss, moment matching loss, and microscopic color histogram loss as the style loss, and optimized for the generated image together with the content loss and VGG reconstruction loss.
Finally, to validate our proposed method, ArtFID, content loss, and style loss were selected as objective evaluation metrics, and ablation experiments were performed to verify the effectiveness of their components.

The rest of the paper is structured as follows: Section 2 provides an overview of the related work, Section 3 provides the methodology of the study, Section 4 includes the experimental content and results, and Section 5 presents the conclusion.

2. Related Work

2.1. Image Style Transfer

The methods of image style transfer can be divided into single style transfer and arbitrary style transfer. Gatys et al. [1] first used a pre-trained deep convolutional neural network for image style transfer, providing the basis for style migration using convolutional networks. To solve the transfer speed problem, Ulyanov et al. [3] trained feedforward neural networks by optimizing the model, which improved the transfer speed but was limited to single-style transfer only. Later, many researchers proposed many image style transfer methods [16,17,18] based on generative adversarial networks (GAN) [19], reporting good performances. The above method is a single-style transfer method.

To realize arbitrary style transfer, Huang et al. [7] proposed AdaIN, which aligns the mean and variance of content image features with the mean and variance of style image features to perform a multi-style conversion. Deeply influenced by the attention mechanism [14], Park et al. [10] proposed a style attention network to match style features to content features. Liu et al. [20] proposed adaptive attention normalization (AdaAttN), adaptively performing attention normalization in the transformation layer. Wang et al. [11] incorporated aesthetic features into the style transfer network with an aesthetic discriminator. Both Deng et al. [12] and Zhang et al. [13] used transformer-based approaches [21] for arbitrary image style transfer. Since then, several style transfer methods based on contrastive learning have emerged [22,23,24]. Each image feature follows a distribution of many different manifolds [25], and manifold alignment is the relationship between two images from different stream shape distributions. Projecting these two images into a shared subspace establishes a link between the image feature spaces and preserves their stream shape structure. There are several existing manifold alignment method [26,27,28]. Huo et al. [25] proposed an image style transfer method for manifold alignment using a global channel transformation, which takes multiple stream-form distributions and then aligns them. However, the model must include the most relevant content and style stream forms, resulting in a generated style map prone to losing semantic content and style information. Luo et al. [15] proposed the progressive attentional manifold alignment (PAMA). This advanced attentional manifold alignment uses attentional operations and spatially aware interpolation to align the content and style space. However, it directly uses the encoder feature output for manifold alignment, thus preventing the extraction of too much local information before performing manifold alignment.

2.2. Attention Mechanism

Applying the attention mechanism allows the neural network model to focus on more helpful information. Hu et al. [29] proposed the SE-Net (squeeze and excitation network), which learns the feature weights based on the loss function value so that the more effective feature maps are weighted more. Those that are ineffective are weighted less, training the model to achieve better results. Woo et al. proposed the CBAM (convolutional block attention module network) [30], combining channel and spatial attention, with good results. The emergence of the SK-Net (selective kernel network) [31] verified the importance of convolutional kernels. Subsequently, the ECA-Net (efficient channel attention network) [32] improved upon the SE-Net by aggregating cross-channel information through a one-dimensional convolutional layer to obtain more accurate attention information. Xiao et al. [33] proposed the RTFN (robust temporal feature network) using multi-headed convolutional layers to diversify multi-scale features, employing self-attention to correlate different positions of previously extracted features. Chen et al. [34] proposed dynamic convolution, employing an attention mechanism for feature extraction; compared to static convolution, it has a more substantial feature representation capability. Ashish et al. [35] proposed the Halo-Net, which efficiently employs a self-attention filter for the convolution operation, improving the network’s performance, unlike the traditional self-attention module.

3. Methods

3.1. Overall Framework

The algorithm proposed in this research comprises an encoder module, a three halo dynamic-convolutional attentional manifold alignment (HDAMA) module, and a decoder module, as shown in Figure 2. Firstly, we use the $R u l e 4_1$ layer of the pre-trained VGG-19 network to encode the input content and style images to obtain $F_{c}$ and $F_{s}$ , respectively, and then align the content and style manifolds through the HDAMA module to make $F_{c}$ gradually fuse with $F_{s}$ to obtain the feature map $F_{c} s$ . This process is repeated three times to better align the manifold and finally pass the pre-trained VGG-19 network through the HDAMA module. The decoder output of the VGG-19 network is trained to generate the stylized image.

3.2. Halo Attention Dynamic Convolution Attention Manifold Alignment Module

The dynamic convolutional attentional flow form proposed in this research for the HDAMA module is shown in Figure 3 First, the content and style features are extracted separately using the halo attention module (at this time, halo = 2), followed by dynamic convolution [34] to extract the features. The style features are rearranged by the attention module to obtain $F s^{'}$ , followed by spatially aware interpolation to concentrate the channel information to obtain adaptive weights W, interpolate between content features $F_{c}$ and $F_{s}$ , and finally output by dynamic convolution and the halo attention module (at this time halo = 4).

3.2.1. Dynamic Convolution Module

Dynamic convolution has a more robust feature expression capability compared to static convolution. The traditional static perceptron is calculated as follows.

(1) $y = g (W^{T} x + b)$

where W is the weight, b is the bias, and g is the activation function.

The dynamic perceptron aggregates k linear perceptrons, which are non-linear, and is computed as follows.

(2) $\begin{matrix} y = g ({\tilde{W}}^{T} (x) x + \tilde{b} (x)) \\ \tilde{W} (x) = \sum_{k = 1}^{K} π_{k} (x) {\tilde{W}}_{k}, \tilde{b} (x) = \sum_{k = 1}^{K} π_{k} (x) {\tilde{b}}_{k} \\ s . t . 0 \leq π_{k} (x) \leq 1, \sum_{k = 1}^{K} π_{k} (x) = 1, \end{matrix}$

where

\sum_{k = 1}^{K} π_{k} (x) = 1

π_{k}

is the attention weight of the kth linear perceptron and

0 \leq π_{k} (x) \leq 1

, which varies with the input. The dynamic perceptron also introduces two additional calculations: attention weight calculation and dynamic weight fusion.

(3) $\begin{matrix} O ({\tilde{W}}^{T} x + \tilde{b}) ≫ O (\sum π_{k} {\tilde{W}}_{k}) + O (\sum π_{k} {\tilde{b}}_{k}) \\ + O (π (x)) \end{matrix}$

Even though this adds extra computation, it is negligible compared to the original perceptron. The dynamic convolution structure diagram is shown in Figure 4.

3.2.2. Halo Attention Module

The halo attention module [35] is different from the traditional self-attentive module in that it efficiently employs a self-attentive filter to perform the convolution operation and improve the network’s performance. The halo attention module is shown in Figure 5 (halo = 1).

First, the input feature map is divided into four blocks of the same size. Each block is subjected to the padding operation, meaning that the n layer halo, where n is the value of the halo, is added outside each block, increasing the perceptual field, thus making the each block’s perceptual field larger; subsequently, each block after adding the halo is sampled separately. Then, the information after these samples is subjected to the attention operation, thus performing downsampling, and finally outputting the feature map and summing it with the input feature map. Using the residual jump connection can integrate local details and global details.

The halo attention network differs from the ordinary convolutional network because it uses self-attention for convolutional operations. It adds a few parameters, but can have a good performance improvement.

3.2.3. Attention Manifold Alignment

Attention manifold alignment is a module proposed by Luo et al. [15], applied to arbitrary image style transfer; according to Huo et al. [25], each image feature follows a distribution of multiple flow shapes with many different semantic distributions. Assuming that the content and style images have m and n semantic regions, respectively, the content features $F_{c}$ and style features $F_{s}$ are represented as follows.

(4) $\begin{matrix} F_{c} = \cup_{i = 1}^{m} F_{c, i}, F_{c} \subseteq R^{C \times H_{c} W_{c}}, F_{c, i} \subseteq R^{C \times M_{i}} \end{matrix}$

(5) $\begin{matrix} F_{s} = \cup_{i = 1}^{n} F_{s, i}, F_{s} \subseteq R^{C \times H_{s} W_{s}}, F_{s, i} \subseteq R^{C \times N_{i}} \end{matrix}$

where

F_{c, i}

and

F_{s, i}

denote the ith subset composed of

M_{i}

and

N_{i}

, respectively, corresponding to the ith manifold.

In order to solve the problem of inconsistency between the content and style manifolds, this research introduces and improves upon the attentional manifold alignment (AMA) module, comprising the attention module [10,36] and spatially aware interpolation.

In the attention module, input content features $F_{c}$ and style features $F_{s}$ are first normalized and subsequently embedded to obtain the attention graph $A_{c s}$ , calculated as follows.

(6) $A_{c s} = softmax (f {(Norm (F_{c}))}^{T} \otimes g (Norm (F_{s})))$

where

f (•)

denotes the

1 \times 1

convolution block of the content feature

F_{c}

, and

g (•)

denotes the

1 \times 1

convolution block of the style feature

F_{s}

Norm (•)

is the mean variance normalization, and ⊗ is the matrix multiplication. The attention map

A_{c s}

contains the pair-wise similarity between the content features

F_{c}

and style features

F_{s}

. Using

A_{c s}

as an affine transformation, the style features

F_{s}

are spatially rearranged to obtain

{F_{s}}^{'}

(7) ${F_{s}}^{'} = θ (h {(F_{s})}^{T} \otimes A_{c s})$

where

θ (•)

is the

1 \times 1

convolution block, and

h (•)

denotes the shape after redefining the matrix. The link between the different manifolds can be found by observing the blocks.

In the spatially aware interpolation module, the dense channel module operates on all connected features using convolution kernels of different scales to obtain adaptive weights W such that $F_{c}$ and ${F_{s}}^{'}$ are channel connected. This is calculated as follows.

(8) $W = \frac{1}{n} \sum_{i = 1}^{n} ψ_{i} ([F_{c}, {F_{s}}^{'}])$

where

ψ_{i}

denotes the ith convolution kernel,

[. . ., . . .]

denotes the channel connection operation.

All the features connected in the spatially aware interpolation module identify the difference between the content stream shape and style stream shape, and the output adaptive weights $W \in R^{h \times w}$ are used to interpolate $F_{c}$ and ${F_{s}}^{'}$ , calculated as follows.

(9) $F_{c s} = W ⊙ F_{c} + (1 - W) ⊙ {F_{s}}^{'}$

where ⊙ refers to the distribution of different features. The structural similarity between the corresponding manifolds can be increased by interpolation.

The HDAMA module introduces dynamic convolution to enhance the expression of features. The attention module can find the distributions between the content and style space, using the space-aware interpolation module to increase their structural similarity and solve any local distortion. The affinity between the corresponding stream shapes of the content features $F_{c}$ and style features $F_{s}$ increase after the feature maps are output by the VGG-19 encoder and enter the HDAMA module. This attention module presents semantic regions more quickly in the next HDAMA module. Subsequently, spatially aware interpolation is used to create the content space and increase the style space similarity to generate high-quality images.

3.3. Loss Function

The loss function used in this research consists of VGG reconstruction loss, content loss, and style loss, where the style loss includes the relaxed earth mover distance (REMD) loss, moment matching loss, differentiable color histogram loss, and complete variance loss. The total loss function is represented as follows.

(10) $L_{t o t a l} = L_{r e c} + \sum_{i = 1} (λ_{s}^{i} L_{s} + λ_{r}^{i} L_{r} + λ_{m}^{i} L_{m} + λ_{h}^{i} L_{h} + λ_{t v}^{i} L_{t v})$

where

λ^{i}

is the weight of the ith manifold alignment,

L_{r e c}

is the VGG reconstruction loss,

L_{s}

is the content loss, and

L_{r}

L_{m}

L_{h}

, and

L_{t v}

are the style losses. Except for VGG reconstruction loss, all the loss functions are generated by manifold alignment.

The VGG reconstruction loss [15] can constrain all the features in the VGG space and is calculated as follows.

(11) $\begin{matrix} L_{r e c} = λ ({∥(I_{r c} - I_{c})∥}_{2} + {∥(I_{r s} - I_{s})∥}_{2}) \\ + \sum_{i} ({∥c - ϕ_{i} (I_{c})∥}_{2} + {∥ϕ_{i} (I_{r s}) - ϕ_{i} (I_{s})∥}_{2}) \end{matrix}$

where

I_{r c}

and

I_{r s}

are the content and style images reconstructed from the VGG features, and

λ

is the constant weight.

ϕ_{i} (I)

refers to the features extracted by the

R e L U i_1

layer from image I in the VGG-19 encoder. The VGG reconstruction loss makes the decoder reconstruct the VGG features, so all features between the VGG encoder and decoder are restricted to the VGG space.

Content loss [15] is based on the structural self-similarity loss between the content features $F_{c s}$ of the stylized image $F_{c}$ and the VGG features for the preservation of the streaming structure.

(12) $L_{s} = \frac{1}{H_{c} W_{c}} \sum_{i, j} | \frac{D_{i j}^{c}}{\sum_{i} D_{i j}^{c}} - \frac{D_{i j}^{c s}}{\sum_{j} D_{i j}^{c s}} |$

where

D_{i j}^{c}

and

D_{i j}^{c s}

are the two-by-two cosine distance matrices of

F_{c}

and

F_{c s}

, respectively.

The introduction of REMD loss [37] in the style feature’s manifold can optimize this manifold, making the style and content manifolds accomplish the alignment more effectively. The REMD loss is calculated as follows.

(13) $L_{r} = max (\frac{1}{H_{s} W_{s}} \sum_{i} min_{j} C_{i j}, \frac{1}{H_{c} W_{c}} \sum_{j} min_{i} C_{i j})$

where

C_{i j}

denotes the two-by-two cosine distance matrix between

F_{c}

and

F_{c s}

The moment matching loss [15] is used to regularize the size of the features and is calculated as follows.

(14) $L_{m} = {∥μ_{c s} - μ_{s}∥}_{1} + {∥Σ_{c s} - Σ_{s}∥}_{1}$

where

μ

and

Σ

are the mean and covariance matrices of the eigenvectors.

To reduce the problem of color mixing, a microscopic color histogram loss was introduced [15,38], calculated as follows.

(15) $L_{h} = \frac{1}{\sqrt{2}} {∥H_{s}^{1 / 2} - H_{c s}^{1 / 2}∥}_{2}$

where H denotes the color histogram feature and

H^{1 / 2}

denotes the square root of the element.

Total variance loss [39] smooths the image to eliminate noise and is calculated as follows.

(16) $L_{t v} = \sum_{i, j} {({(x_{i, j + 1} - x_{i, j})}^{2} + {(x_{i + 1, j} - x_{i, j})}^{2})}^{\frac{β}{2}}$

where

x_{i, j}

is the pixel at coordinate point

(i, j)

x_{i, j + 1}

is the pixel at the next vertical coordinate point,

x_{i + 1, j}

is the pixel at the next horizontal coordinate point, and

β

is the weight value set to 1 in all experiments.

VGG reconstruction loss maintains a shared space for alignment between content and style streams. Content and style losses can make the respective streams generate better images in alignment. The full variance loss effectively reduces the noise of the generated image, making the stylized image clearer and more visually appealing.

4. Experiment

4.1. Experimental Environment and Parameter Setup

The experimental hardware used an Intel Core™ i9 12900KF CPU @ 3.20 GHz, with 32 GB of running memory, and an NVIDIA Geforce RTX3090 TI 24 GB GPU. The operating system used was 64-bit Windows 11, running the deep learning framework Pytorch version 1.12.0, with the Pycharm editor.

MS-COCO [40] and WikiArt [41] were chosen as the content and style datasets, respectively, each with approximately 80,000 images. Both the encoder and decoder used the VGG-19 [2] network, pre-trained on the ImageNet [42] dataset, with both the encoder and decoder having a symmetrical structure. The content loss, REMD loss, and moment matching loss were computed using the features extracted from $R e L U 3_1$ , $R e L U 4_1$ and $R e L U 5_1$ of the VGG-19 encoder. The weights of content loss $λ_{s}^{1}$ , $λ_{s}^{2}$ , and $λ_{s}^{3}$ were set to 12, 9, and 7, respectively. All weights of REMD loss and moment matching loss were set to 2. The weights of microscopic color histogram loss $λ_{h}^{1}$ , $λ_{h}^{2}$ , $λ_{h}^{3}$ were set to 0.25, 0.5, and 1, respectively. All weights of the full variance loss were set to 0.1. Adam [43] was used as the optimizer with a learning rate of 0.0001, a batch size of 8, and a total of 160,000 iterations. The size of the content and style images was 512 × 512, which was then randomly cropped to 256 × 256.

4.2. Results

4.2.1. Objective Evaluation of Results

In order to verify the effectiveness of the method in this research, the model was compared with PAMA [15], S2WAT [13], StyTr-2 [12], and AesUST [11]. The transfer efficiency was first assessed, and the time required to generate single stylized images of two different size sizes, 256 × 256 and 512 × 512, is shown in Table 1.

According to Table 1, compared with S2WAT, StyTr-2, and AesUST, although the transfer efficiency of our model for a single-stylized image only had a slight advantage, the major advantages of transfer efficiency were reflected when generating multiple-stylized images. Compared with PAMA, the speed our method was slower, but the generated stylized images showed better stylization and less detail loss.

Stylization performance is mainly determined by content preservation and style compatibility. Content preservation refers to the extent to which the semantic content from the content image is preserved in the stylized image. Style compatibility refers to the degree of similarity between the generated image style and the target image style. The ArtFID [44] evaluation metric combines both, calculated as follows.

(17) $ArtFID (X_{g}, X_{c}, X_{s}) = (1 + \frac{1}{N} \sum_{i = 1}^{N} d (X_{c}^{(i)}, X_{g}^{(i)})) \cdot (1 + FID (X_{s}, X_{g}))$

where

X_{c}

is the content image,

X_{s}

is the style image,

X_{g}

is the stylized image generated by the model,

d (. . ., . . .)

is the corresponding distance calculated using the LPIPS metric [45], and

F I D (. . ., . . .)

is the calculation outcome of the FID evaluation metric [46]. Similar to the FID evaluation metric, the lower the ArtFID value, the better the effect.

In order to further verify the effectiveness of this research, ArtFID evaluation metrics, as well as content and style losses [7], were selected to compare our method with PAMA, S2WAT, StyTr-2, and AesUST. Sixteen images were extracted from the Microsoft COCO and WikiArt datasets, generating a total of 256 stylized images. An objective evaluation of the outcomes is shown in Table 2. Our method obtained the best results concerning ArtFID and content loss, while the style loss was slightly lower than S2WAT, indicating that our method performs well in content retention and stylization.

4.2.2. Subjective Evaluation Results

Since image style transfer is a rather subjective evaluation, the objective evaluation outcomes have limited value, so the text methods were qualitatively compared with PAMA, S2WAT, StyTr-2, and AesUST, as shown in Figure 6. S2WAT and StyTr-2 use transformer-based approaches to style transfer, including many parameters. The resulting stylized images possess poor content detail and style transformation, as shown in the fourth column of the third row of Figure 6 and the fifth column of the third row of Figure 6, where unnecessary textures appear in the resultant stylized images. AesUST uses aesthetic channels and aesthetic discriminators for image style migration. However, the stylized images generated are uniform in color, as shown in the third row and sixth column of Figure 6, to show the original style of the stylized images. Additionally, some detail is lost, as shown in the billboard in the fourth row and sixth column of Figure 6. PAMA also adopts manifold alignment for image style transfer. The generated stylized image is uniform in color, as shown in the sky in the first row and third column of Figure 6, which is not perfect in local stylization. In the fourth row and fourth and sixth columns of Figure 6, the style shows that all the images should be black and white pencil drawings, while both S2WAT and AesUST retain the original color of the character’s face with insufficient stylization. In contrast, the stylized images generated by our method performed better in terms of content preservation and stylization. For example, the first row and seventh column of Figure 6 show the square-shaped sky of the stylized image, and the fourth row and seventh column of Figure 6 show more details on the character’s face compared with the fourth row and third column of Figure 6.

4.3. Ablation Experiments

4.3.1. Validation of the Three HDAMA Modules

In order to verify the effectiveness of using three HDAMA modules, we used one HDAMA module alone and two HDAMA modules together, for testing, as shown in Figure 7. In the experimental setup using one HDAMA module, $λ_{s}^{1}$ was set to 12, and $λ_{h}^{1}$ was set to 0.25. Using two HDAMA modules, $λ_{s}^{1}$ and $λ_{s}^{2}$ were set to 12 and 9, respectively, $λ_{h}^{1}$ and $λ_{h}^{2}$ were set to 0.25 and 0.5, respectively, and the rest of the parameters were set as in the original experiment. From Figure 7a, it can be seen that when using one HDAMA module, the stylization effect is average and does not show the square shape of the style map. Using two HDAMA modules partially produced a faint square-like style, as shown in Figure 7b. Using three HDAMA modules showed a more apparent square-like style, as shown in Figure 7c. Therefore, this proved that the stylization effect was better when using three HDAMA modules.

4.3.2. Halo Attention Ablation Experiment

To verify the effectiveness of the halo attention module, all halo attention modules in the network were removed for the ablation experiment, as shown in Figure 8. It can be seen that the methods without the halo attention module failed to generate the stylized image with the same degree of line thickness compared to the methods with the halo attention module, as shown when comparing the third column of the first row of Figure 8 with the fourth column of the first row of Figure 8. Furthermore, the degree of color intensity of the stylized image was also superior for methods with the halo attention module, as shown when comparing the third column of the second row of Figure 8 with the fourth column of the second row of Figure 8. The methods without the halo attention module generated a small aperture, as shown by the fire hydrant in the third column of the third row of Figure 8. It can be seen that the halo attention module is an important part of the algorithmic network proposed in this research.

4.3.3. Dynamic Convolutional Ablation Experiment

To verify the effectiveness of dynamic convolution, all dynamic convolutions in the network were replaced with ordinary convolutions for the ablation experiments; all other settings were the same, as shown in Figure 9. It can be seen that the images generated after replacing dynamic convolution with normal convolution appear blurred in the middle, as shown in the first, second, and third rows of the third column of Figure 9. It can be seen that dynamic convolution is an important part of the algorithmic network proposed in this research.

5. Conclusions

In this research, we proposed an arbitrary image stylized transfer algorithm based on halo attention dynamic convolutional manifold alignment, using halo attention and dynamic convolution to extract image features, attention operations and spatially aware interpolation to manifold align the content and style feature space, and dynamic convolution and halo attention to extract features to output, during manifold alignment, a multi-level loss function with full variance loss to eliminate image noise. The stream alignment process was repeated three times, finally outputting a stylized image from the VGG decoder. The experimental results show that our proposed method can form high-quality stylized images, achieving values of 33.861, 2.516, and 3.602 for the ArtFID, style loss, and content loss, respectively, achieving good results when compared with existing methods. However, there are still occasional problems concerning content map detail loss and larger model parameters. In future work, we will optimize the model and attention mechanism to improve its performance and make it lighter.

Author Contributions

Conceptualization, K.L., D.Y. and Y.M.; methodology, K.L., D.Y. and Y.M.; software, K.L. and D.Y.; validation, K.L.; formal analysis, K.L.; investigation, K.L.; resources, D.Y. and Y.M.; data curation, K.L. and D.Y.; writing—original draft preparation, K.L.; writing—review and editing, K.L., supervision, D.Y. and Y.M.; project administration, D.Y. and Y.M.; funding acquisition, D.Y. and Y.M. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The datasets used are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. The stylized comparison graph, comparing our method with PAMA [15], AesUST [11], StyTr-2 [12], and S2WAT [13], showing that the stylized image is square; therefore, the stylized image generated by the model should also be square.

Figure 1. The stylized comparison graph, comparing our method with PAMA [15], AesUST [11], StyTr-2 [12], and S2WAT [13], showing that the stylized image is square; therefore, the stylized image generated by the model should also be square.

View Image - Figure 2. The overall structure comprises an encoder–decoder structure, where content and style maps are input to extract features through the VGG encoder; then, manifold alignment operation occurs through the three HDAMA modules, and finally a stylized image is generated through VGG decoder output.

Figure 2. The overall structure comprises an encoder–decoder structure, where content and style maps are input to extract features through the VGG encoder; then, manifold alignment operation occurs through the three HDAMA modules, and finally a stylized image is generated through VGG decoder output.

View Image - Figure 3. HDAMA block structure figure. Input feature maps [Forumla omitted. See PDF.] and [Forumla omitted. See PDF.], features are extracted by the halo attention and dynamic convolution modules; then, manifold alignment operation occurs, and finally features are output by dynamic convolution and halo attention modules.

Figure 3. HDAMA block structure figure. Input feature maps [Forumla omitted. See PDF.] and [Forumla omitted. See PDF.], features are extracted by the halo attention and dynamic convolution modules; then, manifold alignment operation occurs, and finally features are output by dynamic convolution and halo attention modules.

Figure 4. Dynamic convolution structure.

View Image - Figure 5. Halo attention structure. An image is input and subsequently divided into four blocks of the same size. Next, the query from these blocks is computed, and n layers of aperture are added outside each block by the halo operation (n is the value of halo). Subsequently, a note operation is performed on this sampled information so as to downsample it and finally output the feature map.

Figure 5. Halo attention structure. An image is input and subsequently divided into four blocks of the same size. Next, the query from these blocks is computed, and n layers of aperture are added outside each block by the halo operation (n is the value of halo). Subsequently, a note operation is performed on this sampled information so as to downsample it and finally output the feature map.

Figure 6. Qualitative comparison.

Figure 7. The effectiveness of using three HDAMA modules. (a) One HDAMA module; (b) two HDAMA modules; (c) three HDAMA modules.

Figure 8. Halo attention ablation experiment.

Figure 9. Dynamic convolution ablation experiment.

Table 1

Transfer efficiency table.

Method	Time (s)
Method	256 × 256	512 × 512
PAMA	0.0140	0.0159
S2WAT	0.0180	0.0220
StyTr-2	0.0600	0.2516
AesUST	0.0130	0.0350
Ours	0.0149	0.0179

Table 2

Objective evaluation of results.

Method	ArtFID	Style Loss	Content Loss
PAMA	35.869	2.705	3.609
S2WAT	40.707	2.442	3.610
StyTr-2	40.993	2.639	3.613
AesUST	41.157	2.546	3.737
Ours	33.861	2.516	3.602

Bold indicates the best effect. The lower the each evaluation index value, the better the effect.

References

1. Gatys, L.A.; Ecker, A.S.; Bethge, M. A neural algorithm of artistic style. arXiv; 2015; arXiv: 1508.06576[DOI: https://dx.doi.org/10.1167/16.12.326]

2. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv; 2014; arXiv: 1409.1556

3. Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 6924-6932.

4. Kotovenko, D.; Sanakoyeu, A.; Lang, S.; Ommer, B. Content and style disentanglement for artistic style transfer. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, South Korea, 27 October–2 November 2019; pp. 4422-4431.

5. Li, Y.; Fang, C.; Yang, J.; Wang, Z.; Lu, X.; Yang, M.H. Diversified texture synthesis with feed-forward networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 3920-3928.

6. Wang, X.; Oxholm, G.; Zhang, D.; Wang, Y.F. Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 5239-5247.

7. Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 1501-1510.

8. Li, Y.; Fang, C.; Yang, J.; Wang, Z.; Lu, X.; Yang, M.H. Universal style transfer via feature transforms. arXiv; 2017; arXiv: 1705.08086

9. Sheng, L.; Lin, Z.; Shao, J.; Wang, X. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 8242-8250.

10. Park, D.Y.; Lee, K.H. Arbitrary style transfer with style-attentional networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 5880-5888.

11. Wang, Z.; Zhang, Z.; Zhao, L.; Zuo, Z.; Li, A.; Xing, W.; Lu, D. AesUST: Towards aesthetic-enhanced universal style transfer. Proceedings of the 30th ACM International Conference on Multimedia; Lisboa, Portugal, 10–14 October 2022; pp. 1095-1106.

12. Deng, Y.; Tang, F.; Dong, W.; Ma, C.; Pan, X.; Wang, L.; Xu, C. Stytr2: Image style transfer with transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 18–24 June 2022; pp. 11326-11336.

13. Zhang, C.; Yang, J.; Wang, L.; Dai, Z. S2WAT: Image Style Transfer via Hierarchical Vision Transformer using Strips Window Attention. arXiv; 2022; arXiv: 2210.12381

14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv; 2017; arXiv: 1706.03762

15. Luo, X.; Han, Z.; Yang, L.; Zhang, L. Consistent style transfer. arXiv; 2022; arXiv: 2201.02233

16. Kim, Y.H.; Nam, S.H.; Hong, S.B.; Park, K.R. GRA-GAN: Generative adversarial network for image style transfer of Gender, Race, and age. Expert Syst. Appl.; 2022; 198, 116792. [DOI: https://dx.doi.org/10.1016/j.eswa.2022.116792]

17. Li, R.; Wu, C.H.; Liu, S.; Wang, J.; Wang, G.; Liu, G.; Zeng, B. SDP-GAN: Saliency detail preservation generative adversarial networks for high perceptual quality style transfer. IEEE Trans. Image Process.; 2020; 30, pp. 374-385. [DOI: https://dx.doi.org/10.1109/TIP.2020.3036754] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33186111]

18. Lin, C.T.; Huang, S.W.; Wu, Y.Y.; Lai, S.H. GAN-based day-to-night image style transfer for nighttime vehicle detection. IEEE Trans. Intell. Transp. Syst.; 2020; 22, pp. 951-963. [DOI: https://dx.doi.org/10.1109/TITS.2019.2961679]

19. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. arXiv; 2014; arXiv: 1406.2661

20. Liu, S.; Lin, T.; He, D.; Li, F.; Wang, M.; Li, X.; Sun, Z.; Li, Q.; Ding, E. Adaattn: Revisit attention mechanism in arbitrary neural style transfer. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, BC, Canada, 11–17 October 2021; pp. 6649-6658.

21. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv; 2020; arXiv: 2010.11929

22. Chen, H.; Wang, Z.; Zhang, H.; Zuo, Z.; Li, A.; Xing, W.; Lu, D. Artistic style transfer with internal-external learning and contrastive learning. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 26561-26573.

23. Zhang, Y.; Tang, F.; Dong, W.; Huang, H.; Ma, C.; Lee, T.Y.; Xu, C. Domain enhanced arbitrary image style transfer via contrastive learning. Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings; Vancouver, BC, Canada, 7–11 August 2022; pp. 1-8.

24. Wu, Z.; Zhu, Z.; Du, J.; Bai, X. CCPL: Contrastive Coherence Preserving Loss for Versatile Style Transfer. Proceedings of the Computer Vision–ECCV 2022: 17th European Conference; Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XVI Springer: Berlin/Heidelberg, Germany, 2022; pp. 189-206.

25. Huo, J.; Jin, S.; Li, W.; Wu, J.; Lai, Y.K.; Shi, Y.; Gao, Y. Manifold alignment for semantically aligned style transfer. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, BC, Canada, 11–17 October 2021; pp. 14861-14869.

26. Lu, C.; Hu, F.; Cao, D.; Gong, J.; Xing, Y.; Li, Z. Transfer learning for driver model adaptation in lane-changing scenarios using manifold alignment. IEEE Trans. Intell. Transp. Syst.; 2019; 21, pp. 3281-3293. [DOI: https://dx.doi.org/10.1109/TITS.2019.2925510]

27. Pei, Y.; Huang, F.; Shi, F.; Zha, H. Unsupervised image matching based on manifold alignment. IEEE Trans. Pattern Anal. Mach. Intell.; 2011; 34, pp. 1658-1664.

28. Cui, Z.; Chang, H.; Shan, S.; Chen, X. Generalized unsupervised manifold alignment. Adv. Neural Inf. Process. Syst.; 2014; 27.

29. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132-7141.

30. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 3-19.

31. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seoul, South Korea, 27 October–2 November 2019; pp. 510-519.

32. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 13–19 June 2020; pp. 11534-11542.

33. Xiao, Z.; Xu, X.; Xing, H.; Luo, S.; Dai, P.; Zhan, D. RTFN: A robust temporal feature network for time series classification. Inf. Sci.; 2021; 571, pp. 65-86. [DOI: https://dx.doi.org/10.1016/j.ins.2021.04.053]

34. Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 13–19 June 2020; pp. 11030-11039.

35. Vaswani, A.; Ramachandran, P.; Srinivas, A.; Parmar, N.; Hechtman, B.; Shlens, J. Scaling local self-attention for parameter efficient visual backbones. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Montreal, BC, Canada, 11–17 October 2021; pp. 12894-12904.

36. Deng, Y.; Tang, F.; Dong, W.; Sun, W.; Huang, F.; Xu, C. Arbitrary style transfer via multi-adaptation network. Proceedings of the 28th ACM International Conference on Multimedia; Seattle, WA, USA, 12–16 October 2020; pp. 2719-2727.

37. Kolkin, N.; Salavon, J.; Shakhnarovich, G. Style transfer by relaxed optimal transport and self-similarity. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10051-10060.

38. Afifi, M.; Brubaker, M.A.; Brown, M.S. Histogan: Controlling colors of gan-generated and real images via color histograms. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Montreal, BC, Canada, 11–17 October 2021; pp. 7941-7950.

39. Rudin, L.I.; Osher, S.; Fatemi, E. Nonlinear total variation based noise removal algorithms. Phys. Nonlinear Phenom.; 1992; 60, pp. 259-268. [DOI: https://dx.doi.org/10.1016/0167-2789(92)90242-F]

40. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. Proceedings of the Computer Vision–ECCV 2014: 13th European Conference; Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13 Springer: Berlin/Heidelberg, Germany, 2014; pp. 740-755.

41. Phillips, F.; Mackintosh, B. Wiki Art Gallery, Inc.: A case for critical thinking. Issues Account. Educ.; 2011; 26, pp. 593-608. [DOI: https://dx.doi.org/10.2308/iace-50038]

42. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; Miami, FL, USA, 20–25 June 2009; pp. 248-255.

43. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv; 2014; arXiv: 1412.6980

44. Wright, M.; Ommer, B. Artfid: Quantitative evaluation of neural style transfer. Proceedings of the Pattern Recognition: 44th DAGM German Conference, DAGM GCPR 2022; Konstanz, Germany, 27–30 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 560-576.

45. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 586-595.

46. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv; 2017; arXiv: 1706.08500

Word count: 6462

Show less

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The objective of image style transfer is to render an image with artistic features of a style reference while preserving the details of the content image. With the development of deep learning, many arbitrary style transfer methods have emerged. From the recent arbitrary style transfer algorithms, it has been found that the images generated suffer from the problem of poorly stylized quality. To solve this problem, we propose an arbitrary style transfer algorithm based on halo attention dynamic convolutional manifold alignment. First, the features of the content image and style image are extracted by a pre-trained VGG encoder. Then, the features are extracted by halo attention and dynamic convolution, and then the content feature space and style feature space are aligned by attention operations and spatial perception interpolation. The output is achieved through dynamic convolution and halo attention. During this process, multi-level loss functions are used, and total variation loss is introduced to eliminate noise. The manifold alignment process is then repeated three times. Finally, the pre-trained VGG decoder is used to output the stylized image. The experimental results show that our proposed method can generate high-quality stylized images, achieving values of 33.861, 2.516, and 3.602 for ArtFID, style loss, and content loss, respectively. A qualitative comparison with existing algorithms showed that it achieved good results. In future work, we will aim to make the model lightweight.

Details

Title

Image Style Transfer Based on Dynamic Convolutional Manifold Alignment of Halo Attention

Author

Li, Ke; Yang, Degang

; Ma, Yan

First page

1881

Publication year

2023

Publication date

2023

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics12081881

ProQuest document ID

2806542345

Image Style Transfer Based on Dynamic Convolutional Manifold Alignment of Halo Attention

Jump to:

Full Text

Abstract

Details

Suggested sources