Meta Network for Flow-Based Image Style Transfer

Full text

Turn on search term navigation

1. Introduction

Image style transfer [1,2,3,4,5,6,7,8,9] refers to the process of incorporating the style of a reference image (or style image) into a content image, with applications in digital art, design, and image generation. Over the years, this field has undergone significant advancements, transitioning from traditional image processing techniques to the adoption of deep learning methodologies.

Early image style transfer methods relied on digital signal processing techniques, such as Fourier transforms or image segmentation, to blend style and content. However, these approaches struggled to capture fine stylistic details and heavily depended on manually designed features, making them ill-suited for diverse style requirements. In 2015, Gatys et al. [10] introduced the first convolutional neural network (CNN)-based style transfer method, leveraging pre-trained VGG networks to extract the content and style features and optimizing the generated image via a gradient descent. Although this approach produced high-quality results, its computational cost rendered it unsuitable for real-time applications.

To address this limitation, Johnson et al. [11] proposed a fast feed-forward network in 2016, enabling stylized image generation with a single forward pass through a pre-trained generative network. However, this method required training a separate model for each style, limiting its flexibility and generalization capability. Subsequent research aimed to overcome these constraints by enabling multi-style and arbitrary style transfer. The notable contributions have included Adaptive Instance Normalization (AdaIN) [12], which dynamically adjusts the normalization parameters (scale and bias) to match the input style features, and Conditional Instance Normalization (CIN) [13], which utilizes style labels to control the normalization parameters for seamless multi-style switching. Other advancements have included Universal Style Transfer (UST) [14], which leverages mathematical operations in feature space, such as Principle Component Analysis (PCA) and Whitening and Coloring Transform (WCT), to achieve generalized content–style fusion; and StyleBank [15], which designs unique convolutional kernels for each style and supports arbitrary styles through linear combinations. Other techniques, such as Avatar-Net [16] and Linear Style Transfer [17], have further improved the detail reconstruction and computational efficiency. However, most style transfer methods still suffer from content leakage, where the generated image fails to preserve the structural characteristics of the original content image, such as scene layout, object boundaries, or details. This issue often arises due to an insufficient disentanglement of the content and style features, overly simplistic feature fusion mechanisms, or generative networks that excessively favor the style features.

To address these challenges, recent works have incorporated attention mechanisms, such as Self-Attention [18] and Cross-Attention [19], to enhance the weighting of the content features. Others have employed adversarial learning [20] to ensure the balanced fusion of content and style or have adopted flow-based models [21,22] to improve the content preservation and detail reconstruction. For example, ArtFlow [21] built a reversible feature extraction framework inspired by the Glow model [23] to prevent content leakage. However, its restrictive inverse computation often resulted in artifacts like checkerboard patterns. Hierarchy Flow [22], introduced by W. Fan et al., achieved greater flexibility by introducing hierarchical feature interactions during the transformation process, effectively resolving artifacts and preserving content details.

Recent style transfer methods have typically required training an image transfer network for each new style, with the style information embedded in the network parameters through numerous iterations of stochastic gradient descent. To address these limitations, meta learning [24,25]—originally introduced to enable models to quickly adapt to new tasks using only a few examples—has emerged as a promising paradigm for improving generalization and adaptability. Inspired by this, F. Shen et al. [26] proposed a meta network that takes a style image as the input and directly generates a corresponding image transformation network, bypassing the need for retraining. While this approach enables adaptive style transfer, its single-pass feed-forward design lacks the content-preserving capabilities of flow-based architectures. This study aims to develop a meta network that enhances the adaptability of flow-based style transfer by generating a model tailored to a given style image. The proposed framework consists of two key components: (1) a modified version of the Hierarchical Flow model, termed Randomized Hierarchical Flow (RH Flow), which introduces a random permutation of the feature sub-blocks before the hierarchical coupling layer to enhance the feature interaction diversity and flexibility; and (2) a meta network that generates the RH Flow parameters, enabling both effective content preservation and flexible style adaptation. By integrating these innovations, our approach enhances the balance between content fidelity and stylistic expressiveness, offering significant advancements for applications in digital art, design, and image generation. The key contributions of this paper include (1) the design of the Randomized Hierarchy Flow for enhanced content preservation, (2) the development of a meta network that generates adaptive transformation parameters without retraining, and (3) comprehensive experimental validation demonstrating its superior performance and parameter efficiency.

2. Related Works

The field of image style transfer has evolved significantly from traditional methods to advanced deep learning approaches. This section provides a concise overview of the flow-based models and meta-learning techniques.

2.1. Flow-Based Style Transfer Models

Flow-based models, particularly Glow [23], offer unique advantages due to their reversible design. The Glow model’s components—Activation Normalization (ActNorm), an invertible 1 × 1 convolution, and affine coupling layers—ensure high fidelity in data reconstruction and flexible manipulation of the latent features, as shown in Figure 1.

ArtFlow [21] was the first to adopt Glow for style transfer, leveraging its reversibility for content preservation. However, ArtFlow’s reliance on multi-scale squeezing operations led to artifacts, such as checkerboard patterns. To overcome these issues, Hierarchy Flow [22] introduced structural improvements, including hierarchical coupling layers and an aligned-style loss function. These enhancements ensured robust content preservation and style representation.

2.2. Meta Learning for Style Transfer

Meta-learning (or learning to learn) originated from the concept of improving learning efficiency and adaptability [27,28]. Recently, it has gained significant attention for its ability to enhance the speed of learning and generalization to new tasks. A gradient-based learning method can be expressed using Equation (1) [29], where the model parameters θ are updated based on the gradient of a loss function ℓ with parameters ϕ. The update employs a gradient transformation function h with parameters ψ to compute new model parameters:

(1) $θ_{n e w} = h_{ψ} (θ, \nabla_{θ} L_{ϕ} (y, f_{θ} (x)))$

Meta-learning research can be categorized into the following: (1) learning model parameters that adapt easily to new tasks [30], (2) learning optimizer strategies based on reward functions derived from loss or parameter updates [31], (3) learning representations of loss or reward functions [32], and (4) discovering transferable unsupervised rules in task-agnostic environments [33].

The current multi-style and arbitrary style transfer techniques have addressed the need to retrain networks for each style, improving the diversity and quality of the generated images. However, these methods are limited by fixed feature spaces and struggle to adapt to entirely new styles. Additionally, their performance is constrained when handling extreme styles or real-time applications. To address these challenges, F. Shen et al. [26] proposed a meta network that dynamically generates the parameters for style transfer networks based on the input style images. This approach eliminates the need for retraining networks for each style through a single forward pass, significantly enhancing efficiency and flexibility.

3. Meta Model for Flow-Based Image Style Transfer

This study proposes a flow-based style transfer meta network, Meta model for Flow-based Image Style Transfer (MFIST), which focuses on generating a flexible and adaptable flow-based image style transfer system. The research methods and procedures are divided into two main components. (1) Improving the Hierarchical Flow Model: Building upon the Hierarchical Flow method proposed by W. Fan et al. [22], we introduce an improved version, called Randomized Hierarchy Flow (RH Flow). Before the hierarchical coupling layer, we apply random permutation to the split feature blocks, dynamically altering the coupling order. This enhancement aims to increase the diversity of feature interactions, further improving the model’s expressiveness and flexibility in style transfer. Additionally, we propose a lightweight Style Encoder to replace the Style Net used in the original work. The Style Encoder processes the initial features extracted from a pre-trained VGG network and generates AdaIN parameters for style transfer. This design significantly reduces the number of parameters while maintaining accuracy and efficiency in style feature extraction. These improvements collectively ensure that the model achieves both efficiency and flexibility during style transfer. (2) Constructing a Flow-Based Reversible Style Transfer Architecture with Meta-Learning: The parameters of the Randomized Hierarchy Flow model are not obtained through conventional training but are instead generated by a meta network trained for this purpose. Inspired by the meta-learning network proposed by F. Shen et al. [26], this study designs a meta network to discover optimal parameters for the Randomized Hierarchy Flow model, rather than relying on traditional feed-forward training. This reversible architecture effectively preserves content features while enabling accurate style transfer, providing a flexible solution for real-time and diverse style transfer applications.

3.1. Randomized Hierarchical Flow Model

As illustrated in Figure 2, the proposed Randomized Hierarchy Flow (RH Flow) is composed of multiple reversible randomized hierarchical coupling (RHC) layers. Given a content image I_c and a style image I_s, the processing is as follows: During the forward pass (indicated by the red arrows), the RHC layers encode the content image I_c to extract multi-level hierarchical content features. Simultaneously, the style image I_s is processed through a pre-trained VGG network to extract style features. These features are further encoded by the Style Encoder to generate the style feature statistics (mean and standard deviation), as indicated by the green arrows. These content and style features are fused using the Adaptive Instance Normalization (AdaIN) module (depicted by the merging arrows) to incorporate style information into the content representation. In the backward pass (blue arrows), the network progressively reconstructs the stylized image I_cs from the fused features through invertible RHC layers. The RH Flow model is fully reversible, enabling lossless reconstruction during style transfer.

Randomized Hierarchical Coupling Layers

Unlike the ArtFlow model, which requires spatial compression, the Hierarchical Flow model applies hierarchical subtraction along the channel dimension to enable learnable spatial feature fusion and transformation. As illustrated in Figure 3, during the forward pass (top part of the figure), the Affine-Net first applies an affine transformation to the input content feature, followed by splitting into N sub-tensors. These sub-tensors are then randomly permuted (indicated by arrows with “random permute”) and hierarchically fused through subtractive coupling over T steps. The final output is obtained by concatenating all intermediate results (purple arrow). In the reverse pass (bottom part of the figure), the stylized feature is split again into N sub-tensors. Using the style statistics (μ, σ) obtained from the Style Encoder via AdaIN, weighted additive coupling (shown by the ⊕ operations) is applied sequentially to progressively inject style information into the content features. This process reconstructs the final stylized image. All intermediate operations and data flows are reversible, ensuring that the RH Flow architecture preserves content structure while achieving flexible style adaptation.

Forward Pass

As illustrated in Algorithm 1, given an input tensor x with dimensions H × W × C, the Affine-Net, parameterized by $θ_{a}$ , performs affine transformations, expanding the tensor along the channel dimension to H × W × NC. The expanded tensor a is then split into N sub-tensors $(a_{1}, a_{2}, \dots, a_{N})$ , each of size H × W × C, which are subsequently shuffled through a random permutation to form $(b_{1}, b_{2}, \dots, b_{N})$ . A hierarchical subtractive coupling mechanism is then iteratively applied across N steps, where each step progressively refines the intermediate features $h_{i}$ by subtracting the corresponding shuffled components. Finally, the intermediate feature maps are concatenated along the channel dimension to generate the output y.

Algorithm 1. Forward Pass

FORWARD(

x, θ_{a}

)

a

\leftarrow

Affine-Net(

x; θ_{a}

)

{(a}_{1}, a_{2}, \dots, a_{N}) \leftarrow

split

(a)

{(b}_{1}, b_{2}, \dots, b_{N}) \leftarrow

random permute

{(a}_{1}, a_{2}, \dots, a_{N})

h_{1}

\leftarrow

x - b_{1}

for

i \leftarrow 2

N

{h_{i} \leftarrow h_{i - 1} - b}_{i}

y \leftarrow

concat

{(h}_{1}, h_{2}, \dots, h_{N})

return

y, b_{1}, b_{2}, \dots, b_{N}

Reverse Pass

As shown in Algorithm 2, the reverse pass reconstructs a stylized version of the original tensor x by iteratively applying N additive coupling transformations. The process begins by normalizing the input tensor y using Adaptive Instance Normalization (AdaIN), with style statistics (μ, σ) extracted by the Style Encoder. The normalized tensor y is then split into N feature blocks $(y_{1}, y_{2}, \dots, y_{N})$ . Each block is progressively fused with the corresponding shuffled affine tensor $b_{i}$ in a backward manner, starting from the last block. To enhance feature fusion and improve adaptability during training, a learnable weight α is introduced at each step, dynamically balancing the influence of the current feature block and the accumulated transformation. The final output x is reconstructed after N steps of fusion.

Algorithm 2. Reversed Pass

REVERSED(

y, b_{1}, b_{2}, \dots, b_{N},

μ, σ

)

y

\leftarrow

AdaIN(

y, μ, σ

)

y_{1}, y_{2}, \dots, y_{N} \leftarrow

split(

y

)

h_{N} \leftarrow y_{N} + b_{N}

for

i \leftarrow N - 1

downto 1 do

h_{i} \leftarrow α \cdot (y_{i} + b_{i}) + (1 - α) {\cdot h}_{i + 1}

x \leftarrow h_{1}

return

x

RH Flow

As described in Algorithm 3, the RH Flow model performs style transfer by iteratively refining the stylized image through a randomized hierarchical transformation. The process begins with encoding the style features f_s from the style image using a pre-trained VGG network. These features are further processed by the Style Encoder, parameterized by $θ_{s}$ , to produce the adaptive style statistics (μ, σ). The input content image I_c is then progressively transformed over T iterations. During each iteration, the forward pass applies an affine transformation to the content tensor, producing an intermediate representation y along with a set of affine parameters $(b_{1}, b_{2}, \dots, b_{N})$ . The reverse pass then reconstructs a stylized version of x by fusing these components hierarchically, guided by the learnable fusion weight α. All parameters required for the RH Flow model, including those of the Affine-Net, Style Encoder, and fusion weights, are dynamically generated through the proposed meta network, ensuring efficient, adaptive, and high-fidelity style transfer.

Algorithm 3. Randomized Hierarchy Flow

RH_FLOW(

I_{c}, f_{s}, α, θ_{a}, θ_{s}

)

(μ, σ) \leftarrow

Style-Encoder(

f_{s} {; θ}_{s}

)

x \leftarrow

I_{c}

for

t \leftarrow 1

T

y, b_{1}, b_{2}, \dots, b_{N} \leftarrow

FORWARD(

x, θ_{a}

)

x \leftarrow

REVERSED(

y, b_{1}, b_{2}, \dots, b_{N},

μ, σ

)return

x

Affine-Net and Style Encoder

The Affine-Net performs affine transformations, expanding the input tensor along the channel dimension by a factor of N. We adopt the Affine-Net architecture proposed by W. Fan et al. [22]. As illustrated in Figure 4a, our current Affine-Net is a three-layer perceptron with the structure Conv-IN-ReLU → Conv-IN-ReLU → Conv-ReLU, where all convolutional layers utilize k × k kernels with a stride of 1. The first two convolutional layers double the input channel dimension C, while the final layer maps the features to the output channel dimension NC. This results in an output size of H × W × NC. For k = 3, N = 4, and C = 3, the total number of parameters in the Affine-Net is calculated as follows: (2C² + 4C² + 2NC²) × k² = 1134.

The Style Encoder generates mean (μ) and standard deviation (σ) vectors, each of dimension NC, as input into the AdaIN module. Unlike the Style Net architecture used by W. Fan et al. [22], our Style Encoder is lightweight and designed to operate on style features rather than raw style images I_s. The processing pipeline is as follows: The style image is first passed through the frozen VGG-16 network, and feature maps are extracted from four specific layers—relu1_2, relu2_2, relu3_3, and relu4_3—producing a total of 64 + 128 + 256 + 512 = 960 feature maps. These maps are subsequently processed using Mean-Std Feature Embedding (MSFE), which calculates the mean and standard deviation of each individual feature map. This results in a 1920-dimensional vector, denoted as f_s, referred to as the style feature vector. This vector f_s is subsequently passed to the Style Encoder, which produces a 2NC-dimensional output vector. For N = 4 and C = 3, this results in a 24-dimensional vector.

As illustrated in Figure 4b, the Style Encoder is a three-layer perceptron consisting of the structure Conv-BN-ReLU → Conv-BN-ReLU → Fully Connected. The 1920-dimensional style feature vector f_s is first reshaped to a 3D tensor of size 30 × 32 × 2 before being fed into the Style Encoder. Both convolutional layers employ 1 × 1 kernels with a stride of 1, producing feature maps of sizes 8 × 8 × 16 and 1 × 1 × 64, respectively. The output of the second convolutional layer is reshaped into a 64-dimensional vector and passed through a fully connected layer, which generates a 2NC-dimensional vector comprising the mean (μ) and standard deviation (σ) for AdaIN. This design significantly reduces the number of trainable parameters while ensuring efficient and accurate extraction of style features. For example, when N = 4 and C = 3, the total number of parameters in the Style Encoder is calculated as 2 × 16 + 16 × 64 + 64 × 24 = 2592 parameters.

3.2. MFIST Architecture

The proposed MFIST framework integrates a frozen VGG-16 network to extract multi-level texture features from style images. These extracted features are then processed through a meta network, which consists of fully connected layers that map them onto the parameter space of the Randomized Hierarchy (RH) Flow model. The optimization process minimizes the total loss, which is defined as a weighted sum of content and style losses, ensuring high-quality stylization. This design enables the generation of adaptive style transfer models tailored to different style images.

As illustrated in Figure 5, the proposed MFIST architecture is composed of three main parts: the VGG-16 encoder, the meta network, and the Randomized Hierarchy Flow (RH Flow) model. The left section represents the VGG-16 encoder, which extracts multi-level feature maps from the content and style images. The style image is passed through the VGG-16 encoder to extract deep features (vertical green arrows). These features are used both for computing style loss (horizontal red arrows) and for generating style representations for the meta network. The content image is processed similarly, with extracted features used to compute content loss (horizontal blue arrows) between the content and stylized outputs. The middle section shows the meta network, which processes the style features extracted by VGG-16. The 1920-dimensional style feature vector is first input into a hidden layer (green box). It is then divided into six groups feeding into six fully connected (FC) layers (orange boxes) to predict parameters for the Style Encoder and Affine-Net of the RH Flow model. The black arrows indicate the flow of style feature representations through the meta network for parameter generation. The right section houses the RH Flow model, which performs hierarchical style fusion based on the generated parameters. The RH Flow model takes the content features and fuses them with the stylized statistics through a sequence of reversible randomized hierarchical coupling (RHC) layers. Red arrows indicate the forward content encoding, while blue arrows represent the reverse reconstruction of the stylized image. Style information (green arrows) extracted and processed by the Style Encoder is injected into the RH Flow model during the reverse pass through the AdaIN modules. This structured design enables efficient and adaptive style transfer while preserving high-fidelity content structures and achieving flexible stylization. Each arrow in the figure explicitly represents a key operation: feature extraction, parameter generation, content encoding, style fusion, or loss computation.

Architecture of the Meta Network

The RH Flow architecture consists of two key sub-networks: the Affine-Net and the Style Encoder, which require 1134 and 2592 parameters, respectively. Additionally, with N = 4 learnable fusion weights (α), the total number of parameters required by RH Flow amounts to 2592 + 1134 + 4 = 3730 parameters. All these parameters are dynamically generated by the meta network. As illustrated in Figure 6, the meta network processes the 1920-dimensional feature vector extracted from the VGG-16 features of the style image (Input). This feature vector is first passed through a hidden layer with 224 output dimensions. The hidden layer output is then evenly split into seven groups, each consisting of 32 dimensions (indicated by black arrows). These groups are individually fed into seven fully connected (FC) layers: The first six FC layers each output 432 parameters, corresponding to the convolutional parameters of the Style Encoder. The last FC layer outputs 1138 parameters, corresponding to the Affine-Net and the learnable fusion weights. Specifically, among the 1138 parameters, 1134 are used by the Affine-Net, and 4 are assigned to the fusion weights (α). The black arrows in the figure represent the flow of feature information through the splitting and fully connected transformations. This grouped design offers significant advantages compared to using a single, large-scale FC layer: it reduces the total number of trainable parameters, optimizes computational resource allocation, and enhances parameter-sharing efficiency. By incorporating convolutional layers into the Style Encoder, the model benefits from improved computational efficiency and generalization, making the overall architecture lightweight and well-suited for dynamic parameter generation.

3.3. Training the Meta Network

The definitions of content loss and style loss in this study align with those used in AdaIN [12]. The content loss measures the difference between the transformed image I_cs and the content image I_c in the relu4_3 layer of the pre-trained VGG-16 encoder. After performing channel normalization, the content loss is defined as the Mean Squared Error (MSE) between the normalized features, as expressed in (2), where norm represents the channel normalization operation. The style loss evaluates the stylistic similarity between the transformed image I_cs and the style image I_s across four layers of the VGG-16 encoder (relu1_2, relu2_2, relu3_3, and relu4_3). The style loss is defined as the sum of the MSEs of the mean and standard deviation of each channel, as shown in (3), where μ and σ denote the mean and standard deviation of each channel. In (1) and (2), $ϕ_{1}$ , $ϕ_{2}$ , $ϕ_{3}$ , and $ϕ_{4}$ correspond to the feature maps of the VGG-16 encoder at layers relu1_2, relu2_2, relu3_3, and relu4_3, with their respective channel dimensions denoted as $l_{1}$ , $l_{2}$ , $l_{3}$ , and $l_{4}$ . The total loss for training the meta network is defined as the weighted sum of the content loss and style loss, as formulated in (4), where $β$ is the weighting factor that balances content loss and style loss. The meta network is trained to minimize the total loss L, allowing it to dynamically generate parameters for the RH Flow architecture. The training procedure for the meta network is described in Algorithm 4.

(2) $L_{c} \leftarrow {‖n o r m (ϕ_{4_3} (I_{c s})) - n o r m (ϕ_{4_3} (I_{c}))‖}_{2}$

(3) $L_{s} \leftarrow \sum_{i = 1}^{4} \sum_{j = 1}^{l_{i}} ({‖μ ({ϕ_{i} (I_{c s})}_{j}) - μ ({ϕ_{i} (I_{s})}_{j})‖}_{2} + {‖σ ({ϕ_{i} (I_{c s})}_{j}) - σ ({ϕ_{i} (I_{s})}_{j})‖}_{2})$

(4) $L \leftarrow {β L}_{c} + (1 - β) L_{s}$

Algorithm 4. Meta Net Training

META(

I_{c}, I_{s}, θ_{m}

)

f_{s} \leftarrow m s f e (ϕ_{1} (I_{s}), ϕ_{2} (I_{s}), ϕ_{3} (I_{s}), ϕ_{4} (I_{s}))

θ_{a}, θ_{s}, α \leftarrow

Meta Net(

f_{s} {; θ}_{m}

)

I_{c s} \leftarrow

RH_FLOW(

I_{c}, f_{s}, α, θ_{a}, θ_{s}

)

L_{c} \leftarrow {‖n o r m (ϕ_{4} (I_{c s})) - n o r m (ϕ_{4} (I_{c}))‖}_{2}

L_{s} \leftarrow

\sum_{i = 1}^{3} \sum_{j = 1}^{l_{i}} ({‖μ ({ϕ_{i} (I_{c s})}_{j}) - μ ({ϕ_{i} (I_{s})}_{j})‖}_{2} + {‖σ ({ϕ_{i} (I_{c s})}_{j}) - σ ({ϕ_{i} (I_{s})}_{j})‖}_{2})

L \leftarrow {β L}_{c} + (1 - β) L_{s}

θ_{m}

\leftarrow θ_{m}

- η \nabla_{θ_{m}} L

4. Experimental Results

We developed and evaluated the Randomized Hierarchy Flow (RH Flow) model with the parameters dynamically generated by the meta network (Meta Net). All the experiments were conducted on a server equipped with NVIDIA RTX 2080 Ti GPUs (NVIDIA Corporation, Santa Clara, CA, USA) and 256 GB of system RAM. The implementation was carried out using Python and the PyTorch deep learning framework (version 1.8). To train and evaluate the model, we utilized two widely adopted style transfer datasets: MS-COCO 2014 [34], containing a total of 123,558 real-world images; and WikiArt [35], containing 52,757 artwork images. From MS-COCO 2014, 82,783 images were randomly selected for the training, while the testing was conducted by randomly sampling images from the remaining 40,775 images. Similarly, from WikiArt, 42,129 images were used for the training, and the testing was conducted on randomly sampled images from the remaining 10,628 images. By default, all the images were resized to 300 × 400 pixels for both the training and testing.

Representative style transfer results generated by our method are shown in Figure 7 and Figure 8. Figure 9 provides a comparative analysis with state-of-the-art style transfer approaches, demonstrating that both our method and Hierarchy Flow are effective at preserving content features. The enlarged regions further highlight the superior content preservation achieved by our method, which substantially mitigates the content leakage. For comparison purposes, some of the results shown in Figure 9 are adapted from Reference [22].

We quantitatively evaluated the stylized images using the SSIM, Gram loss, and KID. Specifically, the SSIM (Structural Similarity Index Measure) evaluates the similarity between the stylized image and the original content image based on the structural information, luminance, and contrast, thus serving as an effective indicator of the content preservation. The Gram loss, computed as the Gram matrix distance [10] on the deep features extracted from a pre-trained VGG network, measures the stylistic difference between the stylized and style images, reflecting the effectiveness of the style transfer. The KID (Kernel Inception Distance) assesses the perceptual similarity between the distributions of the generated and target style images, with lower values indicating better perceptual quality and style consistency. As shown in Table 1, our model achieved the highest SSIM score and the second-lowest Gram distance, demonstrating its strong content retention and effective style adaptation. The arrows in the table headers indicate the desired direction of the metric—higher for SSIM (↑) and lower for Gram distance (↓).

As shown in Table 2, our model again attained the highest SSIM and ranked third in KID performance, while maintaining the lowest number of trainable parameters among all the compared methods. Again, the arrows in the table indicate that higher SSIM (↑) and lower KID values (↓) are preferred. These quantitative results substantiate that our proposed framework effectively preserves the critical content structures while achieving visually consistent and high-quality stylization. For a fair comparison, some of the results in these tables were adapted from References [21,22].

5. Conclusions and Discussion

In this work, we propose Meta FIST, a novel flow-based image style transfer framework that integrates Randomized Hierarchy Flow (RH Flow) with a meta network for adaptive parameter generation. The meta network dynamically produces the RH Flow parameters conditioned on the given style image, enabling more flexible and adaptive style transfer. Our approach improves the content preservation, style fidelity, and adaptability, effectively addressing the key limitations of traditional style transfer methods. The experimental results confirm that Meta FIST delivers high-quality stylization while maintaining the structural integrity of the content image.

Despite these promising results, some limitations remain. Specifically, the computational cost of dynamically generating the parameters through the meta network could be further optimized. Additionally, the current model may struggle with extremely abstract or non-representational styles, as the hierarchical flow structure emphasizes strong content preservation, which may limit the flexibility required for such styles. Addressing this trade-off between content fidelity and stylization flexibility is an important challenge for future work.

Furthermore, the current framework lacks explicit semantic constraints, relying primarily on the reversibility of the RH Flow and content loss to retain structural details. While effective at the feature level, this approach may be insufficient for achieving semantic-level content control. Incorporating semantic information, such as semantic segmentation maps or object detection annotations, could enable the model to better preserve and manipulate the meaningful content structures, particularly in complex scenes.

In summary, future work will focus on the following: (a) improving the computational efficiency of the meta network; (b) enhancing the flexibility of the model to better handle highly abstract or non-representational artistic styles; (c) integrating semantic-aware mechanisms, such as semantic segmentation or object detection information, to improve the content-aware stylization; (d) and exploring hybrid models that combine attention-based learning with flow-based approaches to further advance the model’s adaptability and content fidelity.

In addition, this study represents an initial attempt to apply meta network-based parameter generation within a hierarchical flow-based style transfer framework. We believe that the meta network is a general and flexible component that could be further explored in other style transfer architectures, beyond the hierarchical flow. Investigating such extensions could help validate the versatility and broader applicability of our approach to various style transfer paradigms.

Author Contributions

Conceptualization, Y.T., H.-W.L. and H.-J.L.; Methodology, H.-W.L. and C.-J.C.; Software, Y.T. and H.-J.L.; Validation, C.-J.C.; Formal Analysis, H.-J.L.; Investigation, H.-W.L. and H.-J.L.; Resources, Y.T.; Data Curation, C.-H.Y.; Writing—Original Draft Preparation, H.-J.L.; Writing—Review and Editing, H.-W.L. and C.-H.Y.; Visualization, Y.T.; Supervision, H.-W.L.; Project Administration, H.-J.L.; Funding Acquisition, H.-J.L. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Ref. [34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll’ar, and C. L. Zitnick, “Microsoft coco: Common objects in context”, European conference on computer vision. Springer, 2014, pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48 (accessed on 1 October 2024). Ref. [35] K. Nichol, Painter by numbers, wikiart, 2016. https://www.kaggle.com/c/painter-by-numbers/ (accessed on 1 October 2024).

Acknowledgments

This work was supported by the National Science and Technology Council, Taiwan, R.O.C., under grant NSTC 113-2221-E-032-020.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Flow model [23].

Figure 2 Randomized Hierarchy Flow model.

Figure 3 Randomized Hierarchy Coupling (RHC).

Figure 4 (a) Affine-Net and (b) Style Encoder.

Figure 5 MFIST Architecture.

Figure 6 Architecture of meta network.

Figure 7 Style transfer results: multiple content images stylized based on a single style image.

Figure 8 Style transfer results: a single content image individually stylized using different style images.

Figure 9 Style transfer results and enlarged regions compared with state-of-the-art style transfer methods.

Table 1

Quantitative evaluation results based on SSIM and Gram distance metrics.

Method	SSIM ↑	Gram Distance ↓
StyleSwap	0.44	0.00482
AdaIN	0.29	0.00127
WCT	0.27	0.00074
LinearWCT	0.35	0.00093
OptimalWCT	0.21	0.00035
Avatar-Net	0.31	0.00099
Artlow+AdaIN	0.45	0.00078
Ours	0.615	0.00050

Table 2

Quantitative evaluation results of different flow architectures in terms of SSIM, KID (×10³), and number of parameters.

Method	SSIM ↑	KID ↓	Parameters
AdaIN	0.28	41.1/5.1	7.01 M
WCT	0.24	51.2/6.2	34.24 M
Artflow+AdaIN	0.52	24.6/3.8	6.42 M
Artflow+WCT	0.53	33.3/5.3	6.42 M
CCPL	0.43	39.1/6.8	8.67 M
Hierarchy Flow	0.60	28.2/4.7	1.01 M
Ours	0.615	29.0/5.0	0.55 M

References

1. Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. arXiv; 2018; arXiv: 1711.11585

2. Miyato, T.; Koyama, M. CGANs with Projection Discriminator. arXiv; 2018; arXiv: 1802.05637

3. Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward Multimodal Image-to-Image Translation. arXiv; 2018; arXiv: 1711.11586

4. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv; 2018; arXiv: 1703.10593

5. Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. arXiv; 2018; arXiv: 1611.07004

6. Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic Image Synthesis with Spatially-Adaptive Normalization. arXiv; 2019; arXiv: 1903.07291

7. Kotovenko, D.; Sanakoyeu, A.; Ma, P.; Lang, S.; Ommer, B. A Content Transformation Block for Image Style Transfer. arXiv; 2020; arXiv: 2003.08407

8. Wei, Y. Artistic Image Style Transfer Based on CycleGAN Network Model. Int. J. Image Graph.; 2024; 24, 2450049. [DOI: https://dx.doi.org/10.1142/S0219467824500499]

9. Liu, J.; Liu, H.; He, Y.; Tong, S. An Improved Detail-Enhancement CycleGAN Using AdaLIN for Facial Style Transfer. Appl. Sci.; 2024; 14, 6311. [DOI: https://dx.doi.org/10.3390/app14146311]

10. Gatys, L.A.; Ecker, A.S.; Bethge, M. A Neural Algorithm of Artistic Style. arXiv; 2015; arXiv: 1508.06576v2[DOI: https://dx.doi.org/10.1167/16.12.326]

11. Johnson, J.; Alahi, A.; Li, F.-F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. arXiv; 2016; arXiv: 1603.08155v1

12. Huang, X.; Belongie, S. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. arXiv; 2017; arXiv: 1703.06868v2

13. Dumoulin, V.; Shlens, J.; Kudlur, M. A Learned Representation for Artistic Style. arXiv; 2017; arXiv: 1610.07629v5

14. Li, Y.; Fang, C.; Yang, J.; Wang, Z.; Lu, X.; Yang, M.-H. Universal Style Transfer via Feature Transforms. arXiv; 2017; arXiv: 1705.08086v2

15. Chen, D.; Yuan, L.; Liao, J.; Yu, N.; Hua, G. StyleBank: An Explicit Representation for Neural Image Style Transfer. arXiv; 2017; arXiv: 1703.09210v2

16. Sheng, L.; Lin, Z.; Shao, J.; Wang, X. Avatar-Net: Multi-scale Zero-shot Style Transfer by Feature Decoration. arXiv; 2018; arXiv: 1805.03857v2

17. Li, X.; Liu, S.; Kautz, J.; Yang, M.-H. Learning Linear Transformations for Fast Arbitrary Style Transfer. arXiv; 2018; arXiv: 1808.04537v1

18. Liu, B.; Wang, C.; Cao, T.; Jia, K.; Huang, J. Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing. arXiv; 2024; arXiv: 2403.03431v1

19. Zhou, X.; Yin, M.; Chen, X.; Sun, L.; Gao, C.; Li, Q. Cross Attention Based Style Distribution for Controllable Person Image Synthesis. Computer Vision—ECCV 2022; Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G.M.; Hassner, T. Lecture Notes in Computer Science Springer: Cham, Switzerland, 2022; Volume 13675.

20. Pan, X.; Zhang, M.; Ding, D.; Yang, M. A Geometrical Perspective on Image Style Transfer with Adversarial Learning. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 44, pp. 63-75. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.3011143]

21. An, J.; Huang, S.; Song, Y.; Dou, D.; Liu, W.; Luo, J. ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows. arXiv; 2021; arXiv: 2103.16877v2

22. Fan, W.; Chen, J.; Liu, Z. Hierarchy Flow for High-Fidelity Image-to-Image Translation. arXiv; 2023; arXiv: 2308.06909v1

23. Kingma, D.P.; Dhariwal, P. Glow: Generative Flow with Invertible 1 × 1 Convolutions. arXiv; 2018; arXiv: 1807.03039v2

24. Yao, F. A learning theory of meta learning. Natl. Sci. Rev.; 2024; 11, nwae133. [DOI: https://dx.doi.org/10.1093/nsr/nwae133] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39007010]

25. Gao, W.; Shao, M.; Shu, J.; Zhuang, X. Meta-BN Net for Few-Shot Learning. Front. Comput. Sci.; 2023; 17, 171302. [DOI: https://dx.doi.org/10.1007/s11704-021-1237-4]

26. Shen, F.; Yan, S.; Zeng, G. Meta Networks for Neural Style Transfer. arXiv; 2017; arXiv: 1709.04111v1

27. Schmidhuber, J. Evolutionary Principles in Self-Referential Learning. Master’s Thesis; Technische Universität München: München, Germany, 1987.

28. Thrun, S.; Pratt, L. Learning to Learn; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012.

29. Bechtle, S.; Molchanov, A.; Chebotar, Y.; Grefenstette, E.; Righetti, L.; Sukhatme, G.; Meier, F. Meta-learning via learned loss. arXiv; 2019; arXiv: 1906.05374

30. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. Proceedings of the 34th International Conference on Machine Learning; Sydney, Australia, 6–11 August 2017.

31. Meier, F.; Kappler, D.; Schaal, S. Online learning of a memory for learning rates. Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA); Brisbane, Australia, 21–25 May 2018; pp. 2425-2432.

32. Houthooft, R.; Chen, Y.; Isola, P.; Stadie, B.C.; Wolski, F.; Ho, J.; Abbeel, P. Evolved policy gradients. Proceedings of the NeurIPS Proceeding; Montréal, QC, Canada, 2–8 December 2018; pp. 5405-5414.

33. Metz, L.; Maheswaranathan, N.; Cheung, B.; Sohl-Dickstein, J. Learning unsupervised learning rules. Proceedings of the International Conference on Learning Representations; New Orleans, LA, USA, 6–9 May 2019.

34. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll, P.; Zitnick, C.L. Microsoft coco: Common objects in context. Proceedings of the European Conference on Computer Vision; Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740-755.

35. Nichol, K. Painter by Numbers, Wikiart. 2016; Available online: https://www.kaggle.com/c/painter-by-numbers/ (accessed on 1 October 2024).

Word count: 5683

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

A style transfer aims to produce synthesized images that retain the content of one image while adopting the artistic style of another. Traditional style transfer methods often require training separate transformation networks for each new style, limiting their adaptability and scalability. To address this challenge, we propose a flow-based image style transfer framework that integrates Randomized Hierarchy Flow (RH Flow) and a meta network for adaptive parameter generation. The meta network dynamically produces the RH Flow parameters conditioned on the style image, enabling efficient and flexible style adaptation without retraining for new styles. RH Flow enhances feature interaction by introducing a random permutation of the feature sub-blocks before hierarchical coupling, promoting diverse and expressive stylization while preserving the content structure. Our experimental results demonstrate that Meta FIST achieves superior content retention, style fidelity, and adaptability compared to existing approaches.

Details

Title

Meta Network for Flow-Based Image Style Transfer

Author

Tsai Yihjia¹; Hsiau-Wen, Lin²; Chen Chii-Jen¹

; Lin Hwei-Jen¹; Chen-Hsiang, Yu³

¹ Department of Computer Science and Information Engineering, Tamkang University, New Taipei City 251301, Taiwan; [email protected] (Y.T.); [email protected] (C.-J.C.)
² Department of Information Management, Chihlee University of Technology, New Taipei City 220305, Taiwan
³ Multidisciplinary Graduate Engineering, College of Engineering, Northeastern University, Boston, MA 02115, USA; [email protected]

First page

2035

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics14102035

ProQuest document ID

3211939607

Meta Network for Flow-Based Image Style Transfer

Jump to:

Full text

1. Introduction

2. Related Works

2.1. Flow-Based Style Transfer Models

2.2. Meta Learning for Style Transfer

3. Meta Model for Flow-Based Image Style Transfer

3.1. Randomized Hierarchical Flow Model

3.2. MFIST Architecture

3.3. Training the Meta Network

4. Experimental Results

5. Conclusions and Discussion

Abstract

Details

Suggested sources