Content area
Text-to-Thangka generation requires preserving both semantic accuracy and textural details. Current methods struggle with fine-grained feature extraction, multi-level feature integration, and discriminator overfitting due to limited Thangka data. We present HST-GAN, a novel framework combining parallel hybrid attention with differentiable symmetric augmentation. The architecture features a Parallel Spatial-Channel Attention module (PSCA) for precise localization of deity facial features and ritual object textures, along with a Hierarchical Feature Fusion Network (HLFN) for multi-scale alignment. The framework’s Differentiable Symmetric Augmentation (DiffAugment) dynamically adjusts discriminator inputs to prevent overfitting while improving generalization. On the T2IThangka dataset, HST-GAN achieves an Inception Score of 2.08 and reduces Fréchet Inception Distance to 87.91, demonstrating superior performance over baselines on the Oxford-102 benchmark.
Introduction
Thangka art1, as a visual art form embodying profound cultural heritage and significant cultural value (as illustrated in Fig. 1), requires months to years of meticulous craftsmanship due to its distinctive compositional style, exquisite textural details, and vibrant color palette. Generative artificial intelligence technologies, particularly text-to-image generation models, are injecting new vitality into the digital preservation of cultural heritage while opening innovative pathways for its sustainable transmission. More importantly, the methodologies and technical frameworks developed for Thangka image generation characterized by rich coloration and fine-grained textures—demonstrate remarkable transferability to digital preservation and creative applications involving other complex visual artifacts, including murals and traditional embroideries. This research not only carries substantial practical value for cultural heritage conservation but also establishes a paradigmatic case for technological innovation in cross-modal generation domains.
[See PDF for image]
Fig. 1
Some Thangka images.
Example images from the T2IThangka dataset used in this study.
Since the rise of deep learning, models based on Generative Adversarial Networks (GANs)2 have become a classical approach for this task. Through adversarial training between the generator and the discriminator, GANs demonstrate powerful feature learning capabilities in image generation tasks. However, in the context of Thangka image generation, GAN models still face challenges such as insufficient semantic alignment due to the loss of fine-grained features, as well as issues related to data scarcity and training instability, which require further improvement. While the diffusion models proposed by Sheynin et al.3 have made significant progress in text-to-image generation, their iterative generation paradigm struggles to meet the real-time requirements for generating high-resolution details in Thangka images. In terms of model architecture, convolutional neural networks (CNNs) remain a cornerstone in various fields, particularly in cultural heritage digitization, due to their strong local feature extraction capability, parameter-sharing mechanism that ensures efficiency, and excellent performance across diverse visual tasks. In recent years, although Transformer-based architectures such as Vision Transformer have achieved remarkable results on large-scale datasets, their global attention mechanism entails high computational complexity and a heavy reliance on massive datasets. These characteristics make them highly susceptible to overfitting and significant computational burdens in scenarios such as Thangka image processing, where data quality is inconsistent and sample sizes are limited. In contrast, the local feature extraction capability of CNNs aligns naturally with the high sensitivity required for details such as lines and textures in this task. Moreover, their robustness in small-sample scenarios has been widely validated. Therefore, we opt for a GAN-based approach with a CNN backbone, aiming to achieve an optimal balance among model performance, computational efficiency, and data requirements. This choice offers unique advantages in the specific context of our study.
In response to issues such as blurred details and insufficient resolution in early text-to-image generation models, subsequent works have introduced a series of improvements. These include: StackGAN4, which enhanced image quality through a strategy of generating high-resolution images from low-resolution inputs; AttnGAN5, which incorporated an attention mechanism to dynamically associate words with relevant image regions; and DF-GAN6, which simplified the architecture and achieved high-resolution single-stage generation via an object-aware discriminator. DM-GAN7 introduced a dynamic memory module to refine blurred details and stabilize training, while GigaGAN8 developed a multi-scale parallel generation architecture. With the rise of diffusion models, Imagen9 employed advanced optimization techniques to enhance local detail generation while maintaining high resolution. CogView310 innovatively combined autoregressive and diffusion-based architectures to improve text-image alignment accuracy. In terms of model architecture optimization, RATLIP11 adopted a mixed attention mechanism to mitigate feature forgetting in recurrent neural networks. GALIP12 integrated CLIP models into both the discriminator and generator to enhance semantic understanding in the discriminator. Additionally, SF-GAN13 proposed a recurrent semantic fusion network that enables local feature adaptation through progressive feature interaction across time steps.
Although current generative models have achieved remarkable progress in general image generation tasks—such as on datasets like CUB-200 Birds and Oxford-102 Flowers (see Fig. 2a), with Fig. 2c further demonstrating the strong performance of diffusion models on the COCO dataset—the study of Thangka image generation remains largely unexplored both domestically and internationally, due to inherent algorithmic limitations and the unique characteristics of Thangka art. Existing methods are unable to meet the requirements for high-quality Thangka presentation and appreciation. As illustrated in Fig. 2b, current models lack emphasis on key regions (such as the face of the main deity or ritual objects), resulting in issues such as detail degradation (e.g., broken lines). Furthermore, existing approaches are inefficient in integrating high-level semantic text information with low-level image texture features. Lastly, conventional data augmentation strategies (e.g., flipping and cropping), which are commonly used to mitigate training instability, are ill-suited to the symmetry constraints inherent in Thangka art and may instead introduce noise. These technical bottlenecks significantly limit the practical application of generative artificial intelligence in Thangka image generation.
[See PDF for image]
Fig. 2
Comparison of Generative Model development.
An overview of generative model development spanning datasets, T2IThangka, and general benchmarks. a Commonly used datasets in the text-to-image domain. b The effectiveness of the GAN model on the Thangka dataset. c The output quality of diffusion models on the COCO dataset.
To address these challenges—including detail blurring, feature degradation, compositional irregularities, and coarse textures, particularly the model overfitting caused by limited dataset sizes—we propose HST-GAN, a novel framework integrating parallel hybrid attention mechanisms and differentiable symmetry augmentation. The main contributions of this work are summarized as follows:
We propose a novel Parallel Spatial-Channel Attention module (PSCA) that innovatively achieves simultaneous attention to semantically critical image regions and adaptive enhancement of task-relevant color/feature channels. This mechanism strengthens local features essential for the generation task while suppressing irrelevant background interference, thereby significantly improving the model’s feature extraction precision and attention allocation capabilities. PSCA ensures concurrent enhancement of both key regions and important channels, ultimately establishing more comprehensive semantic representations.
We design a Hierarchical Feature Fusion Network (HLFN) that effectively integrates high-level semantic information (e.g., proportions of principal deities) with low-level detail features (e.g., clothing textures) through a cross-level feature alignment strategy. This architecture maintains exceptional harmony between global structures and local details in generated images, leading to substantial improvements in output quality.
We introduce a Differentiable Symmetric Augmentation strategy (DiffAugment) into the discriminator of the generative adversarial network. This approach effectively mitigates overfitting caused by limited training data while remarkably enhancing the model’s generalization capability.
Methods
Based on the RAT-GAN baseline model, we propose HST-GAN, a novel generative framework that synergistically integrates parallel hybrid attention mechanisms with differentiable symmetric augmentation techniques. The overall architecture is illustrated in Fig. 3. In the generator, the Thangka text description feature vector and noise vector are initially processed through an affine transformation block to enable deep fusion of textual information. Since the PSCA mechanism module is applied after the generator block, the key areas and channels in the Thangka image can be focused on simultaneously. Considering that the generated Thangka images lack the guiding information of the character contours for the details of each part, the HLFN is adopted to align the high-level semantics with the low-level semantics to ensure that the fused features contain the fine-grained information in the images. In order to improve the discriminator’s learning ability for data distribution and reduce the model’s dependence on specific patterns of the training data, the method of introducing the DiffAugment strategy in the discriminator is adopted; Perform differentiable data augmenting operations on both real images and generated images simultaneously. By enhancing the discriminator, the generator needs to generate higher-quality images in order to deceive the discriminator. This section will elaborate on the PSCA mechanism module, HLFN, and DiffAugment strategy that are used.
[See PDF for image]
Fig. 3
The proposed model architecture for text-to-image generation.
The upper part illustrates the generator structure, while the lower part details the discriminator structure.
Contrastive text embedding pre-training
We employ a bidirectional Long Short-Term Memory (LSTM) network to encode the textual descriptions, generating a sentence-level feature vector , while a convolutional network is used to extract the image feature vector . Following the approach of AttnGAN, a contrastive loss function is adopted to achieve cross-modal feature alignment between text and image. Specifically, the similarity matrix for all possible image-text pairs is computed using Eq. (1) as follows:
1
Here, n denotes the number of images in the batch, and represent the i-th text feature and image feature, respectively, and indicates the dot-product similarity between the i-th text feature and the j-th image feature. The similarity matrix is then converted into matching probabilities:2
To maximize the similarity between matched text-image feature pairs, we minimize the following contrastive loss function:3
This pre-training process enables the text encoder to generate semantically consistent feature representations, providing conditional input for the subsequent generative adversarial network. As shown in Fig. 3, in the generator, the text feature vector and noise vector are used as inputs simultaneously. A RNN is then employed to model the temporal structure between generator blocks, enabling global allocation of textual information. Specifically, a variant of LSTM is used instead of a standard RNN. First, a channel-wise scaling operation is applied to the image feature vector c, followed by a channel-wise shifting operation. This process can be formally expressed as:
4
where is the hidden state of the RNN, and are parameters predicted by two single-hidden-layer multilayer perceptrons (MLPs) based on :5
When processing an image feature map composed of w × h feature vectors, this affine transformation is repeated for each feature vector. The initial state of the LSTM is generated from the noise vector z:6
At each time step t, the LSTM receives the text feature and the previous hidden state . The update rules are as follows (Eqs. 7–9):7
8
9
where the current hidden state is transformed into affine parameters:10
The recurrent connections in the RNN establish temporal dependencies among the affine transformation parameters across different layers. This design addresses the conflict issues inherent in traditional isolated fusion blocks (e.g., CBN/CIN) and achieves consistent allocation of textual information across layers.Parallel Spatial-Channel Attention Module (PSCA)
Generators often fail to concurrently capture both spatial key regions and channel-level color distributions, resulting in synthesized Thangka images lacking critical details and fine textures. Notably, existing attention mechanisms, such as CBAM14, suffer from computational inefficiency, which stems from their serial processing architecture. Furthermore, these mechanisms tend to overlook the subtle yet critical interdependencies among color channels that carry specific cultural meanings in Thangka art. To address these limitations, we propose a Parallel Spatial-Channel Dual Attention Module (PSCA), with its architecture depicted in Fig. 4 The PSCA module employs a dual-path pooling-convolution cooperative mechanism that enables precise localization of key regions (e.g., facial features of Buddhist deities). By utilizing shared spatial contextual weights and efficient convolutional operations, it simultaneously refines features across both spatial and channel dimensions. This dual refinement process generates semantically and spatially balanced features for subsequent cross-level feature fusion, effectively minimizing interference from irrelevant background elements.
[See PDF for image]
Fig. 4
PSCA: A Parallel Spatial-Channel Dual Attention Module.
This module is designed to integrate spatial and channel attention mechanisms in parallel to enhance feature representation.
In this study, the text feature T and processed noise vector Z are jointly fed into a series of generator blocks (G_block0, G_block1, G_block2, G_block3). Each generator block contains two affine transformation layers and convolutional layers, which work cooperatively to progressively refine the noise vector and text features, thereby generating multi-level features (F0−F3). In the generator’s hierarchical feature maps, the output from higher-level blocks (G_block1) captures rich semantic and fine-grained information, while the outputs and from mid-to-low-level blocks (G_block2, G_block3) focus more on localized detail features. To harmonize multi-scale feature representation, the PSCA module employs shared-parameter 1 × 1 convolutional kernels to simultaneously process , producing channel-aligned lateral features .
To process the spatial and channel-wise feature dependencies in parallel while avoiding information loss caused by serial computation, we apply spatial attention and channel attention branches to each lateral feature for synchronous computation, generating the final dual-attention features. In the spatial attention branch, each lateral feature undergoes average pooling (to preserve global context) and max pooling (to highlight salient features) simultaneously, yielding average-pooled features and max-pooled features. These features are then concatenated and processed by a 7 × 7 convolution to expand the receptive field, making it suitable for the large-scale symmetrical structures of Thangka paintings. Finally, the spatially weighted feature is generated. The computation of the spatial attention branch is formulated in Eq. (11):
11
In the equation, σ denotes the sigmoid activation function, represents a 7 × 7 convolutional operation, and and correspond to the average-pooled features and max-pooled features, respectively.Meanwhile, the channel attention branch employs a dual-pooling strategy to aggregate spatial information from the feature maps, generating two complementary spatial context descriptors— and —that capture distinct yet complementary spatial contexts. These descriptors undergo nonlinear transformation through a shared-parameter MLP. The corresponding output features from the MLP are then combined through element-wise summation, followed by a sigmoid activation function to produce the final channel attention weight vector . The computation of the channel attention branch is formulated as follows:
12
In the equation, σ denotes the sigmoid activation function; AvgPool() and MaxPool() represent average pooling and max pooling operations, respectively; W₀ and W₁ indicate the dimensionality-reduction and dimensionality-increasing weights of the MLP. Finally, the spatial and channel attention weights are multiplied element-wise to enhance joint modeling capability for key Thangka regions (such as Buddha’s robe patterns and background motifs):13
where ⨂ denotes element-wise multiplication.The parallel spatial-channel attention mechanism module processes spatial and channel-wise features simultaneously through a dual-path pooling-convolution cooperative mechanism. Multi-level features are first aligned via 1 × 1 convolutions, then synchronously optimized by dual attention branches, with their final weights multiplied to enhance cooperative responses. This method can accurately generate intricate Thangka texture details, improve color distribution rationality, and suppress background interference. The design preserves high-level semantic guidance while incorporating low-level detail cues, providing structured multi-scale representations for subsequent hierarchical feature fusion.
Hierarchical Feature Fusion Network (HLFN)
The Spatial-Channel Attention Module (PSCA) significantly enhances the model’s capability to precisely identify critical regions and channels during the image generation process. However, the rigorous compositional principles inherent in Thangka artwork demand that generators maintain a delicate balance between global structure and local details—a requirement where conventional approaches like StackGAN++15. exhibit limitations due to their unidirectional feature propagation, often resulting in detail degradation. This challenge stems from a fundamental architectural limitation: current generator networks typically process hierarchical features (including high-level semantic representations and low-level detail features) in isolation, lacking an effective cross-level fusion mechanism. Such fragmented processing inevitably leads to compromised performance in both fine detail preservation and semantic coherence, ultimately diminishing the overall image quality. To address these critical issues, we propose a novel HLFN that strategically integrates recurrent neural networks (RNNs) with affine transformation layers. This innovative architecture not only preserves intricate textures and contextual information but also achieves optimized multi-scale feature fusion.
The proposed framework operates through a sophisticated two-phase process. Initially, the parallel spatial-channel attention module generates comprehensive dual-attention features . Subsequently, our hierarchical feature fusion network performs advanced integration of multi-scale features (, , and ), effectively bridging high-level semantic features with fine granularity and low-level detail representations. This sophisticated fusion mechanism enables efficient hierarchical feature consolidation while strictly adhering to the exacting standards required for authentic Thangka image synthesis.
The fusion process, as illustrated in Fig. 5, employs an elaborate feature integration strategy that effectively combines cross-layer features through progressive upsampling and element-wise addition techniques. Beginning with the high-level dual-attention feature map (with lower resolution), the network gradually scales up its dimensions while systematically incorporating detailed information from lower-level feature maps and . This process utilizes nearest-neighbor interpolation for upsampling operations to preserve the intricate edge details characteristic of Thangka artwork, thereby ensuring precise feature transmission and minimizing information loss. The final output consists of a series of meticulously fused feature maps that not only significantly enhance the richness and consistency of feature representation but also provide higher-quality input for subsequent image generation stages. This fusion process can be formally expressed by Eq. (14):
14
where Upsample denotes the nearest-neighbor interpolation operation, , , and represent input feature maps at different hierarchical levels, and indicates the fused output feature map.[See PDF for image]
Fig. 5
The employed Hierarchical Feature Fusion Network (HLFN).
It extracts and amalgamates features from different scales to enhance the model's representational capacity.
The Hierarchical Feature Fusion Network addresses the critical challenge of coordinating global structure with local details in Thangka image generation through its innovative combination of spatial-channel attention mechanisms and multi-scale feature fusion. The system first extracts dual-attention features using PSCA, then performs deep integration of high-level semantic features with low-level details through progressive upsampling, achieving optimized cross-level feature enhancement. This strategic approach effectively circumvents the detail degradation caused by unidirectional feature propagation while substantially improving both the richness and consistency of feature representation.
Differentiable Symmetry Enhancement Strategy (DiffAugment)
When analyzing the performance of RAT-GAN on the T2IThangka dataset, we first observed a widening divergence between the generator and discriminator losses (Fig. 6a), indicating that the discriminator had begun to memorize the training images. It is worth noting that such overfitting behavior in the discriminator is not unique to the T2IThangka dataset; similar phenomena have been reported by Brock et al16. even on large-scale datasets such as ImageNet. Further experiments (Fig. 6b) revealed that an overfitted discriminator loses its ability to accurately evaluate newly generated images. As a result, the generator receives unreliable feedback, ultimately producing images that are blurry, distorted, and lacking in diversity, which significantly compromises the overall quality and authenticity of the generated outputs. To quantitatively assess this effect, we introduced the Fréchet Inception Distance (FID) metric for tracking and analysis. The results show that although the FID exhibits an overall downward trend as training progresses, its fluctuation pattern reveals critical issues (Fig. 6c): in the later stages of training, the magnitude of FID rebound increases significantly. This behavior is highly consistent with typical characteristics of discriminator overfitting—when the discriminator over-memorizes training samples, the distorted feedback signals lead to instability in generation quality. These findings highlight the importance of closely monitoring discriminator overfitting during model training to ensure stable and consistent improvement in generation quality.
[See PDF for image]
Fig. 6
Indicators of model overfitting.
a Oscillating and non-converging loss functions. b Generated images with repetitive patterns and artifacts. c Persistent fluctuation in the FID score, indicating a failure to converge.
Adversarial training strategies and loss function design for GAN have made extensive and sustained efforts in image generation, but the issues of data quality and training stability have always constituted two fundamental challenges in thangka image generation. Although the current Internet data has the advantage of scale, it generally has inherent defects such as quality heterogeneity, sparse text description and noise interference, which makes the acquisition of text-image pairing data a key bottleneck. Therefore, in training scenarios with limited data volume, the discriminator is prone to exhibiting a strong tendency to memorize, and the overfitted discriminator will severely punish any generated samples that deviate from the distribution of the training data, which ultimately leads to training instability due to the loss of informativeness of the provided gradient signals due to its lack of generalization ability. Traditional solutions rely heavily on explicit regularization methods. For example, BigGAN employs spectral normalization16 to suppress discriminator overfitting, but still faces pattern collapse in small data scenarios. It was found that data augmentation has an irreplaceable role in suppressing the overfitting problem. However, its application in the field of GAN lags significantly behind explicit regularization methods17 and traditional data enhancement (e.g., random cropping, flipping) destroys the compositional symmetry of thangkas, leading to misalignment of the generated images. Notably, Zhang et al.18 experimentally demonstrated that directly applying traditional data augmentation to the training of GAN not only fails to improve the performance, but may instead destroy the adversarial balance of the model. These findings are consistent with the observation in thangka image generation that traditional data augmentation methods can destroy the symmetry structure of the image. Inspired by differentiable enhancement techniques such as DiffAugment19 and AugSelf20, this paper proposes a three-stage enhancement strategy for Thangka images: firstly, symmetrical enhancement is achieved by implementing the same enhancement operation through real images and generated images; secondly, parametric color tuning is used to achieve the differentiability guarantee, and lastly, intra-batch discretization is achieved by dynamic stochasticization. This strategy effectively improves the generation quality and training stability under the premise of guaranteeing the artistic specification.
This study addresses two fundamental challenges in Thangka image generation through the integration of parallel spatial-channel attention mechanisms and hierarchical feature fusion: insufficient attention to critical regions and feature channels, and misalignment between high-level semantic features and low-level details. The proposed approach enables generated images to achieve superior consistency between overall contours and fine details while better conforming to human esthetic standards. However, the scarcity of Thangka training data leads to discriminator overfitting, limiting generation capability to simple principal deity images (e.g., Sakyamuni Buddha, Akshobhya). To enhance the model’s generative capacity—particularly in preserving intricate textures and rich colors while improving output diversity we propose a differentiable symmetric augmentation strategy. Unlike traditional augmentation methods (e.g., flipping) that disrupt compositional symmetry, our approach simultaneously applies differentiable data augmentation to both real and generated samples during discriminator training. This design propagates augmented sample gradients back to the generator, achieving three key improvements: enhanced discriminator capability in learning data distributions, higher-quality image generation, and reduced model dependency on specific patterns with mitigated mode collapse. The framework incorporates three core technical innovations specifically designed for Thangka image generation:
Let denote the real data distribution, G the generator, and D the discriminator. Symmetric augmentation applies random transformation (with parameters ) to both real and generated samples, the objective function of the discriminator is:
15
All transformations are designed as differentiable operations, enabling gradient backpropagation to the generator:16
To enhance training diversity and stability, we apply identical augmentation types with independently sampled parameters across samples within each batch. This maintains intra-batch consistency while increasing augmentation diversity.Given the distinctive artistic attributes and structural characteristics of Thangka images, we strategically selected two augmentation strategies in the practical application of DiffAugment: translation (random affine shifts within ±12.5% of the image dimensions, as shown in Fig. 7a) and color adjustment (nonlinear perturbations within brightness ±0.5, contrast [0.5, 1.5], and saturation [0, 2], as illustrated in Fig. 7b). These augmentation operations allow gradients to propagate back to the generator, thereby maintaining dynamic equilibrium during the training process. Although this approach typically leads to a slight reduction in the discriminator’s training accuracy, it improves the generator’s validation accuracy, effectively mitigates overfitting, and ultimately promotes better model convergence.
[See PDF for image]
Fig. 7
The two enhancement strategies used.
a Geometric transformation using random translation. b Photometric adjustment via color space perturbations.
Loss functions
The generator’s loss function comprises adversarial loss and a gradient penalty term. Through adversarial training and gradient penalty, the generator learns to produce high-quality, diverse images that are semantically aligned with the text descriptions. Accordingly, the training objective of the generator can be expressed by Eq. (17):
17
Building upon the baseline symmetric augmentation framework (Eq. (15)), and to ensure gradient smoothness, this study integrates the negative sample supervision mechanism proposed by Ye et al.21 with the gradient regularization method from MA-GP6. Specifically, while maintaining differentiable augmentations, the discriminator is tasked not only with distinguishing between real and generated samples, but also with identifying negative sample pairs where the text descriptions do not match the images. By further incorporating the hinge loss from MA-GP6 on both real and matched text-image pairs, the final objective function of the discriminator is formulated as shown in Eq. (18):18
Results
Dataset and experimental parameter settings
This study extends and enhances the Thangka image dataset originally constructed by Hu et al.22 through systematic data augmentation and annotation refinement, resulting in the newly developed T2IThangka dataset that is specifically optimized for text-to-Thangka image generation tasks. The finalized dataset comprises three distinct subsets: a training set containing 7568 images, a validation set with 1490 images, and a test set consisting of 1098 carefully curated samples. To ensure data quality, we implemented advanced object detection techniques to precisely extract the principal deity regions from each image. Each cropped region is accompanied by comprehensive Chinese semantic annotations (averaging 100 words per image) that meticulously document the deity’s distinctive features, characteristic postures, and symbolic representations.
In order to rigorously evaluate the generalization capability and practical effectiveness of our proposed model, we conducted supplementary experiments using the widely recognized Oxford-102 dataset23. This benchmark dataset encompasses 102 unique flower categories, totaling 8189 high-quality images, with each floral specimen being described by ten distinct textual captions to provide rich semantic context.
The experimental environment was configured with the following specifications: an Ubuntu 18.04 operating system, PyTorch 1.10.0 framework, and CUDA 11.3 for GPU acceleration, running on a computing platform equipped with Xeon Platinum 8362 processors and RTX 3090 graphics cards. For the training process, we employed the Adam optimizer with distinct learning rates for the generator (1 × 10⁻⁴) and discriminator (4 × 10⁻⁴), along with a momentum parameter β of 0.9. Due to GPU memory constraints, we set the batch size to 24 and conducted training for a total of 900 iterations to ensure model convergence while maintaining computational efficiency.
Evaluation metrics
This study employs two widely adopted quantitative metrics for assessing image generation quality and diversity: the Inception Score (IS)24 and FID25. The IS quantitatively evaluates generated images by computing the Kullback-Leibler (KL) divergence between the conditional class distribution of generated images and the marginal class distribution of real images. A higher IS value indicates superior image generation quality with better class discriminability and visual fidelity. The mathematical formulation for IS calculation is presented in Eq. (19):
19
The FID employs a pre-trained Inception v3 network to compute the Fréchet distance between the feature distributions of generated and real images. In contrast to the IS metric, lower FID values indicate that the generated images more closely approximate the real image distribution, reflecting superior visual authenticity. The mathematical formulation of FID is presented in Eq. (20):20
In the evaluation of Thangka image generation, subjective human assessment demonstrates superior authority due to its precise perception of artistic quality and cultural connotations. This study conducts a comprehensive multi-dimensional qualitative evaluation through survey statistics for both the proposed GAN model and comparative models in text-to-Thangka image generation. The evaluation framework encompasses four core dimensions: image quality Q (assessing clarity and detail representation), artistic merit A (evaluating color, composition, and style consistency), cultural accuracy C (measuring conformity to traditional characteristics), and innovation I (assessing reasonable creativity based on tradition). Each dimension employs a 10-point Likert scale with respective weights of 0.4, 0.3, 0.2, and 0.1. We recruited fifty evaluators, including professional Thangka artists, Tibetan studies researchers, and general audiences, to perform blind assessments on 100 generated images. The multi-dimensional scores were aggregated into a single subjective score for each image, calculated through Eq. (21). The final comprehensive score for each model was derived by averaging the scores across all 100 generated images.
21
Quantitative evaluation
In this study, experiments were conducted on the T2IThangka dataset and the Oxford102 dataset to evaluate the effectiveness of the proposed method. Comparisons were made with several existing mainstream text-to-image generation models, including the Attentional Generative Adversarial Network (AttnGAN)5, the Semantic-Spatial Aware Generative Adversarial Network (SSA-GAN)26, the Text-to-Image Generative Adversarial Network with Reinforced Semantic Fusion (RAT-GAN)21, the Memory-Driven Generative Adversarial Network for Image Generation (DM-GAN)7, the Deep Fusion Block-based Generative Adversarial Network (DF-GAN)6, the ControlGAN27 with Separated Attribute Space, and the Vector Quantized Diffusion (VQ-Diffusion)28 model operating in a discrete index space.
According to the experimental results of different models on the T2IThangka dataset presented in Table 1, the proposed method achieves improvements of 77.78% in IS, 8.55% in FID, and 6.61% in subjective evaluation compared to RAT-GAN, while also attaining the best performance in terms of FID and subjective evaluation. It is worth noting that although models such as AttnGAN, DM-GAN, and ControlGAN achieve relatively high IS scores, their FID performance is notably poor. This indicates that while the images generated by these models exhibit high contrast, they significantly deviate from real Thangka images in color usage and detail representation, losing the distinctive artistic style and cultural connotations of Thangka. Further analysis of the comparative models reveals that although AttnGAN utilizes an attention mechanism to capture key textual information, its semantic understanding of Thangka image descriptions remains insufficient, leading to the omission of critical details in the generated images. The discrete codebook in the VQ-VAE of VQ-Diffusion fails to accurately represent the intricate lines, unique colors, and sacred style of Thangka art, resulting in the loss or distortion of critical details during the compression and quantization process. DM-GAN optimizes detail generation through a memory module, but this approach shows limited effectiveness when applied to Thangka images. These results demonstrate that the HST-GAN model excels in the task of text-to-Thangka image generation. It more accurately integrates textual semantic information, enhances feature representation in key regions and channels, improves detail generation quality through hierarchical feature alignment, and optimizes the discriminator via a differentiable symmetric augmentation strategy, thereby producing Thangka images with higher authenticity and structural integrity. Although RAT-GAN employs recurrent affine transformations to handle long sequences, it may suffer from gradient instability during training, which constrains its generation quality. In contrast, SSA-GAN leverages a dynamic masking mechanism based on text embeddings to explicitly enhance or suppress certain features, ensuring that the generated images remain spatially consistent with the text descriptions. As a result, this feature modulation approach proves more effective in the task of text-to-Thangka image generation.
Table 1. Quantitative results of different models on the T2IThangka dataset, showing that HST-GAN outperforms other state-of-the-art methods
Models | Size | IS↑ | FID↓ | Subjective Evaluation↑ |
|---|---|---|---|---|
AttnGAN5 | 256×256 | 2.73 | 231.28 | 5.02 |
DM-GAN7 | 2.50 | 115.32 | 6.93 | |
SSA-GAN26 | 1.93 | 93.67 | 7.13 | |
RAT-GAN21 | 1.17 | 96.13 | 6.96 | |
DF-GAN6 | 2.04 | 93.68 | 6.53 | |
ControlGAN27 | 2.51 | 149.45 | 6.22 | |
VQ-Diffusion28 | 1.32 | 138.15 | 5.34 | |
HST-GAN(ours) | 2.08 | 87.91 | 7.42 |
Bold values indicate the best performance in each column.
To further validate the effectiveness and generalization capability of the proposed method, we conducted additional experiments on the publicly available Oxford-102 dataset. As demonstrated in Table 2, the HST-GAN outperforms all comparative methods across quantitative evaluation metrics, including Inception Score (IS) and FID. These experimental results confirm that our approach consistently generates more text-consistent and visually natural images across different domains, while maintaining superior performance in cross-domain applications. The demonstrated generalization capability suggests that the proposed framework possesses considerable adaptability for diverse image generation tasks beyond Thangka synthesis.
Table 2. Quantitative results of HST-GAN on the Oxford-102 dataset, demonstrating the superiority of our method over the baseline approach
Models | IS↑ | FID↓ |
|---|---|---|
StackGAN4 | 3.20 | 55.28 |
StackGAN++15 | 3.26 | 48.68 |
DF-GAN6 | 3.80 | - |
ControlGAN27 | 3.81 | - |
RAT-GAN21 | 4.09 | - |
HST-GAN (ours) | 4.37 | 15.98 |
Bold values indicate the best performance in each column.
Qualitative evaluation
This section presents a comprehensive qualitative evaluation of HST-GAN’s performance on the T2IThangka dataset. Figure 8 provides a visual comparison between our proposed method and five state-of-the-art approaches (SSA-GAN, RAT-GAN, VQ-Diffusion, ControlGAN, and DM-GAN) that demonstrated competitive quantitative results. The comparative analysis reveals HST-GAN’s superior performance across multiple critical dimensions of image generation quality.
[See PDF for image]
Fig. 8
Qualitative comparison on the T2IThangka dataset.
The input text descriptions are provided in the bottom row, with corresponding generated images by different methods aligned horizontally.
In terms of fine-grained feature representation, HST-GAN generates images with significantly sharper contours and richer textural details. As shown in the first two columns of row 6 (depicting Sakyamuni Buddha and Amitabha), the monastic robes rendered by HST-GAN exhibit meticulously detailed folds and naturally draping patterns that create authentic three-dimensional effects. In contrast, comparative methods display various limitations: (i) VQ-Diffusion’s outputs (row 1) exhibit an overall blurry appearance; (ii) DM-GAN (row 2) produces stiff robe textures lacking proper layering; while (iii) SSA-GAN (row 4) demonstrates noticeable edge blurring and texture deficiency.
Remarkably, HST-GAN maintains excellent performance when processing complex structures. For instance, the generated statue of Avalokiteshvara in meditation posture displays natural and fluid bodily proportions. This achievement contrasts sharply with RAT-GAN’s outputs (row 3), which frequently exhibit structural adhesion or loss of fine details. Furthermore, in rendering ritual implements, HST-GAN’s version of four-armed Avalokiteshvara (row 6, third column) presents clearly defined patterns on the wish-fulfilling jewel held in the primary hands, with its luster seamlessly integrated into the ambient lighting.
Particularly noteworthy is HST-GAN’s exceptional capability in facial expression generation. The produced Buddha images capture subtle spiritual qualities—slightly downcast eyes and compassionate smiles are vividly rendered with remarkable fidelity. This advantage becomes especially evident when compared to ControlGAN (row 5), whose facial outputs often display distortions or disproportioned features. Regarding image quality consistency, HST-GAN maintains stable high-resolution output, while DF-GAN exhibits localized pixelation artifacts at equivalent scales. The comprehensive comparison demonstrates HST-GAN’s significant advantages in three key aspects: (1) fine-grained detail preservation, (2) structural coherence maintenance, and (3) textural richness achievement. These combined strengths enable HST-GAN to deliver unparalleled performance in the specialized domain of Thangka image generation, establishing it as a superior solution for this culturally significant application.
To address the inherent limitations of quantitative evaluation metrics in assessing multidimensional aspects, we conducted a comprehensive user study involving 30 evaluators who performed multidimensional assessments on model-generated text-image pairs. The evaluation framework incorporated four critical dimensions: (1) image quality, (2) text-image consistency, (3) creativity, and (4) color performance, with all ratings recorded on a standardized 5-point Likert scale (ranging from 1 to 5). The collected evaluation data were visualized through an interactive faceted bubble chart matrix (Fig. 9), where the x-axis represents different generative models, the y-axis indicates rating scores, and the area of each bubble corresponds to the proportional distribution of evaluators selecting that particular rating.
[See PDF for image]
Fig. 9
Qualitative evaluation bubble chart.
The x-axis represents different models, the y-axis shows average scores across evaluation dimensions, with bubble sizes proportional to the number of raters for each score.
On the Oxford-102 dataset, HST-GAN demonstrates superior performance in shape representation, texture synthesis, and color reproduction, generating images that exhibit stronger semantic alignment with textual descriptions. As visually confirmed in Fig. 10, our proposed method shows significantly more pronounced qualitative advantages compared to RAT-GAN, particularly in producing geometrically regular shapes and naturally gradient color transitions that faithfully reflect the input text specifications.
[See PDF for image]
Fig. 10
Qualitative comparison on the Oxford-102 dataset.
Input text descriptions (bottom row) with corresponding generated images by different methods (horizontal alignment).
Ablation study
To systematically evaluate the individual contributions of each proposed component, we conducted ablation studies on the T2IThangka dataset using both Inception Score (IS) and FID as evaluation metrics. Specific results can be seen in Table 3, where the benchmark model is a RAT-GAN containing only circular affine transformations.
Table 3. Ablation results on T2IThangka (proposed method
Model | PSCA | CBAM | HLFN | DiffAugment | IS↑ | FID↓ |
|---|---|---|---|---|---|---|
1 | - | - | - | - | 1.17 | 96.13 |
2 | √ | - | - | - | 1.65 | 89.03 |
3 | - | √ | - | - | 1.43 | 94.87 |
4 | - | - | √ | - | 1.55 | 93.62 |
5 | - | - | - | √ | 1.63 | 89.92 |
6 | √ | - | √ | - | 1.96 | 88.60 |
HST-GAN (ours) | √ | - | √ | √ | 2.08 | 87.91 |
Bold values indicate the best performance in each column.
The incorporation of the PSCA module yields consistent improvements across all evaluation metrics, with the IS score increasing by 0.48 points and the FID metric decreasing significantly compared to the baseline model. These quantitative results demonstrate PSCA’s crucial capability in extracting discriminative spatial-channel features, enabling precise localization of key regions for accurate contour delineation of principal deities. As a comparison, we also introduced the well-established CBAM module for ablation studies. The results show that while CBAM demonstrates improvements over the baseline model, its performance gains are consistently inferior to those of our proposed PSCA module across all evaluation metrics. This comparative analysis highlights the superior specificity and effectiveness of the PSCA module in handling such tasks compared to general-purpose attention mechanisms. Subsequent integration of the HLFN module further enhances model performance, achieving a 0.38-point IS improvement and 2.51-point FID reduction. This performance gain stems from HLFN’s effective alignment mechanism between high-level semantic concepts and low-level visual features, which mitigates semantic information loss during the generation process and consequently improves the quality of synthesized Thangka images.The introduction of differentiable symmetric augmentation (DiffAugment) in the discriminator contributes an additional 0.46-point IS improvement. This enhancement originates from DiffAugment’s dynamic augmentation strategy that operates on all input data during training, substantially improving data utilization efficiency while strengthening the discriminator’s feature discrimination capability. As illustrated in Fig. 11, the loss function trajectories demonstrate that the DiffAugment-enhanced model exhibits markedly more stable convergence characteristics compared to the baseline. The final integrated model, incorporating all proposed components, achieves optimal performance across all evaluation metrics. These comprehensive experimental results provide conclusive evidence for the effectiveness of our proposed methodological innovations in Thangka image generation.
[See PDF for image]
Fig. 11
Training dynamics of RAT-GAN versus RAT-GAN+DA.
Discriminator and generator losses are plotted for both models (RAT-GAN in blue; RAT-GAN+DA in red).
The experimental results presented in Fig. 12 reveal distinct performance characteristics across different model configurations. The baseline model generates notably coarse images, exhibiting significant deficiencies in both clothing texture reconstruction and edge detection capabilities, resulting in poorly differentiated visual features. Upon integrating the PSCA module, the generated images demonstrate marked improvements, with substantially reduced background noise and more concentrated weight distribution in key regions, leading to significantly enhanced boundary definition of target objects. The subsequent incorporation of HLFN enables effective fusion of low-level local details (e.g., facial features) with high-level semantic information (e.g., facial contour structures), thereby producing results with substantially enriched hierarchical representation. Finally, through the integration of DiffAugment in the discriminator, the model achieves stable generation quality across various evaluation metrics. This progressive enhancement demonstrates the complementary nature and cumulative benefits of each proposed component in our framework.
[See PDF for image]
Fig. 12
Representative results from the ablation study.
The comparison demonstrates the performance contribution of each key component in the proposed model.
Visualization of attention maps and feature maps
To validate the effectiveness of the proposed PSCA mechanism and HLFN, we conducted visualizations of both attention maps and multi-level feature representations. The first column of Fig. 13 displays the heatmaps generated by the attention mechanism. The results indicate that the heatmaps consistently highlight image regions corresponding to key elements in the text descriptions, demonstrating the module’s ability to identify semantically relevant spatial locations and guide the generation process toward salient areas. Columns 2 and 3 of Fig. 13 illustrate deep-level features F₁ (containing contour and structural information) and shallow-level features F₂ (containing texture and edge details), respectively. The fourth column shows that the fused feature F₃, obtained through cross-level integration, preserves both the global structure from deep features and fine-grained details from shallow features, indicating that the fusion strategy effectively combines multi-scale information and improves structural coherence and detail quality in the generated images. These visualizations confirm that the proposed methods contribute meaningfully to feature representation and integration.
[See PDF for image]
Fig. 13
Visualization of attention and hierarchical features.
From left to right: the attention map highlighting the model’s focus, followed by feature maps from successive layers of the network.
To further investigate the impact of individual modules on model performance, we conducted a comprehensive user study involving 20 evaluators (including both Thangka art specialists and general users). Participants were presented with five sets of text prompts, with each prompt generating four variant images from different model configurations. For each model variant’s output, evaluators selected the optimal results based on text-image alignment and visual esthetics. As shown in Fig. 14, the research results indicate that after applying each method proposed in this paper, the generated results have all gained a higher degree of favor from users.
[See PDF for image]
Fig. 14
User study evaluation based on ablation experiment results.
The chart presents subjective human ratings of the visual quality for different model variants.
Discussion
To address the issues of insufficient fine-grained feature extraction, difficulties in multi-level feature fusion, and training instability in text-to-Thangka image generation, this paper proposes a cross-modal generative framework based on a parallel hybrid attention mechanism and differentiable symmetry enhancement for generating novel images that align with Thangka textual descriptions. First, a PSCA module is designed to effectively enhance detailed representation through a dual attention mechanism. Second, an HLFN is constructed to achieve effective alignment between high-level semantics and low-level details. To improve the diversity of generated images, differentiable symmetric enhancement (DiffAugment) is employed to suppress discriminator overfitting and enhance model generalization. Compared to existing methods, the proposed HST-GAN generates Thangka images that exhibit higher consistency with textual descriptions, as well as more accurate texture and color restoration. Both objective and subjective evaluations validate the reliability and superiority of the proposed method in text-to-image generation tasks.
Although this study demonstrates promising performance in local Thangka image generation, the model still exhibits layout biases and detail blurring when generating global, high-resolution Thangka images involving complex backgrounds and multi-object interactions. This is primarily attributed to the current model’s limited capability in understanding global spatial relationships. In the future, our research will advance along two main directions: First, at the architectural level, we will explore the deep integration of Transformer-based architectures or lightweight global attention mechanisms with existing CNN backbones. Second, at the generative paradigm level, we plan to combine the core attention and feature fusion mechanisms proposed in this study with emerging diffusion models, ultimately achieving higher-quality and more efficient Thangka image generation. Finally, we anticipate that the core methodology presented in this study may offer new technical paradigms for digital cultural heritage preservation and even scientific computing visualization.
Acknowledgements
This research was supported by the National Natural Science Foundation of China (Grants No. 62061042 for the project “Research on Text Description Generation Methods for Iconic Thangka Images” and No. 62366047 for the project “Research on Key Technologies for Content Generation and Style Transfer in Thangka Murals”), the Fundamental Research Funds for the Central Universities (Grant No. 31920240070 for the project “Deep Learning-Based Thangka Pattern Generation from Chinese Text”), and the Special Funds for Guiding the Development of World-Class Universities (Disciplines) and Characteristic Research from the Central Universities.
Author contributions
W.H. and Y.Z. contributed to conceptualization, methodology, software, resources, formal analysis, writing-original draft, writing-review and editing. M.L. and Q.Z. contributed to supervision, investigation and data curation. All authors reviewed the manuscript.
Data availability
The dataset used in this study will be considered publicly available at a later stage available from [email protected] upon request.
Competing interests
The authors declare no competing interests.
Ethics approval and consent to participate
Ethical approval does not apply to this article.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Li, Y. & Liu, X. Sketch-based Thangka image retrieval. In Proc. 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC). 2066–2070 (IEEE, 2021).
2. Goodfellow, I. J. et al. Generative adversarial nets. In Proc. 28th International Conference on Neural Information Processing Systems 2672–2680 (ACM, 2014).
3. Sheynin, S. et al. kNN-diffusion: image generation via large-scale retrieval. Preprint at https://doi.org/10.48550/arXiv.2204.02849 (2022).
4. Zhang, H. et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proc. IEEE International Conference on Computer Vision. 5908–5916. https://doi.org/10.1109/ICCV.2017.629 (IEEE, 2017).
5. Xu, T. et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 1316–1324. https://doi.org/10.1109/CVPR.2018.00143 (IEEE, 2018).
6. Tao, M. et al. DF-GAN: A simple and effective baseline for text-to-image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVRP) 16515–16525 (IEEE, 2022).
7. Zhu, M. et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5795–5803 (IEEE, 2019).
8. Kang M. et al. Scaling up GANs for text-to-image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10124–10134 (IEEE, 2023).
9. Saharia, C et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst.; 2022; 36, pp. 36479-36494.
10. Zheng, W. et al. (2024). CogView3: finer and faster text-to-image generation via relay diffusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision. ECCV 2024. Lecture Notes in Computer Science, vol 15135. 1–22 (Springer, 2024).
11. Lin, C., Lu, X. & Chen, G. RATLIP: generative adversarial CLIP text-to-image synthesis based on recurrent affine transformations. https://doi.org/10.48550/arXiv.2405.08114 (2024).
12. Tao M. et al. GALIP: generative adversarial CLIPs for text-to-image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14214–14223 (IEEE, 2023).
13. Yang, B et al. SF-GAN: semantic fusion generative adversarial networks for text-to-image synthesis. Expert Syst. Appl.; 2025; 262, 125583. [DOI: https://dx.doi.org/10.1016/j.eswa.2024.125583]
14. Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. CBAM: convolutional block attention module. In Computer Vision - ECCV (eds. Ferrari, V. et al.) 3–19 (Springer, 2018).
15. Zhang, H et al. StackGAN + +: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2019; 41, pp. 1947-1962. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2856256] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30010548]
16. Brock, A. et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis. International Conference on Learning Representations (ICLR), 2019.
17. Gulrajani, I. et al. Improved training of Wasserstein GANs. In Proc. 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA: Curran Associates Inc., 5769–5779 (2017).
18. Zhang, H. et al. Consistency regularization for generative adversarial networks. In Proc. 6th International Conference on Learning Representations. (2020).
19. Zhao, S et al. Differentiable augmentation for data-efficient GAN Training. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 7559-7570.
20. Hou, L et al. Augmentation-aware self-supervision for data-efficient GAN training[J]. Adv. Neural Inf. Process. Syst.; 2023; 36, pp. 31601-31620.
21. Ye, S; Wang, H; Tan, M; Liu, F. Recurrent affine transformation for text-to-image synthesis. IEEE Trans. Multimed.; 2024; 26, pp. 462-473. [DOI: https://dx.doi.org/10.1109/TMM.2023.3266607]
22. Hu, W; Zhang, F; Zhao, Y. Thangka image captioning model with salient attention and local interaction aggregator. Herit. Sci.; 2024; 12, 407. [DOI: https://dx.doi.org/10.1186/s40494-024-01518-5]
23. Nilsback, M.-E. & Zisserman, A. Automated flower classification over a large number of classes. In Proc.Indian Conference on Computer Vision, Graphics and Image Processing. 722–729. https://doi.org/10.1109/ICVGIP.2008.47 (2008).
24. Salimans, T et al. Improved techniques for training GANs. Adv. Neural Inf. Process. Syst.; 2016; 29, pp. 2234-2242.
25. Heusel, M et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst.; 2017; 30, pp. 6629-6640.
26. Liao, W et al. Text-to-image generation with semantic-spatial aware GAN. Proc. IEEE Conf. Comput. Vis. Pattern Recognit.; 2022; 2022, pp. 18166-18175. [DOI: https://dx.doi.org/10.1109/CVPR52688.2022.01765]
27. Lee, M; Seok, J. Controllable generative adversarial network. IEEE Access; 2019; 7, pp. 28158-28169. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2899108]
28. Gu, S et al. Vector quantized diffusion model for text-to-image synthesis. Proc. IEEE Conf. Comput. Vis. Pattern Recognit.; 2022; 2022, pp. 10686-10696.
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.