Content area
Text-to-Thangka generation requires preserving both semantic accuracy and textural details. Current methods struggle with fine-grained feature extraction, multi-level feature integration, and discriminator overfitting due to limited Thangka data. We present HST-GAN, a novel framework combining parallel hybrid attention with differentiable symmetric augmentation. The architecture features a Parallel Spatial-Channel Attention module (PSCA) for precise localization of deity facial features and ritual object textures, along with a Hierarchical Feature Fusion Network (HLFN) for multi-scale alignment. The framework’s Differentiable Symmetric Augmentation (DiffAugment) dynamically adjusts discriminator inputs to prevent overfitting while improving generalization. On the T2IThangka dataset, HST-GAN achieves an Inception Score of 2.08 and reduces Fréchet Inception Distance to 87.91, demonstrating superior performance over baselines on the Oxford-102 benchmark.
Details
1 Northwest Minzu University, School of Mathematics and Computer Science, Lanzhou, China (GRID:grid.412264.7) (ISNI:0000 0001 0108 3408); Gansu Provincial Engineering Research Center of Multi-Modal Artificial Intelligence, Lanzhou, China (GRID:grid.412264.7); Northwest Minzu University, Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Key Laboratory of Linguistic and Cultural Computing of Ministry of Education, Chinese National Information Technology Research Institute, Lanzhou, China (GRID:grid.412264.7) (ISNI:0000 0001 0108 3408)
2 Northwest Minzu University, Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Key Laboratory of Linguistic and Cultural Computing of Ministry of Education, Chinese National Information Technology Research Institute, Lanzhou, China (GRID:grid.412264.7) (ISNI:0000 0001 0108 3408)