Introduction
With the development of sensors, people’s daily lives are increasingly intertwined with computer imaging1, 2–3. However, a single sensor often lacks the ability to fully capture the rich information present in real-world scenes4,5. There is a growing need for infrared and visible information fusion technology6,7, which integrates data captured by multiple sensors8, facilitating advanced visual tasks such as military surveillance9, autonomous driving10, and medical images denoising11. With the development of deep learning, the technologies for solving infrared and visible image fusion tasks (IVIF) have also been refined12. However, the challenge remains: how to retain the high contrast of infrared images while preserving as many texture details as possible from visible images, so that the fused images can be effectively applied to real-world tasks in production and daily life. This remains a central topic for researchers. Meanwhile, addressing the issue of information loss in infrared and visible image fusion tasks is another key challenge. Designing an appropriate loss function to reduce information loss and produce a high-quality and information-rich fused image is a popular area of research. Existing fusion methods generally suffer from insufficient adaptive capability to different modal information in their overall network design, as well as limited ability to model complementary interactions between modalities, making it challenging to dynamically balance the unique attributes of infrared and visible images. While some methods have achieved partial success in pixel-level information retention, their designs often remain disconnected from the practical requirements of downstream tasks.
To address the above issues, we propose a texture-preserving and information loss minimization method for infrared and visible image fusion, TPFusion, which utilizes convolution operations with various receptive field sizes to fully extract information from the source images. It then performs contrast enhancement and texture details preservation based on the unique properties of the source images. After obtaining the enhanced image features, the features are input into a dual-attention fusion module for integration. Finally, a multi-level decoder decodes the fused features to reconstruct the final fused image. To ensure optimal fusion results and avoid information loss, we design a information loss function based on image information entropy principles13. This information constraints in combination with structural similarity and gradient constraints, optimizing the fusion model to achieve the best solution. The main contributions of this paper are as follows:
We propose a texture-preserving and information loss minimization method for infrared and visible image fusion, TPFusion, which aims to retain information from the source images as much as possible.
We design a loss function based on image information entropy principles to train TPFusion, which consists of information loss, structural similarity loss, and gradient loss functions.
Extensive qualitative and quantitative experiments have demonstrated that TPFusion is capable of producing fused images with high contrast and rich textural details, while simultaneously maximizing the preservation of information.
Related work
This section brifely introduces IVIF methods for texture-preserving IVIF methods and specific constraints based IVIF methods.
Texture-preserving IVIF methods
In current research, numerous image fusion methods have been proposed, most of which focus on preserving texture information to maximize the detail representation and visual quality of the fused images14, 15, 16, 17, 18, 19, 20–21. Liu et al.22 propose an attention-guided global–local adversarial learning framework, where attention weight maps highlight essential regions across exposure levels to guide the coarse fusion. In parallel, Liu et al.23 design a cascaded architecture that integrates a feature learning module and a fusion mechanism, facilitating multi-scale feature extraction from multi-modal inputs and promoting the discovery of salient common structures for effective fusion. Yang et al.24 present CEFusion, a texture-preserving IVIF method that enhances image quality by leveraging cross-modal multi-granularity information interaction and edge guidance through a triple-branch scene fidelity module and a progressive cross-granularity interaction feature enhancement module. Li et al.25 design PMFANet to adaptively integrate features and optimize texture extraction, ensuring enhanced preservation. Meanwhile the author design ETPENet to capture fine-grained texture details while suppressing lighting interference. These networks ensure the fused image retains optimal texture, color, contrast, and brightness. Lu et al.26 propose DEFusion, a dual encoder image fusion method based on dense connectivity, which employs a progressive fusion strategy and a novel loss function to minimize gradient loss and preserve detailed information, ensuring superior fusion results. Wang et al.27 propose AMLCA, a network combining an additive cross-attention module (ACAM) and a wavelet convolution-guided transformer module (WCGTM), to enhance feature interaction and preserve local and global information in visible and infrared image fusion. By leveraging a multi-layer fusion strategy, AMLCA effectively extracts complementary information from both local details and global dependencies, improving overall fusion performance. Yi et al.28 propose the Text-IF method, which achieves degradation-aware and interactive image fusion through semantic text guidance. It excels in texture preservation and effectively alleviates the degradation issues in low-quality source images while enhancing multi-modal information fusion. Liu et al.29 propose the DCEvo method, which adopts a cross-dimensional evolutionary learning approach to effectively fuse key information from infrared and visible images. By jointly learning features across multiple dimensions, DCEvo extracts discriminative features, ensuring the preservation of details during fusion while reducing the impact of noise and degradation. Additionally, DCEvo uses an evolutionary learning strategy to adaptively optimize the fusion process, improving fusion quality and enhancing robustness and adaptability in complex scenarios.
Moreover, existing texture-presevering IVIF methods not considering the unique information of infrared and visible images, especially under dark nighttime conditions, people desire fused results that contain both texture details and high contrast. However, existing IVIF methods rarely provide solutions that effectively balance texture details and high contrast, especially under dark nighttime conditions, where people desire fused results that retain both aspects.
Specific constraints based IVIF methods
In unsupervised image fusion tasks, the use of specific constraints can significantly improve the fusion results, leading to enhance image quality30, 31–32. To address the performance limitations under the constraints of realistic visual quality and structural fidelity, Liu et al.33 propose HALDeR, a hierarchical attention-guided learning framework for multi-modal image fusion. Their method imposes multi-level constraints, such as illumination-aware attention, detail refinement, and adversarial realism, to ensure vivid color and faithful structure. Yi et al.34 propose UIRGBfuse adopting a unified fusion framework, incorporating an IR-RGB joint fusion learning strategy, R, G, and B fusion losses, a frequency domain compensation feature fusion module, and a hybrid CNN-Transformer deep feature refinement module. This approach ensures natural fusion, preserves source image details, enhances color fidelity, and improves target detection performance in infrared and RGB visible image fusion. Yang et al.35 propose a novel multi-scale fusion network, SADFusion, which integrates a domain-specific framework and salient-aware loss to guide the model in balancing the preservation of texture details and thermal targets. The salient-aware loss utilizes salient modality features as pixel-to-pixel intensity and gradient maps, helping the network trade off the necessary information while effectively fusing complementary features at multiple scales. Wang et al.36 propose a dual-path residual attention fusion network, DRAFusion, which incorporates a novel feature adaptation loss function to control the proportion of key information from infrared and visible images during training. This loss function helps maintain a better balance between the fused result and the source images. Existing specific constraints based IVIF methods often use specific loss to optimize the fusion results, but they may overlook the structural differences between the source images, resulting in overly smooth fused images or the loss of key target information in certain regions. more refined loss functions need to be designed that can adaptively adjust the weight of information from different modalities (infrared and visible), ensuring that thermal target information is preserved while effectively retaining details and textures.
Existing IVIF networks often struggle to accommodate the distinct characteristics of infrared and visible modalities and to model their complementary relationships, making it diffcult to balance the high thermal contrast of infrared images with the rich spatial detail of visible images. Although some methods succeed in preserving pixel-level information, their architectures rarely meet the practical requirements of downstream tasks; in pursuing visually pleasing outputs, many approaches sacrifice crucial details needed for object detection, semantic segmentation, and other high-level applications. Compared with the latest IVIF method, CDDFuse37, the approach proposed in this paper directly overcomes these shortcomings through a modality-aware design that deploys two independent multi-scale feature extraction branches, is trained end-to-end in a single stage, and is guided by a concise composite loss that simultaneously minimises information loss while preserving gradient fidelity and structural similarity.
Methods
In this section, we introduce the architecture of the TPFusion, which mainly consists of multi-scale feature extraction module, texture enhancement module, contrast enhancement module, and dual-attention fusion module.
The overall architecture
As Fig. 1 shows, TPFusion takes infrared image and visible image as inputs, and output the fused image ,
1
[See PDF for image]
Fig. 1
The architecture of TPFusion, which consists of multi-scale feature extraction module (MSFEM), texture enhancement module (TEM), contrast enhancement module (CEM), and dual-attention fusion module (DAFM).
In TPFusion, the inital feature of two input images are first extracted using multi-scale feature extract module (MSFEM), respectively, i.e,
2
3
where denotes the multi-scale feature extraction module, denotes batch normalization with a ReLU activation function and represents the initial features output from MSFEM. By leveraging the multi-scale feature extraction, the fusion model can effectively handle variations in texture, contrast, and illumination between the infrared and visible images. Then, the features are respectively input into the texture enhancement module (TEM) and the contrast enhancement module (CEM) to enhance texture and contrast, i.e.,
4
After passing through the TEM and the CEM, the enhanced features are input into the dual-attention fusion module (DAFM) for feature fusion and decoding to generate the final fused image , i.e.,
5
Texture and contrast enhancement module
In order to obtain high-quality fused images, the feature enhancement operations need to effectively preserve both the texture details of visible images and the contrast of infrared images simultaneously. To achieve this goal, TPFusion designs a texture enhancement module (TEM) based on the Laplacian operator. The Laplacian operator highlights regions of intensity discontinuity, making it particularly effective for extracting fine texture details and edges in visible images. By integrating this operation into the TEM, TPFusion enhances its ability to capture and preserve high-frequency details, which are crucial for maintaining the visual fidelity of the fused images.
At the same time, contrast enhancement module (CEM) increases the receptive field to help the model to capture broader contextual information. In the context of infrared images, contrast is often associated with thermal differences and structural information. By expanding the receptive field, the CEM can better capture global contrast variations and integrate them into the fused images, retaining more contrast information from the infrared images, thereby improving the quaility of the fused images. The initial features output from the MSFEM are fed into the TEM and CEM respectively.
6
7
8
where denotes the enhanced features, denotes concatenation along channel dimension, and denotes Laplacian which is commonly used38,39 to effectively capture high-frequency information. The enhanced features are then input into the dual-attention fusion module (DAFM) for feature fusion and feature decoding.
Dual-attention fusion module
In DAFM, the enhanced features are passed through channel-spatial attention operation and concatenation along channel dimension to obtain the features , to effectively capture both channel-wise and spatial-wise dependencies, ensuring that the most informative features are emphasized while suppressing redundant or less relevant information. Finally, these features undergo global average pooling (GAP), max pooling (Max), and Softmax operation to obtain the feature weight matrices , , , i.e.,
9
10
11
where ChaSpa(·) denotes the spatial and channel attention mechanism. Then, the features and are element-wise multiplied by the weight matrices of each branch to obtain the weighted features and ,
12
13
Finally, they are element-wise added to get the fused features , which is used to reconstruct the final fused image ,
14
15
Loss function
To comprehensively optimize TPFusion, we design a information content based loss function, which consists of structural similarity loss , gradient loss , and information loss ,
16
where , , are the hyperparameters that weight the three loss functions. In Eq. (16), the structural similarity loss is used to express the degree of distortion of luminance and structure in the fused image,
17
The gradient loss makes the fused image have clearer edge information, which is designed by the first-order gradient of the fused image and the orginal images,
18
Inspired by local mutual information maximization40,41, the information loss makes the fused image retain important information from the source images, which helps to reduce the information loss during the image fusion, ensuring the fused image accurately reflect the feature of original images,
19
where denotes the mutual information between the two images and denotes the image entropy. They are used to quantify the similarity between the two images at the information level. The mutual information between the fused image and original images is first calculated to measure how much information of the original information is retained in the fused image, and then each mutual information term is weighted by the inverse square root of the entropy of the infrared and visible images, which can adjusts the contribution of each image’s mutual information based on its entropy. The higher entropy indicates the less information loss in the original images. Finally, the weighted sum of mutual information is then normalized by the half of fused images entropy.
Experiments and results
In this section, we conduct the experiments to evaluate the performance of TPFusion on the tasks of image fusion and the downstream tasks
Dataset and implementation details
MSRS42, TNO43 and 44 datasets are the commonly used datasets in IVIF. In our experiments we select 1083 pairs of images from MSRS dataset’s trainset to train TPFusion, which is qualitatively and quantitatively tested on TNO, MSRS, . During train on MSRS dataset and test on MSRS and datasets, the RGB images are first converted into the YCbCr color space. The Y channel of the RGB image is fused with the corresponding infrared image. After the fusion, the fused image is reconcatenated with the Cb and Cr channels along the channel dimension, and the result image is finally converted back to the RGB color space. Among the 1083 pairs of images from MSRS dataset’s train set, we crop the images and obtain 28,000 pairs of train images with size of , meanwhile, they are normalized into [0, 1]. In the training phase, the inital learning rate is 1e−4, the batchsize is set to 4, and the Adam optimizer is employed. The model is train for 30 epochs, and the hyperparameters , , in the loss function are set as 10, 100, 1, respectively. The experiments are conducted on Pytorch framework, Intel(R) Core(TM) i9-14900HX CPU, and NVIDIA GeForce RTX 4080 Laptop GPU. The parameter settings of the TPFusion network model are shown in the Table 1.
Table 1. Parameter settings of TPFusion in our experiments.
Blocks | Layers | Input size | Structure | Output size |
---|---|---|---|---|
MSFEM | Convolution | 256 × 256 × 1 | Kernel size 7 × 7, stride1, padding3 | 256 × 256 × 8 |
BN + ReLU | 256 × 256 × 8 | – | 256 × 256 × 8 | |
Convolution | 256 × 256 × 8 | Kernel size 5 × 5, stride1, padding2 | 256 × 256 × 16 | |
BN + ReLU | 256 × 256 × 16 | – | 256 × 256 × 16 | |
Convolution | 256 × 256 × 16 | Kernel size 3 × 3, stride1, padding1 | 256 × 256 × 32 | |
BN + ReLU | 256 × 256 × 32 | – | 256 × 256 × 32 | |
TEM | Convolution | 256 × 256 × 32 | Kernel size 5 × 5, stride1, padding2 | 256 × 256 × 32 |
BN + ReLU | 256 × 256 × 32 | – | 256 × 256 × 32 | |
Convolution | 256 × 256 × 32 | Kernel size 3 × 3, stride1, padding1 | 256 × 256 × 32 | |
BN + ReLU | 256 × 256 × 32 | – | 256 × 256 × 32 | |
Convolution | 256 × 256 × 32 | Kernel size 1 × 1, stride1, padding0 | 256 × 256 × 32 | |
Laplacian | 256 × 256 × 32 | – | 256 × 256 × 32 | |
CEM | Convolution | 256 × 256 × 32 | Kernel size 5 × 5, stride1, padding2 | 256 × 256 × 32 |
BN + ReLU | 256 × 256 × 32 | – | 256 × 256 × 32 | |
Convolution | 256 × 256 × 32 | Kernel size 3 × 3, stride1, padding1 | 256 × 256 × 32 | |
BN + ReLU | 256 × 256 × 32 | – | 256 × 256 × 32 | |
Convolution | 256 × 256 × 32 | Kernel size 1 × 1, stride1, padding0 | 256 × 256 × 32 | |
DAFM | Convolution | 256 × 256 × 192 | Kernel size 3 × 3, stride1, padding1 | 256 × 256 × 64 |
BN + ReLU | 256 × 256 × 64 | – | 256 × 256 × 64 | |
Convolution | 256 × 256 × 64 | Kernel size 3 × 3, stride1, padding1 | 256 × 256 × 16 | |
BN + ReLU | 256 × 256 × 16 | – | 256 × 256 × 16 | |
Convolution | 256 × 256 × 16 | Kernel size 3 × 3, stride1, padding1 | 256 × 256 × 1 | |
Sigmoid | 256 × 256 × 1 | – | 256 × 256 × 1 |
In the experiments, we take thirteen previous deep learining based IVIF methods to compare with our TPFusion, i.e., DDcGAN45, PMGI46, NestFuse47, RFN-Nest48, SDNet49, MFEIF23, TarDAL44, DDFM50, DATFuse51, CDDFuse37, BTSFusion18, SFCFusion52, and PromptFusion53.
We compare the performance of these image fusion methods using eight objective evaluation metrics, including mutual information (MI)40, visual information fidelity (VIF)54, average gradient (AG)55, correlation coefficient (CC)56, sum of correlations of differences (SCD)57, entropy (EN)58, 59, and spatial frequency (SF)60.
Experimental results
Experimental results on the TNO dataset
Table 2 presents the quantitative results on the TNO dataset, where the best result is in bold with an underline, the second result is in bold, and the third result is with an underline. TPFusion outperforms other methods in terms of AG, CC and , indicating that the fusion result of TPFusion contain higher textual details and clearer object edge. Meanwhile, TPFusion ranks the second in MI, VIF, EN, SCD, SF among the 14 methods, which indicates that TPFusion performs excellently in extracting important information from the original image, effectively distinguishing the useful information from noise.
Table 2. Quantitative comparison of methods on the TNO dataset.
Methods | Year | MI | VIF | AG | CC | SCD | EN | QAB/F | SF |
---|---|---|---|---|---|---|---|---|---|
DDcGAN45 | 2020 | 1.5107 | 0.4451 | 3.9118 | 0.4375 | 1.6226 | 6.9104 | 0.3414 | 12.1566 |
PMGI46 | 2020 | 1.9975 | 0.4635 | 2.5859 | 0.4141 | 1.5263 | 7.0255 | 0.4140 | 8.7531 |
NestFuse47 | 2020 | 2.0934 | 0.5539 | 3.8253 | 0.4468 | 1.5307 | 7.0165 | 0.5053 | 9.7701 |
RFN-Nest48 | 2021 | 1.7597 | 0.4720 | 1.6669 | 0.4480 | 1.6041 | 6.9764 | 0.3334 | 5.8759 |
SDNet49 | 2021 | 1.9096 | 0.4927 | 3.6285 | 0.4111 | 1.5640 | 6.7127 | 0.4337 | 11.7009 |
MFEIF23 | 2021 | 1.8538 | 0.5384 | 2.9083 | 0.4416 | 1.5560 | 6.6650 | 0.4532 | 7.1030 |
TarDAL44 | 2022 | 2.0013 | 0.5333 | 3.9112 | 0.4175 | 1.3856 | 6.8284 | 0.3996 | 10.5084 |
DDFM50 | 2023 | 1.7577 | 0.3185 | 3.4528 | 0.4543 | 1.5455 | 6.8552 | 0.2440 | 7.8717 |
DATFuse51 | 2023 | 2.3187 | 0.5201 | 2.5904 | 0.4221 | 1.4933 | 6.4818 | 0.4976 | 9.6988 |
CDDFuse37 | 2023 | 2.0807 | 0.5419 | 3.8687 | 0.4469 | 1.5686 | 6.9836 | 0.5075 | 11.5175 |
BTSFusion18 | 2024 | 1.3757 | 0.5064 | 4.2795 | 0.4406 | 1.6335 | 6.7130 | 0.4114 | 11.2330 |
SFCFusion52 | 2024 | 1.6566 | 0.5340 | 3.5832 | 0.4665 | 1.6010 | 7.1110 | 0.4296 | 10.7048 |
PromptFusion53 | 2024 | 2.9755 | 0.5709 | 4.1560 | 0.4682 | 1.6882 | 6.9597 | 0.5174 | 11.9845 |
TPFusion(ours) | 2.4564 | 0.5687 | 4.3945 | 0.4687 | 1.6801 | 7.0514 | 0.5213 | 12.1247 |
Figure 2 shows the qualitative comparison of methods on the TNO dataset. We magnify a local region in the image for better comparison. In Fig. 2, TPFusion not only presents clear infrared human targets but also effectively extracts rich texture detail information. Meanwhile, TPFusion enhances the clarity of helicopter texture details, and its fusion results also contain cleaner object edges and richer contrast. Compared to other methods, TPFusion demonstrates its powerful ability in extracting valuable information while efficiently processing complex data, showcasing its superior performance in information retrieval. This strength is evident in its fusion results, where the fused image displays enhanced contrast and sharper texture details, preserving the rich and clear object features even under challenging lighting conditions. Additionally, TPFusion ensures that no critical information is lost, even in the darkest regions. These advantages make TPFusion highly effective in low-light environments, significantly improving its ability to recover fine details in poorly light areas.
[See PDF for image]
Fig. 2
Qualitative comparison of methods on the TNO dataset.
Experimental results on the MSRS dataset
Table 3 presents the quantitative results on the MSRS dataset. TPFusion achieves best results in AG, CC, which indicates that TPFusion generates high-quality fused images contain rich information and highly similar to the structure of original image. TPFusion ranks the second in EN, it demonstrates that TPFusion can effectively preserve information while maintaining the uniqueness of multi-model data. It minimizes the issue of information loss and generate high-quality fused images.
Table 3. Quantitative comparison of methods on the MSRS dataset.
Methods | Year | MI | VIF | AG | CC | SCD | EN | QAB/F | SF |
---|---|---|---|---|---|---|---|---|---|
DDcGAN45 | 2020 | 1.8444 | 0.5747 | 4.5061 | 0.6303 | 1.1485 | 7.3152 | 0.3640 | 11.9985 |
PMGI46 | 2020 | 2.1880 | 0.7230 | 2.9615 | 0.7301 | 1.0618 | 6.2421 | 0.4172 | 8.2757 |
NestFuse47 | 2020 | 2.4215 | 0.7290 | 3.0975 | 0.6141 | 1.0907 | 6.5043 | 0.5487 | 9.7199 |
RFN-Nest48 | 2021 | 2.4599 | 0.6558 | 2.1151 | 0.7283 | 1.2313 | 6.1958 | 0.3940 | 6.1634 |
SDNet49 | 2021 | 1.7245 | 0.5015 | 2.6762 | 0.7026 | 0.7477 | 5.2460 | 0.3797 | 8.6730 |
MFEIF23 | 2021 | 2.1305 | 0.7592 | 2.6049 | 0.6330 | 1.0275 | 5.8004 | 0.5533 | 8.0698 |
TarDAL44 | 2022 | 2.3532 | 0.6732 | 3.1161 | 0.6262 | 1.2068 | 6.3462 | 0.4258 | 9.8841 |
DDFM50 | 2023 | 2.6548 | 0.7429 | 2.5109 | 0.6588 | 1.4492 | 6.0147 | 0.4652 | 7.0861 |
DATFuse51 | 2023 | 2.8922 | 0.6909 | 3.5739 | 0.6887 | 1.2102 | 6.4796 | 0.5050 | 10.9269 |
CDDFuse37 | 2023 | 2.3520 | 0.8510 | 3.7479 | 0.6007 | 1.1210 | 6.7010 | 0.5928 | 11.5564 |
BTSFusion18 | 2024 | 2.2816 | 0.5675 | 4.1242 | 0.6994 | 1.1693 | 6.2910 | 0.4900 | 11.6698 |
SFCFusion52 | 2024 | 3.1097 | 0.7880 | 4.1417 | 0.6878 | 1.2611 | 6.4411 | 0.6402 | 11.7680 |
PromptFusion53 | 2024 | 3.6505 | 0.9142 | 3.5226 | 0.5984 | 1.5124 | 6.6456 | 0.5875 | 9.2697 |
TPFusion(ours) | – | 2.9861 | 0.7901 | 4.9561 | 0.7992 | 1.2787 | 7.1883 | 0.5909 | 11.5292 |
Figure 3 shows the qualitative comparison of methods on the MSRS dataset. In Fig. 3, TPFusion not only ensures the preservation of clear texture details but also achieves a more distinct and noticeable contrast retention. As observed in the localized areas, TPFusion effectively preserves the fine object texture details. Compared to the other methods, TPFusion shows a significant advantage in global illumination retention. Meanwhile, in case of the person walking away from the camera, TPFusion effectively preserves and enhances the texture details of the person’s clothing on the back. The clear presentation of the distant car and the preservation of global brightness further demonstrate that TPFusion enhances the feature clarity without sacrificing the broader scene background.
[See PDF for image]
Fig. 3
Qualitative comparison of methods on the MSRS dataset.
Experimental results on the dataset
Table 4 presents the quantitative comparison of methods on the dataset. TPFusion achieves the best fusion results in terms of SCD, EN, QAB/F, SF. Meanwhile, TPFusion ranks the third in terms of MI, VIF, indicating that its capabilities in preserving texture details and contrast. Compared to the other methods, TPFusion not only generates the high-quality fused images but also ensures that the edge information is clearer and the details are more abundant.
Table 4. Quantitative comparison of methods on the dataset.
Methods | Year | MI | VIF | AG | CC | SCD | EN | QAB/F | SF |
---|---|---|---|---|---|---|---|---|---|
DDcGAN45 | 2020 | 2.5700 | 0.6192 | 5.7327 | 0.5400 | 1.6733 | 7.0738 | 0.4812 | 14.1103 |
PMGI46 | 2020 | 3.1750 | 0.5917 | 3.3321 | 0.5219 | 1.5461 | 6.9066 | 0.4372 | 9.3976 |
NestFuse47 | 2020 | 3.4870 | 0.5725 | 3.7403 | 0.5261 | 1.5788 | 6.8017 | 0.5355 | 11.1034 |
RFN-Nest48 | 2021 | 2.8789 | 0.5833 | 2.8551 | 0.5724 | 1.7263 | 6.8636 | 0.4059 | 7.7208 |
SDNet49 | 2021 | 3.2314 | 0.5823 | 4.7340 | 0.5009 | 1.5226 | 6.8357 | 0.5289 | 13.6008 |
MFEIF23 | 2021 | 3.1406 | 0.6242 | 3.0979 | 0.5120 | 1.6472 | 6.6747 | 0.4864 | 8.7762 |
TarDAL44 | 2022 | 3.1780 | 0.6089 | 4.2255 | 0.5128 | 1.5585 | 7.0977 | 0.4193 | 12.6157 |
DDFM50 | 2023 | 2.8573 | 0.6086 | 3.1883 | 0.5785 | 1.6491 | 6.7181 | 0.4555 | 9.1628 |
DATFuse51 | 2023 | 4.1202 | 0.6425 | 3.4366 | 0.4931 | 1.2838 | 6.4029 | 0.4919 | 10.4482 |
CDDFuse37 | 2023 | 3.7173 | 0.6316 | 4.8741 | 0.5355 | 1.6475 | 6.8996 | 0.6148 | 13.9768 |
BTSFusion18 | 2024 | 2.4817 | 0.5551 | 6.6066 | 0.5474 | 1.5384 | 6.7485 | 0.4993 | 14.7871 |
SFCFusion52 | 2024 | 3.4796 | 0.6749 | 4.9860 | 0.5503 | 1.7549 | 6.9953 | 0.6075 | 14.6383 |
PromptFusion53 | 2024 | 4.2918 | 0.7944 | 4.5126 | 0.5033 | 1.4875 | 6.7955 | 0.6133 | 13.5892 |
TPFusion(ours) | 3.8629 | 0.6541 | 6.4173 | 0.5453 | 1.7826 | 7.3060 | 0.6361 | 14.8767 |
Figure 4 shows two example fused images from the dataset. Under the extreme conditions such as heavy fog, where visible images can not to provide rich texture details and clear edge information, TPFusion is still able to generate high-quality fusion results. Through both global and local fusion outcomes, it can be observed that TPFusion effectively preserves and enhances edges and contrast while achieving near-complete dehazing. These findings not only demonstrate the exceptional performance of TPFusion but also confirm its robustness in challenging scenarios. Meanwhile, TPFusion preserves as much contrast and texture details information as possible in the fused images. This demonstrates that our method can reliably handle multi-modal image fusion tasks even in the extreme environments. Furthermore, compared with the other methods, the overall structure of our fusion results is closer to the original image, minimizing the information loss.
[See PDF for image]
Fig. 4
Qualitative comparison of methods on the dataset.
Ablation experiments
We conduct ablation experiments to evaluate its different components: the network blocks, the loss function components, and the hyperparameters.
Multi-scale feature extraction module
TPFusion employs the MSFEM to enhance its ability and capture both local and global information. To verify its effectiveness, we randomly select 30 image pairs of images from the TNO dataset for experimental validation and retrain TPFusion without the MSFEM. The experimental results are presented in Table 5. One can notice that removing the MSFEM reduces the MI, AG, EN, and SF values of TPFusion by 0.1891, 0.1252, 0.2177, 2.0185, respectively, which indicates that the MSFEM enhances TPFusion’s feature extraction capability and effectively prevents information loss during the fusion.
Table 5. Quantitative comparison of methods on module based ablation experiments on the TNO dataset.
Methods | MI | VIF | AG | CC | SCD | EN | QAB/F | SF |
---|---|---|---|---|---|---|---|---|
w/o MSFEM | 2.2673 | 0.5174 | 4.2693 | 0.4220 | 1.4996 | 6.8337 | 0.3561 | 10.1062 |
w/o CEM | 2.1973 | 0.4671 | 4.3395 | 0.4099 | 1.3794 | 6.8562 | 0.4392 | 11.9584 |
w/o TEM | 1.9508 | 0.4379 | 4.1093 | 0.3855 | 1.2954 | 7.0272 | 0.2720 | 13.1898 |
w/o TEM&CEM | 1.8530 | 0.4356 | 4.2176 | 0.4698 | 1.6724 | 7.0776 | 0.3200 | 11.0951 |
w/o DAM | 2.2759 | 0.4099 | 4.1363 | 0.4453 | 1.5352 | 6.2962 | 0.4201 | 9.6174 |
TPFusion | 2.4564 | 0.5687 | 4.3945 | 0.4665 | 1.6801 | 7.0514 | 0.5213 | 12.1247 |
Figure 5 shows the examples of fusion results generated by TPFusion with and without the MSFEM. The images generated with the MSFEM display richer texture details and higher contrast than the images generated without the MSFEM, further confirming that the MSFEM improves TPFusion’s feature extraction capability and helps reduce information loss.
[See PDF for image]
Fig. 5
Qualitative comparison of methods on module based ablation experiments of TPFusion on the Kaptein_1123 data from the TNO dataset.
Texture enhancement module and contrast enhancement module
To enhance the model’s ability to capture texture and contrast from the original images, TPFusion employs the TEM and the CEM. To verify their effectiveness, we conduct ablation experiments with three variants. The first variant removes the TEM, the second variant removes the CEM, and the third variant removes both the TEM and CEM.
Table 5 shows that with the TEM and the CEM enables TPFusion to extract more texture details from the source images, thereby improving the quality of the fused images. This enhancement not only boosts the overall fusion performance but also ensures the fused images more closely with the human visual system, making them appear natural and visually appealing. The increased retention of fine details and textures ensures that the fused images are clearer and more coherent, offering a realistic representation that is easier for the human eyes to interpret.
Figure 5 shows qualitative results from the TEM and the CEM ablation experiments. The fused images without TEM and CEM exhibits weakened texture details and reduced contrast. The overall brightness of the pedestrian and the texture details of the clothing decrease noticeably. These observations demonstrate that the TEM and the CEM improve TPFusion’s ability in extracting and representing texture details and contrast.
Dual-attention fusion module
TPFusion employs a dual-attention fusion module to guide the network in generating high-quality fused images. To verify the effectiveness of this module, we retrain TPFusion without the dual-attention mechanisms (DAM). Table 5 presents the experimental results, which show that removing the DAM reduces the VIF, AG, SCD, EN, and SF values by 0.1588, 0.2582, 0.1449, 0.7552, and 2.5073, respectively. These reults indicate that the DAM enhances TPFusion’s ability to preserve gradient information from the source images.
Figure 5 shows qualitative examples of fusion results without the DAM. Without the DAM, the fused image deviates from the original structure and shows lower similarity to the source images; salient pedestrian features appear weaker, and overall fusion quality drops. These observations confirm that DAM strengthens TPFusion’s spatial and channel-level feature representation, increases sensitivity to saliency and texture detail, and helps prevent information loss, thereby improving the fused images quality.
Loss function
In this section, we conduct ablation experiments on the loss function by removing the structural similarity loss, the gradient loss, and the information loss, respectively. Table 6 shows the experimental results. We can find that, after removing the structural similarity loss from the total loss, TPFusion’s performance on VIF, AG, EN, QAB/F, and SF decrease 0.2247, 1.7237, 1.4221, 0.1887, and 0.65, respectively, which demonstrates that the structural similarity loss reduce distortion in the fused images while significantly enhancing the representation of scene information. When the gradient loss is removed, TPFusion’s performance in MI, VIF, AG, CC, EN, and SF decreased by 0.161, 0.354, 2.7125, 2.2608, 2.0022, and 2.5059, respectively, which indicates the gradient loss enhances TPFusion’s abilities about extracting gradient information and texture details from the source images. Meanwhile, the gradient loss could help TPFusion generates high-quality and informative fused images. When removing the information loss, TPFusion’s performance in MI, VIF, CC, EN, and QAB/F decrease 0.4302, 0.1283, 0.2307, 0.6773, and 0.2185, respectively, indicating the information loss term enhances the TPFusion’s abilities to extract and integrate important information, reduces the information loss during the image fusion process. Ensuring that TPFusion can fully extract information from the source images and has the capability to generate high-quality fused images.
Table 6. Quantitative comparison of methods on loss function based ablation experiments on the MSRS dataset.
Methods | MI | VIF | AG | CC | SCD | EN | QAB/F | SF |
---|---|---|---|---|---|---|---|---|
w/o SSIM | 3.0262 | 0.5654 | 3.2324 | 0.5714 | 1.0728 | 5.7662 | 0.4022 | 10.8792 |
w/o Grad | 2.8251 | 0.4361 | 2.2436 | 0.5384 | 0.9514 | 5.1861 | 0.3037 | 9.0233 |
w/o Info | 2.5559 | 0.6618 | 4.3643 | 0.5685 | 1.2116 | 6.5110 | 0.3724 | 13.0848 |
TPFusion | 2.9861 | 0.7901 | 4.9561 | 0.7992 | 1.2787 | 7.1883 | 0.5909 | 11.5292 |
Figure 6 presents the example images of TPFusion’s fusion results trained with different loss functions. We can find that TPFusion’s fused results accurately capture the contour of the person. Meanwhile, the fused images of TPFusion effectively represent texture details in the magnified areas, which indicates that the information loss function term helps the network retain more important information from the source images and allows the fused images to contain richer texture details and salient features.
[See PDF for image]
Fig. 6
Qualitative comparison of methods on the loss function ablation experiments on the 01506D data from the MSRS dataset.
Loss function parameter
The loss in Eq. (16) contains various hyperparameters, such as , , and for adjusting the structural similarity, the gradient, and the information in the fused images. Tables 7, 8, and 9 present the ablation studies for these hyperparameters.
Table 7. Quantitative comparison on the TNO dataset conducted through ablation studies on of the loss function.
MI | VIF | AG | CC | SCD | EN | QAB/F | SF | |||
---|---|---|---|---|---|---|---|---|---|---|
1 | 100 | 1 | 2.3289 | 0.5065 | 4.4664 | 0.4941 | 1.7736 | 7.0055 | 0.4839 | 11.2443 |
5 | 2.4333 | 0.5180 | 4.4634 | 0.4977 | 1.7779 | 7.0035 | 0.4866 | 11.2642 | ||
10 | 2.4564 | 0.5687 | 4.3945 | 0.4687 | 1.6801 | 7.0514 | 0.5213 | 12.1247 | ||
25 | 2.4674 | 0.5352 | 4.4419 | 0.4962 | 1.7637 | 7.0490 | 0.4724 | 11.0206 | ||
50 | 2.4447 | 0.5354 | 4.4496 | 0.4646 | 1.7607 | 7.0162 | 0.4714 | 11.2368 |
Table 8. Quantitative comparison on the TNO dataset conducted through ablation studies on of the loss function.
MI | VIF | AG | CC | SCD | EN | QAB/F | SF | |||
---|---|---|---|---|---|---|---|---|---|---|
10 | 25 | 1 | 1.9015 | 0.4719 | 3.4737 | 0.4790 | 1.4509 | 6.5379 | 0.2962 | 10.6925 |
50 | 2.3157 | 0.6012 | 4.7966 | 0.4918 | 1.6980 | 6.8291 | 0.4388 | 9.9407 | ||
100 | 2.5464 | 0.5687 | 4.3945 | 0.4687 | 1.6801 | 7.0514 | 0.5213 | 12.1247 | ||
150 | 2.5684 | 0.5945 | 4.7024 | 0.3849 | 1.3220 | 7.0153 | 0.4399 | 9.4293 | ||
200 | 2.4039 | 0.5993 | 3.7749 | 0.4023 | 1.4360 | 7.0120 | 0.4457 | 9.6348 |
Table 9. Quantitative comparison on the TNO dataset conducted through ablation studies on of the loss function.
MI | VIF | AG | CC | SCD | EN | QAB/F | SF | |||
---|---|---|---|---|---|---|---|---|---|---|
10 | 100 | 1 | 2.5464 | 0.5687 | 4.3945 | 0.4687 | 1.6801 | 7.0514 | 0.5213 | 12.1247 |
5 | 1.9956 | 0.5111 | 3.7778 | 0.5244 | 1.7648 | 6.8878 | 0.3951 | 12.0247 | ||
10 | 1.9581 | 0.5197 | 4.1959 | 0.4610 | 1.6683 | 7.0204 | 0.4589 | 10.6667 | ||
25 | 1.7990 | 0.5264 | 4.3223 | 0.4782 | 1.7091 | 6.9598 | 0.4852 | 11.1178 | ||
50 | 1.6933 | 0.5171 | 4.4683 | 0.4847 | 1.7229 | 6.9677 | 0.4814 | 11.2484 |
The experimental results in Table 7 indicates that a moderate increase in helps TPFusion more effectively preserve structural information in fused images and enhances the integration of details from the source images, resulting in noticeable improvements in evaluation metrics such as MI, VIF, QAB/F, and SF. However, when the weight of the structural similarity loss becomes excessively high, the fused images, although containing more information, also introduce considerable noise and redundant content, thereby degrading the overall visual quality. Specifically, an overemphasis on structural similarity adversely affects the clarity of texture details, as evidenced by the decline in VIF, AG, QAB/F, and SF.
As Table 8 shows, when greater weight is assigned to the gradient loss during TPFusion’s training, the model becomes more adept at capturing contrast and fine texture details, thereby enriching the fused images with enhanced high-frequency components. This improvement is reflected in the rising scores of evaluation indicators such as VIF, AG, and SF. Nevertheless, as the gradient loss weight increases beyond the optimal value of 100, the model starts to introduce undesirable artifacts and noise, which adversely affect the sharpness and integrity of edges and textures. As a result, the perceptual quality of the fused images deteriorates, as indicated by the subsequent decline in VIF, AG, and SF.
According to Table 9, raising the coefficient of the information loss enables TPFusion to absorb a larger share of content from the source images, thereby endowing the fusion results with richer informational cues. Because genuinely valuable and complementary information in the sources is limited, this extra retention mostly appears as duplicated structures, background clutter, and noise. Once the coefficient exceeds a moderate range, these unwanted artefacts become prominent, softening edges and blurring fine textures. The scores for MI, VIF, and QAB/F slip, signalling a measurable decline in perceptual quality.
Table 10. object detection analysis results on the MSRS and datasets.
Methods | MSRS dataset | M3FD Dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
[email protected]:0.95 | [email protected]:0.95 | |||||||||||
Car | Person | Avg. | Car | Person | Avg. | Car | Person | Avg. | Car | Person | Avg. | |
Visible | 85.1 | 63.0 | 74.0 | 62.0 | 35.9 | 48.9 | 70.3 | 15.8 | 43.1 | 48.1 | 8.58 | 14.2 |
Infrared | 63.8 | 89.6 | 76.7 | 46.7 | 63.3 | 55.0 | 46.8 | 31.7 | 39.3 | 31.2 | 17.8 | 24.5 |
DDcGAN45 | 88.1 | 53.8 | 70.9 | 63.4 | 34.2 | 48.8 | 71.1 | 11.6 | 41.3 | 48.3 | 6.84 | 27.6 |
PMGI46 | 87.2 | 84.5 | 85.9 | 65.1 | 58.3 | 61.7 | 73.5 | 21.1 | 47.3 | 50.0 | 12.8 | 31.4 |
NestFuse47 | 86.2 | 86.4 | 86.3 | 61.8 | 57.5 | 59.6 | 25.7 | 5.39 | 8.75 | 8.75 | 2.75 | 5.75 |
RFN-Nest48 | 88.4 | 77.3 | 82.9 | 63.3 | 51.3 | 57.3 | 72.9 | 16.9 | 44.9 | 50.5 | 10.3 | 30.4 |
SDNet49 | 71.4 | 75.6 | 73.5 | 59.8 | 53.1 | 56.4 | 67.3 | 24.9 | 46.1 | 46.5 | 14.3 | 30.4 |
MFEIF23 | 74.7 | 83.0 | 78.8 | 52.4 | 55.4 | 53.9 | 70.7 | 22.2 | 46.4 | 48.6 | 12.6 | 30.6 |
TarDAL44 | 87.7 | 88.0 | 87.9 | 62.5 | 58.6 | 60.5 | 68.4 | 28.3 | 48.3 | 47.3 | 15.4 | 31.4 |
DDFM50 | 83.7 | 87.0 | 85.4 | 60.0 | 59.3 | 59.7 | 71.3 | 21.9 | 46.6 | 50.0 | 12.7 | 31.3 |
DATFuse51 | 87.0 | 87.5 | 87.2 | 63.9 | 58.8 | 61.4 | 68.4 | 20.6 | 44.5 | 47.2 | 12.1 | 29.7 |
CDDFuse37 | 85.0 | 84.8 | 84.9 | 63.7 | 56.2 | 59.9 | 72.6 | 20.9 | 46.7 | 49.9 | 13.5 | 31.7 |
BTSFusion18 | 83.4 | 89.4 | 86.4 | 60.6 | 59.7 | 60.1 | 64.7 | 20.9 | 42.8 | 46.4 | 12.3 | 29.3 |
SFCFusion52 | 85.3 | 89.2 | 87.3 | 63.6 | 59.7 | 61.6 | 69.1 | 22.1 | 45.6 | 48.6 | 13.6 | 31.1 |
PromptFusion53 | 83.2 | 88.7 | 85.9 | 61.8 | 57.3 | 59.6 | 73.3 | 22.0 | 47.6 | 50.7 | 12.6 | 31.7 |
TPFusion | 84.3 | 87.9 | 86.1 | 62.7 | 64.3 | 63.5 | 74.6 | 23.1 | 48.8 | 51.5 | 14.2 | 32.8 |
Table 11. Quantitative comparsion on the IVIF image semantic segmentation on the MFENet dataset and object detection on the dataset.
Methods | Semantic segmentation on MFENet | Object detection on M3FD | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Car | Person | Bike | Curve | Car stop | Guardrail | mAcc | Person | Car | Bus | Motor | Truck | Lamp | mAP | |
Visible | 86.8 | 77.2 | 73.7 | 68.1 | 85.2 | 70.5 | 75.8 | 62.5 | 98.2 | 90.3 | 91.3 | 78.5 | 99.9 | 86.8 |
Infrared | 69.1 | 72.6 | 58.1 | 60.1 | 73.9 | 0.19 | 40.2 | 81.3 | 98.6 | 96.5 | 68.7 | 85.1 | 92.6 | 87.1 |
DDcGAN45 | 79.5 | 84.1 | 73.3 | 77.4 | 74.8 | 75.1 | 77.3 | 83.9 | 99.4 | 99.5 | 97.0 | 88.5 | 99.5 | 94.6 |
PMGI46 | 72.2 | 70.6 | 80.7 | 65.0 | 66.4 | 98.7 | 77.3 | 86.8 | 99.0 | 99.1 | 94.7 | 90.9 | 99.5 | 95.0 |
NestFuse47 | 80.8 | 72.9 | 46.1 | 78.0 | 61.9 | 85.4 | 70.4 | 41.8 | 88.3 | 96.8 | 40.1 | 64.2 | 65.0 | 66.1 |
RFN-Nest48 | 79.0 | 80.2 | 68.5 | 84.0 | 79.6 | 69.4 | 75.8 | 69.0 | 99.2 | 99.2 | 96.4 | 90.3 | 99.5 | 95.0 |
SDNet49 | 97.5 | 57.0 | 39.6 | 49.0 | 45.8 | 94.0 | 70.7 | 86.5 | 99.4 | 98.9 | 94.5 | 89.6 | 99.5 | 94.7 |
MFEIF23 | 78.7 | 71.5 | 56.1 | 68.9 | 68.8 | 88.8 | 75.3 | 41.5 | 88.8 | 97.6 | 41.2 | 65.0 | 66.3 | 66.7 |
TarDAL44 | 89.7 | 77.7 | 78.4 | 78.7 | 74.5 | 83.7 | 81.6 | 45.7 | 93.9 | 99.5 | 40.4 | 66.6 | 64.8 | 68.5 |
DDFM50 | 87.8 | 79.8 | 66.9 | 81.1 | 75.6 | 76.9 | 77.3 | 46.3 | 93.7 | 97.6 | 42.2 | 69.5 | 63.6 | 68.8 |
DATFuse51 | 40.3 | 72.8 | 61.1 | 67.5 | 69.3 | 93.1 | 71.7 | 83.9 | 99.4 | 99.5 | 94.2 | 90.3 | 99.5 | 94.5 |
CDDFuse37 | 91.6 | 76.7 | 63.5 | 79.0 | 62.2 | 87.4 | 79.4 | 41.9 | 88.6 | 99.5 | 39.8 | 65.8 | 66.3 | 67.0 |
BTSFusion18 | 92.1 | 76.6 | 49.0 | 74.4 | 56.7 | 80.4 | 75.5 | 85.7 | 99.4 | 98.9 | 96.0 | 90.1 | 98.9 | 94.8 |
SFCFusion52 | 85.6 | 58.5 | 53.6 | 79.6 | 69.9 | 89.7 | 77.0 | 84.8 | 98.4 | 98.5 | 95.1 | 89.3 | 98.5 | 94.1 |
PromptFusion53 | 90.1 | 74.4 | 60.4 | 79.4 | 65.9 | 86.5 | 75.7 | 46.3 | 94.6 | 98.3 | 42.9 | 67.7 | 62.5 | 68.7 |
TPFusion | 94.2 | 80.9 | 62.5 | 87.7 | 70.0 | 91.1 | 80.4 | 86.2 | 99.4 | 99.5 | 96.6 | 90.6 | 98.9 | 95.2 |
Analysis of object detection on the fused images
To further validate the effectiveness of TPFusion in downstream tasks, we compare object detection results on fused images generated by TPFusion and thirteen other fusion methods using the MSRS and datasets. In experiment, we randomly select 60 image pairs, which are fused using the different methods, and apply YOLOv8 to detect the objects in the fused images. We take mean average precision (mAP) to evaluate the detection results, where [email protected] denotes the average mAP for IoU (intersection over union) exceeding a thresold of 0.5, and [email protected]:0.95 indicates the average mAP across 5 thresold from 0.5 to 0.95 with 0.1 increments.
As Table 10 shows, on the fused images of our TPFusion for MSRS dataset, YOLOv8 achieves an [email protected] of 84.3% and 87.9% in detecting cars and persons, respectively, which ranks the tenth and sixth among all detection results with fourteen fusion methods. The average [email protected] for TPFusion is 86.1%, which is ranks the sixth. When measured by [email protected]:0.95, TPFusion achieves detection accuracies of 62.7% and 64.3% for cars and persons, respectively, which ranks the seventh and the first. With TPFusion, the average [email protected]:0.95 for the two objects is 63.5%, which ranks the first and surpasses the second-best method by 1.8%. On dataset, YOLOv8 on TPFusion’s fused images achieve an [email protected] of 74.6% and 23.1% in detecting cars and persons, which ranks the first and fourth among all detection results with fourteen fusion methods. The average detection accuracy is 48.8% in terms of mAP@ 0.5, which surpass the second-best method by 0.5%. When measured by [email protected]:0.95, TPFusion achieves detection accuracies of 51.5% and 14.2% for cars and persons, respectively, which ranks the first and fourth among all detection results with fourteen fusion methods. The average detection accuracy is 32.8% in terms of [email protected]:0.95, exceeding the second-best method by 1.1%. Overall, TPFusion’s fused images outperform other methods for objection detection.
To further compare the fusion results of different methods, we fine-tune a well-trained YOLOv5 on their fused images. In these object detection experiments, 300 fused images are randomly selected from each fusion method to construct the training set for the object detection task. Additionally, 60 images are randomly selected from each fusion method to serve as the test sets. The fine-tuning process is optimized using Stochastic Gradient Descent (SGD), with an initial learning rate of 0.08 and a batch size of 32. The model is trained for a total of 500 epochs. Table 11 shows the quantitative results. One can find that TPFusion achieved an average precision of 95.2%, achieveing the highest overall ranking compared with other fusion methods.
Figure 7 and 8 present two example images of object detection. As Fig. 7 shows, on the TPFusion’s fused images, YOLOv8 detects multiple persons with higher confidence. In Fig. 8, the fusion result generated by TPFusion significantly enhance the detector’s accuracy in identifying dual pedestrian targets within complex scenes. These comparative experimental results demonstrate that the detection confidence levels achieved by our method substantially outperform existing approaches.
[See PDF for image]
Fig. 7
Qualitative comparison of object detection results in TPFusion and other methods on the MSRS dataset.
[See PDF for image]
Fig. 8
Qualitative comparison of object detection results in TPFusion and other methods on the dataset.
Analysis of semantic segmentation experiment on the fused images
Using a pre-trained MFNet segmentation model, we perform semantic segmentation experiments on the fused images of TPFusion and other methods on the MFNet61 dataset to verify the advantages of TPFusion.
Table 11 shows the semantic segmentation quantitative results. We can find that, TPFusion achieves segmentation accuracies of , , and on the car, person, and curve categories, ranking the second, second, and first among the 16 methods. In terms of the overall performance, TPFusion attains an mean segmentation accuracy of across all categories, ranking the second and falling only behind the top-performing method specifically designed for segmentation and detection.
Figure 9 shows two example segmented images on various fusion methods. The semantic segmentation results from methods such as DDcGAN, PMGI, NestFuse, SDNet and DATFuse not only fail to preserve the original pedestrian semantic information in the image but also cause partial semantic information loss. Moreover, it incorrectly classify some correct semantic categories in the image into new semantics that do not appear in the ground truth. For example, in the segmentation results of DATFuse, the semantic information of the pedestrian’s feet and head is lost, and car-type semantics, which are entirely absent in the original image, appeared in the right side of the image. In contrast, our method not only preserves the semantic information of the pedestrian category to the greatest extent but also avoids introducing any incorrect semantic categories.
[See PDF for image]
Fig. 9
Qualitative comparison of semantic segmentation results in TPFusion and other methods on the MFNet dataset.
Table 12. Running times (in seconds) for different methods on various datasets.
Methods | TNO | MSRS | M3FD | Framework |
---|---|---|---|---|
DDcGAN | 0.9019 | 0.8396 | 0.9821 | TensorFlow |
PMGI | 0.0267 | 0.5489 | 0.2776 | TensorFlow |
NestFuse | 0.3869 | 0.4043 | 1.0424 | PyTorch |
RFN-Nest | 0.2973 | 0.4843 | 0.6509 | PyTorch |
SDNet | 0.1280 | 0.0636 | 0.0794 | TensorFlow |
MFEIF | 0.9660 | 0.9117 | 4.3066 | PyTorch |
TarDAL | 8.1755 | 3.8169 | 11.2723 | PyTorch |
DDFM | 77.7769 | 48.9294 | 464.9891 | PyTorch |
DATFuse | 0.0468 | 0.0475 | 0.0625 | PyTorch |
CDDFuse | 0.3692 | 0.2738 | 0.6798 | PyTorch |
BTSFusion | 0.1224 | 0.1523 | 0.3363 | PyTorch |
SFCFusion | 3.5514 | 3.7764 | 5.6325 | Matlab |
PromptFusion | 1.0592 | 0.8761 | 1.8590 | PyTorch |
TPFusion | 0.0923 | 0.0877 | 0.2319 | PyTorch |
Analysis of running efficiency
To compare the running effciency of TPFusion with other fusion methods, where all methods are tested on an NVIDIA GeForce RTX 4080 Laptop GPU. The comparison results are summarized in Table 12 shows their comparsion results. It can be observed that TPFusion ranks third in terms of speed on the TNO, MSRS and datasets, with differences of 0.0656, 0.0402, and 0.1694 compared to the fastest algorithms on each dataset, respectively. However, considering both the image fusion performance and the experimental results of the downstream object detection tasks, it fully demonstrates the superiority and advancement of TPFusion compared to other methods.
Conclusions
This paper proposes a texture-preserving and information loss minimization method for IVIF, TPFusion. Qualitative and quantitative analysis on three datasets, as well as object detection experiments, semantic segmentation experiment, and running efficiency, revals that our method not only achieves superior performance in IVIF tasks but also ensures the retention of sufficient critcal information, effectively addressing the issue of information loss in the fusion process. While TPFusion demonstrates satisfactory fusion performance, its computational efficiency currently ranks third among existing methods. This limitation motivates our future research. There, we will employ a lightweight fusion strategy to optimize the network architecture of TPFusion, to accelerate its processing while maintaining its fusion performance.
Acknowledgements
This work is supported by the Natural Science Foundation of Ningxia under Grant 2022AAC03250, the National Natural Science Foundation of China under Grant 11761001, Leading Talnent Project of Science and Technology Innovation of Ningxia under Grant No. KJT2016002, Major Project of North Minzu University under Grant No. ZDZX201801.
Author contributions
Q.H. contributed to the experimental results discussion and analysis. All authors prepared main contents of the manuscript and reviewed and edited the submission.
Data availability
The TNO dataset is utilized and analysed in the current study. The data is available at: https://figshare.com/articles/dataset/TNO_Image_Fusion_Dataset/1008029 (accessed on February 11, 2025 ), The MSRS dataset is utilized and analysed in the current study. The data is available at: https://github.com/Linfeng-Tang/MSRS (accessed on February 11, 2025 ), The M3FD dataset is utilized and analysed in the current study. The data is available at: https://github.com/JinyuanLiu-CV/TarDAL (accessed on February 11, 2025 ), The YOLOv8 Code employed and assessed during the current study is analyzed in this research work. The code can be found at: https://github.com/Pertical/YOLOv8/blob/main/README.zh-CN.md (accessed on February 11, 2025 ).
Declarations
Competing interests
The author declare no competing interests.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Yi, W et al. Frequency-guidance collaborative triple-branch network for single image dehazing. Displays; 2023; 80, 102577.
2. Yi, W et al. Gated residual feature attention network for real-time dehazing. Appl. Intell.; 2022; 52, pp. 17449-17464.
3. Yi, W et al. Semi-supervised progressive dehazing network using unlabeled contrastive guidance. Neurocomputing; 2023; 551, 126494.
4. Xu, H; Ma, J; Jiang, J; Guo, X; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 44, pp. 502-518.
5. Li, X. et al. From text to pixels: A context-aware semantic synergy solution for infrared and visible image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023).
6. Yi, W et al. MFAF-Net: image dehazing with multi-level features and adaptive fusion. Vis. Comput.; 2024; 40, pp. 2293-2307.
7. Liu, J et al. Infrared and visible image fusion: From data compatibility to task adaption. IEEE Trans. Pattern Anal. Mach. Intell.; 2025; 47, pp. 2349-2369. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/40030603]
8. Yin, F et al. DCNet: Large-scale point cloud semantic segmentation with discriminative and efficient feature aggregation. IEEE Trans. Circuits Syst. Video Technol.; 2023; 33, pp. 4083-4095.
9. Gracewell, J., Santhosh, R. & Sabarish, V. Image fusion for improved situational awareness in military operations using machine learning. In 2024 2nd International Conference on Advances in Computation, Communication and Information Technology, vol. 1, 733–737 (2024).
10. Li, Y; Fang, A; Guo, Y; Wang, X. Image fusion via mutual information maximization for semantic segmentation in autonomous vehicles. IEEE Trans. Ind. Inform.; 2023; 20, pp. 5838-5848.
11. Dong, A. et al. TDMF: Text-guided denoising and interactive medical image fusion. In IEEE International Conference on Acoustics, Speech and Signal Processing, 1–5 (2025).
12. Liu, J et al. Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. Int. J. Comput. Vis.; 2024; 132, pp. 1748-1775.
13. Tian, Y., Zhang, Y. & Li, L. Assessment method of fusion image quality based on region entropy. In IEEE 15th International Conference on Electronic Measurement and Instruments, 1–4 (2021).
14. Liu, J; Wu, Y; Wu, G; Liu, R; Fan, X. Learn to search a lightweight architecture for target-aware infrared and visible image fusion. IEEE Signal Process. Lett.; 2022; 29, pp. 1614-1618.2022ISPL..29.1614L
15. Liu, J. et al. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8115–8124 (2023).
16. Lu, Y et al. PTPFusion: A progressive infrared and visible image fusion network based on texture preserving. Image Vis. Comput.; 2024; 151, 105287.
17. Pan, Z., Lin, H., Wu, Q., Xu, G. & Yu, Q. Residual texture-aware infrared and visible image fusion with feature selection attention and adaptive loss. Infrared Physics & Technology 105410 (2024).
18. Qian, Y; Liu, G; Tang, H; Xing, M; Chang, R. BTSFusion: Fusion of infrared and visible image via a mechanism of balancing texture and salience. Opt. Lasers Eng.; 2024; 173, 107925.
19. Xu, M; Tang, L; Zhang, H; Ma, J. Infrared and visible image fusion via parallel scene and texture learning. Pattern Recognit.; 2022; 132, 108929.
20. Wang, J. et al. Advancing infrared and visible image fusion with enhanced multiscale encoder and attention-based networks. iScience (2024).
21. Ming, R; Xiao, Y; Liu, X; Zheng, G; Xiao, G. SSDFusion: A scene-semantic decomposition approach for visible and infrared image fusion. Pattern Recognit.; 2025; 163, 111457.
22. Liu, J; Shang, J; Liu, R; Fan, X. Attention-guided global–local adversarial learning for detail-preserving multi-exposure image fusion. IEEE Trans. Circuits Syst. Video Technol.; 2022; 32, pp. 5026-5040.
23. Liu, J; Fan, X; Jiang, J; Liu, R; Luo, Z. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Trans. Circuits Syst. Video Technol.; 2021; 32, pp. 105-119.1:CAS:528:DC%2BB3MXislWrtL%2FO
24. Yang, B; Hu, Y; Liu, X; Li, J. CEFusion: An infrared and visible image fusion network based on cross-modal multi-granularity information interaction and edge guidance. IEEE Trans. Intell. Transp. Syst.; 2024; 25, pp. 17794-17809.
25. Li, S et al. P3TFusion: Progressive two-stage infrared and visible image fusion network focused on enhancing target and texture information. Digit. Signal Process.; 2025; 162, 105136.
26. Lu, Q; Zhang, H; Yin, L. Infrared and visible image fusion via dual encoder based on dense connection. Pattern Recognit.; 2025; 163, 111476.
27. Wang, D et al. AMLCA: Additive multi-layer convolution-guided cross-attention network for visible and infrared image fusion. Pattern Recognit.; 2025; 163, 111468.
28. Yi, X., Xu, H., Zhang, H., Tang, L. & Ma, J. Text-if: Leveraging semantic text guidance for degradation-aware and interactive image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024).
29. Liu, J. et al. DCEvo: Discriminative cross-dimensional evolutionary learning for infrared and visible image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025).
30. Cheng, M et al. LEFuse: Joint low-light enhancement and image fusion for nighttime infrared and visible images. Neurocomputing; 2025; 626, 129592.
31. Liu, J; Wu, Y; Huang, Z; Liu, R; Fan, X. SMoA: Searching a modality-oriented architecture for infrared and visible image fusion. IEEE Signal Process. Lett.; 2021; 28, pp. 1818-1822.2021ISPL..28.1818L
32. Luo, X; Zhang, J; Wang, L; Niu, D. HBANet: A hybrid boundary-aware attention network for infrared and visible image fusion. Comput. Vis. Image Understanding; 2024; 249, 104161.
33. Liu, J., Shang, J., Liu, R. & Fan, X. Halder: Hierarchical attention-guided learning with detail-refinement for multi-exposure image fusion. In 2021 IEEE International Conference on Multimedia and Expo, 1–6 (2021).
34. Yi, S; Guo, S; Chen, M; Wang, J; Jia, Y. UIRGBFuse: Revisiting infrared and visible image fusion from the unified fusion of infrared channel with R, G, and B channels. Infrared Phys. Technol.; 2024; 143, 105626.
35. Yang, Z et al. SADFusion: A multi-scale infrared and visible image fusion method based on salient-aware and domain-specific. Infrared Phys. Technol.; 2023; 135, 104925.
36. Wang, Z et al. A dual-path residual attention fusion network for infrared and visible images. Optik; 2023; 290, 171251.
37. Zhao, Z. et al. CDDFuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5906–5916 (2023).
38. Dinesh, C; Cheung, G; Bajić, IV. Point cloud denoising via feature graph Laplacian regularization. IEEE Trans. Image Process.; 2020; 29, pp. 4143-4158.2020ITIP..29.4143D4089254
39. Zhuang, P; Wu, J; Porikli, F; Li, C. Underwater image enhancement with hyper-Laplacian reflectance priors. IEEE Trans. Image Process.; 2022; 31, pp. 5442-5455.2022ITIP..31.5442Z [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35947571]
40. Qu, G; Zhang, D; Yan, P. Information measure for performance of image fusion. Electron. Lett.; 2002; 38, 1.
41. Hjelm, R. D. et al. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
42. Tang, L; Yuan, J; Zhang, H; Jiang, X; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion; 2022; 83, pp. 79-92.
43. Toet, A. The TNO multiband image data collection. Data Brief; 2017; 15, pp. 249-251. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29034288][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5635205]
44. Liu, J. et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802–5811 (2022).
45. Ma, J; Xu, H; Jiang, J; Mei, X; Zhang, X-P. DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans. Image Process.; 2020; 29, pp. 4980-4995.2020ITIP..29.4980M
46. Zhang, H., Xu, H., Xiao, Y., Guo, X. & Ma, J. Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 12797–12804 (2020).
47. Li, H; Wu, X-J; Durrani, T. NestFuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Trans. Instrum. Meas.; 2020; 69, pp. 9645-9656.2020ITIM..69.9645L
48. Li, H; Wu, X-J; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion; 2021; 73, pp. 72-86.
49. Zhang, H; Ma, J. SDNet: A versatile squeeze-and-decomposition network for real-time image fusion. Int. J. Comput. Vis.; 2021; 129, pp. 2761-2785.
50. Zhao, Z. et al. DDFM: Denoising diffusion model for multi-modality image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8082–8093 (2023).
51. Tang, W; He, F; Liu, Y; Duan, Y; Si, T. DATFuse: Infrared and visible image fusion via dual attention transformer. IEEE Trans. Circuits Syst. Video Technol.; 2023; 33, pp. 3159-3172.
52. Chen, H et al. SFCFusion: Spatial-frequency collaborative infrared and visible image fusion. IEEE Trans. Instrum. Meas.; 2024; 73, pp. 1-15.1:CAS:528:DC%2BB2cXhtFOhu7s%3D
53. Liu, J et al. PromptFusion: Harmonized semantic prompt learning for infrared and visible image fusion. IEEE/CAA J. Autom. Sin.; 2024; 12, pp. 502-515.
54. Han, Y; Cai, Y; Cao, Y; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion; 2013; 14, pp. 127-135.
55. Cui, G; Feng, H; Xu, Z; Li, Q; Chen, Y. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Opt. Commun.; 2015; 341, pp. 199-209.1:CAS:528:DC%2BC2MXhtF2isQ%3D%3D
56. Ma, J; Ma, Y; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion; 2019; 45, pp. 153-178.
57. Aslantas, V; Bendes, E. A new image quality metric for image fusion: The sum of the correlations of differences. AEU Int. J. Electron. Commun.; 2015; 69, pp. 1890-1896.
58. Roberts, JW; Van Aardt, JA; Ahmed, FB. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens.; 2008; 2, 023522.
59. Piella, G. & Heijmans, H. A new quality metric for image fusion. In Proceedings 2003 International Conference on Image Processing, vol. 3, III–173 (2003).
60. Eskicioglu, AM; Fisher, PS. Image quality measures and their performance. IEEE Trans. Commun.; 1995; 43, pp. 2959-2965.
61. Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y. & Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 5108–5115 (2017).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In the task of infrared and visible image fusion, achieving high-quality fusion results typically requires preserving detailed texture and minimizing information loss, while maintaining high contrast and clear edges; however, existing methods often struggle to balance these objectives, leading to texture degradation and information loss during the fusion process. To address these challenges, we propose TPFusion, a texture-preserving and information loss minimization method for infrared and visible image fusion. TPFusion consists of the following key components: a multi-scale feature extraction module for enhancing the capability of capturing features; a texture enhancement module and contrast enhancement module, which helps to preserve fine-grained textures and extract salient contours and contrast information; a dual-attention fusion module for fusing the features extracted from the source images; an information content based loss function minimizing the feature discrepancy between the fused images and the source images and effectively reducing the information loss. Extensive evaluations demonstrate that TPFusion achieves superior fusion performance. Across three datasets, TPFusion delivers the best results: on the TNO dataset, it raises AG by 2.69% and QAB/F by 0.75%; on the MSRS dataset, it lift AG by 9.99% and CC by 9.46%; and on the M3FD it boosts SCD by 1.58% and EN by 2.93% over the second best method. In downstream tasks, TPFusion attains the highest mean average precision on object detection achieves the second-highest accuracy on semantic segmentation.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 North Minzu University, Institute of Image Understanding Research, Yinchuan, China (GRID:grid.464238.f) (ISNI:0000 0000 9488 1187); Dalian Minzu University, School of Information and Communication Engineering, Dalian, China (GRID:grid.440687.9) (ISNI:0000 0000 9927 2735)
2 North Minzu University, Institute of Image Understanding Research, Yinchuan, China (GRID:grid.464238.f) (ISNI:0000 0000 9488 1187)