Introduction
Multi-sensor images offer a way to capture complementary information of the same scene, facilitating enhanced visual understanding and scene perception, and overcoming the limitations of single-sensor imaging. By synthesizing information from various sensors, such as infrared and visible light, these images can provide more robust data for subsequent image processing or decision-making tasks [1]. Specifically, the fusion of infrared and visible light images has become a critical area of study within computer vision. Infrared-visible light fusion is now widely applied across many fields [2]. Visible light sensors, which rely on reflected light, offer high spatial resolution and detailed background information; however, they often fail to capture targets effectively in poor lighting conditions or when objects are camouflaged. In contrast, infrared sensors detect the thermal radiation emitted by objects, are unaffected by lighting or environmental challenges, and can operate continuously, day or night. Therefore, fusing infrared and visible light images into a single, cohesive image is vital for retaining critical information from both modalities.
In recent decades, various traditional infrared and visible image fusion methods have been proposed and have demonstrated strong performance. These methods generally fall into four major categories: multiscale transform-based methods, sparse representation-based methods, subspace-based methods, and hybrid models. The multiscale methods include wavelet transform techniques [3] and non-subsampled contourlet transform [4]. Sparse representation methods [5] and [6] focus on constructing an overcomplete dictionary from high-quality natural images, enabling sparse representation of both infrared and visible light images, thereby enhancing the final fused image. Subspace-based methods, such as principal component analysis (PCA) [7] and independent component analysis (ICA) [8], project high-dimensional source images into lower-dimensional subspaces, capturing their intrinsic structures more effectively. Hybrid models, which combine the strengths of multiple techniques, have also been proposed. For example, Liu [9] introduced a unified fusion framework by integrating multiple fusion approaches to improve overall performance.
In the task of infrared and visible light image fusion, [10] applyed complex evidence theory for multi-source data fusion is an effective approach. [11] combined attention mechanisms with long and short time sequences allows for adaptive learning of the model’s state estimation. [12] proposes an infrared and visible light video fusion algorithm based on possibility distribution synthesis theory. By quantitatively describing feature differences, correlation measures, and joint synthesis, it achieves better target and detail preservation, significantly improving fusion quality. [13] innovates 3D object detection by fusing multiple sensors and discusses the limitations of current methods, as well as future research directions.
In recent years, convolutional neural networks (CNNs)-based fusion methods have gained prominence due to their superior ability to represent features. Generative Adversarial Networks (GANs) have also been employed for image fusion, due to their powerful unsupervised distribution estimation capabilities. Most Transformer-based studies have concentrated on capturing the global information of images, aiming to address the limitations of convolutional operations, which primarily capture local features. However, the traditional self-attention mechanism is not suitable for outdoor applications due to its high computational complexity, resulting from quadratic multiplication and a large number of model parameters. Moreover, the Transformer architecture often fails to fully utilize differential and shared information between multimodalities, leading to suboptimal feature extraction. Similarly, standalone CNNs struggle to capture long-range dependencies, and their ability to extract feature scale and texture information is limited.
Traditional algorithms rely on manually designed feature templates, which fail to adapt to complex and diverse image scenarios. Especially in noisy or variable lighting conditions, the extracted features often lack robustness. While deep learning models have powerful capabilities for automatic feature learning, they exhibit limitations in extracting high-frequency features (e.g., edges or textures). In particular, deeper network layers tend to overlook local details, leading to the smoothing of high-frequency features. Traditional methods for texture retention often employ direct fusion or simple weighting, resulting in blurred image edges, distorted texture structures, and “ghosting” effects in edge regions. Deep learning algorithms may suffer from overfitting or limitations in feature extraction layers, resulting in the loss of local textures or blurring of details when generating high-resolution outputs. Especially in low-contrast or complex textured areas, the generated images may fail to accurately reflect the target’s texture characteristics. High-quality feature extraction is foundational for subsequent tasks such as object detection, classification, registration, or fusion. Insufficient or indistinct feature extraction compromises the overall performance of algorithms. In infrared and visible image fusion, insufficient texture information can obscure target areas, affecting scene interpretability or the recognition of critical targets. In practical applications such as nighttime license plate recognition and road crack detection, algorithms must not only extract functional features but also preserve overall image details to ensure usability.
Infrared and visible images capture different key information (e.g., infrared highlights thermal radiation, while visible light retains rich texture details). Traditional fusion methods often involve simple feature weighting or merging, which may overlook the complex relationships between the features of the two modalities. The cross-attention mechanism computes interdependencies among features, effectively capturing the complementary information across modalities and thereby enhancing the quality of the fusion results. Cross-attention not only focuses on local details but also dynamically adjusts to global features, which is crucial in complex scenarios (e.g., intricate backgrounds or sparse target information). This capability helps prevent issues like information loss or over-smoothing.Traditional methods often rely on fixed weight allocations, which are insufficient for adapting to diverse image content. Many approaches focus solely on low-level features, neglecting the integration of high-level semantic information, which results in subpar visualization of the fused image. Some methods merely combine features without thoroughly exploring the dynamic correlations between modalities. Cross-attention learns the dynamic dependencies between the features of both modalities, enhancing the fusion effect by emphasizing relevant regions.Our proposed cross-attention mechanism effectively integrates modal features, offering a novel approach to infrared and visible image fusion. It enhances the interpretability and applicability of fusion algorithms, addressing the deficiencies of existing methods in adaptability and dynamism.
To address these issues, this paper presents a multi-scale image fusion algorithm with two-branch, multi-level feature fusion. To acquire global context and multi-scale feature information while reducing the number of network parameters, a lightweight multi-scale grouped convolution(LMGC) is introduced. To effectively extract and fuse the differential and shared information from multimodal images, a lightweight cross-attention fusion module (LCFM) is proposed. Additionally, to capture spatial information and enhance multi-level feature fusion and texture extraction, an improved multi-dimensional hybrid spatial attention (MHSA) mechanism is developed, along with an optimized dual-branch multi-level fusion module (DMFM). Our contributions can be summarized in four aspects:
1. 1. We propose a two-branch multilevel feature fusion with the cross-attention mechanism for infrared and visible image fusion(DMCM)
2. 2. We propose a lightweight multi-scale grouped convolution based on GS convolution and introduce Swiftformer to extract and fuse global and local features images fully.
3. 3. The improved dual-branch multi-level fusion module is designed to capture rich multi-level features and fully extract texture information, with one branch employing lightweight cross-attention fusion and multi-dimensional hybrid spatial attention, and the other leveraging a Sobel operator and deep convolution for edge information extraction.
4. 4. Extensive experiments on public datasets demonstrate that the proposed model outperforms seven classical algorithms in terms of subjective and objective evaluations, and operational efficiency. Furthermore, the proposed model achieves excellent performance in downstream tasks such as target detection.
The rest of the paper is organized as follows: Sect 2 discusses existing deep learning-based methods for the fusion of infrared and visible images. In Sect 3, we present the overall network architecture of the proposed algorithm, a detailed explanation of the proposed modules, the loss function. Sect 4 describes the experimental parameters, training details, experiments comparing the algorithm of this paper with other state-of-the-art algorithms, ablation experiments and target detection experiments. Finally, we give conclusions in Sect 5.
Related works
Liu [14] first introduced CNNs for image fusion using twin neural networks to generate weight mappings for fusion. Li [15] proposed DenseFuse, an encoding-decoding-based method with dense connections, to fuse infrared and visible images, aiming to mitigate the loss of deep features in the fusion process. Jian [16] incorporated an attention mechanism into a symmetric otoacoustic emission network to enhance salient infrared information during fusion. Li [17] designed a learnable fusion network trained in two phases, which better preserved image details and addressed the limitations of traditional manual fusion strategies. Similarly, Ma [18] developed an end-to-end model that uses a salient target mask to guide network training, ensuring that thermal information is effectively highlighted. Li and Wu [19] were the first to employ an autoencoder network for infrared-visible image fusion (IVIF), incorporating dense blocks in the encoder to comprehensively extract features, followed by the application of additive in the fusion layer to generate fusion outputs.
Ma [20] initially used GANs to establish an adversarial process between visible images and fusion results, enhancing texture detail. However, this method relied solely on information from visible images, resulting in the loss of target contrast and contour. To address this issue, they later introduced a dual discriminator GAN [21], which utilized both high- and low-resolution versions of the fused image to deceive two discriminators. This approach integrated both infrared and visible image data, significantly improving fusion performance. Li [22] introduced an end-to-end GAN model that incorporates multi-class classification constraints to further enhance fusion. Liu [23] designed a fusion network featuring a single generator and dual discriminators, employing a saliency mask to preserve the structural information of infrared targets and the texture details from visible light.
The Transformer model was first introduced by Vaswani [24] to address challenges in natural language processing, where it achieved remarkable success. Building on this, Dosovitskiy [25] extended the Transformer architecture to the visual domain by designing the Vision Transformer (ViT) for image classification tasks. Owing to its self-attention mechanism, which effectively captures long-term dependencies, the Transformer has since been applied to a wide range of computer vision tasks, including object detection [26], video restoration [27], and image super-resolution [28]. These successes have catalyzed the development of Transformer-based methods within the field of image fusion. VS [29] were pioneers in proposing a Transformer model for image fusion, enabling the extraction of both local and long-range information through the use of spatial and Transformer branches. Ma [30] advanced this work by introducing a generalized fusion method that integrates cross-domain learning with the Swin Transformer. Their model effectively preserves foreground targets from thermal images and background textures from visible light images, given that these regions tend to exhibit higher pixel intensities in their respective modalities. Tang [31] proposed an end-to-end Transformer architecture for infrared and visible image fusion. Their design features a dual attention residual module to extract critical features from the source images, while a Transformer module is utilized to retain global complementary information, ensuring the preservation of long-range dependencies.
The proposed method
Framework overview
Our network is composed of three primary components: feature encoding, feature fusion, and feature decoding, as illustrated in Fig 1. To effectively extract deep features from both infrared and visible images, we employ UNet [32] as the backbone network, using concatenated infrared and visible images as input. The backbone’s feature extraction stage incorporates lightweight multi-scale grouped convolution and SwiftFormer modules to capture local and global features, respectively. For shallow feature extraction, we utilize a pre-trained VGG19 [33] as a secondary stem within the feature extraction architecture. In the feature fusion stage, a dual-branch multilevel fusion module is introduced to integrate deep and shallow features more comprehensively.
[Figure omitted. See PDF.]
Feature extraction portion of the backbone network
LMGC.
To minimize feature information loss and reduce the number of network parameters, Li [34] introduced GSConv (Fig 2(a)), which combines depthwise separable convolution with a channel mixing operation. Depthwise separable convolution not only captures local feature information but also significantly reduces the number of network parameters. The channel mixing operation facilitates interaction and communication between different channels, thereby enhancing the richness and diversity of feature representations.
Building on GSConv, this paper proposes the LMGC (Fig 2(c)), which extracts feature information across different scales while further reducing network parameters. Grouped convolution, an advanced convolutional algorithm, divides input channels into groups and performs independent convolution operations within each group. This strategy reduces the model’s parameter count by limiting the number of channels processed by each convolution kernel, thereby decreasing storage requirements, mitigating the risk of overfitting, and improving the model’s generalization capabilities. In this study, grouped convolution replaces depthwise separable convolution to achieve model lightweighting.
This module addresses the limitations of current algorithms in capturing multi-scale information. By incorporating multi-scale receptive field designs into grouped convolutions, we significantly enhance the model’s adaptability to various resolutions and complex structures while reducing parameter counts and computational overhead. This approach enables substantial reductions in hardware requirements while maintaining model accuracy. Compared to existing convolutional methods, it more comprehensively extracts both local image details (e.g., textures) and global semantic information (e.g., large-scale structures) without compromising computational efficiency.
Inspired by VGG , which expands the receptive field by stacking small convolution kernels—using two 3x3 convolutions to achieve a 5x5 receptive field and three 3x3 convolutions to approximate a 7x7 receptive field—this paper cascades three 3x3 convolution kernels with the addition of multiple short connections to enable multi-scale feature extraction (Fig 2(b)).
[Figure omitted. See PDF.]
(a) GSConv (b) multi-scale feature extraction (c) Lightweight multi-scale grouped convolution.
Initially, the infrared and visible images are concatenated, and channel dimensionality reduction is performed using 1x1 pointwise convolution to halve the number of channels. This process can be expressed as:
(1)
where IR and VIS represent the source images, Concate denotes the concatenation operation, and PWConv refers to pointwise convolution. Next, the input features are partitioned into several groups, and multi-scale information is extracted using cascaded 3x3 convolution kernels:
(2)(3)
where Group refers to grouped convolution, indicates one of the N divided parts, and represent the output after one, two, and three 3x3 convolutions, respectively. Subsequently, the sub-modules of each group are recombined, and channel independence is broken using the channel mixing operation, which promotes information sharing between groups and enhances inter-channel interaction:
(4)
where Aggregate refers to the recombination of grouped convolutions, and Shuffle denotes the channel shuffle operation. Finally, the original features are concatenated with the input to retain more detailed information.
(5)
Swiftformer.
Given the significant computational overhead of traditional attention mechanisms, SwiftFormer, proposed in [35], offers an efficient alternative. By eliminating key-value interactions while retaining performance, SwiftFormer encodes query-key interactions through merged linear projection layers. This approach, termed effective additional attention, results in faster inference and more robust contextual representations. To address the limited ability of our base model in capturing global context, we incorporate the SwiftFormer module as the global feature extraction component in the backbone network. The fusion process can be formulated as:
(6)(7)
where IR and VIS represent the two input source images, denotes the local feature maps output by the LMS-GC, and represents the local-global feature maps output by the SwiftFormer module.
Feature fusion module
DMFM.
To effectively fuse shallow and deep features from the two modalities, we design a dual-branch multilevel fusion module based on multilevel attention [36] (see Fig 3). To compensate for the lack of spatial information in deep features, we introduce multidimensional hybrid spatial attention after the channel attention (CA) mechanism. Each deep feature map is represented as layers , , , with channels of 32, 64, 128, and 256. Ascending pointwise convolution increases the number of channels in each layer to 64, 128, 256, and 512, while descending pointwise convolution reduces the channel dimensions back to 32, 64, 128, and 256. For this task, we select two VGG-19 networks with pre-trained weights as sub-stem networks. These networks take visible and infrared images as input and aim to fully exploit the shallow features of the source images.
[Figure omitted. See PDF.]
The module is specifically designed to resolve mismatches in cross-domain information fusion encountered by current algorithms. It captures salient features from different information domains and achieves more effective fusion. This design addresses the issue of information loss caused by single-domain fusion in traditional methods. Unlike conventional fusion methods that rely on direct weighting or simple concatenation, the module integrates features from different layers in a manner tailored to their characteristics.
To enhance the fusion of shallow infrared and visible light features, we employ a lightweight cross-attention fusion module. The infrared features from the sub-stem network are denoted as ,, ,corresponding to layers with 64, 128, and 256 channels, respectively. Similarly, the visible features are denoted as ,,. Finally, the deep features, infrared features, and visible features are fused along the channel dimension.
The shallow feature maps from the infrared and visible images are first concatenated, and the fusion is performed using a lightweight cross-attention fusion module combined with 3x3 depthwise convolution. Concurrently, deep features undergo feature extraction using channel attention and multidimensional hybrid spatial attention. This process can be expressed as:
(8)(9)
where denote the shallow features of the source images at the VGG-19 network output, is the backbone network output, and DWConv represents the 3x3 depthwise convolution. To effectively fuse and enhance feature representation, the outputs from both sub-branches are concatenated, dimensionality is reduced using pointwise convolution, and detailed features are extracted via depthwise convolution. This process is expressed as:
(10)
where DPConv represents the pointwise convolution and two 3x3 depthwise convolutions. Finally, in the second branch, all feature maps are concatenated, and edge texture information is extracted using the Sobel operator and depthwise convolution. The two branches are then fused, with the final process represented as:
(11)
where Sobel represents the edge detection operator, denotes the final output feature map, and fusion represents the final integration of both branches.
MHSA.
To address the limitations of prior channel attention mechanisms, such as inadequate generalization ability and the issues arising from channel dimension reduction, efficient local attention(ELA) was proposed in [37]. ELA extracts feature vectors in both horizontal and vertical directions by utilizing banded pooling in the spatial dimension, while maintaining elongated kernel shapes to capture long-range dependencies. This approach also reduces the influence of irrelevant regions on label prediction.
To mitigate the lack of spatial information, we propose a MHSA (Fig 4) based on ELA. While ELA convolves and normalizes the two dimensions separately before generating weights, which limits the diversity of information obtained, we introduce a multidimensional fusion branch to enhance feature representation. This addition improves model accuracy by fusing features across dimensions, thereby capturing richer and more comprehensive information.
[Figure omitted. See PDF.]
Initially, the input features are pooled equally across both dimensions, processed using 1D convolution, and normalized. The corresponding weights are then generated via a Sigmoid function. The fusion branch then concatenates the pooled features, after which a 3x3 depthwise convolution extracts local features and reduces the number of network parameters. The output from the multidimensional fusion branch is divided into two parts, with corresponding weights generated through Sigmoid. Finally, these weights are fused with the weights of the other two dimensions and multiplied with the initial features to produce the final feature map.
LCFM.
The cross-attention mechanism [38], an extension of self-attention originally introduced in Transformers, enhances model performance by focusing not only on internal positional relationships within an input sequence but also on relationships between positions in different input sequences. While traditional self-attention [39] emphasizes internal relations within a single sequence, cross-attention plays a crucial role in multimodal fusion in computer vision, allowing the transfer of information between different modalities.
The cross-attention mechanism further enhances the fusion process by effectively combining high-temperature regions from infrared images with edge information from visible light images. It enables precise matching and interaction of cross-modal features, addressing the inefficiencies of current methods in fusing features from different modalities. Existing approaches often lack accurate feature interaction mechanisms during modality fusion, leading to feature conflicts or information redundancy. In contrast, cross-attention dynamically adjusts the focus on regions of interest in different modalities, significantly improving the efficiency and accuracy of cross-modal fusion.
In the context of infrared and visible image fusion, the objective is to produce a composite image that captures salient targets while preserving rich texture details. Fully utilizing the distinctive and shared information between source images is essential for superior fusion performance. Motivated by the effectiveness of cross-attention in extracting common features across images, we propose Difference Feature Attention in Fig 5(a) and Common Feature Attention in Fig 5(b), both of which are embedded into the LCFM, as depicted in Fig 5(c).
[Figure omitted. See PDF.]
(a) Difference Feature Attention (b) Common Feature Attention (c) The LCFM we proposed.
First, the marked segments are transformed into query Q, key K, and value V using depthwise convolution followed by pointwise convolution, as expressed by the following equations:
(12)
Next, to explore common information between infrared and visible images while accounting for long-term dependencies, we calculate the similarity matrix between Q and K using a dot-product attention layer. This matrix is then multiplied by V to infer the shared information between Q and V. This process is represented as:
(13)
Subsequently, the differential information can be obtained by subtracting the shared information from the original data, represented as:
(14)
Finally, to obtain complementary information from multimodal images, we inject the differential information back into Q, which can be formulated as:
(15)
where LN represents layer normalization, and MLP refers to the multi-layer perceptron.
loss fuction
The goal of image fusion is to provide an information-rich image with sufficient detail and balanced intensity by combining the favorable features of the source image. The loss function mainly consists of two parts, the fundamental loss and the contrast loss, the fundamental loss can be defined as.
(16)
where and denote the pixel loss and gradient loss, respectively. is the hyperparameter that balances these two loss terms.
(17)
Where ∇ denotes the Sobel operator, which is used to compute the gradient. denotes the number of norms. H and W denote the height and width respectively. max( , ) denotes the element-by-element maximum selection.
(18)
Where , are the source images. if is the fused image. The contrast loss can be expressed as:
(19)
where N and M are the number of VGG layers and the number of negative samples for each positive sample, respectively. denotes the foreground feature of the fused image. m denotes the negative sample. and are the positive and negative samples. So the final loss function can be expressed as:
(20)
Experiments
Comparison with SOTA methods
Datasets and experimental parameters.
For our experiments, we utilize two publicly available datasets: the TNO dataset [40] and the Roadscene dataset [41]. The experiments are conducted in the following environment: an Intel i9-12900k processor (3.2 GHz), an NVIDIA GeForce RTX 3090 GPU, 64 GB of RAM, Python 3.9.0 as the programming language, Windows 11 as the operating system, PyTorch 1.10.1 as the deep learning framework, and CUDA version 11.2.
Training details.
The entire fusion framework is trained on the TNO dataset in two stages: a pre-training phase and a fine-tuning phase. During pre-training, we select 46 image pairs, which are converted into grayscale. To fully leverage the gradient and pixel information from each image, 1,410 image blocks of size 64 × 64 are cropped from the source images. The Adam optimizer is used with a learning rate of 0.0001 and a batch size of 30. During this phase, only the fundamental loss is employed to update the network parameters. In the fine-tuning phase, the dataset consists of 18 images from the TNO dataset containing significant masks. As in the pre-training phase, 1,410 image blocks of size 64 × 64 are extracted. A contrast-constrained loss is applied, utilizing a positive sample alongside three negative samples. The network is trained for 5 epochs, using the same optimizer, learning rate, and batch size as in the pre-training phase. In this stage, both fundamental and contrast losses are utilized to update the network weights.
Comparison models and evaluation metrics.
To assess the effectiveness of the proposed algorithm, we compare it with six state-of-the-art fusion algorithms, as well as foundational models, for a total of seven comparison models: SwinFusion [30], U2Fusion [41], RFN [42], DenseFuse [15], LRRNet [43], DIDFuse [44], and CoCoNet [36]. The experimental parameters of the comparison methods are fine-tuned based on the settings provided in the original papers and adapted to the laboratory configuration.
For quantitative evaluation, we employ six metrics to measure fusion performance: Average Gradient (AG), Entropy (EN), Standard Deviation (SD), Spatial Frequency (SF), Visual Information Fidelity (VIF), and Structural Content Discrepancy (SCD).
Results and analysis on TNO dataset.
Qualitative Comparisons on TNO: The introduction of LCFM and LMGC significantly enhances our results, providing well-defined features such as figures, bushes, and road signs (highlighted by red and green boxes) with clear background texture. As seen in the first image (Fig 6), the salient object (the figure in the red box) is more distinct due to the edge enhancement facilitated by the multilevel feature integration module. While DenseFuse and DIDFuse also deliver clear thermal features, their images lack brightness, which diminishes their overall visual quality. U2Fusion and RFN fail to render a clear target, resulting in blurry depictions of humans. In contrast, our fusion images preserve vivid texture details. In the second (Fig 7), the fusion results from CoCoNet and SwinFusion exhibit low contrast, with the target (the human in the green box) appearing blurred and indistinct. Other algorithms show limited texture detail and poor visualization under low light conditions. Our model achieves the optimal balance between saliency and vivid detail preservation.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Quantitative Comparisons on TNO: As shown in Table 1, the red highlights indicate the highest value in each metric. This demonstrates that our proposed network outperforms the other seven algorithms. Specifically, on the TNO dataset, our algorithm achieves the best results for EN, SCD, and AG, indicating superior fusion of texture details and appropriate contrast. While our method ranks second for VIF and SF, the performance is only marginally lower than the top-ranked algorithm. It ranks third for SD, but the difference compared to the top two algorithms is minimal, suggesting potential for further improvement.
[Figure omitted. See PDF.]
Results and analysis on Roadscene dataset.
Qualitative Comparisons on Roadscene Dataset: In Roadscene.04424 (Fig 8), the fusion results from RFN, DenseFuse, and U2Fusion (vehicles in the red and green boxes) exhibit unclear edges and low contrast. DIDFuse produces poorly lit significant targets with blurred textures, while the SwinFusion result is overly bright, leading to a less visually appealing image. The fusion images from LRRNet and CoCoNet appear darker but retain clear textures. In contrast, our method generates brighter and more distinct significant targets (pedestrians in the red boxes) with clear background textures, demonstrating the effectiveness of our cross-fertilization module and multi-branch feature interaction module in capturing edge detail information.
[Figure omitted. See PDF.]
Quantitative Comparisons on Roadscene Dataset: Table 1 presents a quantitative comparison of the RoadScene dataset. Our method achieves the best results for SD, EN, AG, and SCD, confirming the superior fusion results of our approach. For the remaining metrics, our algorithm ranks second, with only a 5% reduction compared to the top-performing method. As shown in Fig 7, we quantitatively compare our algorithm to seven existing fusion methods using six metrics across 10 image pairs from the RoadScene dataset. Our method ranks highest in all metrics except VIF and SF, affirming that our fusion strategy successfully retains the most information and achieves the best overall fusion performance.
As shown in Fig 9, we use six metrics to compare the algorithm in this paper with the seven existing fusion methods (10 image pairs) on the RoadScence dataset for quantitative comparison. The figure shows that except for VIF and SF all other metrics perform best, indicating that this paper’s algorithm fuses the results with the most information and the best fusion strategy.
[Figure omitted. See PDF.]
To further evaluate the real-time performance and resource consumption of the algorithm across different hardware environments, comparative experiments were conducted on a variety of platforms. High-Performance GPU Environment: The representative device was the NVIDIA RTX 3090, known for its exceptional computational power and abundant memory, making it ideal for server-side batch processing and high-performance tasks.Use Case: Large-scale image fusion tasks, such as real-time surveillance systems or cloud-based image processing services. Mid-Range GPU Environment: The representative device was the NVIDIA GTX 1650, which offers low power consumption and adequate computational capability, making it suitable for lightweight tasks.Use Case: Small-scale image fusion tasks on personal computers or portable workstations. Embedded Computing Environment (High-End Embedded Device): The representative device was the NVIDIA Jetson Xavier NX, specifically designed for embedded AI applications. It delivers relatively high computational performance, supports CUDA acceleration, and has low power consumption.Use Case: Real-time processing for drones, smart cameras, and in-vehicle systems.Embedded Computing Environment (Low-End Embedded Device): The representative device was the Raspberry Pi 4 (with 4GB of memory), characterized by its limited computational resources and low power consumption, making it ideal for resource-constrained edge devices. Use Case: Offline processing tasks with low real-time requirements, such as sensor fusion and data collection.
Experimental results demonstrated that the NVIDIA RTX 3090 excelled in real-time performance, with the shortest processing time and lowest latency. However, it consumed the most memory and power. In contrast, the Raspberry Pi 4 had the lowest memory and power usage, albeit with significantly reduced real-time performance. Therefore, the choice of hardware environment for deploying the algorithm should be tailored to the specific requirements of the application.
[Figure omitted. See PDF.]
To analyze the adaptability and robustness of the multi-dimensional fusion branches across different scenarios, comparative experiments were conducted under the following conditions:Daytime Scenario: Characterized by uniform lighting and rich details, but complex backgrounds may interfere with target features.Adaptability Analysis: The multi-branch design enhances texture details and suppresses interference from the complex background, ensuring that target features are prominently represented.Nighttime Scenario: The lighting is insufficient, and the contrast between target areas and the background is low, with noise potentially being more prominent.Adaptability Analysis: The module effectively suppresses noise and extracts hidden high-frequency information, improving the visibility of nighttime targets (e.g., vehicles, pedestrians).Low-Light Scenario: This scenario features significant lighting variations (e.g., alternating light and dark regions, indoor-to-outdoor transitions) with stark contrasts in brightness between different areas.Adaptability Analysis: By incorporating cross-attention and spatial attention mechanisms, the module balances feature representation between bright and dark regions, reducing the impact of uneven lighting on fusion quality.
We conducted experiments on the TNO dataset across these various scenarios to assess the adaptability and robustness of the proposed method, as shown in Table 3. The results demonstrate that, in daytime scenarios with abundant lighting, the four metrics were the highest. In contrast, the metrics slightly decreased in nighttime and low-light scenarios, indicating that the proposed algorithm exhibits strong robustness and adaptability across different environments.
[Figure omitted. See PDF.]
Ablation studies
To assess the performance of each module, ablation experiments were conducted on the TNO and Roadscene datasets. In the first set of experiments, the Swiftformer module was removed while keeping the other modules intact, referred to as ST. In the second set, the multidimensional hybrid spatial attention was excluded, referred to as MHSA. The third set involved replacing the lightweight cross-attention fusion module with the channel attention module, referred to as LCFM. The fourth set replaced the dual-branch multilevel fusion module with the channel attention module, referred to as DMFM. The fifth set replaced the lightweight multi-scale grouped convolution with standard convolution, referred to as LMGC. The sixth set represents the proposed model, denoted as Ours.
Subjective Evaluation: As shown in the Table 4, removing the Swiftformer module results in a significant decrease in performance metrics on both the TNO and Roadscene datasets, highlighting the importance of global feature extraction for effective image fusion. The fusion metrics in the second to fourth experiments also show slight declines, confirming that the modules introduced in this study enhance image fusion performance. The removal of the lightweight multi-scale grouped convolution causes only a minor reduction in the metrics, suggesting that its capacity for local feature extraction is comparable to that of standard convolution.
[Figure omitted. See PDF.]
Objective Evaluation: In Fig 10, the absence of Swiftformer results in a network that lacks global structural features and exhibits reduced image brightness. The second set, which lacks NHSA, produces images with diminished clarity. The third set, without LCFM, results in darker image regions, particularly affecting character visibility. The fourth and fifth sets, which lack edge enhancement branches and multi-scale information fusion, lead to blurred texture edges, demonstrating the superior capability of the proposed algorithm in capturing detailed features. The sixth set, representing the proposed model, achieves clear character and streetlight contours, well-defined background textures, and balanced illumination.
[Figure omitted. See PDF.]
To demonstrate the effectiveness of stacking small convolution kernels, we conducted ablation experiments on the TNO dataset. The first experiment used a single 3×3 convolution, keeping other parts unchanged (see Fig 11(a)). The second experiment utilized parallel 3×3 and 5×5 convolutions (see Fig 11(b)). The third experiment employed our proposed stacked small convolution kernel method.As shown in Table 5, our method achieved the best performance across all four metrics, demonstrating the superiority of stacked small convolution kernels for image fusion.
[Figure omitted. See PDF.]
(a) The first experiment used a single 3×3 convolution (b) The second experiment utilized parallel 3×3 and 5×5 convolutions.
[Figure omitted. See PDF.]
To demonstrate the superiority of the proposed cross-attention mechanism, we conducted comparative and ablation experiments on the TNO dataset.In the comparative experiments, we compared cross-attention with other attention mechanisms and convolution operations. Specifically:In the first experiment, cross-attention was replaced with standard 3×3 convolutions for both branches (see Fig 12(a)). In the second experiment, cross-attention was replaced with channel attention for both branches (see Fig 12(b)). In the third experiment, cross-attention was replaced with multi-head attention for both branches (see Fig 12(c)). For the ablation experiments:In the first experiment, differential attention was removed, with all other components unchanged.In the second experiment, common attention was removed, with all other components unchanged.The third experiment used the proposed method with both attention mechanisms intact.
[Figure omitted. See PDF.]
(a) standard 3×3 convolutions for both branches (b) channel attention for both branches (c) multi-head attention for both branches.
As shown in Table 6, the proposed cross-attention mechanism achieved the best performance across all four metrics in the comparative experiments, demonstrating its advantages over other attention mechanisms and convolution operations.In the Table 7, removing either differential or common attention resulted in a decline in performance metrics, highlighting that both types of attention are indispensable.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
The number of training parameters
In addition to qualitative and quantitative analyses, model size and runtime speed are critical factors in practical applications. Consequently, we evaluate the memory consumption and computational efficiency of the proposed model. It is important to note that the results presented in this section may differ slightly from those in the original paper due to variations in platform settings, hyperparameters, and other factors. Specifically, FLOPs and the number of training parameters are computed using an input size of 64×64 pixels. For runtime measurements, we randomly selected ten images from the TNO dataset to calculate the average execution time.
Fig 13 and Table 8 presents the quantitative results for model size, FLOPs, and runtime for several state-of-the-art methods. Our model demonstrates superior speed, outperforming all methods except DIDFuse and CoCoNet. The incorporation of spatial attention and the SwiftFormer module increases the number of parameters compared to DIDFuse and LRRNet. However, by employing lightweight multi-scale grouped convolution and the lightweight cross-fusion module, our algorithm reduces the parameter count by 40% relative to CoCoNet. Although our runtime is 15% longer than CoCoNet, our model still ranks third in overall performance. Furthermore, while the enhanced multi-branch feature interaction module results in a larger number of parameters, it effectively extracts more edge and texture information, leading to clearer fusion results.
[Figure omitted. See PDF.]
Infrared-visible object detection
Target detection is a well-established and extensively studied task in advanced computer vision. With the continual improvement of multimodal datasets, their capacity to capture and reflect semantic information has become increasingly important for evaluating multimodal image fusion techniques. This subsection examines the impact of image fusion on target detection. Implementation Details: Experiments were conducted on the dataset [45] using the state-of-the-art YOLOv5 [46] detector. The detector’s configuration was kept consistent with its original settings, and all quantitative results were directly derived from the test code. The quantitative comparison results, obtained by comparing six detectors, are presented in Table 9. The metric [email protected] refers to the mean Average Precision (mAP) at an Intersection over Union (IoU) threshold of 0.5, while the other values represent the Average Precision (AP) for the respective classes. In terms of [email protected], which is the primary metric of interest, the proposed method demonstrates superior performance.
[Figure omitted. See PDF.]
Conclusion
This paper introduces a multi-scale image fusion algorithm with a two-branch, multi-level feature fusion structure to achieve effective fusion between infrared and visible images. A lightweight multi-scale grouped convolution is proposed to reduce network parameters while extracting both local features and multi-scale information. Additionally, the multi-level feature integration module is improved by incorporating an edge feature enhancement branch and replacing the channel attention mechanism for shallow feature extraction with a lightweight cross-attention fusion module. This adjustment, tailored for multimodal fusion, enhances edge information extraction and enables more effective fusion of shallow features from both modalities. Furthermore, an improved spatial attention mechanism is introduced to incorporate a multi-dimensional fusion branch, allowing for more comprehensive spatial information fusion. Extensive experiments conducted on publicly available datasets demonstrate that the proposed algorithm consistently outperforms existing algorithms in both objective and subjective evaluations. While running efficiency experiments suggest that further optimization is needed in terms of processing speed, the algorithm still achieves competitive performance in target detection tasks when compared to other methods.
References
1. 1. Ma J, Ma Y, Li C. Infrared and visible image fusion methods and applications: a survey. Inf Fusion. 2019;45:153–78.
* View Article
* Google Scholar
2. 2. Xu H, Ma J. EMFusion: an unsupervised enhanced medical image fusion network. Inf Fusion. 2021;76:177–86.
* View Article
* Google Scholar
3. 3. Petrović VS, Xydeas CS. Gradient-based multiresolution image fusion. IEEE Trans Image Process. 2004;13(2):228–37. pmid:15376943
* View Article
* PubMed/NCBI
* Google Scholar
4. 4. Bhatnagar G, Wu Q, Liu Z. Directive contrast based multimodal medical image fusion in NSCT domain. IEEE Trans. Multim. 2013;15(5):1014–24.
* View Article
* Google Scholar
5. 5. Yin M, Duan P, Liu W, Liang X. A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation. Neurocomputing. 2017;226:182–91.
* View Article
* Google Scholar
6. 6. Kim M, Han D, Ko H. Joint patch clustering-based dictionary learning for multimodal image fusion. Inf Fusion. 2016;27:198–214.
* View Article
* Google Scholar
7. 7. Abdi H, Williams L. Principal component analysis. Wiley Interdiscip Rev: Comput Statist. 2010;2(4):433–59.
* View Article
* Google Scholar
8. 8. Hyvärinen A, Oja E. Independent component analysis: algorithms and applications. Neural Netw. 2000;13(4–5):411–30. pmid:10946390
* View Article
* PubMed/NCBI
* Google Scholar
9. 9. Liu Y, Liu S, Wang Z. A general framework for image fusion based on multi-scale transform and sparse representation. Inf Fusion. 2015;24:147–64.
* View Article
* Google Scholar
10. 10. Xiao F, Wen J, Pedrycz W, Aritsugi M. Complex evidence theory for multisource data fusion. Chin J Inf Fusion. 2024;1(2):134–59.
* View Article
* Google Scholar
11. 11. Jin X, Sun T, Chen W, Ma H, Wang Y, Zheng Y. Parameter adaptive non-model-based state estimation combining attention mechanism and LSTM. IECE Trans Intell Syst. 2024;1(1):40–8.
* View Article
* Google Scholar
12. 12. Guo X, Yang F, Ji L. A mimic fusion algorithm for dual channel video based on possibility distribution synthesis theory. Chin J Inf Fusion. 2024;1(1):33–49.
* View Article
* Google Scholar
13. 13. Abro G, Ali Z, Rajput S. Innovations in 3D object detection: A comprehensive review of methods, sensor fusion, and future directions. IECE Trans Sens Commun Control. 2024;1(1):3–29.
* View Article
* Google Scholar
14. 14. Liu Y, Chen X, Cheng J, Peng H, Wang Z. Infrared and visible image fusion with convolutional neural networks. Int J Wavelets Multiresolut Inf Process. 2018:16(03);1850018.
* View Article
* Google Scholar
15. 15. Li H, Wu XJ. DenseFuse: a fusion approach to infrared and visible images. IEEE Trans Image Process. 2018;28(5):2614–23.
* View Article
* Google Scholar
16. 16. Jian L, Yang X, Liu Z, Jeon G, Gao M, Chisholm D. SEDRFuse: a symmetric encoder–decoder with residual block network for infrared and visible image fusion. IEEE Trans Instrument Measur. 2020;70(1):1–15.
* View Article
* Google Scholar
17. 17. Li H, Wu XJ, Kittler J. RFN-Nest: an end-to-end residual fusion network for infrared and visible images. Inf Fusion. 2021;73:72–86.
* View Article
* Google Scholar
18. 18. Ma J, Tang L, Xu M, Zhang H, Xiao G. STDFusionNet: an infrared and visible image fusion network based on salient target detection. IEEE Trans Instrum Meas. 2021;70:1–13.
* View Article
* Google Scholar
19. 19. Li H, Wu XJ, Kittler J. Infrared and visible image fusion using a deep learning framework. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE; 2018. p. 2705–10.
* View Article
* Google Scholar
20. 20. Ma J, Yu W, Liang P, Li C, Jiang J. FusionGAN: a generative adversarial network for infrared and visible image fusion. Inf Fusion. 2019;48:11–26.
* View Article
* Google Scholar
21. 21. Ma J, Xu H, Jiang J, Mei X, Zhang X. DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans Image Process. 2020;29:4980–95.
* View Article
* Google Scholar
22. 22. Li Q, Lu L, Li Z, Wu W, Liu Z, Jeon G. Coupled GAN with relativistic discriminators for infrared and visible images fusion. IEEE Sens J. 2019;21(6):7458–67.
* View Article
* Google Scholar
23. 23. Liu J, Fan X, Huang Z, Wu G, Liu R, Zhong W. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 5802–11.
* View Article
* Google Scholar
24. 24. Vaswani A. Attention is all you need. Adv Neural Inf Process Syst. 2017.
* View Article
* Google Scholar
25. 25. Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint 2020.
* View Article
* Google Scholar
26. 26. Xi Z, Wang J, Kang Y. Oriented target detection algorithm based on transformer. In: Proceedings of the 2021 4th International Conference on Artificial Intelligence and Pattern Recognition. 2021. p. 22–8.
* View Article
* Google Scholar
27. 27. Li G, Zhang K, Su Y, Wang J. Feature pre-inpainting enhanced transformer for video inpainting. Eng Appl Artif Intell. 2023;123:106323.
* View Article
* Google Scholar
28. 28. Zou B, Ji Z, Zhu C, Dai Y, Zhang W, Kui X. Multi-scale deformable transformer for multi-contrast knee MRI super-resolution. Biomed Signal Process Control. 2023;79:104154.
* View Article
* Google Scholar
29. 29. Vs V, Valanarasu J, Oza P, Patel V. Image fusion transformer. In: 2022 IEEE International Conference on Image Processing (ICIP). 2022. p. 3566–70.
* View Article
* Google Scholar
30. 30. Ma J, Tang L, Fan F, Huang J, Mei X, Ma Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J Autom Sinica. 2022;9(7):1200–17.
* View Article
* Google Scholar
31. 31. Tang W, He F, Liu Y, Duan Y, Si T. DATFuse: Infrared and visible image fusion via dual attention transformer. IEEE Trans Circuits Syst Video Technol. 2023;33(7):3159–72.
* View Article
* Google Scholar
32. 32. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015:18th international conference, Munich, Germany, 2015 October 5–9, proceedings, part III 18. Springer; 2015. p. 234–241.
* View Article
* Google Scholar
33. 33. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint 2014.
* View Article
* Google Scholar
34. 34. Li H, Li J, Wei H, Liu Z, Zhan Z, Ren Q. Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J Real-Time Image Process. 2024:21(3);62.
* View Article
* Google Scholar
35. 35. Shaker A, Maaz M, Rasheed H, Khan S, Yang M, Khan F. Swiftformer: efficient additive attention for transformer-based real-time mobile vision applications. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 17425–36.
* View Article
* Google Scholar
36. 36. Liu J, Lin R, Wu G, Liu R, Luo Z, Fan X. Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. Int J Comput Vision. 2024;132(5):1748–75.
* View Article
* Google Scholar
37. 37. Xu W, Wan Y. ELA: Efficient local attention for deep convolutional neural networks. arXiv preprint 2024.
* View Article
* Google Scholar
38. 38. Lin H, Cheng X, Wu X, Shen D. Cross attention in vision transformer. 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE; 2022. p. 1–6.
* View Article
* Google Scholar
39. 39. Parikh AP, Täckström O, Das D, Uszkoreit J. A decomposable attention model for natural language inference. arXiv preprint 2016.
* View Article
* Google Scholar
40. 40. Toet A. The TNO multiband image data collection. Data Brief. 2017;15:249–51. pmid:29034288
* View Article
* PubMed/NCBI
* Google Scholar
41. 41. Xu H, Ma J, Jiang J, Guo X, Ling H. U2Fusion: a unified unsupervised image fusion network. IEEE Trans Pattern Anal Mach Intell. 2022;44(1):502–18. pmid:32750838
* View Article
* PubMed/NCBI
* Google Scholar
42. 42. Li H, Wu X, Kittler J. RFN-Nest: an end-to-end residual fusion network for infrared and visible images. Inf Fusion. 2021;73(missing):72–86.
* View Article
* Google Scholar
43. 43. Li H, Xu T, Wu X-J, Lu J, Kittler J. LRRNet: a novel representation learning guided fusion network for infrared and visible images. IEEE Trans Pattern Anal Mach Intell. 2023;45(9):11040–52. pmid:37074897
* View Article
* PubMed/NCBI
* Google Scholar
44. 44. Zhao Z, Xu S, Zhang C, Liu J, Li P, Zhang J. DIDFuse: deep image decomposition for infrared and visible image fusion. arXiv preprint 2020.
* View Article
* Google Scholar
45. 45. Liu J, Fan X, Huang Z, Wu G, Liu R, Zhong W. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 5802–11.
* View Article
* Google Scholar
46. 46. Jocher G, Stoken A, Borovec J, Changyu L, Hogan A, Diaconu L, et al. ultralytics/yolov5: v3. 1-bug fixes and performance improvements. Zendo. 2020.
* View Article
* Google Scholar
Citation: Sun X, Lv F, Feng Y, Zhang X (2025) DMCM: Dwo-branch multilevel feature fusion with cross-attention mechanism for infrared and visible image fusion. PLoS ONE 20(3): e0318931. https://doi.org/10.1371/journal.pone.0318931
About the Authors:
Xicheng Sun
Contributed equally to this work with: Xicheng Sun, Xu Zhang
Roles: Conceptualization, Data curation, Formal analysis, Methodology, Writing – original draft
Affiliation: Software College, Liaoning Technical University, Huludao, Liaoning, China
ORICD: https://orcid.org/0009-0008-9310-4840
Fu Lv
Roles: Funding acquisition, Resources, Supervision
E-mail: [email protected]
Affiliations: Software College, Liaoning Technical University, Huludao, Liaoning, China, Department of Basic Teaching School of Software, Liaoning Technical University, Huludao, Liaoning, China
Yongan Feng
Roles: Supervision, Writing – review & editing
Affiliation: Software College, Liaoning Technical University, Huludao, Liaoning, China
Xu Zhang
Contributed equally to this work with: Xicheng Sun, Xu Zhang
Roles: Visualization, Writing – review & editing
Affiliation: Software College, Liaoning Technical University, Huludao, Liaoning, China
1. Ma J, Ma Y, Li C. Infrared and visible image fusion methods and applications: a survey. Inf Fusion. 2019;45:153–78.
2. Xu H, Ma J. EMFusion: an unsupervised enhanced medical image fusion network. Inf Fusion. 2021;76:177–86.
3. Petrović VS, Xydeas CS. Gradient-based multiresolution image fusion. IEEE Trans Image Process. 2004;13(2):228–37. pmid:15376943
4. Bhatnagar G, Wu Q, Liu Z. Directive contrast based multimodal medical image fusion in NSCT domain. IEEE Trans. Multim. 2013;15(5):1014–24.
5. Yin M, Duan P, Liu W, Liang X. A novel infrared and visible image fusion algorithm based on shift-invariant dual-tree complex shearlet transform and sparse representation. Neurocomputing. 2017;226:182–91.
6. Kim M, Han D, Ko H. Joint patch clustering-based dictionary learning for multimodal image fusion. Inf Fusion. 2016;27:198–214.
7. Abdi H, Williams L. Principal component analysis. Wiley Interdiscip Rev: Comput Statist. 2010;2(4):433–59.
8. Hyvärinen A, Oja E. Independent component analysis: algorithms and applications. Neural Netw. 2000;13(4–5):411–30. pmid:10946390
9. Liu Y, Liu S, Wang Z. A general framework for image fusion based on multi-scale transform and sparse representation. Inf Fusion. 2015;24:147–64.
10. Xiao F, Wen J, Pedrycz W, Aritsugi M. Complex evidence theory for multisource data fusion. Chin J Inf Fusion. 2024;1(2):134–59.
11. Jin X, Sun T, Chen W, Ma H, Wang Y, Zheng Y. Parameter adaptive non-model-based state estimation combining attention mechanism and LSTM. IECE Trans Intell Syst. 2024;1(1):40–8.
12. Guo X, Yang F, Ji L. A mimic fusion algorithm for dual channel video based on possibility distribution synthesis theory. Chin J Inf Fusion. 2024;1(1):33–49.
13. Abro G, Ali Z, Rajput S. Innovations in 3D object detection: A comprehensive review of methods, sensor fusion, and future directions. IECE Trans Sens Commun Control. 2024;1(1):3–29.
14. Liu Y, Chen X, Cheng J, Peng H, Wang Z. Infrared and visible image fusion with convolutional neural networks. Int J Wavelets Multiresolut Inf Process. 2018:16(03);1850018.
15. Li H, Wu XJ. DenseFuse: a fusion approach to infrared and visible images. IEEE Trans Image Process. 2018;28(5):2614–23.
16. Jian L, Yang X, Liu Z, Jeon G, Gao M, Chisholm D. SEDRFuse: a symmetric encoder–decoder with residual block network for infrared and visible image fusion. IEEE Trans Instrument Measur. 2020;70(1):1–15.
17. Li H, Wu XJ, Kittler J. RFN-Nest: an end-to-end residual fusion network for infrared and visible images. Inf Fusion. 2021;73:72–86.
18. Ma J, Tang L, Xu M, Zhang H, Xiao G. STDFusionNet: an infrared and visible image fusion network based on salient target detection. IEEE Trans Instrum Meas. 2021;70:1–13.
19. Li H, Wu XJ, Kittler J. Infrared and visible image fusion using a deep learning framework. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE; 2018. p. 2705–10.
20. Ma J, Yu W, Liang P, Li C, Jiang J. FusionGAN: a generative adversarial network for infrared and visible image fusion. Inf Fusion. 2019;48:11–26.
21. Ma J, Xu H, Jiang J, Mei X, Zhang X. DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans Image Process. 2020;29:4980–95.
22. Li Q, Lu L, Li Z, Wu W, Liu Z, Jeon G. Coupled GAN with relativistic discriminators for infrared and visible images fusion. IEEE Sens J. 2019;21(6):7458–67.
23. Liu J, Fan X, Huang Z, Wu G, Liu R, Zhong W. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 5802–11.
24. Vaswani A. Attention is all you need. Adv Neural Inf Process Syst. 2017.
25. Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint 2020.
26. Xi Z, Wang J, Kang Y. Oriented target detection algorithm based on transformer. In: Proceedings of the 2021 4th International Conference on Artificial Intelligence and Pattern Recognition. 2021. p. 22–8.
27. Li G, Zhang K, Su Y, Wang J. Feature pre-inpainting enhanced transformer for video inpainting. Eng Appl Artif Intell. 2023;123:106323.
28. Zou B, Ji Z, Zhu C, Dai Y, Zhang W, Kui X. Multi-scale deformable transformer for multi-contrast knee MRI super-resolution. Biomed Signal Process Control. 2023;79:104154.
29. Vs V, Valanarasu J, Oza P, Patel V. Image fusion transformer. In: 2022 IEEE International Conference on Image Processing (ICIP). 2022. p. 3566–70.
30. Ma J, Tang L, Fan F, Huang J, Mei X, Ma Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J Autom Sinica. 2022;9(7):1200–17.
31. Tang W, He F, Liu Y, Duan Y, Si T. DATFuse: Infrared and visible image fusion via dual attention transformer. IEEE Trans Circuits Syst Video Technol. 2023;33(7):3159–72.
32. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015:18th international conference, Munich, Germany, 2015 October 5–9, proceedings, part III 18. Springer; 2015. p. 234–241.
33. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint 2014.
34. Li H, Li J, Wei H, Liu Z, Zhan Z, Ren Q. Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J Real-Time Image Process. 2024:21(3);62.
35. Shaker A, Maaz M, Rasheed H, Khan S, Yang M, Khan F. Swiftformer: efficient additive attention for transformer-based real-time mobile vision applications. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 17425–36.
36. Liu J, Lin R, Wu G, Liu R, Luo Z, Fan X. Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. Int J Comput Vision. 2024;132(5):1748–75.
37. Xu W, Wan Y. ELA: Efficient local attention for deep convolutional neural networks. arXiv preprint 2024.
38. Lin H, Cheng X, Wu X, Shen D. Cross attention in vision transformer. 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE; 2022. p. 1–6.
39. Parikh AP, Täckström O, Das D, Uszkoreit J. A decomposable attention model for natural language inference. arXiv preprint 2016.
40. Toet A. The TNO multiband image data collection. Data Brief. 2017;15:249–51. pmid:29034288
41. Xu H, Ma J, Jiang J, Guo X, Ling H. U2Fusion: a unified unsupervised image fusion network. IEEE Trans Pattern Anal Mach Intell. 2022;44(1):502–18. pmid:32750838
42. Li H, Wu X, Kittler J. RFN-Nest: an end-to-end residual fusion network for infrared and visible images. Inf Fusion. 2021;73(missing):72–86.
43. Li H, Xu T, Wu X-J, Lu J, Kittler J. LRRNet: a novel representation learning guided fusion network for infrared and visible images. IEEE Trans Pattern Anal Mach Intell. 2023;45(9):11040–52. pmid:37074897
44. Zhao Z, Xu S, Zhang C, Liu J, Li P, Zhang J. DIDFuse: deep image decomposition for infrared and visible image fusion. arXiv preprint 2020.
45. Liu J, Fan X, Huang Z, Wu G, Liu R, Zhong W. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 5802–11.
46. Jocher G, Stoken A, Borovec J, Changyu L, Hogan A, Diaconu L, et al. ultralytics/yolov5: v3. 1-bug fixes and performance improvements. Zendo. 2020.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 Sun et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In response to the limitations of current infrared and visible light image fusion algorithms—namely insufficient feature extraction, loss of detailed texture information, underutilization of differential and shared information, and the high number of model parameters—this paper proposes a novel multi-scale infrared and visible image fusion method with two-branch feature interaction. The proposed method introduces a lightweight multi-scale group convolution, based on GS convolution, which enhances multi-scale information interaction while reducing network parameters by incorporating group convolution and stacking multiple small convolutional kernels. Furthermore, the multi-level attention module is improved by integrating edge-enhanced branches and depthwise separable convolutions to preserve detailed texture information. Additionally, a lightweight cross-attention fusion module is introduced, optimizing the use of differential and shared features while minimizing computational complexity. Lastly, the efficiency of local attention is enhanced by adding a multi-dimensional fusion branch, which bolsters the interaction of information across multiple dimensions and facilitates comprehensive spatial information extraction from multimodal images. The proposed algorithm, along with seven others, was tested extensively on public datasets such as TNO and Roadscene. The experimental results demonstrate that the proposed method outperforms other algorithms in both subjective and objective evaluation results. Additionally, it demonstrates good performance in terms of operational efficiency. Moreover, target detection performance experiments conducted on the dataset confirm the superior performance of the proposed algorithm.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer