Content area
What are the main findings? Our proposed UniFusOD method integrates infrared-visible image fusion and object detection into a unified, end-to-end framework, achieving superior performance across multiple tasks. The introduction of the Fine-Grained Region Attention (FRA) module and UnityGrad optimization significantly enhances the model’s ability to handle multi-scale features and resolves gradient conflicts, improving both fusion and detection outcomes. What are the implications of the main findings? The unified optimization approach not only improves image fusion quality but also enhances downstream task performance, particularly in detecting rotated and small objects. This approach demonstrates significant robustness across various datasets, offering a promising solution for multimodal perception tasks in remote sensing and autonomous driving. Infrared-visible image fusion and object detection are crucial components in remote sensing applications, each offering unique advantages. Recent research has increasingly sought to combine these tasks to enhance object detection performance. However, the integration of these tasks presents several challenges, primarily due to two overlooked issues: (i) existing infrared-visible image fusion methods often fail to adequately focus on fine-grained or dense information, and (ii) while joint optimization methods can improve fusion quality and downstream task performance, their multi-stage training processes often reduce efficiency and limit the network’s global optimization capability. To address these challenges, we propose the UniFusOD method, an efficient end-to-end framework that simultaneously optimizes both infrared-visible image fusion and object detection tasks. The method integrates Fine-Grained Region Attention (FRA) for region-specific attention operations at different granularities, enhancing the model’s ability to capture complex information. Furthermore, UnityGrad is introduced to balance the gradient conflicts between fusion and detection tasks, stabilizing the optimization process. Extensive experiments demonstrate the superiority and robustness of our approach. Not only does UniFusOD achieve excellent results in image fusion, but it also provides significant improvements in object detection performance. The method exhibits remarkable robustness across various tasks, achieving a 0.8 and 1.9 mAP50 improvement over state-of-the-art methods on the DroneVehicle dataset for rotated object detection and the M3FD dataset for horizontal object detection, respectively.
Full text
1. Introduction
The rapid development of remote sensing satellite platforms has made the acquisition of vast amounts of data possible, significantly driving advancements in deep learning technologies [1]. However, the inherent limitations of individual sensor modalities present substantial challenges for achieving comprehensive visual perception. Visible images, with their high spatial resolution and rich color information, are adept at capturing texture and chromatic details. Nevertheless, they heavily depend on high-quality illumination conditions, and their performance deteriorates significantly in low-light environments, resulting in the loss of important information [2,3,4,5]. In contrast, infrared images, owing to their thermal imaging mechanism, are not dependent on ambient light, providing robust edge information even in dim lighting. They also offer certain advantages in terms of penetration and camouflage resistance, making them particularly useful for highlighting target contours. However, they fall short in representing fine textures [6,7,8]. Thus, relying solely on a single modality for object detection in remote sensing often leads to perceptual blind spots, limiting model performance and scene generalization.
To address these complementary deficiencies, Infrared-Visible Image Fusion has emerged as a critical research domain, with substantial applications in autonomous driving and remote sensing systems [9,10,11,12]. On the one hand, the task of infrared-visible image fusion aims to integrate complementary information from both types of images, creating a richer and more informative fused image. This enhanced image improves scene clarity and provides the necessary details relevant to the specific application scenario. Existing methods in infrared-visible fusion typically focus on feature-level fusion and alignment using deep learning techniques. These methods can be broadly classified into three categories: autoencoder (AE)-based approaches [4,13,14,15] enhance feature representation through reconstruction; generative adversarial network (GAN)-based approaches [8,16,17] constrain the fusion image distribution to align with original inputs, thus avoiding direct fusion weight learning; and unified models [18,19] employ cross-learning to address the lack of ground truth and training samples. However, these methods generally improve fusion performance from the perspectives of fusion weights or image distribution, with an emphasis on semantic information. They often overlook the fine-grained fusion of image details, which are crucial for downstream tasks, especially those requiring dense information. On the other hand, while fused images can provide high-quality inputs for higher-level perceptual tasks such as object detection and tracking [6,20,21], traditional infrared-visible fusion methods primarily focus on improving visual information quality without fully addressing the specific needs of downstream tasks. As a result, although fused images may exhibit high visual quality, they do not necessarily lead to significant improvements in perceptual accuracy or overall task performance in real-world applications [22].
To overcome these limitations, recent studies have explored the joint optimization of image fusion and high-level perception tasks such as object detection [20,21,22,23]. This approach aims to optimize both pixel-level and feature-level processes simultaneously, ensuring that enhancements in image fusion also improve downstream tasks like object detection. One of the primary advantages of joint optimization is its ability to leverage semantic information from object detection to guide the fusion process, making the fused image more effective for detection. Additionally, this optimization enables the fusion task itself to be more beneficial in enhancing object detection performance. However, despite its potential, several challenges remain, as illustrated in Figure 1. These challenges include the following: (1) Inefficient stepwise optimization: Most existing methods adopt a cascaded design, where image fusion and object detection networks are optimized in separate stages, as shown in Figure 1a–c. While this approach may offer improvements in both fusion and detection individually, it introduces inefficiencies due to the lack of integrated learning, making the process computationally expensive and complex. This stepwise optimization also poses challenges for real-time processing, as it does not leverage the potential for joint optimization that could reduce computational overhead. (2) Lack of focus on fine-grained or dense information: Existing feature fusion methods often fail to emphasize fine-grained or dense information, resulting in fused features that may not perform well in location-sensitive tasks such as detection and segmentation. This oversight limits the effectiveness of the fused features in tasks where precise spatial information is crucial. (3) Limited ability to find global optimal solutions: Multi-stage optimization methods often become trapped in local optima due to their stepwise nature. Furthermore, these methods typically connect tasks via the loss function, without structural interactions between them, limiting the optimization process’s ability to address the needs of both tasks simultaneously.
Overall, significant challenges remain in achieving synergistic optimization between image fusion and downstream tasks, particularly regarding efficient end-to-end optimization, for which effective solutions are still lacking. In this paper, we propose UniFusOD, a novel end-to-end framework that unifies image fusion and object detection (OD) tasks into a single, integrated optimization process, as illustrated in Figure 1d. By jointly optimizing these tasks, UniFusOD ensures that the fused image is not only visually enhanced but also optimized for downstream tasks like object detection, leveraging the complementary strengths of both modalities. This approach enhances feature fusion across various levels, improving visual perception capabilities. To enable the model to focus on important details at fine-grained levels, we introduce the Fine-Grained Region Attention (FRA) module. Inspired by the biological visual system, the FRA module allows the model to selectively attend to and distinguish key regions, thereby improving feature representation at both spatial and semantic levels. Additionally, we introduce UnityGrad, a novel optimization algorithm based on the Nash bargaining principle. UnityGrad resolves gradient conflicts between fusion and detection tasks, aligning their optimization directions and scales. This approach stabilizes the optimization process and enhances the efficiency and effectiveness of multimodal image fusion for object detection.
The main contributions of this paper are as follows:
(1) We present UniFusOD, an end-to-end multimodal image fusion detection framework that synchronously optimizes both image fusion and downstream tasks. This approach overcomes the inefficiencies and local optima issues associated with multi-stage optimization methods.
(2) The Fine-Grained Region Attention (FRA) module is designed to enhance the model’s ability to focus on and capture region-specific information at various levels of granularity. Inspired by biological visual systems, FRA improves feature representation by selectively attending to crucial regions, enabling the model to better capture and represent task-relevant information in complex multimodal images.
(3) We propose UnityGrad, inspired by the Nash bargaining principle, to resolve gradient conflicts between fusion and detection tasks. This novel approach harmonizes the optimization goals of both tasks, leading to a more balanced and efficient optimization process, ultimately stabilizing and improving model performance.
(4) Through extensive experiments on image fusion and object detection tasks, we demonstrate the effectiveness and robustness of our approach, achieving superior performance over traditional methods in both tasks.
2. Related Work
2.1. Multimodal Image Fusion and Object Detection
Deep learning has significantly advanced both low- and high-level visual tasks in remote sensing, particularly in image fusion and object detection, demonstrating great potential [6,24,25,26]. Early multimodal image fusion studies [7,27,28] mainly optimized fusion outcomes by adjusting network structures or loss functions, achieving good visual effects. From the perspective of network architecture design, many works adopted encoder-decoder frameworks to extract hierarchical features from source images—for instance, integrating residual blocks or dense connections to enhance feature propagation and avoid gradient vanishing, which is particularly effective for fusing heterogeneous modalities like visible and infrared images [29,30]. In terms of feature fusion strategies, researchers have explored multi-scale fusion mechanisms and attention-driven weight allocation to emphasize complementary information between modalities [31,32]. These feature-based fusion approaches allow for a more nuanced integration of information from different sources, potentially improving the overall quality of fused images. However, a significant limitation of feature-based fusion methods lies in their potential disconnection from the end-task performance, such as object detection. While these methods can enhance visual quality and detail in fused images, they may not always be optimized for downstream tasks, which require more task-specific feature integration [33]. For loss function optimization, pixel-level reconstruction losses and structural similarity (SSIM) loss were widely employed to constrain the fused image to be consistent with source images in pixel intensity and structural distribution [34,35]. However, these methods often overlook a crucial point: the primary goal of fusion is not just to improve visual quality but to enhance the performance of downstream tasks, such as object detection. Although high-quality fused images are visually impressive, they may not always meet the specific needs of practical applications [33].
Recent research has increasingly recognized that multimodal image fusion should not be an isolated task but closely integrated with downstream tasks like object detection, tracking, and segmentation. This has led to the development of joint optimization frameworks that combine image fusion with object detection. In these frameworks, fusion is not only aimed at generating visually pleasing images but also at improving downstream task performance. For example, Yuan et al. [26] pioneered cross-modal alignment to address airborne visible and infrared misalignment for rotated detection, establishing the foundational need to resolve modality discrepancies. Building on this, Liu et al. [22] introduced joint learning of fusion and detection with a novel loss function, directly incorporating detection-derived semantic and location information into fusion to simultaneously enhance both tasks. This approach improves fusion quality and detection performance by incorporating semantic and location information from the detection task into the fusion process. Finally, Liu et al. [36] generalized this interaction paradigm through a multi-interaction architecture, formalizing mutual task promotion beyond single-directional guidance to achieve bidirectional, task-aligned feature learning that collectively elevates fusion and detection performance.
Despite these advances, challenges remain. Object detection focuses on semantic understanding, while fusion and segmentation tasks emphasize pixel-level relationships, making the optimization of image fusion and object detection complex. A critical challenge is finding a balance that allows both tasks to mutually enhance each other [10,36,37]. Many methods still rely on cascade architectures, where separate modules are trained and inferred independently, resulting in high computational cost and inefficiency. Furthermore, efficiently integrating information from different modalities while removing redundant features remains a persistent challenge in multimodal fusion.
In conclusion, the integration of image fusion with object detection is a promising research area. By designing effective network architectures and loss functions, image fusion and object detection can mutually promote each other, enhancing the overall performance of multimodal image processing tasks. However, overcoming the optimization challenges requires exploring more efficient and flexible model architectures, particularly end-to-end optimization frameworks for joint inference of image fusion and object detection tasks.
2.2. Multitask Learning
Multitask Learning (MTL) is a technique that improves learning efficiency by simultaneously addressing multiple tasks and sharing information between them [38,39]. This information sharing is typically achieved through a shared hidden representation [40,41,42]. However, the optimization process in multitask learning presents several challenges, such as gradient conflicts between tasks [43,44] and plateau effects in the loss function [45], which complicate the optimization process.
To overcome these challenges, various architectures and methods have been proposed [46,47,48]. Some approaches focus on optimizing the training process by adjusting the gradients of tasks through weighting. For example, some studies weight the loss functions based on task uncertainty [49], gradient norms [50], stochastic weights [51], or gradient similarity [52,53]. However, these methods are predominantly heuristic and may lead to performance instability in practical applications [54]. Additionally, other methods employ techniques such as Neural Architecture Search (NAS) [55,56] or routing networks [57] to automatically discover shared patterns and determine network architectures. While effective in some cases, these approaches come with significant computational overhead.
Recently, there has been growing interest in multi-objective optimization based on the Multi-Gradient Descent Algorithm (MGDA) [58]. Under certain conditions, MGDA guarantees convergence to a Pareto stable point, making it a promising optimization strategy. Hotegni et al. [59] framed the multi-objective optimization problem as a multitask learning problem and introduced a task-weighting approach based on the Frank-Wolfe algorithm [60]. Liu et al. [54] proposed a method that maximizes the worst-case improvement by searching for the optimal update direction within the neighborhood of the average gradient. Liu [51] further developed a method to find a fair gradient direction by ensuring equal cosine similarity of gradients across all tasks. While this approach satisfies all the requirements of Nash axioms, it does not guarantee a Pareto optimal solution. Therefore, mitigating gradient conflicts in multitask learning remains a critical challenge.
3. Methodology
In this section, we introduce UniFusOD, a unified end-to-end framework that simultaneously addresses infrared-visible image fusion and object detection. Specifically, in Section 3.1, we formalize the joint fusion and detection task as an end-to-end optimization problem, aiming to simultaneously improve both visual quality and detection performance. Then, in Section 3.2, we present the overall framework, which integrates a shared backbone, a Fine-Grained Region Attention (FRA) module, and task-specific heads for fusion and detection. The entire model is trained end-to-end using the UnityGrad method, which harmonizes gradients from both tasks to enable stable and balanced multi-task optimization. In Section 3.3, we introduce the FRA mechanism designed to enhance region-level feature representation by focusing on important areas across multiple scales. Next, Section 3.4 details the task heads and their corresponding loss functions used to guide the model toward generating semantically meaningful fused images and accurate object detection. Finally, in Section 3.5, we propose UnityGrad, a gradient harmonization strategy that mitigates optimization conflicts between tasks, enabling more stable and effective end-to-end learning.
3.1. Problem Formulation
Assuming the visible image is denoted as and the infrared image as , the optimization problem can be formulated as follows:
Here, represents the fused image, and and are the fusion network and detection network controlled by parameters and , respectively.
To avoid optimization difficulties, most methods separately train the fusion network and detection network at different stages. However, this approach makes it challenging to find the global optimal solution. To enable end-to-end optimization, we reformulate the problem into the following optimization problem:
Here, represents the parameters shared by both the detection network and the fusion network, such as the backbone parameters. and are the specific parameters for the detection and fusion networks, respectively. is a balancing coefficient. To ensure the stability of the optimization process, we jointly optimize the losses and , and is a regularization term applied to the parameters. The regularization constraints are implemented using the UnityGrad method, which adjusts the gradients during optimization.
3.2. Overall Architecture
The overall framework, as shown in Figure 2, consists of three components: the Backbone, the Fine-grained Region Attention (FRA), and task-specific heads. Moreover, to mitigate gradient conflicts between different tasks, the UnityGrad method is used during parameter updates to compute more stable gradients, thus stabilizing the multi-task optimization process.
For the visible image and the infrared image , the backbone network is first employed to extract features and for each modality. To save memory and computational resources, the backbone is assumed to have shared parameters. The features extracted from each block of the backbone are then summed along the channel dimension, producing the mixed-modal features from L blocks.
To enhance the model’s ability to perceive features from different regions, we propose a fine-grained region attention mechanism, which progressively extracts region and object information across different scales, improving the feature representation capacity. Finally, a lightweight task head is used to generate the fused image, ensuring it exhibits both high visual quality and strong semantic information. At the same time, a detection head is employed for object detection, ensuring that the learned representations effectively balance visual quality and detection task accuracy, making the model suitable for various perception tasks in multimodal scenarios.
Finally, the proposed UnityGrad method is used to modulate the gradients propagated from different tasks, solving for new update gradients to ensure stable optimization across tasks.
3.3. Fine-Grained Region Attention
In multimodal perception tasks, models must process information across multiple scales and feature representations. This requires feature extractors to adaptively capture region-specific information at various levels of granularity. However, traditional convolutional neural networks (CNNs) typically use fixed-size convolution kernels, limiting their ability to handle regions with fine-grained precision. In biological visual systems, the ability of neurons to selectively focus on regions at different scales is a key feature of visual perception. Based on this principle, we propose a fine-grained region attention mechanism (FRA), which improves the model’s ability to focus on and distinguish important regions by enabling more effective feature representation across multiple spatial and semantic levels. By integrating region-level attention mechanisms, FRA improves the model’s capacity to capture and represent crucial region-specific information in complex images. The structure of the FRA module is illustrated in Figure 3, providing a detailed view of how attention is applied across different regions.
The input to the FRA module consists of multi-scale feature maps extracted by the backbone. Each represents mixed features from visible and infrared images at different scales. Here, L denotes the number of feature maps, and represents the feature map at layer l, where and are the height and width of the feature map, and is the number of channels in the feature map.
These multi-scale feature maps are extracted from different layers of the backbone, each representing information at different granularities. Thus, the maps contain multi-scale information from distinct regions of the image. To improve the network’s ability to capture region-specific features, we apply region-level operations to these feature maps. Specifically, we perform K convolution operations on each feature map using different dilation factors and kernel sizes to generate initial region-specific attention maps. These attention maps are then aggregated to obtain the final attention maps. The calculation of the region-level attention maps is as follows:
where represents the k-th convolution operation with dilation factor . represents region-level attention maps generated by the k-th convolution operation, consisting of M attention masks, each with spatial dimensions , but focusing on different regions of the image.Considering the role of global information, we integrate the global features into the region-specific attention maps to refine them. We apply global pooling on the input feature map to extract its global features, compressing each channel into a scalar to form a global feature vector , which represents the global context of the image:
This global feature vector is then passed through a feed-forward network (FFN), which generates a weighted matrix . Each row of represents the weight for each region from the k-th convolution operation. To normalize the weights, we apply softmax along the M-dimension, resulting in , which represents the contribution of each operation to the M regions:
The softmax operation ensures that the attention weights are normalized, allowing the coefficients to reflect the relative importance of each region.
Using the weighted coefficients predicted by the global information, we compute a weighted sum of the initial attention maps to obtain the combined attention maps :
The combined maps effectively capture region-specific features by aggregating the attention from different dilation factors and kernel sizes, which focus on various spatial scales and receptive fields.
Next, we apply the Sigmoid activation function to to normalize it:
where represents the Sigmoid activation function, which normalizes the attention weights for each region, ensuring a balanced distribution across all regions.The attention map contains M region masks, where each mask represents the attention distribution for a specific region of the image. Thus, the region attention map can be represented as
Using the region attention maps , we apply pixel-wise weighting to the original feature map , enhancing the features of the important regions. In this process, each region mask is pixel-wise multiplied with the corresponding region of the input feature map , resulting in the weighted region feature map :
where ⊙ represents the pixel-wise multiplication operation and denotes the spatial location. By performing this operation, each region mask weights the corresponding region in the feature map . Regions with higher attention weights are amplified, while those with lower weights are suppressed. This process enhances the important regions and effectively reduces the impact of irrelevant or less important regions, refining the overall feature map representation.Using this approach, we apply region attention to the multi-scale feature maps extracted by the backbone, producing region-enhanced features . These enhanced features are then used for the final detection and fusion tasks.
3.4. Detection and Fusion Heads
In multimodal perception tasks, besides the feature extraction module, the task heads play a crucial role in both object detection and image fusion. We design two distinct task heads for object detection and image fusion, and optimize them using appropriate loss functions.
As shown in Figure 4, the image fusion task head aims to restore the multi-scale region-level feature maps from the FRA module to the same spatial dimensions as the original input image . It then reconstructs the fused image. First, for each , we use an upsampling operation to resize it to match the dimensions of the input image . This is typically done using bilinear interpolation. After upsampling, the feature maps align with the input image spatially, preserving spatial consistency during fusion. We then sum all the upsampled feature maps to fuse information from different scales:
To reduce computational complexity and prepare the feature maps for the next step, we apply a convolution layer to decrease the number of channels in the fused feature map. This operation compresses the feature map’s channel dimension, making it suitable for reconstruction. After channel reduction, we process the feature map with five consecutive convolution layers, followed by ReLU activations, to progressively reconstruct the fused image. Each convolution operation helps recover image details by learning convolutional kernels.
To optimize the fusion quality, we use the Structural Similarity Index (SSIM) and Laplacian second-order gradient loss. SSIM is effective at evaluating structural and perceptual image quality and is computed as
where is the fused image, and and are the source images.The Laplacian operator captures second-order texture details, enhancing edges and fine features like high-frequency textures. The gradient loss is defined as the difference between the Laplacian results of the target and source images:
where is the Laplacian operator calculated using different Gaussian kernel sizes k, and is the fused image, while and are the source images.The final image fusion loss function is
where is the SSIM-based structural loss, is the Laplacian gradient loss, and and are balancing coefficients controlling the importance of each term.The object detection task head consists of regression and classification branches, aiming to detect target objects through accurate regression and classification. We use classic loss functions such as SmoothL1 Loss and Focal Loss, which have demonstrated strong performance in detection tasks:
where and are balancing coefficients to adjust the relative importance of regression and focal losses.Therefore, the final total loss function is
where is a balancing coefficient that controls the trade-off between object detection loss and image fusion loss , allowing the model to effectively learn both tasks simultaneously.3.5. UnityGrad
In image fusion and object detection tasks, optimization objectives often conflict in both gradient direction and magnitude. Image fusion aims to generate high-quality fused images, while object detection focuses on improving detection accuracy. When these tasks share parameters , conflicting gradients can lead to suboptimal updates and degraded overall performance. To address this challenge effectively, we propose UnityGrad, a principled approach that unifies conflicting gradients through cooperative bargaining.
Let K denote the number of tasks and let be the gradient of the i-th task’s loss with respect to shared parameters . While we primarily focus on image fusion () and object detection () in this paper, the UnityGrad formulation generalizes to any number of tasks.
Given the current parameters , we search for an update vector within the ball of radius centered at zero, denoted as . The key insight of UnityGrad is to formulate this as a cooperative bargaining problem where the agreement set is and the disagreement point is 0, which represents making no parameter update [61]. For each task, we define a utility function , representing how beneficial the update direction is for that task [62].
Our main assumption is that when is not at a Pareto stationary point, the task gradients are linearly independent. This ensures that the disagreement point (no update) is dominated by some point in that benefits all tasks.
The core of UnityGrad is the following optimization objective:
This logarithmic objective is derived from the Nash Bargaining Solution in cooperative game theory, which maximizes the product of utility gains. Taking the logarithm transforms this product into a sum while preserving the solution’s properties. The logarithmic formulation is particularly important as it ensures scale-invariance across tasks with different gradient magnitudes, preventing any single task from dominating the optimization process. We can characterize the solution to this logarithmic optimization problem as follows:
Let G be the matrix whose columns are the gradients . The solution to our optimization problem is (up to scaling) , where is the solution to , with representing element-wise reciprocal.
To derive this result, we analyze the gradient of our objective function. The gradient of our objective function takes the form . We observe that for any vector satisfying for all i, the utility functions increase monotonically with . This, combined with the Pareto optimality characteristic inherent in bargaining solutions [61], necessitates that the optimal point must lie on the boundary of . Consequently, at the optimal solution, the gradient must align with the radial direction. Mathematically, this means for some scalar . Given the independence of the gradients, we can express as a linear combination where each . This yields the condition , which can be rearranged as . Since we require for descent directions, it follows that . For simplicity, we set to determine the direction of (noting that its magnitude might exceed ). The bargaining solution problem thus reduces to finding coefficients with positive components such that: for all i. This can be elegantly expressed in matrix form as , where denotes the element-wise reciprocal vector. □
To solve the equation efficiently, we employ an iterative approach. We initialize with uniform weights and use a fixed-point iteration:
where the square root and division operations are applied element-wise. This iteration continues until for a small threshold , typically requiring only a few iterations to achieve good convergence.Through this iterative optimization, UnityGrad converges to a Pareto Stationary Point, where the gradients of all tasks are balanced relative to each other. This ensures that no task’s loss can be further reduced without increasing another task’s loss, achieving true unity in the optimization process. The complete UnityGrad algorithm is summarized in Algorithm 1.
| Algorithm 1 UnityGrad |
| Input: Initial shared parameters ; differentiable losses ; learning rate ; total steps T |
4. Experiment
4.1. Datasets and Evaluation Criteria
4.1.1. Introduction to Experimental Datasets
We conducted validation on four publicly available infrared and visible image datasets, which are as follows: M3FD [22], DroneVehicle [63], RoadScene [18], and TNO [64]. Specifically, all datasets consist of co-registered infrared and visible image pairs, which are acquired concurrently with aligned sensors to ensure spatial and temporal consistency between modalities. The selection of these datasets allows us to comprehensively evaluate both image fusion and object detection performance under diverse conditions, reflecting the robustness of our approach across various challenges. The M3FD dataset was used to evaluate both detection and image fusion performance [22], while RoadScene and TNO datasets were used for evaluating image fusion performance [65]. The DroneVehicle dataset was used to assess performance in detecting rotated objects [42].
The M3FD dataset includes 4200 pairs of high-resolution aligned infrared and visible light images. The dataset covers a variety of scenes and is categorized into four different types: daytime, overcast, nighttime, and challenging conditions. Additionally, the M3FD dataset annotates a total of 33,603 objects across six categories: people, cars, buses, motorcycles, trucks, and lights. This makes it suitable for evaluating both object detection and fusion tasks.
The DroneVehicle dataset consists of a total of 56,878 images collected by drones, with half of the images being RGB and the other half infrared. The dataset provides detailed annotations for five categories: cars, trucks, buses, vans, and cargo trucks, using rotated bounding boxes for annotation. This increases the evaluation standard for the model’s detection capabilities and is well-suited for evaluating multimodal object detection performance.
The RoadScene dataset, created in 2020, is based on road scenes and includes paired infrared and visible light images. It contains 221 pairs of aligned images, covering a rich set of road scenes such as bicycles, cars, pedestrians, and traffic lights. These image pairs were extracted from FLIR5 video footage and have been denoised and rigorously aligned. With a large number of high-resolution images, the RoadScene dataset is suitable for evaluating image fusion tasks.
The TNO dataset is a commonly used dataset in the field of infrared and visible image fusion. It includes a large collection of multispectral images from various military-related scenes, such as enhanced visual images, near-infrared images, long-wave infrared images, and thermal radiation images, collected by the Netherlands Organization for Applied Scientific Research. Unlike the MSRS and RoadScene datasets, the visible light images in the TNO dataset are single-channel images. It contains night-time images of multi-band military scenes, with a total of 60 pairs of infrared and visible light images.
4.1.2. Evaluation Criteria
To comprehensively evaluate the performance of the model, we used five evaluation metrics for the image fusion task: Entropy (EN), Structural Similarity Index (SSIM), Mutual Information (MI), Visual Information Fidelity (VIF), and Standard Deviation (SD). The object detection task was assessed using mean Average Precision (mAP).
EN measures the information richness of the fused image. A higher entropy value indicates that the fused image contains more information. The entropy is calculated as
where N is the number of gray levels in the fused image, and is the proportion of pixels at the gray level n in the fused image.SSIM is a metric used to assess image quality, particularly to measure the structural similarity between the fused image and the reference image. The SSIM is calculated as
where and are the mean values of images x and y, and are their variances, is their covariance, and and are constants to avoid division by zero.SSIM values range from −1 to 1, with 1 indicating perfect similarity. The closer the value is to 1, the more similar the structural information between the images.
MI quantifies how much information is retained in the fused image from the source images. In information theory, MI is used to measure the dependence between two random variables. The MI is calculated as
where and are the entropies of images A and B, and is their joint entropy.VIF evaluates the image quality by quantifying the consistency between the image content and human visual perception. Unlike traditional pixel-based metrics (such as MSE or PSNR), VIF considers the perceptual quality by accounting for how the human eye is more sensitive to certain frequencies. The VIF calculation involves several steps: first, the image is decomposed using filters (such as Gaussian filters) to generate multi-scale representations. Then, information is calculated for each scale, with higher weights given to low-frequency components due to human sensitivity. Finally, the VIF is computed by combining the information from all scales:
where represents the information content at the i-th scale, is the noise variance at that scale, and M is the number of scales. A higher VIF indicates better image quality.SD measures the degree of variation in the pixel values of the image. A higher SD indicates that the image has more distinct details and richer textures. The SD is calculated as
where is the pixel value of the image, is the mean of the pixel values, and N is the total number of pixels.For evaluating object detection performance, we use Precision, Recall, mAP0.5, and mAP0.5:0.95 as the evaluation metrics. These metrics are derived based on the counts of true positives (TP), false positives (FP), and false negatives (FN), as well as the Intersection over Union (IoU) between predicted and ground-truth bounding boxes.
Precision is the ratio of correctly predicted positive samples to all detected samples, calculated as
Recall is the ratio of correctly predicted positive samples to the total number of actual positive samples, calculated as
Average Precision (AP) is the area under the precision-recall curve, calculated as
Mean Average Precision (mAP) is the average of the AP values for all classes, calculated as
where is the AP value for class i, and N is the number of classes in the dataset. mAP0.5 indicates the average precision when the Intersection over Union threshold is set to 0.5. mAP0.5:0.95 represents the mean average precision across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05.4.2. Implementation Details
All experiments were conducted on an NVIDIA RTX 4090 GPU, using the MMDetection framework for multimodal image fusion and detection. During training, the Adam optimizer was applied with an initial learning rate of , which decayed every 10 epochs. We employed the UniFusOD structure, shown in Figure 2, for end-to-end training, utilizing both fusion and detection loss. The hyperparameters in the Detection and Fusion Heads were set as , with and , balancing the fusion and detection tasks. For the M3FD dataset, a 2:8 training-to-testing split was used to ensure stability during training and reliability in evaluation. All images were resized and underwent essential data augmentation, including random cropping and flipping, to enhance the model’s generalization ability. During inference, performance was evaluated separately for fusion and detection, with adjustments made based on the characteristics of each dataset. This approach allows for more targeted assessment of the model’s strengths in different aspects of multimodal processing.
The experimental results demonstrate that the proposed method achieves strong performance across multiple tasks and datasets, highlighting the potential of this algorithm in the fields of multimodal image fusion and object detection.
5. Results
5.1. Results on Infrared-Visible Image Fusion
To validate the competitiveness of our algorithm, we compared it with ten other methods: DIDFuse [7], FusionGAN [5], SDNet [2], U2Fusion [18], TarDAL [22], RFN-Nest [13], DenseFuse [4], CDDFuse [65], AMDANet [66] and MMIF-INet [67]. We evaluated performance using five metrics: Entropy (EN), Structural Similarity Index (SSIM), Mutual Information (MI), Visual Information Fidelity (VIF), and Standard Deviation (SD), with detailed descriptions provided in Section 4.1.2. The experimental results show that, even with a simple and direct feature extraction approach, our method outperforms the others on several metrics.
On the TNO dataset, as shown in the Table 1, our method achieves higher EN and MI values, 7.44 and 2.00, respectively, significantly surpassing the alternatives. This indicates that our method preserves more image details and information. Our method also outperforms others in SSIM with a score of 1.07, indicating that the fused images retain the best structural resemblance to the original. While the SD of our method is slightly lower than CDDFuse, it maintains higher stability, balancing brightness variations and avoiding overprocessing.
On the Roadscene dataset, the results in Table 2 indicate that SSIM is slightly lower than some methods; our method still excels in EN and SD, especially with an SD of 59.48, reflecting superior brightness and contrast retention. Additionally, our MI value of 1.96 is comparable to other methods, demonstrating strong information preservation.
For the M3FD dataset, as shown in Table 3, our method achieves the highest MI and EN values, 1.40 and 6.60, respectively, showing a clear advantage in information and detail retention. Although VIF and SSIM are slightly lower than some methods, the overall fusion quality remains high, particularly in terms of detail and information fidelity.
To visually validate the fusion performance, Figure 5 presents qualitative fusion results. The first row shows M3FD samples, and the second row shows DroneVehicle samples, with each group comprising visible, infrared, and fused images. The blue rectangles in the fused images highlight regions where key thermal features from the infrared modality are retained. Meanwhile, the red rectangles emphasize enhanced texture and color semantics originating from the visible spectrum. These results demonstrate that our fusion algorithm effectively integrates complementary information from both modalities, maintaining critical details while enhancing overall scene visibility.
5.2. Results on Infrared-Visible Object Detection
In addition to evaluating image fusion quality, we also assessed the performance of the detector using the DroneVehicle and M3FD datasets. To ensure a fair comparison, we used Oriented RCNN [68] as the baseline model for rotated object detection and YOLOv5 [69] for horizontal object detection on the M3FD dataset, consistent with prior methods. The backbone network was kept the same as the fusion network. The object detection results demonstrate that our method outperforms others across several categories, validating its effectiveness and stability.
On the DroneVehicle dataset, as shown in Table 4, the fused detector integrating visible and infrared images with the Oriented RCNN architecture achieved notable performance gains, especially across multiple object categories. In terms of overall performance, our method attained an optimal mAP of 79.5, 0.8 higher than M2FP—the second-ranked method. For individual categories, our method delivered the highest detection accuracy in Car, Truck, and Van. Specifically, the Car detection accuracy reached 96.4, 0.7 higher than M2FP; the Truck accuracy hit 81.3, a substantial 3.0 higher than C2Former which ranked second in this category; and the Van accuracy reached 65.6, 0.7 higher than AFFCM. These results highlight our clear advantage in object detection, confirming that fusing visible and infrared modalities significantly improves performance, particularly for small objects in complex environments.
On the M3FD dataset, as seen in Table 5, our method paired with YOLO v5 demonstrated superior performance, achieving 59.4 mAP and 86.9 mAP50—1.9 higher than Fusion-Mamba (the second-best method) in both metrics. In category-specific detection, our method outperformed all counterparts. The Car detection accuracy reached 95.0, 0.2 higher than TarDAL which previously held the top spot; the Bus accuracy hit 94.1, 0.9 higher than SuperFusion; the Motorcycle accuracy reached 77.8, 0.4 higher than SuperFusion; and the Lamp accuracy attained 88.6, 0.8 higher than DetFusion. However, Truck detection was lower at 82.4, compared to 87.1 and 85.8 for Fusion-Mamba and SuperFusion, respectively. This indicates that our method struggles with larger objects, likely due to the challenges in capturing complex spatial features or less distinct boundaries, which may not be fully addressed by the current fusion-detection framework. Overall, our method demonstrates strong performance, especially for smaller objects and in challenging environments, though further refinement is needed for detecting larger objects like Trucks.
To further validate the detection robustness, Figure 6 provides qualitative results on both datasets. It clearly illustrates UniFusOD’s capability in accurately localizing rotated and small targets, even under challenging multimodal conditions. The comparison with ground truth highlights its strong spatial precision and reliable semantic understanding.
In summary, UniFusOD not only excels in image fusion but also delivers state-of-the-art performance in object detection across varying modalities, categories, and visual complexities. Its robustness against object rotation, scale variation, and modality noise makes it a compelling solution for multimodal perception tasks.
5.3. Ablation Study Results
To verify the effectiveness of the proposed Fine-Grained Region Attention (FRA) mechanism and UnityGrad optimization strategy, we conducted systematic ablation experiments on the M3FD dataset. Performance was evaluated using image fusion metrics and object detection metrics. Results are presented in Table 6, where “
As shown in Table 6, the Baseline achieves initial performance with EN of 5.21, SSIM of 0.76, mAP50 of 83.3, and mAP50:95 of 57.1. The introduction of the FRA module significantly enhances these metrics. By focusing on fine-grained regional variations during the fusion process, FRA boosts EN to 6.44, improves the VIF to 0.64, and increases the detection metrics by 2.7 for mAP50 and 1.6 for mAP50:95. This demonstrates FRA’s ability to capture multi-scale regional information, which significantly enhances both visual quality and semantic representation.
Further integration of UnityGrad—designed to reduce conflicts during multi-task optimization—leads to even more substantial gains across the board. Specifically, UnityGrad improves fusion metrics, with SD increasing by 4.97 compared to FRA alone. The SSIM increases to 0.85, while EN sees a modest improvement of 0.16. Detection performance is also enhanced, with mAP50 increasing by 0.9 to 86.9 and mAP50:95 rising by 0.7 to 59.4. These results confirm that UnityGrad optimizes the gradient propagation across tasks, allowing the FRA module to fully exploit its potential in synergizing image fusion and object detection tasks.
6. Discussion
6.1. Study of Fine-Grained Region Attention
The FRA module uses multi-scale feature maps and region-level attention to adaptively focus on important regions, enhancing the model’s ability to capture key features in complex images. To validate its effectiveness, we conducted ablation experiments on the M3FD dataset. Specifically, we investigated the effects of (1) varying the number and configuration of convolutional operators and (2) changing the number of region attention maps M.
6.1.1. Effect of Different Designs
Each denotes a convolution operation with a specific kernel size and dilation factor, followed by batch normalization and ReLU activation. By combining multiple such , the module generates diverse region attention maps, enabling it to model spatial contexts of different granularities. To study the impact of varying , we conducted experiments using combinations of convolutional kernels with sizes 3 × 3, 5 × 5, 7 × 7, and 11 × 11. The number of in the FRA module, denoted as K, directly determines the diversity of region-wise attention maps. Table 7 presents the results for both image fusion and object detection tasks.
The results show that using a single small kernel such as 3 × 3 limits the model’s ability to capture broader contextual information, as it tends to focus solely on fine-grained local details. Specifically, the fusion metric EN increases only marginally from 6.44 to 6.42, and detection accuracy measured by mAP50 shows a minor rise from 86.0 to 86.2. When a second kernel with size 5 × 5 is added, the model benefits from a broader receptive field, resulting in a noticeable enhancement in performance. For instance, the spatial detail metric SD improves from 37.86 to 40.25, and the mAP50:95 increases from 59.0 to 59.1.
The best performance is observed when three kernels are used—specifically 3 × 3, 5 × 5, and 7 × 7. Under this configuration, all key metrics reach their peak values. The EN reaches 6.60, SD increases to 42.52, mutual information MI rises to 1.40, and VIF stands at 0.68. In terms of detection, mAP50 reaches 86.9, while mAP50:95 improves to 59.4. However, introducing a fourth kernel with a size of 11 × 11 slightly degrades performance. For example, SD drops to 39.88 and mAP50:95 decreases to 58.9, which may be attributed to the over-smoothing of fine details and increased computational overhead, leading to redundancy in feature representation.
To gain deeper insights into how different configurations affect spatial attention, we visualize the corresponding region attention maps in Figure 7. The attention map generated by the first operator, which uses a 3 × 3 kernel, predominantly focuses on localized, fine-grained regions. As larger kernels are progressively introduced, the receptive field expands, allowing the attention to gradually shift toward broader, more semantically meaningful areas across the object. In the final aggregation stage, the attention maps clearly highlight the full extent of the target object, demonstrating the effectiveness of the FRA module in capturing both local details and global structural information.
These results collectively confirm that incorporating multiple convolutional operators with varied receptive fields significantly enhances the model’s ability to capture both fine-grained details and broader semantic context. A carefully selected combination of small to medium-sized kernels enables the FRA module to generate diverse region attention, which leads to more informative and discriminative feature representations.
6.1.2. Effect of the Number of Region Attention Maps
The parameter M represents the number of region attention weights, which controls the number of regions the model focuses on during region modeling. As shown in Table 8, increasing M leads to a significant improvement in the model’s performance for both image fusion and object detection tasks.
When , the model performs at a basic level, with an EN of 6.30, indicating that the model is unable to effectively focus on high-information regions during image fusion, which limits detection performance. In this case, mAP50 is 85.8, and mAP50:95 is 58.5. As M increases to 4, performance improves slightly, with EN reaching 6.35, mAP50 at 86.0, and mAP50:95 at 58.7, suggesting some improvement in region modeling. Increasing M further to 6 yields a more substantial performance gain, with EN at 6.38, MI at 1.35, SSIM at 0.75, mAP50 at 86.5, and mAP50:95 at 59.0. This indicates the model’s enhanced ability to capture and represent important regional features, improving both image fusion and detection accuracy. When , the model achieves the best performance. By increasing the number of region attention maps, the model strengthens its capacity to model region-specific features in complex images, significantly boosting detection accuracy and image quality. However, when M is increased to 10, performance slightly declines. EN increases to 6.40, mAP50 to 86.0, and mAP50:95 to 58.8, suggesting that the benefits of region modeling are approaching saturation, while computational cost continues to rise.
6.2. Study of UnityGrad Algorithm
To evaluate the robustness of the proposed UnityGrad method, we conduct comparative experiments with three typical MTL approaches: GradNorm [50], PCGrad [82], and CAGrad [54], using multiple evaluation metrics. As shown in Table 9, UnityGrad outperforms all other methods, demonstrating significant improvements in both image fusion metrics, such as SSIM of 0.85, and object detection metrics, such as mAP50:95 of 59.4, as well as stability metrics, such as EN of 6.60 and VIF of 0.68. These results confirm that UnityGrad effectively mitigates task interference in multi-task learning, ensuring both superior and stable performance across various dimensions. Compared to existing MTL methods, UnityGrad exhibits enhanced robustness, which can be attributed to its effective multi-task gradient coordination mechanism.
In addition to the quantitative results presented in Table 6 and Table 9, we also analyzed the gradient behavior during training to better understand how UnityGrad mitigates conflicts between tasks. As illustrated in Figure 8, the blue curves represent gradients of the detection loss with respect to shared parameters, while the red curves correspond to gradients of the fusion loss. Without UnityGrad, the detection task dominates the shared gradient space, suppressing the learning capacity of the fusion network due to its relatively smaller gradient magnitude. In contrast, with UnityGrad, the gradient contributions from both tasks are better balanced, leading to improved gradient alignment and a more stable optimization process. Overall, UnityGrad enhances both low-level image fidelity and high-level semantic accuracy, delivering stable joint optimization and superior end-to-end performance for integrated vision tasks.
7. Conclusions
In this paper, we proposed the UniFusOD, a unified framework designed to optimize both infrared-visible image fusion and object detection tasks simultaneously. This network performs end-to-end optimization of low-level and high-level tasks. The Fine-Grained Region Attention (FRA) module enhances the model’s ability to recognize complex region-specific information by applying attention operations at multiple granularities. Furthermore, to address the gradient conflicts between the fusion and detection tasks, we introduced the UnityGrad method, which balances the gradients of the different tasks, stabilizing and improving optimization performance. Experimental results show that UniFusOD significantly enhances the performance of both image fusion and object detection tasks across multiple datasets, demonstrating its potential for real-world applications such as autonomous driving and remote sensing. In the future, we plan to extend this framework to more modalities, such as LiDAR and SAR, to further improve its performance in complex environments.
Conceptualization, X.X., B.N., Z.W. and W.G.; Methodology, X.X.; Software, X.X.; Validation, Z.P.; Formal analysis, J.Q.; Resources, G.Z., B.N. and L.H.; Data curation, X.X. and W.L.; Writing—original draft, X.X.; Writing—review & editing, B.N., Z.P. and L.H.; Visualization, W.L. All authors have read and agreed to the published version of the manuscript.
The data presented in this study are available upon request from the corresponding author.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1 Comparison of (d) UniFusOD with existing image fusion and detection paradigms: (a) Decoupled stages, (b) Coupled two-stage, (c) Multi-stage, and (d) End-to-end. The figure shows the evolution from separated optimization (a,b) to multi-stage (c) and our unified end-to-end framework (d), addressing key challenges through joint fusion-detection optimization.
Figure 2 An overview of the proposed UniFusOD framework. The Backbone extracts and fuses multi-level features from infrared and visible images. The Fine-grained Region Attention (FRA) module learns and distinguishes the importance of different feature regions, guiding the model to focus on key features. The model, together with the task-specific heads, is end-to-end synchronized and optimized using the UnityGrad method.
Figure 3 Detailed view of the fine-grained region attention. This is a schematic diagram showing an example where
Figure 4 The structure diagram of Detection and Fusion Heads.
Figure 5 Qualitative fusion results. The first row shows examples from the M3FD dataset (Visible, Infrared, and Fused images), and the second row shows samples from the DroneVehicle dataset. Red boxes highlight infrared-dominant regions preserved in the fusion result, while blue boxes indicate enhanced semantic and detail representations from the visible spectrum.
Figure 6 Qualitative detection results. The first two columns are samples from the DroneVehicle dataset, and the last four columns are from the M3FD dataset. The first row shows ground truth annotations, and the second row shows predictions from UniFusOD. The results highlight the model’s robustness in detecting rotated and scale-varying targets under complex conditions.
Figure 7 Visualization of region attention maps generated by different
Figure 8 Gradient conflict visualization in joint fusion-detection optimization: (a) Without UnityGrad, showing the gradients of both fusion (red) and detection (blue) tasks across iterations, with visible misalignment; (b) With UnityGrad, demonstrating improved gradient alignment between the fusion (red) and detection (blue) tasks, resulting in better convergence. The x-axis represents the iterations, while the y-axis shows the gradient values.
Fusion Results on the TNO Dataset. Best Results are shown in
| Model | EN | SD | MI | VIF | SSIM |
|---|---|---|---|---|---|
| DIDFuse | 6.97 | 45.12 | 1.70 | 0.60 | 0.81 |
| U2Fusion | 6.83 | 34.55 | 1.37 | 0.58 | 0.99 |
| SDNet | 6.64 | 32.66 | 1.52 | 0.56 | 1.00 |
| RFN-Nest | 6.83 | 34.50 | 1.20 | 0.51 | 0.92 |
| TarDAL | 6.84 | 45.63 | 1.86 | 0.53 | 0.88 |
| DenseFuse | 6.95 | 38.41 | 1.78 | 0.60 | 0.96 |
| MMIF-INet | 6.88 | 39.27 | 1.69 | 0.56 | 0.83 |
| FusinoGAN | 7.10 | | 1.78 | 0.57 | 0.88 |
| AMDANet | | 39.52 | 1.82 | 0.70 | 0.95 |
| CDDFuse | 7.12 | | | | |
| UniFusOD | | 41.28 | | | |
Fusion Results on the Roadscene Dataset with Best Results shown in
| Model | EN | SD | MI | VIF | SSIM |
|---|---|---|---|---|---|
| DIDFuse | 7.43 | 51.58 | 2.11 | 0.58 | 0.86 |
| U2Fusion | 7.09 | 38.12 | 1.87 | 0.60 | 0.97 |
| SDNet | 7.14 | 40.20 | 2.21 | 0.60 | |
| RFN-Nest | 7.21 | 41.25 | 1.68 | 0.54 | 0.90 |
| TarDAL | 7.17 | 47.44 | 2.14 | 0.54 | 0.88 |
| DenseFuse | 7.23 | 44.44 | | 0.63 | 0.89 |
| MMIF-INet | 7.24 | 49.75 | 2.05 | 0.61 | 0.78 |
| FusionGAN | 7.36 | 52.54 | 2.18 | 0.59 | 0.88 |
| AMDANet | 7.43 | 53.77 | 1.92 | | 0.81 |
| CDDFuse | | | | 0.69 | |
| UniFusOD | | | 1.96 | | 0.90 |
Fusion Results on the M3FD Dataset with Best Results in
| Model | EN | SD | MI | SSIM | VIF |
|---|---|---|---|---|---|
| DIDFuse | 5.97 | | | 0.81 | 0.54 |
| U2Fusion | 5.62 | 36.51 | 1.20 | | 0.50 |
| SDNet | 6.21 | 34.22 | 1.24 | | 0.61 |
| RFN-Nest | 6.01 | 37.59 | 1.01 | 0.92 | 0.51 |
| TarDAL | 5.84 | 40.18 | | 0.88 | 0.59 |
| DenseFuse | 6.44 | 36.46 | 1.23 | 0.96 | 0.57 |
| MMIF-INet | 5.74 | 40.67 | 1.18 | 0.96 | 0.55 |
| FusionGAN | 6.30 | 39.83 | 1.16 | 0.88 | 0.53 |
| AMDANet | | 38.27 | 1.31 | 0.97 | |
| CDDFuse | 5.77 | 39.74 | 1.33 | 0.91 | 0.69 |
| UniFusOD | | | | 0.85 | |
Object detection results on the DroneVehicle dataset. The table shows the performance of various methods using different modalities: visible images, infrared (IR) images, and their fusion (visible + IR). The best results in each category are highlighted in
| Methods | Modality | Car | Truck | Freight-Car | Bus | Van | mAP |
|---|---|---|---|---|---|---|---|
| Faster R-CNN [ | Visible | 79.0 | 49.0 | 37.2 | 77.0 | 37.0 | 55.9 |
| RoITransformer [ | Visible | 61.6 | 55.1 | 42.3 | 85.5 | 44.8 | 61.6 |
| YOLOv5s [ | Visible | 78.6 | 55.3 | 43.8 | 87.1 | 46.0 | 62.1 |
| Faster R-CNN | IR | 89.4 | 53.5 | 48.3 | 87.0 | 42.6 | 64.2 |
| RoITransformer | IR | 90.1 | 60.4 | 58.9 | 89.7 | 52.2 | 70.3 |
| YOLOv5s | IR | 90.0 | 59.5 | 60.8 | 89.5 | 53.8 | 70.7 |
| Halfway Fusion [ | Visible + IR | 90.1 | 62.3 | 58.5 | 89.1 | 49.8 | 70.0 |
| UA-CMDet [ | Visible + IR | 88.6 | 73.1 | 57.0 | 88.5 | 54.1 | 70.0 |
| MBNet [ | Visible + IR | 90.1 | 64.4 | 62.4 | 88.8 | 53.6 | 71.9 |
| TSFADet [ | Visible + IR | 89.9 | 67.9 | 63.7 | 89.8 | 54.0 | 73.1 |
| C2Former [ | Visible + IR | 90.2 | | 64.4 | 89.8 | 58.5 | 74.2 |
| AFFCM [ | Visible + IR | 90.2 | 73.4 | | 89.9 | 64.9 | 76.6 |
| MC-DETR [ | Visible + IR | 94.8 | 76.7 | 60.4 | | 61.4 | 76.9 |
| M2FP [ | Visible + IR | | 76.2 | | | | |
| UniFusOD (Oriented RCNN) | Visible + IR | | | 63.5 | 90.8 | | |
Object detection results on the M3FD dataset. The best results in each category are highlighted in
| Methods | Detector | mAP50 | mAP | People | Bus | Car | Motorcycle | Lamp | Truck |
|---|---|---|---|---|---|---|---|---|---|
| DIDFuse [ | YOLOv5 | 78.9 | 52.6 | 79.6 | 79.6 | 92.5 | 68.7 | 84.7 | 68.7 |
| SDNet [ | YOLOv5 | 79.0 | 52.9 | 79.4 | 81.4 | 92.3 | 67.4 | 84.1 | 69.3 |
| RFNet [ | YOLOv5 | 79.4 | 53.2 | 79.4 | 78.2 | 91.1 | 72.8 | 85.0 | 69.0 |
| TarDAL [ | YOLOv5 | 80.5 | 54.1 | 81.5 | 81.3 | | 69.3 | 87.1 | 68.7 |
| DetFusion [ | YOLOv5 | 80.8 | 53.8 | 80.8 | 83.0 | 92.5 | 69.4 | | 71.4 |
| CDDFuse [ | YOLOv5 | 81.1 | 54.3 | 81.6 | 82.6 | 92.5 | 71.6 | 86.9 | 71.5 |
| IGNet [ | YOLOv5 | 81.5 | 54.5 | 81.6 | 82.4 | 92.8 | 73.0 | 86.9 | 72.1 |
| SuperFusion [ | YOLOv7 | 83.5 | 56.0 | | | 91.0 | 77.4 | 70.0 | |
| Fd2-Net [ | YOLOv5 | 83.5 | 55.7 | 82.7 | 82.7 | 93.6 | | 87.8 | 73.7 |
| Fusion-Mamba [ | YOLOv5 | | | 80.3 | 92.8 | 91.9 | 73.0 | 84.8 | |
| UniFusOD | YOLOv5 | | | | | | | | 82.4 |
Ablation study results on the M3FD dataset, with the best results shown in bold. “[Image omitted. Please see PDF.]” denotes that the module is enabled.
| Baseline | FRA | UnityGrad | EN | SD | MI | VIF | SSIM | mAP50 | mAP50:95 |
|---|---|---|---|---|---|---|---|---|---|
| [Image omitted. Please see PDF.] | 5.21 | 37.68 | 1.21 | 0.52 | 0.76 | 83.3 | 57.1 | ||
| [Image omitted. Please see PDF.] | [Image omitted. Please see PDF.] | 6.44 | 37.55 | 1.15 | 0.64 | 0.68 | 86.0 | 58.7 | |
| [Image omitted. Please see PDF.] | [Image omitted. Please see PDF.] | [Image omitted. Please see PDF.] | 6.60 | 42.52 | 1.40 | 0.68 | 0.85 | 86.9 | 59.4 |
Ablation study of the number of convolutional operators
| Number of | EN | SD | MI | VIF | mAP50 | mAP50:95 |
|---|---|---|---|---|---|---|
| 0 (no region attention) | 6.44 | 37.55 | 1.15 | 0.64 | 86.0 | 58.7 |
| 1 (3 × 3) | 6.42 | 37.86 | 1.38 | 0.62 | 86.2 | 59.0 |
| 2 (3 × 3, 5 × 5) | 6.41 | 40.25 | 1.38 | 0.68 | 86.7 | 59.1 |
| 3 (3 × 3, 5 × 5, 7 × 7) | 6.60 | 42.52 | 1.40 | 0.68 | 86.9 | 59.4 |
| 4 (3 × 3, 5 × 5, 7 × 7, 11 × 11) | 6.52 | 39.88 | 1.37 | 0.66 | 86.2 | 58.9 |
Ablation study of the number of region attention maps M. The table shows the impact of different values of M on the model’s performance, with the best performing metrics highlighted in bold.
| M | EN | SD | MI | VIF | SSIM | mAP50 | mAP50:95 |
|---|---|---|---|---|---|---|---|
| 2 | 6.30 | 37.00 | 1.10 | 0.62 | 0.68 | 85.8 | 58.5 |
| 4 | 6.35 | 37.30 | 1.30 | 0.61 | 0.72 | 86.0 | 58.7 |
| 6 | 6.38 | 39.00 | 1.35 | 0.65 | 0.75 | 86.5 | 59.0 |
| 8 | 6.60 | 42.52 | 1.40 | 0.68 | 0.85 | 86.9 | 59.4 |
| 10 | 6.40 | 39.50 | 1.35 | 0.64 | 0.73 | 86.0 | 58.8 |
Performance Comparison of Different Multi-task Learning Methods on M3FD for Image Fusion and Object Detection.
| Method | EN | SD | MI | VIF | SSIM | mAP50 | mAP50:95 |
|---|---|---|---|---|---|---|---|
| GradNorm | 6.21 | 38.60 | 1.29 | 0.57 | 0.70 | 86.1 | 58.7 |
| PCGrad | 6.51 | 39.92 | 1.35 | 0.63 | 0.79 | 86.1 | 59.0 |
| CAGrad | 6.47 | 40.77 | 1.36 | 0.66 | 0.84 | 86.3 | 58.8 |
| UniGrad | 6.60 | 42.52 | 1.40 | 0.68 | 0.85 | 86.9 | 59.4 |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.