Content area

Abstract

What are the main findings?

Our proposed UniFusOD method integrates infrared-visible image fusion and object detection into a unified, end-to-end framework, achieving superior performance across multiple tasks.

The introduction of the Fine-Grained Region Attention (FRA) module and UnityGrad optimization significantly enhances the model’s ability to handle multi-scale features and resolves gradient conflicts, improving both fusion and detection outcomes.

What are the implications of the main findings?

The unified optimization approach not only improves image fusion quality but also enhances downstream task performance, particularly in detecting rotated and small objects.

This approach demonstrates significant robustness across various datasets, offering a promising solution for multimodal perception tasks in remote sensing and autonomous driving.

Infrared-visible image fusion and object detection are crucial components in remote sensing applications, each offering unique advantages. Recent research has increasingly sought to combine these tasks to enhance object detection performance. However, the integration of these tasks presents several challenges, primarily due to two overlooked issues: (i) existing infrared-visible image fusion methods often fail to adequately focus on fine-grained or dense information, and (ii) while joint optimization methods can improve fusion quality and downstream task performance, their multi-stage training processes often reduce efficiency and limit the network’s global optimization capability. To address these challenges, we propose the UniFusOD method, an efficient end-to-end framework that simultaneously optimizes both infrared-visible image fusion and object detection tasks. The method integrates Fine-Grained Region Attention (FRA) for region-specific attention operations at different granularities, enhancing the model’s ability to capture complex information. Furthermore, UnityGrad is introduced to balance the gradient conflicts between fusion and detection tasks, stabilizing the optimization process. Extensive experiments demonstrate the superiority and robustness of our approach. Not only does UniFusOD achieve excellent results in image fusion, but it also provides significant improvements in object detection performance. The method exhibits remarkable robustness across various tasks, achieving a 0.8 and 1.9 mAP50 improvement over state-of-the-art methods on the DroneVehicle dataset for rotated object detection and the M3FD dataset for horizontal object detection, respectively.

Full text

Turn on search term navigation

1. Introduction

The rapid development of remote sensing satellite platforms has made the acquisition of vast amounts of data possible, significantly driving advancements in deep learning technologies [1]. However, the inherent limitations of individual sensor modalities present substantial challenges for achieving comprehensive visual perception. Visible images, with their high spatial resolution and rich color information, are adept at capturing texture and chromatic details. Nevertheless, they heavily depend on high-quality illumination conditions, and their performance deteriorates significantly in low-light environments, resulting in the loss of important information [2,3,4,5]. In contrast, infrared images, owing to their thermal imaging mechanism, are not dependent on ambient light, providing robust edge information even in dim lighting. They also offer certain advantages in terms of penetration and camouflage resistance, making them particularly useful for highlighting target contours. However, they fall short in representing fine textures [6,7,8]. Thus, relying solely on a single modality for object detection in remote sensing often leads to perceptual blind spots, limiting model performance and scene generalization.

To address these complementary deficiencies, Infrared-Visible Image Fusion has emerged as a critical research domain, with substantial applications in autonomous driving and remote sensing systems [9,10,11,12]. On the one hand, the task of infrared-visible image fusion aims to integrate complementary information from both types of images, creating a richer and more informative fused image. This enhanced image improves scene clarity and provides the necessary details relevant to the specific application scenario. Existing methods in infrared-visible fusion typically focus on feature-level fusion and alignment using deep learning techniques. These methods can be broadly classified into three categories: autoencoder (AE)-based approaches [4,13,14,15] enhance feature representation through reconstruction; generative adversarial network (GAN)-based approaches [8,16,17] constrain the fusion image distribution to align with original inputs, thus avoiding direct fusion weight learning; and unified models [18,19] employ cross-learning to address the lack of ground truth and training samples. However, these methods generally improve fusion performance from the perspectives of fusion weights or image distribution, with an emphasis on semantic information. They often overlook the fine-grained fusion of image details, which are crucial for downstream tasks, especially those requiring dense information. On the other hand, while fused images can provide high-quality inputs for higher-level perceptual tasks such as object detection and tracking [6,20,21], traditional infrared-visible fusion methods primarily focus on improving visual information quality without fully addressing the specific needs of downstream tasks. As a result, although fused images may exhibit high visual quality, they do not necessarily lead to significant improvements in perceptual accuracy or overall task performance in real-world applications [22].

To overcome these limitations, recent studies have explored the joint optimization of image fusion and high-level perception tasks such as object detection [20,21,22,23]. This approach aims to optimize both pixel-level and feature-level processes simultaneously, ensuring that enhancements in image fusion also improve downstream tasks like object detection. One of the primary advantages of joint optimization is its ability to leverage semantic information from object detection to guide the fusion process, making the fused image more effective for detection. Additionally, this optimization enables the fusion task itself to be more beneficial in enhancing object detection performance. However, despite its potential, several challenges remain, as illustrated in Figure 1. These challenges include the following: (1) Inefficient stepwise optimization: Most existing methods adopt a cascaded design, where image fusion and object detection networks are optimized in separate stages, as shown in Figure 1a–c. While this approach may offer improvements in both fusion and detection individually, it introduces inefficiencies due to the lack of integrated learning, making the process computationally expensive and complex. This stepwise optimization also poses challenges for real-time processing, as it does not leverage the potential for joint optimization that could reduce computational overhead. (2) Lack of focus on fine-grained or dense information: Existing feature fusion methods often fail to emphasize fine-grained or dense information, resulting in fused features that may not perform well in location-sensitive tasks such as detection and segmentation. This oversight limits the effectiveness of the fused features in tasks where precise spatial information is crucial. (3) Limited ability to find global optimal solutions: Multi-stage optimization methods often become trapped in local optima due to their stepwise nature. Furthermore, these methods typically connect tasks via the loss function, without structural interactions between them, limiting the optimization process’s ability to address the needs of both tasks simultaneously.

Overall, significant challenges remain in achieving synergistic optimization between image fusion and downstream tasks, particularly regarding efficient end-to-end optimization, for which effective solutions are still lacking. In this paper, we propose UniFusOD, a novel end-to-end framework that unifies image fusion and object detection (OD) tasks into a single, integrated optimization process, as illustrated in Figure 1d. By jointly optimizing these tasks, UniFusOD ensures that the fused image is not only visually enhanced but also optimized for downstream tasks like object detection, leveraging the complementary strengths of both modalities. This approach enhances feature fusion across various levels, improving visual perception capabilities. To enable the model to focus on important details at fine-grained levels, we introduce the Fine-Grained Region Attention (FRA) module. Inspired by the biological visual system, the FRA module allows the model to selectively attend to and distinguish key regions, thereby improving feature representation at both spatial and semantic levels. Additionally, we introduce UnityGrad, a novel optimization algorithm based on the Nash bargaining principle. UnityGrad resolves gradient conflicts between fusion and detection tasks, aligning their optimization directions and scales. This approach stabilizes the optimization process and enhances the efficiency and effectiveness of multimodal image fusion for object detection.

The main contributions of this paper are as follows:

(1) We present UniFusOD, an end-to-end multimodal image fusion detection framework that synchronously optimizes both image fusion and downstream tasks. This approach overcomes the inefficiencies and local optima issues associated with multi-stage optimization methods.

(2) The Fine-Grained Region Attention (FRA) module is designed to enhance the model’s ability to focus on and capture region-specific information at various levels of granularity. Inspired by biological visual systems, FRA improves feature representation by selectively attending to crucial regions, enabling the model to better capture and represent task-relevant information in complex multimodal images.

(3) We propose UnityGrad, inspired by the Nash bargaining principle, to resolve gradient conflicts between fusion and detection tasks. This novel approach harmonizes the optimization goals of both tasks, leading to a more balanced and efficient optimization process, ultimately stabilizing and improving model performance.

(4) Through extensive experiments on image fusion and object detection tasks, we demonstrate the effectiveness and robustness of our approach, achieving superior performance over traditional methods in both tasks.

2. Related Work

2.1. Multimodal Image Fusion and Object Detection

Deep learning has significantly advanced both low- and high-level visual tasks in remote sensing, particularly in image fusion and object detection, demonstrating great potential [6,24,25,26]. Early multimodal image fusion studies [7,27,28] mainly optimized fusion outcomes by adjusting network structures or loss functions, achieving good visual effects. From the perspective of network architecture design, many works adopted encoder-decoder frameworks to extract hierarchical features from source images—for instance, integrating residual blocks or dense connections to enhance feature propagation and avoid gradient vanishing, which is particularly effective for fusing heterogeneous modalities like visible and infrared images [29,30]. In terms of feature fusion strategies, researchers have explored multi-scale fusion mechanisms and attention-driven weight allocation to emphasize complementary information between modalities [31,32]. These feature-based fusion approaches allow for a more nuanced integration of information from different sources, potentially improving the overall quality of fused images. However, a significant limitation of feature-based fusion methods lies in their potential disconnection from the end-task performance, such as object detection. While these methods can enhance visual quality and detail in fused images, they may not always be optimized for downstream tasks, which require more task-specific feature integration [33]. For loss function optimization, pixel-level reconstruction losses and structural similarity (SSIM) loss were widely employed to constrain the fused image to be consistent with source images in pixel intensity and structural distribution [34,35]. However, these methods often overlook a crucial point: the primary goal of fusion is not just to improve visual quality but to enhance the performance of downstream tasks, such as object detection. Although high-quality fused images are visually impressive, they may not always meet the specific needs of practical applications [33].

Recent research has increasingly recognized that multimodal image fusion should not be an isolated task but closely integrated with downstream tasks like object detection, tracking, and segmentation. This has led to the development of joint optimization frameworks that combine image fusion with object detection. In these frameworks, fusion is not only aimed at generating visually pleasing images but also at improving downstream task performance. For example, Yuan et al. [26] pioneered cross-modal alignment to address airborne visible and infrared misalignment for rotated detection, establishing the foundational need to resolve modality discrepancies. Building on this, Liu et al. [22] introduced joint learning of fusion and detection with a novel loss function, directly incorporating detection-derived semantic and location information into fusion to simultaneously enhance both tasks. This approach improves fusion quality and detection performance by incorporating semantic and location information from the detection task into the fusion process. Finally, Liu et al. [36] generalized this interaction paradigm through a multi-interaction architecture, formalizing mutual task promotion beyond single-directional guidance to achieve bidirectional, task-aligned feature learning that collectively elevates fusion and detection performance.

Despite these advances, challenges remain. Object detection focuses on semantic understanding, while fusion and segmentation tasks emphasize pixel-level relationships, making the optimization of image fusion and object detection complex. A critical challenge is finding a balance that allows both tasks to mutually enhance each other [10,36,37]. Many methods still rely on cascade architectures, where separate modules are trained and inferred independently, resulting in high computational cost and inefficiency. Furthermore, efficiently integrating information from different modalities while removing redundant features remains a persistent challenge in multimodal fusion.

In conclusion, the integration of image fusion with object detection is a promising research area. By designing effective network architectures and loss functions, image fusion and object detection can mutually promote each other, enhancing the overall performance of multimodal image processing tasks. However, overcoming the optimization challenges requires exploring more efficient and flexible model architectures, particularly end-to-end optimization frameworks for joint inference of image fusion and object detection tasks.

2.2. Multitask Learning

Multitask Learning (MTL) is a technique that improves learning efficiency by simultaneously addressing multiple tasks and sharing information between them [38,39]. This information sharing is typically achieved through a shared hidden representation [40,41,42]. However, the optimization process in multitask learning presents several challenges, such as gradient conflicts between tasks [43,44] and plateau effects in the loss function [45], which complicate the optimization process.

To overcome these challenges, various architectures and methods have been proposed [46,47,48]. Some approaches focus on optimizing the training process by adjusting the gradients of tasks through weighting. For example, some studies weight the loss functions based on task uncertainty [49], gradient norms [50], stochastic weights [51], or gradient similarity [52,53]. However, these methods are predominantly heuristic and may lead to performance instability in practical applications [54]. Additionally, other methods employ techniques such as Neural Architecture Search (NAS) [55,56] or routing networks [57] to automatically discover shared patterns and determine network architectures. While effective in some cases, these approaches come with significant computational overhead.

Recently, there has been growing interest in multi-objective optimization based on the Multi-Gradient Descent Algorithm (MGDA) [58]. Under certain conditions, MGDA guarantees convergence to a Pareto stable point, making it a promising optimization strategy. Hotegni et al. [59] framed the multi-objective optimization problem as a multitask learning problem and introduced a task-weighting approach based on the Frank-Wolfe algorithm [60]. Liu et al. [54] proposed a method that maximizes the worst-case improvement by searching for the optimal update direction within the neighborhood of the average gradient. Liu [51] further developed a method to find a fair gradient direction by ensuring equal cosine similarity of gradients across all tasks. While this approach satisfies all the requirements of Nash axioms, it does not guarantee a Pareto optimal solution. Therefore, mitigating gradient conflicts in multitask learning remains a critical challenge.

3. Methodology

In this section, we introduce UniFusOD, a unified end-to-end framework that simultaneously addresses infrared-visible image fusion and object detection. Specifically, in Section 3.1, we formalize the joint fusion and detection task as an end-to-end optimization problem, aiming to simultaneously improve both visual quality and detection performance. Then, in Section 3.2, we present the overall framework, which integrates a shared backbone, a Fine-Grained Region Attention (FRA) module, and task-specific heads for fusion and detection. The entire model is trained end-to-end using the UnityGrad method, which harmonizes gradients from both tasks to enable stable and balanced multi-task optimization. In Section 3.3, we introduce the FRA mechanism designed to enhance region-level feature representation by focusing on important areas across multiple scales. Next, Section 3.4 details the task heads and their corresponding loss functions used to guide the model toward generating semantically meaningful fused images and accurate object detection. Finally, in Section 3.5, we propose UnityGrad, a gradient harmonization strategy that mitigates optimization conflicts between tasks, enabling more stable and effective end-to-end learning.

3.1. Problem Formulation

Assuming the visible image is denoted as xRH×W×3 and the infrared image as yRH×W×1, the optimization problem can be formulated as follows:

minθd,θfLd(t,Φ(u,θd)),s.t.u=Ψ(x,y;θf)

Here, u represents the fused image, and Ψ(.) and Φ(.) are the fusion network and detection network controlled by parameters θf and θd, respectively.

To avoid optimization difficulties, most methods separately train the fusion network Ψ(.) and detection network Φ(.) at different stages. However, this approach makes it challenging to find the global optimal solution. To enable end-to-end optimization, we reformulate the problem into the following optimization problem:

θ,θd,θf=argminλLd(t,Φ(x,y);θ,θd)+(1λ)Lf(Ψ(x,y;θ,θf))+R(θ,θd,θf)

Here, θ represents the parameters shared by both the detection network and the fusion network, such as the backbone parameters. θd and θf are the specific parameters for the detection and fusion networks, respectively. λ is a balancing coefficient. To ensure the stability of the optimization process, we jointly optimize the losses Ld and Lf, and R(.) is a regularization term applied to the parameters. The regularization constraints are implemented using the UnityGrad method, which adjusts the gradients during optimization.

3.2. Overall Architecture

The overall framework, as shown in Figure 2, consists of three components: the Backbone, the Fine-grained Region Attention (FRA), and task-specific heads. Moreover, to mitigate gradient conflicts between different tasks, the UnityGrad method is used during parameter updates to compute more stable gradients, thus stabilizing the multi-task optimization process.

For the visible image xRH×W×3 and the infrared image yRH×W×1, the backbone network f(.) is first employed to extract features f(x) and f(y) for each modality. To save memory and computational resources, the backbone is assumed to have shared parameters. The features extracted from each block of the backbone are then summed along the channel dimension, producing the mixed-modal features z1,z2,,zL from L blocks.

To enhance the model’s ability to perceive features from different regions, we propose a fine-grained region attention mechanism, which progressively extracts region and object information across different scales, improving the feature representation capacity. Finally, a lightweight task head is used to generate the fused image, ensuring it exhibits both high visual quality and strong semantic information. At the same time, a detection head is employed for object detection, ensuring that the learned representations effectively balance visual quality and detection task accuracy, making the model suitable for various perception tasks in multimodal scenarios.

Finally, the proposed UnityGrad method is used to modulate the gradients propagated from different tasks, solving for new update gradients to ensure stable optimization across tasks.

3.3. Fine-Grained Region Attention

In multimodal perception tasks, models must process information across multiple scales and feature representations. This requires feature extractors to adaptively capture region-specific information at various levels of granularity. However, traditional convolutional neural networks (CNNs) typically use fixed-size convolution kernels, limiting their ability to handle regions with fine-grained precision. In biological visual systems, the ability of neurons to selectively focus on regions at different scales is a key feature of visual perception. Based on this principle, we propose a fine-grained region attention mechanism (FRA), which improves the model’s ability to focus on and distinguish important regions by enabling more effective feature representation across multiple spatial and semantic levels. By integrating region-level attention mechanisms, FRA improves the model’s capacity to capture and represent crucial region-specific information in complex images. The structure of the FRA module is illustrated in Figure 3, providing a detailed view of how attention is applied across different regions.

The input to the FRA module consists of multi-scale feature maps z1,z2,,zL extracted by the backbone. Each zL represents mixed features from visible and infrared images at different scales. Here, L denotes the number of feature maps, and zlRHl×Wl×Cl represents the feature map at layer l, where Hl and Wl are the height and width of the feature map, and Cl is the number of channels in the feature map.

These multi-scale feature maps are extracted from different layers of the backbone, each representing information at different granularities. Thus, the maps contain multi-scale information from distinct regions of the image. To improve the network’s ability to capture region-specific features, we apply region-level operations to these feature maps. Specifically, we perform K convolution operations on each feature map zL using different dilation factors and kernel sizes to generate initial region-specific attention maps. These attention maps are then aggregated to obtain the final attention maps. The calculation of the region-level attention maps Alk is as follows:

Alk=φk,dk(zl)RM×Hl×Wl,k=1,2,,K

where φk,dk(·) represents the k-th convolution operation with dilation factor dk. Alk represents region-level attention maps generated by the k-th convolution operation, consisting of M attention masks, each with spatial dimensions Hl×Wl, but focusing on different regions of the image.

Considering the role of global information, we integrate the global features into the region-specific attention maps to refine them. We apply global pooling on the input feature map zL to extract its global features, compressing each channel into a scalar to form a global feature vector slR1×Cl, which represents the global context of the image:

sl=1Hl×Wli=1Hlj=1Wlzl(i,j)

This global feature vector is then passed through a feed-forward network (FFN), which generates a weighted matrix WlRK×M. Each row wk of Wl represents the weight for each region from the k-th convolution operation. To normalize the weights, we apply softmax along the M-dimension, resulting in ak, which represents the contribution of each operation to the M regions:

αk=softmax(wk)RM

The softmax operation ensures that the attention weights are normalized, allowing the coefficients to reflect the relative importance of each region.

Using the weighted coefficients ak predicted by the global information, we compute a weighted sum of the initial attention maps to obtain the combined attention maps Al:

Al=k=1Kak·AlkRM×H×W

The combined maps effectively capture region-specific features by aggregating the attention from different dilation factors and kernel sizes, which focus on various spatial scales and receptive fields.

Next, we apply the Sigmoid activation function to Al to normalize it:

Al=σ(Al)RM×H×W

where σ(.) represents the Sigmoid activation function, which normalizes the attention weights for each region, ensuring a balanced distribution across all regions.

The attention map Al contains M region masks, where each mask AlmRH×W represents the attention distribution for a specific region of the image. Thus, the region attention map Al can be represented as

Al={Al1,Al2,,AlM}

Using the region attention maps Al, we apply pixel-wise weighting to the original feature map zl, enhancing the features of the important regions. In this process, each region mask Alm is pixel-wise multiplied with the corresponding region of the input feature map zl, resulting in the weighted region feature map vl:

vl(i,j)=m=1MAlm(i,j)zl(i,j)

where ⊙ represents the pixel-wise multiplication operation and (i,j) denotes the spatial location. By performing this operation, each region mask Alm weights the corresponding region in the feature map zl. Regions with higher attention weights are amplified, while those with lower weights are suppressed. This process enhances the important regions and effectively reduces the impact of irrelevant or less important regions, refining the overall feature map representation.

Using this approach, we apply region attention to the multi-scale feature maps z1,z2,,zL extracted by the backbone, producing region-enhanced features v1,v2,,vL. These enhanced features are then used for the final detection and fusion tasks.

3.4. Detection and Fusion Heads

In multimodal perception tasks, besides the feature extraction module, the task heads play a crucial role in both object detection and image fusion. We design two distinct task heads for object detection and image fusion, and optimize them using appropriate loss functions.

As shown in Figure 4, the image fusion task head aims to restore the multi-scale region-level feature maps v1,v2,,vL from the FRA module to the same spatial dimensions as the original input image X. It then reconstructs the fused image. First, for each vl, we use an upsampling operation to resize it to match the dimensions of the input image (H,W). This is typically done using bilinear interpolation. After upsampling, the feature maps align with the input image spatially, preserving spatial consistency during fusion. We then sum all the upsampled feature maps to fuse information from different scales:

Ffuse=l=1LUpsample(vl)

To reduce computational complexity and prepare the feature maps for the next step, we apply a 1×1 convolution layer to decrease the number of channels in the fused feature map. This operation compresses the feature map’s channel dimension, making it suitable for reconstruction. After channel reduction, we process the feature map Freduce with five consecutive 3×3 convolution layers, followed by ReLU activations, to progressively reconstruct the fused image. Each convolution operation helps recover image details by learning convolutional kernels.

To optimize the fusion quality, we use the Structural Similarity Index (SSIM) and Laplacian second-order gradient loss. SSIM is effective at evaluating structural and perceptual image quality and is computed as

LSSIM=1SSIM(u,x)2+1SSIM(u,y)2

where u is the fused image, and x and y are the source images.

The Laplacian operator 2 captures second-order texture details, enhancing edges and fine features like high-frequency textures. The gradient loss is defined as the difference between the Laplacian results of the target and source images:

Lgrad=k=3,5,7k2umax(k2x,k2y)

where k2 is the Laplacian operator calculated using different Gaussian kernel sizes k, and u is the fused image, while x and y are the source images.

The final image fusion loss function Lf is

Lf=λ1LSSIM+λ2Lgrad

where LSSIM is the SSIM-based structural loss, Lgrad is the Laplacian gradient loss, and λ1 and λ2 are balancing coefficients controlling the importance of each term.

The object detection task head consists of regression and classification branches, aiming to detect target objects through accurate regression and classification. We use classic loss functions such as SmoothL1 Loss and Focal Loss, which have demonstrated strong performance in detection tasks:

Ld=λ3LsmoothL1+λ4Lfocal

where λ3 and λ4 are balancing coefficients to adjust the relative importance of regression and focal losses.

Therefore, the final total loss function is

L=λLd+(1λ)Lf

where λ is a balancing coefficient that controls the trade-off between object detection loss Ld and image fusion loss Lf, allowing the model to effectively learn both tasks simultaneously.

3.5. UnityGrad

In image fusion and object detection tasks, optimization objectives often conflict in both gradient direction and magnitude. Image fusion aims to generate high-quality fused images, while object detection focuses on improving detection accuracy. When these tasks share parameters θ, conflicting gradients can lead to suboptimal updates and degraded overall performance. To address this challenge effectively, we propose UnityGrad, a principled approach that unifies conflicting gradients through cooperative bargaining.

Let K denote the number of tasks and let giRn be the gradient of the i-th task’s loss with respect to shared parameters θ. While we primarily focus on image fusion (i=1) and object detection (i=2) in this paper, the UnityGrad formulation generalizes to any number of tasks.

Given the current parameters θ, we search for an update vector Δθ within the ball of radius ϵ centered at zero, denoted as Bϵ. The key insight of UnityGrad is to formulate this as a cooperative bargaining problem where the agreement set is Bϵ and the disagreement point is 0, which represents making no parameter update [61]. For each task, we define a utility function ui(Δθ)=giΔθ, representing how beneficial the update direction is for that task [62].

Our main assumption is that when θ is not at a Pareto stationary point, the task gradients are linearly independent. This ensures that the disagreement point (no update) is dominated by some point in Bϵ that benefits all tasks.

The core of UnityGrad is the following optimization objective:

Δθ*=argmaxΔθBϵi=1Klog(Δθgi)

This logarithmic objective is derived from the Nash Bargaining Solution in cooperative game theory, which maximizes the product of utility gains. Taking the logarithm transforms this product into a sum while preserving the solution’s properties. The logarithmic formulation is particularly important as it ensures scale-invariance across tasks with different gradient magnitudes, preventing any single task from dominating the optimization process. We can characterize the solution to this logarithmic optimization problem as follows:

Claim 1.

Let G be the d×K matrix whose columns are the gradients gi. The solution to our optimization problem is (up to scaling) Δθ*=iαigi, where αR+K is the solution to GGα=1/α, with 1/α representing element-wise reciprocal.

Proof. 

To derive this result, we analyze the gradient of our objective function. The gradient of our objective function takes the form i=1KgiΔθgi. We observe that for any vector Δθ satisfying Δθgi>0 for all i, the utility functions increase monotonically with |Δθ|. This, combined with the Pareto optimality characteristic inherent in bargaining solutions [61], necessitates that the optimal point must lie on the boundary of Bϵ. Consequently, at the optimal solution, the gradient i=1KgiΔθgi must align with the radial direction. Mathematically, this means i=1KgiΔθgi=λΔθ for some scalar λ. Given the independence of the gradients, we can express Δθ as a linear combination Δθ=iαigi where each αi>0. This yields the condition 1Δθgi=λαi, which can be rearranged as Δθgi=1λαi. Since we require Δθgi>0 for descent directions, it follows that λ>0. For simplicity, we set λ=1 to determine the direction of Δθ (noting that its magnitude might exceed ϵ). The bargaining solution problem thus reduces to finding coefficients αRK with positive components such that: Δθgi=jαjgjgi=1αi for all i. This can be elegantly expressed in matrix form as GGα=α1, where α1 denotes the element-wise reciprocal vector. □

To solve the equation GGα=1/α efficiently, we employ an iterative approach. We initialize α with uniform weights α(0)=(1/K,,1/K) and use a fixed-point iteration:

α(t+1)=1GGα(t)

where the square root and division operations are applied element-wise. This iteration continues until |α(t+1)α(t)|<ϵ for a small threshold ϵ, typically requiring only a few iterations to achieve good convergence.

Through this iterative optimization, UnityGrad converges to a Pareto Stationary Point, where the gradients of all tasks are balanced relative to each other. This ensures that no task’s loss can be further reduced without increasing another task’s loss, achieving true unity in the optimization process. The complete UnityGrad algorithm is summarized in Algorithm 1.

Algorithm 1 UnityGrad
Input: Initial shared parameters θ(0); differentiable losses {i}i=1K; learning rate η; total steps TOutput:  θ(T)     for  t=1  to  T  do            Compute task gradients            for i=1 to K do                  gi(t)θ(t1)i            end for            G(t)[g1(t),g2(t),,gK(t)]            Solve for α(t): find α(t)R>0K such that (G(t))G(t)α(t)=1/α(t)            Update shared parameters            θ(t)θ(t1)ηG(t)α(t)     end for     return  θ(T)

4. Experiment

4.1. Datasets and Evaluation Criteria

4.1.1. Introduction to Experimental Datasets

We conducted validation on four publicly available infrared and visible image datasets, which are as follows: M3FD [22], DroneVehicle [63], RoadScene [18], and TNO [64]. Specifically, all datasets consist of co-registered infrared and visible image pairs, which are acquired concurrently with aligned sensors to ensure spatial and temporal consistency between modalities. The selection of these datasets allows us to comprehensively evaluate both image fusion and object detection performance under diverse conditions, reflecting the robustness of our approach across various challenges. The M3FD dataset was used to evaluate both detection and image fusion performance [22], while RoadScene and TNO datasets were used for evaluating image fusion performance [65]. The DroneVehicle dataset was used to assess performance in detecting rotated objects [42].

The M3FD dataset includes 4200 pairs of high-resolution aligned infrared and visible light images. The dataset covers a variety of scenes and is categorized into four different types: daytime, overcast, nighttime, and challenging conditions. Additionally, the M3FD dataset annotates a total of 33,603 objects across six categories: people, cars, buses, motorcycles, trucks, and lights. This makes it suitable for evaluating both object detection and fusion tasks.

The DroneVehicle dataset consists of a total of 56,878 images collected by drones, with half of the images being RGB and the other half infrared. The dataset provides detailed annotations for five categories: cars, trucks, buses, vans, and cargo trucks, using rotated bounding boxes for annotation. This increases the evaluation standard for the model’s detection capabilities and is well-suited for evaluating multimodal object detection performance.

The RoadScene dataset, created in 2020, is based on road scenes and includes paired infrared and visible light images. It contains 221 pairs of aligned images, covering a rich set of road scenes such as bicycles, cars, pedestrians, and traffic lights. These image pairs were extracted from FLIR5 video footage and have been denoised and rigorously aligned. With a large number of high-resolution images, the RoadScene dataset is suitable for evaluating image fusion tasks.

The TNO dataset is a commonly used dataset in the field of infrared and visible image fusion. It includes a large collection of multispectral images from various military-related scenes, such as enhanced visual images, near-infrared images, long-wave infrared images, and thermal radiation images, collected by the Netherlands Organization for Applied Scientific Research. Unlike the MSRS and RoadScene datasets, the visible light images in the TNO dataset are single-channel images. It contains night-time images of multi-band military scenes, with a total of 60 pairs of infrared and visible light images.

4.1.2. Evaluation Criteria

To comprehensively evaluate the performance of the model, we used five evaluation metrics for the image fusion task: Entropy (EN), Structural Similarity Index (SSIM), Mutual Information (MI), Visual Information Fidelity (VIF), and Standard Deviation (SD). The object detection task was assessed using mean Average Precision (mAP).

EN measures the information richness of the fused image. A higher entropy value indicates that the fused image contains more information. The entropy is calculated as

EN=n=1Npnlog2pn

where N is the number of gray levels in the fused image, and pn is the proportion of pixels at the gray level n in the fused image.

SSIM is a metric used to assess image quality, particularly to measure the structural similarity between the fused image and the reference image. The SSIM is calculated as

SSIM(x,y)=(2μxμy+C1)(2σxy+C2)(μx2+μy2+C1)(σx2+σy2+C2)

where μx and μy are the mean values of images x and y, σx2 and σy2 are their variances, σxy is their covariance, and C1 and C2 are constants to avoid division by zero.

SSIM values range from −1 to 1, with 1 indicating perfect similarity. The closer the value is to 1, the more similar the structural information between the images.

MI quantifies how much information is retained in the fused image from the source images. In information theory, MI is used to measure the dependence between two random variables. The MI is calculated as

MI(A,B)=H(A)+H(B)H(A,B)

where H(A) and H(B) are the entropies of images A and B, and H(A,B) is their joint entropy.

VIF evaluates the image quality by quantifying the consistency between the image content and human visual perception. Unlike traditional pixel-based metrics (such as MSE or PSNR), VIF considers the perceptual quality by accounting for how the human eye is more sensitive to certain frequencies. The VIF calculation involves several steps: first, the image is decomposed using filters (such as Gaussian filters) to generate multi-scale representations. Then, information is calculated for each scale, with higher weights given to low-frequency components due to human sensitivity. Finally, the VIF is computed by combining the information from all scales:

VIF=i=1MIiIi+σi2

where Ii represents the information content at the i-th scale, σi2 is the noise variance at that scale, and M is the number of scales. A higher VIF indicates better image quality.

SD measures the degree of variation in the pixel values of the image. A higher SD indicates that the image has more distinct details and richer textures. The SD is calculated as

SD=1Ni=1N(xiμ)2

where xi is the pixel value of the image, μ is the mean of the pixel values, and N is the total number of pixels.

For evaluating object detection performance, we use Precision, Recall, mAP0.5, and mAP0.5:0.95 as the evaluation metrics. These metrics are derived based on the counts of true positives (TP), false positives (FP), and false negatives (FN), as well as the Intersection over Union (IoU) between predicted and ground-truth bounding boxes.

Precision is the ratio of correctly predicted positive samples to all detected samples, calculated as

precision=TPTP+FP

Recall is the ratio of correctly predicted positive samples to the total number of actual positive samples, calculated as

recall=TPTP+FN

Average Precision (AP) is the area under the precision-recall curve, calculated as

AP=01precision(recall)d(recall)

Mean Average Precision (mAP) is the average of the AP values for all classes, calculated as

mAP=1Ni=1NAPi

where APi is the AP value for class i, and N is the number of classes in the dataset. mAP0.5 indicates the average precision when the Intersection over Union threshold is set to 0.5. mAP0.5:0.95 represents the mean average precision across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05.

4.2. Implementation Details

All experiments were conducted on an NVIDIA RTX 4090 GPU, using the MMDetection framework for multimodal image fusion and detection. During training, the Adam optimizer was applied with an initial learning rate of 1×104, which decayed every 10 epochs. We employed the UniFusOD structure, shown in Figure 2, for end-to-end training, utilizing both fusion and detection loss. The hyperparameters in the Detection and Fusion Heads were set as λ1=λ2=λ3=λ4=1, with λ=0.4 and 1λ=0.6, balancing the fusion and detection tasks. For the M3FD dataset, a 2:8 training-to-testing split was used to ensure stability during training and reliability in evaluation. All images were resized and underwent essential data augmentation, including random cropping and flipping, to enhance the model’s generalization ability. During inference, performance was evaluated separately for fusion and detection, with adjustments made based on the characteristics of each dataset. This approach allows for more targeted assessment of the model’s strengths in different aspects of multimodal processing.

The experimental results demonstrate that the proposed method achieves strong performance across multiple tasks and datasets, highlighting the potential of this algorithm in the fields of multimodal image fusion and object detection.

5. Results

5.1. Results on Infrared-Visible Image Fusion

To validate the competitiveness of our algorithm, we compared it with ten other methods: DIDFuse [7], FusionGAN [5], SDNet [2], U2Fusion [18], TarDAL [22], RFN-Nest [13], DenseFuse [4], CDDFuse [65], AMDANet [66] and MMIF-INet [67]. We evaluated performance using five metrics: Entropy (EN), Structural Similarity Index (SSIM), Mutual Information (MI), Visual Information Fidelity (VIF), and Standard Deviation (SD), with detailed descriptions provided in Section 4.1.2. The experimental results show that, even with a simple and direct feature extraction approach, our method outperforms the others on several metrics.

On the TNO dataset, as shown in the Table 1, our method achieves higher EN and MI values, 7.44 and 2.00, respectively, significantly surpassing the alternatives. This indicates that our method preserves more image details and information. Our method also outperforms others in SSIM with a score of 1.07, indicating that the fused images retain the best structural resemblance to the original. While the SD of our method is slightly lower than CDDFuse, it maintains higher stability, balancing brightness variations and avoiding overprocessing.

On the Roadscene dataset, the results in Table 2 indicate that SSIM is slightly lower than some methods; our method still excels in EN and SD, especially with an SD of 59.48, reflecting superior brightness and contrast retention. Additionally, our MI value of 1.96 is comparable to other methods, demonstrating strong information preservation.

For the M3FD dataset, as shown in Table 3, our method achieves the highest MI and EN values, 1.40 and 6.60, respectively, showing a clear advantage in information and detail retention. Although VIF and SSIM are slightly lower than some methods, the overall fusion quality remains high, particularly in terms of detail and information fidelity.

To visually validate the fusion performance, Figure 5 presents qualitative fusion results. The first row shows M3FD samples, and the second row shows DroneVehicle samples, with each group comprising visible, infrared, and fused images. The blue rectangles in the fused images highlight regions where key thermal features from the infrared modality are retained. Meanwhile, the red rectangles emphasize enhanced texture and color semantics originating from the visible spectrum. These results demonstrate that our fusion algorithm effectively integrates complementary information from both modalities, maintaining critical details while enhancing overall scene visibility.

5.2. Results on Infrared-Visible Object Detection

In addition to evaluating image fusion quality, we also assessed the performance of the detector using the DroneVehicle and M3FD datasets. To ensure a fair comparison, we used Oriented RCNN [68] as the baseline model for rotated object detection and YOLOv5 [69] for horizontal object detection on the M3FD dataset, consistent with prior methods. The backbone network was kept the same as the fusion network. The object detection results demonstrate that our method outperforms others across several categories, validating its effectiveness and stability.

On the DroneVehicle dataset, as shown in Table 4, the fused detector integrating visible and infrared images with the Oriented RCNN architecture achieved notable performance gains, especially across multiple object categories. In terms of overall performance, our method attained an optimal mAP of 79.5, 0.8 higher than M2FP—the second-ranked method. For individual categories, our method delivered the highest detection accuracy in Car, Truck, and Van. Specifically, the Car detection accuracy reached 96.4, 0.7 higher than M2FP; the Truck accuracy hit 81.3, a substantial 3.0 higher than C2Former which ranked second in this category; and the Van accuracy reached 65.6, 0.7 higher than AFFCM. These results highlight our clear advantage in object detection, confirming that fusing visible and infrared modalities significantly improves performance, particularly for small objects in complex environments.

On the M3FD dataset, as seen in Table 5, our method paired with YOLO v5 demonstrated superior performance, achieving 59.4 mAP and 86.9 mAP50—1.9 higher than Fusion-Mamba (the second-best method) in both metrics. In category-specific detection, our method outperformed all counterparts. The Car detection accuracy reached 95.0, 0.2 higher than TarDAL which previously held the top spot; the Bus accuracy hit 94.1, 0.9 higher than SuperFusion; the Motorcycle accuracy reached 77.8, 0.4 higher than SuperFusion; and the Lamp accuracy attained 88.6, 0.8 higher than DetFusion. However, Truck detection was lower at 82.4, compared to 87.1 and 85.8 for Fusion-Mamba and SuperFusion, respectively. This indicates that our method struggles with larger objects, likely due to the challenges in capturing complex spatial features or less distinct boundaries, which may not be fully addressed by the current fusion-detection framework. Overall, our method demonstrates strong performance, especially for smaller objects and in challenging environments, though further refinement is needed for detecting larger objects like Trucks.

To further validate the detection robustness, Figure 6 provides qualitative results on both datasets. It clearly illustrates UniFusOD’s capability in accurately localizing rotated and small targets, even under challenging multimodal conditions. The comparison with ground truth highlights its strong spatial precision and reliable semantic understanding.

In summary, UniFusOD not only excels in image fusion but also delivers state-of-the-art performance in object detection across varying modalities, categories, and visual complexities. Its robustness against object rotation, scale variation, and modality noise makes it a compelling solution for multimodal perception tasks.

5.3. Ablation Study Results

To verify the effectiveness of the proposed Fine-Grained Region Attention (FRA) mechanism and UnityGrad optimization strategy, we conducted systematic ablation experiments on the M3FD dataset. Performance was evaluated using image fusion metrics and object detection metrics. Results are presented in Table 6, where “” indicates the module is enabled, and bold values denote the best performance.

As shown in Table 6, the Baseline achieves initial performance with EN of 5.21, SSIM of 0.76, mAP50 of 83.3, and mAP50:95 of 57.1. The introduction of the FRA module significantly enhances these metrics. By focusing on fine-grained regional variations during the fusion process, FRA boosts EN to 6.44, improves the VIF to 0.64, and increases the detection metrics by 2.7 for mAP50 and 1.6 for mAP50:95. This demonstrates FRA’s ability to capture multi-scale regional information, which significantly enhances both visual quality and semantic representation.

Further integration of UnityGrad—designed to reduce conflicts during multi-task optimization—leads to even more substantial gains across the board. Specifically, UnityGrad improves fusion metrics, with SD increasing by 4.97 compared to FRA alone. The SSIM increases to 0.85, while EN sees a modest improvement of 0.16. Detection performance is also enhanced, with mAP50 increasing by 0.9 to 86.9 and mAP50:95 rising by 0.7 to 59.4. These results confirm that UnityGrad optimizes the gradient propagation across tasks, allowing the FRA module to fully exploit its potential in synergizing image fusion and object detection tasks.

6. Discussion

6.1. Study of Fine-Grained Region Attention

The FRA module uses multi-scale feature maps and region-level attention to adaptively focus on important regions, enhancing the model’s ability to capture key features in complex images. To validate its effectiveness, we conducted ablation experiments on the M3FD dataset. Specifically, we investigated the effects of (1) varying the number and configuration of convolutional operators φk and (2) changing the number of region attention maps M.

6.1.1. Effect of Different φk Designs

Each φk denotes a convolution operation with a specific kernel size and dilation factor, followed by batch normalization and ReLU activation. By combining multiple such φk, the module generates diverse region attention maps, enabling it to model spatial contexts of different granularities. To study the impact of varying φk, we conducted experiments using combinations of convolutional kernels with sizes 3 × 3, 5 × 5, 7 × 7, and 11 × 11. The number of φk in the FRA module, denoted as K, directly determines the diversity of region-wise attention maps. Table 7 presents the results for both image fusion and object detection tasks.

The results show that using a single small kernel such as 3 × 3 limits the model’s ability to capture broader contextual information, as it tends to focus solely on fine-grained local details. Specifically, the fusion metric EN increases only marginally from 6.44 to 6.42, and detection accuracy measured by mAP50 shows a minor rise from 86.0 to 86.2. When a second kernel with size 5 × 5 is added, the model benefits from a broader receptive field, resulting in a noticeable enhancement in performance. For instance, the spatial detail metric SD improves from 37.86 to 40.25, and the mAP50:95 increases from 59.0 to 59.1.

The best performance is observed when three kernels are used—specifically 3 × 3, 5 × 5, and 7 × 7. Under this configuration, all key metrics reach their peak values. The EN reaches 6.60, SD increases to 42.52, mutual information MI rises to 1.40, and VIF stands at 0.68. In terms of detection, mAP50 reaches 86.9, while mAP50:95 improves to 59.4. However, introducing a fourth kernel with a size of 11 × 11 slightly degrades performance. For example, SD drops to 39.88 and mAP50:95 decreases to 58.9, which may be attributed to the over-smoothing of fine details and increased computational overhead, leading to redundancy in feature representation.

To gain deeper insights into how different φk configurations affect spatial attention, we visualize the corresponding region attention maps in Figure 7. The attention map generated by the first operator, which uses a 3 × 3 kernel, predominantly focuses on localized, fine-grained regions. As larger kernels are progressively introduced, the receptive field expands, allowing the attention to gradually shift toward broader, more semantically meaningful areas across the object. In the final aggregation stage, the attention maps clearly highlight the full extent of the target object, demonstrating the effectiveness of the FRA module in capturing both local details and global structural information.

These results collectively confirm that incorporating multiple convolutional operators with varied receptive fields significantly enhances the model’s ability to capture both fine-grained details and broader semantic context. A carefully selected combination of small to medium-sized kernels enables the FRA module to generate diverse region attention, which leads to more informative and discriminative feature representations.

6.1.2. Effect of the Number of Region Attention Maps

The parameter M represents the number of region attention weights, which controls the number of regions the model focuses on during region modeling. As shown in Table 8, increasing M leads to a significant improvement in the model’s performance for both image fusion and object detection tasks.

When M=2, the model performs at a basic level, with an EN of 6.30, indicating that the model is unable to effectively focus on high-information regions during image fusion, which limits detection performance. In this case, mAP50 is 85.8, and mAP50:95 is 58.5. As M increases to 4, performance improves slightly, with EN reaching 6.35, mAP50 at 86.0, and mAP50:95 at 58.7, suggesting some improvement in region modeling. Increasing M further to 6 yields a more substantial performance gain, with EN at 6.38, MI at 1.35, SSIM at 0.75, mAP50 at 86.5, and mAP50:95 at 59.0. This indicates the model’s enhanced ability to capture and represent important regional features, improving both image fusion and detection accuracy. When M=8, the model achieves the best performance. By increasing the number of region attention maps, the model strengthens its capacity to model region-specific features in complex images, significantly boosting detection accuracy and image quality. However, when M is increased to 10, performance slightly declines. EN increases to 6.40, mAP50 to 86.0, and mAP50:95 to 58.8, suggesting that the benefits of region modeling are approaching saturation, while computational cost continues to rise.

6.2. Study of UnityGrad Algorithm

To evaluate the robustness of the proposed UnityGrad method, we conduct comparative experiments with three typical MTL approaches: GradNorm [50], PCGrad [82], and CAGrad [54], using multiple evaluation metrics. As shown in Table 9, UnityGrad outperforms all other methods, demonstrating significant improvements in both image fusion metrics, such as SSIM of 0.85, and object detection metrics, such as mAP50:95 of 59.4, as well as stability metrics, such as EN of 6.60 and VIF of 0.68. These results confirm that UnityGrad effectively mitigates task interference in multi-task learning, ensuring both superior and stable performance across various dimensions. Compared to existing MTL methods, UnityGrad exhibits enhanced robustness, which can be attributed to its effective multi-task gradient coordination mechanism.

In addition to the quantitative results presented in Table 6 and Table 9, we also analyzed the gradient behavior during training to better understand how UnityGrad mitigates conflicts between tasks. As illustrated in Figure 8, the blue curves represent gradients of the detection loss with respect to shared parameters, while the red curves correspond to gradients of the fusion loss. Without UnityGrad, the detection task dominates the shared gradient space, suppressing the learning capacity of the fusion network due to its relatively smaller gradient magnitude. In contrast, with UnityGrad, the gradient contributions from both tasks are better balanced, leading to improved gradient alignment and a more stable optimization process. Overall, UnityGrad enhances both low-level image fidelity and high-level semantic accuracy, delivering stable joint optimization and superior end-to-end performance for integrated vision tasks.

7. Conclusions

In this paper, we proposed the UniFusOD, a unified framework designed to optimize both infrared-visible image fusion and object detection tasks simultaneously. This network performs end-to-end optimization of low-level and high-level tasks. The Fine-Grained Region Attention (FRA) module enhances the model’s ability to recognize complex region-specific information by applying attention operations at multiple granularities. Furthermore, to address the gradient conflicts between the fusion and detection tasks, we introduced the UnityGrad method, which balances the gradients of the different tasks, stabilizing and improving optimization performance. Experimental results show that UniFusOD significantly enhances the performance of both image fusion and object detection tasks across multiple datasets, demonstrating its potential for real-world applications such as autonomous driving and remote sensing. In the future, we plan to extend this framework to more modalities, such as LiDAR and SAR, to further improve its performance in complex environments.

Author Contributions

Conceptualization, X.X., B.N., Z.W. and W.G.; Methodology, X.X.; Software, X.X.; Validation, Z.P.; Formal analysis, J.Q.; Resources, G.Z., B.N. and L.H.; Data curation, X.X. and W.L.; Writing—original draft, X.X.; Writing—review & editing, B.N., Z.P. and L.H.; Visualization, W.L. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Comparison of (d) UniFusOD with existing image fusion and detection paradigms: (a) Decoupled stages, (b) Coupled two-stage, (c) Multi-stage, and (d) End-to-end. The figure shows the evolution from separated optimization (a,b) to multi-stage (c) and our unified end-to-end framework (d), addressing key challenges through joint fusion-detection optimization.

Figure 2 An overview of the proposed UniFusOD framework. The Backbone extracts and fuses multi-level features from infrared and visible images. The Fine-grained Region Attention (FRA) module learns and distinguishes the importance of different feature regions, guiding the model to focus on key features. The model, together with the task-specific heads, is end-to-end synchronized and optimized using the UnityGrad method.

Figure 3 Detailed view of the fine-grained region attention. This is a schematic diagram showing an example where k=2 in φk,dk(.).

Figure 4 The structure diagram of Detection and Fusion Heads.

Figure 5 Qualitative fusion results. The first row shows examples from the M3FD dataset (Visible, Infrared, and Fused images), and the second row shows samples from the DroneVehicle dataset. Red boxes highlight infrared-dominant regions preserved in the fusion result, while blue boxes indicate enhanced semantic and detail representations from the visible spectrum.

Figure 6 Qualitative detection results. The first two columns are samples from the DroneVehicle dataset, and the last four columns are from the M3FD dataset. The first row shows ground truth annotations, and the second row shows predictions from UniFusOD. The results highlight the model’s robustness in detecting rotated and scale-varying targets under complex conditions.

Figure 7 Visualization of region attention maps generated by different φk(.) in the FRA module. The aggregated attention highlights the complete object region.

Figure 8 Gradient conflict visualization in joint fusion-detection optimization: (a) Without UnityGrad, showing the gradients of both fusion (red) and detection (blue) tasks across iterations, with visible misalignment; (b) With UnityGrad, demonstrating improved gradient alignment between the fusion (red) and detection (blue) tasks, resulting in better convergence. The x-axis represents the iterations, while the y-axis shows the gradient values.

Fusion Results on the TNO Dataset. Best Results are shown in red and the second results are shown in blue.

Model EN SD MI VIF SSIM
DIDFuse 6.97 45.12 1.70 0.60 0.81
U2Fusion 6.83 34.55 1.37 0.58 0.99
SDNet 6.64 32.66 1.52 0.56 1.00
RFN-Nest 6.83 34.50 1.20 0.51 0.92
TarDAL 6.84 45.63 1.86 0.53 0.88
DenseFuse 6.95 38.41 1.78 0.60 0.96
MMIF-INet 6.88 39.27 1.69 0.56 0.83
FusinoGAN 7.10 44.85 1.78 0.57 0.88
AMDANet 7.37 39.52 1.82 0.70 0.95
CDDFuse 7.12 46.00 2.19 0.77 1.03
UniFusOD 7.44 41.28 2.00 0.79 1.07

Fusion Results on the Roadscene Dataset with Best Results shown in red and second results shown in blue.

Model EN SD MI VIF SSIM
DIDFuse 7.43 51.58 2.11 0.58 0.86
U2Fusion 7.09 38.12 1.87 0.60 0.97
SDNet 7.14 40.20 2.21 0.60 0.99
RFN-Nest 7.21 41.25 1.68 0.54 0.90
TarDAL 7.17 47.44 2.14 0.54 0.88
DenseFuse 7.23 44.44 2.25 0.63 0.89
MMIF-INet 7.24 49.75 2.05 0.61 0.78
FusionGAN 7.36 52.54 2.18 0.59 0.88
AMDANet 7.43 53.77 1.92 0.73 0.81
CDDFuse 7.44 54.67 2.30 0.69 0.98
UniFusOD 7.47 59.48 1.96 0.84 0.90

Fusion Results on the M3FD Dataset with Best Results in red and second results shown in blue.

Model EN SD MI SSIM VIF
DIDFuse 5.97 41.78 1.37 0.81 0.54
U2Fusion 5.62 36.51 1.20 0.99 0.50
SDNet 6.21 34.22 1.24 1.00 0.61
RFN-Nest 6.01 37.59 1.01 0.92 0.51
TarDAL 5.84 40.18 1.37 0.88 0.59
DenseFuse 6.44 36.46 1.23 0.96 0.57
MMIF-INet 5.74 40.67 1.18 0.96 0.55
FusionGAN 6.30 39.83 1.16 0.88 0.53
AMDANet 6.51 38.27 1.31 0.97 0.72
CDDFuse 5.77 39.74 1.33 0.91 0.69
UniFusOD 6.60 42.52 1.40 0.85 0.68

Object detection results on the DroneVehicle dataset. The table shows the performance of various methods using different modalities: visible images, infrared (IR) images, and their fusion (visible + IR). The best results in each category are highlighted in red and the second results are highlighted in blue.

Methods Modality Car Truck Freight-Car Bus Van mAP
Faster R-CNN [70] Visible 79.0 49.0 37.2 77.0 37.0 55.9
RoITransformer [71] Visible 61.6 55.1 42.3 85.5 44.8 61.6
YOLOv5s [69] Visible 78.6 55.3 43.8 87.1 46.0 62.1
Faster R-CNN IR 89.4 53.5 48.3 87.0 42.6 64.2
RoITransformer IR 90.1 60.4 58.9 89.7 52.2 70.3
YOLOv5s IR 90.0 59.5 60.8 89.5 53.8 70.7
Halfway Fusion [3] Visible + IR 90.1 62.3 58.5 89.1 49.8 70.0
UA-CMDet [63] Visible + IR 88.6 73.1 57.0 88.5 54.1 70.0
MBNet [72] Visible + IR 90.1 64.4 62.4 88.8 53.6 71.9
TSFADet [26] Visible + IR 89.9 67.9 63.7 89.8 54.0 73.1
C2Former [73] Visible + IR 90.2 78.3 64.4 89.8 58.5 74.2
AFFCM [74] Visible + IR 90.2 73.4 64.9 89.9 64.9 76.6
MC-DETR [75] Visible + IR 94.8 76.7 60.4 91.1 61.4 76.9
M2FP [76] Visible + IR 95.7 76.2 64.7 92.1 64.7 78.7
UniFusOD (Oriented RCNN) Visible + IR 96.4 81.3 63.5 90.8 65.6 79.5

Object detection results on the M3FD dataset. The best results in each category are highlighted in red and the second results are highlighted in blue.

Methods Detector mAP50 mAP People Bus Car Motorcycle Lamp Truck
DIDFuse [7] YOLOv5 78.9 52.6 79.6 79.6 92.5 68.7 84.7 68.7
SDNet [2] YOLOv5 79.0 52.9 79.4 81.4 92.3 67.4 84.1 69.3
RFNet [13] YOLOv5 79.4 53.2 79.4 78.2 91.1 72.8 85.0 69.0
TarDAL [22] YOLOv5 80.5 54.1 81.5 81.3 94.8 69.3 87.1 68.7
DetFusion [77] YOLOv5 80.8 53.8 80.8 83.0 92.5 69.4 87.8 71.4
CDDFuse [65] YOLOv5 81.1 54.3 81.6 82.6 92.5 71.6 86.9 71.5
IGNet [78] YOLOv5 81.5 54.5 81.6 82.4 92.8 73.0 86.9 72.1
SuperFusion [79] YOLOv7 83.5 56.0 83.7 93.2 91.0 77.4 70.0 85.8
Fd2-Net [80] YOLOv5 83.5 55.7 82.7 82.7 93.6 78.1 87.8 73.7
Fusion-Mamba [81] YOLOv5 85.0 57.5 80.3 92.8 91.9 73.0 84.8 87.1
UniFusOD YOLOv5 86.9 59.4 83.4 94.1 95.0 77.8 88.6 82.4

Ablation study results on the M3FD dataset, with the best results shown in bold. “[Image omitted. Please see PDF.]” denotes that the module is enabled.

Baseline FRA UnityGrad EN SD MI VIF SSIM mAP50 mAP50:95
[Image omitted. Please see PDF.] 5.21 37.68 1.21 0.52 0.76 83.3 57.1
[Image omitted. Please see PDF.] [Image omitted. Please see PDF.] 6.44 37.55 1.15 0.64 0.68 86.0 58.7
[Image omitted. Please see PDF.] [Image omitted. Please see PDF.] [Image omitted. Please see PDF.] 6.60 42.52 1.40 0.68 0.85 86.9 59.4

Ablation study of the number of convolutional operators φk (i.e., K). The table shows the performance of the model when using different numbers of kernel types (e.g., 3 × 3, 5 × 5, 7 × 7, 11 × 11). The best results are shown in bold.

Number of φk (K) EN SD MI VIF mAP50 mAP50:95
0 (no region attention) 6.44 37.55 1.15 0.64 86.0 58.7
1 (3 × 3) 6.42 37.86 1.38 0.62 86.2 59.0
2 (3 × 3, 5 × 5) 6.41 40.25 1.38 0.68 86.7 59.1
3 (3 × 3, 5 × 5, 7 × 7) 6.60 42.52 1.40 0.68 86.9 59.4
4 (3 × 3, 5 × 5, 7 × 7, 11 × 11) 6.52 39.88 1.37 0.66 86.2 58.9

Ablation study of the number of region attention maps M. The table shows the impact of different values of M on the model’s performance, with the best performing metrics highlighted in bold.

M EN SD MI VIF SSIM mAP50 mAP50:95
2 6.30 37.00 1.10 0.62 0.68 85.8 58.5
4 6.35 37.30 1.30 0.61 0.72 86.0 58.7
6 6.38 39.00 1.35 0.65 0.75 86.5 59.0
8 6.60 42.52 1.40 0.68 0.85 86.9 59.4
10 6.40 39.50 1.35 0.64 0.73 86.0 58.8

Performance Comparison of Different Multi-task Learning Methods on M3FD for Image Fusion and Object Detection.

Method EN SD MI VIF SSIM mAP50 mAP50:95
GradNorm 6.21 38.60 1.29 0.57 0.70 86.1 58.7
PCGrad 6.51 39.92 1.35 0.63 0.79 86.1 59.0
CAGrad 6.47 40.77 1.36 0.66 0.84 86.3 58.8
UniGrad 6.60 42.52 1.40 0.68 0.85 86.9 59.4

1. Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion; 2021; 76, pp. 323-336. [DOI: https://dx.doi.org/10.1016/j.inffus.2021.06.008]

2. Zhang, H.; Ma, J. SDNet: A versatile squeeze-and-decomposition network for real-time image fusion. Int. J. Comput. Vis.; 2021; 129, pp. 2761-2785. [DOI: https://dx.doi.org/10.1007/s11263-021-01501-8]

3. Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral deep neural networks for pedestrian detection. arXiv; 2016; [DOI: https://dx.doi.org/10.48550/arXiv.1611.02644] arXiv: 1611.02644

4. Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process.; 2018; 28, pp. 2614-2623. [DOI: https://dx.doi.org/10.1109/TIP.2018.2887342] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30575534]

5. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion; 2019; 48, pp. 11-26. [DOI: https://dx.doi.org/10.1016/j.inffus.2018.09.004]

6. Zhang, X.; Demiris, Y. Visible and infrared image fusion using deep learning. IEEE Trans. Pattern Anal. Mach. Intell.; 2023; 45, pp. 10535-10554. [DOI: https://dx.doi.org/10.1109/TPAMI.2023.3261282] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37015127]

7. Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Li, P.; Zhang, J. DIDFuse: Deep image decomposition for infrared and visible image fusion. arXiv; 2020; arXiv: 2003.09210

8. Ma, J.; Liang, P.; Yu, W.; Chen, C.; Guo, X.; Wu, J.; Jiang, J. Infrared and visible image fusion via detail preserving adversarial learning. Inf. Fusion; 2020; 54, pp. 85-98. [DOI: https://dx.doi.org/10.1016/j.inffus.2019.07.005]

9. Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. Proceedings of the Computer Vision—ECCV 2020: 16th European Conference; Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16 Springer: Berlin/Heidelberg, Germany, 2020; pp. 787-803.

10. Yuan, M.; Cui, B.; Zhao, T.; Wei, X. UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning. arXiv; 2024; arXiv: 2404.17360

11. Kim, D.; Ruy, W. CNN-based fire detection method on autonomous ships using composite channels composed of RGB and IR data. Int. J. Nav. Archit. Ocean. Eng.; 2022; 14, 100489. [DOI: https://dx.doi.org/10.1016/j.ijnaoe.2022.100489]

12. Zhao, G.; Hu, Z.; Feng, S.; Wang, Z.; Wu, H. GLFuse: A Global and Local Four-Branch Feature Extraction Network for Infrared and Visible Image Fusion. Remote Sens.; 2024; 16, 3246. [DOI: https://dx.doi.org/10.3390/rs16173246]

13. Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion; 2021; 73, pp. 72-86. [DOI: https://dx.doi.org/10.1016/j.inffus.2021.02.023]

14. Wang, Z.; Ng, M.K.; Michalski, J.; Zhuang, L. A self-supervised deep denoiser for hyperspectral and multispectral image fusion. IEEE Trans. Geosci. Remote Sens.; 2023; 61, 5520414. [DOI: https://dx.doi.org/10.1109/TGRS.2023.3303921]

15. Zhao, Z.; Xu, S.; Zhang, J.; Liang, C.; Zhang, C.; Liu, J. Efficient and model-based infrared and visible image fusion via algorithm unrolling. IEEE Trans. Circuits Syst. Video Technol.; 2021; 32, pp. 1186-1196. [DOI: https://dx.doi.org/10.1109/TCSVT.2021.3075745]

16. Hou, J.; Zhang, D.; Wu, W.; Ma, J.; Zhou, H. A generative adversarial network for infrared and visible image fusion based on semantic segmentation. Entropy; 2021; 23, 376. [DOI: https://dx.doi.org/10.3390/e23030376] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33801048]

17. Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Trans. Instrum. Meas.; 2020; 70, 5005014. [DOI: https://dx.doi.org/10.1109/TIM.2020.3038013]

18. Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 44, pp. 502-518. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.3012548]

19. Xu, H.; Wang, X.; Ma, J. DRF: Disentangled representation for visible and infrared image fusion. IEEE Trans. Instrum. Meas.; 2021; 70, 5006713. [DOI: https://dx.doi.org/10.1109/TIM.2021.3056645]

20. Zhang, S.; Zhang, X.; Ren, W.; Shen, L.; Wan, S.; Zhang, J.; Jiang, Y.M. Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection. arXiv; 2025; [DOI: https://dx.doi.org/10.48550/arXiv.2501.15099] arXiv: 2501.15099

21. Li, S.; Han, M.; Qin, Y.; Li, Q. Self-attention progressive network for infrared and visible image fusion. Remote Sens.; 2024; 16, 3370. [DOI: https://dx.doi.org/10.3390/rs16183370]

22. Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 18–24 June 2022; pp. 5802-5811.

23. Zhao, W.; Xie, S.; Zhao, F.; He, Y.; Lu, H. Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 13955-13965.

24. Bae, S.; Shin, H.; Kim, H.; Park, M.; Choi, M.Y.; Oh, H. Deep learning-based human detection using rgb and ir images from drones. Int. J. Aeronaut. Space Sci.; 2024; 25, pp. 164-175. [DOI: https://dx.doi.org/10.1007/s42405-023-00632-1]

25. Lee, Y.; Kim, S.; Lim, H.; Lee, H.K.; Choo, H.G.; Seo, J.; Yoon, K. Performance analysis of object detection neural network according to compression ratio of RGB and IR images. J. Broadcast Eng.; 2021; 26, pp. 155-166.

26. Yuan, M.; Wang, Y.; Wei, X. Translation, scale and rotation: Cross-modal alignment meets RGB-infrared vehicle detection. Proceedings of the European Conference on Computer Vision; Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 509-525.

27. Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Trans. Circuits Syst. Video Technol.; 2021; 32, pp. 105-119. [DOI: https://dx.doi.org/10.1109/TCSVT.2021.3056725]

28. Liu, R.; Liu, Z.; Liu, J.; Fan, X. Searching a hierarchically aggregated fusion architecture for fast multi-modality image fusion. Proceedings of the 29th ACM International Conference on Multimedia; Virtual, 20–24 October 2021; pp. 1600-1608.

29. Wang, J.; Lan, C.; Gao, Z. Deep Residual Fusion Network for Single Image Super-Resolution. J. Phys. Conf. Ser.; 2020; 1693, 012164. [DOI: https://dx.doi.org/10.1088/1742-6596/1693/1/012164]

30. Peng, J.; Zhang, W.; Hou, Y.; Yu, H.; Zhu, Z.l. ECAFusion: Infrared and visible image fusion via edge-preserving and cross-modal attention mechanism. Infrared Phys. Technol.; 2025; 151, 106085. [DOI: https://dx.doi.org/10.1016/j.infrared.2025.106085]

31. Zhang, C.; He, D. A Deep Multiscale Fusion Method via Low-Rank Sparse Decomposition for Object Saliency Detection Based on Urban Data in Optical Remote Sensing Images. Wirel. Commun. Mob. Comput.; 2020; 2020, 7917021. [DOI: https://dx.doi.org/10.1155/2020/7917021]

32. Zhang, P.; Jiang, Q.; Cai, L.; Wang, R.; Wang, P.; Jin, X. Attention-based F-UNet for Remote Sensing Image Fusion. Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys); Haikou, China, 20–22 December 2021; pp. 81-88.

33. Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion; 2022; 82, pp. 28-42. [DOI: https://dx.doi.org/10.1016/j.inffus.2021.12.004]

34. Huo, Z.; Qiao, L. Research on Monocular Depth Estimation Algorithm Based on Structured Loss. J. Univ. Electron. Sci. Technol. China; 2021; 50, pp. 728-733.

35. Jiang, L.; Fan, H.; Li, J. A multi-focus image fusion method based on attention mechanism and supervised learning. Appl. Intell.; 2022; 52, pp. 339-357. [DOI: https://dx.doi.org/10.1007/s10489-021-02358-7]

36. Liu, J.; Liu, Z.; Wu, G.; Ma, L.; Liu, R.; Zhong, W.; Luo, Z.; Fan, X. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision; Paris, France, 2–6 October 2023; pp. 8115-8124.

37. Senushkin, D.; Patakin, N.; Kuznetsov, A.; Konushin, A. Independent component alignment for multi-task learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 20083-20093.

38. Caruana, R. Multitask learning. Mach. Learn.; 1997; 28, pp. 41-75. [DOI: https://dx.doi.org/10.1023/A:1007379606734]

39. Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng.; 2021; 33, pp. 2739-2756. [DOI: https://dx.doi.org/10.1109/TKDE.2021.3070203]

40. Menon, R.; Dengler, N.; Pan, S.; Chenchani, G.K.; Bennewitz, M. EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images. arXiv; 2025; arXiv: 2503.04441

41. Wu, Y.; Wang, Y.; Yang, H.; Zhang, P.; Wu, Y.; Wang, B. A Mutual Information Constrained Multi-Task Learning Method for Very High-Resolution Building Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2025; 18, pp. 9230-9243. [DOI: https://dx.doi.org/10.1109/JSTARS.2025.3550940]

42. Zhang, Z.; Luo, P.; Loy, C.C.; Tang, X. Facial landmark detection by deep multi-task learning. Proceedings of the Computer Vision—ECCV 2014: 13th European Conference; Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13 Springer: Berlin/Heidelberg, Germany, 2014; pp. 94-108.

43. Wang, Z.; Tsvetkov, Y.; Firat, O.; Cao, Y. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. arXiv; 2020; [DOI: https://dx.doi.org/10.48550/arXiv.2010.05874] arXiv: 2010.05874

44. Yang, R.; Xu, H.; Wu, Y.; Wang, X. Multi-task reinforcement learning with soft modularization. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 4767-4777.

45. Maninis, K.K.; Radosavovic, I.; Kokkinos, I. Attentive single-tasking of multiple tasks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 1851-1860.

46. Crawshaw, M. Multi-Task Learning with Deep Neural Networks: A Survey. arXiv; 2020; [DOI: https://dx.doi.org/10.48550/arXiv.2009.09796] arXiv: 2009.09796

47. Bairaktari, K.; Blanc, G.; Tan, L.Y.; Ullman, J.; Zakynthinou, L. Multitask Learning via Shared Features: Algorithms and Hardness. Proceedings of the Thirty Sixth Conference on Learning Theory; Bangalore, India, 12–15 July 2023; pp. 747-772.

48. Zhang, Y.; Yang, Q. An overview of multi-task learning. Natl. Sci. Rev.; 2017; 5, pp. 30-43. [DOI: https://dx.doi.org/10.1093/nsr/nwx105]

49. Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City; UT, USA, 18–23 June 2018; pp. 7482-7491.

50. Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. Proceedings of the International Conference on Machine Learning; Stockholm, Sweden, 10–15 July 2018; pp. 794-803.

51. Liu, L.; Li, Y.; Kuang, Z.; Xue, J.; Chen, Y.; Yang, W.; Liao, Q.; Zhang, W. Towards impartial multi-task learning. Proceedings of the ICLR; Vienna, Austria, 4 May 2021.

52. Du, Y.; Czarnecki, W.M.; Jayakumar, S.M.; Farajtabar, M.; Pascanu, R.; Lakshminarayanan, B. Adapting auxiliary losses using gradient similarity. arXiv; 2018; arXiv: 1812.02224

53. Panageas, I.; Piliouras, G.; Wang, X. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. arXiv; 2019; arXiv: 1906.07772

54. Liu, B.; Liu, X.; Jin, X.; Stone, P.; Liu, Q. Conflict-averse gradient descent for multi-task learning. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 18878-18890.

55. Bragman, F.J.; Tanno, R.; Ourselin, S.; Alexander, D.C.; Cardoso, J. Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1385-1394.

56. Ahn, C.; Kim, E.; Oh, S. Deep elastic networks with model selection for multi-task learning. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6529-6538.

57. Rosenbaum, C.; Klinger, T.; Riemer, M. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv; 2017; arXiv: 1711.01239

58. Yu, J.; Dai, Y.; Liu, X.; Huang, J.; Shen, Y.; Zhang, K.; Zhou, R.; Adhikarla, E.; Ye, W.; Liu, Y. . Unleashing the power of multi-task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2404.18961] arXiv: 2404.18961

59. Hotegni, S.S.; Berkemeier, M.; Peitz, S. Multi-objective optimization for sparse deep multi-task learning. Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN); Yokohama, Japan, 30 June–5 July 2024; pp. 1-9.

60. Garber, D.; Kretzu, B. Projection-free online convex optimization with time-varying constraints. arXiv; 2024; arXiv: 2402.08799

61. Nash, J. Two-person cooperative games. Econom. J. Econom. Soc.; 1953; 21, pp. 128-140. [DOI: https://dx.doi.org/10.2307/1906951]

62. Navon, A.; Shamsian, A.; Achituve, I.; Maron, H.; Kawaguchi, K.; Chechik, G.; Fetaya, E. Multi-task learning as a bargaining game. arXiv; 2022; [DOI: https://dx.doi.org/10.48550/arXiv.2202.01017] arXiv: 2202.01017

63. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol.; 2022; 32, pp. 6700-6713. [DOI: https://dx.doi.org/10.1109/TCSVT.2022.3168279]

64. Toet, A. The TNO multiband image data collection. Data Brief; 2017; 15, pp. 249-251. [DOI: https://dx.doi.org/10.1016/j.dib.2017.09.038]

65. Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 5906-5916.

66. Zhong, H.; Tang, F.; Chen, Z.; Chang, H.J.; Gao, Y. AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision; Honolulu, HI, USA, 19–25 October 2025; pp. 10645-10655.

67. He, D.; Li, W.; Wang, G.; Huang, Y.; Liu, S. MMIF-INet: Multimodal medical image fusion by invertible network. Inf. Fusion; 2025; 114, 102666. [DOI: https://dx.doi.org/10.1016/j.inffus.2024.102666]

68. Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada, 10–17 October 2021; pp. 3520-3529.

69. Ultralytics. Ultralytics/yolov5: v3.1—Bug Fixes and Performance Improvements. Zenodo; 2020; [DOI: https://dx.doi.org/10.5281/zenodo.4154370]

70. Girshick, R. Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision; Santiago, Chile, 7–13 December 2015; pp. 1440-1448.

71. Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 2849-2858.

72. Shan, L.; Wang, W. Mbnet: A multi-resolution branch network for semantic segmentation of ultra-high resolution images. Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Singapore, Singapore, 22–27 May 2022; pp. 2589-2593.

73. Yuan, M.; Wei, X. C2former: Calibrated and complementary transformer for rgb-infrared object detection. IEEE Trans. Geosci. Remote Sens.; 2024; 62, 5403712. [DOI: https://dx.doi.org/10.1109/TGRS.2024.3376819]

74. Wu, Y.; Guan, X.; Zhao, B.; Ni, L.; Huang, M. Vehicle detection based on adaptive multimodal feature fusion and cross-modal vehicle index using RGB-T images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2023; 16, pp. 8166-8177. [DOI: https://dx.doi.org/10.1109/JSTARS.2023.3294624]

75. Ouyang, J.; Wang, Q.; Liu, J.; Qu, X.; Song, J.; Shen, T. Multi-modal and cross-scale feature fusion network for vehicle detection with transformers. Proceedings of the 2023 International Conference on Machine Vision, Image Processing and Imaging Technology (MVIPIT); Hangzhou, China, 26–28 July 2023; pp. 175-180.

76. Ouyang, J.; Jin, P.; Wang, Q. Multimodal feature-guided pre-training for RGB-T perception. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 16041-16050. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3454054]

77. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Detfusion: A detection-driven infrared and visible image fusion network. Proceedings of the 30th ACM International Conference on Multimedia; Lisboa, Portugal, 10–14 October 2022; pp. 4003-4011.

78. Li, J.; Chen, J.; Liu, J.; Ma, H. Learning a graph neural network with cross modality interaction for image fusion. Proceedings of the 31st ACM International Conference on Multimedia; Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4471-4479.

79. Tang, L.; Deng, Y.; Ma, Y.; Huang, J.; Ma, J. SuperFusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA J. Autom. Sin.; 2022; 9, pp. 2121-2137. [DOI: https://dx.doi.org/10.1109/JAS.2022.106082]

80. Li, K.; Wang, D.; Hu, Z.; Li, S.; Ni, W.; Zhao, L.; Wang, Q. Fd2-net: Frequency-driven feature decomposition network for infrared-visible object detection. Proceedings of the AAAI Conference on Artificial Intelligence; Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 4797-4805.

81. Dong, W.; Zhu, H.; Lin, S.; Luo, X.; Shen, Y.; Liu, X.; Zhang, J.; Guo, G.; Zhang, B. Fusion-mamba for cross-modality object detection. arXiv; 2024; arXiv: 2404.09146[DOI: https://dx.doi.org/10.1109/TMM.2025.3599020]

82. Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient Surgery for Multi-Task Learning. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 5824-5836.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.