Machine vision based perception for vehicle-mounted UAV autonomous landing under GNSS-denied environments

Abstract

With the growing demand for collaborative Unmanned Aerial Vehicle (UAV) and Unmanned Ground Vehicle (UGV) operations, precise landing of a vehicle-mounted UAV on a moving platform in complex environments has become a significant challenge, limiting the functionality of collaborative systems. This paper presents an autonomous landing perception scheme for a vehicle-mounted UAV, specifically designed for GNSS-denied environments to enhance landing capabilities. First, to address the challenges of insufficient illumination in airborne visual perception, an airborne infrared and visible image fusion method is employed to enhance image detail and contrast. Second, a feature enhancement network and region proposal network optimized for small object detection are explored to improve the detection of moving platforms during UAV landing. Finally, a relative pose and position estimation method based on the orthogonal iteration algorithm is investigated to reduce visual pose and position estimation errors and iteration time. Both simulation results and field tests demonstrate that the proposed algorithm performs robustly under low-light and foggy conditions, achieving accurate pose and position estimation even in scenarios with inadequate illumination.

Full text

Translate

Turn on search term navigation

Introduction

With the rapid development of unmanned systems technology, the demand for collaborative operations between Unmanned Ground Vehicles (UGVs) and Unmanned Aerial Vehicles (UAVs) has been steadily increasing. UGVs are adept at long-distance transportation and carrying heavy loads, whereas UAVs excel in navigating complex terrains and operating at high altitudes. By leveraging each other’s strengths, UAV-UGV collaboration creates robust operational capabilities in fields such as agriculture, power line inspection, border patrol, forest fire prevention, and logistics, significantly enhancing efficiency and reducing labor costs (Zhang et al. 2023; Liu et al. 2023; Thelasingha et al. 2024).

However, this collaboration faces challenges due to the technical limitations of UAVs, such as low payload, limited endurance, and restricted environmental perception. To address these limitations, UAVs frequently need to land on UGV platforms for recharging, which brings forth the critical challenge of precise landing, as presented in Fig. 1.

[See PDF for image]

Fig. 1

Autonomous landing of a UAV on a moving vehicular platform

[See PDF for image]

Fig. 2

Comparison of the vehicular platform in infrared, visible light, and fused infrared-visible images under low-light conditions

Unlike fixed recovery sites such as airports, the primary difficulty for UAV landings on moving vehicular platforms lies in accurately perceiving the environment, particularly detecting the platform and determining the relative pose and position. While most UAVs currently depend on the Global Navigation Satellite System (GNSS) and the Inertial Navigation System (INS) for environmental perception (Moortgat-Pick et al. 2024), these systems are often unreliable in remote or urban environments where GNSS signals can be weak or unavailable. Additionally, INS systems are prone to cumulative errors, reducing positioning accuracy over time (Cao et al. 2022).

In contrast, visual sensors mounted on UAVs offer a promising alternative, as they provide high-resolution environmental information with strong anti-interference capabilities at a low cost. These advantages make vision-based perception an effective solution for UAV landings in GNSS-denied environments (Arafat et al. 2023). Consequently, vision-based environmental perception has emerged as a critical approach for enhancing UAV-UGV collaboration, particularly in challenging environments where traditional navigation systems are limited.

However, in dynamic and complex scenarios, visual environmental perception during the landing process of UAVs on moving platforms still faces the following challenges:

(1) How to ensure the reliability of visual perception under complex and changing weather or lighting conditions

The quality of visual information is fundamental to environmental perception and directly determines its usability (Li and Wu 2024). Currently, UAVs often rely on a single visual source, such as a visible light camera, for environmental perception. However, in low-light conditions, the perception capability of UAVs equipped with a single visual sensor is limited, as presented in Fig. 2 (a). In contrast, infrared sensors can provide clear images in low-light environments, such as at night or during foggy conditions, as presented in Fig. 2 (b). By fusing infrared and visible light images, the advantages of both can be significantly combined to enhance environmental perception, as presented in Fig. 2 (c).

(2) How to accurately detect a moving vehicular platform in aerial images with varying object scales and similar object interference

The scale of objects in UAV images varies widely, with the size of the vehicular platform in the image changing according to the UAV’s altitude and shooting angle (Du et al. 2023). Additionally, the background may contain objects with appearances similar to the vehicular platform, further complicating detection, as presented in Fig. 3. Therefore, detecting a vehicular platform from the UAV perspective poses unique challenges. Additionally, the vehicular platform often occupies a small proportion of pixels in UAV images and appears blurred, leading to the problem of small object detection (Yuan et al. 2023).

[See PDF for image]

Fig. 3

Vehicular platforms are difficult to be detected and identified effectively in aerial visual images when the UAV is landing at a high altitude

[See PDF for image]

Fig. 4

Pose and position estimation of vehicle-mounted UAV with respect to the vehicular platform in GNSS-Denied scenarios

(3) How to use visual algorithms for pose and position estimation of Vehicle-Mounted UAV with respect to the vehicular platform in GNSS-denied scenarios

During the landing process of a UAV, GNSS signals can provide precise positioning. However, GNSS navigation may fail in remote areas or environments with poor signal quality. In such cases, visual algorithms can be utilized to estimate the pose and position parameters of a vehicle-mounted UAV with respect to the vehicular platform, as presented in Fig. 4. Unlike landing on a fixed point, landing on a moving vehicular platform requires more precise and faster visual pose and position estimation algorithms to handle the dynamic changes of the vehicle (Jian et al. 2024). This ensures that the UAV can land stably on the vehicular platform.

This study aims to address these technical challenges by proposing a vision-based autonomous landing algorithm for UAVs. Specifically, the main contributions of this paper can be concluded as follows. It should be emphasized that the novelty of this work does not lie in inventing fundamentally new algorithms, but in the system-level integration and optimization of existing techniques for the specific challenge of UAV autonomous landing under GNSS-denied environments. By combining infrared-visible fusion, enhanced small-object detection, and an efficient pose estimation solver into a unified framework, this study demonstrates how tailored integration and rigorous field validation can yield a practical and robust solution for complex engineering problems.

For collaborative operations between UAVs and UGVs in GNSS-denied scenarios, a comprehensive vision-based environmental perception algorithm is proposed for the autonomous landing of UAVs on moving vehicular platforms. This algorithm enables precise estimation of the landing pose and position parameters of the UAV, relying solely on visual sensors.
To address the susceptibility of visual perception to weather and lighting conditions, airborne infrared and visible image fusion is employed to enhance image quality and improve the usability of visual perception during the autonomous landing of UAVs.
Field tests with a quadrotor UAV and a vehicle in GNSS-denied urban and hilly areas validate the proposed landing perception scheme, achieving accurate pose estimation and reliable vehicle detection through infrared-visible image fusion, even in low-light conditions.

The remainder of this paper is organized as follows: In Section 2, related works are presented. In Section 3, the key techniques and methods of the proposed autonomous landing algorithm for UAVs are described in detail. In Section 4, the experimental setup and results analysis are presented. Finally, Section 5 concludes the paper and discusses potential future work.

Related works

Infrared and visible image fusion based airborne visual information enhancement

Infrared and Visible Image Fusion (IVIF) is a viable approach to enhancing the visual perception quality of onboard UAV systems by providing complementary and enhanced multimodal image information (Lei et al. 2023). IVIF methods can generally be categorized into traditional methods and deep learning-based methods. Traditional methods include multiscale transformations (Zhao et al. 2023), sparse representations (Karim et al. 2023), and variational models (Nie et al. 2021). Although these methods perform well in many applications, they typically rely on prior knowledge, leading to difficulties in obtaining accurate results in unknown scenarios. To overcome this limitation, deep learning-based IVIF methods have rapidly developed, significantly improving the quality of fused images through neural networks. Currently, these methods mainly include approaches based on autoencoders (AE) (Li et al. 2023), convolutional neural networks (CNN) (Xu et al. 2022), and generative adversarial networks (GAN) (Liu et al. 2022). In UAV applications, Yang et al. (2022) proposed a object detection algorithm that combines RGB and infrared images. By effectively utilizing the complementary advantages of the two spectral types through deep learning, this approach significantly enhances object detection performance in complex environments. However, despite significant progress in various aspects, some common issues remain, including how to effectively fuse multimodal image information and reduce information loss. In UAV visual perception tasks, the fused images need to provide high-quality visual information to accommodate dynamic changes during UAV flight. To address these challenges, this paper improves image fusion quality based on existing algorithms by introducing multilayer deep feature extraction, resulting in significant enhancements in detail presentation and information richness of the fused images. Furthermore, efficient network structures and optimization strategies are employed, which help to enhance the performance and robustness of the fusion process for UAV applications.

Airborne image based vehicular landing platform detection

Accurately detecting the vehicular landing platform from UAV visual images in complex environments is essential for achieving precise UAV landings. Object detection technology has evolved from traditional methods to deep learning-based approaches. Traditional methods rely on manual feature extraction, which can be cumbersome, computationally redundant, and lack accuracy. The rapid development of deep learning has established deep neural networks as the mainstream framework for UAV object detection (Tang et al. 2023). Currently, deep object detection algorithms are mainly divided into two categories: two-stage and one-stage algorithms. Two-stage algorithms, such as the R-CNN (Xie et al. 2021) series, generate regions of interest (ROIs) and then perform feature extraction and classification regression on these regions. In contrast, one-stage algorithms, like the YOLO (Tang et al. 2024) series, perform regression detection directly on the entire image without generating candidate regions, thus improving detection efficiency. The YOLO series has progressively optimized detection accuracy, particularly excelling in detecting small objects against complex backgrounds, thanks to its speed advantages. In recent years, YOLO series algorithms have undergone several important improvements tailored to different application scenarios, such as optimizing anchor box settings through K-means clustering (Sun et al. 2022) and enhancing low-level feature transfer with CFANet (Zhang et al. 2023), which have improved small object detection. However, in UAV visual applications, the variability in flight altitude and the complexity of ground backgrounds, along with interference from similar objects, still make it challenging for existing detection algorithms to reliably and accurately recognize objects. To address this, this paper improves upon YOLOv5 by incorporating the Convolutional Block Attention Module (CBAM) (Woo et al. 2018) to enhance the model’s focus on important features of the vehicular landing platform. Additionally, a dedicated detection head for small objects is introduced to further improve the detection capabilities for objects of various scales, particularly small ones. Furthermore, this paper integrates Transformer (Dosovitskiy 2020) structures to leverage their advantages in feature modeling and global context information extraction, thereby further enhancing detection performance.

Vision based pose and position estimation

Pose and position estimation is a crucial component of UAV environmental perception, particularly in the absence of GNSS signals. The pose and position estimation problem can be defined as follows: given a set of feature points with known spatial coordinates in the world coordinate system and their corresponding image coordinates, determine the rotation and translation relationship between the world coordinate system and the camera coordinate system (Hruby et al. 2024). The camera’s intrinsic parameters are typically obtained through calibration, transforming the pose and position estimation problem into an absolute orientation problem: using coordinate information of the same set of points in two coordinate systems to determine the rotation and translation relationship between these systems. Existing pose estimation algorithms are mainly classified into linear and nonlinear methods. Early linear algorithms, while computationally fast, have a complexity of O(n) and larger result errors. These include the Direct Linear Transform (DLT) (Oh and Kim 2023), Efficient Perspective-n-Point (EPnP) (Kim et al. 2023), Robust Perspective-n-Point (RPnP) (Mcleod et al. 2024), and Optimization Perspective-n-Point (OPnP) (Vakhitov et al. 2021). In contrast, nonlinear algorithms generally provide higher accuracy but require suitable initial values, involve more computation, and may face convergence issues during iteration. Given the real-time computation requirements for UAV flight and landing, computation time is also a critical factor in addition to accuracy. The Orthogonal Iteration (OI) algorithm (Lu et al. 2000), as a classic nonlinear method, optimizes parameters by minimizing spatial reprojection errors and typically converges within five to ten iterations. Compared to other iterative algorithms, it offers a balance of high computational accuracy and relatively fast speed.

[See PDF for image]

Fig. 5

The general framework of machine vision based perception for vehicle-mounted UAV autonomous landing

Methodology

This section provides a detailed description of machine vision based perception for vehicle-mounted UAV autonomous landing in GNSS-denied scenarios. The general framework of the proposed algorithm is presented as Fig. 5.

Airborne infrared and visible image fusion for UAV autonomous landing on moving platforms

As presented in Fig. 5, the infrared and visible images obtained from the UAV-mounted visual sensors are denoted as $I_{1}$ and $I_{2}$ , respectively. Therefore, optimization based method is utilized to decompose the source images, resulting in the base components $I_{1}^{b}$ and $I_{2}^{b}$ , and the detail components $I_{1}^{d}$ and $I_{2}^{d}$ for each source image. The base component is obtained by solving the following optimization problem:

\begin{matrix} I_{k}^{b} = arg min_{I_{k}^{b}} {∥I_{k} - I_{k}^{b}∥}_{F}^{2} + λ ({∥g_{x} \times I_{k}^{b}∥}_{F}^{2} + {∥g_{y} \times I_{k}^{b}∥}_{F}^{2}) \end{matrix}

where

g_{x} = [- 1 1]

and

g_{y} = {[- 1 1]}^{T}

represent the horizontal and vertical gradient operators, respectively.

λ

is the regularization parameter that controls the strength of the low-pass filtering of the image, and

k \in \{1, 2\}

After obtaining the base components $I_{1}^{b}$ and $I_{2}^{b}$ , the detail components are obtained using (2):

\begin{matrix} I_{k}^{d} = I_{k} - I_{k}^{b} \end{matrix}

where

k \in \{1, 2\}

The base components are fused using a weighted average strategy, while the detail components are reconstructed using a deep learning framework. Finally, the fused image F is reconstructed by adding the fused base component $F_{b}$ and the detail component $F_{d}$ .

Integration of basis components

The base part extracted from the source image contains common features and redundant information. In this paper, the weighted average strategy is adopted to fuse these base parts. The fused basis parts are calculated by (3).

\begin{matrix} F_{b} (x, y) = α_{1} I_{1}^{b} (x, y) + α_{2} I_{2}^{b} (x, y) \end{matrix}

where (x, y) denotes the corresponding position of the image intensity in

I_{1}^{b}

I_{2}^{b}

and

F_{b}

, and

α_{1}

α_{2}

denote the weight values of the pixels in

I_{1}^{b}

I_{2}^{b}

, respectively.

Integration of detail parts

For the detail parts $I_{1}^{d}$ , $I_{2}^{d}$ , the depth features are extracted based on VGG19 network, and the process is presented in Fig. 5. After extracting the depth features in VGG19. The weight maps are obtained by multilayer fusion strategy. Finally, the fused detail content is reconstructed from these weight maps and detail content.

The deep feature maps of the detail contents are extracted using the VGG19 network, and for each detail content image $I_{k}^{d}$ , the feature maps $ϕ_{k}^{i, m}$ are extracted through the different layers of the VGG19 network (in this paper, we choose the layers of $r e l u_{1_1}$ , $r e l u_{2_1}$ , $r e l u_{3_1}$ , and $r e l u_{4_1}$ ), where i stands for the index of the layer, and m stands for the channel number. The fusion strategy is presented in Fig. 6.

[See PDF for image]

Fig. 6

Details part of the process of integrating strategies

After obtaining the deep features $ϕ_{k}^{i, m}$ , the activity level map $\hat{C_{k}^{i}}$ is obtained by calculating the L1 norm and applying block mean operations. For each location (x, y) on the feature map $ϕ_{k}^{i, m}$ , the L1 norm is calculated to serve as the preliminary activity level map $C_{k}^{i}$ :

\begin{matrix} C_{k}^{i} (x, y) = | | ϕ_{k}^{i, 1 : M} (x, y) {| |}_{1} \end{matrix}

where

k \in \{1, 2\}

i \in \{1, 2, 3, 4\}

To make the fusion method robust to image misregistration, the activity level map is smoothed using a block mean operation. The final activity level map $\hat{C_{k}^{i}}$ is computed as:

\begin{matrix} \hat{C_{k}^{i}} (x, y) = \frac{1}{{(2 r + 1)}^{2}} \sum_{β = - r}^{r} \sum_{θ = - r}^{r} C_{k}^{i} (x + β, y + θ) \end{matrix}

where r determines the size of the block.

Based on the activity level map, the initial weight map $W_{k}^{i}$ is computed using the soft-max operator as presented in (6):

\begin{matrix} W_{k}^{i} (x, y) = \frac{\hat{C_{k}^{i}} (x, y)}{\sum_{n = 1}^{K} \hat{C_{n}^{i}} (x, y)} \end{matrix}

where K represents the number of activity level maps.

W_{k}^{i} (x, y)

indicates the value of the initial weight map, ranging from [0, 1].

To match the size of the input detail content, the weight map is upsampled. Each initial weight map $W_{k}^{i}$ is expanded to the same size as the detail part using the upsampling operation:

\begin{matrix} \hat{W_{k}^{i}} (x + p, y + q) = W_{k}^{i} (x, y) \end{matrix}

where

p, q \in \{0, 1, 3, \dots, (2^{(i - 1)} - 1)\}

Using the generated weight maps $\hat{W_{k}^{i}}$ , the detail parts of the images $I_{1}^{d}$ and $I_{2}^{d}$ are fused by weighted fusion to obtain the fused detail part $F_{d}$ . The weighted fusion process is as follows:

\begin{matrix} F_{d} (x, y) = \sum_{n = 1}^{K} \hat{W_{k}^{i}} (x, y) \times I_{n}^{d} (x, y), K = 2 . \end{matrix}

Reconstruction

Once the fused detail part $F_{d}$ is obtained, the final fused image F is reconstructed using the fused base part $F_{b}$ and the fused detail part $F_{d}$ , as presented in (9):

\begin{matrix} F (x, y) = F_{b} (x, y) + F_{d} (x, y) \end{matrix}

Airborne image based vehicular landing platform detection

In the early stages of landing, when the flight altitude is relatively high, the vehicular platform occupies only a small area in the image, making it hard to be effectively detected. To address this issue, three improvements are made to the YOLOv5 model: introducing a Transformer encoder block, integrating the CBAM, and adding a small object prediction head.

The transformer encoder block

Airborne small object detection is often affected by complex backgrounds and high-density occlusion. Traditional convolutional neural networks (CNNs), due to their limited local receptive fields, struggle to comprehensively capture the global information of small objects. This limitation is particularly pronounced in small object detection tasks, especially when the vehicle mounted UAV flying at high altitudes where the vehicular platform occupies a very small proportion of the image. To enhance the model’s small object detection capabilities in such complex scenarios, we drew inspiration from Vision Transformers and introduced Transformer encoder blocks into the YOLOv5 architecture, replacing some of the convolutional and CSP Bottleneck blocks. The Transformer encoder blocks capture richer global contextual information through a multi-head attention mechanism, significantly improving the recognition ability of the vehicular platform, particularly in high-density occlusion scenarios. Each Transformer encoder comprises two sub-layers: the first is a multi-head attention layer, and the second is a multilayer perceptron (MLP). Layer normalization and dropout layers help the network converge better and prevent overfitting. The multi-head attention mechanism allows the model to focus not only on the current pixel but also on acquiring more contextual semantic information. Residual connections between sub-layers facilitate this global information capture, which is crucial for detecting landing areas on vehicular platforms. It helps the model accurately identify and locate on vehicular platforms amidst complex terrain and backgrounds, ensuring precise and safe UAV landings in challenging environments.

Integration of the CBAM

The object area in airborne images often contains a large amount of complex geographical elements and background noise, which greatly increases the difficulty of detecting vehicular platform. This is particularly evident at high altitudes, where small objects are easily obscured by these complex backgrounds. To effectively solve this problem and improve the performance of small object detection. The CBAM is integrated into the YOLOv5 architecture. Given a feature map, CBAM infers attention maps sequentially along the channel and spatial dimensions, and then multiplies these attention maps with the input feature map to perform adaptive feature optimization, enhancing the model’s focus on small object areas. As presented in Fig. 7, the structure of the CBAM module can effectively suppress interfering information in the background and highlight the salient features of small objects, enabling the model to accurately recognize and detect small objects even in complex backgrounds. The integration of CBAM into the YOLOv5 architecture enhances the model’s ability to focus on small objects in complex airborne imagery. By refining feature representation, CBAM mitigates the interference of background noise and geographical elements, improving small object detection, particularly in cluttered, high-altitude environments. This enhancement is essential for reliable detection of vehicular platforms in challenging conditions.

[See PDF for image]

Fig. 7

CBAM module utilized for suppressing background noise and enhancing small object features

Adding the small object prediction head

Deep image features are rich in semantic information and aid in object recognition, while shallow features are more focused on small object information. To capitalize on these benefits, a dedicated small object prediction head (Prediction head I) is added to the YOLOv5 base model. This prediction head is located at the lower levels of the network, extracting detailed information from high-resolution feature maps and forming a four-head structure with the other three prediction heads. The Transformer encoder block is also applied to the prediction head, enhancing the four-head structure’s detection capabilities for various objects by leveraging its global information processing capability. This multi-head structure enables the model to better handle detection challenges arising from variations in object scale, particularly improving the detection accuracy of small-sized objects such as distant vehicular platforms.

[See PDF for image]

Fig. 8

Coordinate system construction for vision based pose and position estimation

Pose and position estimation for UAV autonomous landing

Accurately estimating the pose and position of the UAV with respect to the moving vehicular platform during landing is crucial. To achieve this, this paper utilizes the OI algorithm for pose and position estimation.

Coordinate systems construction

As presented in Fig. 8, four coordinate systems are constructed for vision based pose and position estimation.

World Coordinate System $O_{w} - X_{w} Y_{w} Z_{w}$ : The origin is at a reference point on the vehicular platform, with the $X_{w}$ axis extending forward along the platform, the $Z_{w}$ axis vertically downward, and the $Y_{w}$ axis determined according to the right-hand rule.
Camera Coordinate System $O_{c} - X_{c} Y_{c} Z_{c}$ : The origin $O_{c}$ is placed at the optical center of the camera, with the $X_{c}$ and $Y_{c}$ axes parallel to the image coordinate system’s x and y axes, respectively. The optical axis of the camera is the $Z_{c}$ axis, extending from the optical center to the origin of the image coordinate system.
Pixel Coordinate System $o - u v$ : The origin of the pixel coordinate system is at the top left corner of the image plane, with the u axis pointing to the right horizontally and the v axis pointing down vertically. The coordinates used in image processing refer to those in the pixel coordinate system.
Image Coordinate System $I - x y$ : The origin of the image coordinate system is at the point where the camera’s optical axis intersects the image plane, with the x and y axes parallel to the u and v axes of the pixel coordinate system, respectively.

Pose and position estimation algorithm

The pose and position estimation between the UAV and the vehicular platform is transformed into an absolute orientation problem. The specific steps are as follows. First, predefined feature points are extracted from the surface of the vehicular platform, and their corresponding positions in the camera images are obtained. Then, using the world coordinate system coordinates of these feature points and their image coordinates in the camera coordinate system, a PnP problem is constructed. An initial pose and position estimation is obtained using the DLT algorithm. Finally, the OI algorithm is applied to refine the initial PnP estimate, achieving precise pose and position estimation.

Implementation details of the algorithm

The OI algorithm uses the object-space residual norm as its objective function, where $v_{i} = {[\begin{matrix} u_{i} & v_{i} & 1 \end{matrix}]}^{T}$ is the projection point of $P_{i}$ on the normalized image plane. The object-space residual is the distance between the spatial point and the re-projected ray. It is described as follows:

\begin{matrix} e_{os} (i) = & (I - V_{i}) (R P_{i} + t) \end{matrix}

\begin{matrix} V_{i} = & ({v_{i} v_{i}}^{T}) {(v_{i}^{T} v_{i})}^{- 1} \end{matrix}

where

v_{i}

and

V_{i}

are the projection point and re-projection vector corresponding to the feature point, respectively,

e_{os}

is the object-space residual,

P_{i}

is the set of corresponding points in the world coordinate system, R is the rotation matrix from the world coordinate system to the camera coordinate system, and t is the translation vector from the world coordinate system to the camera coordinate system (Fig. 9).

[See PDF for image]

Fig. 9

Physical meaning of object-space residual

The norm of the objective function E(R, t) is calculated using the formula:

\begin{matrix} E (R, t) = \sum_{i = 1}^{n} | | (V_{i} - I) (R P_{i} + t) {| |}^{2} \end{matrix}

where E(R, t) is the norm of the object-space residual. Given the rotation matrix R, the objective function E(R, t) can be considered a quadratic function of t. The closed-form solution of (12) is given by (13):

\begin{matrix} t = \frac{1}{n} {(I - \frac{1}{n} \sum_{i = 1}^{n} V_{i})}^{- 1} \sum_{i = 1}^{n} (V_{i} - I) R P_{i} \end{matrix}

where n denotes the number of feature points.

[See PDF for image]

Fig. 10

Four pairs of visible and infrared scenarios selected in DroneVehicle

Next, the rotation matrix R is solved based on singular value decomposition. The covariance matrix $\sum_{PQ}$ is given by (14):

\begin{matrix} \sum_{PQ} = \frac{1}{n} \sum_{i = 1}^{n} (Q_{i} - μ_{Q}) {(P_{i} - μ_{P})}^{T} \end{matrix}

where

Q_{i}

is the set of corresponding points in the camera coordinate system, and

μ_{P}

and

μ_{Q}

are the mean vectors of

P_{i}

and

Q_{i}

, respectively.

If the singular value decomposition of the covariance matrix yields $U D V^{T}$ , then the optimal solution for the rotation matrix R can be uniquely determined as:

\begin{matrix} R = & U S V^{T} \end{matrix}

\begin{matrix} S = & \{\begin{matrix} I & d e t (\sum_{xy} \geq 0) \\ d i a g (1, 1, . . ., - 1) & d e t (\sum_{xy} < 0) \end{matrix}) \end{matrix}

where U and V are results from the singular value decomposition of the covariance matrix, and S is a constant matrix.

The rotation matrix R and translation vector t describe the relative pose and position between the UAV and the vehicular platform. Equations (17) and (18) provide the calculations for the position P and the UAV’s position $O^{w}$ with respect to the vehicular platform:

\begin{matrix} P_{i} = & R^{- 1} Q_{i} - R^{- 1} t \end{matrix}

\begin{matrix} O^{w} = & - R^{- 1} t \end{matrix}

Experimental results and analysis

Experiment setup

All experiments are conducted in a unified hardware environment with specific configurations, including an Intel Core i5-10400F processor (2.90GHz) and an NVIDIA GeForce RTX 3090 graphics card. The dataset used in this paper includes Visdrone 2019 and DroneVehicle. The scenarios selected in the DroneVehicle dataset are shown in Fig. 10. The specific parameter configurations for each experiment are provided in Table 1.

Table 1. Experimental parameter configuration

Parameter	Role of the Parameter	Configuration Value (or Range)
$λ$	Controls model complexity to prevent overfitting	(0, 6.4]
npad	Adjusts image smoothness and detail preservation	(0, 64]
Learning rate	Determines the rate of parameter updates, affecting convergence speed and stability	lr0 : 0.0032, lrf : 0.12, momentum : 0.843, $w e i g h t_d e c a y : 0.00036$ , $w a r m u p_e p o c h s : 2.0$
Batch size	Affects training efficiency and memory usage; smaller batch sizes enhance detail capture, while larger sizes increase training speed	4
Epochs	Sets the number of training iterations, affecting model accuracy and overfitting	80
$σ$	Simulates noise in feature points to assess model robustness	0.0001
Pose and Position Estimation Horizontal Distance	Determines the horizontal distance between the UAV and the landing platform, affecting estimation accuracy	(0,120]
Number of Feature Points	Number of feature points used in pose estimation; a larger number improves accuracy but increases computational complexity	5

[See PDF for image]

Fig. 11

Image fusion results in foggy, dark, daytime

Algorithm performance evaluation

Image fusion

To evaluate the experimental results of image fusion, this paper adopts a combination of subjective and objective evaluations to assess the quality of the fusion results. Subjective evaluation relies purely on human visual assessment of the fused results, while objective evaluation is conducted using a series of image quality metrics.

In Fig. 11, several examples of fused images are presented. Visually, the fused images are clear and natural, with rich details and smooth edges. The combination of infrared and visible light is shown to produce images with realistic colors, clear thermal information, and a well-integrated fusion of visual details. In UAV autonomous landing scenarios, the fused images are found to significantly enhance object recognition and scene understanding, enabling UAVs to capture critical details more effectively.

For quantitative analysis of the fused images, four evaluation metrics are selected: $Q_{MI}$ , $Q_{M}$ , $Q_{S}$ , and $Q_{CB}$ (Li et al. 2013). $Q_{MI}$ measures the mutual information between the fused image and the input images, indicating the amount of original image information retained in the fused image. Higher mutual information reflects that the fused image preserves more features and details of the original images. $Q_{M}$ evaluates the brightness and contrast characteristics by considering the mean and variance of the fused image, where higher values generally indicate better brightness and contrast processing, resulting in enhanced visual effects. $Q_{S}$ measures the retention of structural information from the source images, with higher values indicating greater structural similarity. $Q_{CB}$ evaluates the correlation between the input images and the fused image, assessing how well the fused image retains the characteristic information of the original images. A higher correlation coefficient signifies better similarity between the fused image and the original images. When $λ$ and npad are varied, the objective evaluation metrics of Fig. 11 are displayed in Figs. 12 and 13.

[See PDF for image]

Fig. 12

Objective evaluation indicators when $λ$ changes

As presented in Fig. 12, the horizontal axis $λ$ represents the regularization parameter that controls the strength of the image’s low-pass filtering. In the context of Tikhonov regularization, it determines the trade-off between the fidelity to the original image and the smoothness of the result. A larger $λ$ value leads to stronger smoothing effects, causing more suppression of high-frequency components (details). A moderate increase in $λ$ generally enhances structural information, as seen in Figs. 12 (g) and (k), which improve $Q_{S}$ and partially improve $Q_{CB}$ . However, an excessive increase in $λ$ may result in excessive smoothing or loss of information, as shown in Figs. 12 (a) and (d). Although $Q_{S}$ continues to increase, $Q_{MI}$ and $Q_{M}$ may decrease. By comprehensively considering the effects of these four parameters on the fused image, the optimal value is found to be $λ = 3$ .

[See PDF for image]

Fig. 13

Objective evaluation indicators when npad changes

The padding parameter, npad, effectively reduces boundary artifacts, resulting in a smooth transition at the image edges. However, it can increase computational cost. Smaller padding may speed up computation but could introduce noticeable artifacts at the image edges, impacting the quality of the fused image. As shown in Figs. 13 (d), (h), and (j), a moderate increase in npad helps mitigate edge effects, enhancing information retention, contrast, and structural similarity, which can potentially improve $Q_{MI}$ , $Q_{M}$ , $Q_{S}$ , and $Q_{CB}$ . However, excessive npad introduces too much unnecessary padding, adding useless information and reducing contrast and structural similarity, thereby lowering $Q_{MI}$ , $Q_{M}$ , $Q_{S}$ , and $Q_{CB}$ . By comprehensively considering the impact of these four parameters on the fused image and computational cost, an optimal value of $n p a d = 16$ is selected.

To assess competitiveness, the proposed fusion method is evaluated against representative IVIF approaches, including CDDFuse (Zhao et al. 2023), RFNet (Xu et al. 2022), an AE-based method (Li et al. 2023), and a GAN-based method (Liu et al. 2022), under identical settings (input size, preprocessing, and evaluation protocol). The proposed method is configured with $λ = 3$ and $n p a d = 16$ . As shown in Table 2, the proposed approach consistently outperforms all competing methods across four objective metrics. Compared with earlier AE- and GAN-based methods, our approach preserves more mutual information ( $Q_{MI}$ ) and achieves higher structural similarity ( $Q_{S}$ ), demonstrating its ability to retain both global context and fine details. When compared with recent state-of-the-art methods such as CDDFuse and RFNet, the proposed method still provides notable improvements, with $Q_{MI}$ increasing from 4.28 to 4.75 and $Q_{S}$ from 0.66 to 0.71. These results confirm that the integration of multilayer deep feature extraction and optimized fusion strategy enables more effective preservation of complementary information, yielding fused images with richer details, improved contrast, and better structural fidelity, which are crucial for UAV perception under low-light conditions.

Table 2. Quantitative comparison of fusion methods (higher is better)

Method	$Q_{MI}$	$Q_{M}$	$Q_{S}$	$Q_{CB}$
AE-based (Li et al. 2023)	3.12	0.44	0.55	0.48
GAN-based (Liu et al. 2022)	3.68	0.47	0.59	0.52
CDDFuse (Zhao et al. 2023)	4.05	0.53	0.63	0.57
RFNet (Xu et al. 2022)	4.28	0.56	0.66	0.60
Proposed (ours)	4.75	0.62	0.71	0.66

The bold entries indicate the best results among the compared methods

[See PDF for image]

Fig. 14

Object detection algorithm performance evaluation

Object detection

The evaluation metrics selected for object detection include precision, recall, F1 score. The relationships between the metrics and confidence levels are presented in Fig. 14.

[See PDF for image]

Fig. 15

Object detection examples on low-light visible images and infrared-visible light fused images. (flight height: 100m)

As presented in Fig. 14 (a), the model achieves a precision of 1.00 at high confidence levels (confidence $> 0.9$ ), indicating extremely accurate predictions. This is particularly suited for scenarios like autonomous drone landings on vehicle platforms, where minimizing false positives is critical. A false positive detection in this context could cause the UAV to incorrectly assess the landing area, leading to failed or unsafe landings. As presented in Fig. 14 (b), the F1 score peaks at 0.58 when the confidence is around 0.34, suggesting that the model strikes a good balance between precision and recall at this level, making it suitable for applications that require both metrics. As presented in Fig. 14 (c), the car category achieves the highest precision at 0.870, reflecting the best detection performance among categories. Finally, as presented in Fig. 14 (d), at low confidence levels, all categories exhibit high recall rates, reaching 0.77, indicating the model’s strong ability to detect a wide range of objects, including those with uncertainty.

A lower confidence level is suitable for detecting all objects, while a higher confidence level is appropriate for scenarios requiring high precision. In the UAV autonomous landing process on a vehicular platform, the model’s confidence threshold can be dynamically adjusted based on the UAV’s altitude and distance from the platform. During the initial landing phase, when the UAV is further from the platform and requires a broader search area, a lower confidence threshold (confidence $< 0.3$ ) should be used to detect all potential landing sites and minimize the risk of missing viable objects. As the UAV moves closer and enters the final approach phase, the search area narrows, and a higher confidence threshold (confidence $> 0.9$ ) is necessary to improve recognition accuracy and reduce false positives, ensuring a safe and precise landing (Figs. 15 and 16).

By comparing the object detection results of low-light visible images and fused images, it is clear that infrared and visible light fused images have advantages in object detection, significantly improving detection and recognition performance under low-light conditions.

To assess the robustness of each component, the ablation experiments are repeated five times with different random seeds. Table 3 reports the mean and standard deviation of the main evaluation metrics. The results show that each added module consistently improves detection performance. In particular, the full model achieves the highest [email protected] of 0.672 ± 0.004, confirming that the performance gain is stable and not due to random initialization. The baseline YOLOv5s achieves 0.642 in [email protected] and 0.381 in [email protected]:0.95.When the Transformer encoder is introduced, the performance increases to 0.683 and 0.406, indicating that the incorporation of global context modeling is beneficial for capturing the spatial distribution of small-scale targets in aerial views.By integrating the CBAM module, additional improvements are observed, with [email protected] reaching 0.702, as the attention mechanism enhances the focus on salient regions and suppresses irrelevant background noise. With the addition of a dedicated small-object prediction head, the detector further improves to 0.719 in [email protected] and 0.436 in [email protected]:0.95, highlighting the effectiveness of explicitly addressing scale variation. Finally, when all components are combined, the proposed detector achieves the best performance, reaching 0.741 in [email protected] and 0.451 in [email protected]:0.95, with precision and recall also improved to 0.81 and 0.75, respectively. These results confirm that each modification provides complementary benefits and that their joint integration leads to the most significant performance gain.

[See PDF for image]

Fig. 16

Feature point layout schemes

Table 3. Ablation study of the proposed detector

Variant	[email protected]	[email protected]:0.95	Precision	Recall
Baseline (YOLOv5s)	0.642	0.381	0.71	0.66
+ Transformer encoder	0.683	0.406	0.74	0.68
+ CBAM	0.702	0.419	0.76	0.70
+ Small-object head	0.719	0.436	0.79	0.72
All (ours)	0.741	0.451	0.81	0.75

The bold entries indicate the best results among the compared methods

To quantify the effectiveness of the proposed modifications, the detector is evaluated against several representative baselines, including YOLOv5s (baseline), YOLOv7-tiny, YOLOv8n, and CFANet (Zhang et al. 2023), under identical training protocols and data splits. As shown in Table 4, the baseline YOLOv5s achieves a [email protected] of 0.642 and a [email protected]:0.95 of 0.381, which reflects its limitations in handling small targets in aerial images. YOLOv7-tiny and YOLOv8n yield moderate improvements, with YOLOv8n reaching 0.681 in [email protected] and 0.405 in [email protected]:0.95, benefiting from its more efficient backbone design. CFANet further enhances detection accuracy, particularly in recall, achieving 0.702 in [email protected] and 0.422 in [email protected]:0.95, owing to its enhanced low-level feature transfer. In comparison, the proposed detector consistently surpasses all counterparts, attaining 0.741 in [email protected] and 0.451 in [email protected]:0.95, while also yielding the highest precision (0.81) and recall (0.75). These results demonstrate that the integration of the Transformer encoder, CBAM, and the small-object prediction head provides complementary benefits: the Transformer encoder enhances global context modeling, CBAM suppresses irrelevant background noise, and the dedicated prediction head improves the sensitivity to small-scale targets. Overall, the results confirm that the proposed detector delivers substantial performance gains over existing detectors while maintaining computational feasibility for aerial deployment.

Pose and position estimation

For the UAV landing process, as shown in Fig. 17, the pose and position estimation errors are analyzed under three different feature point layout schemes. The layout schemes are as follows: A, C, E, G, and I for the first layout scheme; B, D, F, H, and I for the second layout scheme; and A, B, D, F, and G for the third layout scheme.

Figure 18 presents the pose and position estimation errors for each feature point layout scheme at a horizontal distance of 100 m from the vehicle platform and a landing height of 50 m.

Table 5 provides the average pose and position estimation errors for each feature point layout scheme. It is evident that the third layout scheme yields the smallest average attitude and position errors, with an attitude angle error of 0.5889 $^{\circ}$ and a position error of 0.0083 m.

Table 4. Detection performance on aerial datasets

Method	[email protected]	[email protected]:0.95	Precision	Recall	Params (M)	FLOPs (G)
YOLOv5s (baseline)	0.642	0.381	0.71	0.66	7.2	16.5
YOLOv7-tiny	0.668	0.397	0.74	0.68	6.2	13.2
YOLOv8n	0.681	0.405	0.76	0.70	3.2	8.7
CFANet (Zhang et al. 2023)	0.702	0.422	0.77	0.72	8.8	19.5
Ours (full)	0.741	0.451	0.81	0.75	9.1	21.3

The bold entries indicate the best results among the compared methods

[See PDF for image]

Fig. 17

Pose and position estimation errors under different feature point layout schemes

To further validate the effectiveness of the Orthogonal Iteration (OI) algorithm, it is compared with several widely used PnP algorithms, including DLT (Oh and Kim 2023), EPnP (Kim et al. 2023), RPnP (Mcleod et al. 2024), and OPnP (Vakhitov et al. 2021), under identical experimental conditions. Both estimation accuracy (rotation error and translation error) and computational efficiency (average processing time per frame) are evaluated. As summarized in Table 6, OI achieves the lowest rotation and translation errors while maintaining competitive runtime, converging within only a few iterations. EPnP demonstrates the fastest runtime but at the cost of higher errors. RPnP and OPnP improve robustness compared to DLT, but their computational overhead is larger, making them less suitable for real-time UAV deployment. These results confirm that OI provides a favorable trade-off between accuracy and efficiency, which is critical for real-time autonomous landing applications.

[See PDF for image]

Fig. 18

Physical verification platform

Table 5. Pose and position estimation errors for UAV landing under different feature point layout schemes

Layout Scheme	Mean Pitch Angle Error ( $^{\circ}$ )	Mean Yaw Angle Error ( $^{\circ}$ )	Mean Roll Angle Error ( $^{\circ}$ )	Mean Attitude Angle Error ( $^{\circ}$ )	Mean Offset Error (m)	Mean Distance Error (m)	Mean Height Error (m)	Mean Position Error (m)
First Feature Point Layout	0.0507	1.2676	1.2672	0.8618	0.0167	0.0047	0.0360	0.0192
Second Feature Point Layout	0.2122	4.3823	4.3708	2.9885	0.1070	0.0319	0.1824	0.1071
Third Feature Point Layout	0.0444	0.8610	0.8612	0.5889	0.0084	0.0010	0.0155	0.0083

The bold entries indicate the best results among the compared methods

Table 6. Comparison of PnP algorithms in pose and position estimation

Method	Rotation error ( $^{\circ}$ )	Translation error (m)	Time (ms/frame)
DLT (Oh and Kim 2023)	1.85	0.042	3.1
EPnP (Kim et al. 2023)	1.12	0.027	1.6
RPnP (Mcleod et al. 2024)	0.94	0.019	2.9
OPnP (Vakhitov et al. 2021)	0.88	0.016	3.4
OI (ours)	0.59	0.008	2.3

The bold entries indicate the best results among the compared methods

Runtime performance on embedded platform

Real-time execution is a critical requirement for UAV autonomous landing, since delays in perception directly increase the risk of unstable or failed landings. To evaluate the computational feasibility, the complete perception pipeline is deployed on an Nvidia Jetson Xavier NX, and the runtime of each module is measured over 1,000 test frames. As presented in Table 7, the image fusion stage requires an average of 18.6 ms per frame, the detection stage requires 15.2 ms per frame, and the pose estimation stage requires 8.9 ms per frame. The overall end-to-end latency is therefore 42.7 ms, corresponding to a throughput of approximately 23.4 FPS.

In UAV autonomous navigation, a frame rate above 20 FPS is generally considered sufficient to support stable control and timely landing decisions. The achieved 23.4 FPS thus satisfies the real-time requirement, ensuring that the UAV can continuously update its perception of the landing platform during approach. It is also observed that image fusion contributes the largest portion of runtime (43.5%), which suggests that further optimization (e.g., lightweight backbone replacement or pruning) could increase efficiency without compromising detection accuracy. Overall, the measured runtime confirms that the proposed system can be reliably deployed on embedded platforms for real-time UAV landing tasks.

Table 7. Average runtime of each module on Nvidia Jetson Xavier

Module	Time (ms/frame)	Throughput (FPS)
Image fusion	18.6	53.8
Object detection	15.2	65.8
Pose estimation (OI)	8.9	112.4
End-to-end system	42.7	23.4

The bold entries indicate the best results among the compared methods

[See PDF for image]

Fig. 19

Results of hardware experiments

Field tests

As presented in Fig. 19, the field tests of the algorithm proposed in this paper are conducted on a real quadrotor UAV and a Buick Encore vehicle. The quadrotor UAV is equipped with a FLIR Vue Pro and an X Lite FPV CAM to capture infrared and visible images, respectively. The collected heterogeneous images are input into an Nvidia Jetson Xavier via a USB interface to enable real-time processing of the vision algorithm. In the field tests, the parameters for the vision algorithm are set to the optimal values identified in Section 4.2, and the third layout scheme for pose estimation feature points, which produces the smallest error, is adopted. By extracting the corner points of the vehicle-mounted moving platform, the positions of the feature points in the image can be determined. The field test results are presented in Table 3, with three GNSS-denied testing scenarios selected in urban building areas and hilly terrain, conducted from 7:00 PM to 10:00 PM. The flight altitude varies from 26m to 5.5m. The test results demonstrate that the infrared and visible image fusion algorithm significantly improves imaging quality under low-light conditions, thereby enabling the UAV to effectively detect the ground vehicle and achieve satisfactory pose estimation accuracy.

It should be noted that the current field tests are conducted with a single vehicle type (Buick Encore) and primarily under low-light conditions. While the results confirm the robustness of the proposed framework in these scenarios, several potential limitations remain. First, the generalizability of the detector to vehicles of different shapes, sizes, and colors has not yet been systematically evaluated. Variations in vehicle geometry or paint reflectivity may affect the detection performance, especially at long distances. Second, the image fusion and detection modules have not been extensively tested under adverse weather conditions, such as daytime with harsh shadows, rain, or snow. These conditions can introduce illumination imbalance, reflections, and occlusions that may challenge both the fusion quality and the detector reliability. Future work will therefore extend the testing campaign to include diverse vehicle types and environmental conditions in order to comprehensively validate the robustness and applicability of the system.

Conclusion

In this paper, a novel machine vision based autonomous landing perception scheme is presented for UAVs in GNSS-denied environments. The presented scheme addresses key challenges in landing on moving vehicular platforms by utilizing infrared and visible image fusion, optimized small object detection, and an orthogonal iteration-based pose and position estimation.

To enhance the visual perception capabilities of UAVs under low-light and foggy conditions, a deep learning-based image fusion technique is employed. Image quality is significantly improved, ensuring that the surroundings can be effectively perceived by the UAV during landing. Additionally, feature enhancement and region proposal networks optimized for detecting small moving platforms are integrated, overcoming the limitations posed by varying object scales and complex backgrounds. Finally, an orthogonal iteration algorithm is utilized to improve the accuracy of pose and position estimation between the UAV and the vehicular platform, ensuring reliable landings even in challenging GNSS-denied scenarios.

The contributions of this paper can be summarized as follows:

Comprehensive Vision-based Perception Algorithm: We developed an integrated vision-based perception algorithm for the autonomous landing of UAVs on moving vehicular platforms. This algorithm utilizes only visual sensors to accurately estimate the UAV’s landing pose and position, enabling reliable performance in GNSS-denied scenarios.
Enhanced Visual Perception with Image Fusion: To mitigate the impact of adverse weather and lighting conditions on visual perception, we introduced an airborne infrared and visible image fusion method. This approach significantly improves image quality, enhancing the UAV’s capability to detect and land on the target platform in challenging environments.
Field Tests Validation: Field tests with a quadrotor UAV and a vehicle in GNSS-denied urban and hilly environments validate the proposed landing perception scheme. These tests demonstrate the algorithm’s accurate pose estimation and reliable vehicle detection capabilities through infrared-visible image fusion, even in low-light conditions, confirming the robustness of the approach in challenging scenarios.

It should be clarified that the primary contribution of this study lies in the integration and application-specific optimization of established techniques, rather than the development of entirely new algorithms. The significance of this work is therefore twofold: first, it demonstrates how the careful combination of complementary methods can effectively address the challenges of UAV autonomous landing in GNSS-denied environments; second, it validates the integrated framework through extensive field tests, thereby bridging the gap between theoretical research and real-world deployment. These contributions establish a robust foundation for future applications involving UAV and UGV collaborative operations, particularly in scenarios where precise and autonomous UAV landings on moving vehicular platforms are crucial for seamless integration and coordination between the two systems.

Acknowledgements

The authors gratefully acknowledge the support from the above-mentioned funding agencies and institutions.

Funding

This work was supported by Natural Science Foundation of China (62201229, 62503202, 52225212), the 10th China Association for Science and Technology (CAST) Young Elite Scientist Sponsorship Program (NO.YESS20240254), Natural Science Foundation of Jiangsu Province (BK20220516), Natural Science Foundation of the Higher Education Institutions of Jiangsu Province (22KJB510002), the State Key Laboratory of Autonomous Intelligent Unmanned Systems (ZZKF2025-2-7), the Open Project Program of Shandong Marine Aerospace Equipment Technological Innovation Center, Ludong University (Grant No. MAETIC202208), Beijing Engineering Research Center of Aerial Intelligent Remote Sensing Equipments Fund (AIRSE202411), Vehicle Measurement, Control and Safety Key Laboratory of Sichuan Province (QCCK2025-002), Jiangsu University Research Startup Fund for Senior Talent (21JDG063) and National College Student Innovation Training Program (Project No. 202410299067Z).

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request. The data that support the findings of this study are not publicly available due to institutional restrictions and confidentiality agreements. Specifically, the dataset contains location-sensitive and scenario-specific information related to autonomous vehicle operations on unstructured roads, which are subject to internal security policies and third-party licensing constraints. Reasonable requests for access to anonymized portions of the data may be considered by the corresponding author on a case-by-case basis, subject to approval by the affiliated institutions. All experiments reported in this paper can be reproduced using publicly available datasets such as VisDrone2019 and DroneVehicle, together with the algorithmic descriptions provided in the manuscript. Source code and trained models can be shared with interested researchers upon request to the corresponding author, subject to institutional approval.

Declarations

Competing Interests

The authors declare that they have no competing interests.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Arafat, MY; Alam, MM; Moh, S. Vision-based navigation techniques for unmanned aerial vehicles: review and challenges. Drones; 2023; 7, 2 89. [DOI: https://dx.doi.org/10.3390/drones7020089]

Cao, S; Lu, X; Shen, S. Gvins: tightly coupled gnss–visual–inertial fusion for smooth and consistent state estimation. IEEE Trans Rob; 2022; 38, 4 pp. 2004-2021. [DOI: https://dx.doi.org/10.1109/TRO.2021.3133730]

Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929

Du B, Huang Y, Chen J, Huang D (2023) Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 13435–13444

Hruby P, Duff T, Pollefeys M (2024) Efficient solution of point-line absolute pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 21316–21325

Jian Z, Li Q, Zheng S, Wang X, Chen X (2024) Lvcp: lidar-vision tightly coupled collaborative real-time relative positioning. arXiv:2407.10782

Karim, S; Tong, G; Li, J; Qadir, A; Farooq, U; Yu, Y. Current advances and future perspectives of image fusion: a comprehensive review. Inf Fusion; 2023; 90, pp. 185-217. [DOI: https://dx.doi.org/10.1016/j.inffus.2022.09.019]

Kim M, Koo J, Kim G (2023) Ep2p-loc: end-to-end 3d point to 2d pixel localization for large-scale visual localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 21527–21537

Lei, J; Li, J; Liu, J; Zhou, S; Zhang, Q; Kasabov, NK. Galfusion: multi-exposure image fusion via a global–local aggregation learning network. IEEE Trans Instrum Meas; 2023; 72, pp. 1-15.

Li, H; Wu, X-J. Crossfuse: a novel cross attention mechanism based infrared and visible image fusion approach. Inf Fusion; 2024; 103, 102147. [DOI: https://dx.doi.org/10.1016/j.inffus.2023.102147]

Li, S; Kang, X; Hu, J. Image fusion with guided filtering. IEEE Trans Image Process; 2013; 22, 7 pp. 2864-2875. [DOI: https://dx.doi.org/10.1109/TIP.2013.2244222]

Li, H; Zhao, J; Li, J; Yu, Z; Lu, G. Feature dynamic alignment and refinement for infrared-visible image fusion: translation robust fusion. Inf Fusion; 2023; 95, pp. 26-41. [DOI: https://dx.doi.org/10.1016/j.inffus.2023.02.011]

Liu, Z; Shang, Y; Li, T; Chen, G; Wang, Y; Hu, Q; Zhu, P. Robust multi-drone multi-target tracking to resolve target occlusion: a benchmark. IEEE Trans Multimedia; 2023; 25, pp. 1462-1476. [DOI: https://dx.doi.org/10.1109/TMM.2023.3234822]

Liu J, Fan X, Huang Z, Wu G, Liu R, Zhong W, Luo Z (2022) Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 5802–5811

Lu C-P, Hager GD, Mjolsness E (2000) Fast and globally convergent pose estimation from video images. IEEE Trans Pattern Anal Mach Intell 22(6):610–622

Mcleod S, Chng CK, Ono T, Shimizu Y, Hemmi R, Holden L et al (2024) Robust perspective-n-crater for crater-based camera pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 6760–6769

Moortgat-Pick A, Schwahn M, Adamczyk A, Duecker DA, Haddadin S (2024) Autonomous uav mission cycling: a mobile hub approach for precise landings and continuous operations in challenging environments. In: Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), pp 8450–8456

Nie, R; Ma, C; Cao, J; Ding, H; Zhou, D. A total variation with joint norms for infrared and visible image fusion. IEEE Trans Multimedia; 2021; 24, pp. 1460-1472. [DOI: https://dx.doi.org/10.1109/TMM.2021.3065496]

Oh J, Kim H (2023) A camera center estimation based on perspective one point method. IEEE Trans Intell Vehicles

Sun W, Dai L, Zhang X, Chang P, He X (2022) Rsod: real-time small object detection algorithm in uav-based traffic monitoring. Appl Intell, 1–16

Tang, G; Ni, J; Zhao, Y; Gu, Y; Cao, W. A survey of object detection for uavs based on deep learning. Remote Sensing; 2023; 16, 1 149. [DOI: https://dx.doi.org/10.3390/rs16010149]

Tang S, Zhang S, Fang Y (2024) Hic-yolov5: improved yolov5 for small object detection. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp 6614–6619

Thelasingha N, Julius AA, Humann J, Reddinger J-P, Dotterweich J, Childers M (2024) Iterative planning for multi-agent systems: an application in energy-aware uav-ugv cooperative task site assignments. IEEE Trans Autom Sci Eng

Vakhitov A, Ferraz L, Agudo A, Moreno-Noguer F (2021) Uncertainty-aware camera pose estimation from points and lines. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 4659–4668

Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19

Xie X, Cheng G, Wang J, Yao X, Han J (2021) Oriented r-cnn for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 3520–3529

Xu H, Ma J, Yuan J, Le Z, Liu W (2022) Rfnet: unsupervised network for mutually reinforcing multi-modal image registration and fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 19679–19688

Yang L, Ma R, Zakhor A (2022) Drone object detection using rgb/ir fusion. arXiv:2201.03786

Yuan X, Cheng G, Yan K, Zeng Q, Han J (2023) Small object detection via coarse-to-fine proposal generation and imitation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 6317–6327

Zhang, C; Zhou, W; Qin, W; Tang, W. A novel uav path planning approach: heuristic crossing search and rescue optimization algorithm. Expert Syst Appl; 2023; 215, [DOI: https://dx.doi.org/10.1016/j.eswa.2022.119243] 119243.

Zhang, Y; Wu, C; Guo, W; Zhang, T; Li, W. Cfanet: efficient detection of uav image based on cross-layer feature aggregation. IEEE Trans Geosci Remote Sens; 2023; 61, pp. 1-11.

Zhao Z, Bai H, Zhang J, Zhang Y, Xu S, Lin Z, Timofte R, Van Gool L (2023) Cddfuse: correlation-driven dual-branch feature decomposition for multi-modality image fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 5906–5916

Word count: 8445

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Machine vision based perception for vehicle-mounted UAV autonomous landing under GNSS-denied environments

Content area

Abstract

Full text