Content area
Autonomous localization methods for Unmanned Aerial Vehicles (UAVs) have significant application potential in complex environments. This paper presents a comprehensive survey of UAV localization techniques, focusing on both pure vision-based and sensor-assisted approaches. For pure vision-based localization, the survey emphasizes key technologies for feature descriptor generation, advancements in similarity measurement criteria, and optimized computational strategies. The impact of these technologies on improving computational efficiency and localization accuracy. In the context of sensor-assisted multi-source UAV localization, the applications of filtering-based fusion, optimization-based fusion, and deep learning-based fusion methods are discussed. A detailed analysis demonstrates the advantages of multi-modal data fusion in improving robustness and accuracy. Despite significant progress in localization accuracy and adaptability to complex environments, challenges remain in adapting to low-texture environments, optimizing fusion strategies, and addressing computational resource limitations. Finally, the paper discusses future directions for the research and implementation of UAV autonomous localization methods.
Introduction
The significant of UAVs in satellite-denied environments lies in ensuring mission reliability. Conventional localization methods relying on satellite navigation, such as GPS, are susceptible to interference or complete signal blockage, rendering them ineffective in determining the UAV's spatial position accurately [1]. This limitation is particularly detrimental for UAVs performing critical missions such as military operations, emergency rescue, or surveying. Autonomous localization technologies, especially sensor-based visual localization methods, leverage environmental visual information captured by onboard cameras to compute position and orientation without reliance on external satellite signals [2]. These methods achieve high-precision localization through feature matching and image fusion, enabling UAVs to navigate effectively in harsh environments, complex terrains, and high-interference zones such as occluded areas [3]. Both pure vision-based localization and multi-sensor fusion techniques provide feasible solutions for autonomous UAV localization. With continuous advancements in technology, the application of UAV localization techniques is expanding, as illustrated in Fig. 1.
[See PDF for image]
Fig. 1
Application of autonomous UAV localization
Pure visual localization relies on image sequences captured by UAVs, where feature points are extracted, matched, and geometrically reconstructed to estimate the UAV's pose. This method does not depend on any external sensors and is entirely based on computer vision and image processing techniques, making it suitable for environments where satellite navigation is denied [4]. Through the extraction and matching of feature descriptors, the selection of precise metric criteria, and geometric optimization, pure visual methods have become a key direction in UAV autonomous localization. Numerous vision-based UAV localization methods have been proposed in the literature, covering areas such as feature extraction, matching, and geometric optimization. Visual simultaneous localization and mapping (SLAM) methods, such as ORB-SLAM [5] and SVO [6] have been widely applied in UAV localization. Cha et al. [7] proposed a structural visual inspection method based on Faster R-CNN. By integrating this visual inspection approach with UAV visual localization, autonomous inspection can be achieved with the support of purely visual SLAM. The advent of deep learning has also given rise to a surge of interest in visual localization methods based on deep features, such as SuperPoint [8] and LOFTR [9].
Although pure visual methods have certain advantages, they are susceptible to factors such as lighting conditions, weather, and motion blur [10]. To improve the accuracy and robustness of localization, multi-sensor fusion-based UAV autonomous localization methods have gained increasing attention in recent years. The integration of LiDAR, Inertial Measurement Units (IMU), infrared sensors, and traditional RGB images within these methods ensures the provision of more reliable localization data for UAVs under various environmental conditions [11]. Multi-source vision fusion technology not only enhances environmental adaptability but also opens up new research directions for UAV autonomous localization. Waqas et al. [12] proposed an autonomous UAV obstacle avoidance and localization framework designed for GPS-denied environments, integrating a YOLOv3-based deep learning obstacle avoidance method with benchmark marker-based localization technology. Experimental results in indoor environments and large parking lots demonstrated superior performance compared to traditional methods, enhancing the UAV's obstacle avoidance and localization capabilities. Liang et al. [13] focused on human–machine interaction torque estimation in rehabilitation training, enhancing tracking accuracy and stability in sudden torque variations by optimizing the dynamic model structure and noise estimation capability. Chen et al. [14] addressed the issue of excessive collision force in autonomous docking of unmanned vehicles, proposing an adaptive impedance control strategy and a locking mechanism based on a Stewart platform to reduce collision forces and achieve compliant control. Ali et al. [15] proposed an autonomous UAV system based on Faster R-CNN, capable of target recognition and mapping in GPS-denied environments. The system leverages real-time streaming protocols and multiprocessing techniques to reduce false positives, making it particularly effective for detecting small and blurry targets. The core of UAV autonomous localization lies in leveraging onboard sensors, such as vision, IMU, and GPS, for real-time position and attitude estimation in three-dimensional space, typically relying on state estimation algorithms and path planning algorithms to enable precise navigation and dynamic obstacle avoidance. Reference [16] proposes a UAV localization method based on mechanical antennas (MA), addressing the localization challenges in environments where Global Navigation Satellite System (GNSS) signals are unavailable. This method employs low-frequency magnetic field signals generated by the MA on the UAV and magnetic field sensors at ground-based stations. It utilizes a particle swarm optimization algorithm to achieve stable localization in complex electromagnetic environments.
This paper presents a systematic review of UAV autonomous localization methods, focusing on the following aspects: Firstly, for pure visual-based UAV autonomous localization methods, the generation of feature descriptors, similarity measurement criteria, and optimization strategies are discussed. The advantages and disadvantages of different pure visual, UAV localization methods are compared and analyzed. Secondly, for sensor-assisted multi-source UAV autonomous localization methods, an in-depth study is conducted on filtering fusion, optimization fusion, and multi-modal fusion methods incorporating deep learning, with a focus on their localization performance and applicability in complex environments. Through the analysis and comparison of different methods, this paper aims to provide a systematic reference for the research and development of UAV autonomous localization technology.
Pure visual-based UAV autonomous localization methods
Pure visual-based UAV autonomous localization methods primarily rely on image data to achieve autonomous localization and pose estimation of the.
UAV through steps such as feature extraction, feature matching, and optimization. As shown in Fig. 2, the process typically includes the generation of feature descriptors, the design and application of similarity measurement criteria, and optimization calculations based on geometric constraints. The generation of feature descriptors constitutes the basis of the entire process, as it extracts stable and distinguishable feature points that can adapt to complex scene variations [17]. Similarity measurement criteria evaluate the similarity between different image features to achieve accurate matching, ensuring the reliability of feature points [18]. Finally, the fusion of multi-view information is used to conduct a precise estimation of the spatial position and pose of the target through the process of optimization calculations based on geometric constraints [19]. This chapter systematically reviews the key technologies in this process and, based on this, conducts a comparative analysis of mainstream pure visual UAV localization methods, providing references for further research.
[See PDF for image]
Fig. 2
Overall flowchart of pure visual localization
Feature descriptor generation
The generation of feature descriptors represents a pivotal stage in the process of visual-based UAV localization. Conventional feature descriptors, including SIFT [20], SURF [21] and ORB [22], are predicated on local image gradients and corner features, thereby ensuring stability across a range of viewpoints and scales. Figure 3a shows images captured by the UAV from different viewpoints, while Fig. 3b illustrates the feature matching results of these images across various viewpoints. Reference [20] put forth a methodology for the extraction of invariant features from images that remain stable under conditions such as scaling, rotation, affine distortion, and lighting changes. This approach allows for the matching of features across different views. Shan et al. [23] further developed a framework for UAV navigation assisted by Google Maps in GPS-
[See PDF for image]
Fig. 3
Feature matching of UAV images across different viewpoints
denied environments, using optical flow [24] and HOG features [25] for localization on georeferenced images, with a particle filter employed for coarse-to-fine search. In the case of major navigation failures, Reference [26] proposed a system for autonomous and safe UAV return. This system utilizes omnidirectional stabilized stereo vision technology to construct a visual map and localize the real-time view onto the map, achieving successful localization at altitudes of 5–25 m and flight speeds up to 55 km/h, without reliance on external infrastructure or inertial sensors.
Furthermore, Yol et al. [27] employed template matching with cross-correlation, using aerial images as templates to match with another georeferenced image. However, traditional feature descriptors exhibit several limitations when confronted with substantial variations in images. This is largely attributable to their susceptibility to alterations in lighting, changes in viewpoint, and the presence of image noise. Collectively, these factors result in diminished matching reliability and precision. Furthermore, traditional feature descriptors typically rely on the extraction of repeatable interest points. However, in complex scenes or areas with low texture, the generation of these points is challenging and unstable, which limits their effectiveness in practical applications, particularly in dynamic environments such as dense urban areas or long-distance travel.
With the development of deep learning, feature extraction methods based on deep neural networks have emerged. Techniques such as SuperPoint [8] and LOFTR [9] are capable of learning more discriminative features. This has resulted in enhanced matching accuracy and robustness against interference. Nassar et al. [24] improved geolocation accuracy by extracting contextual information from the scene, enabling navigation without the need for Global Positioning System (GPS). The framework takes input from downward-facing camera images and corresponding satellite images at specific locations, extracting local features for registration and establishing an initial association between UAV motion and map locations. The localization accuracy is further enhanced through semantic shape matching. In addition, reference [28] applied the U-Net convolutional neural network [29] to beach debris segmentation and multispectral image monitoring of coastal environments. A number of researchers have concentrated on the enhancement of descriptors with a view to improving the accuracy of UAV localization. It is worthy of note that Tian et al. [30] were the first to investigate the potential of second-order similarity (SOS) in descriptor learning. They proposed the second-order similarity regularization (SOSR) method, which is based on the principle of matching point similarity distance. The combination of SOSR with triplet loss has resulted in descriptors that significantly outperform previous models on several local descriptor benchmarks. Consequently, research in the field of deep learning-based feature descriptors is progressing towards enhancing feature expressiveness and adaptability. Furthermore, the application of advanced network architectures is advancing UAV localization technology.
Similarity measurement criteria
Similarity measurement helps UAVs achieve precise pose estimation by evaluating the feature similarity between images from different viewpoints. In traditional similarity metrics, the most commonly used standards are Euclidean distance [31] and Hamming distance [32], applied for matching floating-point descriptors and binary descriptors, respectively. However, these metrics are susceptible to interference under variations in lighting, viewpoint changes, and noise. Fan et al. [33] proposed an image registration algorithm based on composite deformable template matching, which combines image edge and entropy features to cope with environmental changes and sensor discrepancies. The employment of an efficient search strategy facilitates the identification of the optimal match in satellite images. The LIFT method, as introduced by Yi et al. [34], employs a voting mechanism to assess the accuracy of matches. In similarity measurement, LIFT searches for the nearest neighbors of each key point within different directions and sets thresholds based on descriptor distances, followed by voting.
LIFT [34] selects the key point with the most votes as the match result. In comparison to conventional L2 distance and Hamming distance methodologies, this approach offers enhanced robustness and reliability in the determination of matching, thereby providing a novel perspective on the measurement of similarity. To improve localization accuracy and reduce computational burden, Reference [35] proposed a neural network-based approach with learnable embedding functions and attention mechanisms, enabling more efficient performance in graph similarity computations. In contrast to LIFT, which relies on a voting mechanism for matching, SimGNN combines global graph-level embeddings and fine-grained node-level comparisons, effectively reducing the computational burden while offering better generalization ability. Figure 4 illustrates the similarity measurement retrieval results based on a vocabulary tree, with the query image marked in yellow. The SIFT algorithm is predominantly employed in numerous studies for the purpose of extracting feature descriptors. The subsequent implementation of similarity measurement methods facilitates the construction of a vocabulary tree, thereby ensuring efficient retrieval.
[See PDF for image]
Fig. 4
Feature matching of UAV images across different viewpoints
In recent years, Ling et al. [36] proposed a Multi-Scale Graph Matching Network (MGMN) framework. This framework computes the similarity between two graph-structured objects in an end-to-end manner by integrating the relationships between each node of one graph and the other graph. This integration has been demonstrated to enhance the accuracy of similarity calculation. Compared to MGMN, the MSM-Transformer framework [37] enhances the modeling capability of global features and contextual relevance through a Dual-Attention Visual Transformer (DaViT). This architectural configuration enables more accurate matching of images captured from disparate viewpoints.
Additionally, the weight-sharing mechanism of MSM-Transformer significantly reduces model complexity, making it suitable for deployment on resource-constrained embedded devices. The implementation of a Symmetric Decoupled Contrastive Learning (DCL) loss function effectively addresses the issue of class imbalance, thereby enhancing the stability and performance of the model in satellite and UAV image matching tasks.
Consequently, the application of deep learning-based metric criteria in UAV pure visual autonomous localization enables adaptive optimization through the learning of an end-to-end matching process, thereby markedly enhancing the precision and resilience of feature point matching. These methods effectively address interference factors such as lighting, viewpoint changes, and noise. Furthermore, these methods enhance computational efficiency and generalization ability by leveraging learnable embedding functions and attention mechanisms introduced by neural networks. The Symmetric Decoupled Contrastive Learning (DCL) loss function [38] effectively alleviates the class imbalance problem, improving the stability and performance of the model in complex environments. This enhances the reliability of UAVs during image registration and similarity calculation, ensuring their autonomous navigation capability under various environmental conditions.
Optimization computation
In pure vision-based UAV autonomous localization methods, the application of optimization computation is crucial. Although template matching techniques can accurately establish the correspondence between the UAV’s perspective and the reference map, their main drawback is the high computational cost of similarity measurements, especially when processing large-scale image data. To address this issue, numerical optimization [27] has been employed to efficiently retrieve template positions within the reference map. In reference [39], a numerical optimization technique was employed to maximize the Normalized Information Distance (NID) between the reference map and the UAV image. This process utilized the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm, a quasi-Newton optimization method. The process starts with optimizing a Gaussian-blurred image and then uses the optimization results as the initialization for refining the original image optimization process. Research indicates that this method provides results similar to grid search with fewer computations. However, this approach relies on optimizing a Gaussian-blurred image, which can lead to the loss of important features, affecting the precision of the final result.
The primary objective of Visual Localization (AVL) is to match the UAV's current perspective with the visual memory constructed from previous data to achieve accurate positioning [40]. The primary challenges associated with the matching process pertain to the substantial volume of data and the intricacies inherent to the matching of images derived from disparate sources. These images are typically obtained by UAVs under a multitude of circumstances, including varying lighting conditions and resolutions, which collectively contribute to the elevated complexity of the matching process [41]. To address this issue, optimization computation techniques are widely employed, particularly in the context of large-scale image data processing, to effectively reduce computational costs and improve matching accuracy.
Figure 5 illustrates a schematic of an AVL localization task combined with optimization computation, highlighting the critical role of optimization in enhancing positioning accuracy and efficiency. Goforth et al. [42] proposed an AVL method combining convolutional neural networks (CNNs) and visual odometry (VO), inspired by [43],
[See PDF for image]
Fig. 5
Schematic diagram of AVL localization task
which learns deep feature representations for direct image alignment. The network consists of two parallel CNNs built based on the first three layers of VGG16 [44], with shared weights between the two. The UAV's position is estimated by combining a series of homographies computed by the CNN, a method that is closely related to VO. However, this approach may result in drift over time. To address this issue, the authors [42] introduced pose parameter optimization on a sliding window. However, this only helps reduce drift rather than eliminate it. The reference [45] proposed a joint positioning and power optimization (JPPO) framework for UAVs. This framework jointly optimizes the deployment location and power distribution of the UAVs, thereby markedly enhancing the localization accuracy of the entire area of interest (AoI) and overcoming the drift problem inherent to aerial vehicle location (AVL) methods over time.
Additionally, JPPO is applicable in both dynamic and static environments and enhances the overall system performance by maximizing the localization accuracy index (ALAI) of the AoI. Fu et al. [46] employed a double-deep Q-network (DDQN) algorithm integrated with geographic positioning information (GPI), which significantly simplifies the computation of channel state information and improves optimization efficiency. This method achieves higher localization accuracy through rapid optimization of UAV deployment positions. Optimization techniques can significantly improve the accuracy and efficiency of localization, reduce the computational burden when processing large-scale image data, and speed up real-time localization responses. Furthermore, optimization can effectively handle complex relationships between images, improve feature matching accuracy, reduce drift, and ensure stable operation of UAVs in dynamic environments.
Comparison of pure visual UAV localization methods
In pure visual UAV localization, feature descriptors are crucial for localization accuracy, as their quality directly affects the precision of feature point matching. Table 1 compares the performance of various feature descriptors based on the HPatches dataset, using mean Average Precision (mAP) as the evaluation metric across three tasks: Patch Verification, Image Matching, and Patch Retrieval.
Table 1. Comparison of feature descriptor performance
Methods | mAP (%) | ||
|---|---|---|---|
Patch verification | Image matching | Patch retrieval | |
SIFT [20] | 63.3 | 24.4 | 42.1 |
TFeat [47] | 80.6 | 27.9 | 50.7 |
DeepDesc [48] | 78.5 | 26.5 | 52.7 |
L2Net [49] | 84.3 | 42.3 | 63.8 |
HardNet [50] | 87.2 | 50.1 | 69.0 |
DOAP [51] | 87.6 | 49.5 | 69.7 |
HyNet [52] | 88.7 | 53.9 | 72.2 |
SOSNet [30] | 87.7 | 51.4 | 70.3 |
GeoDesc [53] | 90.6 | 58.1 | 75.3 |
These descriptors include traditional methods like SIFT [20], as well as TFeat [47], DeepDesc [48], L2Net [49], and HardNet [50]; manually designed improvements such as DOAP [51]; and the latest deep learning methods like SOSNet [30], HyNet [52], and GeoDesc [53]. The data in Table 1 is directly taken from the original papers, as presented in their respective publications.
GeoDesc [53] demonstrates superior performance compared to other methods across all tasks, particularly in scenarios involving complex viewpoint and illumination variations. It achieves mAP values of90.6%, 58.1%, and 75.3% for Patch Verification, image Matching, and Patch Retrieval, respectively. In comparison, SOSNet [30] and HyNet [52] also demonstrate robust performance, with mAPs of 87.7%, 51.4%, and 88.7%, 53.9% in the Patch Verification and Image Matching tasks, respectively. These methods exhibit superior performance compared to traditional approaches and earlier learned descriptors, such as TFeat [47] and L2Net [49].
However, they still do not match GeoDesc’s generalization ability, especially in the Patch Retrieval task, where GeoDesc demonstrates superior performance due to its optimized batch processing strategy and training approach. Nevertheless, GeoDesc requires higher diversity in training data and more computational resources, which may limit its applicability in resource-constrained environments. While traditional methods like SIFT [20] are more efficient, they are less adaptive to changes in viewpoint and illumination, making them unsuitable for the complex scenarios UAV localization demands. Therefore, GeoDesc is ideal for high-precision tasks in complex environments, while SOSNet [30] and HyNet [52] offer a better balance between performance and resource consumption, making them suitable for a wider range of application scenarios.
To further compare the performance of different descriptors, the experiment selected Mean Matching Accuracy (MMA) as the evaluation metric and tested the descriptors in scenarios with illumination changes, viewpoint changes, and overall data. Table 2 presents the MMA values for each method under different error thresholds (2 pixels (2@px), 4 pixels (4@px), and 6 pixels (6@px)). The data in the table is directly taken from the original papers, as presented in their respective publications. As shown in Table 2, the evaluation includes various descriptors, such as traditional methods like SIFT [20] and RootSIFT [55], local feature learning methods like SOSNet [30], ALIKE [57], LF-Net [56], deep learning descriptors.
Table 2. Performance comparison in different scenarios
Methods | MMA | ||||||||
|---|---|---|---|---|---|---|---|---|---|
Overall | Illumination | Viewpoint | |||||||
@2px | @4px | @6px | @2px | @4px | @6px | @2px | @4px | @6px | |
SIFT [20] | 0.48 | 0.57 | 0.60 | 0.47 | 0.53 | 0.57 | 0.48 | 0.57 | 0.62 |
ORB [22] | 0.32 | 0.43 | 0.48 | 0.38 | 0.46 | 0.5 | 0.30 | 0.40 | 0.43 |
DELF [54] | 0.46 | 0.53 | 0.62 | – | – | – | 0.08 | 0.19 | 0.33 |
RootSIFT [55] | 0.48 | 0.58 | 0.61 | 0.47 | 0.55 | 0.59 | 0.48 | 0.59 | 0.63 |
SuperPoint [8] | 0.53 | 0.72 | 0.78 | 0.60 | 0.76 | 0.80 | 0.50 | 0.66 | 0.71 |
LF-Net [56] | 0.62 | 0.76 | 0.79 | 0.61 | 0.78 | 0.81 | 0.51 | 0.67 | 0.73 |
SOSNet [30] | 0.51 | 0.62 | 0.63 | 0.52 | 0.62 | 0.66 | 0.47 | 0.61 | 0.65 |
ALIKE[57] | 0.62 | 0.75 | 0.77 | 0.65 | 0.77 | 0.83 | 0.58 | 0.69 | 0.73 |
D2-Net [58] | 0.25 | 0.54 | 0.72 | 0.28 | 0.56 | 0.77 | 0.20 | 0.48 | 0.67 |
FeatureBooster [59] | 0.62 | 0.76 | 0.79 | 0.61 | 0.78 | 0.81 | 0.51 | 0.67 | 0.73 |
HAN [60] | 0.55 | 0.67 | 0.73 | 0.52 | 0.64 | 0.72 | 0.53 | 0.66 | 0.71 |
From the results, LF-Net [56] and FeatureBooster [59] perform excellently across the 2px, 4px, and 6px pixel thresholds, achieving overall matching accuracy values of 0.62 and 0.76, respectively. They also exhibit high robustness to illumination and viewpoint changes, reflecting the adaptability of deep learning methods to complex scenes. ALIKE [57] achieves a matching accuracy of 0.65 in the Illumination task at the 2px threshold, outperforming most methods, indicating its unique advantage in handling illumination variation scenarios.
Although SIFT [20] and RootSIFT [55] exhibit dependable matching efficiency and stability, they are less resilient in scenarios with considerable illumination and viewpoint alterations. Notably, at the 2px threshold, the matching accuracy of SIFT and RootSIFT is only 0.48, which is considerably inferior to that of learned descriptors. SOSNet, a deep learning-based descriptor, demonstrates satisfactory performance in the illumination task (matching accuracy of 0.52 at 2px). However, its performance in the viewpoint change task (matching accuracy of 0.47 at 2px) is comparatively weaker than that of LF-Net [56] and ALIKE [57], indicating that its adaptability to complex viewpoint change scenarios requires enhancement. In contrast, LF-Net and FeatureBooster demonstrate superior performance in both illumination and viewpoint change tasks through the implementation of enhanced training strategies and model structures. In particular, FeatureBooster exhibits a combination of robustness and accuracy, rendering it a superior choice for multi-scenario applications. Although D2-Net [58] achieves a matching accuracy of 0.67 in the viewpoint change task at the 6px threshold, its performance is unstable at lower pixel thresholds (2px), resulting in weaker overall adaptability. Therefore, when selecting a descriptor, it is important to consider the specific application scenario. For resource-constrained environments, SOSNet [30] can be chosen, while for high-precision scenarios, ALIKE or FeatureBooster is recommended.
In summary, there are significant performance differences among various feature descriptors in UAV pure visual localization tasks. While traditional descriptors such as SIFT [20] and RootSIFT [55] demonstrate good stability, they exhibit lower matching accuracy under significant illumination and viewpoint variations, which limits their application in complex environments. Conversely, deep learning-based methods, including LF-Net [56], FeatureBooster [59], and ALIKE [57], demonstrate robust performance in handling illumination and viewpoint changes. LF-Net [56], in particular, exhibits superior overall performance across diverse scenarios. While SOSNet [30] demonstrates robust performance in illumination change scenarios, its performance in viewpoint variation tasks is comparatively modest. GeoDesc [53] demonstrates particular efficacy in tasks requiring high precision, outperforming other descriptors in multiple tasks at @2px, @4px, and @6px, thereby underscoring its advantages in high-precision tasks. Overall, deep learning methods exhibit considerable promise in handling complex environments. However, in resource-constrained or low-precision scenarios, traditional methods or other lightweight models may retain certain applicability.
Sensor-assisted multi-source UAV autonomous localization methods
In complex environments, a single sensor often fails to meet the localization requirements [61]. The GNSS is prone to external factors such as building occlusion or electromagnetic interference, which can compromise its reliability. Visual odometry (VO) is susceptible to performance degradation in low-light and harsh weather conditions due to its reliance on visible light [62]. Additionally, the inertial navigation system (INS) may result in inaccurate localization over extended periods due to the accumulation of errors over time [63]. These limitations make it difficult for a single sensor to provide accurate and reliable navigation information for autonomous systems. To overcome these challenges, it becomes crucial to fuse multi-source heterogeneous data obtained from various sensors. Data fusion leverages the complementary characteristics of each sensor, which not only significantly enhances the robustness of the navigation system but also improves its adaptability in various environments. The framework of a multi-sensor data fusion localization system, as shown in Fig. 6, can be categorized into three main approaches: filtering-based, nonlinear optimization-based, and learning-based methods.
[See PDF for image]
Fig. 6
Localization methods based on multi-sensor data fusion
Filter fusion methods
The filtering fusion technique, as shown in Fig. 6a, effectively enhances the robustness of UAV localization by combining data from different sensors. The Kalman Filter (KF) [64] is the most classical recursive filter and is widely applied in integrated navigation systems, such as GNSS/INS systems, with coupling methods including loose coupling [65], tight coupling [66], and deep coupling [67]. The loose coupling algorithm processes GNSS and INS data in a relatively independent manner in GNSS/INS data fusion. Reference [68] estimates UAV attitude information by loosely coupling data from dual GPS antennas and the Inertial Navigation System (INS), where the GPS antennas provide position data, and the INS provides attitude and inertial data. This loose coupling method utilizes GPS information to correct the drift of the INS without directly relying on GPS as the dominant sensor, thereby maintaining the system's independence and flexibility [69]. However, since the system relies on the results of GNSS calculations, the overall accuracy can be affected when GNSS signal quality is poor or interfered with. The tight coupling method, on the other hand, directly combines pseudorange and Doppler data, applying GNSS signals directly to the INS solution [65]. Tight coupling eliminates the need to rely on independent positioning results by utilizing GNSS data to adjust position and velocity in real-time [70]. This approach markedly enhances positioning accuracy and stability, particularly in environments with partial GNSS signal obstruction. The deep coupling method is the most complex fusion approach, requiring high coordination between hardware and algorithms. This method tightly couples the GNSS receiver with the INS, improving both accuracy and anti-jamming capabilities, and ensuring high reliability in navigation even in extreme environments [71]. Deep coupling is mainly used in military equipment such as UAVs, missiles, and armored vehicles, ensuring stable positioning and navigation even under interference and jamming conditions [72].
The Extended Kalman Filter (EKF) [73] and Unscented Kalman Filter (UKF) [74] are two commonly used filtering algorithms for state estimation in nonlinear systems. EKF linearizes the state and observation equations of a nonlinear system by expanding them in a Taylor series [75, 76–77]. EKF has been widely applied in vision-assisted Inertial Navigation Systems (INS) [78], and Reference [79] applied an Adaptive High-Gain Extended Kalman Filter (AEKF) to the INS of a quadrotor UAV, utilizing its ability to adjust high-gain parameters to enhance the robustness of state estimation. However, due to the errors introduced by linearization approximations, EKF's accuracy tends to degrade in highly nonlinear and complex scenarios [80]. In contrast, the Unscented Kalman Filter (UKF) estimates the nonlinear system without the need for linearization by employing the unscented transform [81]. UKF selects a set of representative sampling points and propagates the system state using these points, thereby capturing nonlinear features more accurately [82].
Figure 7 illustrates the schematic of UKF combined with multi-sensor fusion, showing how UKF enhances positioning accuracy and system stability by fusing data from multiple sensors in complex dynamic environments. Reference [83] proposed a variant of the UKF based on the unscented transform to address the state estimation challenge for maneuvering UAVs in complex dynamic environments and designed a Linear-Quadratic-Gaussian (LQG) controller to achieve synchronized control of UAVs and robotic arms. Compared to EKF, UKF demonstrates superior accuracy and stability when dealing with more highly nonlinear filtering problems [84]. Moreover, the Complementary Filter (CF) [85], due to its high computational efficiency, has become a common choice for fusing multi-sensor data in embedded systems. In particular, in resource-constrained embedded platforms, CF is extensively employed to integrate gyroscope and accelerometer data to provide a seamless and real-time estimation of attitude, thereby satisfying the general control requirements.
[See PDF for image]
Fig. 7
Filter fusion framework
Table 3 provides a comprehensive analysis and comparison of the advantages of multisource filter fusion methods for UAV positioning, as reported in recent classic literature [82, 87, 88, 89, 90, 91, 92, 93, 94, 95–95], while also discussing the limitations of each method.In the research on multi-sensor fusion, various methods have demonstrated significant advantages in providing high-precision localization and environmental adaptability, but they also have notable limitations. First, the limitations of filtering methods primarily arise from their sensitivity to the model and computational complexity. Specifically, Bucci et al. [82] pointed out that while filtering methods perform excellently in high-precision navigation, they are susceptible to error propagation, meaning that as observations decrease, the accuracy declines. Similarly, Zhu et al. [87] emphasized that filtering methods rely on high-precision measurement models, and if the model is flawed or the data quality is low, localization errors will accumulate rapidly, leading to a decline in accuracy.In contrast, fusion methods combining optimization and deep learning, while improving localization accuracy, also introduce new challenges. Studies by Song et al. [88] and Wang et al. [89] show that optimization fusion methods can provide high localization accuracy and low computational burden, but their real-time performance is often poor. Particularly in complex environments, insufficient noise model optimization can affect performance. Additionally, while the introduction of deep learning improves system robustness and adaptability (as shown by Yusefi et al. [90] and Guo et al. [94]), deep learning models typically require large amounts of high-quality data and training, leading to high computational burdens and sensitivity to sensor quality.
Table 3. Advantages and limitations of multi-source filtering fusion methods for UAV localization
Work | Fusion methods | Advantages | Limitations |
|---|---|---|---|
Bucci et al. [82] | Filter | • High accuracy navigation • Parallel processing • Low computational load | • Risk of error propagation • Accuracy decreases when observations are reduced |
Hartley et al. [86] | Filter | • Fast error conversion • Dynamically robust • Multi-sensor compatibility | • Modeling sensitivity • Computational complex |
Zhu et al. [87] | Filter | • High accuracy • Fast convergence • Good robustness | • High model dependence • Sensitive to measurement model errors |
Song et al. [88] | Filter + optimization | • High positioning accuracy • Low computational burden • Great environmental adaptability | • Limited real-time ability • Poor optimization of noise models |
Wang et al. [89] | Filter + optimization | • Low resource requirements • Adapts to complex environments • High localization accuracy | • Dependent on visual conditions • Sensitive to sign detection |
Yusefi et al. [90] | Filter + optimization + deep learning | • Multi-sensor fusion • Adapting to GNSS failure • High positioning accuracy | • Dependent on sensor quality • Computational complex |
Cao et al. [91] | Filter + optimization | • Tightly coupled fusion • Small drift • Environmental adaptability | • Initialization relies on short measurements |
Wei et al. [92] | Filter + optimization | • High navigation accuracy • Robustness • Adapts to changes in sensor property | • Relies on dynamic weight design • Computationally complex |
Liu et al. [93] | Filter + optimization + deep learning | • Accuracy improvement • Self-supervised learning | • Complex processing • High computational overhead |
Guo et al. [94] | Filter + deep learning | • Precise navigation at low cost • Noise reduction • Low computational resource | • Dependent on deep learning • Sensitive to IMU noise |
Aslan et al. [95] | Filter + deep learning | • Highly accurate position estimation • Low cost sensor fusion • Suitable for UAS navigation | • Dependent on deep learning training • Requires high-quality data |
When comparing the limitations of different methods, filtering methods have relatively low computational complexity, making them suitable for resource-limited environments, but their sensitivity to model and measurement errors is an inherent flaw, which can lead to a decline in localization accuracy over long durations of operation (as described by Hartley et al. [86]). While optimization methods have made breakthroughs in improving accuracy and environmental adaptability, there is a trade-off in terms of real-time performance and noise handling (Song et al. [88], Wang et al. [89]). Deep learning fusion methods can achieve high localization accuracy in complex environments, but their reliance on data and high computational burden restrict their use in real-time applications (Liu et al. [93], Aslan et al. [95]). In practical cases, the limitations of these methods significantly impact UAV performance. For example, Bucci et al.'s [82] filtering method, when faced with adverse weather or multipath effects, may lead to localization drift and loss of accuracy due to its inability to effectively handle sensor faults or error propagation. The deep learning method proposed by Guo et al. [94], while improving accuracy, may cause instability in some practical applications due to its sensitivity to IMU noise and high training requirements. Furthermore, the localization optimization method by Song et al. [88] may cause delays and accuracy loss in dynamic and complex environments due to imperfect noise models.To overcome these issues, several alternative solutions have been proposed. One effective solution is to alleviate computational complexity through hardware acceleration. Liu et al. [93] proposed a fusion method combining deep learning and optimization with filtering, which, through self-supervised learning and data fusion techniques, reduces computational burden and improves localization accuracy to some extent.
On the other hand, model simplification is a common optimization approach. Aslan et al. [95] accelerated computation by reducing sensor data processing or adopting low-complexity models. The combination of these methods can help the system maintain high accuracy while improving real-time processing capability and environmental adaptability.
In high-precision and more complex nonlinear filtering applications, researchers have proposed the Invariant Extended Kalman Filter (InEKF) [96]. Leveraging Lie group theory, InEKF tightly integrates the nonlinear structure of the system state with the covariance propagation process, significantly improving accuracy and consistency in state estimation [86]. InEKF is particularly well-suited for high-dimensional and strongly nonlinear application scenarios, such as mobile robots, UAVs, and navigation systems, enabling more robust state estimation in real-time complex environments [87, 97].
In summary, UAV filtering and fusion-based localization methods enhance robustness and navigation accuracy in unstructured environments by integrating multisource sensor data and employing various algorithms, including classical Kalman filters, Extended Kalman Filters, Unscented Kalman Filters, and Complementary Filters.
Optimized fusion methods
Traditional filtering methods, as shown in Fig. 6b, typically rely only on the state information from the previous time step, which can lead to the loss of some information. The multisource information fusion problem can be described using Bayesian graphical networks, where the Maximum A Posteriori (MAP) of the system state is used as the objective for state estimation [98]. When new measurement data are introduced, Bayesian graphical networks effectively eliminate redundant information through marginalization, thereby improving the accuracy of state estimation [99]. This approach represents the system state and sensor measurements as a factor graph structure, with measurement data and state updates mapped to the nodes of the graph. The structure establishes an optimization framework based on posterior estimation, whereby the posterior is estimated using the measurement data and state updates. Within this framework, nonlinear optimization algorithms solve a nonlinear least squares problem using all available measurements to obtain the optimal state estimate for the system.
In reference [98], the extensive applications of factor graph algorithms in addressing advanced robotic optimization problems were reviewed. By leveraging an efficient decomposable structure, factor graph algorithms are capable of handling high-dimensional nonlinear problems, thereby significantly enhancing computational efficiency and accuracy. This offers substantial support for the autonomy and real-time performance of robotic systems. Furthermore, the study presents a comprehensive examination of the applications and key benefits of this algorithm in pivotal domains such as state estimation, SLAM, and trajectory planning. The incorporation of process noise into Stochastic Boolean Networks (SBNs) enables a more accurate reflection of real-world conditions. However, the presence of noise increases the complexity of state estimation. To address this challenge, Reference [100] employed recursive algorithms to compute the prior and posterior beliefs of the system state. This approach enabled them to achieve optimal state estimation and to propose novel solutions to state estimation challenges in complex scenarios.
The Incremental Smoothing and Mapping (iSAM) algorithm [101] facilitated efficient incremental updates in high-dimensional and multi-state scenarios. iSAM updates the sparse smoothing information matrix incrementally, only recalculating the changed terms when new measurements are added, thus improving the efficiency of incremental updates. The iSAM2 [102] further enhanced this approach by introducing a Bayesian tree structure and improving algorithmic efficiency through incremental state reordering and relinearization operations. The literature [103] proposed an improved iSAM algorithm that combines flexible relinearization thresholds and error learning models, significantly enhancing the navigation efficiency and accuracy of autonomous underwater vehicles (AUVs). In reference [104], the Multi-Robot Incremental Smoothing and Mapping (MR-iSAM2) algorithm was introduced, which employs an innovative Multi-Root Bayesian Tree (MRBT) data structure to address the SLAM inference problem for multiple robots. Significant advancements have also been made in visual-inertial fusion algorithms. FMC-SVIL [89] integrates optimized stereo vision and inertial measurement data to achieve efficient attitude estimation, thereby enabling real-time processing of information from both cameras and inertial sensors. This ensures accurate positioning results even in GPS-denied environments. The Factor Graph Optimization (FGO) algorithm represents the state estimation problem through a graph optimization model.
As shown in Fig. 8, this approach typically generates a system's factor graph model and measurement variables as a directed graph structure based on bayesian networks. In this representation, sensor measurements and state updates are depicted as nodes in the factor graph, with the associated factors represented as edges. This approach utilizes posterior estimation theory to achieve optimal data fusion. The literature [88] proposed a tightly coupled UWB/INS integrated navigation method based on FGO, which reduces computational load by incorporating IMU pre-integration factors and effectively reduces indoor positioning errors. OKVIS [105] and VINS-Fusion [106] employ sliding window and IMU pre-integration techniques to effectively address lighting changes and rapid motion, improving the robustness of purely visual approaches and achieving high-precision short-term navigation and stability in dynamic environments.
[See PDF for image]
Fig. 8
Factor graph optimization model
As show in Table 4, the optimization method adopts a sliding window strategy to jointly optimize the UAV's state variables within a fixed time horizon, balancing estimation accuracy and computational efficiency. Based on the IMU pre-integration technique, this approach effectively avoids repeated integration of raw IMU data, thereby improving computational efficiency. The core of the method lies in constructing an optimization objective function that incorporates multi-source observations from IMU, vision, and LiDAR. By minimizing the weighted sum of squared errors from all observations—such as IMU pre-integration error, visual reprojection error, and LiDAR point cloud registration error—the UAV's pose and position are estimated with high precision. During the optimization process, these error terms serve as constraints that jointly act on the state variables, enhancing the accuracy and robustness of the sensor fusion results.
Table 4. Optimization techniques applied to UAV localization
Algorithm 1: Optimization algorithms in UAV localization | |
|---|---|
Input: imu_data, visual_data, lidar_data, dt, Q: process noise covariance, R: sensor noise covariance Output: state[:3]: The UAV's estimated position [x, y, z] after optimization // Initialization 1: sliding_window = [] // Sliding window of recent states 2: initial_state = [0, 0, 0, 0, 0, 0, 0, 0, 0] // [x, y, z, vx, vy, vz, roll, pitch, yaw] 3: sliding_window.append(initial_state) // Main loop for optimization-based fusion 4: for t in time_steps: 5: // (1) Get sensor data at current time 6: imu_data = get_imu_data() 7: visual_data = get_visual_data() 8: lidar_data = get_lidar_data() // (2)Propagate prior state using IMU integration (pre-integration) 9: imu_preint = integrate_imu(imu_data, dt) // Δpos, Δvel, Δrot 10: predicted_state = propagate_state(sliding_window[-1], imu_preint) // (3) Add predicted state to sliding window 11: sliding_window.append(predicted_state) // (4) Construct optimization problem 12: problem = initialize_optimization_problem() 13: for i in range(len(sliding_window)-1): 14: // Add IMU pre-integration constraints 15: problem.add_imu_factor(sliding_window[i], sliding_window[i + 1], Q) 16: if visual_data is available: 17: problem.add_visual_factor(visual_data, R) 18: if lidar_data is available: 19: problem.add_lidar_factor(lidar_data, R) // (5) Solve the optimization problem 20: optimized_states = solve_optimization(problem) // (6) Update state estimates 21: optimized_state = optimized_states[-1] // Take the last optimized state 22: state = optimized_state[:3] // Position 23: velocity = optimized_state[3:6] // Velocity 24: orientation = optimized_state[6:9] // Orientation // (7) Maintain fixed-size sliding window 25: if len(sliding_window) > window_size: 26: sliding_window.pop(0) // (8) Output estimated position 27: print("Estimated Position:", state) |
The fusion methods for UAV visual-inertial odometry (VIO) still face limitations in applications such as low-texture environments, absolute positioning, and extreme weather conditions [107]. However, combining GNSS with visual-inertial fusion solutions can effectively address these issues, especially in environments with limited GNSS signal availability. The literature [90] proposed a multi-sensor fusion positioning method based on GNSS, camera, and IMU. It achieves accurate and reliable attitude estimation by employing multi-step correction filters, smoothing data, and applying Generalized Depth Visual-Inertial Odometry (GD-VIO) during GNSS signal interruptions. A loosely coupled GNSS/INS/camera system uses EKF for UAV attitude estimation [108]. Tight-coupling systems, such as GVINS [91] and InGVIO [109], integrate GNSS raw measurements, vision, and inertial data to enable UAV navigation and position correction in satellite signal-deprived environments. The reference [110] introduced a tightly coupled Invariant Extended.
Kalman Filter (IEKF) based on a Double Frame Group (TFG), achieving depth fusion of GNSS, INS, and LiDAR. By utilizing a unique group structure to approximate logarithmic linearization, this method effectively resolves the accuracy degradation and divergence issues caused by initial state errors. The tight-coupling architecture fully leverages the high-precision positioning capability of GNSS. It integrates IMU and visual data to supplement information, ensuring accurate and stable localization in complex environments.
Optimization algorithms for UAV autonomous localization improve positioning accuracy by minimizing observation errors. Bundle Adjustment (BA) is one of the most commonly used optimization methods. Reference [111] proposed that to recover the camera's pose and the scene's geometric structure, the consistency correspondence between multiple image pairs must be bundled together to generate a complete trajectory. BA achieves precise alignment for UAV autonomous navigation by adjusting the camera positions and 3D point coordinates across multiple viewpoints. The multi-source optimization fusion methods presented in Table 5 demonstrate various application potentials and technical characteristics for UAV localization. The research by Dellaert et al. [98] is based on optimization methods, with its flexible structure and efficient performance making it suitable for a wide range of application scenarios. However, the method's dependence on factor decomposition and the complexity of the modeling process pose significant limitations in realizing its advantages. Guo et al. [103] further improved the optimization algorithm, proposing a high-precision, robust, and computationally convenient localization solution. However, this method exhibits some sensitivity under complex modeling conditions, and its algorithmic complexity remains high. Compared to single optimization methods, Song et al. [88] combined optimization with filtering, reducing computational burden while maintaining high localization accuracy and exhibiting strong environmental adaptability. However, its real-time capability is somewhat limited, and the optimization ability for noise models still requires further enhancement. Li et al. [100] focused on utilizing deep learning techniques to improve noise immunity and enhance model performance.
Table 5. Advantages and limitations of multi-source optimization fusion methods for UAV localization
Work | Fusion methods | Advantages | Limitations |
|---|---|---|---|
Dellaert et al. [98] | Optimization | • Flexible structure • High efficiency • Suitable for a variety of scenarios | • Dependent on factor decomposition • Modeling complexity |
Guo et al. [103] | Optimization | • High accuracy • Robustness • Easy to calculate | • Sensitive to modeling errors • High complexity |
Song et al. [88] | Optimization + filter | • High positioning accuracy • Low computational burden • Great environmental adaptability | • Limited real-time ability • Poor optimization of noise models |
Van et al. [99] | Optimization + Deep learning | • Robustness • Resistance to uncertainty • High accuracy | • Computationally complex • Dependent on sensor quality |
Li et al. [100] | Optimization + deep learning | • Strong noise immunity • High estimation accuracy • High model performance | • Computationally complex • Relies on measurement model accuracy |
Wang et al. [89] | Optimization + filter | • Low resource requirements • Adapts to complex environments • High localization accuracy | • Dependent on visual conditions • Sensitive to sign detection |
Yusefi et al. [90] | Optimization + filter + deep learning | • Multi-sensor fusion • Adapting to GNSS failure • High positioning accuracy | • Dependent on sensor quality • Computational complex |
Cao et al. [91] | Optimization + filter | • Tightly coupled fusion • Small drift • Environmental adaptability | • Initialization relies on short measurements |
Wei et al. [92] | Optimization + filter | • High navigation accuracy • Adapts to changes in sensor property | • Relies on dynamic weight design • Computationally complex |
Liu et al. [93] | Optimization + filter + deep learning | • Accuracy Improvement • Self-supervised learning | • Complex processing • High computational overhead |
Almalioglu et al. [112] | Optimization + deep learning | • Self-supervised learning • High accuracy of attitude estimation | • Relies on unlabeled data • Sensitive to training data quality |
In summary, UAV optimization fusion localization methods achieve high-precision and robust positioning in complex environments through deep integration of multi-source sensors and nonlinear optimization algorithms. Particularly in GNSS-denied or extreme conditions, these methods provide reliable navigation and positioning assurance for UAVs.
Deep learning fusion methods
Filters and factor graph optimization methods exhibit limitations in handling complex nonlinear features, dynamic environmental changes, and multi-source data fusion [113]. These methods heavily rely on accurate sensor mathematical models and error propagation processes, resulting in limited robustness and generalization capabilities when confronted with unknown environments and high-noise data [92]. In contrast, deep learning methods leverage large-scale data training to automatically extract features and learn complex patterns in the environment. These methods are more appropriate for managing diverse scenarios, including occlusions, lighting variations, and scene uncertainty, and thus are more suitable for complex UAV localization tasks [114]. Figure 6c and Fig. 9 present schematic diagrams of the deep learning algorithm structure. In this framework, neural networks assume the role of decision-making units, while the environment provides rewards or penalties based on the outcome of actions, thereby modulating the neural network parameters. The literature [115] proposed a solution based on deep learning and RGB-D fusion, enabling UAVs to perceive obstacles' categories, contours, and 3D spatial position while autonomously generating optimal obstacle avoidance paths. The reference [116] introduced a probabilistic framework for multi-target detection and localization, achieving accurate target detection in complex environments through multi-sensor data fusion from micro-UAV swarms. This approach offers good computational scalability and robustness.
[See PDF for image]
Fig. 9
Typical deep learning architecture
Cong et al. [117] proposed a multi-UAV network source localization method combining deep neural networks (DNN) and spatial spectral fitting (SSF). By fusing Direction of Arrival (DOA) information with deep SSF and CRB weighting, the method enhances source localization accuracy and efficiency. However, DNNs are sensitive to input data noise and outliers, often losing generalization ability when facing feature changes or distribution shifts. End-to-end methods based on DNNs have become a research hotspot in recent years, particularly in applications such as visual depth estimation.
The core challenge in solving self-motion estimation through visual data lies in obtaining the depth information of feature points, which is then used to recover the camera's pose using the Perspective-n-Point (PnP) method [118]. Eigen et al. [119] proposed a dual-scale DNN for pixel-level depth estimation, pioneering a new approach to depth estimation. Furthermore, Generative Adversarial Networks (GANs) have been utilized to estimate depth from a single image. These techniques serve to underscore the potential of deep learning in image reconstruction and depth inference. The literature [93] enhanced depth prediction accuracy and pixel-level depth estimation performance by utilizing a geometric pose estimator. Alternating learning strategies and sensitivity-adaptive depth decoders were employed to optimize the results. Additionally, some researchers have explored the integration of deep learning frameworks into IMU data modeling to enhance the accuracy of GNSS/INS integrated navigation systems [94, 120]. The reference [121] utilized recurrent neural networks to predict IMU errors, thereby enhancing system robustness in the absence of GNSS signals and improving navigation performance.
In the field of VIO, the high-dimensional feature processing capability of DNNs has brought significant advantages to the development of end-to-end VIO algorithms. Compared to traditional VIO methods, deep learning-based VIO algorithms simplify the synchronization and calibration between the camera and IMU. In recent years, many unsupervised deep learning methods have made progress in the VIO domain.The literature [95] proposed a deep learning-based VIO method that combines CNNs and Bidirectional Long Short-Term Memory networks (BiLSTM) to extract and fuse visual and inertial features. This approach enables high-precision position prediction for Unmanned Aerial Systems (UAS). The VINet [122] and SelfVIO [112] approaches employed self-supervised learning to achieve accurate self-motion estimation and depth reconstruction without IMU intrinsic parameters. This significantly enhanced the system's adaptability and accuracy in dynamic environments. Additionally, the literature [123] proposed a multi-sensor fusion method based on deep reinforcement learning and multi-model adaptive estimation, which integrates LiDAR and RGB-D camera data to enhance the precision of localization in SLAM operations.
Table 6 presents a comprehensive examination of the advantages and limitations of multi-source deep learning fusion methodologies for UAV localization, as documented in the existing literature. The sources referenced in the table include [90, 94, 95, 96, 97–97, 112, 115, 117, 126, 127–128]. The research by Wang et al. [115] demonstrates that deep learning methods have unique advantages in enhancing environmental perception, autonomous obstacle avoidance, and intelligent flight. However, the accuracy of these methods decreases as the target distance increases, and the image processing time is relatively long, which significantly impacts real-time performance. Cong et al. [117] further improved the accuracy and computational efficiency of Direction of Arrival (DOA) estimation through deep learning, showcasing the power of nonlinear fitting. However, this method is highly dependent on model training and has a high computational complexity, which may be limiting in resource-constrained scenarios. The research conducted by Li et al. [100] further incorporated optimization algorithms, thereby enabling the system to exhibit enhanced noise immunity and estimation accuracy, rendering it suitable for more complex environments. However, this approach relies on the accuracy of the measurement model for performance improvement, and computational complexity remains a significant challenge. Guo et al. [94] combined deep learning with filtering techniques to provide a low- cost solution for UAV navigation, with noise reduction capabilities and low computational resource requirements. However, this method is highly dependent on the deep learning model and is sensitive to IMU noise. The method proposed by Norbelt et al. [125] offers advantages in low latency and high precision but faces the challenge of high computational complexity. Additionally, its adaptability in extreme environments is limited. Although the method performs well in dynamic environments, its performance may be significantly affected under extreme weather or complex lighting conditions, failing to achieve optimal localization accuracy and stability. Yang et al. [126] provided functionalities for multi-map fusion and advanced scene understanding, optimizing the selection of landing points. However, this method relies on visual SLAM, making it highly sensitive to lighting and environmental changes. In conditions of low light or significant environmental variations, the performance of the SLAM system may be severely impacted, thereby affecting the UAV's navigation and landing accuracy.
Table 6. Advantages and limitations of multi-source deep learning fusion methods in UAV localization
Work | Fusion methods | Advantages | Limitations |
|---|---|---|---|
Wang et al. [115] | Deep learning | • Improved environmental awareness • Autonomous obstacle avoidance flight intelligence | • Accuracy due to distance • Long image processing time |
Cong et al. [117] | Deep learning | • Highly accurate DOA estimation • Improved computational efficiency • Powerful nonlinear fitting abilities | • Training dependent • Computationally complex |
Van et al. [99] | Deep learning + optimization | • Robustness • Resistance to uncertainty • High accuracy | • Computationally complex • Dependent on sensor quality |
Li et al. [100] | Deep learning + optimization | • Strong noise immunity • High estimation accuracy • High model performance | • Computationally complex • Relies on measurement model accuracy |
Yusefi et al. [90] | Deep learning + optimization + filter | • Multi-sensor fusion • Adapting to GNSS failure • High positioning accuracy | • Dependent on sensor quality • Computational complex |
Liu et al. [93] | Deep learning + optimization + filter | • Accuracy improvement • Self-supervised learning | • Complex processing • High computational overhead |
Guo et al. [94] | Deep learning + filter | • Precise navigation at low cost • Noise reduction • Low computational resource | • Dependent on deep learning • Sensitive to IMU noise |
Aslan et al. [95] | Deep learning + filter | • Highly accurate position estimation • Low cost sensor fusion • Suitable for UAS navigation | • Dependent on deep learning training • Requires high quality data |
Almalioglu et al. [112] | Deep learning + optimization | • Self-supervised learning • No need for IMU calibration • High accuracy of attitude estimation | • Relies on unlabeled data • Sensitive to training data quality |
The literature [124] | Deep learning + SLAM | • No GNSS dependency • Real-time map generation • Emergency response support | • High computational requirements • Sparse maps |
Norbelt et al. [125] | Deep learning + SLAM | • Low latency, high accuracy • Robust in dynamic environments • Real-time object classification | • Computational complex • Limited adaptability to extreme environments |
Yang et al. [126] | Deep learning + SLAM | • Multi-map fusion • Advanced scene understanding • Optimized landing sites | • Dependent on visual SLAM • Affected by light and environmental changes |
Basiri et al. [127] | Deep learning + ORB-SLAM | • Environmental adaptability • High precision • Enhanced robustness | • Dependent on visual features • Computational complex |
Jing et al. [128] | Deep learning + VINS-Mono | • No GPS dependence • Real-time performance • Vision-inertia fusion | • Affected by light • High requirements for sensor calibration |
In summary, deep learning fusion methods in UAV localization integrate visual data, IMU data, and other sensor information to automatically extract complex features and handle dynamic changes and noisy environments, thereby improving localization accuracy and robustness. The application of deep neural networks (DNNs) in multi-sensor fusion, VIO, and tasks such as target detection and obstacle avoidance have significantly enhanced system precision.
As shown in Table 7, Vision-based localization systems, particularly those employing RGB cameras or depth cameras mounted on UAVs, represent the most widely adopted approach in current UAV localization research. Cameras serve as the primary sensors for environmental perception and feature matching, whether using the D435 depth camera as in the work of Jing et al., or monocular and front-facing cameras adopted by Norbelt et al. [125]. and Waqas et al. [12]. In terms of processing units, Intel Core i7 series processors are frequently employed, indicating a prevailing preference for high-performance computing platforms suitable for deployment on mobile systems. Regarding environmental adaptability, these methods have been validated across indoor, outdoor, and simulated environments, and have been tested under varying lighting conditions and even moderate wind speeds, demonstrating their feasibility and robustness in diverse and challenging scenarios.
Table 7. The hardware and environmental conditions required by certain methods
Methods | Sensors | Parameters | Processors | Environmental conditions |
|---|---|---|---|---|
Jing et al. [128] | D435 depth camera, IMU, OptiTrack system | Camera's insider, external reference between camera and IMU | Quad-core Intel NUC onboard computing unit | Indoor, Natural light |
Cha et al. [7] | Nikon D5200 Cameras, Nikon D7200 Cameras | 500 × 375 pixels, bounding box annotations | Intel Core i7-6700k @4 GHz | Different lighting conditions |
Ali et al. [15] | UAV camera, VPS, stationary and mobile beacons | 1920 × 1080 pixels, 30 FPS, four fixed beacons | UAV processor, laptop | Indoor, outdoor (15–19 km/h winds) |
Basiri et al. [127] | Multiple drone front-facing cameras | Feature identification, 3D mapping | Intel Core i7-4702MQ, 2 GB VGA GeForce GT 740 M | Indoor, simulated environment |
Norbelt et al. [125] | RGB monocular camera, UAV camera | 640 × 480 pixels, RPE, ATE | Intel(R) Core i7-6700 K@ 4.0 GHz | Outdoor, different lighting conditions |
Waqas et al. [12] | UAV front camera, Hawkeye Micro Cam 160 | 640 × 640 pixels, 0.3 FPS | Intel Core i7 2.6 GHz、Nvidia Geforce 1060 | Indoor and outdoor (ArUco) |
Discussion
This paper presents a comprehensive performance comparison and evaluation of different methods for UAV localization, leading to several key conclusions. First, GeoDesc demonstrated the best performance in complex scenarios with varying viewpoints and lighting conditions, achieving mAP scores of 90.6%, 58.1%, and 75.3%, respectively. It also performed excellently in the torque estimation problem for rehabilitation training. However, GeoDesc has high requirements for the diversity of training data and computational resources, limiting its applicability in resource-constrained environments. While SOSNet and HyNet also showed strong performance across multiple tasks, their generalization capability was weaker, particularly in the face of complex environmental changes. LF-Net and FeatureBooster exhibited robust adaptability in tasks involving lighting and viewpoint variations, with FeatureBooster being an excellent choice for multi-scenario applications due to its balance of robustness and accuracy.
Second, traditional methods such as SIFT and RootSIFT, while reliable in terms of matching efficiency and stability, exhibited insufficient robustness in handling large variations in viewpoint and lighting. In particular, their matching accuracy was low in viewpoint variation tasks. In contrast, deep learning methods such as GeoDesc and FeatureBooster demonstrated stronger adaptability and higher matching accuracy, especially in dynamically changing complex environments. This indicates that, although traditional methods still have applications in specific tasks, they can no longer meet the localization demands in complex environments.
Finally, multi-sensor fusion methods, particularly those combining GNSS and VIO, have proven effective in overcoming the challenges posed by GNSS signal-limited environments. Tight-coupling systems such as GVINS and InGVIO, which directly fuse raw sensor data, enhance navigation accuracy and robustness. However, despite significant advancements in visual-inertial fusion methods, challenges remain in low-texture environments and extreme weather conditions.
Summary and outlook
This paper presents a comprehensive and systematic review of sensor-based UAV autonomous localization methods, offering a detailed analysis of pure vision-based localization and sensor-assisted approaches. The review is supported by extensive literature and experimental data. In the realm of pure vision-based UAV autonomous localization, the generation of feature descriptors, similarity measurement criteria, and optimization strategies are explored. Vision-based methods demonstrate high flexibility and computational efficiency but exhibit limitations in weakly textured or complex lighting scenarios. Sensor-assisted multi-source UAV localization methods leverage the advantages of multi-sensor data fusion through techniques such as filtering, optimization, and deep learning. These approaches significantly improve localization accuracy, robustness, and applicability in challenging environments. However, these methods face common challenges, including high computational complexity and strong dependencies on sensor quality and model accuracy.
Neither vision-based methods nor single-sensor techniques alone can meet the diverse requirements of complex environments. As a result, multimodal fusion is emerging as a critical direction for improving localization performance. The integration of deep learning further enhances environmental adaptability and data learning capabilities. However, the high demands for data quality and computational resources pose limitations for large-scale practical applications. Future research on sensor-based UAV autonomous localization technologies will primarily focus on the following directions:
Lightweight Design and Real-Time Performance
Adaptation to Low-Texture and Complex Scenarios:
Optimization of Multimodal Fusion
Open Data and Standardized Evaluation:
In summary, sensor-based UAV autonomous localization methods still have vast potential for development in both theoretical research and practical applications. Future efforts should focus on advancing algorithms and data frameworks in a coordinated manner to further enhance the accuracy, real-time performance, and adaptability of localization systems. These advancements will lay a solid foundation for the application of UAVs in defense, industrial, and civilian domains.
Conclusion
Sensor-based UAV autonomous localization methods hold broad application prospects in complex environments. This paper systematically reviewed pure vision-based methods, sensor-assisted multimodal fusion approaches, and optimization strategies incorporating deep learning technologies, providing a detailed analysis of their advantages, limitations, and applicable scenarios. While current autonomous localization methods have achieved significant progress in terms of accuracy, environmental adaptability, and real-time performance, challenges remain in adapting to weakly textured scenes, improving multimodal fusion efficiency, and establishing standardized evaluation frameworks. Future efforts should focus on optimizing algorithm structures to achieve system lightweight and efficiency, developing more robust feature extraction and matching algorithms to handle complex scenarios, deeply integrating multimodal data to enhance robustness and precision, and promoting the development of open datasets and standardized evaluation frameworks to ensure fairness in research and facilitate the sharing and application of technological outcomes. As UAV autonomous localization technology continues to evolve, it will demonstrate strong capabilities in increasingly complex scenarios, providing solid technical support for intelligent navigation, disaster rescue, infrastructure inspection, and other fields.
Acknowledgements
This research is supported by the National Natural Science Foundation of China (No. 62203163), the Educational Commission of Hunan Province of China (No. 24A0519), and the Postgraduate Scientific Research Innovation Project of Hunan Province (No. CX2024100).
Author contributions
Haiqiao Liu: Funding acquisition, Supervision, Writing. Qing Long: Methodology, Data collection and analysis, Paper Writing. Bing Yi: Project administration. Wen Jiang: Paper editing and proofreading.
Funding
National Natural Science Foundation of China, 62203163, haiqiao liu, Postgraduate Scientific Research Innovation Project of Hunan Province, CX2024100, Qing Long, Educational Commission of Hunan Province of China, 24A0519, haiqiao liu.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Declarations
Conflict of interst
All authors disclosed no relevant relationships.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Dempster, AG; Cetin, E. Interference localization for satellite navigation systems. Proc IEEE; 2016; 104,
2. Morales-Ferre, R; Richter, P; Falletti, E et al. A survey on coping with intentional interference in satellite navigation for manned and unmanned aircraft. IEEE Commun Surv Tutor; 2019; 22,
3. Shen, Z; Sun, J; Wang, Y et al. Semi-dense feature matching with transformers and its applications in multiple-view geometry. IEEE Trans Pattern Anal Mach Intell; 2022; 45,
4. Strode, PRR; Groves, PD. GNSS multipath detection using three-frequency signal-to-noise measurements. GPS Solutions; 2016; 20, pp. 399-412.
5. Mur-Artal, R; Montiel, JMM; Tardos, JD. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans Rob; 2015; 31,
6. Forster C, Pizzoli M, Scaramuzza D (2014) SVO: fast semi-direct monocular visual odometry. In: 2014 IEEE international conference on robotics and automation (ICRA). IEEE 2014:15–22
7. Cha, YJ; Choi, W; Suh, G et al. Autonomous structural visual inspection using region-based deep learning for detecting multiple damage types. Comput-Aided Civ Infrastr Eng; 2018; 33,
8. DeTone D, Malisiewicz T, Rabinovich A (2018) Superpoint: self-supervised interest point detection and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 224–236
9. Sun J, Shen Z, Wang Y et al (2021) LoFTR: detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8922–8931
10. Dandois, JP; Olano, M; Ellis, EC. Optimal altitude, overlap, and weather conditions for computer vision UAV estimates of forest structure. Remote sensing; 2015; 7,
11. Li, W; Fu, Z. Unmanned aerial vehicle positioning based on multi-sensor information fusion. Geo-Spatial Inf Sci; 2018; 21,
12. Waqas, A; Kang, D; Cha, YJ. Deep learning-based obstacle-avoiding autonomous UAVs with fiducial marker-based localization for structural health monitoring. Struct Health Monit; 2024; 23,
13. Liang, X; Yan, Y; Wang, W et al. Adaptive human–robot interaction torque estimation with high accuracy and strong tracking ability for a lower limb rehabilitation robot. IEEE/ASME Trans Mechatron; 2024; 29,
14. Chen, Z; Zhan, G; Jiang, Z et al. Adaptive impedance control for docking robot via Stewart parallel mechanism. ISA Trans; 2024; 155, pp. 361-372.
15. Ali, R; Kang, D; Suh, G; Cha, YJ. Real-time multiple damage mapping using autonomous UAV and deep faster region-based neural networks for GPS-denied structures. Autom Constr; 2021; 130, 103831.
16. Cui, Y; Wang, C; Hu, Q et al. A novel positioning method for UAV in GNSS-denied environments based on mechanical antenna. IEEE Trans Industr Electron; 2024; 71,
17. Chen, P; Dang, Y; Liang, R et al. Real-time object tracking on a drone with multi-inertial sensing data. IEEE Trans Intell Transp Syst; 2017; 19,
18. Minaeian, S; Liu, J; Son, YJ. Vision-based target detection and localization via a team of cooperative UAV and UGVs. IEEE Trans Syst Man Cybern Syst; 2015; 46,
19. Sabzehali, J; Shah, VK; Fan, Q et al. Optimizing number, placement, and backhaul connectivity of multi-UAV networks. IEEE Internet Things J; 2022; 9,
20. Lowe, DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vision; 2004; 60, pp. 91-110.
21. Bay, H; Ess, A; Tuytelaars, T et al. Speeded-up robust features (SURF). Comput Vis Image Underst; 2008; 110,
22. Rublee E, Rabaud V, Konolige K et al (2011) ORB: An efficient alternative to SIFT or SURF. In: 2011 International conference on computer vision. IEEE, 2564–2571
23. Shan M, Wang F, Lin F et al (2015) Google map aided visual navigation for UAVs in GPS-denied environment. In: 2015 IEEE international conference on robotics and biomimetics (ROBIO). IEEE, 114–119
24. Beauchemin, SS; Barron, JL. The computation of optical flow. ACM Comput Surv (CSUR); 1995; 27,
25. Zhang, T; Zhang, X; Ke, X et al. HOG-ShipCLSNet: a novel deep learning network with hog feature fusion for SAR ship classification. IEEE Trans Geosci Remote Sens; 2021; 60, pp. 1-22.
26. Warren, M; Greeff, M; Patel, B et al. There's no place like home: visual teach and repeat for emergency return of multirotor UAVs during GPS failure. IEEE Robot Autom Lett; 2018; 4,
27. Yol A, Delabarre B, Dame A et al (2014) Vision-based absolute localization for unmanned aerial vehicles. In: 2014 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 3429–3434
28. Tiškus, E; Bučas, M; Gintauskas, J et al. U-net performance for beach wrack segmentation: effects of UAV camera bands, height measurements, and spectral indices. Drones; 2023; 7,
29. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III 18. Springer International Publishing, 234–241
30. Tian Y, Yu X, Fan B et al (2019) Sosnet: second order similarity regularization for local descriptor learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11016–11025
31. Wang, L; Zhang, Y; Feng, J. On the Euclidean distance of images. IEEE Trans Pattern Anal Mach Intell; 2005; 27,
32. Norouzi M, Fleet DJ, Salakhutdinov RR (2012) Hamming distance metric learning. Adv Neural Inf Process Syst 25
33. Fan B, Du Y, Zhu L et al (2010) The registration of UAV down-looking aerial images to satellite images with image entropy and edges. In: Intelligent robotics and applications: third international conference, ICIRA 2010, Shanghai, China, November 10–12, 2010. Proceedings, Part I 3. Springer Berlin Heidelberg, 609–617
34. Yi KM, Trulls E, Lepetit V et al (2016) Lift: learned invariant feature transform. In: Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VI 14. Springer International Publishing, 467–483
35. Bai Y, Ding H, Bian S et al (2019) Simgnn: a neural network approach to fast graph similarity computation. In: Proceedings of the twelfth ACM international conference on web search and data mining, 384–392
36. Ling, X; Wu, L; Wang, S et al. Multilevel graph matching networks for deep graph similarity learning. IEEE Trans Neural Netw Learn Syst; 2021; 34,
37. He, Q; Xu, A; Zhang, Y et al. A contrastive learning based multiview scene matching method for UAV view geo-localization. Remote Sensing; 2024; 16,
38. Yeh CH, Hong CY, Hsu YC et al (2022) Decoupled contrastive learning. In: European conference on computer vision. Springer Nature Switzerland, Cham, 668–684
39. Saputro DRS, Widyaningsih P (2017) Limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method for the parameter estimation on geographically weighted ordinal logistic regression model (GWOLR)[C]//AIP conference proceedings. AIP Publ 1868(1):040009
40. Al-Jarrah, OY; Shatnawi, AS; Shurman, MM et al. Exploring deep learning-based visual localization techniques for UAVs in GPS-denied environments. IEEE Access; 2024; 12, pp. 113049-113071.
41. Balta, H; Velagic, J; Beglerovic, H et al. 3D registration and integrated segmentation framework for heterogeneous unmanned robotic systems. Remote Sens; 2020; 12,
42. Goforth H, Lucey S (2019) GPS-denied UAV localization using pre-existing satellite imagery. In: 2019 international conference on robotics and automation (ICRA). IEEE, 2974–2980
43. Wang C, Galoogahi HK, Lin CH et al (2018) Deep-LK for efficient adaptive object tracking. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 627–634
44. Kim J, Lee JK, Lee KM (2016) Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 1646–1654
45. Mughal, MH; Khokhar, MJ; Shahzad, M. Assisting UAV localization via deep contextual image matching. IEEE J Sel Top Appl Earth Observ Remote Sens; 2021; 14, pp. 2445-2457.
46. Fu, R; Ren, X; Li, Y et al. Machine-learning-based uav-assisted agricultural information security architecture and intrusion detection. IEEE Internet Things J; 2023; 10,
47. Balntas V, Riba E, Ponsa D et al (2016) Learning local feature descriptors with triplets and shallow convolutional neural networks. BMVC 1(2):1–11
48. Simo-Serra E, Trulls E, Ferraz L et al (2015) Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the IEEE international conference on computer vision, 118–126
49. Tian Y, Fan B, Wu F (2017) L2-net: deep learning of discriminative patch descriptor in Euclidean space. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 661–669
50. Zhang X, Yu FX, Kumar S et al (2017) Learning spread-out local feature descriptors. In: Proceedings of the IEEE international conference on computer vision, 4595–4603
51. He K, Lu Y, Sclaroff S (2018) Local descriptors optimized for average precision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 596–605
52. Tian, Y; Barroso Laguna, A; Ng, T et al. HyNet: learning local descriptor with hybrid similarity measure and triplet loss. Adv Neural Inf Process Syst; 2020; 33, pp. 7401-7412.
53. Luo Z, Shen T, Zhou L et al (2018) Geodesc: learning local descriptors by integrating geometry constraints. In: Proceedings of the European conference on computer vision (ECCV), 168–183
54. Noh H, Araujo A, Sim J et al (2017) Large-scale image retrieval with attentive deep local features. In: Proceedings of the IEEE international conference on computer vision, 3456–3465
55. Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2911–2918
56. Ono Y, Trulls E, Fua P et al (2018) LF-Net: learning local features from images. Adv Neural Inf Process Syst 31
57. Zhao, X; Wu, X; Miao, J et al. Alike: accurate and lightweight keypoint detection and descriptor extraction. IEEE Trans Multimedia; 2022; 25, pp. 3101-3112.
58. Dusmanu M, Rocco I, Pajdla T et al (2019) D2-net: a trainable CNN for joint description and detection of local features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8092–8101
59. Wang X, Liu Z, Hu Y et al (2023) Featurebooster: boosting feature descriptors with a lightweight neural network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7630–7639
60. Mishkin D, Radenovic F, Matas J (2018) Repeatability is not enough: learning affine regions via discriminability. In: Proceedings of the European conference on computer vision (ECCV), 284–300
61. Asaad, SM; Maghdid, HS. A comprehensive review of indoor/outdoor localization solutions in IoT era: research challenges and future perspectives. Comput Netw; 2022; 212, 109041.
62. Tan YX, Prasetyo MB, Daffa MA et al (2023) Evaluating visual odometry methods for autonomous driving in rain. In: 2023 IEEE 19th international conference on automation science and engineering (CASE). IEEE, 1–8
63. Liu, X; Zhou, Q; Chen, X et al. Bias-error accumulation analysis for inertial navigation methods. IEEE Signal Process Lett; 2021; 29, pp. 299-303.
64. Kalman RE (1960) A new approach to linear filtering and prediction problems: 35–45
65. Falco, G; Pini, M; Marucco, G. Loose and tight GNSS/INS integrations: comparison of performance assessed in real urban scenarios. Sensors; 2017; 17,
66. Gao, B; Hu, G; Zhong, Y et al. Cubature Kalman filter with both adaptability and robustness for tightly-coupled GNSS/INS integration. IEEE Sens J; 2021; 21,
67. Ban Y, Niu X, Zhang T et al (2014) Low-end MEMS IMU can contribute in GPS/INS deep integration. In: 2014 IEEE/ION position, location and navigation symposium-PLANS 2014. IEEE, 746–752
68. Rhudy MB, Gu Y, Napolitano M (2013) Low-cost loosely-coupled dual GPS/INS for attitude estimation with application to a small UAV. In: AIAA guidance, navigation, and control (GNC) conference, 4957
69. Semeniuk, L; Noureldin, A. Bridging GPS outages using neural network estimates of INS position and velocity errors. Meas Sci Technol; 2006; 17,
70. Morgado, M; Oliveira, P; Silvestre, C. Tightly coupled ultrashort baseline and inertial navigation system for underwater vehicles: an experimental validation. J Field Robot; 2013; 30,
71. Singh, T; Vishwakarma, DK. A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput Appl; 2021; 33,
72. Li, H; Yu, L; Zhang, J et al. Fusion deep learning and machine learning for heterogeneous military entity recognition. Wirel Commun Mob Comput; 2022; 2022,
73. Li, M; Mourikis, AI. High-precision, consistent EKF-based visual-inertial odometry. Int J Robot Res; 2013; 32,
74. Xiong, K; Zhang, HY; Chan, CW. Performance evaluation of UKF-based nonlinear filtering. Automatica; 2006; 42,
75. Charkhgard, M; Farrokhi, M. State-of-charge estimation for lithium-ion batteries using neural networks and EKF. IEEE Trans Industr Electron; 2010; 57,
76. Simanek, J; Reinstein, M; Kubelka, V. Evaluation of the EKF-based estimation architectures for data fusion in mobile robots. IEEE/ASME Trans Mechatron; 2014; 20,
77. Guo, H; Chen, H; Xu, F et al. Implementation of EKF for vehicle velocities estimation on FPGA. IEEE Trans Industr Electron; 2012; 60,
78. Toledo-Moreo, R; Zamora-Izquierdo, MA; Ubeda-Minarro, B et al. High-integrity IMM-EKF-based road vehicle navigation with low-cost GPS/SBAS/INS. IEEE Trans Intell Transp Syst; 2007; 8,
79. Sebesta, KD; Boizot, N. A real-time adaptive high-gain EKF, applied to a quadcopter inertial navigation system. IEEE Trans Industr Electron; 2013; 61,
80. Valipour, M; Ricardez-Sandoval, LA. Assessing the impact of EKF as the arrival cost in the moving horizon estimation under nonlinear model predictive control. Ind Eng Chem Res; 2021; 60,
81. Costanzi, R; Fanelli, F; Meli, E et al. UKF-based navigation system for AUVs: online experimental validation. IEEE J Oceanic Eng; 2018; 44,
82. Bucci, A; Franchi, M; Ridolfi, A et al. Evaluation of UKF-based fusion strategies for autonomous underwater vehicles multisensor navigation. IEEE J Oceanic Eng; 2022; 48,
83. Feng, X; Zhang, T; Lin, T et al. Implementation and performance of a deeply-coupled GNSS receiver with low-cost MEMS inertial sensors for vehicle urban navigation. Sensors; 2020; 20,
84. Simon, D. Kalman filtering with state constraints: a survey of linear and nonlinear algorithms. IET Control Theory Appl; 2010; 4,
85. Nazarahari, M; Rouhani, H. 40 years of sensor fusion for orientation tracking via magnetic and inertial measurement units: Methods, lessons learned, and future challenges. Inf Fusion; 2021; 68, pp. 67-84.
86. Hartley, R; Ghaffari, M; Eustice, RM et al. Contact-aided invariant extended Kalman filtering for robot state estimation. Int J Robot Res; 2020; 39,
87. Zhu, Z; Sorkhabadi, SMR; Gu, Y et al. Design and evaluation of an invariant extended kalman filter for trunk motion estimation with sensor misalignment. IEEE/ASME Trans Mechatron; 2022; 27,
88. Song, Y; Hsu, LT. Tightly coupled integrated navigation system via factor graph for UAV indoor localization. Aerosp Sci Technol; 2021; 108, 106370.
89. Wang, F; Zou, Y; Zhang, C et al. UAV navigation in large-scale GPS-denied bridge environments using fiducial marker-corrected stereo visual-inertial localisation. Autom Constr; 2023; 156, 105139.
90. Yusefi, A; Durdu, A; Bozkaya, F et al. A generalizable D-VIO and its fusion with GNSS/IMU for improved autonomous vehicle localization. IEEE Trans Intell Vehic; 2023; 9,
91. Cao, S; Lu, X; Shen, S. GVINS: Tightly coupled GNSS–visual–inertial fusion for smooth and consistent state estimation. IEEE Trans Rob; 2022; 38,
92. Wei, X; Li, J; Zhang, D et al. An improved integrated navigation method with enhanced robustness based on factor graph. Mech Syst Signal Process; 2021; 155, 107565.
93. Liu, J; Cao, Z; Liu, X et al. Self-supervised monocular depth estimation with geometric prior and pixel-level sensitivity. IEEE Trans Intell Vehic; 2022; 8,
94. Guo, F; Yang, H; Wu, X et al. Model-based deep learning for low-cost IMU dead reckoning of wheeled mobile robot. IEEE Trans Industr Electron; 2023; 71,
95. Aslan, MF; Durdu, A; Yusefi, A et al. HVIOnet: a deep learning based hybrid visual–inertial odometry approach for unmanned aerial system position estimation. Neural Netw; 2022; 155, pp. 461-474.
96. Potokar, ER; Norman, K; Mangelson, JG. Invariant extended kalman filtering for underwater navigation. IEEE Robot Autom Lett; 2021; 6,
97. Zhang, H; Xiao, R; Li, J et al. A high-precision LiDAR-inertial odometry via invariant extended Kalman filtering and efficient Surfel mapping. IEEE Trans Instrum Meas; 2024; 73, 8502911.
98. Dellaert, F. Factor graphs: exploiting structure in robotics. Ann Rev Control Robot Autonom Syst; 2021; 4,
99. Van Nam, D; Gon-Woo, K. Learning type-2 fuzzy logic for factor graph based-robust pose estimation with multi-sensor fusion. IEEE Trans Intell Transp Syst; 2023; 24,
100. Li, F; Tang, Y. Multi-sensor fusion Boolean Bayesian filtering for stochastic Boolean networks. IEEE Trans Neural Netw Learn Syst; 2022; 34,
101. Kaess, M; Ranganathan, A; Dellaert, F. iSAM: incremental smoothing and mapping. IEEE Trans Rob; 2008; 24,
102. Kaess, M; Johannsson, H; Roberts, R et al. iSAM2: Incremental smoothing and mapping using the Bayes tree. Int J Robot Res; 2012; 31,
103. Guo, J; He, B. Improved iSAM based on flexible re-linearization threshold and error learning model for AUV in large scale areas. IEEE Trans Intell Transp Syst; 2020; 22,
104. Zhang Y, Hsiao M, Dong J et al (2021) MR-iSAM2: incremental smoothing and mapping with multi-root bayes tree for multi-robot SLAM. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 8671–8678
105. Leutenegger S, Furgale P, Rabaud V et al (2013) Keyframe-based visual-inertial slam using nonlinear optimization. In: Proceedings of robotics science and systems (RSS), 2013
106. Qin T, Cao S, Pan J et al (2019) A general optimization-based framework for global pose estimation with multiple sensors. arXiv:1901.03642
107. Yang, D; Liu, H; Jin, X et al. Enhancing VIO robustness under sudden lighting variation: a learning-based IMU dead-reckoning for UAV localization. IEEE Robot Autom Lett; 2024; 9,
108. Angelino CV, Baraniello VR, Cicala L (2012) UAV position and attitude estimation using IMU, GNSS and camera. In: 2012 15th international conference on information fusion. IEEE, 735–742
109. Liu, C; Jiang, C; Wang, H. Ingvio: a consistent invariant filter for fast and high-accuracy gnss-visual-inertial odometry. IEEE Robot Autom Lett; 2023; 8,
110. Xia C, Li X, He F et al (2024) Accurate and rapidly-convergent GNSS/INS/LiDAR tightly-coupled integration via invariant EKF based on two-frame group. IEEE Trans Autom Sci Eng 2024:1–14
111. Cao, M; Jia, W; Lv, Z et al. Fast and robust feature tracking for 3D reconstruction. Opt Laser Technol; 2019; 110, pp. 120-128.
112. Almalioglu, Y; Turan, M; Saputra, MRU et al. SelfVIO: self-supervised deep monocular visual-inertial odometry and depth estimation. Neural Netw; 2022; 150, pp. 119-136.
113. Xiwei, WU; Bing, X; Cihang, WU et al. Factor graph based navigation and positioning for control system design: a review. Chin J Aeronaut; 2022; 35,
114. Nguyen, G; Dlugolinsky, S; Bobák, M et al. Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev; 2019; 52, pp. 77-124.
115. Wang, D; Li, W; Liu, X et al. UAV environmental perception and autonomous obstacle avoidance: a deep learning and depth camera combined solution. Comput Electron Agric; 2020; 175, 105523.
116. Deming, RW; Perlovsky, LI. Concurrent multi-target localization, data association, and navigation for a swarm of flying sensors. Inf Fusion; 2007; 8,
117. Cong, J; Wang, X; Yan, C et al. CRB weighted source localization method based on deep neural networks in multi-UAV network. IEEE Internet Things J; 2022; 10,
118. Xie, X; Zou, D. Depth-based efficient PnP: a rapid and accurate method for camera pose estimation. IEEE Robot Autom Lett; 2024; 9,
119. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems, 27
120. Chen, C; Pan, X. Deep learning for inertial positioning: a survey. IEEE Trans Intell Transp Syst; 2024; 25,
121. Doostdar, P; Keighobadi, J; Hamed, MA. INS/GNSS integration using recurrent fuzzy wavelet neural networks. GPS Solutions; 2020; 24,
122. Clark R, Wang S, Wen H et al (2017) Vinet: visual-inertial odometry as a sequence-to-sequence learning problem. In: Proceedings of the AAAI conference on artificial intelligence, 31(1), 3995–4001
123. Wong, CC; Feng, HM; Kuo, KL. Multi-sensor fusion simultaneous localization mapping based on deep reinforcement learning and multi-model adaptive estimation. Sensors; 2023; 24,
124. Steenbeek, A; Nex, F. CNN-based dense monocular visual SLAM for real-time UAV exploration in emergency conditions. Drones; 2022; 6,
125. Norbelt, M; Luo, X; Sun, J et al. UAV localization in urban area mobility environment based on monocular VSLAM with deep learning. Drones; 2025; 9,
126. Yang, L; Ye, J; Zhang, Y et al. A semantic SLAM-based method for navigation and landing of UAVs in indoor environments. Knowl-Based Syst; 2024; 293, 111693.
127. Basiri, A; Mariani, V; Glielmo, L. Improving visual SLAM by combining SVO and ORB-SLAM2 with a complementary filter to enhance indoor mini-drone localization under varying conditions. Drones; 2023; 7,
128. Chen LJ, Henawy J, Kocer BB et al (2019) Aerial robots on the way to underground: an experimental evaluation of VINS-mono on visual-inertial odometry camera. In: 2019 international conference on data mining workshops (ICDMW). IEEE, 91–96
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.