Full Text

Turn on search term navigation

1. Introduction

Thermal infrared technology can work under all types of weather conditions and has been widely used for rescue, surveillance, and automatic target recognition. Besides, tracking based on thermal infrared technology is not sensitive to illumination variations and can track the target in total darkness [1,2]. Airborne target tracking, which plays an important role in infrared imaging guidance, remains a challenging task [3,4]. Compared with visual tracking, the imagery generated by infrared imaging guidance has low resolution and lacks texture information [5]. Moreover, both aircraft and infrared imaging platforms are highly maneuverable, leading to strong ego-motion and severe image jittering [6]. When an aircraft passes through a cloud, the aircraft would be partly occluded by the cloud. At the same time, the infrared decoy can also lead to occlusion and radiate a stronger signal than the aircraft. The change in aircraft attitude will give rise to the difference of imaging, which is also a challenge for aircraft tracking, as seen in Figure 1.

Wang et al. [7] broke a tracker down into several parts and observed that the feature extractor plays the most important role. Thus, a robust feature representation of the aircraft is crucial to the overall performance of the tracker. Recently, trackers based on correlation filters have achieved great success [8,9,10,11], which can be effectively trained in the Fourier domain and generate dense response scores over all searching locations. With the adoption of multi-channel features [12,13] instead of single-channel gray-scale features [14], the tracking performance has been greatly improved. The progress in convolutional neural networks (CNN) inspired more research to focus on the combination of CNN features and correlation filters [15,16], which provide a further performance boost. However, the CNN features pre-trained on ImageNet are not discriminative enough for domain-specific target tracking and incur a high computational cost.

By implicitly including dense samples, the correlation filters are able to make full use of limited training data [17]. Motivated by the training mechanism of correlation filters [12,18], we explicitly construct shifted versions of the aircraft in the initial frame as the training data without requiring additional training data. To ensure a better tracking performance, we integrate handcrafted features that encode general representation into the network. The learned domain-specific features and handcrafted features can co-adapt and cooperate to achieve an objective.

The main contributions of this work are summarized as follows:

We propose a new approach to automatically learn features online that can be adapted to the current video domain without pre-training on large datasets.
The general feature representations and the domain-specific features learned online are integrated into a unified framework to ensure the tracking performance.
The proposed method can be embedded in a framework based on correlation filters as a flexible module to improve the performance.
We carry out experiments on airborne infrared imagery to demonstrate that the proposed tracking algorithm achieves competitive performance compared with benchmark trackers.

2. Related Work

Bolme et al. [14] first introduced correlation filters to visual tracking, which take single-channel gray-scale features as the input. Tracking based on the Minimum Output Sum of Squared Error (MOSSE) filters achieves competitive performance compared with the more complex trackers and runs at 669 frames per second. Henriques et al. [18] explored the circulant structure of dense samples and derived closed-form solutions with polynomial and Gaussian kernels. The introduction of the kernel trick and the exploiting of the circulant structure of the samples enable efficient training in the frequency domain, achieving orders of magnitude faster than standard methods. The Kernelized Correlation Filter (KCF) [12], which can be seen as a kernelized version of a linear correlation filter, extended the work of [18] by replacing single-channel features with Histograms of Oriented Gradients (HOGs) features [19]. The KCF [12] and the multi-channel extension of correlation filters improve the tracking performance significantly and run at hundreds of frames per second. The aforementioned trackers cannot handle scale variations well. To address the problem of scale estimation, Danelljan et al. [20] proposed to learn separate filters for scale estimation and target translation. After finding the optimal translation, scale estimation is achieved by training a classifier based on a scale pyramid. Similarly, Li et al. [13] proposed a scale adaptive scheme by defining a scaling pool. The multiple scale searching strategy and the multiple feature integration scheme work together to boost the tracking performance.

The periodic assumption of the samples implied in correlation filters enables efficient training using the Fast Fourier Transform (FFT). However, the periodic assumption also introduces undesired boundary effects, making the tracking model inaccurate. Galoogahi et al. [21] addressed the issue for single-channel discriminative correlation filters by proposing a new objective, which can reduce the samples affected by the boundary effect and can be optimized by using the Augmented Lagrangian Method (ALM). The approach limits boundary effects and preserves computational efficiency. Danelljan et al. [17] exploited a spatial regularization component to penalize correlation filter coefficients near the background to alleviate the boundary effects. The spatial regularization mitigates the attention on the background region and enhances the emphasis on the target region. The introduced component can be used for multi-dimensional features, leading to a more discriminative tracking model. Instead of learning from the circular samples, which are plagued by boundary effects, Background-Aware Correlation Filters (BACFs) [22] emphasize the learning of the tracking model from real negative samples extracted from the background. The optimization process based on the Alternating Direction Method of Multipliers (ADMM) and Sherman–Morrison lemma achieves real-time performance while maintaining competitive accuracy. Li et al. [11] incorporated temporal regularization with Spatially-Regularized Discriminative Correlation Filters (SRDCFs) [17] to handle the appearance variations of the target during the tracking process. The introduction of the temporal regularizer to SRDCF with a single sample can approximate the training of SRDCF with multiple samples, and the training can be optimized efficiently via ADMM. Dai et al. [8] proposed an adaptive spatial regularization component to obtain object-aware spatial weight. The approach can be seen as a general extension of SRDCF [17] and BACF [22]. To accelerate the tracking process, the CF model with shallow features is exploited to estimate the scale. The other correlation filters model equipped with complicated features are responsible for accurate localization.

Feature representation is a critical part of visual tracking [7,23]. Recently, convolutional neural networks have achieved great success in various vision tasks. With the adoption of CNN features, trackers based on correlation filters began to show improving performance [15,16,24]. Ma et al. [15,25] exploited the hierarchical convolutional features as target representations for visual tracking. The learned correlation filters on each layer cooperate to infer the target location in a coarse-to-fine manner. Danelljan et al. [16] extended SRDCF [17] by using CNN features and demonstrated superior performance compared to handcrafted feature representations. Further, they proposed to learn continuous convolution operators [26]. With the integration of multi-resolution feature maps in the continuous spatial domain, the tracking performance was improved. Efficient Convolution Operators (ECOs) [24] introduce a factorized convolution operator and a compact generative model of samples to the C-COT (Continuous Convolution Operators Tracking) [26] tracker, which simultaneously improves computational efficiency and tracking accuracy. He et al. [27] investigated the multi-resolution CNN features and proposed the weight sum operation of the response maps based on the ECO [24] tracker. The adoption of the first convolution layer and the final convolution layer of the VGG (Visual Geometry Group)network [28] achieves the best tracking performance. Xu et al. [29] exploited the relevance of multi-channel features and presented group feature selection in the channel and spatial dimensions. With the use of group-sparse regularization and the low-rank temporal constraint, the combination of correlation filters and CNN features provides superior tracking performance.

3. Proposed Algorithm

In this section, we first introduce feature learning via convolutional regression. Second, we detail the architecture of the network. Finally, we introduce the proposed tracking algorithm. Algorithm 1 depicts the whole process.

Algorithm 1: Proposed tracking algorithm.

Input: Initial position and size of the aircraft [x₀ y₀ w₀ h₀]. Output: Estimated aircraft states [x_i y_i w_i h_i].

1: Construct shifted versions of the aircraft as training samples.

Train the network according to Equation (1).

3: Train the correlation filters using features extracted from the network.

4: fori = 2 n (length of sequence) do

5: Extract features of the search patch.

6: Generate a response map based on the extracted features and the trained filters.

7: Calculate the displacement from the response map to estimate the state of the aircraft.

8: Update the correlation filters.

9: end for

3.1. Learning via Convolutional Regression

In the typical formulation of the correlation filters, the correlation filters are trained by solving a linear least-squares problem. The training samples are implicitly generated by performing a circular sliding window operation and exploiting the fast Fourier transform [12,18]. The adopted features in the correlation filters are usually handcrafted features or CNN features pre-trained on large datasets, which are not tightly bound to the current video domain. Inspired by the training mechanism of correlation filters, we explicitly construct shifted versions of the aircraft in the initial frame as the training data and try to obtain features for the current domain in a convolutional regression network. The training of the network is consistent with the training of correlation filters. Therefore, the features obtained from the network are tightly coupled with both the current video domain and the tracking frameworks based on correlation filters. Learning the weights w of the network is to minimize the following loss function,

L(w)=1N∑i=1Nℓ(φ(_si,w),_gi)+λr(w)

where N is the number of shifted samples,ℓ(φ(_si,w),_gi)denotes the loss of the i-th training sample_si,λis a regularization parameter, andr(w)represents the weight decay term. The desired output_gifor_siis a scalar value sampled from the Gaussian function according to its shifted position, which can be written as,

_gi(_xi,_yi)=exp(−(^(_xi−_x0)22_σx2+^(_yi−_y0)22_σy2))

where(_x0,_y0)stands for the initial position of the aircraft,(_xi,_yi)represents the shifted position of the aircraft in the sample_si, and the variances_σxand_σyare proportional to the width and height of the aircraft. The correspondence between the sample_siand the label_gi is shown in Figure 2.

Specifically,ℓ(φ(_si,w),_gi)can be defined as the error term between the network outputφ(_si,w)and the label_gi, which is given by,

ℓ(φ(_si,w),_gi)=||φ(_si,w)−_gi|^|2

The weights w can be effectively calculated via gradient descent [23,30], which can be written as,

_wt=_wt−1−η∂L∂w

whereηis the learning rate andL(w) is the loss function defined in Equation (1). We iteratively optimize w by minimizing the loss functionL(w).

3.2. Network Architecture

Since only a few samples extracted from the initial frame can be used as training data, to ensure better tracking performance, we incorporate general feature representations into the network. HOG features have been widely used to represent the information of the target and gain excellent performance in visual tracking [27]. To this end, we propose to combine HOG features that encode general feature representations and domain-specific features learned online into the framework. Instead of directly concatenating the HOG features and the CNN features, the way of encoding the HOG features into the network is to co-adapt and cooperate to achieve an objective. We follow the work of [31,32], which implemented the HOG features in a CNN framework. The implementation mainly includes the calculation of the gradient, the assignment of the gradient, and the normalization of the block. Firstly, the gradient along the direction_ukis calculated using a directional filter. The k-th directional filter_Gkcan be written as,

_Gk=_u1k _Gx+_u2k _Gy

_uk=cos2πkKsin2πkK,_Gx=000−101000,_Gy=_GxT

where K is the number of orientations. Then, the gradients are assigned to histogram_hkby using an approximated bilinear binning, which is given by,

_hk≈||g||max{0,<g,_uk>/||g||−cos2π/K1−cos2π/K}

where<g,_uk>is the projection of the gradient g along direction_uk . The cell histogram is calculated in 8 × 8 pixels and normalized in a block composed of 2 × 2 cells. The network architecture is shown in Figure 3. The norm layer is a special case of the Local Response Normalization (LRN) layer [33], and the clamp implements the function,

y=min{x,τ}

whereτ is a positive threshold. Clipping the values can avoid too much influence of very large gradients [34].

3.3. Tracking Algorithm

The initialization parameters of the network are obtained from the HOG features, as shown in Figure 4, and we further train the network by using Equation (1). The training process of the network is consistent with the training mechanism of correlation filters, as we mentioned in Section 3.1. After the network is trained, the feature maps from the network are integrated into the correlation filters for aircraft tracking. We denote the input image by x, and the corresponding feature isφ(x,w). Similarly, a correlation filter f is then learned by solving the following objective function:

^f*=argminf||f∗φ(x,w)−y|^|2+_γ||f||22

where y is a Gaussian function peaked at the target center, ∗ means circular correlation, andγis a regularization parameter. After, we crop a search patch and obtain the features z in the new frame. The correlation response map m can be given as,

m=^F−1(f^⊙z^)=^F−1(y^⊙^{φ^*}φ^⊙^{φ^*}+γ⊙z^)

where the hat denotes the Fourier transform, the operator^F−1denotes the inverse fast Fourier transform, ⊙ is the element-wise product, and^{φ^*}is the complex conjugate ofφ^ . Thus, the translation of the target from the previous frame can be estimated by searching for the maximum value of the correlation response map. The overall procedure of the algorithm is shown in Figure 5. We summarize the main steps of the proposed tracking algorithm in Algorithm 1.

4. Experiments We validate the proposed method by conducting experiments on both synthetic infrared imagery and real infrared imagery. We first introduce the parameter settings of the experiments. Then, we conduct ablation studies to verify the most important part of the proposal. Finally, we compare its performance with trackers based on the tracking benchmark library. 4.1. Experimental Setup

We construct shifted versions of the aircraft in the initial frame to obtain 256 training samples as the training data. The corresponding labels are assigned according to Equation (2). We follow the initial parameter settings of the network. The number of orientations is set to 18, and the threshold valueτ in Equation (8) is 0.2. We iteratively apply the Stochastic Gradient Descent (SGD) optimizer with a batch size of 16. The setting of the learning rate is highly related to the loss curve. Therefore, we conduct learning rate experiments of different orders of magnitude and randomly select 30 percent of the samples from the training set as the validation set. The corresponding loss curves are shown in Figure 6. If the learning rate is set to 0.01, it is difficult for the loss function to converge. The loss function will converge slowly with a learning rate of 0.0001. To this end, we set the learning rate to 0.001 to make the loss function converge more smoothly and quickly. The training is stopped after 10 epochs since the loss value decreases little after that, as shown in Figure 6b. After the network is trained, the features from the network are integrated into the correlation filter tracking framework [12,24,25]. The experiments were performed on a PC with an Intel i3-4030U 1.9 GHz CPU, and 4 GB of RAM.

4.2. Ablation Studies

The features adopted in the correlation filter tracking framework play a critical role. We perform quantitative analysis to evaluate the use of features from different layers. We follow the evaluation metrics used in [35,36], which include the precision metric and the success metric, and follow the One-Pass Evaluation (OPE) protocol. The success metric is presented with plots, which show the ratios of successful frames changed with the overlap ratio between the tracked and ground-truth bounding boxes. The precision metric calculates the percentage of frames within a range of the center location error thresholds. Given tracking bounding box_Btand ground-truth bounding box_Bgt, precision P and overlap ratio R are defined as follows:

P=^{(_xt−_xgt)2}+^{(_yt−_ygt)2}

R=_Bt∩_Bgt_Bt∪_Bgt

where(_xt,_yt)and(_xgt,_ygt)are the center coordinates of the tracking bounding box_Btand ground-truth bounding box_Bgt, respectively. For each frame, we can calculate the precision P and overlap ratio R. Given a precision thresholdPth, the percentage of frames within the thresholdPthcan be computed. We change the thresholdPthto calculate the corresponding percentage of frames. Thus, we can plot the percentage of frames varying with the thresholdPth, which is called the precision plot. Similarly, we can plot the percentage of frames changing with the thresholdRthto obtain the success plots. The precision score and the overlap score adopt thresholdsPthandRth with 20 pixels and 0.5 to measure the percentage of the frames, which is consistent with the parameter settings in [35].

We extract features from each layer of the network to analyze the tracking performance. The experiments are performed based on the Hierarchical Convolutional Features Tracking (HCFT) framework [25]. The corresponding precision plots and success plots are shown in Figure 7. The results are obtained based on synthetic infrared imagery, composing of simulated aircraft and a real cloud background. The dataset is collected by the Institute of Flight Control and Simulation Technology. The features from the latter layers achieve better performance. Therefore, the features of the last layer are adopted in our subsequent experiments.

To evaluate the effect of the initialization of the weights with the computation of the HOG features, we perform experiments including random initialization of all the convolutional layers and initialize the weights of the first to fourth convolutional layer with the HOG features in turn. The performance comparisons with different initialization parameters are shown in Figure 8. As we can see, the initialization of the second convolutional layer with the HOG features improves the performance greatly. For better analysis of the weights after training, we visualize the changes of the first and second convolutional layer after training with initialization parameters obtained from the HOG features. As shown in Figure 9, the second convolutional layer shows slight changes, which also proves the importance of its initialization. Its main distribution is kept after training. The training process with parameters from the HOG features acts like the fine-tuning parameters for the current video domain. The best results are achieved by initializing all layers with the HOG features. The tracking results using the HOG features (tracking boxes with green borders) and the features after training (tracking boxes with red borders) are shown in Figure 10. If we adopt the HOG features alone, the tracker begins to drift to the suspected region, caused by decoy interference. After training the network for the current video domain with the initialization parameters of the HOG features, the tracker learns more discriminative features. As shown in Figure 10, the maximum value of the response maps points to the target region. Thus, the combination of the training with the parameters of the HOG features achieves better performance.

Furthermore, we conduct experiments with different networks to verify the effect of the architecture. We manually remove different layers from the original network and adopt the features of the last layer for comparison. The performance degrades after removing layers, as shown in Figure 11. The performance boost benefits from the combination of the network architecture and the initialization parameters obtained from the HOG features.

To further analyze the effectiveness of the feature learning module, we integrate the module into different frameworks based on the correlation filters and compare the tracking performance with different baselines. The experiments are conducted based on the KCF framework [12], HCFT framework [25], and ECO framework [24]. Then, these trackers with the online learning modules are named KCFOL (Kernelized Correlation Filter with Online Learning), CFOL (Convolutional Features with Online Learning), and ECOOL (Efficient Convolution Operator with Online Learning), respectively. On the basis of the experimental results stated above, KCFOL, CFOL, and ECOOL are equipped with features of the fourth convolutional layer. The evaluation includes the features learned online (ECOOL, CFOL, KCFOL), the features extracted from VGG-Net [28] (HCFT, ECO), and the HOG features (ECO-HC KCF, HCFT-HOG). As seen from Figure 12, the ECO tracker benefits greatly from the integration of the online learning module (ECOOL), while the KCF tracker does not gain many benefits from the embedding module. We visualize the tracking result of the ECOOL tracker and KCFOL tracker. As the target approaches, there exist changes in the scale of the target. Since the KCF tracker has no scale estimation module, the tracking results focus on the local region of the target and cannot achieve a good feature representation of the target model. The tracking bounding box easily drifts to the suspected area. The scale estimation method adopted in the ECO tracker can handle the challenge of scale change, leading to a better feature representation of the target model and superior tracking performance, as seen from Figure 13.

4.3. Evaluating the Tracking Benchmark

The former experiments were conducted on synthetic infrared imagery, composed of simulated aircraft and a real cloud background. The aircraft was simulated based on the OpenScene-Graph (OSG) toolkit and was rendered according to its infrared signatures [37,38,39,40]. The generation of the simulated image was integrated with the navigation and guidance processes of the missile. The image of the real cloud was captured by an IRCAM Equus 327 KM. The infrared camera worked in the band of 3–5 μm, with a resolution of 640 × 512 pixels. To demonstrate the effectiveness of our aircraft-tracking algorithm, we conducted the following experiments on both synthetic infrared imagery and real infrared imagery based on the tracking benchmark library [35]. The comparison includes ECO [24], HCFT [15], SiamRPN (Siamese Region Proposal Network) [41], SiamFC (Fully-Convolutional Siamese Networks) [42], and the trackers in the tracking benchmark library [35]. The relevant methods for comparison are summarized as follows.

CF based trackers are trained by solving a linear least-squares problem. The periodic assumption of the samples implied by correlation filters enables efficient training using the fast Fourier transform and can generate a dense response. In the process of implementation, KCF [12] adopts HOG features, while HCFT [15] and ECO [24] adopt CNN features pre-trained from VGG-Net [28]. KCF, HCFT, and ECO are implemented using MATLAB.

Boosting based trackers consider tracking as a binary classification problem and combine weak classifiers into a strong classifier. OAB (Online Adaptive Boosting) [43] adopts Haar features, orientation histograms, and local binary patterns to generate weak classifiers. To alleviate the drift problem introduced by the online update of the ensemble of classifiers, SemiT (Semi-supervised Tracking) [44] formulates the update process in a semi-supervised fashion, which utilizes both label data and unlabeled samples collected during tracking. The implementations of OAB and SemiT are achieved by using the C language.

TLD (Tracking-Learning-Detection) [45] uses positive and negative constraints to restrict the labeling of the unlabeled samples, which in turn guides the training of the binary classifier. The constraints are implemented via Lucas–Kanade and the Normalized Cross-Correlation (NCC). To reduce the dependence on generating training samples from unlabeled data, Struck [46] uses a kernelized structured output support vector machine to directly predict the change in object location. The features adopted in TLD and Struck are binary patterns and Haar features, respectively. Besides, TLD is implemented using MATLAB and the C language, while Struck is carried out using the C language.

The trackers adopt features with sparse representation expressing a target by a sparse linear combination of a few trivial templates. In this category, the L1APG (L1 Accelerated Proximal Gradient) [47] tracker adopts the holistic representation and tracks the object by solving the L1 minimization problem. ASLA (Adaptive Structural Local Appearance) [48] utilizes a structural local sparse model and alignment-pooling method across the local patches to measure the similarity between the candidate regions and the target model. They are implemented within the particle filter framework, and the optimal state can be computed by the maximum a posteriori estimation. In the implementation, L1APG and ASLA are conducted via the MATLAB platform.

The Siamese network consists of two branches, and the parameters between the two branches are tied to apply an identical transformation to the exemplar image and the candidate image. SiamFC [42] formulates tracking as learning similarity functions. In a more specific implementation, the similarity functions are trained from ImageNet Video with the convolutional features of AlexNet [33]. SiamRPN [41] adopts the region proposal network instead of the multi-scale test adopted in SiamFC to obtain a better estimation of the scale. The training set of SiamRPN includes ImageNet Video and YouTube-BB (Youtube Bounding Boxes)[49]. The implementations of SiamFC and SiamRPN are performed through MatConvNet (MATLAB toolbox implementing Convolutional Networks)and Pytorch, respectively. The codes of KCF, OAB, SemiT, TLD, Struck, L1APG, and ASLA are provided in the tracking benchmark library [35], and the codes of HCFT, ECO, SiamFC, and SiamRPN are provided by the authors.

The evaluation follows the OPE protocol used in [35], and the details of the dataset used in the experiments are listed in Table 2. The evaluated tracking algorithms are summarized in Table 3. Trackers with dense sampling (TLD, Struck) provide a large search range and achieve better performance. Among the evaluation results, discriminative trackers (ECO [24], TLD [45], Struck [46]) perform better than trackers based on generative models (L1APG [47], ASLA [48]). Discriminative trackers employ the information from both the target and background and train a classifier to distinguish the target from the background [7]. For generative model based trackers, it is difficult to learn the generative appearance model of the target in the complex background. For aircraft tracking in infrared imagery, the aircraft may be frequently occluded by a cloud or an infrared decoy, resulting in inaccurate target models. As seen from Figure 14, ASLA and L1APG drift to the decoy and the cloud in Frame 75. The online update of the ensemble of weak classifiers helps distinguish the target from the background, but it also introduces errors due to frequent updates. SiamRPN and SiamFC benefit from the large training dataset to learn similarity functions. However, they lack an efficient model update mechanism to handle the appearance change, leading to model drift problems. Notice that in Frame 107, TLD lost the target. However, in Frame 108, TLD is re-initialized by its detector and successfully locks onto the target again. Both HCFT and ECO adopt CNN features to improve the performance. Instead of simply resampling all feature channels at the same resolution, ECO adopts continuous convolution operators to integrate feature channels, which enables more accurate localization. After replacing the pre-trained CNN feature with online learning features (CFOL, ECOOL), the performance of the baseline methods are improved. The overall performance is summarized by precision plots and success plots. For clarity, only the top 10 trackers are presented, as shown in Figure 15. Qualitative comparisons with the top-performing trackers are shown in Figure 16.

5. Conclusions In this paper, we propose an effective algorithm for aircraft tracking in infrared imagery. We integrate domain-specific features learned online and general feature representations in a unified convolutional network. The training of the network is consistent with the training mechanism of the correlation filters. Therefore, the features learned are closely related to both the current video domain and the trackers based on correlation filters. The introduced feature learning method can be integrated into the tracking framework as a flexible module to improve the baseline method. Experimental results show that the proposed algorithm achieves competitive performance in terms of accuracy and robustness.

Layer	Kernel Size	Channel	Stride	Padding
Conv1	3 × 3	18	1	1
Conv2	8 × 8	18	4	2
Conv3	2 × 2	108	1	0
Conv4	2 × 2	27	1	1

Datasets	Number of Sequences	Max Frames	Min Frames	Total Frames	Bit Depth	Resolution
Synthetic imagery	6	450	316	2304	8	128 × 128
Real imagery	4	2831	821	8832	8	640 × 512

	Tracker		Feature	Search	MU	Precision	Overlap
DM	CF based	KCF	HOG	DS	Y	0.668	0.348
		HCFT	Pretrained CNN	DS	Y	0.741	0.377
		CFOL	CNN Learned online	DS	Y	0.748	0.439
		ECO	Pretrained CNN	DS	Y	0.898	0.764
		ECOOL	CNN Learned online	DS	Y	0.932	0.834
	Boosting based	OAB	Haar, BP, OH	DS	Y	0.701	0.408
	Boosting based	SemiT	Haar	DS	Y	0.652	0.359
	Struck		Haar	DS	Y	0.749	0.471
	TLD		BP	DS	Y	0.841	0.434
GM	ASLA		Sparse	PF	Y	0.489	0.384
GM	L1APG		Sparse	PF	Y	0.515	0.356
Siamese	SiamRPN		Pretrained CNN	DS	N	0.638	0.448
Siamese	SiamFC		Pretrained CNN	DS	N	0.643	0.471

Author Contributions

S.W. designed the algorithm, conducted the experiments, and wrote the paper. S.L. analyzed the data and assisted in writing the manuscript. K.Z. and J.Y. supervised the study and reviewed this paper. All authors read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant Number 61703337 and the Aerospace Science and Technology Innovation Fund of China under Grant Number SAST2017-082.

Acknowledgments

The authors would like to thank the provision of the infrared image sequences simulated by the Institute of Flight Control and Simulation Technology.

Conflicts of Interest

The authors declare no conflict of interest.

Word count: 6587

Show less

© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Airborne target tracking in infrared imagery remains a challenging task. The airborne target usually has a low signal-to-noise ratio and shows different visual patterns. The features adopted in the visual tracking algorithm are usually deep features pre-trained on ImageNet, which are not tightly coupled with the current video domain and therefore might not be optimal for infrared target tracking. To this end, we propose a new approach to learn the domain-specific features, which can be adapted to the current video online without pre-training on a large datasets. Considering that only a few samples of the initial frame can be used for online training, general feature representations are encoded to the network for a better initialization. The feature learning module is flexible and can be integrated into tracking frameworks based on correlation filters to improve the baseline method. Experiments on airborne infrared imagery are conducted to demonstrate the effectiveness of our tracking algorithm.

Details

Title

Learning to Track Aircraft in Infrared Imagery

First page

3995

Publication year

2020

Publication date

2020

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs12233995

ProQuest document ID

2468949431

Learning to Track Aircraft in Infrared Imagery

Jump to:

Full Text

Abstract

Details

Suggested sources