This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Unmanned helicopter (UH) has the characteristics of strong manoeuvrability and good concealment and can avoid radar detection by flying at ultralow altitude. Therefore, UH is widely used to raid important targets on the battlefield. Ordinary UH still needs to be operated by rear personnel to complete a series of tasks, which just transfers the operator from the front to the background, and does not achieve true unmanned operation. The intelligent UH should complete a series of tasks through autonomous decision-making, so as to truly operate completely autonomously without the control of personnel. To achieve this, research related to real-time communication, resource allocation, and path planning needs to be paid more attention [1–3].
As one of the key technologies for unmanned systems to achieve intelligence, path planning technology plays an important role in improving the intelligence, safety, and adaptability of UH [4]. UH needs to complete a series of decisions under the guidance of the safe path to achieve autonomous movement. Therefore, path planning technology is the basis for UH to move towards the target. In order to ensure the safety of the UH movement process, the path planning needs to consider a large number of constraints such as the battlefield environment and the manoeuvrability of the UH. Since many elements in the battlefield environment will pose a serious threat to the safe flight of the UH, the path planning of the UH will face complex constraints.
In recent years, researchers have proposed a series of solutions to solve the path planning problem of unmanned aerial vehicle (UAV). A new metaheuristic grey wolf optimizer (GWO) was proposed to solve the UCAV two-dimensional path planning problem in the literature [5], which fully consider the threats and constraints of the battlefield environment. In the literature [6], an improved pigeon-inspired optimization algorithm (PIOFOA) was proposed to solve problems about path planning in a three-dimensional dynamic environment of oilfields. An improved constrained differential evolution (DE) algorithm was proposed in the literature [7], which combines DE algorithm with the level comparison method, to find the optimal route in feasible regions. An adaptive selection mutation-constrained differential evolution algorithm was proposed in [8]. In this paper, UAV path planning was modelled as the optimization problem, in which fitness functions include traveling distance and risk of UAV; three constraints involve the height of UAV, angle of UAV, and limited UAV slope. On the one hand, the environmental threats in existing research are usually static, and the threat area is completely impassable. This constraint method reduces the difficulty of UAVs avoiding dangerous areas to a certain extent. However, there are usually many dynamically changing threat areas in the battlefield environment, and it is difficult for the algorithms in the above studies to accurately identify and avoid these areas. On the other hand, most of the above studies were conducted on general-purpose UAV, and few studies were conducted on UH alone. There is a big difference between UAV and UH in terms of flight height and usage scenarios. First of all, UAVs usually conduct cruise operations at high altitudes of tens of thousands of meters, while UHs usually operate at low altitudes of thousands or even hundreds of meters. Secondly, UAV is usually used for high-altitude reconnaissance, confrontation, and other tasks on the battlefield, while UH is more used for low-altitude raid missions. Therefore, it is necessary to distinguish UH and UAV for separate research. In summary, the existing path planning algorithms cannot fully meet the path planning requirements of UH in complex battlefield environments.
UH needs to fly in low airspace for a long time during the raid mission, which will face ground obstacles and radar threats. Ground obstacles such as mountains are usually stationary, and it is not difficult to accurately identify and avoid them. Due to factors such as terrain and the curvature of the earth, the probability of UH being detected by radar will vary with the flight height, which means that it will face a dynamically changing threat area for a long time. Traditional path planning algorithms generate optimal paths based on real-time environmental information, which is effective in the face of static threat areas. However, the security status of some locations within the dynamic threat changes in real time, and some locations are passable at one time and not at another. Therefore, the paths planned by traditional path planning algorithms are not absolutely safe in the face of dynamic threat areas. A good solution to this problem is to accurately identify the dynamic threat area and avoid it entirely. Deep Q-network (DQN) algorithm is the product of the combination of neural network and Q-learning algorithm. It can not only process large state space information but also interact with the environment to seek optimal strategies when the environment state is unknown. With appropriate reward settings, the DQN algorithm can accurately identify the dynamic threat area and avoid it by interacting with the environment. Therefore, using the DQN algorithm for path planning can help UH to effectively avoid dynamic threat areas in the battlefield environment. This paper was aimed at providing an effective path planning method based on DQN algorithm, which can help UH to effectively avoid the dynamically changing radar coverage area and successfully complete the raid task in complex environment.
Based on the above analysis, a heuristic deep Q-network (H-DQN) algorithm is proposed in this paper. We study the ability of the proposed algorithm to plan paths for UH in the complex environment and try to make the planned paths smoother, thereby reducing the manoeuvring consumption of UH. Compared with traditional algorithms, the H-DQN algorithm can effectively identify the dynamically changing radar coverage area and help UH plan a safe and effective flight path. The main contributions of this paper are as follows:
(1) A heuristic comprehensive reward function is designed, which mainly includes two parts: heuristic reward function and smooth reward function. The heuristic reward function can promote the rapid convergence of the algorithm and effectively improve the sparse reward problem faced by traditional reinforcement learning. The smooth reward function can constrain the UH’s behaviour, prompting the UH to choose a smoother path for flight, thereby reducing flight consumption. The proposed heuristic comprehensive reward function integrates the information of environmental constraints and motion constraints, which can effectively promote the convergence speed of algorithm and further improve the quality of planned path. It has certain versatility and can be combined with other intelligent algorithms
(2) We model the dynamic threat constraints faced by UH in low-altitude raid missions and apply the proposed H-DQN algorithm to UH path planning. The modelling of dynamic constraints fully considers the complexity of the battlefield environment, which can reflect the difficult situations faced by UH in the application of the battlefield. The proposed algorithm embedded in the environment model for path planning is described in detail, which provides a new solution to the path planning problem
The rest of this paper is structured as follows. The related works are presented in next section. In Section 3, numerical analysis and modelling of the complex low airspace environment faced by UH are carried out. Section 4 introduces the deep reinforcement learning methods. In Section 5, the design of the comprehensive reward function and the proposed H-DQN algorithm are explained in detail. In Section 6, the experimental results and comparative analysis are presented. The conclusions are presented in Section 7.
2. Related Works
Path planning technology usually refers to finding the optimal path from the starting position to the target position according to certain evaluation criteria under certain environmental constraints [9]. Path planning algorithms are usually divided into global path planning algorithms and local path planning algorithms [10]. Among them, the global path planning algorithm requires that the environmental model is known, and the algorithm can generate the global optimal path according to the environmental constraints, and the representative one is the A
The local path planning algorithm can make corresponding decisions according to the local environment information and explore the passable path when the global information is unknown by interacting with the environment. The more representative algorithms include genetic algorithm, dynamic window approach, ant colony algorithm, particle swarm algorithm, and artificial potential field method [15–20]. In the literature [15], the authors find the optimal flight path for the UAV by using an improved genetic algorithm with a new genetic factor on the basis of the probability map. Aiming at the problem that the classical DWA has an unreasonable path in dense obstacles and cannot guarantee speed and safety at the same time, the literature [16] proposes an adaptive DWA algorithm, which is successfully applied to the local path planning of the robot. A heterogeneous UAV coverage path planning algorithm based on ant colony algorithm is proposed in [17]; the author applied ant colony algorithm to cooperative search system to minimize the time consumption of the task. An improved particle swarm algorithm was proposed to solve the problem of path planning for an unmanned aerial vehicle (UAV) in adversarial environments including radar-guided surface-to-air missiles (SAMs) and unknown threats in the literature [18]. In order to efficiently complete the underwater information collection, a heterogeneous AUV-aided information collection system with the aim of maximizing the energy efficiency of IoUT nodes taking into account AUV trajectory, resource allocation, and the Age of Information (AoI) was proposed in the literature [19]. Particle swarm optimization algorithm was used as the method for the trajectory planning of underwater robots. In order to ensure the optimality, rationality, and path continuity of the formation trajectory of unmanned surface vehicles, a deterministic algorithm of multi-subtarget artificial potential field (MTAPF) based on improved APF was proposed in the literature [20]. MTAPF can greatly reduce the probability of USV falling into the local minimum and help USV escape from the local minimum by switching the target point. However, these algorithms generally have shortcomings such as difficult to guarantee convergence and easy to fall into local optimum, so their applicable scenarios are relatively limited. Among these algorithms, the genetic algorithm has been widely used in many fields because of its strong scalability and easy to combine with other algorithms [21]. In the literature [22], an improved cost function for a grid path planning in 2D static environment-based genetic algorithm (GA) was proposed, which was used to reduce the energy consumption of AUVs. A genetic algorithm was used to determine the optimized path with the minimum travel time for a USV under environmental loads in the literature [23]. A new hybrid algorithm which is based on genetic algorithm and firefly algorithm was proposed in the literature [24], which was used to solve the path planning problem of mobile robots. It is worth noting that the parameters of the genetic algorithm are numerous and complex, so its path search is inefficient and time-consuming.
Introducing reinforcement learning technology into path planning is a research hotspot in recent years. Reinforcement learning is a learning method that maps from the environment state to the action. By constructing a Markov decision model, the learner repeatedly interacts and explores with the environment to learn the optimal strategy. Reinforcement learning does not require complete prior knowledge. Since learners can independently obtain optimal behaviour strategies through dynamic interaction with the environment when facing an unfamiliar environment, the application of reinforcement learning to path planning has certain advantages. According to the update method of the policy, reinforcement learning can be classified into value function-based and policy gradient-based [25], among which value function-based reinforcement learning is more widely used. As a kind of value function-based reinforcement learning algorithm, Q-learning algorithm has been widely used in the field of path planning. In order to prove the ability of Q-learning algorithm to interact with the environment, the Q-learning algorithm was used to extract the state of the environment in the literature [26]. The path planning task of a mobile robot in an unknown environment was accomplished by combining the Q-learning algorithm with the dynamic window approximation algorithm. In the literature [27], the Q-learning algorithm is used to complete the autonomous navigation and control of intelligent ships in simulated waterways. The author completed the environmental information modelling during the ship’s navigation and set environmental factors such as obstacles and restricted areas as reward and punishment information. By combining the Q-learning algorithm, a multi-AUV collaborative data acquisition algorithm was proposed in the literature [28], which can reduce the data acquisition load of a single AUV and serve as a path planning algorithm for autonomous underwater vehicles. However, as the environment becomes more complex, the state space of the environment is also becoming larger, and the problem of state space explosion occurs at this time, which makes it difficult for the traditional Q-learning algorithm to converge.
Relevant studies have shown that deep reinforcement learning formed by the combination of deep learning and reinforcement learning can effectively improve the state space explosion problem [29]. The DQN algorithm is formed by combining the deep neural network and the Q-learning algorithm. The appearance of DQN algorithm further solves the problem of path planning. In the literature [30], a deep reinforcement learning method ANOA based on dueling deep Q-network was proposed, which tailored design of state and action spaces and the reward function. In the literature [31], a smoothly convergent DRL (SCDRL) method was proposed based on the deep Q-network (DQN) and reinforcement learning, which to solve the path following problem for an underactuated unmanned-surface-vessel (USV). Aiming at the problem of vehicle model tracking error and overdependence in traditional path planning of intelligent driving vehicles, a path planning method of intelligent driving vehicles based on deep reinforcement learning was proposed in the literature [32]. A novel hierarchical framework to achieve real-time path planning and following for a gliding robotic dolphin was proposed in the literature [33], which present a novel hierarchical deep Q-network method to separately plan the collision avoidance path and the approach path and also design different continuous states under the kinematic constraints.
Based on the above analysis, it can be seen that the DQN algorithm has obvious advantages in solving path planning problems in complex environments. In view of the complexity of UH’s model for performing raid tasks, a heuristic DQN algorithm for UH’s path planning based on the deep reinforcement learning DQN algorithm is designed in this paper.
3. Environment Model
Figure 1 is an illustration of the battlefield environment that UH faces when performing low airspace raid missions. The helicopter flight area
[figure(s) omitted; refer to PDF]
The experimental environment is set as a low airspace with a length of 50 km and a height of 1 km. UH’s mission is to raid radar positions 50 km away. It is assumed that the UH is equipped with a radar warning device that can determine whether it is locked by the radar, and the position of the target radar is known. The position coordinates of the radar are expressed as
UH avoids colliding with mountains during flight. Assuming that the height of the mountain is 0.15 km, its position is expressed as
As shown in Figure 2, due to the powerful manoeuvrability of UH, it can do 8 degrees of freedom within
[figure(s) omitted; refer to PDF]
Then, the position of UH in
In the process of UH flight, the distance from the obstacle is greater than the safety radius
Due to the large difference between the horizontal and vertical speeds of UH, its safe radius
The maximum attack distance of UH is 8 km. Assuming that each attack of UH is regarded as a hit, the condition for completing the task is that the distance
The maximum detection range of the radar is 45 km. Due to factors such as ground reflection clutter and the curvature of the earth, it is usually difficult for radars to detect low-flying targets. The radar detection probability is expressed as
In equation (8),
[figure(s) omitted; refer to PDF]
In Figure 3, the
Based on the above information, the passable condition
Modelling the battlefield environment is the first step in path planning. Through the above numerical analysis, we introduce the entire battlefield environment in detail, define the movement mode and behaviour constraints of UH, and clarify the passable and impassable areas in the environment. In order to successfully complete the raid mission, UH needs to reach the attack area safely, and in the process, it needs to avoid hitting mountains and being detected by radar. Since the radar threat area is dynamic, the best way to keep UH safe is to avoid crossing the radar coverage area. Therefore, to measure the quality of the planned path, indicators such as path length, path smoothness, whether there is a collision, and whether it crosses the radar coverage area should be integrated. The ideal planned path should have a short length and good smoothness, so as to effectively reduce the flight consumption of UH. Avoiding collisions and avoiding crossing the radar area are prerequisites for UH safety. It can be seen from Figure 3 that within the range of the flight height
4. Deep Reinforcement Learning Methods
It is the core content of path planning that requires a suitable algorithm model for path search. In this section, we introduce the reinforcement learning Q-learning algorithm and the deep reinforcement learning DQN algorithm, respectively, and explain the experience playback and target network mechanisms in the DQN algorithm. These algorithms are the key to completing the path search and are the basis of our proposed algorithm.
4.1. Reinforcement Learning
Four elements, state set
Due to the Markov property of reinforcement learning, the state action value function can be expressed as
Q-learning is a relatively mature and widely used reinforcement learning algorithm. Q-learning is a reinforcement learning algorithm based on value function, and its update method can be expressed as
4.2. Deep Reinforcement Learning
The combination of reinforcement learning and neural network has been studied in the early days, but simply combining the two has not achieved the desired effect [34]. The proposal of DQN algorithm provides a powerful boost for the development of deep reinforcement learning [35]. Experience replay and the proposal of target network mechanism are important reasons for the success of DQN algorithm. Since the correlation of samples in deep reinforcement learning is much larger than in simple reinforcement learning, the purpose of experience replay is to make the deep neural network converge to the same step size, so that the algorithm gradient descent moves in the same direction, thereby promoting the algorithm convergence. At the same time, the experience replay mechanism requires the algorithm to randomly sample training samples from the experience pool, which can improve data utilization. The experience replay mechanism effectively solves three problems: overcoming the correlation of empirical data, reducing the variance of parameter updates, and overcoming the nonstationary distribution problem [36].
The principle of the DQN algorithm is to combine reinforcement learning and deep neural networks, use the Q-learning algorithm to provide labelled samples for the neural network, and then use the gradient descent method of backpropagation to update the neural network parameters. The DQN algorithm uses a neural network to fit the update process of the Q-learning algorithm:
The algorithm update process uses the stochastic gradient descent algorithm to update the network parameter
DQN algorithm code description can be seen in Algorithm 1:
Algorithm 1: DQN algorithm.
Algorithm: DQN algorithm
Initialization: initialize training network parameter
Iterative process:
Repeat (for each episode)
Initialization state
Repeat (for each step)
Select action
Perform action
Store transition
Sample random mini batch
Loss function
Updating network parameters
End Repeat (
End Repeat (End of the training)
5. Heuristic Deep Q-Network Algorithm
In this section, we detail the proposed process of the heuristic DQN algorithm and embed it in the UH path planning task. We first describe the state-action ensemble of the UH performing low-altitude raid task model and then design a heuristic synthetic reward function.
5.1. State and Action Sets
The partitioning of state and action sets is the first step in reinforcement learning algorithms. Since the UH is moving in the environment, its motion path is a time-dependent nonlinear function. Considering that the DQN algorithm requires the state to be discrete, the environment model needs to be discretized. The grid method can be used to discretize the system environment. First, the airspace environment is divided into 500 squares using
UH performs an action per time step. Since UH performs 8-dimensional motion, the action set
The movement direction of
Path search is a key step in path planning, and the division of state-action sets is a prerequisite for path search. The state set can effectively pass the UH position and environmental constraints to the algorithm for path search. The action set defines the movement mode of the UH in the environment and further clarifies the action constraints.
5.2. Comprehensive Reward Function
The setting of the reward function is crucial for reinforcement learning algorithms. Selecting an appropriate reward function can effectively promote the algorithm convergence, while inappropriate reward function settings may lead to difficulty in algorithm convergence [37]. In traditional reinforcement learning algorithms, when a learner completes a task, there is a corresponding reward, while the previous series of behaviours are not rewarded. Some studies have pointed out that this kind of reward can lead to sparse reward problem in the face of complex environment [38]. When the set of environmental states is large, the learner encounters a series of nonfeedback states before completing the task. Since the effective reward cannot be obtained in time, the algorithm will be difficult to converge. In order to improve this problem, we design a heuristic reward function
Considering the motion constraints, frequently changing the motion direction is unfavorable for UH, especially for large corners. Frequently changing the movement direction will increase the flight consumption of the UH and also affect the flight safety. In order to make the planning path smoother, the smooth path reward function
It is worth noting that both
We strongly approve of the behaviours of completing the task, so we can get a large reward. We summarize the above reward settings into a comprehensive reward function
To sum up, the comprehensive reward function designed can generate dynamic rewards in real time in combination with environmental information, so that UH has good control performance and can make the planning path smoother. The heuristic comprehensive reward function can optimize the search according to the continuously estimated environmental cost information, which makes the process of reward accumulation smoother, thus effectively improving the sparse reward problem in complex environments. The positional relationship between UH and radar targets, motion constraints, and environmental constraints is effectively integrated by the reward function, which further improves the efficiency of path search.
5.3. Heuristic Deep Q-Network Algorithm Model
Figure 4 shows the algorithm model, which clearly shows the whole process of using H-DQN for path search. It can be seen from the figure that after the algorithm runs, the
[figure(s) omitted; refer to PDF]
The division of the state-action set is a key step in embedding the reinforcement learning algorithm model into the path planning problem, and designing an appropriate reward function is an important way to improve the performance of the algorithm. We describe these procedures in detail and frame the proposed algorithm so that it is convenient to extend these results to more general path planning problems. It is worth noting that the comprehensive reward function we designed has remarkable generality, not only applicable to the DQN algorithm and its various variants, but also can be combined with other intelligent algorithms that require reward constraints.
6. Simulation Experiment
In this section, the performance of the proposed H-DQN algorithm is evaluated through comparative experiments. To ensure the validity of the experiments, all experiments were carried out in the same environment. The construction of the experimental environment and the implementation of the algorithm are all done through Python code on the PyCharm platform. We are using Python-3.6.10 version, and the neural network is built with Tensorflow-2.6.0 version. All experiments were performed on the same computer with twelve Intel (R) Core (TM) i7-8700 CPU @ 3.20 GHz processor and one NVIDIA GeForce GT 430 GPU, running memory with 16 GB RAM.
6.1. Experimental Parameter Settings
The learning rate
[figure(s) omitted; refer to PDF]
Figure 5 shows the performance of the algorithm under different learning rate values. It can be seen from the figure that the performance of the algorithm is affected by the value of the learning rate. This is because the learning rate represents how well the learning effect is preserved with each algorithm update. The larger the value of the learning rate α, the more the learning effect is retained, and the faster the training speed will be. At this time, the algorithm is not stable enough and is prone to oscillation. The smaller the learning rate is, the less the learning effect is retained, and the slower the training speed will be. At this point, the algorithm will become relatively stable, but if the training speed is too slow, it will bring more time overhead, which is also unacceptable under certain circumstances. According to the results of this experiment, when the learning rate
In addition to the learning rate
The influence of other parameters of the algorithm is as follows: if the exploration factor
6.2. The Effect of Reward Coefficients
Since the comprehensive reward function designed this time is obtained by the weighted addition of each part of the reward function, the influence of the weight of each part of the reward function on the overall performance of the algorithm is a question worthy of discussion. In this section, we analyse the influence of each coefficient of the comprehensive reward function through multiple comparative experiments. Due to the particularity of the definition,
[figure(s) omitted; refer to PDF]
Figure 6 shows the score of the algorithm when the parameters
[figure(s) omitted; refer to PDF]
It can be seen from Figure 7 that when
It can be seen from the above experimental results that the heuristic reward can provide heuristic information for UH, guide UH to move closer to the goal, and effectively promote the algorithm convergence. The introduction of the smooth reward function can effectively promote the path planned by the algorithm to be smoother, but when the weight of the smooth reward function exceeds a certain value, its smoothing ability begins to gradually weaken. Combining Figures 7–9, we can see that in our experimental environment, when
6.3. Compare with Other Algorithms
In order to further prove the good performance of the H-DQN algorithm proposed in this paper, the more representative path planning algorithms A
[figure(s) omitted; refer to PDF]
It can be seen from Figure 10 that although GA and A
Figures 11 and 12 show the comparison of the average length and average smoothness of the paths planned by the GA, A
7. Conclusions
In this paper, an H-DQN algorithm for path planning of UH in complex low airspace environments is proposed. Numerical modeling and analysis of UH’s low airspace raid mission environment is carried out. On this basis, the corresponding state, behaviour space, and comprehensive reward function of the task model are given. In order to improve the sparse reward problem of traditional reinforcement learning algorithms, a heuristic reward function is designed to guide the algorithm to converge quickly. The introduction of a smooth reward function constrains the behaviour of UH and makes the planned path smoother. The simulation experiment analyses the influence of the weight of each part of the reward function on the performance of the algorithm and compares it with the traditional path planning algorithm. The experimental results show that the proposed H-DQN algorithm has better performance and faster convergence speed, which can help UH successfully complete the raid task. In the next step, we consider combining the comprehensive reward function with more intelligent algorithms to verify its effectiveness in different experimental backgrounds.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant No. 62071483 and Grant No. 61602505).
[1] Z. Fang, J. Wang, Y. Ren, Z. Han, H. V. Poor, L. Hanzo, "Age of information in energy harvesting aided massive multiple access networks," IEEE Journal on Selected Areas in Communications, vol. 40 no. 5, pp. 1441-1456, DOI: 10.1109/JSAC.2022.3143252, 2022.
[2] T. Do-Duy, L. D. Nguyen, T. Q. Duong, S. R. Khosravirad, H. Claussen, "Joint optimisation of real-time deployment and resource allocation for UAV-aided disaster emergency communications," IEEE Journal on Selected Areas in Communications, vol. 39 no. 11, pp. 3411-3424, DOI: 10.1109/JSAC.2021.3088662, 2021.
[3] Z. Qadir, F. Ullah, H. S. Munawar, F. al-Turjman, "Addressing disasters in smart cities through UAVs path planning and 5G communications: a systematic review," Computer Communications, vol. 168, pp. 114-135, DOI: 10.1016/j.comcom.2021.01.003, 2021.
[4] J. Chen, C. Du, Y. Zhang, P. Han, W. Wei, "A clustering-based coverage path planning method for autonomous heterogeneous UAVs, " , IEEE Transactions on Intelligent Transportation Systems,DOI: 10.1109/TITS.2021.3131473, 2021.
[5] S. Zhang, Y. Zhou, Z. Li, W. Pan, "Grey wolf optimizer for unmanned combat aerial vehicle path planning," Advances in Engineering Software, vol. 99, pp. 121-136, DOI: 10.1016/j.advengsoft.2016.05.015, 2016.
[6] F. Ge, K. Li, Y. Han, W. Xu, Y. Wang, "Path planning of UAV for oilfield inspections in a three-dimensional dynamic environment with moving obstacles based on an improved pigeon-inspired optimization algorithm," Applied Intelligence, vol. 50 no. 9, pp. 2800-2817, DOI: 10.1007/s10489-020-01650-2, 2020.
[7] X. Zhang, H. Duan, "An improved constrained differential evolution algorithm for unmanned aerial vehicle global route planning," Applied Soft Computing, vol. 26, pp. 270-284, DOI: 10.1016/j.asoc.2014.09.046, 2015.
[8] X. Yu, C. Li, J. F. Zhou, "A constrained differential evolution algorithm to solve UAV path planning in disaster scenarios," Knowledge-Based Systems, vol. 204,DOI: 10.1016/j.knosys.2020.106209, 2020.
[9] J. Chen, Y. Zhang, L. Wu, T. You, X. Ning, "An adaptive clustering-based algorithm for automatic path planning of heterogeneous UAVs," IEEE Transactions on Intelligent Transportation Systems,DOI: 10.1109/TITS.2021.3131473, 2021.
[10] Z. Zhang, J. Wu, J. Dai, C. He, "Rapid penetration path planning method for stealth UAV in complex environment with BB threats," International Journal of Aerospace Engineering, vol. 2020,DOI: 10.1155/2020/8896357, 2020.
[11] N. Wang, X. Jin, M. J. Er, "A multilayer path planner for a USV under complex marine environments," Ocean Engineering, vol. 184 no. JUL.15,DOI: 10.1016/j.oceaneng.2019.05.017, 2019.
[12] T. Phanthong, T. Maki, T. Ura, T. Sakamaki, P. Aiyarak, "Application of A ∗ algorithm for real-time path re-planning of an unmanned surface vehicle avoiding underwater obstacles," Journal of Marine Science and Application, vol. 13 no. 1, pp. 105-116, DOI: 10.1007/s11804-014-1224-3, 2014.
[13] A. Ammar, H. Bennaceur, I. Châari, A. Koubâa, M. Alajlan, "Relaxed Dijkstra and A ∗ with linear complexity for robot path planning problems in large-scale grid environments," Soft Computing, vol. 20 no. 10, pp. 4149-4171, DOI: 10.1007/s00500-015-1750-1, 2016.
[14] T. Whitaker, S. J. Cunningham, C. Bobda, "Decentralised indoor smart camera mapping and hierarchical navigation for autonomous ground vehicles," IET Computer Vision, vol. 14 no. 7, pp. 462-470, DOI: 10.1049/iet-cvi.2019.0949, 2020.
[15] H. Shorakaei, M. Vahdani, B. Imani, A. Gholami, "Optimal cooperative path planning of unmanned aerial vehicles by a parallel genetic algorithm," Robotica, vol. 34 no. 4, pp. 823-836, DOI: 10.1017/S0263574714001878, 2016.
[16] Y. X. Wang, Y. Y. Tian, X. Li, L. H. Li, "Self-adaptive dynamic window approach in dense obstacles," Control and Decision, vol. 34 no. 5, pp. 927-936, 2019.
[17] J. Chen, F. Ling, Y. Zhang, T. You, Y. Liu, X. du, "Coverage path planning of heterogeneous unmanned aerial vehicles based on ant colony system," Swarm and Evolutionary Computation, vol. 69, article 101005,DOI: 10.1016/j.swevo.2021.101005, 2022.
[18] J. J. Shin, H. Bang, "UAV path planning under dynamic threats using an improved PSO algorithm," International Journal of Aerospace Engineering, vol. 2020,DOI: 10.1155/2020/8820284, 2020.
[19] Z. Fang, J. Wang, J. Du, X. Hou, Y. Ren, Z. Han, "Stochastic optimization aided energy-efficient information collection in internet of underwater things networks," IEEE Internet of Things Journal, vol. 9,DOI: 10.1109/JIOT.2021.3088279, 2021.
[20] H. Sang, Y. You, X. Sun, Y. Zhou, F. Liu, "The hybrid path planning algorithm based on improved A ∗ and artificial potential field for unmanned surface vehicle formations," Ocean Engineering, vol. 223 no. 3–4,DOI: 10.1016/j.oceaneng.2021.108709, 2021.
[21] V. Roberge, M. Tarbouchi, G. Labonté, "Comparison of parallel genetic algorithm and particle swarm optimization for real-time UAV path planning," IEEE Transactions on Industrial Informatics, vol. 9 no. 1, pp. 132-141, DOI: 10.1109/TII.2012.2198665, 2013.
[22] K. Tanakitkorn, P. A. Wilson, S. R. Turnock, A. B. Phillips, "Grid-based GA path planning with improved cost function for an over-actuated hover-capable AUV," 2014 IEEE/OES Autonomous Underwater Vehicles (AUV),DOI: 10.1109/AUV.2014.7054426, .
[23] H. Kim, S. H. Kim, M. Jeon, J. H. Kim, S. Song, K. J. Paik, "A study on path optimization method of an unmanned surface vehicle under environmental loads using genetic algorithm," Ocean Engineering, vol. 142 no. sep. 15, pp. 616-624, DOI: 10.1016/j.oceaneng.2017.07.040, 2017.
[24] T. W. Zhang, G. H. Xu, X. S. Zhan, T. Han, "A new hybrid algorithm for path planning of mobile robot," The Journal of Supercomputing, vol. 2, 2021.
[25] M. Sun, X. Xu, X. Qin, P. Zhang, "AoI-energy-aware UAV-assisted data collection for IoT networks: a deep reinforcement learning method," IEEE Internet of Things Journal, vol. 8 no. 24, pp. 17275-17289, DOI: 10.1109/JIOT.2021.3078701, 2021.
[26] L. Chang, L. Shan, C. Jiang, Y. Dai, "Reinforcement based mobile robot path planning with improved dynamic window approach in unknown environment," Autonomous Robots, vol. 45 no. 1, pp. 51-76, DOI: 10.1007/s10514-020-09947-4, 2021.
[27] C. Chen, X. Q. Chen, F. Ma, X. J. Zeng, J. Wang, "A knowledge-free path planning approach for smart ships based on reinforcement learning," Ocean Engineering, vol. 189, pp. 106299-106299, DOI: 10.1016/j.oceaneng.2019.106299, 2019.
[28] G. Han, A. Gong, H. Wang, M. Martinez-Garcia, Y. Peng, "Multi-AUV collaborative data collection algorithm based on Q-learning in underwater acoustic sensor networks," IEEE Transactions on Vehicular Technology, vol. 70 no. 9, pp. 9294-9305, DOI: 10.1109/TVT.2021.3097084, 2021.
[29] M. Volodymyr, K. Koray, S. David, A. R. Andrei, V. Joel, G. B. Marc, G. Alex, R. Martin, K. F. Andreas, O. Georg, P. Stig, B. Charles, S. Amir, A. Ioannis, K. Helen, K. Dharshan, W. Daan, L. Shane, H. Demis, "Human-level control through deep reinforcement learning," Nature, vol. 518 no. 7540, pp. 529-533, DOI: 10.1038/nature14236, 2015.
[30] X. Wu, H. Chen, C. Chen, M. Zhong, S. Xie, Y. Guo, H. Fujita, "The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method," Knowledge-Based Systems, vol. 196,DOI: 10.1016/j.knosys.2019.105201, 2020.
[31] Y. Zhao, X. Qi, Y. Ma, Z. Li, M. A. Sotelo, "Path following optimization for an underactuated USV using smoothly-convergent deep reinforcement learning," IEEE Transactions on Intelligent Transportation Systems, vol. PP no. 99, 2020.
[32] J. Li, Y. Chen, X. N. Zhao, J. Huang, "An improved DQN path planning algorithm," The Journal of Supercomputing, vol. 78, 2021.
[33] J. Wang, Z. Wu, S. Yan, M. Tan, J. Yu, "Real-time path planning and following of a gliding robotic dolphin within a hierarchical framework," IEEE Transactions on Vehicular Technology, vol. 70 no. 4, pp. 3243-3255, DOI: 10.1109/TVT.2021.3066482, 2021.
[34] J. Tsitsiklis, B. Van Roy, An Analysis of Temporal-Difference Learning with Function Approximation (Technical Report LIDS-P-2322), 1996.
[35] J. Chung, "Playing Atari with deep reinforcement learning," Computer Ence, vol. 21, pp. 351-362, 2013.
[36] J. Li, Y. Chen, X. N. Zhao, J. Huang, "An improved DQN path planning algorithm," The Journal of Supercomputing, vol. 78 no. 1, pp. 616-639, DOI: 10.1007/s11227-021-03878-2, 2022.
[37] B. Wang, S. Li, X. Gao, T. Xie, "UAV swarm confrontation using hierarchical multiagent reinforcement learning," International Journal of Aerospace Engineering, vol. 2021,DOI: 10.1155/2021/3360116, 2021.
[38] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V. Mnih, N. Heess, J. T. Springenberg, "Learning by playing solving sparse reward tasks from scratch," International conference on machine learning, pp. 4344-4353, 2018.
[39] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2018.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Jiangyi Yao et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
Unmanned helicopters (UH) can evade radar detection by flying at ultralow altitudes, so as to conduct raids on targets. Path planning is one of the key technologies to realize UH’s autonomous completion of raid missions. Since the probability of UH being detected by radar varies with height, how to accurately identify the radar coverage area to avoid crossing has become a difficult problem in UH path planning. Aiming at this problem, a heuristic deep Q-network (H-DQN) algorithm is proposed. First, as part of the comprehensive reward function, a heuristic reward function is designed. The function can generate dynamic rewards in real time according to the environmental information, so as to guide the UH to move closer to the target and at the same time promote the convergence of the algorithm. Second, in order to smooth the flight path, a smoothing reward function is proposed. This function can evaluate the pros and cons of UH’s actions, so as to prompt UH to choose a smoother path for flight. Finally, the heuristic reward function, the smooth reward function, the collision penalty, and the completion reward are weighted and summed to obtain the heuristic comprehensive reward function. Simulation experiments show that the H-DQN algorithm can help UH to effectively avoid the radar coverage area and successfully complete the raid mission.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details






1 Equipment Simulation Training Center, Shijiazhuang Campus, Army Engineering University, Shijiazhuang, Hebei 050003, China
2 Department of UAV Engineering, Shijiazhuang Campus, Army Engineering University, Shijiazhuang, Hebei 050003, China
3 State Key Laboratory of Blind Signal Processing, Chengdu, Sichuan 610000, China