This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
From a macro point of view, air combat decision making refers to one party in air combat providing corresponding control instructions to fighter jets after analyzing and judging battlefield information so that it can complete the dominant attack position occupying the enemy. Decision making is the core of air combat, and its rationality will determine the final outcome of air combat [1].
In recent years, with the continuous improvement and development of deep learning (DL) theory, the deep reinforcement learning (DRL) algorithm combined with deep learning and reinforcement learning has become a research hotspot in artificial intelligence. With no training samples, not limited by specific models, and able to take into account the long-term impact of actions and other advantages, deep reinforcement learning methods have gradually received attention in the research of air combat maneuver decision making. Deep reinforcement learning can be divided into two types: value-based reinforcement learning algorithms and policy-based reinforcement learning algorithm. [2–4].
Watkins proposed Q learning on the basis of dynamic programming, which forms the evaluation value of each state action through repeated experiments and iterations. However, due to the limitation of the look-up table method, its algorithm is only applicable to the applications of finite state space and action space. Subsequently, with the increasing dimension of the state space of the research object, DNNs, CNNs, or RNNs were used to replace the action value function Q, forming the deep Q network algorithm (DQN) [5, 6] and introducing the experience replay target q-value network. In reference [7], the DQN algorithm is used to construct autonomous obstacle avoidance decisions for UAVs. By transforming the obstacle avoidance process of UAVs into a Markov decision problem and introducing neural networks for the decision model and improving the replay process, random dynamic obstacle avoidance of UCAVs in a 3D environment is realized, which effectively improves the efficiency of task execution. The DeepMind team realized autonomous learning in the Openai Gym simulation platform based on the DQN algorithm [8] and won the battle with professional players with absolute results, which again proved that DQN has obvious advantages over traditional algorithms and humans in decision-making ability. Subsequently, the AlphaGO System and the AlphaGO Master were developed and used to defeat all the world champions, which caused a sensation and made people reunderstand artificial intelligence technology. In 2017, AlphaGo Zero realized a self-game, started training without task samples, and further improved both speed and effect. Silver et al. [9] and Liu and Ma [10] constructed a discrete UAV maneuvering action library and realized the autonomous attack of a low-dynamic UAV by using a DQN. In reference [11], the DQN algorithm is used in UAV air combat confrontation, and the min-max algorithm is used to solve value functions in different states. The simulation result verifies that this method has good effects.
Value-based reinforcement learning methods cannot deal with the problem of continuous action space [12–15]. Lillicrap combined the deterministic policy gradient algorithm [16] and actor-critic framework, and a deep deterministic policy gradient (DDPG) algorithm is proposed to address continuous state space and continuous action space problems [17].
Wang et al. used the DDPG algorithm to study the pursuit strategy of a car in a plane. [18] Yang used the DDPG algorithm to construct an air combat decision system. Focusing on the problem of low data utilization due to the lack of prior knowledge of air combat in the DDPG algorithm, they proposed adding the sample data of the existing mature maneuvering decision-making system into the replay buffer in the initial training stage to prevent the DDPG algorithm from falling into a local optimum during training. Thus, the convergence speed of the algorithm was accelerated. [19].
At present, deep reinforcement learning has been widely applied in unmanned vehicle control, [20] robot path planning and control, [21] pursuit and avoidance of targets, [22] unmanned driving [23, 24], and real-time strategy games [25, 26]. However, most of the reinforcement learning algorithms used in air combat maneuvering decision making are discrete action space algorithms, which inevitably face the problems of rough flight paths and limited reachable domains. At the same time, model-free deep reinforcement learning algorithms are widely used at present, which are capable of self-learning effective air combat maneuver strategies independent of human air combat expert experience and have a general learning framework. However, model-free deep reinforcement learning algorithms need to interact with the environment to obtain a large number of training samples, and inefficient data utilization and learning efficiency become important bottlenecks in the practical application of model-free reinforcement learning methods. [3, 27–30].
According to the above problems, in this paper, the UCAV maneuvering decision-making problem in continuous action space is studied. By introducing a heuristic exploration strategy, the problem of insufficient exploration strategy exploration ability and low data utilization in the DDPG algorithm is improved, and then a UCAV air combat maneuver decision-making method based on the heuristic DDPG algorithm is proposed.
2. Air Combat Environment Design
2.1. Flight Motion Model
To consider the coupling relationship between the control quantities when continuous control quantities are independently sought, the UCAV platform model based on the angle of attack, engine thrust, and roll angle as control quantities is adopted to fully consider the influence of the aerodynamic characteristics of the platform on the flight state so that the model is closer to reality and the flight trajectory is more realistic, increasing its engineering use value. Its three-degree-of-freedom mass kinematic model is as follows:
The updated equations for its velocity
As seen from the above equation, to obtain a direct mapping relationship between the model control quantities
2.2. Geometry of Air Combat
When describing the geometric relationship between aircraft in air combat, the important factors usually considered are the distance between two aircraft, heading crossing angle (HCA), line of sight (LOS), antenna train angle (ATA), and aspect angle (AA). The distance between two aircraft is usually expressed by the calculation
[figure omitted; refer to PDF]
ATA and AA can be expressed as
2.3. Reward Shaping
The objective of maneuver decisions in close air combat based on reinforcement learning is to find an optimal maneuver strategy to enable the UCAV to complete the attack position to maximize the current cumulative reward. Reward is the only quantitative index of strategy evaluation, which determines the final learning strategy of an agent and directly affects the convergence and learning speed of the algorithm. When the UCAV conducts air combat decision making through deep reinforcement learning, except for the reward for completing the task, there is no reward in the middle process, and there is the problem of sparse reward. Therefore, it is not only necessary to design the reward for completing the task but also crucial to design the guiding reward for each step in each round. In this paper, a reward function including angle, height, distance, and speed factors is designed.
2.3.1. Angle Factor
When the maximum firing range of the UCAV weapon is superior to that of the enemy, the UCAV missile firing conditions can be preferentially met in the head-on encounter with the enemy. Due to the omnidirectional attack capability of the fourth-generation short-range air-to-air missile, there is no need to consider the attitude of the enemy at this time. Therefore, under the current weapon advantage, the angle factor is mainly determined by the ATA of the UCAV. As long as the ATA angle is within the range of the maximum off-axis launch angle, the angle reward can be obtained, specifically expressed as
When the maximum firing distance of the UCAV weapon is weaker than that of enemy aircraft, it is extremely detrimental to UCAV security. At this time, to ensure their own safety, UCAV should be guided to give full play to their maneuverability and always be located beyond the maximum off-axis angle of enemy aircraft and attack enemy aircraft as far as possible with the tactics of tail attack. In this case, the angle factor should consider both the ATA of the UCAV and AA of the enemy aircraft, and the angle factor design is as follows:
2.3.2. Height Factor
The height factor not only represents the relationship between the two in the vertical plane in the geometry situation of air combat but also measures the energy advantage of the UCAV. The side that satisfies the height advantage not only has the advantage of energy mobility but can also exert the missile's larger attack range. A high reward factor is achieved when the UCAV is in the desired altitude range relative to the enemy aircraft:
2.3.3. Distance Factor
Distance is an important factor for UCAV platform situation assessment and weapon launch conditions. When the relative distance between two aircraft meets the maximum missile launch distance, the maximum distance factor can be obtained, which is defined as
2.3.4. Speed Factor
When the distance between the two planes reaches the maximum launching distance of the missile, the UCAV sees the speed of the enemy aircraft as the best attack speed. When the distance between the two planes is relatively far, the UCAV should maintain a large flight speed to rapidly form a favorable situation and maintain a kinetic energy advantage with the help of high speed and maneuverability. The speed factor is established as follows:
2.3.5. Environmental Factor
When the UCAV air combat strategy is learned through reinforcement learning, in addition to making the UCAV capable of attacking enemy aircraft, the more important prerequisite is that the UCAV has the ability to adapt to the battlefield environment and maintain a safe flight altitude. Therefore, to train the air combat strategy with both air combat capability and safe flight capability, it is necessary to set negative returns in the form of punishment for dangerous flight maneuvers, so the environmental factor
2.3.6. End Factor
2.3.7. Total Reward Function
Based on the above analysis, the total reward function is
3. Heuristic DDPG Algorithm
This section constructs an exploration strategy that is more effective than traditional Gaussian noise or OU noise. At present, the traditional exploration strategy such as OU usually acts directly on the actions of the strategy network output and makes the actions randomly disturbed in the form of addition to realize the exploration of unknown space. In an air combat environment, unmanned combat aircraft control the amount of high dimensionality and large amplitude range; therefore, the DDPG algorithm is based on the strategy of OU explores noise and is likely to create many blind spots in the search. The serious influence training effect, at the same time, is based on the limited performance and flight safety, UCAV variation and volume control of each dimension has a strict limit. When the output action of the policy network is close to the boundary of its scope, it is blind and ineffective to implement exploration by adding noise directly.
A large number of current research exploration methods that potentially set the exploration strategy
As the DDPG algorithm is a typical off-policy learning method, its exploration process and learning process are independent from each other, so the exploration strategy
3.1. Algorithm Design
The framework proposed in this section can be regarded as a heuristic learning framework, in which the exploration strategy
The generation of heuristic information
Referring to the parameterized representation of the policy network in traditional DDPG, the policy
This value can be calculated from the data that the DDPG algorithm interacts with the environment.
To estimate the difference reward value
After the difference reward
After the exploration strategy
Algorithm 1: Heuristic DDPG pseudocode.
(1) Initialize exploration strategy
(2) Action strategy
(3) Initialize the replay buffer.
(4) for
(i) Heuristic strategy
(6) Call DDPG:
(7) Action strategies
(8) Calculate the reward of the exploration strategy
(9) Update network parameters according to the gradient of exploration strategy:
(10) Add heuristic information
(11) Update action strategy
(12) End for
Although the improvement of the DDPG algorithm in this section increases the amount of computation in the calculation of heuristic data
3.2. Performance Test of Algorithm
To test the performance of the improved algorithm proposed in this paper, Half Cheetah-v1, a Mujoco robot control environment in the OpenAI Gym toolkit, is selected as the test environment. Considering that the UCAV air combat in this paper is a decision-making process with air combat status information as input, without considering the image input, the RAM version of the environment is chosen and the state information is obtained directly, rather than the RGB version with the game graphics as input. To reflect the performance of the algorithm, 20 Monte Carlo simulations were performed for each algorithm. The
[figures omitted; refer to PDF]
In Figure 2, the ordinate ‘performance’ is the cumulative reward value of completing a task. The areas covered by red, dark blue, and light blue are the heuristic DDPG algorithm proposed in this paper, the PPO algorithm and the traditional DDPG algorithm, respectively, after 20 Monte Carlo simulations of the
4. Maneuver Decision Scheme Design
To increase the generalization ability of strategic networks, this paper considers the relative relationship between the enemy and the UCAV in the selection of state variables and takes the three-dimensional relative position coordinates of two aircraft, the relative flight speed, AA, and ATA as state variables; that is, the state variables are
In the selection of the action, it is designed as the variation of the model control variable
5. Simulation and Analysis
5.1. Network and Parameter
Combined with the maneuver decision scheme, the actor network and critic network structures in our algorithm are designed. The structures of the actor network and actor target network are the same, and the input value is the state input
Table 1
Actor/Actor target network structure.
Layer | Input | Activation function | Output |
Input layer | |||
Full connection layer 1 | tanh | ||
Full connection layer 2 | tanh | ||
Output layer | Linear |
The critic network has the same structure as the critic target network. The input value is the combination of the state input
Table 2
Critic/Critic target network structure.
Layer | Input | Activation function | Output |
Input layer | |||
Full connection layer 1 | tanh | ||
Full connection layer 2 | tanh | ||
Output layer | Linear |
The neural network training platform is a TensorFlow open-source deep learning computing platform based on an NVIDIA GeForce GTX 1080Ti GPU in an Ubuntu 16.04 system. The specific hyperparameter settings of the H-DDPG algorithm are shown in Table 3.
Table 3
Hyperparameter setting of heuristic DDPG algorithm.
Parameter | Parameter value |
Size of replay buffer | 50000 |
Size of minibatch | 64 |
Actor learning rate | 0.0001 |
Critic learning rate | 0.001 |
Discount rate | 0.99 |
5.2. Initial Situation Setting
To verify the effectiveness of the algorithm, it is assumed that the enemy fighter and the UCAV adopt the same platform model and the same maneuverability constraints. The decision method of enemy adopts the rolling time-domain maneuver decision method proposed in reference [36]. In order to reflect the antagonism of air combat, we suppose the two sides enter the battle in a head-on encounter and set the UCAV height slightly lower than the enemy aircraft at a disadvantage. The simulation initialization state is shown in Table 4.
Table 4
Initial state of UCAV and enemy.
x (m) | y (m) | z (m) | Max step (s) | ||||
UCAV | 0 | 0 | 10000 | 250 | 0 | 45 | 200 |
Enemy | 10000 | 10000 | 12000 | 200 | 0 | −135. |
5.3. Enemy Making Random Maneuvers
Case 1.
The UCAV weapon is stronger in the head-on situation, and the launching distance of the UCAV weapon is superior. The winning conditions of the UCAV are as follows:
As shown in Figure 3, the enemy aircraft chooses to dive downward through a random maneuver. The UCAV approaches the enemy aircraft in flat flight and then dives downward to gain altitude superiority. It finally gives priority to meet the weapon firing conditions and launches missiles to win air battles. It can be seen from the changes of reward factors in Figure 4 that at the beginning of the battle, the UCAV had already met the maximum angle reward factor, approached the enemy aircraft through flat flight and dove at 21 s to obtain the height advantage, and reached the weapon launch range at 26 s. At this time, all the reward functions achieved 1, meeting the winning conditions in the air battle. Figure 5 shows the curve of the average cumulative reward function value of training for this air combat mission. Each epoch on the horizontal axis contains 200 training missions, and the ordinate axis is the average cumulative reward value obtained for every 200 missions.
[figures omitted; refer to PDF]
[figure omitted; refer to PDF][figure omitted; refer to PDF]Case 2.
The enemy weapon is stronger when the firing distance of the enemy weapon is dominant; the UCAV winning conditions are as follows:
As seen from Figure 6, the enemy swooped down to the left through random maneuvers and then climbed to the left. Due to the low altitude at the beginning, the UCAV first shortened the distance with the enemy and improved the height advantage by climbing. Before entering the enemy attack range, it made a sharp right turn. The UCAV achieves a height advantage by successfully diving behind the enemy's tail and by turning to the right with a small overload. Finally, the UCAV achieves a height advantage by continuously following the enemy with a small overload deceleration and pulling up to the left to meet the rear attack conditions and win the air battle. As seen from the changes in reward factors in Figure 7, at the early stage of air combat, due to the long distance, low altitude, and enemy meeting the attack angle, all reward factors are −1. With the implementation of a large overload maneuver, the UCAV gradually obtains each situation advantage and finally meets the weapon launch conditions at 89 s by tracking the enemy aircraft. At the beginning of the battle, the UCAV has already met the maximum angle reward. It approaches the enemy through flat flight and dives at 21 s to gain an altitude advantage. At 26 s, the UCAV reaches the weapon launch range. Figure 8 shows the curve of the cumulative reward value during the task training process in this section.
[figures omitted; refer to PDF]
[figure omitted; refer to PDF][figure omitted; refer to PDF]5.4. Enemy Making Intelligent Maneuvers
Under this task, the enemy makes intelligent maneuvers using the rolling time-domain maneuver decision method proposed in reference [36], which is adopted to traverse 216 trial maneuvers generated by the discrete variation of control variables, and the maneuvers corresponding to the optimal membership function value are selected and executed through the membership function of the air combat situation.
Mission setting: the enemy weapon is stronger. Under this mission, the UCAV adopts the rear attack mode to attack the enemy aircraft. The enemy does not need to go around the rear but adopts an omnidirectional attack strategy. The UCAV winning conditions are as follows:
[figures omitted; refer to PDF]
As shown in Figure 9, after the two sides entered the air combat airspace in a head-on encounter, the enemy aircraft adopted an accelerated dive maneuver at a high altitude to quickly approach the UCAV to meet the priority conditions of weapon launch. The UCAV first adopted an accelerated flat flight to quickly shorten the distance between the two sides and pulled off to the upper left and right of the enemy aircraft before entering the enemy missile attack range. The enemy aircraft lost altitude advantage due to the rapid speed of the dive and then leveled out and pulled up to the left, regained altitude advantage and turned, but due to the climb maneuver reduced speed resulting in a larger turning radius. At this point, the UCAV performs a loop to increase its speed and power advantage and finally wins by following the enemy aircraft to reach the weapon firing conditions.
Figure 10 shows the reward function curve. It can be seen from the figure that in the 30 s–50 s range, the UCAV gained a temporary altitude advantage through a loop and then repositioned below the enemy aircraft until 105 s the UCAV remained level and dived, adjusting the angle by sacrificing its altitude advantage. After that, the UCAV succeeded in placing itself behind the enemy aircraft at 110 s gaining altitude and angle advantages. After that, the UCAV closed the distance by adjusting its attitude and defeated the enemy aircraft at 133 s. Figure 11 shows the curve of control variable of UCAV. Figure 12 shows the curve of the average cumulative reward value of training for this air combat mission. Each epoch on the horizontal axis contains 200 training missions, and the ordinate axis is the average cumulative reward value obtained for every 200 missions.
[figure omitted; refer to PDF][figures omitted; refer to PDF]
[figure omitted; refer to PDF]6. Conclusions
In this paper, a continuous action space air combat decision-making technology for UCAVs based on reinforcement learning is studied. Starting with the UCAV continuous action space model, a continuous action space air combat model is constructed based on the aerodynamic parameters of the unmanned stealth fighter. Focusing on the problems of weak exploration ability and low data utilization rate of the DDPG algorithm, a heuristic exploration strategy was introduced to propose a heuristic DDPG algorithm to improve the exploration ability of the original algorithm. The effectiveness and superiority of the proposed algorithm are verified by the Monte Carlo simulation in a typical continuous motion control environment (Half Cheetah). In the simulation verification stage, two subtasks with increasing difficulty, random maneuvers, and intelligent attack maneuvers are adopted for enemy aircraft, and the results show that the method presented in this paper can accomplish maneuver decisions under various tasks as well.
[1] L. Fu, F. Xie, D. Wang, G. Meng, "The overview for UAV air-combat decision method," Proceedings of the 26th Chinese Control and Decision Conference (2014 CCDC),DOI: 10.1109/ccdc.2014.6852760, .
[2] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2018.
[3] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. Sallab, S. Yogamani, P. Perez, "Deep reinforcement learning for autonomous driving: a survey," IEEE Transactions on Intelligent Transportation Systems,DOI: 10.1109/tits.2021.3054625, 2021.
[4] Y. Li, "Deep reinforcement learning: an overview," 2017. https://arxiv.org/abs/1701.07274
[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, "Playing atari with deep reinforcement learning," 2013. https://arxiv.org/abs/1312.5602
[6] F. Agostinelli, G. Hocquet, S. Singh, P. Baldi, "From reinforcement learning to deep reinforcement learning: an overview," Braverman Readings in Machine Learning. Key Ideas from Inception to Current State, pp. 298-328, DOI: 10.1007/978-3-319-99492-5_13, 2018.
[7] X. Han, J. Wang, J. Xue, Q. Zhang, "Intelligent decision-making for 3-dimensional dynamic obstacle avoidance of UAV based on deep reinforcement learning," Proceedings of the 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP),DOI: 10.1109/wcsp.2019.8928110, .
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518 no. 7540, pp. 529-533, DOI: 10.1038/nature14236, 2015.
[9] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, D. Hassabis, "Mastering the game of Go without human knowledge," Nature, vol. 550 no. 7676, pp. 354-359, DOI: 10.1038/nature24270, 2017.
[10] P. Liu, Y. Ma, "A deep reinforcement learning based intelligent decision method for UCAV air combat," pp. 274-286, DOI: 10.1007/978-981-10-6463-0_24, .
[11] W. Ma, H. Li, Z. Wang, Z. Huang, Z. Wu, X. Chen, "Close air combat maneuver decision based on deep stochastic game," Systems Engineering and Electronics, vol. 9, 2020.
[12] X. Zhang, G. Liu, C. Yang, J. Wu, "Research on air combat maneuver decision-making method based on reinforcement learning," Electronics, vol. 7 no. 11,DOI: 10.3390/electronics7110279, 2018.
[13] Q. Yang, J. Zhang, G. Shi, J. Hu, Y. Wu, "Maneuver decision of UAV in short-range air combat based on deep reinforcement learning," IEEE Access, vol. 8, pp. 363-378, DOI: 10.1109/ACCESS.2019.2961426, 2019.
[14] X. Xu, D. Hu, X. Lu, "Kernel-based least squares policy iteration for reinforcement learning," IEEE Transactions on Neural Networks, vol. 18 no. 4, pp. 973-992, DOI: 10.1109/tnn.2007.899161, 2007.
[15] Z. Wang, H. Li, H. Wu, Z. Wu, "Improving maneuver strategy in air combat by alternate freeze games with a deep reinforcement learning algorithm," Mathematical Problems in Engineering, vol. 2020,DOI: 10.1155/2020/7180639, 2020.
[16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, "Continuous control with deep reinforcement learning," 2015. https://arxiv.org/abs/1509.02971
[17] D. Silver, G. Lever, N. Heess, D. Thomas, W. Daan, R. Martin, "Deterministic policy gradient algorithms," pp. 387-395, .
[18] M. Wang, L. Wang, T. Yue, H. Liu, "Influence of unmanned combat aerial vehicle agility on short-range aerial combat effectiveness," Aerospace Science and Technology, vol. 96,DOI: 10.1016/j.ast.2019.105534, 2020.
[19] M. Wang, L. Wang, T. Yue, "An application of continuous deep reinforcement learning approach to pursuit-evasion differential game," pp. 1150-1156, DOI: 10.1109/itnec.2019.8729310, .
[20] Q. Yang, Y. Zhu, J. Zhang, S. Qiao, J. Liu, "UAV air combat autonomous maneuver decision based on DDPG algorithm," pp. 37-42, DOI: 10.1109/icca.2019.8899703, .
[21] B. Li, Y. Wu, "Path planning for UAV ground target tracking via deep reinforcement learning," IEEE Access, vol. 8, pp. 29064-29074, DOI: 10.1109/access.2020.2971780, 2020.
[22] S. You, M. Diao, L. Gao, F. Zhang, H. Wang, "Target tracking strategy using deep deterministic policy gradient," Applied Soft Computing, vol. 95,DOI: 10.1016/j.asoc.2020.106490, 2020.
[23] A. E. Sallab, M. Abdou, E. Perot, S. Yogamani, "Deep reinforcement learning framework for autonomous driving," Electronic Imaging, vol. 2017 no. 19, pp. 70-76, DOI: 10.2352/issn.2470-1173.2017.19.avm-023, 2017.
[24] S. Wang, D. Jia, X. Weng, "Deep reinforcement learning for autonomous driving," 2018. https://arxiv.org/abs/1811.11329
[25] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. V. Hasselt, D. Silver, T. Lillicrap, K. Calderone, R. R. Tsing, "Starcraft ii: a new challenge for reinforcement learning," 2017. https://arxiv.org/pdf/1710.03131.pdf
[26] P. Peng, Y. Wen, Y. Yang, Y. Quan, T. Zhenkun, L. Haitao, W. Jun, "Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games," 2017. https://arxiv.org/abs/1703.10069
[27] T. Zhao, L. Kong, Y. Han, D. Ren, Y. Chen, "Review of model-based reinforcement learning," Journal of Frontiers of Computer Science and Technology, vol. 14 no. 06, pp. 918-927, 2020.
[28] Q. Zhihui, L. Ning, L. Xiaotong, L. Xiulei, Q. Dong, "OverviewofResearchonModel-freeReinforcementLearning," Computer Science, vol. 48 no. 03, pp. 180-187, 2021.
[29] H. Liu, Y. Pan, S. Li, Y. Chen, "Synchronization for fractional-order neural networks with full/under-actuation using fractional-order sliding mode control," International Journal of Machine Learning and Cybernetics, vol. 9 no. 7, pp. 1219-1232, DOI: 10.1007/s13042-017-0646-z, 2018.
[30] H. Liu, S. Li, G. Li, H. Wang, "Robust adaptive control for fractional-order financial chaotic systems with system uncertainties and external disturbances," Information Technology and Control, vol. 46 no. 2, pp. 246-259, DOI: 10.5755/j01.itc.46.2.13972, 2017.
[31] Storm Shadow UCAV performance, "Storm Shadow UCAV performance," 1994. http://www.aerospaceweb.org/design/ucav/
[32] T. Xu, Y. Wang, C. Kang, "Tailings saturation line prediction based on genetic algorithm and BP neural network," Journal of Intelligent and Fuzzy Systems, vol. 30 no. 4, pp. 1947-1955, 2016.
[33] Z. Zhao, Q. Xu, M. Jia, "Improved shuffled frog leaping algorithm-based BP neural network and its application in bearing early fault diagnosis," Neural Computing & Applications, vol. 27 no. 2, pp. 375-385, DOI: 10.1007/s00521-015-1850-y, 2016.
[34] Y. C. Lin, D. D. Chen, M. S. Chen, X. M. Chen, J. Li, "A precise BP neural network-based online model predictive control strategy for die forging hydraulic press machine," Neural Computing & Applications, vol. 29,DOI: 10.1007/s00521-016-2556-5, 2016.
[35] H. Wang, Y. Wang, K. E. Wen-Long, "An intrusion detection method based on spark and BP neural network," Computer Knowledge & Technology, vol. 13 no. 6, pp. 157-160, 2017.
[36] K. Dong, C. Huang, "Autonomous air combat maneuver decision using Bayesian inference and moving horizon optimization," Journal of Systems Engineering and Electronics, vol. 29 no. 1, pp. 86-97, 2018.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Wang Yuan et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
With the rapid development of unmanned combat aerial vehicle (UCAV)-related technologies, UCAVs are playing an increasingly important role in military operations. It has become an inevitable trend in the development of future air combat battlefields that UCAVs complete air combat tasks independently to acquire air superiority. In this paper, the UCAV maneuver decision problem in continuous action space is studied based on the deep reinforcement learning strategy optimization method. The UCAV platform model of continuous action space was established. Focusing on the problem of insufficient exploration ability of Ornstein–Uhlenbeck (OU) exploration strategy in the deep deterministic policy gradient (DDPG) algorithm, a heuristic DDPG algorithm was proposed by introducing heuristic exploration strategy, and then a UCAV air combat maneuver decision method based on a heuristic DDPG algorithm is proposed. The superior performance of the algorithm is verified by comparison with different algorithms in the test environment, and the effectiveness of the decision method is verified by simulation of air combat tasks with different difficulty and attack modes.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer