This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Intercepting the maneuvering targets is a particular challenge due to the complexity of the engagement [1, 2]. The traditional guidance and control system for interception show its weakness when facing high maneuvering targets, but intelligent methods can solve the problem [3]. In the field of guidance, proportional navigation (PN) has found widespread applications because of the features of simplicity and robustness [4]. PN is mainly divided into true proportional navigation (TPN) [5] and pure proportional navigation (PPN) [6]. For maneuvering targets, Ref. [7] investigated the capture region of the realistic true proportional navigation (RTPN) in three-dimensional (3D) space, taking into account the nonlinearity of the interceptor-target relative kinematics, resulting in more general findings. However, when targets exhibit large maneuvering, the performance of proportional navigation (PN) can significantly decline. This is mainly due to the commanded acceleration of PN often exceeding the capability of the interceptor, resulting in large miss distances [8]. Optimal guidance law (OGL) can intercept or strike a target with a specific optimized performance index [9]. However, the time-to-go needs to be accurately estimated in OGL; otherwise, the performance may decline. Many new guidance methods such as differential geometry [10], sliding mode control [11], and other dynamic and control theories have also been proposed. However, these guidance laws are often too complex in form, which usually require too much measurement information and involve plenty of guidance parameters, and, hence, are difficult to be applied in practice.
Reinforcement learning (RL) [12] provides a new idea for the homing guidance law design problem. For example, Q-learning is used for the adaptive determination of parameters in [13] and [14] by training. In [15], a guidance framework designed by RL-based guidance law is proposed. It has been proven that guidance laws based on RL are much better than PNG, according to plenty of numerical simulation results. However, these traditional RL-based algorithms promote the guidance performance only through selecting suitable coefficients of the controller [16], which cannot achieve precise guidance under realistic disturbed conditions. Moreover, the state and action set of the traditional RL method are discrete, and the dimension is low, while the actual interception engagement is continuous and high-dimensional [17].
As deep learning (DL) continued to advance, a new class of algorithms known as DRL, combining both DL and RL techniques, emerged [18]. The DRL method can effectively overcome the difficulties of complex space and high dimensions [19, 20], so it may have advantages in homing guidance. Ref. [21] proposed deep Q-Network (DQN), which solves the problem of high-dimensional input. Aiming at the problem of exoatmospheric homing guidance, a novel guidance method using DQN is proposed in [22]. However, DQN is more suitable for the problem of discrete control, while the actual interceptor’s acceleration is usually continuous. The discrete action command might lead to a large deviation and also a big miss distance.
The DDPG algorithm, introduced in [18], is an actor-critic (AC) [23] algorithm that is well suited for the homing guidance problem in continuous state and action space environments. Ref. [24] explored the possibility of applying DDPG to the design of homing guidance law. By comparing the two learning modes of learning from zero and learning with prior knowledge, it is proven that the latter helps to improve learning efficiency. In [25] and [26], the terminal guidance law of missiles is also advanced based on DDPG. The result shows that the proposed policy has stronger robustness and a smaller miss distance compared with PN. However, most DRL-based guidance laws need to measure and estimate the relative velocity and position between the target and interception and the information on target acceleration [27, 28]. The involved measurements are too many and are usually with lags and large errors. An RL-based guidance law was proposed in [29] and [30] to solve this problem, which only uses the LOS angle measurements and their change rates as the observation information. The problem of state estimation is simplified, and the bad influences caused by the estimation biases of position and velocity may be eliminated. Ref. [29] introduces proximal policy optimization (PPO) to propose a homing guidance law to intercept exoatmospheric maneuvering targets, combing with metalearning [31, 32]. Experimental results have shown that this guidance method outperforms the augmented ZEM [30] guidance method. Ref. [33] proposed a model-based DRL method, which uses deep neural networks and metalearning to approximate the predictive model of the guidance dynamics and incorporates it into the control framework of path integration. It introduces a general framework for guidance, but it is complex in form, and the problem of estimation errors is still not solved.
A novel homing guidance law against maneuvering targets using the DDPG algorithm is proposed in this paper, which directly maps an engagement state information to the commanded acceleration, which is an end-to-end and model-free guidance policy. The homing guidance law we proposed only takes the information of LOS angles and LOS rates between the target and interceptor as observation and state inputs and does not require prior estimation of the target’s acceleration. DDPG algorithm can effectively solve a continuous and high dimensional dynamic environment. Continuous action space is designed based on the interceptor’s acceleration overload. The LOS rate and ZEM are mainly considered in the design of reward, and the agent is trained in a 3D environment. The results of comparison with TPN and DQN-based RL guidance law show that the proposed guidance method is with strong environmental adaptability and better guidance performance.
The paper is structured as follows: Section 2 presents the problem formulation, including the engagement scenario and the model of motion and measurement. Section 3 mainly introduces the DDPG algorithm, and the details of RL guidance law are described. The results are given in Section 4, and Section 5 presents the conclusion.
2. Problem Formulation
2.1. Engagement Scenario
The interception process, a simplified engagement scenario, is used. Referring to Figure 1, the target’s and the interceptor’s position vectors are
[figure(s) omitted; refer to PDF]
For the process of interception, the closing velocity is usually large. If the target and interceptor maneuver along the LOS direction, it can be challenging to alter the miss distance outcome. Therefore, we assume that the interceptor maneuvers only in a plane perpendicular to the direction of LOS in the LOS coordinate system, without considering its maneuver along the LOS direction.
2.2. Motion Model of Interception
The intersection plane is formed by
[figure(s) omitted; refer to PDF]
The relative velocity is decomposed into two components:
The LOS direction can be represented using
In summary, when
ZEM is the final miss distance generated by the interceptor when the target and missile are not maneuvering [7, 34]. ZEM and time-to-go are calculated as follows.
2.3. Measurement Model of Interception
The measurement model is mainly to process the information measured by the interceptor and is developed to calculate the LOS angles and LOS rates of change based on the current missile-target state [37]. Referring to Section 2.1, the relative position and velocity vector are as follows:
By utilizing equations (9) and (10),
This paper’s simulation neglects the effects of error related to the relative distance, closing velocity, and LOS angle measurements in the measurement model. Only the errors of measurement in the LOS angular rates are introduced. A Gaussian noise with zero mean and a specified standard deviation
3. RL Homing Guidance Law
Establishing a Markov decision model [38] of the problem is a prerequisite for designing the homing guidance law using the DRL algorithm [12]. Then, the interception problem needs to be transformed into the RL framework.
3.1. The Overview of RL
Reinforcement learning is an iterative process [39] that involves an agent interacting with the environment, observing state
Reinforcement learning algorithms are broadly categorized into two methods: value function and policy gradient [40]. The methods of the former, such as Q-learning and DQN, estimate the value of state-action pairs. The latter’s methods, such as policy gradient and AC algorithm, directly learn the policy which maps states to actions. DRL algorithms, such as DDPG and A3C, combine deep learning with these methods. However, value function methods are not suitable for problems with high dimensions and continuous action spaces, and the policy gradient method based on AC architecture has more advantages for such problems. DDPG is used for solving the problem and is compared against the TPN and DQN algorithms in this paper.
3.2. DDPG Algorithm
DDPG is based on AC architecture for solving RL problems with continuous spaces in state and action. The algorithm uses neural networks to approximate the functions, which are represented by the value network (part of the critic) and policy network (part of the actor). The value network calculates the state or action values of the corresponding state, while the policy network calculates the action values of the policy. The DDPG framework is shown in Figure 3.
[figure(s) omitted; refer to PDF]
A dual network is also used in the DDPG algorithm, namely, the current and the target network. The AC-type algorithm generally includes a policy and value network, so DDPG has a total of four networks after using a dual network [18].
DDPG also uses the replay buffer to reduce the correlation between training data. During training, the agent randomly selects small batches of data from the experience replay pool to calculate network loss and gradient and then updates the current policy network and value network through gradient backpropagation. DDPG differs from DQN in that it implements a soft update approach to update the target network, as opposed to periodically copying parameters from the current network. The soft update slowly updates the parameters each time, and it is mathematically expressed as follows:
Algorithm 1: DDPG for homing guidance law.
1. Initialize network parameters and target Q network parameters w,
2. Initialize replay pool D.
3. For episode = 0 to T
4. Interceptor's state s0 is initialized
5. For s = s0to Terminations:
6. a) Output action
7. b) Execute a, transfer to s', and get reward r. Judge termination d.
8. c) Store transitions {s, a, s', r, d}.
9. d) Sample n transitions from D.
10. e) Compute the current target Q value yi.
11. f) Compute the loss
12. g) Compute the loss of policy network
13. h) Update parameters in target networks with equation (12).
14. i) Update state: s = s'.
3.3. RL Model of Interception
To solve the problem of interception using DDPG, the original problem needs to be transformed into the framework of RL. First, the corresponding MDP is established, and the elements of reinforcement learning are designed according to the motion model in Section 2.2.
3.3.1. State
The process of interception can be described by an MDP. The environment of this process is composed of the 3D motion model established in Section 2. The state space designed mainly includes LOS angle and their change of rate [29], which is expressed as follows:
3.3.2. Action
The DDPG algorithm is particularly appropriate for problems with continuous actions. Considering the continuous maneuvering form of the interceptor, without considering the maneuvering along the LOS, that is, the interceptor maneuvers along the plane perpendicular to LOS. Therefore, if the interceptor’s acceleration in
The total acceleration acting on the interceptor is as follows:
We assume that the maneuvering target’s maximal overload and the interceptor in a certain direction are 3 g and 6 g, respectively, so the target’s and the interceptor’s maximum total overload are
3.3.3. Reward
The reward design is the key to RL problems. To ensure the training converges to the optimum, the method of reward shaping [41] is used to avoid the problem of reward sparsity and learn the optimal policy.
The LOS rate and ZEM are considered in the reward function of the model. During the interception, the LOS rate is positively correlated with the relative velocity. The smaller the absolute value of the relative velocity, the smaller the ZEM. The Gaussian reward [30] is designed as follows:
The reward is a shaping reward that depends on the velocity-leading angle
[figure(s) omitted; refer to PDF]
To ensure effective interception of the target, a terminal reward constraint is required. Therefore, a terminal reward function is designed. If the ZEM meets with the allowable miss distance, then a positive reward (+10) is given. Otherwise, there is no reward value (+0). Specifically expressed as:
To sum up, the total reward is as follows:
3.4. Create the Agent
Based on the established framework of interception, the network is further designed, the algorithm hyperparameters are debugged, and the DDPG agent is trained.
3.4.1. The Neural Network
The Tensorflow framework is used for building the neural network of DDPG. DDPG contains two parts: value and policy network. For the value network, the output is the action value corresponding to the state and action, which is different from the Q network in DQN. The value network uses a three-layer backpropagation (BP) neural network [42], which is shown in Table 1. Relu and tanh activation functions are used in the network [43]. The policy network structure is described in Table 2.
Table 1
The structure of the value network.
Layers | Neurons | Activation functions |
Input of state | 4 | \ |
Input of action | 2 | \ |
Hidden layer 1 | 60 | Relu |
Hidden layer 2 | 40 | Relu |
Output | 1 | \ |
Table 2
The structure of the policy network.
Layers | Neurons | Activation functions |
Input of state | 4 | \ |
Hidden layer 1 | 60 | Relu |
Hidden layer 2 | 40 | Relu |
Output | 1 | Tanh |
3.4.2. Hyper Parameter
The parameters are important to the performance of training. And this adjustment process is different in different application ranges. Different problems have different parameter sets. If the parameter setting is unreasonable, the algorithm cannot converge. Therefore, hyperparameters need to be continually adjusted during training. The hyperparameters used in this problem are determined by conducting numerous numerical simulations in the established interception environment. Table 3 shows the hyperparameters that are ultimately chosen for this study.
Table 3
The hyperparameters of DDPG.
Hyperparameter | Parameter value |
Maximum iterations | 2000 |
Discount factor | 0.995 |
Coefficient of soft update | 0.001 |
Reward coefficient | 0.05 |
Capacity of experience replay pool | 100000 |
Minibatch size | 64 |
Environmental noise variance | 1.0 |
Noise attenuation rate | 0.99 |
Value network learning rate | 0.002 |
Policy network learning rate | 0.001 |
4. Simulations and Analysis
During the training, the state measurement errors and time constant are not considered. The motion equation of each episode is integrated by the Runge-Kutta method, whose order is 4 and the simulation step is 1 ms. Table 4 shows the initial conditions.
Table 4
The initial conditions for training.
Physical parameters | Reference value |
Azimuth angle of LOS | |
Elevation angle of LOS | |
LOS range | |
Interceptor’s position vector | |
Velocity yaw angle of target | |
Interceptor velocity | |
Target velocity | |
Alignment deviation perpendicular to intersection plane | |
Velocity pitch angle of the target | |
Alignment deviation in the intersection plane |
During training, a terminal reward with an allowable miss distance of 0.2 m was used. The results presented include the training results, a comparison with TPN, and a comparison with a homing guidance law based on DQN [22].
4.1. Results of Training
The DDPG environment is built in Tensorflow, and then the agent generates a large volume of data which is used to optimize its policy. We train the agent by a computer with NVIDIA GeForce RTX 2080 Ti GPU, Gold 6226R: 2.90GHz CPU. The versions of Python and Tensorflow are 3.7.6 and 1.15.0.
Tensorboard is used to show the process of training, and it took approximately 9401.2702 seconds to train 2000 episodes, equivalent to about 2.6 hours for full training. Figures 5 and 6 depict the change in the loss for the policy and value networks, respectively, with the horizontal axis representing the iterations of training. A decrease in the policy network loss corresponds to an increase in the Q-value output of the value network, as demonstrated in Figure 5. This indicates that the parameters of the policy network are continuously optimized, resulting in maximum action value output. The loss function is in the value network, and it can be observed that in the early stages of training, TD error is relatively small. As training progresses, the network becomes increasingly optimized, with lower TD error values being more beneficial for algorithm training.
[figure(s) omitted; refer to PDF]
Figure 7 illustrates the changes in rewards, with the horizontal axis representing episodes and the vertical axis representing the cumulative reward (in blue) and average reward (in orange) for each round after smoothing. It can be observed that the maximum cumulative reward is achieved after 250 episodes. Since DDPG contains two networks, the stability of the algorithm is affected, and there may be fluctuations. Therefore, there is a certain range of change in the cumulative reward after convergence. However, the policy of the agent is optimized during training, and the convergence speed is fast.
[figure(s) omitted; refer to PDF]
4.2. Comparison with TPN
During training in Section 4.1, the effects of measurement errors and time delays are not considered. The agent is compared with TPN with different guidance coefficients in two ways, that is constant maneuvering and sinusoidal maneuvering. The simulation takes into account the measurement error. Additionally, the control system’s response delay is assumed to be equivalent to two sampling periods (20 ms) after the guidance command.
The simulation is conducted under the following conditions: the launch location’s latitude is 60°, longitude is 140°, launch azimuth is 90°, and altitude is 100 m. The target’s and interceptor’s initial information is presented in Table 5, indicating an initial relative distance between them of 100 km with
Table 5
The initial conditions of the simulation.
Position (km) | Velocity (m/s) | |
Interceptor | (0, 0, 0) | (338.7, 4984, -211) |
Target | (70, 50, -33.3) | (-6039, 610, 3486) |
4.2.1. Constant-Maneuvering Target
In the case of constant-maneuvering targets, the maneuvering is only considered in the LOS vertical plane. To verify the DDPG guidance law’s generalization ability, we assume the target’s acceleration is
[figure(s) omitted; refer to PDF]
The terminal miss distance is given in Table 6. In Figure 8(a), the TPN in both cases of N taking 3 and 5 cannot reduce the vertical velocity. When the time-to-go decreases,
Table 6
The comparison results of terminal miss distance.
TPN | TPN | DDPG guidance law | |
Constant maneuvering | 414 m | 68 m | 0.16 m |
Sinusoidal maneuvering | 12.5 m | 0.93 m | 0.1 m |
4.2.2. Sinusoidal-Maneuvering Target
The target’s acceleration is
[figure(s) omitted; refer to PDF]
The guidance coefficient also takes 3 and 5. Figure 10 gives the simulation results. In Figure 10(a), the guidance law based on DDPG can reduce the vertical velocity more fully than TPN. In Figure 10(d), the change of the LOS rate also shows that DDPG can reduce the LOS rate more effectively during the guidance process.
[figure(s) omitted; refer to PDF]
The coefficient
The above results show that the proposed RL method is more effective than the TPN to intercept targets with certain maneuvering ability. The DDPG guidance law is capable of effectively reducing the vertical relative velocity, ensuring a very small final miss distance, and mitigating the divergence of the LOS rate.
4.3. Comparison with DQN
During the training process in Section 4.1, the target’s maximum overload is
Table 7
The initial conditions of the target and interceptor.
Position (km) | Velocity (m/s) | |
Target | (66.34, 50, -55.67) | (-5362, 0, 4499) |
Interceptor | (0, 0, 0) | (872, 4860, -785) |
[figure(s) omitted; refer to PDF]
Figures 11 and 12(a) show that the two methods are suitable for discrete (DQN) and continuous (DDPG) accelerations, respectively. As shown in Figure 12(d), the terminal miss distances in DQN and the DDPG guidance law are less than the allowable miss distance, and both are less than 0.01 m after calculating. When the target’s overload saturation is
In addition, when the total overload saturation of the target is less than
5. Conclusion
We propose a DDPG-based guidance law for the guidance and control of interceptors with continuous maneuvering capabilities in this paper. The DDPG agent is developed using TensorFlow and optimized in the interception engagement scenario. Taking into account the effects of measurement errors and time delays in guidance control, the effectiveness of the proposed guidance law is compared with TPN and DQN-based RL guidance law through simulations of typical examples. The findings suggest that the DDPG-based guidance law outperforms the other two in terms of guidance performance. Future research could consider more complex interception scenarios, exploring more suitable intelligent guidance methods with potential implementation in real interception processes.
Acknowledgments
This study was supported by the National Natural Science Foundation of China (no.: 12002370).
[1] D. Hong, M. Kim, S. Park, "Study on reinforcement learning-based missile guidance law," Applied Sciences, vol. 10 no. 18,DOI: 10.3390/app10186567, 2020.
[2] Y. W. Fang, T. B. Deng, W. X. Fu, "Review of intelligent guidance law," Unmanned Sytems Technology, vol. 3 no. 6, pp. 36-42, 2020.
[3] Y. F. Nie, Q. J. Zhou, T. Zhang, "Research status and prospect of guidance law," Flight Mechanics, vol. 19 no. 3, 2001.
[4] P. Zarchan, Tactical and Strategic Missile Guidance, 2012.
[5] K. B. Li, W. S. Sun, L. Chen, "Performance analysis of realistic true proportional navigation against maneuvering targets using Lyapunov-like approach," Aerospace Science and Technology, vol. 69 no. 10, pp. 333-341, DOI: 10.1016/j.ast.2017.06.036, 2017.
[6] C. D. Yang, C. C. Yang, "A unified approach to proportional navigation," IEEE Transactions on Aerospace and Electronic Systems, vol. 33 no. 2, pp. 557-567, DOI: 10.1109/7.575895, 1997.
[7] K. B. Li, Z. H. Bai, H. S. Shin, A. Tsourdos, M. J. Tahk, "Capturability of 3D RTPN against true-arbitrarily maneuvering target with maneuverability limitation," Chinese Journal of Aeronautics, vol. 644, pp. 4511-4528, DOI: 10.1007/978-981-15-8155-7_374, 2022.
[8] Z. H. Bai, K. B. Li, W. S. Su, L. Chen, "Real true proportional guidance intercepts the capture area of any maneuvering target," Acta Aeronautica et Astronautica Sinica, vol. 41 no. 8, pp. 338-348, 2020.
[9] X. S. Huang, Missile Guidance and Control Systems Design, 2013.
[10] K. B. Li, L. Chen, X. Z. Bai, "Differential geometry modeling of interceptor guidance," SCINCE CHINA: Technological Sciences, vol. 41 no. 9, pp. 1205-1217, 2011.
[11] Z. X. Li, R. Zhang, "Time-varying sliding mode control of missile based on suboptimal method," Journal of Systems Engineering and Electronics, vol. 32 no. 3, pp. 700-710, DOI: 10.23919/JSEE.2021.000060, 2021.
[12] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 1998.
[13] T. R. Li, B. Yang, R. Wang, J. P. Hui, "Guidance method of reentry vehicle based on Q-learning algorithm," Tactical Missile Technology, vol. 5, pp. 44-49, 2019.
[14] Q. H. Zhang, B. Q. Ao, Q. X. Zhang, "Q-learning reinforcement learning guidance law," Systems Engineering and Electronics, vol. 42 no. 2, pp. 414-419, 2020.
[15] B. Gaudet, R. Furfaro, "Missile homing-phase guidance law design using reinforcement learning," AIAA Guidance, Navigation, and Control Conference,DOI: 10.2514/6.2012-4470, .
[16] G. L. Han, Design of Terminal Guidance Guidance Law Based on Reinforcement Learning, 2019.
[17] H. Y. Li, J. Wang, S. M. He, C. H. Lee, "Nonlinear optimal impact-angle-constrained guidance with large initial heading error," Journal of Guidance, Control, and Dynamics, vol. 44 no. 9, pp. 1663-1676, DOI: 10.2514/1.G005868, 2021.
[18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, "Continuous control with deep reinforcement learning," Computer Science, vol. 8 no. 6, 2015.
[19] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis, "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play," Science, vol. 362 no. 6419, pp. 1140-1144, DOI: 10.1126/science.aar6404, 2018.
[20] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, "Mastering the game of Go with deep neural networks and tree search," Nature, vol. 529 no. 7587, pp. 484-489, DOI: 10.1038/nature16961, 2016.
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518 no. 7540, pp. 529-533, DOI: 10.1038/nature14236, 2015.
[22] J. Tang, Z. H. Bai, Y. G. Liang, F. Zheng, K. Li, "An exoatmospheric homing guidance law based on deep Q network," International Journal of Aerospace Engineering, vol. 2022,DOI: 10.1155/2022/1544670, 2022.
[23] S. Fujimoto, H. V. Hoof, D. Meger, "Addressing function approximation error in actor-critic methods," 2018. https://arxiv.org/abs/1802.09477
[24] S. M. He, H. S. Shin, A. Tsourdos, "Computational missile guidance: a deep reinforcement learning approach," Journal of Aerospace Information Systems, vol. 18 no. 8, pp. 571-582, DOI: 10.2514/1.I010970, 2021.
[25] X. L. Hou, H. Li, Z. Wang, Z. X. Wu, H. Wen, "Design of missile terminal guidance law based on DDPG algorithm," Tactical Missile Technology, vol. 4, pp. 110-116, 2021.
[26] Y. Liu, Z. Z. He, C. Y. Wang, M. Z. Guo, "Research on terminal guidance law design based on DDPG algorithm," Journal of Computer Science, vol. 44 no. 9, pp. 1854-1865, 2021.
[27] A. Ratnoo, D. Ghose, "Collision-geometry-based pulsed guidance law for exo-atmospheric interception," Journal of Guidance, Control, and Dynamics, vol. 32 no. 2, pp. 669-675, DOI: 10.2514/1.37863, 2009.
[28] S. Gutman, "Exo-atmospheric interception via linear quadratic optimization," Journal of Guidance, Control, and Dynamics, vol. 42 no. 3, pp. 624-631, DOI: 10.2514/1.G003093, 2019.
[29] B. Gaudet, R. Furfaro, R. Linares, "Reinforcement learning for angle-only intercept guidance of maneuvering targets," Aerospace Science and Technology, vol. 99, article 105746,DOI: 10.1016/j.ast.2020.105746, 2020.
[30] B. Gaudet, R. Furfaro, R. Linares, A. Scorsoglio, "Reinforcement meta-learning for interception of maneuvering exo-atmospheric targets with parasitic attitude loop," Journal of Spacecraft and Rockets, vol. 58 no. 2, pp. 386-399, DOI: 10.2514/1.a34841, 2021.
[31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, "Proximal policy optimization algorithms," 2017. https://arxiv.org/abs/1707.06347
[32] H. Xiang, J. Lin, C. H. Chen, Y. Kong, "Asymptotic meta learning for cross validation of models for financial data," IEEE Intelligent Systems, vol. 35 no. 2, pp. 16-24, DOI: 10.1109/MIS.2020.2973255, 2020.
[33] C. Liang, W. Wang, Z. Liu, C. Lai, B. Zhou, "Learning to guide: guidance law based on deep meta-learning and model predictive path integral control," IEEE Access, vol. 7, pp. 47353-47365, DOI: 10.1109/ACCESS.2019.2909579, 2019.
[34] K. B. Li, S. Hyo-Sang, T. Antonios, T. Min-Jea, "Performance of 3-D PPN against arbitrarily maneuvering target for homing phase," IEEE Transactions on Aerospace and Electronic Systems, vol. 56 no. 5, pp. 3878-3891, DOI: 10.1109/TAES.2020.2987404, 2020.
[35] K. B. Li, S. Hyo-Sang, T. Antonios, "Capturability of a sliding-mode guidance law with finite-time convergence," IEEE Transactions on Aerospace and Electronic Systems, vol. 56 no. 3, pp. 2312-2325, DOI: 10.1109/TAES.2019.2948519, 2020.
[36] K. B. Li, S. Hyo-Sang, T. Antonios, T. Min-Jea, "Capturability of 3D PPN against lower-speed maneuvering target for homing phase," IEEE Transactions on Aerospace and Electronic Systems, vol. 56 no. 1, pp. 711-722, DOI: 10.1109/TAES.2019.2938601, 2020.
[37] S. Hyo-Sang, K. B. Li, "An improvement in three-dimensional pure proportional navigation guidance," IEEE Transactions on Aerospace and Electronic Systems, vol. 57 no. 5, pp. 3004-3014, DOI: 10.1109/TAES.2021.3067656, 2021.
[38] S. Kieninger, L. Donati, B. G. Keller, "Dynamical reweighting methods for Markov models," Current Opinion in Structural Biology, vol. 61, pp. 124-131, DOI: 10.1016/j.sbi.2019.12.018, 2020.
[39] L. Busoniu, T. De Bruin, D. Tolic, J. Kober, I. Palunko, "Reinforcement learning for control: performance, stability, and deep approximators," Annual Review in Control, vol. 46,DOI: 10.1016/j.arcontrol.2018.09.005, 2018.
[40] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, M. Botvinick, "Learning to reinforcement learn," 2016. https://arxiv.org/abs/1611.05763
[41] W. Y. Yang, C. J. Bai, C. Cai, "Review on sparse reward in deep reinforcement learning," Computer Science, vol. 47 no. 3, pp. 183-191, 2020.
[42] N. Fatema, S. G. Farkoush, M. H. Hasan, H. Malik, "Deterministic and probabilistic occupancy detection with a novel heuristic optimization and back-propagation (BP) based algorithm," Journal of Intelligent Fuzzy Systems, vol. 42 no. 2, pp. 779-791, DOI: 10.3233/JIFS-189748, 2022.
[43] A. Maniatopoulos, N. Mitianoudis, "Learnable leaky ReLU (LeLeLU): an alternative accuracy-optimized activation function," Information, vol. 12 no. 12,DOI: 10.3390/info12120513, 2021.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2023 Yangang Liang et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
A novel homing guidance law against maneuvering targets based on the deep deterministic policy gradient (DDPG) is proposed. The proposed guidance law directly maps the engagement state information to the acceleration of the interceptor, which is an end-to-end guidance policy. Firstly, the kinematic model of the interception process is described as a Markov decision process (MDP) that is applied to the deep reinforcement learning (DRL) algorithm. Then, an environment of training, state, action, and network structure is reasonably designed. Only the measurements of line-of-sight (LOS) angles and LOS rotational rates are used as state inputs, which can greatly simplify the problem of state estimation. Then, considering the LOS rotational rate and zero-effort-miss (ZEM), the Gaussian reward and terminal reward are designed to build a complete training and testing simulation environment. DDPG is used to deal with the RL problem to obtain a guidance law. Finally, the proposed RL guidance law’s performance has been validated using numerical simulation examples. The proposed RL guidance law demonstrated improved performance compared to the classical true proportional navigation (TPN) method and the RL guidance policy using deep-Q-network (DQN).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details


1 College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China; Hunan Key Laboratory of Intelligent Planning and Simulation for Aerospace Mission, Changsha 410073, China
2 The 31102 Troops, Nanjing 210000, China