Full Text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Intercepting the maneuvering targets is a particular challenge due to the complexity of the engagement [1, 2]. The traditional guidance and control system for interception show its weakness when facing high maneuvering targets, but intelligent methods can solve the problem [3]. In the field of guidance, proportional navigation (PN) has found widespread applications because of the features of simplicity and robustness [4]. PN is mainly divided into true proportional navigation (TPN) [5] and pure proportional navigation (PPN) [6]. For maneuvering targets, Ref. [7] investigated the capture region of the realistic true proportional navigation (RTPN) in three-dimensional (3D) space, taking into account the nonlinearity of the interceptor-target relative kinematics, resulting in more general findings. However, when targets exhibit large maneuvering, the performance of proportional navigation (PN) can significantly decline. This is mainly due to the commanded acceleration of PN often exceeding the capability of the interceptor, resulting in large miss distances [8]. Optimal guidance law (OGL) can intercept or strike a target with a specific optimized performance index [9]. However, the time-to-go needs to be accurately estimated in OGL; otherwise, the performance may decline. Many new guidance methods such as differential geometry [10], sliding mode control [11], and other dynamic and control theories have also been proposed. However, these guidance laws are often too complex in form, which usually require too much measurement information and involve plenty of guidance parameters, and, hence, are difficult to be applied in practice.

Reinforcement learning (RL) [12] provides a new idea for the homing guidance law design problem. For example, Q-learning is used for the adaptive determination of parameters in [13] and [14] by training. In [15], a guidance framework designed by RL-based guidance law is proposed. It has been proven that guidance laws based on RL are much better than PNG, according to plenty of numerical simulation results. However, these traditional RL-based algorithms promote the guidance performance only through selecting suitable coefficients of the controller [16], which cannot achieve precise guidance under realistic disturbed conditions. Moreover, the state and action set of the traditional RL method are discrete, and the dimension is low, while the actual interception engagement is continuous and high-dimensional [17].

As deep learning (DL) continued to advance, a new class of algorithms known as DRL, combining both DL and RL techniques, emerged [18]. The DRL method can effectively overcome the difficulties of complex space and high dimensions [19, 20], so it may have advantages in homing guidance. Ref. [21] proposed deep Q-Network (DQN), which solves the problem of high-dimensional input. Aiming at the problem of exoatmospheric homing guidance, a novel guidance method using DQN is proposed in [22]. However, DQN is more suitable for the problem of discrete control, while the actual interceptor’s acceleration is usually continuous. The discrete action command might lead to a large deviation and also a big miss distance.

The DDPG algorithm, introduced in [18], is an actor-critic (AC) [23] algorithm that is well suited for the homing guidance problem in continuous state and action space environments. Ref. [24] explored the possibility of applying DDPG to the design of homing guidance law. By comparing the two learning modes of learning from zero and learning with prior knowledge, it is proven that the latter helps to improve learning efficiency. In [25] and [26], the terminal guidance law of missiles is also advanced based on DDPG. The result shows that the proposed policy has stronger robustness and a smaller miss distance compared with PN. However, most DRL-based guidance laws need to measure and estimate the relative velocity and position between the target and interception and the information on target acceleration [27, 28]. The involved measurements are too many and are usually with lags and large errors. An RL-based guidance law was proposed in [29] and [30] to solve this problem, which only uses the LOS angle measurements and their change rates as the observation information. The problem of state estimation is simplified, and the bad influences caused by the estimation biases of position and velocity may be eliminated. Ref. [29] introduces proximal policy optimization (PPO) to propose a homing guidance law to intercept exoatmospheric maneuvering targets, combing with metalearning [31, 32]. Experimental results have shown that this guidance method outperforms the augmented ZEM [30] guidance method. Ref. [33] proposed a model-based DRL method, which uses deep neural networks and metalearning to approximate the predictive model of the guidance dynamics and incorporates it into the control framework of path integration. It introduces a general framework for guidance, but it is complex in form, and the problem of estimation errors is still not solved.

A novel homing guidance law against maneuvering targets using the DDPG algorithm is proposed in this paper, which directly maps an engagement state information to the commanded acceleration, which is an end-to-end and model-free guidance policy. The homing guidance law we proposed only takes the information of LOS angles and LOS rates between the target and interceptor as observation and state inputs and does not require prior estimation of the target’s acceleration. DDPG algorithm can effectively solve a continuous and high dimensional dynamic environment. Continuous action space is designed based on the interceptor’s acceleration overload. The LOS rate and ZEM are mainly considered in the design of reward, and the agent is trained in a 3D environment. The results of comparison with TPN and DQN-based RL guidance law show that the proposed guidance method is with strong environmental adaptability and better guidance performance.

The paper is structured as follows: Section 2 presents the problem formulation, including the engagement scenario and the model of motion and measurement. Section 3 mainly introduces the DDPG algorithm, and the details of RL guidance law are described. The results are given in Section 4, and Section 5 presents the conclusion.

2. Problem Formulation

2.1. Engagement Scenario

The interception process, a simplified engagement scenario, is used. Referring to Figure 1, the target’s and the interceptor’s position vectors are $r_{t}$ and $r_{m}$ . $r$ is the relative position in the launch inertial coordinate system. The velocity vectors are $v_{t}$ and $v_{m}$ , the relative velocity vector is $v$ . The accelerations are defined as $a_{t}$ and $a_{m}$ , and the relative acceleration vector is $a = a_{t} - a_{m}$ .

[figure(s) omitted; refer to PDF]

For the process of interception, the closing velocity is usually large. If the target and interceptor maneuver along the LOS direction, it can be challenging to alter the miss distance outcome. Therefore, we assume that the interceptor maneuvers only in a plane perpendicular to the direction of LOS in the LOS coordinate system, without considering its maneuver along the LOS direction.

2.2. Motion Model of Interception

The intersection plane is formed by $r$ and $v$ which are shown in Figure 1, and the details are illustrated in Figure 2. The plane between the target and interceptor will rotate as a function of their relative motion [34]. Figure 2 illustrates the centroid of the interceptor at the origin $o_{m}$ , with unit vectors perpendicular and parallel to $r$ denoted by $e_{θ}$ and $e_{r}$ , respectively. Additionally, $q$ represents the LOS angle within the plane.

[figure(s) omitted; refer to PDF]

The relative velocity is decomposed into two components: $v_{r}$ represents the closing velocity, while $v_{θ}$ represents relative velocity perpendicular to the LOS. $v_{θ}$ causes the rotation of the LOS. Additionally, $ω_{s}$ denotes the angular velocity of the LOS within 3D space, $ω_{s} = ω_{s} e_{ω}$ and $e_{ω}$ is perpendicular to $e_{r}$ and $e_{θ}$ , forming the LOS coordinate system. According to [6], $\begin{matrix} (1) & \dot{r} = v_{r} e_{r} + v_{θ} e_{θ} = \dot{r} e_{r} + r ω_{s} e_{ω} \times e_{r}, \\ (2) & {\dot{e}}_{ω} = Ω_{s} e_{r} \times e_{ω} = - Ω_{s} e_{θ} . \end{matrix}$

$Ω s$ represents the angular velocity. By deriving from equation (1), $\begin{matrix} (3) & \ddot{r} - r ω_{s}^{2} e_{r} + r {\dot{ω}}_{s} + 2 \dot{r} ω_{s} e_{θ} + r ω_{s} Ω_{s} e_{ω} = a . \end{matrix}$

The LOS direction can be represented using $q_{β}$ and $q_{ε}$ and within the launch inertial system [35]. Referring to [6], equation (4) can be obtained. $\begin{matrix} (4) & ω_{s} = {\dot{q}}_{β} \sin q_{ε} \cdot x_{S} + {\dot{q}}_{β} \cos q_{ε} \cdot y_{S} + {\dot{q}}_{ε} z_{S}, \end{matrix}$ where $x_{S}$ , $y_{S}$ , and $z_{S}$ are the coordinate axis unit vectors in the LOS coordinate system. According to [36] and [22], $\begin{matrix} (5) & {\dot{e}}_{r} = {\dot{q}}_{ε} y_{S} - {\dot{q}}_{β} \cos q_{ε} z_{S}, \\ (6) & e_{θ} = \frac{{\dot{q}}_{ε} y_{S} - {\dot{q}}_{β} \cos q_{ε} z_{S}}{\sqrt{{\dot{q}}_{β} \cos q_{ε}^{2} + {\dot{q}}_{ε}^{2}}}, \\ (7) & e_{ω} = \frac{{\dot{q}}_{β} \cos q_{ε} y_{S} + {\dot{q}}_{ε} z_{S}}{\sqrt{{\dot{q}}_{β} \cos q_{ε}^{2} + {\dot{q}}_{ε}^{2}}} . \end{matrix}$

In summary, when $q_{ε}$ and $q_{β}$ are measured, and then their rates of change are obtained by filtering, the equation of motion and the intersection plane can be determined according to equations (5)–(7).

ZEM is the final miss distance generated by the interceptor when the target and missile are not maneuvering [7, 34]. ZEM and time-to-go are calculated as follows. $\begin{matrix} (8) & \overset{⟶}{ZEM} = r - v \cdot t_{g o}, \\ t_{go} = \frac{r \cdot v}{v^{2}} . \end{matrix}$

2.3. Measurement Model of Interception

The measurement model is mainly to process the information measured by the interceptor and is developed to calculate the LOS angles and LOS rates of change based on the current missile-target state [37]. Referring to Section 2.1, the relative position and velocity vector are as follows: $\begin{matrix} (9) & r = {r_{x}, r_{y}, r_{z}}^{T}, \\ (10) & v = {v_{x}, v_{y}, v_{z}}^{T} . \end{matrix}$

By utilizing equations (9) and (10), $\begin{matrix} (11) & \begin{matrix} q_{ε} = \tan^{- 1} \frac{r_{y}}{\sqrt{r_{x}^{2} + r_{z}^{2}}}, \\ q_{β} = \tan^{- 1} \frac{- r_{z}}{r_{x}}, \end{matrix} \\ \begin{matrix} {\dot{q}}_{ε} = \frac{r_{x}^{2} + r_{z}^{2} v_{y} - r_{y} r_{x} v_{x} + r_{z} v_{z}}{r^{2} \sqrt{r_{x}^{2} + r_{z}^{2}}}, \\ {\dot{q}}_{β} = \frac{r_{z} v_{x} - r_{x} v_{z}}{r_{x}^{2} + r_{z}^{2}} . \end{matrix} \end{matrix}$

This paper’s simulation neglects the effects of error related to the relative distance, closing velocity, and LOS angle measurements in the measurement model. Only the errors of measurement in the LOS angular rates are introduced. A Gaussian noise with zero mean and a specified standard deviation $1 \times 10^{- 4} rad / s$ is assigned.

3. RL Homing Guidance Law

Establishing a Markov decision model [38] of the problem is a prerequisite for designing the homing guidance law using the DRL algorithm [12]. Then, the interception problem needs to be transformed into the RL framework.

3.1. The Overview of RL

Reinforcement learning is an iterative process [39] that involves an agent interacting with the environment, observing state $S_{t}$ and receiving an instantaneous reward $R_{t}$ for each action $A_{t}$ taken during a single episode of training. Then, it executes an action and feeds back to the environment, so that the agent learns the better policy.

Reinforcement learning algorithms are broadly categorized into two methods: value function and policy gradient [40]. The methods of the former, such as Q-learning and DQN, estimate the value of state-action pairs. The latter’s methods, such as policy gradient and AC algorithm, directly learn the policy which maps states to actions. DRL algorithms, such as DDPG and A3C, combine deep learning with these methods. However, value function methods are not suitable for problems with high dimensions and continuous action spaces, and the policy gradient method based on AC architecture has more advantages for such problems. DDPG is used for solving the problem and is compared against the TPN and DQN algorithms in this paper.

3.2. DDPG Algorithm

DDPG is based on AC architecture for solving RL problems with continuous spaces in state and action. The algorithm uses neural networks to approximate the functions, which are represented by the value network (part of the critic) and policy network (part of the actor). The value network calculates the state or action values of the corresponding state, while the policy network calculates the action values of the policy. The DDPG framework is shown in Figure 3.

[figure(s) omitted; refer to PDF]

A dual network is also used in the DDPG algorithm, namely, the current and the target network. The AC-type algorithm generally includes a policy and value network, so DDPG has a total of four networks after using a dual network [18].

DDPG also uses the replay buffer to reduce the correlation between training data. During training, the agent randomly selects small batches of data from the experience replay pool to calculate network loss and gradient and then updates the current policy network and value network through gradient backpropagation. DDPG differs from DQN in that it implements a soft update approach to update the target network, as opposed to periodically copying parameters from the current network. The soft update slowly updates the parameters each time, and it is mathematically expressed as follows: $\begin{matrix} (12) & \begin{matrix} w^{'} ⟵ τ w + 1 - τ w^{'}, \\ θ^{'} ⟵ τ θ + 1 - τ θ^{'}, \end{matrix} \end{matrix}$ where $τ$ is the update coefficient. To avoid the local optimum in the process of exploring the state space, random noise $μ$ is added to the action, which is expressed as follows: $\begin{matrix} (13) & a = π_{θ} s + μ, \end{matrix}$ where $π_{θ} s$ is the output of actor network. The loss is also obtained by temporal-difference (TD) training. The pseudocode is shown in Algorithm 1.

Algorithm 1: DDPG for homing guidance law.

1. Initialize network parameters and target Q network parameters w, $θ$ , w’=w, $θ$ ’= $θ$ .

2. Initialize replay pool D.

3. For episode = 0 to T

4. Interceptor's state s₀ is initialized

5. For s = s₀to Terminations:

6. a) Output action $a = π_{θ} s + μ$ according to state s in policy network.

7. b) Execute a, transfer to s', and get reward r. Judge termination d.

8. c) Store transitions {s, a, s', r, d}.

9. d) Sample n transitions from D.

10. e) Compute the current target Q value y_i.

$y_{i} = r_{j} + 1 - d γ Q^{'} s_{j}^{'}, π_{θ^{'}} s_{j}^{'}, w^{'}$ .

11. f) Compute the loss $c_l o s s = 1 / n \sum_{j = 1}^{n} {y_{j} - Q s_{j}, a_{j}, w}^{2}$ . Update value network parameter w by backpropagation.

12. g) Compute the loss of policy network $a_{l o s s} = - 1 / n \sum_{j = 1}^{n} s_{j}, π_{θ} s_{j}, θ$ . Update policy network parameter $θ$ by backpropagation.

13. h) Update parameters in target networks with equation (12).

14. i) Update state: s = s'.

3.3. RL Model of Interception

To solve the problem of interception using DDPG, the original problem needs to be transformed into the framework of RL. First, the corresponding MDP is established, and the elements of reinforcement learning are designed according to the motion model in Section 2.2.

3.3.1. State

The process of interception can be described by an MDP. The environment of this process is composed of the 3D motion model established in Section 2. The state space designed mainly includes LOS angle and their change of rate [29], which is expressed as follows: $\begin{matrix} (14) & S = Δ q_{ε}, Δ q_{β}, {\dot{q}}_{ε}, {\dot{q}}_{β}, \end{matrix}$ where $Δ q_{ε}$ and $Δ q_{β}$ are the LOS angle differences. Therefore, this information can be used for the state input of the agent only by measuring the LOS angles and their rates. It is assumed that the interceptor has a detection capability. The process is observable, and the variables in this paper are defined as follows: $\begin{matrix} (15) & O = r_{t}, v_{t}, r_{m}, v_{m}, Δ q_{ε}, Δ q_{β}, {\dot{q}}_{ε}, {\dot{q}}_{β} . \end{matrix}$

3.3.2. Action

The DDPG algorithm is particularly appropriate for problems with continuous actions. Considering the continuous maneuvering form of the interceptor, without considering the maneuvering along the LOS, that is, the interceptor maneuvers along the plane perpendicular to LOS. Therefore, if the interceptor’s acceleration in $y_{S}$ and $z_{S}$ directions is $u_{1}$ and $u_{2}$ , respectively, the continuous action space is expressed as follows: $\begin{matrix} (16) & A = u_{1}, u_{2}, u_{1}, u_{2} \in - a_{\max}, a_{\max} . \end{matrix}$

The total acceleration acting on the interceptor is as follows: $\begin{matrix} (17) & a_{m} = \sqrt{{u_{1}}^{2} + {u_{2}}^{2}} . \end{matrix}$

We assume that the maneuvering target’s maximal overload and the interceptor in a certain direction are 3 g and 6 g, respectively, so the target’s and the interceptor’s maximum total overload are $3 \sqrt{2} g$ and $6 \sqrt{2} g$ .

3.3.3. Reward

The reward design is the key to RL problems. To ensure the training converges to the optimum, the method of reward shaping [41] is used to avoid the problem of reward sparsity and learn the optimal policy.

The LOS rate and ZEM are considered in the reward function of the model. During the interception, the LOS rate is positively correlated with the relative velocity. The smaller the absolute value of the relative velocity, the smaller the ZEM. The Gaussian reward [30] is designed as follows: $\begin{matrix} (18) & R_{1} = \exp - θ / σ . \end{matrix}$

The reward is a shaping reward that depends on the velocity-leading angle $θ$ between the LOS direction and the relative velocity. The reward coefficient $σ$ is used for adjusting the reward value. Figure 4 illustrates the effect of different $σ$ values on the reward. Smaller values of $σ$ result in smaller rewards under the same conditions. When $θ$ is smaller, it indicates that the corresponding reward is larger.

[figure(s) omitted; refer to PDF]

To ensure effective interception of the target, a terminal reward constraint is required. Therefore, a terminal reward function is designed. If the ZEM meets with the allowable miss distance, then a positive reward (+10) is given. Otherwise, there is no reward value (+0). Specifically expressed as: $\begin{matrix} (19) & R_{2} = \begin{cases} + 10 & if {ZEM}_{f} \leq r_{Miss} \\ 0 & else \end{cases} . \end{matrix}$

To sum up, the total reward is as follows: $\begin{matrix} (20) & R = R_{1} + R_{2} . \end{matrix}$

3.4. Create the Agent

Based on the established framework of interception, the network is further designed, the algorithm hyperparameters are debugged, and the DDPG agent is trained.

3.4.1. The Neural Network

The Tensorflow framework is used for building the neural network of DDPG. DDPG contains two parts: value and policy network. For the value network, the output is the action value corresponding to the state and action, which is different from the Q network in DQN. The value network uses a three-layer backpropagation (BP) neural network [42], which is shown in Table 1. Relu and tanh activation functions are used in the network [43]. The policy network structure is described in Table 2.

Table 1

The structure of the value network.

Layers	Neurons	Activation functions
Input of state	4	\
Input of action	2	\
Hidden layer 1	60	Relu
Hidden layer 2	40	Relu
Output	1	\

Table 2

The structure of the policy network.

Layers	Neurons	Activation functions
Input of state	4	\
Hidden layer 1	60	Relu
Hidden layer 2	40	Relu
Output	1	Tanh

3.4.2. Hyper Parameter

The parameters are important to the performance of training. And this adjustment process is different in different application ranges. Different problems have different parameter sets. If the parameter setting is unreasonable, the algorithm cannot converge. Therefore, hyperparameters need to be continually adjusted during training. The hyperparameters used in this problem are determined by conducting numerous numerical simulations in the established interception environment. Table 3 shows the hyperparameters that are ultimately chosen for this study.

Table 3

The hyperparameters of DDPG.

Hyperparameter	Parameter value
Maximum iterations	2000
Discount factor	0.995
Coefficient of soft update	0.001
Reward coefficient	0.05
Capacity of experience replay pool	100000
Minibatch size	64
Environmental noise variance	1.0
Noise attenuation rate	0.99
Value network learning rate	0.002
Policy network learning rate	0.001

4. Simulations and Analysis

During the training, the state measurement errors and time constant are not considered. The motion equation of each episode is integrated by the Runge-Kutta method, whose order is 4 and the simulation step is 1 ms. Table 4 shows the initial conditions.

Table 4

The initial conditions for training.

Physical parameters	Reference value
Azimuth angle of LOS	$40 \deg$
Elevation angle of LOS	$30 \deg$
LOS range	$100 km$
Interceptor’s position vector	${0, 0, 0}^{T} m$
Velocity yaw angle of target	$220 \deg$
Interceptor velocity	$5 km / s$
Target velocity	$7 km / s$
Alignment deviation perpendicular to intersection plane	$2 \deg$
Velocity pitch angle of the target	$0 \deg$
Alignment deviation in the intersection plane	$2 \deg$

During training, a terminal reward with an allowable miss distance of 0.2 m was used. The results presented include the training results, a comparison with TPN, and a comparison with a homing guidance law based on DQN [22].

4.1. Results of Training

The DDPG environment is built in Tensorflow, and then the agent generates a large volume of data which is used to optimize its policy. We train the agent by a computer with NVIDIA GeForce RTX 2080 Ti GPU, Gold 6226R: 2.90GHz CPU. The versions of Python and Tensorflow are 3.7.6 and 1.15.0.

Tensorboard is used to show the process of training, and it took approximately 9401.2702 seconds to train 2000 episodes, equivalent to about 2.6 hours for full training. Figures 5 and 6 depict the change in the loss for the policy and value networks, respectively, with the horizontal axis representing the iterations of training. A decrease in the policy network loss corresponds to an increase in the Q-value output of the value network, as demonstrated in Figure 5. This indicates that the parameters of the policy network are continuously optimized, resulting in maximum action value output. The loss function is in the value network, and it can be observed that in the early stages of training, TD error is relatively small. As training progresses, the network becomes increasingly optimized, with lower TD error values being more beneficial for algorithm training.

[figure(s) omitted; refer to PDF]

Figure 7 illustrates the changes in rewards, with the horizontal axis representing episodes and the vertical axis representing the cumulative reward (in blue) and average reward (in orange) for each round after smoothing. It can be observed that the maximum cumulative reward is achieved after 250 episodes. Since DDPG contains two networks, the stability of the algorithm is affected, and there may be fluctuations. Therefore, there is a certain range of change in the cumulative reward after convergence. However, the policy of the agent is optimized during training, and the convergence speed is fast.

[figure(s) omitted; refer to PDF]

4.2. Comparison with TPN

During training in Section 4.1, the effects of measurement errors and time delays are not considered. The agent is compared with TPN with different guidance coefficients in two ways, that is constant maneuvering and sinusoidal maneuvering. The simulation takes into account the measurement error. Additionally, the control system’s response delay is assumed to be equivalent to two sampling periods (20 ms) after the guidance command.

The simulation is conducted under the following conditions: the launch location’s latitude is 60°, longitude is 140°, launch azimuth is 90°, and altitude is 100 m. The target’s and interceptor’s initial information is presented in Table 5, indicating an initial relative distance between them of 100 km with $q_{ε}$ and $q_{β}$ both being 30°. The guidance acceleration’s sampling period is 10 ms, with the time step being 1 ms in the test. When the relative velocity is greater than 0, the terminal miss distance approximates the ZEM.

Table 5

The initial conditions of the simulation.

	Position (km)	Velocity (m/s)
Interceptor	(0, 0, 0)	(338.7, 4984, -211)
Target	(70, 50, -33.3)	(-6039, 610, 3486)

4.2.1. Constant-Maneuvering Target

In the case of constant-maneuvering targets, the maneuvering is only considered in the LOS vertical plane. To verify the DDPG guidance law’s generalization ability, we assume the target’s acceleration is $a_{t} = {0, 4 g, 4 g}^{Τ}$ . The interceptor’s total overload is set to 6√2 g, which differs from the target and interceptor setting during training. The guidance coefficient is 3 and 5. Figure 8 shows the results.

[figure(s) omitted; refer to PDF]

The terminal miss distance is given in Table 6. In Figure 8(a), the TPN in both cases of N taking 3 and 5 cannot reduce the vertical velocity. When the time-to-go decreases, $v_{q}$ increases continuously, and the closing velocity also changes greatly. In Figure 8(d), the guidance law based on DDPG effectively suppresses the LOS rate. According to the results, when $N$ is 3 and 5, TPN eventually produces a large miss distance, while DDPG guidance law produces a small miss distance.

Table 6

The comparison results of terminal miss distance.

	TPN $N = 3$	TPN $N = 5$	DDPG guidance law
Constant maneuvering	414 m	68 m	0.16 m
Sinusoidal maneuvering	12.5 m	0.93 m	0.1 m

4.2.2. Sinusoidal-Maneuvering Target

The target’s acceleration is $a_{t} = {0, 2 g + 2 gcos t, 2 g - 2 gsin t}^{Τ}$ , and the overload saturation is $4 \sqrt{2} g$ . Figure 9 shows the acceleration changing with time.

[figure(s) omitted; refer to PDF]

The guidance coefficient also takes 3 and 5. Figure 10 gives the simulation results. In Figure 10(a), the guidance law based on DDPG can reduce the vertical velocity more fully than TPN. In Figure 10(d), the change of the LOS rate also shows that DDPG can reduce the LOS rate more effectively during the guidance process.

[figure(s) omitted; refer to PDF]

The coefficient $N$ takes 3 and 5 in this case as well. The results are presented in Figure 10. Table 6 reveals the terminal miss distance for the DDPG-based guidance law and TPN. Figure 10(a) suggests that the DDPG method can reduce the vertical velocity to a greater extent than TPN. Additionally, Figure 10(d) illustrates that DDPG is more effective in reducing the LOS rate during the guidance process.

The above results show that the proposed RL method is more effective than the TPN to intercept targets with certain maneuvering ability. The DDPG guidance law is capable of effectively reducing the vertical relative velocity, ensuring a very small final miss distance, and mitigating the divergence of the LOS rate.

4.3. Comparison with DQN

During the training process in Section 4.1, the target’s maximum overload is $3 \sqrt{2} g$ . In the test of this section, the initial conditions are kept unchanged. The target moves in a mode of constant maneuvering, and its maximum overload is also $3 \sqrt{2} g$ . The acceleration form is $a_{t} = {0, 3 g, 3 g}^{T}$ . The interceptor is guided by the guidance laws based on DDPG and DQN [22]. The initial conditions are presented in Table 7, and Figures 11 and 12 show the results.

Table 7

The initial conditions of the target and interceptor.

	Position (km)	Velocity (m/s)
Target	(66.34, 50, -55.67)	(-5362, 0, 4499)
Interceptor	(0, 0, 0)	(872, 4860, -785)

[figure(s) omitted; refer to PDF]

Figures 11 and 12(a) show that the two methods are suitable for discrete (DQN) and continuous (DDPG) accelerations, respectively. As shown in Figure 12(d), the terminal miss distances in DQN and the DDPG guidance law are less than the allowable miss distance, and both are less than 0.01 m after calculating. When the target’s overload saturation is $3 \sqrt{2} g$ , both guidance laws can fully suppress the LOS rate’s divergence. In Figure 12(b), the LOS rate of the DQN guidance law fluctuates in the end, while the LOS rate of the DDPG guidance law decreases to 0. Therefore, the DDPG guidance law performs better than the DQN. In Figures 12(c) and 12(d), the velocity leading angle of the two guidance laws decreases continuously during the interception process, that is, $v_{q}$ decreases continuously. At the same time, the ZEM generated by the interceptor also decreases, and it effectively hits the target with a small miss distance at the end time.

In addition, when the total overload saturation of the target is less than $3 \sqrt{2} g$ , the simulation is compared and verified. The two RL-based guidance laws can effectively intercept the target from the results, and the final miss distance is within the allowable miss distance. However, the DDPG guidance law can deal with the continuous action space, which is more realistic. Moreover, the DDPG guidance law can effectively suppress the LOS rate (see Figure 12(b)), while the DQN guidance law diverges at the end of time, indicating that the DDPG method performs better. Based on the simulations conducted above, it can be observed that when the target possesses some maneuvering capability, both RL-based guidance laws can ensure effective target interception, with the DDPG guidance law outperforming the DQN guidance law.

5. Conclusion

We propose a DDPG-based guidance law for the guidance and control of interceptors with continuous maneuvering capabilities in this paper. The DDPG agent is developed using TensorFlow and optimized in the interception engagement scenario. Taking into account the effects of measurement errors and time delays in guidance control, the effectiveness of the proposed guidance law is compared with TPN and DQN-based RL guidance law through simulations of typical examples. The findings suggest that the DDPG-based guidance law outperforms the other two in terms of guidance performance. Future research could consider more complex interception scenarios, exploring more suitable intelligent guidance methods with potential implementation in real interception processes.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (no.: 12002370).

References

[1] D. Hong, M. Kim, S. Park, "Study on reinforcement learning-based missile guidance law," Applied Sciences, vol. 10 no. 18,DOI: 10.3390/app10186567, 2020.

[2] Y. W. Fang, T. B. Deng, W. X. Fu, "Review of intelligent guidance law," Unmanned Sytems Technology, vol. 3 no. 6, pp. 36-42, 2020.

[3] Y. F. Nie, Q. J. Zhou, T. Zhang, "Research status and prospect of guidance law," Flight Mechanics, vol. 19 no. 3, 2001.

[4] P. Zarchan, Tactical and Strategic Missile Guidance, 2012.

[5] K. B. Li, W. S. Sun, L. Chen, "Performance analysis of realistic true proportional navigation against maneuvering targets using Lyapunov-like approach," Aerospace Science and Technology, vol. 69 no. 10, pp. 333-341, DOI: 10.1016/j.ast.2017.06.036, 2017.

[6] C. D. Yang, C. C. Yang, "A unified approach to proportional navigation," IEEE Transactions on Aerospace and Electronic Systems, vol. 33 no. 2, pp. 557-567, DOI: 10.1109/7.575895, 1997.

[7] K. B. Li, Z. H. Bai, H. S. Shin, A. Tsourdos, M. J. Tahk, "Capturability of 3D RTPN against true-arbitrarily maneuvering target with maneuverability limitation," Chinese Journal of Aeronautics, vol. 644, pp. 4511-4528, DOI: 10.1007/978-981-15-8155-7_374, 2022.

[8] Z. H. Bai, K. B. Li, W. S. Su, L. Chen, "Real true proportional guidance intercepts the capture area of any maneuvering target," Acta Aeronautica et Astronautica Sinica, vol. 41 no. 8, pp. 338-348, 2020.

[9] X. S. Huang, Missile Guidance and Control Systems Design, 2013.

[10] K. B. Li, L. Chen, X. Z. Bai, "Differential geometry modeling of interceptor guidance," SCINCE CHINA: Technological Sciences, vol. 41 no. 9, pp. 1205-1217, 2011.

[11] Z. X. Li, R. Zhang, "Time-varying sliding mode control of missile based on suboptimal method," Journal of Systems Engineering and Electronics, vol. 32 no. 3, pp. 700-710, DOI: 10.23919/JSEE.2021.000060, 2021.

[12] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 1998.

[13] T. R. Li, B. Yang, R. Wang, J. P. Hui, "Guidance method of reentry vehicle based on Q-learning algorithm," Tactical Missile Technology, vol. 5, pp. 44-49, 2019.

[14] Q. H. Zhang, B. Q. Ao, Q. X. Zhang, "Q-learning reinforcement learning guidance law," Systems Engineering and Electronics, vol. 42 no. 2, pp. 414-419, 2020.

[15] B. Gaudet, R. Furfaro, "Missile homing-phase guidance law design using reinforcement learning," AIAA Guidance, Navigation, and Control Conference,DOI: 10.2514/6.2012-4470, .

[16] G. L. Han, Design of Terminal Guidance Guidance Law Based on Reinforcement Learning, 2019.

[17] H. Y. Li, J. Wang, S. M. He, C. H. Lee, "Nonlinear optimal impact-angle-constrained guidance with large initial heading error," Journal of Guidance, Control, and Dynamics, vol. 44 no. 9, pp. 1663-1676, DOI: 10.2514/1.G005868, 2021.

[18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, "Continuous control with deep reinforcement learning," Computer Science, vol. 8 no. 6, 2015.

[19] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis, "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play," Science, vol. 362 no. 6419, pp. 1140-1144, DOI: 10.1126/science.aar6404, 2018.

[20] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, "Mastering the game of Go with deep neural networks and tree search," Nature, vol. 529 no. 7587, pp. 484-489, DOI: 10.1038/nature16961, 2016.

[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518 no. 7540, pp. 529-533, DOI: 10.1038/nature14236, 2015.

[22] J. Tang, Z. H. Bai, Y. G. Liang, F. Zheng, K. Li, "An exoatmospheric homing guidance law based on deep Q network," International Journal of Aerospace Engineering, vol. 2022,DOI: 10.1155/2022/1544670, 2022.

[23] S. Fujimoto, H. V. Hoof, D. Meger, "Addressing function approximation error in actor-critic methods," 2018. https://arxiv.org/abs/1802.09477

[24] S. M. He, H. S. Shin, A. Tsourdos, "Computational missile guidance: a deep reinforcement learning approach," Journal of Aerospace Information Systems, vol. 18 no. 8, pp. 571-582, DOI: 10.2514/1.I010970, 2021.

[25] X. L. Hou, H. Li, Z. Wang, Z. X. Wu, H. Wen, "Design of missile terminal guidance law based on DDPG algorithm," Tactical Missile Technology, vol. 4, pp. 110-116, 2021.

[26] Y. Liu, Z. Z. He, C. Y. Wang, M. Z. Guo, "Research on terminal guidance law design based on DDPG algorithm," Journal of Computer Science, vol. 44 no. 9, pp. 1854-1865, 2021.

[27] A. Ratnoo, D. Ghose, "Collision-geometry-based pulsed guidance law for exo-atmospheric interception," Journal of Guidance, Control, and Dynamics, vol. 32 no. 2, pp. 669-675, DOI: 10.2514/1.37863, 2009.

[28] S. Gutman, "Exo-atmospheric interception via linear quadratic optimization," Journal of Guidance, Control, and Dynamics, vol. 42 no. 3, pp. 624-631, DOI: 10.2514/1.G003093, 2019.

[29] B. Gaudet, R. Furfaro, R. Linares, "Reinforcement learning for angle-only intercept guidance of maneuvering targets," Aerospace Science and Technology, vol. 99, article 105746,DOI: 10.1016/j.ast.2020.105746, 2020.

[30] B. Gaudet, R. Furfaro, R. Linares, A. Scorsoglio, "Reinforcement meta-learning for interception of maneuvering exo-atmospheric targets with parasitic attitude loop," Journal of Spacecraft and Rockets, vol. 58 no. 2, pp. 386-399, DOI: 10.2514/1.a34841, 2021.

[31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, "Proximal policy optimization algorithms," 2017. https://arxiv.org/abs/1707.06347

[32] H. Xiang, J. Lin, C. H. Chen, Y. Kong, "Asymptotic meta learning for cross validation of models for financial data," IEEE Intelligent Systems, vol. 35 no. 2, pp. 16-24, DOI: 10.1109/MIS.2020.2973255, 2020.

[33] C. Liang, W. Wang, Z. Liu, C. Lai, B. Zhou, "Learning to guide: guidance law based on deep meta-learning and model predictive path integral control," IEEE Access, vol. 7, pp. 47353-47365, DOI: 10.1109/ACCESS.2019.2909579, 2019.

[34] K. B. Li, S. Hyo-Sang, T. Antonios, T. Min-Jea, "Performance of 3-D PPN against arbitrarily maneuvering target for homing phase," IEEE Transactions on Aerospace and Electronic Systems, vol. 56 no. 5, pp. 3878-3891, DOI: 10.1109/TAES.2020.2987404, 2020.

[35] K. B. Li, S. Hyo-Sang, T. Antonios, "Capturability of a sliding-mode guidance law with finite-time convergence," IEEE Transactions on Aerospace and Electronic Systems, vol. 56 no. 3, pp. 2312-2325, DOI: 10.1109/TAES.2019.2948519, 2020.

[36] K. B. Li, S. Hyo-Sang, T. Antonios, T. Min-Jea, "Capturability of 3D PPN against lower-speed maneuvering target for homing phase," IEEE Transactions on Aerospace and Electronic Systems, vol. 56 no. 1, pp. 711-722, DOI: 10.1109/TAES.2019.2938601, 2020.

[37] S. Hyo-Sang, K. B. Li, "An improvement in three-dimensional pure proportional navigation guidance," IEEE Transactions on Aerospace and Electronic Systems, vol. 57 no. 5, pp. 3004-3014, DOI: 10.1109/TAES.2021.3067656, 2021.

[38] S. Kieninger, L. Donati, B. G. Keller, "Dynamical reweighting methods for Markov models," Current Opinion in Structural Biology, vol. 61, pp. 124-131, DOI: 10.1016/j.sbi.2019.12.018, 2020.

[39] L. Busoniu, T. De Bruin, D. Tolic, J. Kober, I. Palunko, "Reinforcement learning for control: performance, stability, and deep approximators," Annual Review in Control, vol. 46,DOI: 10.1016/j.arcontrol.2018.09.005, 2018.

[40] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, M. Botvinick, "Learning to reinforcement learn," 2016. https://arxiv.org/abs/1611.05763

[41] W. Y. Yang, C. J. Bai, C. Cai, "Review on sparse reward in deep reinforcement learning," Computer Science, vol. 47 no. 3, pp. 183-191, 2020.

[42] N. Fatema, S. G. Farkoush, M. H. Hasan, H. Malik, "Deterministic and probabilistic occupancy detection with a novel heuristic optimization and back-propagation (BP) based algorithm," Journal of Intelligent Fuzzy Systems, vol. 42 no. 2, pp. 779-791, DOI: 10.3233/JIFS-189748, 2022.

[43] A. Maniatopoulos, N. Mitianoudis, "Learnable leaky ReLU (LeLeLU): an alternative accuracy-optimized activation function," Information, vol. 12 no. 12,DOI: 10.3390/info12120513, 2021.

Word count: 5532

Show less

Copyright © 2023 Yangang Liang et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

Translate

A novel homing guidance law against maneuvering targets based on the deep deterministic policy gradient (DDPG) is proposed. The proposed guidance law directly maps the engagement state information to the acceleration of the interceptor, which is an end-to-end guidance policy. Firstly, the kinematic model of the interception process is described as a Markov decision process (MDP) that is applied to the deep reinforcement learning (DRL) algorithm. Then, an environment of training, state, action, and network structure is reasonably designed. Only the measurements of line-of-sight (LOS) angles and LOS rotational rates are used as state inputs, which can greatly simplify the problem of state estimation. Then, considering the LOS rotational rate and zero-effort-miss (ZEM), the Gaussian reward and terminal reward are designed to build a complete training and testing simulation environment. DDPG is used to deal with the RL problem to obtain a guidance law. Finally, the proposed RL guidance law’s performance has been validated using numerical simulation examples. The proposed RL guidance law demonstrated improved performance compared to the classical true proportional navigation (TPN) method and the RL guidance policy using deep-Q-network (DQN).

Details

Title

Homing Guidance Law Design against Maneuvering Targets Based on DDPG

Author

Liang, Yangang¹

; Tang, Jin¹

; Bai, Zhihui²; Li, Kebo¹

¹ College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China; Hunan Key Laboratory of Intelligent Planning and Simulation for Aerospace Mission, Changsha 410073, China
² The 31102 Troops, Nanjing 210000, China

Editor

Shaoming He

Publication year

2023

Publication date

2023

Publisher

John Wiley & Sons, Inc.

ISSN

16875966

e-ISSN

16875974

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2023/4188037

ProQuest document ID

2829309034

Homing Guidance Law Design against Maneuvering Targets Based on DDPG

Jump to:

Full Text

Abstract

Details

Suggested sources