Full text

Turn on search term navigation

1. Introduction

In recent years, the pursuit-evasion (PE) problem has attracted great attention because of its widespread application background in competitive games, optimization of IoT resources, and military attacks [1,2,3,4]. However, due to the real-time confrontation between the pursuit and evasion sides, the traditional unilateral control theory cannot solve the problem accurately [5]. Although the existing algorithms can solve the differential game problem in many scenarios, an offline algorithm cannot make real-time responses to the information of agents of the PE game with strong real-time performance. Thus, this paper focuses on the online PE game problem and realizes the solution of the agent policy according to the concept of adaptive dynamic programming.

The core of solving the PE game problem is to obtain the control policy of each agent on both sides of the game. Isaacs [6] introduced the modern control theory into the game theory and established the differential game theory. Thereafter, as a branch of the differential game, the PE game of agents has attracted much attention. With the continuous development of aerospace technology and the launch of man-made satellites, the game problems of continuous confrontation between both sides and even multiple players need to be solved urgently [7,8]. Friedman [9] proved the existence of saddle points in differential games, thus enabling them to optimize the strategies of all agents in the PE problem. For the control problem in a linear differential game system [10,11] discussed the control method for the cost function of a quadratic form. In contrast, [12] discussed the uniqueness of the Nash equilibrium point, so that the analytical solution can be obtained for the classical differential game problem.

However, a general system may be more complex, and it could be difficult to obtain its analytical solution. Therefore, compared with analytic methods, scholars usually prefer numerical methods for solving the problem with more complex agents [13], such as the pursuit-evasion problem of aircraft. Qiuhua et al. [14] and Pontani and Conway [15] studied the optimal control strategies and solution methods for two spacecraft pursuit-evasion problems via a multiple shooting method. Xu and Cai [16] used a genetic algorithm in a game problem to find the Nash equilibrium and obtained the control of two aircraft. There are also applications in multi-agent pursuit-evasion systems in recent research [17,18,19,20,21]. Thus, the offline methods for the PE game problem are becoming increasingly sophisticated. However, the policy obtained offline cannot deal with online emergencies, such as temporarily changing agents’ goals.

Solving PE game problems online is increasingly becoming a focus. Werbos et al. [22,23] designed actor-critic structures for implementing algorithms in real time, where the learning mechanisms of the structures are composed of policy evaluation and policy improvement. Bertsekas and Tsitsiklis [24] introduced RL methods of different forms and the policy iteration (PI) and value iteration (VI) methods for discrete-time (DT) dynamic systems are compared, which initially apply the idea of RL to the problem of a control system. Werbos [25,26] developed an RL approach based on VI for feedback control of DT dynamic systems using value function approximation (VFA). It is proven that the VFA method is suitable for finding the optimal control online for DT control problems. However, in the actual scenario, the pursuit and evasion problem mostly formulate the control and game of a continuous system. Vrabie [27] presented a method of adaptive dynamic programming (ADP) that is useful to circumvent differential games and establish PI algorithms for continuous-time (CT) control problems. Noting that the information of the system might need to be completed, Vrabie and Lewis [28] considered different forms of systems to obtain online learning methods via optimum control with incomplete information of various systems. The concept of adaptive dynamic programming was further extended to the field of differential games by Vrabie [29], and the synchronous tuning algorithm was used to achieve the Nash equilibrium. However, the system information about both sides of the game must be obtained completely. Kartal et al. [30] used the synchronous tuning algorithm in the pursuit-evasion game of the first-order system to obtain the capture conditions of agents in the game and reached the Nash equilibrium. Zhang et al. [31] and Li et al. [32] determined the scheme’s feasibility in distributed systems. However, in the general differential game problems, the states of agents are usually not used as direct control variables, and hence a system becomes more complex. Furthermore, solving the pursuit-evasion game of the actual scenario in real time without using the whole information of game systems has been a hot research field.

This paper proposes a novel ADP method for online solving the Nash equilibrium policies of two-player pursuit-evasion differential games. The min-max principle is adopted to confirm the Nash equilibrium of the game. As the agents in the game can form an Internet of Things (IoT) system, the real-time control law of each agent is obtained by taking a linear-quadratic cost function in adaptive dynamic programming. To consider the scene when capture happens, we introduce the Lyapunov function. Since most actual systems are continuous, we use the policy iteration algorithm to make the real-time policy converge to the analytical solution of the Nash equilibrium. Moreover, we employ the value function approximation method to calculate the neural network parameters without solving the Hamilton–Jacobi–Isaacs equation directly. The feasibility of the proposed method is demonstrated through simulation results from different scenarios of the pursuit-evasion game. This paper is inspired by recent research in various fields, such as motion coordination in wafer scanners [33], soil-structure interaction [34,35], driving fatigue feature detection [36], H∞ consensus for multiagent-based supply chain systems [37], and reliable and secure communications in wireless-powered NOMA systems [38]. These studies have contributed significantly to advancing real-time control and optimization methods in various applications.

The contributions of the paper are shown as follows:

The min-max principle is used to find the analytical solution of Nash equilibrium, and the method’s stability is proven by establishing a Lyapunov function for obtaining the capture conditions of the game.
By constructing a form of adaptive dynamic programming, the policies of agents in each cycle are obtained through the PI method, and we prove that it converges to the Nash equilibrium.
To avoid the inconvenience of solving the HJI equation, we establish a set of functions to approximate the value function. As the neural network parameters converge, the agent’s solution in policy iteration is obtained.

The rest of the paper is organized as follows. The dynamic model of the PE game is established in Section 2. We discuss the features of Nash equilibrium in Section 3, and the capture conditions of agents are concerned as different parameters are set. Section 4 executes the adaptive dynamic programming method, which consists of the PI method and the VFA algorithm. The agents’ policies are obtained without directly solving the Riccati equations of the PE game. Section 5 presents the simulations of some actual problems. Section 6 concludes the paper and discusses the limitations of the research.

2. Formulation of the Game

Consider a system containing two objects and composing a pursuer-evader couple. The pursuer tries to capture the evader, while the evader tries to escape from being captured.

The pursuit-evasion game in real-time is a typical differential game problem. Here, the motion equation of each participant can be expressed as a couple of differential equations defined in a fixed coordinate system. The game with one pursuer and one evader is a typical zero-sum differential game as the benefits of both sides are mutually exclusive.

(1) ${\dot{x}}_{p} = A x_{p} + B u_{p}$

(2) ${\dot{x}}_{e} = A x_{e} + B u_{e}$

where

x_{p}

u_{p}

x_{e}

, and

u_{e}

are the state variables and control variables of the two players. Among them, the state variable contains the state information of the players, and there may be various physical quantities representing the operation of the players according to different game systems. To facilitate the subsequent operations in this paper, the state variables here must contain the location information of the agents in each dimension. The control variables contain the elements which realized to control the agents in each dimension.

In the PE game problem, the relative motion state of agents is very important. So, we let $δ$ be the difference in the states between the two agents:

(3) $δ = x_{p} - x_{e}$

The pursuer tries to reduce the distance of two agents, which is embedded in $δ$ , while the evader tries to enlarge it. Substituting Equations (1) and (2) into Equation (3) and calculating its derivative with respect to time, we have:

(4) $\dot{δ} = A δ + B (u_{p} - u_{e})$

For formulating a zero-sum pursuer-evader (PE) game, we construct a performance function with integral form as:

(5) $J (δ, u_{p}, u_{e}) = \int_{0}^{\infty} (δ^{⊤} Q δ + u_{p}^{⊤} R_{p} u_{p} - u_{e}^{⊤} R_{e} u_{e}) d τ$

where

Q

is a non-negative definite coefficient matrix.

R_{p}

and

R_{e}

are both positive definite matrices. In the integral function,

δ^{⊤} Q δ

is the term that measures the relative state of the system (4) and is used to give limits to the distance between agents.

u_{p}^{⊤} R_{p} u_{p}

and

u_{e}^{⊤} R_{e} u_{e}

stand for the scales in consumption corresponding to the two agents, which are used for realizing the limitations of the controls.

The value function is given as follows when the agents execute certain policies:

(6) $V (δ) = \int_{0}^{\infty} (δ^{⊤} Q δ + u_{p}^{⊤} R_{p} u_{p} - u_{e}^{⊤} R_{e} u_{e}) d τ$

If both pursuer and evader employ their optimal policies along the optimal paths, then the optimal value of the game can be obtained as:

(7) $V^{*} (δ) = \min_{u_{p}} \max_{u_{e}} J = \min_{u_{p}} \max_{u_{e}} \int_{0}^{\infty} (δ^{⊤} Q δ + u_{p}^{⊤} R_{p} u_{p} - u_{e}^{⊤} R_{e} u_{e}) d τ$

In this paper, the goal is to find out the control policy of each agent. The difficulty of the work lies in finding the numerical solution to each agent’s policy, in which the steps of policy iteration and the selection of the appropriate value function approximation are very important. In reinforcement learning, the policy needs some iterative steps. In a continuous system, we adopt adaptive dynamic programming for solving the agents’ policies. It makes the value function of this paper different from the end value performance index constructed by Jagat and Sinclair [2] because the end value performance index cannot solve the optimal strategy iteratively. Moreover, the performance index of this paper is selected as a quadratic structure, which pays more attention to the intermediate process of the game, improves the real-time competitiveness of both sides of the game, and facilitates the development of the strategy iteration algorithm as stated in Section 4. The distance between two agents is regarded as the tracking error, which means that both pursuer and evader optimize their policies throughout the game process. It is not only in line with the actual situation but also convenient for solving this problem.

3. Solution of the Pursuit-Evasion Game

In this section, we substitute the dynamic model of the PE game problem to the minimax principle and obtain the analytic Nash equilibrium of the PE game. The cases when capture occur are studied and proven by adopting the Lyapunov function approach.

The PE game of agents is regarded as a kind of differential game, which is settled based on the bilateral optimal control theory. The optimal policies of agents are obtained by using the min-max principle. The differential game refers to a continuous game with a couple of players in continuous-time systems. Each agent tries to achieve its goal and magnify its benefit. The game will end up with every participant achieving the Nash equilibrium policy. Using the minimax theorem, we can ensure that agents’ policy is their corresponding optimal policies. When each agent adopts its optimal policy, the Nash equilibrium is achieved. Currently, the condition when optimal policies are adopted is called the saddle point.

In a 2-player PE game problem, the optimal policy of the pursuer tries to minimize the Hamilton function whereas that of the evader tries to maximize it. Therefore, there exists a couple of policies $(u_{p}^{*}, u_{e}^{*})$ . When the pursuer adopts $u_{p}^{*}$ and the evader adopts $u_{e}^{*}$ , the game reaches the Nash equilibrium. We call $(u_{p}^{*}, u_{e}^{*})$ the saddle point of the game.

The expressions in Equation (6) are the same as the Bellman equation of a zero-sum game. From Equations (1) and (2) and Leibniz’s formula, we have:

(8) $\begin{matrix} H (δ (t), \nabla V, u_{p}, u_{e}) = δ^{⊤} Q δ + u_{p}^{⊤} R_{p} u_{p} - u_{e}^{⊤} R_{e} u_{e} + \dot{V} \\ = δ^{⊤} Q δ + u_{p}^{⊤} R_{p} u_{p} - u_{e}^{⊤} R_{e} u_{e} + \nabla V^{⊤} \dot{δ} \\ = δ^{⊤} Q δ + u_{p}^{⊤} R_{p} u_{p} - u_{e}^{⊤} R_{e} u_{e} + \nabla V^{⊤} (A (x_{p} - x_{e}) + B (u_{p} - u_{e})) \end{matrix}$

where

H (δ, \nabla V, u_{p}, u_{e})

is the Hamiltonian,

u_{p}

and

u_{e}

are admissible control policies of the pursuer and evader, respectively.

\nabla V

denotes

\frac{\partial V}{\partial δ}

We can obtain the optimal control of each agent according to the stationary condition:

(9) $\frac{\partial H}{\partial u_{p}} = 0$

(10) $\frac{\partial H}{\partial u_{e}} = 0$

Additionally, the second derivative of the Hamiltonian to $u_{p}$ and $u_{e}$ should satisfy:

(11) $\frac{\partial^{2} H}{\partial u_{p}^{2}} = 2 R_{p} > 0$

(12) $\frac{\partial^{2} H}{\partial u_{e}^{2}} = - 2 R_{e} < 0$

The optimal controls of the agents are obtained as:

(13) $u_{p}^{*} = - \frac{1}{2} R_{p}^{- 1} B^{⊤} \nabla V^{*}$

(14) $u_{e}^{*} = - \frac{1}{2} R_{e}^{- 1} B^{⊤} \nabla V^{*}$

As the system is invariant for infinite time, the solution of the problem is defined by Equations (13) and (14), in which the value $V$ can solve the following equation analytically:

(15) $δ^{⊤} Q δ + u_{p}^{* ⊤} R_{p} u_{p}^{*} - u_{e}^{* ⊤} R_{e} u_{e}^{*} + \nabla V^{* ⊤} (A δ + B u_{p}^{*} - B u_{e}^{*}) = 0$

Since the pursuit-evasion behavior between two agents becomes a zero-sum game when both agents adopt their optimal policies, which is called the game theoretic saddle point policy, the game will reach the Nash equilibrium at that condition.

Before proving that the game can achieve the Nash equilibrium as per policies Equations (13) and (14), we need to use the properties of the Hamiltonian function of the system, which is demonstrated in Lemma 1.

Lemma 1.

Suppose $V^{*}$ satisfies the HJI Equation (15), which makes the Hamiltonian $H (δ (t), \nabla V^{*}, u_{p}^{*}, u_{e}^{*})$ comes to 0. Then, (8) transforms to:

(16) $\begin{array}{l} H (δ (t), \nabla V^{*}, u_{p}, u_{e}) = \\ \nabla V^{* ⊤} B ((u_{p} - u_{p}^{*}) + (u_{e} - u_{e}^{*})) + u_{p}^{⊤} R_{p} u - u_{p}^{* ⊤} R_{p} u_{p}^{*} + u_{e}^{* ⊤} R_{e} u_{e}^{*} - u_{e}^{⊤} R_{e} u_{e} \end{array}$

Proof of Lemma 1.

Suppose $V^{*}$ satisfies the HJI Equation (15), which makes the Hamiltonian $H (δ (t), \nabla V^{*}, u_{p}^{*}, u_{e}^{*})$ comes to 0. Then, (8) transforms to:

(17) $\begin{matrix} H (δ, \nabla V, u_{p}, u_{e}) = δ^{⊤} Q δ + \nabla V^{⊤} B (u_{p}^{*} - u_{e}^{*}) + u_{p}^{* ⊤} R_{p} u_{p}^{*} - u_{e}^{* ⊤} R_{e} u_{e}^{*} + \nabla V^{⊤} B ((u_{p} - u_{p}^{*}) + (u_{e}^{*} - u_{e})) \\ + u_{p}^{⊤} R_{p} u_{p} - u_{p}^{* ⊤} R_{p} u_{p}^{*} + u_{e}^{* ⊤} R_{e} u_{e}^{*} - u_{e}^{⊤} R_{e} u_{e} \\ = H (δ, \nabla V, u_{p}^{*}, u_{e}^{*}) + \nabla V^{⊤} B ((u_{p} - u_{p}^{*}) + (u_{e}^{*} - u_{e})) \\ + u_{p}^{⊤} R_{p} u_{p} - u_{p}^{* ⊤} R_{p} u_{p}^{*} + u_{e}^{* ⊤} R_{e} u_{e}^{*} - u_{e}^{⊤} R_{e} u_{e} \end{matrix}$

If the value function $V$ comes to the optimal value, we have:

(18) $\begin{matrix} H (δ, \nabla V^{*}, u_{p}, u_{e}) = H (δ, \nabla V^{*}, u_{p}^{*}, u_{e}^{*}) + \nabla V^{⊤} B ((u_{p} - u_{p}^{*}) + (u_{e}^{*} - u_{e})) \\ + u_{p}^{⊤} R_{p} u_{p} - u_{p}^{* ⊤} R_{p} u_{p}^{*} + u_{e}^{* ⊤} R_{e} u_{e}^{*} - u_{e}^{⊤} R_{e} u_{e} \end{matrix}$

According to the HJI function Equation (15), the Hamiltonian comes to 0 as the value function reaches the optimal value, and the proof is completed. □

We can transform the Hamiltonian in the way as demonstrated in Lemma 1 to support the proof of the Nash equilibrium as shown in the following theorem.

Theorem 1.

Consider the dynamics of the agents Equations (1) and (2) with the value function (6). Define $V^{*}$ as a positive definite solution of the HJI Equation (15). Then, $u_{p}^{*}$ and $u_{e}^{*}$ in Equations (13) and (14) are the Nash equilibrium policies of agents, and $V^{*}$ is the optimal value of the PE game.

Proof of Theorem 1.

Suppose $V^{*}$ satisfies the HJI Equation (15), which makes the Hamiltonian $H (δ (t), \nabla V^{*}, u_{p}^{*}, u_{e}^{*})$ comes to 0. Then, (8) transforms to:

To prove that $u_{p}^{*}$ and $u_{e}^{*}$ are the Nash equilibrium solution, we have to confirm that the value function is maximized when the evader executes $u_{e}^{*}$ in (13). Similarly, the value function is maximized when the pursuer executes $u_{p}^{*}$ in (14), which can be expressed as:

(19) $u_{p}^{*} = \arg \min V_{u_{p}, u_{e}^{*}} (δ (t))$

(20) $u_{e}^{*} = \arg \max V_{u_{p}^{*}, u_{e}} (δ (t))$

Moreover, let $V_{u_{p}^{*}, u_{e}^{*}} (δ (t))$ be the value when the pursuer executes $u_{p}^{*}$ and the evader executes $u_{e}^{*}$ , we can turn Equations (19) and (20) into inequalities as:

(21) $V_{u_{p}^{*}, u_{e}^{*}} (δ (t)) \leq V_{u_{p}, u_{e}^{*}} (δ (t))$

(22) $V_{u_{p}^{*}, u_{e}^{*}} (δ (t)) \geq V_{u_{p}^{*}, u_{e}} (δ (t))$

where

V_{u_{p}, u_{e}} (δ (t))

is the solution of the Hamilton function (16). Let

V (δ (t_{0}))

is the initial state of value function. Here, we assume that the capture will happen within the period

t \in [t_{0}, \infty)

. This indicates

\lim_{x \to + \infty} V_{u_{p}, u_{e}} (δ (t)) = 0

. To verify the establishment of inequalities (21) and (22), we add this term into Equation (8) and have:

(23) $V (δ) = \int_{t_{0}}^{\infty} (δ^{⊤} Q δ + u_{p}^{⊤} R_{p} u_{p} - u_{e}^{⊤} R_{e} u_{e}) d τ + \int_{t_{0}}^{\infty} {\dot{V}}_{u_{p}, u_{e}} d τ + V (δ (t_{0}))$

From Equation (23), obviously we have $V_{u_{p}^{*}, u_{e}^{*}} (δ (t)) = V^{*} (δ (t_{0}))$ . Upon using lemma 1, (23) becomes:

(24) $\begin{matrix} V_{u_{p}, u_{e}} (δ (t_{0})) = \int_{t_{0}}^{\infty} (\nabla V^{* ⊤} B ((u_{p} - u_{p}^{*}) + (u_{e}^{*} - u_{e})) + u_{p}^{⊤} R_{p} u_{p} \\ - u_{p}^{* ⊤} R_{p} u_{p}^{*} + u_{e}^{* ⊤} R_{e} u_{e}^{*} - u_{e}^{⊤} R_{e} u_{e}) d τ + V (δ (t_{0})) \end{matrix}$

Let $ε (V)$ be the integral in Equation (24). We just need to verify that $ε (V_{u_{p}, u_{e}^{*}}) \geq 0$ and $ε (V_{u_{p}^{*}, u_{e}}) \leq 0$ to prove (21) and (22). Using (24) we get

(25) $\begin{matrix} ε (V_{u_{p}^{*}, u_{e}}) = \int_{t}^{\infty} (\nabla V^{* ⊤} B (u_{e}^{*} - u_{e}) + u_{e}^{* ⊤} R_{e} u_{e}^{*} - u_{e}^{⊤} R_{e} u_{e}) d τ \\ = \int_{t}^{\infty} - {(u_{e} - u_{e}^{*})}^{⊤} R_{e} (u_{e} - u_{e}^{*}) d τ \leq 0 \end{matrix}$

(26) $\begin{matrix} ε (V_{u_{p}, u_{e}^{*}}) = \int_{t}^{\infty} (\nabla V^{* ⊤} B (u_{p} - u_{p}^{*}) - u_{p}^{* ⊤} R_{p} u_{p}^{*} + u_{p}^{⊤} R_{p} u_{p}) d τ \\ = \int_{t}^{\infty} {(u_{p}^{*} - u_{p})}^{⊤} R_{p} (u_{p}^{*} - u_{p}) d τ \geq 0 \end{matrix}$

which accomplishes the proof. □

Remark 1.

It can be seen from Theorem 2 that the value function does not continue to decrease when it reaches the Nash equilibrium, regardless of how the pursuer unilaterally changes its policy. Similarly, no matter how the evader unilaterally changes the policy, the value function will not continue to increase. When $(u_{p}, u_{e})$ reaches the game theoretic saddle point, if one agent changes its policy unilaterally, which is contrary to its benefit, then the other one will reap the benefit from the change. As the game comes to Nash equilibrium, if the pursuer unilaterally alters the strategy, the evader will be harder to capture. On the contrary, if the evader unilaterally changes its policy, it will be easier for the pursuer to realize capture.

In the PE game problem, it is noteworthy whether the pursuer can capture the evader. If so, the problem changes to a finite-time game. Such issues are common in the interception field. Then, we will figure out the conditions which lead to the capture in the game.

The following theorem gives the necessary condition for the occurrence of the capture.

Theorem 2.

Let the pursuer and evader meet the same dynamic model as Equations (1) and (2). Further, let Equations (13) and (14) are the controls of the agents in the game, in which $V (δ)$ is the analytical solution of the HJI Equation (15). Then, the capture scenario happens only if dynamic (6) is asymptotically stable.

Proof of Theorem 2.

Because $V (δ)$ solves the HJI Equation (15) analytically, it’s obvious that $V (δ)$ is positive and $V (δ (t_{0})) = 0$ . Select function $V (δ)$ as a candidate of the Lyapunov function. The derivative of $V (δ)$ is given by:

(27) $\begin{matrix} \dot{V} = \nabla V^{⊤} \dot{δ} = \nabla V^{⊤} (A δ + B (u_{p} - u_{e})) = - δ^{⊤} Q δ - u_{p}^{⊤} R_{p} u_{p} + u_{e}^{⊤} R_{e} u_{e} \\ = - δ^{⊤} Q δ - \frac{1}{4} \nabla V^{⊤} B (R_{p}^{- 1} - R_{e}^{- 1}) B^{⊤} \nabla V \end{matrix}$

As we can see, the derivative of the value $\dot{V}$ can be negative under the condition of $R_{p}^{- 1} - R_{e}^{- 1} \geq 0$ . That means, if system dynamics (4) is stabilizable and observable, with $R_{e} - R_{p} \geq 0$ holds, then dynamic (6) is asymptotically stable, and the capture occurs. On the other hand, if $R_{p}^{- 1} - R_{e}^{- 1} \leq 0$ , which fails to meet the Lyapunov stability condition, then the states of the PE game (4) are likely to diverge. Therefore, this will cause the distance between the two agents to enlarge, making the occurrence of the capture impossible. At this moment, the pursuer cannot capture the evader. □

Remark 2.

It can be predicted that when the dynamic of $δ$ is stable, the distance between two agents in the game will approach 0 as time $t \to \infty$ . Conversely, if $R_{p}^{- 1} - R_{e}^{- 1}$ is non-positive, the pursuer probably cannot capture the evader. If the capture takes place, as the distance of the two agents is embedded in state variables, the divergence of the positive matrix $R_{p}^{- 1} - R_{e}^{- 1}$ will change the capture time and pattern of the PE game. In value function (6), $u_{p}^{⊤} R_{p} u_{p}$ and $u_{e}^{⊤} R_{e} u_{e}$ stand for the summation of control energy consumption for two players. For the pursuer and evader, $R_{p}$ and $R_{e}$ represents dynamic constraints to their control or performance [13], which is known as control penalty. In this way, larger $R_{e}$ or smaller $R_{p}$ tends to facilitate the capture scenario occurs.

Remark 3.

In [39,40], a non-quadratic form Lyapunov function is proposed to verify the system convergence, and the tracking performance is better than that of the quadratic form Lyapunov function. However, the model construction of the pursuit-evasion game involved in this paper focuses more on the physical meaning of the object. The quadratic form of the state variable can represent the relative error, including the relative distance and the relative speed difference. The quadratic form of the control can represent the power of the object after integration. For other systems with relative order 1, the Lyapunov function can be used in the form of $V (δ) = δ^{1 + α}, 0 < α < 1$ to improve convergence efficiency and make the tracking error approach 0 faster.

4. Numeric Solution of PE Game by ADP Method

We attain the expressions of the policies adopted by each agent in Section 3 to let the game reach the Nash equilibrium. To obtain the policy employing numerical methods, relevant researchers [15] introduced genetic algorithms in calculating the controls of each player in the game offline. So far, the offline strategy cannot track the temporary changes of the states of all agents for the online game with continuous-time systems. Therefore, to solve the policies of agents online efficiently, the ADP method is used in this section. As is hard to obtain the gradient terms of the value function, we introduce policy iteration method to form an iteration, and the value function is fitted in the process of VFA.

4.1. Policy Iteration

Now that the value function is an integral, the whole time period is segmented by the concept of ADP, to realize the PI method.

For convenience, the value function of the PE game is simplified as:

(28) $V (δ (t)) = \int_{t}^{\infty} r (δ (τ), u_{p} (τ), u_{e} (τ)) d τ$

for

r (δ (τ), u_{p} (τ), u_{e} (τ)) = δ^{⊤} Q δ + u_{p}^{⊤} R_{p} u_{p} - u_{e}^{⊤} R_{e} u_{e}

Let $T$ is an interval, and Equation (6) is expanded as:

(29) $V (δ (t)) = \int_{t}^{t + T} r (δ, u_{p}, u_{e}) d τ + \int_{t + T}^{\infty} r (δ, u_{p}, u_{e}) d τ = \int_{t}^{t + T} r (δ, u_{p}, u_{e}) d τ + V (δ (t + T))$

It should be noted that $T$ is neither state nor control of the game. Usually, $T$ is regarded as a hyper parameter of ADP. The variation of $T$ may affect the performance and efficiency of the method. Here, we separate the entire period into some segments of intervals and assume $[t, t + T]$ to be the $i$ th interval of the PE game. Moreover, the policies executed by the two agents in $[t, t + T]$ are $u_{p}^{(i)}$ and $u_{e}^{(i)}$ . Then, we have:

(30) $V^{(i)} (δ (t)) = \int_{t}^{t + T} r (δ, u_{p}^{(i)}, u_{e}^{(i)}) d τ + V^{(i)} (δ (t + T))$

We can obtain the controls of pursuer and evader which should be adopted in the next interval as (30):

(31) $\begin{array}{l} u_{p}^{(i + 1)} = - \frac{1}{2} R_{p}^{- 1} B^{⊤} \nabla V^{(i)} \\ u_{e}^{(i + 1)} = - \frac{1}{2} R_{e}^{- 1} B^{⊤} \nabla V^{(i)} \end{array}$

From (31), the policies are obtained based on the states and controls of the two agents. Matrix $A$ in Equation (4) are absent in the process illustrated. In actual scenarios, there may be unknown parameters in the modeling of various agents. Therefore, the method could be applied to solving online games efficiently.

Equations (30) and (31) formulate a cycle of the PI method. As the iterations execute, the PE game will converge to the Nash equilibrium gradually. We will show the feasibility of this property in Theorem 3.

Theorem 3.

For a PE game, let $u_{p}^{0}$ and $u_{e}^{0}$ as the admissible initial controls of the pursuer and evader. The function $V (δ)$ , and controls $u_{p}$ and $u_{e}$ will converge to $V^{*} (δ)$ , $u_{p}^{*}$ , and $u_{e}^{*}$ , respectively. The game reaches Nash equilibrium as the controls converge.

Proof of Theorem 3.

Let $V_{u_{p}^{(i)}, u_{e}^{(i)}}$ be the value function when pursuer and evader execute the policies $(u_{p}^{(i)}, u_{e}^{(i)})$ . Set $i$ as the iteration counter. Subtract value function from (24) as the agents execute $(u_{p}^{(i + 1)}, u_{e}^{*})$ and $(u_{p}^{(i)}, u_{e}^{*})$ , respectively, and have:

(32) $V_{u_{p}^{(i + 1)}, u_{e}^{*}} (δ (t)) - V_{u_{p}^{(i)}, u_{e}^{*}} (δ (t)) = - \int_{t}^{\infty} (u_{p}^{(i + 1)} - u_{p}^{(i)})^{⊤} R_{p} (u_{p}^{(i + 1)} - u_{p}^{(i)}) \leq 0$

which indicates the function set

V_{u_{p}^{(i)}, u_{e}^{*}} (δ (t))

i = 1, \dots, N

decreases monotonously. Meanwhile, the set

V_{u_{p}^{*}, u_{e}^{(i)}} (δ (t))

i = 1, \dots, N

increases monotonously. As value function (6) is unique for any adopted controls for agents, according to the Dini’s Theorem, the value function of any controls

V_{u_{p}^{(i)}, u_{e}^{(i)}}

converges uniformly at

V_{u_{p}^{*}, u_{e}^{*}}

Since $V_{u_{p}^{(i)}, u_{e}^{(i)}} (δ (t))$ is continuous and differentiable, $\nabla V_{u_{p}^{(i)}, u_{e}^{(i)}}$ will converge to $\nabla V_{u_{p}^{*}, u_{e}^{*}}$ as iteration goes. Thus, the policies of both players $(u_{p}^{(i)}, u_{e}^{(i)})$ converges to Nash equilibrium policies $(u_{p}^{*}, u_{e}^{*})$ as $V_{u_{p}^{(i)}, u_{e}^{(i)}} (δ (t))$ converges, which complete the proof. □

Remark 4.

For continuous-time PE games, the whole period is divided into intervals by PI method. As the iteration goes, it can converge to the Nash equilibrium, and the control policy of the agents can also converge to the saddle point. In particular, the method is still available in time-varying PE game problems. If system matrices vary to other admissible values, when the new system is still stable, then it will converge to a new saddle point, which is the Nash equilibrium for the new problem.

Remark 5.

The PI method does not need to know the system matrix $A$ in obtaining the policies of two players, which indicates that for systems with structures of incomplete information, the method still fulfilled the process of convergence to the Nash equilibrium. However, the state of agents $δ (t)$ and $δ (t + T)$ along with their controls need to be known at each step.

4.2. Value Function Approximation

For most PE games, the HJI equation is difficult to be solved analytically, or it might not have any analytical solution. Therefore, we use an approximation process to obtain the solution of HJI equation. The method focuses on approximating the value function, which is named the VFA algorithm.

Assume that a linearly independent integration of a set of basis functions $ϕ_{j} (δ)$ is competent to approximate the value function $V$ , which is expressed as:

(33) $V (δ (t)) = \sum_{j = 1}^{L} w_{j} ϕ_{j} (δ) = w_{L}^{⊤} φ_{L} (δ)$

where

L

stands for the amount of retained functions, and

φ_{L} (δ)

forms a

L

-dimension vector of basic functions.

w_{L}

stands for the neural network parameters to be determined, which is composed of each element

w_{j}, (j = 1, \dots, L)

Using the above value function approximation (VFA) for the cost function, the HJI equation of policy iteration in Algorithm 1 can be expressed as:

(34) $w_{L}^{⊤} \cdot φ_{L} (δ (t)) = \int_{t}^{t + T} r (δ, u_{p}, u_{e}) d τ + w_{L}^{⊤} \cdot φ_{L} (δ (t + T))$

Since the weight parameter is unknown, its initial value is guessed at the beginning of the iteration process, which produces residual error before the neural network parameters reach the convergence to the optimal value. From (34), the residual error is expressed as:

(35) $ξ (δ, T) = \int_{t}^{t + T} r (δ, u_{p}, u_{e}) d τ + w_{L}^{⊤} (φ_{L} (δ_{t + T}) - φ_{L} (δ_{t})) .$

where

φ_{L} (δ_{t + T}) = φ_{L} (δ (t + T))

φ_{L} (δ_{t}) = φ_{L} (δ (t))

. The above residual error stands for a temporal difference residual error.

To find the neural network vector $w_{L}^{(i)}$ that approximates the cost function $V^{(i)}$ , The least-square method is used at every iteration. Hence, the weight parameters is adapted to minimize:

(36) $S = \int ξ^{2} (δ, T) d δ$

The quadratic integral residual $S$ comes to the minimum as the partial derivative to $w_{L}^{(i)}$ becomes 0:

(37) $\int \frac{d ξ}{d w_{L}^{(i)}} \cdot ξ d δ = 0$

Substituting (35) into (37) and assuming $ρ = \int_{t}^{t + T} r (δ, u_{p}, u_{e}) d τ$ , we obtain:

(38) $w_{L}^{(i)} = Φ^{- 1} Θ,$

where

$Φ = \int (φ_{L} (δ_{t + T}) - φ_{L} (δ_{t})) \cdot (φ_{L} (δ_{t + T}) - φ_{L} {(δ_{t})}^{⊤} d δ,$

and

$Θ = \int (φ_{L} (δ_{t + T}) - φ_{L} (δ_{t})) ρ d δ .$

Though the classical ADP method can also solve such game problems, though in the process of policy iteration, the basis function signal composed of state variables is required to meet a persistence excitation condition. Thus, there exists a positive constant that holds the following condition:

(39) $\int_{t}^{t + T} {\dot{φ}}_{L} {\dot{φ}}_{L}^{⊤} d τ ⪰ c I .$

Equation (39) is the general definition of maintaining persistence excitation for input signals, where $φ_{L}$ should be a continuously differentiable bounded function.

However, as the times of iterations increase, although we require that each basis function in the basis function vector is linearly independent if the state input is minimal, the basis function $φ_{L}$ may approach 0. In addition, the base function $φ_{L}$ tends to become weaker with the convergence of the state, and ultimately cannot form an effective excitation. Currently, it is necessary to introduce an additional excitation signal $δ_{d}$ to promote the iteration and prevent the divergence of the states. Here, the excitation signal $δ_{d}$ should hold Equation (40):

(40) $\int_{t}^{t + T} {\dot{φ}}_{L} (δ_{d}) {\dot{φ}}_{L}^{⊤} (δ_{d}) d τ ⪰ c I .$

In this case, $Φ^{- 1}$ will not be incomputable as the states converge, making the updates of the weight parameter $w_{L}$ more credible.

We summarize the policy iteration method by using VFA in Algorithm 1.

Algorithm 1 Policy Iteration Algorithm Using Value Function Approximation

Step 1 Select 0 and 0 as policies for pursuer and evader, respectively.

Step 2 Approximate the value function in each interval using a set of admissible basis functions as follows:

V (δ (t)) = \sum_{j = 1}^{L} w_{j} ϕ_{j} (δ) = w_{L}^{⊤} φ_{L} (δ)

Step 3 Define an additional excitation signal to guarantee that the input holds the persistence of excitation. Step 4 Determine the weight parameter

w_{L}^{(i)}

by minimizing the quadratic integral residual of the HJI equation as

w_{L}^{(i) ⊤} φ_{L} (δ (t)) = \int_{t}^{t + T} r (δ, u_{p}^{(i)}, u_{e}^{(i)}) d τ + w_{L}^{(i) ⊤} φ_{L} (δ (t + T))

Step 5 Update the control variable using the following:

\begin{array}{l} u_{p}^{(i + 1)} = - \frac{1}{2} R_{p}^{- 1} B^{⊤} \nabla φ_{L}^{⊤} w_{L}^{(i)} \\ u_{e}^{(i + 1)} = - \frac{1}{2} R_{e}^{- 1} B^{⊤} \nabla φ_{L}^{⊤} w_{L}^{(i)} \end{array}

Step 6 Stop if the weight parameter converges; else turn to step 3.

Algorithm 1 introduces VFA based on the PI method, which adopts a numerical way of solving the HJI equation. Neural network parameters are approximated at every iteration step using the patch least-square method. The policies of the players are obtained as the weight parameters are determined. According to Theorem 3, if the basis function is competent to approximate the value function, then the weight parameters will converge as soon as the games reach the Nash equilibrium.

Remark 6.

The VFA algorithm is embedded in the PI method in solving solutions of the agents. In the batch least squares algorithm, $n$ has a minimum value in every iteration step, which should be no less than $L$ (the scale of neural network parameters). Otherwise, the solution of the PE game is probably divergent, and the value function $V$ would be impossible to be approximated, resulting in fatal errors in the VFA algorithm [27].

5. Numerical Simulation

In this section, the pursuit-evasion game is numerically simulated. Based on the general motion model, the pursuit and escape the problem of the second-order system is studied, which considers the acceleration of both players along all the dimensions as controls. The position and velocity of the agents are monitored online as the state variables.

Consider the PE game problem in a two-dimensional space whose dynamic model would be:

(41) $\{\begin{matrix} {\dot{s}}_{p x} = v_{p x} \\ {\dot{s}}_{p y} = v_{p y} \\ {\dot{v}}_{p x} = a_{p x} \\ {\dot{v}}_{p y} = a_{p y} \end{matrix}$

(42) $\{\begin{matrix} {\dot{s}}_{e x} = v_{e x} \\ {\dot{s}}_{e y} = v_{e y} \\ {\dot{v}}_{e x} = a_{e x} \\ {\dot{v}}_{e y} = a_{e y} \end{matrix}$

where

s_{p x}

s_{p y}

v_{p x}

, and

v_{p y}

are the coordinates and velocities of the pursuer in

x

and

y

directions, respectively. Similarly,

s_{e x}

s_{e y}

v_{e x}

, and

v_{e y}

are the coordinates and velocities of the evader in

x

and

y

directions, respectively. As for the controls,

(a_{p x}, a_{p y})

, and

(a_{e x}, a_{e y})

are the accelerator couples of the two agents, which stand for the policies of the two agents, respectively.

Here, we subtract model (41) from (42), and obtain the system of difference (43), whose state variables are $δ = [l_{x}, Δ v_{x}, l_{y}, Δ v_{y}]$ . Among them, $l_{x}$ and $l_{y}$ stand for the distance in $x$ and $y$ direction, respectively. The complete system of difference model is:

(43) $\{\begin{matrix} {\dot{l}}_{x} = Δ v_{x} \\ Δ {\dot{v}}_{x} = a_{p x} - a_{e x} \\ {\dot{l}}_{y} = Δ v_{y} \\ Δ {\dot{v}}_{y} = a_{p y} - a_{e y} \end{matrix}$

The distance between the two agents can be regarded as the capture condition of the PE game problem, which is given as follows:

(44) $l = \sqrt{{(s_{p x} - s_{e x})}^{2} + {(s_{p y} - s_{e y})}^{2}}$

To determine whether the pursuer can catch up with the evader, set $d$ as the capture radius. When the distance between the two agents is lower than $d$ , we can call it an effective capture, which terminates the PE game.

In this process, the velocity of agents is unconstrained and effect-less in the benefits of agents, so matrix $Q$ in value function (6) can be $Q = d i a g ([1, 1, 0, 0])$ .

Generally, the basic functions in the VFA algorithm are made up of the Kronecker product of the quadratic polynomial terms ${\{δ_{i} * δ_{j}\}}_{i, j = 1, \dots, 4}$ . However, for game problems with more states, this definition will make the calculation inefficient. To improve the operation efficiency and obtain the policies of the agents, we construct a single-layer neural network as follows:

(45) $V = \sum_{k = 1}^{6} w_{k} ϕ_{k} = w_{1} l_{x}^{2} + w_{2} l_{x} Δ v_{x} + w_{3} Δ v_{x}^{2} + w_{4} l_{y}^{2} + w_{5} l_{y} Δ v_{y} + w_{6} Δ v_{y}^{2}$

Then, the parameters $w_{k}$ are updated online through the algorithm introduced in Chapter 4. The initial value of the parameter is selected as $w_{L}^{(0)} = [1.6; 1.2; 1.6; 1.4; 1.4; 1.2]$ . As the input changes in real time, the residual error can be calculated according to Equation (35). Then, as we minimize the quadratic integral residual, we can obtain the updated parameter vector $w_{L}^{(i + 1)}$ by Equation (38).

To keep the base function vector in a persistence excitation condition, the excitation function $δ_{d}$ is defined as $δ_{d} = 0.1 {[\begin{matrix} \sin (t) + \sin (0.5 t) & \sin (2 t) & \sin (t) + \sin (0.5 t) & \sin (2 t) \end{matrix}]}^{⊤}$ . The selection of the excitation function $δ_{d}$ is set to hold Equation (40).

Other initial states are set in the following simulation as $x_{p 0} = [3; 1; - 2; - 1]$ , $x_{e 0} = [13; 12; 0; 4]$ , $R_{p} = 0.3$ , $R_{e} = 1$ , $a_{e 0} = a_{p 0} = 0$ .

Set the capture radius to $d = 0.2 m$ and begin to simulate the PE game problem. The locations of each agent vary as shown in Figure 1.

In this problem, the policies of the agents are their accelerations. After the game starts, even if the pursuer and the evader initially move in different directions concerning each other, the pursuer can still adjust its policy as soon as possible and accelerate its velocity in the direction of the evader. Meanwhile, the evader can adjust its policy in time to escape from being caught. However, the evader is still captured for more stringent constraints of the control effort. The capture occurs at $t_{c} = 2.95 s$ and the coordinates of the pursuer and evader are $(19.6235, 34.7588)$ and $(19.7570, 34.8126)$ respectively.

The distance between the two agents is shown in Figure 2.

As the iteration continues, the parameters of the neural network convergence as the game reaches the Nash equilibrium, which indicates that the policies of both agents have converged to the optimized values. The distance of the two agents when capture scenario happens is $0.1440 m$ . In this simulation, the policies of both agents are updated at the end of each iteration cycle. The interval of PI method is set as T = 0.05 s. It can be seen in Figure 2 that the policy obtained by the PI method can almost converge to the analytical solution of the Nash equilibrium, and the capture is made in the nick of the Nash equilibrium by using the PI method.

Compared with the online PSO algorithm, the policies obtained by the ADP algorithm are closer to the analytical Nash equilibrium, reflecting its better convergence performance compared to the online PSO algorithm. This can be better reflected as the number of iterations increases over time.

According to Algorithm 1, the initial values of the network parameters are given arbitrarily. During the whole process of the PE game, six complete cycles of the PI method are executed. The neural network parameters of the VFA algorithm converge gradually, stabilizing to a set of fixed amounts from the third cycle to the termination. Moreover, to verify that the converged neural network parameters are not locally optimal, we compare the analytical value to our solution, shown in Figure 3. Each parameter converges to its analytical solution, which reflects the stability of the algorithm in solving the problem.

As mentioned above, matrix $R$ is the soft constraint of the agents, which is determined by the actual structure parameters. Different values may lead to various endings for the PE game problem. Now we change the value of $R_{p}$ to repeat the simulation above and keep the initial states unchanged. The values of $R_{p}$ are taken as 0.3, 0.5, and 0.8. Then, using Algorithm 1, the obtained distance of the two agents in the PE game is shown in Figure 4.

When $R_{p}$ is less than $R_{e}$ (when the PE game system remains stable), the closer the two values, the longer the time required for the pursuer to catch the evader. Therefore, $R$ matrix can be regarded as the motion performance limitation index of agents. The smaller the $R$ value, the better its motion performance, and the wider the value of its control.

Moreover, the interval $T$ in the PI method at each step can influence the performance of the solution we obtain. The computing ability plays an important role in setting an appropriate $T$ . Here, we set different PI method intervals $T = 1 s, 0.5 s$ and 0.025 s to recompute the game problem. Note that choosing $R_{p} = 0.5, R_{e} = 1$ , and remaining other initial states and parameters unchanged, we recompute the PE game to obtain the distance of the two agents, shown in Figure 5. The parameters $w_{i}, i = 1, \dots, 6$ are shown in Figure 6.

It can be seen from Figure 6 that a shorter interval of the PI method can boost the efficiency in the convergence of the Nash equilibrium. However, a shorter interval led to more iteration cycles to terminate the game, which indicates that the calculation cost grows as the amount of iteration cycles increases. Figure 7 shows that the neural network parameters converge to the analytical solution corresponding to the Nash equilibrium for the intervals of $T = 0.1 s$ and $T = 0.5 s$ , respectively. The parameter $w_{L}$ occasionally deviates from the analytical solution in the simulation when $T = 0.25 s$ , which means that a too small iteration interval may cause the parameter to diverge. It is beneficial to select a moderate iterative interval according to the conditions of the agents.

Now we consider that the mobility of the pursuer is less than that of the evader, i.e., there is no capture in the game process. Currently, the solution to the game problem does not meet the condition of the system stability. Therefore, the policies of both agents, i.e., the scale of control, may diverge. Let $R_{p} = 0.3$ , and $R_{e} = 0.24$ . We impose hard constraints to the control on both sides, which is ${‖a_{p x}, a_{p y}, a_{e x}, a_{e y}‖}_{1} \leq 100$ . The motion trajectories of both sides of the game are shown in Figure 8.

Figure 8 shows that the distance between two agents increases further with time elapse. Therefore, it is also proved in Theorem 2 that when $R_{e} - R_{p} \leq 0$ , then the capture may not exist. At this moment, the state variables of both agents diverge, and the pursuer cannot catch the evader.

6. Conclusions

In this paper, the solution of a two-player pursuit-evasion game is discussed. Through the minimax principle, the analytic solution of Nash equilibrium is obtained, and the necessary condition which causes the occurrence of capture is discussed. The PI method is adopted in solving the PE game online, and the VFA algorithm is adopted to prevent possible inconvenience in dealing with the HJI equation. There is no need to know the system matrix to obtain the policies, and the game approaches the analytical Nash equilibrium solution, which is verified in the simulation.

In the future, we will study more complex PE game problems with more agents. The case when coupling exists between state variables or control variables also deserves further study.

Author Contributions

Conceptualization, B.H.; Methodology, Z.G.; Software, X.Z.; Investigation, G.L.; Resources, B.H.; Writing—original draft, Z.G.; Visualization, G.L.; Supervision, X.Z. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures

Figure 1. Information transmission of Algorithm 1.

Figure 2. Location of the agents in the game.

Figure 3. Distance of the agents in the game.

Figure 4. The convergence of neural network parameters.

Figure 5. Distance between agents for different [Forumla omitted. See PDF.].

Figure 6. Distance of pursuer and evader for different PI method interval [Forumla omitted. See PDF.].

Figure 7. Parameters when [Forumla omitted. See PDF.].

Figure 8. Distance when the system diverges.

References

1. Gong, H.; Gong, S.; Li, J. Pursuit–evasion game for satellites based on continuous thrust reachable domain. IEEE Trans. Aerosp. Electron. Syst.; 2020; 56, pp. 4626-4637. [DOI: https://dx.doi.org/10.1109/TAES.2020.2998197]

2. Jagat, A.; Sinclair, A.J. Nonlinear control for spacecraft pursuit-evasion game using the state-dependent riccati equation method. IEEE Trans. Aerosp. Electron. Syst.; 2017; 53, pp. 3032-3042. [DOI: https://dx.doi.org/10.1109/TAES.2017.2725498]

3. Zhang, K.; Wang, P.; Zhou, J. Distributed nash equilibrium seeking for noncooperative games in nonlinear multi-agent systems: An event-triggered neuro-adaptive approach. Asian J. Control; 2021; 24, pp. 605-613. [DOI: https://dx.doi.org/10.1002/asjc.2527]

4. Li, C.; Li, S.; Zhang, A.; He, Q.; Liao, Z.; Hu, J. Meta-learning for few-shot bearing fault diagnosis under complex working conditions. Neurocomputing; 2021; 439, pp. 197-211. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.01.099]

5. Ye, D.; Shi, M.; Sun, Z. Satellite proximate pursuit-evasion game with different thrust configurations. Aerosp. Sci. Technol.; 2020; 99, 105715. [DOI: https://dx.doi.org/10.1016/j.ast.2020.105715]

6. Isaacs, R. Games of Pursuit; Rand: Santa Monica, CA, USA, 1951.

7. Shima, T. Optimal cooperative pursuit and evasion strategies against a homing missile. J. Guid. Control Dyn.; 2011; 34, pp. 414-425. [DOI: https://dx.doi.org/10.2514/1.51765]

8. Turetsky, V.; Glizer, V.Y. Open-loop solution of a defender–attacker–target game: Penalty function approach. J. Control Decis.; 2019; 6, pp. 166-190. [DOI: https://dx.doi.org/10.1080/23307706.2018.1453311]

9. Friedman, A. Differential Games; Courier Corporation: Phoenix, AZ, USA, 2013.

10. Lukes, D.L. Equilibrium feedback control in linear games with quadratic costs. SIAM J. Control Optim.; 1971; 9, pp. 234-252. [DOI: https://dx.doi.org/10.1137/0309018]

11. Lopez, V.G.; Lewis, F.L.; Wan, Y.; Sanchez, E.N.; Fan, L. Solutions for multiagent pursuit-evasion games on communication graphs: Finite-time capture and asymptotic behaviors. IEEE Trans. Autom. Control; 2019; 65, pp. 1911-1923. [DOI: https://dx.doi.org/10.1109/TAC.2019.2926554]

12. Yu, Z. Linear–quadratic optimal control and nonzero-sum differential game of forward–backward stochastic system. Asian J. Control; 2011; 14, pp. 173-185. [DOI: https://dx.doi.org/10.1002/asjc.406]

13. Faruqi, F.A. Differential Game Theory with Applications to Missiles and Autonomous Systems Guidance; John Wiley & Sons: Hoboken, NJ, USA, 2017; [DOI: https://dx.doi.org/10.1002/9781119168485]

14. Zhang, Q.; Sun, S.; Chen, Y. Strategy and numerical solution of pursuit-evasion with fixed duration for two spacecraft. J. Astronaut.; 2014; 35, pp. 537-544.

15. Pontani, M.; Conway, B.A. Optimal interception of evasive missile warheads: Numerical solution of the differential game. J. Guid. Control Dyn.; 2008; 31, pp. 1111-1122. [DOI: https://dx.doi.org/10.2514/1.30893]

16. Xu, X.; Cai, Y. Design and numerical simulation of a differential game guidance law. Proceedings of the 2016 IEEE International Conference on Information and Automation (ICIA); Ningbo, China, 1–3 August 2016; pp. 314-318.

17. Talebi, S.P.; Werner, S. Distributed kalman filtering and control through embedded average consensus information fusion. IEEE Trans. Autom. Control; 2019; 64, pp. 4396-4403. [DOI: https://dx.doi.org/10.1109/TAC.2019.2897887]

18. Zhang, Z.; Yan, W.; Li, H. Distributed optimal control for linear multiagent systems on general digraphs. IEEE Trans. Autom. Control; 2020; 66, pp. 322-328. [DOI: https://dx.doi.org/10.1109/TAC.2020.2974424]

19. Li, C.; Li, S.; Zhang, A.; Yang, L.; Zio, E.; Pecht, M.; Gryllias, K. A Siamese hybrid neural network framework for few-shot fault diagnosis of fixed-wing unmanned aerial vehicles. J. Comput. Des. Eng.; 2022; 9, pp. 1511-1524. [DOI: https://dx.doi.org/10.1093/jcde/qwac070]

20. Zhou, J.; Wu, X.; Lv, Y.; Li, X.; Liu, Z. Recent progress on the study of multi-vehicle coordination in cooperative attack and defense: An overview. Asian J. Control; 2021; 24, pp. 794-809. [DOI: https://dx.doi.org/10.1002/asjc.2685]

21. Li, Z.; Zhu, H.; Luo, Y. An escape strategy in orbital pursuit-evasion games with incomplete information. Sci. China Technol. Sci.; 2021; 64, pp. 559-570. [DOI: https://dx.doi.org/10.1007/s11431-020-1662-0]

22. Werbos, P.J.; Miller, W.; Sutton, R. A menu of designs for reinforcement learning over time. Neural Networks for Control; MIT Press: Cambridge, MA, USA, 1990; Volume 3, pp. 67-95.

23. Werbos, P.J. Intelligence in the brain: A theory of how it works and how to build it. Neural Netw.; 2009; 22, pp. 200-212. [DOI: https://dx.doi.org/10.1016/j.neunet.2009.03.012]

24. Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-dynamic programming: An overview. Proceedings of the 1995 34th IEEE Conference on Decision and Control; New Orleans, LA, USA, 13–15 December 1995; Volume 1, pp. 560-564.

25. Werbos, P. Approximate dynamic programming for realtime control and neural modelling. Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches; Van Nostrand Reinhold: New York, NY, USA, 1992; pp. 493-525.

26. Werbos, P.J. Neural networks for control and system identification. Proceedings of the 28th IEEE Conference on Decision and Control; Tampa, FL, USA, 13–15 December 1989; pp. 260-265.

27. Vrabie, D. Online Adaptive Optimal Control for Continuous-Time Systems. 2010; Available online: https://rc.library.uta.edu/uta-ir/handle/10106/2083 (accessed on 3 April 2023).

28. Vrabie, D.; Lewis, F. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Netw.; 2009; 22, pp. 237-246. [DOI: https://dx.doi.org/10.1016/j.neunet.2009.03.008]

29. Vrabie, D.; Lewis, F. Adaptive dynamic programming algorithm for finding online the equilibrium solution of the two-player zero-sum differential game. Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN); Barcelona, Spain, 18–23 July 2010; pp. 1-8. [DOI: https://dx.doi.org/10.1109/ijcnn.2010.5596754]

30. Kartal, Y.; Subbarao, K.; Dogan, A.; Lewis, F. Optimal game theoretic solution of the pursuit-evasion intercept problem using on-policy reinforcement learning. Int. J. Robust Nonlinear Control; 2021; 31, pp. 7886-7903. [DOI: https://dx.doi.org/10.1002/rnc.5719]

31. Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; Basar, T. Fully decentralized multi-agent reinforcement learning with networked agents. Proceedings of the International Conference on Machine Learning, PMLR; Stockholm, Sweden, 10–15 July 2018; pp. 5872-5881.

32. Li, Y.; Tang, Y.; Zhang, R.; Li, N. Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach. IEEE Trans. Autom. Control; 2021; 67, pp. 6429-6444. [DOI: https://dx.doi.org/10.1109/TAC.2021.3128592]

33. Song, F.; Liu, Y.; Shen, D.; Li, L.; Tan, J. Learning Control for Motion Coordination in Wafer Scanners: Toward Gain Adaptation. IEEE Trans. Ind. Electron.; 2022; 69, pp. 13428-13438. [DOI: https://dx.doi.org/10.1109/TIE.2022.3142428]

34. Li, J.; Chen, M.; Li, Z. Improved soil–structure interaction model considering time-lag effect. Comput. Geotech.; 2022; 148, 104835. [DOI: https://dx.doi.org/10.1016/j.compgeo.2022.104835]

35. Hong, Y.; Yao, M.; Wang, L. A multi-axial bounding surface p-y model with application in analyzing pile responses under multi-directional lateral cycling. Comput. Geotech.; 2023; 157, 105301. [DOI: https://dx.doi.org/10.1016/j.compgeo.2023.105301]

36. Wang, F.; Wang, H.; Zhou, X.; Fu, R. A Driving Fatigue Feature Detection Method Based on Multifractal Theory. IEEE Sens. J.; 2022; 22, pp. 19046-19059. [DOI: https://dx.doi.org/10.1109/JSEN.2022.3201015]

37. Li, Q.-K.; Lin, H.; Tan, X.; Du, S. H_∞ Consensus for Multiagent-Based Supply Chain Systems Under Switching Topology and Uncertain Demands. IEEE Trans. Syst. Man Cybern. Syst.; 2018; 50, pp. 4905-4918. [DOI: https://dx.doi.org/10.1109/TSMC.2018.2884510]

38. Cao, K.; Wang, B.; Ding, H.; Lv, L.; Tian, J.; Hu, H.; Gong, F. Achieving Reliable and Secure Communications in Wireless-Powered NOMA Systems. IEEE Trans. Veh. Technol.; 2021; 70, pp. 1978-1983. [DOI: https://dx.doi.org/10.1109/TVT.2021.3053093]

39. Hosseinzadeh, M.; Yazdanpanah, M.J. Performance enhanced model reference adaptive control through switching non-quadratic Lyapunov functions. Syst. Control Lett.; 2015; 76, pp. 47-55. [DOI: https://dx.doi.org/10.1016/j.sysconle.2014.12.001]

40. Tao, G. Model reference adaptive control with L tracking. Int. J. Control; 1996; 64, pp. 859-870. [DOI: https://dx.doi.org/10.1080/00207179608921661]

Word count: 7286

Show less

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The paper studies a novel method for real-time solutions of the two-player pursuit-evasion game. The min-max principle is adopted to confirm the Nash equilibrium of the game. As agents in the game can form an Internet of Things (IoT) system, the real-time control law of each agent is obtained by taking a linear-quadratic cost function in adaptive dynamic programming. By introducing the Lyapunov function, we consider the scenario when capture occurs. Since most actual systems are continuous, the policy iteration algorithm is used to make the real-time policy converge to the analytical solution of the Nash equilibrium. Furthermore, we employ the value function approximation method to calculate the neural network parameters without directly solving the Hamilton–Jacobi–Isaacs equation. Simulation results depict the method’s feasibility in different scenarios of the pursuit-evasion game.

Details

Title

Solution for Pursuit-Evasion Game of Agents by Adaptive Dynamic Programming

Author

Gong, Zifeng; He, Bing; Liu, Gang; Zhang, Xiaobo

First page

2595

Publication year

2023

Publication date

2023

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics12122595

ProQuest document ID

2829796580

Solution for Pursuit-Evasion Game of Agents by Adaptive Dynamic Programming

Jump to:

Full text

Abstract

Details

Suggested sources