A Deep Reinforcement Learning-Based

Full text

Turn on search term navigation

1. Introduction

The Autonomous marine vehicle (AMV) is a marine intelligent platform that performs tasks autonomously or semi-autonomously [1], which is widely applied in military and civilian fields due to its small volume, strong concealment, good flexibility and other advantages [2]. In different missions, path following of the AMV plays a crucial role for realized autonomous operation. Considering that, in practice, the AMV inevitably suffers from marine environment disturbances, the path-following control method with high precision and efficiency is crucial to the success of an operation, where a parameterized path is expected to be tracked as accurately as possible [3].

In generally, the path-following control of an AMV consists of two critical parts: kinematics guidance and dynamics control. In the part of guidance research, by calculating the desired heading angle, path-following errors can converge to zero, and in the part of control research, control inputs including surge force and yaw torque are solved using the desired guidance signals, thus contributing to the path following performance [4]. For the former, the light-of-sight (LOS) guidance was widely applied because of its high precision and simplicity [5,6,7,8]. For the latter, fruitful methods were proposed and applied to controller design, such as PID control [9,10], fuzzy control [11,12,13], adaptive control [14,15], active disturbances rejection control [16,17], sliding mode control [18,19], and backstepping control [20,21,22]. In [23], considering the path-following control under unknown environment disturbances, the modified integral LOS guidance law and the adaptive sliding mode control law are developed, realizing the desired path following. In [24], to solve the path-following control of an under-actuated autonomous underwater vehicle subject to environment disturbances, an adaptive robust control method is proposed using fuzzy logic, backstepping and sliding mode technology, where the fuzzy logic system is utilized to approximate the unknown uncertainties. In [25], a novel, adaptive, robust path-following scheme is proposed by combining with the trajectory linearization control and the finite-time disturbance observer. In [26], a fuzzy unknown observer-based, robust, adaptive path-following control scheme is proposed, where the fuzzy observer is designed to estimate lumped unknowns and the observer-based, robust, adaptive tracking control laws are synthesized, thus ensuring that the guided signals are globally asymptotically tracked. However, the above control method depends on a system model with high accuracy, and the derivation process is complex.

With increasingly rapid development of machine learning, deep reinforcement learning (DRL) algorithms are widely applied to the relative studies of unmanned system control [27]. The DRL is a combination of deep learning and reinforcement learning, which has strong decision-making ability and anti-disturbance ability of reinforcement learning and strong representation ability of deep neural network, thus effectively reducing the complexity and difficulty of the controller design. At present, the popular DRL algorithms include the soft actor-critic (SAC), the proximal policy optimization (PPO), and the deep deterministic policy gradient (DDPG) [28,29,30]. In [31], the advantage of actor-critic (A2C) is proposed to solve path-following control for a fish-like robot, where the desired path is a randomly generated curve. In [32], a DRL controller is designed using the DDPG for path following, and simulation shows that the proposed method is better than the PID in terms of transient characteristics. In [33], a distributed DRL method is proposed to solve the path-following control of an under-actuated AMV, where the DDPG-based controller is designed and the radial basis neural network is utilized to approximate the unknown disturbances. In [34], an improved DDPG control method was proposed for path following based on an optimized sampling pool and average motion evaluation network, and the simulation results show that the proposed method effectively improves the utilization rate of samples and avoids falling into a local optimum in the training process. In [35], a linear active disturbances rejection controller based on the SAC was proposed to solve path-following control under wind and wave environments. In [36], the path-following control laws were designed using the twin-delayed deep deterministic policy gradient algorithm (TD3), where desired velocities were generated by the LOS guidance.

Considering the path following of an under-actuated AMV under unknown model parameters and environment disturbances, this article establishes the motion model for an AMV, and proposes a path-following control method by combining with the long short-term memory network (LSTM) and the TD3 algorithm. The main contributions are as follows: (1) within the kinematic level, the surge-heading joint guidance law is developed based on the LOS, where the desired velocity signals are generated to guide the vehicle along the desired path; (2) within the dynamic level, the TD3-based surge-heading controller is developed for the vehicle, where states, actions and reward functions are defined; and (3) to enhance the convergence rate of controller networks, the LSTM layer, using the historical states, is added into the TD3.

The remainder of this article is organized as follows: preliminaries and problem statement are described in Section 2; Section 3 presents the kinematic guidance law and the DRL-based dynamic controller of an AMV; simulation results and analysis are presented in Section 4; and Section 5 contains the conclusion.

2. Preliminaries and Problem Statement

2.1. Reinforcement Learning

Reinforcement learning is based on the Markov decision process. Four basic elements are defined as ${S, A, P, R}$ , where $S$ is the set of all states, $A$ is the set of all actions, $P$ is the state transition probability, and $R$ is the reward function [37]. The decay sum of all rewards from a certain state to the final state can be calculated by

(1) $R_{t} = r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1}$

where

γ

is the discount factor, satisfying

γ \in [0, 1]

, and

r_{t + i} (i = 1, \dots, k + 1)

is the reward at the current time.

Additionally, the value functions under the policy $η$ include action value function $Q^{μ} (s_{t}, a_{t})$ and state value function $V^{μ} (s_{t})$ , where $s_{t}$ and $a_{t}$ are the state and action at the current time. The value functions are described as

(2) $\{\begin{cases} Q^{μ} (s_{t}, a_{t}) = E_{μ} [R_{t} | s_{t}, a_{t}] = E_{μ} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} | s_{t}, a_{t}] \\ V^{μ} (s_{t}) = E_{μ} [R_{t} | s_{t}] = E_{μ} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} | s_{t}] \end{cases}$

where

E

expresses the expectation, and the optimal policy

μ^{*}

can be achieved by maximizing the optimal state-action value functions [38].

(3) $μ^{*} = \arg \max V^{μ} (s_{t}) = \arg \max Q^{μ} (s_{t}, a_{t})$

2.2. LSTM Network

The LSTM network has better memory ability, where important data is retained and irrelevant noise is deleted, thereby relieving the gradient disappearance of the existing recurrent neural network and the memory burden of networks [39]. The neuronal structure is shown in Figure 1, where $x_{t}$ is the input; $h_{t}$ is the output; $c_{t}$ is the state value of the memory cell at the current time; $h_{t - 1}$ and $c_{t - 1}$ are the input signals at the previous time; $f_{t}$ is the forgetting gate; $i_{t}$ is the input gate; $o_{t}$ is the output gate; and $σ$ is the sigmoid function.

As shown in Figure 1, when the information inputs to the neuron, it firstly goes through the forgetting gate and input gate; then, it goes through the output gate, and the state value of the memory cell $c_{t}$ are calculated based on the $f_{t}$ and $i_{t}$ . Finally, the outputs are calculated based on $o_{t}$ and $c_{t}$ . The renewal process can be described by

(4) $\{\begin{cases} f_{t} = σ (W_{1} x_{t} + W_{2} h_{t - 1} + b_{1}) \\ i_{t} = σ (W_{3} x_{t} + W_{4} h_{t - 1} + b_{2}) \\ o_{t} = σ (W_{7} x_{t} + W_{8} h_{t - 1} + b_{4}) \\ c_{t} = c_{t - 1} \times f_{t} + \tanh (W_{5} x_{t} + W_{6} h_{t - 1} + b_{3}) \times i_{t} \\ h_{t} = o_{t} \times \tanh (c_{t}) \end{cases}$

where

W_{i}

is the weight coefficient with

i = 1, 2, \dots, 8

;

b_{h}

is the bias with

h = 1, \dots, 4

, and

\tanh

is the activation function [40].

2.3. Under-Actuated AMV Model

As described in [41], the under-actuated AMV model of three degrees of freedom in the horizontal plane is written as

(5) $\{\begin{cases} η = R (η) ν \\ M \dot{ν} + C (ν) ν + D (ν) ν = τ_{d} + τ \end{cases}$

where

η = {[x, y, ψ]}^{T}

are the positions and heading angle of AMV in the earth-fixed frame, and

ν = {[u, v, r]}^{T}

are the surge, sway and yaw velocities in the body-fixed frame.

τ = {[τ_{u}, 0, τ_{r}]}^{T}

are the control inputs of path-following, and

τ_{d} = {[τ_{u d}, τ_{v d}, τ_{r d}]}^{T}

are the time-varying marine environment disturbances.

R (η)

is the rotation matrix from the body-fixed frame to the earth-fixed frame, which is defined as

(6) $R (η) = [\begin{matrix} \cos ψ & - \sin ψ & 0 \\ \sin ψ & \cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}]$

$M$ is the inertial matrix and satisfies $M = M^{T} > 0$ , which is written as

(7) $M = [\begin{matrix} m_{11} & 0 & 0 \\ 0 & m_{22} & m_{23} \\ 0 & m_{32} & m_{33} \end{matrix}]$

$C (ν)$ is the coriolis-centripetal matrix and satisfies $C (ν) = - C {(ν)}^{T}$ , which is written as

(8) $C (ν) = [\begin{matrix} 0 & 0 & c_{13} (ν) \\ 0 & 0 & c_{23} (ν) \\ - c_{13} (ν) & - c_{23} (ν) & 0 \end{matrix}]$

and

D (ν)

is the damping matrix, which is written as

(9) $D (ν) = [\begin{matrix} d_{11} (ν) & 0 & 0 \\ 0 & d_{22} (ν) & d_{23} (ν) \\ 0 & d_{32} (ν) & d_{33} (ν) \end{matrix}]$

with

m_{11} = m - X_{\dot{u}}

m_{22} = m - Y_{\dot{v}}

m_{33} = I_{z} - N_{\dot{r}}

m_{23} = m x_{g} - Y_{\dot{r}}

m_{32} = m x_{g} - N_{\dot{v}}

c_{13} (ν) = - m_{11} v - m_{23} r

c_{23} (ν) = - m_{11} u

d_{11} (ν) = - X_{u} - X_{| u | u} |u| - X_{u u u} {|u|}^{2}

d_{22} (ν) = - Y_{v} - Y_{| v | v} |v|

d_{33} (ν) = - N_{r} - N_{| v | r} |v| - N_{| r | r} |r|

Y_{\dot{r}} = N_{\dot{r}}

, where

m

is AMV mass, and

I_{z}

is the moment of inertia in yaw.

X_{*}

Y_{*}

and

N_{*}

are the hydrodynamic coefficients.

As shown in Figure 2, the desired path $(x_{d} (s), y_{d} (s))$ is a continuous parameterized curve with a time-independent variable $s$ . For any moving point on the curve, a path-tangential angle in the earth-fixed frame is defined as

(10) $α = a t a n 2 (y_{d}^{'} (s), x_{d}^{'} (s))$

where

y_{d}^{'} (s) = \partial y_{d} / \partial s

x_{d}^{'} (s) = \partial x_{d} / \partial s

. The errors between

(x, y)

and

(x_{d}, y_{d})

can be formulated as

(11) $[\begin{array}{l} x_{e} \\ y_{e} \end{array}] = {[\begin{array}{l} \cos α - \sin α \\ \sin α \cos α \end{array}]}^{T} [\begin{array}{l} x - x_{d} (s) \\ y - y_{d} (s) \end{array}]$

where

x_{e}

is the along-track error, and

y_{e}

is the cross-track error.

In this article, our objective is to design the DRL-based path-following control scheme for an uncertain under-actuated AMV, such that the vehicle can follow the desired path with the desired velocities regardless of model uncertainties and unknown marine environment disturbances. To be specific, the objective can be formalized as

(12) $\{\begin{cases} \lim_{t \to \infty} x_{e} \leq δ_{x} \\ \lim_{t \to \infty} y_{e} \leq δ_{y} \end{cases}$

where

δ_{x}

and

δ_{y}

are any small positive constants.

3. DRL-Based Path-Following Control Scheme

In this section, a DRL-based path-following control scheme is established for an under-actuated AMV in the presence of model uncertainties and unknown marine environment disturbances. The diagram of the proposed control scheme is shown in Figure 3, where kinematic guidance and dynamic control are designed, respectively.

Within the kinematic level, according to the position errors and related motion states obtained by the AMV model, the desired surge velocity and heading angle are generated by the designed surge-heading joint guidance law based on the light-of-sight guidance. Within the dynamic level, DRL-based surge-heading controllers are presented for following the desired guidance signals. The reward function is designed to generate rewards by comparing the desired signals with the actual vehicle states in the environment. The controllers generate the control inputs based on the rewards, and the vehicle precisely tracks the desired signals based on the control inputs and the novel environment states to realize path-following control. By combining with the kinematic guidance and dynamic control, the objective (12) can be successfully completed.

3.1. Kinematic Guidance Design

Firstly, kinematic guidance is designed in this subsection, where desired surge velocities and heading angles are produced. Differentiating (11) along (5) yields

(13) $\{\begin{cases} {\dot{x}}_{e} = u \cos (ψ - α) - v \sin (ψ - α) + \dot{α} y_{e} - u_{s} \\ {\dot{y}}_{e} = u \sin (ψ - α) + v \cos (ψ - α) - \dot{α} x_{e} \end{cases}$

where

u_{s}

is velocity of the virtual point along the desired path, which is defined by

(14) $u_{s} = \dot{s} \sqrt{{x_{d}^{'}}^{2} (s) + {y_{d}^{'}}^{2} (s)}$

Define the sideslip angle of under-actuated vehicle as

(15) $β = \arctan (\frac{v}{u})$

In this context, path-following error dynamics (13) is rewritten as

(16) $\{\begin{cases} {\dot{x}}_{e} = u \cos (ψ - α) - u \sin (ψ - α) t a n β + \dot{α} y_{e} - u_{s} \\ {\dot{y}}_{e} = u \sin (ψ - α) + u \cos (ψ - α) t a n β - \dot{α} x_{e} \end{cases}$

Then, select the Lyapunov function related to path following errors as

(17) $V = \frac{1}{2} (x_{e}^{2} + y_{e}^{2})$

The time derivative of (17) along the solution (16) is

(18) $\begin{array}{l} \dot{V} = x_{e} {\dot{x}}_{e} + y_{e} {\dot{y}}_{e} \\ = x_{e} (u \cos (ψ - α) - u \sin (ψ - α) t a n β + \dot{α} y_{e} - u_{s}) \\ + y_{e} (u \sin (ψ - α) - \dot{α} x_{e} + u \cos (ψ - α) t a n β) \end{array}$

Thus, the surge-heading joint guidance law is designed as follows

(19) $\{\begin{cases} u_{d} = k_{1} \sqrt{y_{e}^{2} + Δ^{2}} \\ ψ_{d} = α - β_{d} - \arctan (\frac{y_{e}}{Δ}) \end{cases}$

where

k_{1} > 0

;

Δ > 0

is the look-ahead distance;

β_{d} = \arctan (v / u_{d})

and virtual velocity

u_{s}

is determined by

(20) $u_{s} = U_{d} \cos (β_{d} + ψ - α) + k_{2} x_{e}$

with

k_{2} > 0

and

U_{d} = \sqrt{u_{d}^{2} + v^{2}}

Using the fact

(21) $\{\begin{cases} \sin (\tan^{- 1} (- \frac{y_{e}}{Δ})) = - \frac{y_{e}}{{(y_{e}^{2} + Δ^{2})}^{1 / 2}} \\ \cos (\tan^{- 1} (- \frac{y_{e}}{Δ})) = - \frac{Δ}{{(y_{e}^{2} + Δ^{2})}^{1 / 2}} \end{cases}$

and substituting the guidance law (19) into (18) yield

(22) $\begin{array}{l} \dot{V} = x_{e} (\dot{α} y_{e} - k_{2} x_{e}) + y_{e} (- \frac{k_{1}}{\cos β_{d}} y_{e} - \dot{α} x_{e}) \\ = - k_{2} x_{e}^{2} - \frac{k_{1}}{\cos β_{d}} y_{e}^{2} \end{array}$

Since $0 < \cos β_{d} \leq 1$ renders $\dot{V} \leq - k_{2} x_{e}^{2} - k_{1} y_{e}^{2}$ , it indicates that path-following error $x_{e}$ and $y_{e}$ can globally asymptotically converge to the origin using the proposed guidance method.

3.2. Dynamic Control Design

Because of inaccurate measurements and environment disturbances, AMV model parameters cannot be obtained completely, thereby resulting in uncertainties of dynamics. To enhance the engineering practicality and reduce the complexity of controllers, a TD3-based reinforcement learning control method is presented within the dynamic level, where control inputs of the AMV are generated successfully.

The main purpose of the DRL algorithm is to make the vehicle take actions in the case of different path information. The proposed TD3 algorithm is based on the actor-critic structure, where policy functions are produced using actor networks and critic networks used to judge the performance of the actor [42]. Additionally, LSTM network layers are introduced into the TD3 and thereby enhance the utilization rate of historical states.

Firstly, the network structure of TD3 algorithm is shown in Figure 4. By virtue of initial environment states, actions of the AMV are generated using actor networks, and rewards are accordingly calculated using reward functions; thus, the states can be updated with the generated actions. The empirical value is defined as $e (t) = {s, a, r, s_{t + 1}}$ and saved into buffer $M e m o r y D$ . Through repeated training, the empirical replay sequence $D = {e_{1}, e_{2}, \dots, e_{n}}$ can be formed. Considering that the adjacent actions of path-following have strong relevance, a batch of empirical sequences are selected randomly for training. The actor network of target generates the action $a_{t + 1}$ according to the state $s_{t + 1}$ in the empirical replay sequence, and the critic network of target calculates the $Q_{\min}$ value, where $Q_{\min}$ is the smaller of the two $Q_{t a r g e t}$ values generated by target networks. Two critic networks are updated based on $Q$ , $Q_{\min}$ and loss functions. Actor networks generate actions using states. Critic networks generate the $Q$ value using states and actions, and thus calculate the policy gradient and update actor networks using the $Q$ value.

The specific renewal process is as follows. Considering that TD3 algorithm is a deterministic policy and has the characteristic of target policy smoothing regularization, random noise $ε$ is added into target actions. Therefore, there is $a_{t + 1} = μ^{'} (s_{t + 1} | w^{'}) + ε$ , where $μ^{'}$ is the policy of target actor networks, and $w^{'}$ is the parameter of target actor networks.

The target value is calculated as

(23) $y = r + γ Q_{\min} (s_{t + 1}, a_{t + 1} | θ_{i}^{'})$

where

i = 1, 2

θ_{i}^{'}

represents parameters of target critic networks.

The loss function is defined as

(24) $L (θ_{i}) = \frac{1}{N} {(y - Q (s, a | θ_{i}))}^{2}$

where

N

represents the number of mini-batch, and

θ_{i}

represents parameters of critic networks. The gradient is updated by

(25) $\frac{\partial L (θ_{i})}{\partial θ_{i}} = E [(y - Q (s, a | θ_{i})) \frac{\partial Q (s, a | θ_{i})}{\partial θ_{i}}]$

Subsequently, the policy gradient of actor networks is updated by

(26) $\{\begin{cases} \frac{\partial J (w)}{\partial w} = E [\frac{\partial Q (s, a | θ)}{\partial a} \frac{\partial μ (s | w)}{\partial w}] \\ \nabla_{w} J \approx \frac{1}{N} \sum (\nabla_{a} Q (s, a | θ_{1}) |_{a = μ (s)} \nabla_{w} μ (s | w)) \end{cases}$

After a couple of cycles, target parameters are soft updated by

(27) $\{\begin{cases} θ_{i}^{'} = ξ θ_{i} + (1 - ξ) θ_{i}^{'} \\ w^{'} = ξ w + (1 - ξ) w^{'} \end{cases}$

where

ξ \in (0, 1)

represents the learning rate.

Then, the LSTM network is introduced into actor-critic networks, thus contributing to the LSTM-TD3-based reinforcement learning controllers. The LSTM-TD3 network still retains the actor-critic structure, where LSTM inputs is a length of sequences. According to the real-time navigational information, the continuous states are saved into the sequences. The LSTM network layer is connected to generate the final hidden state $h_{t}$ , where $h_{t}$ is a one-dimensional array. Via the multi-layer perceptron (MLP) neural networks, the path-following control inputs of an AMV are generated, which include surge forces and yaw torques. The network structure of the actor is shown in Figure 5. Note that the critic has the similar network structure to the actor, and generates available actions.

Finally, the state space, action space and reward function are designed as follows. To be specific, the state space represents perceived environment information of the AMV, which is the basis of decision-making and reward-evaluating. The action space represents control inputs of the AMV, including surge forces and yaw torques. The reward function is used to evaluate current state of the AMV.

In this context, the state space is defined as

(28) $s_{t} = [x_{e}, y_{e}, ψ, ψ_{e}, u, v, r, u_{e}, τ_{u (}_{t - 1)}, τ_{r (}_{t - 1)}]$

where

τ_{u (}_{t - 1)}

and

τ_{r (}_{t - 1)}

are control inputs at the previous moment.

u_{e} = u - u_{d}

and

ψ_{e} = ψ - ψ_{d}

with

u_{d}

ψ_{d}

are generated by the guidance law (19).

Taking path-following errors, surge velocity and heading angle errors into considerations, the reward function $r_{1}$ is designed as

(29) $r_{1} = λ (2 \exp^{- k_{3} | u_{e} |} - 1) + (2 \exp^{- k_{4} | ψ_{e} |} - 1) + (2 \exp^{- k_{5} | \sqrt{x_{e}^{2} + y_{e}^{2}} |} - 1)$

where

k_{*} (* = 3, 4, 5) > 0

and

λ > 0

. Note that the exponential function is used to calculate rewards, which limits the size of rewards and avoids high rewards.

Furthermore, the reward function $r_{2}$ is designed as

(30) $r_{2} = \exp^{- k_{6} | σ_{τ_{u}} |} + \exp^{- k_{6} | σ_{τ_{r}} |} - 1$

where

k_{6} > 0

;

σ_{τ_{u}}

and

σ_{τ_{r}}

are the standard deviation of two inputs, which are used to reduce the chattering of control inputs.

By combining with (29) and (30), the hybrid rewards of path-following control are established as

(31) $r = r_{1} + k_{7} r_{2}$

where

k_{7} > 0

, and satisfies

k_{7} \in (0, 1)

The framework of the dynamic control algorithm is summarized in Algorithm 1.

Algorithm 1. Dynamic control algorithm of an AMV.

Inputs: Learning rate

ξ

l_{θ}

and

l_{w}

, regular factor

ε

, gradient threshold parameter

g

, discount factor

γ

, sequence length

L

, the maximum number of steps per training

K

, updating cycle of target network parameter

d

, training cycle

F

, mini-batch

N

Initialize: Critic network

Q (s, a | θ_{i})

and actor network

μ (s | w)

with random parameters

θ_{1}

θ_{2}

and

w

, target network

θ_{1}^{'} \leftarrow θ_{1}

θ_{2}^{'} \leftarrow θ_{2}

and

w^{'} \leftarrow w

, experience replay buffer

M e m o r y D

, navigation environment of an AMV.

Procedure:1: for

n_{1} = 1, \dots, F

do2: for

n_{2} = 1, \dots, K

3: Select actions with exploration noise

a ~ μ (s | w) + ε

and obtain reward

r

and next moment state

s_{t + 1}

4: Save transition tuple

(s, a, r, s_{t + 1})

into

M e m o r y D

5: Sample

N

transitions

(s, a, r, s_{t + 1})

\{\begin{cases} a_{t + 1} \leftarrow μ^{'} (s_{t + 1} | w^{'}) + ε \\ y \leftarrow r + γ Q_{\min} (s_{t + 1}, a_{t + 1} | θ_{i}^{'}) \end{cases}

6: Update Critic networks parameters

θ_{i}

θ_{i} \leftarrow \arg \min_{θ_{i}} \frac{1}{N} {(y - Q (s, a | θ_{i}))}^{2}

7: if

t

mod

d

then

8: Update actor network parameters

w

\nabla_{w} J \approx \frac{1}{N} \sum (\nabla_{a} Q (s, a | θ_{1}) |_{a = μ (s)} \nabla_{w} μ (s | w))

9: Update target network as

\{\begin{cases} θ_{i}^{'} = ξ θ_{i} + (1 - ξ) θ_{i}^{'} \\ w^{'} = ξ w + (1 - ξ) w^{'} \end{cases}

end if end for

Outputs: Actor network parameter

w

, critic network parameters

θ_{1}

and

θ_{2}

4. Simulation Studies

In this section, simulation studies are shown to verify the effectiveness and superiority of the proposed DRL-based path-following control method. Consider an under-actuated AMV described by (5) with model uncertainties and unknown marine environment disturbances. Model parameters of the prototype AMV can be found in [43].

Within the kinematic guidance level, relative parameters are chosen as follows: $k_{1} = 0.2$ , $k_{2} = 2$ , $Δ = 3$ . Within the dynamic control level, relative parameters are chosen as follows: $k_{3} = 1.5$ , $k_{4} = 6$ , $k_{5} = 1$ , $k_{6} = 1$ , $λ = 0.8$ , $k_{7} = 0.3$ . Training hyper parameters and network parameters of the LSTM-TD3 are shown in Table 1 and Table 2, respectively.

Episode rewards with different DRL algorithms of path-following control of an AMV are shown in Figure 6. It can be seen that the initial reward is extremely low since the vehicle explores the environment randomly during the initial training stage. After collecting enough data, the rewards converge to a stable value under the DRL control method. Compared to the asynchronous advantage of actor-critic (A3C) developed in [44], the TD3 and LSTM-TD3 can effectively increase the accumulated reward. Additionally, the proposed LSTM-TD3 shows a faster convergence rate and a more stable convergence process than the other two algorithms.

After training the algorithm for 1000 episodes, the optimal actor network parameters of the LSTM-TD3 and TD3 are saved and utilized. Simulation time is set as 200s. The initial positions and attitude are $η = {[- 10, 0, 0]}^{T}$ and the initial velocities are $ν = {[0, 0, 0]}^{T}$ . Time-varying marine environment disturbances are defined as

(32) $τ_{d} = [\begin{matrix} 5 \sin (0.1 t + π / 5) \\ 2.2 \cos (0.1 t + 6) \\ 1.2 \cos (0.1 t + 3) \end{matrix}]$

The desired path is defined as

(33) $\{\begin{cases} x_{d} = s \\ y_{d} = 10 \sin (0.3 s) + 2 s \end{cases}$

Simulation results are shown in Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11. Figure 7 shows the path-following control performance of an under-actuated AMV where the desired path, actual path under the proposed LSTM-TD3 and the traditional TD3 are plotted. Obviously, the proposed control method has significant superiority in terms of transient responses and steady-state performance. Figure 8 shows the path-following errors of an AMV subject to model uncertainties and marine environment disturbances. It can be seen that the tangential error and the normal error can converge to the origin faster under the proposed control method. Figure 9 shows the surge velocity and heading angle, where desired signals are generated by guidance law (19). The actual velocities gradually converge to the desired value by the aid of the DRL controller. Note that the slight chattering of velocities is due to large path inflection point, and under-actuated AMV have to reduce their speed to follow the desired path. Figure 10 shows the velocity error and heading angle error of an AMV. It can be seen that the LSTM network considers historical states and thus enhances control performance. Figure 11 shows path-following control inputs of an AMV, including the surge force and the yaw torque. Because of the hybrid rewards with standard deviation, where 20 continuous inputs are set as a calculation group and dynamic sliding is introduced, the control chattering is effectively relieved.

5. Conclusions

This article studies the path-following control of an under-actuated AMV subject to unknown model parameters and marine environmental disturbances. Within the kinematic level, a surge-heading joint guidance law is presented, and makes the vehicle follow the desired path. Within the dynamic level, a LSTM-TD3-based reinforcement learning controller is presented, where vehicle actions are generated by the state space and hybrid reward. Additionally, actor-critic networks are developed using the LSTM network, and vehicles can make a decision by the aid of historical states, thus enhancing the convergence rate of controller networks. Simulation results and comprehensive comparisons demonstrate the remarkable effectiveness and superiority of the proposed path-following control method. By the aid of the proposed controller, the AMV can achieve path following regardless of marine environment disturbances.

Author Contributions

Conceptualization, X.Q. and Y.J.; methodology, X.Q.; software, Y.J.; validation, X.Q., Y.J. and R.Z.; formal analysis, F.L.; investigation, Y.J.; resources, X.Q.; data curation, X.Q.; writing—original draft preparation, X.Q.; writing—review and editing, X.Q.; visualization, F.L.; supervision, F.L.; project administration, R.Z.; funding acquisition, R.Z. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Neuronal structure of the LSTM.

Figure 2. Diagram of horizontal path-following control.

Figure 3. Diagram of the proposed path-following control scheme.

Figure 4. Network structure of the TD3.

Figure 5. Network structure of Actor.

Figure 6. Episode rewards with different DRL algorithms.

Figure 7. Path-following control performance of an AMV.

Figure 8. Path-following errors of an AMV under LSTM-TD3 and TD3.

Figure 9. Surge velocity and heading angle of an AMV.

Figure 10. Velocity and heading errors of an AMV under LSTM-TD3 and TD3.

Figure 11. Surge force and yaw torque of an AMV.

Table 1

Training hyper parameters.

Parameters	Value
Discount factor $γ$	0.99
State sequence length $L$	20
Training cycle $F$	1000
Maximum number of steps $K$	1000
Capacity of buffer $D$	100,000
Learning rate $l$	0.001
Optimizer	Adam
Gradient threshold parameter $g$	1
Regular factor $ε$	0.00005
Mini-batch $N$	128

Table 2

LSTM-TD3 network parameters.

Parameters	Value
Input layer of actor network	11
Input layer of critic network	13
Fully connected layer	200
LSTM layer of actor network	100
LSTM layer of critic network	100
Output layer of actor	2
Output layer of critic	1

References

1. Jorge, V.A.; Granada, R.; Maidana, R.G.; Jurak, D.A.; Heck, G.; Negreiros, A.P.; Dos Santos, D.H.; Gonçalves, L.M.; Amory, A.M. A Survey on Unmanned Surface Vehicles for Disaster Robotics: Main Challenges and Directions. Sensors; 2019; 19, 702. [DOI: https://dx.doi.org/10.3390/s19030702]

2. Liu, T.; Dong, Z.; Du, H.; Song, L.; Mao, Y. Path Following Control of the Underactuated USV Based on the Improved Line-of-Sight Guidance Algorithm. Pol. Marit. Res.; 2017; 24, pp. 3-11. [DOI: https://dx.doi.org/10.1515/pomr-2017-0001]

3. Mu, D.; Wang, G.; Fan, Y.; Bai, Y.; Zhao, Y. Fuzzy-Based Optimal Adaptive Line-of-Sight Path Following for underactuated unmanned surface vehicle with uncertainties and time-varying disturbances. Math. Probl. Eng.; 2018; 2018, 7512606. [DOI: https://dx.doi.org/10.1155/2018/7512606]

4. Koh, S.; Zhou, B.; Fang, H.; Yang, P.; Yang, Z.; Yang, Q.; Guan, L.; Ji, Z. Real-time deep reinforcement learning based vehicle navigation. Appl. Soft Comput.; 2020; 96, 106694. [DOI: https://dx.doi.org/10.1016/j.asoc.2020.106694]

5. Mu, D.; Wang, G.; Fan, Y.; Bai, Y.; Zhao, Y. Path following for podded propulsion unmanned surface vehicle: Theory, simulation and experiment. IEEJ Trans. Electr. Electron. Eng.; 2018; 13, pp. 911-923. [DOI: https://dx.doi.org/10.1002/tee.22645]

6. Lekkas, A.M.; Fossen, T.I. Integral LOS Path Following for Curved Paths Based on a Monotone Cubic Hermite Spline Parametrization. IEEE Trans. Control Syst. Technol.; 2014; 22, pp. 2287-2301. [DOI: https://dx.doi.org/10.1109/TCST.2014.2306774]

7. Fossen, T.I.; Lekkas, A.M. Direct and indirect adaptive integral line-of-sight path-following controllers for marine craft exposed to ocean currents. Int. J. Adapt. Control Signal Process.; 2017; 31, pp. 445-463. [DOI: https://dx.doi.org/10.1002/acs.2550]

8. Fossen, T.I.; Pettersen, K.Y.; Galeazzi, R. Line-of-Sight Path Following for Dubins Paths with Adaptive Sideslip Compensation of Drift Forces. IEEE Trans. Control Syst. Technol.; 2014; 23, pp. 820-827. [DOI: https://dx.doi.org/10.1109/TCST.2014.2338354]

9. Liu, Z.; Song, S.; Yuan, S.; Ma, Y.; Yao, Z. ALOS-Based USV Path-Following Control with Obstacle Avoidance Strategy. J. Mar. Sci. Eng.; 2022; 10, 1203. [DOI: https://dx.doi.org/10.3390/jmse10091203]

10. Rout, R.; Subudhi, B. Inverse optimal self-tuning PID control design for an autonomous underwater vehicle. Int. J. Syst. Sci.; 2017; 48, pp. 367-375. [DOI: https://dx.doi.org/10.1080/00207721.2016.1186238]

11. Yu, C.; Xiang, X.; Lapierre, L.; Zhang, Q. Nonlinear guidance and fuzzy control for three-dimensional path following of an underactuated autonomous underwater vehicle. Ocean Eng.; 2017; 146, pp. 457-467. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2017.10.001]

12. Xiang, X.; Yu, C.; Zhang, Q. Robust fuzzy 3D path following for autonomous underwater vehicle subject to uncertainties. Comput. Oper. Res.; 2017; 84, pp. 165-177. [DOI: https://dx.doi.org/10.1016/j.cor.2016.09.017]

13. Zhang, J.; Xiang, X.; Lapierre, L.; Zhang, Q.; Li, W. Approach-angle-based three-dimensional indirect adaptive fuzzy path following of under-actuated AUV with input saturation. Appl. Ocean Res.; 2021; 107, 102486. [DOI: https://dx.doi.org/10.1016/j.apor.2020.102486]

14. Sahu, B.K.; Subudhi, B. Adaptive tracking control of an autonomous underwater vehicle. Int. J. Autom. Comput.; 2014; 11, pp. 299-307. [DOI: https://dx.doi.org/10.1007/s11633-014-0792-7]

15. Shin, J.; Kwak, D.J.; Lee, Y. Adaptive Path-Following Control for an Unmanned Surface Vessel Using an Identified Dynamic Model. IEEE/ASME Trans. Mechatron.; 2017; 22, pp. 1143-1153. [DOI: https://dx.doi.org/10.1109/TMECH.2017.2651057]

16. Lamraoui, H.C.; Zhu, Q. Path following control of fully-actuated autonomous underwater vehicle in presence of fast-varying disturbances. Appl. Ocean Res.; 2019; 86, pp. 40-46. [DOI: https://dx.doi.org/10.1016/j.apor.2019.02.015]

17. Zhang, H.; Zhang, X.; Cao, T.; Bu, R. Active disturbance rejection control for ship path following with Euler method. Ocean Eng.; 2022; 247, 110516. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2021.110516]

18. Zhang, G.; Huang, H.; Wan, L.; Li, Y.; Cao, J.; Su, Y. A novel adaptive second order sliding mode path following control for a portable AUV. Ocean Eng.; 2018; 151, pp. 82-92. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2017.12.054]

19. Zhang, H.; Zhang, X.; Bu, R. Radial Basis Function Neural Network Sliding Mode Control for Ship Path Following Based on Position Prediction. J. Mar. Sci. Eng.; 2021; 9, 1055. [DOI: https://dx.doi.org/10.3390/jmse9101055]

20. Wang, J.; Wang, C.; Wei, Y.; Zhang, C. Three-Dimensional Path Following of an Underactuated AUV Based on Neuro-Adaptive Command Filtered Backstepping Control. IEEE Access; 2018; 6, pp. 74355-74365. [DOI: https://dx.doi.org/10.1109/ACCESS.2018.2883081]

21. Yan, Z.; Yang, Z.; Zhang, J.; Zhou, J.; Jiang, A.; Du, X. Trajectory tracking control of UUV based on backstepping sliding mode with fuzzy switching gain in diving plane. IEEE Access; 2019; 7, pp. 166788-166795. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2953530]

22. Zhou, J.; Zhao, X.; Chen, T.; Yan, Z.; Yang, Z. Trajectory tracking control of an underactuated AUV based on backstepping sliding mode with state prediction. IEEE Access; 2019; 7, pp. 181983-181993. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2958360]

23. Chen, X.; Liu, Z.; Zhang, J.; Zhou, D.; Dong, J. Adaptive sliding-mode path following control system of the underactuated USV under the influence of ocean currents. J. Syst. Eng. Electron.; 2018; 29, pp. 1271-1283. [DOI: https://dx.doi.org/10.21629/JSEE.2018.06.14]

24. Liang, X.; Wan, L.; Blake, J.I.; Shenoi, R.A.; Townsend, N. Path Following of an Underactuated AUV Based on Fuzzy Backstepping Sliding Mode Control. Int. J. Adv. Robot. Syst.; 2016; 13, 122. [DOI: https://dx.doi.org/10.5772/64065]

25. Qiu, B.; Wang, G.; Fan, Y.; Mu, D.; Sun, X. Path Following of Underactuated Unmanned Surface Vehicle Based on Trajectory Linearization Control with Input Saturation and External Disturbances. Int. J. Control Autom. Syst.; 2020; 18, pp. 2108-2119. [DOI: https://dx.doi.org/10.1007/s12555-019-0659-3]

26. Wang, N.; Sun, Z.; Yin, J.; Zou, Z.; Su, S. Fuzzy unknown observer-based robust adaptive path following control of underactuated surface vehicles subject to multiple unknowns. Ocean Eng.; 2019; 176, pp. 57-64. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2019.02.017]

27. Havenstrøm, S.T.; Rasheed, A.; San, O. Deep reinforcement learning controller for 3D path following and collision avoidance by autonomous underwater vehicles. Front. Robot. AI; 2021; 7, 211. [DOI: https://dx.doi.org/10.3389/frobt.2020.566037]

28. Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. IEEE Access; 2020; 8, pp. 165344-165364. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3022600]

29. Sola, Y.; Le Chenadec, G.; Clement, B. Simultaneous control and guidance of an auv based on soft actor–critic. Sensors; 2022; 22, 6072. [DOI: https://dx.doi.org/10.3390/s22166072]

30. Fang, Y.; Huang, Z.; Pu, J.; Zhang, J. AUV position tracking and trajectory control based on fast-deployed deep reinforcement learning method. Ocean Eng.; 2022; 245, 110452. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2021.110452]

31. Zhang, T.; Tian, R.; Wang, C.; Xie, G. Path-Following Control of Fish-like Robots: A Deep Reinforcement Learning Approach. IFAC-PapersOnLine; 2020; 53, pp. 8163-8168. [DOI: https://dx.doi.org/10.1016/j.ifacol.2020.12.2306]

32. Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng.; 2019; 183, pp. 155-166. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2019.04.099]

33. Han, Z.; Wang, Y.; Sun, Q. Straight-Path Following and Formation Control of USVs Using Distributed Deep Reinforcement Learning and Adaptive Neural Network. IEEE/CAA J. Autom. Sin.; 2023; 10, pp. 572-574. [DOI: https://dx.doi.org/10.1109/JAS.2023.123255]

34. Sun, Y.; Ran, X.; Zhang, G.; Wang, X.; Xu, H. AUV path following controlled by modified Deep Deterministic Policy Gradient. Ocean Eng.; 2020; 210, 107360. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2020.107360]

35. Zheng, Y.; Tao, J.; Sun, Q.; Sun, H.; Chen, Z.; Sun, M.; Xie, G. Soft Actor–Critic based active disturbance rejection path following control for unmanned surface vessel under wind and wave disturbances. Ocean Eng.; 2022; 247, 110631. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2022.110631]

36. Liang, Z.; Qu, X.; Zhang, Z.; Chen, C. Three-Dimensional Path-Following Control of an Autonomous Underwater Vehicle Based on Deep Reinforcement Learning. Pol. Marit. Res.; 2022; 29, pp. 36-44. [DOI: https://dx.doi.org/10.2478/pomr-2022-0042]

37. Liu, Y.; Peng, Y.; Wang, M.; Xie, J.; Zhou, R. Multi-usv system cooperative underwater target search based on reinforcement learning and probability map. Math. Probl. Eng.; 2020; 2020, 7842768. [DOI: https://dx.doi.org/10.1155/2020/7842768]

38. Havenstrøm, S.T.; Sterud, C.; Rasheed, A.; San, O. Proportional integral derivative controller assisted reinforcement learning for path following by autonomous underwater vehicles. arXiv; 2020; [DOI: https://dx.doi.org/10.48550/arXiv.2002.01022] arXiv: 2002.01022

39. Zhang, W.; Wu, P.; Peng, Y.; Liu, D. Roll motion prediction of unmanned surface vehicle based on coupled CNN and LSTM. Future Internet; 2019; 11, 243. [DOI: https://dx.doi.org/10.3390/fi11110243]

40. Li, J.; Tian, Z.; Zhang, G.; Li, W. Multi-AUV Formation Predictive Control Based on CNN-LSTM under Communication Constraints. J. Mar. Sci. Eng.; 2023; 11, 873. [DOI: https://dx.doi.org/10.3390/jmse11040873]

41. Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; John Wiley & Sons: Hoboken, NJ, USA, 2011.

42. Chu, Z.; Sun, B.; Zhu, D.; Zhang, M.; Luo, C. Motion control of unmanned underwater vehicles via deep imitation reinforcement learning algorithm. IET Intell. Transp. Syst.; 2020; 14, pp. 764-774. [DOI: https://dx.doi.org/10.1049/iet-its.2019.0273]

43. Wang, N.; Gao, Y.; Yang, C.; Zhang, X. Reinforcement learning-based finite-time tracking control of an unknown unmanned surface vehicle with input constraints. Neurocomputing; 2022; 484, pp. 26-37. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.04.133]

44. Xie, S.; Chu, X.; Zheng, M.; Liu, C. A composite learning method for multi-ship collision avoidance based on reinforcement learning and inverse control. Neurocomputing; 2020; 411, pp. 375-392. [DOI: https://dx.doi.org/10.1016/j.neucom.2020.05.089]

Word count: 4990

Show less

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

In this article, a deep reinforcement learning-based path-following control scheme is established for an under-actuated autonomous marine vehicle (AMV) in the presence of model uncertainties and unknown marine environment disturbances is presented. By virtue of light-of-sight guidance, a surge-heading joint guidance method is developed within the kinematic level, thereby enabling the AMV to follow the desired path accurately. Within the dynamic level, model uncertainties and time-varying environment disturbances are taken into account, and the reinforcement learning control method using the twin-delay deep deterministic policy gradient (TD3) is developed for the under-actuated vehicle, where path-following actions are generated via the state space and hybrid rewards. Additionally, actor-critic networks are developed using the long-short time memory (LSTM) network, and the vehicle can successfully make a decision by the aid of historical states, thus enhancing the convergence rate of dynamic controllers. Simulation results and comprehensive comparisons on a prototype AMV demonstrate the remarkable effectiveness and superiority of the proposed LSTM-TD3-based path-following control scheme.

Details

Title

A Deep Reinforcement Learning-Based Path-Following Control Scheme for an Uncertain Under-Actuated Autonomous Marine Vehicle

Author

Qu, Xingru

; Jiang, Yuze; Zhang, Rubo; Long, Feifei

First page

1762

Publication year

2023

Publication date

2023

Publisher

MDPI AG

e-ISSN

20771312

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/jmse11091762

ProQuest document ID

2869437725

A Deep Reinforcement Learning-Based Path-Following Control Scheme for an Uncertain Under-Actuated Autonomous Marine Vehicle

Jump to:

Full text

Abstract

Details

Suggested sources