Full Text

Turn on search term navigation

1. Introduction

Due to their advantages of simple structure, sustainability, and flexible control methods, skid-steering vehicles have been widely used in civil and military applications, including agricultural vehicles [1], combat vehicles [2,3], robots [4,5], and so on. As shown in Figure 1, skid-steering vehicles are typically equipped with four independent driving wheels, forming an overdrive system with more control output than control input. Unlike car-like vehicles [6,7,8], skid-steering vehicles do not possess a dedicated steering system; therefore, a reasonable torque distribution is key to skid-steering vehicle control. With the development of optimal problem solvers, through reinforcement learning (RL) techniques in complex systems, new insights have been provided for coordinated torque distribution in skid-steering vehicles. Therefore, the development of torque distribution strategies based on RL has become a promising research direction for future skid-steering vehicles.

The conventional control methods for skid-steering vehicles, mainly, include kinematics and dynamics methods [9]. In kinematics methods, the desired control value is transformed into a reference speed for each wheel, based on the vehicle kinematics, following which each driving wheel generates driving torque, to track its own reference speed for vehicle control [10,11,12]. Kinematics methods are subject to significant slippage and skidding, when precisely following the wheel reference speed, which seriously impacts the control performance of the vehicle. In [13], the authors proposed a speed-adaptive control algorithm for six-wheeled differential steering vehicles. To enhance the vehicle handing and lateral stability performance, reference wheel speeds are, individually, generated for each wheel, based on their respective slipping and skidding status. In [14], a hierarchical controller was designed for a skid-steering vehicle, based on the need of dynamic control. In the controller, a reference wheel speed generator was designed to calculate the wheel speed, and a wheel speed follower is used to reach the target wheel speed. In [15], in order to mitigate the effects of wheel slip on accurate control, the authors proposed a new kinematics model for skid-steering vehicles, which has the ability to predict and compensate for the slippages in the forward kinematics.

To improve maneuverability and stability, recent studies have proposed dynamics methods for the control of skid-steering vehicles. Dynamics methods can address the problem of torque distribution for skid-steering vehicles, through optimization theory [16,17,18]. In [19], the authors proposed a hierarchical torque distribution strategy for a skid-steering electric unmanned ground vehicle (EUGV), in order to control the longitudinal speed and yaw rate. The objective function, consisting of the longitudinal tire workload rates and weight factors of the tires, was established by considering inequality constraints, including actuator, road adhesion, and tire friction circle constraints. In [20], a hierarchical control framework for a six-wheel independent drive (6WID) vehicle was proposed, and optimization theory was employed to distribute the driving torques. The control strategy was used to realize the real-time torque distribution, having the ability to maintain wheel failure tolerance and limit wheel slip.

To date, RL algorithms have been successfully implemented in robots [21] and unmanned aerial vehicles (UAVs) [22,23] as well as energy [24], transportation [25], and other complex systems [26,27]. The DDPG algorithm has been successfully used to deal with decision-making problems involving continuous action spaces in various vehicle control applications [28], such as trajectory planning [29], automatic lane changing [30], and optimal torque distribution [31,32,33]. To overcome the shortcomings of the original DDPG in the training process, many learning tricks have been proposed to make the training more efficient and the convergence more stable. In [34], the authors proposed a DDPG-based controller that allows UAVs to fly robustly in uncertain environments. Three learning tricks—delayed learning, adversarial attack, and mixed exploration—were introduced, to overcome the fragility and volatility of the original DDPG, which greatly improved the convergence speed, convergence effect, and stability. In [35], a control approach based on Twin Delayed DDPG (TD3) was proposed, to handle the model-free attitude control problem of On-Orbit Servicing Spacecraft (OOSS), under the guidance of a mixed reward system. The Proportional–Integral–Derivative (PID) guide TD3 algorithm effectively increased the training speed and learning stability of the TD3 algorithm, through the use of prior knowledge. In [36], an improved energy management framework that embeds expert knowledge into the DDPG was proposed for hybrid electric vehicles (HEVs). By incorporating the battery characteristics and the optimal brake specific fuel consumption of HEVs, the proposed framework not only accelerated the learning process, but also obtained better fuel economy. In [37], the authors proposed a knowledge-assisted DDPG for the control of a cooperative wind farm, by combining knowledge-assisted methods with the DDPG algorithm. Three analytical models were utilized to speed up the learning process and train robustly.

To summarize the above work, kinematic methods are incapable of overcoming unexpected wheel slip, while dynamics methods can, theoretically, effectively control wheel slip within a limited range. However, the implementation of dynamics methods requires complicated functions, in order to estimate the vehicle model and the wheel–ground interactions, which are difficult to apply in practice [38]. The RL algorithm iteratively explores the optimal control strategy of the system in the training process, and the associated neural networks can approximate the dynamics model and wheel–ground interactions. Therefore, an RL-based torque distribution strategy for skid-steering vehicles is incorporated into our work.

In this study, we propose a KA-DDPG-based driving torque distribution method for skid-steering vehicles, in order to minimize the tracking error, with respect to the desired value. We, first, analyze a dynamics model of skid-steering vehicles for torque distribution. Then, the agent of the KA-DDPG algorithm is designed for vehicle control. Based on the torque distribution strategy of KA-DDPG, we can achieve longitudinal speed and yaw rate tracking control for skid-steering vehicles. The main contributions of this study can be summarized as follows: (1) a KA-DDPG-based torque distribution strategy is proposed for skid-steering vehicles, in order to minimize the errors of the longitudinal speed and yaw rate, thus realizing the tracking control of the desired value; (2) considering the learning efficiency of the KA-DDPG algorithm, the knowledge-assisted RL framework is proposed by combining two knowledge-assisted learning methods with the DDPG algorithm; and (3) a dynamics model for skid-steering vehicles is constructed, in order to evaluate the performance of the proposed method.

The remainder of this paper is organized as follows. Section 2 consists of two parts: one addressing the dynamics model of skid-steering vehicles, and another focused on the DDPG algorithm. Section 3 presents our KA-DDPG-based torque distribution method for skid-steering vehicles. The settings of the simulation environment are detailed in Section 4. In Section 5, the performance of the KA-DDPG-based torque distribution strategy and the contributions of the assisted learning methods in KA-DDPG are illustrated. Section 6 concludes this work and discusses possible future work.

2. Preliminaries

2.1. Dynamics Model

The skid-steering vehicle investigated in this study is shown in Figure 1, which is equipped with four independently driven wheels, forming a differential steering system. $O X Y$ is the reference coordinate system, $o x y$ is the fixed coordinate system attached to the vehicle, and o represents the origin of the fixed coordinate system. B is the width of the vehicle. The distances from the center of gravity (CG) to the front and rear wheels are $L_{f}$ and $L_{r}$ , respectively. As shown in Figure 1, the motion of the skid-steering vehicle is caused by the friction forces, which are generated by the wheel–ground interactions. The dynamics model is formulated as follows:

(1) $\begin{matrix} m (v_{x a}^{'} - v_{y a} ω_{ϕ a}) & = \sum_{i = 1}^{i = 4} f_{x_{i}} + d_{x}, \\ m (v_{y a}^{'} + v_{x a} ω_{ϕ a}) & = \sum_{i = 1}^{i = 4} f_{y_{i}} + d_{y}, \\ J ω_{ϕ a}^{'} & = \sum_{i = 1}^{i = 4} τ_{f_{i}} + d_{ϕ}, \end{matrix}$

where

f_{x_{i}}

is the longitudinal force, and

f_{y_{i}}

is the lateral force on the

i th

wheel, m is the mass of the vehicle, J is the inertia of the vehicle around CG,

v_{x a}

is the longitudinal speed,

v_{y a}

is the lateral speed,

ω_{ϕ a}

is the yaw rate,

τ_{f_{i}}

is the torque caused by the traction force

f_{i} = f_{x_{i}} \hat{x} + f_{y_{i}} \hat{y}

(where

\hat{x}

is the unit vector in the longitudinal direction, and

\hat{y}

is the unit vector in the lateral direction), and

d_{x}

d_{y}

, and

d_{ϕ}

are disturbances in the different directions, respectively.

From Equation (1), it can be seen that the vehicle motion is determined by $f_{x_{i}}$ and $f_{y_{i}}$ resulting from the wheel–ground interaction. A large number of semi-empirical complex friction models have been used to describe the wheel–ground interaction, such as the LuGre and Dugoff models. Most such models are very complex and are not necessary for the design of RL-based control strategies. In this work, we choose an appropriate friction model, described in [39], where $f_{x_{i}}$ and $f_{y_{i}}$ are defined as follows:

(2) $\begin{matrix} f_{x_{i}} & = μ_{f} N_{i} S_{f} (s_{i}) \frac{s_{x_{i}}}{s_{i}}, \\ f_{y_{i}} & = μ_{f} N_{i} S_{f} (s_{i}) \frac{s_{y_{i}}}{s_{i}}, \\ s_{i} & = \sqrt{s_{x_{i}}^{2} + s_{y_{i}}^{2}}, \end{matrix}$

where

N_{i}

is the force of the driving wheel in the vertical direction,

μ_{f}

is the friction coefficient,

S_{f} (s_{i}) = \frac{2}{π} \times arctan (90 s_{i})

is the dynamic feature of the friction force,

s_{x_{i}}

is the longitudinal slip, and

s_{y_{i}}

is the lateral slip. The slips in different directions are defined as follows:

(3) $\begin{matrix} s_{x_{l f}} = r ω_{r_{l f}} - v_{x} + \frac{B}{2} ω_{ϕ a}, & s_{y_{l f}} = v_{y} + L_{f} ω_{ϕ a}, \\ s_{x_{l r}} = r ω_{r_{l r}} - v_{x} + \frac{B}{2} ω_{ϕ a}, & s_{y_{l r}} = v_{y} - L_{r} ω_{ϕ a}, \\ s_{x_{r f}} = r ω_{r_{r f}} - v_{x} - \frac{B}{2} ω_{ϕ a}, & s_{y_{r f}} = v_{y} + L_{f} ω_{ϕ a}, \\ s_{x_{r r}} = r ω_{r_{r r}} - v_{x} - \frac{B}{2} ω_{ϕ a}, & s_{y_{r r}} = v_{y} - L_{r} ω_{ϕ a}, \end{matrix}$

where

ω_{r_{i}}

is the wheel rotation speed. As shown in Figure 2, the dynamics model for each driving wheel can be described as:

(4) $j ω_{r_{i}}^{'} = T_{i} - r f_{x_{i}} - T_{z_{i}} + d_{r_{i}}, i = l f, l r, r f, r r,$

where j is the wheel inertia,

T_{i}

is the driving torque,

T_{z_{i}}

is the rolling resistance torque, r is the wheel radius, and

d_{r_{i}}

is the disturbance from the external environment.

Based on the definitions of the wheel dynamics model and wheel slip, the dynamics model of the wheel slip in the longitudinal direction can be described as:

(5) $s_{x_{i}}^{'} = \frac{r}{j} (T_{i} - r f_{x_{i}} - T_{z_{i}} + d_{w_{i}}) - v_{x a}^{'} .$

The lateral slip speed $v_{y a}$ is very small, relative to the longitudinal speed $v_{x a}$ . From an engineering perspective, we simplify the problem by disregarding the lateral motion of the vehicle. Based on the vehicle dynamics model introduced above, the objective of this paper is to propose an RL-based torque distribution strategy for skid-steering vehicles, which can minimize the errors of longitudinal speed and yaw rate, such that the desired value can be tracked well.

2.2. DDPG Algorithm

The DDPG algorithm [40] is a mature RL algorithm that can be used to deal with continuous action control. The idea of experience replays and random sampling are incorporated into the DDPG algorithm, and the process of establishing the target network is, also, considered. The DDPG is an actor–critic algorithm, where the basic framework includes actor networks and critic networks. The actor networks are used for policy search, while the critic networks are used for value function approximation. As shown in Figure 3, the DDPG algorithm includes two actor networks and two critic networks.

The online actor network, u, is used to output an action $a = u (s ∣ θ^{u})$ , according to the current state s. The online critic network, Q, is used to evaluate the action and obtain a Q-value $Q (s, a ∣ θ^{Q})$ . In the above, $θ^{u}$ and $θ^{Q}$ represent the parameters of the online actor network and online critic network, respectively. In addition, to follow the network update process, the DDPG also includes a target actor network $u^{'}$ and a target critic network $Q^{'}$ .

The critic network is updated, by minimizing the loss function between the target and online networks. The loss function is given as follows:

(6) $L (θ^{Q}) = \frac{1}{N} \sum_{t}^{N} {(y_{i} - Q (s_{i}, a_{i} ∣ θ^{Q}))}^{2},$

where N is the number of mini-batches, and

y_{i}

is the target value, defined as

(7) $y_{i} = r_{i} + τ Q^{'} (s_{i + 1}, u^{'} (s_{i + 1} ∣ θ^{u^{'}}) ∣ θ^{Q^{'}}),$

where

τ

is the discount factor, and

θ^{u^{'}}

and

θ^{Q^{'}}

are the parameters of the target actor network and target critic network, respectively. The actor networks are updated according to the following formula:

(8) $\begin{matrix} \nabla_{θ^{u}} u ∣_{s_{t}} \approx \frac{1}{N} \sum_{t}^{N} [\nabla_{a_{t}} Q (s_{t}, a_{t} ∣ θ^{Q}) \nabla_{θ^{u}} u (s_{t} ∣ θ^{u})] . \end{matrix}$

Further, the target networks are updated as follows:

(9) $\begin{matrix} θ^{Q^{'}} \leftarrow λ θ^{Q} + (1 - λ) θ^{Q^{'}}, \\ θ^{u^{'}} \leftarrow λ θ^{u} + (1 - λ) θ^{u^{'}}, \end{matrix}$

where

λ

is a configurable constant coefficient, such that

λ ≪ 1

During the learning process of DDPG, the agent can only select actions based on the randomly generated policy, then optimize the policy through trial and error. This action may lead to sub-optimal policy actions in the overdrive system. Therefore, the agent requires the help of knowledge-assisted learning methods, to guide the learning direction in the learning process.

3. KA-DDPG-Based Direct Torque Distribution Method

3.1. Framework of KA-DDPG

The KA-DDPG algorithm is, mainly, composed of the original DDPG algorithm and knowledge-assisted methods. The considered knowledge-assisted methods include a low-fidelity controller that generates criteria actions and an evaluation method that generates guiding rewards. In this study, the low-fidelity controller is a simplified and inaccurate torque distribution method for skid-steering vehicles, which may be replaced by a torque distribution method based on optimal control theory. The evaluation method is an estimation value function, based on the ideal torque distribution feature, which was constructed through expert knowledge. The framework of KA-DDPG is shown in Figure 4.

After receiving a state s, the agent generates an action through its own policy, named the agent action $a_{a}$ . The low-fidelity controller also generates an action based on the state s, named the criteria action $a_{c}$ . The action that the agent actually performs in the environment is called the execution action $a_{e}$ , which is produced by adjustment of the agent action $a_{a}$ , with respect to the criteria action $a_{c}$ . The reward that the agent gathers from the environment is called the observed reward $R_{o}$ , while the reward gathered from the evaluation method is called the guiding reward $R_{g}$ . The updating reward $R_{u}$ is generated by combining the observation reward $R_{o}$ and the guiding reward $R_{g}$ . Through continuous interaction with the environment, the agent can quickly learn to make reasonable actions in the direction of high reward, with the help of knowledge-assisted learning methods. The help of knowledge-assisted learning methods is, gradually, eliminated over the training period.

3.2. KA-DDPG-Based Driving Torque Distribution

We introduce the KA-DDPG-based torque distribution method in this section, including the state space, action space, and reward function. The criteria action method is elaborated in the action space part, while the guiding reward method is elaborated in the reward function part.

3.2.1. State Space

The state space of a vehicle is valuable information that the agent can attain before the decision-making step, which is used to evaluate the vehicle’s situation. In this study, the control objective of the torque distribution strategy for the skid-steering vehicle is to minimize the error in desired value tracking, including minimizing the longitudinal speed error and the yaw rate error. Thus, the longitudinal speed error $v_{d e l t a}$ and the yaw rate error $ω_{d e l t a}$ are included in the agent’s state space, defined as follows:

(10) $\begin{matrix} v_{d e l t a} & = v_{x d} - v_{x a}, \\ ω_{d e l t a} & = ω_{ϕ d} - ω_{ϕ a}, \end{matrix}$

where

v_{x d}

and

ω_{ϕ d}

are the desired values of the longitudinal speed and yaw rate of vehicle, respectively, which are illustrated in Figure 1. The sign of the value indicates the direction of motion. In addition, the longitudinal acceleration

v_{x a}^{'}

and the angular acceleration

ω_{ϕ a}^{'}

have a significant impact on the convergence performance of the desired value tracking error. Thus, both are included in the state space. In summary, the state space is defined as follows:

(11) $state = \{v_{d e l t a}, ω_{d e l t a}, v_{x a}^{'}, ω_{ϕ a}^{'}\} .$

3.2.2. Action Space

For the drive configuration of the skid-steering vehicle described in this study, the vehicle forms a differential drivetrain, through four independently driven wheels. The wheels work in torque drive mode, and their control output is the driving torque of the wheel. Thus, the action space can be expressed as follows:

(12) $action = \{T_{f l}, T_{r l}, T_{f r}, T_{r r}\} .$

During the learning process, the agent learns from previous experience, by continuously interacting with its environment, and generates reasonable actions through the powerful fitting ability of the neural network. At the beginning of learning, the actions are randomly generated by the agent, which means that actions with low rewards are likely to be generated. Although the control performance of the low-fidelity controller cannot meet the requirements, it still greatly outperforms the control of randomly selected actions. Thus, the low-fidelity controller is used to assist the agent to learn quickly, by generating criteria actions; that is, it acts like a teacher when the agent has just started learning, guiding the direction of learning. The core idea of the criteria action method is to assist the agent in searching for actions at the beginning of learning, reducing randomly selected actions. The framework of the criteria action method is shown in Figure 5.

The execution action $a_{e}$ is generated by adjusting the agent action $a_{a}$ , with the help of the criteria action $a_{c}$ . The criteria action needs to be eliminated after assisting the agent for a period of time, such that the agent can learn to select reasonable actions independently. Thus, the execution action $a_{e}$ is defined as follows:

(13) $a_{e} = (1 - γ^{i}) \cdot a_{a} + γ^{i} \cdot a_{c},$

where

γ

is the discount factor, and i is the training episode. The execution action defined in this way is the same as training the agent with a low-fidelity controller at the beginning, then slowly reducing the assistance over time, until the assistance method has no effect on the agent’s action selection, such that the agent selects the action independently.

The low-fidelity controller helps to reduce randomly selected actions for the agent and provides a direction of learning for action selection. The low-fidelity controller defined in this study is expressed by Equation (14), which may be replaced by a suitable dynamics-based torque distribution method.

(14) $\begin{matrix} T_{l f} = T_{l r} = K_{p} (m \cdot \frac{v_{d e l t a}}{d t} - J \cdot \frac{ω_{d e l t a}}{d t}), \\ T_{r f} = T_{r r} = K_{p} (m \cdot \frac{v_{d e l t a}}{d t} + J \cdot \frac{ω_{d e l t a}}{d t}), \end{matrix}$

where

K_{p}

is the proportional coefficient, and

d t

is the step size. The controller imparts the same driving torque to the wheels on the same side, which does not meet the requirements of actual control, but it is far superior to randomly selected actions.

3.2.3. Reward Function

The reward is the only feedback signal available for the agent’s learning, which is used to evaluate the action taken by the agent. The reward function designed in this paper combines a dense reward with a sparse reward, for resolving the problem of driving torque distribution. Three types of excitations are considered in the dense reward: (1) a reward $r_{1}$ , regarding the desired value tracking errors; (2) a reward $r_{2}$ , regarding the convergence of tracking errors; (3) and a reward $r_{3}$ , which encourages reducing the driving torques. The specific rewards are defined as follows:

(15) $r_{1} = v_{d e l t a}^{2} + 5 \cdot ω_{d e l t a}^{2},$

(16) $r_{2} = {(v_{x a}^{'})}^{2} + {(ω_{ϕ a}^{'},)}^{2},$

(17) $r_{3} = T_{f l}^{2} + T_{r l}^{2} + T_{f r}^{2} + T_{r r}^{2} .$

The dense reward part of the learning process is defined as follows:

(18) $r_{d e n s e} = w_{1} \cdot r_{1} + w_{2} \cdot r_{2} + w_{3} \cdot r_{3} .$

The dense reward is a punishment that helps the agent minimize the tracking error of the desired value. In addition, the sparse reward is used as a positive reward, rewarding the agent for controlling the vehicle in an ideal state. The sparse reward is defined as follows:

(19) $r_{s p a r s e} = \{\begin{matrix} 3000, & if |v_{d e l t a}| < 0.01 \times |v_{x d}| and |ω_{d e l t a}| < 0.01 \times |ω_{ϕ d}| \\ 0, & otherwise . \end{matrix}$

The final observed reward is the sum of the dense reward $r_{d e n s e}$ and the sparse reward $r_{s p a r s e}$ , expressed as follows:

(20) $R_{o} = r_{d e n s e} + r_{s p a r s e} .$

Unlike the original DDPG algorithm, which directly uses the observed reward as the updating reward to update the policy directly, in the KA-DDPG algorithm, we adjust the observed reward, with the help of the evaluation method, to sharpen the updating reward, which can accelerate the agent’s learning process. The guiding reward method is proposed to adjust the updating reward, and its framework is shown in Figure 6.

The core idea behind the guiding reward method is to sharpen the updating reward that the agent uses to update the policy. The guiding reward, like the criteria action, needs to eliminate the effect after a period of training. Thus, by combining the guiding reward $R_{g}$ and the observed reward $R_{o}$ , the updating reward $R_{u}$ is defined as follows:

(21) $R_{u} = (1 - γ^{i}) \cdot R_{o} + γ^{i} \cdot R_{g} .$

The distance between the agent action and criteria action is used as the guiding reward, which is a simple but effective evaluation method. The agent action and criteria action are both four-dimensional vectors. The guiding reward is defined as:

(22) $R_{g} = - \sqrt{{(a_{a} - a_{c})}^{T} (a_{a} - a_{c})} .$

The KA-DDPG algorithm is summarized in Algorithm 1.

Algorithm 1. KA-DDPG algorithm.

1: Initialize critic network

Q (s, a ∣ θ^{Q})

and actor network

u (s ∣ θ^{u})

; 2: Initialize target network

Q^{'}

and

u^{'}

with same weights; 3: Initialize replay buffer

R B

; 4: for

e p i s o d e = 1, \dots, M

do 5: Receive initial observation state

s_{0}

; 6: for

t = 1, \dots, T

do 7: Select agent action

a_{a, t} = u (s_{t} ∣ θ^{u}) + N_{t}

; 8: Get criteria action

a_{c, t}

; 9: Calculate execution action:

a_{e, t} = (1 - γ^{e p i s o d e}) \cdot a_{a, t} + γ^{e p i s o d e} \cdot a_{c, t}

; 10: Get observed reward

R_{o, t}

; 11: Get guiding reward

R_{g, t}

; 12: Calculate updating reward:

R_{u, t} = (1 - γ^{e p i s o d e}) \cdot R_{o, t} + γ^{e p i s o d e} \cdot R_{g, t}

; 13: Store transition

⟨s_{t}, a_{e, t}, R_{u, t}, s_{t + 1}⟩

R B

; 14: Sample a random mini-batch of N transitions from

R B

; 15: Set

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, u^{'} (s_{i + 1} θ^{u^{'}}) ∣ θ^{Q^{'}})

; 16: Update critic by minimizing the loss:

L (θ^{Q}) = \frac{1}{N} \sum_{t}^{N} {(y_{i} - Q (s_{i}, a_{i} ∣ θ^{Q}))}^{2}

; 17: Update the actor policy, using the sampled policy gradient:

\nabla_{θ^{u}} u ∣_{s_{t}} \approx \frac{1}{N} \sum_{t}^{N} [\nabla_{a_{t}} Q (s_{t}, a_{t} ∣ θ^{Q}) \nabla_{θ^{u}} u (s_{t} ∣ θ^{u})]

; 18: Update the target networks

Q^{'}

and

u^{'}

4. Simulation Environment Settings

In the simulations, we considered the dynamics model of skid-steering vehicles with four independent driving wheels, which includes vehicle body dynamics, wheel dynamics, and the wheel–ground interaction model. It is important to note that we consider the vehicle as running on flat ground with friction coefficient $μ_{f} = 0.85$ . Table 1 provides the detailed vehicle dynamics settings used in the simulations.

The KA-DDPG algorithm was implemented on the PyCharm IDE with Python 3.7, and ran on an Intel Core i5 computer. Based on the definition of the agent, the actor network and its target network were constructed, using two $4 \times 512 \times 512 \times 4$ fully connected neural networks, and the critic network and its target network were constructed using two $8 \times 512 \times 512 \times 1$ fully connected neural networks. The structures of the neural networks are shown in Figure 7, and the detailed parameters of the KA-DDPG algorithm are listed in Table 2.

Based on the simulation environment design described above, training was carried out for a total of 5000 episodes, and the designed agent learned the torque distribution strategy. During training, the vehicle state was randomly initialized in each episode. Due to the two newly introduced knowledge-assisted learning methods, the training of the KA-DDPG converged quickly and stably. After completing the learning, only the parameters of the actor network were retained, which then received the current state of the vehicle in real-time and generated the optimal distributing action, in order to realize control of the vehicle.

5. Results and Discussion

Simulations were first conducted to demonstrate the control performance of the KA-DDPG-based torque distribution method. The vehicle behaviors are discussed under different scenarios, including a straight scenario and a cornering scenario. Furthermore, we verified the contributions of the knowledge-assisted learning methods in the learning process of the KA-DDPG algorithm, through three different cases.

5.1. Effectiveness of KA-DDPG

In order to verify the control performance of the KA-DDPG-based torque distribution method for skid-steering vehicles, the simulations were designed with two different scenarios: a straight scenario and a corning scenario. The low-fidelity controller defined in Section 3.2.2 was considered as the baseline for comparative experiments, which is a controller based on physical knowledge. We introduced an evaluation method that is, commonly, used to quantitatively evaluate longitudinal speed and yaw rate tracking performance. This evaluation method uses the integral of the quadratic function of the tracking error of the longitudinal speed and yaw rate, to evaluate the tracking performance. The integrals $J_{1}$ and $J_{2}$ are, respectively, expressed as follows [41]:

(23) $\begin{matrix} J_{1} = \int_{0}^{t} v_{d e l t a}^{2} d τ, \\ J_{2} = \int_{0}^{t} ω_{d e l t a}^{2} d τ . \end{matrix}$

In the straight scenario, the desired yaw rate was always kept at 0 rad/s, in order to identify the drift of the skid-steering vehicle. The desired longitudinal speed changed during the simulation, which was set to $3.8$ m/s at the beginning of the simulation, to $1.8$ m/s and $3.2$ m/s at 200 s and 400 s, respectively, and, then, kept at $3.2$ m/s until the end. The simulation results for the straight scenario are shown in Figure 8.

Figure 8a shows a comparison of longitudinal speed tracking, for the different torque distribution methods in the straight scenario. To more clearly demonstrate the tracking performance for longitudinal speed, Figure 8b shows the corresponding tracking error. The tracking error of the baseline method fluctuated greatly, and the longitudinal speed tracking performance was significantly worse than that of KA-DDPG. The tracking error of the baseline frequently exceeded $0.2$ m/s, while the maximum error of KA-DDPG was always stably kept within $0.2$ m/s. Figure 8c shows a comparison of the yaw rate tracking performance in the straight scenario. It can be emphasized that the target yaw rate is a constant value in the straight scenario, which is always 0 rad/s. The tracking results show that KA-DDPG kept the tracking error of the yaw rate within $0.2$ rad/s, while the maximum tracking error of the baseline exceeded $0.2$ rad/s. As shown in Figure 8d, the heading of the baseline was larger than that of KA-DDPG, indicating that the vehicle controlled by the baseline suffered from greater drift. These simulation results demonstrate that the tracking performance of the KA-DDPG algorithm, for the longitudinal speed and yaw rate in the straight scenario, was better than that of the baseline.

Figure 9 shows the torque distribution curves for each wheel in the straight scenario, and the torque distributions of the two different methods had similar trends. However, the driving torques of the KA-DDPG algorithm reflect the vehicle’s situation more quickly than the baseline, which explains why the KA-DDPG algorithm presented better tracking performance than the baseline.

To verify the cornering ability of the KA-DDPG-based torque distribution method, the simulations were conducted in a cornering scenario. Compared to the simulations in the straight scenario, the tracking performance of the proposed torque distribution method was more rigorously verified in the cornering scenario. The longitudinal speed of the vehicle was set to $1.8$ m/s at the beginning of the simulation, $3.7$ m/s at 150 s, $2.5$ m/s at 350 s, and, then, held at $2.5$ m/s until the end. The yaw rate was no longer a constant value, which was set to $3.2$ rad/s at the beginning of the simulation, $2.5$ rad/s at 150 s, $3.7$ rad/s at 350 s, and, then, held at $3.7$ rad/s until the end of the simulation. The tracking performances of the different methods in the cornering scenario are shown in Figure 10. Figure 10a shows the longitudinal speed tracking results in the cornering scenario, and it can be seen, intuitively, that the longitudinal speed tracking performance did not deteriorate significantly, compared with the straight scenario. Figure 10b shows the tracking errors of the longitudinal speed in the cornering scenario. The maximum error of the baseline exceeded $0.2$ m/s, while the maximum error of the KA-DDPG method was kept within $0.2$ m/s. The yaw rate tracking performance of different methods is shown in Figure 10c, while Figure 10d shows the tracking errors of the yaw rate in the cornering scenario. The KA-DDPG method kept the yaw rate error within $0.2$ rad/s, which was significantly smaller than that of the baseline. In the cornering scenario, the tracking performance of the KA-DDPG method was still better than that of the baseline, which is the same as the conclusion in the straight scenario, indicating that the tracking performance did not deteriorate in a more complex scenario.

Figure 11 shows the torque distribution curves, for each wheel in the cornering scenario. The torque distributions of the different methods in the cornering scenario showed the same trend as in the straight scenario, and the driving torques of the KA-DDPG method changed more quickly than those of the baseline method.

To quantitatively illustrate the tracking performance more specifically, Equation (23) was introduced, to evaluate the tracking performance of the longitudinal speed and yaw rate. The quantitative evaluation results for the simulations in the straight scenario and the cornering scenario discussed above are displayed in Figure 12.

Figure 12a shows the quantitative evaluation in the straight scenario. In 500 s of straight driving, $J_{1}$ in the KA-DDPG method was reduced by $71.36 %$ , compared to the baseline. Similarly, $J_{2}$ of the KA-DDPG method was reduced by $87.33 %$ , compared to the baseline. The quantitative evaluation demonstrated that the KA-DDPG method has better tracking performance for the longitudinal speed and yaw rate than the baseline in the straight scenario, consistent with the results of the above analysis. The quantitative evaluation in the cornering scenario is shown in Figure 12b. Compared with the evaluation results of the baseline, in the simulation under the cornering scenario, $J_{1}$ of the KA-DDPG method was reduced by $67.87 %$ , and $J_{2}$ was reduced by $80.99 %$ , when compared to the baseline. These quantitative evaluations in different scenarios demonstrate that the KA-DDPG method has better desired value tracking performance than the baseline.

Based on the analysis of the simulation results, the tracking performance of skid-steering vehicles, based on the KA-DDPG method, was investigated. Compared with the baseline, the KA-DDPG-based torque distribution method showed better tracking performance in different scenarios. Although the baseline is a low-fidelity controller, it was sufficient to illustrate that the KA-DDPG method can be successfully applied to the torque distribution problem of skid-steering vehicles.

5.2. Contributions of Knowledge-Assisted Learning Methods

To verify the contributions of the knowledge-assisted learning methods in the learning process of KA-DDPG, we trained the KA-DDPG algorithm, with three cases. The configurations of these cases were as follows: (1) the KA-DDPG, including the criteria action and guiding reward methods (i.e., the algorithm proposed in this work); (2) the KA-DDPG, only including the guiding reward method; and (3) the KA-DDPG, only including the criteria action method. As the skid-steering vehicle studied in this paper is an overdrive system, applying the original DDPG may cause the agent to search in a wrong direction, leading to a sub-optimal solution. Therefore, we do not consider the torque distribution method, based on the original DDPG, in this section. In each case, the KA-DDPG was trained for 5000 episodes. During training, the vehicle state was randomly initialized in each episode. To verify the stability of the convergence, the training in each case was repeated five times.

Figure 13 shows the total rewards during the learning process of the first case, which is the case of KA-DDPG with both of the assisted learning methods. As shown in Figure 13a, the rewards converged quickly and smoothly in each learning process. Such learning processes indicate that the agent can still stably learn a reasonable torque distribution strategy, to ensure the control performance. With the assistance of knowledge-assisted learning methods, the agent no longer randomly generates actions with low rewards during the learning process, and has the ability to rapidly increase the cumulative rewards, which not only accelerated the convergence of the learning process, but also ensured that the convergence was smooth. As shown in Figure 13b, on average, the KA-DDPG-based torque distribution method outperformed the baseline in cumulative rewards after training for 200 episodes, which means that KA-DDPG not only learns control policies from the low-fidelity controller, but also explores and learns policies for better control performance. At about 300 episodes, each learning process converged smoothly.

Then, we trained the KA-DDPG, including only the guiding reward method and without the criteria action method, in order to illustrate the contribution of the criteria action method. Figure 14 shows the learning process of the KA-DDPG, without the assistance of the criteria action method. Without the assistance of criteria actions, execution actions were searched randomly at the beginning of the learning process, making the rewards more volatile. As the number of episodes increased, the volatility of the total rewards decreased, and the learning process converged stably. Figure 14a shows the total rewards in each learning process, and Figure 14b shows the average of the total rewards. Without the assistance of the criteria action method, the KA-DDPG needed about 800 episodes to perform better than the baseline, while the first case (including both assisted learning methods) only took about 200 episodes to achieve the same result. Similarly, the convergence of KA-DDPG, without the criteria action method, was, also, slower than the first case, taking about 1200 episodes to converge. Such results illustrate that the learning cost of KA-DDPG is greatly increased without the assistance of the criteria action method. The above simulation results demonstrate that the criteria action method reduces the learning cost, by helping the agent to reduce the randomly selected actions in the learning process.

Finally, we trained the KA-DDPG, including only the criteria action method, in order to illustrate the contribution of the guiding reward method. Figure 15 shows the learning process of the KA-DDPG algorithm, without sharpening the reward through use of the guiding reward method. Compared to the first case, KA-DDPG, without the guiding reward method, had a larger reward at the beginning of the learning process, but it grew more slowly and showed more volatility. The simulation results demonstrated that the guiding reward method in KA-DDPG led the learning process to converge faster, by sharpening the reward function. Figure 15a shows the total rewards of KA-DDPG, without the guiding reward method in each learning process, while Figure 15b shows the average of the total rewards. Without the assistance of the guiding reward method, KA-DDPG needed about 500 episodes to outperform the baseline, while only about 200 episodes were needed to achieve the same control performance, when using both assisted learning methods. Similarly, KA-DDPG, without the guiding reward method, needed about 800 episodes to converge, which was also slower. These simulation results demonstrate that the guiding reward method, also, has the ability to reduce the learning cost. However, unlike the criteria action method, which reduces the learning cost by reducing randomly selected actions, the guiding reward method achieves this purpose by sharpening the reward function.

To summarize, the simulation results presented above demonstrate that both of the assisted learning methods proposed in this paper can reduce the learning cost of the KA-DDPG algorithm. The guiding reward method accelerates the learning process, by sharpening the updating reward with the assistance of the evaluation method, whereas the criteria action method achieves this through providing the learning direction by criteria actions, thus reducing the randomly searched actions.

5.3. Discussion

To verify the KA-DDPG based torque distribution strategy, we conducted the desired value tracking performance evaluation in two different scenarios, by comparing with the controller used for assisted learning. The contribution of the assisted learning methods was, also, verified by comparing the learning process of different cases of the KA-DDPG algorithm.

For the desired value tracking performance evaluation, quantitative evaluations in different scenarios show that the KA-DDPG method had smaller desired value tracking errors than the baseline. In other words, the KA-DDPG method had a better desired value tracking performance. The great improvements in the tracking performance come from the fact that the KA-DDPG algorithm not only learned from the knowledge-assisted learning methods but also explored the better distribution strategy, through the exploration ability of RL.

For the contribution of knowledge-assisted learning methods evaluation, we trained the KA-DDPG algorithm, with different configurations on three cases, namely the KA-DDPG algorithm with two assisted learning methods and the KA-DDPG algorithm with only one of the assisted learning methods. The result is the KA-DDPG with two assisted learning methods converged with less learning time than the other cases. The great improvements in the learning process come from the knowledge-assisted learning methods. The guiding reward method accelerates the learning process, by sharpening the updating reward with the assistance of the evaluation method, whereas the criteria action method achieves this by providing the learning direction through criteria actions, reducing the randomly searched actions.

From all the simulation results, we conclude that the KA-DDPG algorithm, which combines the knowledge-assisted learning methods with the DDPG algorithm, can be successfully applied to the torque distribution of skid steering vehicles. This work lays a foundation for the use of RL technologies to directly distribute the torque of skid-steering vehicles, in future practical applications. The knowledge-assisted RL framework proposed in this work provides a powerful weapon for applying RL technologies in overdrive systems, such as skid-steering vehicles. However, the verification part of this paper was carried out in a simulation environment, so challenges do still exist to transfer the proposed method to a real application. For future research, we will consider reducing the reality gap in a real application.

6. Conclusions

In this study, a KA-DDPG-based torque distribution strategy for skid-steering vehicles was proposed, in order to minimize the tracking error of the desired value, which included the longitudinal speed and yaw rate, making the considered problem a dual-channel control problem. The KA-DDPG algorithm combines the DDPG algorithm with knowledge-assisted learning methods, constructing a knowledge-assisted learning framework that combines analytical methods with an RL algorithm. Two knowledge-assisted methods were introduced into the KA-DDPG algorithm: a criteria action method and a guiding reward method.

In order to validate the proposed strategy, simulations were first conducted in different scenarios, including a straight scenario and a cornering scenario. The tracking performance simulation results demonstrated that the proposed torque distribution strategy had an excellent control performance in different scenarios. In addition, simulations were conducted to verify the contributions of the assisted learning methods in the learning process of KA-DDPG. The simulation results illustrated that both of the proposed assisted learning methods helped to accelerate the agent’s learning process. The criteria action method provides the learning direction for the agent and speeds up the learning speed, by reducing the random search actions, while the guiding reward method achieves the same result by sharpening the reward function.

This work opens an exciting path for the use of RL algorithms in the problem of torque distribution in skid-steering vehicles. However, there are still some areas that require more in-depth study. The verification section of this research was carried out in a simulation environment, without any experiments being in a real environment. Wheel slip limits, which are important for skid-steering vehicle control, were not considered in this work. The lateral movement of the vehicle, which is unavoidable in actual applications, was, also, not considered in this work. At present, we are exploring these unsolved problems to extend this work.

Author Contributions

Conceptualization, P.C. and H.Y.; methodology, H.D.; software, H.D.; validation, H.D. and P.C.; formal analysis, P.C.; investigation, P.C.; resources, H.D.; data curation, H.D.; writing–original draft preparation, H.D.; writing–review and editing, P.C.; visualization, P.C.; supervision, H.Y.; project administration, H.Y. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

Figure 1. Skid-steering vehicle diagram.

Figure 2. Wheel rotational motion.

Figure 3. Framework of the DDPG algorithm.

Figure 4. Framework of KA-DDPG.

Figure 5. Knowledge-assisted learning method—criteria action method.

Figure 6. Knowledge-assisted learning method—guiding reward method.

Figure 7. Structure of the neural networks: (a) actor neural network; and (b) critic neural network.

View Image - Figure 8. Tracking performance in the straight scenario: (a) longitudinal speed tracking; (b) longitudinal speed error; (c) yaw rate tracking; and (d) heading.

Figure 8. Tracking performance in the straight scenario: (a) longitudinal speed tracking; (b) longitudinal speed error; (c) yaw rate tracking; and (d) heading.

Figure 9. Torque distributions in the straight scenario: (a) baseline and (b) KA-DDPG.

View Image - Figure 10. Tracking performance in the cornering scenario: (a) longitudinal speed tracking; (b) longitudinal speed error; (c) yaw rate tracking; and (d) yaw rate error.

Figure 10. Tracking performance in the cornering scenario: (a) longitudinal speed tracking; (b) longitudinal speed error; (c) yaw rate tracking; and (d) yaw rate error.

Figure 11. Torque distributions in the cornering scenario: (a) baseline and (b) KA-DDPG.

Figure 12. Quantitative evaluation results: (a) evaluation in the straight scenario and (b) evaluation in the cornering scenario.

View Image - Figure 13. The learning process of KA-DDPG with criteria action and guiding reward methods: (a) total rewards for each learning process and (b) the average value of the total rewards.

Figure 13. The learning process of KA-DDPG with criteria action and guiding reward methods: (a) total rewards for each learning process and (b) the average value of the total rewards.

View Image - Figure 14. The learning process of KA-DDPG without the criteria action method: (a) total rewards for each learning process and (b) the average value of the total rewards.

Figure 14. The learning process of KA-DDPG without the criteria action method: (a) total rewards for each learning process and (b) the average value of the total rewards.

View Image - Figure 15. The learning process of KA-DDPG without the guiding reward method: (a) total rewards for each learning process and (b) the average value of the total rewards.

Figure 15. The learning process of KA-DDPG without the guiding reward method: (a) total rewards for each learning process and (b) the average value of the total rewards.

Table 1

Vehicle dynamics parameters.

Description	Symbol	Value
Vehicle mass	m	$2360 kg$
Vehicle width	B	$2.4 m$
Distance from the front axle to CG	$L_{f}$	$2.5 m$
Distance from the front axle to CG	$L_{r}$	$2.5 m$
Wheel radius	r	$0.2 m$
Vehicle inertia around CG	J	$4050 kg \cdot m^{2}$
Friction coefficient	$μ_{f}$	$0.85$
Rotation moment of wheel	j	$0.85 kg \cdot m^{2}$
Rolling resistance torque	$T_{Z_{i}}$	$11.8 N \cdot m$
Max torque of wheel	$T_{m a x}$	$100 N \cdot m$

Table 2

Hyperparameters of the KA-DDPG algorithm.

Parameters	Value
Random seed	2
Max episode	5000
Max steps per episode	100
Step size	$0.1$
Memory capacity	1,000,000
Batch size	512
Actor network learning rate	$0.0003$
Critic network learning rate	$0.001$
Exploration noise scale	$0.1$
Soft update rate	$0.01$
Discount factor	$0.999$
Proportional coefficient	$0.05$

References

1. Fernandez, B.; Herrera, P.J.; Cerrada, J.A. A simplified optimal path following controller for an agricultural skid-steering robot. IEEE Access; 2019; 7, pp. 95932-95940. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2929022]

2. Ouda, A.N.; Amr, M. Autonomous Fuzzy Heading Control for a Multi-Wheeled Combat Vehicle. Int. J. Robot. Control; 2021; 1, pp. 90-101. [DOI: https://dx.doi.org/10.31763/ijrcs.v1i1.286]

3. Zhang, H.; Yang, X.; Liang, j.; Xu, X.; Sun, X. GPS Path Tracking Control of Military Unmanned Vehicle Based on Preview Variable Universe Fuzzy Sliding Mode Control. Machines; 2021; 12, 304. [DOI: https://dx.doi.org/10.3390/machines9120304]

4. Krecht, R.; Hajdu, C.; Ballagi, A. Possible Control Methods for Skid-Steer Mobile Robot Platforms. Proceedings of the 2020 2nd IEEE International Conference on Gridding and Polytope Based Modelling and Control; Gyor, Hungary, 19 November 2020; pp. 31-34. [DOI: https://dx.doi.org/10.1109/GPMC50267.2020.9333814]

5. Liao, J.F.; Chen, Z.; Yao, B. Adaptive robust control of skid steer mobile robot with independent driving torque allocation. Proceedings of the 2017 IEEE International Conference on Advanced Intelligent Mechatronics; Munich, Germany, 3–7 July 2017; pp. 340-345. [DOI: https://dx.doi.org/10.1109/AIM.2017.8014040]

6. Kruse, O.; Mukhamejanova, A.; Mercorelli, P. Super-Twistiing Sliding Mode Control for Differential Steering Systems in Vehicular Yaw Tracking Motion. Electronics; 2022; 11, 1330. [DOI: https://dx.doi.org/10.3390/electronics11091330]

7. Mercorelli, P. Fuzzy Based Control of a Nonholonomic Car-like Robot for Drive Assistant Systems. Proceedings of the 19th International Carpathian Control Conference; Szilvasvarad, Hungary, 28–31 May 2018; pp. 434-439. [DOI: https://dx.doi.org/10.1109/CarpathianCC.2018.8399669]

8. Mercorelli, P. Using Fuzzy PD Controllers for Soft Motions in a Car-like Robot. Adv. Sci. Technol. Eng. Syst. J.; 2018; 3, pp. 380-390. [DOI: https://dx.doi.org/10.25046/aj030646]

9. Khan, R.; Malik, F.M.; Raza, A.; Mazhar, N. Comprehensive study of skid-steer wheeled mobile robots: Development and challenges. Ind. Robot; 2021; 48, pp. 142-156. [DOI: https://dx.doi.org/10.1108/IR-04-2020-0082]

10. Dogru, S.; Marques, L. An improved kinematic model for skid-steered wheeled platforms. Auton. Robots; 2021; 45, pp. 229-243. [DOI: https://dx.doi.org/10.1007/s10514-020-09959-0]

11. Huskic, G.; Buck, S.; Herrb, M.; Lacroix, S.; Zell, A. High-speed path following control of skid-steered vehicles. Int. J. Robot. Res.; 2019; 38, pp. 1124-1148. [DOI: https://dx.doi.org/10.1177/0278364919859634]

12. Zhao, Z.Y.; Liu, H.O.; Chen, H.Y.; Hu, J.M.; Guo, H.M. Kinematics-aware model predictive control for autonomous high-speed tracked vehicles under the off-road conditions. Mech. Syst. Signal Process.; 2019; 123, pp. 333-350. [DOI: https://dx.doi.org/10.1016/j.ymssp.2019.01.005]

13. Du, P.; Ma, Z.M.; Chen, H.; Xu, D.; Wang, Y.; Jiang, Y.H.; Lian, X.M. Speed-adaptive motion control algorithm for differential steering vehicle. Proc. Inst. Mech. Eng. Part D J. Autom. Eng.; 2020; 235, pp. 672-685. [DOI: https://dx.doi.org/10.1177/0954407020950588]

14. Zhang, Y.S.; Li, X.Y.; Zhou, J.J.; Li, S.C.; Du, M. Hierarchical Control Strategy Design for a 6-WD Unmanned Skid-steering Vehicle. Proceedings of the 2018 IEEE International Conference on Mechatronics and Automation; Changchun, China, 5–8 August 2018; pp. 2036-2041. [DOI: https://dx.doi.org/10.1109/ICMA.2018.8484545]

15. Rabiee, S.; Biswas, J. A Friction-Based Kinematic Model for Skid-Steer Wheeled Mobile Robots. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA); Montreal, QC, Canada, 20–24 May 2019; pp. 8563-8569. [DOI: https://dx.doi.org/10.1109/ICRA.2019.8794216]

16. Novellis, L.D.; Sorniotti, A.; Gruber, P. Wheel Torque Distribution Criteria for Electric Vehicles with Torque-Vectoring Differentials. Veh. Technol.; 2014; 63, pp. 1593-1602. [DOI: https://dx.doi.org/10.1109/TVT.2013.2289371]

17. Zhao, H.Y.; Gao, B.Z.; Ren, B.T.; Chen, H.; Deng, W.W. Model predictive control allocation for stability improvement of four-wheel drive electric vehicles in critical driving condition. Control Theory Appl.; 2015; 9, pp. 2268-2696. [DOI: https://dx.doi.org/10.1049/iet-cta.2015.0437]

18. Khosravani, S.; Jalali, M.; Khajepour, A.; Kasaiezadeh, A.; Chen, S.K.; Litkouhi, B. Application of Lexicographic Optimization Method to Integrated Vehicle Control Systems. IEEE Trans. Ind. Electron.; 2018; 65, pp. 9677-9686. [DOI: https://dx.doi.org/10.1109/TIE.2018.2821625]

19. Zhang, H.; Liang, H.; Tao, X.; Ding, Y.; Yu, B.; Bai, R. Driving force distribution and control for maneuverability and stability of a 6WD skid-steering EUGV with independent drive motors. Appl. Sci.; 2021; 11, 961. [DOI: https://dx.doi.org/10.3390/app11030961]

20. Prasad, R.; Ma, Y. Hierarchical Control Coordination Strategy of Six Wheeled Independent Drive (6WID) Skid Steering Vehicle. IFAC-PapersOnLine; 2019; 52, pp. 60-65. [DOI: https://dx.doi.org/10.1016/j.ifacol.2019.09.010]

21. Xiong, H.; Ma, T.; Zhang, L.; Diao, X. Comparison of end-to-end and hybrid deep reinforcement learning strategies for controlling cable-driven parallel robots. Neurocomputing; 2020; 377, pp. 73-84. [DOI: https://dx.doi.org/10.1016/j.neucom.2019.10.020]

22. Xu, D.; Hui, Z.; Liu, Y.; Chen, G. Morphing control of a new bionic morphing UAV with deep reinforcement learning. Aerosp. Sci. Technol.; 2019; 92, pp. 232-243. [DOI: https://dx.doi.org/10.1016/j.ast.2019.05.058]

23. Gong, L.; Wang, Q.; Hu, C.; Liu, C. Switching control of morphing aircraft based on Q-learning. Chin. J. Aeronaut.; 2020; 33, pp. 672-687. [DOI: https://dx.doi.org/10.1016/j.cja.2019.10.005]

24. Huang, Q.; Huang, R.; Hao, W.; Tan, J.; Fan, R.; Huang, Z. Adaptive power system emergency control using deep reinforcement learning. IEEE Trans. Smart Grid; 2020; 11, pp. 1171-1182. [DOI: https://dx.doi.org/10.1109/TSG.2019.2933191]

25. Ning, Z.; Kwok, R.; Zhang, K. Joint computing and caching in 5G-envisioned Internet of vehicles: A deep reinforcement learning-based traffic control system. IEEE Trans. Intell. Transp. Syst.; 2021; 22, pp. 5201-5212. [DOI: https://dx.doi.org/10.1109/TITS.2020.2970276]

26. Mnih, V.; Kavukcuoglu, K.; Silver, D. Human-level control through deep reinforcement learning. Natures; 2015; 518, pp. 529-533. [DOI: https://dx.doi.org/10.1038/nature14236] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25719670]

27. Hasselt, H.V.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. Proceedings of the 30th AAAI Conference on Artificial Intelligence; Phoenix, AZ, USA, 12–17 February 2016; pp. 2094-2100. [DOI: https://dx.doi.org/10.48550/arXiv.1509.06461]

28. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglo, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv; 2013; [DOI: https://dx.doi.org/10.48550/arXiv.1312.5602] arXiv: 1312.5602

29. Yu, L.; Shao, X.; Wei, Y.; Zhou, K. Intelligent Land-Vehicle Model Transfer Trajectory Planning Method Based on Deep Reinforcement Learning. Sensors; 2018; 18, 2905. [DOI: https://dx.doi.org/10.3390/s18092905] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30200499]

30. Hu, H.; Lu, Z.; Wang, Q.; Zheng, C. End-to-End Automated Lane-Change Maneuvering Considering Driving Style Using a Deep Deterministic Policy Gradient Algorithm. Sensors; 2020; 20, 5443. [DOI: https://dx.doi.org/10.3390/s20185443]

31. Jin, L.; Tian, D.; Zhang, Q.; Wang, J. Optimal Torque Distribution Control of Multi-Axle Electric Vehicles with In-wheel Motors Based on DDPG Algorithm. Energies; 2020; 13, 1331. [DOI: https://dx.doi.org/10.3390/en13061331]

32. Sun, M.; Zhao, W.Q.; Song, G.G.; Nie, Z.; Han, X.J.; Liu, Y. DDPG-Based Decision-Making Strategy of Adaptive Cruising for Heavy Vehicles Considering Stability. IEEE Access; 2020; 8, pp. 59225-59246. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2982702]

33. Wei, H.Q.; Zhang, N.; Liang, J.; Ai, Q.; Zhao, W.Q.; Huang, T.Y.; Zhang, Y.T. Deep reinforcement learning based direct torque control strategy for distributed drive electric vehicles considering active safety and energy saving performance. Energy; 2022; 238, 121725. [DOI: https://dx.doi.org/10.1016/j.energy.2021.121725]

34. Wan, K.; Gao, X.; Hu, Z.; Wu, G. Robust Motion Control for UAV in Dynamic Uncertain Environments Using Deep Reinforcement Learning. Remote Sens.; 2020; 12, 640. [DOI: https://dx.doi.org/10.3390/rs12040640]

35. Zhang, Z.B.; Li, X.H.; An, J.P.; Man, W.X.; Zhang, G.H. Model-free attitude control of spacecraft based on PID-guide TD3 algorithm. Int. J. Aerosp. Eng.; 2020; 2020, 8874619. [DOI: https://dx.doi.org/10.1155/2020/8874619]

36. Lian, R.Z.; Peng, J.K.; Wu, Y.K.; Tan, H.C.; Zhang, H.L. Rule-interposing deep reinforcement learning based on energy management strategy for power-split hybrid electric vehicle. Energy; 2020; 197, 117297. [DOI: https://dx.doi.org/10.1016/j.energy.2020.117297]

37. Zhao, H.; Zhao, J.H.; Qiu, J.; Liang, G.Q.; Dong, Z.Y. Cooperative wind farm control with deep reinforcement learning and knowledge assisted learning. IEEE Trans. Ind. Inform.; 2020; 16, pp. 6912-6921. [DOI: https://dx.doi.org/10.1109/TII.2020.2974037]

38. Leng, B.; Xiong, L.; Yu, Z.P.; Zou, T. Allocation control algorithms design and comparion based on distributed drive electric vehicles. Int. J. Autom. Technol.; 2018; 19, pp. 55-62. [DOI: https://dx.doi.org/10.1007/s12239-018-0006-3]

39. Liao, J.F.; Chen, Z.; Yao, B. Model-Based Coordinated Control of Four-Wheel Independently Driven Skid Steer Mobile Robot with Wheel-Ground Interaction and Wheel Dynamics. IEEE Trans. Ind. Inform.; 2019; 15, pp. 1742-1752. [DOI: https://dx.doi.org/10.1109/TII.2018.2869573]

40. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv; 2015; [DOI: https://dx.doi.org/10.48550/arXiv.1509.02971] arXiv: 1509.02971

41. Zhang, H.; Zhao, W.; Wang, J. Fault-Tolerant Control for Electric Vehicles with Independently Driven in-Wheel Motors Considering Individual Driver Steering Characteristics. IEEE Trans. Veh. Technol.; 2019; 68, pp. 4527-4536. [DOI: https://dx.doi.org/10.1109/TVT.2019.2904698]

Word count: 7859

Show less

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Due to the advantages of their drive configuration form, skid-steering vehicles with independent wheel drive systems are widely used in various special applications. However, obtaining a reasonable distribution of the driving torques for the coordinated control of independent driving wheels is a challenging problem. In this paper, we propose a torque distribution strategy based on the Knowledge-Assisted Deep Deterministic Policy Gradient (KA-DDPG) algorithm, in order to minimize the desired value tracking error as well as achieve the longitudinal speed and yaw rate tracking control of skid-steering vehicles. The KA-DDPG algorithm combines knowledge-assisted learning methods with the DDPG algorithm, within the framework of knowledge-assisted reinforcement learning. To accelerate the learning process of KA-DDPG, two assisted learning methods are proposed: a criteria action method and a guiding reward method. The simulation results obtained, considering different scenarios, demonstrate that the KA-DDPG-based torque distribution strategy allows a skid-steering vehicle to achieve high performance, in tracking the desired value. In addition, further simulation results, also, demonstrate the contributions of knowledge-assisted learning methods to the training process of KA-DDPG: the criteria action method speeds up the learning speed by reducing the agent’s random action selection, while the guiding reward method achieves the same result by sharpening the reward function.

Details

Title

Driving Torque Distribution Strategy of Skid-Steering Vehicles with Knowledge-Assisted Reinforcement Learning

Author

Dai, Huatong¹; Chen, Pengzhan²; Yang, Hui¹

¹ School of Electrical Engineering and Automation, East China Jiaotong University, Nanchang 330013, China; [email protected]
² School of Intelligent Manufacture, Taizhou University, Taizhou 318000, China; [email protected]

First page

5171

Publication year

2022

Publication date

2022

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app12105171

ProQuest document ID

2670081897

Driving Torque Distribution Strategy of Skid-Steering Vehicles with Knowledge-Assisted Reinforcement Learning

Jump to:

Full Text

Abstract

Details

Suggested sources