Intelligent Navigation of Indoor Robot Based on

Full text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

With the rapid development of intelligent manufacturing and e-commerce, traditional manual sorting and transportation efficiency are no longer able to meet the requirements of contemporary logistics and warehousing industries [1]. Indoor robots can replace the front-line workers in the assembly workshop, thus realizing the automation and intelligent delivery. Their independent transportation is more efficient, and their automated transportation system is also safer and more reliable. As application scenarios of robots become increasingly complicated with the fast growth of modern industry [2], studying the autonomous intelligent navigation decision-making algorithm is of great significance.

In recent years, the path planning problem has gradually become a research hotspot. Traditional path planning algorithms include Dijkstra algorithm [3], $A^{*}$ algorithm [4], $D^{*}$ algorithm [5], artificial potential field method [6], and fast expanding random tree method [7].

However, the traditional path planning algorithm cannot fully understand increasingly complex and unknown external environment information, especially in path planning. It is difficult to model because of the complexity of the environment, and the previous algorithm is prone to a state of convergence instability. At the same time, it faces the problem of insufficient data processing capacity in large-scale space [8]. Reinforcement learning is a new intelligent learning algorithm. Because of its unique learning mechanism and prior knowledge independent of the environment, reinforcement learning provides a new solution for path planning [9].

Deep reinforcement learning has a strong perception ability of deep learning and decision-making ability of enhanced learning intelligence, so it has better performance when facing complex environments and tasks, which is conducive to autonomous learning and obstacle avoidance planning of robots.

In this paper, an optimized deep reinforcement learning algorithm is proposed to improve the indoor robot’s intelligent navigation system, and a deep reinforcement learning model is built for the agent to realize the processes from the input of state perception to the output of motion control, which facilitates the robot to make autonomous navigation decisions. At the same time, an optimized DL (disturb learning)-DDPG algorithm based on the depth deterministic strategy gradient DDPG algorithm is suggested, and noise is processed to reduce the influence on the training samples in the real environment. Finally, these algorithms are applied to the path planning system to simulate the distribution task by a single robot in a task. The simulation experiment proves that the optimized DL-DDPG algorithm can improve the efficiency of intelligent autonomous navigation and the goods distribution by the agent.

[figure(s) omitted; refer to PDF]

The structure of this paper is organized as follows:

(1) In Section 1, the research content, innovations, and organizational structure are introduced in this paper

(2) In Section 2, the research background and the reason for using the improved DRL algorithm for path planning are explained

(3) In Section 3, the motion mode, observation model, and sensor model of the indoor robot are built, and the intelligent control model of the indoor robot is built for subsequent algorithms and simulations

(4) In Section 4, the improved DL-DDPG algorithm is designed based on the DRL

(5) In Section 5, the research results are verified by the simulation results and analysis

(6) Finally, the full research and the expectations for future work are summarized

The main innovations in this paper are as follows:

(1) Aiming at the dependence of the path planning method on the environment model, a method of end-to-end online learning and decision-making for robots is designed, and the robot can plan the path through autonomous learning without relying on environment modeling

(2) Combined with sensors to detect and perceive the surrounding obstacles, the DDPG (deep deterministic policy gradient) algorithm is adopted to realize the input of environmental perception and action direct output control

(3) The algorithm preprocesses the relevant data in the learning sample with Gaussian noise, which helps the agent adapt to the noisy training environment and improve its robustness

2. Background

Robots have been widely used in home services, space exploration, automated industrial environments, rescues, and other fields. In these applications, path planning without collision is the primary prerequisite. Therefore, path planning ability plays a very important role in the process of robot navigation tasks.

Traditional path planning algorithms include Dijkstra, $A^{*}$ (A-start) and RRT (rapidly exploring random tree). They first need to understand all the environmental information, then build a model in the environment [10], and obtain the planned path by using the path-searching algorithm according to certain optimization criteria. Traditional path planning methods rely on environment modeling, which lacks robustness and has local optimal solutions, so the accuracy of dealing with complex problems is very low.

Later, intelligent bionic path planning algorithms with partial autonomy emerged, mainly including genetic algorithm [11], ant colony optimization [12], and particle swarm optimization [13]. The intelligent bionic algorithm can carry out path planning tasks in a dynamic space, but when the amount of computation is relatively large, the efficiency of path planning is low, and the real-time performance of path planning cannot be guaranteed [14]. In addition, when the robot has insufficient knowledge of the working environment, the planned path is not optimal.

In general, these traditional path planning algorithms are incapable of processing complex and high-dimensional environmental information in complicated environments, or they easily fall into local optimality. They also require a large amount of prior knowledge in path planning. Reinforcement learning (RL) [15] is a kind of artificial intelligence algorithm that does not require prior knowledge. It obtains feedback information through trial and error and iteration with the environment to optimize strategies directly. Moreover, it can learn autonomously and online, which has made it gradually become the research focus of path planning of robots in unknown environments [16].

The basic idea of reinforcement learning is to make the agent learn toward the direction of obtaining the maximum reward according to the reward given by the environment during the continuous interaction between the agent and the environment and to obtain the optimal strategy through obtaining the maximum reward [17]. Reinforcement learning mainly includes two types of methods based on value function and strategy searches, such as the Q-learning algorithm, SARSA algorithm, policy gradient algorithm, and actor-critic (AC) algorithm [18].

However, the traditional reinforcement learning method is limited by the dimension of action space and sample space, so it is difficult to adapt to complex problems closer to reality. In contrast, deep learning (DL) has the strong perceptual ability and can adapt to complex problems, but it lacks certain decision-making abilities. The combination of deep reinforcement learning (DRL) with DL and RL can overcome their weakness and provide new ideas and directions for motion planning of mobile robots in complex environments.

During the past few decades, there has been increasing attention on artificial intelligence technology. Deep reinforcement learning (DRL) plays an important role in the field of intelligent navigation and path planning [19]. Levine et al. [20] adopted the DRL method to carry out end-to-end joint training on visual perception and motion control of the robot for it to complete the task of placement of specific objects. Mirowski et al. [21] combined the A3C algorithm with a recurrent neural network for an agent to complete path planning and map construction of the surrounding environment in a maze. To reduce the error in actor-critic, Fujimoto et al. [22] proposed the TD3 algorithm when updating the target network by adding noise to the target actor network and combining with the skill of the delayed updated actor, which increased the robustness of the algorithm to a certain extent. However, in the dynamic environment, the TD3 algorithm cannot get to the target point effectively in the environment with obstacles, and the sample utilization is low.

Sallab et al. [23] combined DQN (deep Q network) and actor-critic algorithm to control the driving of unmanned vehicles. In 2016, Lillicrap et al. combined DQN and DPG algorithms and proposed a depth deterministic strategy gradient (DDPG) [24]. Kendall et al. [25] made an agent that could effectively obtain the visual information of its surroundings based on the DDPG (deep deterministic policy gradient) algorithm to control its autonomous movement. DDPG algorithm can maintain flexible decisions in a dynamic space, which shows a great advantage in robot path planning problems in a complex environment.

The characteristics of traditional algorithms, reinforcement learning algorithms, and deep reinforcement learning algorithms are summarized in Table 1.

Table 1

Comparison of three algorithms.

	Traditional algorithm	Reinforcement learning algorithm	Deep reinforcement learning algorithm
Example	Dijkstra, A-start, ACO	Q-learning, SARSA, AC (actor-critic)	DQN, TD3, DDPG
Working mode	Off-line	Online	Online
Calculated amount	Large	Average	Small
Environmental modeling	Necessary	Needless	Needless
Local optimum	Often	Seldom	Seldom
Static environment	Good performance	Average performance	Average performance
Dynamic environment	Bad performance	Average performance	Good performance

In DRL path planning, there is still a gap between the simulated environment and the real environment because of the simple scenario settings in the simulated environment. The agent needs to detect and perceive the surrounding environment through sensors, radars, and other equipment. Affected by hardware performance and information transmission, there would be some deviations between the data samples obtained and the real data, causing the misjudgment of the agent in the environment, thus resulting in the inefficiency of the autonomous navigation decision-making process.

Therefore, an improved DDPG algorithm combined with sensors to detect and perceive surrounding obstacles is proposed to realize the input of environmental perception and action direct output control, to enable the robot to complete the task of path planning in a complex environment.

3. Intelligent Control Model of Indoor Robot

3.1. Indoor Robot Motion Mode

To complete the delivery task, the indoor robot is equipped with advanced facilities for autonomous navigation, such as high-precision dynamic 3D information processing, AHRS, and GPS-INS inertial navigation assistance systems. To ensure brevity and universality in the 2D plane, the continuous motion equations of the robot are set by the following equations: $\begin{matrix} (1) & \begin{matrix} \dot{x} t \\ \dot{y} t \\ \dot{ϕ} t \\ \dot{v} t \end{matrix} = \begin{matrix} v t \cos ϕ t \\ v t \sin ϕ t \\ ω t \\ a t \end{matrix}, \end{matrix}$ $w h e r e x$ and $y$ represent the 2D coordinates of the robot in the environment; $ϕ$ indicates the direction of motion of the robot; and $v$ is the velocity of the robot. During the interval, the status update of time $t$ can be described as the following equations: $\begin{matrix} (2) & \begin{matrix} x t = x t - 1 + v t - 1 ∆ t \cos ϕ t - 1, \\ y t = y t - 1 + v t - 1 ∆ t \sin ϕ t - 1, \\ v t = v t - 1 + a t - 1 ∆ t, \\ ϕ t = ϕ t - 1 + ω t - 1 ∆ t . \end{matrix} \end{matrix}$

Note that it is reasonable to assume the speed $v$ of the indoor robot within a certain range, i.e., $v \in 0, v_{\max}$ .

3.2. Indoor Robot Observation Model

The radar is installed on the indoor robot to detect signals. By comparing the received target echo with the transmitted signal, the target range, bearing, speed, and other information can be obtained, which provides basic data for navigation, collision avoidance, parking, and other actions. At any time $t$ , the relative distance $D$ between the indoor robot and the target point, and the relative azimuth angle $φ$ can be obtained. In addition, suppose that the robot position vector $S^{u a v}$ and the destination position vector $S_{d}$ as $\begin{matrix} (3) & S^{u a v} = {x_{t}, y_{t}, z_{t}}^{T} S_{d} = {x_{d}, y_{d}, z_{d}}^{T}, \end{matrix}$

And the observation equation of the robot to the target point in a two-dimensional coordinate system at a certain time is defined as $\begin{matrix} (4) & Z = \begin{matrix} D \\ φ \end{matrix} = \begin{matrix} {x_{t}, y_{t} - x_{d}, y_{d}}_{2} \\ \arctan \frac{y_{t} - y_{d}}{x_{t} - x_{d}} \end{matrix} . \end{matrix}$

3.3. Indoor Robot Sensor Model

A complex and dynamic environment is the main challenge for autonomous navigation and intelligent control of the robot. In the process of autonomous navigation, the robot needs to detect threats in the environment. Therefore, a group of distance sensors (the number is 12) is installed on the robot to help it detect possible obstacles within the detectable range over the front. At each moment, the observation of obstacles by the robot is defined as follows: $\begin{matrix} (5) & O_{o} = d_{1}, d_{2}, d_{3}, d_{4}, d_{5}, d_{6}, d_{7}, d_{8}, d_{9}, d_{10}, d_{11}, d_{12} . \end{matrix}$

$w h e r e d_{1}, d_{2}, \dots, d_{3}$ represent the data transmitted by the corresponding sensor. As shown in Figure 2, the red line represents the sensor indication in front of the agent, the orange lines represent the sensor indication in front of the agent (including the right and left sides of the robot), and the yellow lines represent the sensor indication from behind the agent. The maximum detectable range of the sensor is set as L if the sensor does not detect an obstacle, and if the sensor detects an obstacle, $d_{n} \in 0, L$ represents the distance from the agent to the obstacle.

[figure(s) omitted; refer to PDF]

4. Autonomous Navigation Based of Indoor Robot Based on Deep Reinforcement Learning

4.1. Deep Reinforcement Learning

In the traditional path planning algorithm, a large amount of computation and slow convergence speed make it difficult to cope with large-scale and complex emergencies. Combining the perception ability of deep learning with the decision-making ability of reinforcement learning, deep reinforcement learning can be applied in the end-to-end perception and control system with strong adaptability [26], which can greatly improve the efficiency of path planning.

Figure 3 describes the decision-making process of an agent based on deep reinforcement learning, which could be described as follows:

(1) At each moment, the agent interacts with the environment to obtain a high-dimensional observation, and the specific state features can be sensed by using the deep learning method

(2) The value function of the action is evaluated based on the expected return, and the current state is mapped as the corresponding action

(3) The environment responds to this action and gets the next observation, and thus, the optimal strategy can be obtained by continuously repeating the above processes

[figure(s) omitted; refer to PDF]

Specifically, the complete learning process of an agent can be indicated by the Markov decision process, which is represented by a quad $S, A, R, γ$ . The quad $S$ represents the observation and the state of an agent; A represents the actions that an agent can carry out; $R$ is the reward function, representing the rewards of the agent after the action is completed in a certain state; $γ$ is the discount coefficient to balance instantaneous and cumulative rewards in the learning process.

4.2. DDPG Algorithm

DDPG (deep deterministic policy gradient) is an effective algorithm for precise control. The biggest advantage of the DDPG algorithm is that it can learn in a continuous action space. The algorithm is mainly composed of an actor-critic framework, including the actor network and critic network. The actor network is responsible for generating reasonable actions based on the existing state, and the critic network is responsible for scoring the completed actions to evaluate their effectiveness. According to the current state of the agent, actions $a_{t} = μ s_{t} θ^{μ}$ can be completed by online actor networks. Then, the online critic network will evaluate this action and generate the corresponding value $Q = s_{t}, a_{t} θ^{Q}$ , where $θ^{μ}$ and $θ^{Q}$ represent the online actor network and online critic network, respectively. At the same time, the actor target network $θ^{μ^{'}}$ and critic target network $θ^{Q^{'}}$ are constructed, respectively, to solve the over-fitting problem during efficiency training.

After each turn, the samples based on the information of state, action, reward, and state at the next moment of the current agent will be stored in the experience pool, denoted as $s_{t}, a_{t}, r_{t}, s_{t + 1}$ for subsequent agent learning. Learning samples of $N$ are randomly selected from the experience pool to calculate the loss function $L$ of the critic network during each learning. The calculation process can be described as follows: $\begin{matrix} (6) & L θ^{Q} = N^{- 1} \sum_{t}^{N} {Y_{i} - Q s_{i}, a_{i} θ^{Q}}^{2} \\ Y_{i} = r_{i} + γ Q^{'} s_{i + 1}, μ^{'} s_{i + 1} θ^{μ^{'}} θ^{Q^{'}}, \end{matrix}$ $w h e r e Y_{i}$ represents the target value, $γ$ indicates the attenuation coefficient, and $i$ represents the serial number of the learning sample. Meanwhile, the policy gradient is used to complete the training of the actor network, as shown in the following equation: $\begin{matrix} (7) & \nabla_{θ^{μ}} J \approx N^{- 1} \sum_{t}^{N} \nabla_{a_{t}} Q s_{t}, a_{t} θ^{Q} \nabla_{θ^{μ}} μ s_{t} θ^{μ} . \end{matrix}$

After updating the online actor and critic network, the target networks are updated, respectively, through soft updates: $\begin{matrix} (8) & \begin{matrix} θ^{Q^{'}} = τ θ^{Q} + 1 - τ θ^{Q^{'}}, \\ θ^{μ^{'}} = τ θ^{μ} + 1 - τ θ^{μ^{'}}, \end{matrix} \end{matrix}$ where τ denotes the soft update factor to control the update amplitude of the target network.

4.3. DL-DDPG Algorithm

In the conventional DDPG algorithm, the network takes random samples from the experience pool for offline training. However, if no proper methods are used to preprocess the original samples, the model training process will become time-consuming. In the real distribution environment, the agent will perceive the environment through methods including sensor detection, image recognition, signal processing, and other processes. In the complex dynamic environment, the perception process of the robot is prone to error because it can be disturbed by noises and other environmental factors, which will affect its safety. Therefore, the robustness of the model must be improved to reduce the interference and error caused by the environment. Enhancing adaptability in complex environments is of great significance in the field of intelligent control such as unmanned vehicles and robots.

In order to train a robust indoor robot to adapt to the measurement error and other noises in the real environment, a DL (disturb learning)-DDPG algorithm is proposed to reduce the influence of errors in the real environment. In the process of saving samples each time, active noise is processed for each sample in advance, which could confuse the perception of the agent and help it adapt to the training environment under various noisy situations. After each sample $s_{t}, a_{t}, r_{t}, s_{t + 1}$ is generated, Gaussian noise processing is performed on each element: $\begin{matrix} (9) & \begin{matrix} s_{t}^{'} \sim N_{G a u s s i a n} s_{t}, {σ_{s}}^{2}, \\ a_{t}^{'} \sim N_{G a u s s i a n} s_{t}, {σ_{a}}^{2}, \\ s_{t + 1}^{'} \sim N_{G a u s s i a n} s_{t}, {σ_{s}}^{2}, \end{matrix} \end{matrix}$ $w h e r e σ_{s}$ and $σ_{a}$ represent variance values for generating action and state noise, respectively. Next, the sample data are processed into $s_{t}^{'}, a_{t}^{'}, r_{t}, s_{t + 1}^{'}$ and stored in the experience pool to generate error samples for disturbance learning of the robot.

In the DL-DDPG algorithm, the state $s$ of the robot is set as $\begin{matrix} (10) & s = d_{1}, \dots, d_{12}, d_{r}^{t}, ξ^{t}, v^{t}, ϕ^{t} . \end{matrix}$ where $d_{1}, d_{2}, \dots, d_{3}$ represent the observation of obstacles by robot sensors; $d_{r}^{t}$ represents the distance from the robot to the target point at time $t$ ; $ξ^{t}$ represents the relative angle between the robot and the target point; $v^{t}$ represents the speed of the robot movement; $ϕ^{t}$ represents the motion angle of the robot.

When the robot completes cargo distribution through autonomous navigation, it needs to maintain a stable speed and driving direction. The indoor robot strategy controller is constructed according to the change rate of the moving speed and the driving direction, whose action output is defined as follows: $\begin{matrix} (11) & a_{t} = {\dot{v}}_{t}, {\dot{ϕ}}_{t}, \end{matrix}$ where ${\dot{v}}_{t}$ and ${\dot{ϕ}}_{t}$ represent the change rate of the angle and the motion speed, respectively. Then, load them into the robot motion model in the first part to update the real-time state of the robot.

To ensure that the robot can complete autonomous navigation tasks safely and effectively during the training, proximity reward $r_{c l o s e}$ , speed reward $r_{s p e e d}$ , and task reward $r_{t a s k}$ are set as follows: $\begin{matrix} (12) & r_{c l o s e} = D_{t - 1} - D_{t}, \\ r_{s p e e d} = v_{t} - {\dot{v}}_{t}, \\ r_{t a s k} = \begin{matrix} 20, & M i s s i o n a c c o m p l i s h e d, \\ - 20, & C o l l i s i o n, \\ 0, & O t h e r c o n d i t i o n s, \end{matrix} \end{matrix}$ $w h e r e D_{t - 1}$ and $D_{t}$ are obtained based on formula (4) and represent the distance between the robot and target point at times $t - 1$ and $t$ , respectively. $r_{s p e e d}$ is used to make the robot drive fast and stably, while $r_{t a s k}$ is used to help the robot avoid obstacles and distribute goods. Therefore, the reward function in this study can be obtained as follows: $\begin{matrix} (13) & R = β_{1} r_{c l o s e} + β_{2} r_{s p e e d} + β_{3} r_{t a s k}, \end{matrix}$ $w h e r e β_{1}$ , $β_{2}$ , $β_{3}$ represent the weight coefficients of the three subreward functions, respectively. In the subsequent simulation experiment, we set $β_{1}$ = 10, $β_{2}$ = 0.5, $β_{3}$ = 1. A complete indoor robot intelligent navigation system can be built based on the DL-DDPG algorithm and the design of robot action, state, and reward function. Figure 4 describes the complete learning process of the robot intelligent navigation strategy based on the DL-DDPG algorithm.

[figure(s) omitted; refer to PDF]

5. Simulation Experiment

5.1. Simulation Environment

To verify the effectiveness of the DL-DDPG algorithm in indoor robot intelligent navigation and autonomous control, corresponding simulation experiments are set. The simulated environment runs on the platform of Gym-agent-master, Python 3.6, TensorFlow 1.14.0, PyCharm. The VTK third-party library is used to generate the simulated environment in the Northeast geodetic coordinate system, and the cylinder represents an obstacle, as shown in Figure 5. The maximum running speed of the robot is set as 2.0 m/s, the radius of the obstacles in the simulation scenario is 1 m, and the center-to-center distance of obstacles is 3 m. To ensure the effectiveness of the robot navigation task, the initial distance between the robot and destination shall not be less than 50 m, and the simulation step is 1 s.

[figure(s) omitted; refer to PDF]

In the simulation experiment, a simulator is constructed to realize the autonomous intelligent control of the robot in large and complex environments. To make the simulation experiment convenient, the dynamic physical constraints of the robot are ignored, and the shape of the robot is abstracted as a sphere. The minimum distance between the original spot and the destination is greater than 30 m to ensure sparse rewards. The rangefinder is installed on the robot for observation. When the distance between the robot and the target point is less than 1 m, it is deemed that the robot reaches the destination, and the tasks of autonomous navigation and intelligent control are completed.

As shown in Figure. 6, the same BP neural network is built for the Actor network and Critic network, and the number of fully connected layers of the neural network is set as 2, in which the number of neurons in each layer is 128. The number of input layer nodes is equal to that of state quantities of the agent, which is 16. The number of nodes in the output layer is equal to that of the action quantity of the agent, which is 2. The nonlinear function ReLU is introduced as the excitation function. As the experiment progresses, a backpropagation mechanism, namely the gradient descent, is used in the neural network to fit the network and update the parameters as shown in Figure 6.

[figure(s) omitted; refer to PDF]

In the simulation experiment, the backpropagation mechanism (gradient descent method) is used in the neural network to fit the network and update the parameters. The learning rates of network A and network C are set to 0.0001 and 0.0002, respectively, and the reward discount coefficient is set to 0.99. The capacity of the experience pool is set to 100000. When the experience pool is filled up with data, it enters the learning state. The number of samples extracted from the experience pool each time is set to 32.

5.2. Simulation Results and Analysis

After setting relevant parameters, the training of AGV starts. If the robot fails to complete the training task or collides with an obstacle within the specified time, this round will be deemed to be over. Then, the simulation environment will reset, and the next round will begin. At the same time, to simulate the environment in reality, the updating rules of the scenario are set as follows: the location of the robot, the destination, and the number of obstacles in each round are all set randomly.

To verify the effectiveness of the DL-DDPG algorithm in automatic navigation and intelligent control, DL-DDPG, DDPG, and TD3 algorithms are adopted to train and compare the robot, respectively, in the simulation experiment. The reward value obtained by the robot in each round during the training is recorded, as shown in Figure 7.

[figure(s) omitted; refer to PDF]

As is shown in Figure 7, the DL-DDPG algorithm has the most obvious upward trend, taking the lead to reach the peak of 240 after about 4000 rounds. TD3 algorithm has the lowest return performance with dramatic fluctuation. The traditional DDPG algorithm does not start to rise until about 2000 rounds with a strong fluctuation, and the peak time is also later than the optimized DL-DDPG algorithm. However, after around 6600 training episodes, the reward value under the DL-DDPG algorithm dropped erratically, but it quickly returned to a higher stable level after around 7100 rounds. This shows that the DL-DDPG algorithm proposed in this paper can help the robot adapt to the noisy training environment and improve the training effect. At the same time, when the model is unstable during the training process, the model under the Dl-DDPG algorithm can quickly regain its stability and receive continuous high rewards.

Figure 8 shows the success rate of 0–10000 rounds under the training by DDPG, AL-DDPG, and TD3 algorithms. It can be seen that the task completion rate of the robot is lower than 80% under the DDPG and TD3 algorithms, and the performance of their learned strategies is poor. In comparison, the success rate of DL-DDPG algorithm training has the highest rising trend. After 3000 rounds, the success rate is stably above 80%, and the peak value is close to 90%. DL-DDPG algorithm has the highest success rate and the best learning strategy compared to the other two algorithms.

[figure(s) omitted; refer to PDF]

To verify the effectiveness of the indoor robot system navigation strategy, the intelligent control system will be tested in three environments after the training. The number of environmental obstacles is set to 100, 150, and 200, respectively. The simulation results are shown in Figure 9.

[figure(s) omitted; refer to PDF]

From the results of the simulation experiment, it can be concluded that in the environment with different number of obstacles, the trained robot can achieve intelligent autonomous navigation and then avoid the obstacles to reach the destination. Besides, according to the trend of robot speed change, the robot is able to steadily increase its speed and maintain it within the maximum speed limit until it finally reaches the destination.

To verify the effectiveness of robot autonomous navigation strategy under the DL-DDPG algorithm, we conducted 1000 rounds of comparative tests in the above three scenarios, respectively, and collected the success rates of indoor robot navigation as shown in Table 2.

Table 2

Success rate of robot completing autonomous navigation tasks under different scenarios.

Algorithm	100 obstacles (success rate) (%)	150 obstacles (success rate) (%)	200 obstacles (success rate) (%)
DL-DDPG	90.3	81.2	76.6
DDPG	79.1	63.5	59.3
TD3	83.2	71.8	63.7

Although the success rate of robot navigation tends to decline as the obstacles increase, the success rate under the DL-DDPG algorithm is always higher than the other two algorithms. In a complex scenario with 200 obstacles, the success rate under the DL-DDPG algorithm remains at a high level of 76.6%. It means that the indoor robot navigation system under the DL-DDPG algorithm has a better performance and can adapt to more complex delivery scenarios.

At the same time, we recorded the data of all successful rounds in the above testing process, and counted the average task completion time under the three algorithms, as shown in Table 2. In simple scenarios, there is no significant difference in the time needed for navigation tasks under the three algorithms. However, as shown in Table. 3, as the number of obstacles increases, the system based on the DL-DDPG algorithm shows better adaptability and reaches the destination in a shorter time. It means that the system trained by the optimized DL-DDPG algorithm could develop a more efficient navigation strategy and complete the autonomous navigation task faster Table 3.

Table 3

Average completion time of robot completing autonomous navigation task under different scenarios.

Algorithm	Time (100 obstacles) (s)	Time (150 obstacles) (s)	Time (200 obstacles) (s)
DL-DDPG	64.6	71.5	88.9
DDPG	63.4	80.9	97.7
TD3	65.1	74.1	95.3

Considering the delay errors generated by hardware facilities such as sensors, we simulate the performance of robots in real delay scenarios through a simple scene to verify the adaptability of DL-DDPG algorithm in complex environments. Set in the scenario of 150 obstacles, when the robot conducts navigation tasks, there is a probability of $w$ to trigger $s t = s t - 1$ , i.e, the next state will not be updated. The success rate of the three algorithms to complete the distribution task is shown in the following table:

Table 4 shows that the success rate of the algorithm has a downward trend, but no matter in which situation the success rate of DL-DDPG algorithm can maintain above 75%, much higher than that of DDPG and TD3. The data show that the improved DL-DDPG algorithm has strong adaptability in complex environment and can maintain a high task completion rate.

Table 4

Average success rate of robot completing autonomous navigation task under delay condition (150 obstacles).

Runaway probability	Success rate ( $w$ = 0) (%)	Success rate ( $w$ = 0.05) (%)	Success rate ( $w$ = 0.1) (%)
DL-DDPG	81.2	80.9	77.4
DDPG	63.5	64.1	61.7
TD3	71.8	69.2	70.3

6. Conclusion

The path planning problem of robots has gradually become a research hotspot. Conventional algorithms rely on environmental modeling and cannot make autonomous decisions, so they can only be applied to static off-line environment and are difficult to adapt to more and more complex distribution scenarios. In comparison, deep reinforcement learning has strong perception and decision-making ability, which has great advantages in path planning in complex dynamic environments and has gradually become a research hotspot in path planning in the future.

In this paper, an autonomous intelligent navigation algorithm based on deep reinforcement learning is proposed, which could realize direct control from the perceptual input of the environment to direct output of action through end-to-end learning. Meanwhile, based on the partially observable Markov model, sensors are installed to help the robot detect and perceive obstacles to avoid the obstacles autonomously. In addition, a DL-DDPG strategy is adopted to disturb the learning samples with noises, which improves the robustness of robot autonomous decisions in real environment. The simulation results show that the DL-DDPG algorithm in this paper is more efficient in online decision-making for the indoor robot control system, so that the robot can complete the autonomous navigation task more accurately and stably.

In the experimental simulation, although the DL-DDPG algorithm proposed in this paper has achieved good results, real environments are much more complicated than simulated ones. The robot would be affected by signal interference or moving objects in the environment. In the subsequent research, we plan to configure real image perception equipment on the indoor robot. Then, we will introduce a convolutional neural network to the deep reinforcement learning model to help the indoor robot with image processing and environment perception, which will make our model closer to the real application scenarios.

Acknowledgments

This research was supported by the General Special Scientific Research Plan of Science and Technology Department of Shaanxi Province in China (2022GY-319) and General Special Scientific Research Plan of Science and Technology Department of Shaanxi Province in China (2022GY-329).

References

[1] I. F. Vis, "Survey of research in the design and control of automated guided vehicle systems," European Journal of Operational Research, vol. 170 no. 3, pp. 677-709, DOI: 10.1016/j.ejor.2004.09.020, 2006.

[2] Z. Zhang, J. Chen, Q. Guo, "AGVs Route Planning Based on Region-Segmentation Dynamic Programming in Smart Road Network Systems," Scientific Programming, vol. 2021,DOI: 10.1155/2021/9589476, 2021.

[3] E. W. Dijkstra, "A note on two problems in connexion with graphs," Numerische Mathematik, vol. 1 no. 1, pp. 269-271, DOI: 10.1007/bf01386390, 1959.

[4] A. R. Leach, A. P. Lemon, "Exploring the conformational space of protein side chains using dead-end elimination and the A ∗ algorithm," Proteins: Structure, Function, and Genetics, vol. 33 no. 2, pp. 227-239, DOI: 10.1002/(sici)1097-0134(19981101)33:2<227::aid-prot7>3.0.co;2-f, 1998.

[5] T. Oral, F. Polat, "MOD ∗ lite: an incremental path planning algorithm taking care of multiple objectives," IEEE Transactions on Cybernetics, vol. 46 no. 1, pp. 245-257, DOI: 10.1109/tcyb.2015.2399616, 2016.

[6] O. Khatib, "Real-time obstacle avoidance for manipulators and mobile robots," Proceedings of the 1985 IEEE International Conference on Robotics and Automation, vol. 2, pp. 500-505, .

[7] R. S. Sutton, Reinforcement Learning: An introduction, 2018.

[8] Z. Zhang, Q. Guo, J. Chen, P. Yuan, "Collision-free route planning for multiple AGVs in an automated warehouse based on collision classification," IEEE Access, vol. 6, pp. 26022-26035, DOI: 10.1109/ACCESS.2018.2819199, 2018.

[9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518 no. 7540, pp. 529-533, DOI: 10.1038/nature14236, 2015.

[10] P. Chen, Y. Huang, J. Mou, "Global path planning for autonomous ship: a hybrid approach of fast Marching Square and velocity obstacles methods[J]," Ocean Engineering, 2020.

[11] D. Ghosh, J. Singh, "Spectrum-based Muliti-Fault Localization Using Chaotic Genetic Algorithm," Information and Software Technology, vol. 133,DOI: 10.1016/j.infsof.2021.106512, 2021.

[12] M. Dorigo, V. Maniezzo, A. Colorni, "Positive Feedback as a Search strategy," 1999. Technical Report

[13] R. Eberhart, J. Kennedy, "A new optimizer using particle swarm theory," pp. 39-43, .

[14] X. Liang, Y. Mu, B. Wu, "Review on correlative algorithms of path planning," Value Engineering, vol. 39 no. 3, pp. 295-299, 2021.

[15] F. Liu, C. Chen, Z. Li, "Research on path planning of robot based on deep reinforcement learning," Proceedings of the 2020 39th Chinese Control Conference(CCC), Shenyang, pp. 3730-3734, .

[16] J. Kober, J. A. Bagnell, J. Peters, "Reinforcement learning in robotics: a survey," The International Journal of Robotics Research, vol. 32 no. 11, pp. 1238-1274, DOI: 10.1177/0278364913495721, 2013.

[17] A. S. Polydoros, L. Nalpantidis, "Survey of model based reinforcement learning:Applications on robotics[J]," Journal of Intelligent and Robotic Systems, vol. 86, 2017.

[18] R. S. Sutton, A. G. Barto, Reinforcement Learning: An introduction, 2018.

[19] M. Seyed Sajad, M. Schukat, Deep Reinforcement Learning: An Overview, 2018.

[20] S. Levine, C. Finn, T. Darrell, "End-to-End training of deep visuomotor policies," Journal of Machine Learning Research, vol. 17 no. 39, 2016.

[21] P. Mirowski, R. Pascanu, F. Viola, "Learning to Navigate in Complex Environments," Proceedings of the International Conference on Learning Representations, .

[22] S. Fujimoto, H. Hoof, D. Meger, "Addressing function approximation error in actor-critic methods," Proceedings of the 35th International Conference on Machine Learning, pp. 1587-1596, .

[23] A. E. Sallab, M. Abdou, E. Perot, "End-to-End Deep Reinforcement Learning for Lane Keeping Assist," 2016. https://arxiv.org/abs/1612.04340

[24] T. P. Lillicrap, J. J. Hunt, A. Pritzel, "Continuous Control with Deep Reinforcement learning," Proceedings of the 4th International Conference on Learning Representations, pp. 2829-2838, .

[25] A. Kendall, J. Hawke, D. Janz, "Learning to drive in a day," pp. 8248-8254, .

[26] R. Zhang, C. Wu, T. Sun, "Research progress of reinforcement Learning in Path planning [J/OL]," Computer Engineering and Applications, vol. 1–15, pp. 07-27, 2021.

Word count: 5457

Show less

Copyright © 2023 Xuemei He et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

Translate

Targeting the problem of autonomous navigation of indoor robots in large-scale, complicated, and unknown environments, an autonomous online decision-making algorithm based on deep reinforcement learning is put forward in this paper. Traditional path planning methods rely on the environment modeling, which can cause more workload of calculating. In this paper, the sensors to detect surrounding obstacles are combined with the DDPG (deep deterministic policy gradient) algorithm to input environmental perception and control the action direct output, which enables robots to complete the tasks of autonomous navigation and distribution without relying on environment modeling. In addition, the algorithm preprocesses the relevant data in the learning sample with Gaussian noise, facilitating the agent to adapt to noisy training environment and improve its robustness. The simulation results show that the optimized DL-DDPG algorithm is more efficient on online decision-making for the indoor robot navigation system, which enables the robot to complete autonomous navigation and intelligent control independently.

Details

Title

Intelligent Navigation of Indoor Robot Based on Improved DDPG Algorithm

Author

He, Xuemei¹; Yin Kuang¹

; Song, Ning¹; Liu, Fan¹

¹ College of Art and Design, Shaanxi University of Science & Technology, Xi’an, China

Editor

Zhi-Wei Liu

Publication year

2023

Publication date

2023

Publisher

John Wiley & Sons, Inc.

ISSN

1024123X

e-ISSN

15635147

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2023/6544029

ProQuest document ID

2804963630

Intelligent Navigation of Indoor Robot Based on Improved DDPG Algorithm

Jump to:

Full text

Abstract

Details

Suggested sources