This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Currently, with the continuous development of aviation technology and artificial intelligence [1], unmanned aerial vehicles (UAVs) have been widely used in target reconnaissance [2], communication relay [3], geographic surveying and mapping [4], logistics, and distribution [5]. However, as application environments become more complex, traditional single UAV performs poorly in terms of adapting to the environment and achieving low mission success rates for complex operational tasks [6]. Fortunately, the collaborative work of multiple UAVs effectively mitigates many drawbacks of a single UAV in complex environments and significantly improves the success rate of mission execution. Multi-UAV cooperative work has the characteristics of low cost, good scalability, and strong adaptability, which has become the mainstream direction of UAV application [7].
The basis of multi-UAV cooperative work is the multi-UAV path planning problem [8], which generates an optimal path for each UAV to reach its destination, taking into account various constraints, such that the total path planned is the shortest [9]. Especially in dynamic environments with partial or no prior knowledge, collaborative path planning of multiple UAVs requires real-time adjustment of planned paths based on environmental changes to avoid collisions with other flying objects, which is a highly complex and challenging problem. Path planning for multiple UAVs is essentially a multiconstraint combinatorial optimization problem [10], and the nature of combinatorial optimization makes it become a strongly NP-hard complexity problem [11]. Moreover, the presence of multiple environmental constraints and the varying performance capabilities of the UAVs contribute to the increased complexity in achieving optimal solutions.
The performance limitations of a single UAV must be taken into account during the actual path planning process, in addition to the limitations of objective conditions like the job task and its environment [12]. As demonstrated in Figure 1, path planning for UAVs entails getting the group job specifications and quickly computing a workable path for every UAV. This process takes into full consideration the significant influencing factors, including both static and dynamic factors. The static factors encompass UAV capabilities, region scope, and fixed obstacles. On the other hand, the dynamic factors involve cooperative relationships, communication networks, and moving obstacles. The cooperative path planning of UAVs is aimed at efficiently generating feasible pathways for each UAV while ensuring that the UAVs can efficiently accomplish the cluster mission.
[figure(s) omitted; refer to PDF]
With the extensive application of large-scale UAV collaborative work, multi-UAV path planning has gained widespread attention and has become a hot topic in the current research field of UAV technology. Researchers with diverse backgrounds have studied and investigated multi-UAV path planning issues with various restrictions and objectives in recent years. However, due to the multitude of constraints, immense computational requirements, and the complexity of UAV motion models, most previous studies have simplified the UAV constraints and applied them to relatively simple static environments. As a result, they have been unable to generate optimal solutions for path planning in real-time complex environments. Luckily, deep reinforcement learning techniques based on trial-and-error learning mechanisms have been widely applied to address multi-UAV path planning problems. These techniques adapt to dynamically changing environments and generate optimal paths for each UAV, thanks to the rapid advancements in artificial intelligence technology. Compared to traditional methods, deep reinforcement learning approaches have significant advantages in complex environments, demonstrating stronger robustness and enabling true collective intelligence [13].
However, despite the potential of deep reinforcement learning algorithms in addressing multi-UAV path planning problems, they face several challenges. One of these challenges is how to effectively utilize valuable experiences [14]. During the training process, deep reinforcement learning algorithms need to interact with the environment and extract useful knowledge from experiences. However, due to the complexity and diversity of path planning problems, effectively leveraging experiences becomes quite difficult. Furthermore, deep reinforcement learning algorithms are prone to instability during the training process [15]. This means that the algorithm may experience fluctuations and inconsistencies, leading to slower convergence speeds.
In this study, we propose a prioritized experience replay (PER) mechanism based on temporal difference (TD)–error to address the issue of effectively utilizing experience. In addition, we propose a delayed update skill to address the issue of unstable and divergent updates during the agents’ training process. Combining the above two points, we propose the PERDE-MADDPG algorithm to address the path planning issue for UAVs. The following is a summary of this study’s significant contributions:
1. In order to efficiently utilize the more valuable experiences, we employed a PER mechanism instead of the traditional experience replay mechanism, which accelerates the convergence of the algorithm.
2. We have adopted a delayed update strategy to address the issue of unstable updates during the agents’ training process. Specifically, after a certain number of updates to the critic network, we update the policy network, target critic network, and target policy network once, which helps stabilize the training process and mitigate the instability associated with frequent updates.
The remainder of this work is organized as follows. Section 2 reviews related work of path planning. In Section 3, the cooperative path planning of multiple UAVs is abstracted as a multiconstraint combinatorial optimization problem, and its Markov model is established. Section 4 presents the PERDE-MADDPG algorithm. In Section 5, simulation experiments are carried out, and the proposed algorithm’s efficacy is confirmed. Section 6 summarizes the work of this paper and future research directions.
2. Related Work
Path planning is aimed at designing the optimal path for each drone to follow during its flight [16], which is the main key reason why multiple UAVs can be successfully deployed in different domains [17]. Path planning algorithms can generally be categorized into three main groups: sampling-based methods, heuristic algorithms, and deep reinforcement learning algorithms.
Probabilistic Roadmap Method (PRM) [18] and Rapid-exploration Random Tree (RRT) [19] are the main sampling methods. The sampling method avoids the need to model the entire environmental space and instead reconstructs the environment using sampled points, resulting in relatively lower computational intensity. For instance, Yan, Zhuang, and Xiao [20] proposed a real-time path planning technique for UAV in intricate 3D environments. A modified PRM is introduced in this work by random sampling in a bounding box array. Song et al. [21] proposed an improved Rapid-exploration Random Tree (RRT) path planning algorithm for the application of automated land vehicles (ALVs), which combines the no logic constraints of the vehicle and the double extended RRT to not only improve the searching efficiency but also ensure the feasibility of the paths. Although sampling methods can handle path planning problems in high-dimensional environments, they rely on random sampling within the problem space, which makes it impossible to guarantee the quality of the obtained paths. As a result, the outcomes are frequently less than ideal.
Heuristic-based methods include genetic algorithm (GA) [22], particle swarm algorithm (PSO) [23], ant colony algorithm (ACO) [24], and others. Most heuristic algorithms search for spatially optimal solutions by mimicking group organisms’ behavior, such as foraging or roundup capture. These algorithms can solve high-dimensional, multiconstraint optimization problems. To address the issue of the traditional PSO algorithm easily falling into local optima, Huang et al. [25] proposed a novel PSO algorithm called ACVDEPSO. In this algorithm, the velocity of particles is transformed into cylinder vectors for the convenience of path searching. To overcome the main drawbacks of low search efficiency in solving path planning problems using the traditional ACO algorithm, Li, Xiong, and She [26] introduced variable pheromone enhancement factors and variable pheromone evaporation coefficients into the ACO algorithm, proposing a new algorithm called ACO-VP. Despite the capability of heuristic algorithms to find approximate optimal solutions in large-scale problems and effectively explore complex search spaces, they often suffer from slow convergence, complex operations, long computation time, and sensitivity to parameters.
Sampling methods and heuristic algorithms mostly deal with UAV cluster task allocation and path planning problems separately, ignoring the coupling between the two. Deep reinforcement learning algorithms are more task-oriented and guided by reward functions to accomplish task allocation, path planning, collision avoidance, and obstacle avoidance under the premise of satisfying various constraints. For the problem of slow convergence due to the low reward of training samples, Jun, Yunxiao, and Li [27] combined HER with DQN to increase the sample validity of the experience pool, thus improving the convergence speed. Han et al. [28] employed the artificial potential field approach to influence the DDPG algorithm’s action selection, which ultimately improved the path smoothness and shortened the path length. The benefits of using reinforcement learning algorithms to solve path planning problems include their capacity to adapt to complex environments and their adaptive and generalization abilities.
However, despite the various advantages of deep reinforcement learning algorithms in solving multi-UAV path planning problems, they still face some problems. Two significant challenges are how to efficiently utilize experience, as well as the instability and slow convergence speed during the training process. To address the challenge of effectively utilizing experience, we propose a PER mechanism based on TD-error. This mechanism assigns priority based on the TD-error of experience, reflecting the importance of each experience in learning and decision-making. In addition, we address the issues of instability and slow convergence during the training process by introducing a delayed update strategy. By using this delayed update strategy, the frequency of parameter updates during the training process can be reduced, thereby increasing the stability of the training.
3. Problem Formulation and System Model
In this section, we define and model the path planning problem of multiple UAVs. Firstly, the path planning problem is described as an optimization problem subject to multiple constraints. Then, in order to satisfy the optimization conditions, we construct a multiagent system based on the MADDPG framework.
3.1. Formulation of the Multi-UAV Path Planning Problem
In our work, a fleet of
As shown in Figure 2,
[figure(s) omitted; refer to PDF]
The collaborative path planning problem in this work is aimed at creating an uninterrupted flight path for each UAV connecting its starting position and destination while avoiding collisions with other obstacles and UAVs. Each UAV should arrive at its specified position at a given time point. Therefore, a time-sensitive series of waypoints can be used to characterize the flight route of a specific UAV. The time-sensitive waypoint sequence for UAV
1. Collision avoidance constraint: Each UAV should avoid collisions with other UAVs or obstacles. We define the safety distance
2. Motion state update constraint: Each UAV motion state needs to be periodically updated by applying a new control vector through iterative steps. The UAV motion over short durations
3. Destination arrival constraint: We consider a UAV to have reached its destination when the distance between the UAV and the endpoint is less than
The collaborative path planning problem can be described as the minimization of the total flight distance for all UAVs, as its primary objective is to find collision-free routes for each UAV to efficiently reach their respective targets. According to the waypoint sequence
It can be calculated using the following formula:
The collaborative path planning problem addressed in this paper is to find the optimal path for each UAV without collision, aiming to minimize the total path length of all UAVs. It can be expressed as follows:
3.2. Building Multiagent System Based on MADDPG
Because the path planning problem of multiple UAVs is a normal sequential choice problem, stochastic games (also known as Markov games) can be used to simulate it. According to the Markov decision process, the state information of the UAV cluster is represented as the five-tuple
In the five-tuple
Because of the characteristics of fast speed, short response time of UAVs, each UAV can only observe a portion of the environment. Then, a partially observable Markov game problem can describe the path planning problem, characterized by a tuple
1. Observation space
2. Action space
3. Reward function
Therefore, there are three termination conditions for UAVs: reaching the destination point, colliding with other UAVs or obstacles, and reaching the maximum number of steps without reaching the destination point.
In this work, the objective of the UAV is to reach the designated targets without experiencing any collisions. Thus, when UAV
When UAV
When the maximum number of steps is reached and UAV
During the flight process of a UAV, when it moves away from the destination point, it should be penalized. We use
Based on the above description, the total reward for UAV
4. The PERDE-MADDPG Approach
In this section, we propose the PERDE-MADDPG algorithm to solve the path planning problem of multiple UAVs. The PERDE-MADDPG algorithm builds upon the MADDPG algorithm and introduces a PER mechanism based on sample priority, which replaces the traditional empirical playback mechanism. And this algorithm incorporates delayed update skills to reduce the update frequency of the actor network and target network parameters, which helps mitigate the impact of parameter instability.
In this approach, each UAV is considered as an autonomous agent that intelligently learns a reasonable path selection strategy and executes optimal actions based on its current local observations. Similar to the MADDPG algorithm, the proposed approach employs a system with centralized training and distributed execution. In the decentralized execution phase, agents have the capability to share their experiences and iteratively optimize their policies through interactions with the environment and other agents. During the phase of decentralized execution, each agent independently takes actions according to its observations based on its policy.
The PERDE-MADDPG algorithmic architecture is shown in Figure 3, which can be divided into three main parts: environment, UAVs, and replay buffer
[figure(s) omitted; refer to PDF]
First, the environment module serves as a simulation environment for the path planning problem. It encompasses all the elements that the UAVs need to perceive and interact with. The environment provides state information, such as the current position and velocity, to the UAVs. Through interactions with the environment, the UAVs execute actions and receive reward signals, enabling them to learn adaptive path planning strategies. The environment module plays the role of facilitating the interaction between the UAVs and provides necessary information and feedback.
Next, the UAV module is the core of the system, where each UAV is represented by an intelligent agent responsible for generating optimal paths for each UAV. The agent utilizes deep reinforcement learning to select actions based on the environment and experiences stored in the pool. Each agent consists of four networks: actor, critic, target actor, and target critic. The actor network chooses actions based on the environment state. The critic network scores the state-action pairs for path planning. The target networks stabilize the training process. During network updates, we employ a delayed update strategy on top of soft updates, where the critic network is updated multiple times while the actor and target networks are updated once. This reduces the update frequency of the actor and target networks, making the training more stable. During the training process, the agents update their parameters based on reward signals and environmental feedback, continuously improving their path planning strategies.
Finally, the replay buffer module is used to store the UAVs’ experiences in the environment and provide samples for the UAV module during training. The replay buffer adopts the PER method, assigning priorities to experiences and sampling and updating them based on these priorities. As a result, experiences that have a greater impact on path planning are selected more frequently for training, thereby improving the efficiency and performance of the path planning algorithm. Through PER, UAVs can better leverage their previous experiences to optimize their path planning strategies.
And the following can be used to characterize each agent’s training process: (1) By interacting with the environment, each agent obtains its local observation
4.1. PER
Experience replay methods were initially introduced in DQN algorithms. The two key design considerations for experience replay methods revolve around how to store experiences and how to effectively replay them. The regular DQN algorithm uses uniform sampling, treating each experience equally without considering their varying importance. However, in reality, different experiences have different levels of importance for the agent’s learning. Solely relying on uniform sampling fails to fully leverage the experience data and maximize the learning effectiveness of the agent.
PER is a good solution to the problem of not utilizing experience efficiently. The improvement of PER methods is how to replay those experiences. The PER method abandons the random sampling approach of classical experience replay and uses the size of the TD-error to measure the priority of a set of experiences. The importance of the samples can be measured using the TD-error in the time-difference method, where experiences with larger TD-errors are given higher priority, and on the contrary, experiences with smaller TD-errors are given lower priority. The TD-error expression for experience
In order to break the random sampling criterion, the probability of drawing experience
Since prioritized empirical playback introduces TD-error, it alters the sample distribution in an uncontrolled form. This alteration might make the neural network’s training process more prone to oscillations or even divergence. To address this problem, the design of importance sampling weights
Normalization is generally required in the program, so each sample weight is divided by the largest weight to derive the formula:
The new loss function expression of agent
4.2. Delayed Update
In the MADDPG method, each agent has an actor network and a critic network. Using the local observation
In addition, the target network is an important technique in MADDPG for improving data utilization and algorithm stability [29]. The target network refers to the target actor network and target critic network, which have the same structure as their respective actor and critic networks. In reinforcement learning, the computation of target values can be affected by the fluctuations in the current estimated
In this work, we use
To update the parameters
In traditional MADDPG methods, during the training process, the network parameters of the actor network, critic network, and their corresponding target networks are updated at a certain frequency. However, this frequent updating of the network parameters can lead to oscillations and instability in the agent’s policy learning, causing the agent to lose direction in its policy learning, which can result in unstable agents and occasional policy oscillations in practical applications. To address this issue, we employ delayed update skills. Specifically, after a certain number of updates to the critic network, we update the actor network, and target networks once, reducing the frequency of updates to the actor network and its corresponding target network and stabilizing the learning process.
After a certain number of updates to the policy network, we update the critic network by minimizing the loss function
Algorithm 1: PERDE-MADDPG algorithm.
Input: the total number of episodes
1:For i= 1 to N do
2: Initialize the network parameters of the actor, critic, and corresponding target networks of UAV i;
3: End for
4:For episode = 1 to
5: Initialize a random noise process
6: Get initial state
7: For t = 1 to T do
8: For i= 1 to N do
9: choose action
10: End for
11: Execute joint actions
12: Get reward
13: Store
14: If t mod
15: For i= 1 to N do
16: For j= 1 to K do
17: Experience j is sampled with probability
to Eq. 16;
18: Calculate corresponding
according to Eq. 19;
19: Update the priority
20: End for
21: Update critic network parameters by minimizing the loss function as
Formula (20);
22: If t mod
23: Update actor network parameters using Formula (21);
24: Update target network parameters using Formula (22);
25: End if
26: End for
27: End if
28: End for
29: End for
According to the previous discussions, the proposed PERDE-MADDPG method can be summarized as Algorithm 1. The process of the PERDE-MADDPG algorithm is as follows: First, we initialize the network parameters of the actor, critic, and corresponding target networks of UAV
5. Experiments and Results
In this section, we conduct experiments in the MPE (multiagent particle environment) simulation environment to validate the effectiveness of the proposed PERDE-MADDPG algorithm. We evaluate our method’s performance by contrasting it with three widely used reinforcement learning algorithms: MATD3 [30], MADDPG [31], and SAC [32]. In the following sections, we first explore the settings of parameters in our proposed algorithm and then test the performance gains of our method based on average reward, arrival rate, and path length.
5.1. Parameter Setting
The four approaches’ parameters correspond to the configurations found in [33, 34], as shown in Table 1. The effectiveness of the reinforcement learning algorithms is strongly influenced by the number of neurons in the neural networks’ hidden layers as well as the learning rate of each network. Thus, in order to figure out the ideal values for later trials, we first carry out tests to assess the effects of the number of neurons
Table 1
Parameter configuration in the algorithm.
Parameters | Value |
Batch size | 1024 |
Replay buffer | 1e6 |
Update interval | 50 |
Soft update rate | |
Priority parameter | 0.6 |
Reward discount factor | 0.95 |
Delayed update interval | 100 |
Number of steps in each episode | 100 |
The total number of episodes | 50,000 |
In the experiments shown in Figures 4 and 5, there are 4 UAVs and 12 obstacles, and their positions are randomly generated. Each UAV flies towards its destination while avoiding collisions with obstacles and other UAVs. Figures 4 and 5 display the mean episode reward (every 1000 episodes) in the training process of the PERDE-MADDPG method.
[figure(s) omitted; refer to PDF]
As shown in Figure 4, the number of neurons in the hidden layers changes from 32 to 256. From Figure 4, it can be observed that our method achieves the shortest convergence time and highest reward when the number of neurons is set to 64. As a result, setting the neuron number
5.2. Mean Reward Evaluation
Under the parameter settings mentioned earlier, we evaluate the performance of our method based on mean reward, which is the mean value of the rewards obtained per episode during the evaluation process. The average reward is a significant indicator of reinforcement learning algorithms’ learning efficiency and optimization capability. Algorithms that achieve higher mean reward are considered more efficient and accurate. In this subsection, we sum up the rewards of the agents over 100 evaluate episodes, calculate the mean reward by dividing the sum by 100, and evaluate the performance of different methods by comparing their average rewards.
Figure 6 displays the average rewards obtained by the four reinforcement learning methods under different scenarios, where the number of obstacles is fixed at 12, and the number of UAVs ranges from 2 to 6. From the graph, it can be observed that as the number of UAVs increases, the average rewards for each method also increase. This is because, with an increasing number of UAVs, although the number of times each UAV reaches its destination decreases, the overall number of successful UAV arrivals increases, resulting in higher rewards. Furthermore, we can also observe that regardless of the number of UAVs, our proposed PERDE-MADDPG algorithm achieves higher mean reward than the other three algorithms. For instance, when the number of UAVs is 5, our algorithm obtains a reward of 52.70, which is an improvement of 10.43% compared to MATD3, 17.27% compared to MADDPG, and 26.74% compared to SAC.
[figure(s) omitted; refer to PDF]
Figure 7 illustrates the average rewards obtained by the four reinforcement learning methods under different scenarios, where the number of UAVs is fixed at 2 and the number of obstacles ranges from 12 to 20. From the graph, it can be observed that as the number of obstacles increases, the average rewards for each method decrease. This is because, with more obstacles, the likelihood of collisions between UAVs increases, making it more challenging to reach their destinations and resulting in lower rewards. Additionally, we can also observe that regardless of the number of obstacles, our proposed PERDE-MADDPG algorithm achieves higher mean reward than the other three algorithms. For instance, when the number of obstacles is 20 and the number of UAVs is 6, our algorithm obtains a reward of 18.75, which is an improvement of 7.10% compared to MATD3, 15.96% compared to MADDPG, and 25.18% compared to SAC.
[figure(s) omitted; refer to PDF]
5.3. Arrival Rate Evaluation
The arrival rate is an important metric for evaluating path planning algorithms. We evaluate the trained algorithm in different scenarios for 100 episodes. Each time all UAVs reach their destinations, we increment the counter for successful episodes. Therefore, the arrival rate of the algorithm in a particular scenario is defined as the number of successful episodes divided by 100.
Figure 8 shows the arrival rate of four different methods in various scenarios where the number of obstacles is fixed at 12 and the number of UAVs ranges from 2 to 6. Based on the graph, it is evident that as the number of UAVs increases, the success rates of all four algorithms demonstrate a comparable pattern. This is because, with more UAVs, the likelihood of collisions between them increases, resulting in a decrease in the number of successful episodes. Additionally, the graph also indicates that our algorithm outperforms the other three methods when the number of UAVs is fixed. For instance, when the number of UAVs is five, the success rate of the PERDE-MADDPG algorithm is 40.00%, which is 10.00%, 13.00%, and 16.00% higher than that of MATD3, MADDPG, and SAC, respectively.
[figure(s) omitted; refer to PDF]
Figure 9 illustrates the arrival rate of four different methods in various scenarios where the number of UAVs is 2 and the number of obstacles ranges from 12 to 20. From the graph, it is clear that the arrival rate of all four algorithms falls as the number of obstacles rises. This is because with more obstacles, the scenarios become more complex, and the likelihood of collisions between UAVs and obstacles increases, resulting in a decrease in the success rate. Additionally, the graph also demonstrates that regardless of the number of obstacles, our algorithm outperforms the other three methods. For instance, when the number of obstacles is 20, the success rate of the PERDE-MADDPG algorithm is 70.00%, which is 9.00%, 14.00%, and 17.00% higher than that of MATD3, MADDPG, and SAC, respectively.
[figure(s) omitted; refer to PDF]
5.4. Path Length Evaluation
In this section, we will validate the performance of our method by evaluating the total length of the planned paths. As stated in Equation (7), the objective of multi-UAV path planning is to minimize the total length of the planned paths. Therefore, a direct evaluation of path planning methods is to compare the total lengths of the planned paths. A more accurate method will generate shorter path lengths for the UAVs.
We will evaluate the trained algorithm in different scenarios for 100 episodes. If, in an evaluating episode, all UAVs reach their destinations, we will record their total path length. The path length of the planned paths in different scenarios is then defined as the total path length of all UAVs in successful episodes divided by the total number of successful episodes.
Figure 10 shows the path lengths of four methods in the experiment with 12 obstacles and a varying number of UAVs from 2 to 6. It can be seen that as the number of UAVs increases, the path lengths of all methods also increase. This is because the total flying distance of UAVs increases with the number of UAVs. Additionally, as the number of UAVs increases, the probability of collisions between UAVs also increases. To avoid collisions, UAVs tend to choose longer paths, thus increasing the flying distance. From Figure 10, it can be observed that regardless of the number of UAVs used in the experiment, our PERDE-MADDPG method consistently plans paths with shorter lengths compared to the other methods. When the number of UAVs is five, our method achieves a path length of 89.78 km, which is 7.64%, 10.13%, and 13.76% shorter than the path lengths produced by the MATD3, MADDPG, and DDPG algorithms, respectively.
[figure(s) omitted; refer to PDF]
Figure 11 displays the path lengths of four methods in the experiment with a fixed number of 2 UAVs and a varying number of obstacles from 12 to 20. It is evident that as the number of obstacles increases, the path lengths of all methods also increase. This is because as the number of obstacles increases, the likelihood of collisions between UAVs and obstacles also increases. UAVs need to navigate through more complex paths to avoid collisions with obstacles. From Figure 11, it can be observed that our PERDE-MADDPG method consistently plans paths with shorter lengths compared to the other methods, regardless of the number of obstacles. When the number of obstacles is 16, our method achieves a path length of 35.2 km, which is 5.98%, 10.19%, and 14.62% shorter than the path lengths produced by the MATD3, MADDPG, and SAC algorithms, respectively.
[figure(s) omitted; refer to PDF]
6. Conclusions
This paper proposed a path planning algorithm: PERDE-MADDPG based on PER and delayed update skills. When selecting experiences from the experience pool, we employed a PER mechanism to utilize valuable experiences efficiently. On this basis, we have adopted a delayed update strategy to address the issue of unstable updates during the agents’ training process. Experimental results show that our proposed PERDE-MADDPG algorithm outperforms better than the MATD3, MADDPG, and SAC algorithms in terms of mean reward, arrival rate, and planned path lengths under different environments. However, when the number of drones becomes too large, the PERDE-MADDPG algorithm also suffers from difficulties in convergence and poor training performance. So, we intend to further incorporate attention mechanisms in the future to boost the PERDE-MADDPG algorithm’s performance.
Funding
This work was supported by the National Natural Science Foundation of China (No. 62106202), the Key Research and Development Program of Shaanxi (No. 2024GX-YBXM-118), the Aeronautical Science Foundation of China (No. 2023M073053003), and the Fundamental Research Funds for the Central Universities.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 62106202), the Key Research and Development Program of Shaanxi (No. 2024GX-YBXM-118), the Aeronautical Science Foundation of China (No. 2023M073053003), and the Fundamental Research Funds for the Central Universities.
[1] B. Fan, Y. Li, R. Zhang, Q. Fu, "Review on the technological development and application of UAV systems," Chinese Journal of Electronics, vol. 29 no. 2, pp. 199-207, DOI: 10.1049/cje.2019.12.006, 2020.
[2] Y. Tan, Y. Ni, W. Xu, Y. Xie, L. Li, D. Tan, "Key technologies and development trends of the soft abrasive flow finishing method," Journal of Zhejiang University-Science A, vol. 24 no. 12, pp. 1043-1064, DOI: 10.1631/jzus.A2300038, 2023.
[3] L. Li, P. Xu, W. Xu, B. Lu, C. Wang, D. Tan, "Multi-field coupling vibration patterns of the multiphase sink vortex and distortion recognition method," Mechanical Systems and Signal Processing, vol. 219, article 111624,DOI: 10.1016/j.ymssp.2024.111624, 2024.
[4] T. Feng, "Research on application of UAV remote sensing surveying and mapping technology in engineering surveying and mapping," International Journal of Geology, vol. 3 no. 1, 2021.
[5] Y. Li, M. Liu, D. Jiang, "Application of unmanned aerial vehicles in logistics: a literature review," Sustainability, vol. 14 no. 21,DOI: 10.3390/su142114473, 2022.
[6] L. Chang, X. Wenjun, Z. Peng, G. Qing, X. Zonghao, G. Chao, "UAV real-time route planning logical architecture in complex threat environment," Journal of Beijing University of Aeronautics and Astronautics, vol. 46 no. 10, pp. 1948-1957, 2020.
[7] J. Chen, Y. Zhang, L. Wu, T. You, X. Ning, "An adaptive clustering-based algorithm for automatic path planning of heterogeneous UAVs," IEEE Transactions on Intelligent Transportation Systems, vol. 23 no. 9, pp. 16842-16853, DOI: 10.1109/TITS.2021.3131473, 2022.
[8] C. Ren, J. Chen, C. Du, W. Yu, "Path planning algorithm for multiple UAVs based on artificial potential field," 2023 IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), pp. 970-974, DOI: 10.1109/ITAIC58329.2023.10408816, .
[9] J. Chen, C. Du, Y. Zhang, P. Han, W. Wei, "A clustering-based coverage path planning method for autonomous heterogeneous UAVs," IEEE Transactions on Intelligent Transportation Systems, vol. 23 no. 12, pp. 25546-25556, DOI: 10.1109/TITS.2021.3066240, 2022.
[10] H. Qin, B. Zhao, L. Xu, X. Bai, "Petri-net based multi-objective optimization in multi-UAV aided large-scale wireless power and information transfer networks," Remote Sensing, vol. 13 no. 13,DOI: 10.3390/rs13132611, 2021.
[11] J. Chen, Y. He, Y. Zhang, P. Han, C. Du, "Energy-aware scheduling for dependent tasks in heterogeneous multiprocessor systems," Journal of Systems Architecture, vol. 129, article 102598,DOI: 10.1016/j.sysarc.2022.102598, 2022.
[12] J. Chen, P. Han, Y. Zhang, T. You, P. Zheng, "Scheduling energy consumption-constrained workflows in heterogeneous multi-processor embedded systems," Journal of Systems Architecture, vol. 142, article 102938,DOI: 10.1016/j.sysarc.2023.102938, 2023.
[13] L. Junlan, Z. Wenbo, J. Hongbing, Z. Mingzhe, "A summary of UAV swarm path planning algorithm research," Aerospace Electronic Warfare, vol. 38 no. 1, 2022.
[14] C. Kang, C. Rong, W. Ren, F. Huo, P. Liu, "Deep deterministic policy gradient based on double network prioritized experience replay," IEEE Access, vol. 9, pp. 60296-60308, DOI: 10.1109/ACCESS.2021.3074535, 2021.
[15] K. Wan, D. Wu, B. Li, X. Gao, Z. Hu, D. Chen, "Me-MADDPG: an efficient learning-based motion planning method for multiple agents in complex environments," International Journal of Intelligent Systems, vol. 37 no. 3, pp. 2393-2427, DOI: 10.1002/int.22778, 2022.
[16] H. Ergezer, K. Leblebicioglu, "Path planning for UAVs for maximum information collection," IEEE Transactions on Aerospace and Electronic Systems, vol. 49 no. 1, pp. 502-520, DOI: 10.1109/TAES.2013.6404117, 2013.
[17] S. Aggarwal, N. Kumar, "Path planning techniques for unmanned aerial vehicles: a review, solutions, and challenges," Computer Communications, vol. 149, pp. 270-299, DOI: 10.1016/j.comcom.2019.10.014, 2020.
[18] L. E. Kavraki, M. N. Kolountzakis, J.-C. Latombe, "Analysis of probabilistic roadmaps for path planning," IEEE Transactions on Robotics and Automation, vol. 14 no. 1, pp. 166-171, DOI: 10.1109/70.660866, 1998.
[19] S. Karaman, M. R. Walter, A. Perez, E. Frazzoli, S. Teller, "Anytime motion planning using the RRT," 2011 IEEE International Conference on Robotics and Automation, pp. 1478-1483, DOI: 10.1109/ICRA.2011.5980479, .
[20] F. Yan, Y. Zhuang, J. Xiao, "3D PRM based real-time path planning for UAV in complex environment," 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1135-1140, DOI: 10.1109/ROBIO.2012.6491122, .
[21] J. Song, B. Dai, E. Shan, H. He, "An improved RRT path planning algorithm," Acta Electonica Sinica, vol. 38 no. 2A, 2010.
[22] J. Li, Y. Huang, Z. Xu, J. Wang, M. Chen, "Path planning of UAV based on hierarchical genetic algorithm with optimized search region," 2017 13th IEEE International Conference on Control & Automation (ICCA), pp. 1033-1038, DOI: 10.1109/ICCA.2017.8003203, .
[23] J. L. Foo, J. Knutzon, V. Kalivarapu, J. Oliver, E. Winer, "Path planning of unmanned aerial vehicles using B-splines and particle swarm optimization," Journal of Aerospace Computing, Information, and Communication, vol. 6 no. 4, pp. 271-290, DOI: 10.2514/1.36917, 2009.
[24] M. Dorigo, M. Birattari, T. Stutzle, "Ant colony optimization," IEEE Computational Intelligence Magazine, vol. 1 no. 4, pp. 28-39, DOI: 10.1109/MCI.2006.329691, 2006.
[25] C. Huang, X. Zhou, X. Ran, J. Wang, H. Chen, W. Deng, "Adaptive cylinder vector particle swarm optimization with differential evolution for UAV path planning," Engineering Applications of Artificial Intelligence, vol. 121, article 105942,DOI: 10.1016/j.engappai.2023.105942, 2023.
[26] J. Li, Y. Xiong, J. She, "UAV path planning for target coverage task in dynamic environment," IEEE Internet of Things Journal, vol. 10 no. 20, pp. 17734-17745, DOI: 10.1109/JIOT.2023.3277850, 2023.
[27] W. Jun, Y. Yunxiao, L. Li, "Based mobile robot path planning based on improved deep reinforcement learning," Electronic Measurement Technology, vol. 44 no. 22, 2021.
[28] Z. Han, X. Mingyang, Z. Min, W. Naiqi, "Path planning of mobile robot with fusion DDPG algorithm," Control Engineering of China, vol. 28 no. 11, pp. 2136-2142, 2021.
[29] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, L. Hanzo, "Multi-agent deep reinforcement learning-based trajectory planning for multi-UAV assisted mobile edge computing," IEEE Transactions on Cognitive Communications and Networking, vol. 7 no. 1, pp. 73-84, DOI: 10.1109/TCCN.2020.3027695, 2021.
[30] J. Ackermann, V. Gabler, T. Osa, M. Sugiyama, "Reducing overestimation bias in multi-agent domains using double centralized critics," 2019. https://arxiv.org/abs/1910.01465
[31] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, I. Mordatch, "Multi-agent actor-critic for mixed cooperative-competitive environments," Advances in Neural Information Processing Systems, vol. 30, 2017.
[32] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, "Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor," International conference on machine learning, pp. 1861-1870, 2018.
[33] J. Chen, T. Li, Y. Zhang, T. You, Y. Lu, P. Tiwari, N. Kumar, "Global-and-local attention-based reinforcement learning for cooperative behaviour control of multiple UAVs," IEEE Transactions on Vehicular Technology, vol. 73 no. 3,DOI: 10.1109/TVT.2023.3327571, 2024.
[34] J. Chen, F. Ling, Y. Zhang, T. You, Y. Liu, X. Du, "Coverage path planning of heterogeneous unmanned aerial vehicles based on ant colony system," Swarm and Evolutionary Computation, vol. 69, article 101005,DOI: 10.1016/j.swevo.2021.101005, 2022.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2024 Chongde Ren et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/