Content area
Reinforcement learning (RL), as an emerging interdisciplinary field formed by the integration of artificial intelligence and control science, is currently demonstrating a cross-disciplinary development trend led by artificial intelligence and has become a research hotspot in the field of optimal control. This paper systematically reviews the development context of RL, focusing on the intrinsic connection between single-agent reinforcement learning (SARL) and multi-agent reinforcement learning (MARL). Firstly, starting from the formation and development of RL, it elaborates on the similarities and differences between RL and other learning paradigms in machine learning, and briefly introduces the main branches of current RL. Then, with the basic knowledge and core ideas of SARL as the basic framework, and expanding to multi-agent system (MAS) collaborative control, it explores the coherence characteristics of the two in theoretical frameworks and algorithm design. On this basis, this paper reconfigures SARL algorithms into dynamic programming, value function decomposition and policy gradient (PG) type, and abstracts MARL algorithms into four paradigms: behavior analysis, centralized learning, communication learning and collaborative learning, thus establishing an algorithm mapping relationship from single-agent to multi-agent scenarios. This innovative framework provides a new perspective for understanding the evolutionary correlation of the two methods, and also discusses the challenges and solution ideas of MARL in solving large-scale MAS problems. This paper aims to provide a reference for researchers in this field, and to promote the development of cooperative control and optimization methods for MAS as well as the advancement of related application research.
Introduction
In recent years, reinforcement learning (RL), a fundamental artificial intelligence technology, has garnered significant attention. In 2024 (Ministry of Industry and Information Technology of the People’s Republic of China 2024), the Notice on Printing and Issuing the Guide for the Construction of the National Artificial Intelligence Industry Comprehensive Standardization System (2024 Edition) jointly issued by the China’s Ministry of Industry and Information Technology and other four departments emphasized the important strategic position of artificial intelligence technology. As a fundamental technology that integrates multiple fields of artificial intelligence, reinforcement learning is closely associated with adaptive dynamic programming (ADP) (Liu and Wei 2014; Xue et al. 2021, 2022; Zhao et al. 2025)and deep learning (Zhang et al. 2025a; Qin et al. 2025c), and has permeated into multiple disciplinary fields. It has formed a large number of fruitful and significant results in the fields of artificial intelligence, biology and brain-like computing (Shen et al. 2021; Neftci and Averbeck 2019; Zhang and Wang 2023; Uc-Cetina et al. 2023; Kaloev and Krastev 2023; Liu et al. 2021a, 2017), and promoted the integration and development of related disciplines. Many scholars in the field generally hold a positive attitude towards the development prospect of RL technology, believing that the development of this technology has shown great potential in solving a series of challenging problems and will play an important role in promoting the basic theory research of complex system intelligence.
The concept of RL was first used in psychology and neuroscience. Its origins date back to the 1950s. In 1950, psychologist B.F. Skinner proposed the concept of operant conditioning by studying the behavior of animals through experiments. That is, the outcome (reward or punishment) that follows an action can increase or decrease the probability that the action will occur in the future (Skinner 1956, 1958). This principle of self-improvement through a process of trial and error is very close to the principle of RL today, and is therefore considered the foundation of RL. In the 1960s, Marvin Minsky proposed a novel approach to addressing the “brain-model” problem in the paper (Minsky 2007), aiming to design computational algorithms by mimicking the working principles of the human brain. The concepts of “reinforcement operator”, “reinforcement process” and “reinforcement system” are explicitly mentioned, as well as trial-and-error learning. After the 1960s, the development of RL was slow for a long period. Meanwhile, many researches on RL gradually shifted to supervised learning algorithms, such as the menace system proposed by Donald Michie to solve the Tic-Tac-Toe game. Although this algorithm also emphasized the concept of reward and punishment, but it is actually a pattern recognition and perception learning system (Anderson 1983).
Not until the 1980s and 1990s did Fukushima et al. (1983) explore the use of neuron-like adaptive elements to address complex learning control problems and its relationship with classical and operant conditioning in animal learning research. Watkins and Dayan (1992) first proposed the model of Q-learning, which is a model-free RL algorithm based on action-value function to seek the optimal policy, and has since formed the foundation of modern RL theory. However, when handling complex systems and high-dimensional state spaces, conventional RL proves to be insufficient. In 2015, Mnih et al. (2015) proposed a deep learning-based RL model in DeepMind. This model first combined deep learning and RL. It significantly expanded the application scope of RL and realized end-to-end learning.
David Silver, head of Technology at DeepMind, put it this way, “AI is RL plus deep learning!”. In recent years, the development of RL algorithms has entered a diversified development, showing a trend of interdisciplinary integration development, and has been more widely used in game development, autonomous driving, health care and robotics (Bai et al. 2024; Karimi et al. 2024; Ma et al. 2024; Yan et al. 2023; Rattal et al. 2025; Cao et al. 2024; Iskandar et al. 2023; Diprasetya et al. 2024; Guan 2020; Rawat and Rana 2023; Xu et al. 2022; Ahad et al. 2021; Qin et al. 2025b; Zhang et al. 2024; Sun et al. 2024). RL algorithms are also developing in the direction of more emphasis on practicality and scalability.
The RL algorithm primarily addresses intelligent decision-making by continuously adapting decisions based on the current state to ultimately achieve a given goal. A system consisting of multiple agents is called a multi-agent system (MAS), and in general, multiple agents are able to sense the environment, make decisions, and perform actions to achieve a specific goal or task. Each agent and among agents show a certain degree of autonomy, which can independently control its own behavior and internal state without the direct intervention of external entities. In recent years, with the rapid development of RL algorithms and their remarkable achievements in many fields, researchers have begun to turn their attention to the field of MAS. At present, many scholars have tried to apply RL algorithms to MAS, forming the direction of multi-agent reinforcement learning (MARL) (Littman 1994; Zhang et al. 2021; Liu and Wang 2022; Malathy et al. 2024). In MARL, this creative combination tries to solve many complex tasks with higher dimensions and more states in complex environments, and simultaneously, this core idea of trial-and-error based learning (Sui et al. 2023; Waltz and Fu 1965) is continued. Compared to single-agent RL (SARL), MARL leverages collective intelligence and collaboration to model complex social interactions, such as dialogue, cooperation, and competition. This is highly significant for understanding human social behavior and designing socio-technical systems. At present, MARL has been widely used in mobile robot (Schwung et al. 2019a; Zhang et al. 2023; Qin et al. 2025b; Zhang et al. 2025b), aerospace (Saeed et al. 2024; Xia et al. 2023; Wang et al. 2022a), unmanned aerial vehicle (UAV) formation cooperative control (Xue and Chen 2023; Xing et al. 2024; Zhao et al. 2023a; Liu et al. 2023), medical resource allocation (Alelaiwi 2020; Kim 2023), autonomous driving (Yu et al. 2020; Kiran et al. 2022; Yu et al. 2019), market economy simulation (Kell et al. 2020; Zhu et al. 2023a; Wei et al. 2022; Shi et al. 2018) and other fields.
The methods to generalize RL algorithms to MAS are diverse and complex. At present, the two mainstream paradigms are independent learning and joint learning. By independent learning, each agent in the system can execute the RL algorithm independently, and the agents other than itself are regarded as the environment or part of the environment, such as Q-learning, Deep Q-Network (DQN), Sarsa, A3C and other algorithms, the common feature of these algorithms is that each agent updates its own policy or value function independently, without directly considering the policies of other agents. In joint learning, each agent jointly learns a global policy based on the joint observations of all agents in the system, and the agents are trained to take a unified action by using a joint reward signal. For example, MOON (Multi-agent Optimization On Networks (Li et al. 2024a), Multi-Agent Proximal Policy Optimization (MAPPO) (Yu et al. 2022), Value Decomposition Network (VDN) (Sunehag et al. 2017), Multi-Agent Deep Deterministic Policy Gradient (MADDPG) (Wu et al. 2024) and other algorithms, these algorithms are built on the basis of joint learning, and deal with the interactions and dependencies between agents in different ways to achieve effective distributed learning. Compared with independent learning, joint learning allows the model to be personalized according to the user’s situation and allows multiple users to participate in the training process of the model. These advantages make joint learning an effective machine learning paradigm in scenarios that need to protect data privacy, take advantage of data diversity, improve model performance and reduce cost. Although joint learning has many advantages, it also has some disadvantages and challenges, such as high communication cost, difficulty in model convergence, and data labeling and standardization. Based on this, in recent years, many researchers have proposed some MARL algorithms that are between independent learning and RL, such as Co-Forest (Li and Zhou 2007), Co-Trade (Zhang and Zhou 2011) and other algorithms. These algorithms usually combine the characteristics of the two, not only maintaining a certain degree of distributed computing, but also sharing or cooperating information to a certain extent. In summary, with the progress of technology and the further increase of application requirements, MARL algorithms are expected to achieve a wider range of applications in many fields.
In recent years, MARL algorithms and applications have shown a vigorous development trend and are widely used in numerous fields. However, there are still some unresolved issues in certain aspects. For instance, how to limit the impact of approximation errors on iteration, how to improve the processing efficiency of the algorithm for high-dimensional systems, how to enhance the intelligent characteristics of RL algorithms by integrating brain science and brain-inspired intelligence, and how to achieve virtual reality interaction by combining parallel control technology. Existing reviews (Sun and Mu 2020; Zhang et al. 2021; Xing-Xing et al. 2020; Wang et al. 2024a) of multi-agent learning mostly focus on a specific problem or scenario, and there are few comprehensive reviews in the literature that cover the transition from SARL to MAS. This paper aims to systematically expound the development process from SARL to MARL and the application of relevant algorithms in industrial practice, with the expectation of providing a relatively systematic reference for relevant technical personnel. In conclusion, RL will play an important role in promoting the basic theoretical research and key technology development of complex system intelligence.
The remainder of the paper is arranged as follows. The overview of RL is given in Sect. 2, including RL in machine learning in Sect. 2.1, and branches of RL in Sect. 2.2. Sect. 3 introduces the SARL, mainly aimed at the development of basic concepts and algorithms, while the MARL is arranged in Sect. 4. Later, the progress of multi-agent in industrial applications is described in Sect. 5. The conclusion and prospect are summarized in Sect. 6.
Overview of reinforcement learning
As a pivotal learning paradigm within the realm of machine learning, RL occupies a crucial position. Unlike supervised and unsupervised learning, two of the main paradigms in machine learning, RL focuses on the decision-making processes of autonomous agents, especially sequential decision-making. Figure 1 illustrates the fundamental principles of RL, depicting the process by which an agent interacts with its environment to learn an optimal policy, with the ultimate goal of maximizing cumulative rewards within a given context. Consequently, RL is widely recognized as a leading approach towards the development of artificial intelligence.
[See PDF for image]
Fig. 1
The process of the agent interacting with the environment
Reinforcement learning in machine learning
RL, a key branch of machine learning, emphasizes autonomous agents’ sequential decision-making, distinguishing it from supervised learning. RL mechanism is the process by which an agent learns an optimal policy through interaction with the environment. This paper primarily focuses on addressing decision-making problems in uncertain environments. Specifically, such uncertainty manifests in two aspects: either the environment itself contains critical information that is inaccessible, or the intelligent agent cannot fully perceive the environmental state due to its inherent limitations. The core research question thus revolves around how to select the optimal action from a set of potential actions to achieve long-term objectives. Compared with RL, supervised learning has a clear supervision goal, and the supervisor can provide the correct answer in the sample. Supervised learning is an algorithm learning and training method in machine learning and artificial intelligence (Rosenblatt 1958). It is a method of training an algorithm to predict a target variable by using a labeled data set. Supervised learning can be divided into two basic problems: regression and classification. The regression problem involves using algorithms to understand the relationship between independent variables and dependent variables in order to predict the trend of continuous variables. The classification problem is to predict the values of discrete variables, and through algorithms, the test data is accurately assigned to specific categories. Common algorithms for regression and classification problems include linear regression (Gauss 1809), logistic regression (Hosmer and Lemeshow 2000), decision trees (Geurts et al. 2009), support vector machines (Isla-Cernadas et al. 2025), random forests (Breiman 2001) and other classical algorithms. Following supervised learning, Lloyd et al. (1982) proposed the K-means clustering algorithm in the study of the quantization problem in Pulse Code Modulation (PCM) in the middle of the 20th century, which marked the formation of the concept of unsupervised learning. In unsupervised learning, the training data is given in unlabeled form and there is no correct or incorrect reward signal as evaluation and feedback. Common Clustering algorithms include K-means (Arthur and Vassilvitskii 2007), Hierarchical Clustering (Kobren et al. 2017; Lance and Williams 1967), DBSCAN (Schubert et al. 2017), and so on. Association refers to the discovery of association relationships and patterns between data items, and different rules are used to find the relationship between variables in a given data set. Common association algorithms include Apriori algorithm (Agrawal et al. 1993), FP-Growth algorithm (Han et al. 2000), DHP algorithm (Park et al. 1995) and other classical algorithms. Dimensionality reduction is the process of projecting high-dimensional data into low-dimensional space while preserving data integrity. This means reducing the data dimension to make data visualization and subsequent analysis easier. Common dimensionality reduction algorithms include principal component analysis (PCA) (Pearson 1901) and t-SNE (Van der Maaten and Hinton 2008).
Therefore, RL can be roughly considered as a type of learning between supervised learning and unsupervised learning. In addition to the two types of supervised learning mentioned above, there are other machine learning paradigms such as semi-supervised learning (Chapelle et al. 2009; Yang et al. 2022) and self-supervised learning (Liu et al. 2021b; Gui et al. 2024). In recent years, these machine learning paradigms together with RL methods constitute a new situation of multiple development of machine learning.
Branches of reinforcement learning
At present, RL is developing in the direction of diversification. Researchers try to creatively combine RL with related technologies, expand the feasibility of RL field, and make up for the innate shortcomings of RL algorithm with the advantages of other methods in machine learning, which shows good results in solving practical problems. Emerging methods have been derived, as shown in Table 1, including but not limited to Meta-reinforcement Learning (Meta-RL) (Qu et al. 2021; Wang et al. 2024b), offline RL (Yang et al. 2023; Prudencio et al. 2023), integral RL (IRL) (Xue et al. 2025a, b), and transfer RL (TRL) (Chen et al. 2023; Zhu et al. 2023b). In the following, each subfield is introduced separately.
Table 1. Branch classification comparison of RL methods
Branch name | Core objective | Key technologies | Application scenario |
|---|---|---|---|
SARL | Learn the optimal policy maximize cumulative rewards | Value functionPolicy function | Game field, Simple robot control |
Meta-RL | Learn “learning strategies” achieve rapid adaptation across tasks | Model-agnostic Meta-learningMeta-policy optimization | The multitasking execution of robots, Autonomous driving |
TRL | Transfer source task knowledge to target taskaccelerate learning | Parameter transferFeature representation transferPolicy transfer | Robots field, Medical field |
HRL | Break down complex tasks into hierarchical subtasksreduce decision complexity | Separation of high-level strategies and low-level strategies | Logistics warehousing, Military mission |
MARL | Multi-agents learn individual strategiesconsidering cooperation or competition | Joint learningCredit assignmentPolicy coordination | Traffic flow control, Multi-robot collaboration |
Meta-reinforcement learning
The concept of Meta-RL comes from Meta-learning and RL, which is a cross field formed by fusing the advantages of the two. It aims to improve the learning efficiency of algorithms in complex tasks by integrating existing knowledge. Meta-learning leverages prior knowledge to enhance adaptation to new tasks, enabling the network to learn more efficiently. Its essence is to increase the generalization ability of the learner in multi-task. Traditional RL usually needs a large amount of sample data to train the model when facing a new task. This process often needs to learn from scratch, resulting in high training cost and low sample efficiency. In order to solve these problems, this paper integrates Meta-learning into the existing RL algorithm based on Meta-learning, which focuses on learning how to learn a model that can perform well on new tasks with limited training data. This method extends the original single-task framework of RL to the multi-task framework of Meta-learning. Good results are obtained in practical application. At present, Meta-RL has a wide range of application prospects in game development, robot control, network optimization and industrial manufacturing process. Simultaneously, there are also many challenges, such as how to improve sample efficiency, how to make the Meta-RL algorithm have enough generalization ability, and how to reduce computational complexity when training multiple tasks.
Transfer reinforcement learning
TRL is a method that applies the concept of transfer learning to RL algorithms, aiming to accelerate the learning process by transferring existing policies, models or experiences. Transfer learning involves applying pretrained model parameters to a new model, facilitating rapid adaptation. The goal is to use the knowledge learned in a source task to help the target task improve the learning speed. Through transfer, the training time of the target task can be reduced, in order to improve its final learning effect. In the traditional RL algorithm, whether it is the policy algorithm based on the value function or the policy search algorithm, when the environmental conditions change or the task changes, the agent needs to be retrained. However, this training process needs to pay a huge cost. To address this issue, researchers have integrated transfer learning into RL, leveraging prior knowledge from other tasks to enhance learning efficiency and effectiveness. This can reduce the time and sample requirements for learning from scratch, which is especially important in scenarios with scarce data or complex tasks.
According to different classification criteria, TRL can be divided into different transfer types. For example, according to task-based transfer learning, TRL can be divided into single-task transfer and multi-task transfer. Knowledge-based transfer learning can be divided into policy transfer, value function transfer, feature transfer and experience replay transfer. By applying transfer learning to RL algorithm, the agent can obtain better performance with less sample data. Meanwhile, leveraging prior experience significantly reduces exploration time in discovering optimal strategies, enhancing the agent’s adaptability while lowering data requirements for model training. At present, TRL has been widely used in industrial process control and speech system processing.
Hierarchical reinforcement learning
Different from TRL, Hierarchical reinforcement learning (HRL) as a branch of RL is not a fusion of the existing two types of methods, but a hierarchical policy that decomposes the ultimate goal of RL into multiple sub-tasks. A method of solving complex problems by solving subproblems one by one in a divide-and-conquer manner. There are two common approaches to subproblem decomposition: one is that all subproblems are shared tasks, and the other is that the results of the previous subproblem are added to the solution of the next subproblem (reuse tasks). The common HRL methods mainly have the following four categories: options based learning, hierarchical abstract machine based learning, MaxQ function decomposition based learning and end-to-end learning. Although there are still many breakthroughs to be made in the theoretical level of HRL, the potential ability of this algorithm in the field of large-scale RL can be seen through its application in some fields. Therefore, HRL is also one of the most cutting-edge fields in RL.
As an evolving field in artificial intelligence, RL encompasses various specialized approaches beyond the three main types, including inverse RL, SARL, and MARL. It is worth noting that each branch of RL presents unique challenges. Inverse RL needs to accurately restore the reward function in limited demonstration data, which faces the double test of data scarcity and model generalization ability. In the face of complex environment, MARL needs to balance the contradiction between exploration and exploitation to achieve efficient learning and decision-making. MARL faces many challenges, such as information asymmetry, communication cost and cooperative policy design among agents. However, it is these unique challenges that make the various branches of RL play an irreplaceable and important role in different application scenarios.
In light of this, this paper will focus on SARL and MARL, using these two fields as the main thread, and carry out a comprehensive and in-depth discussion. We will deeply analyze the theoretical basis, key technologies and application cases of these two sub-fields, in order to show readers the unique charm and development prospects of RL in single-agent and multi-agent scenarios. The main research content in this paper is shown in Fig. 2.
[See PDF for image]
Fig. 2
General block diagram of the research content
Single-agent reinforcement learning
SARL is the process of applying RL algorithms to a single agent, which is often referred to in the literature as RL process. It enables decision-making in uncertain environments, enhancing system performance through continuous learning and interaction. Therefore, SARL is regarded as a key technology for advancing artificial intelligence.
To facilitate the introduction of the following content, the following definition is given:
Definition 1
Policy denotes the probability of taking an action a given the input state s, denoted by . The mathematical expression is
1
In fact, a policy represents a mapping from the state space to the action space.
Definition 2
Value function is commonly used to evaluate the goodness of a state or state-action. The value function is mainly divided into two forms, the state value function and the action value function, which are denoted as and respectively.
State value function, represents the expected cumulative reward obtained by taking action according to a certain policy in state s, and is denoted as
2
Action value function represents the expected cumulative reward obtained by executing action according to policy after taking action a in state s, which is denoted as3
Probabilistic foundations of reinforcement learning
The probabilistic foundations of RL are central to understanding and designing effective RL algorithms. From the above definition, it is not difficult to see that the definition of policy is given by conditional probability distribution, and the definition of state transition matrix is also closely related to probability. Therefore, probability plays an important supporting role in RL. First of all, the real world environment is usually full of uncertainty, so RL strategies generally adopt random strategies. The benefit of adopting a stochastic policy is that it can couple exploration to the process of sampling. Through the stochastic policy, the agent can make flexible changes in the dynamic environment to better adapt to the current environment. This exploration is the key to discovering unknown states and potentially better policies. In addition, in practical applications, there are various noises in the environment, which may lead to overestimation of the system (the agent’s evaluation of the value of the state or action (such as the Q value) may be higher than its true value). The presence of noise may lead the agent to adopt suboptimal strategies, which will affect the robustness of the policy and the convergence of the algorithm, and cause great damage to the system. Therefore, probability provides a mathematical framework to model and deal with the various uncertainties existing in the above environment.
A stochastic policy defines a probability distribution over actions conditioned on the current state. Stochastic policies are widely used in modern RL algorithms. Therefore, it is necessary to understand stochastic strategies. Of course, understanding random variables and probability distributions is a prerequisite and foundation for understanding stochastic strategies. a random variable is a variable that can take different values at random; in Markov decision process (MDP), a random variable is an action at the current time, usually denoted by a lowercase letter. The probability distribution of a random variable defines the likelihood of each possible value and is generally classified as discrete or continuous. The discrete probability distribution is usually described by probability mass function, such as binomial distribution and Poisson distribution. Continuous probability distributions are often described by probability density functions, such as normal and uniform distributions.
The most commonly used random strategies in RL algorithms are as follows:
Greedy policy
Greedy policy does not consider other possible actions, only focuses on the current best choice. For a given state s, greedy policy will choose the action that maximizes the action-value function , which is mathematically expressed as
4
Precisely, the greedy policy is deterministic. However, to promote exploration and enhance algorithm robustness, it may incorporate some randomness, making it a special case of the random policy.-Greedy policy
-Greedy policy is one of the most basic and commonly used stochastic policies in RL algorithms, it means that with probability , the agent chooses the action that maximizes the current action value function, while the other actions are equally likely, which is mathematically expressed as
5
Compared with the pure greedy policy, the -greedy policy can control the balance between exploration and exploitation by the parameter , so as to encourage the agent to explore the environment. Therefore, the -greedy policy can achieve a better balance between exploration and exploitation.Gaussian policy
In a Gaussian policy, each action a is sampled from a parameterized Gaussian distribution with mean . The score function of the Gaussian policy is given by
6
Since the eigenvectors of the corresponding features generated by the policy are normally distributed, we can update the parameter so that the sampled behaviors are distributed over the high-reward behaviors as much as possible. Mathematically, it is expressed as follows7
where is the deterministic part and is the Gaussian random noise with zero mean. This policy can control the balance between exploration and exploitation by adjusting the standard deviation of the Gaussian distribution. Gaussian policy is widely used in RL algorithms where the action space is continuous.Boltzmann distribution policy
The Boltzmann distribution policy is suitable for discrete action spaces or when the action space is not large, which is mathematically expressed as
8
where is the action-value function. From the above equation, it is easy to see that the size of the action-value function is closely related to the Boltzmann distribution policy, that is, the size of the action-value function is directly related to the size of the probability of the action being selected.These concepts are the foundation of probability and statistics, and they have applications in data analysis, risk assessment, decision theory, and a variety of scientific and engineering fields. In RL algorithms, these concepts are used to model and analyze the interaction between the agent and the environment, as well as to evaluate and improve the agent’s policy and decision process.
Markov decision process
MDP is a mathematical model for sequential decision making first proposed by Russian mathematician Andrey Andreyevich Markov based on the study of Markov (1906). The MDP originates from the Markov chain, a stochastic process characterized by the Markov property, defined within a discrete index set and state space. We consider that the next state of a system is only related to the current state , and has nothing to do with the previous state. A state satisfying such a property is considered to have Markov property, where,
9
The Markov process can be represented by a two-tuple , S is a finite set of states, P is the state transition probability, and the state transition matrix can be expressed as10
RL interacts with the environment through actions and obtains rewards from the environment. This feedback mechanism needs to provide more state space, while Markov process does not have actions and rewards. Therefore, a pure Markov process is not sufficient to describe the process of RL. In the 1950s, Bellman (1957a, 1957b) applied Markov chain to decision process and proposed the concept of MDP. The principle is illustrated in Fig. 3.[See PDF for image]
Fig. 3
Markov decision process
In RL, the feedback of the agent and its environment to the actions is deterministic and satisfies the Markovian property. Therefore, the RL problem can be transformed into an MDP, that is, the process of the agent interacting with the environment can be modeled as an MDP. An MDP is represented by a five-tuple , where S is a finite set of states. A denotes a finite set of actions; P represents the state transition probability; R stands for reward function; Let denote the discount factor, which is used to calculate the cumulative return. In RL tasks, the state S and action A can be either discrete state space or continuous state space. However, in practical applications, the state and action of the system are often continuous or high-dimensional discrete. Therefore, when the state is not fully observable, the MDP is transformed into a partially observable MDP (POMDP). Compared with the fully observable Markov decision process, the POMDP is usually more complicated, and we focus on the case when the state is fully observable in this paper.
Classification of single-agent algorithms
SARL algorithms are mainly divided into model-based dynamic programming methods, value function-based policy learning algorithms and policy search algorithms. The algorithm classification and comparative analysis are shown in Table 3. Due to the increasingly rapid update rate of RL algorithms, it is quite difficult to accurately and comprehensively classify different algorithms. With the acceleration of research progress, the development and connection of MARL algorithms constitute a complex and rich network. Various algorithms that appear in different periods inspire, reference and integrate each other, from model-based dynamic programming methods to model-free value function policy learning algorithms and policy search algorithms. The Actor-Critic (AC) algorithm combines the PG method and the value function method in order to obtain the advantages of both methods. In recent years, in order to adapt to more complex tasks and environments, the AC algorithm is also constantly improved and extended.
Table 2. Classification of SARL algorithms
Algorithm classification | Classical algorithm name |
|---|---|
Dynamic programming | Value iteration, Policy iteration |
Value-based function | MC, TD, Sarsa, Q-learning, DQN, Double DQN, Dueling DQN |
Policy based gradient | A2C, A3C, TRPO, PPO, ACER, GAE, DPG, DDPG, TD3, SAC |
Table 3. Analysis and comparison of SARL algorithms
Classical algorithm name | Parallel capability | Core idea/Applicable scene |
|---|---|---|
Value iteration | – | Given the initial values, iterate and update until convergence to the optimal value function |
Policy iteration | – | Carry out policy evaluation and policy improvement alternately until the policy converges |
MC | – | Sampled randomly to calculate expected value post-experiments |
Sarsa | – | Learn policy and value functions directly from interactions |
Q-learning | – | By studying the Q function, we can determine the expected value in a given state |
Double DQN | – | Decouple action selection and evaluation separately |
Dueling DQN | – | Learn state value and advantage functions to get Q-value indirectly |
A2C | Synchronous | (Continuous state space Continuous action space) |
A3C | Asynchronous | (Continuous state space Continuous action space High-dimensional input) |
TRPO | Synchronous | (Continuous state space Continuous action space High-dimensional input) |
PPO | Synchronous | (Continuous state space Continuous action space) |
ACER | Asynchronous | (Discrete action space Continuous action space) |
GAE | Synchronous | (Continuous state space Continuous action space High-dimensional input) |
DPG | Synchronous | (Continuous state space High-dimensional action space) |
TD3 | Synchronous | (Continuous state space High-dimensional action space) |
SAC | Synchronous | (Continuous state space High-dimensional action space) |
Broadly, RL is a sequential decision problem aimed at finding the optimal policy that maximizes the expected cumulative reward. As mentioned in the previous section, MDP can be used to describe sequential decision problems in RL. Of course, sequential decision problems are rich and diverse, including non-Markovian processes in addition to MDP. Based on whether the transition probability P is known in the quintuple of an MDP, the approach is classified as either a model-based dynamic programming algorithm or a model-free RL algorithm. As shown in the Table 2, these two types of algorithms include policy iteration algorithm and value iteration algorithm. Model-free RL algorithms are further specifically divided into online and offline.
Model-based dynamic programming
When the model of the dynamic characteristics of the environment is known and the state and action spaces are small, such problems can be reduced to model-based RL problems, and dynamic programming methods are suitable to solving such problems. The so-called dynamic programming achieves optimization by adjusting the agent sequence and state changes. The essence of dynamic programming lies in solving complex problems by breaking them down into simpler sub-problems, which makes it particularly effective for those exhibiting overlapping sub-problems and optimal sub-structure.
Policy iteration (Bertsekas 2011; Liu and Wei 2013)and value iteration (Bertsekas 2012; Zhao et al. 2023b) are two important methods used to solve dynamic programming problems. The so-called policy iteration algorithm usually consists of two steps: policy evaluation and policy improvement, which are alternated until convergence. In policy evaluation, the value function for each state is computed using a numerical iteration algorithm, and a new policy is derived through the value function and the greedy policy. Therefore, the core of dynamic programming is to find the optimal value function under a given policy. Based on the Bellman optimality principle, the Bellman optimization equation can be obtained as follows
11
12
The mathematical expression of the state value function is as follows13
Equation (13) shows how the value function at state s can be represented by its successor state value function . In this equation, , and are known, is the specified value of a given policy, so the value function is the only unknown, that is, the equation is a linear system of equations about the value function, and the number of unknowns is the total number of states. Therefore, the policy evaluation becomes a problem of solving linear equations. The common methods for solving linear equations include direct solution methods (such as Gaussian elimination method, matrix triangular decomposition method, etc.) and iterative solution methods. Compared with the Jacobi iteration, the Gauss-Seidel iteration is significantly faster because it is an algorithm that updates the computation immediately by using the computed components.The above is the process of the policy evaluation algorithm, the purpose of calculating the value function is to find the optimal policy through the value function, and improve the current policy by adopting the greedy policy for each state under the current policy, as
14
This process is a policy improvement algorithm, and the combination of the two is a policy iteration algorithm. It can be seen from the above calculation process that the policy iteration algorithm has a large number of iterations and low computational efficiency. In many cases, the policy obtained when the value function does not fully converge is the same as the policy obtained by infinite iterations of the value function. Therefore, we aim to reduce the requirements for policy evaluation to enhance the speed and computational efficiency of policy iteration. Value function is another effective method to solve the optimal policy. Compared with the policy iteration algorithm, the value function iteration can find the optimal value function through the cyclic iteration, and it can be completed after one policy extraction. In other words, the value function iteration does not need to wait for the policy evaluation algorithm to fully converge, and the policy improvement can be carried out after one evaluation. Therefore, the value iteration algorithm can be considered as the policy iteration algorithm in which the policy evaluation process is only iterated once. This algorithm is suitable for the case of large state space, and the computation of value function iteration is relatively small. Value function iteration is the most general computational framework in dynamic programming algorithms. In recent years, researchers have made a variety of improvements and extensions to value iteration algorithms, including asynchronous dynamic programming, least-squares policy iteration (LSPI) (Lagoudakis and Parr 2003, etc. Tamar et al. (2016) combined value iteration with deep RL which further enhance the algorithm’s efficiency and expand its application scope.In summary, policy iteration and value iteration are two complementary methods to solve MDP problems. They are different in algorithm design and iteration process, but both aim to find the optimal policy and are closely related in mathematics.
Policy learning algorithm based on value function
The RL problem based on the known model is a more ideal situation, but most of the problems encountered in reality are often accompanied by the dynamic model unknown or partially unknown, this is the model-free Markov decision problem we often mentioned (Lin et al. 2025), RL algorithms based on this problem mainly include Monte Carlo method (Sutton and Barto 1998) and Time Difference (Sutton 1988) method. The fundamental concept of model-free RL aligns with that of dynamic programming, aiming first to compute the value function associated with the current policy and then leverage it to refine the policy through policy evaluation and improvement. The original expression for the value function is
15
and the original state-action value function is expressed as16
When calculating the value function without model, random samples can be used to estimate the expectation. Different from the dynamic programming, the expectation is directly calculated by using the model, the Monte Carlo method uses the empirical average instead of the expectation of the random variable. Therefore, the introduction of Monte Carlo method solves the problem of unknown environment model. Temporal-Difference is another method based on model-free RL, and it is also one of the core contents of RL algorithm. In the estimation of value function, temporal-difference combines the Monte Carlo sampling method and the bootstrapping method of dynamic programming method. The basic idea of temporal difference (Sutton 1988) is as follows17
The introduction of the Time Difference method reduces the computational complexity. Meanwhile, bootstrapping enhances learning efficiency by utilizing the next state’s value function to estimate the current value function. Typical Time Difference algorithms include Sarsa, Q-learning algorithm, etc.Sarsa (Rummery and Niranjan 1994) is an online learning algorithm based on same policy, that is, the same set of policies is used when learning and executing the policies. It is used to solve problems in an MDP and is updated as follows
18
where is the action-value of the next state and the corresponding action . During the learning process, both its action policy and evaluation policy are -greedy policies. The core of Sarsa is to estimate the state-action value function (Q-function), which is updated differently compared with the Monte Carlo method.Q-learning (Watkins and Dayan 1992) algorithm is one of the most popular algorithms in current research. It is an off-line learning algorithm based on different strategies, that is, the current policy followed by the agent is different from the optimal policy, but it learns the optimal policy while exploring in the environment. It is updated as follows
19
where is the maximum Q-value over all possible actions in the new state , which represents the maximum possible reward in the future. Different from Sarsa method, the action policy of Q-learning algorithm adopts -greedy policy, but the target policy is greedy policy. The model-free property makes Q-learning do not need to model the environment and is suitable for complex or unknown environments. However, for problems with continuous state space or action space, traditional Q-learning requires discretization, which may lead to a decline in performance. In this context, researchers use nonlinear function approximators to approximate Q(s, a), which combines the idea of Q-value iteration with deep neural networks to form a DQN. The basic idea is to use convolutional neural networks and fully connected networks as nonlinear function approximators to approximate the value function . As a pioneering work, DQN (Mnih et al. 2013; Hausknecht and Stone 2015) algorithm realizes data extraction and utilization directly from high-dimensional sensory data by experience replay and maintaining the target network, which successfully solves the problem of dimensionality disaster faced by traditional RL algorithms. In addition, improved algorithms based on DQN, such as double DQN (Van Hasselt et al. 2016) and dueling DQN (Wang et al. 2016), have been proposed one after another, and these algorithms also show more extensive adaptability in practical applications. With the organic integration of deep learning and RL, the emergence of deep RL marks the research of RL has entered a new stage.Policy gradient algorithm
As we mentioned earlier, the goal of RL is to obtain the optimal policy via interaction with the environment. The above model-based dynamic programming algorithms and value function-based policy learning algorithms first estimate the action value function, and then directly find the corresponding optimal policy according to the estimated value function. However, in practical scenarios, when dealing with large or continuous action spaces, it becomes challenging to determine the next action based on the value function. Therefore, the PG algorithm is studied, which skips the evaluation of the action-value function and directly selects the policy from the input state to the output policy.
The basic idea of PG algorithm (Sutton et al. 1999) is to use a parameterized differentiable function to define the optimal policy. The objective function of PG method is some performance measure , and all objective functions are equations of policy , and then find an optimal parameter . Such that the expectation of the cumulative reward based on the policy is maximized. The policy is parameterized by . Therefore, all objective functions are equations with respect to . The mathematical expression of the objective function is as follows
20
where is the state density function of policy , and the derivative with respect to yields21
It can be seen from equation(21) that the gradient of is independent of . Further, it can be simplified into the following form22
From Eq. (22), which is known as the PG formula, it is known that it was introduced by Jordan et al. (1998) at NIPS(Neural Information Processing Systems). can be estimated by Monte Carlo or difference-of-time methods, and then we can update using gradient ascent23
The policy-based learning method and the value-based learning method primarily differ in that the former derives its final policy from parameters, while the latter obtains its final policy directly from the value function.In Eq. (23), when estimating , if the Monte Carlo update method is used, the cumulative reward needs to be calculated by running a complete game before the real reward can be obtained and calculated, which obviously reduces the calculation efficiency. In contrast, the Time Difference method based on the value function can evaluate each AC, implement but not update, and effectively reduce the variance. If is used to represent the parameters of critic, the cumulative reward can be expressed as follows
24
Its gradient can be calculated by the following equation25
where the parameter determines the magnitude of the cumulative reward gradient. That is, the original AC framework (Sutton et al. 1999) usually needs to learn policy and Q-value function simultaneously.DeepMind proposed A3C algorithm (Mnih et al. 2016) (Asynchronous Advantage Actor-Critic) in 2016. A3C algorithm introduces the asynchronous training framework under the traditional AC framework, and puts the training process of RL into multiple threads in parallel, allowing multiple agents to interact with the environment simultaneously. Compared with the original AC algorithm, the A3C algorithm greatly improves the training efficiency, and the final convergence effect and robustness of the algorithm are improved.
The development of PG algorithm has brought challenges in terms of stability and convergence speed. In recent years, in order to improve the stochastic PG algorithm, the deterministic policy gradient algorithm (DPG) (Silver et al. 2014) has been proposed. The basic idea is to introduce the idea of DQN algorithm into the continuous action space, so that it can adapt to the environment with a large number of actions or continuous actions. Different from the random policy, the deterministic policy will only uniquely select a certain action in the same state, rather than a probability distribution of actions. Therefore, the DPG method no longer pays attention to the expectation of actions, and the learning efficiency becomes higher with fewer samples, especially when the action space is large. On the basis of DPG, Lillicrap et al. (2015) proposed deep deterministic policy gradient algorithm (DDPG) by introducing deep neural network for function approximation. DDPG provides a stable benchmark for the learning process by using experience replay mechanism and fixed target network model. It effectively avoids the instability and oscillation in the training process.
Multi-agent reinforcement learning
The unified whole formed by the interaction of multiple agents and their environment is called MAS. The MARL algorithm is formed by combining the existing RL algorithm with the MAS. The MARL is a machine learning approach. Multiple agents operate in a shared environment and refine their policies through interaction. they achieve the optimal policy by maximizing reward values. Compared with the MARL process, the environment composed of multiple agents is complex and dynamic. Therefore, when dealing with the problem of MASs, various factors need to be considered, such as environmental interaction between the agent and the environment, policy learning, credit assignment, and exploration and exploitation. In general, MARL is formed and developed on the basis of RL, which is not only an inheritance and extension of the existing RL algorithm, but also has been widely applied in many practical fields.
Description of the multi-agent learning model
For convenience, here are a few definitions:
Definition 3
Matrix game is also called two-person zero-sum game. If there are two players, player one has n strategies, and player two has m strategies, then the payoff of player one is the matrix constructed. Because it is a zero-sum game, the payoff of player two is the negative value of the payoff matrix of player one. It is usually written as .
Definition 4
Nash equilibrium is also known as non-cooperative game equilibrium. In a game, if there are n players, each player i has a policy set , policy combination is a Nash equilibrium. If and only if for every player i and any policy of i, there is , where is the payoff function of player i, is the policy of player i in the equilibrium policy combination, represents the policy combination of other players besides player i, and is any other policy of player i.
In MARL, problems are typically modeled as Markov games, the Nash equilibrium provides a theoretical goal for policy stability when the strategies of all agents reach the Nash equilibrium, no party can obtain a higher reward by unilaterally changing their policy. When describing MASs, the Markov property in the MARL model cannot describe the dynamic characteristics of the environment. In order to solve the above problems in recent years, researchers have made full use of the fitting and data processing capabilities of various neural networks and deeply integrated with the existing RL algorithms, which has been effective in dealing with complex environments with high-dimensional state and action spaces. Therefore, in order to accurately describe the environmental characteristics and dynamic characteristics of multi-agent, the autonomous decision-making process of multi-agent can be expressed as Markov games. Figure 4 represents the MARL problem.
[See PDF for image]
Fig. 4
Representation of MARL problem
Markov game is a branch of game theory based on stochastic game theory. It is the extension of MDP in MAS. It primarily addresses the modeling of environmental and dynamic characteristics in MAS. Each agent possesses a unique observation space, action space, and reward function. They need to make decisions under the influence of the behavior of other agents and the dynamics of the environment. Similar to the single-agent MDP, the modeling process can be specifically used by a six-tuple , where denotes the set of systems with N agents, is the global state space in the environment. represents the joint action space of all agents, The state transition probability function represents the action selected by all agents in the system , the probability when the state transitions from s to a new state , denotes the local reward function of the i-th agent. Meanwhile, each agent in the system has its own reward function. Let denote the discount factor, which represents the trade-off between long-term reward and immediate reward.
In a multi-agent environment, the i-th agent observes the global state s to obtain the local observation . When the observable information of the agent is equal to all the information of the environment, that is , it is called a fully observable Markov decision process. When the observable information of the agent is less than all the information of the environment, that is, , this is called a POMDP, also called a partially observable Markov game. In partially observable Markov games, the modeling process can be represented by an octuple , where represents the local observation space. denotes the observation function and the rest of the set elements are defined as in the Markov game. For a single agent in a POMDP, the Bellman equation for its state-action value function can be expressed as follows
26
where is the optimal value of the expected cumulative discounted reward of taking action in the confidence state . is the probability of moving to the next belief state and getting an immediate reward r given the current belief state and action .The optimal expected cumulative discounted reward for taking action in state can be derived from Eq. (26). Specifically, it equals the expected immediate reward of that action plus the product of the discount factor and the maximum optimal expected cumulative discounted reward in the subsequent state. This relationship forms the basis for calculating the optimal reward in a given state-action pair. By solving this equation, the optimal action at each confidence state can be found to construct the optimal policy .
In summary, Markov games depict multi-agent interactions as a stochastic process of states, joint actions and rewards, while concepts such as Nash equilibrium and correlated equilibrium provide mathematical definitions of strategic stability. Marden et al. (2009) modeled networked multi-agent MDP as potential games, by constructing a potential function equivalent to the global objective, enabling each agent to achieve overall optimality by optimizing only its local rewards; Schwung et al. (2022) for adversarial MARL scenarios, proposed a solution paradigm based on zero-sum differential games: converting multi-agent competition into the approximate solution of the Hamilton–Jacobi–Isaacs (HJI) equation. The chiastopic fusion of game theory and MARL has become a research hotspot for solving large-scale distributed decision-making problems in recent years, this provides a rigorous problem modeling framework and convergence target for MARL.
Classification of multi-agent algorithms
Compared with the single agent environment, the MAS should not only pay attention to the interaction with the environment, but also concern about the dynamic influence behavior between agents. In order to obtain the optimal policy, each agent needs to examine the actions and states of other agents to obtain the joint action value function. In MASs, a large input space for the action value function leads to an exponential growth in joint action value parameters as the number of agents increases. This phenomenon is known as the “curse of dimensionality”. It is difficult to fit a suitable function to represent the true joint action value function. How to learn the joint action-value function is one of the core problems in MARL algorithm.
In addition, as shown in Table 5, some important characteristics of the system deserve our special attention in the MARL algorithm, such as whether the communication control mode between agents is centralized control or decentralized control, whether the environment of the agents is fully observable or partially observable, and whether the agents cooperate or compete in order to obtain the optimal policy. It is noteworthy that the existing literature presents numerous dimensions for classifying MARL algorithms, and a single algorithm often meets multiple classification criteria simultaneously. This inevitably leads to cross-overs and overlaps in any single classification framework. In this paper, when naturally extending SARL algorithms to MASs, we adopt a function-oriented policy and primarily employ the following four extension methods including centralized learning, behavior analysis, communication learning, and collaborative learning. MARL algorithms are classified as in Table 4.
Table 4. Classification of MARL algorithms
Algorithm classificationa | Classical algorithm name |
|---|---|
Behavior analysis | IQL+DQN, IQL+TRPO, IQL+DDPG, PS-TRPO, DRUQN |
Centralized learning | Associative Q-learning, MAPPO, IRAT, MAIC |
Communication learning | RIAL/DIAL, CommNet, IC3Net, BiCNet, ATOC, SchedNet |
Collaborative learning | VDN, QMIX, QTRAN, Qatten, MADDPG, COMA, MAAC |
NoteaAlgorithms can be classified in multiple ways. This table is only divided based on the dominant characteristics that are most frequently cited in the literature; if further subdivision is needed, cross-analysis can be conducted by combining dimensions such as training paradigms and information granularity
Table 5. Analysis and comparison of MARL algorithms
Classical algorithm name | Parallel capability | Core idea/Applicable scene |
|---|---|---|
IQL+DQN | Arbitrary type, non-communication | Agents independently run DQN algorithms |
IQL+TRPO | Arbitrary type, non-communication | Agents independently run TRPO algorithms |
IQL+DDPG | Arbitrary type, non-communication | Agents independently run DQN algorithms |
PS-TRPO | Arbitrary type, non-communication | Independent TRPO runs in parallel without interference |
DRUQN | Arbitrary type, non-communication | Independent double Q-networks self-learning policy |
Associative Q-learning | Arbitrary type, non-communication | Joint Q-table via associative memory |
MAPPO | Cooperation type, non-communication | Centralized critic, decentralized actors |
IRAT | Cooperation type, communication (implicit) | Iterative reward shaping for teams |
MAIC | Cooperation type, communication (policy sharing) | Policy consensus via iterative communication |
RIAL/DIA | Cooperation type, communication | Discrete (RIAL) or differentiable (DIAL) message learning |
CommNet | Cooperation type, communication | Continuous vector channel; average hidden states |
IC3Net | Arbitrary type, communication | Gated continuous channel for selective broadcasting |
BiCNet | Arbitrary type, communication | Bidirectional RNN for implicit message fusion |
ATOC | Cooperation type, communication | Attention decides when & what to communicate |
SchedNet | Cooperation type, communication | Learnable scheduler chooses who broadcasts |
VDN | Cooperation type, CTDE (implicit)b | Additive Q-value decomposition |
QMIX | Cooperation type, CTDE (implicit) | Monotonic mixing of local Q-values |
QTRAN | Cooperation type, CTDE (implicit) | Unconstrained Q-value transformation |
Qatten | Cooperation type, CTDE (implicit) | Attention-based Q-value aggregation |
MADDPG | Arbitrary type, CTDE (implicit) | Centralized critic, decentralized actors |
COMA | Cooperation type, CTDE (implicit) | Counterfactual baseline for credit assignment |
MAAC | Arbitrary type, CTDE (implicit) | Attentional multi-agent AC |
NotebNon-communication means there are no explicit communication mechanisms between agents. They treat other agents as part of the environment, neither exchanging messages nor sharing gradients, thus no communication channels exist. The CTDE (Centralized Training with Decentralized Execution) framework utilizes global information during the training phase without real-time communication exchange, each agent acts independently based solely on local observations in the execution phase, thus falling into the category of non-communication
Centralized Learning
Centralized learning in MARL is a method to optimize the cooperation performance of multiple agents by centrally processing and sharing information. The core idea is to use a global perspective to centralize the state, action and reward information of multiple agents for joint decision making. By merging the action spaces of the agents, a high-dimensional joint action space is formed, and the global information is used to optimize the policy in the training phase. Through information sharing, agents gain a better understanding of the environment. This enables them to coordinate actions and enhance overall performance. Typical centralized learning methods include joint action-value functions (such as joint Q-learning Tan 1993), MAPPO (Yu et al. 2022), IRAT (Wang et al. 2022b), MAIC (Yuan et al. 2022), etc. Joint Q-learning and MAPPO are the joint learning forms of Q-learning and PPO based on the SARL algorithm. Joint learning is a typical method of centralized learning algorithm. Aiming at the reward sparsity problem in MARL scenarios, the IRAT algorithm alleviated the reward sparsity problem by constructing an individual policy and a team policy, and making them learn and update simultaneously. In response to the challenges of weak reward signals and the difficulty in interpreting decision-making logic in complex domains, Urbanowicz and Moore (2009) focused on the optimization of the Learning Classifier System (LCS) in sparse reward environments and the construction of an interpretable decision-making mechanism. Through the deep coupling of evolutionary algorithms and RL, an adaptive rule exploration mechanism was designed, which overcame the learning stagnation problem of traditional methods caused by sparse rewards. The MAIC algorithm presents a new regularization method. It enables each agent to generate motivational messages and directly influence other agents’ value functions. The core idea is to enable more effective collaboration through motivational communication between agents. In this algorithm, agents not only learn their own strategies, but also learn to influence the decision-making process of other agents through communication, thereby achieving better coordination in the entire MAS. Simultaneously, as the number of agents grows, the joint state space and joint action space will grow exponentially, and in order to avoid the “curse of dimensionality”, centralized learning is only suitable for scenarios with a small number of agents. Centralized learning is generally only suitable for fully cooperative tasks. In fully competitive and hybrid tasks, there is no complete information sharing between different agents and therefore there is no centralized communication.
Behavior analysis
Behavior analysis can also be called independent learning, as the name implies each agent learns independently, considering other agents as part of the environment. Behavior analysis is an important method in MASs, which is suitable for small scale multi-agent problems in discrete state and action spaces and has strong scalability. Under normal circumstances, this method directly applies the MARL algorithm to the multi-agent environment. Each agent operates independently, following the independent Q-learning rule to execute its own Q-learning algorithm.
The existing independent learning algorithms mainly include IDQN (Tampuu et al. 2017), DQN/TRPO/DDPG+IQN (Chen et al. 2025), PS-TPRO, and Deep Repeated Update Q-Networks (DRUQN) (Castañeda 2016). The IDQN algorithm combines the traditional Q-learning algorithm with DQN, treats each agent as an independent learner, and each agent learns the value function according to its own observations and actions, and the loss function of agent i is given by
27
The total loss function aggregates the loss functions of all agents. The optimization process aims to minimize both the total loss and individual agent losses. Practice has proved that DQN algorithm in the scale of MASs to achieve good results. To address the issue of improving the performance and efficiency of 5G and 6G networks, Geranmayeh et al. (Geranmayeh and Grass 2025) propose a method that deploys DQN to determine the optimal beamforming angles and transmission power for multiple transceivers, which effectively enhances network efficiency and the optimal allocation of resources. In a multi-agent environment, each agent uses a DQN to learn its policies, which guide the agent to take actions in a specific state to maximize the cumulative reward. This combination can achieve more stable and efficient learning in multi-agent environments, while reducing the phenomenon of overestimation of Q value, which can promote policy coordination in large-scale MASs. In order to solve the problem of non-stationary environment in the process of agent training, Castañeda (2016) proposed DRUQN algorithm. The algorithm is an extension of the traditional Q-learning algorithm. Its core idea is to reduce the deviation caused by policy update by repeatedly updating the Q value, so as to improve the learning stability and efficiency.Communication learning
In MASs, agents usually only have access to local environment information, but cannot directly understand the global state. Communication learning enables the agent to generate information according to its local observations during the training process, and decides whether to communicate and which agents to communicate with. After the training is completed, the decision is made explicitly based on the information transmitted by the other agents. The existing communication-based multi-agent deep RL (CB-MADRL) algorithms mainly include RIAL/DIAL (Foerster et al. 2016), CommNet (Sukhbaatar and Fergus 2016), IC3Net (Singh et al. 2019), BiCNet (Guan et al. 2022) and the attention mechanism-based communication models ATOC (Peng et al. 2017) and SchedNet (Kim et al. 2019).
The RIAL algorithm integrates deep Recurrent Q-Networks (DRQN) with independent Q-learning. Specifically, it employs two distinct Q-networks, one for evaluating the original action and the other for assessing discrete channel information. The agent needs to choose an action to interact with the environment and a communication action. Therefore, the input of the Q-network contains two aspects of information: the local observations of each agent and the channel information passed by other agents in the previous time. A further improvement of the DIAL algorithm over RIAL, which allows gradient information to flow through the communication channel, enables agents to give each other feedback about their communication actions through gradient backpropagation. For multi-agent POMDP problems, Sukhbaatar and Fergus (2016) proposed the CommNet architecture, whose core idea is to let each agent learn how to encode and decode information from other agents in order to better coordinate actions. This approach encodes the local observations of agents through a shared neural network, where each agent’s decision depends on an average vector of observations and communication information from other agents. In principle, agents could use copies of a shared neural network to execute in the environment in a decentralized manner, while requiring instant communication with all agents. By allowing agents to effectively exchange information without explicit communication protocols, the collaboration ability of the whole system is improved. IC3Net extends CommNet, and unlike the globally shared reward used in CommNet, IC3Net provides personalized rewards for each agent, so it can exhibit more diverse effects in fully competitive or hybrid environments.
The BiCNet algorithm assumes discrete information transfer between agents. It links each agent’s policy network and value function through a bidirectional long short-term memory layer. This enables agents to capture long-term dependencies in memory states and exchange information effectively. This method shows excellent application effect in multi-agent scenarios that require high degree of cooperation. In addition, the communication model ATOC based on attention mechanism uses bidirectional Long Short-Term Memory (LSTM) model to integrate the received information to improve cooperation efficiency, and SchedNet algorithm (Kim et al. 2019) considers the limited communication bandwidth and communication medium. By using weight-based Scheduling (WSA), the set of all agent Weight generators is treated as a single neural network, while the observations of each agent are recorded through the encoder and action selector in turn. It provides an effective and adaptable solution for communication in MASs, especially in bandwidth-constrained environments.
Collaborative learning
Learning cooperation is a highly innovative and practical method in the field of MARL. It combines the advantages of the above two learning methods of behavior analysis and centralized learning, and skillfully integrates the cutting-edge ideas in the field of multi-agent into the MARL framework, and realizes efficient collaboration between agents by implicitly learning the communication mechanism between agents. Different from the traditional explicit communication methods, the core of collaborative learning is that agents do not directly exchange information with each other, but learn how to better cooperate by observing and infering the behavior patterns of other agents. Typical collaborative learning algorithms include VDN (Sunehag et al. 2017), QMIX (Rashid et al. 2018), QTRAN (Son et al. 2019), Qatten (Yang et al. 2020b), MADDPG (Wu et al. 2024), COMA (Foerster et al. 2017), MAAC (Iqbal and Sha 2018) and MASQL (Yang et al. 2018). Among these algorithms, except MADDPG, all of them are suitable for solving fully cooperative tasks in MASs. MADDPG is suitable for solving cooperative, competitive and hybrid problems. Collaborative learning algorithms can be divided into value function decomposition based methods and PG based methods according to the evaluation method of value function.
MARL algorithms based on the decomposition of values address the scalability problem by breaking down the global value function into local value functions for each agent. The core idea is to separate the environment information of the agent from the observation information of other agents by decomposing the global value function, and achieving the goal of optimizing the policy. In MARL, the global value function usually depends on the joint actions and states of all agents. When the number of agents is large, directly calculating the global value function will lead to the “curse of dimensionality". The value function decomposition method simplifies computation. It achieves this by breaking the global value function into a combination of local value functions for each agent. In this approach, it is challenging to consider how to correctly decompose the joint value function between agents for decentralized execution. For collaborative tasks, it is a necessary condition that the local maximum of each agent’s value function coincides with the global maximum of the joint value function. This is known as the Individual-global maximum (IGM) principle (Wen et al. 2023), which is mathematically expressed as follows
28
where, represents the optimal joint action of the agent, and represents the optimal joint action selected collectively by all agents. In collaborative learning, a key challenge is to derive the individual value functions from the joint value function, ensuring that the local maxima of each agent’s value function align with the global maximum of the joint value function. This requirement is known as the IGM principle. Typical MARL algorithms based on value function decomposition mainly include VDN, QMIX, QTRAN, Qatten and Qplex.The VDN (Sunehag et al. 2017) algorithm represents an additive assumption based on the “decomposable value function", that is, the joint Q-value can be decomposed into the linear sum of each agent’s individual Q-functions, and its mathematical expression is as follows
29
where N represents the number of agents and stands for the joint action value function. The LSTM is used as the Q network to approximate the of each agent. After approximately obtaining , VDN uses the update method of DQN to update through the global reward r, and its Loss function is expressed as follows30
where M is the batch size, , , is the target net. Then, the joint Q-function is used to optimize the agent policy. The VDN algorithm is simple, efficient and scalable. However, when there are complex nonlinear interactions between agents, the limitation of its additive assumption makes it perform poorly in complex tasks. Based on this, the QMIX (Rashid et al. 2018) algorithm was proposed. The QMIX algorithm can be regarded as a further extension of the VDN algorithm. When VDN does the joint value function decomposition, it only makes the assumption of simple linear addition, which cannot capture the more complex mutual relationship between agents. QMIX introduces monotonicity and nonlinearity assumptions. It employs a hybrid network to integrate local value functions of individual agents. Additionally, global state information is incorporated during training to enhance algorithm performance. The core idea is to use mixed network (a neural network) to fit the nonlinear relationship between the local value function and the joint value function of all agents, instead of the linear addition in VDN. Compared with VDN, QMIX relaxes the assumption constraints and solves the problem that VDN cannot capture more complex interrelationships between agents, but its monotonicity assumption still limits the expression of complex nonlinear relationships. QTRAN (Son et al. 2019) algorithm is a more general decomposition algorithm developed on the basis of the above two algorithms. The core idea is to introduce an improved learning objective based on QMIX, as well as specific network design. The complex joint Q-function approximation is decomposed into two parts. First, the linear local Q-function is obtained by VDN method, and it is used as the approximation of the joint Q-function. The state-value network is then used to fit the difference between the local and joint Q-functions. By introducing a transformation function to relax the monotonicity constraint on the value function of the joint action, the QTRAN algorithm further expands the applicable scope of the value decomposition method, so that it can model complex multi-agent cooperation tasks more flexibly. Qatten (Yang et al. 2020b) algorithm introduces an attention mechanism to dynamically adjust the contribution between agents. Specifically, the Qatten algorithm dynamically adjusts the importance of agents by calculating the contribution weight of each agent’s Q-value to the global Q-value. It performs well in complex tasks, but has high computational complexity and the training process is rather challenging.Based on the above algorithms, value function decomposition methods have recently been developed, such as WQMIX (Rashid et al. 2020) algorithm, QTRAN++ (Son et al. 2020) algorithm, QPLEX (Wang et al. 2020) algorithm and QPD (Yang et al. 2020a) algorithm, etc. Different algorithms have been proposed to solve more complex multi-agent policy optimization problems. These methods not only simplify the interaction process between agents, but also significantly improve the flexibility and adaptability of the system, which provides new ideas and solutions for solving complex multi-agent cooperation problems.
The main MARL algorithms based on PG include MADDPG (Wu et al. 2024), COMA (Foerster et al. 2017), MAAC (Iqbal and Sha 2018) and MASQL (Yang et al. 2018). Directly applying single-agent RL algorithms to MASs can exacerbate environmental non-stationarity, given the partial observability of each agent. The non-stationarity problem in MARL can be alleviated by using the MARL algorithm based on PG. MADDPG algorithm learns a joint critic by centralization. During training, a critic with global observability is introduced to guide actor training, while during testing, only the actor with local observability performs actions. Since the critic network of each agent is based on global information, that is, the Q-function of each agent is obtained from the joint actions and observations of all agents, the algorithm can deal with non-stationary environment problems. COMA algorithm aims to solve the multi-agent credit assignment problem. Similar to MADDPG algorithm, all agents utilize a shared critic network. This network relies on local observations and actions of all agents. In contrast, each agent has an independent actor network based solely on local observations. In the actor part, the GRU network is used to better deal with the local observation problem. Additionally, inspired by the difference rewards method, a counterfactual baseline function is introduced. The multi-agent credit assignment problem was solved by comparing the global reward obtained by the agent following the current actor network and the global reward obtained by following a default policy. MAAC algorithm can deal with collaborative, competitive and mixed environments simultaneously. Its core idea is to introduce the attention mechanism into the construction of Q-function, and share parameters in the critic network. The observations-actions of other agents are first embedded, then weighted by attention weights, added together, and then combined with its own local observations and actions as the input of critic. Attention weights measure how similar two agents embeddings are. In this way, the agent can pay more attention to the agents similar to themselves and improve the utilization of information. The MASQL method essentially migrates the soft Q-learning algorithm to the multi-agent environment, which aims to solve the task that the optimal policy is not the only one, and tries to learn a distribution of optimal policies, so as to learn all possible optimal policies.
As an important branch of machine learning, MARL has achieved remarkable success in many fields in recent years. However, there are still a series of critical issues to be solved in practice, we will discuss this in detail in Sect. 6.
Application
RL algorithms have been widely used in many fields and have made significant progress. Its powerful adaptive ability and optimization potential make it an effective tool to solve complex decision-making problems (Qin et al. 2025a, 2024). With the rapid development of MASs, the combination of RL algorithm and MASs has become the current research hotspot and the inevitable trend of future development. This section focuses on the application research progress of MARL algorithms in three fields: intelligent transportation, medical treatment, energy management and optimization, expecting to provide new ideas and methods for theoretical innovation and practical application in related fields. Figure 5 shows the application content combing of the MARL problem in this paper.
[See PDF for image]
Fig. 5
Overview of the framework for MARL applications
Intelligent transportation
In recent years, the application of MARL in the field of intelligent transportation has made remarkable progress, and has become an important tool to solve complex transportation system optimization problems. Its core advantage is that it can adapt to the dynamic traffic environment through the cooperation and competition between multiple agents, and realize efficient traffic management and control. This section will introduce the related research progress from two aspects: traffic signal control and vehicle cooperation.
Traffic signal control
Traffic signal control is one of the important applications of MARL in the field of intelligent transportation. The traditional fixed-duration signal control method runs according to a fixed schedule and is not affected by real-time traffic conditions. This control method cannot adapt to dynamic traffic flow and is easy to cause congestion. MARL, through multi-agent cooperative optimization, makes the signal light of each intersection as an agent, and the agents observe the traffic flow. The vehicle queue length and the current state of the signal light learn how to adjust the optimal routing, which can adjust the signal light state in real time, while reducing traffic congestion and vehicle waiting time.
Although RL has been proven to be an effective data-driven approach to this problem, multi-intersection scenarios in the traditional Decentralized MARL framework still faces the challenge of insufficient multi-intersection coordination. Regarding the adaptive Traffic Signal Control (ATSC) problem for multi-intersection in large-scale network. Qi et al. (2024) introduced “pressure” as an indicator to measure the traffic condition at an intersection, while considering the traffic state of its adjacent intersections. By designing a reward function combining pressure and vehicle waiting time, the multi-agent Q-learning method was innovated, which solved the problem of partial observability and non-stationarity, and improved the adaptability of the algorithm in complex traffic networks. Chen et al. (2024) proposed the Coevolutionary Multi-agent RL (CoevoMARL) method. It leverages the temporally stable traffic pattern to dynamically evolve the learned spatial interaction network. This enhances the adaptability of the traffic signal control policy to real-world traffic flow variations. Further research considers the scenario that some intersections cannot receive traffic data, that is, for the case that the adaptive traffic signal control method based on RL lacks robustness when deployed in practice. Jiang et al. (2024) introduced a new RL model, BlindLight, which uses a dual model structure to independently learn the Q values of different types of intersections, and designs a new reward function called NOVO (Number of Vehicles and Outflow). The combination of the number of vehicles and the outflow provides an optimization objective with low variance. Experiments show that the BlindLight model performs well in comparison to existing state-of-the-art traffic signal control methods, especially in scenes containing blind intersections, which is significantly better than other methods. Ren et al. (2024) addressed the vulnerability issue in Traffic Signal Control Systems (TSCS) by proposing a black-box multi-objective attack policy and defense mechanism. They designed a dynamic threshold-based critical state selection method to minimize cumulative rewards with a limited number of attacks. This approach demonstrated excellent performance in practical applications. Xu et al. (2024) based on the graph deep RL model of two-stage attention network and GraphSage, constructed a dynamic interaction graph to promote effective interaction between agents. In addition, the dynamic interaction graph and multi-agent state feature aggregation mechanism were constructed, which significantly improved the interaction efficiency between agents and the overall performance of the model.
Vehicle cooperation
The application of MARL in vehicle cooperation mainly focuses on path planning, obstacle avoidance and platooning cooperation of multiple vehicles. Through the Multi-agent AC algorithm (MA-AC), each vehicle as an agent can dynamically adjust its driving policy according to the surrounding environment and the state of other vehicles, so as to improve the safety and efficiency of driving.
For the Multi-Vehicle Pursuit (MVP) task, the existing algorithms usually choose a fixed escape target for the pursuer without considering the dynamic traffic conditions, which significantly reduces the pursuit success rate. Li et al. (2024b) proposed a process cognitive RL based on priority experience for multi-vehicle pursuit in urban multi-intersection dynamic traffic scenes. This paper introduced a priority experience replay mechanism to personalize and prioritize according to the parameters of each MARL agent. Furthermore, an attention module was introduced to extract key features from the dynamic urban traffic environment, and a process cognition method based on these features to adaptively group pursuit vehicles. Aiming at the problem of how to efficiently disseminate vehicle motion states in complex communication environments and mixed traffic scenarios, Liu et al. (2024) used the hierarchical structure of CHA framework to make cooperative driving vehicles forward-looking. CHA at each level is combined with Graph Attention Networks (GAT) to achieve the stabilization effect of vehicle collaborative optimization decision.
In the field of UAV swarm in large-scale unknown area search tasks, Hou et al. (2024) proposed a distributed cooperative search method based on MARL. The large-scale search scene is decomposed into local and global information to support the UAV swarm search algorithm. The convolutional neural network is used to process high-dimensional map data to provide more accurate environmental perception ability for UAV. This method significantly improves the search efficiency and complex environment adaptability of UAV swarm. Based on the cooperative problem of multi-UAVs system in dynamic mission environment, Jiang et al. (2025) proposed a Hybrid Attention MARL (HAMRL) algorithm, which uses the attention structure to learn the dynamic characteristics of the task environment. The hybrid attention mechanism is integrated to establish an efficient intra-group and inter-group communication aggregation mechanism, and the superiority of the algorithm in convergence speed and performance is demonstrated.
Medical treatment
In recent years, thanks to the demand for intelligent and personalized solutions in the medical field, the application of RL in the medical field has gradually attracted wide attention. Intelligent medical system will be an important development direction in the future medical field. How to apply the basic RL framework to medical problems is a more concerned issue. The aim is that when the basic building blocks of RL (parameters like state, action, and reward) are precisely defined, seemingly complex medical problems become tracable. Since the development and treatment of an individual’s disease is influenced by its own past and current actions, this process can be modeled dynamically by a Markov framework analogous to RL to model these factors. This feature makes RL show unique advantages in medical scenarios, especially in two aspects of chronic disease management and medical resource allocation.
Management of chronic disease
The management of chronic diseases (diabetes, hypertension, cancer, etc.) requires long-term follow-up and dynamic adjustment of treatment regimens. Traditional approaches to chronic disease management mainly rely on physician experience, standardized treatment guidelines, and patient self-management. Although these methods can control the disease to a certain extent, there are many disadvantages, such as the inability to fully consider the individual differences of patients, the bias of doctors’ experience and judgment, and the inability to monitor the change of patients’ condition in real time, resulting in the lag of treatment adjustment. The deep RL algorithm is applied to chronic disease management, and each individual is regarded as an agent. The chronic disease management problem is modeled as an MDP, and the patient’s electronic health record is used to provide dynamic adjustment of treatment plans for patients. At present, the research of deep RL in the field of chronic disease management has made significant progress in recent years, and shows a broad application prospect.
In general, PG-based RL algorithms are less suitable for most clinical applications compared with other RL algorithms. This is because PG-based algorithms require iterative data collection based on new policies, a process that is often cost-prohibitive in clinical settings where real-time data acquisition is expensive. Value-based RL algorithms are common in clinical applications. Diabetes is one of the most serious chronic diseases in the world. Regarding the issue of insulin dosage for patients with type 1 diabetes (T1D), Noaro et al. (2023) proposed a personalized and adaptive mealtime insulin dose calculation policy based on Double Deep Q-learning (DDQ) instead of the traditional standard formula. Through the two-step learning framework of population model training and group initialization model combined with clustering algorithm, the insulin dose could be dynamically adjusted according to the individual characteristics of patients. Compared with the traditional method, the proposed method increased the Time in Target Range from 68.35 to 70.08%. The time to hypoglycemia was significantly reduced from 8.78 to 4.17%, which significantly improved the level of personalized treatment.
For the common comorbid chronic conditions in type 2 diabetes patients, Zheng et al. (2021) modeled disease risks as health outcomes using outpatient care electronic health records and evaluated the RL recommendations on an independent patient subset. It has been proven in clinical practice to achieve personalized management of diabetes and multimorbidity. In this paper, a Double Dueling Deep Q Network (D3QN) model based on offline RL is proposed to develop the optimal conversion policy during NIV treatment. Compared to traditional clinician policy, the treatment plan recommended by the D3QN model reduced the expected mortality rate from 27.82 to 25.44%, which significantly improved the survival rate of patients. The model not only performs well in COPD patients, but also shows good applicability in other respiratory disease subgroups. To address the high mortality of Heart Failure (HF) patients during treatment, leveraging the data from the MIMIC-IV database and taking into account individual patient characteristics, Drudi et al. (2024) developed a RL based model to determine the optimal usage policy for vasoactive drugs and diuretics. The treatment policy optimized by RL can significantly reduce the mortality of HF patients (about 20%), which provides strong evidence to support the clinical treatment.
Allocation of medical resources
The dynamic allocation of medical resources is another important application direction of RL. Traditional medical resource allocation methods mainly rely on manual experience, fixed rules and static planning. Although these methods can meet the medical needs to some extent, they cannot adapt to the dynamic changing medical needs, resulting in resource waste or shortage. By introducing the deep RL algorithm, the dynamic priority allocation of resources is realized, both and the efficiency of information transmission and the robustness of the system are improved, the efficiency and fairness of medical resource management are further improved, and the sustainable development of the medical system is provided with important support. Business process management is a key part of medical resource allocation. Although there are many mechanisms to support resource allocation during business process execution, these methods often ignore performance optimization. Huang et al. (2011) modeled resource allocation as an MDP, considered the dynamic characteristics of different patient groups and types of services required, and used the basic Q-learning algorithm to enable the system to adjust its policy in real time according to environmental changes. Experiments show that the proposed method is significantly better than the existing heuristic and manual coding strategies in terms of performance optimization, which provides a new optimization idea for business process management. Literature (Schütz and Kolisch 2012) combined simulation-based ADP and discrete event simulation to deal with random customers and profit maximization problems, which are particularly prominent in the capacity allocation of radiology services, greatly increasing the revenue of hospitals and balancing resource allocation.
Owing to its remarkable advantages in data-driven processes, dynamic decision-making, and multi-objective optimization, the MARL algorithm offers a novel methodology for the intelligent management of medical resources. This technology can not only make flexible decisions based on real-time data, but also optimize multiple objectives simultaneously, so as to realize efficient allocation and utilization of resources in complex and changing medical environments. With the continuous progress of technology, RL is expected to generate more innovations and breakthroughs in the medical field, inject strong impetus into personalized medicine and intelligent medical decision-making, and promote the development of the whole industry to a more efficient and accurate direction.
Energy management and optimization
In recent years, the application of deep RL in the field of energy management and optimization has made significant progress, and has become an important tool to solve the optimization problems of complex energy systems. Traditional energy management optimization methods usually use centralized management system to schedule and allocate energy uniformly, but this management method may lead to information transmission lag, affect the scheduling efficiency, and once the centralized system failure will lead to the paralysis of the entire energy system. Deep RL algorithm can be applied to energy storage management and renewable resources integration in the field of microgrid. By modeling the energy storage state, load demand and energy allocation decisions as RL, RL can dynamically optimize the operation policy of microgrid, which will significantly improve energy efficiency and stability. With the aggravation of environmental pollution and the continuous rise of industrial energy consumption, an intelligent and efficient energy management policy is urgently needed to reduce the cost and optimize the performance of energy systems.
To address the complex operation and control problems faced by modern industrial energy systems, Lu et al. (2024) proposed a novel model-free energy management policy based on Hybrid-Action Deep RL (HADRL) algorithm. The interaction process between the industrial energy management center and each device is modeled as an MDP, and the double parameterized DQN is used to learn through the actor and critic network, which avoids the value overestimation problem in traditional Q-learning and significantly improves the stability and efficiency of the algorithm. Addressing the core pain points of model mismatch and real-time control difficulties in online optimization of building energy consumption, Mocanu et al. (2019) proposed a fully model-free online optimization framework based on deep RL. This research regarded building thermodynamics as an unknown environment and used the DQN as the agent. It continuously adjusted the HVAC parameters based solely on real-time sensor data to achieve a dynamic balance between energy consumption and comfort. Schwung et al. (2019b) addressed the energy optimization challenges in mixed production environments caused by discrete-continuous coupling, heterogeneous equipment, and frequent equipment switching, and proposed an RL framework based on the AC architecture. The authors unified production scheduling and energy consumption control as a continuous decision-making problem. Compared with traditional heuristic scheduling, the average energy consumption was reduced by 12.8%, peak power was reduced by 22%, and the production cycle remained stable. Asha Rani et al. (2024) studied and compared a variety of DRL algorithms (such as DQN, DDPG, TD3, etc.) from the perspective of EV charging station operators. Finally, Truncated Quantile Critics (TQC) algorithm was selected. TQC algorithm performs well in processing MDP and can effectively avoid the overestimation problem in traditional DRL algorithm. It may also be extended to other distributed energy management systems, which has a wide range of application prospects.
Gong et al. (2024) proposed an adaptive microgrid energy scheduling optimization framework to solve the problem of multi-energy scheduling optimization in microgrid, which can realize the energy scheduling optimization of microgrid without depending on SOC data. This optimization method based on the Recurrent MADDPG has effectively addressed the challenges posed by privacy protection and data missing. In addition, real-time multi energy management system (EMS), as the core system of Combined Heat and Power Microgrids (CHPMGs), faces various challenges in practical operation, such as complex multi-energy management, energy demand uncertainty and so on. To solve these problems, Hu et al. (2024) proposed a real-time EMS based on data-driven and model-independent Safe Deep RL. The energy management problem of CHPMGs was modeled as a constrained MDP, and two sets of neural networks were proposed to approximate the system parameters to meet the operating constraints of CHPMGs. On the basis of the actor network, a security layer based on mathematical optimization was constructed to modify the actions of agents to meet specific constraints. It can effectively deal with the uncertainty of energy demand and ensure the stable operation of the system in a dynamic environment. For the operation optimization problem of Ship Power Systems (SPSs) in All-Electric Ships (AESs), Shang et al. (2024) described the operation uncertainty through a unified triple. Moreover, the SPS operation scenarios were clustered, a multi-task deep RL framework was introduced, and the Impala-RND (Random Network Distillation) algorithm was developed by combining the Importance Weighted Actor Learner Architecture and the RND mechanism to minimize the operation cost by adjusting the power generation and navigation plan scheduling.
Conclusion and prospect
This paper systematically reviews the development trajectory of RL in MASs in recent years, revealing the interconnectivity features between SARL and MARL in terms of their theoretical frameworks and algorithm designs, this interconnectivity is reflected not only in the extension applications of core methods such as value function decomposition and PG (for example, the adaptability transformation of dynamic programming ideas in MARL’s centralized learning), but also in the transition from “individual optimality” to “group equilibrium” in the decision-making logic. By reconfiguring the SARL algorithm into dynamic programming type, value function decomposition type, and PG type, and abstracting the MARL algorithm into four paradigms: centralized learning, behavior analysis, communication learning, and collaborative learning, this review establishes an algorithm mapping relationship from single-agent to multi-agent scenarios. This innovative framework provides a new perspective for understanding the evolutionary correlation between the two methods. Although MARL has achieved remarkable success in many fields in recent years, there are still a series of key issues to be solved in practice. Finally, based on the investigation of the current research status and the consideration of existing problems, several key issues and future research directions in the study of MARL algorithms are presented.
Curse of dimensionality
The curse of dimensionality in MARL means that as the number of agents grows, the state and action space dimensions expand. As a result, accurately estimating and representing data distribution in high-dimensional space becomes challenging. Simultaneously, it also increases the difficulty of training for limited samples, which may leads to a significant decline in learning efficiency and performance, so it is difficult to apply in large-scale MASs. Currently, the common solution is to break down a complex problem into several sub-problems, solve each of these sub-problems separately, and then combine them to reduce the complexity of the system. In addition, projecting high-dimensional data into a lower-dimensional space can also achieve the effect of dimensionality reduction. However, in general, it is still a challenge for current MARL algorithms to effectively decompose and abstract problems in the face of complex and dynamic changing environments. Therefore, the problem of data sparsity, computational efficiency and model interpretability in high-dimensional states still need to be further studied.
Instability
The instability problem in MASs is mainly manifested in the non-stationarity of the environment. In a MAS, since the state transition probability of the environment is related to the strategies of all agents, and each agent’s policy is constantly updated during the learning process. From the perspective of a single agent, other agents also form part of the environment, and the joint policy of the system is also changing dynamically. This dynamic shift disrupts the stationarity assumption in the MDP. The state transition probability and reward function are no longer fixed but vary with other agents’ policy changes. Environmental nonstationarity in MASs (Wang et al. 2025) is a complex and challenging problem. In order to reduce the impact caused by the dynamic changes in the environment, in the CTDE learning paradigm, the agent uses global information to learn a stable policy during training, which still remains decentralized at execution time, such as QMIX and VDN to solve non-stationarity problems. In the future, it is necessary to further explore more efficient, robust and generalizable RL algorithms to improve the learning and adaptation ability of MASs in dynamic environments.
Credit allocation problem
In MARL, multiple agents jointly influence the global reward to achieve cooperation. The individual contribution of each agent to the reward remains unclear. Thus, the causal link between an agent’s local actions and the global reward is not well defined. Therefore, how to fairly allocate the environmental reward to each participating agent is a problem. Making each agent learn and adjust according to its own behavior has become a key problem to be solved. At present, researchers have proposed a variety of methods to solve the credit assignment problem, and value function decomposition can effectively deal with the credit assignment problem. The core idea is to decompose the global reward into the local reward of each agent, and to achieve credit assignment by decomposing the Q-value of the joint action into the local Q-value of each agent. These solutions has the characteristics of fairness and symmetry. The above methods solve the credit assignment problem to some extent, but there are still problems such as multi-agent cooperation, long-term reward delay, and limitations of different methods in different scenarios that need further research.
With the further development of research, MARL algorithm will be gradually improved to solve more complex problems, and is expected to be applied to other related fields. Our study aims to organize and summarize the current state of MARL research. Due to the limited level of the authors and the limitation of the paper space, some theoretical methods and excellent results in the direction of RL may not be included in this review. Through the analysis of this paper, it is hoped that it can provide reference and inspiration for researchers in this field to promote the development of cooperative control and optimization methods and related applications of MASs.
Author contributions
D.Z. and Q.Y. were responsible for the conception and design of the study and wrote the main manuscript text. L.M. and R.X. participated in the literature review section and provided theoretical support for the manuscript. W.L. prepared the figures and tables and helped write the summary. C.Q. provided the final review and revision of the manuscript. All authors participated in the discussions and reviewed and approved the manuscript content.
Funding
This work was supported in part by Nanyang Normal University Foundation of China under Grant (2024PY011), in part by the Key Scientific and Technological Project of the Henan Province under Grant (232102311012, 232102311004), in part by the Key Research Projects Funding Program for Higher Education Institutions of Henan Province under Grant (24A320003), and in part by the Key Medical Science and Technology Research Project of Henan Province under Grant (SBJ202103098, LHGJ20220662).
Data availability
No datasets were generated or analysed during the current study.
Declarations
Conflict of interest
The authors declare no conflict of interest.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pp 207–216
Ahad A, Tahir M, Sheikh MAS, et al (2021) Optimal route selection in 5G-based smart health-care network: a reinforcement learning approach. In: 2021 26th IEEE Asia-Pacific conference on communications (APCC), pp 248–253
Alelaiwi A (2020) Resource allocation management in patient-to-physician communications based on deep reinforcement learning in smart healthcare services. In: 2020 IEEE International conference on multimedia & expo workshops (ICMEW), pp 1–5
Anderson, JR. Machine learning: an artificial intelligence approach; 1983; Burlington, Morgan Kaufmann:
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, vol 1, pp 1027–1035
Asha Rani, GS; Lal Priya, PS; Jayan, J et al. Data-driven energy management of an electric vehicle charging station using deep reinforcement learning. IEEE Access; 2024; 12, pp. 65956-65966. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3398059]
Bai, H; Shen, R; Lin, Y et al. Lamarckian platform: pushing the boundaries of evolutionary reinforcement learning toward asynchronous commercial games. IEEE Trans Games; 2024; 16,
Bellman, R. A Markovian decision process. J Math Mech; 1957; 6,
Bellman, RE. Dynamic programming; 1957; Princeton, Princeton University Press:
Bertsekas, DP. Approximate policy iteration: a survey and some new methods. J Control Theory Appl; 2011; 9,
Bertsekas, D. Dynamic programming and optimal control; 2012; Nashua, Athena scientific:
Breiman, L. Random forests. Mach Learn; 2001; 45,
Cao, H; Xiong, H; Zeng, W et al. Safe reinforcement learning-based motion planning for functional mobile robots suffering uncontrollable mobile robots. IEEE Trans Intell Transp Syst; 2024; 25,
Castañeda AO (2016) Deep reinforcement learning variants of multi-agent learning algorithms. PhD thesis, University of Edinburgh, Edinburgh
Chapelle, O; Scholkopf, B; Zien, A. Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Trans Neural Netw; 2009; 20,
Chen, H; Luo, H; Huang, B et al. Transfer learning-motivated intelligent fault diagnosis designs: a survey, insights, and perspectives. IEEE Trans Neural Netw Learning Syst; 2023; 35,
Chen, W; Yang, S; Li, W et al. Learning multi-intersection traffic signal control via coevolutionary multi-agent reinforcement learning. IEEE Trans Intell Transp Syst; 2024; 25,
Chen, J; Zhu, B; Zhang, M et al. Multi-agent deep reinforcement learning cooperative control model for autonomous vehicle merging into platoon in highway. World Electr Veh J; 2025; 16,
Diprasetya MR, Pullani AN, Schwung A (2024) Sim-to-real transfer for robotics using model-free curriculum reinforcement learning. In: 2024 IEEE international conference on industrial technology (ICIT), pp 1–6
Drudi C, Fechner M, Mollura M et al (2024) Reinforcement learning for heart failure treatment optimization in the intensive care unit. In: 2024 46th annual international conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp 1–4
Foerster JN, Assael YM, de Freitas N et al (2016) Learning to communicate with deep multi-agent reinforcement learning. In: Proceedings of the 30th international conference on neural information processing systems, Barcelona, pp 2145–2153
Foerster J, Farquhar G, Afouras T et al (2017) Counterfactual multi-agent policy gradients. arXiv:1705.08926
Fukushima, K; Miyake, S; Ito, T. Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans Syst Man Cybern; 1983; 13,
Gauss, CF. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium; 1809; Hamburg, Perthes et Besser:
Geranmayeh, P; Grass, E. Optimization of beamforming and transmit power using DGN and comparison with traditional techniques. IEEE Access; 2025; 13, pp. 94275-94285. [DOI: https://dx.doi.org/10.1109/ACCESS.2025.3573096]
Geurts, P; Irrthum, A; Wehenkel, L. Supervised learning with decision tree-based methods in computational and systems biology. Mol BioSyst; 2009; 5,
Gong, J; Yu, N; Han, F et al. Energy scheduling optimization for microgrids based on partially observable Markov game. IEEE Trans Artif Intell; 2024; 5,
Guan H (2020) Analysis on deep reinforcement learning in industrial robotic arm. In: 2020 international conference on intelligent computing and human–computer interaction (ICHCI), pp 426–430
Guan H, Gao Y, Zhao M et al (2022) Ab-mapper: attention and bicnet based multi-agent path planning for dynamic environment. In: 2022 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, pp 13799–13806
Gui, J; Chen, T; Zhang, J et al. A survey on self-supervised learning: algorithms, applications, and future trends. IEEE Trans Pattern Anal Mach Intell; 2024; 46,
Han, J; Pei, J; Yin, Y. Mining frequent patterns without candidate generation. ACM SIGMOD Rec; 2000; 29,
Hausknecht MJ, Stone P (2015) Deep recurrent Q-learning for partially observable MDPs. In: AAAI fall symposia, p 141
Hosmer, DW; Lemeshow, S. Applied logistic regression; 2000; Hoboken, Wiley: [DOI: https://dx.doi.org/10.1002/0471722146]
Hou, Y; Zhao, J; Zhang, R et al. UAV swarm cooperative target search: a multi-agent reinforcement learning approach. IEEE Trans Intell Veh; 2024; 9,
Hu, B; Gong, Y; Liang, X et al. Safe deep reinforcement learning-based real-time multi-energy management in combined heat and power microgrids. IEEE Access; 2024; 12, pp. 193581-193593. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3520357]
Huang, Z; van der Aalst, W; Lu, X et al. Reinforcement learning based resource allocation in business process management. Data Knowl Eng; 2011; 70,
Iqbal S, Sha F (2018) Actor-attention-critic for multi-agent reinforcement learning. arXiv:1810.02912
Iskandar A, Rostum HM, Kovács B (2023) Using deep reinforcement learning to solve a navigation problem for a swarm robotics system. In: 2023 24th international Carpathian control conference (ICCC), pp 185–189
Isla-Cernadas, D; Fernández-Delgado, M; Cernadas, E et al. Closed-form gaussian spread estimation for small and large support vector classification. IEEE Trans Neural Netw Learn Syst; 2025; 36,
Jiang, Q; Qin, M; Zhang, H et al. Blindlight: high robustness reinforcement learning method to solve partially blinded traffic signal control problem. IEEE Trans Intell Transp Syst; 2024; 25,
Jiang, Y; Di, K; Qian, R et al. Optimizing risk-aware task migration algorithm among multiplex UAV groups through hybrid attention multi-agent reinforcement learning. Tsinghua Sci Technol; 2025; 30,
Jordan MI, Kearns MJ, Solla SA (1998) Advances in neural information processing systems. In: proceedings of the 1997 conference, vol 10. MIT Press
Kaloev M, Krastev G (2023) Comprehensive review of benefits from the use of neuron connection pruning techniques during the training process of artificial neural networks in reinforcement learning: experimental simulations in atari games. In: 2023 7th International symposium on multidisciplinary studies and innovative technologies (ISMSIT), pp 1–6
Karimi, S; Asadi, S; Payberah, AH. Bazigooshi: a hybrid model of reinforcement learning for generalization in gameplay. IEEE Trans Games; 2024; 16,
Kell AJM, Forshaw M, Stephen McGough A (2020) Exploring market power using deep reinforcement learning for intelligent bidding strategies. In: 2020 IEEE international conference on big data (big data), pp 4402–4411
Kim, S. Learning and game based spectrum allocation model for internet of medical things (IOMT) platform. IEEE Access; 2023; 11, pp. 48059-48068. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3266331]
Kim D, Moon S, Hostallero D et al (2019) Learning to schedule communication in multi-agent reinforcement learning. arXiv:1902.01554
Kiran, BR; Sobh, I; Talpaert, V et al. Deep reinforcement learning for autonomous driving: a survey. IEEE Trans Intell Transpl Syst; 2022; 23,
Kobren A, Monath N, Krishnamurthy A et al (2017) A hierarchical algorithm for extreme clustering. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 255–264
Lagoudakis, MG; Parr, R. Least-squares policy iteration. J Mach Learn Res; 2003; 4,
Lance, GN; Williams, WT. A general theory of classificatory sorting strategies: 1. Hierarchical systems. Comput J; 1967; 9,
Li, M; Zhou, ZH. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern-Part A; 2007; 37,
Li, K; Hu, Q; Liu, Q et al. A predefined-time consensus algorithm of multi-agent system for distributed constrained optimization. IEEE Trans Netw Sci Eng; 2024; 11,
Li, X; Yang, Y; Yuan, Z et al. Progression cognition reinforcement learning with prioritized experience for multi-vehicle pursuit. IEEE Trans Intell Transp Syst; 2024; 25,
Lillicrap TP, Hunt JJ, Pritzel A et al (2015) Continuous control with deep reinforcement learning. arXiv:1509.02971
Lin, M; Zhao, B; Liu, D. Optimal learning output tracking control: a model-free policy optimization method with convergence analysis. IEEE Trans Neural Netw Learn Syst; 2025; 36,
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning proceedings 1994. Elsevier, pp 157–163
Liu Y, Wang Q (2022) Game confrontation of 5v5 multi-agent based on Mappo reinforcement learning algorithm. In: 2022 37th youth academic annual conference of chinese association of automation (YAC), pp 1395–1398
Liu, D; Wei, Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learning Syst; 2013; 25,
Liu, D; Wei, Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst; 2014; 25,
Liu, D; Wei, Q; Wang, D et al. Adaptive dynamic programming with applications in optimal control; 2017; Cham, Springer: [DOI: https://dx.doi.org/10.1007/978-3-319-50815-3]
Liu, D; Xue, S; Zhao, B et al. Adaptive dynamic programming for control: a survey and recent advances. IEEE Trans Syst Man Cybern Syst; 2021; 51,
Liu, X; Zhang, F; Hou, Z et al. Self-supervised learning: generative or contrastive. IEEE Trans Knowl Data Eng; 2021; 35,
Liu, D; Dou, L; Zhang, R et al. Multi-agent reinforcement learning-based coordinated dynamic task allocation for heterogenous UAVs. IEEE Trans Veh Technol; 2023; 72,
Liu, B; Han, W; Wang, E et al. An efficient message dissemination scheme for cooperative drivings via cooperative hierarchical attention reinforcement learning. IEEE Trans Mob Comput; 2024; 23,
Lloyd, SP. Least squares quantization in PCM. IEEE Trans Inf Theory; 1982; 28,
Lu, R; Jiang, Z; Yang, T et al. A novel hybrid-action-based deep reinforcement learning for industrial energy management. IEEE Trans Ind Inf; 2024; 20,
Ma, Z; Liu, X; Huang, Y. Unsupervised reinforcement learning for multi-task autonomous driving: expanding skills and cultivating curiosity. IEEE Trans Intell Transpl Syst; 2024; 25,
Malathy V, Al-Jawahry HM, GKM et al (2024) A reinforcement learning method in cooperative multi-agent system for production control system. In: 2024 international conference on data science and network security (ICDSNS), pp 1–4
Marden, JR; Arslan, G; Shamma, JS. Cooperative control and potential games. IEEE Trans Syst Man Cybern B; 2009; 39,
Markov, AA. Extension of the law of large numbers to dependent quantities. Izv Fiz-Matem Obsch Kazan Univ; 1906; 15,
Ministry of Industry and Information Technology of the People’s Republic of China (2024) Notice on issuing the national comprehensive standardization system construction guidelines for the artificial intelligence industry (2024 edition). https://www.miit.gov.cn/zwgk/zcwj/wjfb/tz/art/2024/art_e8ebf5600ec24d3db644150873712c5f.html
Minsky, M. Steps toward artificial intelligence. Proc IRE; 2007; 49,
Mnih, V; Kavukcuoglu, K; Silver, D et al. Human-level control through deep reinforcement learning. Nature; 2015; 518,
Mnih V, Kavukcuoglu K, Silver D et al (2013) Playing Atari with deep reinforcement learning. arXiv:1312.5602
Mnih V, Badia AP, Mirza M et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PmLR, pp 1928–1937
Mocanu, E; Mocanu, DC; Nguyen, PH et al. On-line building energy optimization using deep reinforcement learning. IEEE Trans Smart Grid; 2019; 10,
Neftci, EO; Averbeck, BB. Reinforcement learning in artificial and biological systems. Nat Mach Intell; 2019; 1,
Noaro, G; Zhu, T; Cappon, G et al. A personalized and adaptive insulin bolus calculator based on double deep q-learning to improve type 1 diabetes management. IEEE J Biomed Health Inform; 2023; 27,
Park, JS; Chen, MS; Yu, PS. An effective hash-based algorithm for mining association rules. ACM SIGMOD Rec; 1995; 24,
Pearson, K. Liii. On lines and planes of closest fit to systems of points in space. Lond Edinburgh Dublin Philos Mag J Sci; 1901; 2,
Peng P, Yuan Q, Wen Y et al (2017) Multiagent bidirectionally-coordinated nets for learning to play Starcraft combat games. arXiv:1703.10069
Prudencio, RF; Maximo, MR; Colombini, EL. A survey on offline reinforcement learning: taxonomy, review, and open problems. IEEE Trans Neural Netw Learn Syst; 2023; 35,
Qi, L; Sun, Y; Luan, W. Large-scale traffic signal control based on multi-agent q-learning and pressure. IEEE Access; 2024; 12, pp. 1092-1101. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3345343]
Qin, C; Qiao, X; Wang, J et al. Barrier-critic adaptive robust control of nonzero-sum differential games for uncertain nonlinear systems with state constraints. IEEE Trans Syst Man Cybern Syst; 2024; 54,
Qin, C; Hou, S; Pang, M et al. Reinforcement learning-based secure tracking control for nonlinear interconnected systems: an event-triggered solution approach. Eng Appl Artif Intell; 2025; 161, [DOI: https://dx.doi.org/10.1016/j.engappai.2025.112243] 112243.
Qin, C; Jiang, K; Wang, Y et al. Event-triggered H∞ control for unknown constrained nonlinear systems with application to robot arm. Appl Math Model; 2025; 144, 4881452 [DOI: https://dx.doi.org/10.1016/j.apm.2025.116089] 116089.
Qin, C; Ran, X; Zhang, D. Unsupervised image stitching based on generative adversarial networks and feature frequency awareness algorithm. Appl Soft Comput; 2025; 183, [DOI: https://dx.doi.org/10.1016/j.asoc.2025.113466] 113466.
Qu, G; Wu, H; Li, R et al. DMRO: a deep meta reinforcement learning-based task offloading framework for edge-cloud computing. IEEE Trans Netw Serv Manage; 2021; 18,
Rashid T, Samvelyan M, Schroeder C et al (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In: International conference on machine learning, PMLR, pp 4295–4304
Rashid T, Farquhar G, Peng B et al (2020) Weighted qmix: expanding monotonic value function factorisation for deep multi-agent reinforcement learning. In: Advances in neural information processing systems, pp 10199–10210
Rattal, S; Badri, A; Moughit, M et al. Ai-driven optimization of low-energy IoT protocols for scalable and efficient smart healthcare systems. IEEE Access; 2025; 13, pp. 48401-48415. [DOI: https://dx.doi.org/10.1109/ACCESS.2025.3551224]
Rawat RS, Rana DS (2023) Implementation of reinforcement learning and imaging for better decision-making in the medical sector. In: 2023 IEEE 8th international conference for convergence in technology (I2CT), pp 1–4
Ren, Y; Zhang, H; Du, L et al. Stealthy black-box attack with dynamic threshold against Marl-based traffic signal control system. IEEE Trans Ind Inf; 2024; 20,
Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev; 1958; 65,
Rummery, GA; Niranjan, M. On-line Q-learning using connectionist systems; 1994; Cambridge, University of Cambridge, Department of Engineering:
Saeed AK, Holguin F, Yasin AS et al (2024) Multi-agent and multi-target reinforcement learning for satellite sensor tasking. In: 2024 IEEE aerospace conference, pp 1–13
Schubert, E; Sander, J; Ester, M et al. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst (TODS); 2017; 42,
Schütz, HJ; Kolisch, R. Approximate dynamic programming for capacity allocation in the service industry. Eur J Oper Res; 2012; 218,
Schwung A, Schwung D, Abdul Hameed MS (2019a) Cooperative robot control in flexible manufacturing cells: centralized vs. distributed approaches. In: 2019 IEEE 17th international conference on industrial informatics (INDIN), pp 233–238
Schwung, D; Schwung, A; Ding, SX. Actor-critic reinforcement learning for energy optimization in hybrid production environments. Int J Comput; 2019; 18,
Schwung, D; Schwung, A; Ding, SX. Distributed self-optimization of modular production units: a state-based potential game approach. IEEE Trans Cybern; 2022; 52,
Shang, C; Fu, L; Xiao, H et al. Joint optimization of power generation and voyage scheduling in ship power system based on operating scene clustering and multitask deep reinforcement learning. IEEE Trans Transp Electr; 2024; 10,
Shen X, Zhang X, Wang Y (2021) Kernel temporal difference based reinforcement learning for brain machine interfaces. In: 2021 43rd annual international conference of the IEEE engineering in medicine & biology society (EMBC), pp 6721–6724
Shi B, Yuan H, Shi R (2018) Pricing cloud resource based on multi-agent reinforcement learning in the competing environment. In: 2018 IEEE Intl conf on parallel & distributed processing with applications, ubiquitous computing & communications, big data & Cloud computing, social computing & networking, sustainable computing & communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pp 462–468
Silver D, Lever G, Heess N et al (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, PMLR, pp 387–395
Singh A, Jain T, Sukhbaatar S (2019) Learning when to communicate at scale in multiagent cooperative and competitive tasks. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA
Skinner, BF. A case history in scientific method. Am Psychol; 1956; 11,
Skinner, BF. Reinforcement today. Am Psychol; 1958; 13,
Son K, Kim D, Kang WJ et al (2019) QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International conference on machine learning, PMLR, pp 5887–5896
Son K, Kim D, Kang WJ et al (2020) Qtran++: improved value transformation for cooperative multi-agent reinforcement learning. arXiv:2006.12010
Sui F, Yue W, Zhang Z et al (2023) Trial-and-error learning for mems structural design enabled by deep reinforcement learning. In: 2023 IEEE 36th international conference on micro electro mechanical systems (MEMS), pp 503–506
Sukhbaatar S, Fergus R (2016) Learning multiagent communication with backpropagation. In: Proceedings of the 29th conference on neural information processing systems, NIPS, Barcelona, pp 2252–2260
Sun, C; Mu, C. Some key scientific problems in multi-agent deep reinforcement learning. Acta Automatica Sinica; 2020; 46,
Sun, Z; Zhou, Y; Tang, S et al. Noise suppression zeroing neural network for online solving the time-varying inverse kinematics problem of four-wheel mobile manipulators with external disturbances. Artif Intell Rev; 2024; 57,
Sunehag P, Lever G, Gruslys A et al (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv:1706.05296
Sutton, RS. Learning to predict by the methods of temporal differences. Mach Learn; 1988; 3, pp. 9-44. [DOI: https://dx.doi.org/10.1023/A:1022633531479]
Sutton, R; Barto, A. Reinforcement learning: an introduction. IEEE Trans Neural Netw; 1998; 9,
Sutton, RS; McAllester, D; Singh, S et al. Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst; 1999; 12, pp. 1-7.
Tamar A, Wu Y, Thomas G et al (2016) Value iteration networks. Advances in neural information processing systems 29
Tampuu, A; Matiisen, T; Kodelja, D et al. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE; 2017; 12,
Tan M (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In: Proceedings of the tenth international conference on machine learning, pp 330–337
Uc-Cetina, V; Navarro-Guerrero, N; Martin-Gonzalez, A et al. Survey on reinforcement learning for language processing. Artif Intell Rev; 2023; 56,
Urbanowicz, RJ; Moore, JH. Learning classifier systems: a complete introduction, review, and roadmap. J Artif Evol Appl; 2009; 2009, pp. 1-25.
Van der Maaten, L; Hinton, G. Visualizing data using t-SNW. J Mach Learn Res; 2008; 9,
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI conference on artificial intelligence
Waltz, DL; Fu, KS. A heuristic approach to reinforcement learning control systems. IEEE Trans Autom Control; 1965; 10,
Wang Z, Schaul T, Hessel M et al (2016) Dueling network architectures for deep reinforcement learning. arXiv:1511.06521
Wang J, Ren Z, Liu T et al (2020) Qplex: duplex dueling multi-agent q-learning. arXiv:2008.01062
Wang, H; Tao, J; Peng, T et al. Dynamic inventory replenishment strategy for aerospace manufacturing supply chain: combining reinforcement learning and multi-agent simulation. Int J Prod Res; 2022; 60,
Wang L, Zhang Y, Hu Y et al (2022b) Individual reward assisted multi-agent reinforcement learning. In: Proceedings of the 39th international conference on machine learning, ACM Press, Baltimore, pp 23417–23432
Wang, D; Gao, N; Liu, D et al. Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications. IEEE/CAA J Autom Sin; 2024; 11,
Wang H, Liu Z, Hu G et al (2024b) Offline meta-reinforcement learning for active pantograph control in high-speed railways. IEEE Trans Ind Inform
Wang, S; Yue, Q; Xu, Z et al. A collaborative multi-agent reinforcement learning approach for non-stationary environments with unknown change points. Mathematics; 2025; 13,
Watkins, CJ; Dayan, P. Q-learning. Mach Learn; 1992; 8, pp. 279-292.
Wei, S; Wang, S; Sun, S et al. Stock ranking prediction based on an adversarial game neural network. IEEE Access; 2022; 10, pp. 65028-65036. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3181999]
Wen, G; Yang, T; Zhou, J et al. Reinforcement learning and adaptive/approximate dynamic programming: a survey from theory to applications in multi-agent systems. Kongzhi yu Juece/Control Decision; 2023; 38,
Wu, J; Li, D; Yu, Y et al. An attention mechanism and adaptive accuracy triple-dependent MADDPG formation control method for hybrid UAVs. IEEE Trans Intell Transpl Syst; 2024; 25,
Xia, X; Fu, X; Zhong, S et al. A multi-agent convolution deep reinforcement learning network for aeroengine fleet maintenance strategy optimization. J Manuf Syst; 2023; 68, pp. 410-425. [DOI: https://dx.doi.org/10.1016/j.jmsy.2023.05.005]
Xing, X; Zhou, Z; Li, Y et al. Multi-UAV adaptive cooperative formation trajectory planning based on an improved MATD3 algorithm of deep reinforcement learning. IEEE Trans Veh Technol; 2024; 73,
Xing-Xing, L; Yang-He, F; Yang, M et al. Deep multi-agent reinforcement learning: a survey. Acta Automatica Sinica; 2020; 46,
Xu, D; Yu, Z; Liao, X et al. A graph deep reinforcement learning traffic signal control for multiple intersections considering missing data. IEEE Trans Veh Technol; 2024; 73,
Xue, Y; Chen, W. Multi-agent deep reinforcement learning for UAVs navigation in unknown complex environment. IEEE Trans Intell Veh; 2023; 9,
Xue, S; Luo, B; Liu, D. Event-triggered adaptive dynamic programming for unmatched uncertain nonlinear continuous-time systems. IEEE Trans Neural Netw Learn Syst; 2021; 32,
Xue, S; Luo, B; Liu, D et al. Event-triggered ADP for tracking control of partially unknown constrained uncertain systems. IEEE Trans Cybern; 2022; 52,
Xue, S; Zhang, W; Luo, B et al. Integral reinforcement learning-based dynamic event-triggered nonzero-sum games of USVs. IEEE Trans Cybern; 2025; 55,
Xue, S; Zhao, N; Zhang, W et al. A hybrid adaptive dynamic programming for optimal tracking control of USVs. IEEE Trans Neural Netw Learn Syst; 2025; 36,
Xu H, Zuo L, Sun F et al (2022) Low-latency patient monitoring service for cloud computing based healthcare system by applying reinforcement learning. In: 2022 IEEE 8th international conference on computer and communications (ICCC), pp 1373–1377
Yan, Z; Kreidieh, AR; Vinitsky, E et al. Unified automatic control of vehicular systems with reinforcement learning. IEEE Trans Autom Sci Eng; 2023; 20,
Yang Y, Hao J, Chen G et al (2018) Multi-agent soft q-learning. arXiv:1804.04175
Yang Y, Hao J, Chen G et al (2020a) Q-value path decomposition for deep multi-agent reinforcement learning. arXiv:2002.03950
Yang Y, Hao J, Liao B et al (2020b) Qatten: a general framework for cooperative multi-agent reinforcement learning. arXiv:2002.03939
Yang, X; Song, Z; King, I et al. A survey on deep semi-supervised learning. IEEE Trans Knowl Data Eng; 2022; 35,
Yang, Q; Wang, S; Zhang, Q et al. Hundreds guide millions: adaptive offline reinforcement learning with expert guidance. IEEE Trans Neural Netw Learn Syst; 2023; 35,
Yu, C; Wang, X; Xu, X et al. Distributed multiagent coordinated learning for autonomous driving in highways based on dynamic coordination graphs. IEEE Trans Intell Transp Syst; 2019; 21,
Yu, C; Wang, X; Xu, X et al. Distributed multiagent coordinated learning for autonomous driving in highways based on dynamic coordination graphs. IEEE Trans Intell Transp Syst; 2020; 21,
Yuan L, Wang J, Zhang F et al (2022) Multi-agent incentive communication via decentralized teammate modeling. In: Proceedings of the 36th AAAI conference on artificial intelligence, AAAI Press, Ottawa, pp 9466–9474
Yu C, Velu A, Vinitsky E et al (2022) The surprising effectiveness of PPO in cooperative multi-agent games. In: Advances in Neural Information Processing Systems, pp 24611–24624
Zhang X, Wang Y (2023) A kernel reinforcement learning decoding framework integrating neural and feedback signals for brain control*. In: 2023 45th annual international conference of the IEEE engineering in medicine & biology society (EMBC), pp 1–4
Zhang, ML; Zhou, ZH. Cotrade: confident co-training with data editing. IEEE Trans Syst Man Cybern B; 2011; 41,
Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of reinforcement learning and control pp 321–384
Zhang, D; Wang, Y; Jiang, K et al. Safe optimal robust control of nonlinear systems with asymmetric input constraints using reinforcement learning. Appl Intell; 2023; 54,
Zhang, D; Wang, Y; Meng, L et al. Adaptive critic design for safety-optimal FTC of unknown nonlinear systems with asymmetric constrained-input. ISA Trans; 2024; 155, pp. 309-318. [DOI: https://dx.doi.org/10.1016/j.isatra.2024.09.018]
Zhang, D; Yu, C; Li, Z et al. A lightweight network enhanced by attention-guided cross-scale interaction for underwater object detection. Appl Soft Comput; 2025; 184, [DOI: https://dx.doi.org/10.1016/j.asoc.2025.113811] 113811.
Zhang, Y; Zhao, B; Liu, D. Distributed optimal containment control of wheeled mobile robots via adaptive dynamic programming. IEEE Trans Syst Man Cybern Syst; 2025; 55,
Zhao F, Hua Y, Zheng H et al (2023a) Cooperative target pursuit by multiple fixed-wing UAVs based on deep reinforcement learning and artificial potential field. In: 2023 42nd Chinese control conference (CCC), pp 5693–5698
Zhao, M; Wang, D; Qiao, J et al. Advanced value iteration for discrete–time intelligent critic control: a survey. Artif Intell Rev; 2023; 56,
Zhao, B; Zhang, S; Liu, D. Self-triggered approximate optimal neuro-control for nonlinear systems through adaptive dynamic programming. IEEE Trans Neural Netw Learn Syst; 2025; 36,
Zheng, H; Ryzhov, IO; Xie, W et al. Personalized multimorbidity management for patients with type 2 diabetes using reinforcement learning of electronic health records. Drugs; 2021; 81, pp. 471-482. [DOI: https://dx.doi.org/10.1007/s40265-020-01435-4]
Zhu H, Vyetrenko S, Dwarakanath K et al (2023a) Once burned, twice shy? The effect of stock market bubbles on traders that learn by experience. In: 2023 Winter simulation conference (WSC), pp 291–302
Zhu, Z; Lin, K; Jain, AK et al. Transfer learning in deep reinforcement learning: a survey. IEEE Trans Pattern Anal Mach Intell; 2023; 45,
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.