Content area

Abstract

Accurate and feasible target assignment in an urban environment without road networks remains challenging. Existing methods exhibit critical limitations: computational inefficiency preventing real-time decision-making requirements and poor cross-scenario generalization, yielding task-specific policies that lack adaptability. To achieve efficient target assignment in urban adversarial scenarios, we propose an efficient traversable path generation method requiring only binarized images, along with four key constraint models serving as optimization objectives. Moreover, we model this optimization problem as a Markov decision process (MDP) and introduce the generalization sequential proximal policy optimization (GSPPO) algorithm within the reinforcement learning (RL) framework. Specifically, GSPPO integrates an exploration history representation module (EHR) and a neuron-specific plasticity enhancement module (NPE). EHR incorporates exploration history into the policy learning loop, which significantly improves learning efficiency. To mitigate the plasticity loss in neural networks, we propose an NPE module, which boosts the model’s representational capability and generalization across diverse tasks. Experiments demonstrate that our approach reduces planning time by four orders of magnitude compared to the online planning method. Against the benchmark algorithm, it achieves 94.16% higher convergence performance, 33.54% shorter assignment path length, 51.96% lower threat value, and 40.71% faster total time. Our approach supports real-time military reconnaissance and will also facilitate rescue operations in complex cities.

Full text

Turn on search term navigation

1. Introduction

Accelerating urbanization makes military operations in dense urban environments increasingly pivotal [1]. This raises demand for accurate and efficient target assignment in such settings. For target assignment, precise road network information is essential. However, obtaining accurate road network intelligence remains fundamentally challenging [2]. This constraint requires extracting traversable paths from limited situational inputs—especially unmanned aerial vehicle (UAV) reconnaissance imagery. Such path generation forms the critical foundation for real-time target assignment. Here, we focus on urban adversarial target assignment—defined as military engagements that force our units against enemy targets in complex urban environments. Specifically, this is a weapon–target assignment (WTA) variant that optimizes unit distribution (e.g., UGVs) against enemy targets [3].

Target assignment is an NP-hard problem. Its decision space grows exponentially with problem scale—specifically with the number of units and targets [4]. Consequently, solution real-time performance, accuracy, and effectiveness directly impact mission outcomes. These factors determine optimal results in adversarial operations. Current research employs two main algorithmic approaches: traditional exact algorithms based on mixed integer linear programming (MILP) and heuristic algorithms [5].

Traditional exact algorithms mainly address small-scale problems. They often require simplified structural assumptions. Target assignment is typically formulated as an MILP problem [6]. Solution methods include the Hungarian algorithm [7], branch-and-bound algorithm [8,9], and Lagrangian relaxation [10], which rely on classical solvers like Gurobi. However, they do not directly solve the original nonlinear objective function. When the problem size increases, online computation time scales exponentially. This makes it hard to meet strict real-time requirements in practical decision-making. Thus, exact algorithms face scalability limits. Their use is largely restricted to problems with low real-time demands and simple constraints.

Human decision preferences and weapon–target interaction knowledge can be encoded as rules. These rules help construct optimal solutions. Rule-based heuristic algorithms apply such rules during solution construction and search. This enables rapid generation of feasible assignments. Representative methods include genetic algorithms (GAs) [11], particle swarm optimization (PSO) [12], and ant colony optimization (ACO) [13]. However, these algorithms heavily depend on scenario-specific conditions and domain knowledge. They frequently converge to local optima [14]. Although optimization accuracy improves, convergence speed remains limited. Consequently, such algorithms show poor generalization capabilities in high-dimensional real-world problems.

Reinforcement learning (RL) has achieved remarkable breakthroughs in various domains recently [15]. Combinatorial optimization involves selecting optimal variables from discrete decision spaces. This process resembles RL’s action selection mechanism. The offline-to-online paradigm of RL enables real-time combinatorial optimization solutions. Consequently, RL techniques show strong potential for classical combinatorial optimization problems. Traditional methods include value-based algorithms for simplified single-target weapon scenarios [16] and actor-critic algorithms for cooperative target assignment problems [5,17].

However, conventional RL algorithms face four critical limitations in urban adversarial target assignment [18]. First, most methods optimize single tasks, lacking adaptability to diverse mission scenarios. Second, grid-validated algorithms fail in realistic cities with irregular structures [19]. Moreover, traditional approaches ignore the impact of historical decisions on current choices, resulting in low learning efficiency. Furthermore, continual learning gradually reduces neural network adaptability for new tasks—termed plasticity loss. This severely compromises cross-scenario generalization [20].

To overcome these limitations, this paper presents an integrated framework for target assignment in urban adversarial scenarios without road networks. Our solution spans traversable path construction to autonomous assignment scheme generation, enabling efficient real-time planning using only binarized shapefile (SHP) data. The main contributions of this paper are summarized as follows:

We propose an integrated framework for urban adversarial target assignment without road networks. It generates traversable paths from binarized SHP data and solves assignment via the novel GSPPO algorithm within an MDP formulation.

We design an EHR module that fully utilizes historical information by integrating environmental interaction trajectories into the policy learning loop, which enhances learning efficiency.

We develop an NPE module that dynamically recalibrates network parameters during training. It optimizes policy updates while mitigating plasticity loss, which improves the network’s generalization across diverse tasks.

We validate GSPPO’s effectiveness for urban adversarial target assignment through comprehensive experiments. Results show that GSPPO maintains high decision quality while significantly improving computational efficiency. These capabilities indicate strong practicality for real-time scenarios.

2. Related Work

Urban adversarial target assignment studies the optimal allocation of our units against enemy targets to maximize mission effectiveness. Current research mainly focuses on two aspects: model formulation and algorithm design. Based on problem scale and real-time requirements, existing approaches can be categorized into three major classes: exact algorithms, heuristic algorithms, and RL algorithms. These categories show distinct features in computational complexity, scalability, and real-time performance.

2.1. Exact Algorithms

Exact algorithms seek optimal solutions through mathematical programming frameworks, but due to computational complexity, they are typically only applicable to small-scale problems. Branch-and-bound is a mainstream exact method for solving the WTA problem. Cha et al. [21] proposed a branch-and-bound algorithm for artillery fire scheduling, capable of obtaining optimal solutions for small-scale instances within a reasonable computation time. Kline et al. [22] further introduced a hybrid depth-first search strategy to address nonlinear integer programming formulations of the WTA problem. Lu et al. [23] modeled the WTA problem as a 0–1 integer linear program and improved search efficiency in the solution space by combining branch-and-bound with column generation techniques. Ahner et al. [24] proposed an adaptive dynamic programming approach that integrates concave adaptive value estimation (CAVE) with the max-margin reward (MMR) algorithm, validating solution optimality through a post-decision dynamic programming formulation. Dynamic programming has also been applied to small-scale two-stage WTA problems; however, its practical applicability is limited by the curse of dimensionality [24].

2.2. Heuristic Algorithms

To address larger-scale or more complex WTA problems, researchers have proposed various heuristic algorithms that strike a balance between solution quality and computational efficiency. Rule-based heuristic methods generate feasible solutions rapidly by embedding domain-specific knowledge. Xin et al. [25,26] introduced a virtual permutation-based tabu search approach and a constructive heuristic algorithm, respectively, both of which significantly improved the efficiency of solving medium-scale WTA instances. Chang et al. [27] combined rule-driven population initialization with an improved artificial bee colony algorithm to effectively solve medium-scale WTA problems. Zhang et al. [28] proposed a heuristic algorithm based on statistical marginal reward (HA-SMR), which demonstrated effectiveness in asset-based WTA scenarios. Multi-objective optimization approaches enhance solution diversity by balancing damage probability and resource cost. Yan et al. [29] developed an improved multi-objective particle swarm optimization (MOPSO) algorithm that generates a solution set superior to the general Pareto front by dynamically adjusting learning factors and inertia weights. However, heuristic algorithms are generally unable to guarantee global optimality and exhibit high sensitivity to parameter configurations [27].

2.3. RL Algorithms

In recent years, RL has been introduced into the domain of WTA due to its advantages in dynamic decision-making. RL methods directly learn assignment policies through state-action modeling. Luo et al. [30] proposed an RL-based framework for solving WTA problems, which outperformed traditional heuristic approaches in both solution quality and computational efficiency. Wang et al. [31] integrated deep Q-networks (DQN) with an improved multi-objective artificial bee colony (MOABC) algorithm, enhancing system cumulative reward while reducing time overhead through a hybrid strategy. Multi-objective RL methods further optimize multiple conflicting objectives simultaneously. Zou et al. [32] combined DQN with adaptive mutation and greedy crossover operators, proposing a multi-objective evolutionary algorithm (MOEA) that significantly improved both the convergence and diversity of solutions. Ding et al. [33] designed a distributed PPO algorithm incorporating threat assessment and a dynamic attention mechanism, enabling adaptability to a complex battlefield environment through a hierarchical decision-making framework. LSTM-PPO [18] hybrids capture temporal dependencies in sequential tasks, yet struggle with plasticity loss during task shifts. Curriculum RL (CRL) frameworks employ phased task progression to reduce negative transfer but increase hyperparameter sensitivity. For specific solution methods and their corresponding characteristics and limitations, please refer to Table 1.

Our review reveals that traditional WTA frameworks still face critical limitations in urban adversarial scenarios. They inadequately handle assignment mechanisms under complex constraints, especially regarding real-time applicability. Theoretical and methodological innovations remain urgently needed to address complex urban challenges.

To address these gaps, we propose a comprehensive modeling framework for urban adversarial target assignment. Our solution leverages RL foundations. Specifically, it integrates historical interaction trajectories into the learning loop during decision-making. Moreover, it balances adaptability to new strategies with stability of historical policies across diverse scenarios. Experiments verify its ability to generate high-quality solutions efficiently, under real-time constraints, outperforming benchmarks in adaptability metrics.

3. Preliminaries

3.1. MDP

RL achieves decision optimization through interactive learning between the agent and the environment [34]. Within the standard framework of the MDP, RL can be formulated as M=S,A,P,R,γ. Here, S denotes the state space, A denotes the action space, P(st+1|st,at) stands for transition dynamics, r=R(s,a) denotes the reward function, and γ0,1 represents the reward discount factor. For any sS and action aA, the value of action a under state s is given by the action-value function Qπ(s,a)=E[t=0γtR(st,at)]. The objective of an RL agent is to learn an optimal policy π that maximizes the expected discounted sum of rewards, formulated as Eπ[t=0γtR(st,at)].

3.2. PPO Algorithm

PPO [35] demonstrates excellent stability and efficiency when handling high-dimensional and complex problems. Assume that parameters of the actor and the critic in PPO are represented as θ and ψ. At=t>tγttrtVψ(st) indicates the approximated advantage function, in which Vψ(st) is the state value function. Besides, the clipped surrogate objective of the actor is presented as follows:

(1)LCLIP(θ)=E^t[min(pt(θ)At,clip(pt(θ),1ϵ,1+ϵ)At)]

where E^t[] represents the empirical average over a finite batch of samples, pt(θ)=πθ(at|st)πθold(at|st) is the probability ratio, θold is the parameters of the actor before the update, and ϵ is a small hyperparameter which is typically set to 0.2 in our implementation. Moreover, the objective of the critic is represented as follows:

(2)L(ψ)=E^t[t>tγttrtVψ(st)]2

3.3. Plasticity Loss

Plasticity refers to the ability of a neural network to modify its connection strengths in response to new information. Plasticity loss occurs when this adaptability decreases during learning [36]. Several methods have been proposed to address plasticity loss in deep neural networks, such as resetting the final layer [37], plasticity injection [38], shrink + perturb [39], and so on. Among these methods, the most widely used is plasticity injection. Plasticity injection replaces the final layer of a network with a new function that is a sum of the final layer’s output and the output of a newly initialized layer, subtracted by itself. The gradient is then blocked in both the original layer and the subtracted new layer. During subsequent interventions, the previous parameter values are combined into a single set of weights and biases:

(3)wnew=wold+wnewa.wnewb

However, update mechanism limitations restrict feature reuse and new information integration across tasks, which reduces learning efficiency. Moreover, impaired gradient flow prolongs training duration, increases optimization difficulty, and may cause convergence failure.

4. Problem Formulation

We focus on modeling and solving the target assignment problem in an urban environment. Specifically, the proposed approach proceeds through the following stages: First, to understand the operational area, we analyze a specific adversarial zone. This analysis extracts essential features of the urban road network and identifies critical path points. Second, leveraging real-time battlefield awareness, we gather positions, types, and attributes of our units and enemy targets along with the critical points. Using this comprehensive situational data, we then identify multiple traversable paths between our units and enemy targets. For each identified path, we compute critical performance factors: total distance, congestion level, threat level, and so on. Third, to optimize the assignment and movement, we construct constraint models and design a specialized reward function. These components work together to jointly optimize the grouping strategies and maneuvering strategies. Finally, based on this optimization, the method generates an effective target assignment plan. The overall workflow is summarized in Figure 1.

4.1. Image Acquisition and Analysis with Critical Path Point Extraction

This study uses Shapefile (SHP) data to build a geospatial database. SHP is an industry-standard format. It provides robust vector data management capabilities. SHP supports points, lines, and polygons. It also stores rich attribute data. This capability enables precise representation of critical urban elements. These elements include building footprints, road networks, and target-specific locations [40]. Furthermore, standardized SHP datasets are widely available. They are accessible through open source platforms like OpenStreetMap [41]. This accessibility facilitates the rapid construction of urban adversarial battlefield environment databases. It also significantly lowers data acquisition barriers.

Core decision factors in target assignment depend fundamentally on horizontal spatial relationships. These factors include our units’ and enemy targets’ positions. They also include fire coverage ranges, line-of-sight areas, and maneuvering paths [42,43,44]. Consequently, this study employs a two-dimensional top-down perspective. This perspective serves as the primary analytical framework. It effectively avoids the visual distractions and computational complexity of three-dimensional space. This streamlines target assignment in an urban environment. The original SHP image data of the urban environment comes from UAV aerial photography. Therefore, the proposed method will also be adapted to real-world geospatial data. First, we render building polygons from the SHP data as a grayscale image where building interiors as black (0) and navigable areas as white (255). After that, we use adaptive Gaussian thresholding binarization to convert this data into a binary image:

(4)T(x,y)=1k2i=hhj=hhI(x+i,y+j)G(i,j)C

where T(x,y) denotes the local threshold, I(x+i,y+j) denotes the original image’s grayscale value, G(i,j) denotes the Gaussian kernel weight, k denotes the neighborhood window size, and C denotes the threshold offset. The converted image is shown in Figure 2.

After binarization, we extract polygonal contours using chain code and bounding box techniques. These contours capture urban road networks and building structures, as shown in Figure 3. We then apply geometric simplification with the Douglas–Peucker algorithm. The tolerance parameter ϵ=1 pixel reduces computational complexity while preserving critical spatial features. The algorithm iteratively eliminates non-essential vertices. For each baseline segment, it evaluates the maximum perpendicular distance D from intermediate points to baseline segments:

(5)D=(y2y1)x0(x2x1)y0+x2y1y2x1(y2y1)2+(x2x1)2

Then, the vertices of the simplified polygon are extracted and saved as critical path points.

4.2. Connectivity Analysis and Traversable Path Computation

After extracting path points, we must determine their mutual connectivity. For any two points, we compute the differences in their horizontal and vertical coordinates Δx,Δy. We take the maximum of these values as the number of sampling points n between them. A uniform set of intermediate points is then sampled along the connecting line segment. Each sampled coordinate is rounded to the nearest integer. We then check all sampled points: if any point corresponds to a black pixel in the binary image, the two points are disconnected. Black pixels indicate obstacles or non-traversable areas. Otherwise, the points are connected.

After determining connectivity among all points, we employ the A* algorithm. This algorithm generates traversable paths between our units and enemy units. As an efficient heuristic search method, A* guides its search using a cost function. The cost function fn combines the actual and estimated costs:

(6)fn=gn+hn

where gn represents the actual cost from the start node to the current node n, and hn is the estimated cost from node n to the goal. This approach significantly reduces unnecessary node expansions, enabling the rapid generation of multiple traversable paths within complex urban road networks.

4.3. Constraint Model Design

We ensure the constraint model design aligns closely with urban combat characteristics and decision logic. To capture inherent environmental complexities, this study establishes four core constraint models: equipment mobility constraints, battlefield threat constraints, unit grouping constraints, and dynamic environmental constraints [45,46,47,48]. Table 2 summarizes key factors considered in each model.

Before building the constraint model, we must compute key path metrics. These metrics derive from real-time battlefield awareness in urban adversarial scenarios. The awareness integrates multi-source situational data. We design these indicators based on publicly available equipment parameters and rules.

The offensive firepower value Zi of our unit i is closely related to its unit type, and specifically satisfies the following condition:

(7)Zi=zhatkif zhi=1,hqi=0,zji=0,bzi=0hqatkif zhi=0,hqi=1,zji=0,bzi=0zjatkif zhi=0,hqi=0,zji=1,bzi=0bzatkif zhi=0,hqi=0,zji=0,bzi=1

where zhatk+ denotes the offensive firepower of command units, hqatk+ denotes the offensive firepower of logistics units, zjatk+ denotes the offensive firepower of armored units, and bzatk+ denotes the offensive firepower of support units. Zj+ denotes the offensive firepower of the enemy target j.

The defensive capability value Dfi of our unit i is also closely related to its unit type and specifically satisfies the following condition:

(8)Dfi=zhdefif zhi=1,hqi=0,zji=0,bzi=0hqdefif zhi=0,hqi=1,zji=0,bzi=0zjdefif zhi=0,hqi=0,zji=1,bzi=0bzdefif zhi=0,hqi=0,zji=0,bzi=1

where zhdef+ denotes the defensive capability of command units, hqdef+ denotes the defensive capability of logistics units, zjdef+ denotes the defensive capability of armored units, and bzdef+ denotes the defensive capability of support units.

The priority level value Imi of our unit i specifically satisfies the following condition:

(9)Imi=zhimdayif zhi=1,hqi=0,zji=0,bzi=0,day=1zhimnightif zhi=1,hqi=0,zji=0,bzi=0,day=0hqimif zhi=0,hqi=1,zji=0,bzi=0zjimif zhi=0,hqi=0,zji=1,bzi=0bzimif zhi=0,hqi=0,zji=0,bzi=1

The binary variable day0,1 indicates day-night conditions. zhimday+ denotes the priority levels of command units during the daytime and zhimnight+ at night, hqim+ denotes the priority levels of logistics units, zjim+ denotes the priority levels of armored units, and bzim+ denotes the priority levels of support units.

4.3.1. Equipment Mobility Constraints

The equipment mobility constraint model directly responds to the physical limitations imposed by urban topography on unit deployment. Path traversability determines feasible deployment areas for different equipment types. Road congestion couples with equipment mobility speed, which further impacts task response timeliness. Additionally, the logistics supply radius acts as a hard constraint during task execution.

Actually, our unit cannot engage multiple enemy targets simultaneously. It must select one path from the traversable paths to initiate an attack. For i0,1,,N, xi,j,k satisfies the following condition:

(10)k=0Lj=0Kxi,j,k1

Mobility efficiency requires the following key considerations: unit grouping configurations, selected path characteristics, and support unit availability. For i0,1,,N and k0,1,,L, the travel time tj+ for an our group moving toward enemy target j must satisfy the following condition:

(11)tj=maxxi,j,kci,j,kbznumjbzgain

bznumj denotes the number of support units attacking enemy target j. A greater number of support units leads to higher mobility efficiency. bzgain+ represents the mobility efficiency gain per support unit.

4.3.2. Battlefield Threat Constraints

The root cause of the battlefield threat constraint model stems from the spatial non-uniformity of threats in an urban environment. The perceived distance from enemies must be combined with path exposure levels to determine threat values. Furthermore, equipment importance and its integrated protection capabilities significantly determine survival likelihood. Therefore, we design this constraint based on historical experience rules and probability.

Consequently, battlefield threat assessment requires the following key considerations: unit grouping configuration, priority level, composite defense capability, path traversability, line-of-sight coverage along the path, and cumulative path risk. For i0,1,,N and k0,1,,L, the battlefield threat Mj+ for our unit i advancing towards enemy target j must satisfy the following condition:

(12)M(j)=k=0Li=0NIm(i)x(i,j,k)P(i,j)D(i,j,k)W(i,j,k)/Df(i)

P(i,j) indicates path viability for our unit i to attack enemy target j. D(i,j,k) is the maximum exposure distance for our unit i attacking enemy target j via path k. W(i,j,k) is the combined danger level of the path.

4.3.3. Adversarial Grouping Constraints

The adversarial grouping constraint model arises from the dynamic balance requirement in firepower confrontation between our units and enemy targets. When enemy firepower exceeds a threshold, it must be suppressed by our own units. This firepower gap then triggers formation size adjustments. Ultimately, these adjustments result in exponential firepower growth that reflects grouping efficiency.

Based on the above factors, the combined firepower of our units in a group must be comprehensively considered with respect to the grouping configuration and offensive capabilities of the units. To constrain the combined firepower within a reasonable range, the following constraint is established:

(13)zminZj*k=0Li=0Nx(i,j,k)ZizmaxZj*

zmin+ denotes the minimum ratio of the combined firepower of our units to the firepower of enemy target j, and zmax+ represents the maximum allowable ratio between them.

In addition, the remaining firepower value sforce+ of all our groups must satisfy the following condition:

(14)sforce=k=0Lj=0Ki=0Nx(i,j,k)Zi+hqgainhqnum(j)x(i,j,k)Zi+zhgaingain(j)zhnum(j)x(i,j,k)ZiZj*

hqnumj denotes the number of logistics units among all our units that are attacking enemy target j; hqgain+ represents the gain in offensive firepower contributed by each logistics unit. zhnumj denotes the number of command units attacking enemy target j, and zhgain+ represents the gain in offensive firepower contributed by each command unit. The binary variable gainj indicates whether our units attacking enemy target j can maintain communication with the command unit.

4.3.4. Dynamic Environmental Constraints

In real-world adversarial operations, it is essential to account for dynamically changing external disturbances. To this end, a dynamic environmental constraint model is constructed, which specifically incorporates four key factors: day–night cycle effect, communication node dynamics, real-time path update, and electromagnetic interference. The following subsections present the mathematical formulations and physical interpretations of each sub-model.

The day–night cycle directly affects the mobility efficiency and tactical choices of our units. Following the binary classification principle for adversarial effectiveness, we define the day–night constraint as follows:

(15)DNI(t)=  1,t=day0.8,t=night

Communication node dynamics refers to the destruction of the communication infrastructure during operation. This destruction causes communication node failures and subsequent disconnection of communication links. The probability of communication loss is positively correlated with path length. This correlation simulates the cumulative effect of disturbances throughout the engagement. The following constraint is satisfied:

(16)web(i,j,k)=dis(i,j,k)100+Lmax

Real-time path update refers to the need to select a new route when the originally chosen path becomes impassable. This occurs due to destruction during operation. The following constraint must be satisfied:

(17)Pdes(i,j,k)=1eW(i,j,k)

Electromagnetic interference refers to active electronic jamming by enemies against our units during combat. Examples include communication disruption, GPS spoofing, and radar suppression. We introduce a critical time threshold to divide the mobility process into two phases: a linear interference accumulation phase (ttth) and an exponential degradation phase (t>tth). Coefficients k1 and k2ec(i,j,k)tth dynamically characterize these phases. Coefficient k1 represents short-term adaptive interference coupling effects. Coefficient k2ec(i,j,k)tth represents long-term systemic failure coupling effects. The following condition is satisfied:

(18)Ele(i,j,k)=  k1·c(i,j,k)c(i,j,k)tthk2·ec(i,j,k)tth+k1·tthc(i,j,k)>tth

Considering all the above factors, the dynamic environmental impact Dej+ on our units moving toward enemy unit j must satisfy the following constraint:

(19)De(j)=k=0Li=0N1DNI(t)·((11+web(dis(i,j,k)))+Pdes(i,j,k)+Ele(i,j,k))

These four constraint models simultaneously serve as the decision-making optimization objectives in this paper.

4.4. MDP Model Description

Incorporating the constraint models designed in this paper, the components of the MDP are formulated as follows:

4.4.1. State Space

In the target assignment environment, the state space Stot is a two-dimensional matrix. It contains the states of all our units. The first dimension’s length depends on four elements: enemy unit ID, selected path ID, our unit ID in one-hot encoding, and our unit coordinates. The second dimension’s length equals the number of our units. Its expression is as follows:

(20)Stot=[s1,s2,,sn]T

4.4.2. Action Space

An action branch is a tuple of length 2. It contains the enemy unit ID selected by our unit and the corresponding path ID. The full action space comprises all possible combinations of enemy unit IDs and path IDs. Its design format is multi-discrete:

(21)A=[a1,a2,,an]

4.4.3. Reward Function

The reward function achieves four objectives. First, it improves unit mobility efficiency. Second, it shortens maneuvering time. Third, it enhances firepower utilization. Fourth, it reduces battlefield threats to units. It also mitigates adverse effects from dynamic environmental changes.

Based on the constraint models described above, the reward function is defined as follows:

(22)Reward=Rt+RM+RS+RDe

(23)Rt=β1j=0Kt(j)

(24)RM=β2j=0KMj

(25)RS=β3sforce

(26)RDe=β4j=0KDej

where β1, β2, β3, and β4 denote the weights corresponding to the rewards of the four modules. Different weight configurations reflect preferences for different operational requirements.

5. Algorithm

Urban adversarial target assignment must satisfy the complex constraints described earlier. Cross-dataset training introduces additional risks like parameter rigidity and policy oscillation. These significantly degrade traditional PPO algorithm performance. To solve these problems, we propose the GSPPO algorithm. Its architecture contains three core modules: an EHR module, an actor-critic joint network, and an NPE module. The EHR module constructs memory units across time steps. This encodes the characteristics of historical interaction trajectories. The actor-critic network uses EHR module outputs to generate target assignment policies and value estimations. The NPE mechanism applies gradient-based parameter shrinking and perturbation. This imposes continuous constraints on network weights. These modules collectively enhance algorithm generalization.

5.1. Design of EHR Module

Crucially, traversable path training datasets come from diverse initial state distributions. Yet, all data originate from the same adversarial environment. This environment consistency embeds spatial correlations. These correlations enable transferable implicit information patterns across datasets. This significantly enhances model generalization.

Urban adversarial target assignment exhibits strong temporal dependencies. Each unit–target matching decision directly impacts subsequent assignment outcomes. Using only single-time decision results in updates may trap policies in local optima. Therefore, incorporating historical environment interaction trajectories into every learning process enhances decision quality and efficiency. This sequential decision-making requires long-term temporal modeling capability. To address this, we integrate an EHR module into the PPO framework. The module constructs multi-timestep memory units that capture explicit temporal features from input sequences and extract implicit topological relationships in environmental data. This dual capability significantly enhances complex environment decision-making. The module input expands into temporal sequences by incorporating historical observations, which is formalized as follows:

(27)y=[obstm,,obst2,obst1,obst]

The module extracts valuable information from historical observation sequences. It feeds this information to subsequent layers, enhancing policy training. EHR integration enhances correlated information extraction from sequential data. This improves the policy network’s representation capability. The algorithm structure we designed is shown in Figure 4.

During training, EHR’s hidden state evolves with historical sequences. This captures time-varying hidden information. In contrast, traditional MLP-based policies lack recurrent memory. They optimize decisions using only current observations. Consequently, they underperform in stochastic tasks compared to recurrent policies. This demonstrates EHR’s superiority in sequential decision-making.

We summarize the EHR module in Algorithm 1.

Algorithm 1: EHR module of GSPPO algorithm

1:. Initialize EHR states h0,c0

2:. for t=1,2,,Tmax do

3:.       for i=1,2,,Nenv do

4:.             Store ot,ht,ct into replay buffer

5:.             at,vt,ht+1,ct+1 = EHR ot,ht,ct

6:. end for

7:. Compute advantages with EHR′s final state

8:. Train on sequences maintaining EHR state continuity

5.2. Design of NPE Module

Urban adversarial target assignment faces dual challenges during cross-dataset training: network rigidity and policy instability. First, model parameters gradually converge during training. Feature discrepancies across heterogeneous datasets meanwhile hinder network adaptation. Second, policy-value function divergence induces policy oscillation during frequent task switching. Therefore, the model must enhance its adaptability to new datasets while maintaining performance to represent old datasets.

To overcome these limitations, we propose an NPE module. NPE activates after each gradient update. This enables dynamic weight adjustment without periodic hyperparameter tuning. For the neurons in the network, NPE breaks the direct inheritance of prior-round parameters. Instead, it combines parameter shrinking with perturbed initialization. This alters gradient dynamics and update paths for new data.

The shrink operation reduces old-data reliance while amplifying new-data gradients. This balances gradient amplitudes across datasets. Simultaneously, perturbation disrupts old-data memory patterns to enhance new-data adaptability. These operations collectively enhance the plasticity of the network, thereby greatly improving its generalization ability across datasets.

For a parameter set x, the updated parameter set xnew is computed according to the following formula:

(28)xnew=αxcur+βxinit

where xinit is sampled from the initial parameter distribution, α and β are scaling coefficients that satisfy α=1β. This module directly regulates gradient balance during new data training. It balances gradient contributions between old and new data, which prevents generalization degradation caused by distribution shifts. The mathematical representation is as follows:

(29)xnewLnew(xnew)xnewLold(xnew)

where Lnew(xnew) denotes the loss on new datasets under parameter set xnew, and Lold(xnew) denotes the loss on old datasets under xnew.

We summarize the NPE module in Algorithm 2.

Algorithm 2: NPE module of GSPPO algorithm

1:. Initialize NPE module θ with θ0

2:. for epochs 1 to N do

3:.       Collect trajectories using πθ

4:.       Update θ via GSPPO gradient step

5:.       for each parameter group gT do  //T= encoder, policy, value

6:.       θnew=αθcur+βθinit

7:.       end for

8:. end for

For large-scale exploration tasks, we use a multi-environment parallel framework. This coordinates shared network weight updates across multiple environments. The policy network runs synchronously in randomly initialized environments. It generates diverse trajectories. These form vast sample sets under the current policy. Stochastic gradient descent then optimizes this dataset iteratively. This reduces parameter update variance and accelerates convergence. It also improves convergence stability.

We summarize the proposed method in Algorithm 3.

Algorithm 3: GSPPO algorithm for Urban Adversarial Target Assignment
Input: observation sequence yOutput: target assignment plan

1:. Initialize policy network, πθold, πθ and value function Vψ(st)

2:. for n = 1, 2, …, maximum training episode Nmax do

3:.     Initialize Nenv different workers randomly

4:.     Calculate the learning rate lr and entropy bonus coefficient β 

5:.     for t=1,2,,Tmax do

6:.                 for i=1,2,,Nenv do

7:.                          Run policy πθold, collect {oit,rit,ait}

8:.                          Store St,at,rt,St+1 into replay buffer

9:.                 end for

10:.                        Collect set of partial trajectories Dt from different scenarios

11:.                end for

12:.                Divide Dt into sequences of length nlen

13:.                Calculate advantage estimates At

14:.                for k=1,2,, maximum GSPPO epochs kmax do

15:.                        Calculate loss function L(θ)=LCLIP(θ)+αLVF(θ)+βLENT(θ)

16:.                        Update weights via backpropagation

17:.                 end for

18:.                 Adjust network parameters using NPE

19:. end for

6. Experiments and Analysis

This section presents the experiments conducted on the proposed work from four distinct perspectives. All experiments were conducted in Python 3.8 using PyTorch 1.8.0 and gym 0.26.2 within PyCharm 2023.1.2, running on hardware with an NVIDIA Quadro 5000 GPU (16 GB) and Intel i9-9960X CPU. Firstly, we compared the convergence curves of various RL algorithms on this task. Secondly, we verified the effectiveness of each designed module through ablation experiments. Subsequently, we evaluated the performance of our method against RL algorithms and GA. Finally, we analyzed reward weight impacts on policy generalization. These experiments collectively validate our approach comprehensively.

6.1. Comparative Experiment

In the aforementioned urban environment scenarios, traversable paths from diverse initial states form the training dataset. The proposed GSPPO method was applied for training. We defined three adversarial scenarios. Scenario 1: 7 vs. 4 forces with 20 traversable paths per unit–target pair. Scenario 2: 10 vs. 6 forces with 20 paths per pair. Scenario 3: 7 vs. 4 forces with 30 paths per pair. These configurations test scalability and path density impacts.

We also compared the proposed method with several RL algorithms in the experiments: the classic value-based algorithm Rainbow DQN [49], the actor-critic based algorithm SAC [50], and the relatively recent algorithm CrossQ [51]. The resulting learning curves after training are shown in Figure 5, Figure 6 and Figure 7.

The experiments use the theoretical optimum of each reward module as the benchmark. The reward equals the negative deviation from this benchmark. Thus, the theoretical maximum total reward is 0. As shown in Figure 5, GSPPO’s average reward increases steadily during training. It then plateaus near zero. This indicates policy network convergence.

Rainbow DQN, SAC, and CrossQ show low learning efficiency in urban adversarial scenarios. Their reward curves fail to approach the theoretical optimum. In contrast, GSPPO integrates two key innovations: an HER module and an NPE module. This integration improves data utilization and accelerates training convergence. It also mitigates performance disruptions from complex constraints and multi-dataset variations. Agents thus learn superior policies in challenging conditions.

Figure 6 shows that in scenario 2, GSPPO can still maintain high performance when facing scale expansion. Baseline algorithms exhibit significant degradation. This confirms GSPPO’s scalability for large-scale constrained target assignment.

Figure 7 displays the impact of increasing path density on the algorithm. From the results, it can be seen that though this expands the action space substantially, GSPPO maintains effective performance. This demonstrates generalization against combinatorial complexity.

6.2. Ablation Experiment

We introduced two improvements to the traditional PPO algorithm. To evaluate the impact of each component on overall performance, ablation experiments were conducted by removing one improvement at a time, and the algorithm’s performances were compared across the three aforementioned application scenarios, as shown in Figure 8, Figure 9 and Figure 10.

Ablation results demonstrate key insights. The full GSPPO (EHR + NPE) achieves optimal performance. Partial variants (w/o EHR or w/o NPE) still significantly outperform baseline PPO. Both variants yield higher rewards than PPO with comparable convergence speeds.

The EHR module captures temporal dependencies in urban target assignment. Its memory cells and gating mechanisms model long-term dynamics. This capability exceeds traditional PPO methods. EHR maintains decision consistency across time steps. It thereby improves cumulative rewards.

The NPE module mitigates network plasticity loss during multi-dataset training. It enhances the model’s representational capability and cross-task generalization. By injecting controlled noise into parameter updates, NPE improves generalization against input distribution shifts while preserving policy space continuity. EHR and NPE collaborate synergistically. Their integration enables strong generalization across diverse adversarial scenarios.

6.3. Policy Application Performance Comparison

We evaluated GSPPO in a 7 vs. 4 scenario with 20 traversable paths per unit–target pair. Performance metrics included path length, threat value, and total time. The results were compared with those of the RL algorithms CrossQ, Rainbow DQN, SAC, as well as an improved genetic algorithm, AGA [52].

Each method was evaluated using 100 Monte Carlo simulations, and the values of all metrics in the results were normalized, as shown in Table 3. The experimental results show that the target assignment solution obtained using the GSPPO algorithm achieves an average path length of 3.8073, a threat value of 2.3587, and a total time of 3.3784. Compared with the RL algorithms CrossQ, Rainbow DQN, SAC, and the genetic algorithm AGA, GSPPO can find better target assignment solutions.

We also compared the solving time and CPU utilization of the proposed algorithm with those of the genetic algorithm, as shown in Table 4 and Table 5. Each method was evaluated using 10 repeated simulations. The leftmost column indicates the number of traversable paths used in the planning process. Combining the results described above, it can be concluded that the trained model significantly reduces the planning time required and the CPU usage. Moreover, even when the initial situation changes, the proposed model can still be directly invoked to generate a better assignment solution without the need for replanning.

Furthermore, we found that the proposed method relies on complete UAV aerial imagery. This imagery enables accurate extraction of traversable paths and critical path points. These elements underpin the target assignment framework. However, real-world urban environments often present challenges. Incomplete geospatial data are common, such as missing road segments or occlusions from foliage and buildings. Noisy UAV imagery also occurs, for example, due to inconsistent lighting. These issues may reduce solution reliability. The method’s performance requires full aerial coverage. Incomplete images can lead to unsatisfactory path extraction, resulting in suboptimal target assignment schemes being generated.

6.4. Parameter Sensitivity Analysis

We analyzed reward weight impacts on four operational objectives: mobility efficiency, battlefield threat, remaining firepower, and dynamic environmental factors. Four weight configurations emphasize different aspects of the operational objectives. This reveals how reward shaping quantitatively influences target assignment outcomes.

When β1=0.7,β2=0.1,β3=0.1,β4=0.1, the generated grouping and maneuver strategy is shown in Figure 11. Our units 0, 1, 4, and 5 form a single group and advance along the planned route to engage enemy target 2. Our unit 2 forms a separate group and advances along the planned route to engage enemy target 1. Our unit 3 forms another individual group and follows its planned route to attack enemy target 3. Our unit 6 also operates independently and advances toward enemy target 0 along its designated path.

The results indicate that when β1=0.7,β2=0.1,β3=0.1,β4=0.1, our units prioritize mobility efficiency during the offensive, with less consideration given to battlefield threat, remaining firepower, and dynamic environmental factors. This configuration emphasizes eliminating enemy units in the shortest possible time.

When β1=0.1,β2=0.7,β3=0.1,β4=0.1, the generated grouping and maneuver strategy is shown in Figure 12. Our units 0, 3, 4, and 5 form a single group and advance along the planned route to engage enemy target 1. Our unit 1 operates independently and advances along the planned route to attack enemy target 2. Our unit 2 forms another individual group and follows its designated path to engage enemy target 3. Our unit 6 also acts alone, advancing toward enemy target 0 along its planned route.

The results indicate that when β1=0.1,β2=0.7,β3=0.1,β4=0.1, our units prioritize reducing the potential threat encountered during maneuvering, thereby minimizing losses to our assets during the offensive.

When β1=0.1,β2=0.1,β3=0.7,β4=0.1, the generated grouping and maneuver strategy is shown in Figure 13. Our unit 0 operates independently and advances along the planned route to engage enemy target 3. Our unit 1 forms another individual group and follows its planned route to attack enemy target 2. Our unit 3 acts alone and advances toward enemy target 0. Our unit 5 also forms a separate group and moves along its designated path to engage enemy target 1. Meanwhile, units 2, 4, and 6 are not assigned any target and remain in a reserve state.

The results indicate that when β1=0.1,β2=0.1,β3=0.7,β4=0.1, our units prioritize maintaining a larger reserve force during the offensive, aiming to maximize the utilization of available firepower resources.

When β1=0.1,β2=0.1,β3=0.1,β4=0.7, the generated grouping and maneuver strategy is shown in Figure 14. Our units 0, 3, and 5 form a group and advance along the planned route to engage enemy target 1. Our unit 1 operates independently and advances along the planned route to attack enemy target 3. Our units 2 and 6 form another group and follow their respective paths to engage enemy target 2. Our unit 4 acts alone and advances toward enemy target 0 along its designated route.

The results indicate that when β1=0.1,β2=0.1,β3=0.1,β4=0.7, our units prioritize strategies that are less susceptible to disruption. This configuration aims to minimize the negative impact of potential dynamic disturbances in the operational environment.

The reward weight sensitivity analysis reveals distinct behavioral shifts in target assignment strategies based on parameter configurations. Prioritizing mobility efficiency β1 minimizes engagement time but increases vulnerability to threats. Emphasizing threat reduction β2 enhances survivability at the cost of operational speed. Maximizing reserve firepower β3 conserves resources but may delay mission completion. Optimizing for environmental adaptability β4 improves disturbance resistance while requiring balanced trade-offs in other objectives. These results demonstrate that strategic preferences can be systematically encoded through weight adjustments, enabling mission-specific policy customization. The framework’s flexibility supports diverse operational requirements—from time-critical strikes to high-risk reconnaissance—by modulating the relative importance of core constraints within the reward function.

7. Conclusions

This paper establishes an integrated target assignment framework for urban environments without road networks, which is initiated with traversable path generation from binarized SHP data, progressing to MDP modeling solved by our novel GSPPO algorithm. GSPPO intrinsically integrates the EHR module and the NPE module. EHR incorporates historical exploration trajectories into policy learning, significantly improving decision efficiency. NPE maintains neural plasticity through dynamic parameter recalibration, enhancing cross-scenario generalization. This synergy enables simulation-based real-time planning under complex constraints. Furthermore, a multi-environment parallel training approach is adopted to improve data efficiency and accelerate the learning process. The framework is applicable to military operations and shows potential in humanitarian tasks like disaster zone search/rescue, high-risk urban evacuation, and resource logistics under geospatial uncertainty.

Experimental validation against Rainbow DQN, SAC, CrossQ, and AGA demonstrated GSPPO’s superior solution efficiency. Against the benchmark algorithm, GSPPO achieves 94.16% higher convergence performance, 33.54% shorter assignment paths, 51.96% lower threat exposure, and 40.71% faster total time. The impact of different reward weight designs on the target assignment outcomes is also evaluated, demonstrating the effectiveness of each constraint model’s corresponding reward shaping in the overall reward function.

We found that the approach exhibits a certain limitation: reliance on complete UAV imagery. Moreover, the deployment in tactical contexts raises ethical concerns, including accountability for autonomous decisions and risks of civilian harm. Future work will focus on further developing robust preprocessing methods for real-world noisy data, implementing real-world validation with UGVs and LiDAR sensors, integrating RL methods capable of adapting to varying operational scales, as well as minimizing ethical impacts.

Author Contributions

Conceptualization, X.D.; Software, X.D., Y.W., D.W. and K.F.; Validation, H.C.; Formal analysis, H.C.; Investigation, Q.L.; Data curation, Q.L.; Writing—original draft, X.D.; Writing—review & editing, H.C., L.L. and B.G.; Visualization, Y.W.; Supervision, J.H.; Project administration, J.H. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to acknowledge the assistance of DeepSeek-R1 in improving the English language of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Schematic diagram of the target assignment for an urban adversarial scenario.

View Image -

Figure 2 Binarized SHP image.

View Image -

Figure 3 Polygon contour extraction.

View Image -

Figure 4 The GSPPO algorithm structure.

View Image -

Figure 5 Performance curves for scenario 1 (5 random seeds).

View Image -

Figure 6 Performance curves for scenario 2 (5 random seeds).

View Image -

Figure 7 Performance curves for scenario 3 (5 random seeds).

View Image -

Figure 8 Performance curves for scenario 1 (5 random seeds).

View Image -

Figure 9 Performance curves for scenario 2 (5 random seeds).

View Image -

Figure 10 Performance curves for scenario 3 (5 random seeds).

View Image -

Figure 11 Option when β1=0.7,β2=0.1,β3=0.1,β4=0.1.

View Image -

Figure 12 Option when β1=0.1,β2=0.7,β3=0.1,β4=0.1.

View Image -

Figure 13 Option when β1=0.1,β2=0.1,β3=0.7,β4=0.1.

View Image -

Figure 14 Option when β1=0.1,β2=0.1,β3=0.1,β4=0.7.

View Image -

Characteristics and limitations of various WTA algorithms.

Algorithms Existing Approaches Characteristics Limitations
Exact Algorithms Hungarian, branch-and-bound, Lagrangian relaxation, and so on Optimal Inefficient
Heuristic Algorithms GA, MOPSO, ACO, HA-SMR, and so on Practical Suboptimal
RL algorithms DQN, distributed PPO, and so on Fast solution speed Weak generalization

Constraint models and related factors.

Equipment   Mobility   Constraints   t j Battlefield   Threat   Constraints   M ( j ) Unit   Grouping   Constraints   s f o r c e Dynamic   Environmental   Constraints   D e ( j )
equipment type equipment criticality Imi equipment type day-night cycle effect DNI(t)
equipment mobility speed tj path traversability xi,j,k unit firepower Zi real-time path update Pdes(i,j,k)
path traversability xi,j,k line-of-sight range P(i,j) enemy firepower Zj* communication node dynamic web(i,j,k)
road congestion level ci,j,k path threat level W(i,j,k) firepower differential ZiZj* electromagnetic interference effect Ele(i,j,k)
logistical support impact bzgain composite protection computation Dfi grouping effectiveness gainj

Generalization results of different algorithms.

Algorithm Path Length Threat Value Total Time
GSPPO 3.8073 2.3587 3.3784
CrossQ 4.6881 3.6088 5.2842
Rainbow DQN 5.1942 3.9406 5.3791
SAC 6.2637 5.8793 6.0164
AGA 3.7096 2.8541 3.6333

Comparison of solving time.

Per Pair Path Number GSPPO(s) AGA(s)
1 1 5
5 1 16
7 1 378
10 2 6254
20 6 >>22,734

Comparison of CPU utilization.

Per Pair Path Number GSPPO AGA
1 21.2% 30.1%
5 25.6% 41.7%
7 25.3% 54.0%
10 28.2% 72.2%
20 28.8% 86.5%

References

1. Li, X.J.; Zhang, D.D.; Yang, Y.; Zhang, H.X. Analysis on experienced lessons and core capabilities of urban operation. Prot. Eng.; 2020; 42, pp. 64-69.

2. Wei, N.; Liu, M.Y. Target allocation decision of incomplete information game based on Bayesian Nash equilibrium. J. Northwest. Polytech. Univ.; 2022; 40, pp. 755-763. [DOI: https://dx.doi.org/10.1051/jnwpu/20224040755]

3. Ou, Q.; He, X.Y.; Tao, J.Y. Overview of cooperative target assignment. J. Syst. Simul.; 2019; 31, pp. 2216-2227.

4. Li, K.P.; Liu, T.B.; Ram Kumar, P.N.; Han, X.F. A reinforcement learning-based hyper-heuristic for AGV task assignment and route planning in parts-to-picker warehouses. Transp. Res. Part E Logist. Transp. Rev.; 2024; 185, pp. 103518-103544. [DOI: https://dx.doi.org/10.1016/j.tre.2024.103518]

5. Ma, Y.; Wu, L.; Xu, X. Cooperative Targets Assignment Based on Multi-Agent Reinforcement Learning. Syst. Eng. Electron.; 2023; 45, pp. 2793-2801.

6. Moon, S.H. Weapon Effectiveness and the Shapes of Damage Functions. Def. Technol.; 2021; 17, pp. 617-632. [DOI: https://dx.doi.org/10.1016/j.dt.2020.04.009]

7. Cheng, Y.Z.; Zhang, P.C.; Cao, B.Q. Weapon Target Assignment Problem Solving Based on Hungarian Algorithm. Appl. Mech. Mater.; 2015; 3744, pp. 2041-2044. [DOI: https://dx.doi.org/10.4028/www.scientific.net/AMM.713-715.2041]

8. Andersen, A.C.; Pavlikov, K.; Toffolo, T.A.M. Weapon-target assignment problem: Exact and approximate solution algorithms. Ann. Oper. Res.; 2022; 312, pp. 581-606. [DOI: https://dx.doi.org/10.1007/s10479-022-04525-6]

9. Guo, W.K.; Vanhoucke, M.; Coelho, J. A Prediction Model for Ranking Branch-and-Bound Procedures for the Resource-Constrained Project Scheduling Problem. Eur. J. Oper. Res.; 2023; 306, pp. 579-595. [DOI: https://dx.doi.org/10.1016/j.ejor.2022.08.042]

10. Ni, M.F.; Yu, Z.K.; Ma, F.; Wu, X.R. A Lagrange Relaxation Method for Solving Weapon-Target Assignment Problem. Math. Probl. Eng.; 2011; 2011, 873292. [DOI: https://dx.doi.org/10.1155/2011/873292]

11. Su, J.; Yao, Y.; He, Y. Studying on Weapons-Targets Assignment Based on Genetic Algorithm. Proceedings of the 2nd International Symposium on Computer Science and Intelligent Control; Stockholm, Sweden, 21–23 September 2018; pp. 1-5.

12. Zhai, H.; Wang, W.; Li, Q.; Zhang, W. Weapon-Target Assignment Based on Improved PSO Algorithm. Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC); Kunming, China, 22–24 May 2021; pp. 6320-6325.

13. Cao, M.; Fang, W. Swarm Intelligence Algorithms for Weapon-Target Assignment in a Multilayer Defense Scenario: A Comparative Study. Symmetry; 2020; 12, 824. [DOI: https://dx.doi.org/10.3390/sym12050824]

14. Huang, J.; Li, X.; Yang, Z.; Kong, W.; Zhao, Y.; Zhou, D. A Novel Elitism Co-Evolutionary Algorithm for Antagonistic Weapon-Target Assignment. IEEE Access; 2021; 9, pp. 139668-139684. [DOI: https://dx.doi.org/10.1109/ACCESS.2021.3119363]

15. Bengio, Y.; Lodi, A.; Prouvost, A. Machine Learning for Combinatorial Optimization: A Methodological Tour d’horizon. Eur. J. Oper. Res.; 2021; 290, pp. 405-421. [DOI: https://dx.doi.org/10.1016/j.ejor.2020.07.063]

16. Mouton, H.; Le Roux, H.; Roodt, J. Applying reinforcement learning to the weapon assignment problem in air defence. Sci. Mil. S. Afr. J. Mil. Stud.; 2011; 39, pp. 99-116. [DOI: https://dx.doi.org/10.5787/39-2-115]

17. Li, S.; Jia, Y.; Yang, F.; Qin, Q.; Gao, H.; Zhou, Y. Collaborative Decision-Making Method for Multi-UAV Based on Multiagent Reinforcement Learning. IEEE Access; 2022; 10, pp. 91385-91396. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3199070]

18. Chen, W.; Zhang, Z.; Tang, D.; Liu, C.; Gui, Y.; Nie, Q.; Zhao, Z. Probing an LSTM-PPO-Based Reinforcement Learning Algorithm to Solve Dynamic Job Shop Scheduling Problem. Comput. Ind. Eng.; 2024; 197, 110633. [DOI: https://dx.doi.org/10.1016/j.cie.2024.110633]

19. Lian, S.; Zhang, F. A Transferability Metric Using Scene Similarity and Local Map Observation for DRL Navigation. IEEE/ASME Trans. Mechatron.; 2024; 29, pp. 4423-4433. [DOI: https://dx.doi.org/10.1109/TMECH.2024.3376542]

20. Juliani, A.; Ash, J.T. A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning. Adv. Neural Inf. Process. Syst.; 2024; 37, pp. 113884-113910.

21. Cha, Y.-H.; Kim, Y.-D. Fire Scheduling for Planned Artillery Attack Operations under Time-Dependent Destruction Probabilities. Omega; 2010; 38, pp. 383-392. [DOI: https://dx.doi.org/10.1016/j.omega.2009.10.003]

22. Kline, A.G.; Ahner, D.K.; Lunday, B.J. Real-Time Heuristic Algorithms for the Static Weapon Target Assignment Problem. J. Heuristics; 2019; 25, pp. 377-397. [DOI: https://dx.doi.org/10.1007/s10732-018-9401-1]

23. Lu, Y.; Chen, D.Z. A New Exact Algorithm for the Weapon-Target Assignment Problem. Omega; 2021; 98, 102138. [DOI: https://dx.doi.org/10.1016/j.omega.2019.102138]

24. Ahner, D.K.; Parson, C.R. Optimal Multi-Stage Allocation of Weapons to Targets Using Adaptive Dynamic Programming. Optim. Lett.; 2015; 9, pp. 1689-1701. [DOI: https://dx.doi.org/10.1007/s11590-014-0823-x]

25. Xin, B.; Wang, Y.; Chen, J. An Efficient Marginal-Return-Based Constructive Heuristic to Solve the Sensor–Weapon–Target Assignment Problem. IEEE Trans. Syst. Man Cybern. Syst.; 2019; 49, pp. 2536-2547. [DOI: https://dx.doi.org/10.1109/TSMC.2017.2784187]

26. Xin, B.; Chen, J.; Zhang, J.; Dou, L.; Peng, Z. Efficient Decision Makings for Dynamic Weapon-Target Assignment by Virtual Permutation and Tabu Search Heuristics. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.); 2010; 40, pp. 649-662. [DOI: https://dx.doi.org/10.1109/TSMCC.2010.2049261]

27. Chang, T.; Kong, D.; Hao, N.; Xu, K.; Yang, G. Solving the Dynamic Weapon Target Assignment Problem by an Improved Artificial Bee Colony Algorithm with Heuristic Factor Initialization. Appl. Soft Comput.; 2018; 70, pp. 845-863. [DOI: https://dx.doi.org/10.1016/j.asoc.2018.06.014]

28. Zhang, K.; Zhou, D.; Yang, Z.; Li, X.; Zhao, Y.; Kong, W. A Dynamic Weapon Target Assignment Based on Receding Horizon Strategy by Heuristic Algorithm. J. Phys. Conf. Ser.; 2020; 1651, 012062. [DOI: https://dx.doi.org/10.1088/1742-6596/1651/1/012062]

29. Yan, Y.; Huang, J. Cooperative Output Regulation of Discrete-Time Linear Time-Delay Multi-Agent Systems under Switching Network. Neurocomputing; 2017; 241, pp. 108-114. [DOI: https://dx.doi.org/10.1016/j.neucom.2017.02.022]

30. Luo, W.; Lü, J.; Liu, K.; Chen, L. Learning-Based Policy Optimization for Adversarial Missile-Target Assignment. IEEE Trans. Syst. Man Cybern. Syst.; 2022; 52, pp. 4426-4437. [DOI: https://dx.doi.org/10.1109/TSMC.2021.3096997]

31. Wang, T.; Fu, L.; Wei, Z.; Zhou, Y.; Gao, S. Unmanned Ground Weapon Target Assignment Based on Deep Q-Learning Network with an Improved Multi-Objective Artificial Bee Colony Algorithm. Eng. Appl. Artif. Intell.; 2023; 117, 105612. [DOI: https://dx.doi.org/10.1016/j.engappai.2022.105612]

32. Zou, S.; Shi, X.; Song, S. MOEA with Adaptive Operator Based on Reinforcement Learning for Weapon Target Assignment. Electron. Res. Arch.; 2024; 32, pp. 1498-1532. [DOI: https://dx.doi.org/10.3934/era.2024069]

33. Ding, Y.; Kuang, M.; Shi, H.; Gao, J. Multi-UAV Cooperative Target Assignment Method Based on Reinforcement Learning. Drones; 2024; 8, 562. [DOI: https://dx.doi.org/10.3390/drones8100562]

34. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw.; 1998; 9, 1054. [DOI: https://dx.doi.org/10.1109/TNN.1998.712192]

35. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv; 2017; [DOI: https://dx.doi.org/10.48550/arXiv.1707.06347] arXiv: 1707.06347

36. Dohare, S.; Hernandez-Garcia, J.F.; Lan, Q.; Rahman, P.; Mahmood, A.R.; Sutton, R.S. Loss of Plasticity in Deep Continual Learning. Nature; 2024; 632, pp. 768-774. [DOI: https://dx.doi.org/10.1038/s41586-024-07711-7] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39169245]

37. Nikishin, E.; Schwarzer, M.; D’Oro, P.; Bacon, P.-L.; Courville, A. The Primacy Bias in Deep Reinforcement Learning. Proceedings of the 39th International Conference on Machine Learning; Baltimore, MD, USA, 17–23 July 2022; pp. 16828-16847.

38. Nikishin, E.; Oh, J.; Ostrovski, G.; Lyle, C.; Pascanu, R.; Dabney, W.; Barreto, A. Deep Reinforcement Learning with Plasticity Injection. Proceedings of the 37th International Conference on Neural Information Processing Systems; New Orleans, LA, USA, 10–16 December 2023.

39. Ash, J.; Adams, R.P. On Warm-Starting Neural Network Training. Proceedings of the Advances in Neural Information Processing Systems; Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 3884-3894.

40. Zhu, J.; Wang, X.; Wang, P.; Wu, Z.; Kim, M.J. Integration of BIM and GIS: Geometry from IFC to Shapefile Using Open-Source Technology. Autom. Constr.; 2019; 102, pp. 105-119. [DOI: https://dx.doi.org/10.1016/j.autcon.2019.02.014]

41. Santos, L.B.L.; Jorge, A.A.S.; Rossato, M.; Santos, J.D.; Candido, O.A.; Seron, W.; de Santana, C.N. (Geo)Graphs—Complex Networks as a Shapefile of Nodes and a Shapefile of Edges for Different Applications. Available online: https://arxiv.org/abs/1711.05879v1 (accessed on 20 June 2025).

42. Wang, D.; Xin, B.; Wang, Y.; Zhang, J.; Deng, F.; Wang, X. Constraint-Feature-Guided Evolutionary Algorithms for Multi-Objective Multi-Stage Weapon-Target Assignment Problems. J. Syst. Sci. Complex.; 2025; 38, pp. 972-999. [DOI: https://dx.doi.org/10.1007/s11424-025-4232-2]

43. Zeng, H.; Xiong, Y.; She, J.; Yu, A. Task Assignment Scheme Designed for Online Urban Sensing Based on Sparse Mobile Crowdsensing. IEEE Internet Things J.; 2025; 12, pp. 17791-17806. [DOI: https://dx.doi.org/10.1109/JIOT.2025.3540501]

44. Kline, A.; Ahner, D.; Hill, R. The Weapon-Target Assignment Problem. Comput. Oper. Res.; 2019; 105, pp. 226-236. [DOI: https://dx.doi.org/10.1016/j.cor.2018.10.015]

45. Xiang, X.; Wu, K.; Ren, T.; Wang, L.; Xie, C. Analysis of the Construction of the Kill Chain in Man-Unmanned Collaborative Command for Small-Unit Urban Operations. Proceedings of the 13th China Command and Control Conference; Beijing, China, 15–17 May 2025; pp. 137-142.

46. Zhao, Q.; Li, L.; Chen, X.; Hou, L.; Lei, Z. Research on the development strategy of intelligent weapon equipment for urban warfare based on SWOT and FAHP. Command Control Simul.; 2025; 47, pp. 93-100.

47. Zheng, W.; Li, Q.; Liu, W.; Fei, A.; Wang, F. Data-knowledge-driven metaverse modeling framework for urban warfare. J. Command Control; 2023; 9, pp. 23-32.

48. Li, H.; Yang, H.; Sheng, Z.; Liu, C.; Chen, Y. Multi-UAV collaborative distributed dynamic task allocation based on MAPPO. Control Decis.; 2025; 40, pp. 1429-1437.

49. Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining Improvements in Deep Reinforcement Learning. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence; New Orleans, LA, USA, 2–7 February 2018; Volume 32, [DOI: https://dx.doi.org/10.1609/aaai.v32i1.11796]

50. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proceedings of the 35th International Conference on Machine Learning; Stockholm, Sweden, 10–15 July 2018; pp. 1861-1870.

51. Bhatt, A.; Palenicek, D.; Belousov, B.; Argus, M.; Amiranashvili, A.; Brox, T.; Peters, J. CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity. Proceedings of the Twelfth International Conference on Learning Representations; Vienna, Austria, 7–11 May 2024.

52. Ye, F.; Chen, J.; Tian, Y.; Jiang, T. Cooperative Task Assignment of a Heterogeneous Multi-UAV System Using an Adaptive Genetic Algorithm. Electronics; 2020; 9, 687. [DOI: https://dx.doi.org/10.3390/electronics9040687]

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.