Abstract

Translate

The demand for low-latency computing from the Internet of Things (IoT) and emerging applications challenges traditional cloud computing. Mobile Edge Computing (MEC) offers a solution by deploying resources at the network edge, yet terrestrial deployments face limitations. Unmanned Aerial Vehicles (UAVs), leveraging their high mobility and flexibility, provide dynamic computation offloading for User Equipments (UEs), especially in areas with poor infrastructure or network congestion. However, UAV-assisted MEC confronts significant challenges, including time-varying wireless channels and the inherent energy constraints of UAVs. We put forward the Lyapunov-based Deep Deterministic Policy Gradient (LyDDPG), a novel computation offloading algorithm. This algorithm innovatively integrates Lyapunov optimization with the Deep Deterministic Policy Gradient (DDPG) method. Lyapunov optimization transforms the long-term, stochastic energy minimization problem into a series of tractable, per-timeslot deterministic subproblems. Subsequently, DDPG is utilized to solve these subproblems by learning a model-free policy through environmental interaction. This policy maps system states to optimal continuous offloading and resource allocation decisions, aiming to minimize the Lyapunov-derived “drift-plus-penalty” term. The simulation outcomes indicate that, compared to several baseline and leading algorithms, the proposed LyDDPG algorithm reduces the total system energy consumption by at least 16% while simultaneously maintaining low task latency and ensuring system stability.

Full text

Turn on search term navigation

Translate

1. Introduction

The proliferation of Internet of Things (IoT) devices and emerging low-latency applications have strained traditional cloud computing. Mobile Edge Computing (MEC) mitigates these issues by deploying resources at the network edge, thereby enhancing user experience [1,2,3,4]. However, as terrestrial MEC deployments are often hindered by geographical and infrastructural constraints, Unmanned Aerial Vehicles (UAVs) are increasingly seen as viable aerial platforms. Leveraging their high mobility and flexible deployment, UAVs can provide on-demand computation offloading, especially in areas with poor coverage or network congestion [5,6,7,8]. Despite this potential, UAV-assisted MEC systems face formidable challenges in dynamic environments, including time-varying wireless channels, stochastic task arrivals, and inherent energy limitations. Therefore, designing advanced offloading and resource allocation algorithms for dynamic UAV-assisted MEC has become a critical research direction to address these performance bottlenecks [9,10,11,12].

Optimizing energy consumption and Quality of Service (QoS) in UAV-assisted MEC often involves solving a multi-objective computation offloading problem. Many studies formulate this as a Mixed-Integer Non-Linear Programming (MINLP) problem to find optimal solutions [7,12,13,14]. However, MINLP problems are NP-hard, making their direct solution computationally infeasible for real-time systems. Beyond such mathematical programming, methods from control theory offer another powerful avenue. For instance, approaches based on optimal control theory can, in principle, determine energy-optimal trajectories and resource management policies, as demonstrated in studies like Bianchi [15]. Similarly, dynamic programming, while capable of finding globally optimal solutions, often suffers from the curse of dimensionality. To enable real-time control, some research investigates hybrid approaches that couple learning algorithms with classical controllers like the Linear Quadratic Regulator (LQR) for dynamic trajectory design [16]. Furthermore, robust control techniques such as Model Predictive Control (MPC) and Sliding Mode Control (SMC) are designed to offer stable performance in the presence of system disturbances. To reduce the complexity of MINLP, various approximation methods have also been proposed, including convex optimization relaxation [13], hierarchical approaches [17], and graph-theory-based algorithms [18]. However, despite their theoretical power, all these model-based paradigms—from MINLP approximations to optimal and robust control—share a critical drawback: a heavy reliance on precise system models and perfect environmental knowledge. This dependency significantly degrades their performance in the highly dynamic and uncertain conditions typical of UAV-assisted MEC, rendering them ill-suited for adaptive resource management. This fundamental limitation of model-based approaches has motivated a significant shift in research towards model-free paradigms, chief among them being Deep Reinforcement Learning (DRL).

DRL has emerged as a powerful paradigm for resource management challenges in edge computing [19,20,21,22], as it enables agents to learn optimal policies via model-free interaction with the environment. For instance, early approaches for discrete offloading decisions employed Q-learning-based algorithms [23] and Deep Q-Networks (DQN) [24] to handle intricate state spaces. However, their inability to manage continuous action spaces prompted the adoption of a range of actor–critic methods. While foundational algorithms like Deep Deterministic Policy Gradient (DDPG) [25] showed commendable performance, a remaining challenge was their training instability. To address this, more recent methods such as Proximal Policy Optimization (PPO) [26] and Soft Actor–Critic (SAC) [27] are often favored for their enhanced training stability and sample efficiency. PPO mitigates destructive policy updates by using a clipped surrogate objective, while SAC encourages broader exploration by incorporating an entropy maximization objective, leading to more robust policies. However, despite their sophisticated mechanisms for improving learning dynamics, a critical limitation is shared by all these DRL approaches—from DDPG to PPO and SAC: they primarily optimize for empirical long-term rewards and lack explicit theoretical guarantees for ensuring long-term system stability (e.g., bounded task queues). This gap makes their direct application risky in scenarios where strict Quality of Service (QoS) is a primary concern.

Ensuring long-term system stability is as critical as optimizing instantaneous performance in UAV-assisted MEC. Lyapunov optimization is a powerful stochastic optimization tool that provides theoretical guarantees for queue stability. Its core technique involves transforming a long-term stochastic problem into a series of per-timeslot deterministic subproblems by minimizing a “drift-plus-penalty” expression, enabling online control without future information [28,29]. While several studies have applied this framework to MEC, a key limitation persists: solving the per-timeslot subproblem can still be computationally intensive or rely on oversimplified models, hindering rapid adaptation in dynamic environments. This creates a compelling motivation to integrate the stability guarantees of the Lyapunov framework with the model-free, adaptive learning capabilities of DRL. Effectively synergizing these two paradigms to harness their combined strengths remains a pivotal and underexplored research direction. To provide a clearer overview of these methodologies, we summarize and compare the key existing approaches in Table 1.

Motivated by the need to bridge the gap between the stability guarantees of control theory and the adaptive learning of DRL, this paper proposes a novel algorithm, the Lyapunov-based Deep Deterministic Policy Gradient (LyDDPG). The LyDDPG algorithm synergistically combines Lyapunov optimization with the DDPG method to address the energy optimization problem in dynamic UAV-assisted MEC systems. The core idea is to leverage Lyapunov optimization to decompose the long-term stochastic problem into tractable per-timeslot subproblems, providing a theoretically sound optimization target for the DRL agent. DDPG, which excels in continuous action spaces, is then employed to learn a model-free policy that maps real-time system states to optimal offloading and resource allocation decisions. This integrated approach enables LyDDPG to achieve robust, adaptive policies that minimize long-term energy consumption while maintaining system stability, promising superior performance in complex UAV-MEC scenarios. The principal contributions of this paper are outlined below:

(1). This paper tackles the intricate problem of resource allocation in UAV-assisted MEC systems. We frame it as a long-term stochastic optimization task, where the objective is to curtail energy consumption while guaranteeing queue stability and low task latency.

(2). To accurately capture the system’s stochastic dynamics, we introduce a sophisticated analytical framework. We model the computation queues using M/G/1 queuing theory and employ the Pollaczek–Khinchine (P-K) formula to derive a precise, closed-form expression for the average task sojourn time. This provides a more realistic and robust foundation for latency-aware optimization.

(3). We propose a novel Lyapunov-based Deep Deterministic Policy Gradient (LyDDPG) algorithm. This approach leverages Lyapunov optimization to transform the complex long-term problem into tractable, per-timeslot subproblems, which are then solved by a DDPG agent in a model-free manner. The core optimization target is the minimization of the expected energy consumption over time. Our experiments reveal that, while ensuring low task latency and system stability, our proposed LyDDPG algorithm significantly outperforms other state-of-the-art reinforcement learning benchmarks, reducing the average energy consumption by at least 16%.

2. Modeling and Formulation

2.1. UAV-Assisted MEC Network Model

As depicted in Figure 1, we consider a UAV-assisted MEC system designed to serve computationally constrained User Equipments (UEs) in dynamic environments. The system comprises a UAV furnished with an Edge Server (ES) and a set of N ground UEs, denoted by the set $i = {1, 2, \dots, N}$ . The UAV is responsible for providing computation offloading services to UEs operating within its coverage area. The system operates in discrete timeslots, indexed by the set $t = {1, 2, \dots, T}$ , each with a duration of $δ$ . We assume that both the task arrival process and the wireless channel environment are stochastic and vary dynamically over time. Consequently, at the beginning of each timeslot, UE i faces a dynamic decision problem: based on the randomly arriving task volume, the state of its local and offloading queues, and the instantaneous channel conditions, the UE must decide whether to process the task locally or to partially offload it to the ES on the UAV for computation [13,24,25].

In Figure 1, $A_{i} (t)$ is the amount of data generated by UE i during timeslot t. $g_{i} (t)$ signifies the time-varying channel gain. $Q_{i}^{L} (t)$ represents the local computation task queue at UE i. $Q_{i}^{O} (t)$ represents the task queue awaiting for offloading at UE i. $Q_{i}^{E} (t)$ represents the computation task queue at the ES. Table 2 provides a comprehensive list of the primary system variables investigated in this study.

2.2. Task and Communication Model

Each time slot t begins with UE i generating a new computational task, which has a data size of $A_{i} (t)$ . The arrival process of these tasks is modeled by a Poisson distribution with rate parameter $λ$ . To accurately model the queuing behavior, we characterize the stochastic nature of the task size $A_{i} (t)$ by its first and second moments. We denote the average task size as $E [A_{i}] = {\bar{A}}_{i}$ , and its variance as $V a r (A_{i})$ . Consequently, the second moment of the task size is given by $E [A_{i}^{2}] = V a r (A_{i}) + {({\bar{A}}_{i})}^{2}$ . Upon generation, tasks are initially enqueued in the local computation data queue, $Q_{i}^{L} (t)$ , at UE i, awaiting either local processing or offloading to the UAV once preceding tasks are completed or offloaded. To optimize the task computation process, a partial offloading strategy is employed to determine how tasks are handled. A decision variable $x_{i} (t) \in [0, 1]$ is defined for UE i in time slot t. This continuous variable, $x_{i} (t)$ , signifies that UE i decides to offload a fraction $x_{i} (t)$ of its task data to the UAV, while the remaining fraction $(1 - x_{i} (t))$ is processed locally.

To mitigate inter-user interference during wireless transmission, UEs utilize a Time Division Multiple Access (TDMA) scheme to offload tasks to the UAV. The transmission rate, $R_{i} (t)$ , for offloading tasks from UE i to UAV in time slot t is formulated as [13]:

(1) $R_{i} (t) = B {log}_{2} (1 + \frac{g_{i} (t) P_{i} (t)}{N_{0}})$

where

g_{i} (t)

signifies the channel gain,

P_{i} (t)

represents the transmission power utilized by UE i when communicating with the UAV in time slot t, B represents the channel bandwidth, and

N_{0}

signifies the noise power spectral density.

2.3. Delay Model

To provide a more realistic performance analysis, the delay model moves beyond instantaneous processing times and instead evaluates the average total sojourn time for tasks, the aggregate of the queuing delay and the service delay. We employ the more general M/G/1 queuing model to capture the behavior of the computation queues [30], as the service time distribution for tasks is not necessarily exponential. This requires analyzing the first and second moments of the service time.

For the fraction of the task processed locally, the UE’s CPU acts as a single server. The service time, denoted by the random variable $S_{i}^{L} (t)$ , is the time required to process one entire task of size $A_{i} (t)$ . It is given by:

(2) $S_{i}^{L} (t) = \frac{A_{i} (t) b_{i} (t)}{f_{i}^{L} (t)}$

where we let

f_{i}^{L} (t)

be the CPU clock speed of UE i, and

b_{i} (t)

denote the number of CPU cycles needed per bit of data for processing. The average service time is therefore

E [S_{i}^{L} (t)] = \frac{{\bar{A}}_{i} b_{i} (t)}{f_{i}^{L} (t)}

. For a stable queue, the average data processing throughput of the local device, denoted as

μ_{i}^{L} (t)

, must equal the average arrival rate of data. This throughput is given by:

(3) $μ_{i}^{L} (t) = (1 - x_{i} (t)) λ {\bar{A}}_{i}$

The arrival rate of tasks to be processed locally is $λ_{i}^{L} = (1 - x_{i} (t)) λ$ . The utilization of the local server is $ρ_{i}^{L} = λ_{i}^{L} E [S_{i}^{L} (t)]$ . To ensure queue stability, we must have $ρ_{i}^{L} (t) < 1$ .

The performance of the M/G/1 model is fundamentally linked to the variability of the service time, which is captured by its second moment [30], $E [{(S_{i}^{L} (t))}^{2}]$ , calculated as:

(4) $E [{(S_{i}^{L} (t))}^{2}] = {(\frac{b_{i} (t)}{f_{i}^{L} (t)})}^{2} E [A_{i}^{2}] = {(\frac{b_{i} (t)}{f_{i}^{L} (t)})}^{2} (Var (A_{i}) + {({\bar{A}}_{i})}^{2})$

Using the Pollaczek–Khinchine formula [31,32], the average queuing delay in the local queue $Q_{i}^{L} (t)$ is:

(5) $W_{q, i}^{L} (t) = \frac{E [{(S_{i}^{L} (t))}^{2}]}{2 (1 - ρ_{i}^{L} (t))}$

Thus, the total average sojourn time for a locally processed task, denoted as $T_{i}^{L} (t)$ , is the sum of the average waiting time and the average service time:

(6) $T_{i}^{L} (t) = W_{q, i}^{L} (t) + E [S_{i}^{L} (t)]$

For tasks offloaded to the UAV, the delay consists of three main components: the transmission delay for the task data, the queuing delay at the ES, and the computation delay on the ES. First, the transmission delay for the offloaded portion of the task, denoted $τ_{i}^{O} (t)$ , is expressed as:

(7) $τ_{i}^{O} (t) = \frac{x_{i} (t) {\bar{A}}_{i}}{R_{i} (t)}$

Once the task arrives at the ES, it enters the computation queue $Q_{i}^{E} (t)$ . We model this as another M/G/1 queue. The computational resources allocated to UE i on the UAV are $ψ_{i} (t) F^{E} (t)$ , where $F^{E} (t)$ is the total computation capability of the ES and $ψ_{i} (t)$ is the allocated proportion [33]. This allocation factor $ψ_{i} (t)$ must satisfy the constraint:

(8) $\sum_{i = 1}^{N} ψ_{i} (t) \leq 1$

The service time for an offloaded task $A_{i} (t)$ on the ES is:

(9) $S_{i}^{E} (t) = \frac{A_{i} (t) b_{i} (t)}{ψ_{i} (t) F^{E} (t)}$

Similarly, for a stable edge queue, the average data processing throughput on the edge for tasks from UE i, denoted as $μ_{i}^{E} (t)$ , is:

(10) $μ_{i}^{E} (t) = x_{i} (t) λ {\bar{A}}_{i}$

Following the same M/G/1 analysis, the arrival rate is $λ_{i}^{E} = x_{i} (t) λ$ , and the ES utilization for UE i is $ρ_{i}^{E} (t) = λ_{i}^{E} E [S_{i}^{E} (t)] < 1$ . The average queuing delay at the edge is:

(11) $W_{q, i}^{E} (t) = \frac{λ_{i}^{E} E [{(S_{i}^{E} (t))}^{2}]}{2 (1 - ρ_{i}^{E} (t))}$

The total average sojourn time on the ES, $T_{i}^{E} (t)$ , is:

(12) $T_{i}^{E} (t) = W_{q, i}^{E} (t) + E [S_{i}^{E} (t)]$

The total expected delay for a task from UE i, $T_{i} (t)$ , is the weighted average of the local sojourn time and the total offloading path delay, based on the offloading decision $x_{i} (t)$ :

(13) $τ_{i} (t) = (1 - x_{i} (t)) T_{i}^{L} (t) + x_{i} (t) [τ_{i}^{O} (t) + T_{i}^{E} (t)]$

2.4. Energy Consumption Model

When UE i performs computation locally, its Central Processing Unit (CPU) operates at a frequency $f_{i}^{L} (t)$ . The energy consumed by UE i due to this local CPU processing is denoted by $E_{i}^{L} (t)$ and is given as follows:

(14) $E_{i}^{L} (t) = κ_{L} {[f_{i}^{L} (t)]}^{3} t_{i}^{L} (t)$

Herein, $κ_{L}$ is the effective energy coefficient related to the chip architecture for local computation. When tasks are offloaded, the energy consumption is primarily composed of data transmission and EC processing. The transmission energy consumption for UE i, denoted as $E_{i}^{O} (t)$ , is given by:

(15) $E_{i}^{O} (t) = \frac{P_{i} (t) x_{i} (t) {\bar{A}}_{i} (t)}{R_{i} (t)}$

The computation energy consumption at the ES on the UAV for processing tasks from UE i, denoted as $E_{i}^{C} (t)$ , depends on the computational resources allocated to UE i. Using the refined notation $F^{E} (t)$ for the total edge computation capability, this energy is modeled as:

(16) $E_{i}^{C} (t) = κ_{E} {[ψ_{i} (t) F^{E} (t)]}^{3} t_{i}^{E} (t)$

Where $κ_{E}$ is the effective energy coefficient related to the chip architecture for ES computation. Consequently, the total energy consumption for UE i in time slot t, $E_{i} (t)$ , is calculated as the sum of these three components:

(17) $E_{i} (t) = E_{i}^{L} (t) + E_{i}^{O} (t) + E_{i}^{C} (t)$

2.5. Queue Dynamics Model

To model the evolution of the system state, we define the dynamics of the computation and offloading queues [33]. The queue lengths serve as the state representation for our reinforcement learning agent.

The local computation queue, $Q_{i}^{L} (t)$ , contains tasks awaiting local processing. The evolution of its length is described by the equation below:

(18) $Q_{i}^{L} (t + 1) = max {Q_{i}^{L} (t) - D_{i}^{L} (t), 0} + (1 - x_{i} (t)) A_{i} (t)$

Where we define $D_{i}^{L} (t)$ as the service capacity of the local device in timeslot t, given by $D_{i}^{L} (t) = \frac{f_{i}^{L} (t)}{b_{i} (t)} δ$ , which represents the maximum amount of data that can be processed locally by UE i during the timeslot of duration $δ$ .

The offloading data queue, $Q_{i}^{O} (t)$ , buffers tasks that are designated for offloading but have not yet been transmitted. Its dynamics are given by:

(19) $Q_{i}^{O} (t + 1) = max {Q_{i}^{O} (t) - D_{i}^{O} (t), 0} + x_{i} (t) A_{i} (t)$

Here, $D_{i}^{O} (t)$ is the actual amount of data successfully transmitted in timeslot t. This amount is constrained by both the data available in the queue and the channel’s transmission capacity, $μ_{i}^{O} (t) = R_{i} (t) δ$ . Thus, $D_{i}^{O} (t)$ is given by:

(20) $D_{i}^{O} (t) = min {Q_{i}^{O} (t) + x_{i} (t) A_{i} (t), μ_{i}^{O} (t)}$

Finally, the edge computation queue, $Q_{i}^{E} (t)$ , holds the tasks that have arrived at the edge server. Its update rule is:

(21) $Q_{i}^{E} (t + 1) = max {Q_{i}^{E} (t) - D_{i}^{E} (t), 0} + D_{i}^{O} (t)$

Where we define $D_{i}^{E} (t)$ as the service capacity allocated to UE i on the ES, given by $D_{i}^{E} (t) = \frac{ψ_{i} (t) F^{E} (t)}{b_{i} (t)} δ$ , which is the maximum data that can be processed by the allocated edge resources in a timeslot.

A fundamental requirement for long-term system stability is that all queues are rate-stable [34]. The formal definition of rate-stability is as follows:

(22) $lim_{T \to \infty} \frac{1}{T} \sum_{t = 1}^{T} E [Q_{k} (t)] < \infty, \forall k \in {L, O, E}$

2.6. Problem Definition

This study addresses the challenge of joint task offloading and resource allocation for dynamic UAV-enabled MEC systems. The core aim is to minimize the system-wide, time-averaged energy expenditure. This optimization must be achieved subject to constraints on queue stability, QoS requirements, and physical limitations. Mathematically, this multi-stage stochastic optimization problem, denoted as $P_{1}$ , is formally stated as:

(23) $P_{1} : min_{x_{i} (t), ψ_{i} (t)} lim_{T \to \infty} \frac{1}{T} \sum_{t = 1}^{T} \sum_{i = 1}^{N} E_{i} (t)$

(23a) $\begin{matrix} lim_{T \to \infty} \frac{1}{T} \sum_{t = 1}^{T} E [Q_{k} (t)] < \infty, \forall k \in {L, O, E} \end{matrix}$

(23b) $\begin{matrix} lim_{T \to \infty} \frac{1}{T} \sum_{t = 1}^{T} E [τ_{i} (t)] \leq τ_{i}^{max}, \forall i \end{matrix}$

(23c) $\begin{matrix} \sum_{i = 1}^{N} ψ_{i} (t) \leq 1, \forall t \end{matrix}$

(23d) $\begin{matrix} 0 \leq x_{i} (t) \leq 1, \forall i, t \end{matrix}$

(23e) $\begin{matrix} 0 \leq ψ_{i} (t) \leq 1, \forall i, t \end{matrix}$

Constraint (23a) is the queue stability constraint for all queues in the system, ensuring that no queue grows infinitely large. Constraint (23b) is the QoS constraint, which guarantees that the long-term average sojourn time for tasks of each UE i does not exceed a predefined maximum tolerable delay $τ_{i}^{max}$ . This constraint directly utilizes the delay $τ_{i} (t)$ derived from our M/G/1 model. Constraint (23c) ensures that the total allocated proportion of edge server resources does not exceed its capacity. Constraints (23d) and (23e) define the valid ranges for the offloading ratio and resource allocation proportion, which are the primary decision variables for our agent.

The problem $P_{1}$ is a complex, multi-stage stochastic optimization problem. Obtaining a direct optimal solution is challenging due to the time-varying nature of the environment and the coupling between decisions across time slots. To address this, we leverage Lyapunov optimization in Section 3 to transform $P_{1}$ into a series of more tractable, per-timeslot optimization problems. Subsequently, in Section 4, the DDPG algorithm is employed to learn a policy to solve these subproblems online and in a model-free manner. The effectiveness of the proposed algorithm will be validated through simulations in Section 5.

3. The Lyapunov Optimization Framework

The Lyapunov optimization framework provides a robust methodology for the dynamic control of stochastic networks. In our work, we leverage this framework to transform our long-term stochastic optimization problem into a sequence of more tractable, deterministic subproblems, each addressed on a per-timeslot basis. The fundamental principle is to maintain queue stability by minimizing an upper bound on the “drift-plus-penalty” term within each time slot.

Let the vector of all queue backlogs define the system state, denoted by $Q (t)$ , as follows:

(24) $Q (t) = {[Q_{i}^{L} (t) + Q_{i}^{O} (t) + Q_{i}^{E} (t)]}_{i = 1}^{N}$

We first define a quadratic Lyapunov function $L (Q (t))$ which represents the total “pressure” or congestion of the system:

(25) $L (Q (t)) = \frac{1}{2} \sum_{i = 1}^{N} [{(Q_{i}^{L} (t))}^{2} + {(Q_{i}^{O} (t))}^{2} + {(Q_{i}^{E} (t))}^{2}]$

The one-slot conditional Lyapunov drift, $Δ (Q (t))$ , measures the expected change in $L (Q (t))$ over one timeslot, given the current state $Q (t)$ :

(26) $Δ (Q (t)) = E [L (Q (t + 1)) - L (Q (t)) ∣ Q (t)]$

To jointly minimize energy consumption while ensuring queue stability, we aim to minimize the upper bound of the Lyapunov drift-plus-penalty expression, which is formulated as:

(27) $Λ (Q (t)) ≜ Δ (Q (t)) + V \cdot E [\sum_{i = 1}^{N} E_{i} (t) ∣ Q (t)]$

Here, V is a non-negative control parameter that acts as a crucial trade-off knob, allowing us to explicitly control the balance between system robustness (i.e., queue stability) and energy efficiency. A larger value of V places a higher penalty on energy consumption, compelling the agent to learn a policy that prioritizes long-term energy savings, potentially at the expense of higher transient queue lengths. Conversely, a smaller V emphasizes queue stability, guiding the agent to make decisions that aggressively reduce queue backlogs, even if such actions are more energy-intensive. This parameter provides a flexible mechanism to adapt the system’s operational focus based on different performance requirements.

To make the problem tractable, we first find an upper bound for the drift term $Δ (Q (t))$ . Using the inequality ${(max {Q - D} + A)}^{2} \leq Q^{2} + D^{2} + A^{2} - 2 Q (D - A)$ for any $Q, D, A \geq 0$ , we can bound the drift for each queue based on its dynamics defined earlier:

(28) $\begin{matrix} \frac{1}{2} E [{(Q_{k}^{k} (t + 1))}^{2} - {(Q_{k}^{k} (t))}^{2} | Q (t)] = \frac{1}{2} E [{(A_{k}^{k} (t))}^{2} + {(D_{k}^{k} (t))}^{2}] - \\ E [Q_{k}^{k} (t) (D_{k}^{k} (t) - A_{k}^{k} (t)) | Q (t)] \end{matrix}$

where

k \in {L, O, E}

D_{k}^{k} (t)

and

A_{k}^{k} (t)

represent the corresponding service and arrival data volumes for queue k. By summing over all queues and all UEs, we get the upper bound for the total Lyapunov drift:

(29) $\begin{matrix} Δ (Q (t)) \leq B + \sum_{i = 1}^{N} E & [Q_{i}^{L} (t) ((1 - x_{i} (t)) A_{i} (t) - D_{i}^{L} (t)) \\ + Q_{i}^{O} (t) (x_{i} (t) A_{i} (t) - D_{i}^{O} (t)) \\ + Q_{i}^{E} (t) (D_{i}^{O} (t) - D_{i}^{E} (t)) | Q (t)] \end{matrix}$

where B is a positive constant that is independent of the queue lengths and the optimization decisions, defined as:

(30) $B \geq \frac{1}{2} \sum_{i = 1}^{N} E [{(D_{i}^{L} (t))}^{2} + {((1 - x_{i} (t)) A_{i} (t))}^{2} + \dots | Q (t)]$

Substituting this bound back into the drift-plus-penalty expression (26), we get:

(31) $\begin{matrix} Λ (Q (t)) \leq B + \sum_{i = 1}^{N} E & [Q_{i}^{L} (t) ((1 - x_{i} (t)) A_{i} (t) - D_{i}^{L} (t)) \\ + Q_{i}^{O} (t) (x_{i} (t) A_{i} (t) - D_{i}^{O} (t)) \\ + Q_{i}^{E} (t) (D_{i}^{O} (t) - D_{i}^{E} (t)) + V \cdot E_{i} (t) | Q (t)] \end{matrix}$

The inequality (31) provides a tractable upper bound for the drift-plus-penalty expression. The core principle of Lyapunov optimization is to make decisions in each timeslot to greedily minimize the right-hand side of this inequality. By doing so, we can ensure long-term system stability while pushing the time-averaged energy consumption towards its minimum.

This transforms our original complex stochastic problem P1 into a series of deterministic subproblems, where in each timeslot t, the goal is to choose actions that minimize the expected value of the terms dependent on these actions. In the following chapter, we will detail how we formulate this per-timeslot minimization task as a reinforcement learning problem and employ the DDPG algorithm to solve it.

4. DDPG Based on Lyapunov Function

4.1. Deep Reinforcement Learning-Based Solution Method

To solve the per-timeslot minimization problem derived from the Lyapunov optimization framework, we employ the DDPG algorithm.

We model the problem as a Markov Decision Process (MDP), where the DDPG agent learns to make optimal resource allocation decisions by interacting with the MEC environment.

The core of our proposed LyDDPG approach is to use the DDPG agent to learn a policy that maps the system state to actions, with the goal of minimizing the Lyapunov drift-plus-penalty term at each timeslot. The overall framework of the proposed algorithm is illustrated in Figure 2.

(1). State Space

The state observed by the agent at the beginning of each timeslot t must contain sufficient information to evaluate the per-timeslot optimization objective. The most critical information is the current length of all queues, as they directly influence the Lyapunov function. We also include the channel gains, as they affect transmission rates and energy. Therefore, the system state $s (t) \in S$ is defined as:

(32) $s (t) = (Q_{1}^{L} (t), \dots, Q_{N}^{L} (t), Q_{1}^{O} (t), \dots, Q_{N}^{O} (t), Q_{1}^{E} (t), \dots, Q_{N}^{E} (t), g_{1} (t), \dots, g_{N} (t))$

This state vector provides the agent with a comprehensive snapshot of the current system congestion and communication environment.

(2). Action Space

At each timeslot t, the agent takes an action $a (t) \in A$ , which is composed of the continuous decision variables it can directly control. Based on our model, these are the task offloading ratios $x_{i} (t)$ and the edge resource allocation proportions $ψ_{i} (t)$ for all UEs. The complete action vector is therefore constructed as:

(33) $a (t) = (x_{1} (t), \dots, x_{N} (t), ψ_{1} (t), \dots, ψ_{N} (t))$

where

x_{i} (t) \in [0, 1]

and the agent’s output for

ψ_{i} (t)

will be normalized to satisfy

\sum_{i = 1}^{N} ψ_{i} (t) \leq 1

(3). Reward Function

The core principle of integrating DDPG with the Lyapunov framework is to use the DRL agent to solve the per-timeslot minimization problem derived in Section 3. To achieve this, we design the instantaneous reward $r (t)$ to be the negative of the right-hand side (RHS) of the drift-plus-penalty upper bound in inequality (31). By training the agent to maximize the cumulative reward, it implicitly learns to minimize this expression.

Ignoring the constant B, which does not affect the optimal policy, the per-timeslot reward $r (t)$ is defined as [33]:

(34) $\begin{matrix} r (t) = - E & [\sum_{i = 1}^{N} (Q_{i}^{L} (t) ((1 - x_{i} (t)) A_{i} (t) - D_{i}^{L} (t)) \\ + Q_{i}^{O} (t) (x_{i} (t) A_{i} (t) - D_{i}^{O} (t)) \\ + Q_{i}^{E} (t) (D_{i}^{O} (t) - D_{i}^{E} (t))) \\ + V \cdot \sum_{i = 1}^{N} E_{i} (t) | Q (t)] \end{matrix}$

In practice, at the end of each timeslot t, the agent takes action $a (t)$ , observes the resulting energy consumption $E_{i} (t)$ and the service data rates $D_{k} (t)$ , and calculates the realized value of the expression inside the expectation as the reward $r (t)$ . This reward signal directly guides the DDPG agent to learn a policy $μ (s | θ^{μ})$ that balances queue stability (the Q terms) and energy consumption (the $V \cdot E$ term), thus practically realizing the objectives of the Lyapunov optimization framework in a model-free manner.

4.2. DDPG Algorithm Architecture

This section elaborates on the specific architecture and training protocol of the DDPG agent implemented in this study. The DDPG agent is primarily composed of four neural networks: an actor network, a critic network, and their respective target networks.

(1). Actor Network

The actor network’s primary role is to learn a deterministic policy, denoted as $μ (s (t) | θ^{μ})$ , which maps the currently observed state $s (t)$ to a specific action $a (t)$ . The policy is optimized to select actions that lead to the maximum expected long-term return. Table 3 illustrates the hierarchical structure of the architecture.

(2). Critic Network

The critic network approximates the action-value function $Q (s (t), a (t) ∣ θ^{Q})$ , providing a quantitative measure of how beneficial it is to select action $a (t)$ in the current state $s (t)$ . The hierarchical structure of the architecture is shown in Table 4.

(3). Target Networks

To stabilize the training, DDPG incorporates a target actor network $μ^{'} (s (t) ∣ θ^{μ^{'}})$ and a target critic network $Q^{'} (s (t), a (t) ∣ θ^{Q^{'}})$ as separate components. The architectures of these target networks are identical to their original actor and critic counterparts. At the commencement of training, the parameters of the target networks are initialized as exact copies of the original networks’ parameters. Subsequently, the parameters of the target networks are not updated directly via gradient descent. Instead, they are updated slowly using a “soft update” mechanism, gradually tracking the parameters of the original networks. Specifically, the target network parameters $θ^{'}$ are updated as follows [25,35]:

(35) $θ^{'} \leftarrow τ θ + (1 - τ) θ^{'}$

In this formula, the term $θ$ corresponds to the parameters of the main network, and the update is controlled by a small rate, $τ$ . This gradual update helps to prevent drastic fluctuations in the target Q-values, thereby stabilizing the learning procedure.

4.3. Training Protocol

DDPG utilizes an experience replay strategy to improve sample efficiency and decorrelate the training data. The agent’s interactions with its surroundings produce experience transitions, $(s (t), a (t), r (t), s (t + 1))$ , that are collected in a memory buffer, D, of a predetermined size. When performing a training update, the model learns from a mini-batch $M_{b a t c h}$ of experiences drawn at random from this memory, thus avoiding the use of only the most recent data. The training procedure generally begins only once a certain threshold of experiences has been gathered in the replay buffer. The agent’s interaction cycle with the environment is summarized in Algorithm 1.

Algorithm 1 LYDDPG algorithm.

1.. Initialize Actor network $μ (s (t) | θ^{μ})$ and Critic network $Q (s_{j}, a_{j} | θ^{Q})$ with random parameters $θ^{μ}$ and $θ^{Q}$

2.. Initialize Target networks $μ^{'}$ and $Q^{'}$ with weights $θ^{μ^{'}} \leftarrow θ^{μ}$ , $θ^{Q^{'}} \leftarrow θ^{Q}$

3.. Initialize replay buffer D

4.. Reset environment and get initial state $s_{t}$

5.. For episode=1 to $N_{episode}$ do

6.. For each time step t within the episode do

7.. Select action $d (t)$ based on current state $s (t)$ using actor network and exploration noise

8.. Get deterministic action from actor network: $a_{deterministic} \leftarrow μ (s (t) | θ^{μ})$

9.. Generate exploration noise: $ϵ \sim N (0, σ_{noise}^{2} I)$

10.. Add noise to the deterministic action: $d (t) \leftarrow a_{deterministic} + ϵ$

11.. Clip action $d (t)$ to be within valid action bounds

12.. Execute action $d (t)$ in the environment

13.. Obtain reward for each UE i based on $R_{i} (t)$ , and compute the average reward $r (t)$

14.. Observe next state $s (t + 1)$ , store transition $(s (t), d (t), r (t), s (t + 1))$ in replay buffer D

15.. Update current state: $s_{t + 1} \leftarrow s_{t}$

16.. End for

17.. End for

For each mini-batch sampled from the experience replay buffer, let an individual experience tuple be denoted as $(s_{j}, a_{j}, r_{j}, s_{j}^{'})$ , where j is the index within the mini-batch and $s_{j}^{'}$ represents the next state corresponding to state $s_{j}$ . The following update steps are performed [36]:

(1). Critic Network Update

The Critic network’s parameters, $θ^{Q}$ , are updated by minimizing the Mean Squared Error (MSE) loss. This loss is computed as the difference between the Q-value predicted by the network, $Q (s_{j}, a_{j} | θ^{Q})$ , and the target Q-value $y_{j}$ [37,38]:

(36) $L (θ^{Q}) = \frac{1}{M_{batch}} \sum_{j = 1}^{M_{batch}} {(y_{j} - Q (s_{j}, a_{j} | θ^{Q}))}^{2}$

Here, the calculation of the target Q-value $y_{j}$ utilizes the target actor network and the target critic network [37]:

(37) $y_{j} = r_{j} + γ Q^{'} (s_{j}^{'}, μ^{'} (s_{j}^{'} | θ^{μ^{'}}) | θ^{Q^{'}})$

The function of the discount factor $γ$ is to assign a lower weight to rewards that are further in the future, thereby prioritizing more immediate returns.

(2). Actor Network Update

The Actor network, $μ (s (t) | θ^{μ})$ , learns a deterministic policy that selects actions to maximize the expected cumulative return for a given state $s (t)$ . Its update relies on the Critic network’s evaluation of the actions produced by the current policy.

The Actor network is updated by applying the policy gradient theorem. For DDPG’s deterministic policy, the policy gradient is approximated by performing gradient ascent on the Q-values output by the Critic network. Specifically, the Actor’s parameters $θ^{μ}$ are adjusted such that for states $s_{j}$ sampled from the replay buffer, the actions $μ (s_{j} | θ^{μ})$ produced by the Actor yield higher Q-values. The loss function for the Actor is [35]:

(38) $L (θ^{μ}) = - \frac{1}{M_{batch}} \sum_{j = 1}^{M_{batch}} Q (s_{j}, μ (s_{j} | θ^{μ}) | θ^{Q})$

The detailed procedure for updating the actor and critic networks using a sampled mini-batch, including the soft update of target networks, is summarized in Algorithm 2.

Algorithm 2 DDPG network update procedure.

1.. If size of $D \geq$ minimal size then

2.. Sample a random mini-batch of $M_{batch}$ transitions $(s (t), a (t), r (t), s (t + 1))$ from D

3.. Compute target actions for next states: $a_{j}^{'} \leftarrow μ^{'} (s_{j}^{'} | θ^{μ^{'}})$

4.. Compute target Q-values: $y_{j} = r_{j} + γ Q^{'} (s_{j}^{'}, μ^{'} (s_{j}^{'} | θ^{μ^{'}}) | θ^{Q^{'}})$

5.. Update Critic by minimizing loss: $L (θ^{Q}) = \frac{1}{M_{batch}} \sum_{j = 1}^{M_{batch}} {(y_{j} - Q (s_{j}, a_{j} | θ^{Q}))}^{2}$

6.. Update Actor using policy gradient: $L (θ^{μ}) = - \frac{1}{M_{batch}} \sum_{j = 1}^{M_{batch}} Q (s_{j}, μ (s_{j} | θ^{μ}) | θ^{Q})$

7.. Soft update target networks:

8.. $θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}$

9.. $θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}$

10.. End If

4.4. Computational Complexity Analysis

The computational complexity of the LyDDPG algorithm primarily stems from two main components: interaction with the environment during execution and the training process of the neural networks.

4.4.1. Interaction with the Environment

At each time step, the agent needs to perform a forward pass through the Actor network to select an action. The complexity of this operation is proportional to the number of parameters in the Actor network, $P_{A}$ , denoted as $O (P_{A})$ [39]. Subsequently, the environment executes this action and transitions to a new state, with a complexity that depends on the intricacy of the environment simulation, denoted as $O (C_{e n v})$ . Therefore, the interaction complexity per time step is $O (P_{A} + C_{e n v})$ .

4.4.2. Network Updates

When a sufficient amount of data has been accumulated in the replay buffer, the algorithm samples a mini-batch of size $M_{b a t c h}$ to perform network updates [39]. The main update steps have the following complexities:

(1). Critic Network Update

This process involves one forward pass each through the target Actor and target Critic networks to compute the target values, followed by one forward and one backward pass through the current Critic network. Consequently, the complexity of the Critic update is approximately $O (M_{b a t c h} (P_{A} + 3 P_{C}))$ , where $P_{C}$ is the number of parameters in the Critic network [40].

(2). Actor Network Update

This procedure includes one forward pass through the current Actor network, one forward pass through the current Critic network, and one backward pass through the Actor network. Thus, the complexity of the Actor update is approximately $O (M_{b a t c h} (P_{A} + P_{C}))$ .

(3). Target Network Soft Updates

The complexity of this operation is proportional to the total number of parameters in both the Actor and Critic networks, i.e., $O (P_{A} + P_{C})$ .

Considering that the mini-batch update is the most computationally intensive part, a single complete training step has a complexity of $O (M_{b a t c h} (P_{A} + 3 P_{C}) + M_{b a t c h} (P_{A} + P_{C})) = O (M_{b a t c h} (2 P_{A} + 4 P_{C}))$ . Since $P_{A}$ and $P_{C}$ are often of the same order of magnitude, the overall training complexity is dominated by: $O (M_{b a t c h} (P_{A} + P_{C}))$ [41].

4.4.3. Implications for Practical Deployment

While the training complexity is substantial, it is crucial to distinguish between the training and deployment phases to assess the algorithm’s feasibility on real UAV hardware. The real-time requirement during flight pertains to the interaction (inference) complexity, which is only $O (P_{A})$ . A forward pass through a neural network is a highly efficient, parallelizable operation that can be executed in milliseconds on modern, lightweight onboard computers equipped with embedded GPUs (e.g., the NVIDIA Jetson series).

Therefore, a practical and highly feasible deployment strategy involves an offline training, online inference model. The LyDDPG agent can be thoroughly trained on a powerful ground-based computer. Subsequently, only the lightweight, optimized Actor network needs to be loaded onto the UAV for fast and efficient real-time decision-making. This approach effectively decouples the heavy computational load of training from the time-critical execution phase, making our LyDDPG algorithm a viable solution for practical deployment on resource-constrained UAVs.

4.5. Theoretical Guarantees and Stability Analysis

A critical aspect of our LyDDPG framework is to ensure system stability even when the per-timeslot subproblem is solved approximately by the DDPG agent. Classical Lyapunov stability proofs often assume that the drift-plus-penalty expression is minimized perfectly in each timeslot. However, a DDPG agent, relying on neural network function approximators, learns a near-optimal policy rather than guaranteeing the absolute optimum. Here, we provide a theoretical justification for the stability of our proposed approach.

Our analysis hinges on the premise that a well-trained DDPG agent can learn a policy that is at least as good as a simple, stationary policy that is known to be stabilizing. Let us assume there exists a baseline stationary and stabilizing policy $π^{'}$ , which for any queue state $Q (t)$ , makes a decision that ensures the drift-plus-penalty expression is bounded by a finite constant $C_{m a x}$ . That is:

(39) $E [Δ (Q (t)) + V \cdot E (t) | Q (t), π^{'}] \leq C_{m a x}$

where

E (t) = \sum_{i = 1}^{N} E_{i} (t)

is the total energy consumption at timeslot t.

The objective of our DDPG agent, following the learned policy $π$ , is to minimize this very same expression by maximizing its negative (the reward). After sufficient training, the performance of the learned policy $π$ is expected to be superior or at least equal to that of the simple stabilizing policy $π^{'}$ . Therefore, the following condition holds:

(40) $E [Δ (Q (t)) + V \cdot E (t) | Q (t), π] \leq E [Δ (Q (t)) + V \cdot E (t) | Q (t), π^{'}]$

Combining inequalities (39) and (40), we get:

(41) $E [Δ (Q (t)) + V \cdot E (t) | Q (t), π] \leq C_{m a x}$

This inequality shows that the Lyapunov drift, conditioned on the current queue state, is bounded from above. According to the principles of Lyapunov optimization theory, this is a sufficient condition to guarantee that all system queues are strongly stable, meaning their long-term time-average is finite. This demonstrates that even with an approximate solution from the DDPG agent, our LyDDPG algorithm retains the theoretical stability guarantees inherent to the Lyapunov optimization framework, provided the agent learns a policy that outperforms a trivial stabilizing one.

5. Simulation Results

To evaluate the performance of the proposed LyDDPG algorithm in dynamic UAV-assisted MEC environments, we developed a Python (Python 3.12.0, Python Software Foundation, Wilmington, DE, USA) based simulation platform running on an i9-12900H CPU with 16.0 GB RAM (Intel Corporation, Santa Clara, USA). The key simulation parameters are summarized in Table 5.

Our system model features a single UAV-mounted MEC server assisting nine ground UEs. These UEs are initially distributed in an open area, with their horizontal distances $d_{i}$ to the UAV linearly increasing from 120 m to 240 m at 15 m intervals. The task arrival process for each UE is modeled as an independent Poisson process with a rate of $λ = 2.5 tasks / s$ . The size of each task $A_{i} (t)$ is a random variable with a mean of ${\bar{A}}_{i} = 3 Mbits$ . These baseline parameters are set with reference to common values found in the related literature, such as [3], to establish a representative and moderately loaded scenario. The scalability and robustness of our algorithm’s performance under various conditions are then thoroughly investigated in the subsequent subsections.

The channel gain $g_{i} (t)$ is primarily determined by the path loss, calculated as $g_{i} (t) = A_{d} {(\frac{c}{4 π f_{c} d_{i}})}^{d_{e}}$ with $A_{d} = 3$ , $c = 3 \times 10^{8} m / s$ , $f_{c} = 915 MHZ$ , and $d_{e} = 3$ [3]. To simulate a dynamic environment, we further incorporate small-scale fading effects, making the instantaneous channel conditions a key challenge for the LyDDPG agent. The system operates with a 2MHz communication bandwidth, and the total noise power is calculated as $N_{0} = B v_{O}$ , where $v_{O} = - 174 dBm / HZ$ is the power spectral density.

In the following subsections, we present a comprehensive and multi-faceted evaluation of our proposed LyDDPG algorithm, structured to build a cohesive argument for its effectiveness. Our analysis begins with a sensitivity study of the key control parameter V to establish its optimal setting. Building on this foundation, we then demonstrate the superior performance and convergence of LyDDPG through a direct comparative analysis against several state-of-the-art baselines in a standard static configuration. To investigate the algorithm’s performance boundaries, our evaluation extends to a thorough scalability analysis under varying system loads and, further, to a robustness stress-test in a highly dynamic environment with user mobility. Finally, as a definitive validation of its efficacy, we quantify the algorithm’s optimality gap by benchmarking it against the theoretical optimum derived from Dynamic Programming.

5.1. Sensitivity Analysis of the Control Parameter

Figure 3 illustrates how the Lyapunov control parameter V governs a well-defined trade-off among the key performance metrics. A larger value of V places a greater penalty on energy consumption in the drift-plus-penalty function. Consequently, the DDPG agent learns a more energy-conscious policy. As shown in Figure 3b, this policy leads to a monotonic decrease in average energy consumption.

However, this comes at the cost of relaxing the pressure on maintaining short queues. This results in the agent tolerating larger queue backlogs to save energy, leading to a monotonic increase in the average data queue length and the average system delay, as shown in Figure 3a,c, respectively.

These results empirically validate the effectiveness of the parameter V as a control knob for managing the trade-off between energy efficiency and QoS. It demonstrates that our LyDDPG algorithm can be flexibly tuned to meet diverse operational requirements, from energy-sensitive applications to delay-critical scenarios. For the subsequent experiments, we select a balanced value of $V = 100$ to achieve a reasonable compromise between these competing objectives.

While other hyperparameters, such as the neural network architecture or the size of the replay buffer, are indeed crucial for the performance of any DRL agent, our sensitivity analysis here focuses on the parameter V. This is because V is the central control knob intrinsic to the Lyapunov optimization framework itself, directly governing the trade-off that is at the core of our contribution.

5.2. Comparative Performance Analysis

We perform a comprehensive analysis of the LyDDPG algorithm’s effectiveness by contrasting its results with those from a set of both conventional and cutting-edge techniques. The selected algorithms provide a robust benchmark, spanning from a standard actor–critic method to advanced DRL paradigms. The baselines are as follows:

DDPG (Deep Deterministic Policy Gradient) [42]: This serves as our primary baseline to demonstrate the performance gain achieved by integrating the Lyapunov framework. It is an off-policy, actor–critic algorithm for continuous action spaces. SAC (Soft Actor–Critic) [39]: A state-of-the-art, off-policy DRL algorithm renowned for its high sample efficiency and stability. It maximizes a trade-off between expected return and policy entropy, encouraging broader exploration. PPO (Proximal Policy Optimization) [26]: A popular on-policy algorithm that ensures stable training by using a clipped surrogate objective function, preventing overly large policy updates. A3C (Asynchronous Advantage Actor–Critic) [43]: An on-policy, parallel training method designed to improve learning efficiency and stability by using multiple asynchronous agents.

The fundamental difference between our proposed LyDDPG and these established DRL baselines lies in their core optimization paradigm. SAC, PPO, and A3C are fundamentally designed to empirically maximize a cumulative reward signal. While they may achieve a degree of system stability if the reward function is perfectly shaped, stability is not an explicit, guaranteed objective. In stark contrast, LyDDPG is a hybrid control–theoretic framework where the DRL agent’s task is explicitly to solve the per-timeslot drift-plus-penalty minimization problem derived from Lyapunov theory. For LyDDPG, ensuring queue stability is the primary objective, not an emergent property of reward maximization. This theoretical distinction is what we seek to validate in the following experiments.

By benchmarking against these diverse algorithms, we aim to validate the superiority of LyDDPG in terms of long-term energy consumption, queue stability, and holistic effectiveness within time-varying UAV-assisted MEC scenarios.

Figure 4, Figure 5, Figure 6 and Figure 7 presents a comprehensive performance comparison between our proposed LyDDPG algorithm and several baseline methods, including DDPG, A3C, PPO, and SAC. The results are evaluated across four key metrics: average reward, average data queue length, average energy consumption, and average system delay, all plotted against the number of training episodes.

Firstly, the average reward curves in Figure 4 illustrate the learning efficiency and convergence performance of the algorithms. It is evident that our proposed LyDDPG algorithm outperforms all baselines, achieving both a higher average reward and faster convergence. Notably, while SAC, the closest competitor, keeps pace in the early training phase, a clear performance gap emerges as it approaches final convergence. This superiority is attributed to using the Lyapunov drift-plus-penalty objective as a reward signal, which provides the agent with more effective guidance to accelerate convergence and secure a superior final policy.

Secondly, Figure 5 demonstrates the superior system stability of our LyDDPG algorithm. Our method successfully stabilizes the average data queue length at a minimal level, achieving a remarkable reduction of over 54% compared to the baseline DDPG algorithm.

This substantial advantage in stability is a direct result of the integrated Lyapunov optimization framework, which ensures queue stability by actively penalizing backlogs. Consequently, LyDDPG effectively prevents data congestion and provides a theoretical guarantee of reliability that standard DRL algorithms lack.

Thirdly, Figure 6 evaluates our primary objective: average energy consumption. Our proposed LyDDPG algorithm demonstrates clear superiority, achieving the lowest energy consumption of all benchmarks. This advantage stems from the Lyapunov-guided reward structure, which steers the agent directly toward long-term energy minimization. While the state-of-the-art SAC algorithm also performs well, it is ultimately less efficient; our LyDDPG algorithm reduces average energy consumption by nearly 16% in comparison. This performance gap is attributed to their fundamental policy differences. SAC’s entropy maximization objective fosters a stochastic policy, which can prevent it from converging to the optimal deterministic strategy ideal for this resource allocation task. Conversely, LyDDPG’s deterministic policy enables it to precisely exploit the most energy-efficient operating points, thus validating its superior performance.

Finally, Figure 7 shows the average system delay, which is a critical QoS metric. As a result of maintaining shorter queue backlogs, the proposed LyDDPG algorithm also achieves the lowest average system delay. Since our delay model incorporates queuing delay, the superior queue management of LyDDPG directly translates into a better delay performance. This confirms that our algorithm does not sacrifice QoS for energy savings; instead, it achieves energy efficiency while simultaneously providing the best delay performance among all tested methods.

In summary, the comparative results validate that the synergistic fusion of Lyapunov optimization and DDPG enables our proposed algorithm to excel across all key metrics, achieving superior energy efficiency, unparalleled system stability, and the lowest processing delay.

5.3. Scalability Analysis

5.3.1. Impact of Communication Bandwidth

To assess adaptability to communication resources, we evaluated performance under varying bandwidths from 2 to 10 MHz, with results for energy consumption and delay shown in Figure 8 and Figure 9. Unsurprisingly, all algorithms improved with increased bandwidth, as faster data transmission reduces both energy and delay.

However, our proposed LyDDPG algorithm consistently demonstrates superior performance across the entire spectrum. This superiority stems from its ability not just to react to better conditions, but to proactively re-optimize its entire offloading strategy to best exploit them. It finds a more sophisticated balance between local processing and offloading, ensuring the benefits of higher bandwidth are maximally translated into performance gains. In contrast, while other baselines also improve, their inability to match LyDDPG’s efficiency indicates a less adaptive policy that fails to fully capitalize on ample communication resources. This confirms the robustness and superior optimization capability of our approach.

5.3.2. Scalability with Number of UEs

To evaluate scalability, we examine the impact of increasing the number of UEs from 9 to 21 on both average energy consumption and delay, with results presented in Figure 10 and Figure 11. While the increased network load predictably degrades performance for all methods, our LyDDPG algorithm demonstrates far superior scalability. Crucially, the performance gap between LyDDPG and its competitors widens significantly as the system scales. For instance, its energy consumption advantage over the next-best algorithm, SAC, expands from approximately 16% at 9 UEs to 22% at 21 UEs. This trend strongly validates that our Lyapunov-based framework provides more effective guidance in high-dimensional decision spaces. As system complexity grows, its inherent stability guarantees prevent severe congestion, whereas standard DRL algorithms like SAC struggle to maintain global optimality, leading to more pronounced performance degradation.

5.3.3. Impact of Task Arrival Rate

Figure 12 and Figure 13 illustrate the system’s performance under varying task arrival rates, revealing the trade-off between energy consumption and queue stability for different algorithms.

As shown in Figure 13, our proposed LyDDPG algorithm consistently maintains the lowest average queue length, demonstrating its superior stability. This advantage becomes particularly pronounced under high-load conditions where the Task Arrival Rate is 2.9 or higher. Figure 12 reveals the cost of this stability: while all algorithms’ energy consumption rises with the workload, LyDDPG achieves its best-in-class queue control with the lowest corresponding energy cost. In contrast, algorithms like PPO initially exhibit high energy consumption to maintain short queues at low task arrival rates, but their stability deteriorates rapidly as the system becomes more congested. This comparison highlights that our LyDDPG framework not only ensures system stability but does so with the highest energy efficiency, striking the optimal balance between performance and operational cost.

5.4. Robustness in Dynamic Environments with User Mobility

To extend our analysis beyond static conditions, this section evaluates the algorithms’ robustness in a more challenging and realistic dynamic environment. This mobile user scenario is designed to rigorously test the adaptability of the benchmarked algorithms in a highly non-stationary setting.

In this experiment, we model user mobility using the widely adopted Random Waypoint Model. Each user moves within the service area with an average speed of 1.5 m/s. This value is selected as it represents a typical pedestrian walking speed (approximately 5.4 km/h) and is a commonly adopted parameter in user mobility studies for wireless networks [44]. Selecting new random destinations upon arrival, this mobility induces continuous and significant fluctuations in the user-UAV distance, leading to a highly non-stationary channel environment.

The comparative results for this dynamic scenario are presented in Figure 14 and Figure 15. As depicted, while the performance of all algorithms degrades compared to the static case, the superiority of LyDDPG becomes even more pronounced. The learning curves themselves reveal critical insights into the algorithms’ stability. The curve for LyDDPG is notably smoother and exhibits minimal volatility, indicating a stable and consistent learning process. In stark contrast, the baseline algorithms display highly erratic performance with significant fluctuations, reflecting their struggle to find a stable policy in a constantly shifting environment.

Figure 14 shows that LyDDPG not only maintains the lowest average energy consumption but also widens the performance gap to the next-best algorithm (SAC). Most critically, as shown in Figure 15, the stability advantage of LyDDPG is starkly evident. The learning curve for DDPG, for instance, shows dramatic oscillations, visually representing its difficulty in controlling the queues. The stabilizing force provided by the Lyapunov objective allows our algorithm to avoid this instability. These findings underscore that our LyDDPG framework is not only effective but also exceptionally robust, making it a more viable solution for real-world MEC deployments.

5.5. Optimality Gap Analysis via Dynamic Programming Benchmark

To rigorously evaluate the optimality of our proposed LyDDPG algorithm, we conducted a benchmark comparison against a theoretical optimum derived from Dynamic Programming (DP). As the original problem formulation involves continuous state and action spaces, it is computationally intractable for DP due to the “curse of dimensionality.” Therefore, we follow a standard validation methodology by formulating a small-scale, discrete version of the problem where DP can feasibly compute the optimal policy.

To this end, we constructed a simplified scenario with the following characteristics: a reduced number of UEs ( $N = 2$ ), a discrete set of channel quality levels (Good, Moderate, Poor), and discretized action spaces for both offloading ratios ( $x_{i} (t) \in {0, 0.5, 1.0}$ ) and edge resource allocation ( $ψ_{i} (t)$ ). Under this constrained setting, we applied the value iteration algorithm to solve the Bellman equation and find the optimal policy that minimizes the long-term time-averaged energy consumption while ensuring queue stability. We then evaluated our proposed LyDDPG algorithm within this identical small-scale environment.

The performance comparison across three key metrics is presented in Figure 16. The results demonstrate that our LyDDPG algorithm achieves a performance remarkably close to the theoretical optimum. Specifically, the average energy consumption of LyDDPG is within 6.4% of the optimal value found by DP, confirming its high energy efficiency. Furthermore, in terms of queue stability and Quality of Service, the average queue length and average delay show optimality gaps of only 10.0% and 8.7%, respectively. This small optimality gap across all metrics provides strong evidence that our algorithm, despite being a model-free and scalable approach, learns a highly efficient and near-optimal policy. This result validates the effectiveness of our Lyapunov-guided DRL framework and justifies its application in more complex and practical scenarios.

6. Conclusions

This paper addressed the problem of energy consumption minimization in dynamic UAV-assisted MEC systems, proposing a novel computation offloading and resource allocation algorithm named LyDDPG. The core of this work lies in the synergistic fusion of Lyapunov optimization with DDPG. Lyapunov optimization provides a rigorous framework to decompose the long-term stochastic optimization problem into tractable, per-timeslot “drift-plus-penalty” subproblems, ensuring theoretical stability guarantees. Subsequently, DDPG is employed to solve these subproblems in a model-free manner, learning a policy that maps real-time system states to optimal continuous offloading and resource allocation decisions to minimize the Lyapunov-derived objective. Through extensive simulation experiments, we validated the superiority of the proposed LyDDPG algorithm against several baseline and state-of-the-art methods. The results demonstrate that LyDDPG effectively learns a robust policy that significantly reduces the total system energy consumption by at least 16%, while simultaneously maintaining low task processing delays and ensuring system queue stability.

Despite the promising results, this work is limited to a single-UAV scenario. Future work will focus on extending the LyDDPG framework to multi-UAV systems using Multi-Agent Deep Reinforcement Learning to address collaborative computation offloading and resource allocation. Key research challenges in this domain would include: (1) Inter-UAV Interference Management: designing efficient communication protocols and resource allocation strategies to mitigate signal interference among multiple UAVs. (2) Task Allocation and Load Balancing: investigating dynamic task assignment policies to balance the computational load from ground users across the UAVs in the swarm, thereby preventing bottlenecks. (3) Trajectory and Handover Co-design: jointly optimizing the flight trajectories of multiple UAVs and the user handover mechanisms between them to ensure seamless and continuous service. Successfully addressing these challenges would significantly enhance the system’s scalability, coverage, and overall performance.

Author Contributions

Conceptualization, J.L. and X.Z.; methodology, X.Z., H.Z., J.L. and X.W.; software, X.Z., H.L., X.L., H.Z. and H.L.; validation, X.Z. and J.L.; formal analysis, X.Z.; data curation, H.Z.; writing—original draft preparation, J.L. and X.Z.; writing—review and editing, J.L., X.W. and X.L.; supervision, H.L. and J.L.; project administration, J.L.; funding acquisition, J.L., X.W. and H.L. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 System model.

Figure 2 The framework of the proposed LyDDPG algorithm.

Figure 3 Performance trade-off analysis with varying control parameter V.

Figure 4 Comparison of average reward convergence for different algorithms.

Figure 5 Comparison of queue stability performance for different algorithms.

Figure 6 Comparison of average energy consumption for different algorithms.

Figure 7 Comparison of average system delay for different algorithms.

Figure 8 Impact of communication bandwidth on average energy consumption for different algorithms.

Figure 9 Impact of communication bandwidth on average delay for different algorithms.

Figure 10 Impact of the number of UEs on average energy consumption.

Figure 11 Impact of the number of UEs on average delay.

Figure 12 Impact of task arrival rate on average energy consumption.

Figure 13 Impact of task arrival rate on average data queue length.

Figure 14 Average energy consumption under user mobility.

Figure 15 Average queue length under user mobility.

Figure 16 Optimality gap analysis against the DP-based benchmark in a small-scale scenario.

Table 1

Existing solutions for offloading and resource allocation in MEC.

Ref.	Objective	Categories	Method	Advantages	Drawback
Note: Opt. (Optimization); QoE (Quality of Experience); RIS (Reconfigurable Intelligent Surface).
[1]	UAV-RIS MEC Latency Opt.	Traditional Opt.	Latency-optimization framework	UAV/RIS synergy for low latency	High/unspecified coordination complexity
[2]	Energy-Harvesting MEC Offloading & Resource Opt.	Traditional Opt.	Joint offloading & resource allocation scheme	Improved energy efficiency	Ignores mobility/dynamic channels
[3]	Stable Online Offloading in Dynamic MEC	Hybrid Method	Lyapunov-guided DRL algorithm	Ensures long-term stability & adaptability	Complex algorithm/theory integration
[5]	Joint Trajectory & Resource Opt. in UAV-MEC	Traditional Opt.	Joint trajectory & resource optimization algorithm	Adapts to dynamics, good performance	High computational complexity
[7,12,14]	Multi-UAV MEC Delay & Energy Multi-obj. Opt.	Traditional Opt.	Multi-objective optimization algorithm	Achieves delay-energy trade-off	Complex to solve, hard to get global optimum
[8]	Covert Offloading in UAV-MEC	Traditional Opt.	Covertness-constrained offloading strategy	Improved communication security	May sacrifice communication performance
[9]	Energy Opt. for Multi-UAV Data Collection	Traditional Opt.	Energy-optimization framework	Reduced data collection energy	Ignores computation offloading
[11]	Low-altitude MEC Delay & Energy Opt.	RL-based	Evolutionary Multi-objective DRL algorithm	Learns Pareto-optimal policies	High training overhead
[13]	UAV-MEC QoE Optimization	Traditional Opt.	Two-tier QoE optimization framework	Hierarchical opt. improves QoE	Complex model, hard to quantify QoE
[15]	Dynamic trajectory design & Resource allocation for Multi-UAV MEC	Hybrid Method	MADDPG-LC (integrates MADDPG, LQR, CVXPY)	Real-time joint optimization	High computational complexity
[16]	Energy-optimal quadrotor trajectory control	Traditional Opt.	Hierarchical control based on optimal control theory	Guarantees near-optimal energy consumption	Relies on precise UAV mathematical model
[17]	Parallel Offloading & Trajectory Opt. in UAV-MEC	RL-based	Hierarchical RL algorithm	Reduced decision complexity	May fall into sub-optima
[19,25]	Distributed Offloading in Multi-Agent MEC	RL-based	Multi-Agent DRL algorithm	Distributed decision, good scalability	High coordination /communication overhead
[23]	Ensemble Q-Learning for Task Offloading	RL-based	Ensemble Q-Learning algorithm	Improved performance & robustness	Higher complexity & overhead
[24]	Deadline-aware Offloading in Vehicular Edge Computing	RL-based	Ensemble Q-Learning algorithm	Meets strict delay constraints	Requires real-time network state
[26]	Minimize latency & energy consumption in UAV-MEC	RL-based	Multi-Agent DRL algorithm	Addresses mobility-energy trade-offs	Performance depends on MARL coordination
[27]	Minimize average delay & energy	RL-based	SAC algorithm	Combining Lyapunov with the SAC for decision-making	Explicitly handles long-term energy
[29]	Dynamic Offloading for Maritime UAV-MEC	Traditional Opt.	Lyapunov-based dynamic offloading algorithm	Ensures long-term performance in dynamic env.	Not instantaneously optimal

Table 2

Main variables and their physical meanings.

Symbols	Meanings	Symbols	Meanings
$A_{i} (t)$	Volume of task data randomly arriving at UE	$D_{i}^{L} (t), D_{i}^{E} (t)$	Service capacity for local and allocated ES computation in a time slot
${\bar{A}}_{i}$	Average size of a task from UE	$D_{i}^{O} (t)$	Amount of data successfully offloaded from UE i in a time slot
$V a r (A_{i})$	Variance of the task size from UE	$Q_{i}^{L} (t), Q_{i}^{O} (t), Q_{i}^{E} (t)$	Queue lengths for local, offloading, and ES queues
$λ_{i}$	Average task arrival rate for each UE	$T_{i}^{L} (t), T_{i}^{E} (t)$	Total average sojourn time for local and ES computation
$δ$	Duration of a single timeslot	$τ_{i}^{O} (t)$	Task transmission latency
$x_{i} (t)$	The proportion of the task that is offloaded to the ES	$τ_{i} (t)$	Total average sojourn time for a task from UE
$g_{i} (t)$	Channel gain between UE i and UAV	$E_{i}^{L} (t), E_{i}^{O} (t), E_{i}^{C} (t)$	Energy consumption for local, transmission, and ES computation
B	Channel bandwidth	$E_{i} (t)$	Total energy consumption for UE i
$N_{0}$	Noise power	$μ_{i}^{L} (t), μ_{i}^{E} (t)$	Average data processing throughput for local and ES paths
$P_{i} (t)$	Transmission power between UE i and UAV	$S_{i}^{L} (t), S_{i}^{E} (t)$	Service time for local and ES computation
$R_{i} (t)$	Transmission rate between UE i and UAV	$W_{q, i}^{L} (t), W_{q, i}^{E} (t)$	Average queuing delay for local and ES
$b_{i} (t)$	Number of CPU cycles required to compute 1 bit of data	$κ_{L}, κ_{E}$	Effective energy coefficients for local and ES computation
$ψ_{i} (t)$	Proportion of computational resource allocated to UE i	$ρ_{i}^{L} (t), ρ_{i}^{E} (t)$	Server utilization for local and ES computation queues
$f_{i}^{L} (t)$	CPU computation frequency of UE i	$F^{E} (t)$	Total computational capability of the ES

Table 3

Architecture of the actor network.

Layer	Description	Number of Neurons	Activation Function
Input Layer	Receives the current environment state $s (t)$ as input.	36	N/A
Hidden Layer 1	Fully connected layer, performs feature extraction on the input state.	256	ReLU
Hidden Layer 2	Fully connected layer, further processes features from the previous layer.	128	ReLU
Hidden Layer 3	Fully connected layer, prepares for the output layer.	64	Tanh
Output Layer	Outputs the deterministic action $a (t)$ by applying a linear transformation followed by a Sigmoid function.	18	Sigmoid

Table 4

Architecture of the critic network.

Layer	Description	Number of Neurons	Activation Function
Input Layer	Receives the concatenated current state $s (t)$ and action $a (t)$ as input.	54	N/A
Hidden Layer 1	Fully connected layer, processes the combined state–action representation.	256	ReLU
Hidden Layer 2	Fully connected layer, further processes features from the previous layer.	128	ReLU
Hidden Layer 3	Fully connected layer, prepares for Q-value estimation.	64	ReLU
Output Layer	Fully connected layer, outputs a single scalar value representing the estimated Q-value, $Q (s (t), a (t) \| θ^{Q})$ , for the input state–action pair.	1	None

Table 5

Key simulation parameters and hyperparameters.

Parameter Name	Value	Parameter Name	Value
N	9	$M_{batch}$	128
$f_{i}^{L} (t)$	0.9–1.2 GHz	Layer type of Actor and Critic	Fully Connected
$f_{i}^{E} (t)$	9.5–10.5 GHz	Learning rate of Actor	1 × 10⁻⁴
${\bar{A}}_{i}$	3 Mbits	Learning rate of Critic	3 × 10⁻³
Number of episodes	5000	Size of replay buffer D	10,000
$γ$	0.98	Bandwidth	2 MHz
Path Loss Exponent $d_{e}$	3	Task Arrival Rate $λ$	2.5 tasks/s

References

1. Alshahrani, A. Toward 6G: Latency-Optimized MEC Systems with UAV and RIS Integration. Mathematics; 2025; 13, 871. [DOI: https://dx.doi.org/10.3390/math13050871]

2. Triyanto, D.; Mustika, I.W. Widyawan. Computation Offloading and Resource Allocation for Energy-Harvested MEC in an Ultra-Dense Network. Sensors; 2025; 25, 1722. [DOI: https://dx.doi.org/10.3390/s25061722]

3. Bi, S.; Huang, L.; Wang, H.; Zhang, Y.-J.A. Lyapunov-guided deep reinforcement learning for stable online computation offloading in mobile-edge computing networks. IEEE Trans. Wirel. Commun.; 2021; 20, pp. 7519-7537. [DOI: https://dx.doi.org/10.1109/TWC.2021.3085319]

4. Filali, A.; Abouaomar, A.; Cherkaoui, S.; Kobbane, A.; Guizani, M. Multi-access edge computing: A survey. IEEE Access; 2020; 8, pp. 197017-197046. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3034136]

5. Wang, Z.; Zhao, W.; Hu, P.; Zhang, X.; Liu, L.; Fang, C.; Sun, Y. UAV-Assisted Mobile Edge Computing: Dynamic Trajectory Design and Resource Allocation. Sensors; 2024; 24, 3948. [DOI: https://dx.doi.org/10.3390/s24123948] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38931732]

6. Baidya, T.; Nabi, A.; Moh, S. Trajectory-aware offloading decision in UAV-aided edge computing: A comprehensive survey. Sensors; 2024; 24, 1837. [DOI: https://dx.doi.org/10.3390/s24061837]

7. Sun, G.; Wang, Y.; Sun, Z.; Wu, Q.; Kang, J.; Niyato, D.; Leung, V.C. Multi-objective optimization for multi-uav-assisted mobile edge computing. IEEE Trans. Mob. Comput.; 2024; 23, pp. 14803-14820. [DOI: https://dx.doi.org/10.1109/TMC.2024.3446819]

8. Hu, Z.; Zhou, D.; Shen, C.; Wang, T.; Liu, L. Task Offloading Strategy for UAV-Assisted Mobile Edge Computing with Covert Transmission. Electronics; 2025; 14, 446. [DOI: https://dx.doi.org/10.3390/electronics14030446]

9. Xu, B.; Zhang, L.; Xu, Z.; Liu, Y.; Chai, J.; Qin, S.; Sun, Y. Energy optimization in multi-UAV-assisted edge data collection system. Comput. Mater. Contin.; 2021; 69, pp. 1671-1686. [DOI: https://dx.doi.org/10.32604/cmc.2021.018395]

10. Zhang, Y.; Zhao, R.; Mishra, D.; Ng, D.W.K. A Comprehensive Review of Energy-Efficient Techniques for UAV-Assisted Industrial Wireless Networks. Energies; 2024; 17, 4737. [DOI: https://dx.doi.org/10.3390/en17184737]

11. Sun, G.; Ma, W.; Li, J.; Sun, Z.; Wang, J.; Niyato, D.; Mao, S. Task Delay and Energy Consumption Minimization for Low-altitude MEC via Evolutionary Multi-objective Deep Reinforcement Learning. arXiv; 2025; arXiv: 2501.06410

12. Xu, Y.; Zhang, T.; Liu, Y.; Yang, D.; Xiao, L.; Tao, M. UAV-Assisted MEC Networks With Aerial and Ground Cooperation. IEEE Trans. Wirel. Commun.; 2021; 20, pp. 7712-7727. [DOI: https://dx.doi.org/10.1109/twc.2021.3086521]

13. He, H.; Yang, X.; Huang, F.; Shen, H. Two-Tier Efficient QoE Optimization for Partitioning and Resource Allocation in UAV-Assisted MEC. Sensors; 2024; 24, 4608. [DOI: https://dx.doi.org/10.3390/s24144608] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39066008]

14. Li, L.; Xu, G.; Liu, Z.; Xu, X.; Meng, X.; Meng, X. Multiobjective Optimization of Energy Efficiency and Fairness in AAV-Assisted Wireless Powered MEC Systems: A DRL-Based Approach. IEEE Internet Things J.; 2025; 12, pp. 28758-28775.

15. Wang, H.; Liu, L.; Sun, E.; Zhang, H.; Li, Z.; Fang, C.; Li, M. Dynamic Trajectory Design for Multi-UAV-Assisted Mobile Edge Computing. IEEE Trans. Veh. Technol.; 2025; 74, pp. 4684-4697.

16. Bianchi, D.; Borri, A.; Cappuzzo, F.; Di Gennaro, S. Quadrotor Trajectory Control Based on Energy-Optimal Reference Generator. Drones; 2024; 8, 29. [DOI: https://dx.doi.org/10.3390/drones8010029]

17. Wang, T.; Na, X.; Nie, Y.; Liu, J.; Wang, W.; Meng, Z. Parallel Task Offloading and Trajectory Optimization for UAV-Assisted Mobile Edge Computing via Hierarchical Reinforcement Learning. Drones; 2025; 9, 358. [DOI: https://dx.doi.org/10.3390/drones9050358]

18. Dai, Y.; Lyu, L.; Cheng, N.; Sheng, M.; Liu, J.; Wang, X.; Cui, S.; Cai, L.; Shen, X. A survey of graph-based resource management in wireless networks-part ii: Learning approaches. IEEE Trans. Cogn. Commun. Netw.; 2025; 9, 358.

19. He, H.; Yang, X.; Mi, X.; Shen, H.; Liao, X. Multi-Agent Deep Reinforcement Learning Based Dynamic Task Offloading in a Device-to-Device Mobile-Edge Computing Network to Minimize Average Task Delay with Deadline Constraints. Sensors; 2024; 24, 5141.

20. Uddin, A.; Sakr, A.H.; Zhang, N. Task Offloading in Vehicular Edge Computing using Deep Reinforcement Learning: A Survey. arXiv; 2025; arXiv: 2502.06963

21. Dayong, W.; Abu Bakar, K.B.; Isyaku, B.; Eisa, T.A.E.; Abdelmaboud, A. A comprehensive review on internet of things task offloading in multi-access edge computing. Heliyon; 2024; 10, e29916. [DOI: https://dx.doi.org/10.1016/j.heliyon.2024.e29916]

22. Yang, N.; Chen, S.; Zhang, H.; Berry, R. Beyond the edge: An advanced exploration of reinforcement learning for mobile edge computing, its applications, and future research trajectories. IEEE Commun. Surv. Tutor.; 2025; 27, pp. 546-594.

23. Malik, R. Ensemble Q-Learning Algorithm: An Effective and Novel Approach for Task Offloading in Edge Computing. Ph.D. Thesis; National College of Ireland: Dublin, Ireland, 2023.

24. Farimani, M.K.; Karimian-Aliabadi, S.; Entezari-Maleki, R.; Egger, B.; Sousa, L. Deadline-aware task offloading in vehicular networks using deep reinforcement learning. Expert Syst. Appl.; 2024; 249, 123622. [DOI: https://dx.doi.org/10.1016/j.eswa.2024.123622]

25. Xu, S.; Liu, Q.; Gong, C.; Wen, X. Energy-Efficient Multi-Agent Deep Reinforcement Learning Task Offloading and Resource Allocation for UAV Edge Computing. Sensors; 2025; 25, 3403.

26. Liu, Z.; Zhang, Q.; Su, Y. PPO-Based Joint Optimization for UAV-Assisted Edge Computing Networks. Appl. Sci.; 2023; 13, 12828. [DOI: https://dx.doi.org/10.3390/app132312828]

27. Liang, Y.; Tang, H.; Wu, H.; Wang, Y.; Jiao, P. Lyapunov-Guided Offloading Optimization Based on Soft Actor-Critic for ISAC-Aided Internet of Vehicles. IEEE Trans. Mob. Comput.; 2024; 23, pp. 14708-14721. [DOI: https://dx.doi.org/10.1109/TMC.2024.3445350]

28. Xu, C.; Zhang, P.; Yu, H. Lyapunov-Guided Resource Allocation and Task Scheduling for Edge Computing Cognitive Radio Networks via Deep Reinforcement Learning. IEEE Sens. J.; 2025; 25, pp. 12253-12264. [DOI: https://dx.doi.org/10.1109/JSEN.2025.3542972]

29. Bai, Y.; Zhang, Y. Dynamic Offloading Based on Lyapunov Optimization for UAV-Assisted Maritime IoT-MEC Networks. IEEE Trans. Veh. Technol.; 2025; pp. 1-13. [DOI: https://dx.doi.org/10.1109/TVT.2025.3576090]

30. Picano, B.; Mingozzi, E. Age-Oriented Resource Allocation for IoT Computational Intensive Tasks in Edge Computing Systems. IEEE Internet Things J.; 2025; 12, pp. 14498-14510. [DOI: https://dx.doi.org/10.1109/JIOT.2025.3525997]

31. Kleinrock, L. Queueing Systems, Volume I: Theory; Wiley: New York, NY, USA, 1975; pp. 185-188.

32. Huang, J.; Lai, X.; Yang, F.; Zhang, N.; Niyato, D.; Jiang, W. Ellipsoid-based learning for robust resource allocation with differentiated QoS in massive internet of vehicles networks. IEEE Trans. Veh. Technol.; 2025; 74, pp. 11425-11435.

33. Kumar, A.S.; Zhao, L.; Fernando, X. Task offloading and resource allocation in vehicular networks: A Lyapunov-based deep reinforcement learning approach. IEEE Trans. Veh. Technol.; 2023; 72, pp. 13360-13373. [DOI: https://dx.doi.org/10.1109/TVT.2023.3271613]

34. Neely, M.J. Stochastic Network Optimization with Application to Communication and Queueing Systems; Morgan & Claypool Publishers: San Rafael, CA, USA, 2010; 28.

35. Li, J.; Jiang, Q.; Leung, V.C.M.; Ma, Z.; Abrokwa, K.K. Deep Reinforcement Learning Based Joint Optimization of Task Migration and Resource Allocation for Mobile Edge Computing. IEEE Internet Things J.; 2025; 12, pp. 24431-24440. [DOI: https://dx.doi.org/10.1109/JIOT.2025.3555503]

36. Saad, M.M.; Jamshed, M.A.; Adedamola, A.I.; Nauman, A.; Kim, D. Twin Delayed DDPG (TD3)-Based Edge Server Selection for 5G-Enabled Industrial and C-ITS Applications. IEEE Open J. Commun. Soc.; 2025; 6, pp. 3332-3343. [DOI: https://dx.doi.org/10.1109/OJCOMS.2025.3545566]

37. Li, S.; Li, W.; Zheng, W.; Xia, Y.; Guo, K.; Peng, Q.; Li, X.; Ren, J. Multi-user joint task offloading and resource allocation based on mobile edge computing in mining scenarios. Sci. Rep.; 2025; 15, 16170. [DOI: https://dx.doi.org/10.1038/s41598-025-00730-y]

38. Li, A.; Zheng, Y.; Nong, W.; Wei, M.; Wang, G.; Huang, S. Task offloading decision and resource allocation strategy based on improved ddpg in mobile edge computing. Comput. Inform.; 2025; 44, pp. 245-271. [DOI: https://dx.doi.org/10.31577/cai_2025_2_245]

39. Goudarzi, S.; Soleymani, S.A.; Anisi, M.H.; Jindal, A.; Xiao, P. Optimizing UAV-Assisted Vehicular Edge Computing With Age of Information: An SAC-Based Solution. IEEE Internet Things J.; 2025; 12, pp. 4555-4569. [DOI: https://dx.doi.org/10.1109/JIOT.2025.3529836]

40. Min, J.; Jian, W.; Liang, Z.; Xinyu, W.; Qing, G. DDPG-based intelligent computation offloading and resource allocation for LEO satellite edge computing network. China Commun.; 2025; 22, pp. 1-15. [DOI: https://dx.doi.org/10.23919/JCC.fa.2023-0337.202503]

41. Zhang, J.; Zhang, G.; Wang, X.; Zhao, X.; Yuan, P.; Jin, H. UAV-Assisted Task Offloading in Edge Computing. IEEE Internet Things J.; 2025; 12, pp. 5559-5574. [DOI: https://dx.doi.org/10.1109/JIOT.2024.3488210]

42. Li, Y.; Qi, F.; Wang, Z.; Yu, X.; Shao, S. Distributed edge computing offloading algorithm based on deep reinforcement learning. IEEE Access; 2020; 8, pp. 85204-85215. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2991773]

43. Lin, P.; Ji, Y.; Zhang, Z.; Zhai, R.; Zhu, H.; Liu, Y. Delay-Aware and Energy-Efficient Task Offloading for IoV: A Thread-Cooperative A3C Approach. IEEE Commun. Lett.; 2025; 29, pp. 1599-1603. [DOI: https://dx.doi.org/10.1109/LCOMM.2025.3569328]

44. Camp, T.; Boleng, J.; Davies, V. A survey of mobility models for ad hoc network research. Wirel. Commun. Mob. Comput.; 2002; 2, pp. 483-502. [DOI: https://dx.doi.org/10.1002/wcm.72]

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Lyapunov-Based Deep Deterministic Policy Gradient for Energy-Efficient Task Offloading in UAV-Assisted MEC

Content area

Abstract

Full text

1. Introduction

2. Modeling and Formulation

2.1. UAV-Assisted MEC Network Model

2.2. Task and Communication Model

2.3. Delay Model

2.4. Energy Consumption Model

2.5. Queue Dynamics Model

2.6. Problem Definition

3. The Lyapunov Optimization Framework

4. DDPG Based on Lyapunov Function

4.1. Deep Reinforcement Learning-Based Solution Method

4.2. DDPG Algorithm Architecture

4.3. Training Protocol

4.4. Computational Complexity Analysis

4.4.1. Interaction with the Environment

4.4.2. Network Updates

4.4.3. Implications for Practical Deployment

4.5. Theoretical Guarantees and Stability Analysis

5. Simulation Results

5.1. Sensitivity Analysis of the Control Parameter

5.2. Comparative Performance Analysis

5.3. Scalability Analysis

5.3.1. Impact of Communication Bandwidth

5.3.2. Scalability with Number of UEs

5.3.3. Impact of Task Arrival Rate

5.4. Robustness in Dynamic Environments with User Mobility

5.5. Optimality Gap Analysis via Dynamic Programming Benchmark

6. Conclusions