Content area
ABSTRACT
Nowadays, deep neural network (DNN) partition is an effective strategy to accelerate deep learning (DL) tasks. A pioneering technology, computing and network convergence (CNC), integrates dispersed computing resources and bandwidth via the network control plane to utilize them efficiently. This paper presents a novel network‐cloud (NC) architecture designed for DL task inference in CNC scenario, where network devices directly participate in computation, thereby reducing extra transmission costs. Considering multi‐hop computing‐capable network nodes and one cloud node in a chain path, leveraging deep reinforcement learning (DRL), we develop a joint‐optimization algorithm for DNN partition, subtask offloading and computing resource allocation based on deep Q network (DQN), referred to as POADQ, which invokes a subtask offloading and computing resource allocation (SORA) algorithm with low complexity, to minimize delay. DQN searches the optimal DNN partition point, and SORA identifies the next optimal offloading node for next subtask through our proposed NONPRA (next optimal node prediction with resource allocation) method, which selects the node that exhibits the smallest predicted increase in cost. We conduct some experiments and compare POADQ with other schemes. The results show that our proposed algorithm is superior to other algorithms in reducing the average delay of subtasks.
- DL
- Deep learning
- DNN
- Deep neural network
- IIoT
- Industrial Internet of Things
- CNC
- Computing and network convergence
- NC
- Network-cloud
- MDP
- Markov decision process
- DQN
- Deep Q network
- DRL
- Deep reinforcement learning
Abbreviations
Introduction
Preliminaries and Backgrounds
In recent years, with rapid development of deep learning (DL) technology, the related model called deep neural network (DNN) has been applied to a variety of areas broadly, such as computer vision, natural language processing and big data, etc. DNN model has become more and more complex, leading to higher computational resource requirements for corresponding task execution. For instance, in industrial Internet of Things (IIoT) scenarios, applications like product classification or product defect detection [1], require input images captured by end devices (e.g., cameras) with limited computing resources. Consequently, these computation-intensive tasks need to be offloaded to servers at the edges or in the cloud for computation. This is because the terminal's computing power is insufficient to support the task processing.
Generally, the cloud possesses more abundant computing resources than edges, indicating that edges own limited resources [2]. However, offloading DL tasks to the cloud straightly results in long distance data transmission delay [3], owing to large volume of original data (e.g., high-resolution images). To solve these problems, some researchers [4, 5] split DNN models into two parts, between which there exists dependency relationship due to the feature of layers in the DNN models. The task of the latter part requires inputting intermediate data output from the task of the former part. Usually, the former part is deployed in edges in proximity to terminals and the latter one is deployed in the remote cloud. The purpose is to reduce data transmission delay because intermediate data may have much lower volume than the original data [6], meanwhile bountiful computing resources of the cloud will also be efficiently utilized.
A novel concept, computing and network convergence (CNC), alias computing force network or computing power network, means management of resources (e.g., computing, bandwidth) distributed through the network control plane [7, 8], and even network devices themselves can execute computing tasks [8]. In this new architecture, the heterogeneous amount of resources, which are allocated to computing tasks, from different computing nodes may vary over time, for the reason that optimal allocation can be realized anytime and anywhere [9]. Nowadays, CNC is becoming a growing direction [10]. However, in this scenario, when a task seeks to utilize the abundant resources of a cloud node, it will inevitably pass through multiple computing-capable network nodes before reaching its destination. Consequently, model partition and task offloading are affected by these multi-hop nodes, which vary in their proximity to the terminal.
Limitations and Problems
Figure 1 illustrates the shortcoming of edge servers and the benefit of CNC. Initially, the task is divided into two processing parts. The original data may be transmitted through network nodes to the edge server for computation. After processing, the intermediate data is sent back to the network device to facilitate subsequent data transfer. However, in CNC scenario, integrating computing capabilities into the network devices allows the first part of the computation to occur directly on them. This approach reduces a portion of latency costs associated with round-trip data transmission, as the network nodes are inherently positioned along the necessary data path.
[IMAGE OMITTED. SEE PDF]
This paper proposes a network-cloud (NC) collaboratively computing architecture for DL task inference acceleration (as shown in Figure 2) based on CNC technology. In this architecture, we remove edge servers due to their limitations mentioned above. On the side close to end devices, there is a DL task orchestration system collecting images captured by end devices. Images which require to be processed by the same DNN model can be recognized as subtasks of one DL task. In the network domain, each network node inherently provides a certain level of computing power, while the links between them offer bandwidth for data transmission. Through resource management facilitated by CNC technology, all network nodes and links along a chain path of a task can allocate arbitrary computing and bandwidth resources to it. Cloud nodes are deployed in the remote domain. Although they do not possess data packet forwarding capabilities like network nodes, they provide substantial computing resources. The red line in the figure represents the path of a DL task. The whole DL task, as a computing request, will ultimately be sent to a remote cloud node, which will then send the inference result back to the orchestration system. During this process, devices in the network domain will execute the computation of the former part of the subtasks, while the cloud will execute the latter part.
[IMAGE OMITTED. SEE PDF]
In this context, the computing power provided by multi-hop network nodes is heterogeneous, besides, the bandwidth available on the links between them is uncertain. More importantly, their proximities to the end devices vary, leading to differing weights of computing power among the nodes. For instance, nodes located further away may offer greater computing resources but incur higher data transmission costs for offloading subtasks. Additionally, the complexity of DNN models, characterized by varying computing overheads of different layers and differing data transfer volumes between adjacent layers, further complicates the optimal determination of partition point in a multi-hop computing node scenario. Although existing methods can reduce the high transmission delay of input data in the model through the collaborative computation of computing nodes nearby and a distant cloud node, they typically focus on the single-hop scenario, that is to say, computing nodes can be reached by one hop. When these methods are applied to multi-hop scenarios, they often overlook the proximity relationships among nodes with varying hop counts. Hence, we propose a scheme specifically designed for application in multi-hop contexts to overcome the limitations.
Literature Review
At present, the research scenarios related to DL inference, based on DNN model, are mainly divided into cloud computing, device computing with limited computing power, and collaborative computing between them. Table 1 presents these specific references and the strength of our work.
TABLE 1 Comparison between this paper and the existing methods.
| Ref. | Device computing in single-hop scenario | Device computing in multi-hop scenario | Cloud computing |
| [4] | |||
| [5] | |||
| [11] | |||
| [12] | |||
| [13] | |||
| [14] | |||
| [15] | |||
| Ours |
In researches [12–14], researchers chose to deploy DNN models on devices near the terminal or directly on the terminal locally for inference, but these approaches often face the problem of insufficient computing resources. In [12], Zeng et al. proposed a DNN computing system called Coedge, which could receive information of computation capacity and network conditions from heterogeneous edge servers to provide the optimal partition policy for the DNN model. It achieved a reduction of more than 25% in energy consumption with close delay. In [13], Xu et al. researched the high-concurrency offloading of DNN inference tasks within the context of mobile edge computing and proposed a task offloading mechanism utilizing a distributed soft actor-critic (SAC) approach. Compared to other baseline algorithms, this method achieved an 18.3% reduction in average service latency. In [14], Liu et al. proposed a collaborative inference scheme (CIS) specifically designed to tackle the challenges of collaborative inference between end devices and edge devices in heterogeneous environments. By optimizing task offloading and scheduling, CIS effectively reduces inference latency. Experimental results indicate that, compared to four existing methods, CIS achieves a reduction in average weighted inference latency ranging from 29% to 71%.
In research [15], researcher deployed DNN models in the cloud for inference. However, due to the drawback of causing significant transmission delays, there are relatively few studies focusing on total cloud scenarios. In this paper, Li et al. investigated the challenge of automated online real-time deployment of DNN inference in the cloud. They proposed a novel algorithm, AutoDeep, which adaptively selected cost-efficient cloud configurations for inference tasks. Compared to non-trivial baselines, AutoDeep significantly enhanced inference speed, accelerated search speed and reduced inference costs.
In researches [4, 5, 11], researchers divided a complete inference task into two interdependent tasks through DNN model partition. The earlier task was executed on a nearby device, while the subsequent one was processed on a cloud node. Although this approach minimized the long-distance transmission of raw data, it only accounted for nearby devices within a single hop, which limited the utilization of distributed computing resources. In [4], Su et al. considered a scenario involving one edge device and one cloud node. The goal was to minimize the long-term average end-to-end latency of various DL tasks while ensuring that the energy consumption remained within energy budgets. Utilizing Lyapunov optimization technique and deep reinforcement learning (DRL), they developed a novel DNN partition and resource allocation algorithm based on deep deterministic policy gradient (DDPG) method to realize optimization. In [5], Liu et al. tackled the challenge of DNN cloud-edge collaborative inference across heterogeneous edge devices. They introduced an adaptive DNN inference partition and task offloading algorithm that leveraged DRL. When compared to various traditional inference methods, the proposed algorithm achieved a reduction in average DNN inference latency of approximately 38.85%. In [11], Fan et al. focused on a system comprising multiple base stations and a single cloud, which collaboratively computed various types of DL tasks. And each base station was equipped with an edge device. By employing DNN partition and effectively allocating communication and computing resources, they realized the minimization of end-to-end delays for all DL tasks within the system through their methods.
Summary
Unlike other studies that only consider offloading DL tasks to resource-constrained computing devices within a single hop, which limits their applicability, our work considers that multi-hop computing nodes and one cloud node collaborate to process DL inference tasks. This approach is more aligned with CNC scenarios and offers greater scalability.
Our contributions in this paper are as follows:
- We innovatively propose an NC architecture supported by CNC for DL task inference acceleration. This is a collaboratively computing pattern among multiple network devices with computing capacity in the communication path and the remote cloud. In this way, not only can the deployment cost of edge servers be cut off, but also computing nodes outside the path will not be selected, which guarantees that no additional transmission costs are incurred.
- We consider a scenario that contains one cloud node and several network nodes with heterogeneous computing resources allocated to a task in a bandwidth-variable environment. Hop counts from these nodes to end devices differ. We define a joint-optimization problem that involves determination of DNN partition, subtask offloading and computing resource allocation of one whole DL task for delay minimization. It is categorized as a mixed-integer nonlinear programming problem. Furthermore, in order to better describe the status of the whole system, we define states, actions and the reward function to build a Markov decision process (MDP). Next, we apply a DRL algorithm, deep Q network (DQN).
- In order to solve the problem mentioned above, we propose a joint-optimization algorithm of DNN partition, subtask offloading and computing resource allocation based on DQN (POADQ). In addition, POADQ invokes a subtask offloading and computing resource allocation (SORA) algorithm with low complexity. DQN searches the optimal DNN partition point, and SORA identifies the next optimal offloading node for next subtask through our proposed NONPRA (next optimal node prediction with resource allocation) method, which selects the node that exhibits the smallest predicted increase in cost. We carry out several simulation experiments upon three popular DNN models and compare our algorithm with some other schemes to prove its effectiveness. Finally, the simulation results demonstrate that the proposed algorithm POADQ shows the best performance in reducing the average delay of subtasks.
Related Work
As an active area of research, the acceleration of DL inference task can be realized through various approaches.
DNN Partition
The goal of DNN model partition is to convert high-volume raw data into low-volume intermediate data at a nearby node, reducing transmission overhead of tasks before offloading them to a remote node for further computation. Liao et al. [16] established a joint optimization model for multi-user DNN partition and task offloading for heterogeneous mobile devices and edge devices, aiming to minimize inference delay and energy consumption. The proposed scheme reduced total costs by more than 50% compared to other methods. Zhang et al.[17] formulated a DNN partition and task offloading problem in a dynamic bandwidth environment with multiple mobile devices and one edge device, aiming to jointly optimize delay and energy consumption. They proposed a DRL algorithm based on proximal policy optimization (PPO) to solve this problem. Experimental results showed that the algorithm effectively reduced processing delay and energy consumption while being applicable to various types of DNN models.
DNN Early Exit
It aims to reduce the computational cost of DNN model while preserving its predictive performance as much as possible. The principle is to stop computation early when the model is sufficiently accurate, rather than completing calculations for all layers of the model. Ebrahimi et al. [18] leveraged the DNN early exit technology and provided a performance model, which considered a tradeoff of accuracy and latency, to determine the optimal partition point, exit policy and task placement. Zhou et al. [19] designed a dynamic path based DNN synergistic inference acceleration framework where they applied multi-exit DNN and online exit prediction techniques. It greatly speeds up DNN inference at least 1.8 times.
Feature Compression
By compressing the output features of intermediate layers in a DNN model, the transmission cost of feature data can be effectively reduced. Park et al. [20] proposed a specially designed autoencoder, Auto-Tiler, which efficiently compressed intermediate features. It could handle feature spaces generated by different partition layers, supported various input dimensions, and provided multiple output dimensions to meet different compression ratio requirements. Compared to existing methods, auto-tiler improved compression accuracy by 18%–67% at the same bit rate while reducing processing delay by 73%–81%. Kim et al. [21] designed a framework, which dynamically recalculated the optimal model partition point, feature compression level, and transmission protocol based on bandwidth, round-trip time and packet loss rate. The framework successfully reduced delay by 85% with less than 1% accuracy loss.
Method
Problem Definition and Model
Figure 3 provides an overview of the problem definition proposed in this paper. A complete DL task corresponds to a DNN model that needs to process different image inputs from various end devices, with each input generating a subtask. The task path must traverse multi-hop computing-capable network nodes before reaching a cloud node. End devices, as they generally lack computing power, do not participate in the computation. All computing-capable nodes along the path can provide uncertain computing resources to the task, and the links between nodes offer a certain level of bandwidth. After the model is partitioned, each subtask needs to select a network node for the first part of the inference, while the rest is processed in the cloud.
[IMAGE OMITTED. SEE PDF]
In summary, the problem involves determining the model partition point, deciding which network node each subtask should be offloaded to, and allocating resources from nodes to the offloaded subtasks, given the available resources from each node and link.
Task Model
We define a DL task tuple . represents the collection of subtasks, and is the number of subtasks in this task. is the number of layers of the DNN model in this task, and is the size of original input data processed by the DNN model. The Collection of the computational overhead collection of the DNN model is denoted as , where represents the computational overhead of the -th layer of the DNN model, meanwhile . We denote the collection of size of output data as , where is the size of the output data of the -th layer of the DNN model, meanwhile .
DNN Partition and Computing Nodes Model
This paper assumes that a chain path of a DL task goes through several network device nodes in the network domain and one server node in the cloud. is the partition point of the DNN model, and . In other words, for a subtask, the -th layer and all the layers before it will be offloaded to one of network devices for computation, and the layers after it will be offloaded to the cloud node. In particular, represents that each computing subtask will be executed in the cloud. In this case, the DNN model is not split.
Assume that hop counts of a DL task path in the network domain is , and then the set of network nodes is characterized by , where denotes the -th hop node through which the path passes in the network domain. represents the computing resources allocated from network nodes to the whole task, and similarly the subscript represents node hop count. In addition, denotes the allocated computing resources from the cloud node. is the available bandwidth collection for the task, within which is the bandwidth from the link between node and . Furthermore, is the available bandwidth from the link between the last hop network node and the cloud node.
Delay Model
The delay of transmitting the original input data from the end device to the first hop network node is fixed, that is to say, it is out of our optimization and can be ignored. We also ignore the backhaul delay caused by sending results back from the cloud node, because the result produced by the cloud is a label with little data size [22].
Let represent the offloading policy of subtask in the network domain. For example, indicates the former part of will be computed in network node , otherwise . is the computing resources, which are allocated to the former part of , owned by and that is allocated to the latter part of by the cloud node.
We assume the bandwidth is equally allocated to subtasks. denotes the data transmission delay of subtask , and the calculation formula is shown below.
Optimization Problem
In this paper, we aim to minimize the average delay of subtasks in one DL task. Therefore the optimization function is as follows.
The notations of this paper are mainly shown in Table 2.
TABLE 2 Key notations.
| Notation | Definition |
| Task tuple | |
| Collection of subtasks | |
| the -th subtask of a task | |
| The number of subtasks in a task | |
| The number of layers of a DNN model | |
| The size of the original input data for a DNN model | |
| Collection of computational overhead of every layer of a DNN model | |
| Collection of output data size of every layer of a DNN model | |
| Partition point of a DNN model | |
| Hop counts of a DL task path in the network domain | |
| Collection of nodes through which the path passes in the network domain | |
| the -th hop node through which the path passes in the network domain | |
| Collection of computing resources from network nodes for a task | |
| Computing resources from the cloud node for a task | |
| Collection of available bandwidth from network links for a task | |
| Available bandwidth from the last hop network node to the cloud for a task | |
| Offloading policy of with regard to | |
| Computing resources allocated from for | |
| Computing resources allocated from the cloud for | |
| Data transmission delay of | |
| The original input transmission delay of from the first hop network node to | |
| The intermediate transmission delay of from to the last hop network node | |
| The intermediate transmission delay from the last hop network node to the cloud | |
| Computing delay of | |
| Total delay of | |
| Average delay of subtasks in a task | |
| Minimum resource unit |
ALGORITHM
SORA
| Input: | |
| Output: | |
| 1: | Initialize |
| 2: | /* Allocation of computing resources in the cloud node */ |
| 3: | |
| 4: | |
| 5: | |
| 6: | for each do |
| 7: | if then |
| 8: | |
| 9: | else |
| 10: | |
| 11: | |
| 12: | end if |
| 13: | end for |
| 14: | if then |
| 15: | /* Total cloud computing */ |
| 16: | |
| 17: | |
| 18: | return |
| 19: | end if |
| 20: | |
| 21: | return |
ALGORITHM
NONPRA
| Input: | |
| Output: | |
| 1: | Initialize |
| 2: | for each do |
| 3: | Calculate and according to (2) and (3) as for |
| 4: | , as for |
| 5: | |
| 6: | |
| 7: | end for |
| 8: | ascendingSort() |
| 9: | for each do |
| 10: | outputNodeIndex() |
| 11: | |
| 12: | |
| 13: | |
| 14: | |
| 15: | insertSort() |
| 16: | end for |
| 17: | for each where do |
| 18: | Allocate computing resources to subtasks like the way of the cloud nodes and obtain |
| 19: | end for |
| 20: | return |
Markov Decision Process
We intend to describe the evolution of the system and apply DRL to obtain the optimal solution. Hence, we constructed a Markov decision process (MDP). Generally, an MDP can be represented as , where elements indicate state space, action space, state transition probability and reward value scope, respectively. Due to the uncertainty of environmental changes, we do not need to consider the state transition probability. Hence, we constructed a model-free MDP, and the related elements are defined below.
- 1.State: In each step, the state indicates observations from the environment. As for a whole task, it includes distributed computing resources from each node in different hops and available bandwidth from each link. It is defined as
- 2.Action: The action is the decision of the partition point of a DNN model in each step, and we intend to obtain the optimal partition policy in the current state. It is defined as
- 3.Reward Function: Our goal is to minimize the average delay of subtasks in one DL task. Hence, we choose the opposite number of the optimization objective as the reward function. It is defined as
Algorithm
Subtask Offloading and Resource Allocation Algorithm
We first propose the POADQ algorithm to address the joint-optimization problem of DNN partition, subtask offloading and computing resource allocation in NC architecture. This approach is based on DQN and incorporates a subtask offloading and computing resource allocation (SORA) algorithm as Algorithm 1. SORA tends to obtain the optimal subtask offloading policy and optimal resource allocation way, when a minimum resource unit exists. trunc() indicates a truncation function that takes the integer part of the result. To predict the next node that is likely to cause the minimal cost, we further propose an algorithm called next optimal node prediction with resource allocation (NONPRA) invoked by SORA. It is summarized in Algorithm 2. In this algorithm, , and represent the predicted increase of transmission delay cost, computing delay cost and total delay cost, respectively, for each network node where the next subtask tends to be distributed. records the number of subtasks intentionally distributed to each network node. It is worth noting that the allocation of computing resources consumes a small amount of time, with the majority of time being occupied by the offloading policy. The time complexity of SORA is .
DQN Architecture
Deep reinforcement learning (DRL) is an important branch of machine learning (ML) that combines DL with reinforcement learning (RL) to address complex decision-making and control problems. The core idea of RL is to learn a policy through the interaction between an agent and its environment, maximizing the accumulated long-term reward for a specific task. Traditional RL methods are limited in performance when handling high-dimensional state spaces or complex tasks. DRL incorporates the advantages of DL by using high-dimensional states as feature inputs to neural networks, enabling its application to continuous state space. DQN is a method applied by DRL technology.
In the DQN architecture (as illustrated in Figure 4), there are two neural networks referred to as the current network and the target network. Despite their identical structures, there may be differences in their parameters. The main role of the current network is to observe the current state from the environment and to generate a proper action based on the state. This action will be executed by the environment, which will lead to a reward and a next state. This process will be repeated after the current network receives the state in the next step. A tuple , indicating the current state, action, reward and next state, respectively, will be stored in a replay buffer for future learning.
[IMAGE OMITTED. SEE PDF]
When the replay buffer reaches its capacity, a mini-batch is sampled for the current network training. In most cases, weights in the target network remain unchanged. Only after every steps, the parameters of the target network will be replaced by those in the current network.
ALGORITHM
POADQ
| 1: | Initialize the current network with random weight |
| 2: | Initialize the target network with weight |
| 3: | Initialize the replay buffer with a capacity |
| 4: | Observe |
| 5: | for each episode do |
| 6: | With probability select a random action |
| 7: | Otherwise select |
| 8: | |
| 9: | Receive through Algorithm 1 |
| 10: | Calculate according to (7) |
| 11: | |
| 12: | Execute and receive the next state |
| 13: | Store to the replay buffer |
| 14: | |
| 15: | if the replay buffer is full then |
| 16: | Sample a mini-batch from |
| 17: | Perform a gradient descent step on
with respect to |
| 18: | Every steps, reset |
| 19: | end if |
| 20: | end for |
The POADQ algorithm, which embeds the SORA algorithm, is based on the DQN architecture. The details are shown in Algorithm 3.
Algorithm Process
The flow chart of all algorithms in this paper is shown in Figure 5.
[IMAGE OMITTED. SEE PDF]
Simulation
To conduct experimental simulations, we aim to adopt three popular DNN models: VGG19 [23], AlexNet [24], and ResNet18 [25]. Scenario involves varying numbers of subtasks and network nodes in a task path, which consists of several network nodes and one cloud, with each node computing power and each link bandwidth sampled from a specified range to ensure generality. Under these conditions, we compare our algorithm with other schemes in terms of delay reduction, that is to say, the better algorithm will ensure that the average delay of all subtasks in this task path is minimized.
Data Preparation of Model Workload
To evaluate the computing overhead (workload) of DNN models, we introduce floating-point operations (FLOPs) as a measurement metric. FLOPs is an important metric for measuring computing complexity, representing the total number of floating-point operations required for a given task or inference of a DNN model layer. The FLOPs of different layers may be calculated by different ways, because it depends on their structures and configurations [5].
FLOPs of convolution layer can be calculated as
FLOPs of full connected layer can be calculated as
We consider the smallest indivisible unit in the model as a logical layer according to reference [4]. Next, we will present the computing workload of each logical layer for the three models.
Figures 6, 7, and 8 show the computing workload of each logical layer in VGG19, AlexNet, and ResNet18, respectively. It can be observed that the computing workload distribution varies across models, and the number of layers in each model also differs. ResNet18 contains unique, indivisible residual blocks, which we treat as single logical layers. Generally, most of the computing workload is concentrated in convolutional layers or residual blocks which consist of them. In the following experiments, we will perform model partition based on each logical layer.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Experiment Conduction
We first construct an environment consisting of multiple computing-capable network nodes and a cloud node along a task path, where the number of hops from these nodes to end devices varies due to the presence of links between them. We set corresponding parameters of resources (computing, bandwidth) for this environment. In each step for agent training, the computing resources from all nodes and the bandwidth from all links are randomly sampled within predefined ranges.
Next, after setting the hyper-parameters for DQN agent, we train our proposed algorithm, POADQ, which incorporates the agent in this environment. Upon completing the training, we compare our trained algorithm with other schemes in this environment. Finally, analyses are made based on the comparison of data from figures and tables.
Parameters Setting
The simulation environment is implemented based on Python 3.11 and Pytorch 2.1.0. Table 3 provides additional details on the main parameters of resources in the task path. As for this task, the allocated computing resources from network nodes range randomly from 300 to 600 GFLOPS, while the cloud node resources range from 3000 to 6000 GFLOPS. The available bandwidth from each link between network nodes follows a uniform distribution in the range of 50–100 Mbps, except for the link between the last hop network node and the cloud, which ranges from 200 to 400 Mbps. Lastly, the minimum resource block is set to 10 MFLOPS.
TABLE 3 Parameters of nodes and links along the task path.
| Parameter | Value or range |
| [300, 600] GFLOPS | |
| [3000, 6000] GFLOPS | |
| [50, 100] Mbps | |
| [200, 400] Mbps | |
| 10 MFLOPS |
Before training the DQN agent, we configure its hyper-parameters, as detailed in Table 4.
TABLE 4 Hyper-parameters of DQN.
| Hyper-parameter | Value |
| Minibatch size | 60 |
| Learning rate | 0.01 |
| Random policy parameter | 0.1 |
| Discount factor | 0.9 |
| Optimizer | Adam |
| Training episodes | 20000 |
| Replay buffer size | 500 |
| Target network update step | 100 |
Agent Training
And then, we train the DQN agent, which is embedded in our POAQD algorithm, in the aforementioned environment. Figure 9 illustrates how loss value of DQN evolves as the number of training episodes increases. It can be observed that around 5000 episodes, the average loss becomes relatively small. However, fluctuation may still occur afterward. This is likely due to the random exploration mechanism, which may cause that certain actions are not explored. To ensure sufficient exploration, we extend the training to 20000 episodes.
[IMAGE OMITTED. SEE PDF]
Result of Comparison
To prove the effectiveness of our algorithm POADQ, we conduct simulations of POADQ and other schemes as described below for comparison.
- 1.Total Cloud Computing (TCC): The DNN model is not split and all subtasks are offloaded to the cloud for computation. The computing resources are allocated in the optimal way [15].
- 2.Randomly Offloaded to Network Nodes (RONN): The former parts of subtasks are offloaded to network nodes randomly. The partition of the model and the allocation of computing resources are both in the optimal way.
- 3.Equally Offloaded to Network Nodes (EONN): The former parts of subtasks are offloaded to network nodes equally. The rest subtasks tend to be offloaded to closer network nodes. The partition of the model and the allocation of computing resources are both in an optimal way.
- 4.Offloaded to Network Nodes withOut Hop Counts (ONNOHC): The former parts of subtasks are offloaded to network nodes according to resources and occupied subtasks, while hop counts are not considered. The partition of the model and the allocation computing resources are both in an optimal way [5].
The results are the average values calculated from 100 repeated experiments. For each model, we mainly conduct two experiments. One condition described is that the number of network nodes vary in a range while the subtask number remains fixed at 20. The other described is that the number of subtasks vary in a range while the network node number remains fixed at 8. The result charts are shown in Figure 10.
[IMAGE OMITTED. SEE PDF]
From the chart, we first observe the case where the number of subtasks is fixed, that is, (a), (c) and (e). We notice that when the number of network nodes is small, the performance differences among the five schemes are minimal. This is because, with fewer nodes to choose from for the former part of the subtask inference, the optimal node is easily selected. However, as the number of network nodes increases, the advantages of our algorithm become more pronounced, because we take into account the proximity relationships between nodes to select optimal nodes. Moreover, although the increase of the number of network nodes may lead to more hops and potentially higher latency, the number of nodes providing computing resources also increases accordingly, so the impact is minimal.
Then we observe the case in the charts where the number of network nodes is fixed, that is, (b), (d) and (f). We aware that as the number of subtasks increases, resources become increasingly strained when there are fixed number of nodes providing computing resources. As a result, the subtask latency increases under all schemes.
Table 5 presents more detailed comparison data from the above charts, with results rounded to three decimal places. In most scenarios, TCC performs the worst, as offloading original data to cloud directly for inference results in high transmission latency. The remaining schemes all refer to DNN partition. Next, offloading subtasks randomly (RONN) performs the second worst. Although ONNOHC takes node computing resources into account for subtask offloading, it overlooks the proximity between nodes. As a result, its performance is even worse than that of EONN, which distributes subtasks evenly across all network nodes. In all cases, POADQ we proposed is the optimal one, capable of minimizing the average subtask latency to the greatest extent.
TABLE 5 Result table of the average subtask delay for multiple algorithm.
| (a) Average subtask delay with in VGG19. | (b) Average subtask delay with in VGG19. | ||||||||||
| POADQ | TCC | RONN | EONN | ONNOHC | POADQ | TCC | RONN | EONN | ONNOHC | ||
| 3 | 0.914 | 0.926 | 1.005 | 0.946 | 0.942 | 20 | 0.902 | 2.623 | 1.468 | 1.304 | 1.416 |
| 4 | 0.881 | 1.250 | 1.042 | 0.964 | 0.953 | 25 | 1.143 | 3.268 | 1.898 | 1.736 | 1.782 |
| 5 | 0.893 | 1.611 | 1.136 | 1.058 | 1.046 | 30 | 1.362 | 3.919 | 2.246 | 2.046 | 2.125 |
| 6 | 0.903 | 1.950 | 1.245 | 1.113 | 1.159 | 35 | 1.606 | 4.566 | 2.627 | 2.399 | 2.498 |
| 7 | 0.910 | 2.269 | 1.375 | 1.250 | 1.291 | 40 | 1.862 | 5.235 | 2.953 | 2.875 | 2.867 |
| 8 | 0.914 | 2.610 | 1.507 | 1.309 | 1.424 | 45 | 2.058 | 5.875 | 3.328 | 3.113 | 3.202 |
| 50 | 2.282 | 6.559 | 3.712 | 3.510 | 3.576 | ||||||
| (c) Average subtask delay with in AlexNet. | (d) Average subtask delay with in AlexNet. | ||||||||||
| POADQ | TCC | RONN | EONN | ONNOHC | POADQ | TCC | RONN | EONN | ONNOHC | ||
| 3 | 0.259 | 0.789 | 0.438 | 0.420 | 0.435 | 20 | 0.300 | 2.534 | 1.273 | 1.142 | 1.266 |
| 4 | 0.269 | 1.134 | 0.602 | 0.596 | 0.593 | 25 | 0.388 | 3.129 | 1.605 | 1.522 | 1.579 |
| 5 | 0.281 | 1.475 | 0.757 | 0.754 | 0.759 | 30 | 0.448 | 3.799 | 1.915 | 1.797 | 1.881 |
| 6 | 0.290 | 1.838 | 0.951 | 0.867 | 0.926 | 35 | 0.536 | 4.391 | 2.228 | 2.091 | 2.215 |
| 7 | 0.292 | 2.151 | 1.097 | 1.045 | 1.088 | 40 | 0.618 | 5.081 | 2.592 | 2.564 | 2.562 |
| 8 | 0.308 | 2.511 | 1.288 | 1.132 | 1.271 | 45 | 0.675 | 5.657 | 2.898 | 2.730 | 2.859 |
| 50 | 0.754 | 6.218 | 3.185 | 3.046 | 3.131 | ||||||
| (e) Average subtask delay with in ResNet18. | (f) Average subtask delay with in ResNet18. | ||||||||||
| POADQ | TCC | RONN | EONN | ONNOHC | POADQ | TCC | RONN | EONN | ONNOHC | ||
| 3 | 0.222 | 0.769 | 0.412 | 0.399 | 0.406 | 20 | 0.236 | 2.417 | 1.199 | 1.059 | 1.187 |
| 4 | 0.234 | 1.119 | 0.577 | 0.572 | 0.576 | 25 | 0.300 | 3.084 | 1.536 | 1.463 | 1.525 |
| 5 | 0.231 | 1.445 | 0.740 | 0.722 | 0.718 | 30 | 0.349 | 3.687 | 1.827 | 1.718 | 1.810 |
| 6 | 0.235 | 1.772 | 0.904 | 0.809 | 0.872 | 35 | 0.398 | 4.280 | 2.107 | 1.975 | 2.087 |
| 7 | 0.234 | 2.121 | 1.043 | 0.997 | 1.043 | 40 | 0.480 | 4.912 | 2.428 | 2.422 | 2.444 |
| 8 | 0.233 | 2.434 | 1.232 | 1.069 | 1.201 | 45 | 0.534 | 5.490 | 2.722 | 2.564 | 2.673 |
| 50 | 0.601 | 6.140 | 3.089 | 2.936 | 3.024 |
From the data in the table, it can be seen that when the number of network nodes is 8, our proposed algorithm reduces average subtask delay by at least 44.57% compared to other schemes. Moreover, under AlexNet and ResNet18 models, the algorithm can accelerate inference by at least 62.16%.
All in all, the algorithm we proposed, POADQ, outperforms other algorithms in reducing average subtask latency and in most cases, achieves significant inference acceleration.
Conclusion and Future Work
Conclusion
In this study, we present a NC architecture designed for DL task acceleration, using CNC technology. Within this architecture, nodes and links along the task communication path can provide appropriate computing and bandwidth resources. It requires to determine DNN partition, subtask offloading and computing resource allocation in a multi-hop, heterogeneous-node and bandwidth-variable environment. We construct a MDP and subsequently propose the POADQ algorithm to train the optimal policy for partition point decision-making, while subtask offloading and resource allocation are decided by the SORA algorithm. Through comparative analysis with other schemes, the superior performance of POADQ has been clearly demonstrated.
Future Work
In the future, we plan to investigate the problem of finding optimal paths and allocating resources efficiently for multiple DNN tasks in CNC scenarios. However, achieving resource state awareness across the entire network graph structure poses significant difficulties for general DRL methods. This limitation arises from their inability to effectively process non-Euclidean data structures, such as graphs. Graph neural networks (GNNs) are specifically designed for handling graph structures, therefore applying them to the field of CNC might be a promising direction.
Author Contributions
Ruiyu Yang: investigation, formal analysis, software, methodology writing – original draft validation. Zhili Wang: writing – review & editing, supervision, resources. Yang Yang: supervision, conceptualization. Sining Wang: supervision, resources.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (62322103), Beijing Natural Science Foundation (4232009), and the Fund of Central University Basic Research Projects (2023ZCTH11).
Conflicts of Interest
The authors declare no conflicts of interest.
Data Availability statement
Data is available upon reasonable requests from the corresponding authors.
W. Fan, Z. Chen, Z. Hao, et al., “Dnn Deployment, Task Offloading, and Resource Allocation for Joint Task Inference in Iiot,” IEEE Transactions on Industrial Informatics 19, no. 2 (2022): 1634–1646.
T. Feltin, L. Marchó, J. A. Cordero‐Fuertes, F. Brockners, and T. H. Clausen, “DNN Partitioning for Inference Throughput Acceleration at the Edge,” IEEE Access 11 (2023): 52236–52249.
W. Zhang, D. Yang, H. Peng, et al., “Deep Reinforcement Learning Based Resource Management for Dnn Inference in Industrial Iot,” IEEE Transactions on Vehicular Technology 70, no. 8 (2021): 7605–7618.
Y. Su, W. Fan, L. Gao, L. Qiao, Y. Liu, and F. Wu, “Joint Dnn Partition and Resource Allocation Optimization for Energy‐Constrained Hierarchical Edge‐Cloud Systems,” IEEE Transactions on Vehicular Technology 72, no. 3 (2022): 3930–3944.
X. Liu, S. Liang, Q. Li, and J. Zhang, “Cloud‐Edge Collaborative Dnn Inference based On Deep Reinforcement Learning,” Computer Engineering 48 (2022): 30–38, https://doi.org/10.19678/j.issn.1000‐3428.0063579.
Y. Kang, J. Hauswald, C. Gao, et al., “Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge,” ACM SIGARCH Computer Architecture News 45, no. 1 (2017): 615–629.
X. Huang, B. Lei, M. Wei, G. Ji, and H. Lv, “Task Value Aware Optimization of Routing for Computing Power Network,” in 2023 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB) (IEEE, 2023), 1–6.
S. Tang, Y. Yu, H. Wang, et al., “A Survey On Scheduling Techniques in Computing and Network Convergence,” IEEE Communications Surveys & Tutorials 26, no. 1 (2024):160–195.
Y. Ouyang, X. Ye, J. Sun, Y. Liu, and Y. Zhang, “The First Decade of Computing and Network Convergence,” in ICC 2023‐IEEE International Conference on Communications (IEEE, 2023), 1928–1933.
Z. Hong, X. Qiu, J. Lin, et al., “Intelligence‐Endogenous Management Platform for Computing and Network Convergence,” IEEE Network 38, no. 4 (2024):166–173.
W. Fan, L. Gao, Y. Su, F. Wu, and Y. Liu, “Joint DNN Partition and Resource Allocation for Task Offloading in Edge–Cloud‐Assisted Iot Environments,” IEEE Internet of Things Journal 10, no. 12 (2023): 10146–10159.
L. Zeng, X. Chen, Z. Zhou, L. Yang, and J. Zhang, “Coedge: Cooperative DNN Inference With Adaptive Workload Partitioning Over Heterogeneous Edge Devices,” IEEE/ACM Transactions on Networking 29, no. 2 (2020): 595–608.
W. Xu, N. Chen, and H. Tu, “DNN Inference Task Offloading Based On Distributed Soft Actor‐Critic in Mobile Edge Computing,” in SEKE (ACM, 2023): 386–391.
M. Liu, Y. Gu, S. Dong, et al., “Collaborative Inference for Deep Neural Networks in Edge Environments,” KSII Transactions on Internet and Information Systems (TIIS) 18, no. 7 (2024): 1749–1773.
Y. Li, Z. Han, Q. Zhang, Z. Li, and H. Tan, “Automating Cloud Deployment for Deep Learning Inference of Real‐Time Online Services,” in IEEE INFOCOM 2020‐IEEE Conference on Computer Communications (IEEE, 2020), 1668–1677.
Z. Liao, W. Hu, J. Huang, and J. Wang, “Joint Multi‐User Dnn Partitioning and Task Offloading in Mobile Edge Computing,” Ad Hoc Networks 144 (2023): 103156.
J. Zhang, S. Ma, Z. Yan, and J. Huang, “Joint DNN Partitioning and Task Offloading in Mobile Edge Computing Via Deep Reinforcement Learning,” Journal of Cloud Computing 12, no. 1 (2023): 116.
M. Ebrahimi, A.d.S. Veith, M. Gabel, and E. de Lara, “Combining DNN Partitioning and Early Exit,” in Proceedings of the 5th International Workshop on Edge Systems, Analytics and Networking (IEEE, 2022), 25–30.
M. Zhou, B. Zhou, H. Wang, F. Dong, and W. Zhao, “Dynamic Path Based Dnn Synergistic Inference Acceleration in Edge Computing Environment,” in 2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS) (IEEE, 2021), 567–574.
J. Park, J. Kim, and J. H. Ko, “Auto‐tiler: Variable‐dimension Autoencoder With Tiling for Compressing Intermediate Feature Space of Deep Neural Networks for Internet of Things,” Sensors 21, no. 3 (2021): 896.
H. Kim, J. S. Choi, J. Kim, and J. H. Ko, “A Dnn Partitioning Framework With Controlled Lossy Mechanisms for Edge‐Cloud Collaborative Intelligence,” Future Generation Computer Systems 154 (2024): 426–439.
C. Dong, S. Hu, X. Chen, and W. Wen, “Joint Optimization With DNN Partitioning and Resource Allocation in Mobile Edge Computing,” IEEE Transactions on Network and Service Management 18, no. 4 (2021): 3973–3986.
K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large‐Scale Image Recognition,” arXiv preprint, arXiv:1409.1556 (2014).
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet Classification With Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems 25 (Curran Associates, Inc., 2012).
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), 770–778.
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
