Timely Reliability Analysis of Virtual Machines

Full text

Turn on search term navigation

1. Introduction

With the development of the Internet of Things (IoT), delay-sensitive computing tasks are increasing in cloud computing networks. Because of the long distance between the user equipment and the cloud, dealing with the IoT data in the cloud will cause unacceptable latency, especially for stream or real-time IoT data. To satisfy the strict delay requirements of such IoT services, edge computing emerged. Edge computing provides computing and storage resources at the edge of the network, so that latency-sensitive data and private data can be handled nearby the users, such as applications, e.g., driverless cars, real-time traffic management, virtual reality (VR), augmented reality (AR), and healthcare IoT [1,2]. Regardless of the application scenario, the success of the task relies on the powerful and reliable processing capability of edge computing servers, especially for latency-sensitive applications.

To achieve scalability and robustness, computing and storage resources in cloud servers or edge servers are usually managed by virtualization technology [3,4,5,6]. Computing tasks are performed by independent virtual machines (VMs). Owing to virtualization technology, the network can facilitate load balancing and maintain servers without terminating the task by live migration of VM [7,8]. More importantly, if a VM crashes or is about to crash, the memory image of this VM can be migrated to an idle one and restart the processing, improving fault tolerance of the system. According to [9,10,11], many software failures are transient, which means the system can reboot to repair the problem. If the VM failure is transient, the failed VM can reboot to recover in a short time after the migration. The processing capability of the system is thus recovered. However, as for latency-sensitive tasks, the virtualized management also creates new reliability challenges, which are the failure of VM reducing the processing capability and the migration of VM causing the service downtime [12,13].

Although researchers are trying to design migration mechanisms to shorten the total migration time and downtime, the downtime caused by VM migration still cannot be ignored, especially for service under strict Service Level Agreements (SLAs) [12,14]. As a result, the influence of failure and migration of VM on the satisfying of transmission time requirements needs to be studied, since it is critical for system reliability evaluation, optimizing of resources allocating, and designing of resource management policy. For example, to avoid the server crash, the computing server will not usually assign all the VMs to provide service, but keep a part of VMs idle as backups. When the total number of VMs is limited, more task VMs do not always reduce the delay, since this also brings about more failures in a given time and reduces the backup VMs. Designers of the network and service providers are curious about how to make a trade-off between the task computing resource and idle resource to achieve the highest reliability of the computing network.

The ability of an edge computing network to transmit and process data within the given time can be measured by timely reliability [15,16,17,18]. To analyze whether the VMs in the edge server and the whole network satisfy the time requirements of a task, we define the VM timely reliability (VTR) and the end-to-end timely reliability (ETR), respectively, and propose analysis methods. The end-to-end delay in the computer network consists of four types of delay [19], that is, processing delay, transmission delay, propagation delay, and queueing delay. Since queueing delay is the main random variable of the four components of end-to-end delay [18], we only consider processing delay and queueing delay of the edge server in this paper. From the perspective of edge computing, an edge server can be treated as a multi-server queue with unreliable servers, where the VMs are servers and the tasks are customers. At the same time, to protect the data, the recovery of VMs will not start until the memory image and state have been migrated. To measure the timely reliability of edge computing more precisely, we need to build a queueing model considering failure, migration, and reboot of VM. Present mathematical models [20,21,22,23,24,25,26,27] cannot take the migration of VM into account and neglect the migration downtime. Therefore, to analyze the edge computing system considering migration and recovery of VM, the essential migration-related states of VM need to be defined in a new evaluation model. From the perspective of the network, the end-to-end delay of an edge computing network is determined by the joint distribution of the successive delays of a packet traversing multiple nodes [28]. The main difficulty of networked queueing model analysis is the characterization of the departure process of an upstream queueing node, which is the arrival process of the next queueing node when all intermediate nodes are pure relays [28].

In this paper, to measure the timely reliability of edge computing more precisely, we define three VM states (available, failed-unmigrated, and failed-migrated) and build a multi-server queueing model considering failure, migration, and recovery of VM for edge computing server. The considered failures of VM are all transient and can be recovered by rebooting. The proposed queueing model is analyzed by a continuous-time Markov chain (CTMC) with three types of transfer relations solved by a matrix-geometric approach. The probability density function (PDF) of the sojourn time at an edge computing server is calculated based on the first passage time method. Then, we build a serial queueing model for the edge computing network base on Burke’s theorem, considering edge service and cloud service, respectively. Thereupon, the end-to-end delay and end-to-end timely reliability are obtained. Furthermore, we investigate the impact of VM migration on VTR, the analyzed factors including failure rate, migration rate, reboot rate, and resource allocation. Our results are verified by simulations based on OMNet++.

The main contributions of this paper include: (1) We build a multi-server queueing model that firstly takes account of failure, migration, and recovery of VMs. The present works merely analyze the situation where VMs run perfectly [24,29,30]. (2) A partitioned CTMC method is proposed to analyze three-dimension CTMC and obtain the VM timely reliability. (3) An analysis method based on Burke’s theorem is proposed to obtain the PDF of the end-to-end delay, which is indispensable for the evaluation of end-to-end timely reliability. Present works either obtain average metrics only, such as mean queue length and mean wait time [24,29,31], or obtain the end-to-end timely reliability by simulations [18,32]. (4) We investigate the impact of VM migration and assigning of task VMs and backup VMs on the VTR, which can be a guidance for the designing of a more reliable edge computing network for delay-sensitive applications.

The remainder of this paper is organized as follows. Section 2 presents related works. Section 3 describes the problem, including the analysis of delay at the edge server and the analysis of end-to-end delay. Section 4 proposes the CTMC and end-to-end delay model and proposes an algorithm to obtain the VTR and ETR. Section 5 presents our results, including simulation results and numerical results, and the discussion. Finally, we give the summary and conclusions of this paper in Section 6.

2. Related Work 2.1. Network Delay Analysis

As mentioned before, the end-to-end delay in the computer network consists of processing delay, transmission delay, propagation delay, and queueing delay [19]. We investigate the literature of the network delay analysis, especially on the queueing delay, and provide a comparative table mainly focusing on the analysis method as below. The main models and methods are described in the table. In Table 1, “Fixed queues” means that the structure of the queue models of a task is fixed. For example, in [24], the queue model is a serial queue consists of the queue of edge computing, the queue of cloud gateway, and the queue of cloud computing; “Stochastic queues” means that the structure of the queue models is not fixed, but determined by some specific mechanisms with probabilities. For example, in [33], tasks might be processed at the IoT layer, fog layer, or cloud layer with different probabilities; “Homogeneous servers” means that the queue servers in the model are all homogeneous, i.e., each server has the same service rate [34]; “Heterogeneous servers” means that the queue servers in the model are heterogeneous [29].

As for simulation methods, (1) research focused on the resource management model, such as task offloading and resource allocation, often verifies their delay analysis models by semi-physical simulation [27,33], or Monte Carlo discrete event simulation using tools such as CloudSim [31] and, Arena [29,30]. Average metrics such as mean wait time and mean queue length are calculated to measure the system performance. The influence of VMs state transitions on the timely reliability is not modeled. (2) End-to-end delay models of a network are often verified by Monte Carlo discrete event simulation using Java Modeling Tools (JMS) [24], OPNET [18], OMNeT++, etc. Timely reliability has been analyzed [18,32], but the influence of VM state transitions on the timely reliability is still not modeled.

2.2. VMs Timely Reliability Analysis

In the delay analysis of a single node, the main delays include processing delay and queueing delay. Queueing theory is widely used to analyze node delay. Usually, the processes in the computing server are modeled with multiple servers, including homogeneous servers [24,25] and heterogeneous servers [29,39]. An admission manager or load balancer is often modeled as a single server ahead of the computing servers [25,27]. For the task/data arrival, most studies described the task/data flow by Poisson distribution. Researchers then build solutions by constructing and solving a CTMC. For example, Kafhali and Salah [24] modeled the edge server by M/M/s/K queueing model. Pereira et al. [25] proposed a queueing model for a Fog node containing a load balancer and several web servers to solve a closed-form solution. Liu et al. [27] analyzed the user equipment (UE) side delay and server-side delay, respectively, from a perspective of task offloading and resource allocation. However, the above studies only proposed metrics such as utilization of VMs, average throughput, mean response time, average queue length, and discard rate. Measuring the above quantities fails to manifest whether the low-latency constraints of delay-sensitive applications are satisfied. To solve this problem, the PDF of sojourn time at the edge server needs to be obtained.

Moreover, the above analyses based on queueing theory are all under the assumption that the edge node serves perfectly. However, the failure of a server or a VM can reduce the processing capability of the edge node directly and increase latency [12,40,41]. To model the failure and repair of VMs, VMs can be treated as unreliable and repairable queue servers based on the multi-server queueing models with unreliable and repairable servers [20,21,22,23,26]. For instance, Ke et al. [26] and Chakravarthy et al. [23] analyzed multi-server queueing models considering unreliable servers and vacation policy. Although the unreliable and repairable servers in these queueing models are similar to VMs of the edge node, the present multi-server queueing models cannot be applied to model the edge computing servers directly. The reason is that the VM needs to migrate memory images and states before its recovery, resulting in down states for the migration and recovery that cannot be described by present models.

To investigate the influence of VM migration on node delay, some researchers focused on the performance evaluation of VM migration [12,13,40,42,43,44,45,46,47] and latency-aware VM migration algorithms [14,48,49,50,51]. In these studies, migration time and service downtime [13] were measured to evaluate the performance of VM migration, yet how the migration time and service downtime influenced the sojourn time at the edge node has not been studied. For the latency analysis of edge computing, most studies focused on the delay of specific activities, such as processing, task offloading, and resource allocation [52]. The total sojourn time at the edge node has not been well investigated. Liu et al. [53] analyzed the influence of migration, but they only studied the probability that the computing system can provide sufficient VMs. The delay at the server was not analyzed.

All this considered, present studies on VM delay did not investigate the influence of VM migration on the total delay at the VMs, for the queueing models of VMs ignored the failure and migration of VMs [24,25,27], and the VM migration delay was measured separately from the total delay [12,13,14,42,44,45,46,47,53]. Meanwhile, due to the unique process of failure, migration, and recovery of VMs, present queueing models with unreliable servers [20,21,22,23,26] cannot be directly applied to analyze the VM delay. To analyze the timely reliability of the VMs and the network, we build a queueing model considering the whole process of failure, migration, and recovery of VMs and propose a method to obtain the PDF of the delay at the edge server.

2.3. End-to-End Timely Reliability Analysis

There are two types of serial queueing models for the analysis of end-to-end delay, which are analytic models and approximate models. By traversal of the end-to-end topology, the end-to-end delay is obtained based on the delay in each node. Some researchers analyzed the situation of two queuing nodes, i.e., tandem queue [54,55,56,57,58]. However, due to the mathematical complexity, the formulation of the networked queueing model is infeasible when the queueing nodes are more than three. Therefore, approximation methods such as simulation [18,59] and the upper bound method [60] are widely applied. Researchers also studied the end-to-end delay of queueing networks based on the departure process approximation. For instance, Xie and Haenggi [28] defined a parameter to measure the spatial correlation between interfacing queueing nodes based on the departure process approximation. Although Kafhali and Salah [24] proposed a serial M/M/s/K queueing model for an edge-cloud computing system based on Burke’s theorem, they did not analyze the end-to-end delay in the model. In addition, Jie Shen et al. [61] proved that the end-to-end delay distribution of a network system is the inverse Laplace transform of the transfer function of the signal flow graph and proposed an end-to-end delay analysis method based on frequency domain. The end-to-end delay analysis considering the structure of the edge computing network is still lacking.

3. Problem Description

As depicted by Figure 1, the edge computing network is composed of four parts—sensor nodes, the edge server, the cloud gateway, and the cloud server. The task data generated by sensors are transmitted to the edge server for necessary processing by VMs, including pre-processing, analysis, compression, and encryption, over Bluetooth Low Energy, or Wi-Fi. After the processing at the VMs of the edge server, a part of tasks will be transmitted to the cloud server by the cloud gateway for global storage, power computing, or running various applications. Usually, a task expects to be processed within an extremely short time at the edge server to guarantee the real-time application. Differently, the cloud server can process a task in hours or days. Therefore, the delay at the VMs in the edge server and the delay through the whole network need to be analyzed, respectively. For the edge service, a task only needs to queue at the edge server. Whereas, for the cloud service, a task needs to go through three queues, which are the queue at the edge server, the queue at the cloud gateway, and the queue at the cloud server successively. According to the delay analysis, the definition of timely reliability is given by [18]:

_RT=Pr(D<_Dmax)=_{∫0_Dmax}f(t)dt

whereDis the total delay of a task,_Dmaxis the maximum allowable delay of the task,f(t)is the PDF of the delay.

3.1. Analysis of Delay at the VMs in the Edge Server

In an application, the task data need to be transmitted to the edge server and be processed by VMs within a delay constraint of_Dmax1. In the edge server,SVMs are assigned to process tasks, which are called task VMs (TVMs), anddidle VMs are assigned as backups, which are called backup VMs (BVM). All the VMs are homogeneous. Then, the total number of VMs isZ=S+d. The task flow is generated following a Poisson process at each sensor, where the arrival rate isλ. The service time of each VM follows an exponential distribution where the mean service time is1/μ. If theSVMs are all occupied, the task can be stored in the buffer, whose capacity isK. Then, the maximum capacity of the edge server accommodating tasks isY=S+d+K. If the buffer is full as well, a discard occurs. We assume that the discarded tasks all violate the delay constraints and do not consider the retrial queue.

When a VM fails, the task will be migrated to an idle VM and restart the processing, as shown in Figure 2. The migration downtime follows an exponential distribution where the mean service time is1/α. Note that if all the other VMs are occupied or failed, the task will stay in the failed VM and wait for an idle VM to migrate. Extremely, when all the VMs are failed and not ready to reboot, to avoid deadlock of the VMs, the last failed VM can reboot without migration, reserving the un-migrated task. After the migration of the task, the failed VM can reboot and become an available VM again. If there are fewer thanSVMs available (capable to process a task), the newly recovered VM will be assigned as a TVM; otherwise, it will be assigned as a BVM. The reboot time of VM follows an exponential distribution where the mean reboot time is1/β . Then, each VM has three states: available, failed-unmigrated, and failed-migrated. Here, the available state refers to the state that the VM is capable to process a task. An available VM can be either idle or occupied. Transitions among the three states are depicted in Figure 3: (1) From ① to ②, the running VM fails, transiting from available to failed-unmigrated. (2) If there is at least one idle VM, the migration begins, or the failed VM needs to wait for an idle VM to migrate. When the migration completes, the edge server transits from ② to ③, where the failed VM is failed-migrated. (3) By rebooting, the failed-migrated VM can recover to be available again, edge server transiting from ③ to ①. If there are fewer than S VMs running, the newly recovered VM will be assigned as a TVM, or it will be assigned as a BVM. (4) At ①, when the TVMs are all occupied, the new arrival task will be waiting in the buffer, edge server transiting from ① to ④. If the buffer is full as well, a discard occurs. (5) At ④, if a VM completes the processing of a task and becomes an idle VM, the task waiting in the buffer will get into the idle VM for processing, edge server transiting from ④ to ①.

In the edge computing server described above, the processing of tasks can be treated as a revised M/M/s/K queue, where the VM migration and recovery are considered. Assumptions:

All the tasks are equal in data size;
The failures of VM are all transient failures, which can be recovered by rebooting in a short time;
Idle VMs do not fail;
The working times of VMs in the server are independent and identically distributed with the exponential distribution where1/θis the mean time to failure.

3.2. Analysis of End-to-End Delay

After the processing at the edge server, a part of task data needs to be transmitted to the cloud for further processing or storage within a delay constraint of_Dmax2 . According to Burke’s theorem [62], the steady-state departure process of a stable M/M/c queueing system is a Poisson process with the same rate as the arrival process [63]. Assume that the service time of the cloud gateway and the cloud server follow exponential distributions where the mean service time is1/_μlgand1/_μcs, respectively. Then, the queues at the cloud gateway and the cloud server can be modeled by the M/M/1/K queueing model [64]. As depicted in Figure 4, the tasks are processed at the edge server firstly. Then, a portionpof tasks is transmitted to a cloud server by a cloud gateway for further processing. The delay is caused by the queueing at the edge server, cloud gateway, and cloud server successively.

In the problem, we consider different delay constraints to obtain VTR and ETR of the edge computing network, according to the parameters of the network shown in Table 2.

4. Mathematical Model 4.1. Queueing Model of the Edge Server

To model the queueing at the VMs in the edge server described in the previous section, we denote the state of the edge server by(_x1,_x2,_x3), where integer_x1∈[0,Z]denotes the number of available VMs, integer_x2∈[0,Y]denotes the number of tasks in the edge server, and integer_x3∈[0,min(_x2, Z)]denotes the number of failed-unmigrated VMs. Since_x1,_x2, and_x3 are all limited, the number of states is finite. According to Section 3.1,_x1,_x3, and(_x1+_x3)shall all be less thanZ. In addition, the number of tasks_x2shall not be more than the present capacity of the edge server[min(_x1,S)+_x3+K]. Furthermore, the number of tasks_x3reaches its maximum when all the tasks stay in failed-unmigrated VMs, where the maximum value is_x2. According to the analysis above, the constraints of the three state variables are:

{0≤_x1≤Z0≤_x2≤min(_x1,S)+_x3+K0≤_x3≤min(_x2, Z − _x1).

whereZdenotes the total number of VMs,Sdenotes the number of TVMs,Kdenotes the capacity of the buffer of the edge server.

We assume that the states violating Equation (2) exist in the CTMC and name them as pseudo-states. The transitions of pseudo-states are the same as the real states. Furthermore, to study the state transitions and express the transition rate diagram conveniently, we divide the states into a few blocks in terms of_x2, i.e., the states that have the same value of_x2belong to the same block. Then the total number of blocks is (Y+1), for_x2ranges from 0 toY. Thej-th block_Bj:

_Bj={(_x1,_x2,_x3)|_x2=j−1,_x1∈[0,Z],_x3∈[0,min(_x2, Z)]}.

The number of columns of_Bj. is:

_aj=min(j,Z), 1≤j≤Y+1.

Let_bjdenote the total number of columns from block_B1to block_Bj. Then,_bjis given by:

_bj=_∑u=1j _au.

Consequently, the transitions inside a block are all triggered by state transitions of VM (failure, migration, and reboot). Figure 5 shows transitions inside block_Bm+1. There are three kinds of transition:

(1) Transitions caused by failures of VMs. The number of VMs about to fail equals the number of running VMs, which is the minimum of the number of available TVMsmin(_x1,S)and the number of live tasks(_x2−_x3). Thus, from state(_x1,_x2,_x3)to state(_x1−1,_x2,_x3+1), the transition rate isgθ, wheregis the number of about-to-fail VMs (or the number of running VMs), given by:

g=min(min(_x1,S),_x2−_x3).

Notably, the transition of migration does not exist when_x1=0, since none VM is available.

(2) Transitions caused by migrations of failed VMs. The number of migrating VMs is the minimum of the number of failed-unmigrated VMs_x3and the number of idle VMs(_x1−g). Thus, from state(_x1,_x2,_x3)to state(_x1,_x2,_x3−1), the transition rate ishα, wherehis the number of migrating VMs, given by:

h=min(_x3,_x1−g)=min(_x3,_x1−min(min(_x1,S),_x2−_x3)).

(3) Transitions caused by the reboot of failed-migrated VMs. The number of rebooting VMs is the number of failed-migrated VMs(Z−_x1−_x3). From state(_x1,_x2,_x3)to state(_x1+1,_x2,_x3), the transition rate is(Z−_x1−_x3)β.

At the same time, transitions between blocks are triggered by the arrival or departure of tasks. Figure 6 presents the transitions between blocks when_x1=Z−i+1. The presented two states in each block are the first two states of the block. When a task arrives, thek-th state of block_Bjtransits to thek-th state of block_Bj+1with a transition rate ofλ; when a task departs, thek-th state of block_Bjtransits to thek-th state of block_Bj−1with a transition rate ofgμ.

Note that the transitions inside the blocks and the transitions between the blocks all exist in the CTMC of the edge server. We detach the transitions into two types for a concise and explicit description. Figure 7 presents the states of a case whereS=1, d=1, K=1.

Given the partitioned configuration of the CTMC, to locate a state more easily, we define the coordinate of a state as(i,j,k), whereidenotes the row number,jdenotes the block number, andkdenotes the column number inside the block. The derivation between a state and its coordinate is

{_x1=Z+1−i_x2=j−1_x3=k−1.

where_x1denotes the number of available VMs,_x2denotes the number of tasks in the edge server,_x3denotes the number of failed-unmigrated VMs. According to Equations (2) and (8), the constraints of the three coordinate variables are:

{1≤i≤Z+11≤j≤min(Z−i+1,S)+k+K1≤k≤min(j, i).

To avoid misunderstanding, we define the set of all the states in rowi∈[1,Z+1]of the CTMC as_Li:

_Li={(i,j,k)|j∈[1,Y+1], k∈[1,min(j, Z)]}.

Then, in_Li, the number of states(i,j,k)isc=_bj−_aj+k, where_bjdenotes the total number of columns from block_B1to block_Bj,_ajdenotes the number of columns of the block_Bj. Since the number of states is numerous, it is hard to give the expression of transition rate for each state if we directly construct the generator matrix according to the sequence of states. Therefore, we construct the generator matrix of the CTMC based on the matrix-geometric method. Specifically, we obtain the generator matrices of transition inside a row_Li(transitions caused by arrival and departure of tasks and migration of VMs) and generator matrices of transitions between neighboring rows (transitions caused by failure and reboot of VMs), respectively. In this way, we can express the generator matrix as a partitioned matrix with quasi-birth-and-death feature:

A=[_A11_A120000_A21_A22_A230000_A32_A33⋱⋮⋮⋮0⋱⋱_AZ−1,Z00⋮0_AZ,Z−1_AZ,Z_AZ,Z+10000_AZ+1,Z_AZ+1,Z+1],

where matrix_{Ax_{y(_bY+1×_bY+1)}}denotes the generator matrix from_Lxto_Ly. Then, the element_Axy(_c1,_c2)in row_c1and column_c2of matrix_Axydenotes the transition rate from the_c1-th state of_Lxto the_c2-th state of_Ly. The general form of_Axyis given by:

{_Aii(_bj−_aj+k,_bj+1−_aj+1+k)=Nλ,1≤i≤Z+1, 1≤j≤Y, 1≤k≤min(j,Z)_Aii(_bj+1−_aj+1+k,_bj−_aj+k)=gμ,1≤i≤Z+1, 1≤j≤Y, 1≤k≤min(j,Z)_Aii(_bj−_aj+k,_bj−_aj+k−1)=hα,1≤i≤Z+1, 2≤j≤Y+1, 2≤k≤min(j,Z)_Ai,i+1(_bj−_aj+k,_bj−_aj+k+1)=gθ, 1≤i≤Z, 2≤j≤Y+1, 1≤k≤min(j,Z)−1_Ai+1,i(_bj−_aj+k,_bj−_aj+k)=iβ, 1≤i≤Z, 1≤j≤Y+1, 1≤k≤min(j,Z)_Ai+1,i(_bj−_aj+k+1,_bj−_aj+k)=(k−1)β,i=Z+1, Z+1≤j≤Y+1, k=Z+1,

wheregis the number of running VMs,his the number of migrating VMs.

Identify the pseudo-states according to Equation (2). Then, delete the rows and columns of the pseudo-states inAand name the remainingl×lmatrix asQ. After that, revise the diagonal elements ofQto subject to:

_Ql×l·_1l×1=_0l×1.

The matrixQis the final form of the generator matrix of the CTMC.

To obtain the PDF of sojourn time at the edge server, we employ the first passage time method. Among all the states of the edge computing server, the states in block 1 represent that the edge server is empty, for the numbers of the tasks are 0. Delete the states in block 1. Then, the sojourn time at the edge server approximately equals the time that the edge server transits from the non-empty states to the empty states when no new task arrives, which can be obtained by the first passage time method. Obtaining the PDF of sojourn time at the edge server follows the steps below:

1. Solve Equation (14) to obtain the steady-state probability vectorπof the CTMC:

{^QT ^πT=0^∑π=1;

2. Delete the empty states in the transition rate matrixA, and retain the rest states. The new matrix is named asM.

3. Delete the probability of empty states in the stable probability vectorπ. The new vector is named asγ. Letδ=γ/^∑γ.

4. The distribution function of sojourn timeDis:

_Fes(t)=P(D≤t)=1−δ·exp(Mt)·1.

5. Then PDF of the sojourn timeDis:

_fes(t)=_Fes′(t).

4.2. Timely Reliability Model

As analyzed in Section 3, the queueing at the cloud gateway and cloud server can be modeled by the M/M/1/C model. The tasks go through the queue of the edge server, the queue of cloud gateway, and the queue of the cloud server successively. Then, the end-to-end delay is determined by the joint distribution of the successive delays of a task traversing edge server, cloud gateway, and cloud server. The PDF of delay at the VMs in the edge server has been analyzed in Section 4.1. In the rest of Section 4.2, we present the analysis method of end-to-end delay and timely reliability, including VTR and ETR.

4.2.1. VMs Timely Reliability

VMs timely reliability (VTR) is defined as the probability that the VMs in an edge server are capable of processing a task within a required time. For tasks requiring edge service, the service delay equals the sojourn time at the edge server. According to Equation (1), VTR is:

_RV=_{∫0_Dmax1} _fes(t)dt,

where_Dmax1is the maximum allowable delay at the edge server.

4.2.2. End-to-End Timely Reliability

End-to-end timely reliability (ETR) is defined as the probability that a group of end-to-end nodes in the edge computing network is capable of processing a task within a required time. For tasks requiring cloud service, the service delay is the sum of delay at the edge server, cloud gateway, and cloud. Since only a portionpof tasks is transmitted to the cloud server, the arrival rate of tasks at the cloud gateway and the cloud server ispNλ. Departing from the edge server, a task arrives at the cloud gateway findingwtasks (0≤w≤_Clg ) already there with a probability [24]:

_Plgw={(1−_ρlg)^(_ρlg)w1−_ρlg ^_Clg ,_ρlg≠11_Clg ,_ρlg=1,

where_ρlg=pNλ/_μlgis the traffic intensity at the cloud gateway. The processing time of a task at the cloud gateway follows exponential distribution:

_flg0(t)=_μlg ^e−_μlgt, t>0.

Then, the sojourn time of a task is the joint distribution of the processing time of all the(w+1)tasks (the formerwtasks and itself). Since the sum of identically and exponentially distributed variables follow a Gamma distribution, the PDF of the sojourn time of a task at the cloud gateway if it arrives at the cloud gateway findingwtasks already there is:

_flgw(t)=_μlg ^ww!·^tw−1·^e−_μlgt, t>0.

Therefore, the PDF of the sojourn time at the cloud gateway:

_flg(t)=_{∑w=0_Clg−1} _Plgw·_flgw(t).

Similarly, departing from the cloud gateway, a task arrives at the cloud server findingwtasks (0≤w≤_Ccs) already there with a probability:

_Pcsw={(1−_ρcs)^(_ρcs)w1−^(_ρcs)_Ccs, _ρcs≠11_Ccs, _ρcs=1,

where_ρcs=pNλ/_μcsis the traffic intensity at the cloud server. The sojourn time of a task at the cloud server if it arrives at the cloud server findingwtasks already there is:

_fcsw(t)=_μcs ^ww!·^tw−1·^e−_μcst, t>0.

The PDF of the sojourn time at the cloud server:

_fcs(t)=_{∑w=0_Ccs−1} _Pcsw·_fcsw(t).

According to the above analysis, PDF of the end-to-end delay of the edge computing network can be obtained by:

f(_t3)=_∫0+∞[_∫0+∞ _fes(_t1)_flg(_t2−_t1)d_t1]_fcs(_t3−_t2)d_t2,

where_t1,_t2, and_t3are the sojourn time at the edge server, the total time to go through the edge server, and the cloud gateway, and the total time to go through the edge server, the cloud gateway, and the cloud server, respectively. According to Equation (1), ETR can be obtained by:

_RN=_{∫0_Dmax2}f(_t3)d_t3.

4.3. Algorithm

To calculate VTR, the key is to solve the CTMC with a state-space of high dimension and obtain the PDF of the sojourn time at the edge server. For this purpose, we need to propose an algorithm to formulate the partitioned generator matrix and obtain the PDF of the sojourn time at the edge server based on the first passage time method. The obtaining of PDF of the sojourn time at the edge server is described in Algorithm 1. After that, VTR and ETR can be calculated according to Section 4.2.

Algorithm1 Algorithm for obtaining PDF of the sojourn time at the edge server

Input: Arrival rate of tasks:λ Number of sensors:N Service rate of VM: μ Number of TVMs:S Number of BVMs:d Capacity of the buffer:K Failure rate of VMs:θ Migration rate of VMs:α Reboot rate of VMs:β Output: PDF of the sojourn time at the edge server:_fes(t)

1. initializeA2. vectora=min(j,Z), 1≤j≤Y+1% The number of columns of blockj3. vectorb=_∑u=1j _au, 1≤j≤Y+1% The number of total columns from block 1 to blockj4. for i = 1: Z + 1 do % Transitions caused by arrival or departure of tasks 5. for j = 1: Y do 6. for k = 1: min(a(j), i) do 7. x =b(j) −a(j) + k 8. y =b(j + 1) −a(j + 1) + k 9. A{i, i} (x, y)=Nλ10. A{i, i} (y, x) = min (min (Z + 1 − i, S), j − k) *μ11. end for 12. end for 13. end for 14. for i = 1: Z + 1 do % Transitions caused by migration of VMs 15. for j = 2: Y + 1 do 16. for k =a(j): −1: 2 do 17. x =b(j) −a(j) + k 18. A{i, i} (x, x − 1) = min (k − 1, Z + 1 − i − min (min (Z + 1 − i, S), j − k)) *α19. end for 20. end for 21. end for 22. for j = 2: Y + 1 do % Transitions caused by migration of VMs 23. for k = 1:a(j) − 1 do 24. for i = 1: Z do 25. x=b(j)−a(j)+k26. A {i, i+1} (x, x+1) = min (min(Z + 1 − i,S), j − k) *θ27. end for 28. end for 29. end for 30. for j = 1: Y + 1 do % Transitions caused by reboot of VMs 31. for k = 1:a(j) do 32. for i = 2: Z + 1 do 33. x =b(j) −a(j) + k 34. A{i, i − 1} (x, x) = iβ35. end for 36. end for 37. end for 38. for j = Z + 1: Y + 1 do % Transitions caused by reboot without migration when all the VMs are failed-unmigrated 39. k =a(j) 40. x =b(j) −a(j) + k 41. A{Z + 1, Z} (x, x−1) = (k − 1) *β42. end for 43. Q =A44. for j = 1: size (a,2) do % Identify pseudo-states 45. for k = 1:a(j) do 46. for i = 1: Z + 1 do 47. ifk≤min(j, i)|| j≤min(Z−i+1,S)+k+K48. x = (i − 1) *b(Y + 1) +b(j) −a(j) + k 49. replace row x of Q with 0 50. replace column x of Q with 0 51. end if 52. end for 53. end for 54. end for 55. replace the rows and columns of absorbing states with 0 56. record the row number (or column number) of the states whose value is 0 57. revise the diagonal elements of Q to ensure the sum of each row equal to 0 58. delete the rows and columns whose value is 0 59. the steady-state probability vector:π=Q/e%eis a column vector whose last element is 1, and the other elements are 0 60. M = A 61. set the arrival rate to 0 in M 62. revise the diagonal elements of M to ensure the sum of each row equals to 0 63. delete the rows and columns of the states of block 1 % The states of block 1 represent that the edge server is empty 64. delete the rows and columns that equals to 0 65. delete the probability of empty states in the stable probability vectorπand name the new vector asγ66. letδ=γ/^∑γ67._Fes(t)=1−δ·exp(Mt)·168. calculate the PDF of sojourn time at the edge server:_fes(t)=_Fes′(t)69. return_fes(t)

5. Results and Discussion In this section, we provide a case study of traffic monitoring for the proposed timely reliability evaluation method. Simulation results are obtained to verify our model and numerical results. We also investigate the influence of different factors on timely reliability, which helps design a better edge computing network. 5.1. Experimental Setup

At a crossroads, the edge computing network for traffic monitoring consists of several sensors, an edge server, a cloud gateway, and a cloud server. The architecture of this network is depicted in Figure 1. To monitor the traffic at this crossroads, the sensors collect task data in real-time and transmit the tasks to the edge server, where the tasks are processed by VMs. In addition, a part of processed tasks is transmitted to the cloud server through the cloud gateway for further process or storage. The parameters used in this case are shown in Table 3. The data on sensor data refer to [65,66]. The data on VM migration refer to [40]. The data on system failure refer to [10].

5.2. Simulation Experiments

The numerical results are verified by simulation based on OMNeT++ (V5.5.1), a C++ network simulation library and framework. We choose OMNeT++ for its open source and extensible features. Figure 8 shows the simulation model in OMNeT++. In the simulation model, the sensors are modeled by the Wireless Host Module; the buffer of the edge server is modeled by the Passive Queue Module; the VMs of the edge server are modeled by the Server Module; the cloud gateway is modeled by the Passive Queue Module; the cloud server is modeled by the Cloud Module. Particularly, to model the migration time of the VM migration, a Downtime Generator (DG), modeled by the Passive Queue Module, is employed to connect the edge servers and VMs. Once a running VM fails, the task on this VM will be migrated to an idle VM through the DG, where a migration downtime of an exponential distribution is needed. Notably, the tasks spend no time to pass the DG unless it is on the way of migration. As soon as the migrating task arrives at the destination VM, the failed VM becomes failed-migrated, and its rebooting can begin. The values of the parameters are the same as those in Table 3.

In the simulation, the end-to-end delay converges to a steady-state value as the simulation time goes by, as shown in Figure 9. We collect the delay values of each 1000 tasks to calculate ETR. When the variation of ETR is less than 1%, we take the final value as the stationary simulation result. Let the Boolean variable_Fidenote if the end-to-end delay of a task satisfies the task requirement:

_Fi={0, the delay does not satisfy the requirement _Dmax21, the delay satisfies the requirement _Dmax2.

Then ETR can be obtained by:

_RN=_∑i=1n _Fin.

In the simulation, it takes 5 min for the end-to-end delay to converge. To obtain the steady-state value of ETR, the simulation takes more than 10 min. We compare the simulation results and the numerical results of ETR when the failure rateθ equals to 0.0002, 0.0006, and 0.001 with different numbers of sensors. The comparisons are shown in Figure 10, where the bars’ heights represent the relative errors of the numerical results. The relative errors are all less than 2%. The simulation results verify that the analytical model proposed in this paper is valid and applicable for timely reliability analysis of a computing network considering failure, migration, and reboot of edge server VMs.

5.3. Timely Reliability Analysis To investigate the impacts of different factors on the timely reliability of the VMs, we compare VTR under different conditions based on the analysis methods we proposed.

5.3.1. Analysis of Failure Rate, Migration Rate, and Reboot Rate

It is observed in Figure 11 and Figure 12 that VTR is a monotonic function of the failure rate, migration rate, and reboot rate of VM. Comparing Figure 11a with Figure 11b, we find VTR is more sensitive to the failure rate when the reboot rate and migration rate are smaller since the VMs spend more time to migrate and recover when the reboot rate and migration rate are smaller. Thus, if the migration time and reboot time are quite short, the designers do not need to be very rigid on the reliability of VMs. Conversely, if the migration time and reboot time are relatively long, designers can improve VTR of the edge computing network effectively by reducing the failure rate of VMs.

5.3.2. Analysis of the Number of VMs

The number of TVMs and BVMs are also influential factors of VTR. Results in Figure 13a show that VTR is higher with the increase in the number of TVMs when the failure rate is high. The reason is that more VMs mean the server can process more tasks at the same time. However, when the failure rate is low, we have an interesting finding that VTR monotonically decreases as the number of TVMs grows, such as the cases when the failure rate equals 0.0002 or 0.0004 in Figure 13a. In fact, the increase in the number of TVMs influences VTR by two means, i.e., (1) reducing the end-to-end delay by processing more tasks at the same time and (2) increasing the end-to-end delay by bringing about more failure, migration, and reboot of VMs in a given time. The second impact is more significant when the failure rate is high. Whereas when the failure rate is low, the first impact is more significant. Similar results are shown by Figure 13b, where with no BVM or extremely few BVMs, VTR monotonically increases as the number of TVMs grows, indicating that the first impact is more significant. However, with more BVMs, VTR monotonically decreases as the number of TVMs grows, indicating that the second impact is more significant. Therefore, the increase in TVMs is not always a good choice for improving VTR.

According to the above findings, the resource manager can allocate an appropriate number of TVMs according to the above-mentioned analysis, which can be a guidance for the initial configuration of VMs. As for the number of BVMs, Figure 14 shows that VTR monotonically increases as the number of BVM grows.

5.3.3. Management of VMs

Since the increase in the number of TVMs and the number of BVMs both improve VTR generally, to achieve the highest VTR, there is a trade-off between the number of TVMs and the number of BVMs when the total number of VMs is limited. Figure 15a reveals that when the failure rate of VM is low, only assigning one VM as TVM is the best solution. Figure 15b shows that when the failure rate is high, more VMs should be assigned as TVMs to achieve the highest VTR. Therefore, to achieve the highest VTR, based on the analysis method of this paper, the resource manager can find the optimal allocation of VMs according to the related parameters of the edge computing network and the tasks.

5.4. Section Summary In this section, we provide a case study of traffic monitoring for the proposed timely reliability evaluation method and introduce the simulation method based on OMNeT++. Simulation results verify that the analytical model proposed in this paper is valid and applicable for timely reliability analysis of VMs of an edge server, considering failure, migration, and reboot of VMs. We also investigate the influence of different factors on timely reliability. Results show that: (1) VTR is a monotonic function of the failure rate, migration rate, and reboot rate of VM and is more sensitive to the failure rate when the reboot rate and migration rate are smaller. (2) VTR monotonically increases as the number of BVM grows, but the increase in TVMs does not always improve VTR. (3) To achieve the highest VTR, there is a trade-off between the number of TVMs and the number of BVMs when the total number of VMs is limited. 6. Conclusions In this paper, we analyzed the VTR and the ETR of the edge computing network, considering the failure, migration, and reboot of VMs. To model the state transitions of VMs, we define three states of VMs (available, fail-unmigrated, fail-migrated) and construct a partitioned CTMC for the edge server. We express the generator matrix as a partitioned matrix based on the matrix-geometric method. An algorithm is proposed to resolve the CTMC and obtain the sojourn time at the edge server as well as VTR. Moreover, to analyze ETR of the network, we build the delay analysis model based on Burke’s theorem and propose an end-to-end delay analysis method. The proposed method is more eligible to model the multi-state system of VMs. When the dimension of CTMC states is high, the constructing method of the generator matrix proposed in this paper can effectively reduce the difficulty of algorithm designing. We investigate VTR of the edge computing network under different conditions. The numerical results are verified to be correct by simulation based on OMNeT++. Results show that VTR is a monotonic function of the migration rate, reboot rate of VM, the number of TVMs, and the number of BVMs. However, in some cases, the increase in TVMs may conversely decrease VTR, since more TVMs also bring about more failures in a given time. Moreover, we find that there is a trade-off between the number of TVMs and the number of BVMs when the total number of VMs is limited. The resource manager can find the optimal allocation of VMs according to the migration rate, reboot rate, failure rate, etc., of the edge computing network and the arrive rate, computation complexity, etc., of the tasks. Based on the trade-off of TVMs and BVMs, in future research, we will further study the optimization of computing resources to obtain a better performance of the edge server. We will also study the case when the sizes of task data are different in future work.

Notation	Description
λ	Arrival rate of tasks
N	Number of sensors
μ	Service rate of virtual machine (VM)
S	Number of task VMs (TVMs)
d	Number of backup VMs (BVMs)
K	Capacity of the buffer of the edge server
Z	Total number of VMs
Y	Maximum capacity of the edge server
θ	Failure rate of VM
α	Migration rate of VM
β	Reboot rate of VM
p	The proportion of tasks requiring cloud service
_μlg	Service rate of cloud gateway
_Clg	Capacity of cloud gateway
_μcs	Service rate of cloud server
_Ccs	Capacity of cloud server
_Dmax1	Maximum allowable delay at the edge server
_Dmax2	Maximum allowable end-to-end delay
_x1	Number of available VMs
_x2	Number of tasks in the edge server
_x3	Number of failed-unmigrated VMs
i	State row number
j	State block number
k	State column number inside the block
g	Number of about-to-fail VMs
h	Number of migrating VMs
_Bw	Thew-th block of the CTMC
_Lw	Set of all the states in rowwof the CTMC

Parameter	Description	Value
λ	Arrival rate of tasks	180 per s
N	Number of sensors	10
μ	Service rate of VM	1000 per s
S	Number of TVMs	5
d	Number of BVMs	2
K	Capacity of the buffer of the edge server	180
Z	Total number of VMs	7
Y	Maximum capacity of the edge server	187
θ	Failure rate of VM	0.0002 per s
α	Migration rate of VM	0.2 per s
β	Reboot rate of VM	0.002 per s
p	The proportion of tasks requiring cloud service	1%
_μlg	Service rate of cloud gateway	200 per s
_Clg	Capacity of cloud gateway	100
_μcs	Service rate of cloud server	300 per s
_Ccs	Capacity of cloud server	100
_Dmax1	Maximum allowable delay at the edge server	0.04 s
_Dmax2	Maximum allowable end-to-end delay	0.04 s

Supplementary Materials

The following are available online at https://www.mdpi.com/1424-8220/21/1/93/s1.

Author Contributions

Conceptualization, K.L. and L.G.; methodology, K.L., L.G., and Y.W.; software, K.L. and X.C.; validation, K.L., L.G., Y.W. and X.C.; formal analysis, L.G. and Y.W.; investigation, K.L.; resources, K.L. and L.G.; data curation, K.L. and X.C.; writing-original draft preparation, K.L.; writing-review and editing, K.L., L.G., and Y.W.; visualization, K.L.; supervision, L.G.; project administration, L.G.; funding acquisition, L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NATIONAL NATURAL SCIENCE FOUNDATION OF CHINA, grant number 61304148 and 61871013.

Data Availability Statement

The data presented in this study are available in the supplementary material, Data.zip.

Conflicts of Interest

The authors declare no conflict of interest.

Word count: 10117

Show less

© 2021. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

For the edge computing network, whether the end-to-end delay satisfies the delay constraint of the task is critical, especially for delay-sensitive tasks. Virtual machine (VM) migration improves the robustness of the network, whereas it also causes service downtime and increases the end-to-end delay. To study the influence of failure, migration, and recovery of VMs, we define three states for the VMs in an edge server and build a continuous-time Markov chain (CTMC). Then, we develop a matrix-geometric method and a first passage time method to obtain the VMs timely reliability (VTR) and the end-to-end timely reliability (ETR). The numerical results are verified by simulation based on OMNeT++. Results show that VTR is a monotonic function of the migration rate and the number of VMs. However, in some cases, the increase in task VMs (TVMs) may conversely decrease VTR, since more TVMs also brings about more failures in a given time. Moreover, we find that there is a trade-off between TVMs and backup VMs (BVMs) when the total number of VMs is limited. Our findings may shed light on understanding the impact of VM migration on end-to-end delay and designing a more reliable edge computing network for delay-sensitive applications.

Details

Title

Timely Reliability Analysis of Virtual Machines Considering Migration and Recovery in an Edge Server

First page

Publication year

2021

Publication date

2021

Publisher

MDPI AG

e-ISSN

14248220

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/s21010093

ProQuest document ID

2473635333

Timely Reliability Analysis of Virtual Machines Considering Migration and Recovery in an Edge Server

Jump to:

Full text

Abstract

Details

Suggested sources