Content area
Microservers (MSs, ARM-based mobile devices) with built-in sensors and network connectivity have become increasingly pervasive and their computational capabilities continue to be improved. Many works present that the heterogeneous clusters, consist of the low-power MSs and high-performance nodes (x86-based servers), can provide competitive performance and energy efficiency. However, they make simple modifications in existing distributed computing systems for adaptation, which have been proven not to fully exploit the various heterogeneous resources. In this paper, we argue that these heterogeneous clusters also call for flexible and efficient computational resource sharing and scheduling. We then present Aries, a platform to support abstracting, sharing and scheduling the cluster resources, scaling from embedded devices to high performance servers, between multiple distributed computing frameworks (Hadoop, Spark, etc.). In Aries, we propose a two-layer scheduling mechanism to enhance the resource utilization of these heterogeneous clusters. Specifically, the resource abstraction layer in Aries is constructed for overall coordination of resources, which provide computation and energy management. A hybrid resource abstraction approach is designed to manage HS and MS resources in fine and coarse granularity separately in this layer to support efficient resource offer based on “resource slot”. And the task schedule layer supports various sophisticated schedulers of existing distributed frameworks and decides how many resources to offer computing frameworks. Furthermore, Aries adopts a novel strategy to support smart switch in three system models for energy-saving effectiveness. We evaluate Aries by a variety of typical data center workloads and datasets, and the result shows that Aries can achieve more efficient utilization of resources when sharing the heterogeneous cluster among diverse frameworks.
Introduction
Recently, there is an explosion in the number and variety of widely connected devices in the computing edge producing data, e.g., motion sensors, thermostats, mobile phones. Meanwhile, deploying a mass of high performance servers (HSs)1 (x86-based servers) in data center has become a major big data processing trend to power both large Internet services and data-intensive frameworks. Driven by these large-scale data processing applications, researchers and practitioners have been developing a diverse array of cluster computing frameworks to simplify programming and exploit computing potentialities in data center, including MapReduce [1], Spark [2], Naiad [3] and others [4].
While these platforms provide high-level abstractions for big data processing, they do not provide a distributed processing abstraction for running in the computing edges, which consists of a massive number of devices (mostly ARM-based nodes) with representing a relatively rich integrated idle computing resources. Hence, works [5–8] have been proposed to handle the proliferation of MSs (ARM-based devices) and prove that MSs have recently emerged as an alternative to HSs, which can promise scalable performance and low energy consumption in data center via scale-out. However, according to the real run-time scenarios, many applications on the heterogeneous cluster require much richer data manipulation functions. For instance, PreHeat [9] stores data and process long-periods processing jobs (h) on them to predict future occupancy from daily logs, while digital neighborhood watch [10] shares the camera information based on a time window and has to process short-periods jobs (ms) for real-time responding. Developing a new paradigm for these multi-functional network systems, which could provide interfaces to mitigate the distributed processing frameworks and share the heterogeneous cluster resources among them, is emerged as a novel architecture to couple HS clusters and ARM-based MSs.
Our goal is to simplify the management of heterogeneous cluster, consist of ARM-based MSs and x86-based HSs. By studying existing and proposed applications, we uncover the key challenges for such systems. First, the platform should be flexible to support the heterogeneous computing resources abstraction, because the resources in MSs and data center are prioritized in different metrics like performance, reliability, cost. Second, the platform should enable sharing the computing resources across upper distributed computing frameworks, and further support to schedule the computing tasks of the frameworks to the heterogeneous cluster. Third, comparing to other resource scheduling approaches in data center [11, 12], the platform should achieve a remarkably decline in energy consumption, since introducing the usage of low power ARM-based MSs to the cluster.
In this paper, we propose and implement a platform prototype, namely Aries, for efficiently abstracting, sharing and scheduling cluster resources scaling from ARM-based MSs to x86-based HSs. We make maximum use of idle CPU, memory and network resources in MSs, though sharing the processing responsibility with servers [13], and integrate ZooKeeper [14] for fault tolerance. In Aries, we introduce a two-level hierarchical scheduling mechanism to manage these heterogeneous resources, with a flexible resource abstraction layer and an adaptive task schedule layer. Internally, we propose a novel hybrid fine/coarse granularity of resource abstraction approach in the resource abstraction layer to support efficient resource offer based on “resource slot”. Specifically, Aries organizes the computing resource of MSs into a resource pool, consist of a MS Queue, over which each slot is recorded with total available resource size of each node. And a fine-grain sized method is adopted to HSs, in which the resources are split to multiple slots with unit [1 CPU, 1 GB RAM]. In Aries, the task schedule layer supports sophisticated schedulers of distributed computing frameworks and decides how many resource slots to offer the frameworks.
Moreover, we achieve the remarkably decline in energy consumption by switching the state of part of the servers and MSs into running/waiting/hibernate mode under specific circumstances at specific time. Aries introduces “system state” to handle mixed usage of resource, which maintains the devices’ running states to overlap among energy efficiency and response times of service. In Aries, we provide multiple interfaces to do a deep systematic history traces analysis and optimization, and identify non-response-sensitive time period for better decision of system state switch. Finally, Aries determines specific system state on the period though the pre-configured state abstractions [15], including incoming workload and current resource utilization.
Fig. 1 [Images not available. See PDF.]
Proposed two-layer topological measurement testbed consist of massive distributed embedded devices and a high performance server cluster. We define “data plane” as the collection of embedded devices and “process plane” as HS cluster. These two plane is collected by the several Gigabyte Ethernet Switchers
We deploy and evaluate Aries in our data center running basic distributed benchmarks, which combine 3 servers and 100 embedded boards, as shown in Fig. 1. To evaluate Aries, we have ported two distributed computing frameworks to run over it, including Hadoop and Spark. The evaluation results show that Aries achieves more efficient utilization of various resources and periodically replace few servers with MSs’ resource to reduce energy consumption.
This paper is organized as follows. Section 2 describes related work and gives a brief overview on heterogeneous cluster computing. In Sect. 3, we briefly discuss their limitations as perceived by us, over which new paradigms for systems are emerged as novel architectures combining HS clusters and MS systems. Section 4 presents and describes the architecture and design of Aries in details. The implementation of Aries is introduced in Sect. 5. Section 6 elaborates on the evaluation of Aries with the migrated distributed frameworks, Hadoop and Spark, using several benchmarks (WordCount, Teragen and Terasort in Wikipedia datasets). We conclude the paper in Sect. 7.
Related work
Recently, there is an explosion in the number and variety of data producing by widely distributed devices, e.g., multi-sensor data, system and database audit logs, etc. Works have been proposed to handle the proliferation of MSs (ARM-based mobile devices) and prove that MSs have recently emerged as an alternative to HSs, which can promise scalable performance and low energy consumption in data center via scale-out. Our work is related to four strands of other works: (i) heterogeneous cluster computing, (ii) fog computing, (iii) characterization of big data processing in heterogeneous cluster, and (iv) resource schedulers on cluster. We discuss each in turn below.
Distributed computing on heterogeneous cluster
Kaewkasi and Srisuruk [6] presented a Hadoop cluster demonstration for distributed processing built atop 22 ARM-based boards. Meanwhile, they also supported to migrate Spark to this cluster with hardware configurations. This work can be concluded that the distributed big data processing frameworks (Hadoop, Spark) could be mitigated to such ARM-based cluster. Jung et al. [16] and Beloglazov et al. [17] design a cluster of Linux servers with a broadband network of embedded devices and developed a heterogeneous computing system for MapReduce applications that couples cloud computing with distributed embedded computing. And the project has proven the distributed computing system can deploy upon the heterogeneous cluster. Moreover, recently, many researchers are exploring the machine learning applications on the heterogeneous cluster based on the accelerator of FPGA and mobile devices. Katayoun et al. [7, 18] analyzed several data mining and machine learning algorithms and proposed to support the computational intensive kernels of the algorithms to the hardware acceleration to achieve speed-up and energy-efficiency. They build a heterogeneous CPU and FPGA (Xilinx Zynq boards) platform for the implementation and achieve up to 321.5X kernel speedup. Meanwhile, the flexible architectural Google TensorFlow [19] deep-learning platform is also proposed to address the machine learning challenges supporting a board of devices (including mobile devices with extensive).
Fog computing
Meanwhile, there emerges a novel integration of Internet of things (IoTs) and cluster computing, namely cloud of things (CoTs), which has attracted increasing attention from researchers. For addressing the challenges of the time-constrained CoT applications, many researchers have proposed fog computing as a novel distributed computing framework to minimize the data transmission overhead between IoT and data center clusters [20]. Specifically, the data collected from the embedded devices can be processed distributively in the data collect plane before the transmission decision is made from data center clusters, in which the data would be sent either the cluster or other devices [21–24]. Several works have developed many IoT applications based on the fog computing with different strategies, e.g., big data analytics on mobile [25], vehicular networks on fog computing [26], etc. These works have covered many aspects of fog computing. However, most of them study fog computing merely from a qualitative perspective. In our work, we first present a novel system architecture of resource scheduler to couple the computing and storage resources of CoT computing. Our work provides fog computing a novel resource manage framework to enhance computing and storage resource utilization and makes it easier to deploy distributed computing applications and services on CoT environment.
Table 1. The evaluation platform for Aries
Server (HS) | Microserver (MS) | |
|---|---|---|
CPU | 2.0 GHz (12-core) | 1.0 GHz (dual-core) |
Disk I/O (MB/s) | 117.41 | 53.02 |
Memory (GB) | 64 | 1.0 |
RAM sequential I/O (MB/s) | ||
Network bandwidth (MB/s) | 94.4 | 90.2 |
Idle power (W) | 263.1 | 2.5 |
Reactive power (W) | 292.4 | 3.4 |
Price ($) | 3000 | 20 |
Characterization of big data processing in heterogeneous cluster
Meanwhile, there’re a lot of work proposed to address the characterization of the big data processing in embedded server. Malik et al. [8] characterized emerging big data applications on big Xeon and little Atom-based server architecture, including the Hadoop applications, graph analysis and several machine learning algorithms (collaborative filtering, clustering, etc.). They demonstrate that the size of data, performance constraints and presence of the embedded server have significantly influenced the energy-efficiency of server.
Resource schedulers on cluster
The cluster management in the high performance computing and system community has long been a hot research topics [11, 27–29]. Two resource schedulers Hadoop YARN [12] and Mesos [11], which are most widely used in academical researches and industrial products, have been proposed to efficiently share resources among diverse cluster computing frameworks. The design of these two systems all use a two-level scheduling model, where the upon computing frameworks request resource offers from a central manager. Distinguished from these frameworks, our work is the first resource manager architecture to support resource sharing extending cluster computing to distributed embedded computing.
New paradigms for these multi-functional network systems are emerged as novel architectures combining HSs clusters and sensor networks systems [30]. In this paper, we propose and design a platform, namely Aries, for sharing resources among distributed computing frameworks in heterogeneous cluster and the evaluation confirm that by deploying Aries, the platform using the heterogeneous resource-sharing can offer the large-scale distributed processing applications scalable and energy-efficient.
Analysis and motivation
In this section, we evaluate the computing performance and storage throughput of nodes in the heterogeneous cluster. Then, we give our motivation on the design of a scalable platform to provide a coarse understanding of the cluster’s behavior, in order to characterize the environments that distributed scheduling model works well in. We briefly introduce Aries to address the challenges.
Performance analysis of heterogeneous cluster
For the evaluation of our system demonstration, we construct a heterogeneous cluster shown in Fig. 1. The cluster consists of two planes, a data collect plane and a process plane. Specifically, the data collected plane is responsible for periodically aggregating sensors data and the resource plane is in charge of managing resources, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. In process plane, each HS consists of two Intel(R) Xeon(TM) E5-2620 CPU each with 6-core running at 2.0 GHz with 64 GB memory. The storage layer in the high-performance node is a 20TB hard disk drive and the operating system is 64-bit Ubuntu 13.04. In the data collect plane, the MSs (embedded devices) in the cluster are a mass of CubieTruck Embedded Boards [31], which integrates a dual ARM Cortex-A8 core, and 1 GB RAM, 64 GB MicroSD-TF card, and runs 32-bit Linux 3.2.18. As Aries is built on commodity network infrastructure, three 100 Mbps Cisco SF220-24 Giga Ethernet Switches are used to connect all the nodes in the cluster separately, which observed by bandwidth of 90 Mbps. Moreover, the details of the platform is configured as shown in Table 1. Note that the storage throughput of CubieBoard is likely to be the bottleneck in such a cluster, while the network bandwidth among CubieBoards is similar to the servers.
Motivation of sharing cluster resources
Consider that an application example deployed in “smart homes” we aim to support in heterogeneous cluster in Sect. 3.1. The massive system logs data are loaded from data collect plane into storage layer, and are used to analyze for the results of most frequent operation records. Currently, by using the state-of-art open-source distributed frameworks, we have to connect the data streaming from data collect plane to processing plane through the messaging-based framework, e.g., Kafka [32], and launch distributed jobs to analysis the aggregated data that run periodically. The cluster is separated into three kinds of responsibility, communication, storing or computing. Meanwhile, the heterogeneous cluster has to be used for many experimental jobs, ranging from ad-hoc queries MapReduce tasks (e.g., Hive) to machine learning computations (e.g., Spark MLlib [33], GraphX [34]), which consists of various heterogeneous tasks, from short-terms and long-terms jobs. Therefore, we need to deploy several distributed computing frameworks on this cluster and statically partition the cluster into separated node for collection, communication, storage and computation. Moreover, the embedded devices in data collect plane, which are accessible to each other at all time, are wasting their idle computing and network resources.
On the other hand, consider that the architecture difference between the data collect plane and processing plane. While a mainstream HS contains several multi-core CPUs with multiple cores, several gigabytes DDR4 RAM and high-speed persistent secondary storage, MS normally embeds a light-weighted ARM-SoC CPU and a lower speed storage. In recently years, the hardware performance of MSs have been improved faster than HS. We investigate a performance comparison of HS cluster and MS cluster, and discover that the performance of a cluster consist of eight MSs can beat one server (detailed in Sect. 6). Thus, coupling HS cluster computing with MS cluster computing can provide a scalable and energy-efficient platform for enhancing distributed computing performance.
The underlying motivation and challenges of designing a platform to support resources managing in the heterogeneous cluster lies in the following three-folds:
Sharing resources among distributed computing frameworks to meet the performance requirements of different computation jobs, many works [35–38] introduced different methods to load data from different services into heterogeneous cluster (x86 and ARM mixed), and migrate adaptive version of Hadoop, Spark and MPI to handle the challenges, e.g., log analysis, machine learning algorithm [39], streaming processing [30], etc. Unfortunately, there have not yet a platform to support abstracting and sharing resources efficiently among these distributed computing frameworks on one heterogeneous cluster.
Resource utilization enhancement in the heterogeneous cluster, since deployment overhead of ARM-based embedded devices is very low-cost, we could distribute them in large-scale deployment which are easily connected via network bridging, and theoretically they can also be centralized to process towards data-intense analyzing applications [40]. Meanwhile, the MSs are sharing network resource and the bandwidth between two endpoints is similar to servers. We can conclude that according to sharing resources in cluster, the resource utilization of HSs could be higher with the employment of MSs. Moreover, the performance of distributed computing frameworks would be further enhanced by scheduling large-scale distributed nodes.
Energy effectiveness and cost benefit comparing with HSs, the MSs represent much lower energy consumptive and less costly. Energy saving effectiveness in ARM-based embedded devices is really competitive both in idle and reactive state (ranging from 2 to 3 W for CubieBoard/Raspberry Pi), and the price is 150 cheaper than the x86-based HSs [41]. However, these embedded devices in data collect plane are always only for data collection, and wasting their idle computing and network resources. A novel strategy to join these embedded devices into HS cluster domain is emerging to share distributed processing jobs with HSs.
The Aries architecture
In this section, we study Aries by discussing our design philosophy based on two key components, cluster resource manager and the scheduler for supporting heterogeneous tasks. First, we describe a architecture overview of Aries (Sect. 4.1). Second, we introduce the detailed design in cluster manager with two layers, a resource abstraction and supervisor layer, a cluster state switch manager (Sects. 4.2, 4.3). Third, the key approach about how Aries achieves the efficient scheduling for heterogeneous tasks from upper frameworks will be discussed in the Sect. 4.5.
Overview
In Aries, the main components are shown in Fig. 3. Aries consists of a dynamic profiler process that manages executor daemons running on each node in cluster, a resource manager that preserves a list of free resources and statuses on multiple nodes and frameworks that launch tasks on the nodes. The core design paradigm of Aries is to provide a scheme to support flexible grained resource sharing across frameworks. Similar to the Mesos and YARN, we also introduce resource offer abstraction strategy in Aries, in which the computing resources in cluster organized as unitized offers are offered to each distributed computing framework according to an organizational policy, e.g., fair sharing, monopoly sharing, etc.
The state switch manager provide Aries an ability to handle the working state of each node in the heterogeneous cluster, which is crucially important on saving energy effectiveness (discussed in Sect. 4.3). Moreover, Aries introduces a distributed two-level resource scheduling mechanism called resource units to support the sophisticated schedulers of Hadoop and Spark computing frameworks. Specifically, Aries decides how many resources to offer each framework, while frameworks decide which resources to be accepted and tasks to be allocated. Instead of serving one unified resource unit, Aries separates resource unit by a hybrid organized granularity of “resources slot” into two different type resources from HS and MS. Specifically, we deal with them differently and store them in two separated resource pools using fix-sized list and priority queue. We maintain the fixed granularity size of a resource slot as [1 CPU, 1 GB RAM] in the list for the servers’ (HS) resources pool, and a priority queue is constructed for managing the MSs’ resource pool.
Figure 2 shows an example of how Aries port the distributed computing frameworks Hadoop and Spark to the cluster, and get them scheduled to run tasks. For instance, MS#1 (microserver) reports to the manager that it has 1 CPU and 1 GB RAM free. The manager in Aries records the message in priority queue as one slot, launches the desirable task t1 to the MS#1, and then replies to the upper framework’s profiler. While the task t2 requests for [2 CPU, 1024 MB RAM] and MS#2 reports that 1 CPU and 256 MB RAM free, the manager records messages from MS#2, and then search from list to launch t2 to congruous HS with two slot units summed, consist of 2 CPU and 2 GB RAM. For the choice between servers and MSs, we have implemented two allocation formulate: one that performs a traditional fairness for multiple resources inspired by the domain resource fairness (DRF) allocation algorithm and one that implements strict optimal demand priorities for various tasks, discussed in Sect. 4.5. The evaluation results in Sect. 6 show that Aries can reduce task launch overhead and achieve more efficient utilization of various resources when sharing the cluster among diverse frameworks.
Fig. 2 [Images not available. See PDF.]
Basic structure of Aries, full state switch and data abstraction logic on classic two layer design
Cluster resource manager
To support the efficient resource scheduling in heterogeneous cluster, we design the Aries’ cluster resource manager based on a traditional two-layer resource management paradigm (e.g., centralized scheduling in Mesos-liked master–slaver architectures), consist of a resource abstraction layer and a task schedule layer. Specifically, the resource abstraction layer identifies and unifies resources in cluster, transforming them into scheduled resource units. And task schedule layer allocates them to the computing framework running on Aries as needed. We maintain a thin interface to specify the cooperative interaction between the resource abstraction and task schedule layer.
The Aries framework implements the following abstractions to enable distributed computing applications access to heterogeneous resources, using the two-layer approach described above.
Fig. 3 [Images not available. See PDF.]
Design details of Aries. The main modules of Aries include dynamic profiler, state switch manager, task context packer and a resource manager, etc. We briefly introduce the design of each module with data input and output, and also describe the data organization structures internally. Note that the unavailable core (busy/hibernate) in computing pool is marked by squares shadowed by slash. And “FaultT.” is marked to the fault tolerance module in Aries
First, as shown in Fig. 3, all nodes in cluster used for shared access are collected into a computing resource poll. Aries form the polls when the resource manager runtime is started and backend daemon worker processes are spawned to nodes in data collect plane and processing plane separately. The polls is supported by the manager affinity mapper co-organized of all participating nodes and storage resources. Once resource information about all cluster nodes is available, Aries determines the network topology and records it in device table of resource manager for all participating nodes, which consists of entries node host (IP address), local device id mapper table. The cluster-level implementation of Aries permits any cluster node in device table to participate, by forwarding requests to remote nodes via the network. Once the backend daemon processes in nodes are spawned, we form the manager monitors to detect the weight and work load of each node in the device table.
Second, resource abstraction information are organized as the above format, using a hybrid strategy. Specifically, the computing resource in MSs is organized into MS Queue, consist of total available resource sizes for each nodes. Comparing to the coarse-grained resource abstraction for MSs, Aries adopts a fine-grain sized method for HSs, which is split to multiple slots with unit [1 CPU, 1 GB RAM]. The abstraction is re-organized in the resource manager module in HS unit table and MS unit queue.
Third, during the upper application needs to launch tasks, Aries selects an appropriate target node based on the dynamic available information in device table and status table. Using the recorded available device ID, the task matcher inspects the resource requirements of tasks, then determine the actual computing device selected. We will discuss the detailed state switch manager design and scheduling method of heterogeneous tasks in the following sections.
Finally, the resource manager selects and assigns tasks to the corresponding nodes, and then forwards the feedback, which consist of nodes’ resource utilization and tasks’ monitor state, to appropriate backend process and the node. For generality, the design of cluster resource manager in Aries is shown in Fig. 3.
State switch manager
In state switch manager, we design a device state based mechanism for the approximation of real-time adjustment. In Aries, we define nodes’ states as a series of polices, which guides how to utilize resources across categories. In accordance with the incoming workload, system trace, and resource utilization, Aries chooses from the following three system states, including server cluster (SC) using whole servers in cluster, Mix cluster (MIX) with few servers (HS) and mixed MSs resources in cluster, and MCs with none HS servers in cluster, as shown in Fig. 2. Specifically, under extremely high system workload duration, the MIX state is enabled to utilize maximum resources from the resource pools, which is a state that bring out the best computing potential in system. Accordingly, additional states are configured to fit particular system requirements, e.g., energy awareness. Since a state prioritizing resource from MSs holds much lower energy consumption, system configure idle server into hibernate mode.
Internally, a dynamic profiler in state switcher is in charge of giving decisions about state switching. The state switch manager determines timing and preferable energy-efficient patterns for switch. Hence timing for invoking energy efficient state switch at low system load period is predictable by analyzing traces, and we provide a user-define protection interface to skip state switch in case of extreme workload. The state switch manager interacts with another two modules (context packer and resource manager) via active message interface to directly return the usage of nodes. And once Aries completes switching, the manager layer delegates the messages of devices states to dynamic profiler. The state switch manger is responsible to collect and analysis system runtime logs, which are stored in persistent log storage, and to configure available resource states of nodes. As shown in Fig. 3, the state switch manger has the following components:
Fig. 4 [Images not available. See PDF.]
Signal dispatch based on system logs analysis in Aries
State policy arbiter maintain the node state (status table) using the log analysis and feedback mechanism, which is based on static (device runtime log and capabilities) and dynamic method (application type, device workload and feedback from lower resource manager). Aries maintain the default three policies (SC, MIX and MC) , that stores the decisions (running, waiting, and hibernate) of specific devices’ characteristic information. Once the resource manger triggers dynamic policy switching, the state policy arbiter controls the device monitor to dispatch state signals to nodes.
Target resource selector using the real-times feedback information about nodes characteristics, like task execution time, resource utilization, etc., backed from the lower-level resource manager, the target resource selector maintains the device table as a dynamic structure, that consist of calculated weight and load ratio for cluster nodes. Based on the information in the device table, the target resource selector computes the appropriate node targets to dispatch the tasks, which then maps tasks to a resource abstraction unit in resource manager, based on the decision of the low-level manager. The specific scheduling policies for heterogeneous tasks will be discussed in Sect. 4.5.
Device state monitor the device state monitor, which is responsible for nodes communication and runtime controller, uses a three-way handshake protocol mechanism to control which monitor threads in daemon library of nodes should be used for how long. As shown in Fig. 4, the execution steps of three-way handshake protocol are as following: (1) the state switcher sends a message to the backend thread corresponding to the node with its [process_id, node_id] using IPC, (2) the listener worker of the node, on receiving message, sends the available signal and resource usage information back to the switcher, (3) based on the decision made by dynamic profiler, the backend thread in state switcher, on receiving the back message, executes a signal callback handler and sends the real-time state signal to the node. After listener received the state signal, dispatcher thread starts to switch device state and ensures that the backend thread on receiving its assigned signal.
Context packer
Considering the state switch manager and cluster resource manager have different granularities of visibility of the devices in cluster. While the cluster resource manager consists a global knowledge of all resources in cluster, state switch manager has a local information about the current resource utilization state and behavior of the devices in cluster.
Aries constructs a context packer module as a middleware layer to connect the state switch manager to cluster resource manager and to take the charge of task specification. Considering the requested tasks arrive, Aries specify the operations of task, and refine the request into the number of CPU and size of memory. The task scheduling decisions made from task matcher are based on metrics that include state and throughput of devices, fairness, real-time system workload ratio of devices, etc. Afterwards, a task wrapper registers the requested task and wraps the properties like the runtime attributes of task, decided device ID, device weight, and the task’s requested resources (CPUs, memory) into task specification structure. Then, the context packer submits the task specification structure to cluster resource manager and also involves feedback to state switch manager based on the requested task being executed. The cluster resource manager monitors the phase state of received task specification.
Scheduling heterogeneous tasks
To maximize the expected efficiency (energy, performance, cost) of the cluster resources, two modules in Aries, dynamic profiler and task context packer, are responsible to schedule heterogeneous tasks. The challenge of Aries design is that scheduling heterogeneous tasks without a prior configuration knowledge of upper frameworks. We outline the two key strategies to support scheduling heterogeneous tasks as following to achieve the purpose and satisfy the computing frameworks’ preferences.
Node weight in target selector we maintain the device table with the two priorities of the nodes in cluster, which consists of the entries “node weight”. The object of scheduling police module is to dispatch tasks and minimize the ingress time of tasks, which means Aries allocates each framework its preferred nodes after at most one interval, and thereby maximize the throughput of Aries system. We use a static cost model to define the selector. Specifically, the output of the target selector is a value assignment to control the binary decision state variables of the node i in time j as in which CPUOn is defined as the number of available cores in CPUs and NetIdle as the size of idle network bandwidth. The value of represents nodes’ state using an enum set [H(halt), W(waiting), R(running)], note that waiting state of node is selected and waiting to join in computation. Aries records the state of nodes in “status table” and notify the task matcher in context packer module with the selected available node set.
1
where is cumulative node computing time attained till nth duration, is computing time attained in nth duration, is weight variable (=0.8) with configuration.In the Eq. 2, we configure the weight of the nodes in cluster using the sum of the real-time workload in feedback table and the above priority cumulated by decaying service, which means a node with lower workload and higher service priority would achieve a higher weight.
2
where is weight variable for adjusting nodes’ load, is weight variable for cumulative service priority, is workload levels of node (1–10), determined by available CPUs and memory size.Scheduling police given a particular execution and a specified state duration for Aries, the purpose is to achieve the maximization of expected efficient and satisfy the frameworks. The slots allocation for the frameworks is designed based on the scheduling configuration of computing frameworks. When the frameworks has initial configuration for their own scheduler, Aries would easily allocate their preferred slots when the all preferred slots become available. Nevertheless, when there are no configurations in the frameworks, we achieve a weighted fairness allocation for the demanding frameworks, which equally allocates a balanced number of slots to the frameworks. For the choice between servers and MSs, we have implemented two allocation formulate: one that performs the traditional fairness for multiple resources inspired by the DRF allocation algorithm [42] and one that implements strict optimal demand priorities for various tasks.
Implementation of Aries
We have implemented Aries based on the modification of Mesos in about 2000 lines of C++. The Aries platform supports running on a variety of operating systems and devices, including Linux, OS X on the x86-based and ARM-based node. For the efficient I/O mechanism, we introduce a C++ library (libprocess) to implement the asynchronous I/O. To construct the efficient table structure, we use a fast key-value storage levelDB to handle the query and update of tables. Moreover, the fault toleration module is implemented by storing the operation logs in a distributed persistent storage file system periodically. Similar to other distributed centralized master–slave architectural frameworks, we integrate ZooKeeper to perform leader election between the controllers in HS cluster plane. Finally, as with the resource isolation supplied of Mesos, Aries currently also support to isolate CPU cores and memory by using Linux containers or Docker project.
We port and tune Hadoop and Spark to Aries, and support them to modify their scheduler and executor slightly using resource offer API in Aries, as following. For supporting Java SE 6.0 features, a binary level porting method is applied for embedded devices that imports missing class files and retro-translates class files.
resource_offer(schedulerId, list_of_tasks): supporting resources offering to upper framework.
launch_task(taskReservation): for scheduling the tasks of upper frameworks.
shutdown_task(taskId): for killing the scheduled tasks of upper frameworks.
reply_slot(list_of_slots, list_of_tasks): for reporting the scheduled offers to tasks.
send_status(taskId, status): for handling the status messages from computing job trackers.
Fig. 5 [Images not available. See PDF.]
a Task launch ingress time. Aries’s processing ratio and task launch overhead versus number of microservers. Ingress means loading files and start task computing. Note that since Mesos supports resource scheduling in distributed embedded cluster with a large gain-sized (1 GB memory, 1 CPU) resource abstraction, we deploy Mesos on the heterogeneous cluster with configuring ‘master’ and Zookeeper on HS servers, and ‘slave’ node to MS in the ‘original’ cluster. b WorkCount benchmark. The runtime in Hadoop and Spark benchmarks ported on Aries. Hadoop Terasort dataset from Teragen generated (20 GB). c Teragen/Terasort benchmark. The runtime in Hadoop and Spark benchmarks ported on Aries. Hadoop Terasort dataset from Teragen generated (20 GB)
Evaluation
In this section, we evaluate our primary platform prototype Aries using actual job processing time and high-resolution energy measurements taken on our own cluster based on several scenarios. We choose Mesos, a cluster resource scheduler, to make a comparison with Aries on two aspects: the task ingress overhead and resource utilization in cluster.
Evaluation platform we have a detail description of the heterogeneous platform in Sect. 3.1.
Benchmarks we evaluate the systems, Aries and Mesos, by considering the following several traditional data-intensive processing benchmarks, including WordCount, Teragen, Terasort. And the datasets for the workloads contain the real-world datasets, including the dataset from Wikipedia edit history [43], and auto-generated cluster system audit log datasets.
Specifically, we consider four metrics in the evaluation of Aries:
Framework task launch ingress time the time it takes a ported framework to achieve its tasks allocation;
System resource utilization the total utilization ratio of resource in cluster;
Job completion time the time it takes a job to complete, assuming one job per framework;
Energy savings the total energy efficiency using Aries in cluster.
Comparison with Mesos
Distinguish to the widely used cluster resource scheduler Mesos, Aries extends the scheduler engine to support distributed embedded devices by choosing a finer granularity to manage the distributed embedded resources with a resource priority queue. To evaluate the primary goal of Aries, which is enabling the distributed computing frameworks to efficiently execute tasks on a heterogeneous cluster, we chose two frameworks ported upon Aries, Hadoop and Spark, and ran traditional applications with MSs. We compare Aries to Mesos on two metrics, framework task ingress overhead and the cluster resource utilization.
Framework task ingress overhead
To evaluate the task ingress overhead for the framework, which consist of loading files and task computing initialization, we choose Hadoop framework ported upon Aries and Mesos, and run a traditional log query application with N MSs on a total size 10 GB dataset collected from auto-generated cluster systematic audit log files.
Moreover, we plot the extra task launch overhead in Fig. 5a, showing the averages task launch ingress time among the 16 experiment runs. We observe that the overhead remains small (less than 8 s) even at 16 MSs. From the curve of ingress relative ratio, we can find the increasing of task ingress overhead remains at 2X level. Compared to Mesos, Aries achieves less ingress overhead than Mesos when we deploy more than four MSs. Mesos represents more than 10X overhead when 16 MSs is connected. These performance improvement is due to (1) by automatically capturing the available resource in distributed embedded cluster, Aries can organize the requested tasks from ported frameworks with a preferable node selection, (2) compared to Mesos fixed-size (1 GB memory, 1 CPU) resource offer abstraction, Aries abstracts the resources from distributed embedded cluster using an efficient priority queue strategy and concrete resource allocation.
System resource utilization
To measure the resource utilization comparison between Aries and Mesos, we run 100 WordCount jobs simultaneously, and each server configure with default task types, in equal proportions, and each MS run one task at one time. And we record the 600 s during the job execution duration. From Fig. 6, we see that resources are allocated rapidly when the jobs start to submit, in which Aries achieves higher allocation of nodes than Mesos by average 14% for CPU utilization and 26% for memory. Mesos skips task allocation when the available resources in one node is not enough to fill in one offer, leading to a waste of large number of available resource fragments. For processing these distributed tasks, the one key contributing factor for Aries is its use of weighted scheduling method to manage the nodes in heterogeneous cluster and adoption of multi-level optimization for resource abstraction and sharing, resulting in achieving more efficient resource utilization especially on embedded devices.
Figure 7 shows the normalized processing ratio on Aries. From the benchmark evaluation results, the performance of coupling 1 server with 8 MSs can beat the performance of coupling of 2 servers, and 2 servers with 14 MSs beats the 3 servers. With the number of domain-joined MSs increases, the performance of coupling server with growing number of MSs reaches a peak. For processing performance of coupling execution, the one key contributing factor for Aries is its use of efficient resource abstraction and task scheduling, resulting in enhancing of computing resource utilization ratio and network throughput in the map phase among devices. Moreover, as shown in sub-figure of Fig. 8, we explore the daily CPU utilization of the cluster with Aries compare to the original cluster, which represents the higher allocation of nodes also translates into increased CPU utilization (by almost 12% for CPU).
Fig. 6 [Images not available. See PDF.]
Average CPU and memory utilization over time in the Aries-based cluster versus original cluster (Mesos-based)
Fig. 7 [Images not available. See PDF.]
Relative processing ratio. Aries’s processing ratio and task launch overhead versus number of microservers. Ingress means loading files and start task computing
Job completion runtime
To evaluate the computing efficiency of distributed frameworks migrated to Aries, we use the distributed workload benchmarks, including WordCount, Terasort. The dataset for the WordCount workload is the dataset from 18 GB Wikipedia edit history [43], storing on average 16 MB blocks in HDFS (with double replication). We run the jobs on our experiment environment with Hadoop and Spark frameworks, and configure each server with default task types in equal proportions. With adding the resource of computing embedded device, we can evaluate the job completion runtime in the heterogeneous, over which each embedded device is configured to allocate one task from upper framework at one duration. As shown in Fig. 5b, c, we can find the wordcount benchmark running time in Spark decrease 3.1 with the increasing amount of MSs, while the running time in Hadoop decreases 0.34 Moreover, we specify the Teragen and Terasort benchmark runtime in Hadoop. For execution these benchmarks, the one key contributing factor for Aries is its use of flexible mechanism to abstract the cluster resources and easily support the task scheduling of upper frameworks, resulting in their efficient processing in the heterogeneous cluster.
Energy saving
Energy measurements were taken that the server is connected to WattsUp power meters and CubieTruck node is plugged into electric energy reading socket. As Aries dynamically switch state with high workload variations, the percentage of savings changes. According to a summary statistics in a whole-day evaluation shown in Fig. 8, we figure out an average of 16.42% energy saving per hour on the newer system with Aries. Based on the coupling cluster with distributed MSs, Aries designs an automatic switching between the energy efficient states, which is crucial to the energy saving.
Fig. 8 [Images not available. See PDF.]
Daily energy consumption ratio comparison for original cluster (Mesos-based) and Aries based cluster
Conclusion
Aries is designed for resource usage efficiency purpose for coupling the MSs and data centers. We achieve higher processing ability by mix usage of servers’ resources and embedded devices’ resources in proper system context, by using Aries. Experiments show that this design is able to improve the system resource usage efficiency, and support to share resources among several distributed frameworks deploying on Aries platform. Eventually, we demonstrates an advantage of reducing overall daily system energy consumption significantly on Aries.
In the future work, we will try to migrate other parallel and distributed frameworks into our heterogeneous cluster, for instance MPI library and streaming data-intensive processing frameworks, etc. We also intend to extend the proposed platform supporting multiple-GPUs based systems.
Acknowledgements
This work was financially supported by the Strategic Priority Research Program of the Chinese Academy of Science (No. XDA06010600), as part of the DataOS Project. The authors would like to thank all researchers in DataOS project for useful discussions and suggestions. Also authors thanks anonymous reviewers for their feedbacks.
1Throughout this paper, we use the term “microserver” (MS) in a broad sense to ARM-based mobile devices and “high performance servers” (HS) to x86-based servers.
References
1. Dean, J; Ghemawat, S. MapReduce: simplified data processing on large clusters. Commun. ACM; 2008; 51,
2. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 15–28. USENIX Association (2012)
3. Murray, D.G., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 439–455. ACM, New York (2013)
4. Low, Y; Bickson, D; Gonzalez, J; Guestrin, C; Kyrola, A; Hellerstein, JM. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow.; 2012; 5,
5. Honjo, T., Oikawa, K.: Hardware acceleration of Hadoop MapReduce. In: 2013 IEEE International Conference on Big Data, pp. 118–124. IEEE (2013)
6. Kaewkasi, C., Srisuruk, W.: A study of big data processing constraints on a low-power Hadoop cluster. In: 2014 International Computer Science and Engineering Conference (ICSEC), pp. 267–272. IEEE (2014)
7. Neshatpour, K., Malik, M., Ghodrat, M.A., Sasan, A., Homayoun, H.: Energy-efficient acceleration of big data analytics applications using FPGAs. In: 2015 IEEE International Conference on Big Data, pp. 115–123. IEEE (2015)
8. Malik, M., Rafatirah, S., Sasan, A., Homayoun, H.: System and architecture level characterization of big data applications on big and little core server architectures. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 85–94. IEEE (2015)
9. Scott, J., Bernheim Brush, A.J., Krumm, J., Meyers, B., Hazas, M., Hodges, S., Villar, N.: PreHeat: controlling home heating using occupancy prediction. In: Proceedings of the 13th International Conference on Ubiquitous Computing, pp. 281–290. ACM, New York (2011)
10. Brush, A.J., Jung, J., Mahajan, R., Martinez, F.: Digital neighborhood watch: investigating the sharing of camera data amongst neighbors. In: Proceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 693–700. ACM, New York (2013)
11. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R.H., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol. 11, pp. 295–308 (2011)
12. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 5:1–5:16. ACM, New York (2013)
13. Leverich, J; Kozyrakis, C. On the energy (in) efficiency of Hadoop clusters. ACM SIGOPS Oper. Syst. Rev.; 2010; 44,
14. Junqueira, F; Reed, B. ZooKeeper: Distributed Process Coordination; 2013; Sebastopol, O’Reilly Media, Inc.:
15. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 165–178. ACM, New York (2009)
16. Jung, Y.H., Neill, R., Carloni, L.P.: A broadband embedded computing system for MapReduce utilizing Hadoop. In: 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom), pp. 1–9. IEEE (2012)
17. Beloglazov, A; Buyya, R; Lee, YC; Zomaya, A et al. A taxonomy and survey of energy-efficient data centers and cloud computing systems. Adv. Comput.; 2011; 82,
18. Neshatpour, K., Malik, M., Homayoun, H.: Accelerating machine learning kernel in Hadoop using FPGAs. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 1151–1154. IEEE (2015)
19. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), GA, pp. 265–283. USENIX Association (2016)
20. Bonomi, F., Milito, R., Zhu, J., Addepalli, S.: Fog computing and its role in the internet of things. In: Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, pp. 13–16. ACM, New York (2012)
21. Vaquero, LM; Rodero-Merino, L. Finding your way in the fog: towards a comprehensive definition of fog computing. ACM SIGCOMM Comput. Commun. Rev.; 2014; 44,
22. Yi, S., Li, C., Li, Q.: A survey of fog computing: concepts, applications and issues. In: Proceedings of the 2015 Workshop on Mobile Big Data, pp. 37–42. ACM (2015)
23. Stojmenovic, I., Wen, S., Huang, X., Luan, H.: An overview of fog computing and its security issues. In: Concurrency and Computation: Practice and Experience. Wiley, Chichester (2015)
24. Dubey, H., Yang, J., Constant, N., Amiri, A.M., Yang, Q., Makodiya, K.: Fog data: enhancing telehealth big data through fog computing. In: Proceedings of the ASE BigData and SocialInformatics 2015, pp. 14:1–14:6. ACM, New York (2015)
25. Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., Zhou, L., Yu, Y., Zhang, Z.: TimeStream: reliable stream computation in the cloud. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 1–14. ACM, New York (2013)
26. Stojmenovic, I.: Fog computing: a cloud to the ground support for smart things and machine-to-machine networks. In: 2014 Australasian Telecommunication Networks and Applications Conference (ATNAC), pp. 117–122. IEEE, Piscataway (2014)
27. Jonathan, A., Chandra, A., Weissman, J.: Awan: locality-aware resource manager for geo-distributed data-intensive applications. In: 2016 IEEE International Conference on Cloud Engineering (IC2E), pp. 32–41. IEEE (2016)
28. Chandra, A; Weissman, J; Heintz, B. Decentralized edge clouds. IEEE Internet Comput.; 2013; 17,
29. Zheng, X.: Load Sharing in Large-Scale, Heterogeneous Distributed Systems (1992)
30. Rabkin, A., Arye, M., Sen, S., Pai, V.S., Freedman, M.J.: Aggregation and degradation in jetstream: streaming analytics in the wide area. In: NSDI (2014)
31. BeagleBone. http://beagleboard.org/bone
32. Kreps, J., Narkhede, N., Rao, J., et al.: Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, pp. 1–7 (2011)
33. Meng, X; Bradley, J; Yuvaz, B; Sparks, E; Shivaram, V; Liu, D; Freeman, J; Tsai, D; Amde, M; Owen, S et al. MLlib: machine learning in Apache Spark. JMLR; 2016; 17,
34. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 599–613 (2014)
35. Hull, B., Bychkovsky, V., Zhang, Y., Chen, K., Goraczko, M., Miu, A., Shih, E., Balakrishnan, H., Madden, S.: CarTel: a distributed mobile sensor computing system. In: Proceedings of the 4th International Conference on Embedded Networked Sensor Systems, pp. 125–138. ACM, New York (2006)
36. Gubbi, J; Buyya, R; Marusic, S; Palaniswami, M. Internet of things (IoT): a vision, architectural elements, and future directions. Future Gener. Comput. Syst.; 2013; 29,
37. Gupta, T., Singh, R.P., Phanishayee, A., Jung, J., Mahajan, R.: Bolt: data management for connected homes. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pp. 243–256 (2014)
38. Zhang, H., Hao, C., Wu, Y., Li, M.: Macaca: a scalable and energy-efficient platform for coupling cloud computing with distributed embedded computing. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1785–1788. IEEE (2016)
39. Bekkerman, R; Bilenko, M; Langford, J. Scaling Up Machine Learning: Parallel and Distributed Approaches; 2011; New York, Cambridge University Press: [DOI: https://dx.doi.org/10.1017/CBO9781139042918]
40. Che, S., Li, J., Sheaffer, J.W., Skadron, K., Lach, J.: Accelerating compute-intensive applications with GPUs and FPGAs. In: Symposium on Application Specific Processors, 2008. SASP 2008, pp. 101–107. IEEE (2008)
41. Qureshi, A; Weber, R; Balakrishnan, H; Guttag, J; Maggs, B. Cutting the electric bill for internet-scale systems. ACM SIGCOMM Comput. Commun. Rev.; 2009; 39,
42. Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: NSDI, vol. 11, p. 24 (2011)
43. Wikipedia datasets. http://snap.stanford.edu/data/wiki-meta.html
© Springer Science+Business Media New York 2017.