Abstract

Scheduling a distributed application modeled as a directed acyclic task graph (where nodes represent computation tasks and edges represent precedence constraints/data flow between the tasks) over a set of networked compute nodes is a fundamental problem in distributed computing and thus has received substantial scholarly attention. Most existing solutions, however, fall short of accommodating the dynamic nature of modern dispersed computing systems (e.g., IoT, edge, and robotic systems) where applications and compute networks are not known ahead of time, constantly changing, and are heavily resource-constrained. We propose solutions that address this gap and show that novel methods for understanding the limitations of existing task scheduling algorithms can inform the design of new algorithms better suited for dynamic, unpredictable, and resource-constrained modern dispersed computing systems.

The goal in task scheduling is to assign computational tasks from the given task graph to different compute nodes in such a way that minimizes or maximizes some performance metric (e.g., total execution time, energy consumption, throughput, etc.). We will focus on the task scheduling problem concerning heterogeneous task graphs and compute networks with the objective of minimizing makespan (total execution time). In particular, we explore the potential and limitations of both new and existing task scheduling algorithms for constantly changing, unpredictable, and resource-constrained environments.

First, we show that many of the most widely-used task scheduling algorithms are not well-suited for scenarios with highly dynamic network conditions and propose a novel approach to scheduling in such environments. Leveraging recent advances in machine learning, we show that graph neural networks (GNNs) can be used to quickly produce high-quality schedules in highly dynamic environments.

Then, we study the problem from a slightly different perspective. In real-world scenarios, it is often the case that the compute network is not given, but rather is something that can be designed ahead of time. For example, IoT system designers can decide where to place sensors and compute nodes in order to best support their application. Because task scheduling is NP-hard, however, we know there must exist problem instances on which even the (empirically) best algorithms perform poorly. In our work on network synthesis, we show that, for this reason, optimizing networks with respect to traditional metrics like communication strength can be detrimental to performance. We propose a framework for studying network synthesis for task scheduling applications and demonstrate its utility with an interesting, tactical IoT inspired use-case.

We also show that the traditional benchmarking approach to comparing existing task scheduling algorithms is insufficient because problem instances on which algorithms perform poorly are often remarkably similar to problem instances on which they perform well. We propose a new approach to comparing task scheduling algorithms that can help us better understand the conditions under which an algorithm performs well and poorly. Using our approach, we are able to find problem instances for which every algorithm we consider performs twice as bad (produces a schedule with twice the makespan) as another algorithm. In fact, for most algorithms, we are able to find conditions for which they perform five times as bad as another algorithm. These are the first results of their kind in the literature and represent an important step toward understanding the performance boundaries between algorithms.

Finally, we extend our efforts in comparing task scheduling algorithms to the individual algorithmic components that make up the algorithms. We propose a generalized parametric list scheduling approach for studying the individual and combined effects of such algorithmic components. We evaluate 72 algorithms produced by combining five different types of algorithmic components on 20 datasets and present results on their individual and combined effects on average performance and runtime. We also discuss how these results differ for individual datasets, suggesting that the way algorithmic components interact with each other is problem-dependent.

Details

Title

Dispersed Computing in Dynamic Environments

Author

Coleman, Jared Ray

Publication year

2024

Publisher

ProQuest Dissertations & Theses

ISBN

9798382715650

Source type

Dissertation or Thesis

Language of publication

English

ProQuest document ID

3056588562

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Dispersed Computing in Dynamic Environments

Jump to:

Abstract

Details

Suggested sources