Content area

Abstract

Our research aims to optimize the performance of large distributed systems, which operate across multiple machines, by applying machine learning techniques. In our first project, we intended to use a large dataset of performance data for the Linux operating system to suggest optimal tuning for network applications in a client-server setting. We conducted a series of experiments to select hardware and Linux configuration options that are significant to network performance. Our results showed that network performance was mainly a function of workload and hardware. Investigating these results showed that our dataset did not contain enough diversity in configuration settings to infer the best tuning and was only useful for making hardware recommendations. Based on these experiments and their outcomes, we concluded that one should not take data diversity, even of huge datasets, for granted. We also recommend a set of preliminary tests for anyone working on similar complex datasets and planning to use machine learning.

In our second project, we considered the problem of using a publicly released dataset by Alibaba to model the batch tasks that are often overlooked compared to online services. The dataset contains the arrivals and resource requirements (CPU, memory, etc.) for both batch and online tasks. Our trained model predicts, with high accuracy, the number of batch tasks that arrive in any 30 minute window, their associated CPU and memory requirements, and their lifetimes. It captures over 94% of arrivals in each 30 minute window within a 95% prediction interval. The F1scores for the most frequent CPU classes exceed 75%, and our memory and lifetime predictions incur less than 1% test data loss. The prediction accuracy of the lifetime of a batch-task drops when the model uses both CPU and memory information, as opposed to only using memory information.

Our third project proposes a deep reinforcement learning approach to task scheduling, aiming to maximize cloud resource utilization by strategically delaying and consolidating batch tasks onto fewer machines. We explore Policy Gradient (REINFORCE) and Deep Q-Network (DDQN) methods to develop a self-learning scheduler that adapts to dynamic workload conditions. Experimental results show that REINFORCE increases average CPU and memory utilization by 125-200% compared to Best-Fit and Packer, efficiently reduces the number of machines required, and achieves a 5-30% reduction in resource fragmentation. Although DDQN also reduces machine usage compared to traditional methods, its performance declines under high loads due to job drops and sub-optimal long-term planning in partially observable environments. Moreover, REINFORCE is computationally more efficient with lower memory requirements, while DDQN is more sample efficient.

Details

1010268
Title
Leveraging Machine Learning for Improved Distributed System Performance
Number of pages
109
Publication year
2025
Degree date
2025
School code
0234
Source
DAI-A 86/8(E), Dissertation Abstracts International
ISBN
9798304916066
Advisor
Committee member
Dogar, Fahad; Sambasivan, Raja R.; Kamra, Ashish; Hempstead, Mark
University/institution
Tufts University
Department
Computer Science
University location
United States -- Massachusetts
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
31564279
ProQuest document ID
3165731387
Document URL
https://www.proquest.com/dissertations-theses/leveraging-machine-learning-improved-distributed/docview/3165731387/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic