Content area

Abstract

On GPU-based clusters, the training workloads of machine learning (ML) models, particularly neural networks (NNs), are often structured as Directed Acyclic Graphs (DAGs) and typically deployed for parallel execution across heterogeneous GPU resources. Efficient scheduling of these workloads is crucial for optimizing performance metrics such as execution time, under various constraints including GPU heterogeneity, network capacity, and data dependencies. DAG-structured ML workload scheduling could be modeled as a Nonlinear Integer Program (NIP) problem, and is shown to be NP-complete. By leveraging a positive correlation between Scheduling Plan Distance (SPD) and Finish Time Gap (FTG) identified through an empirical study, we propose to develop a Running Time Gap Strategy for scheduling based on Whale Optimization Algorithm (WOA) and Reinforcement Learning, referred to as WORL-RTGS. The proposed method integrates the global search capabilities of WOA with the adaptive decision-making of Double Deep Q-Networks (DDQN). Particularly, we derive a novel function to generate effective scheduling plans using DDQN, enhancing adaptability to complex DAG structures. Comprehensive evaluations on practical ML workload traces collected from Alibaba on simulated GPU-enabled platforms demonstrate that WORL-RTGS significantly improves WOA’s stability for DAG-structured ML workload scheduling and reduces completion time by up to 66.56% compared with five state-of-the-art scheduling algorithms.

Details

1009240
Business indexing term
Title
Efficient Scheduling for GPU-Based Neural Network Training via Hybrid Reinforcement Learning and Metaheuristic Optimization
Author
Du, Nana 1   VIAFID ORCID Logo  ; Wu, Chase 2   VIAFID ORCID Logo  ; Hou Aiqin 1   VIAFID ORCID Logo  ; Nie Weike 1   VIAFID ORCID Logo  ; Song Ruiqi 1   VIAFID ORCID Logo 

 School of Computer, Northwest University, Xi’an 710100, China; [email protected] (N.D.); [email protected] (R.S.) 
 Department of Data Science, New Jersey Institute of Technology, Newark, NJ 07102, USA; [email protected] 
Publication title
Volume
9
Issue
11
First page
284
Number of pages
42
Publication year
2025
Publication date
2025
Publisher
MDPI AG
Place of publication
Basel
Country of publication
Switzerland
e-ISSN
25042289
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-11-10
Milestone dates
2025-10-13 (Received); 2025-11-07 (Accepted)
Publication history
 
 
   First posting date
10 Nov 2025
ProQuest document ID
3275500653
Document URL
https://www.proquest.com/scholarly-journals/efficient-scheduling-gpu-based-neural-network/docview/3275500653/se-2?accountid=208611
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-26
Database
ProQuest One Academic