Content area
With the rapid growth of computation-intensive applications, high-performance computing (HPC) clusters have become essential for scientific computing, AI training, and industrial simulation. However, job scheduling in HPC clusters remains challenging due to heterogeneous resources, diverse task demands, and complex constraints. Traditional scheduling methods such as FCFS, SJF, and Backfilling show limited adaptability and struggle to achieve global optimization in large-scale environments. To address these issues, this paper proposes an intelligent scheduling method based on graph neural networks (GNNs) and deep reinforcement learning. A resource-constrained job–node bipartite graph is constructed to model task–node matching relationships, with node and task features capturing resource states and task demands. A GNN is employed to encode the scheduling state, and an Actor–Critic reinforcement learning framework is used to guide scheduling decisions. Simulation results show that, compared with other schedulers, the proposed GNN–Actor–Critic approach significantly improves average waiting time, average turnaround time, average slowdown, and overall resource utilization, demonstrating its effectiveness and practicality for HPC cluster scheduling.
Full text
1. Introduction
With the rapid development of computationally intensive applications such as scientific computing, artificial intelligence training, and engineering simulation, high-performance computing clusters have become critical infrastructure underpinning scientific research capabilities and industrial digitalization. In recent years, HPC clusters have expanded in scale, with computing nodes exhibiting trends towards diversification and heterogeneity, while the volume of jobs and resource demands continue to grow. Against this backdrop, efficiently organizing and allocating computational resources to enhance overall cluster utilization and reduce job waiting times and turnaround times has become a core challenge in HPC system management.
However, HPC job scheduling is influenced by multiple factors, including diverse resource types, significant node performance variations, fluctuating job demands, and complex scheduling constraints [1]. This renders the scheduling problem a typical example of complex combinatorial optimization. On the one hand, task submissions exhibit randomness and burstiness, requiring schedulers to make rapid, efficient decisions amidst constantly evolving system states. On the other, scheduling objectives often encompass multiple metrics—such as fairness, resource utilization, average waiting time, and slowdown—which may conflict, precluding unified optimal solutions. Consequently, designing a scheduling method that balances performance, efficiency, and scalability is crucial for enhancing the overall service capability of HPC systems.
Early HPC scheduling primarily relied on fixed rules or mathematical programming models. Typical rules included First-Come, First-Served (FCFS), Shortest Job First (SJF), and Backfilling Scheduling. These strategies offered advantages such as straightforward implementation and high interpretability [2]. However, they often demonstrated insufficient scheduling efficiency when handling large volumes of heterogeneous jobs and struggled to maintain stable performance in complex dynamic scenarios.
Mathematical programming approaches model scheduling as linear or integer programming problems, such as employing Mixed Integer Linear Programming (MILP) or constrained programming to find optimal schedules. Whilst these methods yield high-quality solutions for small-scale scenarios, computational complexity increases exponentially with growing task volumes and resource dimensions, rendering them impractical for large-scale cluster scheduling.
To overcome the high computational complexity of mathematical programming, researchers have proposed various heuristic and metaheuristic optimisation methods, including genetic algorithms, simulated annealing, ant colony algorithms, and local search [3]. These methods possess strong search capabilities, yielding relatively optimal feasible solutions for larger problem scales, and they are suitable for combinatorial optimization scenarios with a certain degree of structure. However, their performance relies on manually designed heuristic rules or search operators, resulting in limited generalization capabilities and difficulty in adaptive adjustment based on system operational states.
In recent years, reinforcement learning has been applied to address large-scale scheduling problems. Nevertheless, existing RL methods face two primary limitations: firstly, most RL schedulers represent scheduling states as flat vectors or simple resource statistics, lacking explicit modeling of the “task–node bipartite graph structure”. This prevents capturing the impacts of resource dependencies, topological feasibility, and node heterogeneity. Secondly, the dimensionality of the action space grows exponentially with the number of nodes, making it challenging for RL models to directly learn effective strategies within high-dimensional combinatorial placement spaces. Consequently, many RL approaches address only either “resource quantity allocation” or “node selection,” failing to simultaneously optimize the overall task–node matching scheme.
Recent studies show that HPC scheduling performance is closely coupled with cluster topology and communication locality. Topology-aware schedulers explicitly consider network structure or communication patterns to reduce contention and tail latency, especially for communication-intensive workloads [4]. However, these methods usually require detailed topology and traffic information, which increases modeling complexity in heterogeneous environments.
With the advancement of data-driven approaches, increasing research endeavors to enhance scheduler decision quality through artificial intelligence techniques. Notably, graph neural networks and reinforcement learning demonstrate formidable capabilities in handling complex dependency structures and sequential decision problems, offering novel avenues for addressing HPC scheduling challenges. This research integrates GNN and RL for HPC scheduling, overcoming limitations of traditional approaches by enabling schedulers to autonomously learn strategies from historical scheduling behavior, thereby achieving intelligent, dynamic scheduling.
This research primarily encompasses the following two aspects: Proposing a task–node bipartite graph representation model based on feasible matching relations. For HPC scheduling scenarios, we establish a formal representation of task demands and node resources, constructing a bipartite graph structure incorporating schedulable relations. Feature parameterization enables differentiable encoding of complex states. This graph modeling substantially reduces input dimensions and action space scale, resolving traditional reinforcement learning’s difficulty in handling large action sets. Designing a scheduling algorithm based on graph neural networks and reinforcement learning. This research constructs a GNNBackbone encoder, performing graph structural embedding on node and task features to extract global resource state characteristics. Building upon this, an Actor–Critic policy network is designed to score candidate (job, node) matching pairs, selecting actions via a greedy policy to achieve multi-step scheduling decisions.
Recently, graph neural networks (GNNs) have been increasingly applied to scheduling problems due to their ability to model structured relationships among tasks and resources. Existing GNN-based scheduling studies can be broadly categorized into three directions [5]. First, some works employ GNNs to capture system topology or communication locality, enabling topology-aware scheduling and placement decisions in clusters and distributed systems. Second, several studies utilize static or semi-static graphs to represent task dependencies or resource constraints, and they apply GNNs for state embedding before heuristic or learning-based decision-making. Third, a small number of works combine GNNs with reinforcement learning to jointly model scheduling states and optimize long-term objectives.
Compared with these approaches, the proposed method differs in both graph construction and optimization perspective. Instead of focusing on explicit topology modeling or fixed task graphs, we dynamically construct a job–node bipartite graph to represent feasible scheduling relationships under resource constraints, and we integrate GNN-based state encoding with an Actor–Critic framework to learn global resource allocation policies. This design enables scalable decision-making under dynamic workloads while avoiding strong assumptions on topology or task dependency structures.
2. Related Theories and Technologies
2.1. Graph Neural Network
With the widespread application of deep learning across various domains, extending neural networks from traditional Euclidean structured data to graph-structured data with irregular topologies has become a significant research challenge. Graph neural networks emerged in this context as a class of models capable of processing non-Euclidean data. Graphs, as a universal data structure representing entity relationships, consist of nodes and edges [6]. They naturally describe complex systems such as social networks, knowledge graphs, transport networks, computer system architectures, and the resource states of high-performance computing clusters.
However, due to characteristics such as variable neighborhood sizes, irregular topologies, and complex information dependencies, traditional convolutional neural networks or recurrent neural networks struggle to directly model structural relationships on graphs. The emergence of GNN provides an effective solution to this challenge.
The core concept of graph neural networks lies in aggregating and updating information within a node’s local neighborhood through a message passing mechanism, thereby learning representations that reflect both node features and graph structure [7]. This mechanism typically involves two steps.
(1) Neighborhood aggregation. For each node v, the features of all neighboring nodes are aggregated as:
(1)
where denotes the representation of node u at layer , and is a differentiable aggregation function such as mean, sum, or max.(2) Feature update. The aggregated neighborhood information is then fused with the node’s own features to obtain an updated node representation:
(2)
where ‖ denotes vector concatenation, represents the learnable weight matrix at the k-th layer, and is a non-linear activation function.After propagating through K layers, each node incorporates structural information from its K-hop neighborhood, and its final embedding can be expressed as:
(3)
where G denotes the input graph and is the initial feature vector of node v.Through multiple rounds of message passing, nodes are able to capture broader structural information, enabling the model to characterize both local and global topological relationships.
High-performance computing cluster scheduling exhibits highly structured characteristics, which align well with the modeling capability of graph neural networks. In HPC scheduling, there exists a natural schedulable relationship between jobs and compute nodes, where resource constraints determine whether a given job can be executed on a specific node.
Based on this, a bipartite graph can be constructed as:
(4)
where denotes the set of job nodes, represents the set of compute nodes, and(5)
is the set of edges encoding feasible scheduling relationships.The resource requirements of jobs, the available resources of nodes, and the schedulable relations between them jointly form a highly structured state space. This structure is typically sparse yet semantically rich, enabling GNNs to automatically learn the implicit correlations between job resource demands and node resource configurations.
Furthermore, modern HPC clusters exhibit multi-level structural characteristics, including rack architectures, network topologies, and GPU interconnection relationships. Such structural information is difficult to effectively represent using traditional vector-based representations, whereas GNNs can incorporate these heterogeneous relationships within a unified graph-based modeling framework. By encoding jobs and nodes as different types of vertices and representing feasible scheduling relations as edges, GNNs are able to capture the latent interaction patterns between tasks and computing resources. This provides reinforcement learning policy networks with more expressive and informative state representations.
The overall GNN-based computational workflow is illustrated in Figure 1.
2.2. Actor–Critic
Reinforcement learning (RL) constitutes a class of machine learning methods that investigate how agents acquire optimal behavioral strategies through continuous interaction with an unknown environment [8]. Deep reinforcement learning (DRL) addresses this issue by using deep neural networks as function approximators for the policy function, value function, or action-value function. This enables reinforcement learning to handle complex decision-making tasks involving high-dimensional inputs (e.g., images, sensor time-series, or graph-structured data) and highly coupled state dependencies.
In HPC environments, the system state typically consists of multiple heterogeneous components, including job queues, node resource states, network topology, and real-time system loads. These components are characterized by high dimensionality, strong coupling, and dynamic evolution over time [9]. Meanwhile, scheduling decisions exhibit long-term impacts, as a job allocation at the current time step affects not only immediate resource utilization but also future scheduling behaviors. Therefore, scheduling inherently requires global and long-term optimization of system performance metrics, such as average waiting time, turnaround time, resource utilization, and energy consumption.
Within this context, deep reinforcement learning is able to naturally incorporate long-term performance objectives into the reward function and learn effective scheduling policies through extensive interaction with the environment. By continuously exploring the state-action space, the agent can autonomously discover superior resource allocation strategies beyond handcrafted rules.
Deep reinforcement learning can be broadly categorized into three paradigms: value-based methods, policy-based methods, and Actor–Critic methods, which combine the advantages of both [10].
Value-based approaches, such as Deep Q-Networks (DQN), aim to approximate the optimal action-value function by minimizing the temporal difference (TD) error. The loss function can be formulated as:
(6)
where denotes the parameters of the current Q-network, and represents the parameters of the target network, which is used to stabilize training.DQN mitigates the instability of classical Q-learning with function approximation by introducing experience replay and a target network, enabling reinforcement learning to achieve major breakthroughs in large-scale perception tasks [11].
In contrast, policy gradient methods directly optimize a parameterized policy without explicitly estimating the action-value maximum. The policy parameters are updated by maximizing the expected return, and the gradient can be expressed as:
(7)
where denotes the advantage function. These methods are particularly suitable for continuous action spaces, but they often suffer from high variance and require a large number of samples to converge.Actor–Critic frameworks integrate the strengths of both value-based and policy-based methods by simultaneously learning two complementary components: a policy network (Actor) and a value network (Critic). The Actor outputs a probability distribution over actions, determining how the agent selects action a in state s, while the Critic evaluates the quality of the selected action by estimating the corresponding state value or action value, thus providing guidance for policy optimization [12].
Compared with pure value-based or policy gradient methods, Actor–Critic approaches exhibit superior stability, higher sample efficiency, and better suitability for continuous and high-dimensional decision problems. Therefore, they have been widely applied in complex control tasks, resource scheduling, and large-scale dynamic decision-making problems.
The overall framework of the Actor–Critic model is illustrated in Figure 2.
Within the Actor–Critic architecture, the policy is represented by a parameterized distribution , while the value network estimates the state value function . Given a transition tuple obtained through interaction with the environment, the agent first uses the value network to estimate the state value, then it constructs the advantage function based on the discounted return:
(8)
where the return is defined as:(9)
which represents the cumulative discounted reward from time step t.The advantage function measures how much the action improves upon the expected performance under the current policy. If , the action performs better than the baseline and its selection probability should be increased; otherwise, it should be decreased.
Based on the advantage function, the Actor network updates its parameters according to the policy gradient:
(10)
The Critic network is updated by minimizing the value estimation error via a regression loss:
(11)
By approximating the true value function using Monte Carlo returns or temporal-difference targets, the Critic continuously improves its estimation accuracy, which in turn provides more reliable guidance for the Actor’s policy updates. Since the Actor relies on the Critic’s value estimation, a more accurate Critic network significantly enhances the stability and efficiency of policy gradient learning.
3. Methodology
3.1. Model Architecture
The core task of high-performance computing cluster scheduling is to allocate incoming computational jobs to appropriate compute nodes based on the current cluster resource state, thereby improving overall system performance. Typical evaluation metrics include average job waiting time, resource utilization, and system throughput. Due to heterogeneous job resource demands, dynamic fluctuations of node resources, and strong temporal dependencies among scheduling decisions, constructing an effective system model is fundamental to designing efficient scheduling algorithms.
In an HPC cluster, each job j is associated with attributes such as submission time , required CPU cores, memory capacity, number of GPUs, and estimated runtime. Each compute node m maintains a fixed total resource capacity, while its available resources dynamically change as jobs are executed and completed [13]. The arrival of jobs, resource occupation, and resource release jointly form a continuously evolving dynamic system. At each decision time step, the scheduler selects a pending job and assigns it to a target compute node based on the current system state.
To characterize this process, the HPC scheduling problem can be formulated as a sequential decision-making task. The system state at time t, denoted as , consists of the current time t, the resource requirements and waiting times of all jobs in the queue, and the available resources of all compute nodes. An action corresponds to selecting a feasible job–node pair for scheduling. State transitions are driven by job execution and completion, while the reward function reflects the impact of scheduling decisions on long-term system performance.
This study proposes a scheduling framework that integrates graph neural networks with an Actor–Critic reinforcement learning architecture. Specifically, the HPC scheduling state is constructed as a task–node bipartite graph, where job nodes represent task demands, compute nodes represent resource states, and edges model the schedulable relationships between tasks and nodes. Through message-passing mechanisms, GNNs extract structural correlations and global resource distribution features from this bipartite graph. The learned graph representations are then fed into an Actor–Critic framework, where the Actor network outputs scores for task–node matching pairs, and the Critic network estimates the global state value, thereby enabling end-to-end learning of scheduling policies.
Through continuous interaction with the scheduling environment, the proposed framework can autonomously learn scheduling behaviors that reduce job waiting times and improve cluster resource utilization across large-scale experience.
3.2. HPC State Graph Construction
Within high-performance computing clusters, the scheduler must simultaneously consider the attributes of tasks in the job queue and the resource utilization of compute nodes [14]. These elements exhibit complex structural relationships: whether a task can be scheduled on a specific node depends on the availability of sufficient resources on that node; multiple tasks may compete for limited resources, and a single node may execute multiple tasks concurrently.
To effectively capture such structured dependencies, this work models the HPC scheduling state as a task–node bipartite graph. By formulating the scheduling environment in this way, graph neural networks are able to exploit the interaction relationships between tasks and nodes at the structural level, thereby enhancing the representational capacity of the global system state.
Formally, the bipartite graph is defined as:
(12)
where the node set V consists of two disjoint subsets: the task node set and the compute node set .Each task node represents a job currently waiting to be scheduled, where each job j corresponds to one node in . The compute node set represents all physical nodes in the cluster. Each node has a fixed total resource capacity and dynamically changing available resources over time.
Since task nodes and compute nodes belong to different categories, no edges exist between nodes of the same type, which preserves the bipartite property of the graph. Edges only exist between task nodes and compute nodes, representing schedulable relationships between jobs and resources.
For illustration purposes, Figure 3 shows an example consisting of four task nodes and four compute nodes. In practical systems, the number of task nodes and compute nodes can be flexibly scaled according to different cluster configurations and workload intensities.
For task nodes, this study extracts their core resource requirements and queue status as node features, including the required number of CPU cores, memory capacity, number of GPUs, and waiting time. The feature vector of a task node j can be expressed as follows:
(13)
For compute nodes, features are constructed based on their current available resources and resource utilization:
(14)
where resource utilization is defined as the ratio of allocated resources to total resources, reflecting the current workload of a compute node.The edge set between task nodes and compute nodes, denoted as E, characterizes the schedulable relationship. An edge is established between a task node j and a compute node m if and only if the currently available resources of node m can fully satisfy the resource requirements of task j, i.e.,
(15)
By introducing edges based on schedulable relationships, the feasible scheduling action space is directly encoded into the graph structure. This significantly reduces the complexity of action selection. Instead of performing a Cartesian product search over the entire job set and node set, the policy network only needs to consider the feasible edge set, thereby dramatically reducing the number of candidate actions to be evaluated.
Compared with traditional vector-based representations, the bipartite graph-based state modeling offers two major advantages. First, task nodes and compute nodes are naturally distinguished as different entity types within the graph structure, enabling more effective learning of their semantic differences [15]. Second, the task–node edges explicitly encode scheduling constraints, allowing the graph neural network to leverage this structural information during message passing and thus learn more optimal resource matching patterns [16].
Especially in large-scale cluster environments with dynamically changing workloads, the graph-based representation can flexibly adapt to varying numbers of tasks and nodes, effectively avoiding the dimensionality explosion problem commonly encountered in fixed-size vector representations.
In summary, by constructing a task–node bipartite graph, this study achieves structured modeling of the global scheduling state in HPC systems. This provides a unified input representation for subsequent graph neural network processing and Actor–Critic policy learning, laying a solid foundation for the scalability and generalizability of the proposed scheduling framework.
3.3. Graph Neural Network-Based State Encoding Mode
This study represents the scheduling state of a high-performance computing (HPC) cluster as a task–node bipartite graph, enabling a unified description of job demands, node resources, and schedulable relationships. To further extract effective scheduling features from this structured state, a graph neural network is employed to encode the bipartite graph. Through message passing, GNNs propagate information among nodes and capture topological dependencies between tasks and compute nodes, as well as latent resource matching patterns.
In this work, an enhanced GraphSAGE (Graph Sample and Aggregate) model is adopted as the graph encoder. The core idea of GraphSAGE is to update the representation of a target node by aggregating the features of its neighboring nodes at each layer. For a node v, the update rule for its l-th layer representation is defined as:
(16)
where denotes the neighbor set of node v, represents the aggregation function, denotes the learnable parameters at the l-th layer, and is a nonlinear activation function.Since the task–node bipartite graph contains no edges among task nodes or among compute nodes, the GraphSAGE structure naturally captures cross-type message passing between tasks and nodes, enabling the model to learn task–node resource feasibility relationships from a structural perspective.
This research employs a GraphSAGE-based encoder with mean aggregation, where each node updates its representation by averaging the features of its neighboring nodes.
In implementation, two GraphSAGE layers are employed to encode both task nodes and compute nodes, yielding task embeddings and compute node embeddings , respectively.
As each scheduling decision corresponds to selecting a feasible task–node pair , a joint representation of the candidate action needs to be constructed. Specifically, the task embedding and node embedding are concatenated to form the representation of a candidate match:
(17)
where ‖ denotes the vector concatenation operation. This representation integrates both task demand information and node resource information, forming the basis for evaluating the suitability of each scheduling action. To estimate the global system state, mean pooling is applied over the embeddings of compute nodes:(18)
where denotes the set of compute nodes. The global state vector is used as an input to the Critic network and also provides the Actor network with contextual information about the overall cluster load, enabling more informed scheduling decisions.Through this structured state modeling and GNN encoding strategy, the proposed method naturally adapts to clusters of varying scales, maintaining dynamic scalability with respect to fluctuating numbers of tasks and compute nodes.
In summary, the GNN-based state encoder extracts high-quality structured features from the task–node bipartite graph, providing expressive, generalizable, and resource-structure-aware representations for the Actor–Critic reinforcement learning framework, thereby laying a solid foundation for effective scheduling policy learning in HPC environments.
3.4. Actor–Critic Scheduling
In the preceding sections, this study completed both the graph structural modeling of HPC scheduling states and the graph neural network-based state embedding representation. To achieve end-to-end scheduling policy learning, we further constructed an Actor–Critic deep reinforcement learning framework upon the GNN state encoder. This enables the scheduling policy to autonomously learn long-term optimal decision patterns for task–node matching through continuous interaction with the environment.
The Actor–Critic approach belongs to the classic policy gradient class of algorithms within deep reinforcement learning. Its core principle involves simultaneously training two networks: the Actor as the policy network, responsible for generating action selection probabilities, and the Critic as the value network, tasked with evaluating the value of the current state.
These two networks mutually reinforce each other through the advantage function, effectively mitigating the high variance and convergence difficulties inherent in pure policy gradient methods [17] while also avoiding the limitation of pure value methods in directly outputting action selections.
3.4.1. Policy Network
The Actor network constructs a probability distribution over all feasible scheduling actions given the current system state , thereby guiding the task–node matching decision.
At each scheduling step t, the system first passes the task–node bipartite graph through the GraphSAGE encoder to obtain compute node embeddings and task node embeddings . For each candidate task–node pair , a joint representation is constructed by concatenating their embeddings:
(19)
The policy network takes the embeddings of all candidate pairs as input and outputs an unnormalized score (logit) for each action using a multi-layer perceptron (MLP):
(20)
where denotes the policy network composed of several fully connected layers with nonlinear activation functions, which evaluates the priority of assigning task j to node m.The logits of all candidate actions are then normalized using the softmax function to form the policy distribution:
(21)
where denotes the feasible action set consisting of all schedulable task–node matching pairs under the current state.Based on this action probability distribution, the policy can autonomously select appropriate scheduling decisions by comprehensively considering task demands and node resource availability.
During the inference phase (i.e., online scheduling or environment simulation), the stochastic sampling strategy used in training is no longer required. To obtain more stable and reproducible scheduling behavior, a greedy decision strategy is adopted.
Specifically, at each scheduling step t, given the current state , the policy network computes the action probabilities for all feasible actions. The scheduler then executes the action with the maximum probability:
(22)
This greedy selection strategy removes the randomness introduced by sampling during training, resulting in a deterministic decision process during deployment. It not only improves the stability of the scheduling results but also usually leads to better overall system performance.
3.4.2. Value Network
Within the Actor–Critic framework, the Critic network is responsible for evaluating the state value function of the current scheduling state. It provides a value baseline and optimization direction for updating the Actor network, thereby reducing the variance of policy gradient estimation and improving training stability and convergence speed [18].
At each scheduling step t, the environment state is represented as a task–node bipartite graph. After being processed by the GraphSAGE-based graph encoder, low-dimensional embeddings of all compute nodes are obtained. To extract global state information from node-level representations, global average pooling is employed to aggregate all node embeddings into a unified global state vector:
(23)
This global pooling operation effectively compresses node information across clusters of different sizes, producing a fixed-dimensional global representation regardless of the number of compute nodes. The global state vector is then fed into the value network to estimate the state value:
(24)
During training, the value network and policy network are jointly optimised by minimising the value function error to update parameters . The objective function is defined as:
(25)
where denotes the target value computed based on sampled trajectories.Training the value network effectively reduces variance in the policy gradient, enhancing the convergence efficiency and stability of the overall Actor–Critic framework.
3.4.3. Reward Design
In reinforcement learning, the agent’s learning objective is directly determined by the reward function, making its design critical to the performance of scheduling policies. For high-performance computing cluster scheduling, the primary objectives of this study are to reduce job waiting time, decrease the overall system task slowdown, and improve task completion rate and resource utilization.
Based on these objectives, after executing an action —i.e., assigning task j to node m—the immediate reward is defined as:
(26)
where , , and are weighting coefficients used to balance the relative importance of different optimization objectives. Specifically, the coefficient controls the importance of minimizing job completion time, emphasizes resource utilization efficiency, and balances fairness among jobs. The weight coefficients are selected to ensure that different objectives are comparably scaled and that no single term dominates the learning signal during training.In practice, the weight values are determined through preliminary experiments and empirical tuning, and the final settings are chosen from a range where stable performance is observed.
denotes the waiting time from task j submission to execution start, where a shorter waiting time reflects better system responsiveness. represents the current overall system resource utilization rate. measures the deceleration ratio of task j, which is defined as:
(27)
This reward function jointly considers task responsiveness and resource efficiency, ensuring that the scheduling policy not only focuses on individual task performance but also optimizes the overall system resource utilization.
As scheduling in HPC clusters is inherently a long-term sequential decision-making process, each action affects not only the current system state but also future system dynamics. To capture such long-term effects, the cumulative discounted return is defined as:
(28)
where T denotes the termination time of the current scheduling trajectory, and is the discount factor that balances short-term and long-term rewards. As , the policy emphasizes long-term performance, whereas smaller values of encourage short-term optimization.2.. Within the Actor–Critic framework, directly using the return as the policy update signal may lead to high gradient variance, which can affect training stability. Therefore, an advantage function is introduced to measure the relative quality of the selected action compared with the average baseline:
(29)
where is the state value estimated by the Critic network, and represents the improvement achieved by executing action in state .When , the selected action performs better than the baseline and should be reinforced; when , the action underperforms and its selection probability should be suppressed.
By subtracting the baseline value , the advantage function effectively reduces the variance of policy gradient updates, thereby improving the stability and convergence efficiency of the training process.
3.4.4. Policy Update and Loss Function
Within the Actor–Critic framework, the policy network and value network are optimized in an alternating manner. The Actor is responsible for optimizing action selection strategies, while the Critic evaluates state values and provides a baseline for policy updates. This collaborative optimization process improves scheduling performance while enhancing training stability. (1). Policy Network Update
Based on the policy gradient theorem, the policy network aims to maximize the expected cumulative reward [19]. Its gradient update is given by:
(30)
where denotes the parameters of the policy network, represents the action probability output by the Actor network, is the advantage function measuring the relative quality of the selected action, and denotes the expectation over sampled trajectories.In practice, this update is implemented by minimizing the following policy loss function using backpropagation:
(31)
When , the probability of the selected action is increased, whereas when , its probability is reduced, thereby driving the policy towards better decisions. (2). Value Network Update
The value network approximates the state value function under the current policy by fitting the true return . Its optimization objective is defined as:
(32)
where denotes the state value estimate output by the value network, and is the multi-step cumulative return obtained from environment interactions.By minimizing this mean squared error loss, the value network continuously improves its estimation accuracy of long-term state returns [20]. This provides the Actor network with a more reliable baseline, effectively reducing gradient oscillations and enhancing policy learning stability. (3). Entropy Regularization
To prevent premature convergence and encourage sufficient exploration, an entropy regularization term is introduced:
(33)
where denotes the feasible action set under state .This entropy term encourages the policy to maintain a certain degree of randomness, thus preventing deterministic collapse and facilitating the discovery of globally optimal scheduling strategies in complex environments [21]. (4). Overall Loss Function
By integrating the policy loss, value loss, and entropy regularization, the final optimization objective is defined as:
(34)
where balances the relative contribution of the value loss and policy loss, and controls the strength of the entropy regularization term.Through this joint loss optimization, the Actor and Critic networks are trained collaboratively, enabling the scheduling policy to achieve improved task responsiveness, enhanced system stability, and stronger generalization capability.
3.5. GNN–Actor–Critic Scheduling Inference
The proposed GNN–Actor–Critic-based task scheduling framework is illustrated in Figure 4.
After completing the GNN-based state encoding and Actor–Critic policy training, the system enters the inference phase, i.e., the actual HPC task scheduling process. During inference, the model parameters are fixed and no longer updated. Instead, scheduling decisions are generated by the trained policy network based on real-time system states. The core objective of this stage is to produce task–node matching actions and continuously drive the scheduling environment forward, thereby forming a complete closed-loop scheduling process.
Specifically, at each scheduling epoch t, the scheduler executes the following steps: System State Modeling: The scheduler collects the current job queue information and the resource status of compute nodes and constructs a task–node bipartite graph. The graph includes task node features, compute node resource features, and the schedulable edges filtered according to resource constraints. GNN State Encoding: The constructed bipartite graph is fed into the GNN encoder. Through message passing, task node embeddings, node embeddings, and their interaction features are extracted to form a unified representation of the current scheduling state. Actor Inference and Action Selection: Based on the encoded state representation, the Actor network evaluates all feasible task–node candidate pairs. After applying the softmax function, an action probability distribution is obtained. The scheduler then selects the task–node pair with the highest probability as the scheduling decision. Scheduling Action Execution: The selected scheduling action is executed by assigning the corresponding task to the target compute node. The node resource states are updated, and task execution timestamps are recorded. Environment State Evolution: The scheduler periodically checks job completion status and releases the resources occupied by completed tasks. Meanwhile, the system time is advanced with a fixed step size, and the job queue and node resource states are updated accordingly. Iterative Execution: The above procedure is repeated iteratively until all tasks in the system have been processed and completed.
3.6. Time and Space Complexity Analysis
Let denote the number of waiting jobs and denote the number of compute nodes in the cluster. Graph construction requires checking resource feasibility between jobs and nodes, resulting in a worst-case complexity of . In practice, this cost is significantly reduced due to resource constraints that sparsify feasible job–node edges.
For GNN-based state encoding, each message passing layer aggregates information over the graph edges. Assuming E feasible edges, the computational complexity of a single GNN layer is . With L GNN layers, the total GNN encoding complexity is .
The Actor–Critic policy inference operates on learned embeddings and has a complexity of for scoring feasible actions. Therefore, the overall time complexity of a single scheduling decision is dominated by sparse graph operations and scales approximately linearly with the number of feasible job–node pairs.
The space complexity of the proposed method mainly consists of storing node features, edge features, and learned embeddings. Specifically, the memory requirement is , where E denotes the number of feasible job–node edges. Additional memory is required for network parameters, which is independent of cluster scale once the model architecture is fixed.
4. Experiments and Results
This experiment was conducted on a workstation equipped with high-performance computing capabilities, featuring an Intel Core i7 series multi-core CPU operating at 3.6 GHz, 32 GB of memory, and an NVIDIA RTX 5090 GPU with 32 GB of dedicated graphics memory. The experiment employed a dual-system environment comprising Windows 11 and Ubuntu 20.04. Model training was primarily conducted under the Ubuntu system, while scheduling simulations and result analysis were performed within the Windows environment. The deep learning framework utilised was PyTorch version 1.13, with Python version 3.8.
4.1. Dataset
To systematically evaluate the effectiveness and stability of the proposed GNN–Actor–Critic scheduling algorithm under diverse workload conditions, this study constructs an HPC scheduling simulation environment based on real task trajectory logs and cluster node configurations. The experimental environment consists of three main components: task datasets, cluster resource configurations, and scheduling time control mechanisms.
This study utilizes a self-built monitoring dataset collected from a heterogeneous computing cluster. Each task record contains the following fields: job ID (
To improve the robustness and reliability of the scheduling system, the following preprocessing operations are applied to the entire task dataset: Sorting tasks by submission time to ensure the correct temporal order of task arrivals; Removing records with missing fields or abnormal values; Marking tasks that significantly exceed the resource capacity of any single cluster node as unschedulable. These tasks are excluded from resource allocation during the scheduling process, but they are still recorded and counted for statistical analysis.
An unschedulable task is defined as a task whose resource requirement exceeds the maximum resource capacity of all nodes in the cluster, formulated as:
(35)
where denotes the resource demand of task j, and denotes the maximum resource capacity of compute node m.The experimental cluster environment was configured using YAML files, which describe the resource capacities and quantities of each compute node. A typical configuration is summarized as follows: Number of nodes: ; Maximum CPU cores per node: 32 cores; Maximum memory per node: 64 GB; Number of GPUs per node: 0–2.
This configuration forms a heterogeneous HPC cluster platform for scheduling simulation. At the initialization stage, all nodes are assigned their maximum available resources. During scheduling, node resource states are dynamically updated as tasks are allocated and completed.
To simulate real-world HPC scheduling behavior as closely as possible, a discrete-time-based scheduling simulation mechanism is developed. The simulation framework mainly consists of the following components: Time Progression Mechanism: The simulation clock advances in fixed time steps of s. Task Arrival Mechanism: Tasks are added into the scheduling queue when the current simulation time reaches or exceeds their submission times. Resource Allocation and Reclamation Mechanism: Node resources are occupied when tasks start execution and are released immediately upon task completion. Concurrent Scheduling Mechanism: Multiple tasks can be scheduled simultaneously on different nodes at the same time step. Non-Schedulable Filtering Mechanism: Tasks whose resource requirements exceed the maximum capacity of any node are filtered out in advance and excluded from the scheduling process.
4.2. Comparative Algorithms
To evaluate the effectiveness of the proposed GNN–Actor–Critic scheduling algorithm in high-performance computing cluster resource allocation scenarios, several representative baseline methods are selected for comparison, including First-Come-First-Served, Shortest Job First, Backfilling, Heuristic Resource Fit (HRF), and a Proximal Policy Optimization (PPO)-based scheduling algorithm. A multi-dimensional and multi-scenario experimental framework is constructed to enhance the comprehensiveness and credibility of the evaluation.
Specifically, FCFS schedules tasks strictly according to their arrival order, without considering task execution time or resource requirements. SJF prioritizes tasks with shorter estimated execution times. The Backfilling strategy attempts to fill idle resource slots with smaller executable tasks while ensuring that the head-of-queue task is not delayed. FCFS, SJF, and Backfilling are widely used classical HPC scheduling strategies [22]. HRF selects the compute node whose available resources best match the task’s resource demands, thereby reducing resource fragmentation and improving allocation efficiency.
The PPO-based scheduling algorithm models the HPC scheduling process as a sequential decision-making problem. It learns task–node matching strategies through a policy network, which outputs a probability distribution over candidate scheduling actions based on the current task queue state and cluster resource availability [23]. During training, the policy update magnitude is constrained by clipping the optimization objective, which improves learning stability and accelerates convergence.
For the PPO-based scheduler, the policy and value networks both adopt a two-layer MLP with 128 hidden units. The learning rate is set to , the discount factor is 0.99, and the clipping parameter is 0.2.
4.3. Experimental Design
In this research, the GNN encoder consists of two graph convolution layers, each with a hidden dimension of 128. Nonlinear activation functions are applied after each layer to enhance representation capacity. This lightweight architecture balances expressive power and computational efficiency for large-scale scheduling scenarios.
For the Actor–Critic framework, both the actor and critic networks are implemented as multilayer perceptrons with two hidden layers, each containing 128 neurons. The networks are trained using the Adam optimizer with a learning rate of . The discount factor is set to 0.99.
The model is trained for 500 episodes with a batch size of 64. These hyperparameters are selected based on empirical validation and prior reinforcement learning studies, and the same settings are used for all experiments to ensure fair comparisons.
To systematically evaluate the performance of the proposed GNN–Actor–Critic scheduling algorithm under different operating conditions, this study constructs a multi-scenario comparative experimental framework along three dimensions: task scale, system load intensity, and optimization objectives. This design enables a comprehensive assessment of the effectiveness and robustness of the proposed scheduling strategy.
With respect to task scale, three representative scheduling scenarios are constructed, including a small-scale scenario with approximately 500 tasks, a medium-scale scenario with approximately 2000 tasks, and a large-scale scenario with approximately 6000 tasks. By comparing the scheduling performance across different task scales, the scalability and scheduling stability of the proposed method are evaluated as the task volume gradually increases.
To analyze the performance under different load levels, three scheduling scenarios are designed by adjusting the job arrival rate: low load, medium load, and high load. These load levels mainly differ in the relationship between the job arrival rate and the overall system processing capacity.
Under low load conditions, the task arrival rate is significantly lower than the cluster processing capacity. Cluster resources remain relatively abundant, allowing tasks to be scheduled and executed quickly with short waiting times.
In the medium-load scenario, the task arrival rate approaches the system throughput capacity. As a result, task queues start to form, and performance differences among scheduling strategies gradually become more evident.
In the high-load scenario, the task arrival rate exceeds the system’s processing capacity, leading to persistent task accumulation and highly constrained resources. Under such conditions, the robustness of the scheduling algorithm and the effectiveness of its resource allocation strategy play a critical role in determining overall system performance.
To further investigate the effectiveness of each component in the proposed model, a structured ablation study is conducted. The following three variants are designed for comparison: the full GNN–Actor–Critic model; an MLP–Actor–Critic model without graph-structured modeling; a GNN–Greedy model without reinforcement learning training.
Through these ablation settings, both the individual contributions and the synergistic effects of the graph neural network and the Actor–Critic reinforcement learning modules on the overall scheduling performance are systematically analyzed.
4.4. Performance Metrics
To comprehensively evaluate performance differences among scheduling algorithms in high-performance computing clusters, this study constructs a multi-dimensional evaluation metric system. Key indicators include average waiting time, average turnaround time, average reduction ratio, task completion rate, and resource utilization.
4.4.1. Average Waiting Time
Task waiting time measures the delay between task submission and the actual start of execution, and it serves as an important indicator of the responsiveness of a scheduling strategy. For a task j, the waiting time is defined as:
(36)
where denotes the submission time of task j, and represents its execution start time.The average waiting time is calculated as:
(37)
where N denotes the total number of successfully completed tasks.A lower average waiting time indicates that the scheduler can respond to task requests more promptly, thereby improving the system’s responsiveness and interactivity.
4.4.2. Average Turnaround Time
Task turnaround time reflects the total duration from task submission to task completion, comprehensively capturing both waiting time and execution time. For a task j, the turnaround time is defined as:
(38)
where denotes the task completion time.The average turnaround time is calculated as:
(39)
where N represents the total number of successfully completed tasks.A lower average turnaround time indicates higher efficiency of the scheduling system in processing tasks and reflects better overall resource utilization and system responsiveness.
4.4.3. Average Slowdown
Slowdown is a critical performance metric in high-performance computing scheduling, which measures the additional delay introduced by queuing and scheduling. It is defined as the ratio between the task turnaround time and the actual execution time.
For a task j, the slowdown is defined as:
(40)
where denotes the turnaround time of task j, and denotes its actual execution time.The average slowdown is calculated as:
(41)
where N denotes the total number of successfully completed tasks.A lower average slowdown indicates that the scheduling strategy achieves a better balance between short tasks and long tasks, reflecting improved fairness and overall scheduling efficiency.
4.4.4. Task Completion Rate
Under high-load conditions, some tasks may fail to meet the cluster’s resource constraints and thus cannot be executed. To quantify the system’s task processing capability under such conditions, the task completion rate is introduced, defined as:
(42)
where denotes the number of successfully completed tasks, and denotes the total number of tasks in the workload.This metric reflects the stability and throughput capability of the scheduling system under high resource pressure and extreme load conditions.
4.4.5. Resource Utilization
Resource utilization reflects the efficiency of computational resource usage within the cluster. In this study, a weighted combination of CPU utilization and GPU utilization is used to characterize the overall system resource consumption.
The CPU and GPU utilization rates are calculated as follows:
(43)
(44)
where denotes the number of CPU cores occupied by running tasks at time t, TotalCPU represents the total number of available CPU cores in the cluster, and T denotes the total duration of the entire scheduling process.The overall resource utilization is then computed as a weighted combination of CPU and GPU utilization:
(45)
(46)
The weighting coefficients are determined according to the relative proportion of CPU and GPU resources in the system, ensuring that the composite utilization metric accurately reflects the actual hardware configuration characteristics. In this study, the weights are set as and .
4.5. Results and Analysis
4.5.1. Scheduling Performance Across Task Scales
Comparative experiments were conducted on the FCFS, SJF, Backfilling, HRF, PPO-based scheduler, and the proposed GNN–Actor–Critic scheduling algorithm under small-scale, medium-scale, and large-scale task load scenarios. The experimental results under different task scales are presented in Table 1, Table 2 and Table 3.
As shown in Table 1, under small-scale task scenarios, all scheduling algorithms are able to complete most tasks within a relatively short period. However, the proposed GNN–Actor–Critic scheduling algorithm demonstrates a clear advantage in terms of average waiting time, average turnaround time, and average slowdown. Compared with FCFS, the average waiting time is reduced by approximately 60%, and compared with the PPO scheduler, a further reduction of about 23% is achieved. These results indicate that even under relatively small-scale workloads, the integration of structured state modeling and reinforcement learning-based policy optimization can significantly enhance scheduling performance.
As shown in Table 2, under medium-scale task scenarios, distinct scheduling algorithms exhibit significant differences across multiple performance metrics. Traditional FCFS and SJF strategies perform poorly in terms of average waiting time and slowdown, as their scheduling efficiency becomes increasingly constrained with task accumulation.
Backfilling and HRF alleviate resource fragmentation to some extent, thereby improving system turnaround time and reducing average slowdown. In comparison, the reinforcement learning-based PPO scheduler further enhances task response performance, reducing the average slowdown to 2.97.
The proposed GNN–Actor–Critic scheduling algorithm achieves the best overall performance, obtaining the lowest average waiting time (121.9 s) and average slowdown (2.43) while maintaining a high task completion rate of 98.9%. These results demonstrate that the proposed method significantly improves system response efficiency under medium-scale workloads while preserving stable and reliable task throughput.
As shown in Table 3, performance disparities among scheduling algorithms become more pronounced under large-scale task workloads. Both FCFS and SJF exhibit average slowdown values exceeding 5.0, accompanied by noticeable declines in task completion rates. This indicates that rule-based scheduling strategies struggle to effectively handle severe resource contention under high-load, large-scale task conditions.
Although Backfilling and HRF alleviate queue congestion to some extent, their overall scheduling performance remains significantly inferior to learning-based approaches. In contrast, both the PPO scheduler and the proposed GNN–Actor–Critic method maintain relatively low waiting times and slowdown values in large-scale scenarios.
Notably, the GNN–Actor–Critic scheduling algorithm achieves the best performance across all four evaluation metrics, demonstrating that incorporating graph-structured state modeling enables reinforcement learning strategies to exploit task–node structural relationships more effectively. As a result, it can sustain robust scheduling performance even under highly complex and large-scale workloads.
Comparative results across multiple performance metrics under different task scales are illustrated in Figure 5.
In summary, the GNN–Actor–Critic algorithm proposed in this study maintains a high level of performance across task scenarios of varying scales. It effectively captures system state characteristics under changing task scales and makes more rational scheduling decisions.
4.5.2. Scheduling Performance Under Different Load Levels
By adjusting the task arrival rate parameter, three typical scheduling scenarios were constructed: low load (one task injected every 20 s), medium load (one task injected every 10 s), and high load (one task injected every 3 s). The experimental results under different system load intensities are presented in Table 4, Table 5 and Table 6.
As shown in Table 4, when system resources are relatively abundant, all scheduling algorithms perform adequately in task scheduling, with relatively limited overall performance differences. The proposed GNN–Actor–Critic scheduling algorithm maintains optimal performance across metrics including average waiting time, average turnaround time, and average slowdown. This demonstrates that incorporating graph structure modeling and reinforcement learning strategy optimization can further enhance scheduling efficiency, even under low-load conditions.
Under medium-load scenarios, as the task arrival rate approaches the system’s processing capacity, the performance gap between scheduling algorithms begins to widen. As shown in Table 5, both FCFS and SJF exhibit significant increases in average waiting time and slowdown metrics. Although Backfilling and HRF alleviate resource fragmentation to a certain extent, their overall scheduling effectiveness remains constrained by their rule-based nature.
In contrast, the reinforcement learning-based PPO Scheduler and the proposed GNN–Actor–Critic algorithm demonstrate more stable and robust task response performance. Among all methods, GNN–Actor–Critic achieves the best results, with an average waiting time of 134.2 s and an average slowdown of 2.61, outperforming all baseline algorithms and exhibiting stronger scheduling optimization capability under moderate load conditions.
Under high-load conditions, system resources become highly constrained, leading to increasingly pronounced performance disparities among different scheduling algorithms. As shown in Table 6, Both FCFS and SJF exhibit average slowdown values exceeding , accompanied by a significant decline in task completion rates, which highlights the inherent limitations of traditional rule-based scheduling strategies in high-pressure environments. Although Backfilling and HRF can partially mitigate task queue congestion, their overall performance remains inferior to learning-based scheduling methods in terms of task responsiveness and completion efficiency.
In contrast, the proposed GNN–Actor–Critic scheduling algorithm maintains comparatively low average waiting time and slowdown while achieving a high task completion rate of . These results demonstrate its superior robustness and scheduling stability under high-load conditions.
The comparative results of various scheduling algorithms across multiple performance metrics under different system load scenarios are illustrated in Figure 6.
In summary, the GNN–Actor–Critic scheduling algorithm demonstrates robust task responsiveness and stable resource allocation across diverse load scenarios, exhibiting excellent adaptability to complex environments.
4.5.3. Resource Utilization Comparison Across Algorithms
Based on the comprehensive resource utilization metrics outlined earlier, we evaluated resource consumption across algorithms within a high-performance computing cluster. Given that medium-load scenarios more closely approximate the routine operational state of actual HPC clusters, this study selected experimental results from medium-load scenarios for detailed analysis. The findings are presented in Table 7:
It can be observed that as the task arrival rate approaches the system processing capacity, disparities in resource allocation efficiency among different scheduling algorithms gradually emerge. Among traditional rule-based scheduling algorithms, FCFS and SJF exhibit relatively low CPU and GPU utilization, and they struggle to fully exploit the cluster resource potential when task resource demands become increasingly diverse.
In contrast, Backfilling and HRF partially alleviate resource fragmentation by introducing task interleaving mechanisms and explicit resource matching evaluations. Furthermore, the reinforcement learning-based PPO Scheduler effectively improves system resource efficiency by learning allocation strategies from historical scheduling experiences, achieving an overall resource utilization rate of .
Building upon this foundation, the proposed GNN–Actor–Critic scheduling algorithm further integrates task–node graph modeling and global state representation, enabling a more accurate characterization of the complex relationships between cluster resource states and task demands. This facilitates more fine-grained coordination of heterogeneous resource allocation, effectively reducing resource idleness and fragmentation. Consequently, the overall system resource utilization is significantly improved to .
4.5.4. Ablation Experiment Results
To further analyze the role of each module in the proposed methodology, three ablation models were designed for comparative experiments: Baseline model: Full model with GNN + Actor–Critic. Abl-1: Removal of the reinforcement learning module, retaining only GNN + Greedy policy. Abl-2: Removal of graph structure modeling, replacing the GNN with an MLP, MLP + Actor–Critic.
The ablation experiment results are presented in Table 8:
The experimental results indicate that the Multi-Layer Perceptron (MLP) + Actor–Critic model significantly underperforms compared to the full model across all evaluation metrics. This suggests that purely vectorized feature representations are insufficient to capture the inherent structural information of high-performance computing scheduling environments, which are characterized by high-dimensional resource states and complex task–node associations.
The GNN + Greedy model demonstrates certain improvements over the MLP + Actor–Critic model in terms of average waiting time and slowdown, indicating that graph structure modeling contributes to enhanced single-step decision quality. However, due to the lack of long-term optimization capability, both simplified models suffer from performance degradation in multi-step scheduling processes, as reflected by higher slowdown values and lower task completion rates compared to the full model.
Overall, these results demonstrate that graph neural networks and Actor–Critic reinforcement learning complement each other within this scheduling framework: GNNs capture structural information in the task–node bipartite graph, while Actor–Critic, stripped of graph structure modeling, optimizes global scheduling strategies through multi-step return maximization. Both components are indispensable, collectively forming the key to the superior performance of the proposed intelligent scheduling algorithm.
5. Conclusions
Resource scheduling in high-performance computing clusters plays a fundamental role in determining overall system efficiency and resource utilization, and it remains a highly challenging problem in large-scale heterogeneous computing environments. This work investigated an intelligent scheduling framework for HPC clusters based on graph neural networks and Actor–Critic deep reinforcement learning.
Firstly, motivated by the strong coupling between task characteristics and node resource states, this study constructed a task–node bipartite graph state representation, where pending jobs and compute nodes are modeled as two types of graph vertices, and schedulable relationships are represented through edges. By introducing structured relational modeling into the scheduling process, the proposed representation significantly enhances the expressiveness of system state modeling and effectively reduces the complexity of the decision space compared with traditional flat state representations.
Secondly, an end-to-end scheduling model integrating a GNN-based state encoder and an Actor–Critic reinforcement learning architecture was designed. The GNN encoder captures both local and global resource competition patterns, while the Actor–Critic network performs multi-step scheduling policy optimization by jointly learning action selection and value estimation. This design enables the proposed method to adaptively learn scheduling strategies under dynamic and complex resource conditions.
Extensive experimental results compared with classical scheduling algorithms and the PPO Scheduler demonstrate that the proposed GNN–Actor–Critic scheduler consistently achieves superior performance in terms of average waiting time, average slowdown, task completion rate, and overall resource utilization, especially under high-load and large-scale task scenarios. These results indicate that the proposed method exhibits strong scalability, robustness, and adaptability in heterogeneous HPC environments.
From a practical deployment perspective, several factors should be considered. First, the proposed scheduling framework relies on historical cluster monitoring data for offline training, which is commonly available in modern HPC systems. The required training data consist of job-level resource usage and scheduling logs, without the need for additional instrumentation.
Second, regarding online retraining, the model can be periodically updated using newly collected workload data in an offline or asynchronous manner, avoiding interference with online scheduling decisions. Incremental retraining at coarse time scales (e.g., daily or weekly) is sufficient to adapt to workload evolution, as scheduling policies typically change slowly over time.
Third, in terms of scalability, the proposed GNN–Actor–Critic architecture is lightweight and operates on sparse job–node graphs, enabling efficient inference even in large-scale clusters. As discussed in the complexity and latency analysis, the per-decision overhead remains within practical limits, supporting deployment in production-scale environments.
In future work, we plan to optimize the graph construction and inference efficiency, incorporate real cluster topology information and system noise factors, and deploy the proposed scheduling framework in real HPC environments to further validate its practical applicability.
Conceptualization, X.B. and Z.W.; Methodology, X.B.; Software, J.Z.; Validation, X.B. and J.Z.; Formal analysis, X.B.; Investigation, X.B.; Resources, Z.W.; Data curation, J.Z.; Writing—original draft, X.B.; Writing—review and editing, Z.W.; Visualization, J.Z.; Supervision, Z.W.; Project administration, Z.W. All authors have read and agreed to the published version of the manuscript.
The data are available from the corresponding author upon reasonable request.
The authors would like to thank the HPC Center of Changchun Normal University for providing computing resources and administrative support.
The authors declare no conflicts of interest.
The following abbreviations are used in this manuscript:
| HPC | High-Performance Computing |
| GNN | Graph Neural Network |
| First-Come, First-Served | FCFS |
| Shortest Job First | SJF |
| Deep Reinforcement Learning | DRL |
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1 GNN computational flowchart.
Figure 2 Actor–Critic model framework diagram.
Figure 3 Example of a task–node bipartite graph.
Figure 4 GNN–Actor–Critic scheduling framework.
Figure 5 Performance comparison results:(a) average waiting time; (b) average task slowdown rate; (c) task completion rate; (d) average task turnaround time.
Figure 6 Performance comparison of different scheduling algorithms under varying system load levels. (a) Average waiting time; (b) average task slowdown factor; (c) task completion rate; (d) average task turnaround time.
Performance comparison of different scheduling algorithms in small-scale task scenarios.
| Algorithm | AvgWait (s) | AvgTurnaround (s) | AvgSlowdown | Completion Rate (%) |
|---|---|---|---|---|
| FCFS | 214.3 | 864.7 | 3.79 | 93.6 |
| SJF | 176.8 | 802.1 | 3.32 | 94.2 |
| Backfilling | 142.5 | 731.4 | 2.98 | 96.1 |
| HRF | 135.7 | 708.9 | 2.81 | 96.8 |
| PPO Scheduler | 109.4 | 652.3 | 2.55 | 98.0 |
| GNN–Actor–Critic | 83.6 | 598.2 | 2.17 | 99.1 |
Performance comparison of different scheduling algorithms under medium-scale task scenarios.
| Algorithm | AvgWait (s) | AvgTurnaround (s) | AvgSlowdown | Completion Rate (%) |
|---|---|---|---|---|
| FCFS | 352.7 | 1480.5 | 4.82 | 92.4 |
| SJF | 287.9 | 1326.2 | 4.25 | 93.1 |
| Backfilling | 210.6 | 1145.8 | 3.68 | 95.6 |
| HRF | 198.4 | 1092.7 | 3.41 | 96.3 |
| PPO Scheduler | 156.2 | 962.5 | 2.97 | 97.8 |
| GNN–Actor–Critic | 121.9 | 873.6 | 2.43 | 98.9 |
Performance comparison of different scheduling algorithms in large-scale task scenarios.
| Algorithm | AvgWait (s) | AvgTurnaround (s) | AvgSlowdown | Completion Rate (%) |
|---|---|---|---|---|
| FCFS | 521.6 | 2143.8 | 5.63 | 90.7 |
| SJF | 447.2 | 1986.4 | 5.08 | 91.9 |
| Backfilling | 362.9 | 1794.5 | 4.37 | 94.0 |
| HRF | 329.5 | 1712.8 | 4.02 | 95.1 |
| PPO Scheduler | 248.7 | 1498.3 | 3.36 | 97.0 |
| GNN–Actor–Critic | 187.4 | 1263.9 | 2.91 | 98.4 |
Performance comparison of different scheduling algorithms under low load scenario.
| Algorithm | AvgWait (s) | AvgTurnaround (s) | AvgSlowdown | Completion Rate (%) |
|---|---|---|---|---|
| First-Come-First-Served | 178.4 | 905.6 | 2.87 | 99.2 |
| SJF | 152.7 | 861.3 | 2.61 | 99.3 |
| Backfilling | 129.5 | 824.7 | 2.39 | 99.5 |
| HRF | 121.8 | 801.9 | 2.26 | 99.6 |
| PPO Scheduler | 103.6 | 759.2 | 2.08 | 99.7 |
| GNN–Actor–Critic | 82.4 | 714.5 | 1.81 | 99.8 |
Performance comparison of different scheduling algorithms under medium-load scenarios.
| Algorithm | AvgWait (s) | AvgTurnaround (s) | AvgSlowdown | Completion Rate (%) |
|---|---|---|---|---|
| FCFS | 312.9 | 1387.4 | 4.35 | 96.8 |
| SJF | 268.3 | 1315.2 | 3.98 | 97.2 |
| Backfilling | 221.7 | 1204.6 | 3.54 | 97.9 |
| HRF | 207.4 | 1158.3 | 3.36 | 98.2 |
| PPO Scheduler | 167.8 | 1047.1 | 3.02 | 98.8 |
| GNN–Actor–Critic | 134.2 | 968.5 | 2.61 | 99.1 |
Performance comparison of different scheduling algorithms under high-load scenarios.
| Algorithm | AvgWait (s) | AvgTurnaround (s) | AvgSlowdown | Completion Rate (%) |
|---|---|---|---|---|
| FCFS | 541.3 | 2198.6 | 5.92 | 91.0 |
| SJF | 486.7 | 2074.3 | 5.41 | 92.1 |
| Backfilling | 398.5 | 1887.9 | 4.76 | 94.3 |
| HRF | 362.4 | 1795.7 | 4.39 | 95.2 |
| PPO Scheduler | 282.6 | 1598.4 | 3.78 | 97.0 |
| GNN–Actor–Critic | 231.7 | 1426.8 | 3.29 | 98.0 |
Comparison of resource utilization rates across different scheduling algorithms.
| Algorithm | CPU Utilization (%) | GPU Utilization (%) | Resource Utilization (%) |
|---|---|---|---|
| First-Come-First-Served | 56.8 | 49.5 | 53.3 |
| SJF | 60.2 | 53.1 | 57.4 |
| Backfilling | 66.7 | 59.3 | 63.6 |
| HRF | 69.8 | 62.7 | 66.8 |
| PPO Scheduler | 75.1 | 68.2 | 72.7 |
| GNN–Actor–Critic | 81.3 | 74.9 | 78.8 |
Performance comparison of different model architectures in ablation experiments.
| Model Architecture | AvgWait (s) | AvgTurnaround (s) | AvgSlowdown | Completion Rate (%) |
|---|---|---|---|---|
| Abl-1 | 156.9 | 1035.7 | 2.87 | 98.7 |
| Abl-2 | 182.6 | 1112.3 | 3.12 | 98.3 |
| Baseline | 134.2 | 968.5 | 2.61 | 99.1 |
1. Lu, P.J.; Lai, M.C.; Chang, J.S. A survey of high-performance interconnection networks in high-performance computer systems. Electronics; 2022; 11, 1369. [DOI: https://dx.doi.org/10.3390/electronics11091369]
2. Netto, M.A.S.; Cunha, R.L.F.; Sultanum, N. Deciding when and how to move HPC jobs to the cloud. Computer; 2015; 48, pp. 86-89. [DOI: https://dx.doi.org/10.1109/MC.2015.351]
3. Caballer, M.; Antonacci, M.; Šustr, Z.; Perniola, M.; Moltó, G. Deployment of elastic virtual hybrid clusters across cloud sites. J. Grid Comput.; 2021; 19, 4. [DOI: https://dx.doi.org/10.1007/s10723-021-09543-5]
4. Rajasekaran, S.; Ghobadi, M.; Akella, A. CASSINI: Network-aware job scheduling in machine learning clusters. Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI); Boston, MA, USA, 17–19 April 2023.
5. Zhou, W.; Zhang, J.; Sun, J.; Sun, G. Using small-scale history data to predict large-scale performance of HPC application. Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW); New Orleans, LA, USA, 18–22 May 2020; pp. 787-795.
6. Tanash, M.; Dunn, B.; Andresen, D.; Hsu, W.; Yang, H.; Okanlawon, A. Improving HPC system performance by predicting job resources via supervised machine learning. Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning); Chicago, IL, USA, 28 July–1 August 2019; pp. 1-8.
7. Han, L. Resource scheduling algorithm for heterogeneous high-performance clusters. Proceedings of the Intelligent Networked Things: 5th China Conference, CINT 2022; Urumqi, China, 7–8 August 2022; Springer Nature: Singapore, 2023; pp. 307-321.
8. Abualigah, L.; Alkhrabsheh, M. Amended hybrid multi-verse optimizer with genetic algorithm for solving task scheduling problem in cloud computing. J. Supercomput.; 2022; 78, pp. 740-765. [DOI: https://dx.doi.org/10.1007/s11227-021-03915-0]
9. Mahmoud, H.; Thabet, M.; Khafagy, M.H.; Omara, F.A. An efficient load balancing technique for task scheduling in heterogeneous cloud environment. Clust. Comput.; 2021; 24, pp. 3405-3419. [DOI: https://dx.doi.org/10.1007/s10586-021-03334-z]
10. Wei, J.; Chen, M.; Wang, L.; Ren, P.; Lei, Y.; Qu, Y.; Jiang, Q.; Dong, X.; Wu, W.; Wang, Q.
11. Mariappan, U.; Balakrishnan, D.; Karunya, T.B.; Ponraj, A.; Abhishekh, P.; Devi, R.K. High-performance computing for accurate weather forecast: Leveraging advanced algorithms and HPC systems for improved predictions. Proceedings of the 2024 3rd International Conference for Advancement in Technology (ICONAT); GOA, India, 6–8 September 2024; IEEE: Piscataway, NJ, USA, pp. 1-6.
12. Li, J.; Michelogiannakis, G.; Maloney, S.; Cook, B.; Suarez, E.; Shalf, J.; Chen, Y. Job scheduling in high performance computing systems with disaggregated memory resources. Proceedings of the 2024 IEEE International Conference on Cluster Computing (CLUSTER); Kobe, Japan, 24–27 September 2024; IEEE: Piscataway, NJ, USA, pp. 297-309.
13. Menear, K.; Konate, K.; Potter, K.; Duplyakin, D. Tandem predictions for HPC jobs. Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing; Providence, RI, USA, 21–25 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1-9.
14. Boroumand, A.; Hosseini Shirvani, M.; Motameni, H. A heuristic task scheduling algorithm in cloud computing environment: An overall cost minimization approach. Clust. Comput.; 2025; 28, 137. [DOI: https://dx.doi.org/10.1007/s10586-024-04843-3]
15. Balis, B.; Lelek, T.; Bodera, J.; Grabowski, M.; Grigoras, C. Improving prediction of computational job execution times with machine learning. Concurr. Comput. Pract. Exp.; 2024; 36, e7905. [DOI: https://dx.doi.org/10.1002/cpe.7905]
16. Ramachandran, S.; Jayalal, M.L.; Vasudevan, M.; Das, S.; Jehadeesan, R. Combining machine learning techniques and genetic algorithm for predicting run times of high performance computing jobs. Appl. Soft Comput.; 2024; 165, 112053. [DOI: https://dx.doi.org/10.1016/j.asoc.2024.112053]
17. He, M.; Fu, Z.; Tian, W. Optimization of fault tolerance for iterative graph algorithm in Spark GraphX based on high performance computing cluster. CCF Trans. High Perform. Comput.; 2025; 7, pp. 465-477. [DOI: https://dx.doi.org/10.1007/s42514-025-00225-2]
18. Taghizadeh, J.; Ghobaei-Arani, M.; Shahidinejad, A. A metaheuristic-based data replica placement approach for data-intensive IoT applications in the fog computing environment. Softw. Pract. Exp.; 2022; 52, pp. 482-505. [DOI: https://dx.doi.org/10.1002/spe.3032]
19. Beaumont, O.; Lambert, T.; Marchal, L.; Thomas, B. Performance analysis and optimality results for data-locality-aware tasks scheduling with replicated inputs. Future Gener. Comput. Syst.; 2020; 111, pp. 582-598. [DOI: https://dx.doi.org/10.1016/j.future.2019.08.024]
20. Li, X.; Chen, F.; Ruiz, R.; Zhu, J. MapReduce task scheduling in heterogeneous geo-distributed data centers. IEEE Trans. Serv. Comput.; 2021; 15, pp. 3317-3329. [DOI: https://dx.doi.org/10.1109/TSC.2021.3092563]
21. Okanlawon, A.; Yang, H.; Bose, A.; Hsu, W.; Andresen, D.; Tanash, M. Feature selection for learning to predict outcomes of compute cluster jobs with application to decision support. Proceedings of the 2020 International Conference on Computational Science and Computational Intelligence (CSCI); Las Vegas, NV, USA, 16–18 December 2020; IEEE: Piscataway, NJ, USA, pp. 1231-1236.
22. Feitelson, D.G.; Rudolph, L. Toward convergence in job schedulers for parallel supercomputers. Lect. Notes Comput. Sci.; 1996; 1162, pp. 1-26.
23. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv; 2017; [DOI: https://dx.doi.org/10.48550/arXiv.1707.06347] arXiv: 1707.06347
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.