Full text

Turn on search term navigation

Introduction

Oil production from mature reservoirs continues to be an essential component of the global energy supply, yet sustaining viable production rates from these fields presents significant engineering and economic challenges¹. Artificial lift technologies, particularly gas lift systems, have become indispensable for enhancing the productivity of wells where reservoir pressure alone is insufficient to drive fluids to the surface. Gas lift involves the controlled injection of high-pressure gas into the wellbore to reduce the density of produced fluids, decrease hydrostatic pressure, and improve oil flow. Its operational versatility, adaptability to different well types, and relative simplicity make it a preferred lift method in a broad range of environments, from the Niger Delta to the fields of Iraq and the North Sea^{2, 3, 4–5}.

Despite widespread adoption, accurately predicting the performance of gas-lifted wells remains an open challenge for engineers and field operators. Traditional modeling approaches are typically divided into mechanistic and empirical methods. Mechanistic models derive predictions from first principles, utilizing nodal analysis and coupling reservoir performance with wellbore performance to estimate production rates^{6, 7, 8, 9–10}. While this approach provides physical insight and is broadly applicable across well types, it can become unwieldy due to the complex multiphase flow regimes and interactions characteristic of gas-lifted systems. Moreover, many mechanistic models require detailed input parameters and make simplifying assumptions that limit their accuracy and field applicability, especially under changing conditions in mature wells^{11, 12, 13–14}.

Empirical models, on the other hand, are developed by fitting mathematical relationships directly to observed data. These models are quicker to implement and can yield accurate results for the scenarios and datasets on which they are trained. However, their reliability often diminishes when extrapolated beyond the range of underlying data or when applied to wells with different characteristics than the source population^{15, 16–17}. As oil fields continue to age and well configurations diversify, both mechanistic and empirical modeling approaches face persistent challenges in consistently predicting production rates and optimizing gas lift operations^18,19.

Recent technological advances have led to the increasing digitalization of the oil and gas industry, introducing new opportunities for leveraging large field datasets and deploying advanced data-driven analytical approaches. Multiphase flow measurement and metering, for example, are essential for assessing well performance and economics²⁰. Yet, conventional multiphase flow meters and virtual flow metering systems can be costly and complicated to calibrate and maintain, especially amid fluctuating flow regimes and complex well geometries. Data-driven virtual metering methods, using statistical or machine learning models, are emerging as attractive alternatives that complement or even replace physical metering in certain contexts^21,22.

Driven by rapid progress in artificial intelligence and machine learning (ML), a new generation of predictive models is now being developed to address the limitations of classical techniques. Machine learning is particularly well-suited to modeling the highly nonlinear and interdependent nature of gas-lifted well performance^{23, 24–25}. These models can learn intricate patterns linking operational variables to oil production, without needing to explicitly specify underlying physical equations. By training on extensive, validated field data, machine learning algorithms can capture subtleties that conventional models might miss, facilitating more accurate production forecasting, identifying optimization opportunities, and supporting timely interventions^26,27. Machine learning has already been tested in a variety of oil and gas industry subjects with great success. For example, Youcefi et al.²⁸ attempted to develop data driven models to estimate carbon dioxide solubility in various brines. Alatefi et al.²⁹ made use of explainable artificial intelligence techniques to introduce smart models to predict heat capacity of deep eutectic solvents. Amar et al.³⁰ presented intelligent modeling methods to accurately estimate mercury solubility in natural gas components. Amar et al.³¹ estimated Methane Solubility in Brine via robust data driven frameworks. Amar et al.³² proposed reliable frameworks to predict wax disappearance temperature via data driven algorithms.

The application of machine learning in choke flow rate predictions has garnered considerable interest, with studies reporting successful use of decision trees, random forests, support vector regression, neural networks, and boosting-based algorithms for tasks such as production rate estimation, tubing pressure prediction, and even well integrity risk assessment³³. These models often outperform traditional approaches in predictive accuracy, especially when data are abundant and well distributed. However, most of the research studies in the literature are pertinent to choke oil flow rate for wells that is being operated under natural flow conditions (that is, without the use of artificial lift)^34,35; this paper, however, tries to fill the existing gap by providing a workflow on the construction of data-driven models to predict oil rates of wells under gas lift systems.

Predicting and optimizing oil rates of wells under gas lift systems are critical for advancing oil production, which plays a vital role in energy sustainability. Increasing oil recovery from gas lift wells remains a major priority for the petroleum industry, motivating ongoing research into data-driven approaches for production forecasting. The central problem addressed in this research is the difficulty of accurately predicting oil production rates in gas lift operations due to the multifaceted dependencies among operational and reservoir parameters. As illustrated in Fig. 1, the hypothesize of the work is that modern machine learning models trained on high-quality, field-validated data can capture complex nonlinear relationships and significantly improve the accuracy of production rate predictions compared to traditional methods. The workflow for this research involved (1) gathering and rigorously validating a representative dataset from Iraqi oil fields, (2) preprocessing and normalizing the data to ensure consistency, (3) training and optimizing several machine learning models using grid search and 5-fold cross-validation, (4) systematically evaluating model performance with multiple statistical metrics, and (5) interpreting model predictions and feature importance through sensitivity analysis and SHAP values to provide actionable insights for field operators.

Fig. 1 [Images not available. See PDF.]

Overall work flow of the study.

Machine learning methods background

Ensemble learning

The proposed method employs an advanced ensemble learning framework, which merges the predictions of multiple, diverse machine learning algorithms to achieve improved predictive performance and model robustness. Ensemble learning has gained high importance recently due to its capacity to exploit the varying strengths and methodological perspectives of individual learners. Instead of relying on a single algorithm that may be prone to overfitting or strong inductive bias, ensemble approaches aggregate the outputs of multiple models, thereby minimizing the likelihood of systematic errors and increasing reliability.

In this study, the ensemble is constructed by integrating several well-established algorithms, namely K-Nearest Neighbors (KNN), Decision Trees, Random Forests, and Adaptive Boosting (AdaBoost). Each constituent model contributes unique advantages to the ensemble. KNN offers a non-parametric approach that adapts to the underlying local structure of the data, making it effective for complex, nonlinear relationships. Decision Trees are known for their simplicity and interpretability, forming hierarchical structures that facilitate the extraction of decision rules. Random Forests, as an extension of Decision Trees, mitigate overfitting and increase predictive stability by aggregating the outputs from an ensemble of trees trained on different bootstrapped samples and randomly selected feature subsets.

Furthermore, AdaBoost introduces a sequential training mechanism in which weak learners are trained iteratively, and greater emphasis is placed on instances that were previously misclassified. This adaptive re-weighting strategy enables the overall model to focus learning on harder cases, thus systematically reducing both bias and variance in the final prediction. The distinct methodologies of these algorithms result in varied error patterns; when their outputs are combined, the ensemble is able to neutralize the individual weaknesses inherent in each model, leading to more accurate and robust predictions.

The scientific basis for the enhanced performance of ensemble models rests on their ability to reduce both the variance and bias that are characteristic limitations of single learners. Bagging methods, exemplified by Random Forests, primarily operate by lowering variance through averaging predictions across multiple models, thereby stabilizing the outputs and enhancing generalizability. Boosting methods, as implemented in AdaBoost, address both bias and variance by guiding the sequential focus of weak learners toward difficult-to-classify data points. The aggregate effect is a model that is less sensitive to noise and random fluctuations in the training data, and better equipped to handle new, unseen samples.

Overall, the ensemble strategy, which incorporates KNN, Decision Trees, Random Forests, and AdaBoost, exemplifies a robust and scientifically grounded approach to machine learning. By integrating the outputs of these complementary algorithms, the ensemble can achieve predictive accuracies that consistently outperform those attainable by any single constituent model. A schematic illustration of the integrated ensemble architecture is depicted in Fig. 2, elucidating the manner in which the predictions from diverse base learners are synthesized into a unified output. This approach underscores the value of leveraging algorithmic diversity to create models that are both reliable and generalizable across complex datasets^{36, 37–38}.

Fig. 2 [Images not available. See PDF.]

Ensemble learning flowchart utilized in this study.

AdaBoost

The method employed in this study is Adaptive Boosting (AdaBoost), a robust and widely utilized ensemble learning algorithm that demonstrates efficacy in both classification and regression tasks. AdaBoost constructs a strong predictive model by sequentially combining multiple weak learners, each of which is only slightly better than random guessing. The central mechanism of AdaBoost is its adaptive reweighting process, through which it dynamically focuses attention on the more challenging training samples, thereby improving ensemble accuracy in successive iterations.

At the outset, every data point in the training set is assigned an equal weight. Specifically, for a training set containing N samples, the initial weights ω_i⁽¹⁾ for each sample i are defined as

Within each iteration t, AdaBoost trains a weak classifier, denoted as h_t(x), to minimize the total weighted classification error over all training examples. The weighted error rate, ϵt, for the weak learner at iteration t, is computed by

Where w_i^(t) is the instance i weight at t^th iteration, y_i denotes to a true label, h_t(x_i) represents the weak classifier’s prediction for the i^th sample, and I is an indicator function that returns 1 when its argument is true (meaning the sample is misclassified), and 0 otherwise. To determine how much influence the weak learner should have on the final prediction, AdaBoost assigns it a weight, α_t, which is calculated as:

This equation ensures that weak classifiers with lower error rates receive higher weights, thereby exerting greater influence on the aggregated prediction. Once α_t is obtained, the algorithm updates the weights associated with each data point. This updating process increases the weights of those samples that were misclassified in the current round, thus compelling the next weak learner to focus more intensely on these harder examples. The update rule for the weights is given by

where the true label y_i and the weak learner’s prediction h_t(x_i) are typically encoded as + 1 or − 1. If an instance is misclassified, this rule increases its associated weight for the next iteration. After updating, the weights are normalized so that they sum to one across the entire training set, maintaining a valid probability distribution.

The boosting process continues for a predetermined number of iterations T or until satisfactory accuracy is achieved. The final ensemble prediction for a new input x is determined through a weighted majority vote (for classification), which aggregates the predictions of all weak learners as follows:

In this equation, H(x) denotes the predicted class label, and the sign function maps the combined, weighted votes to a binary outcome. Through this strategy of adaptive weighting and aggregation, AdaBoost capitalizes on the strengths of multiple weak learners, resulting in a final model that exhibits both increased accuracy and resilience to overfitting compared to any single weak classifier^{39, 40–41}.

Decision tree

This approach is a form of supervised learning algorithm widely utilized for both classifications and regressions due to their intuitive interpretability and flexible handling of various data types. The underlying principle involves recursively partitioning the data into subsets by evaluating specific features and splitting points, thereby constructing a tree-like structure. This recursive partitioning continues until the terminal nodes, or leaves, are reached. Each internal node within trees is a decision according to a feature and a corresponding condition, while each branch represents an outcome of that condition.

At each stage of partitioning, the algorithm aims to create child nodes that are as homogeneous as possible, which in the context of classification means that a single class dominates the samples in each leaf node. In regression tasks, the objective is to minimize the prediction error, typically measured by metrics such as mean squared error.

For classification tasks, the splitting criterion at each node typically involves either maximizing Information Gain (IG) or minimizing the Gini Impurity. Information Gain is based on the concept of entropy, H(S), which measures the uncertainty or impurity in a dataset S. The entropy at a node is calculated as:

where P_k denotes the class K sample proportions, and K represents maximum class number. A perfectly homogeneous node, where all samples belong to a single class, will have an entropy of zero.

Data Gain quantifies the decrease of entropies attained via data partitioning according to a specific attribute. For a parent node that splits into a set of child nodes, the Information Gain (IG) from a split is computed as:

Here, H(parent) denotes to the parent node’s entropy, H(childj) is the entropy of the j-th child node, ω_j is the proportion of samples in the j-th child node relative to the parent, and M is the total number of child nodes. The optimal split is the one that maximizes information gain, i.e., that reduces entropy the most.

Alternatively, the Gini Impurity is another metric frequently employed to evaluate the quality of a split. For a node, the Gini Impurity is calculated as:

where, again, P_k represents the calss K samples. A node is considered pure (impurity of zero) if all contained samples belong to a single class. During the tree-building process, the approach selects the part minimizing the Gini Impurity, striving to produce the purest possible child nodes.

In regression settings, instead of entropy or Gini impurity, the decision tree algorithm seeks to minimize metrics such as Mean Squared Error (MSE) across the splits. The mean squared error for a node containing N samples, with observed values y_i and mean prediction y, is calculated as:

Where y_i represents the real magnitude of the i^th samples, and Ӯ denotes to the middle data of all samples in the node.

Through recursive partitioning using these splitting criteria the decision tree algorithm efficiently organizes data into a series of nested, interpretable decisions. This process ultimately produces a model where each path from the root to a leaf node represents a unique series of rules for predicting a target value or class^{42, 43, 44–45}.

Random forest

The approach is a supervised machine learning technique belonging to the ensemble learning family. This method constructs a collection of decision trees, each trained on a unique, randomly sampled subset of the data, and aggregates their results to provide a final prediction. For classification tasks, the output is determined by a majority voting scheme, while for regression problems, the prediction is based on the average of all individual tree outputs.

The fundamental strength of Random Forest arises from its application of two principal techniques: bootstrap aggregating (bagging) and the random subspace method. Bagging generates multiple, diverse training sets by sampling from the original dataset with replacement. For a dataset with N samples, each subset created for an individual tree is also of size N, but may contain repeated observations due to sampling with replacement. Suppose the original labeled dataset is D=((X₁, Y₁), (X₂, Y₂), …, (X_N, Y_N)), each tree is trained on a bootstrap sample Db drawn randomly from D^46,47.

In addition to bagging, the random subspace method introduces further diversity by selecting a random subset of features to consider at each candidate split in the tree construction process. If there are M total features, only m features (m < M) are randomly chosen at each split, over which the best split is determined. This reduces the correlation between individual trees and thus enhances the ensemble’s robustness.

For classification tasks, the final prediction for an input x is determined as follows:

The effectiveness of each individual decision tree in the forest depends on its internal split criteria, often based on metrics such as Gini Impurity or Entropy, which are respectively defined as

where pi is the proportion of samples belonging to class i in node S, and K is the number of classes. These metrics measure the purity of the nodes and guide the choice of optimal splits, but Random Forest does not rely on a single global objective function for its final outcome. Instead, its predictive power is derived from the statistical aggregation of numerous weak, decorrelated learners.

A key advantage of the Random Forest algorithm is its capacity to significantly decrease variance compared to a single decision tree, thereby decrease overfitting probability and augment the its ability in generalizing. This ensemble approach naturally results in increased accuracy and robustness with respect to noise and outliers in the training data.

While Random Forests are fundamentally defined by the aggregation process, additional post-hoc analyses, such as the calculation of feature importance, are often performed. These analyses leverage measures such as the Mean Decrease in Impurity (MDI) or the Mean Decrease in Accuracy (MDA), which provide insights into how much each feature contributes to the predictive power of the model. However, these feature importance metrics are ancillary to the core predictive mechanism of the Random Forest algorithm^{48, 49, 50, 51, 52–53}.

Support vector regression

SVR is a robust algorithm tailored for regression tasks, developed as an extension of the SVM framework. In contrast to conventional regression models that aim to minimize the deviation between predicted and actual values, SVR formulates regression as an optimization problem. The goal is to determine a function that maintains the greatest possible flatness while fitting most targets within a specified margin of tolerance, known as the epsilon (ε) margin. This approach ensures that only those observations lying outside the margin significantly influence the model, thereby increasing its resilience to the effects of outliers.

The regression function in SVR is defined as

where f(x) is the predicted value for the input vector x, ω is the weight vector, ϕ(x) represents a mapping of the input data into a higher-dimensional feature space, and b is the bias term. The transformation ϕ(x) allows the algorithm to capture nonlinear patterns in the data by implicitly projecting inputs into a space where linear regression becomes feasible.

To optimize the SVR model, the algorithm seeks to minimize the norm of the weight vector, promoting model flatness, while allowing violations of the epsilon margin through the introduction of slack variables. The corresponding convex optimization problem can be expressed as

subject to the constraints

where yi is the true target value corresponding to input xi, N denotes the total number of training samples, ε is the margin of tolerance, and ξ_i,ξ_i* are slack variables introduced to allow some observations to lie outside the epsilon-tube. The parameter C > 0 serves as a regularization constant, balancing the trade-off between the model’s flatness and the degree to which deviations beyond ε are permitted.

A crucial feature of SVR is the use of kernel functions, which supply the mapping ϕ(x) without explicitly computing the high-dimensional transformation. Among the most widely utilized Common types of kernels include the linear kernel, polynomial kernel, and radial basis function (RBF) kernel. The kernel function is defined as

For example, the RBF kernel is given by

where γ is a hyperparameter controlling the width of the Gaussian function, and ∥x_i−x_j∥ is the Euclidean distance between samples x_i and x_j.

Overall, SVR constructs a regression function that is optimally flat in the high-dimensional feature space and focuses on those training samples (support vectors) that fall outside the epsilon margin. By leveraging the kernel trick, SVR effectively models complex nonlinear relationships in data, while its loss function and regularization mechanism confer enhanced generalization performance and robustness against anomalous observations^{54, 55–56}.

Convolutional neural networks

CNNs represent a distinguished subclass of deep neural architectures meticulously engineered to process and extract features from structured data types, including images, audio signals, and video frames, which naturally exhibit a grid-like topology. The fundamental operation characterizing CNNs is the convolution, a mathematical operation that systematically applies a set of learnable filters across the input data to produce feature maps. Given an input matrix X and a convolutional kernel or filter K, the two-dimensional convolution operation for a single element S(i, j) in the resulting feature map can be expressed as

where X(i, j) refers to the pixel value at spatial coordinates (i, j) of the input, K(m, n) is the kernel coefficient at position (m, n), and ∗∗ denotes the convolution operation. This process allows the network to learn localized spatial features, such as edges and textures, by sliding the filters over the input domain.

After the application of convolutional layers, CNNs typically incorporate pooling layers, which execute a form of nonlinear downsampling. The primary objective of pooling is to reduce the spatial dimensions of the feature maps, thus decreasing computational burden and introducing invariance to minor translations or deformations in the input. A common pooling operation is max pooling, defined as

where F denotes the input feature map and Ri, j represents the pooling region associated with spatial location (i, j). This operation retains the maximum value within each local region, which preserves prominent features while discarding less informative activations.

As the hierarchical feature extraction progresses through stacked convolutional and pooling layers, the extracted feature maps encode increasingly abstract and high-level input presenting. These high-level features are subsequently processed by one or more fully connected layers (dense layers), which compute the final output of the network. The operation in a fully connected layer can be written as

where x is the input feature vector, W is the weight matrix, b is the bias vector, and f denotes the nonlinear activation function (such as ReLU or softmax, depending on the task). The fully connected layers serve as classifiers or regressors, mapping the learned representations to the desired output, such as category labels in classification or real values in regression.

CNNs demonstrate exceptional performance on perceptual tasks due to their ability to learn hierarchical, spatially invariant representations. The initial layers in a CNN are tailored to capture simple and local patterns, such as oriented edges or color gradients, while deeper layers are adept at spanning larger receptive fields, thereby identifying complex structures and semantic content present in the input data. This progressive abstraction makes CNNs particularly suitable for challenging applications in computer vision, including image recognition, object detection, and semantic segmentation. By combining localized feature extraction, parameter sharing, and hierarchical composition, CNNs achieve both computational efficiency and high accuracy in structured data analysis^{57, 58–59}.

Multilayer perceptron-artificial neural network

MLP-ANN is a foundational model in the domain of ANNs, designed to approximate complex nonlinear functions in both classification and regression tasks. As a subclass of feedforward neural networks, the MLP-ANN draws inspiration from biological nervous systems, wherein artificial neurons are organized into multiple interconnected layers. The architecture of an MLP-ANN consists of an input layer, one or more hidden layers, and an output layer. Each layer comprises a set of artificial neurons that communicate with subsequent layers through directed, weighted connections.

In an MLP-ANN, the computation within a single artificial neuron in layer l involves the aggregation of inputs from the previous layer (l − 1), weighted by synaptic strengths and biased by a dedicated parameter. This transformation can be formalized as

where z_j^(l) is the a value before of neuron j in l^th layer, w_ji^(l) is the connection mass neuron i in layer l − 1 to neuron j in layer l, (l − 1)a_i^(l−1) represents the active from of neuron i in the before layer, b_j^(l) is the bias for neuron j in layer l, and n_l−1 is the number of neurons in layer l − 1.

Following this linear combination, a non-linear activation function f(⋅) is applied to obtain the neuron’s output activation:

Common activation functions in MLP-ANNs include the sigmoid, hyperbolic tangent (tanh), or rectified linear unit (ReLU), each enabling the network to capture non-linear and intricate relationships inherent in the data.

The fundamental task during training an MLP-ANN is to optimize the set of weights and biases that minimize the discrepancy between the model’s predictions and the true target values. This discrepancy is quantified by a loss function. For instance, in a regression context, the MSE loss function is commonly employed, defined as

where L is the average loss, y_k is the target value for the k-th example, y_k is the predicted output of the network, and N is the total number of training examples.

Optimization of the MLP-ANN is typically achieved through the backpropagation algorithm in combination with gradient-based methods such as stochastic gradient descent (SGD). During backpropagation, the gradient of the loss function with respect to each weight is computed recursively from the output layer back to the input layer. Weights are then updated according to

Here, w_ji^(l) denotes the current value of the weight, η is the learning rate determining the magnitude of the update. This cycle of forward propagation, loss evaluation, backpropagation, and weight adjustment is repeated over multiple training epochs. As the network iteratively refines its weights and biases, the MLP-ANN progressively reduces its prediction error on the training data while acquiring the ability to generalize to new, unseen data. The ability of the MLP-ANN to approximate arbitrary continuous functions underpins its widespread application across domains such as signal processing, pattern recognition, and predictive modeling^{60, 61–62}.

Lasso regression

The approach is an advanced linear regression method that enhances both the predictive accuracy and interpretability of statistical models, particularly in high-dimensional data scenarios. The essential innovation in Lasso Regression is the inclusion of an L1 regularization penalty within the loss function. This technique enables simultaneous coefficient shrinkage and automatic variable selection, distinctly setting it apart from traditional regression approaches.

The objective function optimized by the Lasso Regression model augments the ordinary least squares (OLS) cost function with the sum of the absolute values of the coefficients. Mathematically, this is expressed as

In this formulation, y_i denotes the target variable for the ℎ_i^th instance, x_i is a p-dimensional row vector of predictors corresponding to that instance, β=[β1,β2,…,βp]T is the vector of regression coefficients, n is the number of observations, and p is the number of predictors. The regularization parameter λ ≥ 0 determines the strength of the penalty applied to the model.

represents the empirical MSE of the model on the training data. This loss quantifies the discrepancy between observed targets and the model’s predictions. The second term,

is the L₁ penalty, which is proportional to the sum of the absolute values of the regression coefficients. This penalty has the effect of both regularizing the model and, crucially, driving some of the β_j coefficients exactly to zero. As a direct result, Lasso inherently performs feature selection by retaining only the most relevant variables within the model.

The amount of regularization in Lasso Regression is controlled by the hyperparameter λ. A larger value of λ increases the penalty for nonzero coefficients, resulting in a sparser model with more coefficients set to zero. Conversely, a smaller λ reduces the regularization effect, making the model approach standard least squares estimation (Eq. 2 without the penalty). The optimal value of λ is typically determined using cross-validation to ensure the best balance between model simplicity and predictive accuracy.

Lasso Regression is especially beneficial in settings where interpretability and feature reduction are important. In fields such as bioinformatics, Lasso can identify key genetic markers from thousands of potential predictors. In econometrics and finance, it assists in pinpointing the most influential indicators from large datasets. In engineering, Lasso is used for sparse signal reconstruction and for reducing model complexity in signal processing tasks.

Despite its strengths, Lasso Regression also faces limitations. When predictor variables exhibit high collinearity, Lasso may arbitrarily select one variable from a group of correlated variables, making feature selection less stable. Additionally, careful and computationally intensive tuning of the regularization parameter λ is required for optimal performance. In cases where predictors are highly correlated, alternative regularization techniques such as Ridge Regression or Elastic Net may offer improved prediction and stability^{63, 64, 65–66}.

Data description and model evaluation

Data set gathering and quality evaluation

The dataset employed in this study was compiled from operational oil fields located in the south-eastern region of Iraq. Its primary purpose is to facilitate the prediction of oil production rates from wells utilizing gas lift systems. The dataset encompasses a comprehensive range of input parameters that are known to influence gas lift performance and well productivity. Specifically, the input features include: basic sediment and water content (BS&W, %), choke size (1/64 in), upstream pressure (psig), surface gas injection pressure (psig), gas lift valve (GLV) depth (m), gas injection rate (MMscf/d), live oil density (characterized by a combination of API gravity and gas–oil ratio), as well as downstream pressure (psia) and temperature (°C).

All data points are derived from real field measurements and have undergone rigorous validation procedures to ensure accuracy and reliability. Each recorded data point was only included in the dataset if the associated production parameters remained stable and consistent for a minimum duration of ten consecutive days. This stringent data selection criterion minimizes the influence of short-term fluctuations and operational anomalies, thereby increasing the overall reliability of the dataset for machine learning modeling purposes.

The final dataset comprises 169 samples, which were randomly partitioned into training and testing subsets. Approximately 90% of the data (152 samples) was allocated to the training phase to develop the predictive model, while the remaining 10% (17 samples) was reserved for independent testing to assess model performance and generalization capability.

Figure 3 provides a detailed statistical analysis of the distribution of each input parameter within the dataset. The figure is comprised of a series of ten subplots, each corresponding to one of the measured parameters: BS&W, choke size, upstream pressure, gas injection pressure at surface, GLV depth, injection rate, live oil density, downstream pressure, and downstream temperature. In each subplot, the x-axis represents the observed values of the respective parameter, while the y-axis denotes the frequency with which each value occurs across the collected data points.

These histograms not only illustrate the range and central tendency of each variable but also provide insights into the distribution characteristics, such as skewness, modality, and the presence of potential outliers. Furthermore, by including both frequency and cumulative frequency curves, the figure enables rapid assessment of population coverage and the relative proportion of data points within specific value ranges. This visualization is critical for understanding the overall structure of the dataset and for identifying any data imbalances or clustering that may influence the performance of subsequent machine learning models.

A thorough examination of Fig. 3 facilitates the identification of potential preprocessing needs, such as normalization or outlier mitigation, particularly in cases where parameter distributions deviate significantly from normality. Ultimately, this comprehensive depiction of the data distribution ensures transparency in dataset characteristics and supports informed decision-making in the modeling process.

Fig. 3 [Images not available. See PDF.]

Statistical distributions of all input parameters in the dataset, with frequency histograms illustrating value ranges and sample coverage.

Figure 4 illustrates the calculated Hat values for each data point in the dataset, facilitating a comprehensive assessment of statistical leverage within the underlying regression model. The Hat value, denoted as h_ii, quantifies the influence or leverage of the i^th observation on the fitted values. It is derived from the diagonal elements of the Hat matrix H, which is computed as follows: K-fold cross-validation is a widely utilized statistical technique for assessing and improving the performance and applicability of machine learning models. In this method, the dataset is split into K equally sized segments, referred to as “folds.” Every fold acts as the validation set in turn, with the other K-1 folds being merged to create the training set. The overall effectiveness of the model is established by averaging the outcomes from all iterations, offering a more solid and thorough assessment of its predictive accuracy.

where X represents the matrix of input features (predictors), and h_ii is the i^th diagonal element of H.

In Fig. 4, the x-axis corresponds to the data point index (ranging from 1 to 196), while the y-axis depicts the corresponding Hat values. A conventional threshold for identifying high-leverage points is defined as:

or, for simplicity in this analysis, a fixed cutoff of 0.125 is employed. Data points with Hat values exceeding this limit are regarded as potential outliers, as they exert disproportionate influence on the regression fit and may indicate anomalies or atypical observations.

As indicated in Fig. 4, only 8 data points surpass the specified leverage threshold of 0.125, classifying them as high-leverage points or outliers. The remaining observations exhibit Hat values below the threshold, suggesting that the majority of the dataset constitutes well-behaved, representative samples with acceptable influence on the model. This analysis thus confirms the integrity of the data and indicates minimal distortion from influential outliers^{67, 68–69}.

Fig. 4 [Images not available. See PDF.]

Distribution of Hat values for all data points, illustrating the leverage of each observation within the regression model.

K-fold cross-validation is a rigorously validated statistical methodology widely used to assess the predictive performance and generalizability of machine learning models. In this approach, the available dataset is systematically partitioned into k equal-sized subsets, commonly referred to as “folds.” For each of the k iterations, the model trains on previous folds. This procedure is repeated k times, ensuring that every data point is used for validation exactly once. Upon completion, performance metrics from each fold are averaged, yielding a robust and less biased estimate of model accuracy than a single train–test split can offer.

A key advantage of k-fold cross-validation is its comprehensive utilization of all samples for both training and validation, which substantively reduces bias and variance associated with arbitrary train-test splits, especially crucial in cases of limited data availability. This systematic methodology supports reliable model selection, enables thorough hyperparameter optimization, and offers a more objective evaluation of model generalization capabilities. By rotating the validation set across each iteration, the technique also mitigates the risk of overfitting, thereby promoting the development of more resilient models, particularly when working with small or moderately sized datasets.

The selection of k is a critical factor influencing both the computational burden and the reliability of cross-validation estimates. Typical choices such as k = 5 or k = 10 provide a practical compromise between computational efficiency and statistical stability. Lower values of k may induce higher estimate variance, whereas higher values enhance the reliability of results at increased computational cost. For datasets with imbalanced class distributions, enhancements like stratified k-fold cross-validation are frequently employed; stratification ensures that each fold maintains class proportions representative of the overall dataset, thereby avoiding minority class underrepresentation and enhancing the validity of performance evaluation^{70, 71–72}.

In this study, a 5-fold cross-validation approach (k = 5) was implemented to balance computational efficiency with thorough performance assessment. This method ensures that the model is exposed to diverse data distributions throughout all training and validation cycles, enhancing both robustness and reliability of the resultant performance metrics. Thus, 5-fold cross-validation provides a more accurate and unbiased appraisal of model performance, supporting the development of generalizable predictive models. The general procedure for k-fold cross-validation is depicted in Fig. 5.

Fig. 5 [Images not available. See PDF.]

Schematic illustration of the k-fold cross-validation procedure used for model evaluation.

Models’ evaluation methods

To rigorously assess the predictive performance of the developed models, a comprehensive suite of statistical evaluation metrics was calculated for each model, as detailed in subsequent sections. The primary metrics employed include the RE%, AARE%, MSE, and R². In all performance equations, the subscript i denotes an individual data point, while the terms “pred” and “exp” refer to the predicted and experimentally observed (actual) values, respectively^{73, 74, 75, 76–77}. :

The symbol n represents the total number of data points within the dataset.

To enhance the reliability of model evaluation and address inherent variations in the data, all input and output variables were normalized prior to the analysis, as formulated below. In this normalization equation, n is the original data value, n_max and n_min denote the maximum and minimum values in the dataset, respectively, and n_norm is the resulting normalized value. This preprocessing step helps to mitigate the impact of scale differences among variables, thereby facilitating more uniform and meaningful comparisons across the dataset and contributing to a more robust model performance assessment.

Results and discussion

Sensitivity analysis

The relationship between the target output (oil production rate) and each input parameter was quantitatively assessed using the Pearson correlation coefficient. This statistical metric, denoted as r, measures the strength and direction of linear association between two continuous variables and is mathematically defined as follows:

where I and Z represent the individual values of the input and output parameters, and are their respective means, and n is the total number of data points.

The analysis (Fig. 6) revealed that choke size exhibits the highest positive correlation with oil production rate, with a Pearson coefficient of 0.5. From a physical perspective, the choke regulates the flow of fluids from the well; increasing choke size reduces the restriction, thus allowing more oil to flow to the surface under a given pressure differential. This positive association aligns with the fundamental principles of fluid dynamics, where larger flow areas, all else being equal, increase volumetric flow rates as described by Bernoulli’s and Darcy’s laws.

Conversely, BS&W (%) demonstrated the most substantial negative correlation with oil production rate, with a Pearson coefficient of −0.7. Elevated BS&W values signify a higher fraction of water and solid contaminants in the produced fluids. From a chemical and operational standpoint, the presence of water and sediments not only reduces the fraction of oil in the production stream but can also impair flow efficiency by contributing to corrosion, scaling, or plugging issues within the wellbore and surface equipment. Consequently, higher BS&W levels are strongly associated with reduced oil output, which is reflected in the negative correlation observed.

These insights underscore the importance of choke management and water control in optimizing oil production. By coupling statistical analysis with an understanding of the underlying physical and chemical phenomena, the results provide both empirical and mechanistic guidance for production optimization^17,26,78.

Fig. 6 [Images not available. See PDF.]

Relevancy factor calculated through Pearson equations for various input parameters.

Determining the models’ hyperparameters

The optimization of key hyperparameters for each machine learning algorithm was systematically carried out using grid search and cross-validation methods, as illustrated by Figs. 7, 8, 9, 10, 11, 12, 13 and 14. Figure 7 demonstrates that the optimal value for the maximum depth hyperparameter in the Decision Tree model is 15. This depth was determined to provide an effective balance between model complexity and generalization, maximizing predictive performance while mitigating overfitting.

Figure 8 identifies the optimal number of estimators for the AdaBoost algorithm as 43. The number of estimators corresponds to the total boosting rounds; optimizing this parameter is critical, as too few can lead to underfitting, while too many may increase computational cost without substantial gains in accuracy. For the Random Forest model, Fig. 9 indicates that a maximum tree depth of 10 yields the best results. This setting constrains the individual trees within the ensemble, reducing the likelihood of overfitting by encouraging diversity among trees, thereby enhancing generalization on unseen data.

Figure 10 presents the results for the MLP-ANN, indicating that an optimal iteration depth is achieved at 400. This value reflects the point at which the network effectively converges, beyond which additional iterations provide negligible improvement and may introduce overfitting. The training process for CNN model is elucidated in Fig. 11, which shows that an optimal epoch number of 2500 yields the lowest validation error. In neural network training, an epoch defines one complete pass through the entire training dataset. Selecting an appropriate number of epochs is crucial: an insufficient number may result in underfitting, while excessive training can lead to overfitting. Empirically, 2500 epochs allowed the CNN to fully capture the underlying data patterns, as evidenced by the performance plateau observed in the figure.

For SVR model, Fig. 12 displays the optimal tuning of key hyperparameters, with a value of 400 specified may strengthen the discussion). Lastly, hyperparameter tuning for Lasso regression is illustrated in Fig. 13, where the lowest prediction error is found at an alpha (α) value of 10⁻⁷. The alpha parameter governs the strength of the regularization term, with optimal selection striking a balance between bias and variance, thereby improving model robustness and interpretability. Note that MLP-ANN, CNN, SVR, random forest, adaptive boosting, lasso regression, and decision tree took about 13.36, 23.36. 9.23, 6.68, 7.63, 5.36, and 4.26 s on the machine with CPU of Core-i7 and 16 GB RAM specifications.

Fig. 7 [Images not available. See PDF.]

Grid search results for optimum Decision Tree maximum depth.

Fig. 8 [Images not available. See PDF.]

Validation performance versus estimator number in AdaBoost.

Fig. 9 [Images not available. See PDF.]

Random Forest model accuracy as a function of maximum depth.

Fig. 10 [Images not available. See PDF.]

Iteration depth optimization for MLP-ANN.

Fig. 11 [Images not available. See PDF.]

CNN validation loss across training epochs.

Fig. 12 [Images not available. See PDF.]

SVR performance as a function of tuning parameter.

Fig. 13 [Images not available. See PDF.]

Effect of regularization parameter alpha on Lasso regression accuracy.

Evaluation of the developed data-driven models

Table 1 systematically compares the predictive performances of several machine learning models based on three key evaluation metrics: R², MSE, and AARE%. Metrics are provided for training, test, and total datasets, allowing for a comprehensive assessment of both model accuracy and generalization capability. Upon inspection of the test phase results in Table 1, Random Forest emerges as the optimal model for this predictive task. The Random Forest model achieved a test-phase R² value of 0.867, indicating high explanatory power and a strong linear correlation between predicted and observed values. Its corresponding test phase MSE is 18,502, which is among the lowest observed across all tested models, and the test AARE% is 8.76%, signifying robust predictive accuracy and reliability on unseen data.

In comparison, Decision Tree, AdaBoost, and Ensemble Learning models display significantly higher levels of overfitting. For instance, the Decision Tree model exhibits almost perfect fitting on the training set (R² = 1.000, MSE = 0, AARE% = 0.00), but its performance drops sharply on the test set (R² = 0.755, MSE = 34,048, AARE% = 10.25%). Similarly, AdaBoost and Ensemble Learning achieve elevated training R² values of 0.999 and 0.997, respectively, juxtaposed with test R² values of 0.885 and 0.884. Their MSE and AARE% metrics also degrade considerably in the test phase, further highlighting the overfitting issue. Notably, Ensemble Learning’s test MSE is 16,107 and AARE% is 7.48%, while AdaBoost yields a test MSE of 15,924 and AARE% of 7.10%.

The performance of deep learning and regression-based approaches such as CNN, SVR, MLP-ANN, and Lasso Regression is comparatively inferior. The CNN, for example, demonstrates a test R² of 0.473 and an elevated test MSE of 73,257, with a relatively high test AARE% of 18.52%. Both SVR and MLP-ANN also exhibit moderate test R² values (0.643 and 0.619, respectively), with higher MSE and AARE% compared to Random Forest, AdaBoost, and Ensemble models. Lasso Regression performs poorest overall, with a test R² of 0.680, MSE of 44,499, and AARE% of 14.91%.

These numerical trends are further corroborated in Fig. 14, which provides a graphical depiction of model predictions versus experimental observations in the test phase. The Random Forest’s predictions are tightly clustered around the ideal reference line, reflecting both low error and consistent generalization. In contrast, Ensemble, AdaBoost, and Decision Tree models show greater scatter and deviation from the ideal, graphically illustrating the overfitting and reduction in test phase accuracy quantified in Table 1.

Table 1. Comparative summary of training and test phase evaluation metrics for all investigated machine learning models.

Model	R2			MSE			AARE%
Model	Training	Test	Total	Training	Test	Total	Training	Test	Total
Decision Tree	1.000	0.755	0.980	0	34,048	3508	0.00	10.25	1.06
AdaBoost	0.999	0.885	0.990	172	15,924	1795	0.52	7.10	1.20
Random Forest	0.983	0.867	0.974	3100	18,502	4687	3.71	8.76	4.23
Ensemble Learning	0.997	0.884	0.988	534	16,107	2138	1.56	7.48	2.17
CNN	0.888	0.473	0.855	20,232	73,257	25,695	12.63	18.52	13.24
SVR	0.916	0.643	0.894	15,286	49,621	18,824	5.34	14.99	6.34
MLP-ANN	0.890	0.619	0.869	19,846	52,931	23,254	8.29	15.43	9.03
Lasso Regression	0.754	0.680	0.749	44,518	44,499	44,516	15.95	14.91	15.84

Fig. 14 [Images not available. See PDF.]

Graphical illustration of model predictions versus experimental values for all models in the test dataset.

Figure 15 presents crossplots comparing the real (observed) outputs to the predicted outputs generated by each machine learning model across both the training and test datasets. These crossplots provide a direct visual assessment of model accuracy and generalization. An ideal model would yield data points closely aligned with the diagonal reference line (y = x), reflecting predictions that accurately match true values. The Random Forest model stands out in this figure, as its data points for both training and test phases adhere closely to the y = x line, which is consistent with its superior evaluation metrics reported in Table 1.

Conversely, the Adaboost, Ensemble, and Decision Tree models exhibit a pronounced divergence between their training and testing lines, with training points tightly clustered around y = x and test points displaying substantial scatter. This visual disparity further highlights the overfitting observed for these models, an outcome also indicated by the marked decline in their test-phase performance metrics in Table 1. The remaining models, including CNN, SVR, MLP-ANN, and Lasso Regression, display a wider spread of points, indicating poorer predictive skill relative to Random Forest, and a tendency towards underfitting or inadequate representation of the underlying data.

Fig. 15 [Images not available. See PDF.]

Crossplots of real versus predicted output values for all models throughout training and test phases.

Figure 16 displays the RE% for all data points, offering a detailed, point-by-point analysis of each model’s prediction errors. Here, the Random Forest model again distinguishes itself, with RE% values tightly clustered around zero for both training and test samples, signifying consistent and unbiased predictions. In contrast, the RE% distributions for Ensemble, Adaboost, and Decision Tree models reveal substantial divergence between the training and test phases. Specifically, these models exhibit minimal error in training but substantial error on test data, an indicative hallmark of overfitting. The other models, such as CNN, SVR, and Lasso Regression, also produce broader RE% distributions, demonstrating less reliable predictive performance when applied to unseen data.

Fig. 16 [Images not available. See PDF.]

Relative error percentages (RE%) of individual predictions for training and test phases across all models.

Collectively, Figs. 16 and 17 provide visual confirmation of the findings summarized in Table 1, reinforcing the superior accuracy and robustness of the Random Forest model, while illustrating the overfitting tendencies and lower generalization capabilities of the alternative approaches.

To quantitatively interpret the influence of each input variable on model output, SHAP (SHapley Additive exPlanations) analysis was conducted. SHAP, grounded in cooperative game theory, offers a robust and theoretically justified framework for evaluating feature importance and directionality in complex machine learning models. Specifically, SHAP values assign to each feature the marginal contribution it makes to the prediction, averaged over all possible feature orderings. This approach ensures a fair and interpretable decomposition of the model’s output.

Figure 17 displays the SHAP feature importance summary, ranking input variables by their overall impact on model predictions. The analysis reveals that BS&W% (Basic Sediment & Water Percentage), Choke Size, Upstream Pressure, GLV Depth, and Live Oil Density (in conjunction with API gravity and Gas-Oil Ratio) are the most influential parameters affecting oil production rate, listed in descending order of importance. The dominance of BS&W% highlights the critical effect of water and sediment content on flow characteristics, potentially influencing oil rate through increased fluid viscosity, emulsion tendencies, or formation of flow barriers. Choke size and upstream pressure are directly related to well control and pressure drawdown, thus strongly impacting flow rates according to Darcy’s Law and multiphase flow dynamics. GLV depth, as a vital design parameter in artificial lift operations, determines gas injection effectiveness and thus production enhancement. Live oil density, closely tied to fluid composition and reservoir conditions (and further modulated by API and GOR), also substantially impacts oil deliverability through its role in phase behavior and hydrostatic head.

Fig. 17 [Images not available. See PDF.]

SHAP-based feature importance ranking for input variables affecting model predictions.

Figure 18 presents the SHAP summary plot of feature contributions for individual predictions, where each point represents a sample and color encodes feature value (red indicates higher values; blue denotes lower values). The horizontal spread indicates the SHAP value, corresponding to the variable’s effect on the prediction.

The plot for BS&W% indicates a predominantly negative SHAP value association for high BS&W values (red points mostly on the negative side), signifying that higher BS&W% consistently reduces predicted oil output. This aligns with the physical understanding that increased water and sediment content hinder effective hydrocarbon flow, resulting in lower production rates.

In contrast, Choke Size, Upstream Pressure, GLV Depth, and Live Oil Density all demonstrate a direct (positive) relationship with oil production rate: high values (red points) are found on the positive side of the SHAP axis. Larger choke sizes and elevated upstream pressures support higher flow rates by minimizing flow restrictions and maximizing pressure differentials, respectively. Deeper GLV installations can improve gas lift efficiency, enhancing oil mobility, while higher live oil density (correlated with richer hydrocarbon content and optimal reservoir conditions) typically supports greater production potential.

Fig. 18 [Images not available. See PDF.]

SHAP summary plot illustrating individual feature contributions and their effects on predicted oil production rate.

Taken together, these SHAP analyses provide not only a quantifiable importance ranking of input parameters but also a nuanced interpretation of how specific features and their underlying physical mechanisms drive model predictions. Such interpretability is critical for actionable reservoir management and well optimization, as it highlights controllable operational settings and reservoir attributes that can be targeted to maximize oil recovery.

This study presents a comprehensive and methodical approach to forecasting oil production in gas lift wells using advanced machine learning (ML) techniques. By leveraging a well-curated dataset from actual oil fields in south-eastern Iraq and incorporating diverse operational and reservoir parameters, the research ensures practical relevance and robust generalizability. The application of multiple ML models, including both conventional (e.g., Decision Tree, SVR) and deep learning-based architectures (e.g., CNN, MLP-ANN), allows for a thorough comparison of model performance. The use of 5-fold cross-validation and various statistical metrics enhances the credibility of the evaluation process. Additionally, the integration of SHAP analysis for model interpretability is a significant strength, offering transparent insights into feature importance, which can inform field-level decision-making and optimization strategies. However, despite its strengths, the study has several limitations. The dataset, although rigorously validated, contains only 169 samples, which may constrain the model’s ability to capture rare patterns or generalize to highly diverse operational conditions. Small datasets also increase the risk of overfitting, particularly for complex models like CNNs. Also, the constructed models only work within the range of input parameters from which they were developed.

In the broader context of petroleum engineering and reservoir management, this study contributes to the growing body of work integrating machine learning into upstream operations. By demonstrating not only predictive accuracy but also model transparency and statistical rigor, the findings reinforce the role of AI as a valuable tool for tackling complex subsurface problems where conventional modeling may fall short. The methodology and insights presented here can serve as a blueprint for other applications, such as artificial lift design, reservoir performance forecasting, or production optimization in fields with similar data constraints. As the industry continues to pursue digital transformation and data-driven solutions, such studies underscore the potential of machine learning to drive more informed, efficient, and adaptive field management strategies.

Future research can expand upon this study by incorporating a larger and more diverse dataset, including temporal data to capture dynamic changes in well performance over time. Integrating time-series modeling approaches, such as LSTM or Transformer-based architectures, could enhance the model’s ability to forecast production trends under evolving conditions. Additionally, combining machine learning predictions with domain-specific physics-based models may offer hybrid frameworks that balance accuracy with interpretability. Exploring the impact of real-time sensor data and deploying the predictive models in a live monitoring system for continuous optimization of gas lift parameters could significantly enhance operational efficiency. Finally, future studies could also investigate the transferability of the developed models to other fields with similar geological characteristics, enabling broader applicability across the oil and gas industry.

Conclusions

This study has demonstrated that machine learning, particularly the Random Forest algorithm, offers significant advancement in predicting oil production rates for wells utilizing gas lift systems. By systematically collecting and validating a diverse dataset of operational parameters, we developed models that balanced predictive accuracy with generalizability. The Random Forest model excelled, outperforming alternatives and delivering robust accuracy on unseen data (R²: 0.867, MSE: 18502 and AARE: 8.76%) while minimizing overfitting. Sensitivity and SHAP analyses provided valuable interpretability, pinpointing the dominant influence of basic sediment and water, choke size, and upstream pressure on production outcomes. These insights not only enhance model transparency but also inform practical optimization of gas lift well operations. Overall, integrating data-driven methods with expert knowledge holds strong potential for elevating production forecasting and decision-making in complex oilfield environments.

Acknowledgements

The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/457/46.

Author contributions

All authors contributed equally to this research paper.

Data availability

Data will be made available upon reasonable academic request from the corresponding author.

Declarations

Competing interests

The authors declare no competing interests.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Khojastehmehr, M; Madani, M; Daryasafar, A. Screening of enhanced oil recovery techniques for Iranian oil reservoirs using TOPSIS algorithm. Energy Rep.; 2019; 5, pp. 529-544.

2. Yadua, AU et al. Performance of a gas-lifted oil production well at steady state. J. Petroleum Explor. Prod. Technol.; 2021; 11, 6 pp. 2805-2821.

3. Dai, T et al. Waste glass powder as a high temperature stabilizer in blended oil well cement pastes: hydration, microstructure and mechanical properties. Constr. Build. Mater.; 2024; 439, 137359.1:CAS:528:DC%2BB2cXhsFWmt7vK

4. Sun, H et al. Theoretical and numerical methods for predicting the structural stiffness of unbonded flexible riser for deep-sea mining under axial tension and internal pressure. Ocean Eng.; 2024; 310, 118672.

5. Qiu, Y et al. Synergic sensing of light and heat emitted by offshore oil and gas platforms in the South China sea. Int. J. Digit. Earth; 2024; 17, 1 2441932.

6. Tariq, Z et al. A systematic review of data science and machine learning applications to the oil and gas industry. J. Petroleum Explor. Prod. Technol.; 2021; 11, 12 pp. 4339-4374.4332349

7. Zhou, Y et al. Effect of multi-scale rough surfaces on oil-phase trapping in fractures: Pore-scale modeling accelerated by wavelet decomposition. Comput. Geotech.; 2025; 179, 106951.

8. Kargarpour, MA. Oil and gas well rate Estimation by choke formula: semi-analytical approach. J. Petroleum Explor. Prod. Technol.; 2019; 9, 3 pp. 2375-2386.

9. Yu, H et al. Modeling thermal-induced Wellhead growth through the lifecycle of a well. Geoenergy Sci. Eng.; 2024; 241, 213098.1:CAS:528:DC%2BB2cXhsFCktLzE

10. Neyolova, EY; Ponomarev, AA. The application of basin modeling of oil and gas systems based on the capillary-gravity concept. Geol. Ecol. Landscapes; 2024; 8, 1 pp. 41-48.

11. Khezerlooe-ye Aghdam, S et al. Mechanistic assessment of Seidlitzia Rosmarinus-derived surfactant for restraining shale hydration: A comprehensive experimental investigation. Chem. Eng. Res. Des.; 2019; 147, pp. 570-578.1:CAS:528:DC%2BC1MXhtVygtLvP

12. Agwu, OE et al. Modelling the flowing bottom hole pressure of oil and gas wells using multivariate adaptive regression splines. J. Petroleum Explor. Prod. Technol.; 2025; 15, 2 22.4887665

13. Chaihad, N et al. In-situ catalytic upgrading of bio-oil derived from fast pyrolysis of sunflower stalk to aromatic hydrocarbons over bifunctional Cu-loaded HZSM-5. J. Anal. Appl. Pyrol.; 2021; 155, 105079.1:CAS:528:DC%2BB3MXls1Wju7c%3D

14. Bangtang, YIN et al. Deformation and migration characteristics of bubbles moving in gas-liquid countercurrent flow in annulus. Pet. Explor. Dev.; 2025; 52, 2 pp. 471-484.

15. Isehunwa, S. O., Awojinrin, G. T. & Oluwatayo, P. Production Forecasting in Gas Lifted Wells using Interpretable Machine Learning Techniques, in SPE Nigeria Annual International Conference and Exhibition. p. D021S009R006. (2024).

16. Ben Seghier, MEA; Höche, D; Zheludkevich, M. Prediction of the internal corrosion rate for oil and gas pipeline: implementation of ensemble learning techniques. J. Nat. Gas Sci. Eng.; 2022; 99, 104425.1:CAS:528:DC%2BB38XotlClt7k%3D

17. Khan, M. R. et al. Machine Learning Application for Oil Rate Prediction in Artificial Gas Lift Wells. in SPE Middle East Oil and Gas Show and Conference. (2019).

18. Yakoot, M. S., Ragab & Mahmoud, O. Machine Learning Application for Gas Lift Performance and Well Integrity, in SPE Europec featured at 82nd EAGE Conference and Exhibition. p. D021S001R008. (2021).

19. Khezerloo-ye Aghdam, S; Kazemi, A; Ahmadi, M. Theoretical and experimental study of fine migration during Low-Salinity water flooding: effect of Brine composition on interparticle forces. SPE Reservoir Eval. Eng.; 2023; 26, 02 pp. 228-243.

20. Ragab, A. M. S., Yakoot, M. S. & Mahmoud, O. Application of Machine Learning Algorithms for Managing Well Integrity in Gas Lift Wells, in SPE/IATMI Asia Pacific Oil & Gas Conference and Exhibition. p. D012S032R003. (2021).

21. Bera, A et al. Recent advances in ionic liquids as alternative to surfactants/chemicals for application in upstream oil industry. J. Ind. Eng. Chem.; 2020; 82, pp. 17-30.1:CAS:528:DC%2BC1MXitFart7bP

22. Barnaji, MJ; Pourafshary, P; Rasaie, MR. Visual investigation of the effects of clay minerals on enhancement of oil recovery by low salinity water flooding. Fuel; 2016; 184, pp. 826-835.1:CAS:528:DC%2BC28Xht1GgsLvO

23. Agwu, OE; Okoro, EE; Sanni, SE. Modelling oil and gas flow rate through chokes: A critical review of extant models. J. Petrol. Sci. Eng.; 2022; 208, 109775.1:CAS:528:DC%2BB3MXisVymtrvP

24. Kargozarfard, Z; Riazi, M; Ayatollahi, S. Viscous fingering and its effect on areal sweep efficiency during waterflooding: an experimental study. Pet. Sci.; 2019; 16, 1 pp. 105-116.

25. Adu, E; Zhang, Y; Liu, D. Current situation of carbon dioxide capture, storage, and enhanced oil recovery in the oil and gas industry. Can. J. Chem. Eng.; 2019; 97, 5 pp. 1048-1076.1:CAS:528:DC%2BC1MXitF2isLc%3D

26. Sami, NA. Application of machine learning algorithms to predict tubing pressure in intermittent gas lift wells. Petroleum Res.; 2022; 7, 2 pp. 246-252.

27. Aghdam, SK; Kazemi,; Ahmadi, M. Studying the effect of various surfactants on the possibility and intensity of fine migration during low-salinity water flooding in clay-rich sandstones. Results Eng.; 2023; 18, 101149.1:CAS:528:DC%2BB3sXhtVWktrvP

28. Youcefi, MR et al. Improved explainable multi-gene genetic programming correlations for predicting carbon dioxide solubility in various Brines. Desalination; 2025; 610, 118917.1:CAS:528:DC%2BB2MXptFCqtbc%3D

29. Alatefi, S et al. Explainable artificial intelligence models for estimating the heat capacity of deep eutectic solvents. Fuel; 2025; 394, 135073.1:CAS:528:DC%2BB2MXmsVKls70%3D

30. Amar, M. N. et al. A reliable model to predict mercury solubility in natural gas components: A robust machine learning framework and data assessment. J. Hazard. Mater., : p. 138396. (2025).

31. Nait Amar, M. et al. A reliable model for predicting methane solubility in brine: toward effective methane emission mitigation. Energy Fuels39, 5562–5576 (2025).

32. Amar, MN et al. Modeling wax disappearance temperature using robust white-box machine learning. Fuel; 2024; 376, 132703.

33. Syed, FI et al. Artificial lift system optimization using machine learning applications. Petroleum; 2022; 8, 2 pp. 219-226.1:CAS:528:DC%2BB2MXhslWntr3N

34. Abbasi, P; Aghdam, SK; Madani, M. Modeling subcritical multi-phase flow through surface chokes with new production parameters. Flow Meas. Instrum.; 2023; 89, 102293.

35. Ghorbani, H et al. Prediction of oil flow rate through an orifice flow meter: artificial intelligence alternatives compared. Petroleum; 2020; 6, 4 pp. 404-414.

36. Zhang, Y; Liu, J; Shen, W. A review of ensemble learning algorithms used in remote sensing applications. Appl. Sci.; 2022; 12, 17 8654.1:CAS:528:DC%2BB38XitlGgtL3O

37. Yaghoubi, E et al. A systematic review and meta-analysis of machine learning, deep learning, and ensemble learning approaches in predicting EV charging behavior. Eng. Appl. Artif. Intell.; 2024; 135, 108789.

38. Mamun, M. et al. Lung cancer prediction model using ensemble learning techniques and a systematic review analysis. in 2022 IEEE World AI IoT Congress (AIIoT). (2022).

39. Ying, C et al. Advance and prospects of adaboost algorithm. Acta Automatica Sinica; 2013; 39, 6 pp. 745-758.

40. Schapire, R. E. Explaining adaboost, in Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik. 37–52. (Springer, 2013).

41. Hu, W; Hu, W; Maybank, S. Adaboost-based algorithm for network intrusion detection. IEEE Trans. Syst. Man. Cybernetics Part. B (Cybernetics); 2008; 38, 2 pp. 577-583.

42. Navada, A. et al. Overview of use of decision tree algorithms in machine learning. in 2011 IEEE Control and System Graduate Research Colloquium. (2011).

43. Elhazmi, A et al. Machine learning decision tree algorithm role for predicting mortality in critically ill adult COVID-19 patients admitted to the ICU. J. Infect. Public Health; 2022; 15, 7 pp. 826-834.

44. Zulfiqar, H et al. Identification of Cyclin protein using gradient boost decision tree algorithm. Comput. Struct. Biotechnol. J.; 2021; 19, pp. 4123-4131.1:CAS:528:DC%2BB3MXisF2gsrjJ

45. Bansal, M; Goyal, A; Choudhary, A. A comparative analysis of K-Nearest neighbor, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decis. Analytics J.; 2022; 3, 100071.

46. Luna, JM et al. Building more accurate decision trees with the additive tree. Proc. Natl. Acad. Sci.; 2019; 116, 40 pp. 19887-19893.2019PNAS.11619887L40223981:CAS:528:DC%2BC1MXhvFSit7zP

47. Ghiasi, MM; Zendehboudi, S; Mohsenipour, AA. Decision tree-based diagnosis of coronary artery disease: CART model. Comput. Methods Programs Biomed.; 2020; 192, 105400.

48. Sarica, A., Cerasa, A. & Quattrone, A. Random Forest Algorithm for the Classification of Neuroimaging Data in Alzheimer’s Disease: A Systematic Review. Front. Aging Neurosci.9 (2017).

49. Biau, G; Scornet, E. A random forest guided tour. TEST; 2016; 25, 2 pp. 197-227.3493512

50. Rigatti, SJ. Random forest. J. Insur. Med.; 2017; 47, 1 pp. 31-39.

51. Mohapatra, N., Shreya, K. & Chinmay, A. Optimization of the Random Forest Algorithm (Springer Singapore, 2020).

52. Feng, W. et al. FSRF:An Improved Random Forest for Classification. in. IEEE International Conference on Advances in Electrical Engineering and Computer Applications(AEECA) (2020).

53. Ao, Y et al. The linear random forest algorithm and its advantages in machine learning assisted logging regression modeling. J. Petrol. Sci. Eng.; 2019; 174, pp. 776-789.1:CAS:528:DC%2BC1cXisVKitr7N

54. Rocco, CM; Moreno, JA. Fast Monte Carlo reliability evaluation using support vector machine. Reliab. Eng. Syst. Saf.; 2002; 76, 3 pp. 237-243.

55. Qi, K; Yang, H. Elastic net nonparallel hyperplane support vector machine and its geometrical rationality. IEEE Trans. Neural Networks Learn. Syst.; 2022; 33, 12 pp. 7199-7209.4516583

56. Kavitha, S., Varuna, S. & Ramya, R. A comparative analysis on linear regression and support vector regression. in Online International Conference on Green Engineering and Technologies (IC-GET) (2016).

57. Lopez Pinaya, W. H. et al. Chap. 10 - Convolutional neural networks, in Machine Learning (eds Mechelli, A. & S. Vieira, S.), 173–191. (Academic, 2020).

58. Sohn, C et al. Line chart Understanding with convolutional neural network. Electronics; 2021; 10, 6 749.

59. Dong, Z et al. A novel method for automatic quantification of different pore types in shale based on SEM-EDS calibration. Mar. Pet. Geol.; 2025; 173, 107278.

60. Gardner, MW; Dorling, SR. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmos. Environ.; 1998; 32, 14 pp. 2627-2636.1998AtmEn.32.2627G1:CAS:528:DyaK1cXks1Crs7c%3D

61. Tang, J; Deng, C; Huang, GB. Extreme learning machine for multilayer perceptron. IEEE Trans. Neural Networks Learn. Syst.; 2016; 27, 4 pp. 809-821.3476840

62. Ghorbani, MA et al. Pan evaporation prediction using a hybrid multilayer perceptron-firefly algorithm (MLP-FFA) model: case study in North Iran. Theoret. Appl. Climatol.; 2018; 133, 3 pp. 1119-1131.2018ThApC.133.1119G3819537

63. Hajihosseinlou, M; Maghsoudi, A; Ghezelbash, R. Regularization in machine learning models for MVT Pb-Zn prospectivity mapping: applying Lasso and elastic-net algorithms. Earth Sci. Inf.; 2024; 17, 5 pp. 4859-4873.

64. Emmert-Streib, F; Dehmer, M. High-Dimensional LASSO-Based computational regression models: regularization, shrinkage, and selection. Mach. Learn. Knowl. Extr.; 2019; 1, 1 pp. 359-383.

65. Ghosh, P et al. Efficient prediction of cardiovascular disease using machine learning algorithms with relief and LASSO feature selection techniques. IEEE Access.; 2021; 9, pp. 19304-19326.

66. Kang, J et al. LASSO-Based machine learning algorithm for prediction of lymph node metastasis in T1 colorectal Cancer. Crt; 2020; 53, 3 pp. 773-783.

67. Zhang, L et al. Improvement on enhanced Monte-Carlo outlier detection method. Chemometr. Intell. Lab. Syst.; 2016; 151, pp. 89-94.1:CAS:528:DC%2BC2MXitV2isLfN

68. Wang, H; Bah, MJ; Hammad, M. Progress in outlier detection techniques: A survey. Ieee Access.; 2019; 7, pp. 107964-108000.

69. Rofatto, VF et al. A Monte Carlo-based outlier diagnosis method for sensitivity analysis. Remote Sens.; 2020; 12, 5 860.2020RemS..12.860R

70. Gorriz, J. M. et al. Is K-fold cross validation the best model selection method for Machine Learning? arXiv preprint arXiv:2401.16407 (2024).

71. Jung, Y. Multiple predicting K-fold cross-validation for model selection. J. Nonparametric Stat.; 2018; 30, 1 pp. 197-215.3756238

72. Wong, TT; Yeh, PY. Reliable accuracy estimates from k-fold cross validation. IEEE Trans. Knowl. Data Eng.; 2019; 32, 8 pp. 1586-1594.

73. Ghorbani, H. et al. A Robust Approach for Estimation of the Bone Age. IEEE.

74. Madani, M; Moraveji, MK; Sharifi, M. Modeling apparent viscosity of waxy crude oils doped with polymeric wax inhibitors. J. Petrol. Sci. Eng.; 2021; 196, 108076.1:CAS:528:DC%2BB3cXit1yjsbrM

75. Madani, M et al. Modeling of CO2-brine interfacial tension: application to enhanced oil recovery. Pet. Sci. Technol.; 2017; 35, 23 pp. 2179-2186.1:CAS:528:DC%2BC2sXhvVOjsLfN

76. Bemani, A; Madani, M; Kazemi, A. Machine learning-based Estimation of nano-lubricants viscosity in different operating conditions. Fuel; 2023; 352, 129102.1:CAS:528:DC%2BB3sXhtlegsb%2FE

77. Madani, M; Alipour, M. Gas-oil gravity drainage mechanism in fractured oil reservoirs: surrogate model development and sensitivity analysis. Comput. GeoSci.; 2022; 26, 5 pp. 1323-1343.4483443

78. Müller, ER et al. Short-term steady-state production optimization of offshore oil platforms: wells with dual completion (gas-lift and ESP) and flow assurance. TOP; 2022; 30, 1 pp. 152-180.4400711

Word count: 10704

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Optimizing oil production in wells employing gas lift systems is a critical challenge due to the complex interplay of operational and reservoir parameters. This study aimed to develop robust predictive models for estimating oil production rates using a comprehensive dataset from oil fields in south-eastern Iraq, leveraging advanced machine learning techniques. The dataset, comprised of 169 rigorously validated samples, includes key features such as basic sediment and water content, choke size, pressures, gas injection characteristics, gas lift valve depth, oil density, and temperature. Input and output variables were normalized and split into training and test sets to ensure fairness and reliability. Multiple machine learning models (Decision Tree, AdaBoost, Random Forest, Ensemble Learning, CNN, SVR, MLP-ANN, and Lasso Regression) were trained and evaluated using 5-fold cross-validation and key statistical metrics (R², MSE, AARE%). The Random Forest model demonstrated superior performance, achieving a test R² of 0.867 and the lowest prediction errors (MSE: 18502 and AARE: 8.76%) for the testing phase, while other models were prone to overfitting or underfitting. Sensitivity analysis and SHAP interpretability methods revealed that basic sediment and water content, choke size, and upstream pressure had the greatest influence on oil output. These findings underscore the importance of both statistical rigor and model interpretability in oil production forecasting and provide actionable insights for optimizing gas lift operations in oil wells.

Details

Title

Predictive modeling of oil rate for wells under gas lift using machine learning

Author

Ma, Famin¹; Altalbawy, Farag M. A.²; Patel, Pinank³; Manjunatha, R.⁴; Kalia, Rishiv⁵; Formanova, Shoira⁶; Naveen, P. Raja⁷; Joshi, Kamal Kant⁸; Sinha, Aashna⁹; Kandahari, Abdolali Yarahmadi¹⁰; Al-Rubaye, Taqi Mohammed Khattab¹¹; Alam, Mohammad Mahtab¹²

¹ Shangluo University, 726000, Shangluo, Shannxi, China (ROR: https://ror.org/01a56n213) (GRID: grid.481179.2) (ISNI: 0000 0004 1757 7308)
² Department of Chemistry, University College of Duba, University of Tabuk, Tabuk, Saudi Arabia (ROR: https://ror.org/04yej8x59) (GRID: grid.440760.1) (ISNI: 0000 0004 0419 5685)
³ Department of Mechanical Engineering, Faculty of Engineering & Technology, Marwadi Universitly Research Center,, Marwadi University, Rajkot, Gujarat, India (ROR: https://ror.org/030dn1812) (GRID: grid.508494.4) (ISNI: 0000 0004 7424 8041)
⁴ Department of Data analytics and Mathematical Sciences, School of Sciences, JAIN (Deemed to be University), Bangalore, Karnataka, India (ROR: https://ror.org/01cnqpt53) (GRID: grid.449351.e) (ISNI: 0000 0004 1769 1282)
⁵ Centre for Research Impact & Outcome, Chitkara University Institute of Engineering and Technology, Chitkara University, 140401, Rajpura, Punjab, India (ROR: https://ror.org/057d6z539) (GRID: grid.428245.d) (ISNI: 0000 0004 1765 3753)
⁶ Department of Chemistry and Its Teaching Methods, Tashkent State Pedagogical University, Tashkent, Uzbekistan (ROR: https://ror.org/051g1n833) (GRID: grid.502767.1) (ISNI: 0000 0004 0403 3387)
⁷ Department of Mechanical Engineering, Raghu Engineering College, 531162, Visakhapatnam, Andhra Pradesh, India
⁸ Department of Allied Science, Graphic Era Hill University, Dehradun, India (ROR: https://ror.org/01bb4h160) (ISNI: 0000 0004 5894 758X); Graphic Era Deemed to be University, Dehradun, Uttarakhand, India (ROR: https://ror.org/02bdf7k74) (GRID: grid.411706.5) (ISNI: 0000 0004 1773 9266)
⁹ School of Applied and Life Sciences, Division of Research and Innovation, Uttaranchal University, Dehradun, Uttarakhand, India (ROR: https://ror.org/00ba6pg24) (GRID: grid.449906.6) (ISNI: 0000 0004 4659 5193)
¹⁰ Faculty of Engineering, Kandahar University, Kandahar, Afghanistan (ROR: https://ror.org/0157yqb81) (GRID: grid.440459.8) (ISNI: 0000 0004 5927 9333)
¹¹ Department of computers Techniques engineering, College of technical engineering, The Islamic University, Najaf, Iraq (ROR: https://ror.org/01wfhkb67) (GRID: grid.444971.b) (ISNI: 0000 0004 6023 831X); Department of computers Techniques engineering, College of technical engineering, The Islamic University of Al Diwaniyah, Al Diwaniyah, Iraq (ROR: https://ror.org/01wfhkb67) (GRID: grid.444971.b) (ISNI: 0000 0004 6023 831X); Department of computers Techniques engineering, College of technical engineering, The Islamic University of Babylon, Babylon, Iraq (ROR: https://ror.org/0170edc15) (GRID: grid.427646.5) (ISNI: 0000 0004 0417 7786)
¹² Department of Basic Medical Sciences, College of Applied Medical Science, King Khalid University, 61421, Abha, Saudi Arabia (ROR: https://ror.org/052kwzs30) (GRID: grid.412144.6) (ISNI: 0000 0004 1790 7100)

Pages

27765

Section

Article

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20452322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41598-025-12129-w

ProQuest document ID

3234777266

Predictive modeling of oil rate for wells under gas lift using machine learning

Jump to:

Full text

Abstract

Details

Suggested sources