Innovative data techniques for centrifugal pump optimization with machine learning and AI model

Abstract

In modern centrifugal pump machines (CPM), a data acquisition system encompassing software- hardware interfacing is essential for parameter recording. The quality of recorded data plays a crucial role and directly influences the data transformation phase in machine learning (ML) and deep learning (DL) models. The Dewesoft FFT DAQ system is designed to extract the high-quality data from the CPM based on sensor fusion technology. The data recorded from DAQ system undergoes thorough in-depth analysis, processing & transformation before being incorporated into machine learning (ML) or artificial intelligence models. This paper emphasizes the importance of data cleaning, pre-processing, and applying appropriate methodologies to transform raw data into a valuable resource that can be utilized by ML and AI models. Key techniques include Exploratory Data Analysis (EDA), Data Visualization, and Feature Engineering (FE), which collectively enhance data interpretability. Following these transformations, hypothesis testing validates the data’s integrity, ensuring reliability for subsequent modeling. The validated data is employed to train machine learning classifiers and deep learning algorithms, targeting a 27.25% enhancement in operational efficiency based on F1 score. Additionally, it decreases model training time by 180 seconds, facilitating predictive maintenance of critical performance metrics and minimizing downtime. The assessment of model performance relies on Precision, Recall, and F1 score. This approach leverages recent advancements in data science to derive actionable insights from CPM data, facilitating more informed decision-making and optimization of pump operations.

Full text

Translate

Turn on search term navigation

Introduction

Centrifugal Pump Machines are employed in a variety of applications because they are dependable, efficient, and versatile. However, because of the intricate architecture of the system and the volume of data acquired during operation, predicting the performance and identifying faults in centrifugal pumps is an extremely difficult process. Machine learning and deep learning algorithms have demonstrated great potential in addressing these challenges by automating the process of data analysis and feature extraction. This study emphasizes the need for a theoretical framework connecting machine learning (ML) and artificial intelligence (AI) to predictive maintenance and anomaly detection in centrifugal pump monitoring [1]. Predictive maintenance employs condition monitoring principles, analyzing historical and real-time sensor data to foresee potential failures. Fault detection theories, including model-based and data-driven diagnosis frameworks, highlight the significance of identifying deviations from typical operational behavior. When appropriately integrated with these frameworks, ML and AI models can autonomously discern meaningful patterns from extensive sensor datasets, recognize early degradation signs, and initiate timely maintenance [2]. By aligning with established theories in condition monitoring and fault detection, this study enhances the empirical approach of the proposed methodology, ensuring it is rooted in reliable scientific principles, thus improving the reliability and interpretability of the predictive models created [3].

Recent studies have applied advanced signal processing and machine learning methods to detect fault of centrifugal pump. To identify the severity of cavitation, Azizi et al. [4] created a technique based on feature extraction for empirical mode decomposition (EMD). Kumar, Anil, et al. [5] suggested an enhanced deep convolutional neural network (CNN) utilizing acoustic imagery for detection of fault in the components of pump. Gao et al. [6] employed a hybrid feature selection method to enhance the precision of cavitation severity identification. Resende et al. [7] developed the TIP4.0 platform. for predictive maintenance (PdM) using CNN architecture with distributed edge computing. Pooja Vinayak Kamatet al. [8] employed K-means clustering and autoencoder- LSTM for anomaly trend analysis and classification. K-means clustering was used by Antonio L. Alfeo et al. [9] to improve the interpretation of model by achieving the best quality feature combination for a specific classification problem. Before modelling, exploratory data analysis (EDA) is usually carried out to identify patterns and informative aspects in the dataset. However, manual feature engineering by maintenance personnel has limitations. More automated feature learning approaches like semi-supervised learning are being explored.

Various studies have investigated methodologies such as support vector machines (SVMs) [10] for the prediction of cavitation and blockage of flow, variational mode decomposition for diagnosing faults in rolling bearings, and a data indicator-based deep learning network for the detection of multiple faults. Cyclostationary analysis, wavelet decomposition, and symmetric cross entropy of neutrosophic [2] sets have been utilized in the context of pump condition monitoring [8,9].

The procedure encompasses Exploratory Data Analysis (EDA), feature engineering, data transformation and validation, as well as machine learning and deep learning methodologies. EDA plays a crucial role in analyzing the characteristics of the dataset and identifying patterns and correlation between variables. Feature engineering is the method of creating or merging new dimensions from the unprocessed data to improve the quality of machine learning models. Data transformation, standardization/normalization and validation are essential steps in data analysis process to ensure the reliability and consistency of data [11]. ML and AI techniques have been widely applied to centrifugal pump data for performance prediction and fault diagnosis.

Centrifugal pump machines are dependable and efficient; nonetheless, their complexity and vast operational data complicate performance forecasting and fault diagnostics. Advanced signal processing, machine learning (ML) and deep learning (DL) techniques tackle these challenges, although they possess limitations. Empirical mode decomposition (EMD) and feature extraction necessitate manual intervention, which is labour-intensive and susceptible to errors. Deep CNNs can identify flaws; however, they necessitate substantial computer resources and labelled data, which may not always be available. Hybrid methodologies such as k-means clustering or autoencoders enhance feature selection; yet, they may lack scalability and generalizability across diverse datasets. Numerous studies overlook multi-fault scenarios and automated feature learning techniques, such as semi-supervised learning, which constitutes a significant constraint to the research.

Machine learning methodologies are increasingly employed for pump fault detection but encounter issues related to nonlinearities, noise susceptibility, and generalizability. Recent investigations highlight the need for resilient models capable of precise fault diagnosis amid uncertainties. The efficacy of machine learning and deep learning models is significantly influenced by data handling quality [4]. A discernible void exists in the literature regarding data preprocessing and exploratory data analysis (EDA) techniques that proficiently scrutinize intricate patterns within multivariate sensor datasets [12]. This study aims to address this gap by proposing an enhanced preprocessing framework followed by hypothesis testing tailored for improved monitoring of centrifugal pumps.

This study seeks to create models by applying advanced data analysis and machine learning to centrifugal pump data, striving to create reliable models that can detect flow obstructions, forecast pump performance, identify multi faults, and enable predictive maintenance—ultimately improving the reliability and efficiency of centrifugal pump systems. Specifically, this research leverages the Dewesoft FFT DAQ system and integrates machine learning techniques to improve predictive accuracy and operational efficiency. The use of the Dewesoft FFT DAQ system provides high-resolution data acquisition capabilities, while the applied machine learning methods streamline the process of identifying faults and predicting performance with greater precision and less computational time [13]. This dual approach addresses existing limitations by automating feature extraction and enhancing scalability and reliability across diverse operational scenarios.

Methodology

The data collected from the CPM has to be investigated and transformed before sending it to ML/DL Model [14]. The architecture flow of Exploratory Data Analysis (EDA) model should start from data analysis which helps to understand the statistics of CPM and its behavior [15–17]. It will help to understand whether the data recorded through DAQ has noise and outliers [18]. Data pre- processing is done after statistical analysis which helps to eliminate noise and outliers in data set [19]. After data pre-processing, data visualization becomes easy and helps to get insight of data [20,21]. Data visualization uses three approaches univariate, bi-variate and multivariate analysis to get insights of data [22]. Feature Engineering helps to eliminate the column and row feature with less information based on standard deviation and co-variance

Based on the requirement the standardization of data along with splitting the data set to training and testing is to be done based on ML/DL model [23–25]. Most important step after EDA is hypothesis testing to check whether the transformation done to raw data has make the data bias or not. The transformed data after passing hypothesis test should be ready to train for ML/ DL model [26–28].

The basic architecture flow of the model

The experimental setup presented in the Fig 1 shows the hardware that includes the Centrifugal Pump Machine, Sensors & the data acquisition system. The experiment uses sensor fusion that extracts the data using Dewesoft FFT Analyser. The sensors include Accelerometer, Flow sensor and Pressure sensor along with the multimeter to record the current and voltage [13].

[Figure omitted. See PDF.]

Dewesoft’s FFT analyzer can perform real-time FFT analysis on unlimited input channels. It has a sampling rate of 200 kHz per channel, and a dynamic range of 160 dB in the time and frequency domains. The maximum frequency bandwidth that provides valid values without aliasing effects is the sample rate divided by 2.56. Fig 2 illustrates the fundamental design of the model. The specific & detailed subset of the model is illustrated in Fig 3, showcasing real-time exploratory data analysis (EDA) and feature engineering (FE). The extracted features are recorded in xlsx or csv files. The data extracted from the Centrifugal Pump Machine (CPM) via data acquisition system comprises 12 independent variables and 4 dependent variables, resulting in a dataset with 13 columns and 70,062 data points [13]. For further analysis, the data should be uploaded into a suitable environment such as Jupyter Notebook, Google Colab, or PyCharm

[Figure omitted. See PDF.]

To get insight into the recorded data, exploratory data analysis (EDA) is utilized to to understand the underlying patterns, relationships, and insights within the dataset followed with modelling. As seen in Fig 2, the knowledge gained from EDA aids in the transformation of the data through the use of Feature Engineering (FE). The process of feature engineering can occasionally skew the data, leading to poor modelling. To prevent it, when transforming data, the hypothesis test should be used to determine whether data bias is possible. The converted data is prepared for training under the ML and DL models if the hypothesis data validates the data set’s quality. To proceed with production, the finest model is further pickled.

Data analysis & pre-processing

Data is essential for machine learning models. High-quality input data enhances model performance. To increase the quality of data, Data preprocessing is to be performed which is consist of data analysis and data visualization. Real-time data is often contaminated with noise and outliers. Consequently, training models on raw data is ineffective. Data preprocessing transforms unrefined data into a structured format suitable for machine learning (ML) and deep learning (DL) models [29,30].

Primary steps for data cleaning consist of:

* Eliminating Null data

* Removing Duplicates in data

* Transforming Categorical data

* Dealing with noise & outliers

In various industries, including the field of manufacturing and mechanical engineering, the health

monitoring of equipment plays a crucial role in ensuring smooth operations and preventing unexpected failures. Among the many types of machinery used in these industries, centrifugal pumps are commonly employed for various applications, such as fluid transportation and circulation [31,32].

To effectively monitor the health of a CPM, it is important to collect and analyze relevant data from sensors and other monitoring devices. However, the collected data may often contain noise, outliers, or missing values, it may compromise the analysis’s accuracy. As a result, data pre-processing methods address the issues and enhance data quality. One powerful approach for data pre-processing in the health monitoring system of a centrifugal pump is statistical & visualization. Statistical & visualization techniques allow analysts and engineers to gain insights into the data, detect anomalies, and identify patterns. By visualizing the data, it becomes easier to understand its characteristics and make informed decisions regarding data cleaning and preprocessing steps which is important part of EDA [33]. EDA can be applied in various ways during the data pre-processing stage [34–36].

1. i. Data cleaning: Data Cleaning can help identify outliers or noisy data points that need to be removed or corrected. Histograms, box plots, and scatter plots are frequently used to show data distributions and spot abnormalities.

2. ii. Missing data handling: Statistics and visualization aids in understanding the patterns of missing data. By finding missing values, analysts can decide on appropriate strategies for imputation, such as mean or regression-based imputation, to fill in the gaps.

3. iii. Feature engineering: Feature Engineering helps in identifying relationships and correlations between different variables in the dataset. This knowledge can guide the selection of relevant features for further analysis and modelling.

4. iv. Data transformation: Data Transformation techniques can assist in exploring the distribution of variables and identifying skewness or non-normality. Such insights can guide data transformation steps, such as logarithmic or power transformations, to normalize the data for subsequent analyses.

Overall, incorporating statistical & visualization techniques into the data pre-processing stage of a HMS for a CPM enables a better understanding of the data and helps address challenges related to noise, outliers, missing values, and data transformation [37–39]. Consequently, this leads to higher operating efficiency, fewer downtime, better maintenance methods, and more precise and trustworthy health monitoring [39–41]. Recall that efficient data pre-processing is an essential component of any data analysis pipeline, and analysis is a useful tool in this process for the centrifugal pump’s health monitoring system [42–44].

Realtime exploratory data analysis on CPM – feature engineering

The Exploratory Data Analysis (EDA) & Feature Engineering (FE) includes methodologies such as statistical investigation, dealing with null & duplicate values, locating outliers, dealing with multicollinearity, re- checking, removing & adding unwanted features, check & transforming the dependent variable & its distribution [45–46]. The methodology as shown in Fig 3 also includes data visualization, discovering patterns, standardizing dataset & doing hypothesis testing to check the possibility of biasness before modelling.

After hypothesis testing the data is ready to train with different ML & DL models. The model generally works very well after EDA and FE as shown in Fig 3. Performing EDA is fundamental and comprises approximately 70% of the total effort in the analysis process. To begin the analysis of the data collected from the Centrifugal Pump Machine (CPM), various libraries are required, including Pandas, Numpy, Matplotlib.pyplot, Seaborn, Scipy, and sklearn (for preprocessing and train_test_split). Additionally, model_selection and imblearn libraries will be used. The first step involves using the Pandas library to read the data file. Following the data upload, the next step is to check the shape, information, five-point summary (Table 1), and data types of the dataset. The dataset contains 70,062 rows and 13 columns, with 7 float features, 4 integer features, 1 timestamp, and 1 object-categorical feature. The features Casing, Impeller, and Bearing provide vibration data, along with associated temperature values designated as C_Temp, I_Temp, and B_Temp. The DC_RA feature is documented to evaluate the roughness within the impeller casing, aimed at predicting cavitation. Additionally, the features Flow, Pressure, Current, and Voltage are recorded during operation to assist in identifying potential faults in the model. The Feature timestamp and condition are not displayed in the five-point summary table due to the data type not being integer or float.

[Figure omitted. See PDF.]

The five-point summary provides a statistical overview, including the no. of observations, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values. As seen in Table 1, this summary aids in comprehending the central tendencies & distribution of data. The calculation of the standard deviation for quantiles (25%, 50%, and 75%), the sample population, and the mean for the sample population are used for the data obtained. So, in essence, the percentiles split an ordered sample into 100 equal parts and return the corresponding data values [47–49]. To ensure data integrity and quality, the first step involves checking and handling any Null or NA values in the dataset. These missing values should be treated and imputed using methodologies such as replacing them with the mean, median, or other appropriate functions. However, in the case of the CPM dataset under consideration, there are no Null or NA values present, as confirmed by reference. Next, it is essential to address any potential duplicate items within the dataset. This process involves implementing functions that can identify and remove rows or features with identical values across all rows, effectively eliminating any duplicates. Fortunately, the CPM dataset also does not contain any duplicate items, as indicated by the same reference [50].

This ensures that the data is clean and reliable for further analysis. When analyzing data, it is often necessary to drop features with low standard deviation (SD) or coefficient of variance (CoV) to avoid the curse of dimensionality. The CoV is a statistical measure of relative variability, where a lower CoV indicates less dispersion around the mean. As CoV is directly dependent on standard deviation, it provides a normalized measure of dispersion within the data distribution. In practice, CoVs below 1 are considered low variance, while CoVs above 1are regarded as high variance. Essentially, SD measures the raw variability, whereas CoV indicates variability relative to the mean, facilitating comparison between variables using a standardized dispersion ratio [51]. The step involves dropping features with low SD or CoV, considering their limited contribution to the data’s variability. Although CoV can identify and eliminate more features than SD, it is highly sensitive to outliers, making it crucial to use subject expertise to make the final decision. While SD is more popular, CoV is more effective when comparing different features within a dataset, provided there are no outliers and the feature units are the same. Setting a threshold of 0.2, the function eliminates features with values below this threshold. However, upon applying this approach, it was observed that every feature in the dataset had an SD above 0.20, so no features were dropped. This reinforces the recommendation to use the SD approach in most cases due to its robustness against outliers and its general applicability across various datasets [52]. In a centrifugal pump, removing unwanted low variance features can improve performance and efficiency. The higher variance in sensor data from the mean indicate potential faults.

SD-based feature extraction retains only high variance features, enhancing effectiveness. In contrast, PCA may reorganize data but fails to eliminate inefficiencies. Therefore, akin to a filter in a pump improving flow efficiency, SD filtering in data analysis preserves relevant data while discarding features with information below the 0.2 SD threshold.

A number of preparation procedures were carried out to guarantee the dataset’s efficacy and integrity for machine learning (ML) and deep learning (DL) models. Multicollinearity, a situation where features contain redundant information, was addressed by assessing the correlation between features and setting a threshold of 70%. Features exceeding this threshold were identified for removal, but a subject matter expert was consulted to ensure essential features were not discarded; three important features initially dropped due to multicollinearity were retrieved [53]. Outliers were identified and eliminated using the Scipy-stats library, targeting values greater than the 75th percentile and less than the 25th percentile. Rows with 90% zero values were also removed, significantly improving model performance [54]. Analysis revealed an imbalance in the dependent variables, with the GHC dependent variable being predominant. The Synthetic Minority Over-Sampling Technique (SMOTE) from imblearn library was used to increase minority class samples and achieve balance. Additionally, categorical dependent variables were converted to numerical categories or one-hot encoded for DL models. Combining these preprocessing steps ensured the data was well-prepared, enhancing the reliability and accuracy of subsequent analyses and models [55].

The CPM dataset is further processed to 70% training and 30% testing. The cleaned and processed data is split to train-test using sklearn.model selection library. Standardization of data is transformation process to convert data into more usable and sensible information. Sklearn. preprocessing-Standard Scaler library is used after splitting data set when the features have units with large SD and Mean. After Data Preprocessing using statistical and visualization approach it is important to do hypothesis testing to find out whether the data got biased during the Feature Engineering procedure.

Hypothesis testing

Z-Test and T-test both uses average of sample data to calculate the probability as shown in Fig 4. The Z-test is used when the population standard deviation is known, or the sample size is large (n ≥ 30). The T-test is used when the population standard deviation is unknown, and the sample size is small (n <30). So, the possibility of T-test is eliminated. The Anova test is used when the analysis of variance is to be done. It uses one way and two-way approach to do it. If the population standard deviation is known, use the Z-test regardless of the sample size. If it is unknown and the sample is small, use the T-test. After deciding the test based on sample population Null Hypothesis and Alternate Hypothesis should be decided.

[Figure omitted. See PDF.]

The p-value must be calculated for the sample data, which will inform the outcomes of the null and alternative hypothesis tests for both the z-test and ANOVA test. Therefore, the hypothesis will be applied for statistical comparison of CPM data before and after preprocessing.

The Z-test failed to reject the null hypothesis, indicating no significant difference between the means of the training and test sets for the “casing” feature. Likewise, the ANOVA test failed to reject the null hypothesis, suggesting that the means of the original, training, and test sets are statistically equal. The findings of the Z-test and ANOVA test suggest that no bias occurred during feature engineering, and the data is suitable for modelling.

Results & discussion

After data cleaning and processing through Exploratory Data Analysis (EDA) and statistical feature engineering, the final stages involve data visualization, train-test splitting, standardization, and hypothesis testing. Merging EDA with data visualization accelerates model understanding and deployment. Data visualization is a crucial analysis technique for discovering patterns and visualizing statistical data, and it is divided into three main categories. Univariate analysis, which examines one variable at a time, has been employed to investigate 03 different graphs for the same variables and understand the distribution and characteristics of individual variables in the CPM data [56,57].

Box plot analysis

The box plots [58] for “Casing,” “Bearing,” and “Impeller” reveal similar distributions of vibration, with medians around 300 to 400 G and interquartile ranges spanning from approximately 200 to 500 as shown in Fig 5. The whiskers indicate the overall spread of the vibration data, extending from about 200 to 800 G for each variable, with no significant outliers present. Understanding the fundamental patterns and distributions in the dataset is aided by this analysis, which offers a clear picture of each variable’s central tendency, dispersion, and range of values

[Figure omitted. See PDF.]

Pie chart analysis

As observed in pie chart [59] for dependent variables – Good Health Condition (GHC), Impeller Fault (IF), Impeller & Bearing Fault (IBF) and Misalignment (MA) is distributed with a ratio of 30.5%, 27.1%, 27.1% & 15.3% respectively as shown in Fig 6. The data is slightly misbalanced within GHC, IF, IBF and MA.

[Figure omitted. See PDF.]

Line plot analysis

The line plots [60] display the variations in pressure over a specified range for three components: casing, impeller, and bearing as shown in Fig 7. Each figure displays an undefined measure on x-axis and pressure on y-axis, likely related to time or operational cycles. The casing feature varies between the vibration range of 200-800 G with pressure varying from 12.5 -29 KPa. The impeller feature varies between the vibration range of 200-950 G with the pressure range of 10-36 KPa. The bearing feature varies between the vibration range of 200-850 G with the pressure range of 11-36 KPa. All three components exhibit periods of stability and fluctuation. The recurring patterns around specific marks (200, 400, 500) suggest these could be critical operational phases or maintenance cycles. Spikes and sharp drops in pressure are areas of interest for further investigation to understand the underlying causes, such as equipment wear, load changes, or procedural adjustments. The plots show that while there are fluctuations, there are also significant periods where the pressure is relatively stable, indicating normal operational conditions. These insights can guide maintenance schedules, operational adjustments, and additional research to improve the system’s dependability and effectiveness.

[Figure omitted. See PDF.]

Bar plot analysis

The bar plot [61] displays the variations in pressure, current, and voltage across four conditions: GHC (Good Health Condition), IF (Impeller Failure), IBF (Impeller & Bearing Failure), and MA (Misalignment) as shown in Fig 8 For pressure, GHC shows the highest value around 21 KPa, while IF and IBF have lower values near 15 KPa, and MA is slightly higher at 17 KPa. In terms of current, GHC and MA exhibit similar values around 1.6 A, IF has the lowest at 1.3 A, and IBF shows the highest at 1.8 A. For voltage, GHC is about 175 V, IF and IBF are around 125 V, and MA has the highest at 200 V. These plots indicate the health and maintenance status of the system, with GHC showing optimal conditions, IF and IBF indicating failures, and MA suggesting the need for maintenance.

[Figure omitted. See PDF.]

The co-relation heatmap analysis

The heatmap [62] shown in Fig 9 displays the correlation matrix for various features including casing, impeller, bearing, temperatures (C_TEMP, I_TEMP, B_TEMP), flow, pressure, DC_RA, current, and voltage. Dark hues on the color scale indicate a negative correlation, whereas light hues indicate a positive correlation. This heatmap is essential for identifying which variables are interrelated and which are independent, aiding in predictive modeling and operational decisions. The heatmap illustrates the significance of features and the relationship between variables. The heat maps demonstrate a significant relationship between DC_RA and B_temp, indicating that increased cavitation is associated with higher bearing temperatures. Bearing temperature is indicative of cavitation in both the casing of impeller and impeller. In a Good Health Condition (GHC), there is a positive relationship between flow rate and pressure. Conversely, 75% of the data reflects defective conditions, where MA greatly diminishes flow. As a result, the heat map illustrates a negative correlation that detrimentally affects pump performance. A positive correlation of 0.70 is seen between impeller vibration and voltage, indicating that voltage increases when the impeller is damaged. The correlation exceeding 0.70 between the variables is instrumental in predicting maintenance faults.

[Figure omitted. See PDF.]

Scatter clustering analysis

The link between impeller readings (x-axis) and casing readings (y-axis) is depicted in a scatter plot [63] in Fig 10, where bearing values are represented by color intensity. For instance, data points with high impeller readings (around 700-900 G) and high casing readings (around 500- 700 G) are marked with darker colors, indicating higher bearing values. This plot helps identify clusters and outliers, providing insights into typical and atypical operational conditions. For example, a dense cluster of points around impeller readings of 300-500 G and casing readings of 400- 600 G with moderate bearing values suggests a common operational range, while isolated points might indicate anomalies. The existence of data clusters about 750G may signify malfunction in the CPM system, suggesting the necessity for maintenance by the support staff. Faults exceeding 750G signify shaft misalignment, while those below 300G may suggest bearing or impeller defects.

[Figure omitted. See PDF.]

K-means clustering analysis

Fig 11 presents a 3D scatter plot [64] showcasing the relationships between casing, casing temperature (C_TEMP), and bearing values. The data points are plotted in three-dimensional space, providing a more comprehensive view of how these variables interact. For example, clusters of points around a casing value of 500G, C_TEMP of 150, and bearing value of 600G suggest stable operating conditions. Outliers in the plot, such as points far removed from these clusters, indicate potential issues or extreme operational conditions. This 3D visualization is particularly useful for detecting multi-variable interactions and anomalies that may not be visible in two-dimensional plots.

[Figure omitted. See PDF.]

Through a comprehensive examination of the aforementioned graphical representations delineated in Figs 10 and 11, one can extract significant insights pertaining to the operational dynamics inherent within the system, ascertain critical correlations, and elucidate specific domains that necessitate additional scrutiny to guarantee the reliability and performance of the system.

For instance, consistent pressure values around 15-25 KPa in conjunction with moderate impeller and bearing values indicate stable operations, while deviations from these ranges could signal potential issues. The integration of heatmaps, scatter clustering analysis, and K-Means clustering analysis enables maintenance personnel or supervisors to statistically assess and visually identify potential faults, hence minimizing industrial downtime.

ML classifier models

Logistic Regression [65] as seen in Fig 12 represents a widely utilized supervised learning methodology that addresses both binary and multi-class classification challenges. This technique characterizes the likelihood of a data point being associated with a specific class by employing the logistic (sigmoid) function. In contrast to linear regression, which forecasts continuous variables, logistic regression is designed to estimate categorical outcomes and is highly esteemed for its straightforwardness, efficiency, and interpretability within practical applications.

[Figure omitted. See PDF.]

The logistic regression classifier exhibits a moderate level of overall efficacy, achieving an accuracy rate of 71.35% and a corresponding misclassification rate of 28.65%. The model demonstrates a high degree of precision in predicting the IF class (88% correct), followed by the GHC class (70.2%) and the IBF class (69%), while it encounters considerable challenges with the MA class (58%). Instances of misclassification are predominantly observed between the MA and IBF classes, indicating a potential overlap or ambiguity within the feature space pertaining to these categories. In summary, although the model performs adequately with respect to two classes (GHC and IF), there remains a necessity for enhancements to facilitate improved differentiation between IBF and MA predictions.

The data in Table 2 indicates that the IF class exhibits the highest precision at 88%, with GHC, IBF, and MA following at 70.2%, 69%, and 58%, respectively. Both GHC and IF achieve perfect recall (1.0), whereas IBF and MA demonstrate significantly lower values at 56.46% and 48.45%, respectively, suggesting difficulties in accurately identifying relevant instances. The F1-scores are highest for IF (93.62%) and GHC (82.49%), with IBF (62.10%) and MA (52.79%) reflecting poorer performance. The Macro-F1 score of 72.75% indicates a reasonable yet inconsistent performance across different classes, while the Weighted-F1 score of 69.89% highlights the effect of class distribution on the model’s efficacy. In conclusion, the model performs commendably in certain classes, particularly IF, but faces challenges with IBF and MA, signifying the need for further optimization, feature enhancement, or model refinement to enhance overall dependability.

[Figure omitted. See PDF.]

The Naive Bayes Classifier as shown in Fig 13 is a straightforward, quick, and probabilistic machine learning algorithm. It determines the likelihood that a data point belongs to a class and makes the “naive” assumption that features are independent of one another. It’s commonly used for multiclass classification problems. Despite its simplicity, it works well on large datasets and high-dimensional data, though its performance can drop when features are highly correlated. Among the most popular machine learning classifiers for the study include naïve bayes, decision trees, SVM, KNN, and random forests [66–71].

[Figure omitted. See PDF.]

The Gaussian Naïve Bayes classifier’s confusion matrix (Fig 13) shows how well it performs in the four classes of misalignment (MA), impeller fault (IF), impeller bearing fault (IBF), and good health condition (GHC). It shows that MA class is perfectly classified with no misclassifications. GHC class has 7428 correct classifications and 73 misclassified with MA. IBF has 7538 correct classifications and 8 misclassified as MA too.

This Naïve Bayes Classifier indicates high overall accuracy with minor misclassifications. The algorithm has 99.73% accuracy followed with recall. As observed in Table 3 the algorithm has misclassified 73 observations for GHC and 8 observations with total misclassification rate of 0.3%.

[Figure omitted. See PDF.]

The Support Vector Classifier (SVC) [71–73] as shown in Fig 14 functions as a supervised learning algorithm, designed to categorize data into distinct classes. It achieves this by identifying the maximum range between margin of various class labels and plotting the optimal hyperplane accordingly. The method demonstrates strong performance in high-dimensional spaces and is proficient in handling both linear and non-linear classes through the hyperparameter of kernel functions. The Support Vector Classifier (SVC) demonstrates resilience to overfitting, particularly in scenarios where the dimensionality surpasses the sample size. However, it may incur significant computational costs when applied to large datasets.

[Figure omitted. See PDF.]

The SVC confusion matrix (Fig 14) shows the classifier’s ability to correctly classify instances across four classes. GHC and MA have perfect classification with all instances correctly identified. Class IF has 7413 correct classifications with 22 instances misclassified into Class IBF. Class IBF has 7227 correct classifications with 311 instances misclassified into Class IF.

This suggests the accuracy been dropped with some confusion between Classes IF and IBF. As observed in the Table 4 the Recall for SVC is 98.89% and F1scores is 97.80%. The SVC Classifier has done good in detecting GHC and MA. While major misclassification is done for IF & IB with 22 and 311 observations as misclassification. The SVC models is not as effective as Navie Bayes Classifier.

[Figure omitted. See PDF.]

The Classifier - Random Forest as shown in Fig 15 functions as an ensemble learning technique, constructing numerous decision trees throughout the training process. It integrates the predictions of these trees to enhance accuracy and mitigate the risk of overfitting [74,75]. The trees are constructed using a random subset of the data and features, enhancing the model’s robustness and its resistance to noise. Random Forest demonstrates strong performance across various classification tasks and efficiently manages large datasets and high-dimensional spaces; however, training can be slower when utilizing a substantial number of trees.

[Figure omitted. See PDF.]

The Random Forest classifier’s confusion matrix reveals near-perfect performance. Class GHC has 7500 correct classifications with just 1 misclassification as MA. Class IF and Class MA are perfectly classified with all instances correctly identified. Class IBF has 7546 correct classifications with 8 misclassifications. The decision tree diagram visualizes the hierarchical structure and decision rules used by the Random Forest model, illustrating the paths taken to reach classifications. The Random Forest with CPM data set has 99.99% accuracy. The Fig 14 shows that Random Forest ensemble technique has only 1 misclassified observation, i.e., in GHC. The model demonstrates superior performance compared to others; however, in relation to Figs 13 and 14, the data exhibits complexity. It is established that complex data can lead to overfitting in random forest models. Therefore, complete reliance on the random forest model for complex datasets is not recommended.

The F1 Score recorded is 99.99% while Recall is 99.98% as shown in Table 5. Overall, these figures highlight the effectiveness of each model, with Random Forest showing the highest accuracy, followed by Naïve Bayes and then SVC. The decision tree for Random Forest provides insight into the model’s decision-making process.

[Figure omitted. See PDF.]

The Table 6 displays the comparative performance assessment of three machine learning models—Logistic Regression Classifier, Naïve Bayes Classifier, Support Vector Classifier, and Random Forest Classifier—according to three principal performance metrics: Precision, Recall, and F1-Score with respect to computational time. These criteria were selected to offer a thorough evaluation of the models’ proficiency in executing categorization tasks.

[Figure omitted. See PDF.]

1. Logistic Regression Classifier: The Logistic Regression Classifier demonstrates a Precision of 76.23%, Recall of 71.30%, and an F1-Score of 72.75%. While its performance is modest compared to advanced models, it provides a solid baseline and efficient training duration. Its lower Precision, Recall, and F1-Score signify weaker predictive performance compared to other evaluated machine learning models. It is particularly effective for linearly separable datasets and scenarios requiring interpretability. The computation time of 150 seconds makes it more efficient than more complex models.

2. Naïve Bayes Classifier: The Naïve Bayes Classifier attains a Precision of 99.73%, Recall of 99.02%, and an F1-Score of 99.51%, demonstrating its efficacy in accurately classifying pertinent cases while sustaining a balance between Precision and Recall. Its performance indicates efficacy in situations with probabilistic interdependencies among characteristics. The time taken for computation is 180 seconds which is lowest of all ML models.

3. Support Vector Classifier: Exhibits marginally reduced Precision (98.84%), Recall (98.89%), and F1-Score (97.80%) relative to alternative models. Although effective, its comparatively lower F1-Score suggests that further fine-tuning or feature engineering may be necessary to improve its performance in intricate data situations. The time taken to train the SVC model is 300 seconds.

4. Random Forest Classifier: Surpasses the previous models with nearly flawless Precision (99.99%), Recall (99.98%), and F1-Score (99.99%). These results highlight its strength and proficiency in managing varied datasets, perhaps attributable to its ensemble learning methodology and adept management of feature interactions and noise. The time taken to train the RFC model is highest 360 seconds as compared to other ML models.

The Fig 16 highlights that the Naïve Bayes Classifier is the most suitable model for the given dataset, offering superior overall performance for complex dataset with optimal computational time. The findings demonstrate the importance of model selection in achieving high reliability and accuracy for classification tasks in industrial applications.

[Figure omitted. See PDF.]

AI deep learning model

An Artificial Neural Network (ANN) was utilized for classification via deep learning. The architecture comprises multiple dense layers with dropout regularization [76,77]. The input layer yields 128 units and 1536 parameters, succeeded by a 30% dropout layer. A subsequent dense layer generates 64 units with 8256 parameters, followed by another dropout layer. Another dense layer with 32 units and 2080 parameters is introduced, employing a Leaky ReLU activation function, dropout rate of 0.3 and L2 regularization 0.001. An additional dense layer with 16 units and 528 parameters is added, also utilizing Leaky ReLU activation and dropout. The final layer consists of 4 output units for classification, with 68 parameters and a SoftMax activation for multiclass output. The model encompasses a total of 12,468 trainable parameters and no non-trainable parameters. Training occurred over 20 epochs, attaining 100% accuracy and a minimal validation loss of 0.0024%, as depicted in Fig 17. Nevertheless, the training duration was significantly longer than conventional machine learning models, requiring approximately 920 seconds for completion.

[Figure omitted. See PDF.]

Conclusion

This research meticulously outlines the architecture and workflow of an ML/AI model, leveraging CPM data from a DAQ system. Through comprehensive Exploratory Data Analysis (EDA), Feature Engineering, and Data Visualization, we gained profound insights into the data. Key outcomes include insightful data visualization through univariate, bivariate, and multivariate analyses, which provided a deep understanding of the data. Robust data cleaning was ensured through feature engineering expertise, and an optimal train-test split of 70−30% was employed for model selection, followed by standardization of training and testing data. Hypothesis testing using Z-test and ANOVA confirmed the absence of bias post-preprocessing.

These steps ensured the data’s readiness for training ML and Neural Network models. Notably, Neural Networks achieved up to 100% accuracy after 20 epochs, with a validation loss of just 0.0024% but at the cost of high computation time. Nonetheless, the Random Forest classifier exhibited a propensity to overfit, presenting difficulties for production deployment due to the complexity of the data and computation time. Among ML classifiers, the Naive Bayes Classifier outperformed the Support Vector Classifier and Logistic Regression Classifier based on accuracy and optimal computational time. In conclusion, the methodology of EDA, Feature Engineering, Standardization, and Hypothesis Testing has set a solid foundation for high-performing ML and AI algorithms. Deployment scalability is limited by integration challenges with existing DAQ systems. Additionally, inconsistencies in sensor compatibility and calibration may affect model performance across diverse industrial CPM. Future work will focus on training and testing these models to real time deployment. The reliability will be more on hardware’s with higher Graphics Processing Unit (GPU) that support IIOT (Industrial Internet of Things) and also cloud computing infrastructure.

References

1. 1. Surucu O, Gadsden SA, Yawney J. Condition monitoring using machine learning: a review of theory, applications, and recent advances. Expert Systems with Applications. 2023;221:119738.

* View Article

* Google Scholar

2. 2. Chen L, Wei L, Wang Y, Wang J, Li W. Monitoring and predictive maintenance of centrifugal pumps based on smart sensors. Sensors (Basel). 2022;22(6):2106. pmid:35336277

* View Article

* PubMed/NCBI

* Google Scholar

3. 3. Shewale MS, Mulik SS, Deshmukh SP, Patange AD, Zambare HB, Sundare AP. Novel machine health monitoring system. Advances in Intelligent Systems and Computing. Springer Singapore. 2018. p. 461–8. https://doi.org/10.1007/978-981-13-1610-4_47

4. 4. Azizi R, Attaran B, Hajnayeb A, Ghanbarzadeh A, Changizian M. Improving accuracy of cavitation severity detection in centrifugal pumps using a hybrid feature selection technique. Measurement. 2017;108:9–17.

* View Article

* Google Scholar

5. 5. Kumar A, Gandhi CP, Zhou Y, Kumar R, Xiang J. Improved deep convolution neural network (CNN) for the identification of defects in the centrifugal pump using acoustic images. Applied Acoustics. 2020;167:107399.

* View Article

* Google Scholar

6. 6. Gao Q, Tang H, Xiang J, Zhong Y, Ye S, Pang J. A Walsh transform-based Teager energy operator demodulation method to detect faults in axial piston pumps. Measurement. 2019;134:293–306.

* View Article

* Google Scholar

7. 7. Resende C, Folgado D, Oliveira J, Franco B, Moreira W, Oliveira-Jr A, et al. TIP4.0: industrial internet of things platform for predictive maintenance. Sensors (Basel). 2021;21(14):4676. pmid:34300415

* View Article

* PubMed/NCBI

* Google Scholar

8. 8. Kamat PV, Sugandhi R, Kumar S. Deep learning-based anomaly-onset aware remaining useful life estimation of bearings. PeerJ Comput Sci. 2021;7:e795. pmid:34909464

* View Article

* PubMed/NCBI

* Google Scholar

9. 9. Alfeo AL, Cimino MGCA, Vaglini G. Degradation stage classification via interpretable feature learning. Journal of Manufacturing Systems. 2022;62:972–83.

* View Article

* Google Scholar

10. 10. Jose JT, Das J, Mishra SKr, Wrat G. Early detection and classification of internal leakage in boom actuator of mobile hydraulic machines using SVM. Engineering Applications of Artificial Intelligence. 2021;106:104492.

* View Article

* Google Scholar

11. 11. Mishra SKr, Wrat G, Ranjan P, Das J. PID controller with feed forward estimation used for fault tolerant control of hydraulic system. J Mech Sci Technol. 2018;32(8):3849–55.

* View Article

* Google Scholar

12. 12. Mahmood S, Sun H, El-Kenawy E-SM, Iqbal A, Alharbi AH, Khafaga DS. Integrating machine and deep learning technologies in green buildings for enhanced energy efficiency and environmental sustainability. Sci Rep. 2024;14(1):20331. pmid:39223231

* View Article

* PubMed/NCBI

* Google Scholar

13. 13. Dave GS, Pandhare AP, Kulkarni AP, Khankal DV, Abdullah M. Experimental investigation of centrifugal pump machine and its faults through different type of DAQ system and selecting one based on statistical approach. Cogent Engineering. 2024;11(1).

* View Article

* Google Scholar

14. 14. Wrat G, Das J. Energy saving in off-road vehicles using leakage compensation technique, unpublished. Aalborg University, Denmark and IIT(ISM), Dhanbad, India, Sept. 2023. Online. Available from: https://www.researchgate.net/publication/374118311

15. 15. Qiu G, Huang S, Gu Y. Experimental investigation and multi-conditions identification method of centrifugal pump using Fisher discriminant ratio and support vector machine. Advances in Mechanical Engineering. 2019;11(9).

* View Article

* Google Scholar

16. 16. Selvaraj S, Prabhu Kavin B, Kavitha C, Lai W-C. A multiclass fault diagnosis framework using context-based multilayered bayesian method for centrifugal pumps. Electronics. 2022;11(23):4014.

* View Article

* Google Scholar

17. 17. Giro RA, Bernasconi G, Giunta G, Cesari S. A data-driven pipeline pressure procedure for remote monitoring of centrifugal pumps. Journal of Petroleum Science and Engineering. 2021;205:108845.

* View Article

* Google Scholar

18. 18. Li S, Wang H, Song L, Wang P, Cui L, Lin T. An adaptive data fusion strategy for fault diagnosis based on the convolutional neural network. Measurement. 2020;165:108122.

* View Article

* Google Scholar

19. 19. Quintero DA, Claro H, Regino F, Gómez JA. Development of a data acquisition system using LabVIEW and Arduino microcontroller for a centrifugal pump test bench connected in series and parallel. J Phys: Conf Ser. 2019;1257(1):012002.

* View Article

* Google Scholar

20. 20. Kulkarni AJ, Satapathy SC, Kang T, Kashan AH. Advances in intelligent systems and computing. Advances in Intelligent Systems and Computing. Springer Singapore. 2019. p. 461–8. https://doi.org/10.1007/978-981-13-1610-4_47

21. 21. Hajnayeb A. Cavitation analysis in centrifugal pumps based on vibration bispectrum and transfer learning. Shock and Vibration. 2021;2021(1).

* View Article

* Google Scholar

22. 22. Mousmoulis G, Karlsen-Davies N, Aggidis G, Anagnostopoulos I, Papantonis D. Experimental analysis of cavitation in a centrifugal pump using acoustic emission, vibration measurements and flow visualization. European Journal of Mechanics - B/Fluids. 2019;75:300–11.

* View Article

* Google Scholar

23. 23. Ning B, Cheng X, Wu S. Research on centrifugal pump monitoring system based on virtualization technology. Procedia Engineering. 2011;15:1077–81.

* View Article

* Google Scholar

24. 24. Ahmad Z, Prosvirin AE, Kim J, Kim J-M. Multistage centrifugal pump fault diagnosis by selecting fault characteristic modes of vibration and using pearson linear discriminant analysis. IEEE Access. 2020;8:223030–40.

* View Article

* Google Scholar

25. 25. Hachem CE, Perrot G, Painvin L, Couturier R. Automation of Quality Controlin the Automotive Industry Using Deep Learning Algorithms. In 2021 InternationalConference on Computer, Control and Robotics (ICCCR), Shanghai, China: IEEE, Jan. 2021, pp. 123–7. https://doi.org/10.1109/ICCCR49711.2021.9349273

26. 26. Hasan MJ, Rai A, Ahmad Z, Kim J-M. A fault diagnosis framework for centrifugal pumps by scalogram-based imaging and deep learning. IEEE Access. 2021;9:58052–66.

* View Article

* Google Scholar

27. 27. Orrù PF, Zoccheddu A, Sassu L, Mattia C, Cozza R, Arena S. Machine learning approach using MLP and SVM algorithms for the fault prediction of a centrifugal pump in the oil and gas industry. Sustainability. 2020;12(11):4776.

* View Article

* Google Scholar

28. 28. Kumar A, Gandhi CP, Zhou Y, Kumar R, Xiang J. Improved deep convolution neural network (CNN) for the identification of defects in the centrifugal pump using acoustic images. Applied Acoustics. 2020;167:107399.

* View Article

* Google Scholar

29. 29. Yang Y, Zheng H, Li Y, Xu M, Chen Y. A fault diagnosis scheme for rotating machinery using hierarchical symbolic analysis and convolutional neural network. ISA Trans. 2019;91:235–52. pmid:30770156

* View Article

* PubMed/NCBI

* Google Scholar

30. 30. Liu R, Yang B, Zio E, Chen X. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mechanical Systems and Signal Processing. 2018;108:33–47.

* View Article

* Google Scholar

31. 31. Yang L, Chen H, Ke Y, Li M, Huang L, Miao Y. Multi-source and multi-fault condition monitoring based on parallel factor analysis and sequential probability ratio test. Eur J Adv Signal Process. 2021;2021(1):37.

* View Article

* Google Scholar

32. 32. Ebrahimi E, Javidan M. Vibration-based classification of centrifugal pumps using support vector machine and discrete wavelet transform. J vibroeng. 2017;19(4):2586–97.

* View Article

* Google Scholar

33. 33. Xiao Y, Li Y, Chu C. Performance analysis of vibration sensors for closed‐loop feedback health monitoring of mechanical equipment. Journal of Sensors. 2021;2021(1).

* View Article

* Google Scholar

34. 34. Nour MA, M Hussain M. A review of the real-time monitoring of fluid-properties in tubular architectures for industrial applications. Sensors (Basel). 2020;20(14):3907. pmid:32674278

* View Article

* PubMed/NCBI

* Google Scholar

35. 35. Santos ROB, Chagas JM, Prado PHC, Giroto LGFF, Botura CA, Rosa AM, et al. Digital system for dynamic and vibration analysis of a centrifugal pump using the Teknikao Sdav software: a case study. IJAERS. 2020;7(8):367–77.

* View Article

* Google Scholar

36. 36. Dutta N, Kaliannan P, Shanmugam P. Application of machine learning for inter turn fault detection in pumping system. Sci Rep. 2022;12(1):12906. pmid:35902679

* View Article

* PubMed/NCBI

* Google Scholar

37. 37. Selvaraj S, Prabhu Kavin B, Kavitha C, Lai W-C. A multiclass fault diagnosis framework using context-based multilayered Bayesian method for centrifugal pumps. Electronics. 2022;11(23):4014.

* View Article

* Google Scholar

38. 38. Kapuria A, Cole DG. Integrating survival analysis with bayesian statistics to forecast the remaining useful life of a centrifugal pump conditional to multiple fault types. Energies. 2023;16(9):3707.

* View Article

* Google Scholar

39. 39. Yan Y, Wu S. Analysis of key techniques of condition monitoring and fault diagnosis of mechanical system. AETR. 2023;6(1):414.

* View Article

* Google Scholar

40. 40. Wu S, Zhang F, Dang Y, Zhan C, Wang S, Ji S. A mechanical-electromagnetic coupling model of transformer windings and its application in the vibration-based condition monitoring. IEEE Trans Power Delivery. 2023;38(4):2387–97.

* View Article

* Google Scholar

41. 41. Wang S, Yadav R, Raffik R, Bhola J, Rakhra M, Webber JL, et al. Wireless sensor network technology for vibration condition monitoring of mechanical equipment. Electrica. 2023.

* View Article

* Google Scholar

42. 42. Song L, Wang H, Shi Z. A literature review research on monitoring conditions of mechanical equipment based on edge computing. Appl Bionics Biomech. 2022;2022:9489306. pmid:36254227

* View Article

* PubMed/NCBI

* Google Scholar

43. 43. Valenzuela Del Río JE, Lancashire R, Chatrath K, Ritmeijer P, Arvanitis E, Mirabella L. Machine-learning-accelerated simulations for the design of airbag constrained by obstacles at rest. Stapp Car Crash J. 2024;67:1–13. pmid:38513070

* View Article

* PubMed/NCBI

* Google Scholar

44. 44. Koulidis A, Abdullatif M, Ahmed S. Drilling Monitoring System: Mud Motor Condition and Performance Evaluation. In: Middle East Oil, Gas and Geosciences Show, 2023. https://doi.org/10.2118/213422-ms

45. 45. Huang X, Xia H, Liu Y, Yin W, Ran W. Condition Monitoring Of Centrifugal Pump In Nuclear Power Plant Based On Improved Vmd And Svm. The Proceedings of the International Conference on Nuclear Engineering (ICONE). 2023.30. 1747. 2023. https://doi.org/10.1299/jsmeicone.2023.30.1747

46. 46. Turunen T, Miettinen J, Hämäläinen A, Karhinen A, Viitala R. Deep Learning for Centrifugal Pump Condition Monitoring Using Data from Variable Frequency Drive. 2023.

* View Article

* Google Scholar

47. 47. Arun M, Venkatesh S, Naveen R, Sugumaran V.Can pretrained networks be used in fault diagnosis of monoblock centrifugal pump? Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering. 2023.

* View Article

* Google Scholar

48. 48. Ahmad Z, Kim J-Y, Kim J-M. A Technique for centrifugal pump fault detection and identification based on a novel fault-specific mann-Whitney test. Sensors (Basel). 2023;23(22):9090. pmid:38005476

* View Article

* PubMed/NCBI

* Google Scholar

49. 49. Nielsen MB, Bjorck A. On unbiased estimation of standard deviation. IEEE Signal Processing Letters. 2020;27:1485–8.

* View Article

* Google Scholar

50. 50. Kutyniok G. Discussion of: “Nonparametric regression using deep neural networks with ReLU activation function”. Ann Statist. 2020;48(4).

* View Article

* Google Scholar

51. 51. Dagdoug M, Goga C, Haziza D. Model-assisted estimation in high-dimensional settings for survey data. J Appl Stat. 2022;50(3):761–85. pmid:36819070

* View Article

* PubMed/NCBI

* Google Scholar

52. 52. Bhaya W. Review of data preprocessing techniques in data mining. Journal of Engineering and Applied Sciences. 2017;12:4102–7.

* View Article

* Google Scholar

53. 53. Lu JC, Li D. The Coefficient of variation: a model-free measure of variability. Journal.

* View Article

* Google Scholar

54. 54. Abu-Shawiesh MOA, Sinsomboonthong J, Kibria BMG. A modified robust confidence interval for the population mean of distribution based on deciles. Statistics in Transition New Series. 2022;23(1):109–28.

* View Article

* Google Scholar

55. 55. Peretz O, Koren M, Koren O. Naive Bayes classifier – An ensemble procedure for recall and precision enrichment. Engineering Applications of Artificial Intelligence. 2024;136:108972.

* View Article

* Google Scholar

56. 56. Filzmoser P, Nordhausen K. Robust linear regression for high‐dimensional data: An overview. WIREs Computational Stats. 2020;13(4).

* View Article

* Google Scholar

57. 57. Koziarski M. Radial-Based Undersampling for imbalanced data classification. Pattern Recognition. 2020;102:107262.

* View Article

* Google Scholar

58. 58. Ghosh A, Nashaat M, Miller J, Quader S, Marston C. A comprehensive review of tools for exploratory analysis of tabular industrial datasets. Visual Informatics. 2018;2(4):235–53.

* View Article

* Google Scholar

59. 59. Díaz Muñiz C, García Nieto PJ, Alonso Fernández JR, Martínez Torres J, Taboada J. Detection of outliers in water quality monitoring samples using functional data analysis in San Esteban estuary (Northern Spain). Sci Total Environ. 2012;439:54–61. pmid:23063638

* View Article

* PubMed/NCBI

* Google Scholar

60. 60. Sorkun MC, Durmaz İNCEL Ö, Paoli C. Time series forecasting on multivariate solar radiation data using deep learning (LSTM). Turk J Elec Eng & Comp Sci. 2020;28(1):211–23.

* View Article

* Google Scholar

61. 61. Ruiz AP, Flynn M, Large J, Middlehurst M, Bagnall A. The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Discov. 2021;35(2):401–49. pmid:33679210

* View Article

* PubMed/NCBI

* Google Scholar

62. 62. Subbarao MV, Samundiswary P. Time-frequency analysis of non-stationary signals using frequency slice wavelet transform. In: 2016 10th International Conference on Intelligent Systems and Control (ISCO), 2016. 1–6. https://doi.org/10.1109/isco.2016.7726999

63. 63. Mulero-Pérez D, Benavent-Lledó M, Azorín-López J, Marcos-Jorquera D, García-Rodríguez J. Anomaly detection and virtual reality visualisation in supercomputers. Int J Adv Manuf Technol. 2023;133(1–2):935–47.

* View Article

* Google Scholar

64. 64. Kim H, Kim M. Malware detection and classification system based on CNN-BiLSTM. Electronics. 2024;13(13):2539.

* View Article

* Google Scholar

65. 65. Yehia T, Gasser M, Ebaid H, Meehan N, Okoroafor ER. Comparative analysis of machine learning techniques for predicting drilling rate of penetration (ROP) in geothermal wells: A case study of FORGE site. Geothermics. 2024;121:103028.

* View Article

* Google Scholar

66. 66. Peng W, Ward MO, Rundensteiner EA. Clutter reduction in multi-dimensional data visualization using dimension reordering. In Proc. IEEE Symposium on Information Visualization, Austin, TX, USA. 2004;89–96. https://doi.org/10.1109/infvis.2004.15

67. 67. Liu S, Maljovec D, Wang B, Bremer P-T, Pascucci V. Visualizing high-dimensional data: advances in the past decade. IEEE Trans Vis Comput Graph. 2017;23(3):1249–68. pmid:28113321

* View Article

* PubMed/NCBI

* Google Scholar

68. 68. Liu H, Motoda H. Data Mining and Knowledge Discovery. 2002;6(2):115–30.

* View Article

* Google Scholar

69. 69. Dai X, Gao Z. From model, signal to knowledge: A data-driven perspective of fault detection and diagnosis. IEEE Trans Ind Inf. 2013;9(4):2226–38.

* View Article

* Google Scholar

70. 70. Zheng Q, Yang M, Yang J, Zhang Q, Zhang X. Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process. IEEE Access. 2018;6:15844–69.

* View Article

* Google Scholar

71. 71. Grave E, Joulin A, Cissé M, Grangier D, Jégou H. Efficient softmax approximation for GPUs. 2016.

* View Article

* Google Scholar

72. 72. Kumar A, Chatterjee N. Distance-based fuzzy-rough sets and their application to the classification problem. In Proc. Int. Joint Conf. Rough Sets (IJCRS), Halifax, NS, Canada, May 17–20, 2024, Part I, Berlin, Heidelberg: Springer, 2024;134–56. https://doi.org/10.1007/978-3-031-65665-1_9

73. 73. Souza D, Granzotto M, Almeida G, Lopes LCO. Fault detection and diagnosis using support vector machines - A SVC and SVR comparison. safety. 2014;3(1):18–29.

* View Article

* Google Scholar

74. 74. Yehia T, Wahba A, Mostafa S, Mahmoud O. Suitability of different machine learning outlier detection algorithms to improve shale gas production data for effective decline curve analysis. Energies. 2022;15(23):8835.

* View Article

* Google Scholar

75. 75. Thabet SA, Zidan HA, Elhadidy AA, Helmy AG, Yehia TA, Elnaggar H, et al. Application of Machine Learning and Deep Learning to Predict Production Rate of Sucker Rod Pump Wells. In: GOTECH, 2024. https://doi.org/10.2118/219231-ms

76. 76. Ebaid H, Rita Okoroafor E, Meehan N, Yehia T, Gasser M. A Comparative Analysis of Machine Learning Techniques for Geothermal Wells’ Drilling Rate of Penetration (ROP) Prediction. In: The Unconventional Resources Technology Conference, 2024. https://doi.org/10.15530/urtec-2024-4044244

77. 77. Mahmood S, Sun H, Ali Alhussan A, Iqbal A, El-Kenawy E-SM. Active learning-based machine learning approach for enhancing environmental sustainability in green building energy consumption. Sci Rep. 2024;14(1):19894. pmid:39191844

* View Article

* PubMed/NCBI

* Google Scholar

Citation: Dave GS, Pandhare AP, Kulkarni AP, Khankal DV (2025) Innovative data techniques for centrifugal pump optimization with machine learning and AI model. PLoS One 20(6): e0325952. https://doi.org/10.1371/journal.pone.0325952

About the Authors:

Gaurav Sandeep Dave

Roles: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing

E-mail: [email protected]

Affiliation: Department of Mechanical Engineering, Sinhgad College of Engineering, Savitribai Phule Pune University, Pune, India

ORICD: https://orcid.org/0009-0005-6567-3091

Amar Pradeep Pandhare

Roles: Formal analysis, Methodology, Supervision

Affiliation: Department of Mechanical Engineering, Sinhgad College of Engineering, Savitribai Phule Pune University, Pune, India

Atul Prabhakar Kulkarni

Roles: Investigation, Supervision

Affiliation: Department of Mechanical Engineering, VIIT, Pune, India

Dhananjay Vasant Khankal

Roles: Formal analysis, Investigation, Supervision, Validation

Affiliation: Department of Mechanical Engineering, Sinhgad College of Engineering, Savitribai Phule Pune University, Pune, India

[/RAW_REF_TEXT]

References

1. Surucu O, Gadsden SA, Yawney J. Condition monitoring using machine learning: a review of theory, applications, and recent advances. Expert Systems with Applications. 2023;221:119738.

2. Chen L, Wei L, Wang Y, Wang J, Li W. Monitoring and predictive maintenance of centrifugal pumps based on smart sensors. Sensors (Basel). 2022;22(6):2106. pmid:35336277

3. Shewale MS, Mulik SS, Deshmukh SP, Patange AD, Zambare HB, Sundare AP. Novel machine health monitoring system. Advances in Intelligent Systems and Computing. Springer Singapore. 2018. p. 461–8. https://doi.org/10.1007/978-981-13-1610-4_47

4. Azizi R, Attaran B, Hajnayeb A, Ghanbarzadeh A, Changizian M. Improving accuracy of cavitation severity detection in centrifugal pumps using a hybrid feature selection technique. Measurement. 2017;108:9–17.

5. Kumar A, Gandhi CP, Zhou Y, Kumar R, Xiang J. Improved deep convolution neural network (CNN) for the identification of defects in the centrifugal pump using acoustic images. Applied Acoustics. 2020;167:107399.

6. Gao Q, Tang H, Xiang J, Zhong Y, Ye S, Pang J. A Walsh transform-based Teager energy operator demodulation method to detect faults in axial piston pumps. Measurement. 2019;134:293–306.

7. Resende C, Folgado D, Oliveira J, Franco B, Moreira W, Oliveira-Jr A, et al. TIP4.0: industrial internet of things platform for predictive maintenance. Sensors (Basel). 2021;21(14):4676. pmid:34300415

8. Kamat PV, Sugandhi R, Kumar S. Deep learning-based anomaly-onset aware remaining useful life estimation of bearings. PeerJ Comput Sci. 2021;7:e795. pmid:34909464

9. Alfeo AL, Cimino MGCA, Vaglini G. Degradation stage classification via interpretable feature learning. Journal of Manufacturing Systems. 2022;62:972–83.

10. Jose JT, Das J, Mishra SKr, Wrat G. Early detection and classification of internal leakage in boom actuator of mobile hydraulic machines using SVM. Engineering Applications of Artificial Intelligence. 2021;106:104492.

11. Mishra SKr, Wrat G, Ranjan P, Das J. PID controller with feed forward estimation used for fault tolerant control of hydraulic system. J Mech Sci Technol. 2018;32(8):3849–55.

12. Mahmood S, Sun H, El-Kenawy E-SM, Iqbal A, Alharbi AH, Khafaga DS. Integrating machine and deep learning technologies in green buildings for enhanced energy efficiency and environmental sustainability. Sci Rep. 2024;14(1):20331. pmid:39223231

13. Dave GS, Pandhare AP, Kulkarni AP, Khankal DV, Abdullah M. Experimental investigation of centrifugal pump machine and its faults through different type of DAQ system and selecting one based on statistical approach. Cogent Engineering. 2024;11(1).

14. Wrat G, Das J. Energy saving in off-road vehicles using leakage compensation technique, unpublished. Aalborg University, Denmark and IIT(ISM), Dhanbad, India, Sept. 2023. Online. Available from: https://www.researchgate.net/publication/374118311

15. Qiu G, Huang S, Gu Y. Experimental investigation and multi-conditions identification method of centrifugal pump using Fisher discriminant ratio and support vector machine. Advances in Mechanical Engineering. 2019;11(9).

16. Selvaraj S, Prabhu Kavin B, Kavitha C, Lai W-C. A multiclass fault diagnosis framework using context-based multilayered bayesian method for centrifugal pumps. Electronics. 2022;11(23):4014.

17. Giro RA, Bernasconi G, Giunta G, Cesari S. A data-driven pipeline pressure procedure for remote monitoring of centrifugal pumps. Journal of Petroleum Science and Engineering. 2021;205:108845.

18. Li S, Wang H, Song L, Wang P, Cui L, Lin T. An adaptive data fusion strategy for fault diagnosis based on the convolutional neural network. Measurement. 2020;165:108122.

19. Quintero DA, Claro H, Regino F, Gómez JA. Development of a data acquisition system using LabVIEW and Arduino microcontroller for a centrifugal pump test bench connected in series and parallel. J Phys: Conf Ser. 2019;1257(1):012002.

20. Kulkarni AJ, Satapathy SC, Kang T, Kashan AH. Advances in intelligent systems and computing. Advances in Intelligent Systems and Computing. Springer Singapore. 2019. p. 461–8. https://doi.org/10.1007/978-981-13-1610-4_47

21. Hajnayeb A. Cavitation analysis in centrifugal pumps based on vibration bispectrum and transfer learning. Shock and Vibration. 2021;2021(1).

22. Mousmoulis G, Karlsen-Davies N, Aggidis G, Anagnostopoulos I, Papantonis D. Experimental analysis of cavitation in a centrifugal pump using acoustic emission, vibration measurements and flow visualization. European Journal of Mechanics - B/Fluids. 2019;75:300–11.

23. Ning B, Cheng X, Wu S. Research on centrifugal pump monitoring system based on virtualization technology. Procedia Engineering. 2011;15:1077–81.

24. Ahmad Z, Prosvirin AE, Kim J, Kim J-M. Multistage centrifugal pump fault diagnosis by selecting fault characteristic modes of vibration and using pearson linear discriminant analysis. IEEE Access. 2020;8:223030–40.

25. Hachem CE, Perrot G, Painvin L, Couturier R. Automation of Quality Controlin the Automotive Industry Using Deep Learning Algorithms. In 2021 InternationalConference on Computer, Control and Robotics (ICCCR), Shanghai, China: IEEE, Jan. 2021, pp. 123–7. https://doi.org/10.1109/ICCCR49711.2021.9349273

26. Hasan MJ, Rai A, Ahmad Z, Kim J-M. A fault diagnosis framework for centrifugal pumps by scalogram-based imaging and deep learning. IEEE Access. 2021;9:58052–66.

27. Orrù PF, Zoccheddu A, Sassu L, Mattia C, Cozza R, Arena S. Machine learning approach using MLP and SVM algorithms for the fault prediction of a centrifugal pump in the oil and gas industry. Sustainability. 2020;12(11):4776.

28. Kumar A, Gandhi CP, Zhou Y, Kumar R, Xiang J. Improved deep convolution neural network (CNN) for the identification of defects in the centrifugal pump using acoustic images. Applied Acoustics. 2020;167:107399.

29. Yang Y, Zheng H, Li Y, Xu M, Chen Y. A fault diagnosis scheme for rotating machinery using hierarchical symbolic analysis and convolutional neural network. ISA Trans. 2019;91:235–52. pmid:30770156

30. Liu R, Yang B, Zio E, Chen X. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mechanical Systems and Signal Processing. 2018;108:33–47.

31. Yang L, Chen H, Ke Y, Li M, Huang L, Miao Y. Multi-source and multi-fault condition monitoring based on parallel factor analysis and sequential probability ratio test. Eur J Adv Signal Process. 2021;2021(1):37.

32. Ebrahimi E, Javidan M. Vibration-based classification of centrifugal pumps using support vector machine and discrete wavelet transform. J vibroeng. 2017;19(4):2586–97.

33. Xiao Y, Li Y, Chu C. Performance analysis of vibration sensors for closed‐loop feedback health monitoring of mechanical equipment. Journal of Sensors. 2021;2021(1).

34. Nour MA, M Hussain M. A review of the real-time monitoring of fluid-properties in tubular architectures for industrial applications. Sensors (Basel). 2020;20(14):3907. pmid:32674278

35. Santos ROB, Chagas JM, Prado PHC, Giroto LGFF, Botura CA, Rosa AM, et al. Digital system for dynamic and vibration analysis of a centrifugal pump using the Teknikao Sdav software: a case study. IJAERS. 2020;7(8):367–77.

36. Dutta N, Kaliannan P, Shanmugam P. Application of machine learning for inter turn fault detection in pumping system. Sci Rep. 2022;12(1):12906. pmid:35902679

37. Selvaraj S, Prabhu Kavin B, Kavitha C, Lai W-C. A multiclass fault diagnosis framework using context-based multilayered Bayesian method for centrifugal pumps. Electronics. 2022;11(23):4014.

38. Kapuria A, Cole DG. Integrating survival analysis with bayesian statistics to forecast the remaining useful life of a centrifugal pump conditional to multiple fault types. Energies. 2023;16(9):3707.

39. Yan Y, Wu S. Analysis of key techniques of condition monitoring and fault diagnosis of mechanical system. AETR. 2023;6(1):414.

40. Wu S, Zhang F, Dang Y, Zhan C, Wang S, Ji S. A mechanical-electromagnetic coupling model of transformer windings and its application in the vibration-based condition monitoring. IEEE Trans Power Delivery. 2023;38(4):2387–97.

41. Wang S, Yadav R, Raffik R, Bhola J, Rakhra M, Webber JL, et al. Wireless sensor network technology for vibration condition monitoring of mechanical equipment. Electrica. 2023.

42. Song L, Wang H, Shi Z. A literature review research on monitoring conditions of mechanical equipment based on edge computing. Appl Bionics Biomech. 2022;2022:9489306. pmid:36254227

43. Valenzuela Del Río JE, Lancashire R, Chatrath K, Ritmeijer P, Arvanitis E, Mirabella L. Machine-learning-accelerated simulations for the design of airbag constrained by obstacles at rest. Stapp Car Crash J. 2024;67:1–13. pmid:38513070

44. Koulidis A, Abdullatif M, Ahmed S. Drilling Monitoring System: Mud Motor Condition and Performance Evaluation. In: Middle East Oil, Gas and Geosciences Show, 2023. https://doi.org/10.2118/213422-ms

45. Huang X, Xia H, Liu Y, Yin W, Ran W. Condition Monitoring Of Centrifugal Pump In Nuclear Power Plant Based On Improved Vmd And Svm. The Proceedings of the International Conference on Nuclear Engineering (ICONE). 2023.30. 1747. 2023. https://doi.org/10.1299/jsmeicone.2023.30.1747

46. Turunen T, Miettinen J, Hämäläinen A, Karhinen A, Viitala R. Deep Learning for Centrifugal Pump Condition Monitoring Using Data from Variable Frequency Drive. 2023.

47. Arun M, Venkatesh S, Naveen R, Sugumaran V.Can pretrained networks be used in fault diagnosis of monoblock centrifugal pump? Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering. 2023.

48. Ahmad Z, Kim J-Y, Kim J-M. A Technique for centrifugal pump fault detection and identification based on a novel fault-specific mann-Whitney test. Sensors (Basel). 2023;23(22):9090. pmid:38005476

49. Nielsen MB, Bjorck A. On unbiased estimation of standard deviation. IEEE Signal Processing Letters. 2020;27:1485–8.

50. Kutyniok G. Discussion of: “Nonparametric regression using deep neural networks with ReLU activation function”. Ann Statist. 2020;48(4).

51. Dagdoug M, Goga C, Haziza D. Model-assisted estimation in high-dimensional settings for survey data. J Appl Stat. 2022;50(3):761–85. pmid:36819070

52. Bhaya W. Review of data preprocessing techniques in data mining. Journal of Engineering and Applied Sciences. 2017;12:4102–7.

53. Lu JC, Li D. The Coefficient of variation: a model-free measure of variability. Journal.

54. Abu-Shawiesh MOA, Sinsomboonthong J, Kibria BMG. A modified robust confidence interval for the population mean of distribution based on deciles. Statistics in Transition New Series. 2022;23(1):109–28.

55. Peretz O, Koren M, Koren O. Naive Bayes classifier – An ensemble procedure for recall and precision enrichment. Engineering Applications of Artificial Intelligence. 2024;136:108972.

56. Filzmoser P, Nordhausen K. Robust linear regression for high‐dimensional data: An overview. WIREs Computational Stats. 2020;13(4).

57. Koziarski M. Radial-Based Undersampling for imbalanced data classification. Pattern Recognition. 2020;102:107262.

58. Ghosh A, Nashaat M, Miller J, Quader S, Marston C. A comprehensive review of tools for exploratory analysis of tabular industrial datasets. Visual Informatics. 2018;2(4):235–53.

59. Díaz Muñiz C, García Nieto PJ, Alonso Fernández JR, Martínez Torres J, Taboada J. Detection of outliers in water quality monitoring samples using functional data analysis in San Esteban estuary (Northern Spain). Sci Total Environ. 2012;439:54–61. pmid:23063638

60. Sorkun MC, Durmaz İNCEL Ö, Paoli C. Time series forecasting on multivariate solar radiation data using deep learning (LSTM). Turk J Elec Eng & Comp Sci. 2020;28(1):211–23.

61. Ruiz AP, Flynn M, Large J, Middlehurst M, Bagnall A. The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Discov. 2021;35(2):401–49. pmid:33679210

62. Subbarao MV, Samundiswary P. Time-frequency analysis of non-stationary signals using frequency slice wavelet transform. In: 2016 10th International Conference on Intelligent Systems and Control (ISCO), 2016. 1–6. https://doi.org/10.1109/isco.2016.7726999

63. Mulero-Pérez D, Benavent-Lledó M, Azorín-López J, Marcos-Jorquera D, García-Rodríguez J. Anomaly detection and virtual reality visualisation in supercomputers. Int J Adv Manuf Technol. 2023;133(1–2):935–47.

64. Kim H, Kim M. Malware detection and classification system based on CNN-BiLSTM. Electronics. 2024;13(13):2539.

65. Yehia T, Gasser M, Ebaid H, Meehan N, Okoroafor ER. Comparative analysis of machine learning techniques for predicting drilling rate of penetration (ROP) in geothermal wells: A case study of FORGE site. Geothermics. 2024;121:103028.

66. Peng W, Ward MO, Rundensteiner EA. Clutter reduction in multi-dimensional data visualization using dimension reordering. In Proc. IEEE Symposium on Information Visualization, Austin, TX, USA. 2004;89–96. https://doi.org/10.1109/infvis.2004.15

67. Liu S, Maljovec D, Wang B, Bremer P-T, Pascucci V. Visualizing high-dimensional data: advances in the past decade. IEEE Trans Vis Comput Graph. 2017;23(3):1249–68. pmid:28113321

68. Liu H, Motoda H. Data Mining and Knowledge Discovery. 2002;6(2):115–30.

69. Dai X, Gao Z. From model, signal to knowledge: A data-driven perspective of fault detection and diagnosis. IEEE Trans Ind Inf. 2013;9(4):2226–38.

70. Zheng Q, Yang M, Yang J, Zhang Q, Zhang X. Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process. IEEE Access. 2018;6:15844–69.

71. Grave E, Joulin A, Cissé M, Grangier D, Jégou H. Efficient softmax approximation for GPUs. 2016.

72. Kumar A, Chatterjee N. Distance-based fuzzy-rough sets and their application to the classification problem. In Proc. Int. Joint Conf. Rough Sets (IJCRS), Halifax, NS, Canada, May 17–20, 2024, Part I, Berlin, Heidelberg: Springer, 2024;134–56. https://doi.org/10.1007/978-3-031-65665-1_9

73. Souza D, Granzotto M, Almeida G, Lopes LCO. Fault detection and diagnosis using support vector machines - A SVC and SVR comparison. safety. 2014;3(1):18–29.

74. Yehia T, Wahba A, Mostafa S, Mahmoud O. Suitability of different machine learning outlier detection algorithms to improve shale gas production data for effective decline curve analysis. Energies. 2022;15(23):8835.

75. Thabet SA, Zidan HA, Elhadidy AA, Helmy AG, Yehia TA, Elnaggar H, et al. Application of Machine Learning and Deep Learning to Predict Production Rate of Sucker Rod Pump Wells. In: GOTECH, 2024. https://doi.org/10.2118/219231-ms

76. Ebaid H, Rita Okoroafor E, Meehan N, Yehia T, Gasser M. A Comparative Analysis of Machine Learning Techniques for Geothermal Wells’ Drilling Rate of Penetration (ROP) Prediction. In: The Unconventional Resources Technology Conference, 2024. https://doi.org/10.15530/urtec-2024-4044244

77. Mahmood S, Sun H, Ali Alhussan A, Iqbal A, El-Kenawy E-SM. Active learning-based machine learning approach for enhancing environmental sustainability in green building energy consumption. Sci Rep. 2024;14(1):19894. pmid:39191844

Word count: 10893

Show less

© 2025 Dave et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Innovative data techniques for centrifugal pump optimization with machine learning and AI model

Content area

Abstract

Full text

Introduction

Methodology

The basic architecture flow of the model

Data analysis & pre-processing

Realtime exploratory data analysis on CPM – feature engineering

Hypothesis testing

Results & discussion

Box plot analysis

Pie chart analysis

Line plot analysis

Bar plot analysis

The co-relation heatmap analysis

Scatter clustering analysis

K-means clustering analysis

ML classifier models

AI deep learning model

Conclusion

References