Content area
The shift to sustainable energy systems necessitates scalable techniques for the valorization of organic waste via biogas production. This study presents a comprehensive data-driven framework encompassing statistical analysis, Explainable AI (XAI), clustering, and predictive modeling of methane yield to gain deeper operational insights into large-scale biogas production. By utilizing the operational data of a large-scale biodigester in Western cape province of South Africa including key biochemical and physicochemical variables such as temperature, pH, total solids (TS), volatile solids (VS), moisture content (MC), and FOS/TAC, key insights were derived through correlation mapping, scatter analysis, SHapley Additive exPlanations (SHAP)-based XAI for ranking digestion operational features, Principal component analysis (PCA) for addressing multicollinearity, and k-means cluster analysis to identify the operational clusters or groups which highlights critical shifts in system stability. Moreover, ensemble learning approaches, namely, XGBoost, Random Forest, as well as Support Vector Machine and Artificial Neural Network, were developed for methane yield prediction. The SHAP-based XAI identified FOS/TAC, volatile solids (VS), and moisture content (MC) as the most influential predictors of methane yield, while PCA explains 74% of the data variance in three Principal components (PCs), with PC1 dominated by VS, MC, and temperature as key drivers of methane yield. K-means clustering uncovered three distinct operational clusters, offering actionable guidance for feedstock management and process stabilization. Feedstock regression further established municipal solid waste (MSW) as the optimal input for maximizing methane output, with processed organic waste (POW) serving as an effective co-substrate. XGBoost achieved the best performance with an RMSE value of 1.18, followed by Random Forest (RMSE = 1.83), demonstrating the robustness of ensemble models in handling non-linear operational datasets. The research methodology is limited by its reliance on past operational data from a single digester and a lack of direct optimization experiments. However, the research strongly demonstrates the potential of data-driven approaches not only as powerful standalone tools but also as vital complements to experimental investigations. By transforming raw plant data into actionable intelligence, this study offers a scalable methodology for improving energy recovery, enhancing process control, and guiding sustainable development in industrial-scale bioenergy applications.
Introduction
Recovery of energy from waste is a suitable remedy to the problem of handling the enormous volume of waste produced, since it keeps a significant amount of waste out of landfills, which is believed to be among the biggest anthropogenic activities in the world (Adeleke et al., 2021). Waste has an energetic value of 0.87 GJ/ton, which can be harnessed worldwide, and 1.25 GJ/ton is expected by 2021 (Awasthi et al., 2019). The estimated global energy consumption was 520 quadrillion BTUs in 2010, and by 2040, a rise of 56% of the current level is expected (Khan et al., 2017). Due to the present status of the world’s economy and rapidly expanding population, worldwide energy consumption is currently greater than supply, and it will likely peak at levels that are more than six times higher than the existing capacity (Awasthi et al., 2019). The quest for greener and cleaner energy sources is necessary because fossil fuels are quickly running out and have an adverse impact on the environment (Dadak et al., 2016). A key element in ensuring the circular economy is the waste-to-energy, particularly bioenergy exploration (Adeleke & Jen, 2022). Strategic planning in large-scale biogas enterprises has improved due to the development of artificial intelligence, as exhibited in operational resource planning, methane yield forecasts, and optimization. Bioenergy, particularly biogas obtained from anaerobic digestion (AD), has emerged as a viable renewable energy resource owing to its dual benefit of energy recovery and organic waste reduction. However, the complex non-linear interconnectivity among feedstock composition, process parameters, and microbial dynamics renders optimization a challenge. Moreover, traditional mechanistic models demand significant calibration and frequently do not account for real-time variability at an industrial scale (Tsai et al., 2021). To address this, Machine learning (ML)-based data analytics have demonstrated efficacy in handling non-linearities and high-dimensional data for intelligent feedstock management and real-time decision-making. These techniques have facilitated enhanced strategic planning in large-scale biogas production, advancing circular economy goals and sustainable energy transitions. Recent studies have established that ML models outperform regression and kinetic models in biogas or methane yield predictions under varied operating situations (Li et al., 2022; Mohamed et al., 2025; Rutland et al., 2023; Sun et al., 2023).
The literature is replete with abundant research that has utilized data-driven methods, including regression analysis, to improve predictions of biogas process output. These studies highlight the significance of regression modeling, dimensionality reduction, feature ranking, and real-time data application in enhancing the efficiency and scalability of AD systems. Li et al., (2022) employed inverse design and data-driven optimization to model the AD processes, achieving high CH4 yields and low prediction error. Paladino (2022) applied control strategies and data-driven models to improve biogas quality and production from high solids AD. This study emphasizes the significance of process automation and control in enhancing biogas production efficiency. Rossi et al. (2022) developed a multiple linear regression (MLR) model to forecast specific methane production (SMP) from the dry AD of organic fraction municipal solid waste (OFMSW). Using 332 observations, Pearson correlation and Principal Component Analysis (PCA) identified six key variables influencing SMP. Kim et al. (2022) employed PCA and data smoothing to prioritize critical parameters influencing AD in a pilot-scale digester processing food waste. The technique facilitates the early identification of operational issues by emphasizing crucial process variables. The research conducted by Rutland et al., (2023) emphasized the significance of feature importance assessment (FIA). The study indicated that pH frequently appeared as the most significant variable affecting biogas output in many studies, highlighting the importance of identifying and prioritizing essential process parameters to enhance prediction models. The study by Schroer and Just (2024) employed feature engineering to analyze biogas flow from municipal anaerobic co-digestion utilizing high-resolution SCADA data. SHAP-based FIA indicated biogas flow history and high-strength waste (HSW) flow as the primary predictors.
In recent years, data-driven intelligence fueled by advancements in artificial intelligence (AI) has significantly enhanced data-driven decisions, predictive capability, driving a growing interest in smart solutions to the challenges in bioenergy research. Traditional mathematical models, while valuable, often require extensive time and resources. To address these challenges, researchers have turned to AI techniques. Salehi et al. (2022) developed k-nearest neighbors (k-NN) and support vector machine (SVM) models to forecast biogas generation from spent mushroom compost. SVM surpassed k-NN (k = 2), attaining superior R2 values (up to 0.9989) and substantially reduced RMSE. Modeling attempts in biogas research have frequently employed Artificial Neural Networks (ANN) and genetic algorithms (GA) for prediction and optimization of the process (Khaled & Hosseini, 2015). Essienubong et al., (2020) explored ANFIS for modeling and optimizing small-scale digesters fed with food waste and pig slurry, while Nair et al. (2016) utilized ANN to predict biomethane yield from paper and food waste in small reactors. Furthermore, applications were extended to Upflow Anaerobic Sludge Blanket (UASB) reactors, where an ANN model successfully predicted biogas and biomethane outputs from potato starch process wastewater with high accuracy [Index of Agreement (IA) = 0.9806, Coefficient of determination (R2) = 0.9793] (Antwi et al., 2017). Moreover, Olatunji et al. (2022) evaluated the performance of ANFIS and RSM in optimizing biomethane production from pre-treated Arachis hypogea shells. Olatunji et al. (2024) developed a hybrid model of an Adaptive Neuro-Fuzzy Inference System (ANFIS) and Genetic Algorithm (GA) to forecast cumulative biogas generation from food and farm waste (FFV). A study established a voting model that included various ML techniques, attaining a coefficient of determination (R2) of 0.778, underscoring the efficacy of ensemble methods in predicting biogas output (Sun et al., 2023). Mohamed et al., (2025) introduced a hybrid quantum–classical machine learning algorithm to forecast biogas production from full-scale AD plants via quantum circuit learning. The QCL model attained great accuracy (R2 = 0.959), comparable to MLP performance while utilizing fewer parameters and exhibiting superior scalability.
The drive toward enhanced energy recovery and economically viable waste-to-energy processes requires advanced and smarter techniques to optimize methane yield in large-scale biodigesters. While conventional experimental methods have established fundamental knowledge, complex real-time operations of biodigesters necessitate intelligent, data-driven strategies that can comprehend and interpret system behavior, identify hidden patterns and performance anomalies, and facilitate process optimization. Despite the rapid growth of data-driven and ML methods in their black box nature for biogas optimization, the interpretability of the features' impact and the multicollinearity challenge among the bio-digestion variables have not been addressed. Moreover, most existing research focuses on small-scale digesters or laboratory-controlled conditions, missing the robustness needed for industrial-scale biodigesters with feedstock heterogeneity, parameter dependency, and real-time unpredictability. In addition, little or no attention has been given to the potential of unsupervised clustering in the biogas research space to identify operational regimes, groups, or clusters that guide real-time process stability and control. This research fills this significant gap by developing a robust, explainable, and scalable data-driven framework that reveals the multivariate structure, operational regimes of the biodigester system, and interpretable impact of digestion parameters, offering explanatory power that enhances large-scale bioenergy exploration. The specific contributions of this research are as follows:
A SHapley Additive exPlanations (SHAP)-based explainable AI (XAI) for feature interpretability and assessment to interpret the influence of biodigester parameters on methane yield.
A feedstock regression model for evaluating the influence of specific substrate types on methane yield, informing tailored co-digestion techniques.
A PCA-driven multicollinearity reduction to capture the key variance structure of the dataset in fewer uncorrelated principal components.
A k-means clustering of bio-digestion operations into clusters offers actionable insights into stable, transitional, and failure-prone operational regimes facilitating adaptive process control.
Predictive models for methane yield using ensemble learning approaches (XGBoost, Random Forest) as well as SVM and ANN.
Provides a scalable and interpretable decision-support framework suitable for industrial biodigester systems to enhance energy efficiency and system management.
This research promotes energy efficiency and waste reduction, aligning with the global focus on minimizing waste generation and optimizing energy and resource utilization. This study enhances the broader discourse on sustainable development and cleaner production processes, promoting environmental sustainability across various sectors.
Materials and methods
Case study and data collection
This study utilizes real-time operational data from a large-scale bio-digestion facility located in Western Cape Province of South Africa. Bio-digestion is expanding on a commercial scale and a local scale in South Africa, which produces organic waste with a substantial amount of bioenergy resources. The case study plant is designed to process about 600 tons of waste per day into energy and useful by-products. The plant generates approximately 760 Nm3/h of biomethane, 18 tons of CO₂ per day, 100 tons of organic fertilizer daily, and up to 200 tons of refuse-derived fuel (RDF). The map in Fig. 1 illustrates the geographical location of the biodigester plant.
[See PDF for image]
Fig. 1
Map showing the location of the case study biodigester plant in Cape Town, South Africa
The dataset for this study was collected from 8 months of operational data, comprising about 120 data samples of biomethane production parameters. It encompasses key biochemical and physicochemical variables such as temperature, pH, total solids (TS), volatile solids (VS), moisture content (MC), and FOS/TAC influencing methane yield. Table 1 summarizes the plant’s core configuration.
Table 1. Plant configuration
Parameters | Values |
|---|---|
Suspension tank | 16 m × 7 m |
Post digestion tank | 32 m × 9 m |
Substrate | 200 tons of organic mix |
Production capacity | 1200 Nm3 of biogas/hour |
Organic loading rate | 150–175 tons daily |
Additives | Recycled concentrate and water |
Particle size | Varies based on substrate size |
Hydraulic retention rate | 28 days |
The dataset includes vital biodigester’s parameters with their statistical characteristics shown in Table 2. This dataset offers a solid framework for statistical studies, feature ranking, feedstock regression, and neuro-fuzzy modeling in this study. This research demonstrates how optimized biodigester operations can achieve enhanced methane production results under commercial-scale settings.
Table 2. Statistical breakdown of the data for the model
Variables | Input | Output | |||||
|---|---|---|---|---|---|---|---|
Temp ( °C) | pH | TS (%) | VS (%) | MC (%) | FOS/TAC | CH4 (%) | |
Minimum | 36.30 | 7.10 | 0.74 | 58.50 | 42.00 | 0.09 | 45.80 |
Maximum | 40.50 | 8.25 | 16.00 | 77.75 | 99.30 | 0.27 | 68.60 |
Average | 38.50 | 7.68 | 8.37 | 68.13 | 70.65 | 0.18 | 57.20 |
St.D | 2.97 | 0.81 | 10.79 | 13.61 | 40.50 | 0.12 | 16.12 |
Statistical and computational analysis framework
Operational parameter profiling
The linear interrelationship among the plant variables was analyzed using a Pearson correlation matrix and visualized using the correlation heat-map. This expresses the potential co-linearity among the variables as well as the positive and negative correlation of the input parameters to the methane yield. Also, the pair plots, which integrate scatter plots and Kernel density estimation curves, were generated to illustrate both individual feature distributions and their bivariate interactions. Univariate scatter plots were generated to isolate the distinct impact of each operational parameter on methane yield.
Principal component analysis
Principal Component Analysis (PCA) is a mathematical method employed to reduce the dimensions required to effectively represent the features of data matrices. This approach represents the original matrix through an array of new uncorrelated variables known as principal components (PC), which retain most variance in the biodigester dataset. The co-variant matrix C is estimated from the averaged dataset using Eq. 1:
1
where is the mean data matrix, C is the covariance matrix, and N is the number of observations. The eigen decomposition can be solved using Eq. 2:2
where is the eigenvalue of the PC, while is the corresponding eigen-vector. Each PC accounts for a segment of the total variance, and the explained variance ratio (EVR) is computed as in Eq. 3:3
where p is the number of features. To determine the PC scores, project the mean data onto the chosen eigenvectors using Eq. 4:4
where is the matrix of the top k eigenvectors.Operational clusters analysis (k-means clustering)
The k-means clustering approach was developed in this study to unveil hidden behavioral consumption patterns of energy under different meteorological conditions. K-means is a clustering technique that employs an iterative, centroid-based algorithm to partition observations into clusters, where each observation is assigned to the cluster corresponding to the nearest mean (according to the distance between their centroids). K-means minimizes the within-cluster sum of squares (WCSS) as in Eq. 5:
5
where each data point is denoted as , while the set of points in the cluster is denoted as and the centroid of cluster as . Each data point is assigned to a cluster allocated with the nearest centroid as in Eq. 6. In the standard iterative approach, the centroids are randomly initialized utilizing the Euclidean distance in the equation, ensuring that each data point is assigned to the cluster with the nearest centroid .6
Equation 7 defines the cluster assignment, after which the cluster centroids are recalibrated by estimating the mean of all data points inside each cluster according to Eq. 8. Iterations continue until WCSS falls below a threshold.
7
8
Regression modeling of feedstock contribution
This study evaluates the individual contribution of each feedstock to methane yield and provides optimization strategies for enhanced biogas production. A linear regression model in Eq. 9 was developed using the biodigester dataset to establish a mathematical relationship between the methane fraction and the proportions of three feedstock types used in the biodigester.
9
where represents methane yield, stands for the intercept representing the baseline methane yield when all feedstocks are zero, are regression coefficients indicating the contribution of each feedstock per unit increase, while ε (error-term) captures variability not explained by the model. The feedstock input data were normalized for mass flow rate discrepancies, temporally aligned with methane output, and checked for multicollinearity for stable, interpretable coefficients, before the regression modeling.SHapley Additive exPlanations (SHAP)
To provide an interpretable explanation of the influences of climatic parameters on the energy demand, a SHapley Additive exPlanations (SHAP)-based feature ranking framework was employed. SHAP offers a model-agnostic methodology for decomposing a machine learning model’s prediction into the additive impacts of individual features, ensuring both global and local interpretability (Tempel et al., 2025). For a model with input feature , the SHapley value ϕ is computed as in Eq. 10:
10
where is the SHAPley values representing the importance of each feature, denotes the number of features, while represents the group input in the dataset. is a subset of . The SHAP algorithm’s basic principle is that the sum of all feature contributions is obtained by subtracting the baseline and the model’s predicted value as in Eq. 11:11
The value of ϕi (x) signifies the extent to which the feature influences the model’s prediction relative to the baseline for the data instance . The predicted outcome is represented by the baseline value, .
Extreme gradient boosting
Extreme Gradient Boosting (XGBoost) is a form of ensemble machine learning created by Tianqi Chen (Chen & Guestrin, 2016) with a working principle based on gradient boosted decision trees. It produces decision trees progressively, with each subsequent tree rectifying the residuals of the preceding trees via gradient optimization (Zaidi et al., 2025). XGBoost commences with a preliminary prediction, often the mean of the target variable. It subsequently computes the residuals, defined as the discrepancies between the actual target values and the initial prediction. These residuals signify the errors in the existing prediction. In each iteration, XGBoost constructs a new decision tree that concentrates on learning the residuals. The objective of each subsequent tree is to anticipate the errors from the preceding stage, thereby rectifying the model’s inaccuracies (Mustaffa & Sulaiman, 2025).
Given a target , input feature vector , and a dataset , Eq. 12 depicts the prediction at iteration .
12
The objective function minimized at each iteration integrates a differentiable loss function and a regularization term .
13
with14
where represents the number of leaves in a tree, denotes the weights of the leaves, regulates model complexity, and λ implements L2 regularization.Random forest
A Random Forest is another ensemble ML approach that works by building a “forest” of various decision trees during training and producing the mode of the classes in a classification task or the mean prediction from the individual trees for regression tasks. The RF algorithm creates an ensemble of decision trees, each trained on a bootstrap sample of the dataset. Given input vector , anticipate a single tree . For regression problems, recursive binary splits led by impurity metrics like MSE compute .
15
The actual and predicted values of the biomethane yields are denoted as and , respectively. The RF prediction is derived by computing the outputs of all trees using Eq. 16:
16
The stochastic feature selection at each split stimulates diversity among the trees, diminishing correlation and thereby reducing model variance. The generalization error of Random Forests converges as B approaches infinity, constrained by the efficacy of individual trees and their correlation.
Support vector machine
Support Vector Machine (SVM) is a close resemblance of the discriminant analysis typically used with metric analysis, and employed to split data by levels of the categorical variable, after individuals in a training set are sorted into -dimensional space (Pilloud et al., 2018). SVMs are not probabilistic, which means that the probability that any given classification is right cannot be established, even though they generate accurate estimates (Cortes & Vapnik, 1995). Despite the major benefits of the SVM techniques, such as their simplicity, computational efficiency, and capacity to be trained with a small amount of data, determining their optimal kernel and parameters is a significant challenge (Shawe-Taylor & Sun, 2011). The fundamental concept of SVM is to optimize the geometric margin between two datasets while concurrently minimizing the empirical classification error. For a regression problem like the biomethane production in this study, the objective function in Eq. 17 is approximated to demonstrate the correlation between the bio-digestion and pretreatment conditions and biomethane yield.
17
subjected to18
where denotes the biomethane yield and represents the input variables (digestion operating conditions). The slack variables for points outside the insensitive tube are represented as and , is the tube width, while represents the regularization parameters.Hyperparameter settings and performance evaluations
To achieve a robust prediction of biomethane yield, hyperparameters of the ensemble learning methods (XGBoost and Random Forest), the MLP, and the SVM were carefully selected. Random Forest was utilized as a bagging ensemble of decision trees, mitigating variation via bootstrap sampling and random feature selection. Key parameters were optimized for the Random Forest model. The was set at 500, while was set for the which defines the limit for features when splitting nodes. was set for and respectively. Some of the hyperparameters optimized for the XGBoost model include a learning rate of 0.05, of 4, and of 500. A 2-hidden-layer MLP architecture with 64 and 32 neurons, respectively, was developed with a Levenberg–Marquardt training algorithm and Rectified linear activation function. The learning rate of SVM was set at 0.001. A radial basis function (RBF) kernel function, a regularization parameter (C) of 100, and epsilon of 0.1 were all set for the SVM model. The prediction performance of the developed machine learning was evaluated using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Deviation (MAD), Mean Absolute Percentage Error (MAPE), and relative mean bias error (rMBE), and computed using Eqs. 19, 20, 21, 22, 23:
19
20
21
22
23
Results and discussion
Operational parameter profiling for enhanced biomethane recovery
A comprehensive understanding of the interplay between the significant operating conditions and the complex biochemical process within the biodigester plant provides a useful insight into the methane production trends. A Spearman correlation analysis between the critical operating parameters was performed and visualized using the heat-map (see Fig. 2). These observed correlations underscore the multifactorial nature of AD, where no single parameter exclusively dictates methane yield but rather a collective interaction of physical, chemical, and biological factors. Figure 2 shows the magnitude and direction of the linear correlation between the key operating conditions and methane yield. It was observed that temperature, with a correlation value of 0.2, depicts a weak positive relationship with the methane yield. This implies that within the operation context of the biodigester, temperature exhibits little impact on the biomethane yield. Beyond this temperature–methane yield correlation, a better correlation with other parameters was noted. Its correlation with VS (0.58) and MC (0.41) suggests that increased temperatures may facilitate the decomposition of organic matter and improve moisture retention (Conant et al., 2011). However, temperature’s negative correlation with TS (−0.61) and FOS/TAC (−0.44) indicates that elevated temperatures may facilitate decomposition, hence decreasing solids and potentially enhancing system stability (lower FOS/TAC) (Liu & Lv, 2016). Similarly, TS exhibit a little negative connection with CH4 (−0.07), implying that elevated solid concentrations may impede biogas production by restricting bacteria access to substrates and resulting in inadequate mixing (Wang et al., 2023).
[See PDF for image]
Fig. 2
Heat-map correlation map of operating parameters and methane yield
The weak correlation of pH with the methane yield suggests that pH levels were likely maintained within the ideal range for microbial activity, particularly for methanogens, and did not fluctuate sufficiently to substantially influence methane output. While excessive moisture leads to leaching and system dilution, the little positive correlation exhibited by the MC to the methane yield (0.16) suggests that sufficient moisture is essential for microbial metabolism and biogas production. The VS is a significant parameter to the digestion process since it forms the biodegradable component of organic matter. Its positive correlation with methane yield, though weak (0.16), supports this theoretical biochemical link. The organic nature of VS is further expressed by its strong positive correlation (0.61) with MC. The negative correlation of FOS/TAC with CH₄ (−0.23) is practically significant. A high FOS/TAC ratio signifies acid buildup and process instability, which is harmful to methane-producing bacteria (Park et al., 2024). Summarily, the varying correlation among the critical operational parameters of the plant and the yield demonstrates the complex multivariate nature of the bio-digestion process. In the context of biodigester operation in this case study, high FOC/TAC and TS inhibit methane yield, while VS, MC, and a low FOC/TAC are beneficial for improved concentration of methane in the biogas.
Beyond the Spearman correlation analysis, the multicollinearity existing between bio-digestion parameters was analyzed using the Variance Inflation Factor (VIF) analysis (see Fig. 3). The VIF quantifies the extent to which a predictor variable can be linearly accounted for by other predictor variables. The high VIF values for temperature (883) and pH (924) indicate that each is nearly a linear combination of other factors within the dataset. In biodigester operations, pH and temperature frequently correlate with overall process stability and are closely associated with FOS/TAC and gas quality. Such VIFs may result in very unstable individual coefficients, as well as redundancy, which may adversely affect generalization and interpretability. The moisture content has a moderately high VIF value of 50. In sludge and biomass analysis, moisture content and total solids (TS) are fundamentally complementary. The nearly deterministic connection propels VIF upward. Maintaining both offers provides minimal new information while introducing significant multicollinearity. Volatile solids exhibit a lower VIF (17) in comparison to pH, temperature, and moisture. Frequently exhibits a strong correlation with TS, as VS constitutes a fraction of TS (the organic component of solids). The inclusion of VS with TS (and moisture) often results in a nearly linear subspace. FOS/TAC and total solid, with VIF values of 6.6 and 5.3, respectively, are within the “moderate” VIF range, which also signifies redundancy, but it is not harmful in isolation.
[See PDF for image]
Fig. 3
VIF analysis of bio-digestion parameters
From the heat-map, the strong positive correlations between TS and MC, as well as between VS and organic loading, suggest redundancy among these variables. Furthermore, the VIF analysis indicated high multicollinearity for TS and MC, with VIF values exceeding the common threshold of 10. These two analyses indicate that TS and MC are candidate variables for omission in the simplified model owing to their overlapping information. However, all variables were retained for this study, to allow the machine learning algorithms to capture the full range of potential interaction.
The bivariate relationship between the key operating parameters of the digester and their impact on methane concentration is demonstrated using a pair plot as in Fig. 4. The diagonal plots represent kernel density estimations (KDEs) of individual variable distributions, while the off-diagonal scatter plots show pairwise relationships. The chart displays the individual interrelationship of each variable with methane yield. The methane fraction has a minor right skew, indicating that elevated methane yields occurred seldom, whereas the majority of instances are concentrated in a moderate range, likely attributable to differences in feedstock digestibility or process stability. Moreover, the temperature exhibits a tri-modal distribution, suggesting that observations were obtained under three primary thermal regimes, potentially indicative of operational testing in varying thermophilic or mesophilic circumstances. TS and VS exhibit significant variability, reflecting heterogeneous substrate compositions or feeding techniques under various operational conditions. The pH has a confined unimodal distribution, corroborating previous findings that it is maintained within a regulated optimum range (about 7.0–8.0). FOS/TAC exhibits a right-skewed distribution, indicating that the majority of samples reside within a healthy operational range, whereas instances of process imbalance are less frequent.
[See PDF for image]
Fig. 4
Pair plot chart of the biodigester operating parameters with the methane fraction
Shown in Fig. 5 is the parameter-wise influence on methane yield using a scatterplot. Each parameter affects methane production to a varying degree. Methane yield is marginally positively correlated with MC. Microbial mobility and substrate solubilization require sufficient moisture. However, the broad confidence interval shows substantial variability, suggesting that additional interacting parameters may affect this parameter. A slight negative trend is observed between TS and methane yield. Higher TS levels may increase viscosity and restrict microbial substrate accessibility, reducing methane generation efficiency. Furthermore, low TS may improve digestive kinetics and increase methane concentrations. A moderate positive correlation exists between temperature and methane yield. As the temperature rises from ~ 36 °C to 40 °C, methane content increases a little. This supports the AD behavior mechanism, where mesophilic conditions (~ 37–40 °C) promote microbial activity and methane synthesis (Deepanraj et al., 2015). Methane yield increases slightly with pH, with an ideal range of 7.5–8.0. This supports methanogenic bacteria activity in a neutral to slightly alkaline environment. Exceeding this range may impede methane-forming microorganisms. It was also observed that methane yield decreases at higher FOS/TAC ratios, which suggests imbalance and acidification. This ratio is a key stability metric whose values above 0.4 indicate overload or process failure. Thus, consistent biogas generation requires a low FOS/TAC ratio.
[See PDF for image]
Fig. 5
Parameter-wise influence on biogas methane fraction
Impact of feedstock type and composition on biogas production
The efficiency of the bio-digestion process and the quality of the biogas generated is highly contingent on the type and composition of feedstock used. The case study biodigester processes three primary feedstocks: Municipal Solid Waste (MSW), Processed Organic Waste (POW), and liquid POW. This section provides statistical insights into the effect of the feedstock type on methane yield. Preliminary analysis of the dataset reveals that CH4 production varies from 45.80% to 68.60%, with an average concentration of 57.20%. Additionally, the TS and volatile solids composition differ significantly across feedstock types, suggesting possible variation in their digestibility. The values of the regression coefficients corresponding to MSW, POW, and liquid POW indicate the individual contribution of the feedstock per unit increase on the methane concentration in the biogas produced. The chart in Fig. 6 presents the result of the regression model, showing that MSW has the highest positive coefficient. This means that an increase in MSW input significantly enhances CH4 production. This feedstock possesses the highest concentration of organic matter with good biodegradability, making it efficient for the bio-digestion process. The moderate positive value indicates that POW increase exhibits a moderate improvement in CH4 concentration in the biogas, but not as impactful as MSW. It may therefore be effective as a supplement feedstock for biogas. The negative correlation of indicates a reduction in CH4 concentration upon an increase in liquid POW quantity, lowering organic load and microbial efficiency.
[See PDF for image]
Fig. 6
Regression coefficient of different feedstocks on the methane production
Based on this outcome, MSW is recommended as the principal feedstock at a concentration of around 60–70% for optimal methane production. Furthermore, POW should be applied in modest quantities (about 20–30%) to improve digestion. To avert excessive dilution that may impede the digestive process, the utilization of liquid POW should be limited.
Dimensionality reduction of relevant plant features using PCA
The VIF analysis in “Operational parameter profiling for enhanced biomethane recovery” section revealed that a large multicollinearity exists between the bio-digestion parameters, particularly between pH and Temperature, with VIF values exceeding 800, and strong redundancy among solids-related parameters. Such high VIFs signify unstable coefficient estimations and redundant information, thus necessitating a PCA to transform the correlated variables into orthogonal components, diminishing redundancy, and more consistently capturing the underlying process variability for modeling and interpretation. The PCA was further used to reveal underlying structures and significant interactions within the biodigester datasets. The PCA produced an efficient illustration of the variance inherent in the biodigester dataset by identifying the dominant principal component accountable for significant variability. Table 3 and Fig. 7 present the results of the PCA. Table 3 presents the PCA matrix of the critical variables influencing bio-digestion efficiency, and Fig. 7a encompasses the explained variance and the cumulative variance for each component expressed in percentage. PC1 accounts for approximately 45.5% of the variance, indicating a strong correlation between some key parameters of the digester. Parameters such as VS, MC, and temperature substantially affect biogas yield due to their direct influence on microbial activity and digestion efficiency, thus necessitating close monitoring of these parameters. The dimensionality of the biodigester dataset has been reduced to 3 (PC1 to PC3), collectively accounting for approximately 74.4% of the variance. The second component accounts for 14.9%, depicting additional variability in the dataset. This principal component reflects parameters that significantly influence the stability of the reactor and microbial health, including pH and FOS/TAC ratio. These parameters may exert unique influences on digestion processes independent of the main group. To achieve stability in the reactor’s operation and optimal microbial performance, these parameters need to be monitored. The third principal component with 13.9% variance expresses subtle interactions among parameters possibly related to overall balance within the biodigester system. The PCA outcome has a significant implication for the monitoring efforts in the biodigesters’ operation by directing focus toward parameters that strongly load on PC1 and PC2.
Table 3. PCA matrix illustrating the influence of biodigester parameter on the dataset variance
Parameter | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 |
|---|---|---|---|---|---|---|---|
Temperature | −0.439 | −0.138 | 0.086 | 0.091 | −0.846 | 0.238 | 0.013 |
pH | −0.191 | −0.151 | −0.853 | 0.413 | 0.036 | −0.171 | 0.113 |
TS | 0.491 | 0.037 | −0.209 | −0.203 | −0.183 | 0.391 | 0.697 |
VS | −0.469 | −0.157 | 0.149 | −0.346 | 0.116 | −0.498 | 0.593 |
MC | −0.386 | −0.088 | −0.309 | −0.599 | 0.262 | 0.522 | −0.215 |
FOS/TAC | 0.388 | −0.326 | −0.242 | −0.507 | −0.357 | −0.446 | −0.319 |
CH4 | −0.099 | 0.904 | −0.212 | −0.210 | −0.198 | −0.205 | −0.040 |
[See PDF for image]
Fig. 7
Scree plot showing a explained and cumulative variance b eigenvalues relating to each principal components c PCA biplot
Figure 7b shows the scree plot which depicts the eigenvalues for each principal component with a reference line at eigenvalue = 1 to indicate significant components. Figure 7c expresses the contributions of the original parameters to the principal components. The methane concentration is significantly associated with TS, indicating that organic content increases methane generation. The moderate alignment of pH and temperature suggests they affect microbial activity. VS and MC negatively correlate with FOS/TAC (stability ratio), showing that moisture decreases digester stability.
Operational cluster analysis of the biodigester using k-means clustering
The dynamics of the bio-digestion process were further illustrated using a k-means cluster analysis visualized via the PCA. The k-means unveils the natural groups and clusters within the operational data of the biodigester by considering all the critical operation parameters. To determine the optimal number of clusters () for grouping the power consumption into clusters, the elbow method was applied. Figure 8 visualizes the elbow analysis of the clusters with the within-cluster sum of squares (WCSS) plotted against the number of clusters. A sharp decline was observed in the WCSS values from , beyond which the curve begins to level out more gradually. This suggests that although further clusters decrease WCSS, the marginal advantage lessens after . This suggests that the optimal clusters for partitioning the power consumption data can be achieved at . This implies that the critical structural regimes in the power consumption patterns can be sufficiently captured in 3 clusters.
[See PDF for image]
Fig. 8
Elbow method for optimal
To further establish the optimality of clusters, beyond the elbow method, the silhouette score analysis of the clusters was performed. Silhouette analysis can be employed to evaluate the separation distance among the resultant clusters. It illustrates the proximity of each point within a cluster to points in adjacent clusters. Figure 9 presents the average silhouette score against the cluster number. Silhouette score 0.38 at indicates moderate clustering. Score rises to 0.41 at , the highest of all cluster numbers. This peak shows that three clusters are most coherent and well separated. The silhouette scores decrease from to , indicating a decrease in cluster quality with more clusters. The silhouette values range between 0.30 and 0.35 after , showing weaker and less consistent cluster separation.
[See PDF for image]
Fig. 9
Silhouette score analysis for optimal
The cluster analysis revealed 3 clusters, i.e., 3 distinct operational frameworks, each aligning with the operational situation of the bio-digestion process (see Fig. 10). The first cluster (cluster 0) likely represents the optimal and high-performance bio-digestion condition, possibly linked to stable and moderate pH levels (about 7.0–7.5), moderate to high methane concentration, and a well-balanced FOS/TAC ratio. The second group (cluster 1) possibly signals the transitional conditions, such as post-feedstock alteration or digester instability. This encompasses a reduced methane concentration, marginally varying pH values, and increasing or fluctuating FOS/TAC ratios. The last cluster (cluster 2) likely indicates a malfunction period, which is characterized by reduced methane concentration, instability consequent upon elevated moisture or volatile substances, and potential excess of organic matter or inadequate inoculum/feedstock combination. To correlate cluster transitions with operational changes, we established the feedstock in each cluster corresponding to the operational dates. From the plant operational dataset, we observed that cluster 2 often aligns with high MSW input and little/no additive usage. Furthermore, a shift from Cluster 0 to Cluster 2 could indicate overloading, an imbalance in feedstock mix, a lack of pH stabilizers, or micronutrients.
[See PDF for image]
Fig. 10
Operational clusters of the biodigester
The identified digestion clusters were connected to the operational conditions through the statistical centroids and distribution of the clustered variables, as well as alignment with biogas process theory and empirical evidence. The centroid values of each cluster were specifically compared to established ideal limits for AD (e.g., pH 6.8–7.4, FOS/TAC < 0.4). Clusters with centroids within these ranges were deemed stable operating regimes, but those above the criteria (e.g., heightened FOS/TAC or diminished VS availability) were categorized as transitional or unstable. Furthermore, this interpretation correlates empirical evidence that shows that imbalances (e.g., high FOS/TAC, acid accumulation) lead to process instability, while balanced conditions (pH near neutrality, steady VS degradation) indicate stability (Platošová et al., 2021).
Figure 11 illustrates the distribution and variability of key operating parameters across the 3 clusters. TS, MC, and FOS/TAC demonstrate distinct separation within clusters, indicating their significant impact on cluster membership, while the concentration of methane exhibits minor fluctuations, which are insufficient for statistically differentiating clusters. It was noted that less significant fluctuations of methane concentration across clusters indicate that it is a result rather than a cause of operational variation. MC, TS, and the FOS/TAC ratio are the most statistically significant factors in differentiating operating clusters. Temperature and pH are significant, but to a lesser degree.
[See PDF for image]
Fig. 11
Biodigester parameters variation across the operational clusters
SHAP-based interpretable feature ranking of biodigester parameters
The SHAP-based XAI provides a granular and interpretable explanation, as well as impact directions (positive and negative) of bio-digestion operational parameters on CH4 yield. The results highlight the most significant variables but also offer clear interpretability of the predictive model, facilitating data-driven improvement of biodigester operations. All SHAP analyses were computed for the XGBoost model (the best-performing predictive model in this study). Shown in Fig. 12 is the SHAP summary plot, which visualizes the magnitude and direction of the influence of each bio-digestion parameter on biomethane yield. The SHAP values in the SHAP summary chart offer local interpretability, illustrating how particular feature values (high/low) influence predictions positively or negatively. The FOS/TAC ratio predominates the SHAP summary plot, exhibiting the largest spread of SHAP values. It exerts more negative impact, depicted by its high negative SHAP values, indicating that increased FOS/TAC reduces methane output. This corresponds with the biological finding that excess volatile fatty acids induce acidity and limit microbial activity. Likewise, FOS/TAC exhibits a favorable effect, as evidenced by its good SHAP values, signifying process stability and efficient buffering. The moisture content exhibited a consistent influence on the model predictions. High moisture enhances methane production, presumably because of increased water availability, facilitating substrate breakdown and microbial movement. The negative SHAP values attributed to moisture content in the summary plots imply that excessive moisture could also dilute substrate concentration. It was observed that total solid exerts a positive and negative influence on model prediction, depicted by its broad SHAP value dispersion. Moderate total solid concentrations enhance methane yield; however, excessively high total solids may impede performance due to inadequate mixing and mass transport constraints. An increased volatile solid typically results in a positive shift in SHAP, indicating that greater amounts of degradable organic matter enhance methane production. Nevertheless, considerable overlap was detected, suggesting that above specific thresholds, the effect may stabilize or rely on concurrent conditions such as FOS/TAC. A little but non-zero effect was noted for temperature, with its SHAP values close to zero. This indicates that the system operates within a limited mesophilic range (36–38 °C), where microbial activity is comparatively steady. The pH demonstrates an insignificant effect on the model prediction, with nearly all its SHAP values clustered around zero. This indicates that pH consistently stayed within the ideal neutral range during the studies, hence not significantly influencing variability in CH4 output.
[See PDF for image]
Fig. 12
SHAP summary plot of bio-digestion
Shown in Fig. 13 is the SHAP summary bar plot of the bio-digestion parameter. It was noted that FOS/TAC was the most important variable, with the greatest mean SHAP value. This means that the ability of acids and bases to buffer each other is very important for maintaining anaerobic digestion and directly affects how much methane is generated. The second most significant feature is the moisture content, which directly impacts how available the substrate is, how easily microbes can move around, and how well mass transfer works in the digester. This observation is consistent with the established vulnerability of anaerobic systems to severe desiccation or dilution, as both conditions can impede digestive efficiency. With a close SHAP value of total solids and volatile solids, the amount of substrate and the amount of organic matter available are important factors in how much methane is produced. Their insights strengthen the significance of feedstock parameters in biogas optimization. Temperature and pH, which are usually thought to be important control characteristics, have lower SHAP effects in this dataset. This could be because the operational ranges were quite restricted, which probably kept them under the best thresholds and made it harder for them to explain CH4 production.
[See PDF for image]
Fig. 13
Bar plot of SHAP values of bio-digestion variables
Shown in Fig. 14 is the decision plot of the bio-digestion variables. The plot begins at the model’s base methane production of around 60% CH4. FOS/TAC has the greatest impact on CH4 yield, as seen by its sharpest steeps. The prediction rises with steady digestion and increased methane generation, but drops drastically with high FOS/TAC, indicating that acid accumulation reduces yield. The moisture content strongly affects methane yield, with higher values raising it and lower values decreasing it. Temperature and pH have similar trajectories with moderate yet stabilizing effects. They preserve yield against excessive declines rather than promoting huge positive gains. Volatile solids and total solids affect trajectory tails. Due to their role in organic matter degradation, higher volatile solid levels increase yield. Moderate total solid values increase yield, while larger values cause operating stress and decrease CH4 output.
[See PDF for image]
Fig. 14
SHAP decision plot for bio-digestion features
Beyond feature ranking, SHAP values provide more value and insight. ML methods, particularly ensemble learners and neural networks, are black box by nature. This is because they are very good at making predictions, but they do not explain how each variable affects a specific prediction. The SHAP analysis developed on the best predictive model addressed this challenge by breaking down each prediction into the separate contributions of the input features. This makes both local and global interpretations possible. In this study context, the anticipated predominance of FOS/TAC in predicting methane yield was validated using the SHAP analysis, while its relative impact in relation to other parameters (e.g., VS, MC, pH) and discerning variable interactions were concurrently estimated. Hence, it gives model-consistent evidence that complements empirical evidence. This ensures the precision of prediction as well as its interpretability in a large-scale biodigester.
The SHAP dependence plot in Fig. 15 shows how moisture content affects yield through its SHAP values as well as its synergistic non-linear interaction with total solids. Positive SHAP values increase methane yield, while negative values inhibit it. SHAP results are mostly positive at higher moisture levels (around zero and above). It appears that increased moisture levels promote methane production in the experimental range. Well-hydrated surroundings aid hydrolysis and microbial movement, which explains the favorable association. Importantly, this region has moderate to high total solid values, indicating that proper hydration improves organic substrate degradability when solids are balanced. At decreasing moisture levels, SHAP values become negative, sometimes below −10. This suggests that low moisture affects methane output due to microbial activity and substrate accessibility.
[See PDF for image]
Fig. 15
SHAP dependence plot of moisture content with methane yield and total solids
Furthermore, the non-linear interactions between FOC/TAC with methane yield and total solid are presented in Fig. 16, demonstrating a negative relationship between FOS/TAC and methane yield. SHAP readings are mostly positive for lower FOS/TAC levels (near to zero or negative), implying that sustaining the system in this range increases methane yield. This supports digestion theory, which states that a balanced ratio of volatile fatty acids (FOS) to buffer capacity (TAC) promotes methanogenesis. As FOS/TAC values above 1.0, SHAP values rapidly slip into the negative range, even below −4. This shows that acid buildup and buffer overload cause process instability and lower methane yield at higher FOS/TAC ratios. High ratios indicate process imbalance, where high organic input or poor acid-to-methane conversion interrupts microbial function. At low FOS/TAC levels, the model anticipates positive methane yield contributions and lower total solid concentrations. Under leaner substrate circumstances, a balanced FOS/TAC ratio may maintain digestion success. In contrast, larger total solids cluster around higher FOS/TAC values and negative SHAP contributions. This interaction shows that high solids loading increases FOS/TAC ratio and harms methane yield.
[See PDF for image]
Fig. 16
SHAP dependence plot of FOS/TAC with methane yield and total solid
The force plot (see Fig. 17) starts with the base methane yield prediction value (60%), indicating the model’s expected outcome without feature-specific additions. Temperature and FOS/TAC exert negative contributions. FOS/TAC (−0.39) lowers methane yields due to increased volatile acid accumulation compared to buffer capacity. The temperature (−0.85) exhibits the greatest downward pull, indicating that the operating value was slightly outside the ideal mesophilic range. This divergence reduced microbial activity and methane production. On the positive side, pH, volatile solids, and moisture content increase methane output over base value. Methanogenic bacteria thrive under a slightly alkaline pH (0.52 SHAP contribution). Methane formation requires volatile solids (0.35 contribution), while increasing moisture content (0.59 contribution) improves substrate hydrolysis and microbial movement, boosting system efficiency.
[See PDF for image]
Fig. 17
SHAP force plot of bio-digestion variable
Performance metrics of the predictive model
The performance of ensemble learning methods (XGBoost, Random Forest), MLP (ANN), and SVM was compared based on key error-based metrics for a robust prediction of the methane yield as a function of the key operating conditions of the biodigester. Table 4 presents the statistical metrics of all models at the training phase. It was noted that the XGBoost outperformed other models, exhibiting the lowest prediction error values. It had RMSE, MAE, MAD, and MAPE values of 1.1837, 1.0932, 1.0235, and 2.9754. Its rMBE of 0.231 suggests low prediction bias. The excellent prediction outcome of XGBoost is attributed to the ensemble learning technique and gradient boosting regularization, which minimize overfitting and handle non-linear interactions well. The Random Forest also exhibited an excellent training performance, with RMSE and MAPE values of 1.8294 and 3.4101, respectively. Although less exact than XGBoost, its rMBE of 0.244 reflects the predicted and actual methane yields. This shows ensemble tree-based models' resilience in complex bio-digestion datasets. However, the SVM and MLP-ANN models were noted to train efficiently. Particularly, the SVM had the highest error rates (RMSE = 4.4369, MAE = 3.2743, and MAPE = 7.6772%), indicating its limitation in accurately representing the bio-digestion's non-linear dynamics during training. Furthermore, the ANN's moderate error values (RMSE = 3.9309 and MAPE = 5.1421%) may indicate difficulties with optimization or training data size sensitivity despite its flexible architecture. Higher rMBE values (> 0.3) in both models indicated prediction bias. The trend in the error-based metrics at the training phase for all model revealed that the tree-based ensemble learning models (XGBoost and Random Forest) demonstrated a more accurate methane yield prediction than SVM and ANN.
Table 4. Error-based metrics of all models at the training phase
Model | RMSE | MAE | MAD | MAPE | rMBE |
|---|---|---|---|---|---|
SVM | 4.4369 | 3.2743 | 2.6243 | 7.6772 | 0.332 |
Random Forest | 1.8294 | 1.4087 | 1.6343 | 3.4101 | 0.244 |
XGBoost | 1.1837 | 1.0932 | 1.0235 | 2.9754 | 0.231 |
MLP (ANN) | 3.9309 | 3.0599 | 2.2443 | 5.1421 | 0.318 |
A trend similar to the training phase was observed in the model performance at the testing phase as depicted in Table 5. XGBoost again outperformed the other models, proving its durability and generalization capability. XGBoost accurately estimated methane yield under novel digestion conditions with the lowest RMSE, MAE, MAD, and MAPE values of 1.5334, 3.0357, 2.9704, and 5.4685. It is the most accurate bio-digestion prediction model due to its low systematic bias (rMBE = 0.293). While Random Forest performance is excellent, it is relatively less than XGBoost. Random Forest captures non-linear relationships better than gradient boosting, but with less precision. MLP-ANN and SVM models exhibit a poor generalization capability. ANN had RMSE and MAPE values of 5.0848 and 9.521, respectively, while SVM had the largest errors of RMSE = 7.4549 and MAPE = 12.8558. Again, the testing results support ensemble models, especially XGBoost, which maintained high predicted accuracy during training and testing.
Table 5. Error-based metrics of all models at the testing phase
Model | RMSE | MAE | MAD | MAPE | rMBE |
|---|---|---|---|---|---|
SVM | 7.4549 | 6.4852 | 5.9704 | 12.8558 | 0.342 |
Random Forest | 4.6515 | 3.7193 | 3.9704 | 7.2766 | 0.301 |
XGBoost | 1.5334 | 3.0357 | 2.9704 | 5.4685 | 0.293 |
MLP (ANN) | 5.0848 | 4.0951 | 4.9704 | 9.5216 | 0.324 |
The comparative trend in the performance metrics at the training is visualized in a radar chart in Fig. 18. SVM and MLP-ANN exhibited outward expansion on the radar chart, especially for RMSE and MAPE, signifying higher error values. The SVM exhibits the greatest variance, indicating a suboptimal fit to the training data, while the MLP-ANN displays moderate performance, but remains inferior to the ensemble approaches. XGBoost regularly ranks in the innermost region across all error metrics, demonstrating its exceptional accuracy with the lowest RMSE, MAE, MAD, and MAPE values. The Random Forest model exhibits commendable performance, but with slightly higher error magnitudes than XGBoost, while maintaining a compact and balanced profile in comparison to the other models. This indicates strong predictive capacity during training, albeit with lower precision compared to gradient boosting.
[See PDF for image]
Fig. 18
Radar chart of error metrics at the training
The comparison trend of error metrics at the phase depicted in Fig. 19 further supports ensemble learning models' dominance. XGBoost has the lowest predicted errors and best generalization since it is the most compact across all error dimensions. Random Forest follows closely, with slightly higher error magnitudes than XGBoost but a tighter structure. SVM and MLP-ANN have lower predictive accuracy on unseen data due to outward expansion in all measures, especially MAPE and RMSE. SVM has the broadest spread, indicating its poor adaptability to non-linear feature interactions, whereas ANN performs moderately but less than ensemble approaches.
[See PDF for image]
Fig. 19
Radar chart of error metrics at the training
Figures 20 and 21 show the fitness between the actual and predicted methane yield values, as well as the error histogram plot at the training phase. It can be noted from the figure that ANN and SVM models exhibit less fitting between the actual and predicted methane yield values. ANN captures the general trend but deviates significantly from actual values, especially around yield peaks and troughs, indicating underfitting or over-smoothing. SVM has the lowest fit, with wider gaps between predicted and actual values. XGBoost aligns well, with predicted values closely matching the actual values across all sample indices, while Random Forest tracks actual values closely, but it deviates slightly at times, especially when yield variations are high. Figure 21 shows error histograms for all models, demonstrating model performance at the training. The error distribution for XGBoost is tightly concentrated around zero with little spread. Random Forest errors cluster near zero but are more dispersed. ANN errors are quite normal but wider, demonstrating substantial prediction discrepancies. SVM has the greatest error spread and a skew toward negative errors, indicating persistent underprediction and poor error balance.
[See PDF for image]
Fig. 20
Comparison trend plot of the actual and predicted methane yield values across all models at the training phase
[See PDF for image]
Fig. 21
Error histogram of methane prediction at the training phase
Figures 22 and 23 show the fitness between the actual and predicted methane yield values, as well as the error histogram plot at the testing phase. The discrepancies of the observed values of CH4 yield show that extreme and uncommon CH4 fraction values were reported on some days. However, it was shown that these extreme values could instead be attributable to some external influences that the model does not account for and were not genuinely dependent on changes in the operating parameters. XGBoost matches the CH4 yield actual and predictedvalues most closely. In most samples, the predicted line matches the actual trend, proving good generalization and its superior testing metrics (lowest RMSE and MAPE). Random Forest captures overall variation but has more visible peak and trough aberrations. In addition, ANN and SVM models forecast less accurately, suggesting that it is unable to capture data non-linearities. SVM has the worst alignment, with substantial sample index deviations. The oscillatory mismatches show SVM overfits local trends in training data but fails to replicate them in testing. The error histogram in Fig. 23 further reinforces the trend observed in the comparison trend plot. Due to excellent precision and low bias, XGBoost error clusters around zero with little spread. Random Forest errors are centered at zero but spread out, reflecting its mild deviations. ANN errors are more distributed and slightly skewed, indicating prediction instability. SVM has the biggest error distribution and multiple high negative errors, demonstrating systematic underestimating and limited generalization.
[See PDF for image]
Fig. 22
Comparison trend plot of the actual and predicted methane yield values across all models at the testing phase
[See PDF for image]
Fig. 23
Error histogram of methane prediction at the testing phase
The regression plot of all models in Fig. 24 visualizes how well the actual methane yield fits the predicted methane yield using XGBoost, Random Forest, SVM, and ANN (MLP). With an R2-value of 0.968, XGBoost demonstrated the strongest fit between its predicted values and the actual methane yield, affirming its efficacy in modeling the non-linear dynamics of the biodigester system. Further to this, Random Forest also demonstrated a good fit with an R2-value of 0.902, but with marginally greater dispersion around the best-fit line in comparison to XGBoost. ANN (MLP) yields a moderate fit with an R2-value of 0.75. While it captures the general trend, a broader dispersion is noted around the regression line. SVM exhibits the weakest performance with R2 ≈ 0.64. The scatter is more pronounced, and deviations from best-fit line are larger, showing that SVM struggled to represent the non-linearities inherent in the digestion process.
[See PDF for image]
Fig. 24
Regression plot of the developed model
Having demonstrated that XGBoost yields the most accurate predictions, it might assist operators in optimizing the large-scale digester's performance. The objective is to utilize the model as a decision-making tool where the operators can input potential parameters (e.g., feed composition, moisture content, buffer concentrations) and observe the predicted methane yield generated by the model. They can select the configurations that optimize expected yield while remaining within safe and stable operational parameters. Recent research has employed this approach where ML models are integrated with soft sensors or digital twin frameworks in bioreactors, enabling predictions to facilitate parameter optimization and enhance performance (e.g., increased biogas production, improved operational stability) (Zou et al., 2024).
Discussion and implications for large-scale bioenergy exploration
This study holds significant implications for enhanced large-scale biogas generation, particularly in regions where organic waste streams remain underutilized. Large-scale bioenergy operations encounter several challenges related to process variability, scalability, and operational efficiency; hence, industrial-scale digesters require significant intelligent attention. To maximize the effectiveness of energy harvest, the commercial scale must be optimized for both economic feasibility and sustainability. The robust data-driven investigations and insights provided in this study, including SHAP-based feature analysis, k-means clustering, multicollinearity reduction, and hybrid predictive modeling, provide data-driven intelligence for process control and decision-making for the large-scale digestion process. A technical difficulty of the bio-digestion system is the composition of the biogas. It is necessary to minimize the energy carrier while removing the undesirable result. This modeling approach offers a valuable pathway for minimizing economic risks, improving biogas quality, and guiding renewable energy investment decisions. The multicollinearity reduction using the PCA identified volatile solids, moisture content, and temperature as key system performance factors by reducing dimensionality and establishing three PCs that explained over 74% of the variance. This shows that large-scale operators might prioritize these fundamental metrics to streamline monitoring without reducing predictive dependability. Three operational regimes were identified by k-means clustering, which aids real-time process monitoring. Cluster 0 showed stable, high-performing situations, cluster 1 transitional or unstable stages, and cluster 2 overload and imbalance failure-prone levels. This clustering knowledge facilitates adaptive control systems, where early identification of cluster 2 alterations could prompt feedstock mix, buffering agent, or loading rate adjustments. Prediction can reduce operational downtime, methane variability, and economically benefit industrial enterprises.
The SHAP-based interpretability paradigm shows how XAI bridges black box models and operational decision-making. FOS/TAC, volatile solids, and moisture content were noted as the predictors of methane yield, validating biochemical knowledge and providing quantitative criteria for process modification. Operators can monitor FOS/TAC ratios in real time and take appropriate action when values reach instability ranges. The ensemble-based learning methods, particularly XGBoost, offer an excellent paradigm for scaling biogas plant predictive analytics, demonstrating their potential for adaptive and interpretable control in varied operational settings. The feedstock regression analysis supports municipal solid waste as the best substrate, with processed organic waste as a supplement. This has policy and operational relevance in metropolitan areas with undeveloped waste segregation and valuation. Waste management strategies that meet digester feedstock needs could boost renewable energy and environmental protection. This study supports scalable decision-support for plant operators, decision makers, and investors by translating raw operational data into predictive and explanatory insights. It promotes circular economy goals, large-scale biogas generation reliability, and decarbonization through energy recovery and landfill reliance reduction.
Conclusion
This study provides a comprehensive data-driven approach to enhancing the operational efficiency and energy output of large-scale biodigesters. By integrating a statistical approach, SHAP-XAI-based feature ranking, and clustering analyses with ensemble learning-based predictive modeling, this research offers multi-dimensional insights into the dynamics of methane generation and process stability. The predictive models were effectively used to predict the biomethane yield in the large-scale biodigester, with an outstanding performance noted with XGBoost giving the lowest prediction error. RMSE, MAE, MAD, MAPE, and rMBE values of 1.1837, 1.0932, 1.0235, 2.9754, and 0.231, respectively. The feedstock regression identified MSW as the most effective input with POW as a viable co-substrate. Additionally, the analysis of plant operational data—including feature ranking, PCA, and k-means clustering, uncovered critical parameters and operational clusters influencing system performance. The SHAP analysis revealed VS, MC, and FOS/TAC as key drivers of methane yield, while the identification of 3 distinct operational clusters by the k-means clustering provides a strategic basis for adaptive process control, enabling the early detection of transitional or failure states. This research delivers an intelligent methane prediction system and provides a roadmap for operational optimization based on real-time data interpretation. These findings offer a scalable decision-support tool for enhancing energy recovery, reducing environmental risk, and supporting policy development in the context of sustainable bioenergy systems.
Acknowledgements
The authors appreciate the University of Johannesburg, Department of Mechanical Engineering for providing workspace for this research.
Author contributions
**Oluwatobi Adeleke**: Writing—original draft, writing—review & editing, methodology, data curation, conceptualization, formal analysis. **Tien-Chien Jen**: Writing—review & editing, project administration, conceptualization, supervision.
Funding
The authors received no funding for this research.
Data availability
The data will be made available upon reasonable request.
Declarations
Competing interest
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Adeleke, O; Akinlabi, SA; Jen, T; Dunmade, I; Adeleke, O; Akinlabi, SA; Jen, TC; Akinlabi, SA; Jen, TC. Sustainable utilization of energy from waste: A review of potentials and challenges of waste-to-energy in South Africa. International Journal of Green Energy; 2021; 18,
Adeleke, O; Jen, TC. A FCM-clustered neuro-fuzzy model for estimating the methane fraction of biogas in an industrial-scale bio-digester. Energy Reports; 2022; 8, pp. 576-584. [DOI: https://dx.doi.org/10.1016/j.egyr.2022.10.265]
Antwi, P; Li, J; Opoku, P; Meng, J; Shi, E; Deng, K; Kwesi, F. Bioresource technology estimation of biogas and methane yields in an UASB treating potato starch processing wastewater with backpropagation artificial neural network biogas gas meter water lock effluent. Bioresource Technology; 2017; 228, pp. 106-115. [DOI: https://dx.doi.org/10.1016/j.biortech.2016.12.045]
Awasthi, MK; Sarsaiya, S; Chen, H; Wang, Q; Wang, M; Awasthi, SK; Li, J; Liu, T; Pandey, A; Zhang, Z. Kumar, S; Kumar, R; Pandey, A. Global Status of Waste-to-Energy Technology. Current Developments in Biotechnology and Bioengineering; 2019; Elsevier B.V:
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
Conant, RT; Ryan, MG; Ågren, GI; Birge, HE; Davidson, EA; Eliasson, PE; Evans, SE; Frey, SD; Giardina, CP; Hopkins, FM; Hyvönen, R; Kirschbaum, MUF; Lavallee, JM; Leifeld, J; Parton, WJ; Megan Steinweg, J; Wallenstein, MD; Martin Wetterstedt, JÅ; Bradford, MA. Temperature and soil organic matter decomposition rates—Synthesis of current knowledge and a way forward. Global Change Biology; 2011; 17,
Cortes, C; Vapnik, V. Support-vector networks. Machine Learning; 1995; 20,
Dadak, A; Aghbashlo, M; Tabatabaei, M; Younesi, H. Exergy-based sustainability assessment of continuous photobiological hydrogen production using anaerobic bacterium Rhodospirillum rubrum; 2016; Elsevier: [DOI: https://dx.doi.org/10.1016/j.jclepro.2016.08.020]
Deepanraj, B; Sivasubramanian, V; Jayaraj, S. Kinetic study on the effect of temperature on biogas production using a lab scale batch reactor. Ecotoxicology and Environmental Safety; 2015; 121, pp. 100-104. [DOI: https://dx.doi.org/10.1016/j.ecoenv.2015.04.051]
Essienubong, IA; Ndon, AE; Etim, PJ; Akpaden, I; Enin, M; Enin, M; Ibom, A. Fuzzy modelling and optimization of anaerobic co-digestion process parameters for effective biogas yield from bio-wastes. The International Journal of Energy and Engineering Sciences; 2020; 5,
Khaled, AA; Hosseini, S. Fuzzy adaptive imperialist competitive algorithm for global optimization. Neural Computing and Applications; 2015; 26,
Khan, MD; Khan, N; Sultana, S; Joshi, R; Ahmed, S; Yu, E; Scott, K; Ahmad, A; Khan, MZ. Bioelectrochemical conversion of waste to energy using microbial fuel cell technology. Process Biochemistry; 2017; 57,
Kim, M; Chul, P; Kim, W; Cui, F. Application of data smoothing and principal component analysis to develop a parameter ranking system for the anaerobic digestion process. Chemosphere; 2022; 299, [DOI: https://dx.doi.org/10.1016/j.chemosphere.2022.134444] 134444.
Li, J; Zhang, L; Li, C; Tian, H; Ning, J; Zhang, J; Tong, YW; Wang, X. Data-driven based in-depth interpretation and inverse design of anaerobic digestion for CH4-rich biogas production. ACS ES&T Engineering; 2022; 2,
Liu, Z; Lv, J. The effect of total solids concentration and temperature on biogas production by anaerobic digestion. Energy Sources, Part a: Recovery, Utilization, and Environmental Effects; 2016; 38,
Mohamed, Y; Elghadban, A; Lei, HI; Shih, AA; Lee, P-H. Quantum machine learning regression optimisation for full-scale sewage sludge anaerobic digestion. Npj Clean Water; 2025; 8,
Mustaffa, Z; Sulaiman, MH. Advanced forecasting of building energy loads with XGBoost and metaheuristic algorithms integration. Energy Storage and Saving; 2025; [DOI: https://dx.doi.org/10.1016/j.enss.2025.03.005]
Nair, VV; Dhar, H; Kumar, S; Kumar, A; Mukherjee, S; Wong, JWC. Artificial neural network based modeling to evaluate methane yield from biogas in a laboratory-scale anaerobic bioreactor. Bioresource Technology; 2016; 217, pp. 90-99. [DOI: https://dx.doi.org/10.1016/j.biortech.2016.03.046]
Olatunji, KO; Ahmed, NA; Madyira, DM; Adebayo, AO; Ogunkunle, O; Adeleke, O. Performance evaluation of ANFIS and RSM modeling in predicting biogas and methane yields from Arachis hypogea shells pretreated with size reduction. Renewable Energy; 2022; 189, pp. 288-303. [DOI: https://dx.doi.org/10.1016/j.renene.2022.02.088]
Olatunji, OO; Adedeji, PA; Madushele, N; Rasmeni, ZZ; van Rensburg, NJ. Evolutionary optimization of biogas production from food, fruit, and vegetable (FFV) waste. Biomass Conversion and Biorefinery; 2024; 14,
Paladino, O. Data driven modelling and control strategies to improve biogas quality and production from high solids anaerobic digestion: A mini review. Sustainability; 2022; [DOI: https://dx.doi.org/10.3390/su142416467]
Park, S; Kim, G-B; Pandey, AK; Park, J-H; Kim, S-H. Prediction of total organic acids concentration based on FOS/TAC titration in continuous anaerobic digester fed with food waste using a deep neural network model. Biomass and Bioenergy; 2024; 190, [DOI: https://dx.doi.org/10.1016/j.biombioe.2024.107411] 107411.
Pilloud, MA; Maier, C; Scott, GR; Hefner, JT. Latham, KE; Bartelink, EJ; Finnegan, M. Chapter 4 - advances in cranial macromorphoscopic trait and dental morphology analysis for ancestry estimation. New Perspectives in Forensic Human Skeletal Identification; 2018; Academic Press: pp. 23-34. [DOI: https://dx.doi.org/10.1016/B978-0-12-805429-1.00004-1]
Platošová, D; Rusín, J; Platoš, J; Smutná, K; Buryjan, R. Case study of anaerobic digestion process stability detected by dissolved hydrogen concentration. Processes; 2021; 9,
Rossi, E; Pecorini, I; Iannelli, R. Multilinear regression model for biogas production prediction from dry anaerobic digestion of OFMSW. Sustainability; 2022; [DOI: https://dx.doi.org/10.3390/su14084393]
Rutland, H; You, J; Liu, H; Bull, L; Reynolds, D. A systematic review of machine-learning solutions in anaerobic digestion. Bioengineering; 2023; [DOI: https://dx.doi.org/10.3390/bioengineering10121410]
Salehi, R; Yuan, Q; Chaiprapat, S. Development of data-driven models to predict biogas production from spent mushroom compost. Agriculture; 2022; [DOI: https://dx.doi.org/10.3390/agriculture12081090]
Schroer, HW; Just, CL. Feature engineering and supervised machine learning to forecast biogas production during municipal anaerobic co-digestion. ACS ES&T Engineering; 2024; 4,
Shawe-Taylor, J; Sun, S. A review of optimization methodologies in support vector machines. Neurocomputing; 2011; 74,
Sun, J; Xu, Y; Nairat, S; Zhou, J; He, Z. Prediction of biogas production in anaerobic digestion of a full-scale wastewater treatment plant using ensembled machine learning models. Water Environment Research; 2023; 95,
Tempel, F; Ihlen, EAF; Adde, L; Strümke, I. Explaining human activity recognition with SHAP: Validating insights with perturbation and quantitative measures. Computers in Biology and Medicine; 2025; 188, [DOI: https://dx.doi.org/10.1016/j.compbiomed.2025.109838] 109838.
Tsai, W-P; Feng, D; Pan, M; Beck, H; Lawson, K; Yang, Y; Liu, J; Shen, C. From calibration to parameter learning: Harnessing the scaling effects of big data in geoscientific modeling. Nature Communications; 2021; 12,
Wang, Z; Hu, Y; Wang, S; Wu, G; Zhan, X. A critical review on dry anaerobic digestion of organic waste: Characteristics, operational conditions, and improvement strategies. Renewable and Sustainable Energy Reviews; 2023; 176, [DOI: https://dx.doi.org/10.1016/j.rser.2023.113208] 113208.
Zaidi, AR; Abbas, T; Daud, A; Alghushairy, O; Dawood, H; Sarwar, N. Enhancing Android malware detection with XGBoost and convolutional neural networks. Computers, Materials & Continua; 2025; 84,
Zou, J; Lü, F; Chen, L; Zhang, H; He, P. Machine learning for enhancing prediction of biogas production and building a VFA/ALK soft sensor in full-scale dry anaerobic digestion of kitchen food waste. Journal of Environmental Management; 2024; 371, [DOI: https://dx.doi.org/10.1016/j.jenvman.2024.123190] 123190.
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.