Linear Ensembles for WTI Oil Price Forecasting

Full text

Turn on search term navigation

1. Introduction

The global energy consumption scenario is dominated by non-renewable sources such as coal, oil and natural gas. In 2022, according to the Energy Information Administration (EIA) [1], the consumption was: oil (29.5%), coal (26.8%), natural gas (23.7%), biomass (9.8%), nuclear energy (5.0%), hydroelectric energy (2.7%) and other sources (2.5%). In the coming years, oil and natural gas are expected to remain prominent, driven by the development of nations such as China, the largest importer and second largest consumer of oil [2].

Oil, a raw material with high industrial value, has its price influenced by global economic and geopolitical aspects [3,4,5,6,7,8,9]. This price is determined by a complex, non-linear system with many uncertainties [10].

Since 2008, the fall in oil prices has been influenced by the global economic slowdown and geopolitical instability, as well as the crisis between China and the US. The COVID-19 pandemic and the war between Russia and Ukraine have added new uncertainties, affecting price formation [6,11,12]. These events have caused fluctuations in prices, challenging market and political decisions, but also offering opportunities to explore forecasting methods.

Forecasting models include linear and non-linear approaches and combination strategies such as hybrids and ensemble. Linear models, such as exponential smoothing, are used to capture patterns in time series by adjusting for trends and seasonality. For example, Simple Exponential Smoothing (SES) is suitable for series with no trend or seasonality, while the Holt-Winters model deals with series that have these characteristics. Box & Jenkins models, such as AR, ARMA and ARIMA, are essential for analyzing time dependencies, where AR captures the linear relationship between an observation and several past lags, MA models the forecast error as a linear combination of past errors and ARIMA handles non-stationary series by incorporating differentiation [13].

Variants of the ARMA model optimized by Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) improve forecast accuracy by automatically adjusting parameters, allowing for more effective modeling of complex dynamics [14]. These optimization techniques provide an enhanced ability to capture subtle patterns and deal with the inherent complexity of time series.

In addition to hybrid models, combination strategies such as ensemble combine outputs from individual predictors [15]. These strategies include averages, medians, weighted averages and other combinations [16,17]. ensemble techniques optimize the accuracy of forecasts by combining results from multiple models, reducing the variance of errors and increasing the consistency of estimates in volatile markets.

The literature has evolved regarding forecasting models for monthly crude oil (WTI) futures prices [18,19,20]. Although new techniques are emerging, linear models are still widely used, from simple comparisons to hybrid models [21].ensemble models have the potential to improve forecast accuracy, but are still little explored [15,16,17].

The aim of this article is to explore linear models, specifically smoothing and Box & Jenkins models, and apply incremental adjustments to the ARMA model using GA and PSO. Combination strategies ensemble considered include mean, median, pseudo-inverse of Moore-Penrose and dynamic adjustments of weights with GA and PSO, significantly improving the performance of the results.

2. Linear Models

2.1. Smoothing Models

This section presents the first set of models used, known as smoothing models. In these models, the main objective is to estimate the smoothing parameters. The models will be divided as follows: Simple Exponential Smoothing SES, Holt Exponential Smoothing (HES), Additive Holt-Winters (A-HW) and Multiplicative Holt-Winters (M-HW).

2.1.1. Simple Exponential Smoothing (SES)

Simple Exponential Smoothing SES is a data smoothing model that applies non-corresponding weights to the fundamental values of the time series [22,23]. The forecast for one period ahead is given by Equation (1):

(1) ${\hat{F}}_{t + 1} = α x_{t} + (1 - α) \cdot F_{t}$

where

F_{t + 1}

is the forecast,

x_{t}

is the actual data in period t,

F_{t}

is the forecast in period t and

α

is the smoothing parameter (

0 < α < 1

Values of $α$ close to zero indicate slower forecasts and less reaction to changes, while values close to one result in faster responses to recent changes in the time series.

After defining the SES model equation, the next section will introduce the Holt model.

2.1.2. Holt Exponential Smoothing (HES)

The Holt Exponential Smoothing HES models are widely used in time series with a linear trend [24]. Unlike the SES model, which smooths only the level, the Holt model also models the trend. Represented by Equations (2)–(4).

(2) $L_{t} = α Z_{t} + (1 - α) \cdot (L_{t - 1} + T_{t - 1})$

(3) $T_{t} = β (L_{t} - L_{t - 1}) + (1 - β) T_{t - 1}$

(4) $\hat{Y} = L_{t} + T_{t}$

where

L_{t}

is the new smoothed value,

α

is the smoothing coefficient (

0 < α < 1

Z_{t}

is the current value in period t,

β

is the trend smoothing coefficient (

0 < β < 1

T_{t}

is the predicted trend, and

\hat{Y}

is the predicted value.

Low values of $β$ indicate a slow adjustment to the trend, while high values indicate a rapid response to changes in the trend.

The next section will introduce the Holt-Winters model, which models seasonality in an additive or multiplicative way.

2.1.3. Holt-Winters Model

The Holt-Winters model, or triple smoothing model, is used for data with trend, level and seasonality [25]. This model has two variations: Additive Method and Multiplicative Method.

Additive Holt-Winters Method (A-HW): Represented by Equation (5):

(5) $Z_{t} = L_{t} + T_{t} + S_{t} + ε_{t}$

where

L_{t}

is the level,

T_{t}

the trend,

S_{t}

the seasonality at time t and

ε

the white noise. The estimates of the model components are given by Equations (6)–(8):

(6) $\hat{T} t = β (L t - L_{t - 1}) + (1 - β) T_{t - 1}$

(7) $\hat{L} t = α (Z_{t} - S t - 1) + (1 - α) (L_{t - 1} + T_{t - 1})$

(8) $\hat{S} t = γ (Z t - L_{t}) + (1 - γ) S_{t - 1}$

where

α

β

and

γ

are the smoothing parameters for level, trend and seasonality (

0 < γ < 1

Multiplicative Holt-Winters Method (M-HW): Represented by Equation (9):

(9) $Z_{t} = (L_{t} + T_{t}) S_{t} + ε_{t}$

The estimates of the model components are given by Equations (10)–(12):

(10) $\hat{L} t = α (\frac{Z t}{S_{t - 1}}) + (1 - α) (L_{t - 1} + T_{t - 1})$

(11) $\hat{T} t = β (L t - L_{t - 1}) + (1 - β) T_{t - 1}$

(12) $\hat{S} t = γ (\frac{Z t}{L_{t}}) + (1 - γ) S_{t - 1}$

Additive models are indicated for seasonal variations of constant amplitude [26], while multiplicative models are suggested for increasing or decreasing seasonal variations [27]. Both models were tested in this study. The next section will present the adjustments of the smoothing models.

After presenting the smoothing models, the next section will introduce the Box & Jenkins models and their variations.

2.2. Box & Jenkins Models

The Box & Jenkins models, such as ARIMA( $p, d, q$ ), are notable for their accuracy in forecasting time series [25,28,29]. In addition to ARIMA, there are AR(p) and ARMA( $p, q$ ) models, and the challenge is to determine the values of (p), (d) and (q) and their respective coefficients [30].

Next, the AR(p), ARMA( $p, q$ ) and ARIMA( $p, d, q$ ) models are discussed.

2.2.1. Autoregressive Model—AR(p)

The AR(p) model uses p time lags as inputs to forecast future observations, represented by the linear combination $\hat{Z} t - 1 + \dots + \hat{Z} t - p$ of the past terms of the series, multiplied by the coefficients $ϕ_{p}$ and adding a Gaussian white noise $a_{t}$ [13,31]. Based on a deterministic approach, AR(p) uses the Yule-Walker Equations to estimate its coefficients, minimizing the error between the observed data and the predictions [28]. Equation (13) represents the model.

(13) $\hat{Z} t = ϕ_{1} Z t - 1 + ϕ_{2} Z t - 2 + \dots + ϕ_{p} Z t - p + a_{t}$

where

Z t

is the predicted value at time t,

ϕ_{p}

is the weighting coefficient for the delay of

p \in 1, 2, \dots, P

Direct application of this model requires stationary data.

2.2.2. Autoregressive Moving Average Model—ARMA(p,q)

The ARMA( $p, q$ ) model combines autoregression (AR) and moving average (MA) components [13,32]. Equation (14) describes the model:

(14) $\hat{Z} t = ϕ 1 Z_{t - p} + ϕ_{2} Z_{t - p - 1} + \dots + ϕ_{p} Z_{t - p - p + 1} - θ_{1} a_{t - 1} - θ_{2} a_{t - 2} - \dots - θ_{q} a_{t - q} + a_{t}$

2.2.3. Autoregressive Integrated Moving Average Model—ARIMA(p,d,q)

The ARIMA( $p, d, q$ ) model extends ARMA with an order of differentiation d to remove trends and make the series stationary [33]. Equation (15) describes the model:

(15) $\hat{Z} t = ϕ 1 Z_{t - 1} + \dots + ϕ_{p} Z_{t - p} - θ_{1} ε_{t - 1} - \dots - θ_{q} ε_{t - q} - ε_{t}$

The direct application of the ARIMA model makes it possible to model random shocks using the forecast error from the previous step $ε_{t - 1}$ , where $ε_{t} = a_{t}$ .

Maximum likelihood estimation can be used to determine the $θ$ coefficients of the ARMA( $p, q$ ) and ARIMA( $p, d, q$ ) [34] models.

2.3. Bioinspired Optimization Tools

In this section, we present the algorithms used to optimize the ARMA( $p, q$ ) models, using two different strategies: Genetic Algorithms GA and Particle Swarm Optimization PSO. The details of how these algorithms were applied to the problem in question will be provided in Section 3 along with the application of the Ensemble model.

2.3.1. Genetic Algorithms (GA)

Optimization using Genetic Algorithms GA is widely used among algorithms inspired by biological processes. Based on the principles of the theory of evolution of Darwin [35,36,37], GAs model biological behavior to solve optimization problems. Introduced by [38] and refined by [39,40], GAs are recognized for identifying optimal or suboptimal solutions, and are robust to various problems, as they seek a global optimal solution [41].

In the context of GAs, the problem is modeled by representing individuals associated with the parameters and coefficients of the models to be optimized, evaluated by the degree of adaptability, known as fitness [14]. This establishes an analogy between an individual’s ability to thrive in an environment and the effectiveness of parameters in producing an optimal solution.

2.3.2. Particle Swarm Optimization (PSO)

A major advantage of metaheuristics is that they are derivative-independent, unlike classical optimization techniques such as gradient descent or Newton methods, which require derivatives of the predictor [42]. This makes them especially useful in problems where derivatives are unavailable or difficult to calculate.

Inspired by the social behavior of birds and fish, Particle Swarm Optimization PSO, proposed by [35], uses individual and collective experience to solve problems. PSO limits the distribution of swarm members in the search space by the current position ( $x_{p}$ ) and velocity ( $v_{p}$ ) [43]. The search for the best solution is guided by improving the local position ( $p_{b e s t}$ ) and the best global position ( $g_{b e s t}$ ).

Reference The authors [44] proposed adding the inertia coefficient ( $ω$ ), according to Equation (16), restricting the area surveyed. Values of $ω$ vary from 0.9 for broad searches to 0.4 for narrow searches, affecting convergence. Cognitive components $c_{1}$ and $c_{2}$ influence the solution using past experiences, initially defined as 2 [44].

(16) $v_{p}^{(i + 1)} = ω v_{p}^{(i)} + c_{1} \cdot {rand}_{1}^{(i)} [{p_{b e s t}}_{p} - x_{p}^{(i)}] + c_{2} \cdot {rand}_{2}^{(i)} [{g_{b e s t}}_{p} - x_{p}^{(i)}]$

The performance of the PSO is influenced by $c_{1}$ and $c_{2}$ , controlling the speed and direction of the search. When the swarm starts, the particles are randomly distributed in the search space. Each particle is evaluated by the fitness function; the best position found is stored in ( $p_{b e s t}$ ) and ( $g_{b e s t}$ ). The speed of each particle is updated in each iteration based on ( $p_{b e s t}$ ) and ( $g_{b e s t}$ ), until the stopping criterion is reached.

After defining the linear models and optimization tools, the next section presents the Ensemble strategies used.

2.4. Tools for Combining Predictors Ensemble

One of the main advantages of ensembles lies in the ability to synergistically combine different individual models, which can result in remarkable improvements in the generalization process and in the accuracy of predictive models, as mentioned by [45,46,47]. Therefore, it can be said that a Ensemble has the ability to reduce error variance.

However, it is important to note that the effectiveness of ensembles is directly linked to the assertiveness of the individual models, which is influenced by the combination method adopted, as pointed out by [48]. There is no definition or consensus on which ensemble strategy should be used [49].

An ensemble can consist of several stages, such as the generation of individual models, the selection of models and, finally, their combination or aggregation [50,51].

The model generation stage is crucial for creating diversity within the Ensemble [51]. It can be classified as heterogeneous, using models with different architectures, or homogeneous, using models with the same architecture [50,52]. The combination of both approaches is common to diversify the Ensembles [53,54], although heterogeneous models can face challenges in maintaining diversity [53]. Homogeneous models, on the other hand, offer greater control over diversity [53].

The selection and combination of models are fundamental steps in the process of forming an ensemble, with the aim of balancing diversity and forecast accuracy. After generating the models, the next stage is selection, which is essential for building an efficient Ensemble.

A fundamental part of the composition of an ensemble is the selection of predictors, which can involve choosing all the available predictors or a specific subgroup, following established criteria. Selection can be static, using one model or subgroup for the entire test set [55,56], or dynamic, choosing models based on the region of competence during the test phase [50,57]. Although selection is not mandatory at this stage, it can influence the results. Considering all predictors for the final stage may be prudent to avoid selecting models that may underperform in the test set [57]. The next step in forming an Ensemble is the final combination of predictors.

This step integrates the results of the forecasts of the individual predictors, forming the Ensemble forecast ( ${\hat{Z}}_{t + 1}$ ). In time series problems, it is common to aggregate the forecasts of k predictors to obtain more accurate results, usually using the mean or median of the forecasts [58,59].

Both the mean and the median are non-trainable ensemble models, reducing computational costs as it is not necessary to retrain the models repeatedly. The mean is represented by Equation (17), where $y^{i}$ are the predictions of the predictor i and m is the total number of predictors.

(17) ${\hat{Z}}_{t + 1} = \frac{1}{m} \sum_{i = 1}^{m} y^{i}$

The median, represented by Equation (18), is useful in the presence of outliers, offering a robust estimate of the central tendency of the forecasts.

(18) ${\hat{Z}}_{t + 1} = Median {y^{1}, y^{2}, \dots, y^{m}}$

Figure 1 illustrates the process of combining previously trained models using a non-trainable approach.

In addition to the non-trainable ensembles, we also investigated the trainable ones, which differ in the assignment of weights to each predictor [60,61]. This stands out as the main contribution of this research.

Weights can be determined in various ways, such as minimum, maximum or product of the predictors’ outputs. A viable strategy is the weighted average, assigning greater weights to the models with the best performance [62,63,64].

The pseudo-inverse of Moore-Penrose can be used to calculate the weights, effectively adapting to the characteristics of the predictors [65]. Equation (19) provides the solution to this problem:

(19) $W = {(Y^{T} Y)}^{- 1} Y^{T} y$

Figure 2 shows the procedure for combining previously trained models, incorporating an additional re-training step.

This additional step adjusts the weights of the ensemble models according to the evolution of the data or problem conditions, resulting in a trainable approach.

In this paper, we propose an ensemble of predictive models, where the final output is calculated as the weighted average of the models’ individual predictions, represented by Equation (20).

(20) $\hat{y} = \sum_{i = 1}^{M} w_{i} \cdot {\hat{y}}_{i},$

The weights $w_{i}$ are optimized using Genetic Algorithm GA and Particle Swarm Optimization PSO, in order to minimize the prediction error of the ensemble [66].

After defining the linear models, optimization and combination tools, the next section will present the evaluation of these steps.

3. Methodology

In this section, the stages for the development of this research will be presented. Figure 3 illustrates the organization of the stages.

All the statistical tests and computational results were developed using the Python 3.11.5 version programming language.

3.1. Database

The data analyzed, from the EIA [1], covers the monthly closing prices of WTI crude oil from January 1987 to February 2023, totaling 434 observations and showing total integrity with no missing or null records. The distribution of this data is illustrated in Figure 4.

To develop the models, 75% of the data was used for training, while the remaining 25% was used for testing, as shown in Figure 4.

3.2. Pre-Processing

After collecting the data, it was analyzed to identify behaviors such as trend, cyclicality, seasonality and a random term. One of the ways to detect these behaviors is to use certain tests, such as the Cox-Stuart test and the Friedman test [67,68].

The tests show that the series has a trend and seasonality. In this case, it is necessary to pre-process the data to ensure stationarity. The so-called stationary series have a constant mean, constant variance and autocovariance that does not depend on time, reflecting a more stable behavior of the data, which for modeling, especially the Box & Jenkins models, is a sine qua non condition [25,28,69].

For this study, we used the logarithmic transformation and the moving average with a 12-month window, along with differentiation, as shown in Equations (21)–( 23).

(21) $L (t) = log (Z (t))$

in this case

L (t)

will be the value of the logarithm,

Z (t)

represents the time series at time t.

(22) $M (t) = \frac{1}{N} \sum_{k = t - n + 1}^{t} L (i)$

and

M (t)

represents the value of the moving average at time t. The moving average is calculated as the average of the

L (i)

values with a window of N-periods. Now

(i)

represents the iteration over all the points in the N-period window.

If we then apply differentiation using Equation (23), you get:

(23) $Δ = (L (t) - M (t)) - (L (t - 1) - M (t - 1))$

Combining all the parts, the complete transformation of the time series is represented by $Δ$ .

The next step is to determine the parameters of the smoothing models. In this case, determine the parameters for level, trend and seasonality.

3.3. Estimating the Smoothing Coefficients

Exponential smoothing models require the determination of the $α$ , $β$ and $γ$ coefficients. The literature indicates that there is no consensus on an ideal method for this determination [70]. Using the numerical evaluation of the cost function, the L-BFGS-B (The standard algorithm in the statsmodels library minimizes the mean square error (MSE) using the Quasi-Newton method, without the need to provide the Hessian matrix or the structure of the objective function) method was used to minimize the MSE and define the parameters in the training set.

Although Exponential Smoothing models do not require the series to be stationary, it was decided to use stationary data in this study. The adjusted parameters $α$ , $β$ and $γ$ are shown in Table 1 and Appendix A.

After determining the coefficients of the smoothing models, we proceeded to apply the Box & Jenkins models.

3.4. Estimating the Coefficients and Orders of the Box & Jenkins

For the Box & Jenkins models, the orders and coefficients were determined in two ways: the classical approach and the optimization of the coefficients of the ARMA( $p, q$ ) model using GA and PSO.

The candidate orders were evaluated using the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) graphs. The $θ_{i}$ coefficients were estimated by solving the Yule-Walker Equations for AR(p) [71]. For the ARMA( $p, q$ ) and ARIMA( $p, d, q$ ) models, the maximum likelihood estimator [34] was used. The d part of the ARIMA model was determined as 1 by applying a differentiation to the data.

Significant lags were defined by analyzing Figure 5, which makes it easier to understand the autocorrelation patterns in the adjusted time series.

The ACF and PACF help identify significant lag components for the AR and MA models, respectively [72]. Determining these parameters can be challenging due to the complexity and volume of the data. For the AR(p) model, although Figure 5 suggests testing up to lag 2, lags from 1 to 6 were observed. For the ARMA( $p, q$ ) model, the MA(q) part was tested up to order 6, as indicated by the ACF.

For the ARIMA( $p, d, q$ ) model, the same orders were tested, with d set to 1. The coefficients of the Box & Jenkins models are shown in Table 2. After analyzing the orders, it was decided to refine the choice of parameters, especially for the ARMA( $p, q$ ) model, using GA and PSO optimization for greater precision [73,74].

Siqueira2 Table 3 and Table 4. In GA, the parameters $ϕ$ and $θ$ were optimized with $p = 1$ and $q = 3$ , using one-point crossover, dynamic mutation and roulette wheel selection. The specific parameters are detailed in

In PSO Table 4, particles represent candidate solutions $(ϕ_{1}, θ_{1}, θ_{2}, θ_{3}) \in R$ . The particles adjust their trajectories based on the best individual ( $p b e s t$ ) and global ( $g b e s t$ ) experiences. The inertia (w), cognitive ( $c_{1}$ ) and social ( $c_{2}$ ) coefficients modulate the search dynamics. 30 simulations were carried out to identify the optimal configuration.

The optimized values of $(ϕ_{1}, θ_{1}, θ_{2}, θ_{3}) \in R$ are shown in Table 5. With the application of GA and PSO, all the linear models are ready to be combined. The ensembles can be developed using the techniques presented in Section 2.4.

3.5. Ensemble

To build ensemble 1, the average of the models’ forecasts was used, as described in Equation (17), for the forecast horizons of 1, 3, 6, 9 and 12 steps ahead. The forecasts were collected and the arithmetic mean was calculated, with all model outputs contributing equally to the final forecast, without the need for readjustment or retraining. Individual errors were calculated for each horizon.

Ensemble 2 was developed based on the median of the models’ forecasts, following the same steps as ensemble 1 and using Equation (18). Each model was evaluated individually.

Ensemble 3, unlike ensembles 1 and 2, is trainable. Initially, the model predictions were organized in a Y matrix and the actual values in a y vector. The weights that minimize the quadratic difference between predictions and actual values were calculated by applying the pseudo-inverse of Moore-Penrose to solve the least squares problem Equation (19).

For ensemble 4, the weighted average of the models’ predictions was used, with weights initially optimized using GA. A chromosome was formed representing the weights W, with the restriction that the weights add up to 1 and are non-negative. GA was applied with one-point crossover, dynamic mutation and tournament selection, as detailed in Table 6.

After simulations and tests with GA, the parameters for PSO were defined Table 7. Each particle in the PSO represents a candidate solution and adjusts its trajectory based on the best individual ( $p_{b e s t}$ ) and global ( $g_{b e s t}$ ) experiences, following Equation (16).

3.6. Post-Processing

After the initial transformations to the time series data, it was necessary to reverse the modifications to recover the “removed” values and return the forecasts to the original scale. This makes it easier to understand and visualize the results accurately, ensuring that the accuracy metrics are on the same scale as the original data.

The following evaluation metrics were used: MSE, MAE, MAPE and AE.

The Mean Square Error MSE calculates the average of the squared errors with Equation (24).

(24) $M S E = \frac{1}{n} \sum_{t = 1}^{n} {(y - \hat{y})}^{2}$

The Mean Absolute Percentage Error MAPE avoids different scale penalties, represented with Equation (25).

(25) $M A P E = \sum_{t = 1}^{n} | \frac{y - \hat{y}}{\hat{y}} |$

And the Mean Absolute Error MAE calculates the average of the absolute errors with Equation (26).

(26) $M A E = \frac{1}{n} \sum_{t = 1}^{n} | y - \hat{y} |$

On the other hand, the Absolute Error AE is the difference between the observed value and the predicted value and can be calculated with Equation (27).

(27) $A E = | y - \hat{y} |$

Section 4 will present the final results with the models adjusted and reversed as described in Section 3.6.

4. Results

This section presents the results of the models evaluated for each forecast horizon, based on the MSE, MAE and MAPE errors, followed by a ranking of the models (Table 8). For each horizon, the best result is illustrated next to the actual data, as well as the Absolute Error AE curves over time for the 14 models evaluated.The graphs are organized as follows:

Figure 6: A corresponds to the prediction of the best model and B o the evaluation of the AE for one-step ahead;
Figure 7: C represents the prediction of the best model and D the AE evaluation for three-steps ahead;
Figure 8: E shows the prediction of the best model and F the AE evaluation corresponding to six-steps ahead;
Figure 9: G shows the best model prediction and H the AE evaluation for nine-steps ahead;
Figure 10: I contains the prediction of the best model and J the AE evaluation of the absolute error considering twelve-steps ahead.

As shown in Table 8, ensemble 5, using the weighted average with PSO, stood out by dynamically adjusting the weights of the models in the Ensemble based on historical performance, maximizing overall accuracy. This flexibility justifies its superior performance compared to ensembles 1 and 2, which assign equal weights to each model, according to Equations (17) and (18). By looking at the scores assigned to each model according to its performance per evaluation metric, it is possible to construct a score with the sum of all the scores. It can be seen that although ensemble 5 had the best overall performance, ensemble 3 stood out with the best position in relation to MAPE error.

Figure 6 illustrates the best model for predicting a step forward on the test set (Observed).

Subfigure A contains the predicted values with the best model. While subfigure B presents the absolute error for each predicted value of all models. Next to it are the values with the MAE errors per model. This analogy is used for the other predicted steps.

After evaluating the one-step ahead forecasts, we moved on to analyze the three-steps ahead horizons. Table 9 shows the results of all the models based on the MSE, MAE and MAPE error metrics.

For this horizon, ensemble 3 stood out, using the pseudo-inverse of Moore-Penrose to combine the models, taking better advantage of their individual characteristics. The ensembles 4 and 5 also outperformed the individual models, indicating the effectiveness of the GA and PSO approaches. Individual models such as AR, ARMA and ARIMA showed relatively high errors, with the multiplicative Holt Winters model obtaining the highest MSE.

In the three-steps horizon, the ARMA-GA model outperformed ARMA-PSO, possibly due to uncertainties in the parameter selection process. The smoothing models behaved similarly to the one-step horizon, with larger errors in multi-step forecasts. Ensemble 3 again stood out in this horizon.

As mentioned above and illustrated in Table 9, ensemble 3 obtained better results than all the predictive models. This is because ensemble 3 is more precise when adjusting the weights, directly minimizing the prediction error. In this case, it provides more sensitive and accurate responses to fluctuations, which are more evident in forecasts with longer horizons. Its performance is also evident when evaluating the final ranking, thus obtaining a better score in all error metrics. It is worth noting, however, that as in the previous step, ensembles 4 and 5 obtained good results compared to the other ensembles, again highlighting the efficiency of using GA and PSO. In this sense, Figure 7 shows the best prediction model, obtained by ensemble 3.

Similarly, we went on to evaluate other forecast horizons, in this case for six-steps ahead, as shown in Table 10.

Ensemble 3 was again superior, reinforcing its ability to determine the best weightings for longer forecasts. The ensemble 5 also stood out, showing the efficiency of the optimization algorithms. Its performance is also evident when evaluating the final ranking, thus obtaining a better score in all the error metrics. Ensemble 2, which uses the median, performed reasonably well, being robust against outliers as the steps increase. Figure 8 shows the best result for this horizon, obtained by ensemble 3.

After these considerations, the forecasts for the models considering nine-steps ahead were evaluated, as illustrated in Table 11.

Ensemble 3 stood out again, as shown in Table 11. As the horizons increase, the errors of the individual models increase significantly, which does not occur in the ensembles. Ensemble 3 was the best for forecasts nine-steps ahead, as illustrated in Figure 9. Its performance is also evident when evaluating the final ranking, thus obtaining a better score in all the error metrics. The Box & Jenkins models maintained their performance, highlighting the efficiency of the GA and PSO algorithms. Although ensemble 2 was not the best, it obtained considerable results, demonstrating its robustness for longer horizon forecasts, due to the reduction in variability when using central values.

Finally, with regard to the last forecast horizon, considering twelve-steps ahead, Table 12 shows the results of all the models.

The Box & Jenkins models maintained the same results as the previous cases. In the Smoothing models, there was a change, with the additive and multiplicative models, previously the worst, becoming the best. The ensemble 3 remains the most effective.

The exponential smoothing models showed variations in results, with the Additive Holt Winters being the most effective, especially in long-term forecasts, due to its stability and predictability. The SES model also benefited from stationarity in shorter horizon forecasts.

Finally, the results reinforce that ensemble 3 significantly outperformed the individual models, and the ensembles in general proved superior at other forecast horizons.

After the aforementioned considerations for a forecast horizon of twelve-steps ahead, Figure 10 shows the best answer, in this case ensemble 3.

Several were analyzed and MSE, MAE and MAPE were used to evaluate them. These metrics illustrate average values (the best overall approximation in the analysis). Abrupt changes in the direction of the time series make it difficult for models to predict, but some models have the ability to adapt better than others. By analyzing AE, it is possible to see which models have the smallest outliers, which is additional behavioral information that the usual averages do not provide.

The results presented in this section confirm the concepts discussed in the Section 2.4, demonstrating the robustness of the ensemble models over different forecast horizons. Specifically, ensemble 5 proved to be superior in the forecast horizon of one-step ahead, while for forecasts of 3, 6, 9 and 12 steps ahead, ensemble 3 was superior to all models.

In general, ensemble models have different advantages and disadvantages. The mean is simple, but can be influenced by outliers. The median is robust against outliers, but can ignore variability. The Moore-Penrose inverse optimizes weights based on historical performance, and is accurate but computationally complex. Weighted averaging with PSO and GA dynamically adjusts the weights, improving accuracy, but requires more computing power. For short-term forecasts, the mean and median are effective; for the long term, the inverse of Moore-Penrose and the weighted mean offer better optimization, provided there is sufficient data.

5. Conclusions

The main contribution of this work is related to the use of the pseudo-inverse of Moore-Penrose to determine the weights of the models to be used in the formation of the Ensemble, in addition to the use of metaheuristics.

It is known that GA and PSO algorithms are widely used in the literature, although not so much for application in ensemble. In this sense, as an initial work, it was decided to use these techniques.

The results show that the ensemble models, especially those that used metaheuristics and the pseudo-inverse of Moore-Penrose, significantly improved the individual results of the predictive models at all forecast horizons.

After pre-processing the data, the model parameters were determined in various ways: for the smoothing models, a numerical model that minimizes the cost function was used; for the Box & Jenkins models, the Yule-Walker equations and maximum likelihood estimators were used, with delays tested exhaustively. Specifically for the ARMA model, two coefficient optimization techniques were used: GA and PSO. For the ensemble, several strategies were tested, including arithmetic mean, median, pseudo-inverse of Moore-Penrose and weighted mean with GA and PSO. The results showed that the ensemble approaches outperformed the individual models, with the weighted average with PSO (ensemble 5) standing out in step 1, and the pseudo-inverse of Moore-Penrose (ensemble 3) in the other steps.

The results indicate the feasibility of using ensembles in time series forecasting, allowing it to be applied to forecasting models other than linear ones.

In this sense, the research can be further developed with the insertion of other approaches aimed at technological development, such as the creation of other Ensembles. Mention could be made of the use of artificial neural networks to form a non-linear ensemble.

Author Contributions

Conceptualization, J.L.F.d.S., Y.R.K., S.L.S.J. and H.V.S.; methodology, J.L.F.d.S., Y.R.K., S.L.S.J. and H.V.S.; software, J.L.F.d.S.; validation, J.L.F.d.S., Y.R.K., S.L.S.J., T.A.A. and H.V.S.; formal analysis, A.J.C.V.; investigation, J.L.F.d.S., Y.R.K. and H.V.S.; resources, T.A.A.; data curation, J.L.F.d.S., Y.R.K., S.L.S.J. and H.V.S.; writing—original draft preparation, J.L.F.d.S., Y.R.K., S.L.S.J. and H.V.S.; writing—review and editing, J.L.F.d.S., A.J.C.V., S.L.S.J. and H.V.S.; visualization, J.L.F.d.S.; supervision, H.V.S.; project administration, H.V.S.; funding acquisition, T.A.A. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACF	Autocorrelation function
AE	Absolute error
A-HW	Additive Holt-Winters’ models
AR	Autoregressive
ARIMA	Autoregressive integrated moving average
ARMA	Autoregressive-moving-average
EIA	Energy Information administration
GA	Genetic algorithm
HES	Holt simple exponential smoothing
MAE	Mean absolute error
MAPE	Mean absolute percentage error
M-HW	Multiplicative Holt-Winters’ models
MSE	Mean Squared Error
PACF	Partial autocorrelation function
PSO	Particle swarm optimization
SES	Simple exponential smoothing
WTI	West Texas Intermediate

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Non-trainable Ensemble Flowchart.

Figure 2. Trainable Ensemble Flowchart.

Figure 3. Stages for Forecasts with Linear Models and Ensemble.

Figure 4. WTI Crude Oil Price.

Figure 5. Autocorrelation and Partial Autocorrelation.

Figure 6. Ensemble 5 Forecasts One-Step Ahead and Errors.

Figure 7. Ensemble 3 Forecasts Three-Steps Ahead and Errors.

Figure 8. Ensemble 3 Forecasts Six-Steps Ahead and Errors.

Figure 9. Ensemble 3 Forecasts Nine-Steps Ahead and Errors.

Figure 10. Ensemble 3 Forecasts Twelve-Steps Ahead and Errors.

Table 1

Smoothing Model Coefficients.

Models	$α$	$β$	$γ$
SES	1.00	-	-
Holt	1.00	$1.09 \times 10^{- 4}$	-
A-HW	$9.99 \times 10^{- 3}$	$4.65 \times 10^{- 8}$	$3.17 \times 10^{- 8}$
M-HW	1.00	$8.75 \times 10^{- 11}$	$6.67 \times 10^{- 11}$