Content area
Multivariate Adaptive Regression Splines (MARS) have proved to be a powerful tool for modeling non-parametric regression analysis problems and for selecting the model in multi-dimensional data. A major challenge lies in model over-fitting and under-fitting on multi-dimensional data. Applying optimization methods to design and select the model can reduce the over-fitting and under-fitting in MARS modeling. Particularly, information-based model selection criteria have shown to be effective for modeling in MARS. This study proposes a model selection method in MARS using coordinate descent (CD-MARS) that can accurately select the best model. We have integrated coordinate descent (CD) features in the information-based model selection and evaluation criteria formulation. In this formulation, the CD was used as a penalty term to the negative log-likelihood in information-based model selection and evaluation criteria. To test the model, the study generated a dataset for testing the model and then the study applied house prices dataset that is publicly available online to evaluate the model’s success. The dataset have the variables that do and do not add to the dependent variable. We measure the performance of the proposed model using mean squared error (MSE), the Coefficient of Determination (Rsquared) and Mean absolute error. The results show that, the model’s mean squared error (MSE) was small compared to that of MARS. Additionally, the CD-MARS model scored the Rsquared of 0.931 (93.1%) whereas the traditional MARS score was 0.8952 (89.52%). Thus, the proposed model improves the traditional MARS by 3.58%. This indicates that the proposed model fits well the data compared to the traditional MARS and can produce good generalizability in MARS modeling.
Article Highlights
The optimized MARS model with coordinate descent can greatly improve the MARS performance and minimize the prediction error
The material and finish quality, living area, Remodel date, the square feet and Bedrooms are important considerations while predicting the house sale price.
The model minimizes the error and provides good generalizability compared to the existing approaches.
Introduction
Introduction and background
Machine learning (ML) models are built and trained based on existing datasets and deployed to be used in the industries [1, 2]. The model performance depends on the dataset used during the model training. Today, machine learning models are extensively used to address a number of problems in different domains such as business and market analysis, health, security systems etc. The machine learning models can be used to find patterns in the dataset which can be heterogeneous. An example is the dataset which can contain text and numeric or text and image and the dataset can have several variations such as numerical data being categorical or continuous and so on.
The available real-world datasets are highly diverse such as text, image and numerical which eventually becomes a challenge to manage the models built for these datasets. Although these models can produce accurate predictions due to their computational power, they are also subject to several challenges such as underfiting and overfitting.
To overcome these challenges, recent studies show that Multivariate adaptive regression splines (MARS) can be robust in addressing the overfiting and underfiting challenges [3]. It utilizes the split-and-conquer algorithmic method to split the training datasets through various linear splines or segments of different gradients [4]. This makes it very accommodative for problems involving dimensions of high input. Since MARS can effectively capture the interactions among the variables with no specific functional form making it adaptable to the structure of the data and can automatically perform variable selection (feature importance), this makes it powerful non-parametric regression algorithm for building nonlinear models compared to spline-based methods like kernel regression, Classification and regression trees (CART) and decision trees. In addition, the penalty term added to MARS for model complexity makes it robust to over-fitting. Although, MARS has proved to give promising output for nonlinear data in different application areas, the predictive performance remains not sufficient since MARS models are prone to over-fitting in complex situations leading to poor generalization [5, 6, 7–8]. Additionally, being sensitive to outliers and noisy data also affects the predictive performance of MARS model [9, 10]. Methodologies such as multivariate adaptive regression splines and response surface algorithm (MARS/RSM), CMARS and information based criterion [31, 34], have been applied to ptimize and improve the predictive performance [5, 11, 12–13]. The literature shows different approaches to design and select a model in MARS but the use of many predictor variables in regression modeling without precise relationships may still lead to model over fitting [14, 15]. In addition, developing the hybrids of traditional linear regression models such as MARS with Coordinate descent methodology (CD) has received less attention, resulting in a gap in the development of more robust predictive MARS models. Then the need to create a model capable of balancing the over-fitting and under-fitting as more parameters and model complexities increase would reduce these issues. Developing such a model in MARS is a major objective of this research which uses the coordinate decent methodology within the MARS modeling framework.
Related work
Several machine learning approaches were used to address several issues [5, 6, 11, 12]. MARS is applied in the prediction of dependent variable coming from the values of independent variables [13]. MARS has been applied in several domains, like in [14] was used for Predicting tunnel convergence. In this research, MARS was used to predict the convergence of diameter for a fast railway tunnel in soft rock. The results showed that MARS performed well compared to other algorithms in this study. The author also emphasized that MARS is computationally efficient and more flexible and then came up with the conclusion that MARS can be a dependable alternative for nonlinear problems modeling.
The Global Optimization approach using Multivariate Adaptive Regression Splines was proposed in [15]. The authors proposed the global optimal within structural design through the combination of multivariate adaptive regression splines and response surface algorithm/ methodology (MARS/RSM). When compared with models with generalized additives, the study shows that a hybrid of MARS with RSM enhances the typical RSM by acknowledging nonlinear and high-dimensional issues that can be truncated into smaller dimensions while maintaining low computation complexity with greater interpretability. MARS/RSM’s computational efficiency and accuracy were compared to those of simulated annealing and genetic algorithmic methods. Also the MARS/RSM method was used with an array of low-dimensional test functions to show convergence as well as restricting characteristics. In [16], evaluated a sluice’s gate local scout depth downstream using MARS. Authors in [17], utilized MARS in predicting the elastic modulus of joined rock mass. MARS was also used in the prediction of properties for the transfer of heat in a squared cylinder [18]. This research used particle swarm optimization for optimizing the MARS-generated objective function and then compared it to ANN and gene expression programming. The findings show that gene expression programming and artificial neural networks were less efficient than MARS in particle swarm optimization in the optimization of heat transfer rate as well as the prediction of forced convection data. However, this study does not indicate how particle swarm optimization optimizes the MARS model.
The authors in [19] proposed CMARS without using the backward stepwise phase. Instead, they built a Tikhonov regularization problem also referred to as ridge regression employing a penalized residual sum of squares over MARS. These problems were treated by employing constant optimization methods and then apply the sophisticated framework for conic quadratic coding. It turned out that CMARS was very competitive and promising to MARS.
From the literature, it’s clear that MARS has performed well on several problems and has proved to be more flexible and computationally efficient compared to other algorithms [5, 9, 11, 20]. MARS models have also proved to give better predictive performance. Some authors like [15, 18, 19] have combined MARS with other methodologies to get optimum output. The use of stepwise method using GCV can be used in determining the best spline functions treated as input for the model. However, over-fitting in MARS is avoided through the use of GCV by Friedman and the literature raises questions on whether this criterion is to be considered as best model selection in the MARS algorithms. Although, MARS has proved to give promising output for nonlinear data in different application areas, several authors have used it to build the model for their application area but the predictive performance remains not sufficient. Several methodologies have been applied to optimize and improve predictive performance. The criteria discussed in the literature shows that different approaches to model design and selection, and the use of many predictor variables in regression modeling without precise relationships may lead to model over-fitting. The literature has shown that modeling multidimensional data in MARS would result in the model over-fitting and under-fitting. This research proposes the model selection in the MARS framework using a coordinate decent methodology.
Objectives and contribution
Model selection plays a pivotal role in MARS modeling. Traditional MARS uses the GCV criterion to select the model which may result into over-fitting on a multi-dimension data. Recently, Information Theory based criteria and the Extension of the Maximum Likelihood Principles have shown to be effective for model selction in MARS modelling. Then, the integration of information-based model selection criteria in MARS modelling can be a promissing solution to improve MARS in terms of predictive performance.
The main objective of the current study is to develop and evaluate an optimized MARS model using coordinate descent (CD) methodology because MARS have proved to be adaptable to any data structure, perform automatic feature selection during model building and to capture the interactions among the features and CD because it posess less computing power, adaptability to different problem Structures and less sensitive to noise in gradient estimates as well as its application to non-differentiable optimization problems. Coordinate descent have proved to be the best choice as opposed adaptive algorithms and gradient descent algorithms where the algorithms struggle due to the lack of gradients.
This study builds on the existing advances to improve the MARS predictive performance. It incorporates the coordinate descent features with penalized maximum likelihood functions of information criteria to select the best model in MARS that can fit well to accurately improve MARS performance by reducing the over-fitting and under-fitting of the model. The proposed model’s effectiveness is evaluated through analysis and validation by stating the contributions of model selection in MARS modeling and its application on a real dataset. With the developed CD-MARS model, researchers and data analysts can study and analyze multivariate nonlinear data with reduced over-fitting and under-fitting of data to make data-driven decisions.
Materials and methods
An overview of MARS
Multivariate adaptive regression splines (MARS) refers to a nonparametric regression technique that can learn the nonlinear connection between the dependent variable and several independent variables using splines [8]. MARS algorithm was introduced by Friedman [21] and is considered to be a multivariate nonparametric regression technique. It is a data-driven modeling technique that uses the interactions between the input variables and the output. MARS utilizes the split-and-conquer algorithmic method to split the training datasets through various linear splines or segments of different gradients [4]. This makes it very accommodative for problems involving dimensions of high input. The linear splines are limited through knots which marks the portioning in any two data regions to give the needed piecewise curves. These step curves are said to be the basis functions (BFs) and do not necessarily require the working relationship among the input and the output variables. This gives bends, flexibility, thresholds and other BFs modeling [14].
MARS constructs the model in two stages: the forward pass and backward passes. It begins by constructing the model that comprises the intercept term (the average of the response values). It continually includes the basis functions in the model in a pair. It discovers the basis functions which results into the reduced residual sum of squares (rss) error at each step. This pair of two basis functions is the same only that each function uses a distinct side of the mirrored hinge function. Then, for every new basis function, a new hinge function multiplies a model term, which could be the intercept term. A hinge function can be defined by the variable and the connection or the knot, so to add the new basis function, MARS has to figure out all the parent or existing terms, variables, and variable's values to make the bend of the newly step function [22].
1
The form of MARS model is defined in Eq. (1) where each is the mth BF or the combination of multiple BFs. Each basis function thttps://en.wikipedia.org/wiki/Basis_functionakes either the form of a constant, step function or the product of multiple- step functions. The hinge function has an important part in MARS modeling and can be represented as or where the knot is as shown in Eq. (2). The function is normally zero-valued within the range and is used in the partitioning of the data into independent disjoint regions. The idea of the use of MARS stems from the fact that it can replace step functions with something smoother.
2
To select the Model, in the forward pass the lack-of-fit criteria is used and in the backward also named as the pruning pass of MARS a modified generalized cross validation (GCV) criteria is used and is given as
3
The generalized cross-validation (GCV) criterion shown in Eq. (3) was proposed in [23] and is employed to relate how well the model will perform so that it can choose the best subgroup: when GCV is lower, it indicates that the model performs well. GCV can be thought of as normalization, that is it weighs goodness-of-fit versus model complexity [4, 17]. The GCV was presented by Craven [23] and Friedman extended it for MARS [24] and utilized the minimization of the average-squared residuals in fitting the model.
Model selection in MARS
Traditionally, The Generalized Cross-Validation criterion is used as model selection for MARS. Generalized Cross-Validation (GCV) is considered as a technique for validating the model to assess how well the statistical results analysis shall generalize on an independent data set. In other words, the degree of how well the model will perform on unseen data [25, 26, 27–28]. GCV can be taken as a way to estimate the model’s predictive quality. The GCV criterion helps in the selection of a suitable value of the smoothing parameter, by determining the smoothing parameter that results in minimum GCV [29]. The GCV formulation is shown in Eq. 3.
Recently, the information-based model selection and evaluation criteria have been emphasized in the statistical literature [8]. It was necessary to introduce the concepts of model assessment and it has been acknowledged as an applicable area although issues arise on the best estimation model selection given some datasets in a class of participating models. Statistical information theory is a well-known model selection criterion that employs information approximation as a loss function in high dimensions [30]. These model selection criteria have the objective of selecting a model by using inference uncertainty as well as parametric uncertainty (a measure of model complexity and parsimony). Researchers have proposed several model selection criteria in the form of negative log-likelihood plus a penalty term [31]. The authors in [32] formulates the Akaike information criterion (AIC) with the form of:
4
In Eq. 4, the refers to the maximized likelihood function, refers to the most estimated likelihood for the vector parameter , and the estimated independent parameters are represented by k. The lack of fit measure is represented by − 2logL while 2 k represents the model’s free estimated parameters as a penalty term.
Number of model parameters is compared to the complexity measure to compensate for the bias resulting in the compromise in the lack and its lack measure. The model is selected as the best fit of the model to the data when it gives a minimum value for AIC. The selection method for AIC is well known due to its simplicity to use and it’s also well known to over-fit the model in cases of complex modeling situations [8, 20, 33]. To respond to the over-fitting problem, introducing the Bayesian model criteria (SBC) assumes that data are created from the exponentially growing family of the distributions [34]. In [35] the minimum description length (MDL) criteria was introduced which takes a similar form to the SBC and is shown in Eq. (5).
5
When we compare it with AIC, the penalty is increased by introducing other terms where the SBS multiplies the model by (1/2)ln(n). This ensures that a model with a minimum SBC or MDL is selected as the best that fits well data.
The information complexity (ICOMP) criteria was proposed [8] as a motivation for the AIC and concepts of information complexity. The ICOMP differs from AIC in such a way that the methodology was a generalizability of the knowledge covariance complexity index. It is used to calculate the complexity of its structure as a single element which can also be a group of randomly generated vectors. The ICOMP model selection criterion proposed combines the badness-of-fit term with a model's complexity measure. Generally, the ICOMP form is anchored on the overall model complexity through the quantification concept using the inverse-fisher information matrix (IFIM) estimation. The selection yields an estimate for the sum of the two Kullback–Leibler distances between them [30]. The loss function estimates in the layout of ICOMP [36] and is given as:
6
The ICOMP is defined in [8] as
7
where is the maximum likelihood function, represents the represents the parameter vector with the highest likelihood(θk) estimation, and then C represents the real-valued complexity measure. shows the estimate of parameter vector covariance. The covariance matrix utilizes the Cramer-Rao lower bound matrix (CRLB).Coordinate descent
Coordinate descent can be defined as an optimization method that consecutively reduces across coordinate paths to find the function's minimum [37]. The algorithm determines the block of coordinates through the coordinate selection rule at every iteration, then minimizes to the corresponding hyper plane while also fixing other coordinate building blocks. The search line in the coordinate direction can be carried out during the present iteration which forms the right step size. Its application is either differentiable or derivative-free contexts. The idea of Coordinate descent is that the minimization of multivariable function is achieved through minimization of this function in single direction in separate sessions, that is, evaluating univariate optimization issues within the loop [38]. In the simple scenario of cyclic coordinate descent, a single cycle is repeated in the specific direction one by one reducing the objective function in terms of every coordinate path. This means that it begins with early values of variables. round determines from by resolving one-variable optimization issues in iterative way [37].
8
For every variable xi of X, for i from 1 to n. therefore, one starts with the initial estimate x0 for a local minimum of , and iteratively obtains the sequence x1, x2, x3….. By carrying out line search in every iteration, one automatically has This pattern has the same convergence characteristics as the steepest descent. When one period of line searches across the coordinate paths indicating that the fixed point was reached, there really is no enhancement [39].
Consider unrestricted minimization problem, where refers to a continuous variable. Dissimilar coordinate descent (CD) variants may give more assumptions on the function , where assumptions can be smooth and convex while at times are assumed to be smooth and convex or smooth and bound to the restricted domain. Consider the formulation of [40] in Eq. 9
9
In Eq. 9, is smooth, the regularization function is Ω which may be extended-valued and non-smooth, and the regularization parameter is λ > 0. Ω is normally convex and assumed to be block-separable. Ω has the form in Eq. 10 when it is separable [40].
10
where Ωi: ℝ → ℝ for all i.A function is said to be convex if its domain is convex set and for all in its domain and all Then we have, . This ensures that the local minimum is also the global minimum which is the main goal for optimization problems.
Consider a function defined on the region . This function is said to be separable with respect to variable if it can be rewritten as . In this case, we say that this function is additively separable if is separable with respect to multiplication. In other words is said to be separable with respect to multiplication if is separable with respect to addition. Similarly, the function is said to be block separable if it can be rewritten as a summation of functions of disjoint subsets of variables such that for in which ’s are sub vectors in the then we have Where each is a nonempty closed convex set in and the functions are convex for but with finite values for. This optimizes each independent block making optimization simple. Then, for with convex and convex (separable part) coordinate descent leads to global minimum.
Comparative analysis of optimization algorithms
A comparative analysis of several existing optimization algorithms that have been evaluated for their performance in various optimization tasks is presented in Table 1. It highlights characteristics such as computational complexity, convergence speed, and suitability for different problem types. This comparison provides clearer understanding of the strengths and weaknesses of each algorithm and thus helping to identify the effective method for optimization challenges.
Table 1. Comparative analysis of optimization existing algorithms
Reference | Method | Memory Usage | Speed | Accuracy |
|---|---|---|---|---|
[39, 40] | Coordinate Descent | Low | Moderate | Accurate for simpler models, may degrade in complex settings |
[41] | Gradient Descent | Moderate | Faster | Accurate with proper tuning of learning rate |
[42] | Genetic Algorithms | High | Slow | Can be highly accurate in complex scenarios |
[43, 44] | Adaptive Methods | Moderate to High | Moderate | High accuracy because dynamic learning rates |
[45] | LASSO | Low to Moderate | Moderate | Accurate with sparse models, may underperform in dense models |
[45, 46] | Ridge Regression | Low to Moderate | Fast | High accuracy in models with many features, especially when no feature is irrelevant |
The dataset
In this research, we created a sample/ test dataset to test the model and then used the house prices dataset publicly available at [47] where the partition was 70% for training and 30% for testing. The house prices dataset has 81 attributes including the response variable and 1461 elements. To understand the dataset features, we identified the features that have great importance on the output, we performed the feature importance as a measure of the effect of the features on the outputs by building a Random Forest Regressor (RFR) model which performs feature selection through feature importance mechanism. This algorithm evaluates the contribution of each feature by computing the measure of how the feature reduces the impurity such as mean squared error (MSE). It should be noted that a higher value means that the feature will have a higher effect on the outputs. On the other hand, MARS performs automatic feature selection during model training and thus the feature selection performed using RFR was to understand the features which are of great importance prior CD-MARS model training. Detailed dataset description and its respective features can be found in appendix.
Data preparation and cleaning
Data cleaning is an essential part of data analysis. It involves fixing and removing incorrect, duplicated, and incomplete data within the dataset. On the other hand, anomaly detection is one of the techniques used in data cleaning. It helps to know the elements which do not have similar characteristics as the rest of the group. To do this, we separated the categorical attributes and numerical attributes of the dataset for proper handling. The dataset was checked for duplicated values using duplicated () function of pandas. The missing values were identified using the pandas’ function is.na(). The results showed that no duplicates however, there were missing values in some attributes. The attributes with considerable numbers of missing values (> 40% in our case) were dropped/deleted and missing values in fields were also dropped. In data cleaning, we also performed anomaly detection using the k-nearest neighbors (KNN) supervised method to identify the elements which do not have similar characteristics as the rest of the group (outlier analysis) in the dataset.
The proposed Model
In this study, we propose a CD-MARS model that was designed to optimize the MARS model using the coordinate descent method which will further improve model performance (best of fit). This model was designed to integrate the features from the Coordinate descent optimizer and MARS algorithm. Considering the formulation in Eq. 9, where the regularization parameter is separable, the is given in Eq. 10, and considering the formulations in 4, 5, and 6 which use the penalized log-likelihood and the penalty term is added to the negative likelihood, The CD-MARS development was motivated partly by AIC original formulation in Eq. 4 and in part by the CD formulation in Eq. 9. Hence,
11
where is maximum likelihood function, is the estimated maximum likelihood for the parameter vector , and then C represents the measure of real-valued complexity, is the regularization function, is a regularization parameter. The term is the penalty term of the model’s estimated number of free parameters. We considered being separable and taking the form of12
Thus we rewrite equation 11 as shown in equation 13
13
To fit the model, we assume that the likelihood function follows normal distribution to estimate the model parameters and the response variable. Although the CD does not assume normal distribution in its formulation, it follows the normal distribution when employed in regression technics to minimize the loss function.
We compute the loss function as
14
where is the observed response, represents the basis functions and are the coefficients. The term is the predicted response and is the regularization parameter.The proposed CD-MARS considers non-singular Ω to ensure that the model is prevented to collapse under a multicollinearity and to ensure that the coefficients are correctly estimated in a realistic approach. We also consider being convex and block separable to ensure that it leads to global minima.
The proposed model’s flowchart is shown in Fig. 1 where Mmax is the maximum number of basis functions, Imax is the maximum number of interactions of the input variables and d is the smoothing parameter.
[See PDF for image]
Fig. 1
The flowchart for the proposed model
The Pseudo code of the proposed model
The model was designed to optimize the MARS model using the coordinate descent method which will further improve model performance (goodness of fit). Below is the Pseudo code of the proposed model.
Experimental setup
The experiments were performed using the earth package in R environment. The model was implemented using Rstudio 3.6.1 editor installed on Intel core i7 EliteBook running the Windows 10 operating system. To test the model, test data was generated with 10 attributes (n = 10) and 10,000 instances (m = 10,000). X values were generated by random uniform values of size m, n. We generate using a function that describes in terms of number of input variables from normal distribution such that . It calculates basing on the values of and then adds the random uniform variable. The term represents the non-linear relationship resulting from the product of Since the sin function oscillates from , the sine function is scaled to oscillate from -10 and 10. The quadratic term penalize the deviating from 0.5 and the linear terms indicates that the output directly scales with values of variables . The adds a random value drawn from uniform distribution between to the output .
Since combines both linear and non-linear inputs of normal distribution with the random variable, it will have complex output which is the aim to generate a complex data with non-linear relationship and to explore the behavior of the proposed model on the inputs and the output. In this case we generate a dataset that is used to test the proposed CD-MARS model. The maximum interaction was to set to 3, maximum terms were set to 90, and minspan alpha to 0.5. The model was later applied to the house prices dataset to evaluate its behavior on a real dataset.
Performance metrics
To evaluate the model’s performance, the root mean squared error (RMSE) was employed to evaluate the model’s goodness, R-squared (Rsquared) to evaluate the model’s fit and generalized R-squared (GRsquared) to determine the model’s predictive power. The MSE is computed as in Eq. 15 and the Rsquared is calculated as in Eq. 16.
15
where n is the number of the data points, is the observed values and is the predicted values. In Eq. 16, is the residual sum of squares and is the total sum of squares.16
17
Results and discussion
The CD-MARS model
Whereas the MARS model was built basing on the original algorithm, the CD-MARS model was built by incorporating MARS into coordinate descent (CD) algorithm. The MARS model was trained within the CD to optimize the parameters. we present the existing MARS hybrid models and the proposed approach for this research in Table 2. In this study, the CD-MARS model identifies the best the number of basis functions to prune (nprune) and the degree. The resulting CD-MARS model returns the best parameters when the CD converges. It should be noted that optimizing MARS using CD, the number of basis functions to prune (nprune) are optimised while fixing the degree and optimize degree while fixing nprune. In other words, in CD, the model optimises one hyperparameter while fixing the other. The CD-MARS results are shown in Table 3 and select the optimised parameters shown in Table 4 and Table 5 which were used to refit the MARS model.
Table 2. Comparison of existing MARS models and the proposed CD-MARS model
Models | Approach | Computing resource | Performance metric |
|---|---|---|---|
MARS | Utilize GCV to select the best model | Moderate computational | 0.897 Rsquared |
PSO–MLP [3] | Use PSO to optimize the MLP weights | High computational and memory usage | 0.891 Rsquared |
MARS–PSO–MLP [3] | Combines the predictive values of the MARS models and the PSO–MLP model | High computational and memory usage | 0.902 Rsquared |
Proposed | Use coordinate descent to select the model in MARS | Moderate computational | 0.931 Rsquared |
Table 3. Basis functions to prune and their performance against the iterations in CD-MARS
Iteration | nprune | RMSE | MAE | Rsquared | GRsquared |
|---|---|---|---|---|---|
1 | 5 | 40,047.57 | 22,683.83 | 0.931 | 0.915 |
1 | 10 | 34,106.37 | 18,740.45 | 0.931 | 0.916 |
1 | 15 | 27,392.43 | 17,538.77 | 0.931 | 0.916 |
1 | 20 | 33,276.51 | 21,022.95 | 0.931 | 0.915 |
1 | 25 | 32,774.56 | 19,473.81 | 0.931 | 0.915 |
1 | 30 | 33,014.52 | 21,703.76 | 0.931 | 0.914 |
1 | 35 | 33,014.52 | 21,652.31 | 0.930 | 0.914 |
1 | 40 | 33,014.52 | 21,791.47 | 0.929 | 0.913 |
1 | 45 | 33,014.52 | 21,585.51 | 0.929 | 0.912 |
1 | 50 | 33,014.52 | 21,376.27 | 0.928 | 0.911 |
2 | 5 | 40,047.57 | 22,683.83 | 0.928 | 0.910 |
2 | 10 | 34,106.37 | 18,740.45 | 0.927 | 0.909 |
2 | 15 | 27,392.43 | 17,594.61 | 0.926 | 0.907 |
2 | 20 | 33,276.51 | 21,022.95 | 0.926 | 0.905 |
2 | 25 | 32,774.56 | 19,473.81 | 0.925 | 0.904 |
2 | 30 | 33,014.52 | 21,871.35 | 0.924 | 0.901 |
2 | 35 | 33,014.52 | 21,732.42 | 0.923 | 0.899 |
2 | 40 | 33,014.52 | 21,873.51 | 0.922 | 0.896 |
2 | 45 | 33,014.52 | 21,893.72 | 0.920 | 0.893 |
2 | 50 | 33,014.52 | 21,705.36 | 0.919 | 0.888 |
* The row in bold shows the selected optimum number of basis functions to prune (nprune)
Table 4. The selected best degree of CD-MARS
Iteration | Degree | RMSE | MAE | Rsquared | GRsquared |
|---|---|---|---|---|---|
1 | 1 | 27,392.43 | 17,538.77 | 0.931 | 0.916 |
1 | 2 | 37,084.13 | 21,376.27 | 0.911 | 0.902 |
1 | 3 | 136,853 | 30,631 | 0.887 | 0.882 |
2 | 1 | 27,392.43 | 17,594.61 | 0.927 | 0.909 |
2 | 2 | 37,084.13 | 22,452.43 | 0.924 | 0.901 |
2 | 3 | 136,853 | 31,575.58 | 0.919 | 0.888 |
* The row in bold shows the selected optimum degree
Table 5. Optimised CD-MARS hyper parameters
Best Parameter/metric | Value |
|---|---|
Optimized nprune: | 15 |
Optimized degree: | 1 |
Best RMSE: | 27,392.43 |
Best Rsquareduared: | 0.931 |
Experimental results
In this study, attributes with high features importance were used with the response variable, the house sale price (SalePrice). Since MARS is capable of perming feature importance, the study examined and considered the feature importance resulting from CD-MARS training. The results showed that 10 attributes are taken to be of important in building the model as shown in Fig. 2. It should be noted that, if the predictor variable was never used in MARS modeling, it has zero feature importance since MARS models can perform automatic feature selection. The predictor variable whose feature importance is above zero was used with the response variable SalePrice in an effort to determine the interactions between the variables that can affect the house price. To explore factors affecting house sales prices, the current study offers results plus the interpretations and explanations from the MARS and CD-MARS models.
[See PDF for image]
Fig. 2
Feature importance for CD-MARS
Experiments on MARS and CD-MARS
In MARS modeling, basis functions (BF) can be the intercept or constant, the hinge function which can also be the product of more or equal to two hinge functions. The splines are used as basis functions with the form of where + stands for the positive part and t is the knot and zero is otherwise. MARS can select the predictor value to be used as a knot for the hinge function. In nonlinear modeling, basis functions provide a piecewise combination. The basis functions (BF) for the proposed CD-MARS model and for the MARS model on the variable selection are presented in Tables 6 and 7 respectively. When Modeling in MARS and CD-MARS, the basis functions forecast the influence of variables that are independent of the sales price. The explanation for Tables 6 and 7 is the same. A positive sign for the BF coefficient indicates that the predictor variable or combination of predictor variables increases the response variable SalePrice, while a negative coefficient value indicates the reduced value of the response variable, sales price. The coefficient value implies the magnitude of the basis function’s effect or variable effect on the sales price. From Table 6, we observe that predictor variables YearBuilt, OverallQual and BsmtFinSF play crucial roles in determining the price of the house as in BF3, BF4, BF7 and BF8. However, since the basis functions can be a multiple of several basis functions (for this study, the best degree selected is 1 out of 3), the effects of an independent variable on determining the house price may vary. From Tables 6 and 7, we observe that the several basis predictor variables can reduce the house price. The variables like BedroomAbvG and X1stFlrSF, SaleCondition and YearBuilt, BsmtUnfSF and OverallCond greatly increase the sale price as in BF 7, BF10, and BF 114. On the other hand, combining variables OverallCond, LotArea and X2ndFlrSF reduces the sales price as in BF 2, BF 3 and BF 13.
Table 6. List of basis functions for the fitted CD-MARS model and their coefficients for the house price dataset
BF No | Basis functions (BF) | Coefficients |
|---|---|---|
BF 1 | (Intercept) | 235,064.608 |
BF 2 | h(15,431-LotArea) | − 2.420 |
BF 3 | h(7-OverallQual) | − 10,195.601 |
BF 4 | h(OverallQual-7) | 31,933.917 |
BF 5 | h(7-OverallCond) | − 10,281.618 |
BF 6 | h(2004-YearBuilt) | − 543.818 |
BF 7 | h(YearBuilt-2004) | 8391.654 |
BF 8 | h(BsmtFinSF1-616) | 59.284 |
BF 9 | h(1656-TotalBsmtSF) | − 26.699 |
BF 10 | h(TotalBsmtSF-1656) | 85.551 |
BF 11 | h(X1stFlrSF-2097) | − 454.828 |
BF 12 | h(1347-X2ndFlrSF) | − 12.374 |
BF 13 | h(X2ndFlrSF-1347) | 256.473 |
BF 14 | h(GrLivArea-1396) | 69.059 |
BF 15 | h(BedroomAbvGr-4) | − 36,365.553 |
Table 7. List of basis functions for the fitted MARS model and their coefficients for the house price dataset
BF No | Basis function (BFs) | Coefficients |
|---|---|---|
BF 1 | (Intercept) | 344,495.86 |
BF 2 | h(21,384-LotArea) | − 2.12 |
BF 3 | h(7-OverallQual) | − 9691.13 |
BF 4 | h(OverallQual-7) | 38,259.31 |
BF 5 | h(7-OverallCond) | − 10,400.77 |
BF 6 | h(2007-YearBuilt) | − 622.20 |
BF 7 | h(YearBuilt-2007) | 21,797.26 |
BF 8 | h(BsmtFinSF1-790) | 70.79 |
BF 9 | h(2136-TotalBsmtSF) | − 25.99 |
BF 10 | h(2129-X1stFlrSF) | − 39.68 |
BF 11 | h(X1stFlrSF-2129) | − 200.87 |
BF 12 | h(1370-X2ndFlrSF) | − 39.78 |
BF 13 | h(X2ndFlrSF-1370) | 389.09 |
BF 14 | h(GrLivArea-1795) | 54.80 |
BF 15 | h(GrLivArea-2978) | − 123.52 |
BF 16 | h(BedroomAbvGr-4) | − 35,900.11 |
Similarly, in Table 7 the variables OverallQual, YearBuilt, BsmtFinSF1, X2ndFlrSF positively affects the sales price as shown by BF 4, BF 7, BF 8, BF 13, and variables such as X1stFlrSF and BedroomAbvGr negativly impact the sales price as in BF 11 and BF 16. Other variables do not contribute to the overall house sales price as they have been removed during training and thus not considered in the final model.
The refitted CD-MARS models with optimized basis functions to prune and degree is shown in Fig. 3 where the model selected with 15 number of terms (BFs). To train the model, the study performed ten fold cross-validation. Figures 4 and 5 indicates the model performance across each iteration and each fold respectively.
[See PDF for image]
Fig. 3
Fitting the Optimised CD-MARS Model
[See PDF for image]
Fig. 4
Cross-validation results with respect to the iterations
[See PDF for image]
Fig. 5
Cross-validation results for each fold
In addition to performance metrics such as MSE, RMSE and Rsquared, the study also considered the residual plot to evaluate the model’s performance in terms of underfitting and overfitting as shown in Fig. 6. Additionally, the ten fold cross-validation shows that the RMSE and Rsquared tends to be consistent across several folds leading to the stability of the model to fit the data. Figures 4 and 5 show that the model performance was consistent across several fold’s iterations. Due to the fact that the performance measures considered in Figs. 4 and 7 are on different scales, the study applied min–max normalization on the plot data which produced the values raging from 0 and 1. This shows that the model is stable and generalizes well on different subsets of data. The Rsquared is almost consistent across all folds and all iterations. The consistence of Rsquared suggests that the response attribute (saleprice) is explained by the predictor variables across all folds and iterations in all folds. However, it can also be observed that on some iterations the model is not generalizing well as in Fig. 5 fold 2 and fold 9. This is because these folds selected small subset of dataset.
[See PDF for image]
Fig. 6
The CD-MARS residual Plot
[See PDF for image]
Fig. 7
Variations of MAE, RMSE, and Rsquared against basis functions to prune in CD-MARS
Variations of MAE,RMSE, and Rsquared, with number of basis functions to prune in CD-MARS
Terms (nprune) are variables used to construct MARS with predictors and hinge functions. It may be of single predictor or combination of predictors or a combination of predictors and hinge function. Comparing the number of terms in the model with RMSE, Rsquared and GRsquared, we observe in Fig. 7 that as the number of terms varies in each iteration as in Table 3 and giving different RMSE, Rsquared and GRsquared values. However as the number of terms increase the RMSE and Rsquared tend to be consistent (Table 3). The CD-MAR Model selects the degree and terms (nprune) with minimum error values and better generalizability value. From Fig. 7, we observe that the maximum Rsquared is achieved when there are 15 terms in the model implying that the optimized model is the one with 15 terms and the degree of interaction as 1.
The study compared the memory and time consumed for both the proposed model and the traditional MARS as shown in Table 8. The study found that the proposed model uses slightly more time and memory. This is dues to the fact that the proposed model fits MARS in CD first to ontain the best parameters and then refits the obtained model in MARS using the optimised parameters. In other words it’s a two-step process which leads to the consumption of more memory and execution time.
Table 8. Execution memory and time comparisons
Item | MARS | CD-MARS |
|---|---|---|
Memory used before execution (bytes) | 193,521,571 | 285,320,728 |
Memory used after execution (bytes) | 193,521,069 | 285,320,168 |
Memory used during execution (bytes) | -502 | -560 |
Time in seconds | 43.35262 | 57.859234 |
Comparison of experimental results
A comparison on the experimental results of refitted CD-MARS model proposed in this study, the traditional MARS and other soft computing models is presented in Table 9. The results show that the proposed approach have greatly reduced the RMSE and increased Rsquared. This indicates that the model fits and generalizes well and is nearly the same with little differences. However, it has been noted that from the results, the proposed model is slightly computationally intensive since it utilizes more memory as compared to traditional MARS as presented in Table 8.
Table 9. Comparison of CD-MARS with other models
Model | MSE | RMSE | Rsquared | GRsquared |
|---|---|---|---|---|
Linear Regression | 1,414,931,404.63 | 37,615.57 | 0.8155 | |
Support Vector Machines | 1,132,136,370.34 | 33,647.23 | 0.8524 | |
Random Forest Regressor | 950,844,232.54 | 30,835.76 | 0.8760 | |
XGBoost | 900,813,561.35 | 30,013.56 | 0.8826 | |
MARS | 927,995,704 | 33,716.52 | 0.8952 | 0.891 |
CD-MARS | 829,673,296.4 | 27,392.43 | 0.931 | 0.916 |
Discussion
In this study, the performance of CD-MARS on the most appropriate selected portion of variables is compared with the traditional MARS. We show our results on a house prices dataset [47]. As mentioned in previous parts, model selection criteria have a significant impact on the choice of appropriate knots and influencing variables within the response variable. Since the selected number and position of knots vary over the same variables, different kinds of MARS models over different model selection criteria can be acquired on the same dataset. Due to this fact, modeling in MARS is examined under the framework for model selection. The model selection criteria studied in this research is implemented to the MARS algorithm using earth environment written in R environment (Table 6). It utilizes the functionality of the MARS algorithm described by Friedman [24].
Applying the model on the real data, the results showed that the proposed model outperforms the traditional GCV-enabled model. The one with a small or minimum RMSE value is selected as the best model as in Table 3. Each iteration has its respective RMSE value and Rsquared value as shown in Fig. 7. The model selects the optimized iteration with the highest Rsquared value and lowest RMSE value and returns the best parameter values to be used in MARS.
The results show that the traditional MARS produce the highest RMSE value compared to CD-MARS. The Rsquared is high compared to traditional MARS which shows that the CD-MARS criterion fits well and can produce good generalizability. However, the CD-MARS model requires slightly more computing memory and time compared to the traditional MARS model as presented in Table 8.
Conclusions
In MARS modeling, model selection criteria plays a vital role in selecting the best model. The information model selection criteria have proved to perform better compared to GCV criteria for modeling MARS. However, the model may over-fit on a multi-dimensional data when there is no precise information among the independent and dependent attributes. The study discusses literature about Statistical information model selection criteria which were taken as the basis of this study. In this study, the proposed CD-MARS model which employs coordinate descent methodology for MARS modeling is implemented and tested. It was designed to integrate the features from the Coordinate descent optimizer and information-based algorithms. The coordinate descent optimizer was integrated with MARS to select the best model that fits the data. It employs Coordinate descent as a penalty term in the information-based model selection criteria for MARS modeling. The proposed model is implemented in earth R environment and tested on a house prices dataset which is freely available online. The result shows that the mean squared error for the proposed model is small compared to the ordinary MARS model on the house prices dataset used in the study. This shows that the CD-MARS criterion fits well and produces good generalizability. It should be concluded that in MARS modeling, the model with minimum MSE value is considered the best model. It should also be concluded that the model that involves several terms, produces reduced MSE with high Rsquared and GRsquared. This study was limited to model selection in MARS modeling using information-based model selection criteria. To further improve the model’s performance, we recommend that the future research should focus on optimizing MARS to perform classification tasks for multidimensional data.
Author contributions
All authors contributed to the conceptualization. N. A conducted the research and wrote the main manuscript text. W. R.M, and K.W.M supervised the research. All authors reviewed the manuscript.
Funding
This research was funded by Deutscher Akademischer Austauschdienst (DAAD) [91560782].
Data availability
The House prices dataset used to support the findings of this study has been deposited in the Kaggle repository at https://www.kaggle.com/datasets/lespin/house-prices-dataset.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Anil, AKP; Singh, UK. An optimal solution to the overfitting and underfitting problem of healthcare machine learning models. J Syst Eng Inf Technol; 2023; 2,
2. Chen, Z; Xiao, F; Guo, F; Yan, J. Interpretable machine learning for building energy management: a state-of-the-art review. Adv Appl Energy; 2023; 9, 100123.
3. Nguyen, H et al. Prediction of ground vibration intensity in mine blasting using the novel hybrid MARS–PSO–MLP model. Eng Comput; 2022; 38,
4. Cheng, MY; Cao, MT. Accurately predicting building energy performance using evolutionary multivariate adaptive regression splines. Appl Soft Comput J; 2014; 22, pp. 178-188.
5. Sahraei, MA; Duman, H; Çodur, MY; Eyduran, E. Prediction of transportation energy demand: multivariate adaptive regression splines. Energy; 2021; 224, 120090.
6. Naser, AH; Badr, AH; Henedy, SN; Ostrowski, KA; Imran, H. Application of multivariate adaptive regression splines (MARS) approach in prediction of compressive strength of eco-friendly concrete. Case Stud Constr Mater; 2022; 17, e01262.
7. Friedman, JH. Multivariate adaptive regression splines. Ann Stat; 1991; 19,
8. Kartal Koc, E; Bozdogan, H. Model selection in multivariate adaptive regression splines (MARS) using information complexity as the fitness function. Mach Learn; 2015; 101,
9. Özmen, A; Batmaz, İ; Weber, GW. Precipitation modeling by polyhedral RCMARS and comparison with MARS and CMARS. Environ Model Assess; 2014; 19,
10. Murat, N. Outlier detection in statistical modeling via multivariate adaptive regression splines. Commun Stat Simul Comput; 2023; 52,
11. Mushtaq, K et al. Multivariate wind power curve modeling using multivariate adaptive regression splines and regression trees. PLoS One; 2023; 18,
12. Kao, LJ; Chiu, CC. Application of integrated recurrent neural network with multivariate adaptive regression splines on SPC-EPC process. J Manuf Syst; 2020; 57,
13. Latha, CBC; Jeeva, SC. Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Inform Med Unlocked; 2019; 16, 100203.
14. Adoko, AC; Jiao, YY; Wu, L; Wang, H; Wang, ZH. Predicting tunnel convergence using multivariate adaptive regression spline and artificial neural network. Tunn Undergr Sp Technol; 2013; 38, pp. 368-376.
15. Crino, S; Brown, DE. Global optimization with multivariate adaptive regression splines. IEEE Trans Syst Man Cybern Part B Cybern; 2007; 37,
16. Rezaie-Balf, M. Multivariate adaptive regression splines model for prediction of local scour depth downstream of an apron under 2D horizontal jets. Iran J Sci Technol Trans Civ Eng; 2019; 43, pp. 103-115.
17. Samui, P. Multivariate adaptive regression spline (Mars) for prediction of elastic modulus of jointed rock mass. Geotech Geol Eng; 2013; 31,
18. Dey, P; Das, AK. Application of multivariate adaptive regression spline-assisted objective function on optimization of heat transfer rate around a cylinder. Nucl Eng Technol; 2016; 48,
19. Taylan, P; Weber, GW; Özkurt, FY. A new approach to multivariate adaptive regression splines by using Tikhonov regularization and continuous optimization. TOP; 2010; 18,
20. Rezaei, M; Malekjani, M. Comparison between different methods of model selection in cosmology. Eur Phys J Plus; 2021; 136,
21. Gu, C; Wahba, G. Discussion: multivariate adaptive regression splines. Ann Stat; 1991; 19,
22. Bekar Adiguzel, M; Cengiz, MA. Model selection in multivariate adaptive regressions splines (MARS) using alternative information criteria. Heliyon; 2023; 9,
23. Craven, P; Wahba, G. Smoothing noisy data with spline functions - Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math; 1978; 31,
24. Friedman, JH; Roosen, CB. An introduction to multivariate adaptive regression splines. Stat Methods Med Res; 1995; 4,
25. De La Iglesia Martinez,; Labib, SM. Demystifying normalized difference vegetation index (NDVI) for greenness exposure assessments and policy interventions in urban greening. Environ Res; 2023; 220, 115155.
26. Wahba, G. A Comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Ann Stat; 1985; 13,
27. Golub, GH; Heath, M; Wahba, G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics; 1979; 21,
28. Bottegal, G; Pillonetto, G. The generalized cross validation filter. Automatica; 2018; 90, pp. 130-137.3764392
29. Wang, D; Zhong, Z; Bai, K; He, L. Spatial and temporal variabilities of PM2.5 concentrations in China using functional data analysis. Sustainability; 2019; 11,
30. Kullback, S; Leibler, RA. On information and sufficiency. Ann Math Stat; 1951; 22,
31. Akaike, H. Information theory and an extension of the maximum likelihood principle. Nurs Times; 1994; 90,
32. Sclove, SL. Application of model-selection criteria to some problems in multivariate analysis. Psychometrika; 1987; 52,
33. Zhang, J; Yang, Y; Ding, J. Information criteria for model selection. Wiley Interdiscip Rev Comput Stat; 2023; 15,
34. M. Statistics. Estimating the Dimension of a Model. Gideon Schwarz; 2012; 6,
35. Rissanen, J. Modeling by shortest data description. Automatica; 1978; 14,
36. Bozdogan H (2023) Intelligent statistical data mining with information complexity and genetic algorithms, no. January
37. Gordon G, Tibshirani R (2015) Coordinate descent Adding to the toolbox, with stats and ML in mind
38. Sun, T; Hannah, R; Yin, W. Asynchronous coordinate descent under more realistic assumption. Adv Neural Inf Process Syst; 2017; 30, pp. 6183-6191.
39. Hazimeh, H; Mazumder, R. Fast best subset selection: coordinate descent and local combinatorial optimization algorithms. Informs Op Res; 2020; 68, 1517.4166310
40. Wright, SJ. Coordinate descent algorithms. Math Program; 2015; 151,
41. Wanjiku, RN; Nderu, L; Kimwele, M. Dynamic fine-tuning layer selection using Kullback-Leibler divergence. Eng Reports; 2023; 5,
42. Mishra S (2017) Genetic algorithm: an efficient tool for global optimization, 10(8): 2201–2211
43. Azoulay R, David E, Avigal M, Hutzler D (2021) Adaptive task selection in automated educational software: a comparative study, Intell Syst Learn Data Anal Online Educ pp 179–204
44. Kingma DP, Lei Ba J (2017) Adam: a method for stochastic ptimization, ICLR
45. Saini, VK; Kumar, R; Al-Sumaiti, AS; Sujil, A; Heydarian-Forushani, E. Learning based short term wind speed forecasting models for smart grid applications: an extensive review and case study. Electr Power Syst Res; 2023; 222, 109502.
46. Marquardt, DW; Snee, RD. Ridge regression in practice. Am Stat; 1975; 29,
47. lespin Lisette, “House Prices dataset,” kaggle. [Online]. Available: https://www.kaggle.com/datasets/lespin/house-prices-dataset
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.