1. Introduction
Spatial regression models, or simultaneous autoregressive (SAR) models, extend traditional linear regression by accounting for spatial dependencies using a spatial weight matrix, (see Section 2.1). They have been widely applied in fields such as ecology [1,2], social sciences [3,4], criminology [5], and finance [6]. The two widely used SAR-type models are the spatial error model (SEM), where the spatial dependence is in the error terms, and the spatial autoregressive model (SAM), where the spatial dependence is in the dependent variable. See Section 2.1 for further details.
The literature provides well-established estimation methods for SAR models with no missing values in the response variables. Ref. [7] introduced an efficient maximum likelihood (ML) method. When dealing with sparse , efficient sparse Cholesky factorisation algorithms are employed in ML estimation [8,9,10]. Other methods include the method of moments (MOM) [11,12,13], Bayesian approaches [14,15,16], and instrumental variable (IV) methods [17].
However, SAR models often face two significant challenges: measurement errors and missing data. Measurement errors arise when the observed data differ from the true values due to inaccuracies in collection or reporting, impacting both the dependent and explanatory variables. This can lead to biased estimates and unreliable conclusions. On the other hand, missing data occur when certain observations are unavailable or incomplete, disrupting the spatial structure of the dataset and leading to biased results [18,19]. Addressing these issues is essential, as ignoring them can distort the analysis of spatial dependencies and relationships.
In the context of modelling house prices, measurement errors and missing data are common challenges, particularly in the response variable, house price. Measurement errors can occur when house prices vary across different data sources, such as real estate listings, tax assessments, or sales records. Missing data, in this case, refer to the lack of price information on unsold houses, which is also a significant issue as only a small percentage of properties are sold on the market each year [20]. The combination of these issues can complicate the accurate estimation of spatial relationships, emphasising the need for more complex SAR models that can simultaneously address both measurement errors and missing data. In the current literature, the estimation of SAR models with measurement errors and missing data has been addressed separately. Various methods have been proposed to handle these two challenges, but both issues have rarely been tackled simultaneously in a unified framework. This study aims to fill this gap by jointly addressing measurement errors and missing data in the response variable of SAR models.
Various estimation methods have been proposed for the estimation of SAR models with missing values in the response variable. Ref. [20] proposed an iterative algorithm similar to the expectation-maximization (EM) algorithm [21], using approximations to avoid computationally demanding matrix inversions. Ref. [22] refined this approach and proposed a valid EM algorithm by incorporating exact terms. Alternatively, ref. [23] developed a method that directly maximises the marginal log-likelihood of observed data, which is computationally faster and avoids the EM algorithm’s convergence issues. Additionally, ref. [24] introduced an IV estimator for the SAM with missing responses, and ref. [25] proposed an inverse probability weighting (IPW)-based robust estimator for the same model.
On the other hand, the issue of measurement errors in the response variable of SAR models has been addressed by ref. [26,27], who have proposed efficient ML estimation techniques. Additionally, ref. [28,29] applied the integrated nested Laplace approximation (INLA) [30] to estimate such models. However, these studies assumed that there were no missing values in the response variable.
Our article makes two contributions. First, we introduce hierarchical SAR (H-SAR) models that account for measurement errors and missing data in the response variable. Second, we introduce a novel marginal ML method along with two alternative computational approaches that significantly reduce its computational complexity. Our most efficient approach, called the parameterisation approach, reduces the complexity of the proposed marginal ML method from to , where n is the total number of observations. The marginal ML estimation of H-SAR models involves computationally intensive operations, such as matrix inversions and determinant calculations, which typically have computational complexities of . We reduce this complexity by leveraging the specialised sparse matrix operations provided by the R Matrix package (version 1.4-0) [31]. Additionally, we rewrite certain terms in the log-likelihood computation to avoid some of the computationally demanding steps; see Section 3.4 for further details. We illustrate the H-SAR models and their marginal ML method empirically using simulated and real data.
The remainder of this paper is organised as follows. Section 2 provides an overview of standard SAR models and introduces hierarchical SAR models. Section 3 presents the marginal ML method. Efficient computational strategies to reduce the complexity of the proposed algorithm are also discussed. In Section 4, we evaluate the performance of the estimation method using simulated datasets. Section 5 discusses the real data application. Section 6 concludes. The paper has Supplementary Materials containing some further technical and empirical results.
2. Models
In this section, we begin by presenting the two most commonly used standard simultaneous autoregressive (SAR) models in the literature. We then extend these models to account for measurement errors, leading to the development of the hierarchical SAR models employed in this study.
2.1. Standard Simultaneous Autoregressive Models
Before presenting the SAR models, we first define the variables and parameters. Let denote the vector of response variables observed at n spatial locations . Each element represents the value of the response variable at the location . The matrix is an matrix of covariates, where r is the number of covariates. The first column of typically consists of ones to include the intercept , while the remaining columns represent the values of the r covariates at each spatial location. The vector contains the fixed effects parameters or regression coefficients associated with the covariates in .
The matrix is an spatial weight matrix that encodes the spatial structure and relationship between the locations. The element indicates the spatial influence or connectivity between locations and , where typically (indicating no self-influence), and is often row-normalised so that the sum of each row equals 1. Several strategies for constructing have been proposed in the literature (see [7,15] for further details). We assume that is sparse, as commonly observed in numerous real-world scenarios. However, this may not necessarily be the case.
The parameter is the spatial autocorrelation parameter, capturing the strength of the spatial dependence. When , it indicates no spatial dependence, while positive values of suggest positive spatial autocorrelation (neighbouring locations tend to have similar values), and negative values indicate negative spatial autocorrelation.
The error term is a vector of residuals for each spatial location, accounting for the unexplained variability in the response variable. We assume that follows a multivariate normal distribution with a mean vector and a covariance matrix , where is the variance of the error term, and is the identity matrix, implying homoscedasticity and no correlation between errors across locations.
Given these definitions, we consider two SAR models. The first is the spatial autoregressive model (SAM), defined as
(1)
The second model is the spatial error model (SEM), expressed as
(2)
see [15,32] for a more detailed explanation of SAM and SEM.For the standard SAM and SEM, when the error vector is normally distributed, the response variable is also normally distributed with the mean vector and the covariance matrix given in Table 1. To ensure a valid covariance matrix in SAM and SEM, it is crucial that the correlation parameter does not take on any of the values , where are the eigenvalues of sorted in ascending order [33]. When is normalised either by row or column, with each row or column summing to 1, the range of is constrained to [32].
2.2. Hierarchical Simultaneous Autoregressive Models
In many practical applications, the spatial autoregressive processes, , shown in Equations (1) and (2), are measured imperfectly, introducing measurement errors into the model. Given the observed data at spatial locations , the observed value at the spatial location can be expressed as
(3)
where represents the latent spatial autoregressive processes defined in Equations (1) and (2), and is the additive measurement error at the spatial location . We assume that follows a normal distribution with a mean of 0 and a variance . Additionally, we assume that and are independent of each other. In vector notation, this leads to .In spatial statistics, the model given in Equation (3) is often called the data model. This model characterises the distribution of the observed data, , given the underlying latent spatial process, , for . The models describing the underlying latent spatial processes, such as the spatial autoregressive process in Equation (1) and the spatial error process in Equation (2), are called the process models. A joint spatial statistical model defined through a data model and a process model is called a hierarchical spatial model [34]. In this paper, we refer to SAR models that incorporate measurement errors as hierarchical simultaneous autoregressive (H-SAR) models. Specifically, we define the spatial autoregressive model with measurement errors as a hierarchical spatial autoregressive model (H-SAM) and the spatial error model with measurement errors as a hierarchical spatial error model (H-SEM). The H-SAM is derived by substituting the latent process in Equation (3) with the hidden spatial autoregressive process described in Equation (1). Similarly, the H-SEM is obtained by replacing the latent process in Equation (3) with the latent spatial error process specified in Equation (2).
We now derive the marginal density of for H-SAR models. Let be the vector of model parameters, where and . The joint distribution of and is
(4)
where , with , is a normal distribution having mean and variance . In addition, represents the density of the standard SAR processes outlined in Equations (1) and (2) for SAM and SEM, respectively. For both cases, this density follows a multivariate normal distribution. Then, the marginal density of is obtained by integrating out from the joint distribution of and in Equation (4), which is given by(5)
Since both the measurement error and the latent spatial process follow multivariate normal distributions in the H-SAM and H-SEM, the marginal density of also follows a multivariate normal distribution. The mean vector and covariance matrix of the distribution of for both models are presented in Table 1. The constraints on the correlation parameter in H-SAM and H-SEM, ensuring a valid covariance matrix , align with those for standard SEM and SAM, as elucidated by [27]. The log-likelihood function of for H-SAR models in terms of the model parameters is
(6)
where is the vector of the residuals. The expressions for the and are given in Table 1.3. H-SAR Models with Missing Data and Their Marginal Maximum Likelihood Methods
Section 3.1 explores H-SAR models with missing data in the response variable. Section 3.2 introduces the marginal maximum likelihood (MML) method, while Section 3.3 proposes two alternative computational approaches for implementing the MML algorithm. This is followed by an analysis of the computational complexities of the proposed MML algorithms in Section 3.4.
3.1. Hierarchical Simultaneous Autoregressive Models with Randomly Missing Responses
Let represent the subset of containing observed response variables, and let represent the subset of containing missing response variables. The complete-data vector is denoted by . The matrices and are divided into distinct parts as follows:
(7)
where and are the corresponding matrices of the covariates for the observed and unobserved response variables, respectively, and , , , and represent the submatrices of .In this study, we assumed that the missing responses, , are missing at random (MAR). The MML method under the MAR mechanism involves maximising the marginal likelihood of the observed data, ; see [35] for further details. To compute the marginal likelihood of , the unobserved data must be integrated out from the complete-data density of , , where is the complete data density of . In the following subsection, we derive the marginal distribution of using the properties of the multivariate normal distribution and present the MML estimators.
3.2. Marginal Maximum Likelihood Estimation Method
This section discusses the proposed MML estimation method. The vector and matrix for the H-SAM and H-SEM, as presented in Table 1, are partitioned as
(8)
Since is a multivariate normal random variable, also follows a multivariate normal distribution with mean vector and covariance matrix [36]; see Equation (8). To compute the log-likelihood of the marginal distribution of , replace with , with , with , and n with in Equation (6). This yields the following expression for the marginal log-likelihood of :
(9)
where . Maximising Equation (9) with respect to and (taking partial derivatives and setting to zero) while holding and fixed, we obtain the closed-form ML estimates for and as follows: , where for H-SEM, and for H-SAM. By substituting and in the log-likelihood in Equation (9), the concentrated marginal log-likelihood has the form(10)
where is a constant. To obtain the ML estimates for and , we maximise the concentrated marginal log-likelihood in Equation (10) using3.3. Two Alternative Computational Approaches
The numerical optimisation of the concentrated marginal log-likelihood , defined in Equation (10), requires the repeated evaluation of for different values of and . Therefore, the efficient computation of is critical to reduce the overall computational cost of the proposed MML algorithm. We assume that the spatial weight matrix, , is sparse, a characteristic commonly encountered in many practical applications. To exploit this sparsity for efficient evaluation of , we propose two computational approaches. These methods were specifically designed to maximise the computational efficiency when applied to sparse , making them particularly well suited for such scenarios.
Efficient computation of matrix plays a key role in evaluating . Our first approach, referred to as the direct approach, involves directly extracting the submatrix from the larger matrix given in Table 1 as
(11)
After computing , the logarithms of its determinant and its inverse are calculated; subsequently, is evaluated.
In the direct computational approach, the matrices , and are explicitly calculated. However, when evaluating in Equation (10) and the closed-form ML estimators of and , only certain terms involving are required. Specifically, we need to compute the following terms: (i) , (ii) , (iii) , and (iv) . Computing these terms by first explicitly computing can be computationally challenging, as is a dense matrix. In our second computational approach, we reformulate these terms, along with , to enable more efficient computation using sparse matrix operations.
In the second approach, referred to as the parameterisation approach, the submatrix is first calculated using an additional sparse matrix, , from given in Table 1. This approach leverages the sparsity of and to simplify the calculation of . We define sparse matrix as , where is the identity matrix, and is the zero matrix. Now, . Then, after some further simplifications, can be reformulated as
(12)
First, the log determinant () is computed. Using the matrix determinant lemma [37], the logarithm of the determinant of in Equation (12) can be obtained as
(13)
and the proof is given in Section S4.2 of the Supplementary Materials.The determinant of can be efficiently computed by obtaining the Cholesky factors using the
By applying the Woodbury matrix identity [38], p. 427, to the matrix in Equation (12), we obtain as follows:
(14)
By substituting in Equation (14), the required terms in are given by
(15)
(16)
(17)
The matrix inverse , which appears in Equations (15)–(17), does not need to be computed explicitly. Instead, we solve the linear systems within these equations to compute the terms without directly performing the inversion. For further details, see Section 3.4.
In the subsequent sections, we refer to the marginal ML methods as MML methods. The MML algorithm that employs the parameterisation computational approach is denoted as the MML-P method, whereas the implementation that utilises the direct computational approach is referred to as the MML-D method.
3.4. Computational Aspects of Marginal ML Methods
Deriving the computational complexity of the proposed MML methods in the general case is challenging, as it heavily depends on the sparsity of the weight matrix , which is determined by the number of neighbours of each unit in the dataset. Therefore, we provide theoretical computational complexities for the proposed methods under the assumption of a specific structure for .
We now provide a brief overview of the grid and neighbourhood structure used to construct . This structure was employed in both the derivation of computational complexity and the simulation studies presented in Section 4. We utilise a regular grid, where n represents the total number of units in the grid. For instance, consider a regular grid of size , as shown in Figure 1, left panel. We define spatial units at the vertices, resulting in 9 units. The edges define the neighbours associated with each unit. For instance, unit 5 has four neighbours: units 2, 4, 6, and 8. This type of neighbouring structure is known as the Rook neighbourhood in the literature [39]. It is clear that any given unit can have between two and four neighbours in the Rook neighbourhood, depending on its position on the grid. Neighbourhood structures that have a fixed maximum number of neighbours for a given unit, regardless of the grid size (in this case, four), are referred to as local neighbourhoods.
Once the neighbours are defined on the grid, the weight matrix is constructed. The binary , based on the regular grid, is shown in the right panel in Figure 1. In , a value of 1 in the column of the row indicates that unit j is a neighbour of unit i; otherwise, the entry is 0.
We know that . A key computational step in implementing the MML-D and MML-P methods is calculating the Cholesky factor of matrix . According to the results reproted by [40] on nested dissection ordering algorithms, ref. [41] demonstrated that the Cholesky factorisation of the precision matrix of a Gaussian Markov random field (GMRF) defined on a local neighbourhood of a regular grid has a computational complexity of . Since represents a GMRF based on a local neighbourhood, its Cholesky factorisation also exhibits a complexity of .
We now discuss the computational complexity of the MML-D method. Let denote the Cholesky factor of . To obtain the inverse of efficiently while avoiding numerical issues, we solve the following system of equations:
(18)
where represents the matrix to be solved, which is ultimately equal to . The functionThe exact theoretical complexity of solving the system presented in Equation (18) is difficult to derive analytically. To address this, we conducted a simulation study to estimate the empirical computational complexity; see Section S1 of the Supplementary Materials. The results from this study suggest that the complexity of performing forward and backward substitution is approximately .
Once is computed, the MML-D method proceeds with the calculation of . From , we extract the sub-matrix . Given that is a dense matrix, is also dense. In the subsequent steps, the MML-D method explicitly computes both the inverse of and its determinant; each has a complexity of , as remains dense.
The overall computational complexity of the MML-D method is determined by three main components: the Cholesky factorisation of , which has a complexity of ; solving the system of equations in Equation (18), with a complexity of ; and computing both the inverse and determinant of the submatrix , which has a complexity of . Consequently, the overall computational complexity of the MML-D method is given by , which simplifies to .
We now analyse the computational complexity of MML-P. The computational complexity of the MML-P method mainly depends on the computation of the Cholesky factor of . This is because, once is computed, the subsequent steps are straightforward. These include computing the Cholesky factor of , solving the systems in Equations (15) to (17), and calculating the logarithm of the determinant of using Equation (13).
Let denote the Cholesky factor of . Using the
To compute the terms in Equations (15)–(17), we solve the systems within these equations without directly calculating the matrix inverse . By using the precomputed Cholesky factors of , we perform forward and backward substitution to compute the required terms. For example, to compute the term in Equation (15), we solve the system for (where is a vector) using forward and backward substitution. Given the precomputed , the
However, in contrast to the system that needs to be solved in Equation (18) within the MML-D method, the systems in the MML-P method are computationally more efficient. Specifically, the dimensions of the systems in Equations (15)–(17) are , , and , respectively, where typically . In contrast, the system to be solved in the MML-D method has a much larger dimensionality of , making it computationally more intensive.
The computation of the logarithm of the determinant of is significantly faster when the Cholesky factors of and are available. We now rewrite from Equation (13) as follows:
(19)
Since both and are lower triangular matrices, their logarithmic determinant can be efficiently computed by summing the logarithms of their diagonal elements, as follows:
(20)
where and denote the diagonal elements of and , respectively.For the MML-P method, the most computationally intensive operation is the calculation of the Cholesky factors of , which has a complexity of . Consequently, the overall computational complexity of the MML-P method is .
We highly recommend using the MML-P method due to not only its lower computational complexity compared to the MML-D method but also its additional practical advantages. For instance, although is a sparse matrix, its inverse is a dense matrix. As n increases, the size of grows, making it impractical to store due to its large, dense matrix structure. However, to implement MML-D, should be implicitly calculated first. In our simulation studies, attempts to store on the National Institute for Applied Statistics Research Australia High Performance Computer cluster (
4. Simulation Study
This section discusses the accuracy of the MML methods for different missing data percentages using simulated data. Additionally, we compare the computation times of the MML-P and MML-D methods.
In practice, we often encounter situations where the number of unobserved or missing units () is significantly larger than the number of observed units (). For instance, in the property market, house prices can be modeled using H-SAR models with missing data. Here, the prices of sold houses represent the observed responses, while the prices of unsold houses are treated as missing. Since only a small percentage of properties are sold each year, or even over a decade, is relatively small compared to the total number of properties (n), making much larger in comparison. Nevertheless, information on explanatory variables, such as house characteristics and location data for all properties, both sold and unsold, is typically available from assessors [20]. Therefore, in this simulation study, we considered a moderate () to high () percentage of missing data in the response variable.
When estimating SAR models with missing response data, researchers often overlook the missingness issue. In such cases, the SAR model employs a spatial weight matrix based only on the location data of the observed units. The likelihood constructed in this manner is referred to as the observed likelihood, and the estimates obtained from maximising the observed log-likelihood are termed the observed maximum likelihood (OML) estimates in the following sections of this paper. However, the observed likelihood does not reflect the marginal likelihood of the observed response variable, , as it is based solely on the covariate matrix for the observed locations, , and the spatial weight matrix for the observed units, . This misspecification generally leads to biased and inconsistent parameter estimates [18,19]. To further clarify this issue, the marginal distributions of under the OML and MML methods are presented in detail, and the biases in the parameter estimates produced by these methods are compared later in this section.
To generate synthetic data, we set , , and for both H-SEM and H-SAM. The regression parameters were fixed at and . The covariate was drawn from the standard normal distribution. The spatial weight matrix was generated based on a grid, using the Rook neighbourhood as explained in Section 3.4. In total, we simulated 250 datasets, each with units. We used the R
Table 2 displays the mean parameter estimates and mean squared errors (MSEs) for the parameters , , and in the H-SEM and H-SAM. These results are based on analyses with and missing data, estimated using the OML, MML-D, and MML-P methods across 250 simulated datasets. Additionally, the table reports the mean computation times for both parameter estimation and the calculation of standard errors.
For both models, the MML methods (MML-D and MML-P) consistently produce identical parameter estimates that are closer to the true values, achieving lower MSEs and reduced bias for parameters , , and compared to the OML method, regardless of the missing data percentage, as shown in Table 2. This pattern holds for the fixed effects parameters as well; see Section S2 of the Supplementary Materials for further details.
As mentioned at the beginning of this section, the misspecification of the marginal distribution of in the OML method leads to highly biased parameter estimates. The parameters of the marginal distribution of (the marginal mean vector and the marginal covariance matrix ) under MML and OML are summarised in Table 3 (the operation in Table 3 involves first computing the inverse of matrix and then extracting the submatrix corresponding to the observed locations from the resulting matrix). The MML method uses the correct marginal distribution of , as derived in Section 3.2.
In the case of H-SEM, while the marginal mean is the same under both MML and OML, the covariance matrix is incorrectly specified in the OML approach. Correctly specifying requires utilising the complete spatial weight matrix , which accounts for all n spatial locations, as performed in the MML methods. Similarly, in H-SAM, both the marginal mean and the covariance matrix of are misspecified under the OML approach. Accurate specification of the marginal distribution parameters of involves incorporating both the full covariate matrix and the complete spatial weight matrix .
Notably, for datasets with a moderate percentage of missing data ( missing), OML estimates are reasonably accurate. However, MML estimates (both MML-D and MML-P) are closer to the true values, exhibiting lower MSEs for , , and across both models. Under a high percentage of missing data ( missing), OML estimates become significantly biased and show larger MSEs compared to MML estimates, as shown in Table 2. A similar trend is observed for the fixed effect parameters, as detailed in the tables in Section S2 of the Supplementary Materials. These findings highlight the effectiveness of the proposed MML methods in reducing the bias in SAR parameter estimates for datasets that have both measurement errors and a substantial percentage of missing data.
Fitting H-SAR models (including both parameter estimation and standard error calculation) was faster with MML-P than with MML-D, as shown in Table 2, regardless of the missing data percentage. For datasets with missing data, fitting H-SEM using the MML-D method took an average of 1293.76 s, whereas the MML-P method required only 10.02 s. Similarly, for fitting H-SAM with 50% missing data, the MML-D method took an average of 1396.39 s, while the MML-P method required just 13.97 s. When of the data were missing, the average computation time for fitting H-SEM was 671.9590 s using MML-D, compared to 12.3905 s with MML-P. Similarly, for the H-SAM, MML-D took 675.9505 s, while MML-P took 16.4584 s.
An interesting observation is that the MML-D method requires less computational time when estimating datasets with a higher percentage of missing data compared to those with a lower percentage. This is due to its computational complexity of . Given that , a higher missing data percentage generally implies a lower (the number of observed values), thereby reducing the value of . In other words, the time consumption of the MML-D method is inversely proportional to , the number of unobserved values. Consequently, MML-D is faster for datasets with a high percentage of missing data than for those with a low percentage.
On the other hand, the computational complexity of the MML-P method is , which is independent of or the percentage of missing data. As a result, the MML-P method theoretically maintains constant computational time, regardless of the percentage of missing data. This is confirmed by the nearly identical computation times for the MML-P method across two missing data percentages, as shown in Table 2. More importantly, regardless of the percentage of missing data, the MML-P method outperforms the MML-D method due to its lower and constant computational complexity across varying missing data scenarios.
Further results from the simulation study, including mean standard errors and coverage rates for all model parameters (including fixed effects ), are provided in Tables S1–S4 in Section S2 of the Supplementary Materials. These additional results further demonstrate that both MML methods (MML-D and MML-P) offer better coverage and smaller mean standard errors compared to the OML method.
5. Real Data Application
This section demonstrates the application of the proposed estimation methods on a real dataset of house prices. As discussed in Section 1, house price data typically include both measurement errors and missing values. Additionally, property prices exhibit strong spatial correlation, making standard linear regression models inadequate for accurate modelling. Instead, SAR models provide a more suitable alternative. For this analysis, we used a dataset with no missing values, allowing us to estimate the parameters accurately without the complication of missing data. These estimates were then considered as the true parameter values, serving as a benchmark for comparison with the estimates obtained from the proposed MML methods.
The dataset consists of house prices from Lucas County, OH, USA, including 25,357 observations of single-family homes sold between 1993 and 1998. A detailed description of the dataset is available through the Spatial Econometrics toolbox for Matlab (the dataset can be accessed at
The distribution of these house prices is highly right-skewed, as shown in Figure S1 of the Supplementary Materials. Therefore, we used the natural logarithm of housing prices () as the dependent variable, which is a common practice to reduce skewness in property price data. The median of the log of house price is 11.0898. Additionally, the dataset includes a 25,357 × 25,357 row-normalised weight matrix, . This matrix is notably sparser than the simulated weight matrices used in Section 4. Specifically, the average number of neighbours per unit in this dataset is 2.9528, whereas a simulated weight matrix, based on the Rook neighbourhood structure with an approximately similar number of observations (n = 25,281), has an average of 3.9748 neighbours. The selected independent variables include various characteristics of the houses: different powers of house age (age, , and ), the natural logarithm of the lot size in square feet (), the number of rooms (rooms), the natural logarithm of the total living area in square feet (), the number of bedrooms, and binary indicators for each year from 1993 to 1998 (syear) to represent the year of house sale.
In this section, we compare the estimates from the OML and MML methods with those from the full maximum likelihood (FML) estimation method. The FML method refers to the ML estimation of H-SEM and H-SAM using the entire dataset, which contains no missing values. The estimates from the FML were used as the ground truth for comparing the OML and MML methods. To better simulate real-world scenarios, we introduced moderate to high levels of missing data in the response variable. Specifically, we created two datasets: one with missing responses and another with missing responses, obtained from the complete dataset. We set the convergence threshold for the
Table 4 presents the estimates and standard errors for parameters , , and obtained from the FML, OML, and MML methods for the H-SEM and H-SAM, based on the dataset with missing data. The table also reports the computation times for both parameter estimation and the calculation of standard errors. The results are consistent with our findings in Section 4, demonstrating that the MML estimates (for both MML-D and MML-P methods) align more closely with the FML estimates than the OML estimates, which rely only on observed data locations. As anticipated, the time required for both parameter estimation and standard error computation was substantially lower for both models when using the proposed MML-P method compared to MML-D.
Tables S7 and S8 in Section S3 of the Supplementary Materials show the results for the dataset with missing data. The results are consistent with those from the dataset with missing data. In addition, Tables S5–S8 in Section S3 of the Supplementary Materials provide estimates and standard errors for all model parameters including fixed effects () for both the H-SEM and H-SAM, for both and missing data.
We now provide a brief analysis of the trends and patterns in this dataset, based on the estimated parameters presented in Table 4, highlighting how effectively the proposed MML methods capture these patterns compared to the OML method for the H-SEM. The estimated spatial autocorrelation parameter , obtained using FML, is , indicating a very high degree of spatial autocorrelation, which is typical for property price datasets [20,22,23]. To assess the performance of the methods in estimating in the presence of measurement errors and missing values, we compare this FML estimate with those from the OML and MML methods.
The OML method produces an estimate of equal to , suggesting only moderate spatial autocorrelation. In contrast, both MML methods estimate as , closely matching the FML estimate and accurately reflecting the strong spatial autocorrelation present in the data. The estimate of parameter in the H-SAM is also closer to the FML estimate using the MML method than the OML method, further highlighting the superior ability of MML methods to accurately capture the underlying spatial structure of the data compared to the OML approach.
To further assess the accuracy of the OML and MML (MML-P) estimates, accounting for all model parameters including the fixed effects and two variance parameters, we computed the average mean squared errors (MSEs) of the estimated parameters for the H-SEM and H-SAM under both missing data percentages, as presented in Table 5. The estimates obtained from the FML were treated as the true parameter values. The average MSE for the MML-P estimates was calculated using the formula , where is the estimated value of the i-th parameter from the FML, is the corresponding parameter estimate from the MML-P method, and is the total number of parameters with r the number of fixed effect parameters (excluding the intercept parameter). Similarly, the average MSE for the OML estimates was calculated in a similar manner.
As expected, the average MSEs of the parameter estimates obtained from the OML method are significantly higher than those of the MML-P method, regardless of the missing data percentages, for both models. These results highlight the importance of using the proposed MML-P method, particularly in real-world situations where high percentages of missing data are common.
6. Conclusions
This article makes two key contributions. First, we introduced novel hierarchical SAR (H-SAR) models that explicitly account for measurement errors and missing data in the response variable. Second, we proposed marginal maximum likelihood (MML) methods, accompanied by efficient computational approaches, to accurately estimate the parameters of these H-SAR models. Specifically, we developed two computational strategies: the parameterisation method (MML-P) and the direct method (MML-D). Both approaches are particularly advantageous when the spatial weight matrix is sparse, enabling more efficient computation while maintaining accuracy.
The computational complexity analysis in Section 3.4 highlights that the MML-P method is both more efficient and practical compared to the MML-D algorithm, a conclusion further supported by the simulation study in Section 4. The advantages of using the MML-P over the MML-D method arise in two aspects. First, the computational complexity of the MML-P method is , which is significantly lower than the complexity of the MML-D method. As a result, MML-P requires considerably less computation time than MML-D. Second, the MML-D method demands more computational memory than MML-P. In the MML-D approach, storing matrices such as and becomes memory-intensive for datasets with large n, as these matrices are dense, even if the spatial weight matrix is sparse. The MML-P method overcomes these limitations by efficiently avoiding the explicit calculation of the matrix. Instead, it re-expresses terms within the MML-P algorithm, leveraging sparse matrix computations to achieve both reduced computational complexity and memory usage.
Although our methods efficiently handle SAR models with both measurement errors and a substantial amount of missing data, we acknowledge two key limitations. Firstly, our study primarily focused on the estimation of such models under the missing-at-random (MAR) mechanism. There is a pressing need to extend this work to include estimations under the missing-not-at-random (MNAR) mechanism, which often arises in real-world applications. Secondly, we assumed that both the error component of the SAR process () and the measurement error () follow normal distributions with constant variance. This assumption could be relaxed by incorporating heteroscedasticity or by replacing these distributions with heavy-tailed alternatives, such as Student’s t distribution, to better accommodate scenarios involving outliers or non-normal error structures. These extensions would enhance the applicability of the models to more complex real-world datasets. We leave these advancements for future research.
Conceptualisation, A.W.; methodology, A.W. and T.S.; software, A.W.; writing—original draft, A.W.; writing—review and editing, D.G.; supervision, D.G. and T.S. All authors have read and agreed to the published version of the manuscript.
The original contributions presented in the study are included in the article/
The authors declare no potential or apparent conflicts of interest in this article.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1. The grid ([Forumla omitted. See PDF.]) used for constructing [Forumla omitted. See PDF.] based on the Rook neighbourhood in the left panel and the corresponding binary [Forumla omitted. See PDF.] in the right panel.
Expressions for mean (
Terms | SAM | SEM | H-SAM | H-SEM |
---|---|---|---|---|
| | | | |
| | | | |
| | | | |
Mean parameter estimates (with mean squared errors (MSEs) in parentheses) for
H-SEM | H-SAM | ||||||
---|---|---|---|---|---|---|---|
OML | MML-D | MML-P | OML | MML-D | MML-P | ||
| | 0.6109 | 0.7949 | 0.7949 | 0.4497 | 0.7997 | 0.7997 |
| 2.3856 | 1.9745 | 1.9745 | 0.0374 | 2.0059 | 2.0059 | |
| 0.9737 | 1.0350 | 1.0350 | 10.4758 | 0.9995 | 0.9995 | |
ct | 3.7383 | 1293.76 | 10.02 | 3.8775 | 1396.39 | 13.97 | |
| | 0.2557 | 0.7880 | 0.7880 | 0.2434 | 0.8003 | 0.8003 |
| 1.0016 | 1.9189 | 1.9189 | 0.0020 | 2.0111 | 2.0111 | |
| 3.1122 | 1.1157 | 1.1157 | 19.0154 | 0.9748 | 0.9748 | |
ct | 0.8923 | 671.9590 | 12.3905 | 0.631 | 675.9505 | 16.4584 |
The marginal mean (
Terms | MML | OML | |
---|---|---|---|
H-SEM | | | |
| | | |
H-SAM | | | |
| | |
Parameter estimate (est), with its standard error (se), and computation time in seconds (ct) for fitting H-SAM and H-SEM using FML, OML, MML-D, and MML-P for the dataset with
Model | FML | OML | MML-D | MML-P | |||||
---|---|---|---|---|---|---|---|---|---|
est | se | est | se | est | se | est | se | ||
H-SEM | | 0.9866 | 0.0002 | 0.6866 | 0.0012 | 0.9936 | 0.0006 | 0.9936 | 0.0006 |
| 0.0004 | 0.0001 | 0.1643 | 0.0030 | 0.0001 | 0.0002 | 0.0001 | 0.0002 | |
| 0.0685 | 0.0007 | 0.0001 | 0.0081 | 0.0837 | 0.0035 | 0.0837 | 0.0035 | |
ct | 20.098 | 1.39 | 2792.409 | 38.65 | |||||
H-SAM | | 0.6727 | 0.0001 | 0.0027 | 0.0311 | 0.6046 | 0.0201 | 0.6046 | 0.0201 |
| 0.0399 | 0.0008 | 0.0882 | 0.0114 | 0.0682 | 0.0070 | 0.0682 | 0.0070 | |
| 0.0420 | 0.0009 | 0.0882 | 0.0234 | 0.0203 | 0.0080 | 0.0203 | 0.0080 | |
ct | 12.245 | 1.08 | 1196.3 | 22.53 |
Average mean squared error (MSE) of OML and MML-P estimates (relative to the FML estimates) for different missing data percentages.
H-SEM | H-SAM | |||
---|---|---|---|---|
| | | | |
OML | 0.7029 | 0.4838 | 1.2573 | 1.1248 |
MML-P | 0.0153 | 0.2556 | 0.0087 | 0.0172 |
Supplementary Materials
The following supporting information can be downloaded at:
References
1. Tognelli, M.F.; Kelt, D.A. Analysis of determinants of mammalian species richness in South America using spatial autoregressive models. Ecography; 2004; 27, pp. 427-436. Available online: https://www.jstor.org/stable/3683416 (accessed on 2 December 2024). [DOI: https://dx.doi.org/10.1111/j.0906-7590.2004.03732.x]
2. Ver Hoef, J.M.; Peterson, E.E.; Hooten, M.B.; Hanks, E.M.; Fortin, M.J. Spatial autoregressive models for statistical inference from ecological data. Ecol. Monogr.; 2018; 88, pp. 36-59. [DOI: https://dx.doi.org/10.1002/ecm.1283]
3. Angrist, J.D.; Lang, K. Does school integration generate peer effects? Evidence from Boston’s Metco Program. Am. Econ. Rev.; 2004; 94, pp. 1613-1634. [DOI: https://dx.doi.org/10.1257/0002828043052169]
4. Ammermueller, A.; Pischke, J.S. Peer effects in European primary schools: Evidence from the progress in international reading literacy study. J. Labor Econ.; 2009; 27, pp. 315-348. [DOI: https://dx.doi.org/10.1086/603650]
5. Glaeser, E.L.; Sacerdote, B.; Scheinkman, J.A. Crime and social interactions. Q. J. Econ.; 1996; 111, pp. 507-548. [DOI: https://dx.doi.org/10.2307/2946686]
6. Calabrese, R.; Elkink, J.A.; Giudici, P. Measuring bank contagion in Europe using binary spatial regression models. J. Oper. Res. Soc.; 2017; 68, pp. 1503-1511. [DOI: https://dx.doi.org/10.1057/s41274-017-0189-4]
7. Ord, K. Estimation methods for models of spatial interaction. J. Am. Stat. Assoc.; 1975; 70, pp. 120-126. [DOI: https://dx.doi.org/10.1080/01621459.1975.10480272]
8. Pace, R.K.; Barry, R. Sparse spatial autoregressions. Stat. Probab. Lett.; 1997; 33, pp. 291-297. [DOI: https://dx.doi.org/10.1016/S0167-7152(96)00140-X]
9. Pace, R.K. Performing large spatial regressions and autoregressions. Econ. Lett.; 1997; 54, pp. 283-291. [DOI: https://dx.doi.org/10.1016/S0165-1765(97)00026-8]
10. Pace, R.K.; LeSage, J.P. Chebyshev approximation of log-determinants of spatial weight matrices. Comput. Stat. Data Anal.; 2004; 45, pp. 179-196. [DOI: https://dx.doi.org/10.1016/S0167-9473(02)00321-3]
11. Kelejian, H.H.; Prucha, I.R. A Generalized Moments estimator for the autoregressive parameter in a spatial model. Int. Econ. Rev.; 1999; 40, pp. 509-533. [DOI: https://dx.doi.org/10.1111/1468-2354.00027]
12. Kelejian, H.H.; Prucha, I.R. On the asymptotic distribution of the Moran I test statistic with applications. J. Econom.; 2001; 104, pp. 219-257. [DOI: https://dx.doi.org/10.1016/S0304-4076(01)00064-1]
13. Lee, L.F. GMM and 2SLS estimation of mixed regressive, spatial autoregressive models. J. Econom.; 2007; 137, pp. 489-514. [DOI: https://dx.doi.org/10.1016/j.jeconom.2005.10.004]
14. Hepple, L.W. Bayesian analysis of the linear model with spatial dependence. Exploratory and Explanatory Statistical Analysis of Spatial Data; Springer: Dordrecht, The Netherlands, 1979; pp. 179-199. [DOI: https://dx.doi.org/10.1007/978-94-009-9233-7_7]
15. Anselin, L. Spatial Econometrics: Methods and Models; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1988; Volume 4.
16. LeSage, J.P. Bayesian estimation of spatial autoregressive models. Int. Reg. Sci. Rev.; 1997; 20, pp. 113-129. [DOI: https://dx.doi.org/10.1177/016001769702000107]
17. Lee, L.F. Best spatial two-stage least squares estimators for a spatial autoregressive model with autoregressive disturbances. Econom. Rev.; 2003; 22, pp. 307-335. [DOI: https://dx.doi.org/10.1081/ETC-120025891]
18. Wang, W.; Lee, L.F. Estimation of spatial autoregressive models with randomly missing data in the dependent variable. Econom. J.; 2013; 16, pp. 73-102. [DOI: https://dx.doi.org/10.1111/j.1368-423X.2012.00388.x]
19. Benedetti, R.; Suesse, T.; Piersimoni, F. Spatial auto-correlation and auto-regressive models estimation from sample survey data. Biom. J.; 2020; 62, pp. 1494-1507. [DOI: https://dx.doi.org/10.1002/bimj.201800225]
20. LeSage, J.P.; Pace, R.K. Models for spatially dependent missing data. J. Real Estate Financ. Econ.; 2004; 29, pp. 233-254. [DOI: https://dx.doi.org/10.1023/B:REAL.0000035312.82241.e4]
21. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. Methodol.; 1977; 39, pp. 1-22. [DOI: https://dx.doi.org/10.1111/j.2517-6161.1977.tb01600.x]
22. Suesse, T.; Zammit-Mangion, A. Computational aspects of the EM algorithm for spatial econometric models with missing data. J. Stat. Comput. Simul.; 2017; 87, pp. 1767-1786. [DOI: https://dx.doi.org/10.1080/00949655.2017.1286495]
23. Suesse, T. Marginal maximum likelihood estimation of SAR models with missing data. Comput. Stat. Data Anal.; 2018; 120, pp. 98-110. [DOI: https://dx.doi.org/10.1016/j.csda.2017.11.004]
24. Kelejian, H.H.; Prucha, I.R. Spatial models with spatially lagged dependent variables and incomplete data. J. Geogr. Syst.; 2010; 12, pp. 241-257. [DOI: https://dx.doi.org/10.1007/s10109-010-0109-5]
25. Luo, G.; Wu, M.; Xu, L. IPW-based robust estimation of the SAR model with missing data. Stat. Probab. Lett.; 2021; 172, 109065. [DOI: https://dx.doi.org/10.1016/j.spl.2021.109065]
26. Burden, S.; Cressie, N.; Steel, D.G. The SAR model for very large datasets: A reduced rank approach. Econometrics; 2015; 3, pp. 317-338. [DOI: https://dx.doi.org/10.3390/econometrics3020317]
27. Suesse, T. Estimation of spatial autoregressive models with measurement error for large data sets. Comput. Stat.; 2018; 33, pp. 1627-1648. [DOI: https://dx.doi.org/10.1007/s00180-017-0774-7]
28. Bivand, R.; Gómez-Rubio, V.; Rue, H. Spatial data analysis with R-INLA with some extensions. J. Stat. Softw.; 2015; 63, pp. 1-31. [DOI: https://dx.doi.org/10.18637/jss.v063.i20]
29. Gómez-Rubio, V.; Bivand, R.S.; Rue, H. Estimating spatial econometrics models with Integrated Nested Laplace Approximation. Mathematics; 2021; 9, 2044. [DOI: https://dx.doi.org/10.3390/math9172044]
30. Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using Integrated Nested Laplace Approximations. J. R. Stat. Soc. Ser. Stat. Methodol.; 2009; 71, pp. 319-392. [DOI: https://dx.doi.org/10.1111/j.1467-9868.2008.00700.x]
31. Bates, D.; Maechler, M. R Package Matrix: Sparse and Dense Matrix Classes and Methods, version 1.4-0. 2021; Available online: https://rdrr.io/rforge/Matrix/ (accessed on 2 December 2024).
32. LeSage, J.; Pace, R.K. Introduction to Spatial Econometrics; Chapman and Hall; CRC: Boca Raton, FL, USA, 2009.
33. Li, H.; Calder, C.A.; Cressie, N. One-step estimation of spatial dependence parameters: Properties and extensions of the APLE statistic. J. Multivar. Anal.; 2012; 105, pp. 68-84. [DOI: https://dx.doi.org/10.1016/j.jmva.2011.08.006]
34. Arab, A.; Hooten, M.B.; Wikle, C.K. Hierarchical Spatial Models. Encycl. GIS; 2008; 14, pp. 425-431. [DOI: https://dx.doi.org/10.1007/978-0-387-35973-1_564]
35. Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; Volume 793.
36. Petersen, K.B.; Pedersen, M.S. The Matrix Cookbook; Technical University of Denmark: Kongens Lyngby, Denmark, 2008; Volume 7, 510.
37. Ding, J.; Zhou, A. Eigenvalues of rank-one updated matrices with some applications. Appl. Math. Lett.; 2007; 20, pp. 1223-1226. [DOI: https://dx.doi.org/10.1016/j.aml.2006.11.016]
38. Harville, D.A. Matrix Algebra from a Statistician’s Perspective; Springer: New York, NY, USA, 1997.
39. Moura, A.C.M.; Fonseca, B.M. ESDA (Exploratory Spatial Data Analysis) of vegetation cover in urban areas—Recognition of vulnerabilities for the management of resources in urban green infrastructure. Sustainability; 2020; 12, 1933. [DOI: https://dx.doi.org/10.3390/su12051933]
40. George, A.; Liu, J.W. Computer Solution of Large Sparse Positive Definite; Prentice Hall Professional Technical Reference; Prentice Hall: Hoboken, NJ, USA, 1981.
41. Rue, H.; Held, L. Gaussian Markov Random Fields: Theory and Applications; Chapman and Hall CRC: Boca Raton, FL, USA, 2005.
42. Davis, T.A.; Hager, W.W. Multiple-Rank Modifications of a Sparse Cholesky Factorization. SIAM J. Matrix Anal. Appl.; 2001; 22, pp. 997-1013. [DOI: https://dx.doi.org/10.1137/S0895479899357346]
43. Bivand, R.; Nowosad, J.; Lovelace, R. R Package spData: Datasets for Spatial Analysis, version 2.2.2. 2023; Available online: https://cran.r-universe.dev/spData (accessed on 2 December 2024).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Efficient estimation methods for simultaneous autoregressive (SAR) models with missing data in the response variable have been well explored in the literature. A common practice is introducing measurement error into SAR models to separate the noise component from the spatial process. However, prior studies have not considered incorporating measurement error into SAR models with missing data. Maximum likelihood estimation for such models, especially with large datasets, poses significant computational challenges. This paper proposes an efficient likelihood-based estimation method, the marginal maximum likelihood (ML), for estimating SAR models on large datasets with measurement errors and a high percentage of missing data in the response variable. The spatial autoregressive model (SAM) and the spatial error model (SEM), two popular SAR model types, are considered. The missing data mechanism is assumed to follow a missing-at-random (MAR) pattern. We propose a fast method for marginal ML estimation with a computational complexity of
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer