Full text

Turn on search term navigation

1. Introduction

Spatial regression models, or simultaneous autoregressive (SAR) models, extend traditional linear regression by accounting for spatial dependencies using a spatial weight matrix, $W$ (see Section 2.1). They have been widely applied in fields such as ecology [1,2], social sciences [3,4], criminology [5], and finance [6]. The two widely used SAR-type models are the spatial error model (SEM), where the spatial dependence is in the error terms, and the spatial autoregressive model (SAM), where the spatial dependence is in the dependent variable. See Section 2.1 for further details.

The literature provides well-established estimation methods for SAR models with no missing values in the response variables. Ref. [7] introduced an efficient maximum likelihood (ML) method. When dealing with sparse $W$ , efficient sparse Cholesky factorisation algorithms are employed in ML estimation [8,9,10]. Other methods include the method of moments (MOM) [11,12,13], Bayesian approaches [14,15,16], and instrumental variable (IV) methods [17].

However, SAR models often face two significant challenges: measurement errors and missing data. Measurement errors arise when the observed data differ from the true values due to inaccuracies in collection or reporting, impacting both the dependent and explanatory variables. This can lead to biased estimates and unreliable conclusions. On the other hand, missing data occur when certain observations are unavailable or incomplete, disrupting the spatial structure of the dataset and leading to biased results [18,19]. Addressing these issues is essential, as ignoring them can distort the analysis of spatial dependencies and relationships.

In the context of modelling house prices, measurement errors and missing data are common challenges, particularly in the response variable, house price. Measurement errors can occur when house prices vary across different data sources, such as real estate listings, tax assessments, or sales records. Missing data, in this case, refer to the lack of price information on unsold houses, which is also a significant issue as only a small percentage of properties are sold on the market each year [20]. The combination of these issues can complicate the accurate estimation of spatial relationships, emphasising the need for more complex SAR models that can simultaneously address both measurement errors and missing data. In the current literature, the estimation of SAR models with measurement errors and missing data has been addressed separately. Various methods have been proposed to handle these two challenges, but both issues have rarely been tackled simultaneously in a unified framework. This study aims to fill this gap by jointly addressing measurement errors and missing data in the response variable of SAR models.

Various estimation methods have been proposed for the estimation of SAR models with missing values in the response variable. Ref. [20] proposed an iterative algorithm similar to the expectation-maximization (EM) algorithm [21], using approximations to avoid computationally demanding matrix inversions. Ref. [22] refined this approach and proposed a valid EM algorithm by incorporating exact terms. Alternatively, ref. [23] developed a method that directly maximises the marginal log-likelihood of observed data, which is computationally faster and avoids the EM algorithm’s convergence issues. Additionally, ref. [24] introduced an IV estimator for the SAM with missing responses, and ref. [25] proposed an inverse probability weighting (IPW)-based robust estimator for the same model.

On the other hand, the issue of measurement errors in the response variable of SAR models has been addressed by ref. [26,27], who have proposed efficient ML estimation techniques. Additionally, ref. [28,29] applied the integrated nested Laplace approximation (INLA) [30] to estimate such models. However, these studies assumed that there were no missing values in the response variable.

Our article makes two contributions. First, we introduce hierarchical SAR (H-SAR) models that account for measurement errors and missing data in the response variable. Second, we introduce a novel marginal ML method along with two alternative computational approaches that significantly reduce its computational complexity. Our most efficient approach, called the parameterisation approach, reduces the complexity of the proposed marginal ML method from $O (n^{3})$ to $O (n^{3 / 2})$ , where n is the total number of observations. The marginal ML estimation of H-SAR models involves computationally intensive operations, such as matrix inversions and determinant calculations, which typically have computational complexities of $O (n^{3})$ . We reduce this complexity by leveraging the specialised sparse matrix operations provided by the R Matrix package (version 1.4-0) [31]. Additionally, we rewrite certain terms in the log-likelihood computation to avoid some of the computationally demanding steps; see Section 3.4 for further details. We illustrate the H-SAR models and their marginal ML method empirically using simulated and real data.

The remainder of this paper is organised as follows. Section 2 provides an overview of standard SAR models and introduces hierarchical SAR models. Section 3 presents the marginal ML method. Efficient computational strategies to reduce the complexity of the proposed algorithm are also discussed. In Section 4, we evaluate the performance of the estimation method using simulated datasets. Section 5 discusses the real data application. Section 6 concludes. The paper has Supplementary Materials containing some further technical and empirical results.

2. Models

In this section, we begin by presenting the two most commonly used standard simultaneous autoregressive (SAR) models in the literature. We then extend these models to account for measurement errors, leading to the development of the hierarchical SAR models employed in this study.

2.1. Standard Simultaneous Autoregressive Models

Before presenting the SAR models, we first define the variables and parameters. Let $z = {(z_{1}, z_{2}, \dots, z_{n})}^{⊤}$ denote the vector of response variables observed at n spatial locations $s_{1}, s_{2}, \dots, s_{n}$ . Each element $z_{i}$ represents the value of the response variable at the location $s_{i}$ . The matrix $X$ is an $n \times (r + 1)$ matrix of covariates, where r is the number of covariates. The first column of $X$ typically consists of ones to include the intercept $β_{0}$ , while the remaining columns represent the values of the r covariates at each spatial location. The vector $β = {(β_{0}, β_{1}, \dots, β_{r})}^{⊤}$ contains the fixed effects parameters or regression coefficients associated with the covariates in $X$ .

The matrix $W$ is an $n \times n$ spatial weight matrix that encodes the spatial structure and relationship between the locations. The element $w_{i j}$ indicates the spatial influence or connectivity between locations $s_{i}$ and $s_{j}$ , where typically $w_{i i} = 0$ (indicating no self-influence), and $W$ is often row-normalised so that the sum of each row equals 1. Several strategies for constructing $W$ have been proposed in the literature (see [7,15] for further details). We assume that $W$ is sparse, as commonly observed in numerous real-world scenarios. However, this may not necessarily be the case.

The parameter $ρ$ is the spatial autocorrelation parameter, capturing the strength of the spatial dependence. When $ρ = 0$ , it indicates no spatial dependence, while positive values of $ρ$ suggest positive spatial autocorrelation (neighbouring locations tend to have similar values), and negative values indicate negative spatial autocorrelation.

The error term $e = {(e_{1}, e_{2}, \dots, e_{n})}^{⊤}$ is a vector of residuals for each spatial location, accounting for the unexplained variability in the response variable. We assume that $e$ follows a multivariate normal distribution with a mean vector $0$ and a covariance matrix $σ_{e}^{2} I_{n}$ , where $σ_{e}^{2}$ is the variance of the error term, and $I_{n}$ is the $n \times n$ identity matrix, implying homoscedasticity and no correlation between errors across locations.

Given these definitions, we consider two SAR models. The first is the spatial autoregressive model (SAM), defined as

(1) $z = {(I_{n} - ρ W)}^{- 1} (X β + e) .$

The second model is the spatial error model (SEM), expressed as

(2) $z = X β + {(I_{n} - ρ W)}^{- 1} e;$

see [15,32] for a more detailed explanation of SAM and SEM.

For the standard SAM and SEM, when the error vector $e$ is normally distributed, the response variable $z$ is also normally distributed with the mean vector $μ$ and the covariance matrix $Σ$ given in Table 1. To ensure a valid covariance matrix in SAM and SEM, it is crucial that the correlation parameter $ρ$ does not take on any of the values $\frac{1}{λ_{(1)}}, \frac{1}{λ_{(2)}}, \dots, \frac{1}{λ_{(n)}}$ , where $λ_{(1)}, λ_{(2)}, \dots, λ_{(n)}$ are the eigenvalues of $W$ sorted in ascending order [33]. When $W$ is normalised either by row or column, with each row or column summing to 1, the range of $ρ$ is constrained to $\frac{1}{λ_{(1)}} < ρ < 1$ [32].

2.2. Hierarchical Simultaneous Autoregressive Models

In many practical applications, the spatial autoregressive processes, $z$ , shown in Equations (1) and (2), are measured imperfectly, introducing measurement errors into the model. Given the observed data $y = {(y_{1}, y_{2}, \dots, y_{n})}^{⊤}$ at spatial locations $s_{1}, \dots, s_{n}$ , the observed value $y_{i}$ at the spatial location $s_{i}$ can be expressed as

(3) $y_{i} = z_{i} + ϵ_{i} i = 1, 2, \dots, n,$

where

z_{i}

represents the latent spatial autoregressive processes defined in Equations (1) and (2), and

ϵ_{i}

is the additive measurement error at the spatial location

s_{i}

. We assume that

ϵ_{i}

follows a normal distribution with a mean of 0 and a variance

σ_{ϵ}^{2}

. Additionally, we assume that

z_{i}

and

ϵ_{i}

are independent of each other. In vector notation, this leads to

y = z + ϵ

In spatial statistics, the model given in Equation (3) is often called the data model. This model characterises the distribution of the observed data, $y_{i}$ , given the underlying latent spatial process, $z_{i}$ , for $i = 1, \dots, n$ . The models describing the underlying latent spatial processes, such as the spatial autoregressive process in Equation (1) and the spatial error process in Equation (2), are called the process models. A joint spatial statistical model defined through a data model and a process model is called a hierarchical spatial model [34]. In this paper, we refer to SAR models that incorporate measurement errors as hierarchical simultaneous autoregressive (H-SAR) models. Specifically, we define the spatial autoregressive model with measurement errors as a hierarchical spatial autoregressive model (H-SAM) and the spatial error model with measurement errors as a hierarchical spatial error model (H-SEM). The H-SAM is derived by substituting the latent process $z$ in Equation (3) with the hidden spatial autoregressive process described in Equation (1). Similarly, the H-SEM is obtained by replacing the latent process $z$ in Equation (3) with the latent spatial error process specified in Equation (2).

We now derive the marginal density of $y$ for H-SAR models. Let $ϕ = {(β^{⊤}, ρ, ω, θ)}^{⊤}$ be the vector of model parameters, where $ω = σ_{ϵ}^{2}$ and $θ = σ_{e}^{2} / σ_{ϵ}^{2}$ . The joint distribution of $y$ and $z$ is

(4) $f (y, z ∣ ϕ) = f (y ∣ z, ϕ) f (z ∣ ϕ),$

where

f (y ∣ z, ϕ) = \prod_{i = 1}^{n} f (y_{i} ∣ z_{i}, ϕ)

, with

f (y_{i} ∣ z_{i}, ϕ)

, is a normal distribution having mean

z_{i}

and variance

σ_{ϵ}^{2}

. In addition,

f (z ∣ ϕ)

represents the density of the standard SAR processes outlined in Equations (1) and (2) for SAM and SEM, respectively. For both cases, this density follows a multivariate normal distribution. Then, the marginal density of

y

is obtained by integrating out

z

from the joint distribution of

y

and

z

in Equation (4), which is given by

(5) $f (y ∣ ϕ) = \int f (y ∣ z, ϕ) f (z ∣ ϕ) d z .$

Since both the measurement error $ϵ$ and the latent spatial process $z$ follow multivariate normal distributions in the H-SAM and H-SEM, the marginal density of $y$ also follows a multivariate normal distribution. The mean vector $μ$ and covariance matrix $Σ$ of the distribution of $y$ for both models are presented in Table 1. The constraints on the correlation parameter $ρ$ in H-SAM and H-SEM, ensuring a valid covariance matrix $Σ$ , align with those for standard SEM and SAM, as elucidated by [27]. The log-likelihood function of $y$ for H-SAR models in terms of the model parameters $ϕ = {(β^{⊤}, ρ, ω, θ)}^{⊤}$ is

(6) $\log f (y; ω, θ, ρ, β) = - \frac{n}{2} \log (2 π) - \frac{n}{2} \log (ω) - \frac{1}{2} \log | V | - \frac{1}{2 ω} r^{⊤} V^{- 1} r,$

where

r = y - μ

is the vector of the residuals. The expressions for the

V

and

μ

are given in Table 1.

3. H-SAR Models with Missing Data and Their Marginal Maximum Likelihood Methods

Section 3.1 explores H-SAR models with missing data in the response variable. Section 3.2 introduces the marginal maximum likelihood (MML) method, while Section 3.3 proposes two alternative computational approaches for implementing the MML algorithm. This is followed by an analysis of the computational complexities of the proposed MML algorithms in Section 3.4.

3.1. Hierarchical Simultaneous Autoregressive Models with Randomly Missing Responses

Let $y_{o}$ represent the subset of $y$ containing $n_{o}$ observed response variables, and let $y_{u}$ represent the subset of $y$ containing $n_{u}$ missing response variables. The complete-data vector is denoted by $y = {(y_{o}^{⊤}, y_{u}^{⊤})}^{⊤}$ . The matrices $X$ and $W$ are divided into distinct parts as follows:

(7) $X = (\begin{matrix} X_{o} \\ X_{u} \end{matrix}), W = (\begin{matrix} W_{o o} & W_{o u} \\ W_{u o} & W_{u u} \end{matrix}),$

where

X_{o}

and

X_{u}

are the corresponding matrices of the covariates for the observed and unobserved response variables, respectively, and

W_{o o}

W_{o u}

W_{u o}

, and

W_{u u}

represent the submatrices of

W

In this study, we assumed that the missing responses, $y_{u}$ , are missing at random (MAR). The MML method under the MAR mechanism involves maximising the marginal likelihood of the observed data, $y_{o}$ ; see [35] for further details. To compute the marginal likelihood of $y_{o}$ , the unobserved data $y_{u}$ must be integrated out from the complete-data density of $y$ , $f (y_{o} ∣ ϕ) = \int f (y ∣ ϕ) d y_{u}$ , where $f (y ∣ ϕ)$ is the complete data density of $y$ . In the following subsection, we derive the marginal distribution of $y_{o}$ using the properties of the multivariate normal distribution and present the MML estimators.

3.2. Marginal Maximum Likelihood Estimation Method

This section discusses the proposed MML estimation method. The vector $μ$ and matrix $V$ for the H-SAM and H-SEM, as presented in Table 1, are partitioned as

(8) $μ = (\begin{matrix} μ_{o} \\ μ_{u} \end{matrix}), V = (\begin{matrix} V_{o o} & V_{o u} \\ V_{u o} & V_{u u} \end{matrix}) .$

Since $y$ is a multivariate normal random variable, $y_{o}$ also follows a multivariate normal distribution with mean vector $μ_{o}$ and covariance matrix $Σ_{o o} = ω V_{o o}$ [36]; see Equation (8). To compute the log-likelihood of the marginal distribution of $y_{o}$ , replace $V$ with $V_{o o}$ , $μ$ with $μ_{o}$ , $y$ with $y_{o}$ , and n with $n_{o}$ in Equation (6). This yields the following expression for the marginal log-likelihood of $y_{o}$ :

(9) $\log f (y_{o}; ω, θ, ρ, β) = - \frac{n_{o}}{2} \log (2 π) - \frac{n_{o}}{2} \log (ω) - \frac{1}{2} \log | V_{o o} | - \frac{1}{2 ω} r_{o}^{⊤} V_{o o}^{- 1} r_{o},$

where

r_{o} = y_{o} - μ_{o}

. Maximising Equation (9) with respect to

β

and

ω

(taking partial derivatives and setting to zero) while holding

θ

and

ρ

fixed, we obtain the closed-form ML estimates for

β

and

ω

as follows:

\hat{β} (ρ, θ) = {({\tilde{X}}_{o}^{⊤} V_{o o}^{- 1} {\tilde{X}}_{o})}^{- 1} {\tilde{X}}_{o}^{⊤} V_{o o}^{- 1} y_{o} and \hat{ω} (ρ, θ) = \frac{r_{o}^{⊤} V_{o o}^{- 1} r_{o}}{n_{o}}

, where

{\tilde{X}}_{o} = X_{o}

for H-SEM, and

{\tilde{X}}_{o} = {(A^{- 1} X)}_{o}

for H-SAM. By substituting

\hat{β} (ρ, θ)

and

\hat{ω} (ρ, θ)

in the log-likelihood in Equation (9), the concentrated marginal log-likelihood

L_{c}

has the form

(10) $L_{c} (θ, ρ) = c - \frac{n_{o}}{2} \log (\hat{ω} (ρ, θ)) - \frac{1}{2} \log | V_{o o} |,$

where

c = - \frac{n_{o}}{2} \log (2 π) - \frac{n_{o}}{2}

is a constant. To obtain the ML estimates for

θ

and

ρ

, we maximise the concentrated marginal log-likelihood in Equation (10) using optim() function in R.

3.3. Two Alternative Computational Approaches

The numerical optimisation of the concentrated marginal log-likelihood $L_{c}$ , defined in Equation (10), requires the repeated evaluation of $L_{c}$ for different values of $ρ$ and $θ$ . Therefore, the efficient computation of $L_{c}$ is critical to reduce the overall computational cost of the proposed MML algorithm. We assume that the spatial weight matrix, $W$ , is sparse, a characteristic commonly encountered in many practical applications. To exploit this sparsity for efficient evaluation of $L_{c}$ , we propose two computational approaches. These methods were specifically designed to maximise the computational efficiency when applied to sparse $W$ , making them particularly well suited for such scenarios.

Efficient computation of matrix $V_{o o}$ plays a key role in evaluating $L_{c}$ . Our first approach, referred to as the direct approach, involves directly extracting the submatrix $V_{o o}$ from the larger matrix $V$ given in Table 1 as

(11) $V_{o o} = {[I_{n} + θ {(A^{⊤} A)}^{- 1}]}_{o o} .$

After computing $V_{o o}$ , the logarithms of its determinant and its inverse are calculated; subsequently, $L_{c}$ is evaluated.

In the direct computational approach, the matrices $V$ , and $V_{o o}$ are explicitly calculated. However, when evaluating $L_{c}$ in Equation (10) and the closed-form ML estimators of $β$ and $ω$ , only certain terms involving $V_{o o}$ are required. Specifically, we need to compute the following terms: (i) $r_{o}^{⊤} V_{o o}^{- 1} r_{o}$ , (ii) ${\tilde{X}}_{o}^{⊤} V_{o o}^{- 1} {\tilde{X}}_{o}$ , (iii) ${\tilde{X}}_{o}^{⊤} V_{o o}^{- 1} y_{o}$ , and (iv) $\log | V_{o o} |$ . Computing these terms by first explicitly computing $V_{o o}$ can be computationally challenging, as $V_{o o}$ is a dense matrix. In our second computational approach, we reformulate these terms, along with $V_{o o}$ , to enable more efficient computation using sparse matrix operations.

In the second approach, referred to as the parameterisation approach, the submatrix $V_{o o}$ is first calculated using an additional sparse matrix, $B_{o}$ , from $V$ given in Table 1. This approach leverages the sparsity of $W$ and $B_{o}$ to simplify the calculation of $V_{o o}$ . We define sparse matrix $B_{o}$ as $B_{o} = [I_{o} | 0]$ , where $I_{o}$ is the $n_{o} \times n_{o}$ identity matrix, and $0$ is the $n_{o} \times n_{u}$ zero matrix. Now, $V_{o o} = B_{o} V B_{o}^{⊤}$ . Then, after some further simplifications, $V_{o o}$ can be reformulated as

(12) $\begin{matrix} V_{o o} & = B_{o} V B_{o}^{⊤} \\ = B_{o} (I_{n} + θ {(A^{⊤} A)}^{- 1}) B_{o}^{⊤} \\ = I_{o} + θ B_{o} {(A^{⊤} A)}^{- 1} B_{o}^{⊤} . \end{matrix}$

First, the log determinant ( $log | V_{o o} |$ ) is computed. Using the matrix determinant lemma [37], the logarithm of the determinant of $V_{o o}$ in Equation (12) can be obtained as

(13) $\log | V_{o o} | = \log | A^{⊤} A + θ B_{o}^{⊤} B_{o} | - \log | A^{⊤} A |,$

and the proof is given in Section S4.2 of the Supplementary Materials.

The determinant of $A^{⊤} A$ can be efficiently computed by obtaining the Cholesky factors using the Cholesky() function from the Matrix package (version 1.4-0) [31]. The determinant of $A^{⊤} A + θ B_{o}^{⊤} B_{o}$ can be computed using its Cholesky factors. Section 3.4 details how the Cholesky factor of $A^{⊤} A$ is utilised to compute the Cholesky factors of $A^{⊤} A + θ B_{o}^{⊤} B_{o}$ and subsequently how these factors are employed to efficiently calculate $log | V_{o o} |$ .

By applying the Woodbury matrix identity [38], p. 427, to the matrix $V_{o o}$ in Equation (12), we obtain $V_{o o}^{- 1}$ as follows:

(14) $V_{o o}^{- 1} = I_{o} - θ B_{o} {(A^{⊤} A + θ B_{o}^{⊤} B_{o})}^{- 1} B_{o}^{⊤} .$

By substituting $V_{o o}^{- 1}$ in Equation (14), the required terms in $L_{c}$ are given by

(15) $r_{o}^{⊤} V_{o o}^{- 1} r_{o} = r_{o}^{⊤} r_{o} - θ r_{o}^{⊤} B_{o} {(A^{⊤} A + θ B_{o}^{⊤} B_{o})}^{- 1} B_{o}^{⊤} r_{o},$

(16) ${\tilde{X}}_{o}^{⊤} V_{o o}^{- 1} {\tilde{X}}_{o} = {\tilde{X}}_{o}^{⊤} {\tilde{X}}_{o} - θ {\tilde{X}}_{o}^{⊤} B_{o} {(A^{⊤} A + θ B_{o}^{⊤} B_{o})}^{- 1} B_{o}^{⊤} {\tilde{X}}_{o},$

(17) ${\tilde{X}}_{o}^{⊤} V_{o o}^{- 1} y_{o} = {\tilde{X}}_{o}^{⊤} y_{o} - θ {\tilde{X}}_{o}^{⊤} B_{o} {(A^{⊤} A + θ B_{o}^{⊤} B_{o})}^{- 1} B_{o}^{⊤} y_{o} .$

The matrix inverse ${(A^{⊤} A + θ B_{o}^{⊤} B_{o})}^{- 1}$ , which appears in Equations (15)–(17), does not need to be computed explicitly. Instead, we solve the linear systems within these equations to compute the terms without directly performing the inversion. For further details, see Section 3.4.

In the subsequent sections, we refer to the marginal ML methods as MML methods. The MML algorithm that employs the parameterisation computational approach is denoted as the MML-P method, whereas the implementation that utilises the direct computational approach is referred to as the MML-D method.

3.4. Computational Aspects of Marginal ML Methods

Deriving the computational complexity of the proposed MML methods in the general case is challenging, as it heavily depends on the sparsity of the weight matrix $W$ , which is determined by the number of neighbours of each unit in the dataset. Therefore, we provide theoretical computational complexities for the proposed methods under the assumption of a specific structure for $W$ .

We now provide a brief overview of the grid and neighbourhood structure used to construct $W$ . This structure was employed in both the derivation of computational complexity and the simulation studies presented in Section 4. We utilise a $\sqrt{n} \times \sqrt{n}$ regular grid, where n represents the total number of units in the grid. For instance, consider a regular grid of size $\sqrt{3} \times \sqrt{3}$ , as shown in Figure 1, left panel. We define spatial units at the vertices, resulting in 9 units. The edges define the neighbours associated with each unit. For instance, unit 5 has four neighbours: units 2, 4, 6, and 8. This type of neighbouring structure is known as the Rook neighbourhood in the literature [39]. It is clear that any given unit can have between two and four neighbours in the Rook neighbourhood, depending on its position on the grid. Neighbourhood structures that have a fixed maximum number of neighbours for a given unit, regardless of the grid size (in this case, four), are referred to as local neighbourhoods.

Once the neighbours are defined on the grid, the weight matrix $W$ is constructed. The binary $W$ , based on the $3 \times 3$ regular grid, is shown in the right panel in Figure 1. In $W$ , a value of 1 in the $j^{th}$ column of the $i^{th}$ row indicates that unit j is a neighbour of unit i; otherwise, the entry is 0.

We know that $A = I_{n} - ρ W$ . A key computational step in implementing the MML-D and MML-P methods is calculating the Cholesky factor of matrix $A^{⊤} A$ . According to the results reproted by [40] on nested dissection ordering algorithms, ref. [41] demonstrated that the Cholesky factorisation of the precision matrix of a Gaussian Markov random field (GMRF) defined on a local neighbourhood of a regular $\sqrt{n} \times \sqrt{n}$ grid has a computational complexity of $O (n^{3 / 2})$ . Since $A^{⊤} A$ represents a GMRF based on a local neighbourhood, its Cholesky factorisation also exhibits a complexity of $O (n^{3 / 2})$ .

We now discuss the computational complexity of the MML-D method. Let $L_{A^{⊤} A}$ denote the Cholesky factor of $A^{⊤} A$ . To obtain the inverse of $A^{⊤} A$ efficiently while avoiding numerical issues, we solve the following system of equations:

(18) $(L_{A^{⊤} A} \cdot L_{A^{⊤} A}^{⊤}) U = I_{n},$

where

U

represents the matrix to be solved, which is ultimately equal to

{(A^{⊤} A)}^{- 1}

. The function solve() from the Matrix package was employed to efficiently perform forward and backward substitutions to solve this system. Solving this problem is particularly challenging due to its high dimensionality. The matrix

U

has n columns, meaning that computing the entire matrix

U

requires solving n individual systems.

The exact theoretical complexity of solving the $n \times n$ system presented in Equation (18) is difficult to derive analytically. To address this, we conducted a simulation study to estimate the empirical computational complexity; see Section S1 of the Supplementary Materials. The results from this study suggest that the complexity of performing forward and backward substitution is approximately $O (n^{2})$ .

Once ${(A^{⊤} A)}^{- 1}$ is computed, the MML-D method proceeds with the calculation of $V = I_{n} + θ {(A^{⊤} A)}^{- 1}$ . From $V$ , we extract the sub-matrix $V_{o o}$ . Given that ${(A^{⊤} A)}^{- 1}$ is a dense matrix, $V_{o o}$ is also dense. In the subsequent steps, the MML-D method explicitly computes both the inverse of $V_{o o}$ and its determinant; each has a complexity of $O (n_{0}^{3})$ , as $V_{o o}$ remains dense.

The overall computational complexity of the MML-D method is determined by three main components: the Cholesky factorisation of $A^{⊤} A$ , which has a complexity of $O (n^{3 / 2})$ ; solving the system of equations in Equation (18), with a complexity of $O (n^{2})$ ; and computing both the inverse and determinant of the submatrix $V_{o o}$ , which has a complexity of $O (n_{o}^{3})$ . Consequently, the overall computational complexity of the MML-D method is given by $max (O (n^{3 / 2}), O (n^{2}), O (n_{o}^{3}))$ , which simplifies to $max (O (n^{2}), O (n_{o}^{3}))$ .

We now analyse the computational complexity of MML-P. The computational complexity of the MML-P method mainly depends on the computation of the Cholesky factor $L_{A^{⊤} A}$ of $A^{⊤} A$ . This is because, once $L_{A^{⊤} A}$ is computed, the subsequent steps are straightforward. These include computing the Cholesky factor of $A^{⊤} A + θ B_{o}^{⊤} B_{o}$ , solving the systems in Equations (15) to (17), and calculating the logarithm of the determinant of $V_{o o}$ using Equation (13).

Let $L_{A^{⊤} A + θ B_{o}^{⊤} B_{o}}$ denote the Cholesky factor of $A^{⊤} A + θ B_{o}^{⊤} B_{o}$ . Using the updown() function from the Matrix package, this factor can be efficiently computed from the Cholesky factor $L_{A^{⊤} A}$ via the sparse Cholesky update [42], requiring significantly less computational effort compared to directly calculating the Cholesky factors of $A^{⊤} A + θ B_{o}^{⊤} B_{o}$ .

To compute the terms in Equations (15)–(17), we solve the systems within these equations without directly calculating the matrix inverse ${(A^{⊤} A + θ B_{o}^{⊤} B_{o})}^{- 1}$ . By using the precomputed Cholesky factors of $A^{⊤} A + θ B_{o}^{⊤} B_{o}$ , we perform forward and backward substitution to compute the required terms. For example, to compute the term ${(A^{⊤} A + θ B_{o}^{⊤} B_{o})}^{- 1} B_{o}^{⊤} r_{o}$ in Equation (15), we solve the system $(A^{⊤} A + θ B_{o}^{⊤} B_{o}) u = B_{o}^{⊤} r_{o}$ for $u$ (where $B_{o}^{⊤} r_{o}$ is a vector) using forward and backward substitution. Given the precomputed $L_{A^{⊤} A + θ B_{o}^{⊤} B_{o}}$ , the solve() function from the Matrix package is used to perform these operations efficiently.

However, in contrast to the system that needs to be solved in Equation (18) within the MML-D method, the systems in the MML-P method are computationally more efficient. Specifically, the dimensions of the systems in Equations (15)–(17) are $n \times 1$ , $n \times (r + 1)$ , and $n \times 1$ , respectively, where typically $r + 1 ≪ n$ . In contrast, the system to be solved in the MML-D method has a much larger dimensionality of $n \times n$ , making it computationally more intensive.

The computation of the logarithm of the determinant of $V_{o o}$ is significantly faster when the Cholesky factors of $A^{⊤} A$ and $A^{⊤} A + θ B_{o}^{⊤} B_{o}$ are available. We now rewrite $\log | V_{o o} |$ from Equation (13) as follows:

(19) $\log | V_{o o} | = 2 \log | L_{A^{⊤} A + θ B_{o}^{⊤} B_{o}} | - 2 \log | L_{A^{⊤} A} | .$

Since both $L_{A^{⊤} A + θ B_{o}^{⊤} B_{o}}$ and $L_{A^{⊤} A}$ are lower triangular matrices, their logarithmic determinant can be efficiently computed by summing the logarithms of their diagonal elements, as follows:

(20) $\log | V_{o o} | = 2 \sum_{i = 1}^{n} \log ({(l_{A^{⊤} A + θ B_{o}^{⊤} B_{o}})}_{i}) - 2 \sum_{i = 1}^{n} \log ({(l_{A^{⊤} A})}_{i}),$

where

{(l_{A^{⊤} A + θ B_{o}^{⊤} B_{o}})}_{i}

and

{(l_{A^{⊤} A})}_{i}

denote the

i^{th}

diagonal elements of

L_{A^{⊤} A + θ B_{o}^{⊤} B_{o}}

and

L_{A^{⊤} A}

, respectively.

For the MML-P method, the most computationally intensive operation is the calculation of the Cholesky factors of $A^{⊤} A$ , which has a complexity of $O (n^{3 / 2})$ . Consequently, the overall computational complexity of the MML-P method is $O (n^{3 / 2})$ .

We highly recommend using the MML-P method due to not only its lower computational complexity compared to the MML-D method but also its additional practical advantages. For instance, although $A^{⊤} A$ is a sparse matrix, its inverse is a dense matrix. As n increases, the size of ${(A^{⊤} A)}^{- 1}$ grows, making it impractical to store due to its large, dense matrix structure. However, to implement MML-D, ${(A^{⊤} A)}^{- 1}$ should be implicitly calculated first. In our simulation studies, attempts to store ${(A^{⊤} A)}^{- 1}$ on the National Institute for Applied Statistics Research Australia High Performance Computer cluster (https://hpc.niasra.uow.edu.au/ (accessed on 2 December 2024)) failed when n exceeded approximately 32,500. By contrast, the parameterisation approach enabled us to compute the required terms even for larger values of n = 1,000,000. Furthermore, as demonstrated in our simulation study in Section 4 and supported by our real-world analysis in Section 5, both MML-D and MML-P produce identical results, as expected. However, fitting H-SAR models including both parameter estimation and standard error calculation is substantially faster with MML-P.

4. Simulation Study

This section discusses the accuracy of the MML methods for different missing data percentages using simulated data. Additionally, we compare the computation times of the MML-P and MML-D methods.

In practice, we often encounter situations where the number of unobserved or missing units ( $n_{u}$ ) is significantly larger than the number of observed units ( $n_{o}$ ). For instance, in the property market, house prices can be modeled using H-SAR models with missing data. Here, the prices of sold houses represent the observed responses, while the prices of unsold houses are treated as missing. Since only a small percentage of properties are sold each year, or even over a decade, $n_{o}$ is relatively small compared to the total number of properties (n), making $n_{u}$ much larger in comparison. Nevertheless, information on explanatory variables, such as house characteristics and location data for all properties, both sold and unsold, is typically available from assessors [20]. Therefore, in this simulation study, we considered a moderate ( $50 %$ ) to high ( $90 %$ ) percentage of missing data in the response variable.

When estimating SAR models with missing response data, researchers often overlook the missingness issue. In such cases, the SAR model employs a spatial weight matrix based only on the location data of the observed units. The likelihood constructed in this manner is referred to as the observed likelihood, and the estimates obtained from maximising the observed log-likelihood are termed the observed maximum likelihood (OML) estimates in the following sections of this paper. However, the observed likelihood does not reflect the marginal likelihood of the observed response variable, $y_{o}$ , as it is based solely on the covariate matrix for the observed locations, $X_{o}$ , and the spatial weight matrix for the observed units, $W_{o o}$ . This misspecification generally leads to biased and inconsistent parameter estimates [18,19]. To further clarify this issue, the marginal distributions of $y_{o}$ under the OML and MML methods are presented in detail, and the biases in the parameter estimates produced by these methods are compared later in this section.

To generate synthetic data, we set $ρ = 0.8$ , $σ_{ϵ}^{2} = 2$ , and $σ_{e}^{2} = 1$ for both H-SEM and H-SAM. The regression parameters were fixed at $β_{0} = 1$ and $β_{1} = 5$ . The covariate was drawn from the standard normal distribution. The spatial weight matrix was generated based on a $71 \times 71$ grid, using the Rook neighbourhood as explained in Section 3.4. In total, we simulated 250 datasets, each with $n = 5041$ units. We used the R optim() function to numerically maximise the concentrated marginal log-likelihood function in Equation (10) with respect to parameters $θ$ and $ρ$ . We set $10^{- 8}$ as the convergence threshold for the optim() function.

Table 2 displays the mean parameter estimates and mean squared errors (MSEs) for the parameters $ρ$ , $σ_{ϵ}^{2}$ , and $σ_{e}^{2}$ in the H-SEM and H-SAM. These results are based on analyses with $50 %$ and $90 %$ missing data, estimated using the OML, MML-D, and MML-P methods across 250 simulated datasets. Additionally, the table reports the mean computation times for both parameter estimation and the calculation of standard errors.

For both models, the MML methods (MML-D and MML-P) consistently produce identical parameter estimates that are closer to the true values, achieving lower MSEs and reduced bias for parameters $ρ$ , $σ_{ϵ}^{2}$ , and $σ_{e}^{2}$ compared to the OML method, regardless of the missing data percentage, as shown in Table 2. This pattern holds for the fixed effects parameters as well; see Section S2 of the Supplementary Materials for further details.

As mentioned at the beginning of this section, the misspecification of the marginal distribution of $y_{o}$ in the OML method leads to highly biased parameter estimates. The parameters of the marginal distribution of $y_{o}$ (the marginal mean vector $μ_{o}$ and the marginal covariance matrix $Σ_{o o}$ ) under MML and OML are summarised in Table 3 (the operation ${(A^{⊤} A)}_{o o}^{- 1}$ in Table 3 involves first computing the inverse of matrix $A^{⊤} A$ and then extracting the $n_{o} \times n_{o}$ submatrix corresponding to the observed locations from the resulting matrix). The MML method uses the correct marginal distribution of $y_{o}$ , as derived in Section 3.2.

In the case of H-SEM, while the marginal mean $μ_{o}$ is the same under both MML and OML, the covariance matrix $Σ_{o o}$ is incorrectly specified in the OML approach. Correctly specifying $Σ_{o o}$ requires utilising the complete spatial weight matrix $W$ , which accounts for all n spatial locations, as performed in the MML methods. Similarly, in H-SAM, both the marginal mean $μ_{o}$ and the covariance matrix $Σ_{o o}$ of $y_{o}$ are misspecified under the OML approach. Accurate specification of the marginal distribution parameters of $y_{o}$ involves incorporating both the full covariate matrix $X$ and the complete spatial weight matrix $W$ .

Notably, for datasets with a moderate percentage of missing data ( $50 %$ missing), OML estimates are reasonably accurate. However, MML estimates (both MML-D and MML-P) are closer to the true values, exhibiting lower MSEs for $ρ$ , $σ_{e}^{2}$ , and $σ_{ϵ}^{2}$ across both models. Under a high percentage of missing data ( $90 %$ missing), OML estimates become significantly biased and show larger MSEs compared to MML estimates, as shown in Table 2. A similar trend is observed for the fixed effect parameters, as detailed in the tables in Section S2 of the Supplementary Materials. These findings highlight the effectiveness of the proposed MML methods in reducing the bias in SAR parameter estimates for datasets that have both measurement errors and a substantial percentage of missing data.

Fitting H-SAR models (including both parameter estimation and standard error calculation) was faster with MML-P than with MML-D, as shown in Table 2, regardless of the missing data percentage. For datasets with $50 %$ missing data, fitting H-SEM using the MML-D method took an average of 1293.76 s, whereas the MML-P method required only 10.02 s. Similarly, for fitting H-SAM with 50% missing data, the MML-D method took an average of 1396.39 s, while the MML-P method required just 13.97 s. When $90 %$ of the data were missing, the average computation time for fitting H-SEM was 671.9590 s using MML-D, compared to 12.3905 s with MML-P. Similarly, for the H-SAM, MML-D took 675.9505 s, while MML-P took 16.4584 s.

An interesting observation is that the MML-D method requires less computational time when estimating datasets with a higher percentage of missing data compared to those with a lower percentage. This is due to its computational complexity of $max (O (n^{2}), O (n_{o}^{3}))$ . Given that $n_{o} = n - n_{u}$ , a higher missing data percentage generally implies a lower $n_{o}$ (the number of observed values), thereby reducing the value of $max (O (n^{2}), O (n_{o}^{3}))$ . In other words, the time consumption of the MML-D method is inversely proportional to $n_{u}$ , the number of unobserved values. Consequently, MML-D is faster for datasets with a high percentage of missing data than for those with a low percentage.

On the other hand, the computational complexity of the MML-P method is $O (n^{3 / 2})$ , which is independent of $n_{u}$ or the percentage of missing data. As a result, the MML-P method theoretically maintains constant computational time, regardless of the percentage of missing data. This is confirmed by the nearly identical computation times for the MML-P method across two missing data percentages, as shown in Table 2. More importantly, regardless of the percentage of missing data, the MML-P method outperforms the MML-D method due to its lower and constant computational complexity across varying missing data scenarios.

Further results from the simulation study, including mean standard errors and coverage rates for all model parameters (including fixed effects $β$ ), are provided in Tables S1–S4 in Section S2 of the Supplementary Materials. These additional results further demonstrate that both MML methods (MML-D and MML-P) offer better coverage and smaller mean standard errors compared to the OML method.

5. Real Data Application

This section demonstrates the application of the proposed estimation methods on a real dataset of house prices. As discussed in Section 1, house price data typically include both measurement errors and missing values. Additionally, property prices exhibit strong spatial correlation, making standard linear regression models inadequate for accurate modelling. Instead, SAR models provide a more suitable alternative. For this analysis, we used a dataset with no missing values, allowing us to estimate the parameters accurately without the complication of missing data. These estimates were then considered as the true parameter values, serving as a benchmark for comparison with the estimates obtained from the proposed MML methods.

The dataset consists of house prices from Lucas County, OH, USA, including 25,357 observations of single-family homes sold between 1993 and 1998. A detailed description of the dataset is available through the Spatial Econometrics toolbox for Matlab (the dataset can be accessed at https://www.spatial-econometrics.com/ (accessed on 2 December 2024)). The dataset is also included in the R package spData (version 2.2.2) [43].

The distribution of these house prices is highly right-skewed, as shown in Figure S1 of the Supplementary Materials. Therefore, we used the natural logarithm of housing prices ( $ln (price)$ ) as the dependent variable, which is a common practice to reduce skewness in property price data. The median of the log of house price is 11.0898. Additionally, the dataset includes a 25,357 × 25,357 row-normalised weight matrix, $W$ . This matrix is notably sparser than the simulated weight matrices $W$ used in Section 4. Specifically, the average number of neighbours per unit in this dataset is 2.9528, whereas a simulated weight matrix, based on the Rook neighbourhood structure with an approximately similar number of observations (n = 25,281), has an average of 3.9748 neighbours. The selected independent variables include various characteristics of the houses: different powers of house age (age, ${age}^{2}$ , and ${age}^{3}$ ), the natural logarithm of the lot size in square feet ( $ln (lotsize)$ ), the number of rooms (rooms), the natural logarithm of the total living area in square feet ( $ln (LTA)$ ), the number of bedrooms, and binary indicators for each year from 1993 to 1998 (syear) to represent the year of house sale.

In this section, we compare the estimates from the OML and MML methods with those from the full maximum likelihood (FML) estimation method. The FML method refers to the ML estimation of H-SEM and H-SAM using the entire dataset, which contains no missing values. The estimates from the FML were used as the ground truth for comparing the OML and MML methods. To better simulate real-world scenarios, we introduced moderate to high levels of missing data in the response variable. Specifically, we created two datasets: one with $50 %$ missing responses and another with $90 %$ missing responses, obtained from the complete dataset. We set the convergence threshold for the optim() function to $10^{- 8}$ .

Table 4 presents the estimates and standard errors for parameters $ρ$ , $σ_{ϵ}^{2}$ , and $σ_{e}^{2}$ obtained from the FML, OML, and MML methods for the H-SEM and H-SAM, based on the dataset with $90 %$ missing data. The table also reports the computation times for both parameter estimation and the calculation of standard errors. The results are consistent with our findings in Section 4, demonstrating that the MML estimates (for both MML-D and MML-P methods) align more closely with the FML estimates than the OML estimates, which rely only on observed data locations. As anticipated, the time required for both parameter estimation and standard error computation was substantially lower for both models when using the proposed MML-P method compared to MML-D.

Tables S7 and S8 in Section S3 of the Supplementary Materials show the results for the dataset with $50 %$ missing data. The results are consistent with those from the dataset with $90 %$ missing data. In addition, Tables S5–S8 in Section S3 of the Supplementary Materials provide estimates and standard errors for all model parameters including fixed effects ( $β$ ) for both the H-SEM and H-SAM, for both $50 %$ and $90 %$ missing data.

We now provide a brief analysis of the trends and patterns in this dataset, based on the estimated parameters presented in Table 4, highlighting how effectively the proposed MML methods capture these patterns compared to the OML method for the H-SEM. The estimated spatial autocorrelation parameter $ρ$ , obtained using FML, is $0.9866$ , indicating a very high degree of spatial autocorrelation, which is typical for property price datasets [20,22,23]. To assess the performance of the methods in estimating $ρ$ in the presence of measurement errors and missing values, we compare this FML estimate with those from the OML and MML methods.

The OML method produces an estimate of $ρ$ equal to $0.6866$ , suggesting only moderate spatial autocorrelation. In contrast, both MML methods estimate $ρ$ as $0.9936$ , closely matching the FML estimate and accurately reflecting the strong spatial autocorrelation present in the data. The estimate of parameter $ρ$ in the H-SAM is also closer to the FML estimate using the MML method than the OML method, further highlighting the superior ability of MML methods to accurately capture the underlying spatial structure of the data compared to the OML approach.

To further assess the accuracy of the OML and MML (MML-P) estimates, accounting for all model parameters including the fixed effects and two variance parameters, we computed the average mean squared errors (MSEs) of the estimated parameters for the H-SEM and H-SAM under both missing data percentages, as presented in Table 5. The estimates obtained from the FML were treated as the true parameter values. The average MSE for the MML-P estimates was calculated using the formula ${MSE}_{MML - P} = \frac{1}{r + 4} \sum_{i = 1}^{r + 4} {({\hat{θ}}_{i} - {\hat{θ}}_{i, MML - P})}^{2}$ , where ${\hat{θ}}_{i}$ is the estimated value of the i-th parameter from the FML, ${\hat{θ}}_{i, MML - P}$ is the corresponding parameter estimate from the MML-P method, and $r + 4$ is the total number of parameters with r the number of fixed effect parameters (excluding the intercept parameter). Similarly, the average MSE for the OML estimates was calculated in a similar manner.

As expected, the average MSEs of the parameter estimates obtained from the OML method are significantly higher than those of the MML-P method, regardless of the missing data percentages, for both models. These results highlight the importance of using the proposed MML-P method, particularly in real-world situations where high percentages of missing data are common.

6. Conclusions

This article makes two key contributions. First, we introduced novel hierarchical SAR (H-SAR) models that explicitly account for measurement errors and missing data in the response variable. Second, we proposed marginal maximum likelihood (MML) methods, accompanied by efficient computational approaches, to accurately estimate the parameters of these H-SAR models. Specifically, we developed two computational strategies: the parameterisation method (MML-P) and the direct method (MML-D). Both approaches are particularly advantageous when the spatial weight matrix $W$ is sparse, enabling more efficient computation while maintaining accuracy.

The computational complexity analysis in Section 3.4 highlights that the MML-P method is both more efficient and practical compared to the MML-D algorithm, a conclusion further supported by the simulation study in Section 4. The advantages of using the MML-P over the MML-D method arise in two aspects. First, the computational complexity of the MML-P method is $O (n^{3 / 2})$ , which is significantly lower than the $max (O (n^{2}), O (n_{o}^{3}))$ complexity of the MML-D method. As a result, MML-P requires considerably less computation time than MML-D. Second, the MML-D method demands more computational memory than MML-P. In the MML-D approach, storing matrices such as ${(A^{⊤} A)}^{- 1}$ and $V_{o o}$ becomes memory-intensive for datasets with large n, as these matrices are dense, even if the spatial weight matrix $W$ is sparse. The MML-P method overcomes these limitations by efficiently avoiding the explicit calculation of the $V_{o o}$ matrix. Instead, it re-expresses terms within the MML-P algorithm, leveraging sparse matrix computations to achieve both reduced computational complexity and memory usage.

Although our methods efficiently handle SAR models with both measurement errors and a substantial amount of missing data, we acknowledge two key limitations. Firstly, our study primarily focused on the estimation of such models under the missing-at-random (MAR) mechanism. There is a pressing need to extend this work to include estimations under the missing-not-at-random (MNAR) mechanism, which often arises in real-world applications. Secondly, we assumed that both the error component of the SAR process ( $e$ ) and the measurement error ( $ϵ$ ) follow normal distributions with constant variance. This assumption could be relaxed by incorporating heteroscedasticity or by replacing these distributions with heavy-tailed alternatives, such as Student’s t distribution, to better accommodate scenarios involving outliers or non-normal error structures. These extensions would enhance the applicability of the models to more complex real-world datasets. We leave these advancements for future research.

Author Contributions

Conceptualisation, A.W.; methodology, A.W. and T.S.; software, A.W.; writing—original draft, A.W.; writing—review and editing, D.G.; supervision, D.G. and T.S. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no potential or apparent conflicts of interest in this article.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figure and Tables

View Image - Figure 1. The grid ([Forumla omitted. See PDF.]) used for constructing [Forumla omitted. See PDF.] based on the Rook neighbourhood in the left panel and the corresponding binary [Forumla omitted. See PDF.] in the right panel.

Figure 1. The grid ([Forumla omitted. See PDF.]) used for constructing [Forumla omitted. See PDF.] based on the Rook neighbourhood in the left panel and the corresponding binary [Forumla omitted. See PDF.] in the right panel.

Table 1

Expressions for mean ( $μ$ ), covariance ( $Σ$ ), and $V$ for the SAR and H-SAR models with $A = I_{n} - ρ W$ , $ω = σ_{ϵ}^{2}$ and $θ = σ_{e}^{2} / σ_{ϵ}^{2}$ .

Terms	SAM	SEM	H-SAM	H-SEM
$μ$	$A^{- 1} X β$	$X β$	$A^{- 1} X β$	$X β$
$Σ$	$σ_{e}^{2} {(A^{⊤} A)}^{- 1}$	$σ_{e}^{2} {(A^{⊤} A)}^{- 1}$	$ω (I_{n} + θ {(A^{⊤} A)}^{- 1})$	$ω (I_{n} + θ {(A^{⊤} A)}^{- 1})$
$V$	${(A^{⊤} A)}^{- 1}$	${(A^{⊤} A)}^{- 1}$	$(I_{n} + θ {(A^{⊤} A)}^{- 1})$	$(I_{n} + θ {(A^{⊤} A)}^{- 1})$

Table 2

Mean parameter estimates (with mean squared errors (MSEs) in parentheses) for $ρ$ , $σ_{ϵ}^{2}$ , and $σ_{e}^{2}$ , along with the mean computation time in seconds (ct), were calculated from 250 simulated datasets, each containing n = 5041 units. The results are reported for H-SEM and H-SAM with $50 %$ and $90 %$ missing data, using the OML, MML-D, and MML-P algorithms. The true parameter values are $ρ = 0.8$ , $σ_{ϵ}^{2} = 2$ , and $σ_{e}^{2} = 1$ .

		H-SEM			H-SAM
		OML	MML-D	MML-P	OML	MML-D	MML-P
$50 %$	$ρ$	0.6109 (0.0413)	0.7949 (0.0012)	0.7949 (0.0012)	0.4497 (0.1229)	0.7997 (0.0001)	0.7997 (0.0001)
	$σ_{ϵ}^{2}$	2.3856 (0.2385)	1.9745 (0.0567)	1.9745 (0.0567)	0.0374 (3.8650)	2.0059 (0.0247)	2.0059 (0.0247)
	$σ_{e}^{2}$	0.9737 (0.1336)	1.0350 (0.0548)	1.0350 (0.0548)	10.4758 (89.9968)	0.9995 (0.0115)	0.9995 (0.0115)
	ct	3.7383	1293.76	10.02	3.8775	1396.39	13.97
$90 %$	$ρ$	0.2557 (0.3265)	0.7880 (0.0094)	0.7880 (0.0094)	0.2434 (0.3107)	0.8003 (0.0001)	0.8003 (0.0001)
	$σ_{ϵ}^{2}$	1.0016 (2.5969)	1.9189 (0.6545)	1.9189 (0.6545)	0.0020 (3.9921)	2.0111 (0.2587)	2.0111 (0.2587)
	$σ_{e}^{2}$	3.1122 (6.2679)	1.1157 (0.4689)	1.1157 (0.4689)	19.0154 (327.01)	0.9748 (0.0674)	0.9748 (0.0674)
	ct	0.8923	671.9590	12.3905	0.631	675.9505	16.4584

Table 3

The marginal mean ( $μ_{o}$ ) and covariance matrix ( $Σ_{o o}$ ) of the distribution of $y_{o}$ for H-SEM and H-SAM under the MML and OML methods, with $A = I_{n} - ρ W$ and $A_{o} = I_{o} - ρ W_{o o}$ .

	Terms	MML	OML
H-SEM	$μ_{o}$	$X_{o} β$	$X_{o} β$
H-SEM	$Σ_{o o}$	$ω (I_{o} + θ {(A^{⊤} A)}_{o o}^{- 1})$	$ω (I_{o} + θ {(A_{o}^{⊤} A_{o})}^{- 1})$
H-SAM	$μ_{o}$	${(A^{- 1} X)}_{o} β$	$A_{o}^{- 1} X_{o} β$
H-SAM	$Σ_{o o}$	$ω (I_{o} + θ {(A^{⊤} A)}_{o o}^{- 1})$	$ω (I_{o} + θ {(A_{o}^{⊤} A_{o})}^{- 1})$

Table 4

Parameter estimate (est), with its standard error (se), and computation time in seconds (ct) for fitting H-SAM and H-SEM using FML, OML, MML-D, and MML-P for the dataset with $90 %$ missing data.

Model		FML		OML		MML-D		MML-P
Model		est	se	est	se	est	se	est	se
H-SEM	$ρ$	0.9866	0.0002	0.6866	0.0012	0.9936	0.0006	0.9936	0.0006
	$σ_{e}^{2}$	0.0004	0.0001	0.1643	0.0030	0.0001	0.0002	0.0001	0.0002
	$σ_{ϵ}^{2}$	0.0685	0.0007	0.0001	0.0081	0.0837	0.0035	0.0837	0.0035
	ct	20.098		1.39		2792.409		38.65
H-SAM	$ρ$	0.6727	0.0001	0.0027	0.0311	0.6046	0.0201	0.6046	0.0201
	$σ_{e}^{2}$	0.0399	0.0008	0.0882	0.0114	0.0682	0.0070	0.0682	0.0070
	$σ_{ϵ}^{2}$	0.0420	0.0009	0.0882	0.0234	0.0203	0.0080	0.0203	0.0080
	ct	12.245		1.08		1196.3		22.53

Table 5

Average mean squared error (MSE) of OML and MML-P estimates (relative to the FML estimates) for different missing data percentages.

	H-SEM		H-SAM
	$50 %$	$90 %$	$50 %$	$90 %$
OML	0.7029	0.4838	1.2573	1.1248
MML-P	0.0153	0.2556	0.0087	0.0172

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math12233870/s1, Figure S1: Comparison of density distributions of house prices in Lucas County, Ohio, USA: original scale vs. natural logarithmic scale; Table S1: H-SEM mean estimates (est.), mean standard error (se.), and coverage (cov.) over 250 simulated datasets of estimated parameters obtained using the OML and MML-P methods- 90% missing data; Table S2: H-SAM mean estimates (est.), mean standard error (se.), and coverage (cov.) over 250 simulated datasets of estimated parameters obtained using the OML and MML-P methods- 90% missing data; Table S3: H-SEM mmean estimates (est.), mean standard error (se.), and coverage (cov.) over 250 simulated datasets of estimated parameters obtained using the OML and MML-P methods- 50% missing data; Table S4: H-SAM mean estimates (est.), mean standard error (se.), and coverage (cov.) over 250 simulated datasets of estimated parameters obtained using the OML and MML-P methods- 50% missing data; Table S5: Parameter estimates (est.) along with their standard errors (se.), and computation time in seconds (ct) for fitting the H-SEM model using FML for the full dataset, and OML, MML-D, and MML-P for a partially observed dataset with 90% missing data; Table S6: Parameter estimates (est.) along with their standard errors (se.), and computation time in seconds (ct) for fitting the H-SAM model using FML for the full dataset, and OML, MML-D, and MML-P for a partially observed dataset with 90% missing data; Table S7: Parameter estimates (est.) along with their standard errors (se.), and computation time in seconds (ct) for fitting the H-SEM model using FML for the full dataset, and OML, MML-D, and MML-P for a partially observed dataset with 50% missing data; Table S8: Parameter estimates (est.) along with their standard errors (se.), and computation time in seconds (ct) for fitting the H-SAM model using FML for the full dataset, and OML, MML-D, and MML-P for a partially observed dataset with 50% missing data.

References

1. Tognelli, M.F.; Kelt, D.A. Analysis of determinants of mammalian species richness in South America using spatial autoregressive models. Ecography; 2004; 27, pp. 427-436. Available online: https://www.jstor.org/stable/3683416 (accessed on 2 December 2024). [DOI: https://dx.doi.org/10.1111/j.0906-7590.2004.03732.x]

2. Ver Hoef, J.M.; Peterson, E.E.; Hooten, M.B.; Hanks, E.M.; Fortin, M.J. Spatial autoregressive models for statistical inference from ecological data. Ecol. Monogr.; 2018; 88, pp. 36-59. [DOI: https://dx.doi.org/10.1002/ecm.1283]

3. Angrist, J.D.; Lang, K. Does school integration generate peer effects? Evidence from Boston’s Metco Program. Am. Econ. Rev.; 2004; 94, pp. 1613-1634. [DOI: https://dx.doi.org/10.1257/0002828043052169]

4. Ammermueller, A.; Pischke, J.S. Peer effects in European primary schools: Evidence from the progress in international reading literacy study. J. Labor Econ.; 2009; 27, pp. 315-348. [DOI: https://dx.doi.org/10.1086/603650]

5. Glaeser, E.L.; Sacerdote, B.; Scheinkman, J.A. Crime and social interactions. Q. J. Econ.; 1996; 111, pp. 507-548. [DOI: https://dx.doi.org/10.2307/2946686]

6. Calabrese, R.; Elkink, J.A.; Giudici, P. Measuring bank contagion in Europe using binary spatial regression models. J. Oper. Res. Soc.; 2017; 68, pp. 1503-1511. [DOI: https://dx.doi.org/10.1057/s41274-017-0189-4]

7. Ord, K. Estimation methods for models of spatial interaction. J. Am. Stat. Assoc.; 1975; 70, pp. 120-126. [DOI: https://dx.doi.org/10.1080/01621459.1975.10480272]

8. Pace, R.K.; Barry, R. Sparse spatial autoregressions. Stat. Probab. Lett.; 1997; 33, pp. 291-297. [DOI: https://dx.doi.org/10.1016/S0167-7152(96)00140-X]

9. Pace, R.K. Performing large spatial regressions and autoregressions. Econ. Lett.; 1997; 54, pp. 283-291. [DOI: https://dx.doi.org/10.1016/S0165-1765(97)00026-8]

10. Pace, R.K.; LeSage, J.P. Chebyshev approximation of log-determinants of spatial weight matrices. Comput. Stat. Data Anal.; 2004; 45, pp. 179-196. [DOI: https://dx.doi.org/10.1016/S0167-9473(02)00321-3]

11. Kelejian, H.H.; Prucha, I.R. A Generalized Moments estimator for the autoregressive parameter in a spatial model. Int. Econ. Rev.; 1999; 40, pp. 509-533. [DOI: https://dx.doi.org/10.1111/1468-2354.00027]

12. Kelejian, H.H.; Prucha, I.R. On the asymptotic distribution of the Moran I test statistic with applications. J. Econom.; 2001; 104, pp. 219-257. [DOI: https://dx.doi.org/10.1016/S0304-4076(01)00064-1]

13. Lee, L.F. GMM and 2SLS estimation of mixed regressive, spatial autoregressive models. J. Econom.; 2007; 137, pp. 489-514. [DOI: https://dx.doi.org/10.1016/j.jeconom.2005.10.004]

14. Hepple, L.W. Bayesian analysis of the linear model with spatial dependence. Exploratory and Explanatory Statistical Analysis of Spatial Data; Springer: Dordrecht, The Netherlands, 1979; pp. 179-199. [DOI: https://dx.doi.org/10.1007/978-94-009-9233-7_7]

15. Anselin, L. Spatial Econometrics: Methods and Models; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1988; Volume 4.

16. LeSage, J.P. Bayesian estimation of spatial autoregressive models. Int. Reg. Sci. Rev.; 1997; 20, pp. 113-129. [DOI: https://dx.doi.org/10.1177/016001769702000107]

17. Lee, L.F. Best spatial two-stage least squares estimators for a spatial autoregressive model with autoregressive disturbances. Econom. Rev.; 2003; 22, pp. 307-335. [DOI: https://dx.doi.org/10.1081/ETC-120025891]

18. Wang, W.; Lee, L.F. Estimation of spatial autoregressive models with randomly missing data in the dependent variable. Econom. J.; 2013; 16, pp. 73-102. [DOI: https://dx.doi.org/10.1111/j.1368-423X.2012.00388.x]

19. Benedetti, R.; Suesse, T.; Piersimoni, F. Spatial auto-correlation and auto-regressive models estimation from sample survey data. Biom. J.; 2020; 62, pp. 1494-1507. [DOI: https://dx.doi.org/10.1002/bimj.201800225]

20. LeSage, J.P.; Pace, R.K. Models for spatially dependent missing data. J. Real Estate Financ. Econ.; 2004; 29, pp. 233-254. [DOI: https://dx.doi.org/10.1023/B:REAL.0000035312.82241.e4]

21. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. Methodol.; 1977; 39, pp. 1-22. [DOI: https://dx.doi.org/10.1111/j.2517-6161.1977.tb01600.x]

22. Suesse, T.; Zammit-Mangion, A. Computational aspects of the EM algorithm for spatial econometric models with missing data. J. Stat. Comput. Simul.; 2017; 87, pp. 1767-1786. [DOI: https://dx.doi.org/10.1080/00949655.2017.1286495]

23. Suesse, T. Marginal maximum likelihood estimation of SAR models with missing data. Comput. Stat. Data Anal.; 2018; 120, pp. 98-110. [DOI: https://dx.doi.org/10.1016/j.csda.2017.11.004]

24. Kelejian, H.H.; Prucha, I.R. Spatial models with spatially lagged dependent variables and incomplete data. J. Geogr. Syst.; 2010; 12, pp. 241-257. [DOI: https://dx.doi.org/10.1007/s10109-010-0109-5]

25. Luo, G.; Wu, M.; Xu, L. IPW-based robust estimation of the SAR model with missing data. Stat. Probab. Lett.; 2021; 172, 109065. [DOI: https://dx.doi.org/10.1016/j.spl.2021.109065]

26. Burden, S.; Cressie, N.; Steel, D.G. The SAR model for very large datasets: A reduced rank approach. Econometrics; 2015; 3, pp. 317-338. [DOI: https://dx.doi.org/10.3390/econometrics3020317]

27. Suesse, T. Estimation of spatial autoregressive models with measurement error for large data sets. Comput. Stat.; 2018; 33, pp. 1627-1648. [DOI: https://dx.doi.org/10.1007/s00180-017-0774-7]

28. Bivand, R.; Gómez-Rubio, V.; Rue, H. Spatial data analysis with R-INLA with some extensions. J. Stat. Softw.; 2015; 63, pp. 1-31. [DOI: https://dx.doi.org/10.18637/jss.v063.i20]

29. Gómez-Rubio, V.; Bivand, R.S.; Rue, H. Estimating spatial econometrics models with Integrated Nested Laplace Approximation. Mathematics; 2021; 9, 2044. [DOI: https://dx.doi.org/10.3390/math9172044]

30. Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using Integrated Nested Laplace Approximations. J. R. Stat. Soc. Ser. Stat. Methodol.; 2009; 71, pp. 319-392. [DOI: https://dx.doi.org/10.1111/j.1467-9868.2008.00700.x]

31. Bates, D.; Maechler, M. R Package Matrix: Sparse and Dense Matrix Classes and Methods, version 1.4-0. 2021; Available online: https://rdrr.io/rforge/Matrix/ (accessed on 2 December 2024).

32. LeSage, J.; Pace, R.K. Introduction to Spatial Econometrics; Chapman and Hall; CRC: Boca Raton, FL, USA, 2009.

33. Li, H.; Calder, C.A.; Cressie, N. One-step estimation of spatial dependence parameters: Properties and extensions of the APLE statistic. J. Multivar. Anal.; 2012; 105, pp. 68-84. [DOI: https://dx.doi.org/10.1016/j.jmva.2011.08.006]

34. Arab, A.; Hooten, M.B.; Wikle, C.K. Hierarchical Spatial Models. Encycl. GIS; 2008; 14, pp. 425-431. [DOI: https://dx.doi.org/10.1007/978-0-387-35973-1_564]

35. Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; Volume 793.

36. Petersen, K.B.; Pedersen, M.S. The Matrix Cookbook; Technical University of Denmark: Kongens Lyngby, Denmark, 2008; Volume 7, 510.

37. Ding, J.; Zhou, A. Eigenvalues of rank-one updated matrices with some applications. Appl. Math. Lett.; 2007; 20, pp. 1223-1226. [DOI: https://dx.doi.org/10.1016/j.aml.2006.11.016]

38. Harville, D.A. Matrix Algebra from a Statistician’s Perspective; Springer: New York, NY, USA, 1997.

39. Moura, A.C.M.; Fonseca, B.M. ESDA (Exploratory Spatial Data Analysis) of vegetation cover in urban areas—Recognition of vulnerabilities for the management of resources in urban green infrastructure. Sustainability; 2020; 12, 1933. [DOI: https://dx.doi.org/10.3390/su12051933]

40. George, A.; Liu, J.W. Computer Solution of Large Sparse Positive Definite; Prentice Hall Professional Technical Reference; Prentice Hall: Hoboken, NJ, USA, 1981.

41. Rue, H.; Held, L. Gaussian Markov Random Fields: Theory and Applications; Chapman and Hall CRC: Boca Raton, FL, USA, 2005.

42. Davis, T.A.; Hager, W.W. Multiple-Rank Modifications of a Sparse Cholesky Factorization. SIAM J. Matrix Anal. Appl.; 2001; 22, pp. 997-1013. [DOI: https://dx.doi.org/10.1137/S0895479899357346]

43. Bivand, R.; Nowosad, J.; Lovelace, R. R Package spData: Datasets for Spatial Analysis, version 2.2.2. 2023; Available online: https://cran.r-universe.dev/spData (accessed on 2 December 2024).

Word count: 8835

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Efficient estimation methods for simultaneous autoregressive (SAR) models with missing data in the response variable have been well explored in the literature. A common practice is introducing measurement error into SAR models to separate the noise component from the spatial process. However, prior studies have not considered incorporating measurement error into SAR models with missing data. Maximum likelihood estimation for such models, especially with large datasets, poses significant computational challenges. This paper proposes an efficient likelihood-based estimation method, the marginal maximum likelihood (ML), for estimating SAR models on large datasets with measurement errors and a high percentage of missing data in the response variable. The spatial autoregressive model (SAM) and the spatial error model (SEM), two popular SAR model types, are considered. The missing data mechanism is assumed to follow a missing-at-random (MAR) pattern. We propose a fast method for marginal ML estimation with a computational complexity of $O (n^{3 / 2})$ , where n is the total number of observations. This complexity applies when the spatial weight matrix is constructed based on a local neighbourhood structure. The effectiveness of the proposed methods is demonstrated through simulations and real-world data applications.

Details

Title

A Marginal Maximum Likelihood Approach for Hierarchical Simultaneous Autoregressive Models with Missing Data

Author

Wijayawardhana, Anjana; Gunawan, David; Suesse, Thomas

First page

3870

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/math12233870

ProQuest document ID

3144154445

A Marginal Maximum Likelihood Approach for Hierarchical Simultaneous Autoregressive Models with Missing Data

Jump to:

Full text

2. Models

2.1. Standard Simultaneous Autoregressive Models

2.2. Hierarchical Simultaneous Autoregressive Models

3. H-SAR Models with Missing Data and Their Marginal Maximum Likelihood Methods

3.1. Hierarchical Simultaneous Autoregressive Models with Randomly Missing Responses

3.2. Marginal Maximum Likelihood Estimation Method

3.3. Two Alternative Computational Approaches

3.4. Computational Aspects of Marginal ML Methods

4. Simulation Study

5. Real Data Application

Abstract

Details

Suggested sources