PML versus minimum χ2: the comeback

Full text

Turn on search term navigation

Introduction

Maximum likelihood and minimum chi-square methods have been competing for the estimator throne for a long time. At the turn of the nineteenth century, Legendre (1805) and Gauss (1809) put forward least squares estimation as a Gaussian-based alternative to Laplace’s (1774) least absolute deviation method, which relied on his eponymous distribution. Almost a century later, Pearson proposed not only the method of moments (see Pearson 1894), but also the chi-square criterion in the context of matching theoretical and empirical frequencies (see Pearson 1900). In turn, the development of maximum likelihood estimation (MLE) by Fisher (1922, 1925) was one of the most important achievements in twentieth-century statistics. Under standard regularity conditions, MLE asymptotically achieves the Cramér–Rao lower bound (see Cramér 1946; Rao 1945), which makes it at least as good as any minimum $χ^{2}$ estimator. In addition, it achieves second-order efficiency after a bias correction (see Rao 1961). Moreover, the imposition of valid equality restrictions on the parameters systematically leads to efficiency gains (see Rothenberg 1973).

However, not everybody was convinced (see Neyman and Scott (1948) on the incidental parameter problem, as well as the inconsistent MLE examples in Basu (1955), Kraft and Le Cam (1956), Bahadur (1958)), and minimum $χ^{2}$ methods remained popular. In fact, Berkson (1980) argued that ML was often just a special case of minimum $χ^{2}$ , and not necessarily the best one. Soon afterwards, White (1982), building on earlier work by Huber (1967), and Gouriéroux et al. (1984) studied the properties of pseudo-MLEs, characterising their consistency and general inefficiency. Arellano (1989a) put another nail on the ML coffin by showing that valid equality restrictions could result in efficiency losses for Gaussian PMLEs. Arguably, the wooden stake to the heart was driven by Newey and Steigerwald (1997), who described the inconsistency of non-Gaussian PMLE procedures under distributional misspecification. Since then, graduate students with non-Bayesian teachers learn the normal distribution only, and Gaussian PMLE is just an example of Hansen (1982) Generalised Method of Moments (GMM). In this paper, though, we argue that non-Gaussian PMLE, like a B-movie vampire, deserves a second life (or death).

We do so by revisiting the two-equation textbook example in Arellano (1989a),1 except that instead of basing PMLE on the Gaussian distribution, as he did, we use discrete mixtures of normals. The reason is twofold. First, Fiorentini and Sentana (2023) show that, under standard regularity conditions, such estimators are consistent for the conditional mean and variance parameters regardless of the true distribution of the shocks to the model and the number of mixture components, thereby nesting the results for Gaussian PMLE in Gouriéroux et al. (1984) while simultaneously avoiding the concerns raised by Newey and Steigerwald (1997). Second, finite normal mixtures with a sufficiently large number of components can provide good approximations to many distributions (see Nguyen et al. 2020), so it is reasonable to conjecture that PMLEs based on them may get close to achieving the semiparametric (SP) efficiency bound, and therefore exploit the potential adaptivity of some of the parameters when it exists, at least asymptotically.2

The rest of the paper is organised as follows. Section 2 introduces the example in Arellano (1989a) and summarises his main results. Then, in Sect. 3 we extend those results to the entire parameter vector, derive the relevant semiparametric efficiency bounds, and use them to benchmark the different estimators, including the PMLEs based on finite Gaussian mixtures. Next, Sect. 4 contains the results of our extensive Monte Carlo experiments, while Sect. 5 concludes. Proofs and auxiliary results are relegated to the appendices.

The example

Consider the following textbook example:

\begin{matrix} y_{1} = & γ + α y_{2} + β z_{1} + u_{1}, \end{matrix}

\begin{matrix} y_{2} = & μ_{0} + μ_{1} z_{1} + μ_{2} z_{2} + u_{2}, \end{matrix}

with

\begin{matrix} ((\begin{matrix} u_{1} \\ u_{2} \end{matrix})| z_{1}, z_{2} \sim D [(\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} σ_{1}^{2} & σ_{12} \\ σ_{12} & σ_{2}^{2} \end{matrix})] . \end{matrix}

As is well known, the unrestricted Gaussian PMLE of

α

and

β

coincides with the IV estimator that uses a constant,

z_{1}

and

z_{2}

as instruments in the first equation. In turn, the restricted Gaussian PMLE that imposes

σ_{12} = 0

coincides with the OLS estimator of the first equation.

When the joint conditional distribution of $u_{1}$ and $u_{2}$ is Gaussian, OLS is at least as efficient as IV, which justifies the Durbin–Wu–Hausman test.3 But Arellano (1989a) seemingly counterintuitive result says that when the true conditional distribution is not Gaussian, IV may be more efficient than OLS for $α$ and $β$ even though $σ_{12} = 0$ . Specifically, he showed that IV will beat OLS if and only if

\begin{matrix} μ_{22} \geq 1 + ρ_{y_{2} z_{2} . z_{1}}^{- 2}, \end{matrix}

where

\begin{matrix} μ_{22} = E ((\frac{u_{1}^{2}}{σ_{1}^{2}}, \frac{u_{2}^{2}}{σ_{2}^{2}}| z_{1}, z_{2}) \end{matrix}

is the co-kurtosis coefficient between the two structural shocks and

ρ_{y_{2} z_{2} . z_{1}}

is the correlation coefficient between

y_{2}

and

z_{2}

after partialling out the effect of

z_{1}

. Intuitively,

μ_{22}

affects the correct sandwich version of the asymptotic covariance matrix of the OLS estimators of the slope parameters.

Appendix A contains detailed expressions for the asymptotic variances of the OLS and IV estimators of $α$ and $β$ . We have used those expressions to create Fig. 1, which displays in $(ρ_{y_{2} z_{2} . z_{1}}, μ_{22})$ space (minus one plus) the ratio of the asymptotic variances of the OLS and IV estimators of $α$ for positive values of $ρ_{y_{2} z_{2} . z_{1}}$ .4 We do so for the special case in which the $R^{2}$ of Eq. (2) coincides with $ρ_{y_{2} z_{2} . z_{1}}^{2}$ , which allows this parameter to vary freely from 0 to 1.5 As expected, OLS is more/less efficient than IV to the left/right of the boundary line (3).

[See PDF for image]

Fig. 1

Relative efficiency OLS/IV for $α$ . Notes: When the $R^{2}$ of Eq. (2) coincides with $ρ_{y_{2} z_{2} . z_{1}}^{2}$ , the relative efficiency of the OLS/IV estimators of $α$ is given by $\frac{V ({\hat{α}}_{LS})}{V ({\hat{α}}_{IV})} = [(1 - ρ_{y_{2} z_{2} . z_{1}}^{2}) μ_{22} + ρ_{y_{2} z_{2} . z_{1}}^{2}] ρ_{y_{2} z_{2} . z_{1}}^{2} .$ The solid line denotes the boundary line $μ_{22} = 1 + ρ_{y_{2} z_{2} . z_{1}}^{- 2}$ , while the dotted line denotes the locus of $(ρ_{y_{2} z_{2} . z_{1}}, μ_{22})$ combinations for which the IV estimator of $α$ reaches its maximum asymptotic efficiency relative to the corresponding OLS estimator, which is given by $ρ_{y_{2} z_{2} . z_{1}}^{2} = \frac{1}{2} μ_{22} / (μ_{22} - 1)$

This figure also shows the locus of $(ρ_{y_{2} z_{2} . z_{1}}, μ_{22})$ combinations for which the IV estimator of $α$ reaches its maximum asymptotic efficiency relative to the corresponding OLS estimator in this set-up, which is given by the curve $\begin{matrix} ρ_{y_{2} z_{2} . z_{1}}^{2} = \frac{μ_{22}}{2 (μ_{22} - 1)} . \end{matrix}$ Further increases in $ρ_{y_{2} z_{2} . z_{1}}$ for a given $μ_{22}$ result in decreases in relative efficiency, with OLS and IV becoming indistinguishable as $ρ_{y_{2} z_{2} . z_{1}} \to 1$ , in which case $z_{2}$ becomes a perfect instrument for $y_{2}$ .

In this context, Arellano (1989a) proposed solution is to replace Gaussian PMLE by Minimum Distance (MD) estimators, a special case of minimum chi-square methods popularised in econometrics by Malinvaud (1970). The rationale is as follows. Let $θ = {(γ, α, β, μ_{0}, μ_{1}, μ_{2}, σ_{1}^{2}, σ_{2}^{2})}^{'}$ denote the vector of structural parameters. Given that the reduced form of model (2) is

\begin{matrix} ((\begin{matrix} y_{1} \\ y_{2} \end{matrix})| z_{1}, z_{2} \sim & D [μ (z_{1}, z_{2} ; θ), Ω (z_{1}, z_{2} ; θ)] \end{matrix}

\begin{matrix} μ (z_{1}, z_{2} ; θ) = & [\begin{matrix} (γ + α μ_{0}) + (β + α μ_{1}) z_{1} + α μ_{2} z_{2} \\ μ_{0} + μ_{1} z_{1} + μ_{2} z_{2} \end{matrix}] \end{matrix}

\begin{matrix} Ω (z_{1}, z_{2} ; θ) = & (\begin{matrix} σ_{1}^{2} + α^{2} σ_{2}^{2} + 2 α σ_{12} & α σ_{2}^{2} + σ_{12} \\ α σ_{2}^{2} + σ_{12} & σ_{2}^{2} \end{matrix}), \end{matrix}

[See PDF for image]

Fig. 2

Relative efficiency MD/OLS-IV for $α$ . Notes: When the $R^{2}$ of Eq. (2) coincides with $ρ_{y_{2} z_{2} . z_{1}}^{2}$ , the relative efficiency of the MD/OLS and MD/IV estimators of $α$ are given by $\frac{V ({\hat{α}}_{MD})}{V ({\hat{α}}_{LS})} = \frac{μ_{22}}{[1 + (μ_{22} - 1) ρ_{y_{2} z_{2} . z_{1}}^{2}] [(1 - ρ_{y_{2} z_{2} . z_{1}}^{2}) μ_{22} + ρ_{y_{2} z_{2} . z_{1}}^{2}]}$ and $\frac{V ({\hat{α}}_{MD})}{V ({\hat{α}}_{IV})} = \frac{μ_{22} ρ_{y_{2} z_{2} . z_{1}}^{2}}{1 + (μ_{22} - 1) ρ_{y_{2} z_{2} . z_{1}}^{2}},$ respectively. The solid line denotes the boundary line $μ_{22} = 1 + ρ_{y_{2} z_{2} . z_{1}}^{- 2}$

which is exactly identified, the unrestricted MD estimator coincides with IV, which is Indirect Least Squares. Then, Arellano (1989a) shows that imposing the restriction $σ_{12} = 0$ leads to an overidentified optimal MD procedure (weakly) more efficient than both IV and OLS for $α$ and $β$ .

This optimal MD estimator requires an asymptotic covariance of the reduced form parameter estimators which recognises that the third- and fourth-order multivariate cumulants of $u_{1}$ and $u_{2}$ are not usually 0 when they are jointly non-normally distributed.

[See PDF for image]

Fig. 3

Relative efficiency Student t ML/MD for $α$ and $β$ . Notes: When the $R^{2}$ of equation (2) coincides with $ρ_{y_{2} z_{2} . z_{1}}^{2}$ , the relative efficiency of the ML/MD estimators of $α$ and $β$ is, respectively, given by $\frac{AVar (\sqrt{n} {\hat{α}}_{ML})}{AVar (\sqrt{n} {\hat{α}}_{MD})} = \frac{1 + (μ_{22} - 1) ρ_{y_{2} z_{2} . z_{1}}^{2}}{[(1 - ρ_{y_{2} z_{2} . z_{1}}^{2}) M_{ss} + ρ_{y_{2} z_{2} . z_{1}}^{2} M_{ll}] μ_{22}}$ and $\frac{AVar (\sqrt{n} {\hat{β}}_{ML})}{AVar (\sqrt{n} {\hat{β}}_{MD})} = \frac{ρ_{y_{2} z_{2} . z_{1}}^{2}}{[(1 - ρ_{y_{2} z_{2} . z_{1}}^{2}) M_{ss} + ρ_{y_{2} z_{2} . z_{1}}^{2} M_{ll}] μ_{22}},$ where m $_{ll} = ν (2 + ν) / [(ν - 2) (ν + 4)]$ and m $_{ss} = (ν + 2) / (ν + 4)$ with $ν = 2 (2 μ_{22} - 1) / (μ_{22} - 1)$

Appendix A also contains detailed expressions for the asymptotic variances of the optimal MD estimators of $α$ and $β$ . We have used those expressions to create Fig. 2, which depicts in $(ρ_{y_{2} z_{2} . z_{1}}, μ_{22})$ space (minus one plus) the ratio of the asymptotic variance of the restricted optimal MD of $α$ to the asymptotic variance of either the OLS estimator (to the left of (3)) or the IV one (to its right) in the same set-up as Fig. 1. As can be seen, the efficiency gains are relatively small over the displayed range, and they vanish when either the partial correlation goes to 0 or 1 or the co-kurtosis term goes to 0.6

The predictable reaction of a fervent ML believer to Figs. 1 and 2 would be to argue that condition (3) requires the combination of a very good instrument (a high $ρ_{y_{2} z_{2} . z_{1}}^{2}$ ) with a substantial amount of non-normality (a large $μ_{22}$ ), in which case the Gaussian assumption would be very inappropriate. For example, a joint Student t distribution for $u_{1}$ and $u_{2}$ cannot satisfy this condition when the number of degrees of freedom is six or more, and the requirement becomes increasingly difficult for poor instruments.

A naïve ML solution would be to assume that $u_{1}$ and $u_{2}$ follow a bivariate Student t distribution to estimate the model parameters, which should dominate MD. In this respect, we have used the expressions in Appendix A to create Fig. 3a, b, which display in $(ρ_{y_{2} z_{2} . z_{1}}, μ_{22})$ space (minus one plus) the ratio of the asymptotic variances of the t-based MLE of $α$ and $β$ that impose $σ_{12} = 0$ to the asymptotic variances of the corresponding restricted optimal MD. As can be seen, these figures confirm that ML does indeed dominate MD in this case.

The problem with this naïve approach is that if the assumed joint distribution is incorrect, the resulting PMLEs may be inconsistent, as forcefully argued by Newey and Steigerwald (1997).

However, this does not mean that all parameters will be inconsistently estimated. Specifically, Proposition 3 in Fiorentini and Sentana (2019) implies that the unrestricted t-based PMLEs of $α$ and $β$ are always consistent irrespective of the true distribution. Similarly, their Proposition 1 implies that the restricted t-based PMLEs of $α$ and $β$ will remain consistent when the conditional distribution of $σ_{1}^{- 1} u_{1}$ and $σ_{2}^{- 1} u_{2}$ is elliptical even though it does not coincide with the distribution assumed for estimation purposes. Besides, it may be possible to obtain two-step consistent estimators in closed-form along the lines of Fiorentini and Sentana (2019).

More importantly, Fiorentini and Sentana (2023) show that all parameters will always be consistently estimated if one assumes for estimation purposes that $u_{1}$ and $u_{2}$ follow a finite mixture of bivariate normals regardless of the true distribution of those innovations and the number of components of the mixture, as long as the shape parameters are simultaneously estimated with the mean and variance parameters.7 Thus, the consistency of the Gaussian PMLE is just a special case.

The ability of finite Gaussian mixtures to approximate many other distributions mentioned in the introduction suggests that we may be able to relate these finite mixture PMLEs to SP estimators which simply exploit the independence of the shocks and the conditioning variables without making any parametric assumptions. For that reason, in the next section we take SP estimators as our benchmark to study:

The efficiency of the OLS, IV, MD and correct ML estimators relative to SP ones,
The relative efficiency of restricted and unrestricted versions of these SP estimators, and
The relative efficiency of finite mixture-based PMLEs relative to SP estimators

in the context of model (2).

Theoretical analysis

Minimum distance revisited

Although the main focus of the analysis in Arellano (1989a) was $α$ and $β$ , it is of some interest to study the asymptotic efficiency of the optimal MD estimators of the remaining structural model parameters relative to their OLS and IV counterparts. Given that the number of different bivariate cumulants of orders 3 and 4 is 4 and 5, respectively, we focus on the special case in which the joint distribution of the (standardised) structural shocks conditional on the instruments is spherical, or $s (0, I_{2}, η)$ for short, where $η$ is the possibly infinite vector of shape parameters. More formally,

Assumption 1

\begin{matrix} (\frac{u_{1}}{σ_{1}}, \frac{u_{2}}{σ_{2}}| z_{1}, z_{2} ; θ, η \sim i . i . d . s (0, I_{2}, η) \end{matrix}

To simplify the expressions further, we are going to follow Appendix B in Fiorentini and Sentana (2019) and re-parametrise the unrestricted covariance matrix of the structural residuals as

\begin{matrix} (\begin{matrix} σ_{1}^{2} & σ_{12} \\ σ_{12} & σ_{2}^{2} \end{matrix}) = σ^{2} (\begin{matrix} 1 & 0 \\ ψ_{12} & 1 \end{matrix}) (\begin{matrix} e^{ω} & 0 \\ 0 & e^{- ω} \end{matrix}) (\begin{matrix} 1 & ψ_{12} \\ 0 & 1 \end{matrix}), \end{matrix}

where

ψ_{12}

is the coefficient in the least squares projection of

u_{2}

u_{1}

, and

σ^{2}

and

ω

the geometric mean of their variances and the natural log of the ratio of the standard deviations of these shocks, respectively, under the maintained assumption that they are uncorrelated.8 Let

θ^{†} = {(γ, α, β, μ_{0}, μ_{1}, μ_{2}, ω, σ^{2})}^{'}

denote the vector of structural parameters implied by (8) under the restriction

ψ_{12} = 0

. Using the expressions for the Jacobian linking

θ^{†}

and

θ

in (A17), we can then show under standard regularity conditions that:

Proposition 1

Let $(τ_{1}, τ_{2})$ and $(σ_{z_{1}}^{2}, σ_{z_{2}}^{2}, σ_{z_{1} z_{2}})$ denote the means, variances and covariance of $z_{1}$ and $z_{2}$ . If Assumption 1 holds, then:

The difference between the asymptotic covariance matrices of the OLS and MD estimators of $θ^{†}$ , ${\hat{θ}}_{LS}^{†}$ and ${\hat{θ}}_{MD}^{†}$ , respectively, is positive semidefinite of rank 1 at most, with a basis for its image given by
9
$\begin{matrix} \{- [μ_{0} + (τ_{2} - σ_{z_{1} z_{2}} σ_{z_{1}}^{- 2} τ_{1}) μ_{2}], 1, - μ_{1} + σ_{z_{1} z_{2}} σ_{z_{1}}^{- 2} μ_{2}, 0_{1 \times 5}\}, \end{matrix}$ and a basis for its kernel by
10
$\begin{matrix} [1, μ_{0} + (τ_{2} - σ_{z_{1} z_{2}} σ_{z_{1}}^{- 2} τ_{1}) μ_{2}, 0, 0_{1 \times 5}], \end{matrix}$
11
$\begin{matrix} [μ_{1} + σ_{z_{1} z_{2}} σ_{z_{1}}^{- 2} μ_{2}, 0, μ_{0} + (τ_{2} - σ_{z_{1} z_{2}} σ_{z_{1}}^{- 2} τ_{1}) μ_{2}, 0_{1 \times 5}] \end{matrix}$ and
12
$\begin{matrix} (0_{5 \times 3}, I_{5}) . \end{matrix}$
The difference between the asymptotic covariance matrices of the IV and MD estimators of $θ^{†}$ , ${\tilde{θ}}_{IV}^{†}$ and ${\hat{θ}}_{MD}^{†}$ , respectively, is positive semidefinite of rank 1 at most, with the same basis for image and kernel.
The difference between the asymptotic covariance matrices of the OLS and IV estimators of $θ^{†}$ , ${\hat{θ}}_{LS}^{†}$ and ${\tilde{θ}}_{IV}^{†}$ , respectively, is positive/negative semidefinite of rank 1 depending of condition (3), with exactly the same basis for image and kernel.

This proposition considerably sharpens the results in Arellano (1989a) for the special case of spherically symmetric disturbances by showing that the asymptotic efficiency gains concentrate in a single linear combination of the parameters of the first equation $γ$ , $α$ and $β$ given by (9). In contrast, any other linear combination of the parameters orthogonal to this one does not generate any efficiency gains. Specifically, the parameters of the second equation and the residual variances are estimated just as efficiently by the three procedures.

Semiparametric estimation and efficiency bounds

The optimal instruments theory of Chamberlain (1987) implies that Arellano (1989a) MD estimator achieves the SP efficiency bound which exploits the correct specification of the conditional mean and variance functions for $y_{1}$ and $y_{2}$ in the reduced form model (2) when the joint third- and fourth-order cumulants of $u_{1}$ and $u_{2}$ conditional on $z_{1}$ and $z_{2}$ are constant. However, if this last maintained assumption is true, then one can in principle obtain an even more efficient MD estimator of the model parameters after augmenting it with equations for the third- and fourth-order cumulants of the reduced-form residuals under the assumption that the joint cumulants of $u_{1}$ and $u_{2}$ conditional on $z_{1}$ and $z_{2}$ are constant up to the eighth-order.

In fact, the results in Bickel et al. (1993) allow us to obtain the SP efficiency bound which exploits that the joint distribution of $u_{1}$ and $u_{2}$ is independent of $z_{1}$ and $z_{2}$ . Moreover, we can also consider a restricted version of this SP bound under the maintained assumption that (7) holds, as in Hodgson and Vorkink (2003), which will be bigger in the usual positive semidefinite sense. Henceforth, we shall refer to this bound and its associated estimator by the abbreviation SS, reserving SP for the one which does not impose sphericity.

An interesting question in this context is the possibility that some but not all of the parameters of model (2) can be partially adaptively estimated, in the sense that their SP estimators are as asymptotically efficient as the infeasible ML estimators which exploit the information of the true distribution of the shocks, including the values of their shape parameters. The following proposition provides a precise answer to this question under sphericity for the restricted estimators that impose $σ_{12} = 0$ :

Proposition 2

If Assumption 1 holds, then:

The difference between the asymptotic covariance matrices of the restricted SS and infeasible ML estimators of $θ^{†}$ , ${\hat{θ}}_{SS}^{†}$ and ${\hat{θ}}_{ML}^{†} (\bar{η})$ , respectively, is positive semidefinite of rank 1 at most, with a basis for its image given by $(0_{1 \times 7}, 1)$ , and a basis for its kernel by $(I_{7}, 0_{7 \times 1})$ .
The difference between the asymptotic covariance matrices of the restricted SP and infeasible ML estimators of $θ^{†}$ , ${\hat{θ}}_{SP}^{†}$ and ${\hat{θ}}_{ML}^{†} (\bar{η})$ , respectively, is positive semidefinite of rank 5 at most, with a basis for its image given by $(1, 0_{1 \times 7}), (0, - 1, μ_{1} + σ_{z_{1} z_{2}} σ_{z_{1}}^{- 2} μ_{2}, 0_{1 \times 5})$ , $(0_{1 \times 3}, 1, 0_{1 \times 4})$ and $(0_{2 \times 6}, I_{2})$ , and a basis for its kernel by $(0, μ_{1} + σ_{z_{1} z_{2}} σ_{z_{1}}^{- 2} μ_{2}, 1, 0_{1 \times 5})$ and $(0_{2 \times 4}, I_{2}, 0_{2 \times 2})$ .
The difference between the asymptotic covariance matrices of the MD and SP estimators of $θ^{†}$ , ${\hat{θ}}_{MD}^{†}$ and ${\hat{θ}}_{SP}^{†}$ , respectively, is positive semidefinite of rank 4 at most, with a basis for its image given by $(0_{2 \times 1}, I_{2}, 0_{2 \times 5})$ and $(0_{2 \times 4}, I_{2}, 0_{2 \times 2})$ , and a basis for its kernel by $(0_{2 \times 6}, I_{2})$ , $(1, μ_{0} + μ_{1} τ_{1} + μ_{2} τ_{2}, τ_{1}, 0_{1 \times 5})$ and $(0_{1 \times 3}, 1, τ_{1}, τ_{2}, 0_{1 \times 2})$ .

The first part of the proposition implies that all the structural model parameters except the overall residual scale $σ^{2}$ can be (partially) adaptively estimated by the SS estimator, as expected from Proposition 12 in Fiorentini and Sentana (2021).

More interestingly, the second part of the proposition implies that in addition to $μ_{1}$ and $μ_{2}$ , the coefficient of the linear projection of $y_{1}$ onto a constant and $z_{1}$ , which is given by $\begin{matrix} β + (μ_{1} + σ_{z_{1} z_{2}} σ_{z_{1}}^{- 2} μ_{2}) α, \end{matrix}$ will be adaptively estimated by the restricted SP estimator. In this respect, a very important by-product of this proposition is that the model parameters that can be partially adaptively estimated often continue to be consistently estimated under distributional misspecification of the innovations, as shown by Fiorentini and Sentana (2019, 2021) in the context of multivariate location-scale models such as (1)–(2). We will revisit this issue in the Monte Carlo section.

Finally, the last part of the proposition says that the variances of the structural-form residuals, as well as the intercepts in the reduced-form regressions of $y_{1}$ and $y_{2}$ on a constant and the demeaned values of $z_{1}$ and $z_{2}$ , which are given by $γ + τ_{1} (β + α μ_{1}) + τ_{2} (α μ_{2})$ and $μ_{0} + τ_{1} μ_{1} + τ_{2} μ_{2}$ , respectively, are asymptotically equally efficiently estimated by the MD and SP estimators. More importantly, it also says that the efficiency gains are concentrated in the four slope coefficients of the two structural equations.

It would be tedious but otherwise straightforward to extend Propositions 1 and 2 to the case in which the distribution of the shocks conditional on $z_{1}$ and $z_{2}$ is not spherical as a function of the four third-order and five fourth-order cumulants of $u_{1}$ and $u_{2}$ . In fact, there is one important instance in which those higher-order cumulants would be unnecessary for the comparisons. Specifically, we can use Proposition 13.2 in Fiorentini and Sentana (2021) to prove that, subject to regularity, both the parameters of the unrestricted covariance matrix of the reduced-form residuals and the intercepts in the reduced-form regressions of $y_{1}$ and $y_{2}$ on a constant and the demeaned values of $z_{1}$ and $z_{2}$ will be as efficiently estimated by the IV estimator and the unrestricted SP estimator, while the slopes will always be adaptively estimated, just as in the second part of Proposition 2. The reason is twofold. First, the information matrix, the feasible parametric efficiency bound, the SP bound, and the usual Gaussian sandwich formula become block-diagonal between those reduced-form parameters and the four structural slope coefficients $α$ , $β$ , $μ_{1}$ and $μ_{2}$ . In turn, this block diagonality leads to a saddle-point characterisation of the asymptotic efficiency of the SP estimator of $θ$ , with the slope coefficients being adaptive and the others only reaching the efficiency of the Gaussian PMLE.

Efficiency gains from the equality constraint

It is also of interest to analyse the effects of imposing the covariance restriction $σ_{12} = 0$ on the different estimators we have considered:

Proposition 3

If Assumption 1 holds, then:

The difference between the asymptotic covariance matrices of the unrestricted and restricted infeasible ML estimators of $θ^{†}$ , ${\tilde{θ}}_{ML}^{†}$ and ${\hat{θ}}_{ML}^{†}$ , respectively, is positive semidefinite of rank 1 at most, with the basis for its image given by (9), and a basis for its kernel by (10), (11) and (12).
The difference between the asymptotic covariance matrices of the unrestricted and restricted SS estimators of $θ^{†}$ , ${\tilde{θ}}_{SS}^{†}$ and ${\hat{θ}}_{SS}^{†}$ , respectively, is positive semidefinite of rank 1 at most, with the basis for its image given by (9), and a basis for its kernel by (10), (11) and (12).
The difference between the asymptotic covariance matrices of the unrestricted and restricted SP estimators of $θ^{†}$ , ${\tilde{θ}}_{SP}^{†}$ and ${\hat{θ}}_{SP}^{†}$ , respectively, is positive semidefinite of rank 1 at most, with the basis for its image given by (9), and a basis for its kernel by (10), (11) and (12).

Therefore, when one uses “efficient” estimators, the imposition of the valid equality constraint $σ_{12} = 0$ always leads to (weak) efficiency gains for exactly the same linear combination of the parameters of the first structural equation for which optimal MD leads to an efficiency gain relative to both OLS and IV. In fact, it is straightforward to generalise (a) so that it applies to the feasible parametric ML estimators of $θ^{†}$ which simultaneously estimate the finite vector of shape parameters $η$ , as well as to the ML estimators of these parameters themselves. This is in contrast to the seemingly counterintuitive result in Arellano (1989a), which simply reflects the fact that OLS does not use the optimal MD weighting matrix in the non-normal case.

Finite mixtures as sieves

Finally, we study the extent to which PMLEs based on finite mixtures of normals with an increasing number of components could constitute a proper sieves-type SP procedure, as we argued in the introduction.

We do so first when the standardised shocks to model (2) conditional on $z_{1}$ and $z_{2}$ follow a bivariate Student t with 0 means, unit standard deviations, no correlation and 5 degrees of freedom but whose parameters are estimated by finite scale mixture-based log-likelihood functions with $K = 2, 3$ and 4 components. For comparison purposes, we consider four different benchmarks that impose the restriction $σ_{12} = 0$ : (i) the MLE based on the correctly specified log-likelihood function that fixes the number of degrees of freedom to 5, (ii) the SS estimator, (ii) the OLS estimator, and (iv) the optimal MD estimator.

We compute the expected value of the Hessian and outer product of the score of the scale mixture-based PMLEs by means of large sample averages of the analytical expressions in Fiorentini and Sentana (2021) evaluated at the true values of the mean and variance parameters in $θ$ and the pseudo-true values of the shape parameters, which we numerically obtain from samples of millions of simulated observations.

The results, which we report in Table 1, show that the scale mixture-based PMLEs of all the model parameters except the overall residual scale $σ^{2}$ quickly approach the asymptotic efficiency of the infeasible MLE despite the fact that no finite scale mixture of normals can approximate the unbounded higher-order moments, tail behaviour or nonlinear tail dependence of a multivariate Student t. In fact, although panel (a) in Fig. 3 of Gallant and Tauchen (1999) clearly illustrates that a more complex misspecified model does not necessarily lead to more efficient estimators because one is not simply adding new elements to the score, but also changing the pseudo-true values of the shape parameters at which one evaluates the original components of the score, we find that the efficiency improvements occur monotonically.9 As a result, it seems that the covariance matrix of the errors in the least squares projection of the scores of the true model onto the scores of the mixture-based log-likelihood becomes smaller and smaller as K increases (see Proposition 7 in Calzolari et al. 2004).

Table 1. Asymptotic variances of alternative estimators

Parameter	OLS	PML– ${SMN}_{K}$
		$K = 2$	$K = 3$	$K = 4$	SS	ML	MD
Mean parameters of equation 1a
$γ$	1.268	0.931	0.905	0.902	0.901	0.901	1.201
$α$	1.500	0.782	0.731	0.725	0.723	0.723	1.125
$β$	1.000	0.656	0.631	0.627	0.627	0.627	0.875
Mean parameters of equation 1b
$μ_{0}$	1.000	0.792	0.775	0.772	0.771	0.771	1.000
$μ_{1}$	0.333	0.264	0.258	0.257	0.257	0.257	0.333
$μ_{2}$	0.333	0.264	0.258	0.257	0.257	0.257	0.333
(Reparametrised) variance parameters of structural innovations
$ω$	3.000	1.493	1.313	1.290	1.286	1.286	3.000
$σ^{2}$	0.833	0.833	0.833	0.833	0.833	0.300	0.833

Notes: DGP for structural innovations: bivariate Student t with 0 means, unit standard deviations, no correlation and 5 degrees of freedom. Parameter values: $γ = 0.204$ , $α = β = 0.398$ , $μ_{0} = 0.155$ , $μ_{1} = μ_{2} = 0.577$ , $σ_{1}^{2} = 1 / 2$ , $σ_{2}^{2} = 1 / 3$ , $μ_{z_{1}} = μ_{z_{2}} = 1$ , $σ_{z_{1}}^{2} = σ_{z_{2}}^{2} = 1$ , and $σ_{z_{1} z_{2}} = 0$ . OLS denotes the usual ordinary least squares estimator, PML– ${SMN}_{K}$ denotes Pseudo-ML based on a bivariate scale mixture of K normals, SS denotes the spherically symmetric SP estimator, ML denotes MLE which exploit the information of the true distribution of the shocks, including the degrees of freedom, and MD denotes the optimum minimum distance estimator. We compute the expected value of the Hessian and variance of the score of the finite mixture-based PMLEs by means of large sample averages of the analytical expressions in Fiorentini and Sentana (2021) evaluated at the true values of the mean and variance parameters in $θ$ and the pseudo-true values of the shape parameters, which we numerically obtain from samples of millions of simulated observations

In contrast, the asymptotic variances of the scale mixture-based PMLEs of $σ^{2}$ coincides with the asymptotic variances of the OLS estimators irrespective of the number of components, which reflects (i) the block diagonality of the different asymptotic covariance matrices in Proposition 12.2 of Fiorentini and Sentana (2021) because the determinant of (8) is precisely $σ^{4}$ , and (ii) the fact that the ML estimators of the mean in a scale mixture of K gammas is numerically the same regardless of K, as explained in Fiorentini and Sentana (2023).

We then conduct a similar exercise when $u_{1}$ and $u_{2}$ conditional on $z_{1}$ and $z_{2}$ follow a bivariate asymmetric Student t with 0 means, unit standard deviations, no correlation, negative tail dependence and the same $μ_{22}$ as in the symmetric case. We estimate the unrestricted model parameters using general finite mixture-based log-likelihood functions with $K = 2, 3$ and 4 components, and consider as benchmarks the following three unrestricted estimators: infeasible MLE, SP, and IV. In this case, we compute the expected value of the Hessian and outer product of the score of the mixture-based PMLEs using large sample averages of the theoretical expressions in Amengual et al. (2023) evaluated at the true values of the mean and variance parameters and the pseudo-true values of the shape parameters obtained from very large samples of simulated observations.

The results we report in Table 2 show that the mixture-based PMLEs of the slope parameters approach the asymptotic efficiency of the infeasible MLE despite the fact that no finite mixture of normals can approximate the unbounded higher-order moments, tail behaviour or nonlinear tail dependence of a multivariate asymmetric Student t. Again, we find that the efficiency improvements occur monotonically. In contrast, the asymptotic variances of the mixture-based PMLEs of the intercepts and covariance matrix of the reduced form in mean-deviation form coincide with the asymptotic variances of the corresponding IV estimators irrespective of the number of components, which reflects the fact that the ML estimators of the mean vector and covariance matrix in mixtures of K normals are numerically the same for any $K \geq 1$ (see also the discussion at the end of Sect. 3.2).

Table 2. Asymptotic variances of alternative estimators

Parameter	IV	PML MN $_{K}$
		$K = 2$	$K = 3$	$K = 4$	SP	ML
Slope parameters of equation 1a
$α$	1.502	1.320	1.301	1.300	1.296	1.296
$β$	1.000	0.879	0.867	0.865	0.863	0.863
Slope parameters of equation 1b
$μ_{1}$	0.333	0.259	0.252	0.251	0.251	0.251
$μ_{2}$	0.333	0.259	0.252	0.251	0.251	0.251
(Reparametrised) reduced form intercepts
$E (y_{1})$	0.553	0.553	0.553	0.553	0.553	0.499
$E (y_{2})$	0.333	0.333	0.333	0.333	0.333	0.299
Reduced form variance parameters
$ω_{11}$	1.803	1.803	1.803	1.803	1.803	0.796
$ω_{22}$	0.950	0.950	0.950	0.950	0.950	0.308
$ω_{12}$	0.815	0.815	0.815	0.815	0.815	0.229

Notes: DGP for structural innovations: bivariate asymmetric Student t with 0 means, unit standard deviations, no correlation and shape parameters $ν = 9.65$ and $b_{i} = - 1$ . Parameter values: $γ = 0.204$ , $α = β = 0.398$ , $μ_{0} = 0.155$ , $μ_{1} = μ_{2} = 0.577$ , $σ_{1}^{2} = 1 / 2$ , $σ_{2}^{2} = 1 / 3$ , $μ_{z_{1}} = μ_{z_{2}} = 1$ , $σ_{z_{1}}^{2} = σ_{z_{2}}^{2} = 1$ , and $σ_{z_{1} z_{2}} = 0$ . IV denotes the usual instrumental variables estimator, PML– ${MN}_{K}$ denotes Pseudo-ML based on a bivariate mixture of K normals, SP denotes the semiparametric estimator, ML denotes MLE which exploit the information of the true distribution of the shocks, including the degrees of freedom. Moreover, $E (y_{1})$ and $E (y_{2})$ are short-hand for $γ + τ_{1} (β + α μ_{1}) + τ_{2} (α μ_{2})$ and $μ_{0} + τ_{1} μ_{1} + τ_{2} μ_{2}$ , respectively. We compute the expected value of the Hessian and variance of the score of the finite mixture-based PMLEs using large sample averages of the theoretical expressions in Amengual et al. (2023) evaluated at the true values of the mean and variance parameters and the pseudo-true values of the shape parameters obtained from very large samples of simulated observations

Monte Carlo analysis

In previous sections, we have derived several asymptotic results regarding the relative efficiency of the LS, IV and MD estimators, as well as the finite mixture-based PMLEs, the SS estimators, and the feasible and infeasible MLEs. In this section, in contrast, we make use of an extensive Monte Carlo simulation exercise to asses their small sample behaviour.

Design

We consider three different parameter configurations:

$μ_{22} = 3$ and $ρ_{y_{2} z_{2} . z_{1}} = {(μ_{22} - 1)}^{- \frac{1}{2}} = 1 / \sqrt{2} ≃ 0.71$ , which is such that the IV and OLS estimators of $α$ and $β$ have the same asymptotic efficiency (see the solid line in Fig. 1);

$μ_{22} = 3$ and $ρ_{y_{2} z_{2} . z_{1}} = 2^{- \frac{1}{2}} \sqrt{μ_{22} / (μ_{22} - 1)} = \sqrt{3} / 2 ≃ 0.87$ , which corresponds to the dotted line in Fig. 1; and

$μ_{22} = 7 / 3$ and $ρ_{y_{2} z_{2} . z_{1}} = {(μ_{22} - 1)}^{- \frac{1}{2}} = \sqrt{3} / 2 ≃ 0.87$ , which is another case of equal efficiency of IV and OLS, but with lower co-kurtosis.10

As for the distribution of the structural shocks, we consider four non-Gaussian possibilities in which

(u_{1}, u_{2})

follow a:

Student t distribution with $ν = 5$ or $ν = 5.5$ degrees of freedom corresponding to $μ_{22} = 3$ and $μ_{22} = 7 / 3$ , respectively;
Scale mixture of two normals in which the higher variance component has probability $λ = . 05$ and the ratio of the variances is either $ϰ = 0.094$ or $ϰ = 0.122$ corresponding to $μ_{22} = 3$ and $μ_{22} = 7 / 3$ , respectively;
Asymmetric Student t distribution with negative tail dependence ${b = (- 1, - 1)}^{^{'}}$ but degrees of freedom $ν = 9.65$ or $ν = 10.38$ , respectively;
Location-scale mixture of two normals in which the higher variance component has probability $λ = . 05$ , $μ_{22}$ is as in 1., and the marginal skewness of $u_{1}$ and $u_{2}$ is as in 3., which is achieved with $\begin{matrix} δ = (\begin{matrix} - 1.01 \\ - 1.06 \end{matrix}) or δ = (\begin{matrix} - 1.16 \\ - 1.24 \end{matrix}) and ℵ_{L} = (\begin{matrix} 0.32 & 0 \\ 0 & 0.32 \end{matrix}) or (\begin{matrix} 0.38 & 0 \\ 0 & 0.38 \end{matrix}), \end{matrix}$ respectively (see Appendix D for further details on this parametrisation).

For illustrative purposes, we display the joint densities and contours for standardised versions of these distributions in comparison with the bivariate spherical Gaussian distribution in Figs. 4 and 5 for the spherically symmetric and general cases, respectively.

[See PDF for image]

Fig. 4

Monte Carlo spherical data generating processes versus Gaussian distribution. Notes: In all panels, $E (u_{1 i}^{*}) = E (u_{2 i}^{*}) = 0$ , $V (u_{1 i}^{*}) = V (u_{2 i}^{*}) = 1$ , and $c o v (u_{1 i}^{*}, u_{2 i}^{*}) = 0$ . Panels c–d: Student t distribution with $ν = 5$ degrees of freedom. Panels e–f: Scale mixture of two normals with scale parameter $ϰ = 0.09$ and mixing probability $λ = 0.05$

[See PDF for image]

Fig. 5

Monte Carlo non-spherical data generating processes versus Gaussian distribution. Notes: In all panels, $E (u_{1 i}^{*}) = E (u_{2 i}^{*}) = 0$ , $V (u_{1 i}^{*}) = V (u_{2 i}^{*}) = 1$ , and $c o v (u_{1 i}^{*}, u_{2 i}^{*}) = 0$ . Panels c-d: Asymmetric Student t density with $ν = 9.65$ degrees of freedom, skewness parameters $b_{i} = - 1$ . Panels e-f: Location-scale mixture of two normals with mixing probability $λ = 0.05$ , location vector $δ = - {(1.01, 1.06)}^{'}$ and scale parameter $ϰ = 0.32$ (see Appendix D for details)

In all simulated samples, the exogenous variables $z = {(z_{1}, z_{2})}^{'}$ are generated according to a bivariate Student t distribution with 8 degrees of freedom with mean vector $τ = {(1, 1)}^{'}$ and an identity variance covariance matrix.11

Next, for each choice of the partial correlation $ρ_{y_{2} z_{2} . z_{1}}$ mentioned above, we choose $\begin{matrix} R_{2}^{2} = \frac{2 ρ_{y_{2} z_{2} . z_{1}}}{1 + ρ_{y_{2} z_{2} . z_{1}}} and ρ_{y_{2} z_{1}} = ρ_{y_{2} z_{2}} = \sqrt{\frac{R_{2}^{2} - ρ_{y_{2} z_{2} . z_{1}}^{2}}{1 - ρ_{y_{2} z_{2} . z_{1}}}}, \end{matrix}$ which guarantees that (i) $ρ_{y_{2} z_{2} . z_{1}}^{2} \leq R_{2}^{2} \leq 1$ , and (ii) the two slope coefficients of the second equation coincide. If we fix the variance of both $y_{1}$ and $y_{2}$ to 1 without loss of generality, these restrictions implicitly determine the variance of the error term of the second equation as $σ_{2}^{2} = 1 -$ $R_{2}^{2}$ . We also impose the same balancing restriction on the slopes of the first equation by choosing $\begin{matrix} α = β = \sqrt{\frac{(1 + ρ_{y_{2} z_{1}}) R_{1}^{2}}{2} .} \end{matrix}$ Then, we fix $R_{1}^{2}$ to 0.5, which implies $σ_{1}^{2} = 1 / 2$ , an arbitrary choice that simply scales the asymptotic variances of all the different estimators of $α$ and $β$ by the same amount $(1 - R_{1}^{2})$ .12 Finally, we choose the values of the intercepts $γ$ and $μ_{0}$ so that $E (y_{1}) = E (y_{2}) = 1$ (see Appendix C for further details).

Simulation results

We simulate 10,000 samples of length $N = 250$ and $N = 1000$ for each of the above designs. For each simulated sample, we compute the IV, LS and optimal MD estimators, together with unrestricted and restricted versions of PMLE estimators that use either a discrete mixture of two normals–UPML(mn) and RPML(mn)–or a Student t distribution–UPML(t) and RPML(t). In both cases, we simultaneously estimate the shape parameters. Finally, we also compute a two-step SS estimator that starting from the consistent OLS estimator, $θ_{LS}$ , carries out one BHHH iteration using the efficient spherically symmetric semiparametric score estimated nonparametrically. Specifically, we compute the standardised reduced form residuals $\begin{matrix} {\hat{v}}^{*} = {\hat{Ω}}^{- \frac{1}{2}} [y - μ (z_{1}, z_{2} ; θ_{LS})], \end{matrix}$ where ${\hat{Ω}}^{- \frac{1}{2}}$ denotes the inverse of the Cholesky decomposition of the sample covariance matrix of the reduced form residuals $[y - μ (z_{1}, z_{2} ; θ_{LS})]$ , define $\hat{ς} = {\hat{v}}^{*'} {\hat{v}}^{*}$ and estimate nonparametrically the density of $ζ = ς^{1 / 3}$ , $g (ζ)$ , and its derivative, $g^{'} (ζ)$ , using a Gaussian kernel with the usual Silverman (1986) “rule-of-thumb” bandwidth. The change of variable formula then yields $\begin{matrix} δ (ς) = \frac{- 2}{3 ζ^{2}} \frac{g^{'} (ζ)}{g (ζ)} + \frac{4}{3 ζ^{3}}, \end{matrix}$ which we use to compute the semiparametric efficient score using expression (C30) in the Supplemental Appendix C of Fiorentini and Sentana (2021) by subtracting $\begin{matrix} W_{s} (θ_{LS}) [δ (ς) \frac{ς}{2} - 1 - \frac{2}{4 κ + 2} (\frac{ς}{2} - 1)] \end{matrix}$ from the nonparametric score, where $κ$ denotes the coefficient of multivariate excess kurtosis (see Mardia (1970) for details) and $W_{s} (θ)$ is defined in Appendix A.5.

We display the finite sample results by means of the box plots in Figs. 6, 7, 8, 9, 10 and 11, which concentrate on $α$ and $β$ , the two parameters of interest. Figures 6, 7 and 8 show the Monte Carlo results for 250 observations for cases a., b. and c., respectively, while Figs. 9, 10 and 11 contain the results for 1000 observations in the same order.

[See PDF for image]

Fig. 6

Monte Carlo results: $T = 250$ , $μ_{22} = 3$ and $ρ_{y_{2} z_{2} . z_{1}} = {(μ_{22} - 1)}^{- 1}$ . Notes: IV denotes the instrumental variables estimator, LS denotes the ordinary least squares estimator, MD denotes the optimum minimum distance estimator, UPML(mn) and RPML(mn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a mixture of two normals, UPML(smn) and RPML(smn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a scale mixture of two normals, USS and RSS denote the restricted ( $σ_{12} = 0$ ) and unrestricted elliptically symmetric semiparametric estimators described in Sect. 3, while UPML(t) and RPML(t) denote the restricted ( $σ_{12} = 0$ ) and unrestricted feasible PML estimators based on a Student t. DGPs: Panel a: Student t distribution with $ν = 5$ degrees of freedom; Panel b: scale mixture of two normals with scale parameter $ϰ = 0.09$ and mixing probability $λ = 0.05$ ; Panel c: asymmetric Student t density with $ν = 9.65$ degrees of freedom, skewness parameters $b_{i} = - 1$ ; and Panel d: location-scale mixture of two normals with mixing probability $λ = 0.05$ , location vector $δ = - {(1.01, 1.06)}^{'}$ and scale parameter $ϰ = 0.32$ (see Appendix D for details). In all DGPs, we set $σ_{1}^{2} = 1$ so that $R_{1}^{2} = 1 / 2$ . In order to have a tie between IV and LS, we set $ρ_{y_{2} z_{2} . z_{1}} = 1 / \sqrt{2}$ so that $σ_{2}^{2} = 1 / 3$ and, therefore, $γ = 0.20$ , $α = β = 0.40$ , $μ_{0} = 0.15$ and $μ_{1} = μ_{2} = 0.58$

[See PDF for image]

Fig. 7

Monte Carlo results: $T = 250$ , $μ_{22} = 3$ and $ρ_{y_{2} z_{2} . z_{1}} = \frac{1}{2} μ_{22} {(μ_{22} - 1)}^{- 1} .$ Notes: IV denotes the instrumental variables estimator, LS denotes the ordinary least squares estimator, MD denotes the optimum minimum distance estimator, UPML(mn) and RPML(mn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a mixture of two normals, UPML(smn) and RPML(smn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a scale mixture of two normals, USS and RSS denote the restricted ( $σ_{12} = 0$ ) and unrestricted elliptically symmetric semiparametric estimators described in Sect. 3, while UPML(t) and RPML(t) denote the restricted ( $σ_{12} = 0$ ) and unrestricted feasible PML estimators based on a Student t. DGPs: Panel a: Student t distribution with $ν = 5$ degrees of freedom; Panel b: scale mixture of two normals with scale parameter $ϰ = 0.09$ and mixing probability $λ = 0.05$ ; Panel c: asymmetric Student t density with $ν = 9.65$ degrees of freedom, skewness parameters $b_{i} = - 1$ ; and Panel d: location-scale mixture of two normals with mixing probability $λ = 0.05$ , location vector $δ = - {(1.01, 1.06)}^{'}$ and scale parameter $ϰ = 0.32$ (see Appendix D for details). In all DGPs, we set $σ_{1}^{2} = 1$ so that $R_{1}^{2} = 1 / 2$ . In order to have a maximum relative efficiency of IV versus LS, we set $ρ_{y_{2} z_{2} . z_{1}} = \sqrt{3} / 2$ so that $σ_{2}^{2} = 1 / 7$ and, therefore, $γ = 0.22$ , $α = β = 0.39$ , $μ_{0} = 0.31$ and $μ_{1} = μ_{2} = 0.65$

[See PDF for image]

Fig. 8

Monte Carlo results: $T = 250$ , $μ_{22} = 7 / 3$ and $ρ_{y_{2} z_{2} . z_{1}} = {(μ_{22} - 1)}^{- 1}$ . Notes: IV denotes the instrumental variables estimator, LS denotes the ordinary least squares estimator, MD denotes the optimum minimum distance estimator, UPML(mn) and RPML(mn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a mixture of two normals, UPML(smn) and RPML(smn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a scale mixture of two normals, USS and RSS denote the restricted ( $σ_{12} = 0$ ) and unrestricted elliptically symmetric semiparametric estimators described in Sect. 3, while UPML(t) and RPML(t) denote the restricted ( $σ_{12} = 0$ ) and unrestricted feasible PML estimators based on a Student-t. DGPs: Panel a: Student t distribution with $ν = 11 / 2$ degrees of freedom; Panel b: scale mixture of two normals with scale parameter $ϰ = 0.12$ and mixing probability $λ = 0.05$ ; Panel c: asymmetric Student t density with $ν = 10.38$ degrees of freedom, skewness parameters $b_{i} = - 1$ ; and Panel d: location-scale mixture of two normals with mixing probability $λ = 0.05$ , location vector $δ = - {(1.16, 1.24)}^{'}$ and scale parameter $ϰ = 0.38$ (see Appendix D for details). In all DGPs, we set $σ_{1}^{2} = 1$ so that $R_{1}^{2} = 1 / 2$ . In order to have a tie between IV and LS, we set $ρ_{y_{2} z_{2} . z_{1}} = \sqrt{3} / 2$ so that $σ_{2}^{2} = 1 / 7$ and, therefore, $γ = 0.22$ , $α = β = 0.39$ , $μ_{0} = 0.31$ and $μ_{1} = μ_{2} = 0.65$

[See PDF for image]

Fig. 9

Monte Carlo results: $T = 1000$ , $μ_{22} = 3$ and $ρ_{y_{2} z_{2} . z_{1}} = {(μ_{22} - 1)}^{- 1}$ . Notes: IV denotes the instrumental variables estimator, LS denotes the ordinary least squares estimator, MD denotes the optimum minimum distance estimator, UPML(mn) and RPML(mn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a mixture of two normals, UPML(smn) and RPML(smn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a scale mixture of two normals, USS and RSS denote the restricted ( $σ_{12} = 0$ ) and unrestricted elliptically symmetric semiparametric estimators described in Sect. 3, while UPML(t) and RPML(t) denote the restricted ( $σ_{12} = 0$ ) and unrestricted feasible PML estimators based on a Student-t. DGPs: Panel a: Student t distribution with $ν = 5$ degrees of freedom; Panel b: scale mixture of two normals with scale parameter $ϰ = 0.09$ and mixing probability $λ = 0.05$ ; Panel c: asymmetric Student t density with $ν = 9.65$ degrees of freedom, skewness parameters $b_{i} = - 1$ ; and Panel d: location-scale mixture of two normals with mixing probability $λ = 0.05$ , location vector $δ = - {(1.01, 1.06)}^{'}$ and scale parameter $ϰ = 0.32$ (see Appendix D for details). In all DGPs, we set $σ_{1}^{2} = 1$ so that $R_{1}^{2} = 1 / 2$ . In order to have a tie between IV and LS, we set $ρ_{y_{2} z_{2} . z_{1}} = 1 / \sqrt{2}$ so that $σ_{2}^{2} = 1 / 3$ and, therefore, $γ = 0.20$ , $α = β = 0.40$ , $μ_{0} = 0.15$ and $μ_{1} = μ_{2} = 0.58$

[See PDF for image]

Fig. 10

Monte Carlo results: $T = 1000$ , $μ_{22} = 3$ and $ρ_{y_{2} z_{2} . z_{1}} = \frac{1}{2} μ_{22} {(μ_{22} - 1)}^{- 1}$ . Notes: IV denotes the instrumental variables estimator, LS denotes the ordinary least squares estimator, MD denotes the optimum minimum distance estimator, UPML(mn) and RPML(mn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a mixture of two normals, UPML(smn) and RPML(smn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a scale mixture of two normals, USS and RSS denote the restricted ( $σ_{12} = 0$ ) and unrestricted elliptically symmetric semiparametric estimators described in Sect. 3, while UPML(t) and RPML(t) denote the restricted ( $σ_{12} = 0$ ) and unrestricted feasible PML estimators based on a Student t. DGPs: Panel a: Student t distribution with $ν = 5$ degrees of freedom; Panel b: scale mixture of two normals with scale parameter $ϰ = 0.09$ and mixing probability $λ = 0.05$ ; Panel c: asymmetric Student t density with $ν = 9.65$ degrees of freedom, skewness parameters $b_{i} = - 1$ ; and Panel d: location-scale mixture of two normals with mixing probability $λ = 0.05$ , location vector $δ = - {(1.01, 1.06)}^{'}$ and scale parameter $ϰ = 0.32$ (see Appendix D for details). In all DGPs, we set $σ_{1}^{2} = 1$ so that $R_{1}^{2} = 1 / 2$ . In order to have a maximum relative efficiency of IV versus LS, we set $ρ_{y_{2} z_{2} . z_{1}} = \sqrt{3} / 2$ so that $σ_{2}^{2} = 1 / 7$ and, therefore, $γ = 0.22$ , $α = β = 0.39$ , $μ_{0} = 0.31$ and $μ_{1} = μ_{2} = 0.65$

[See PDF for image]

Fig. 11

Monte Carlo results: $T = 1000$ , $μ_{22} = 7 / 3$ and $ρ_{y_{2} z_{2} . z_{1}} = {(μ_{22} - 1)}^{- 1}$ . Notes: IV denotes the instrumental variables estimator, LS denotes the ordinary least squares estimator, MD denotes the optimum minimum distance estimator, UPML(mn) and RPML(mn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a mixture of two normals, UPML(smn) and RPML(smn) denote the restricted ( $σ_{12} = 0$ ) and unrestricted PML estimators based on a scale mixture of two normals, USS and RSS denote the restricted ( $σ_{12} = 0$ ) and unrestricted elliptically symmetric semiparametric estimators described in Sect. 3, while UPML(t) and RPML(t) denote the restricted ( $σ_{12} = 0$ ) and unrestricted feasible PML estimators based on a Student t. DGPs: Panel a: Student t distribution with $ν = 11 / 2$ degrees of freedom; Panel b: scale mixture of two normals with scale parameter $ϰ = 0.12$ and mixing probability $λ = 0.05$ ; Panel c: asymmetric Student t density with $ν = 10.38$ degrees of freedom, skewness parameters $b_{i} = - 1$ ; and Panel d: location-scale mixture of two normals with mixing probability $λ = 0.05$ , location vector $δ = - {(1.16, 1.24)}^{'}$ and scale parameter $ϰ = 0.38$ (see Appendix D for details). In all DGPs, we set $σ_{1}^{2} = 1$ so that $R_{1}^{2} = 1 / 2$ . In order to have a tie between IV and LS, we set $ρ_{y_{2} z_{2} . z_{1}} = \sqrt{3} / 2$ so that $σ_{2}^{2} = 1 / 7$ and, therefore, $γ = 0.22$ , $α = β = 0.39$ , $μ_{0} = 0.31$ and $μ_{1} = μ_{2} = 0.65$

Our findings indicate that OLS is better in finite samples than what the asymptotic theory suggests because the sample co-kurtosis coefficient is downward biased for $μ_{22}$ . In fact, the asymptotic efficiency of the IV estimator of $α$ relative to LS can only be observed in panels b and d of Fig. 10 when the sample length is large and the distribution of the shocks is either a scale or a general finite mixture of normals, which is when there seems to be a lower small sample bias for $μ_{22}$ .

They also confirm that optimal MD dominates both OLS and IV in finite samples, but the need to estimate third- and fourth-order multivariate cumulants to compute the optimal weighting matrix handicaps it somewhat (see Altonji and Segal (1996)) for analogous results in the context of optimal GMM estimators when the shocks are fat-tailed).

Our results also indicate that non-Gaussian PML based on a restrictive parametric distribution like the Student t or a discrete scale mixture of normals works well when the true distribution is spherical, but it generates inconsistencies otherwise when we impose the constraint $σ_{12} = 0$ . Notice, though, that the unrestricted estimators are always consistent for the slope parameters while the restricted estimators seem to be consistent for $β + μ_{1} α$ despite being inconsistent for both $α$ and $β$ , which is in line with our theoretical discussion following Proposition 2.

In turn, the performance of the two-step SS estimators is very similar to the performance of the corresponding parametric estimators, although their finite sample variances are larger than what the asymptotic theory predicts. Specifically, the consistency pattern of the restricted and unrestricted SS estimators is almost identical.

More importantly, we find that non-Gaussian PMLEs based on a flexible distribution like a general finite mixture of normals works well in practice regardless of the true distribution, systematically dominating MD. In addition, the version that imposes the valid covariance restriction $σ_{12} = 0$ is always more efficient than the unrestricted one.

Directions for further research

As we mentioned at the end of Sect. 3.2, it would be useful to generalise our theoretical results dropping the assumption of spherical symmetry. Similarly, and although we have seen that our proposed finite mixture-based PMLEs get close to achieving the SP efficiency bound both under sphericity and in general, an obvious extension of our Monte Carlo experiments would be to consider standard two-step SP estimators that starting from a consistent estimator such as OLS carry out one BHHH iteration using the efficient SP score estimated nonparametrically without imposing spherical symmetry. The curse of dimensionality in estimating multivariate densities, though, might further reduce the theoretical advantages of this method in finite samples.

Another worthwhile exercise would be to extend the analysis in this paper to the general simultaneous equation model with an arbitrary numbers of endogenous variables and instrumental ones considered by Arellano (1989a). Aside from involving more complex analytical expressions than in the bivariate example we have considered, the main practical complication would be that the number of free parameters of a standardised multivariate mixture increases with the square of the cross-sectional dimension, as we explain in Appendix D.

Last, but not least, deriving a formal result that shows that finite Gaussian-mixture-based PMLEs may provide a proper sieve ML estimator when the number of components increases at a suitable rate constitutes a particularly interesting avenue for further research.

Surprisingly, Arellano (1989a), which should be mentioned in all graduate econometric textbooks, has received very few citations: Pollock (1988), Islam (1993), Monés and Ventura (1996), Calzolari et al. (2004), and Sentana (2005), plus a handful of self-citations, and two more which really meant to cite Arellano, (1989b).

See Fiorentini and Sentana (2022) for a related discussion in the context of structural Vars.

Wu (1973) compared OLS with IV in linear single equation models to assess regressor exogeneity unaware that Durbin (1954) had already suggested this. Hausman (1978) provided a procedure with far wider applicability.

The plot would be the mirror image of Fig. 1 for negative values.

As we shall see in Proposition 1 below, though, this special case is such that, asymptotically, the difference between the IV and OLS estimators affects $α$ exclusively.

Again, Proposition 1 below implies that the differences in asymptotic variances between the MD, IV and OLS estimators affect $α$ exclusively in the special case in which the (squared) partial correlation of $y_{2}$ and $z_{2}$ given $z_{1}$ coincides with the $R^{2}$ in the regression of $y_{2}$ on $z_{1}$ and $z_{2}$

On the other hand, if the shape parameters of the mixture are fixed, then Theorem 7 in Gourieroux (1984) guarantees the inconsistency of the resulting estimators except in the Gaussian limiting case.

More generally, $σ^{2} = \sqrt{σ_{1}^{2} σ_{2}^{2} - σ_{12}^{2}}$ and $ω = ln (σ_{1} / \sqrt{σ_{2}^{2} - σ_{12}^{2} / σ_{1}^{2}})$ .

In this respect, the efficiency gains of any $K > 1$ relative to $K = 1$ should be easy to prove formally because the ML estimators of the unconditional mean of the mixture of gamma random variables underlying the scale mixture model coincide regardless of K.

We do not consider the case in which $μ_{22} = 7 / 3$ and $ρ_{y_{2} z_{2} . z_{1}} = . 5 \sqrt{μ_{22} / (μ_{22} - 1)}$ because the efficiency of IV relative to OLS for $α$ is just 1.02 in that case.

Notice that the choice of $σ_{z_{1} z_{2}} = 0$ considerably simplifies some of the eigenvectors in Propositions 1, 2 and 3. For example the linear combination that according to Proposition 2.b can be adaptively estimated by the SP estimator and consistently estimated by a distributionally misspecified ML estimator becomes $β + μ_{1} α$ .

In design a., we then have $R_{2}^{2} = 2 / 3$ , $σ_{2}^{2} = 1 / 3$ , $γ = 0.20$ , $α = β = 0.40$ , $μ_{0} = 0.16$ , and $μ_{1} = μ_{2} = 0.58$ . In turn, in designs b and c., $R_{2}^{2} = 6 / 7$ , $σ_{2}^{2} = 1 / 7$ , $γ = 0.22$ , $α = β = 0.39$ , $μ_{0} = - 0.31$ , and $μ_{1} = μ_{2} = 0.66$ .

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Altonji, JG; Segal, LM. Small-sample bias in GMM estimation of covariance structures. J Bus Econ Stat; 1996; 14, pp. 353-366.

Amengual, D; Fiorentini, G; Sentana, E. Dolado, JJ; Gambetti, L; Matthes, C. Tests for random coefficient variation in vector autoregressive models. Essays in honour of Fabio Canova, advances in econometrics; 2022; Bingley, Emerald: pp. 1-35.44B

Amengual D, Fiorentini G, Sentana E (2023) Information matrix tests for Gaussian mixtures and switching regression models. Mimeo, CEMFI

Arellano M (1989a) On the efficient estimation of simultaneous equations with covariance restrictions. J Econ 42:247–265

Arellano M (1989b) An efficient GLS estimator of triangular models with covariance restrictions. J Econ 42:267–273

Bahadur, RR. Examples of inconsistency of maximum likelihood estimates. Sankhya; 1958; 20, pp. 207-210.

Basu, D. An inconsistency of the method of maximum likelihood. Ann Math Stat; 1955; 26, pp. 144-145. [DOI: https://dx.doi.org/10.1214/aoms/1177728606]

Berkson, J. Minimum chi-square, not maximum likelihood!. Ann Stat; 1980; 8, pp. 457-487. [DOI: https://dx.doi.org/10.1214/aos/1176345003]

Bickel, PJ; Klaassen, CAJ; Ritov, Y; Wellner, JA. Efficient and adaptive estimation for semiparametric models; 1993; Baltimore, Johns Hopkins:

Calzolari, G; Fiorentini, G; Sentana, E. Constrained indirect estimation. Rev Econ Stud; 2004; 71, pp. 945-973. [DOI: https://dx.doi.org/10.1111/0034-6527.00310]

Chamberlain, G. Asymptotic efficiency in estimation with conditional moment restrictions. J Econ; 1987; 34, pp. 305-334. [DOI: https://dx.doi.org/10.1016/0304-4076(87)90015-7]

Cramér, H. Mathematical methods of statistics; 1946; Princeton, Princeton University Press:

Durbin, J. Errors in variables. Rev Int Stat Inst; 1954; 22, pp. 23-32. [DOI: https://dx.doi.org/10.2307/1401917]

Fisher, RA. On the mathematical foundations of theoretical statistics. Philos Trans R Soc Lond Ser A; 1922; 222, pp. 309-368. [DOI: https://dx.doi.org/10.1098/rsta.1922.0009]

Fisher, RA. Theory of statistical estimation. Proc Camb Philos Soc; 1925; 22, pp. 700-725. [DOI: https://dx.doi.org/10.1017/S0305004100009580]

Fiorentini, G; Sentana, E. Consistent non-Gaussian pseudo maximum likelihood estimators. J Econ; 2019; 213, pp. 321-358. [DOI: https://dx.doi.org/10.1016/j.jeconom.2019.05.017]

Fiorentini, G; Sentana, E. Specification tests for non-Gaussian maximum likelihood estimators. Quant Econ; 2021; 12, pp. 683-742. [DOI: https://dx.doi.org/10.3982/QE1406]

Fiorentini, G; Sentana, E. Discrete mixtures of normals pseudo maximum likelihood estimators of structural vector autoregressions. J Econ; 2022; [DOI: https://dx.doi.org/10.1016/j.jeconom.2022.02.010]

Fiorentini G, Sentana E (2023) Consistent estimation with finite mixtures. mimeo, CEMFI

Gallant, AR; Tauchen, G. The relative efficiency of method of moments estimators. J Econ; 1999; 92, pp. 149-172. [DOI: https://dx.doi.org/10.1016/S0304-4076(98)00088-8]

Gauss, CF. Theoria motus corporum coelestium; 1809; Gotha, Perthes:

Gouriéroux, C; Monfort, A; Trognon, A. Pseudo maximum likelihood methods: theory. Econometrica; 1984; 52, pp. 681-700. [DOI: https://dx.doi.org/10.2307/1913471]

Hansen, LP. Large sample properties of generalized method of moments estimators. Econometrica; 1982; 50, pp. 1029-1054. [DOI: https://dx.doi.org/10.2307/1912775]

Hausman, J. Specification tests in econometrics. Econometrica; 1978; 46, pp. 1273-1291. [DOI: https://dx.doi.org/10.2307/1913827]

Hodgson and Vorkink. Efficient estimation of conditional asset pricing models. J Bus Econ Stat; 2003; 21, pp. 269-283. [DOI: https://dx.doi.org/10.1198/073500103288618954]

Huber PJ (1967) The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the V Berkeley symposium in mathematical statistics and probability, vol 1. University of California Press, pp 221–233

Islam N (1993) Estimation of dynamic models from panel data , unpublished Ph.D. dissertation. Harvard University

Kraft, CH; Le Cam, LM. A remark on the roots of the maximum likelihood equation. Ann Math Stat; 1956; 27, pp. 1174-1177. [DOI: https://dx.doi.org/10.1214/aoms/1177728087]

Laplace P-S (1774) Mémoire sur la probabilité des causes par les évènements. Mém Acad R Sci Presentes par Divers Savan 6:621–656

Legendre A-M (1805) Nouvelles méthodes pour la dé termination des orbites des comètes, F. Didot

Magnus, JR; Neudecker, H. Matrix differential calculus with applications in statistics and econometrics; 2019; 3 New Jersey, Wiley: [DOI: https://dx.doi.org/10.1002/9781119541219]

Malinvaud, E. Statistical methods in econometrics; 1970; 2 North Holland, Elsevier:

Mardia, KV. Measures of multivariate skewness and kurtosis with applications. Biometrika; 1970; 57, pp. 519-530. [DOI: https://dx.doi.org/10.1093/biomet/57.3.519]

Monés, MA; Ventura, E. Saving decisions and fiscal incentives. Appl Econ; 1996; 28, pp. 1105-1117. [DOI: https://dx.doi.org/10.1080/000368496327958]

Newey, WK; Steigerwald, DG. Asymptotic bias for quasi-maximum-likelihood estimators in conditional heteroskedasticity models. Econometrica; 1997; 65, pp. 587-99. [DOI: https://dx.doi.org/10.2307/2171754]

Neyman, J; Scott, EL. Consistent estimation from partially consistent observations. Econometrica; 1948; 16, pp. 1-32. [DOI: https://dx.doi.org/10.2307/1914288]

Nguyen, TT; Nguyen, HD; Chamroukhi, F; McLachlan, GJ. Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Math Stat; 2020; 7, 1750861. [DOI: https://dx.doi.org/10.1080/25742558.2020.1750861]

Pearson, K. Contributions to the mathematical theory of evolution. Philos Trans R Soc Lond Ser A; 1894; 185, pp. 71-110. [DOI: https://dx.doi.org/10.1098/rsta.1894.0003]

Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. London Edinburgh Dublin Philos Mag J Sci Ser; 1900; 5, 50 pp. 157-175. [DOI: https://dx.doi.org/10.1080/14786440009463897]

Pollock, DSG. The estimation of linear stochastic models with covariance restrictions. Economet Theor; 1988; 4, pp. 403-427. [DOI: https://dx.doi.org/10.1017/S0266466600013372]

Rao, CR. Information and the accuracy attainable in the estimation of statistical parameters. Bull Calcutta Math Soc; 1945; 37, pp. 81-89.

Rao CR (1961) Asymptotic efficiency and limiting information. In: Proceedings of the IV Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 531–546

Rothenberg TJ (1973) Efficient estimation with a priori information. Cowles foundation monograph 23, Yale

Sentana, E. Least squares predictions and mean-variance analysis. J Financ Economet; 2005; 3, pp. 56-78. [DOI: https://dx.doi.org/10.1093/jjfinec/nbi002]

Silverman, BW. Density estimation for statistics and data analysis; 1986; London, Chapman & Hall:

White, H. Maximum likelihood estimation of misspecified models. Econometrica; 1982; 50, pp. 1-25. [DOI: https://dx.doi.org/10.2307/1912526]

Wu, D-M. Alternative tests of independence between stochastic regressors and disturbances. Econometrica; 1973; 41, pp. 733-750. [DOI: https://dx.doi.org/10.2307/1914093]

Word count: 8274

Show less

© The Author(s) 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Arellano (J Econ 42:247–265, 1989a) showed that valid equality restrictions on covariance matrices could result in efficiency losses for Gaussian PMLEs in simultaneous equations models. We revisit his two-equation example using finite normal mixtures PMLEs instead, which are also consistent for mean and variance parameters regardless of the true distribution of the shocks. Because such mixtures provide good approximations to many distributions, we relate the asymptotic variance of our estimators to the relevant semiparametric efficiency bound. Our Monte Carlo results indicate that they systematically dominate MD and that the version that imposes the valid covariance restriction is more efficient than the unrestricted one.

Details

Title

PML versus minimum χ2: the comeback

Author

Amengual, Dante¹

; Fiorentini, Gabriele²

; Sentana, Enrique¹

¹ CEMFI, Madrid, Spain (GRID:grid.423829.6) (ISNI:0000 0001 2154 8962)
² Università di Firenze and RCEA, Firenze, Italy (GRID:grid.8404.8) (ISNI:0000 0004 1757 2304)

Pages

253-300

Publication year

2023

Publication date

Dec 2023

Publisher

Springer Nature B.V.

ISSN

18694187

e-ISSN

18694195

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1007/s13209-023-00280-4

ProQuest document ID

2891049434

PML versus minimum χ2: the comeback

Jump to:

Full text

Abstract

Details

Suggested sources