Content area
We present
Full text
1. Introduction
The random coefficients model is often used to model unobserved heterogeneity in a population, an important problem in econometrics and statistics. The model is given by
(1)
Here, are the regressors, and are the regression coefficients. It is assumed that , , are i.i.d. random variables with and being independent, and that and are observed while is unknown.
In this paper, we introduce an open source
Nonparametric estimation of the random coefficients model was first established by [1] for a single regressor. Kernel methods for estimation were developed by [2,3]. Constrained least squares methods were developed by [4,5]. Further recent studies on this model include [6,7,8,9,10,11]. Regularized maximum likelihood methods have been considered for problems other than the random coefficients model in [12,13,14,15] among others. The method implemented by the
The paper is organized as follows: Section 2 briefly describes the regularized maximum likelihood method developed in [16], Section 3 discusses the classes and functions available to the module, Section 4 discusses examples of the module’s usage for general cases, and Section 5 demonstrates the usefulness of the model in a real data application.
2. Regularized Maximum Likelihood
It is assumed that the random coefficients have a Lebesgue density . If the conditional density exists, the two densities are connected by the integral equation
Here, is the induced Lebesgue measure on the d-dimensional hyperplane . This connection allows us to employ maximum likelihood estimation to nonparametrically identify as seen in the following expression of the log-likelihood
Direct maximization of over all densities is not feasible due to ill-posedness and will lead to overfitting. We stabilize the problem by adding a penalty term and state the estimator as a minimization problem with negative log-likelihood
(2)
Here, is a constraint that ensures that is a probability density, is a regularization parameter that controls a bias–variance trade-off, and is a regularization functional. See [16] for further details. In principle, can be any convex lower semi-continuous functional. We implemented the most popular choices, which areSquared norm: .
Sobolev Norm for :
Entropy: .
Of course, other penalty terms are possible, e.g., higher-order Sobolev norms , Banach space norms , or total variation have been considered for other types of problems. However, these are often not suitable for density estimation problems like ours. We have chosen the three penalties above since they provide good options for obtaining either a smooth reconstruction with , a very mild smoothness assumptions with entropy, or a compromise with .
In addition to the regularization functional, a regularization parameter also needs to be chosen. In this module, we implement two methods of estimating : Lepskii’s Balancing principle and K-fold cross validation.
Remark
The linear random coefficient model is a regression model with a nonparametric component, but it is different from the standard nonparametric regression . When estimating the random coefficients model, we estimate a density that makes it possible to use a maximum likelihood approach without any distributional assumptions. We only need regularity assumptions for , as for most nonparametric methods. In contrast, a maximum likelihood approach for nonparametric regression requires distributional assumptions on , e.g., ∼. With an assumption like that, m can be estimated by penalized maximum likelihood as well. However, the likelihood functional will be quite different, and the constraint is not required.
3. Python Implementation
The
There are two main functions used to implement regularized maximum likelihood estimation using
Note: For the three-dimensional implementation of the algorithm, we observe numerical instabilities in the optimization algorithm when all the random coefficients, , have an expectation of zero. This is unexpected as our formulation and implementation of the problem should not be sensitive to . A simple shift by some constant c is enough to remove the instability
(3)
Since we want the algorithm to be as robust as possible, we repeat the estimation with 10 different random shifts c∼, producing 10 approximations, say . Just averaging over all 10 approximations is not feasible. If only one suffers from the instability, it can significantly impact the average in a negative way. Hence, a simple k-means clustering algorithm on the distances of is applied instead, using
3.1. The transmatrix() Function
The purpose of the function
(4)
The linear operator above describes the integral of over the hyperplanes parametrized by the sample points . The function makes use of a finite-volume method as a discrete analog in evaluating the integral. The function is used as follows:
The argument
In the case of a random coefficients model with intercept the first column would simply be .
The
| |
| |
| |
| |
| |
| |
The base grid that is generated by the
| |
| |
| |
| |
The output of the
| m(): Returns the number of grid points is to be estimated over. |
3.1.1. transmatrix_2d()
The 2-dimensional implementation of this method works for the random coefficients model with a single regressor and random intercept
(5)
and the model with two regressors and no intercept(6)
The function first initializes an -dimensional array of zeros, where n is the sample size and m is the number of grid points is to be estimated over. In this 1-dimensional case, the hyperplanes that is integrated over reduce to lines, which simplifies the problem. A finite-volume type method of estimation is employed to approximate the integral expression as seen in (4).
To implement this finite-volume method, the lines parametrized by the sample points given by Equations (5) or (6) are made to intersect with the grid. The length of each line segment that intersects with the grid is then calculated and stored in an object. The intersection points of these lines with the grid are retrieved and subsequently mapped to their respective array indices. These indices are used to map the length of the line segments to the appropriate entry in the initialized array, forming the discretized linear operator T. The algorithm is outlined as follows (Algorithm 1):
| Algorithm 1: |
The result of this function’s implementation is a large, sparse array, , where each row corresponds to a collection of all the lengths of the line segments intersecting the grid for a line parametrized by a sample point, i.e., each corresponds to the length of the line segment intersecting the grid at section . The resulting array is sparse because is equal to zero unless a line passes through a section of the grid. The algorithm’s implementation is illustrated in Figure 1.
The resulting array is then used to evaluate the log-likelihood functional
(7)
where is a -array or vector that serves as the discrete approximation of .3.1.2. transmatrix_3d()
The implementation of the function for the 3-dimensional case works for problems with two regressors and a random intercept,
(8)
and the model with three regressors and no intercept.(9)
It is significantly more complicated and computationally more expensive as the increased dimensionality introduces new problems.
Similar to the 2-dimensional implementation, a finite-volume type approach is used to create the discrete version of the linear operator T. In this higher-dimensional case, the hyperplanes parametrized by the sample observations are now 2-dimensional planes, and the grid that is estimated over comprises three axes and is, therefore, in the shape of a 3-dimensional cube. In this case, the intersection of the plane with the grid are characterized by a collection of points that define a polygon whose areas are used as the 2-dimensional analog of the line segments in the lower dimensional implementation of this finite-volume estimation process. The algorithm is outlined below. Figure 2 also illustrates the algorithm for a single cuboidal grid section (see also Algorithm 2).
| Algorithm 2: |
The resulting array is in the same form as the one obtained using the 2-dimensional implementation
It is important to note the computational performance of this algorithm. Matrix multiplication parallelizes the computation and eliminates the need to evaluate the log-likelihood functional sequentially for each sample point. However, the process of creating the transformation matrix scales linearly with the sample size n and exponentially with the dimensionality of the problem. Despite this, the function was written in a way that maximizes efficiency as much as possible by making use of Python’s parallel computation through array-wise computation when possible.
Solving for the transformation matrix, is only one of two steps in the process of estimating the density . The second step is the process of optimizing the regularized log-likelihood functional, the runtime of which scales multiplicatively with respect to the sample size n and the number of discretization points m. Therefore, choosing an appropriate sample size and level of discretization is an important consideration as both primarily determine the total cost of running the algorithm.
3.2. The rmle() Function
The
The function
Table 1 contains all the arguments for the function
3.2.1. Functionals and Regularization Terms
Recall the average log-likelihood functional to be minimized as in Equation (2). As specified in Section 2, the regularization terms implemented in this module are the Sobolev norm for , the squared norm, and Entropy. The module also includes an option for a functional that has no regularization term if the user wishes to produce a reconstruction of without any form of regularization. This option will often lead to overfitting and produce a highly unstable solution.
3.2.2. Sobolev Norm for
The functional incorporating the Sobolev norm for has the following form:
(10)
where indicates the squared norm. It is important to note that the choice of the regularization term would typically depend on knowledge about the true solution, as it imposes certain assumptions on . In the case of the penalty term, the solution has a square-integrable first derivative in addition to the assumptions of non-negativity and integrability imposed by the constraints of the minimization problem. The function is written as follows:| |
| |
| |
| |
| |
| |
The function above returns the value of the discrete implementation of the functional in (10). Table 2 provides details on the arguments required for this function. The SciPy Optimize minimize function does not accept array arguments that are greater than one dimension. A necessary step is to unravel the transformation matrix, , into a one-dimensional array, which is passed to the function as
The underlying minimization function is able to approximate the Jacobian of the functional being estimated using finite-difference methods, which necessitates a quasi-newton method of estimating the Hessian. Choosing to approximate the Jacobian, however, increases the computation time severely as it requires profusely more function evaluations, gradient evaluations, and consequently a lot more iterations, as it seems to converge to the solution more slowly. To this end, we opt to supply an explicit form for the Jacobian of the functional. The explicit form, as well as the function, are written as follows:
| |
| |
| |
| |
| |
| |
| |
| |
The function has identical arguments to the functional being optimized,
The form of this Jacobian implies an additional smoothness assumption on the solution, as it requires to be second-order differentiable.
3.2.3. Squared Norm
The form of the functional that incorporates the squared norm as the regularization term is
The arguments for this function are identical to those listed in Table 2; likewise is true for the Jacobian associated with this regularization term. The Jacobian has the following form:Choosing this regularization functional imposes less smoothness on the solution, as the only additional assumption in place is square-integrability. This leads to a typically less smooth reconstruction as compared with using the Sobolev norm for as the regularization functional. The functions in Python are coded similarly to the Sobolev norm for the regularization functional.
3.2.4. Entropy
The form of the functional that uses the entropy of the function has the following form:
For densities on a compact domain, the entropy is bounded by the norm and by the Sobolev norm. Since we only work on discretized bounded domains, the entropy implies weaker assumptions on the solution than the other two. The Jacobian has the following form:3.2.5. Parameter Selection
Recall the minimization problem as in (2), where a constant controls the size of the effect of the regularization term. The module allows for the user to provide the value directly if the user has a general idea of the level of smoothing necessary for the solution, which is typically affected by the sample size, and the level of discretization of the grid is estimated over. If the user has no best-guess for the value of the regularization parameter alpha, the module has options to automatically select the parameter.
There are two methods implemented in the module for automated selection for , namely, Lepskii’s balancing principle and k-fold cross validation. The Lepskii method typically yields a less accurate result relative to k-fold cross validation; however, its advantage lies in significantly less computational cost.
3.2.6. Lepskii’s Balancing Principle
Lepskii’s principle is an adaptive form of estimation, which is popular for inverse problems [14,20,21,22,23,24]. The method works as follows: we compute for for and with some constants . We then select as the optimal parameter choice where
The algorithm is implemented in Python as follows (Algorithm 3):
| Algorithm 3: Lepskii’s Balancing Principle |
The bulk of the computational cost of the Lepskii algorithm implementation can be broken down into two components: the fixed cost of generating and the variable cost of computing , as the number of values to be used depends on and also the sample size of the data. This implementation of Lepskii algorithm’s scales linearly in terms of runtime with the number of alpha values being tested, m.
3.2.7. K-Fold Cross Validation
Cross validation is another popular parameter choice rule as it optimizes for both training and testing performance. Here, we present a modified implementation of k-fold cross validation with a cost-function that is applicable to our problem. The modification we apply is an algorithm to lessen the computation time by reducing the number of values it needs to iterate through. We consider the loss function
where is a subsample of , which can be interpreted as the j-th fold that is left out in the current iteration, and is the estimate for using . The loss function can be interpreted as the negative of the likelihood that the subsample was drawn from the same distribution, , as . We aim to choose the that minimizes this loss function.The modification we implemented changes the search method for the optimal value by reducing the number of values being tested. The algorithm involves separating the range of values into two sections, and . From there, two alpha values, and , are randomly selected from the respective sections and are used to compute for the corresponding loss function values, and . The section from which the value that produces the smaller was drawn from is kept, while the other is discarded. This is repeated until there is a sufficiently small range of values. Once this range of values is obtained, the loss function is evaluated over all the remaining values and the optimal is chosen as the one which minimizes . The complete algorithm is implemented as follows (Algorithm 4):
| Algorithm 4: Modified K-fold Cross Validation |
The runtime of the unmodified version of k-fold cross validation scales linearly with the product , where k is the number of folds and m is the number of values being tested. Applying the modified version reduces the number of values being tested by a logarithmic factor, which is a significant reduction in computational cost.
4. Examples
This section presents three instructive examples of how the module can be used. The examples are (1) the random coefficients model with a single regressor and random intercept for the two-dimensional case, (2) the model with two regressors and random intercept for the three-dimensional case, and (3) the three-dimensional case with additional robustness. This section will also show how to plot the estimated density using the built-in plotting function
Two-Dimensional Case Single Regressor with Random Intercept
The first example is
We generate synthetic data using the function
The general flow of the process of using the module can be broken down into five steps: Import the necessary modules and functions. Establish the dataset to be used (either real or simulated data). Specify the grid over which is to be estimated. Generate the transformation matrix . Run the
| |
| |
The program begins by importing the necessary functions from
| |
| |
| |
| |
The next step is to generate the grid on which is estimated. This is achieved using the
| |
| |
| |
After the instance of the transformation matrix is produced, the user can then run the
| |
| |
| |
| |
| |
| |
| |
| |
As stated in Section 3.2, the
Figure 3 and Figure 4 show the contour and surface plots produced by the
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
The smaller value for resulted in a reconstruction with less smoothness. The two peaks are well separated. There are no signs of overfitting yet. So we could continue to try smaller values of until becomes jagged, indicating overfitting. However, a better strategy is to choose by cross validation or Lepskii’s principle.
The reduction in the number of grid points resulted in a significant reduction in the runtime of the algorithm while achieving a better estimate for . The reduced computational cost of the algorithm makes an automatic parameter choice method such as cross validation more feasible.
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Cross validation chose for this example. As shown in Figure 7 and Figure 8, the two modes become slightly sharper compared with the estimate with . Any smaller value of the regularization parameter will most likely lead to overfitting. The general workflow that we suggest when tuning the parameters to be used in estimation is as follows:
Establish a relatively large grid range for estimation and generate the transformation matrix . This should be treated as an exploratory step in terms of analyzing the data.
Set equal to the step size of the grid.
Run the
Plot using the
Limit the grid range and possibly also the number of grid points and generate a new matrix .
Run the
We also want to demonstrate the two alternative regularization methods in the package for the same test case. These are regularization and entropy regularization. In contrast to Sobolev regularization, entropy regularization does not enforce smoothness of the solution, while regularization enforces only little smoothness. Naturally, the estimates are a bit spiky. This can be of advantage if the true solution is indeed not a smooth function. In this case, and entropy regularization can deliver better results. In most applications, we will not know whether the solution is smooth. The user needs to make an assumption and choose the corresponding regularization.
Figure 9 and Figure 10 show the estimate using regularization. These figures should be compared with Figure 3 and Figure 4, respectively. The code is as follows.
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
In Figure 11 and Figure 12, estimates with entropy regularization are displayed. Again, these figures can be compared with Figure 3 and Figure 4. The code for entropy regularization is as follows.
| |
| |
| |
Finally, we present an example that demonstrates the usage of the
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Three-Dimensional Case Two Regressors with Random Intercept
This second example is described by
The example is again presented with simulated data using the same function
| |
| |
As in the previous example, the program begins by importing the necessary modules and functions. The
| |
| |
| |
| |
| |
The next step is to establish the number of grid points and to generate the transformation using the simulated sample observations. In this case, we first consider 10 grid points in each axis amounting to a total of 1000 grid points. If the user wishes to estimate over a finer grid, it would be more efficient to first determine the smallest possible range of each axis that would fit almost all of the probability mass of . We use a Sobolev penalty with for the first run.
| |
| |
| |
| |
| |
| |
| |
In the three-dimensional application, we use the regularization functional
It is possible to achieve more accurate estimates with more grid points in conjunction with a narrower grid range but with a significantly higher computational cost. We set the number of grid points to 20 on each axis, which amounts to a total of 8000 grid points. The regularization parameter is reduced to . The additional level of discretization due to the increased number of grid points and the smaller domain, together with the smaller , results in an improved estimate as observed in Figure 17 and Figure 18.
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
The next example will demonstrate the case when the underlying joint density being reconstructed has a single mode at the center of the grid. Both methods of dealing with this issue described in Section 3.1 will be illustrated. Since there were no signs of overfitting in the previous result, we further reduce the regularization to .
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Results are displayed in Figure 19 and Figure 20. The contours show more kinks and appear to be less regular than in Figure 17. This suggests that should not be reduced any further for this data.
Finally, we show estimates for this test problem with and entropy regularization. Apart from the alternative penalty terms, we replicate the setup of Figure 15 and Figure 16, so results are best compared with these plots. and entropy do not enforce smoothness of the solution as much as the Sobolev penalty, which leads to estimates with less regularity. It is up to the user to decide whether they expect the true solution to be smooth or not and choose the corresponding regularization.
The result with the penalty and is displayed in Figure 21 and Figure 22. The corresponding code is as follows.
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
The result with the entropy penalty and is displayed in Figure 23 and Figure 24. The corresponding code is as follows.
| |
| |
| |
Robustness by Shifting
For the three-dimensional implementation, we found that numerical instabilities of the Python optimization algorithm can occur when the underlying density has a single mode located at the center of the grid. This problem is overcome by simply shifting the density and the data by a constant .
The following example demonstrates how our algorithm can automatically detect if the instability is present by adding c∼ to the intercept. The values a and b are determined by the size of the grid. This applies a transformation that can be back-transformed after estimation by adjusting the grid of . This is repeated for ten different shifts. Then, a simple k-means clustering algorithm using
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
5. Application to Household Expenditure and Demand
This demonstrates the usefulness of the module in a real data application and shows the user how to load the data into Python from a comma-separated values (CSV) format and how to pre-process it, if necessary, into a usable form. We evaluate household expenditure data from the British Family Expenditure Survey as in [7,16]. The model is the almost-ideal demand system by [25] but with random coefficients instead of fixed coefficients
As discussed in [7], a linear transformation is applied to the regressors before generating the transformation matrix to be used in the algorithm. This is to ensure that the grid is sufficiently covered by the hyperplanes generated by the sample observations. The OLS estimate for the random coefficients suggests that is centered at (0.262755, 0.0048, −0.00069) for a subsample size of . We choose the grid in a way that this potential mode is not in the center.
| |
| |
| |
| |
| |
| |
| |
The code block above loads the data from a csv file, selects a random subsample of 5000 observations, and applies the linear transformations on the data (also see Figure 27 and Figure 28).
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Included in the module is an option to fit a spline to the estimate . This allows the user to generate smoother plots from the results if they wish. The
| |
| |
| |
Conceptualization, F.D., E.M. and M.R.; Writing—original draft, F.D., E.M. and M.R.; Writing—review and editing, F.D., E.M. and M.R. All authors have read and agreed to the published version of the manuscript.
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1 Two-dimensional Transformation matrix algorithm.
Figure 2 Three-dimensional transformation matrix algorithm.
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15 Contour plots of joint bivariate marginal distributions of
Figure 16 Surface plots of joint bivariate marginal distributions of
Figure 17 Contour plots of joint bivariate marginal distributions of
Figure 18 Surface plots of joint bivariate marginal distributions of
Figure 19 Contour plots of joint bivariate marginal distributions of
Figure 20 Surface plots of joint bivariate marginal distributions of
Figure 21 Contour plots of joint bivariate marginal distributions of
Figure 22 Surface plots of joint bivariate marginal distributions of
Figure 23 Contour plots of joint bivariate marginal distributions of
Figure 24 Surface plots of joint bivariate marginal distributions of
Figure 25 Contour plots of joint bivariate marginal distributions of
Figure 26 Surface plots of joint bivariate marginal distributions of
Figure 27 Contour plots of joint bivariate marginal distributions of
Figure 28 Surface plots of joint bivariate marginal distributions of
Figure 29 Contour plots of the spline interpolated joint bivariate marginal distributions of
Figure 30 Surface plots of the spline interpolated joint bivariate marginal distributions of
| Argument | Description |
|---|---|
| functional | Negative likelihood functional with corresponding regularization term. |
| alpha | Constant |
| tmat | Class object returned by |
| k | Optional argument: Integer that specifies how many folds for modified k-fold cross validation. Default value is |
| initial_guess | Optional argument from the SciPy Optimize minimize function. Used to supply an initial value for the minimization algorithm to start. Default value is set to a NumPy array of values close to zero. |
| hessian_method | Optional argument from SciPy Optimize minimize function. Default value is set to ‘2-point’. |
| constraints | Refers to the linear constraints imposed on the problem. It is set as an optional argument, the default value is set to |
| tolerance | Optional argument for the tolerance criteria for the optimization algorithm’s termination. Default value is set to |
| max_iter | Optional argument for the maximum number of iterations. Default value is set to 100. |
| bounds | Specifies the bound constraints of the optimization problem. The default is expressed as |
| Argument | Notation | Description |
|---|---|---|
| f | | current value of the solution |
| a | | constant that serves as the regularization parameter |
| tm_long | | unraveled form of the transformation matrix, |
| n | n | the sample size |
| s | | the step size of the grid |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.