1. Introduction and Motivation
This paper addresses the evaluation of predictions from multinomial models. Such an evaluation is relevant the assess the quality of these models, which are often applied in consumer choice analysis, marketing research, transportation research and more.
There are various tests for the accuracy of multinomial models. There are tests for the goodness of fit, like the Pearson test and the Brier score. Also, there are tests based on the prediction realization table and on cross-entropy loss (see [1] for a recent survey.). These tests all concern a measure of the fit. In contrast, in this paper, we propose a Likelihood Ratio (and Wald) test on forecast bias in predicting independent multinomial outcomes, where the outcomes can be and where the predictions are probabilities. Such multinomial outcomes are frequently observed in transportation mode choice and brand choice. The probabilities are given from the outset and can be based on an econometric model, like a multinomial probit or multinomial logit model, and they can also be created by experts, who do not necessarily rely on an econometric model. This is an advantage of our simple test.
A useful tool to test for bias forecasts for continuous variables is the so-called Mincer–Zarnowitz regression, by [2]. If realizations and forecasts are continuous variables, and these can be cross sectional data or time series data, where the forecast sample is , then the auxiliary regression
where is a mean zero uncorrelated error process, can be used to test bias. The parameters can be estimated using Ordinary Least Squares (OLS). The Wald type test of interest concerns the null hypothesis that and , jointly. Under the null hypothesis, there is no forecast bias.In this paper, we propose a similar test but now for independent multinomial outcomes, which in a sense extends on the test for binomial outcomes in [3]. There are realizations that can be either , and the predictions are estimated probabilities of each of the outcomes. The question now is whether these predictions based on the probabilities are unbiased or not.
The new test is similarly as easy as the Mincer Zarnowitz regression, where now a multinomial logit model is used instead of a linear regression. This article proceeds in Section 2 with the proposed test. In Section 3, the test is evaluated using various simulation experiments. Section 4 implements the test to an empirical setting on brand choice by individual households. Section 5 concludes the article.
2. The Main Idea
Suppose there is a multinomial model for discrete choices, made by individuals, and that this model generates fitted probabilities for and . Consider the estimated odds of the probability of choosing versus , that is,
(1)
for all . The estimated odds imply that(2)
In case of a predictive bias, the odds ratios are incorrect.
To describe potential bias, we extend (1) by adding intercepts and a slope parameter (alternative to 1), which results in
(3)
This implies that the choice probabilities in (2) can be written as
(4)
with the assumption that for identification, where J can be any of the choices.Under the joint null hypothesis
(5)
there is no bias in the estimated probabilities. We can use this result to test for bias in prediction probabilities by estimating the parameters in a multinomial logit model (MNL) in which the variables are the out-of-sample realizations and the are the predicted probabilities from a statistical model if from the judgment of experts. We can use a Likelihood Ratio (LR) test or Wald test for the composite null hypothesis in (5).The Likelihood Ratio test statistic is given by
(6)
where the log-likelihood function is given by(7)
and where concern the unrestricted Maximum Likelihood estimates.Let . The Wald test statistic is given by
(8)
where denotes the covariance matrix of the Maximum Likelihood estimator(9)
evaluated in the Maximum Likelihood estimates .We can also implement partial tests to examine the absence of relative bias, for example by considering the null hypothesis . As we consider a joint test for zero restrictions on all α parameters, the proposed test statistic will be independent of the chosen alternative for identification.
The model in (4) can also be extended to include choice category-specific parameters , like
(10)
No identification restrictions on the β parameters are necessary as the explanatory variables are different across the alternatives. This extension allows for a more subtle analysis of sources of bias. However, as the number of relevant parameter restrictions increases, this may come at the expense of the power of the test.
3. Simulations
To analyze whether our proposed test is useful in practically relevant cases, we now perform various simulation experiments. As a true Data-Generating process (DGP), we simulate probabilities from a Dirichlet distribution, that is
for . Hence, for each individual, we have different probabilities. Given these probabilities, we generate true values . Next, we create predictive probabilities using for and for different parameters and , where is an indicator function. For and , there is no bias in the forecast probabilities. Next, we estimate the parameters in the MNL model in (4) with the assumption that using Maximum Likelihood Estimation (MLE), described in Chapter 6 of [4], amongst others. Finally, we compute the LR test values for the following three null hypotheses where we adopt the 5% significance level. The number of replications is 10,000. The results are presented in Figure 1, Figure 2 and Figure 3.Figure 1a presents the power plots of the three LR tests for the case that and the sample size is , with different values of on the horizontal axis. Figure 1b presents the power plots of the three LR tests for the case that and the sample size is , with different values of on the horizontal axis. Note that, for typical applications in marketing research, consumer choice modeling, transportation choice, and more, the sample size can be considered as quite small.
Figure 1a,b show that, even already for this small sample, the empirical size is appropriately close to 5%, and the power increases with further away and values, respectively. Furthermore, we see that the LRab test has less power than the LRb test when the true . However, the loss in power for the LRab test is very small compared to the LRb test when the true .
Figure 2a,b consider the same settings, but now for sample size , whereas Figure 3a,b concern the sample size . Comparing these figures with those in Figure 1a,b, where we looked at sample size N = 50, we see that the power curve of the tests rapidly shows steepness with increasing sample size, and hence quickly converges to 1.
4. Illustration
As an illustration, we consider an optical scanner panel data set on purchases of four brands of saltine crackers in the Rome (Georgia) market, collected by Information Resources Incorporated. The data set contains 3292 purchases of crackers made by 136 households over about two years. The data are also analyzed in Chapter 6 of [4] (the data are available from the authors, and the code amounts to a standard module on multinomial logit model estimation). The brands are called Private label, Sunshine, Keebler, and Nabisco. For each purchase, we have the actual price of the purchased brand, the shelf price of the other brands and four times two dummy variables which indicate whether the brands were on display or featured. To describe brand choice, we consider the conditional logit model, where we include as explanatory variables per category the price of the brand and three 0/1 dummy variables indicating whether a brand was on display only or featured only or jointly on display and featured. To allow for out-of-sample evaluation of the model, we holdout the last purchases of each household from the estimation sample. Hence, we have 3156 observations for parameter estimation and use the estimated model to provide 136 forecast probabilities for the out-of-sample purchases. So, and .
We obtain the following LR test values:
LRab = 1.67
LRa = 1.57
LRb = 0.06
which suggests that the predicted probabilities from our MNL model do not entail biased forecasts.When we allow for four different values, that is, as in (10), we obtain the LR test values
LRab = 2.22
LRa = 3.24
LRb = 2.83
Again, we see that the MNL model, as specified in [4], delivers unbiased forecasts. Hence, both models provide unbiased forecasts.
5. Conclusions
We have proposed a simple to implement Likelihood Ration (and Wald) test for forecast bias in case the predictions concern probabilities on independent multinomial outcomes. The test is independent from the origin of the predictions. With simulations, we have shown that the test has proper empirical size and that the empirical power quickly increases with growing sample size. An illustration showed the ease of use of the test.
Conceptualization, P.H.F. and R.P.; methodology, P.H.F. and R.P.; software, R.P.; validation, P.H.F. and R.P.; formal analysis, R.P.; investigation, P.H.F. and R.P.; resources, P.H.F. and R.P.; data curation, R.P.; writing—original draft preparation, P.H.F. and R.P.; writing—review and editing, P.H.F.; visualization, R.P.; supervision, P.H.F.; project administration, P.H.F. All authors have read and agreed to the published version of the manuscript.
The data can be obtained from the authors upon request.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1. (a) The power curve for b = 1 for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis. (b) The power curve for [Forumla omitted. See PDF.] for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis.
Figure 1. (a) The power curve for b = 1 for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis. (b) The power curve for [Forumla omitted. See PDF.] for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis.
Figure 2. (a) The power curve for b = 1 for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis. (b) The power curve for [Forumla omitted. See PDF.] for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis.
Figure 2. (a) The power curve for b = 1 for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis. (b) The power curve for [Forumla omitted. See PDF.] for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis.
Figure 3. (a) The power curve for b = 1 for sample size [Forumla omitted. See PDF.] where the values of [Forumla omitted. See PDF.] are on the horizontal axis. (b) The power curve for [Forumla omitted. See PDF.] for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis.
Figure 3. (a) The power curve for b = 1 for sample size [Forumla omitted. See PDF.] where the values of [Forumla omitted. See PDF.] are on the horizontal axis. (b) The power curve for [Forumla omitted. See PDF.] for sample size [Forumla omitted. See PDF.], where the values of [Forumla omitted. See PDF.] are on the horizontal axis.
References
1. de Jong, V.M.T.; Eijkemans, M.J.C.; van Calster, B.; Timmerman, D.; Moons, K.G.M.; Steyerberg, E.W.; van Smeden, M. Sample size consideration and predictive performance of multinomial logistic prediction models. Stat. Med.; 2019; 38, pp. 1601-1619. [DOI: https://dx.doi.org/10.1002/sim.8063] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30614028]
2. Mincer, J.; Zarnowitz, V. The evaluation of economic forecasts. Economic Forecasts and Expectations; Mincer, J. National Bureau of Economic Research: New York, NY, USA, 1969.
3. Franses, P.H. Testing for bias in forecasts for independent binary outcomes. Appl. Econ. Lett.; 2021; 28, pp. 1336-1338. [DOI: https://dx.doi.org/10.1080/13504851.2020.1838429]
4. Franses, P.H.; Paap, R. Quantitative Models in Marketing Research; Cambridge University Press (CUP): Cambridge, UK, 2001.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
This paper deals with a test on forecast bias in predicting independent multinomial outcomes where the predictions are probabilities. The new Likelihood Ratio (and Wald) test extends the familiar Mincer Zarnowitz regression to a multinomial logit model instead of a linear regression. The test is evaluated using various simulation experiments, which indicate that the size and power properties are good, even for small sample sizes, in the sense that the size is close to the used 5% level, and the power quickly reaches 1. We implement the test in an empirical setting on brand choice by individual households.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer