Full Text

Turn on search term navigation

1. Introduction

The change-point problem was first proposed by Page [1]. It considers a model in which the distribution of the observed data changes abruptly at some point in time, which is common in biology [2], finance [3], literature [4] and epidemiology [5]. Change-point detection can be employed as a tool in time series segmentation. A typical reference in the field of study is [6]. Once a change-point is detected in a data sequence, it is used to split the data sequence into two segments so that both segments are modeled separately. On the other hand, from a practical point of view, behavior and policies can be adjusted based on changes in events of interest. So, it is very important to perform change-point detection.

There are mainly two problems in the change-point model: checking the existence of change points and estimating the positions of these change points. These issues have been studied in substantial literature. For example, see Sen and Srivastava [7] for the mean change in a normal distribution and Worsley [8] for a change in an exponential family using the maximum likelihood ratio method. Others include Bai [9] for the least squares estimate of mean shift in linear processes, Vexler [10] for the change-point problem in a linear regression model and Gombay [11] for the change-point in autoregressive time series, etc. See [12,13,14] for details.

A study has shown that most work on change-point problems has been done for continuous data [14]. In real life, however, many data are observed on a discrete scale. Common discrete distributions include binomial, multinomial and Poisson distributions. In this article, we consider the change-point problem in a multinomial sequence, which originated from the transcription of the Gospels [15]. The Lindisfarne Gospels were divided into several sections, assuming that only one author contributed to the writing of any section and that the sections written by any one author were continuous. The goal was to test whether a single author wrote the Gospels. The data may be the frequency of vocabulary or grammar used by the author of each section. In general, suppose that $X_{1} = (X_{11}, \dots, X_{1 m}), \dots, X_{K} = (X_{K 1}, \dots, X_{K m})$ are K independent multinomial variables with parameter $(n_{1}, p_{1}), \dots, (n_{K}, p_{K})$ , where $p_{i} = (p_{i 1}, \dots, p_{i m})$ , $\sum_{j = 1}^{m} p_{i j} = 1$ , $i = 1, \dots, K$ . On the $i t h$ point, there are $n_{i}$ experiments with m outcomes, and $X_{i}$ records the frequencies of m outcomes. We want to test

(1) $H_{0} : p_{1} = \dots = p_{K} v . s . H_{1} : p_{1} = \dots = p_{k^{*}} \neq p_{k^{*} + 1} = \dots = p_{K}$

where

k^{*}

is the true change-point,

1 < k^{*} < K

. If

H_{0}

is rejected, we further estimate

k^{*}

To solve this problem, Wolfe and Chen [16] proposed several statistics based on the cumulative sum (CUSUM) method. Horváth and Serbinowska [17] used the maximum likelihood ratio and maximum chi-square statistic to test the existence of change points and derived their transformed limit distribution. Batsidis and Horváth [18] extended it and proposed a family of phi-divergence tests that involves broad statistics. Riba and Ginebra [19] performed a graphical exploration of a sequence of polynomial observations and found a break point. Note that they all assumed that the number of categories m is fixed.

In recent years, the rise of big data has made the high-dimensional change-point problem more important. Thus, it becomes urgent to consider high-dimensional multinomial data as the categories of one thing in life can be quite large, such as the type of stores selling certain items on a shopping platform and the type of illness of patients in an outpatient clinic during a day. In this paper, we consider problem (1) with m tending to infinity. Recently, Wang et al. [20] proposed a procedure based on Pearson’s Chi-square test under the above scenario. Their idea is to pre-divide categories into two based on their probability magnitudes and to use the original and modified Pearson’s Chi-square statistic for large and small categories, respectively. This pre-classification can balance sparse and dense signals, resulting in good statistical performance. So, here, we use the pre-classification idea to construct a test statistic for problem (1) with m tending to infinity.

Another tool used in this article is based on information entropy. Entropy, originally a concept in statistical physics, was introduced into information theory by Shannon [21]. It has been widely applied in change-point problems. Unakafov and Keller [22] used ordinal mode conditional entropy to detect change points. Ma and Sofronov [23] proposed a cross-entropy algorithm to estimate the number and positions of change points. Vexler and Gurevic [24] applied empirical likelihood method to change-point detection, in which the essence of empirical likelihood estimation is a density-based entropy estimation. Mutual information, denoted by MI, computed as the difference between entropy and conditional entropy, is popular in deep learning, for example, see [25,26]. In the area of machine learning, MI is similar to information gain, which is often used as a measure of the goodness of a step in an algorithm, such as the selection of node splitting in a tree. Therefore, MI is naturally used as a metric in event detection problems. Relevant works include [27,28], etc. We utilize the MI between data and their position in this paper, given that a large value of MI means a high probability that a change point occurs.

In this paper, we consider the offline change-point problem. We propose a test statistic based on the mutual information for the at most one change-point (AMOC) problem (1) with m tending to infinity as the sample size tends to infinity. We adopt the pre-classification idea in [20] here. The optimal change-point position can also be estimated by MI. We show that the proposed statistic has an asymptotic normal distribution under the null distribution, and the power of the test converges to one under the alternative hypothesis. Meanwhile, we point out the relationship between MI and the likelihood ratio. In fact, the proposed statistic is based on the likelihood ratio method. As is widely acknowledged, although there is no uniformly most powerful test for change-point detection in general [29,30], the test based on the likelihood ratio structure has a high power [31]. Simulation studies demonstrate the excellent power of the test based on the proposed statistic as well as the high accuracy of the estimation. The innovations we have made in this article are that we replace the Pearson Chi-square statistic in Wang et al. [20] with mutual information and achieve better performance in terms of power and accuracy compared to their method.

The remaining structure of this paper is as follows. In Section 2, we present the proposed test statistic and the estimation method of a change point. In Section 3, we provide simulation results. In Section 4, we illustrate the method with an example based on physical examination data. In Section 5, we conclude the paper with some remarks. The proofs of the theorems are given in Appendix A.

2. Methods

2.1. Entropy and Mutual Information

We first briefly introduce some concepts about entropy and mutual information.

Definition 1.

Suppose that $x_{1}, \dots, x_{u}$ are the possible values taken by a random variable X, where u can be infinity. Let $P_{X} (x_{i})$ be the probability that $X = x_{i}$ . The Shannon entropy of X is defined as

(2) $H (X) = - \sum_{i = 1}^{u} P_{X} (x_{i}) log P_{X} (x_{i}) .$

when $P_{X} (x_{i}) = 0$ , define $P_{X} (x_{i}) log P_{X} (x_{i}) = 0$ .

Definition 2.

Let Y be a random variable that takes values in ${y_{1}, \dots, y_{v}}$ , where v can be infinity. The conditional entropy of X given Y is defined as

(3) $H (X | Y) = - \sum_{i = 1}^{u} \sum_{j = 1}^{v} P_{X, Y} (x_{i}, y_{j}) log P_{X | Y} (x_{i} | y_{j}),$

where $P_{X, Y}$ , $P_{X | Y}$ are the joint probabilities of X and Y and the conditional probability of X given Y, respectively.

Definition 3.

Assume that X and Y are the same as in Definitions 1 and 2. The mutual information (MI) of X relative to Y is defined as

(4) $\bar{MI} (X; Y) = \sum_{i = 1}^{u} \sum_{j = 1}^{v} P_{X, Y} (x_{i}, y_{j}) log \frac{P_{X, Y} (x_{i}, y_{j})}{P_{X} (x_{i}) P_{Y} (y_{j})} .$

The entropy value is larger when the data distribution is more symmetric. On the contrary, when the data are skewed, they have a small entropy [32]. The conditional entropy measures how much uncertainty is eliminated in X by observing Y. Obviously, mutual information can be written as the difference between entropy and conditional entropy, that is, $\bar{MI} (X; Y) = H (X) - H (X | Y)$ . It represents the average amount of information about X that can be gained or the amount of reduction of uncertainty in X by observing Y. $\bar{MI} (X; Y) \geq 0$ , and it becomes zero if X and Y are independent of each other.

2.2. Pre-Classification

For multinomial data, when the number of categories m is large, it is sometimes not realistic to treat all categories equally. For example, of all the cities in China, only a few of them account for half of the economy, which means that the rest of the cities have a small average share. The well-known Pareto principle [33] that 20% of the population owns 80% of the wealth in society also illustrates this phenomenon. Therefore, it is reasonable to classify the categories with different orders of magnitude.

Consider problem (1), i.e.,

$H_{0} : p_{1} = \dots = p_{K} v . s . H_{1} : p_{1} = \dots = p_{k^{*}} \neq p_{k^{*} + 1} = \dots = p_{K} .$

We denote

p_{1} = \dots = p_{K} = q_{0}

under

H_{0}

, and

p_{1} = \dots = p_{k^{*}} = q_{0}

p_{k^{*}} = \dots = p_{K} = q_{1}

under

H_{1}

. Denote

q_{l} = (q_{l 1}, \dots, q_{l m})

l = 0, 1

. Note that

q_{0} \neq q_{1}

Similar to Wang et al. [20], let $B_{0}$ be a subset of ${1, \dots, m}$ such that $max_{j \in B_{0}} q_{0 j} a_{m} \to 0$ , $B_{1}$ be a subset of ${1, \dots, m}$ such that $max_{j \in B_{1}} q_{1 j} a_{m} \to 0$ , where $a_{m}^{- 1}$ is $O (1)$ satisfying some conditions as $m \to \infty$ . Let $A_{0} = B_{0}^{c}$ and $A_{1} = B_{1}^{c}$ , where the superscript c stands for the complement operator. Assume that ${min}_{j \in A_{0}} q_{0 j} a_{m} > ε$ and ${min}_{j \in A_{1}} q_{1 j} a_{m} > ε$ for some $ε > 0$ as $m \to \infty$ . Let $A = A_{1} \cup A_{0}$ and $B = A^{c}$ . Then, m categories are divided into large and small orders of magnitude by $a_{m}$ denoted by A and B. A change from $q_{0}$ to $q_{1} \neq q_{0}$ might occur either in A or B.

Let $X_{i A}$ be the component of $X_{i}$ in A for $i = 1, \dots, K$ and $q_{0 A}$ be the component of $q_{0}$ in A. Let $X_{i B}$ and $q_{0 B}$ be similarly defined. Then, the marginal distributions of $X_{i A}$ and $X_{i B}$ under the null assumption are

(5) $(X_{i A}, \sum_{j \in B} X_{i j}) \sim Multi (n_{i}, (q_{0 A}, 1 - \sum_{j \in A} q_{0 j})),$

and

(6) $(X_{i B}, \sum_{j \in A} X_{i j}) \sim Multi (n_{i}, (q_{0 B}, 1 - \sum_{j \in B} q_{0 j})) .$

In the next subsection, we construct a statistic built on the marginal distributions (5) and (6).

Here are some additional notations. Denote $N = \sum_{i = 1}^{K} n_{i}$ , $N_{0 k} = \sum_{i = 1}^{k} n_{i}$ , $N_{1 k} = \sum_{i = k + 1}^{K} n_{i}$ as the number of experiments in total, before and after time k, and $Z = \sum_{i = 1}^{K} X_{i}$ , $Z_{0 k} = \sum_{i = 1}^{k} X_{i}$ , $Z_{1 k} = \sum_{i = k + 1}^{K} X_{i}$ as the number of successful trials in total, before and after time k. Let $\hat{q} = \frac{Z}{N}$ , ${\hat{q}}_{0 k} = \frac{Z_{0 k}}{N_{0 k}}$ , and ${\hat{q}}_{1 k} = \frac{Z_{1 k}}{N_{1 k}}$ be the corresponding frequencies.

For the data in A, let $Z_{A} = \sum_{i = 1}^{K} X_{i A}$ , $Z_{0 k A} = \sum_{i = 1}^{k} X_{i A}$ , $Z_{1 k A} = \sum_{i = k + 1}^{K} X_{i A}$ be the number of successful trials in total, before and after time k. Define $Z_{B S} = \sum_{i = 1}^{K} \sum_{j \in B} X_{i j}$ , $Z_{0 k B S} = \sum_{i = 1}^{k} \sum_{j \in B} X_{i j}$ , and $Z_{1 k B S} = \sum_{i = k + 1}^{K} \sum_{j \in B} X_{i j}$ as the sum of successful trials in B of total, before, and after k. Let ${\hat{q}}_{A j} = \frac{Z_{A j}}{N}$ , ${\hat{q}}_{0 k A j} = \frac{Z_{0 k A j}}{N_{0 k}}$ , ${\hat{q}}_{1 k A j} = \frac{Z_{1 k A j}}{N_{1 k}}$ , $j \in A$ , ${\hat{q}}_{B S} = \frac{Z_{B S}}{N}$ , ${\hat{q}}_{0 k B S} = \frac{Z_{0 k B S}}{N_{0 k}}$ , ${\hat{q}}_{1 k B S} = \frac{Z_{1 k B S}}{N_{1 k}}$ be the corresponding frequencies. Subscript S denotes the sum of frequencies. Similarly, we define $Z_{B}$ , $Z_{0 k B}$ , $Z_{1 k B}$ , $Z_{A S}$ , $Z_{0 k A S}$ , $Z_{1 k A S}$ , ${\hat{q}}_{B j}$ , ${\hat{q}}_{0 k B j}$ , ${\hat{q}}_{1 k B j}$ , $j \in B$ , ${\hat{q}}_{A S}$ , ${\hat{q}}_{0 k A S}$ , ${\hat{q}}_{1 k A S}$ . We illustrate some of the above notations in Table 1 in a more structured fashion.

2.3. Test Statistic

We use MI between the data $X = (X_{1}, \dots, X_{K})$ and the location of the data to construct the statistic. For the data in A, the entropy is

(7) $H_{A} = - {\hat{q}}_{B S} log {\hat{q}}_{B S} - \sum_{j \in A} {\hat{q}}_{A j} log {\hat{q}}_{A j} .$

The entropies in A before and after k are

(8) $H_{0 k A} = - {\hat{q}}_{0 k B S} log {\hat{q}}_{0 k B S} - \sum_{j \in A} {\hat{q}}_{0 k A j} log {\hat{q}}_{0 k A j}$

and

(9) $H_{1 k A} = - {\hat{q}}_{1 k B S} log {\hat{q}}_{1 k B S} - \sum_{j \in A} {\hat{q}}_{1 k A j} log {\hat{q}}_{1 k A j},$

respectively.

Denote $Y^{k} = I {$ the location of $X_{i}$ is before $k}$ as the indicator function of the position of a sample relative to k. Note that, given the observations, $P (Y^{k} = 1) = \frac{N_{0 k}}{N}$ by the independence. By Section 2.1, the MI between X and $Y^{k}$ in A is

(10) $H_{A} - H_{A} (X | Y^{k}) \hat{=} {\bar{MI}}_{k A},$

where

H_{A} (X | Y^{k}) = P (Y^{k} = 0) H_{0 k A} + P (Y^{k} = 1) H_{1 k A} = \frac{N_{0 k}}{N} H_{0 k A} + \frac{N_{1 k}}{N} H_{1 k A}

is the conditional entropy of X given

Y^{k}

. Similarly,

{\bar{MI}}_{k B} = H_{B} - \frac{N_{0 k}}{N} H_{0 k B} - \frac{N_{1 k}}{N} H_{1 k B}

, where

H_{B}

H_{0 k B}

and

H_{1 k B}

are defined similarly as in (7)–(9).

The uncertainty of X given $Y^{k}$ would reach the largest reduction if k is at the true break point $k^{*}$ ; hence, either ${\bar{MI}}_{k^{*} A}$ or ${\bar{MI}}_{k^{*} B}$ should be large. On the contrary, if the sequence is stable, the value of ${\bar{MI}}_{k}$ should be small for any $k \in {1, \dots, K}$ .

Since A and B are unknown, in light of Wang et al. [20], we use $\hat{A} = {j : {\hat{q}}_{j} a_{m} > C ε}$ to estimate A. Here, $C > 0$ is some constant. As shown in [20], $\hat{A}$ is a consistent estimator of A if $a_{m}$ satisfies certain assumptions. Let $\hat{B} = {\hat{A}}^{c}$ . Construct the test statistic

(11) $G_{m, \hat{A}} = \frac{2}{K} \sum_{k = 1}^{K} \frac{N_{0 k} N_{1 k}}{N} {\bar{MI}}_{k \hat{B}} + e_{m} I (max_{k = 1, \dots, K} 2 N {\bar{MI}}_{k \hat{A}} > r_{m})$

for (1). Summation and maximization are conducted respectively for the MI of

\hat{A}

and

\hat{B}

G_{m, \hat{A}}

. The first term in

G_{m, \hat{A}}

is the weighted log-likelihood ratio estimate, as pointed out after Lemma 1. The second term in

G_{m, \hat{A}}

is based on the maximum norm of MI. It is widely acknowledged that the max-norm test is more suitable for sparse and strong signals, see [34,35].

r_{m}

is a threshold for

\hat{A}

, which ensures that the second term in

G_{m, \hat{A}}

converges to zero under

H_{0}

e_{m}

is a large number. Note that the statistic in [20] is based on the Pearson Chi-square statistic. Since in reality, the frequencies of small categories might be zeros, the Pearson Chi-square statistic for

\hat{B}

is hence modified. The statistic presented here does not need to take into account the fact that a frequency may be zero, since by the definition of entropy,

- p log p = 0

p = 0

. In order to study the properties of

G_{m, \hat{A}}

better, we first give a lemma about MI .

Lemma 1.

Denote $L_{k A} = - 2 log (\frac{\prod_{j \in A} {\hat{q}}_{A j}^{Z_{A j}} {\hat{q}}_{B S}^{Z_{B S}}}{{\prod_{j \in A} {\hat{q}}_{0 k A j}}^{Z_{0 k A j}} {\hat{q}}_{0 k B S}^{Z_{0 k B S}} {\prod_{j \in A} {\hat{q}}_{1 k A j}}^{Z_{1 k A j}} {\hat{q}}_{1 k B S}^{Z_{1 k B S}}})$ . Then, $2 N {\bar{MI}}_{k A} = L_{k A}$ . It is also true by replacing A with B in all the subscripts, that is, $2 N {\bar{MI}}_{k B} = L_{k B}$ .

Note that $L_{k A}$ and $L_{k B}$ in Lemma 1 are estimations of minus two log likelihood ratios for data in A and B when the change-point is at k. Therefore, the problem based on MI can be transformed into the problem based on likelihood ratios.

By Lemma 1, the second term in (11) is $e_{m} I (max_{k = 1, \dots, K} L_{k \hat{A}} > r_{m})$ , and hence the existing limit theorems on likelihood ratios can be applied to it directly. The first term in (11) is $\frac{1}{K} \sum_{k = 1}^{K} \frac{N_{0 k} N_{1 k}}{N^{2}} L_{k \hat{B}}$ , which has the form of a weighted log likelihood ratio estimation. In Appendix A, we show that it is only an infinitesimal quantity away from some CUSUM statistic [36] using Taylor expansion and then prove the asymptotic distribution of $G_{m, \hat{A}}$ from related conclusions.

The sum of $L_{k \hat{B}}$ without weighting, $\sum_{k = 1}^{K} L_{k \hat{B}}$ , is closely related to the Shiryayev–Roberts procedure [37,38]. It uses $\sum_{k = 1}^{K} Λ_{k}$ as a statistic, where $Λ_{k}$ is the likelihood ratio when the change point is at k. It is widely applied to determine the best stopping criterion in sequential change-point monitoring (see, e.g., [39]). However, replacing unknown parameters in $\sum_{k = 1}^{K} Λ_{k}$ with their maximum likelihood estimation, which leads to $\sum_{k = 1}^{K} L_{k \hat{B}}$ in this paper, would result in a complex asymptotic analysis [40]. So, here we use the weighted version $\frac{1}{K} \sum_{k = 1}^{K} \frac{N_{0 k} N_{1 k}}{N^{2}} L_{k \hat{B}}$ instead of $\sum_{k = 1}^{K} L_{k \hat{B}}$ .

Theorem 1.

Let $| A |$ denote the cardinality of any set A and $sup | A |$ denote the maximal cardinality of the set A. Assume that $sup | A | = d < \infty$ , and

(i)
$N {a_{m}}^{- 2} {(log a_{m})}^{- 1} \to \infty$ as $(m, K) \to \infty$ ,
(ii)
$\frac{{[a (log N)]}^{2} r_{m}}{{[b_{d} (log N)]}^{2}} \to \infty$ and $\frac{log e_{m}}{a (log N) r_{m}^{1 / 2} - b_{d} (log N)} \to 0$ as $(m, K) \to \infty$ ,
(iii)
$\underset{\binom{K \to \infty}{m \to \infty}}{lim sup} max_{1 \leq k \leq \frac{K}{2}} {(\frac{n_{k + 1}}{N_{0 k}})}^{\frac{1}{2}} log N_{0 k} < \infty$ ,

$\underset{\binom{K \to \infty}{m \to \infty}}{lim sup} max_{\frac{K}{2} \leq k \leq K} {(\frac{n_{k + 1}}{N_{1 k}})}^{\frac{1}{2}} log N_{1 k} < \infty$ .

Then, under $H_{0}$ ,

$\frac{G_{m, \hat{A}} - \frac{m - d + 1}{6}}{\sqrt{\frac{m - d + 1}{45}}} \overset{d}{⟶} N (0, 1)$

as $(m, K) \to \infty$ , where $a (x) = {(2 log x)}^{\frac{1}{2}}$ , $b_{d} (x) = 2 log x + (d / 2) log log x - log Γ (d / 2)$ .

Theorem 1 shows that $G_{m, \hat{A}}$ is asymptotically normally distributed under the null hypothesis. The condition (i) in Theorem 1 ensures the consistency of $\hat{A}$ , which was also assumed in Theorem 1 of [20]. The condition ( $i i$ ) in Theorem 1 requires the threshold $r_{m}$ to be large enough in order to guarantee that $e_{m} I (max_{k = 1, \dots, K} 2 N {\bar{MI}}_{k \hat{A}} > r_{m})$ converges to zero with probability one under the null hypothesis. Condition ( $i i i$ ) means that every $n_{i}$ is much less than N. Next, we focus on the properties of the statistic under the alternative hypothesis.

Theorem 2.

Assume that the conditions (i)–( $i i i$ ) in Theorem 1 hold. Let $δ_{j} = q_{1 j} - q_{0 j}$ , $j = 1, \dots, m$ . Further assume that

(i)
$e_{m} > c m$ as $(m, K) \to \infty$ , where $c > \frac{1}{6}$ .
(ii)
$\frac{N_{k^{*}}}{N} \to κ_{0}$ as $(m, K) \to \infty$ , $κ_{0} \in (0, 1)$ , and there exist $0 < c_{1}, c_{2} < \infty$ such that $c_{1} < \frac{N_{0 k}}{N}, \frac{N_{0 k}}{N} < c_{2}$ for $k = 1, \dots, K$ as $(m, K) \to \infty$ .

If the shift sizes $δ_{j^{‘}} s$ satisfy either of the following two conditions,

(iii)
$N {δ_{j^{‘}}}^{2} {r_{m}}^{- 1} \to \infty$ for some $j^{‘} \in A$ ,
(iv)
$| δ_{j^{‘}} | > 0$ for some $j^{‘} \in B$ ,

then as $(m, K) \to \infty$ ,

$P (|\frac{G_{m, \hat{A}} - \frac{m - d + 1}{6}}{\sqrt{\frac{m - d + 1}{45}}}| > z_{1 - α}) \to 1,$

where $z_{1 - α}$ is the critical value of the standard normal distribution at level α.

Theorem 2 establishes the consistency of the test under certain conditions when the probability in A or B changes. Condition (i) in Theorem 2 means that $e_{m}$ tends to infinity at a certain rate. It aims to ensure that $G_{m, \hat{A}}$ tends to infinity when the parameters in A change. Condition ( $i i$ ) requires comparable sample sizes before and after the change point. The proofs of Theorem 1 and Theorem 2 are provided in Appendix A.

Once $H_{0}$ is rejected, we further use MI to estimate $k^{*}$ . If $max_{k = 1, \dots, K} 2 N {\bar{MI}}_{k \hat{A}} > r_{m}$ , then $\hat{k} = \underset{k = 1, \dots, K}{arg max} {\bar{MI}}_{k \hat{A}}$ ; otherwise, $\hat{k} = \underset{k = 1, \dots, K}{arg max} {\bar{MI}}_{k \hat{B}}$ . Numeric studies in the next section show that the power of the new statistic increases rapidly as the difference between the alternative hypothesis and the null hypothesis increases. At the same time, the precision of $\hat{k}$ using pre-classification is also satisfactory.

3. Simulation

We conduct simulation experiments to assess the performance of the test procedures in empirical size, power and estimation in finite samples. All results are based on 1000 replications. We use R to obtain simulation results. The necessary R code is given in Appendix B.

To analyze the empirical size, we simulate multinomial data with parameter $q_{0} = (\frac{ω}{d} 1_{d}^{⊤}, \frac{1 - ω}{m - d} 1_{m - d}^{⊤})$ under the null hypothesis without break with reference to [20]. The first d probabilities are much greater than those of the latter. Hence, in reality, $\hat{A}$ can be chosen as ${1, \dots \dots, \hat{d}}$ . Following [20], we use $\hat{d} = \underset{i = 1, \dots, m - 1}{arg max} \frac{- 1 - {\hat{q}}_{(i)} {\hat{q}}_{(i + 1)}}{\sqrt{1 + {\hat{q}}^{2}_{(i)}} \sqrt{1 + {\hat{q}}^{2}_{(i + 1)}}}$ , where ${\hat{q}}_{(1)} \geq \dots \dots \geq {\hat{q}}_{(m)}$ are the sorted values of ${\hat{q}}_{j}$ . We consider different situations with the sample size K arranged from 50 to 500, and let $m = K$ in each situation. For simplicity, we fix $n_{i} = 100$ , $i = 1, \dots, K$ . For the formula of $G_{m, \hat{A}}$ , we choose $e_{m} = m$ , $r_{m} = {(2 log log N + \frac{d}{2} log log log N)}^{2}$ according to the conditions in the above section. The simulation results with various combinations of ( $ω$ , d) are reported in Table 2. We observe that the empirical size of the test is 4.5–6.7%, which is thus around the nominal 5% level in different situations. Here, we show the case of $ω \leq 0.5$ . We also performed simulations for $ω > 0.5$ and found empirical values slightly higher than 5% (data not shown).

To evaluate the power of the test, the alternative hypotheses stipulate a single break in the data sequence. We first consider parameters of two forms:

(i). $q_{1} = \{\begin{matrix} ((1 + s) \frac{ω}{d} 1_{\frac{d}{2}}^{⊤}, (1 - s) \frac{ω}{d} 1_{\frac{d}{2}}^{⊤}, \frac{1 - ω}{m - d} 1_{m - d}^{⊤}) & d % 2 = 0, \\ ((1 + s) \frac{ω}{d} 1_{\frac{d - 1}{2}}^{⊤}, \frac{ω [1 - \frac{1 + s}{2} \frac{d - 1}{d}]}{\frac{d + 1}{2}} 1_{\frac{d + 1}{2}}^{⊤}, \frac{1 - ω}{m - d} 1_{m - d}^{⊤}) & otherwise; \end{matrix}$
(ii). $q_{1} = \{\begin{matrix} (\frac{ω}{d} 1_{d}^{⊤}, (1 + s) \frac{1 - ω}{m - d} 1_{\frac{m - d}{2}}^{⊤}, (1 - s) \frac{1 - ω}{m - d} 1_{\frac{m - d}{2}}^{⊤}) & (m - d) % 2 = 0, \\ (\frac{ω}{d} 1_{d}^{⊤}, (1 + s) \frac{1 - ω}{m - d} 1_{\frac{m - d - 1}{2}}^{⊤}, \frac{(1 - ω) [1 - \frac{1 + s}{2} \frac{m - d - 1}{m - d}]}{\frac{m - d + 1}{2}} 1_{\frac{m - d + 1}{2}}^{⊤}) & otherwise . \end{matrix}$

ω

is the proportion in A,

0 < ω < 1

1_{d}

denotes a d-dimensional vector with all components equal to 1. s represents the shift size. We consider different values of s when evaluating power and accuracy to better observe changes in efficiency and accuracy as the gap between the alternative and null hypotheses increases. % is the mod operation. The two alternative hypotheses assume that the change point is located in A and B, respectively. We consider

k^{*} = 0.2 K

and

0.5 K

to capture breaks in the beginning and middle of a sample. For comparison, we use two competitors:

The weighted maximum likelihood ratio statistic $L = max_{k = 1, \dots, K} \frac{N_{0 k} N_{1 k}}{N^{2}} {\hat{Λ}}_{k}$ where ${\hat{Λ}}_{k} = - 2 log (\frac{\prod_{j = 1, \dots m} {\hat{q}}_{j}^{Z_{j}}}{{\prod_{j = 1, \dots m} {\hat{q}}_{0 k j}}^{Z_{0 k j}} {\prod_{j = 1, \dots m} {\hat{q}}_{1 k j}}^{Z_{1 k j}}})$ proposed by Horváth and Serbinowska [17];
The statistic $Q = \sum_{k = 1}^{K} \sum_{j \in \hat{B}} (L_{k j} - {L_{k j}}^{(0)}) + e_{m} I (max_{k = 1, \dots, K} max_{j \in \hat{A}} R_{k j} > r_{m})$ in [20], in which $L_{k j} = \frac{N_{0 k} N_{1 k}}{N} {(\frac{Z_{0 k j}}{N_{0 k}} - \frac{Z_{1 k j}}{N_{1 k}})}^{2}$ , ${L_{k j}}^{(0)} = \frac{N_{0 k} N_{1 k}}{N} (\frac{Z_{0 k j}}{{N_{0 k}}^{2}} + \frac{Z_{1 k j}}{{N_{1 k}}^{2}})$ and $R_{k j} = \frac{L_{k j}}{{\hat{q}}_{j}}$ , $e_{m} = K^{\frac{4}{3}}$ and $r_{m} = log K log K$ in the simulations.

The results are summarized in Figure 1 for level $α = 0.05$ . The size of L is on the high side, as seen from the curve at small s in Figure 1. The new test is very powerful, as evidenced by the rapid rate of convergence to 1 when s increases. In most cases, the empirical power of $G_{m, \hat{A}}$ is larger than the other two for alternative hypothesis (i). For the alternative hypothesis (ii), the three statistics perform equally well. These results further show that our test has higher power to detect a change located in the middle of the sample than in the beginning while the power is also still high.

We also briefly investigate how well the change-point location $k^{*}$ is approximated by the estimator $\hat{k}$ . We choose $k^{*} = 0.2 K$ and $0.5 K$ as the change-point location. In Table 3 and Table 4, we report the mean and standard deviation of the absolute errors $| \hat{k} - k^{*} |$ for the different choices of s and m under the alternative hypothesis (i) or (ii), respectively. We compare our estimate with the maximum likelihood ratio estimate ${\hat{k}}_{L} = \underset{k = 1, \dots, K}{arg max} {\hat{Λ}}_{k}$ and ${\hat{k}}_{Q}$ in [20].

The corresponding absolute errors in Table 3 and Table 4 underscore the considerable precision of $\hat{k}$ , which improves when s is increased 0.3 from to 0.8. For the alternative (i), in almost all situations, $\hat{k}$ is better than the other two competitors. Small changes (for example, s = 0.3) are found with greater difficulty by using ${\hat{k}}_{L}$ and ${\hat{k}}_{Q}$ , while the precision of $\hat{k}$ remains high. For the alternative hypothesis (ii), $\hat{k}$ and ${\hat{k}}_{L}$ have similar performance, and they are both slightly better than ${\hat{k}}_{Q}$ . Alternative Hypothesis (i): Assume that the large probability changes while the small probability remains the same. Alternative Hypothesis (ii): Assume that the small probability changes while the large probability remains the same. Under alternative hypothesis (i), our method has better performance than the other two methods, probably because entropy as a non-linear function can increase the difference between frequencies, and it is more pronounced when the difference is small (e.g., s = 0.3).

Finally, we simulate the power and estimation precision for alternative hypothesis (iii):

$q_{1} = \{\begin{matrix} ((1 + s) \frac{ω}{d} 1_{\frac{d}{2}}^{⊤}, (1 - s) \frac{ω}{d} 1_{\frac{d}{2}}^{⊤}, (1 + s) \frac{1 - ω}{m - d} 1_{\frac{m - d}{2}}^{⊤}, (1 - s) \frac{1 - ω}{m - d} 1_{\frac{p - d}{2}}^{⊤}), & d % 2 = 0, (m - d) % 2 = 0, \\ ((1 + s) \frac{ω}{d} 1_{\frac{d - 1}{2}}^{⊤}, \frac{ω [1 - \frac{1 + s}{2} \frac{d - 1}{d}]}{\frac{d + 1}{2}} 1_{\frac{d + 1}{2}}^{⊤}, (1 + s) \frac{1 - ω}{p - d} 1_{\frac{m - d - 1}{2}}^{⊤}, \frac{(1 - ω) [1 - \frac{1 + s}{2} \frac{m - d - 1}{m - d}]}{\frac{m - d + 1}{2}} 1_{\frac{m - d + 1}{2}}^{⊤}), & otherwise, \end{matrix}$

where parameters in A and B change simultaneously, which was not mentioned in [20]. We compare our statistic and

\hat{k}

with Q and

{\hat{k}}_{Q}

in this case. The results are displayed in Figure 2, Table 5 and Table 6, from which we see that the power of

G_{m, \hat{A}}

is slightly better than that of Q, and the precision of

\hat{k}

is obviously higher than that of

{\hat{k}}_{Q}

4. Example

In this section, we use a data set to address the applicability of our method. The data concern the medical examination results of people working in Hefei’s financial sector (including banks and insurance companies) from 27 September 2017 to 25 August 2021, which includes each person’s age, gender, the date of examination and the disease detected. From the perspective of health analysis and disease prevention, it is thus important to understand the diseases in terms of how often they are detected.

Our goal is to test whether the proportion of people who have been diagnosed with some diseases change over time. After removing gender-specific diseases, we finally choose 210 diseases. Because in some weeks there is no person to have the examination, we eliminate those weeks and finally keep 173 weeks. Let $X_{i}$ be a 210-dimension vector with each component indicating the frequency of a certain disease detected during the $i_{t h}$ time period. Then, there are $K = 173$ vectors $X_{1}, \dots, X_{173}$ of dimension $m = 210$ with $N = 16596$ outcomes.

Figure 3 shows the numbers of the top 30 diseases detected. The weekly sample size $n_{i}^{'} s$ are provided in Figure 4. We find from Figure 3 that the numbers of the first six diseases, Fatty Liver (FL), Overweight (OW), Thyroid Nodule (TN), Pulmonary Nodule (PN), Hepatic Cyst (HC) and Thyroid Cyst (TC), were much higher than those of the other diseases. By calculation, their proportions were, respectively, 0.088, 0.086, 0.06, 0.06, 0.038 and 0.032, which accounted for 35.8% of all the detected diseases. Hence, we choose $\hat{d} = 6$ . The value of the statistic is $G_{m \hat{A}} = 200.2577$ , and hence the null hypothesis that there is no change in the proportions of diseases detected is rejected.

Because $max_{k = 1, \dots, K} 2 N {\bar{MI}}_{k \hat{A}} > r_{m}$ , we find that $\hat{k} = \underset{k = 1, \dots, K}{arg max} {\bar{MI}}_{k \hat{A}} = 14$ , corresponding to 27 December 2017. This suggests that the proportions of diseases detected vary before and after 2018. Table 7 displays the proportions of the first six diseases before and after 2018. The proportions of Overweight and Thyroid Nodule were the highest before 2018. However, after 2018, the proportion of Fatty Liver jumped to the highest, and the proportion of Pulmonary Nodule also increased significantly. A possible explanation is that some unexpected events lead to changes in people’s lifestyles, which lead to changes in the proportion of the population suffering from different diseases. For example, the start of the Sino–US trade war in early February 2018 led to a continuous decline in the price of China’s A-shares, which was the trigger for the change in the lifestyle of financial practitioners after 2018. The study into the proportions of people with different diseases in the financial sector can reveal which disease is on the rise in this sector, and hence proper recommendations can be made for disease prevention.

5. Conclusions

This paper develops a change-point test based on MI for multinomial data when the number of categories is comparable to the sample size. We show that under certain conditions, the proposed statistic is asymptotically normal under the null hypothesis and consistent under the alternative hypothesis. The simulation results suggest that the test based on the proposed statistic has a high power. The proposed inference procedures are used to analyze the change in proportions of diseases detected in physical examination data during a period.

Author Contributions

Conceptualization and methodology, B.J., X.X. and Y.W.; software and writing—original draft preparation, X.X.; writing—review and editing, B.J. and Y.W.; supervision, B.J. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to the referees for their insightful comments in revising this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MI	Mutual Information
FL	Fatty Liver
OW	Overweight
TN	Thyroid Nodule
PN	Pulmonary Nodule
HC	Hepatic Cyst
TC	Thyroid Cyst

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Empirical power of three statistics for level [Forumla omitted. See PDF.]. (a) The power under the alternative hypothesis (i). (b) The power under the alternative hypothesis (ii). G denotes the proposed statistic. L is the weighted maximum likelihood ratio statistic in [17]. Q is the statistic in [20]. [Forumla omitted. See PDF.], [Forumla omitted. See PDF.].

Figure 1. Empirical power of three statistics for level [Forumla omitted. See PDF.]. (a) The power under the alternative hypothesis (i). (b) The power under the alternative hypothesis (ii). G denotes the proposed statistic. L is the weighted maximum likelihood ratio statistic in [17]. Q is the statistic in [20]. [Forumla omitted. See PDF.], [Forumla omitted. See PDF.].

View Image - Figure 2. Empirical power of [Forumla omitted. See PDF.] and Q under alternative hypothesis (iii). [Forumla omitted. See PDF.], [Forumla omitted. See PDF.].

Figure 2. Empirical power of [Forumla omitted. See PDF.] and Q under alternative hypothesis (iii). [Forumla omitted. See PDF.], [Forumla omitted. See PDF.].

Figure 3. The numbers of top 30 diseases detected.

Figure 4. Weekly sample size.

Table 1

Explanation of some notations.

	Total		Before k		After k
Experiment	N		$N_{0 k}$		$N_{1 k}$
Category	A	B	A	B	A	B
Successful trials	$(Z_{A}, Z_{B S})$	$(Z_{B}, Z_{A S})$	$(Z_{0 k A}, Z_{0 k B S})$	$(Z_{0 k B}, Z_{0 k A S})$	$(Z_{1 k A}, Z_{1 k B S})$	$(Z_{1 k B}, Z_{1 k A S})$
Frequency	$({\hat{q}}_{A}, {\hat{q}}_{B S})$	$({\hat{q}}_{B}, {\hat{q}}_{A S})$	$({\hat{q}}_{0 k A}, {\hat{q}}_{0 k B S})$	$({\hat{q}}_{0 k B}, {\hat{q}}_{0 k A S})$	$({\hat{q}}_{1 k A}, {\hat{q}}_{1 k B S})$	$({\hat{q}}_{1 k B}, {\hat{q}}_{1 k A S})$

Table 2

Empirical sizes of $G_{m, \hat{A}}$ at the nominal test size 5% under different situations.

$ω$ , d	m
$ω$ , d	50	100	200	300	500
(0.3, 5)	0.046	0.053	0.052	0.045	0.053
(0.3, 6)	0.052	0.058	0.059	0.052	0.052
(0.3, 10)	0.053	0.039	0.054	0.060	0.057
(0.5, 6)	0.050	0.034	0.047	0.066	0.067
(0.5, 8)	0.053	0.045	0.054	0.062	0.062
(0.5, 10)	0.053	0.051	0.061	0.066	0.054

Table 3

Mean and standard deviation (in parentheses) of $| \hat{k} - k^{*} |$ , $| {\hat{k}}_{L} - k^{*} |$ and $| {\hat{k}}_{Q} - k^{*} |$ under the alternative hypothesis (i) or (ii) with $ω = 0.3$ , $d = 5$ , and $k^{*} = 0.5 K$ .

m	s	Alternative Hypothesis (i)			Alternative Hypothesis (ii)
m	s	$\| \hat{k} - k^{*} \|$	$\| {\hat{k}}_{L} - k^{*} \|$	$\| {\hat{k}}_{Q} - k^{*} \|$	$\| \hat{k} - k^{*} \|$	$\| {\hat{k}}_{L} - k^{*} \|$	$\| {\hat{k}}_{Q} - k^{*} \|$
200	0.3	1.88(3.35)	15.95(27.15)	25.97(34.74)	0.64(1.16)	0.64(1.18)	0.66(1.20)
	0.4	0.77(1.32)	2.52(5.02)	3.94(6.90)	0.20(0.50)	0.20(0.50)	0.21(0.53)
	0.5	0.49(0.91)	1.11(1.93)	2.11(3.16)	0.06(0.25)	0.06(0.25)	0.06(0.27)
	0.6	0.31(0.67)	0.53(0.96)	1.67(2.59)	0.01(0.11)	0.01(0.11)	0.01(0.11)
	0.7	0.16(0.44)	0.29(0.63)	1.12(1.94)	0	0	0
	0.8	0.16(0.52)	0.16(0.48)	0.87(1.63)	0	0	0
500	0.3	1.57(2.32)	10.17(28.95)	6.40(9.60)	0.57(1.11)	0.56(1.10)	0.57(1.07)
	0.4	0.89(1.49)	2.24(3.41)	3.52(4.59)	0.16(0.46)	0.16(0.46)	0.18(0.48)
	0.5	0.50(0.92)	1.08(1.68)	2.18(3.15)	0.07(0.26)	0.07(0.26)	0.07(0.27)
	0.6	0.31(0.67)	0.51(1.06)	1.61(2.46)	0.01(0.11)	0.01(0.10)	0.02(0.13)
	0.7	0.15(0.40)	0.31(0.67)	1.10(1.83)	0	0	0
	0.8	0.09(0.32)	0.14(0.42)	0.88(1.39)	0	0	0

Table 4

m	s	Alternative Hypothesis (i)			Alternative Hypothesis (ii)
m	s	$\| \hat{k} - k^{*} \|$	$\| {\hat{k}}_{L} - k^{*} \|$	$\| {\hat{k}}_{Q} - k^{*} \|$	$\| \hat{k} - k^{*} \|$	$\| {\hat{k}}_{L} - k^{*} \|$	$\| {\hat{k}}_{Q} - k^{*} \|$
200	0.3	10.01(31.28)	33.16(48.90)	66.82(58.79)	0.86(1.71)	0.87(1.70)	0.89(1.87)
	0.4	0.95(1.61)	8.21(23.85)	19.98(41.21)	0.24(0.59)	0.25(0.62)	0.25(0.60)
	0.5	0.73(0.26)	1.82(5.59)	2.45(3.94)	0.07(0.27)	0.08(0.28)	0.09(0.33)
	0.6	0.71(1.29)	0.75(1.35)	1.62(2.49)	0.02(0.15)	0.02(0.15)	0.03(0.18)
	0.7	0.45(0.88)	0.33(0.70)	1.2(2.17)	0	0	0
	0.8	0.30(0.66)	0.20(0.49)	0.93(1.73)	0	0	0
500	0.3	1.77(2.73)	52.01(105.27)	54.39(110.28)	0.80(1.29)	0.80(1.26)	0.83(1.46)
	0.4	0.94(1.48)	3.42(6.47)	3.82(5.64)	0.22(0.54)	0.23(0.56)	0.24(0.57)
	0.5	0.80(1.34)	1.40(2.27)	2.52(3.88)	0.07(0.28)	0.07(0.28)	0.07(0.28)
	0.6	0.68(1.22)	0.73(1.22)	1.56(2.27)	0.01(0.09)	0.01(0.09)	0.02(0.13)
	0.7	0.40(0.82)	0.36(0.75)	1.09(1.67)	0.01(0.08)	0.01(0.08)	0
	0.8	0.31(0.66)	0.20(0.48)	0.82(1.29)	0	0	0

Table 5

Mean and standard deviation (in parentheses) of $| \hat{k} - k^{*} |$ and $| {\hat{k}}_{Q} - k^{*} |$ under alternative hypothesis (iii) with $ω = 0.3$ , $d = 5$ and $k^{*} = 0.5 K$ .

s	$m = 200$		$m = 500$
s	$\| \hat{k} - k^{*} \|$	$\| {\hat{k}}_{Q} - k^{*} \|$	$\| \hat{k} - k^{*} \|$	$\| {\hat{k}}_{Q} - k^{*} \|$
0.3	1.82(2.63)	4.01(6.68)	1.62(2.68)	5.59(8.32)
0.4	0.82(1.42)	3.42(4.76)	0.82(0.82)	3.06(4.49)
0.5	0.47(0.89)	2.24(3.37)	0.46(1.07)	2.33 (3.33)
0.6	0.24(0.58)	1.52(2.48)	0.25(0.57)	1.37(2.15)
0.7	0.17(0.48)	1.10(1.72)	0.12(0.41)	1.16(1.77)
0.8	0.19(0.51)	0.83(1.31)	0.10(0.36)	0.84(1.42)

Table 6

Mean and standard deviation (in parentheses) of $| \hat{k} - k^{*} |$ and $| {\hat{k}}_{Q} - k^{*} |$ under alternative hypothesis (iii) with $ω = 0.3$ , $d = 5$ and $k^{*} = 0.2 K$ .

s	$m = 200$		$m = 500$
s	$\| \hat{k} - k^{*} \|$	$\| {\hat{k}}_{Q} - k^{*} \|$	$\| \hat{k} - k^{*} \|$	$\| {\hat{k}}_{Q} - k^{*} \|$
0.3	1.67(2.61)	1.70(3.69)	1.80(2.85)	4.89(7.62)
0.4	0.98(1.58)	3.22(6.08)	0.85(1.42)	3.59(5.02)
0.5	0.84(1.41)	2.44(3.78)	0.55(1.11)	2.54(3.80)
0.6	0.54(1.01)	1.54(2.41)	0.69(1.18)	1.67(2.45)
0.7	0.42(0.91)	1.28(2.12)	0.44(0.87)	1.16(1.91)
0.8	0.29(0.65)	0.88(1.40)	0.28(0.62)	0.87(1.47)

Table 7

Proportions of the first six diseases before and after $\hat{k}$ .

	Disease	FL	OW	TN	PN	HC	TC
Proportion	before $\hat{k}$	0.078	0.095	0.100	0.021	0.035	0.045
Proportion	after $\hat{k}$	0.090	0.084	0.051	0.062	0.038	0.029

Appendix A

Proof of Lemma 1.

$\begin{matrix} 2 N {\bar{MI}}_{k A} & = 2 N (H_{A} - \frac{N_{0 k}}{N} H_{0 k A} - \frac{N_{1 k}}{N} H_{1 k A}) \\ = 2 N H_{A} - 2 N_{0 k} H_{0 k A} - 2 N_{1 k} H_{1 k A} \\ = - 2 [\sum_{j \in A} Z_{A j} log {\hat{q}}_{A j} + Z_{B S} log {\hat{q}}_{B S} - (\sum_{j \in A} Z_{A j} log {\hat{q}}_{0 k A j} + Z_{0 k B S} log {\hat{q}}_{0 k B S}) \\ - (\sum_{j \in A} Z_{1 k A j} log {\hat{q}}_{1 k A j} + Z_{1 k B S} log {\hat{q}}_{1 k B S})] \\ = L_{k A} . \end{matrix}$

□

Lemma A1.

If $N {a_{m}}^{- 2} {(log a_{m})}^{- 1} \to \infty$ as $(m, K) \to \infty$ , then

(i)
under $H_{0}$ , $P (\hat{A} = A_{0}) \to 1$ as $(m, K) \to \infty$ for any $0 < C < 1$ ;
(ii)
under $H_{1}$ , $P (\hat{A} = A_{0} \cup A_{1}) \to 1$ as $(m, K) \to \infty$ for any $0 < C < \frac{min (κ_{0}, 1 - κ_{0})}{2}$ , where $κ_{0}$ is defined in condition ( $i i$ ) of Theorem 2.

Proof of Lemma A1.

See the proof of Theorem 1 in Wang et al. [20] □

Proof of Theorem 1.

By Lemma 1 and Lemma A1, it suffices to deduce the distribution of $G_{m, A} = \frac{1}{K} \sum_{k = 1}^{K} \frac{N_{0 k} N_{1 k}}{N^{2}} L_{k B} + e_{m} I (max_{k = 1, \dots, K} L_{k A} > r_{m})$ .

We first show that under $H_{0}$ , $e_{m} I (max_{k = 1, \dots, K} L_{k A} > r_{m}) = o_{p} (1)$ . Then, we only need prove $E (e_{m}^{2} I (max_{k = 1, \dots, K} L_{k A} > r_{m})) = e_{m}^{2} P (max_{k = 1, \dots, K} L_{k A} > r_{m}) \to 0$ as $(m, K) \to \infty$ .

Because $sup | A | = d < \infty$ , according to Theorem 1.1 in [17], $lim_{\binom{m \to \infty}{K \to \infty}} {P (a (log N) {(max_{k = 1, \dots, K} L_{k A})}^{\frac{1}{2}} \leq t + b_{d} (log N)} = exp (- 2 e^{- t}) .$

Then $\begin{matrix} P (max_{k = 1, \dots, K} L_{k A} > r_{m}) & = P (a (log N) {(max_{k = 1, \dots, K} L_{k A})}^{\frac{1}{2}} - b_{d} (log N) > a (log N) {r_{m}}^{\frac{1}{2}} - b_{d} (log N)) \\ = 1 - P (a (log N) {(max_{k = 1, \dots, K} L_{k A})}^{\frac{1}{2}} - b_{d} (log N) \leq a (log N) {r_{m}}^{\frac{1}{2}} - b_{d} (log N)), \end{matrix}$ and by condition (ii), we have $\begin{matrix} lim_{\binom{m \to \infty}{K \to \infty}} e_{m}^{2} P (max_{k = 1, \dots, K} L_{k A} > r_{m}) = lim_{\binom{m \to \infty}{K \to \infty}} e_{m}^{2} (1 - exp {- 2 e^{- (a (log N) {r_{m}}^{\frac{1}{2}} - b_{d} (log N))}}) = 0 . \end{matrix}$ For simplicity, we suppress the subscript B in $L_{k B}$ , i.e., write $L_{k B}$ as $L_{k}$ . By a second-order Taylor expansion, $\begin{matrix} L_{k} & = - 2 [\sum_{j} Z_{0 k j} log \frac{{\hat{q}}_{j}}{{\hat{q}}_{0 k j}} + \sum_{j} Z_{1 k j} log \frac{{\hat{q}}_{j}}{{\hat{q}}_{1 k j}}] \\ = - 2 [\sum_{j} Z_{0 k j} log (\frac{{\hat{q}}_{j}}{{\hat{q}}_{0 k j}} - 1 + 1) + \sum_{j} Z_{1 k j} log (\frac{{\hat{q}}_{j}}{{\hat{q}}_{1 k j}} - 1 + 1)] \\ = - 2 [\sum_{j} Z_{0 k j} (\frac{{\hat{q}}_{j}}{{\hat{q}}_{0 k j}} - 1) - \frac{1}{2} \sum_{j} Z_{0 k j} {(\frac{{\hat{q}}_{j}}{{\hat{q}}_{0 k j}} - 1)}^{2} + \sum_{j} Z_{1 k j} (\frac{{\hat{q}}_{j}}{{\hat{q}}_{1 k j}} - 1) - \frac{1}{2} \sum_{j} Z_{1 k j} {(\frac{{\hat{q}}_{j}}{{\hat{q}}_{1 k j}} - 1)}^{2}] \\ + o_{p} (1) \\ = \sum_{j} Z_{0 k j} {(\frac{{\hat{q}}_{j}}{{\hat{q}}_{0 k j}} - 1)}^{2} + \sum_{j} Z_{0 k j} {(\frac{{\hat{q}}_{j}}{{\hat{q}}_{1 k j}} - 1)}^{2} + o_{p} (1) \\ = \frac{{N_{0 k}}^{2} {N_{1 k}}^{2}}{N^{2}} \sum_{j} \frac{Z_{j}}{Z_{0 k j} Z_{1 k j}} {({\hat{q}}_{0 k j} - {\hat{q}}_{1 k j})}^{2} + o_{p} (1) \\ = \sum_{j} \frac{N_{0 k} N_{1 k}}{N} {({\hat{q}}_{0 k j} - {\hat{q}}_{1 k j})}^{2} / {\hat{q}}_{j} + o_{p} (1) . \end{matrix}$ The last inequality follows from the fact that $\frac{Z_{j}}{N} / \frac{Z_{0 k j}}{N_{0 k}} \to 1$ and $\frac{Z_{1 k j}}{N_{1 k}} / \frac{Z_{0 k j}}{N_{0 k}} \to 1 a s (m, K) \to \infty$ under $H_{0}$ .

Let $P_{k j} = \frac{N_{0 k} N_{1 k}}{N} {({\hat{q}}_{0 k j} - {\hat{q}}_{1 k j})}^{2} / {\hat{q}}_{j}$ , and $P_{k} = \sum_{j} P_{k j}$ . Then from the above, we have $L_{k} = P_{k} + o_{p} (1)$ . Rewrite ${\hat{q}}_{0 k}$ and ${\hat{q}}_{1 k}$ as $\frac{1}{N_{0 k}} \sum_{i = 1}^{N_{0 k}} Y_{i}$ and $\frac{1}{N_{0 k}} \sum_{i = 1}^{N_{0 k}} Y_{i}$ , respectively, where $Y_{i} = (Y_{i 1}, \dots, Y_{i m}) \overset{i . i . d}{\sim} Multi (1, q_{0})$ . Denote $S_{k} = \frac{1}{\sqrt{N}} (\sum_{i = 1}^{N_{0 k}} Y_{i} - \frac{N_{0 k}}{N} \sum_{i = 1}^{N} Y_{i})$ , ${\hat{Σ}}^{- 1}_{n} = diag ({\hat{q}}_{1}^{B}, \dots \dots, {\hat{q}}_{m - d + 1}^{B})$ , where the ${\hat{q}}_{j}^{B}$ denotes the estimated proportions by $(X_{i B}, \sum_{j \in A} X_{i j})$ . Then $\frac{N_{0 k} N_{1 k}}{N^{2}} P_{k} = {S_{k}}^{T} {\hat{Σ}}_{n}^{- 1} S_{k}$ . By Remark 2.1 in [36], $\frac{\frac{1}{K} \sum_{k = 1}^{K} \frac{N_{0 k} N_{1 k}}{N^{2}} P_{k B} - \frac{m - d + 1}{6}}{\sqrt{\frac{m - d + 1}{45}}} \overset{d}{⟶} N (0, 1), a s (m, K) \to \infty .$ Hence, Theorem 1 follows from $L_{k B} = P_{k B} + o_{p} (1)$ . □

Proof of Theorem 2.

First assume that $δ_{j^{‘}}$ s satisfy the condition ( $i i i$ ) in Theorem 2. By Theorem 1.7.2 in [41], $max_{k = 1, \dots, K} L_{k \hat{A}} = max_{k = 1, \dots, K} P_{k \hat{A}} + O_{p} (exp (- {(log N)}^{1 - ε}))$ for any $0 < ε < 1$ . Since $exp (- {(log N)}^{1 - ε}) \to 0$ as $(m, K) \to \infty$ , by the proof of Theorem 3 in [20], we have $\begin{matrix} P (max_{k = 1, \dots, K} 2 N {\bar{MI}}_{k \hat{A}} > r_{m}) & = P (max_{k = 1, \dots, K} L_{k \hat{A}} > r_{m}) \\ = P (max_{k = 1, \dots, K} P_{k \hat{A}} + O_{p} (exp (- {(log N)}^{1 - ε})) > r_{m}) \\ \geq P (P_{k^{*} \hat{A}} + o_{p} (1) > r_{m}) \\ \geq P (P_{k^{*} \hat{A} j^{‘}} + o_{p} (1) > r_{m}) \to 1, a s (m, K) \to \infty . \end{matrix}$ Hence, with probability one, $G_{m, \hat{A}} > e_{m}$ . By the condition that $e_{m} > c m$ as $(m, K) \to \infty$ , where $c > \frac{1}{6}$ , we obtain that $\frac{G_{m, \hat{A}} - \frac{m - d + 1}{6}}{\sqrt{\frac{m - d + 1}{45}}} \to \infty, a s (m, K) \to \infty .$ Therefore, the power converges to 1 as $(m, K) \to \infty$ .

Now, assume that $δ_{j^{‘}}$ s satisfy the condition ( $i v$ ) in Theorem 2. By Theorem 2.2 in [36], for $δ_{j^{‘}} > 0$ for some $j^{‘} \in B$ , we have $\frac{2}{K} \sum_{k = 1}^{K} \frac{N_{0 k} N_{1 k}}{N} M I_{k \hat{B}}$ converges to a nonzero limit with probability one, which suggests that the power converges to 1 as $(m, K) \to \infty$ . □

Appendix B

Necessary R code related to this article can be found online at https://github.com/xxrdragonfly/ChangePoint/blob/main/Rcode.R (accessed on 3 Feburary 2023).

References

1. Page, E.S. Continuous inspection schemes. Biometrika; 1954; 41, pp. 100-115. [DOI: https://dx.doi.org/10.1093/biomet/41.1-2.100]

2. Fletcher, R.J.; Robertson, E.P.; Poli, C.; Dudek, S.; Gonzalez, A.; Jeffery, B. Conflicting nest survival thresholds across a wetland network alter management benchmarks for an endangered bird. Biol. Conserv.; 2021; 253, 108893. [DOI: https://dx.doi.org/10.1016/j.biocon.2020.108893]

3. Fryzlewicz, P. Wild binary segmentation for multiple change-point detection. Ann. Stat.; 2014; 42, pp. 2243-2281. [DOI: https://dx.doi.org/10.1214/14-AOS1245]

4. Ross, G.J.; Chevalier, A.; Sharples, L. Tracking the evolution of literary style via Dirichlet-multinomial change point regression. J. R. Stat. Soc. Ser. A-Stat. Soc.; 2019; 183, pp. 149-167. [DOI: https://dx.doi.org/10.1111/rssa.12492]

5. Jiang, F.; Zhao, Z.; Shao, X. Time series analysis of COVID-19 infection curve: A change-point perspective. J. Econom.; 2020; 232, pp. 1-17. [DOI: https://dx.doi.org/10.1016/j.jeconom.2020.07.039]

6. Palivonaite, R.; Lukoseviciute, K.; Ragulskis, M. Algebraic segmentation of short nonstationary time series based on evolutionary prediction algorithms. Neurocomputing; 2013; 121, pp. 354-364. [DOI: https://dx.doi.org/10.1016/j.neucom.2013.05.013]

7. Sen, A.K.; Srivastava, M.S. On tests for detecting change in mean. Ann. Stat.; 1975; 3, pp. 98-108. [DOI: https://dx.doi.org/10.1214/aos/1176343001]

8. Worsley, K.J. Confidence regions and tests for a change-point in a sequence of exponential family of random variables. Biometrika; 1986; 73, pp. 91-104. [DOI: https://dx.doi.org/10.1093/biomet/73.1.91]

9. Bai, J. Least squares estimation of a shift in linear processes. J. Time Ser. Anal.; 1994; 15, pp. 453-472. [DOI: https://dx.doi.org/10.1111/j.1467-9892.1994.tb00204.x]

10. Vexler, A. Guaranteed testing for epidemic changes of a linear regression model. J. Stat. Plan. Inference; 2006; 136, pp. 3101-3120. [DOI: https://dx.doi.org/10.1016/j.jspi.2004.11.010]

11. Gombay, E. Change detection in autoregressive time series. J. Multivar. Anal.; 2008; 99, pp. 451-464. [DOI: https://dx.doi.org/10.1016/j.jmva.2007.01.003]

12. Truong, C.; Oudre, L.; Vayatis, N. Selective review of offline change point detection methods. Signal Process.; 2020; 167, 107299. [DOI: https://dx.doi.org/10.1016/j.sigpro.2019.107299]

13. Aue, A.; Horváth, L. Structural breaks in time series. J. Time Ser. Anal.; 2013; 34, pp. 1-16. [DOI: https://dx.doi.org/10.1111/j.1467-9892.2012.00819.x]

14. Chen, J.; Gupta, A.K. Parametric Statistical Change Point Analysis; Birkhäuser: Boston, MA, USA, 2000.

15. Ross, A.S.C. Philological probability problems. J. R. Stat. Soc. Ser.-Stat. Methodol.; 1950; 12, pp. 19-59. [DOI: https://dx.doi.org/10.1111/j.2517-6161.1950.tb00040.x]

16. Wolfe, D.A.; Chen, Y.S. The changepoint problem in a multinomial sequence. Commun.-Stat.-Simul. Comput.; 1990; 19, pp. 603-618. [DOI: https://dx.doi.org/10.1080/03610919008812877]

17. Horváth, L.; Serbinowska, M. Testing for changes in multinomial observations: The Lindisfarne scribes problem. Scand. J. Stat.; 1995; 22, pp. 371-384.

18. Batsidis, A.; Horváth, L.; Martín, N.; Pardo, L.; Zografos, K. Change-point detection in multinomial data using phi-divergence test statistics. J. Multivar. Anal.; 2013; 118, pp. 53-66. [DOI: https://dx.doi.org/10.1016/j.jmva.2013.03.008]

19. Riba, A.; Ginebra, J. Change-point estimation in a multinomial sequence and homogeneity of literary style. J. Appl. Stat.; 2005; 32, pp. 61-74. [DOI: https://dx.doi.org/10.1080/0266476052000330295]

20. Wang, G.H.; Zou, C.L.; Yin, G.S. Change-point detection in multinomial data with a large number of categories. Ann. Stat.; 2018; 46, pp. 2020-2044. [DOI: https://dx.doi.org/10.1214/17-AOS1610]

21. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.; 1948; 27, pp. 379-432. [DOI: https://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.x]

22. Unakafov, A.M.; Keller, K. Change-Point Detection Using the Conditional Entropy of Ordinal Patterns. Entropy; 2018; 20, 709. [DOI: https://dx.doi.org/10.3390/e20090709] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33265798]

23. Ma, L.J.; Sofronov, G. Change-point detection in autoregressive processes via the Cross-Entropy method. Algorithms; 2020; 13, 128. [DOI: https://dx.doi.org/10.3390/a13050128]

24. Vexler, A.; Gurevich, G. Density-Based Empirical Likelihood Ratio Change Point Detection Policies. Commun.-Stat.-Simul. Comput.; 2010; 39, pp. 1709-1725. [DOI: https://dx.doi.org/10.1080/03610918.2010.512692]

25. Kamimura, R. Supposed maximum mutual information for improving generalization and interpretation of multi-layered neural networks. J. Artif. Intell. Soft Comput. Res.; 2019; 9, pp. 123-147. [DOI: https://dx.doi.org/10.2478/jaiscr-2018-0029]

26. Liu, L.X. Image multi-threshold method based on fuzzy mutual information. Comput. Eng. Appl.; 2009; 45, pp. 166–168, 197.

27. Oh, B.S.; Sun, L.; Ahn, C.S.; Yeo, Y.K.; Yang, Y.; Liu, N.; Lin, Z.P. Extreme learning machine based mutual information estimation with application to time-series change-points detection. Neurocomputing; 2017; 261, pp. 204-216. [DOI: https://dx.doi.org/10.1016/j.neucom.2015.11.138]

28. Kopylova, Y.; Buell, D.A.; Huang, C.T.; Janies, J. Mutual information applied to anomaly detection. J. Commun. Netw.; 2008; 10, pp. 89-97. [DOI: https://dx.doi.org/10.1109/JCN.2008.6388332]

29. Gurevich, G. Retrospective parametric tests for homogeneity of data. Commun.-Stat.-Theory Methods; 2007; 36, pp. 2841-2862. [DOI: https://dx.doi.org/10.1080/03610920701386968]

30. James, B.; James, K.L.; Siegmund, D. Tests for a change-point. Biometrika; 1987; 74, pp. 71-83. [DOI: https://dx.doi.org/10.1093/biomet/74.1.71]

31. Lai, T.L. Sequential changepoint detection in quality control and dynamical systems. J. R. Stat. Soc. Ser. B Stat. Methodol.; 1995; 57, pp. 613-658. [DOI: https://dx.doi.org/10.1111/j.2517-6161.1995.tb02052.x]

32. Lee, W. A Data Mining Framework for Constructing Features and Models for Intrusion Detection Systems. Ph.D. Thesis; Columbia University: New York, NY, USA, 1999.

33. Pareto, V. Cours d’Economie Politique; Droz: Geneva, Switzerland, 1896.

34. Chen, S.X.; Qin, Y.L. A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Stat.; 2010; 38, pp. 808-835. [DOI: https://dx.doi.org/10.1214/09-AOS716]

35. Fan, J.; Liao, Y.; Yao, J. Power enhancement in high-dimensional cross-sectional tests. Econometrica; 2015; 83, pp. 1497-1541. [DOI: https://dx.doi.org/10.3982/ECTA12749] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26778846]

36. Aue, A.; Hrmann, S.; Horváth, L.; Reimherr, M. Break detection in the covariance structure of multivariate time series models. Ann. Stat.; 2009; 37, pp. 4046-4087. [DOI: https://dx.doi.org/10.1214/09-AOS707]

37. Shiryayev, A.N. On optimum methods in quickest detection problems. Theory Probab. Its Appl.; 1963; 13, pp. 22-46. [DOI: https://dx.doi.org/10.1137/1108002]

38. Roberts, S.W. A comparison of some control chart procedures. Technometrics; 1966; 8, pp. 411-430. [DOI: https://dx.doi.org/10.1080/00401706.1966.10490374]

39. Krieger, A.M.; Pollak, M.; Yakir, B. Surveillance of a Simple Linear Regression. J. Am. Stat. Assoc.; 2003; 98, pp. 456-469. [DOI: https://dx.doi.org/10.1198/016214503000233]

40. Vexler, A.; Gregory, G. Average Most Powerful Tests for a Segmented Regression. Commun.-Stat.-Theory Methods; 2009; 38, pp. 2214-2231. [DOI: https://dx.doi.org/10.1080/03610920802521208]

41. Csörgo, M.; Horváth, L. Limit Theorems in Change-Point Analysis; Wiley: New York, NY, USA, 1997.

Word count: 6576

Show less

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Time-series data often have an abrupt structure change at an unknown location. This paper proposes a new statistic to test the existence of a change-point in a multinomial sequence, where the number of categories is comparable with the sample size as it tends to infinity. To construct this statistic, the pre-classification is implemented first; then, it is given based on the mutual information between the data and the locations from the pre-classification. Note that this statistic can also be used to estimate the position of the change-point. Under certain conditions, the proposed statistic is asymptotically normally distributed under the null hypothesis and consistent under the alternative hypothesis. Simulation results show the high power of the test based on the proposed statistic and the high accuracy of the estimate. The proposed method is also illustrated with a real example of physical examination data.

Details

Title

Change-Point Detection in a High-Dimensional Multinomial Sequence Based on Mutual Information

Author

Xiang, Xinrong¹; Jin, Baisuo¹; Wu, Yuehua²

¹ School of Management, University of Science and Technology of China, Heifei 230026, China
² Department of Mathematics and Statistics, York University, Toronto, ON M3J 1P3, Canada

First page

355

Publication year

2023

Publication date

2023

Publisher

MDPI AG

e-ISSN

10994300

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/e25020355

ProQuest document ID

2779463352

Change-Point Detection in a High-Dimensional Multinomial Sequence Based on Mutual Information

Jump to:

Full Text

Abstract

Details

Suggested sources