Random Permutations, Non-Decreasing Subsequences

Full text

Turn on search term navigation

1. Introduction

In this article, we use an expanded structure of the symmetric group_Sn,over the set of permutations from{1,…,n}to{1,…,n},to develop a dependence detection procedure in bivariate random vectors. The procedure is based on identifying the longest non-decreasing subsequence (LNDSS) detected in the graph of the paired marginal ranks of the observations. It records the size of the subsequence and verifies the chances that it has to occur in the expanded space of_Sn,under the assumption of independence between the variables. The procedure does not require assumptions about the type of the two random variables being tested, such as being both discrete, both continuous or a mixed structures (discrete-continuous).

When we face the challenge of deciding whether the independence between random variables can be discarded, it is necessary to establish the nature of the variables, whether they are continuous or discrete. For continuous random variables, we have several procedures, for example, Hoeffding’s test and those based on dependence’s coefficients (Spearman’s coefficient, Pearson’s coefficient, Kendall’s coefficient, etc.). Instead, for the discrete case, the options are few, the most popular is Pearson’s Chi-squared test. Also, the tests based on Kendall and Spearman coefficients going through corrections that consider ties can be used to test for independence between two discrete and ordinal variables (see [1,2]). In general, recommended for small sample sizes. Moreover, some derivations of the Chi-squared statistic have been projected to test independence between two nominal variables, as is the case of the Cramér’s V statistic, see [3].

The goal of this article is to show an independence test, developed from the notion of the LNDSS among the ranks of the observations, see [4]. The main notion was introduced previously in [5] with a different implementation from the one proposed in this paper. The alterations proposed in this paper aim to improve the procedure’s performance. This methodology works without limitation on the type of the two random variables being tested, which can be continuous/discrete.

The existence of ties in a dataset cast doubts about the use of matured tests for continuous variables, see, for instance, [6] for a discussion on this issue. The use of procedures preconized for continuous random variables, in cases with repetitions in the observations due to the precision used to record the data, may have unforeseen consequences on the performance of the procedure. If the ties are eliminated, the use of asymptotic distributions can be compromised, if the ties are considered (by means of some correction), the control of type 1 and 2 errors can be put at risk (increasing the false positives/negatives of the procedure). Another frequent situation is when one of the variables is continuous, and the other is discrete. For some test of independence, problems may arise from this situation forcing the practitioner to apply some arbitrary data categorization. Under this picture, one of the most popular procedures is the Pearson’s Chi-squared statistic. The traditional tests are based on some of the following statistics Pearson’s Chi-squared, likelihood ratio [1], and Zelterman’s [7] for the case in which the number of categories is too large for the available sample size. Moreover, Zelterman’s [7] do not work well when one of the variables is continuous. [1] shows several examples of independent data, where Pearson’s Chi-squared, likelihood ratio, and Zelterman statistics fail. In [1] is shown that to be reliable, those tests require that each cell in the frequency table should have a minimal (non zero) frequency, which can depend on the total size of the data set. It is shown in [7], that in some situations, with a large number of factors, Pearson’s Chi-squared statistic will behave as a normal random variable with summaries as variance and mean that are unassociated to the Chi-squared distribution, even with large sample size. Those situations are similar to the case of continuous random variables registered with limited precision, which in fact, is similar to a discrete random variable with a large number of categories producing sparseness (or sparse tables).

This article is organized as follows. Section 2 introduces the formulation of the test showing the new strategy, in comparison with the implemented in [5]. Section 3 simulates different situations showing the performance of the procedure. The purpose is to show situations in which the statistic proposed in this paper is efficient in detecting dependence. We consider in the simulations settings concentrating points in the diagonals, the variables being continuous or discrete. We also consider perturbations of such situations, which will show the maintenance or loss of power of the test developed here. Section 4 applies the new procedure to real data, and Section 5 presents the final considerations.

2. The Procedure We start this section with the construction of the test’s statistics. For that, we introduce the LNDSS notion.

Definition 1.

Given the setQ={_q1,…,_qn}of cardinality n such that_qi∈R,∀i∈{1,…,n},

the subsequence{_{q_i1},…,_{q_ik}}of Q is a non-decreasing subsequence of Q if1≤_i1<…<_ik≤nand_{q_i1}≤_{q_i2}≤…≤_{q_ik};

ii.

the length of a subsequence verifyingi.isk;

iii.

ln_dn(Q)=_maxk{1≤k≤n:{_{q_i1},…,_{q_ik}}∈_Sn},where_Snis the set of subsequences of Q verifyingi.

ln_dn(Q)(item iii., Definition 1) is the length of the LNDSS ofQ.Here, consider two illustrations of Definition 1. Suppose thatQ={1.3,0.2,2,2.1,1.2}.Then the LNDSS are{1.3,2,2.1}and{0.2,2,2.1},thenln_d5(Q)=3.Consider now a collection Q with replicationsQ={1.5,2.4,1.1,2.4,3,3.1},so the LNDSS is{1.5,2.4,2.4,3,3.1}andln_d6(Q)=5.

Using the next Definition we adapt this notion to the context of random samples.

Definition 2.

Consider(X,Y)a random vector with joint cumulative distribution functionH,let(_X1,_Y1),(_X2,_Y2),…,(_Xn,_Yn)be independent realizations of(X,Y),we denote byLN_Dnthe random variable built from iii. of Definition 1 as

LN_Dn=ln_dn_QD,

whereD=_{(_Xi,_Yi)i=1n}and_QD={_{qrank(_Xi)}=rank(_Yi),i=1,…,n}.

Remark 1.

Note that without the presence of ties, the set_QDis a particular case of all the permutations of the values in the set{1,…,n}.

ii.

With ties, there is more than one way of defining ranks. We apply the minimum rank notion. For example, the sample6.1,2.1,5.3,4.7,5.5,6.2,5.3,4.7has ranks7,1,4,2,6,8,4,2.

If we consider_Sn=πpermutationssuchthatπ:{1,…,n}→{1,…,n},the subset_QDgiven by Definition 2 and without ties is a specific case of the finite set_Sn.Also,_Snis an algebraic group if it is considered operating with the law of composition among the possible permutations. Given two permutations_π1,_π2the composition between them results when applying_π2∗_π1from right to left, it means first applying_π1and to its result applying_π2,that composition also is a permutation. The law of composition is associative, with an identity element and with the existence of an inverse element for each member of_Sn.By Definition a symmetric group defined over any set is the group whose elements are all the bijections from the set to itself, then_Snis the symmetric group of the set{1,…,n}since, it is composed by all the bijections from{1,…,n}to{1,…,n}.Since{1,…,n}is finite, the bijections are permutations.

Through the next example, we show the construction of the LNDSS in a set_QDrelated to fictional observations.

Example 1.

Table 1 shows an artificial data withn=6and already ordered in terms of the magnitude of_xivalues. We show the graphical construction ofLN_Dn,

This data defines a_QD={5,1,1,3,3,6}.The maximal non-decreasing subsequence is{1,1,3,3,6}given by the trajectory(0,0)−(1,1)−(3,1)−(3,3)−(5,3)−(6,6) from the plot between the ranks of the observations, shown in Figure 1. The value ofLN_D6for this example is 5. We note that the indicated trajectory refers to the correspondence of1→1,3→1,3→3,5→3,6→6,which is no longer a permutation in the traditional sense since, it allows repetition both in the domain and in the image.

Remark 2.

Note that the construction of the statisticLN_Dnis symmetric in the sense that if we exchange the roles of X andY,we obtain the same result. Formally, this characteristic is a consequence of the following property. Consider a sample_{{(_Xi,_Yi)}i=1n}and the increasing set of indexes{_I1,…,_Ik}⊆{1,…,n}such that the trajectory(_{X_I1},_{Y_I1})−(_{X_I2},_{Y_I2})−…−(_{X_Ik},_{Y_Ik})constitutes a non-decreasing subsequence (as illustrated by Example 1), this occurs if and only if_{X_Ii}≤_{X_Ii+1}and_{Y_Ii}≤_{Y_Ii+1},1≤i≤k−1,then the trajectory(_{Y_I1},_{X_I1})−(_{Y_I2},_{X_I2})−…−(_{Y_Ik},_{X_Ik})constitutes a non-decreasing subsequence also.

The example shows that the procedure operates in an extended space of the symmetric group_Sn. Below we show a motivation to identify the dependence by trajectories such as those used by Definition 2 and exemplified in Figure 1. The dependence on a bivariate vector can be represented by the ranks of the observations; let’s see a simple motivation.

We see on the left of Figure 2 an apparent relationship between the random variables, this illusion of relationship disappears in the graph on the right, since when computing the ranks of the observations, the marginal stochastic structure is neutralized, showing the dependence between X andY. And, in this case, X and Y are independent, since they have been generated in this way. On the other hand, if the variables X and Y were dependent, Figure 2 on the right should expose a pattern, and traces of it would be captured by theLN_Dnnotion.

The formulation of the conjectures of independence between the random variables is then given by

_H0:XandYareindependent_H1:XandYaredependent.

Here follows the test’s statistic build from Definition 2.

Definition 3.

LetD=_{(_Xi,_Yi)i=1n}be replications of(X,Y). DefineJLN_Dn=1n_∑(u,v)∈DLN_Dn(u,v),whereLN_Dn(u,v)=ln_dn−1(_Q^D(u,v))as given by Definition 2, and^D(u,v)=D∖(u,v),with(u,v)∈D.

That is, we consider the notion given by the Definition 2 for each set^D(u,v),which include the entire sample except one, allowing to build_Q^D(u,v).Then, we defineLN_Dn(u,v)and, the test statistic is the average between all the casesLN_Dn(u,v).Next we introduce the most frequent formulation of estimation of the two-sided p-value in a context such as that given by theJLN_Dnstatistic.

Definition 4.

The estimator of the two sided p-value for the statistical test of independence between X and Y (see (1)) is defined by,

min2_{F^JLN_Dn}(jln_d0)_{I_{F^JLN_Dn}(jln_d0)≤12}+2(1−_{F^JLN_Dn}(jln_d0))_{I_{F^JLN_Dn}(jln_d0)>12},1,

wherejln_d0is the value ofJLN_Dncalculated in the sample, see Definition 3._{F^JLN_Dn}is the empirical cumulative distribution function ofJLN_Dn,under independence, and_IAis the indicator function of the setA.

In the following subsection, we analyze the performance of two proposals to estimate_{FJLN_Dn}, one introduced in [5] and the other proposed by this paper.

2.1._{FJLN_Dn}Estimates

_{F^JLN_Dn} can be estimated by using bootstrap, for instance see [5]. Denote this kind of estimation as_{F^JLN_DnB}.The procedure to buid_{F^JLN_DnB}under_H0hypothesis is replicated here. Let be B a positive and integer value, we compute B size n resamples with replacement of_X1,_X2,…,_Xnand_Y1,_Y2,…,_Ynseparately, since we assume that_H0is true. That is, we generate_X1b,_X2b,…,_Xnbforb=1,2,…,B,resampling from_X1,_X2,…,_Xn,and, we generate_Y1b,_Y2b,…,_Ynbforb=1,2,…,B,resampling from_Y1,_Y2,…,_Yn.Then, for each b define^Db=_{(_Xib,_Yib)i=1n}and from that sample compute the notionJLN_Dn,from Definition 3, sayJLN_Dnb.Then, if|A|denotes the cardinal ofA,set

_{F^JLN_DnB}(q)=|{b:JLN_Dnb≤q}|B.

In Table 2, we show the performance of theJLN_Dn ’ s test based on the computation of the p-value (Definition 4) according to the Bootstrap technique, given by Equation (2). We generated n independent pairs of discrete Uniform distributions from 1 tom,and we computed in 1000 simulations, the proportion of them showing a p-value (Definition 4)≤α,indicating the rejection of_H0.Such a proportion is expected to be close toα,in order to control type 1 error. As we can see, when increasing the number of categoriesm,theαlevel is no longer respected, since the registered proportion always exceedsα.In order to improve the control of type 1 error, in this paper is proposed an alternative way to estimate_{FJLN_Dn}. The Bootstrap method described above and used in [5] can be modified in order to avoid the removal of any of the observations, following the strategy of swapping them. We consider_X1,_X2,…,_Xnand_Y1,_Y2,…,_Ynseparately, givenB∈Z,for eachb∈1,…,Bconsider a permutation^πb:{1,…,n}→{1,…,n}and define_X^πb(1),…,_X^πb(n).Similarly, consider a permutation^σb:{1,…,n}→{1,…,n}and define_Y^σb(1),…,_Y^σb(n).Then, for each b define^{D^πb,^σb}=_{(_X^πb(i),_Y^σb(i))i=1n}and from that sample compute the notionJLN_Dn,from Definition 3, sayJLN_{Dn^πb ^σb}.Then, set

_{F^JLN_DnB,π,σ}(q)=|{b:JLN_{Dn^πb ^σb}≤q}|B.

Bootstrap generates the estimate by Equation (2), it considers samples with replacement, which tends to increase the number of ties. For example, if the original sample has no ties, the Bootstrap procedure tends to create ties, leading to longer non-decreasing subsequences. The permutation-based procedure that allows the formulation of Equation (3) lacks such a tendency, and this principle seems to be a more suitable strategy.

In Table 3, we show the performance of theJLN_Dn ’ s test based on the computation of the p-value (Definition 4) according to Equation (3). We implement the same settings used in Table 2, also we include simulations form=2,3,4,5. The impact of Equation (3) allows better control of the type 1 error, we see that in most cases the proportion does not exceedαand when it does it remains close toα.

Returning to the construction of the hypothesis test (Equation (1)), we note that the hypothesis_H0 is used in the construction of both types of estimates of the cumulative distribution, Equations (2) and (3). For both cases, the observed values_{{(_Xi,_Xi)}i=1n}are treated separately, as being independent_{{_Xi}i=1n}by one side and_{{_Yi}i=1n}for other side. Then, the distribution of the length of the LNDSS, under_H0,is estimated by both procedures, which allows computing the evidence against_H0given by the observed value in the originally paired_{{(_Xi,_Yi)}i=1n}sample and applying Definition 4. Moreover, the type 1 error control refers to the ability of a procedure to reject_H0 under its validity. In other words, it represents an unwanted situation, which we must control. In the study presented by Table 2 and Table 3, based on the two ways of estimating the cumulative distribution ofJLN_Dn(under_H0 ) by Equations (2) and (3) respectively, we see that fixed a levelα, Equation (3) offers better performance than Equation (2) since it maintains type 1 error at pre-established levels. For this reason, the test based on the statisticJLN_Dn with the implementation given by Equation (3) is more advisable in practice.

The following section describes the behavior of the test in different simulated situations, in order to identify its strengths and weaknesses. 3. Simulations

To investigate the performance of theJLN_Dn -based procedure, we will aim to determine the rejection ability of the procedure in scenarios with dependence. Our research focuses on the procedure that uses the Equation (3) to compute the p-value, given the justification of Section 2.1. We begin our study considering discrete distributions that we describe below and some mixtures or disturbances of them.

We take discrete uniform distributions on different regions, considerm,band a fixed values such thatm,b,a∈_Z>0,and set

D1(m,a):Uniform onA=(x,y)∈^{1,…,m}2:|x−y|≤a;

ii.

D2(m,a):Uniform onA=(x,y)∈^{1,…,m}2:|x−y|≤aor|x+y−m−1|≤a;

iii.

D3(m,a,b):Uniform onA={(x,y)∈^{1,…,m}2:|x−y|≤aor|x−y+b|≤aor|x−y−b|≤aor |x−y−2b|≤a or |x−y+2b|≤a}.

The performance of the distributions i.- iii. is illustrated in Figure 3, usingm=20,a=1andb=6.Denote byU(m)the Uniform distribution onA=(x,y)∈^{1,…,m}2,givenp∈[0,1]consider now the next three mixture of distributions

iv.

M1(m,a):pD1(m,a)+(1−p)U(m);

M2(m,a):pD2(m,a)+(1−p)U(m);

vi.

M3(m,a,b):pD3(m,a,b)+(1−p)U(m).

where the notationpD1(m,a)+(1−p)U(m)represents that the bivariate vector is given in proportion p by the distributionD1(m,a)and in proportion(1−p)by the distributionU(m).Note that ifp=1we recover the distributionsDi,i=1,2,3. From Table 4, Table 5 and Table 6, settings have projected that increase the number of categories of the discrete Uniform distributionU(m),and also increase the parameters a andb.

Table 4, Table 5 and Table 6 show the rejection rates, obtained through 1000 simulations of samples of sizen.We inspect the distributionsDi,i=1,2,3and the mixturesMi,i=1,2,3withp=0.8.What is sought is to obtain high proportions evidencing the control of type 2 error.

As expected, for distributionD1the procedureJLN_Dnshows maximum performance, for all sample sizes and variants of m anda.For distributionD2,the performance of the procedureJLN_Dnimproves and reaches maximum performance as the sample size increases, for all variants of m anda.For distributionD3,we noticed a deterioration in the performance of the test when compared to the other two casesD1andD2,despite this, the procedure responds adequately to the sample size, increasing its ability to detect dependence with increasing sample size.

Miis a distribution that results from disturbingDi,so it makes sense to compare the effect of the disturbance, which in the illustrated cases is 20% fromU(m).For the distributionM1theJLN_Dn-based procedure shows optimal performance, as occurs in the caseD1.In casesM2andM3,there is a deterioration in the performance of the procedureJLN_Dnwhen compared toD2andD3,respectively. Despite this, within the framework given byM2,we see that the good properties of the procedure are preserved when the sample size is increased.

In the following simulations, we investigate the dependence between discrete and continuous variables. The types explored are denoted byD4andD5, Figure 4 illustrates the cases. Considerm∈_Z>0,a∈_R>0and set

vii.

D4(m,a): Uniform distribution onA=(x,y)∈{1,…,m}×[0,m+1]:|x−y|≤a;

viii.

D5(m,a): Uniform onA=(x,y)∈{1,…,m}×[0,m+1]:|x−y|≤aor|x+y−m−1|≤a.

Denote byW(m)the Uniform distribution onA=(x,y)∈{1,…,m}×[0,m+1]givenp∈[0,1]consider now the next two mixture of distributions

ix.

M4(m,a):pD4(m,a)+(1−p)W(m);

M5(m,a):pD5(m,a)+(1−p)W(m).

Note that when usingp=1in ix (or x) we recorverD4(orD5).

Table 7 and Table 8 show the performance of 1000 simulations of sizen,fromM4(m,0.5) in Table 7,M5(m,0.5) in Table 8. To the left of each Table (withp=1 ), are simulated cases similar to the illustrated in Figure 4,D4andD5. Table 7 shows that in the case of distributionD4,the procedure is very efficient and, we see that when the distribution is disturbed (by including 20% fromW, to the right of Table 7) the procedure maintains its efficiency in detecting dependence. In relation to the distributionD5, we see from Table 8 that two effects occur, the one produced by the sample size n and the one produced by the value ofm.By increasing n and m the procedure gains power quickly. The same effect is observed in theM5distribution (D5disturbance), with a certain deterioration in the power of the test.

TheJLN_Dn statistic is built in the graph of the paired ranks of the observations, and it is given by the size of the LNDSS found in this graph (see Figure 1). The proposal induces a region where this statistic can found evidence of dependence, in the diagonal of the graph. The simulation study points that the detection power of the procedure occurs in situations with an increasing pattern in the direction in which theJLN_Dnstatistic is built. Even more, the concomitant presence of increasing patterns and decreasing patterns does not necessarily nullify the detection capacity of the procedure, since the statisticJLN_Dnis formulated considering the expanded_Sn space provided with the uniform distribution. See Table 4, Table 5, Table 6, Table 7 and Table 8 in which we observe that by increasing the sample size, the detection capacity ofJLN_Dnis preserved. Also, looking at the right side of the tables already cited, we verify the robustness of the procedure, when inspecting cases with a concentration of points in the diagonals and suffering contamination, if the sample size grows.

In the next section, we apply the test to real data and compare our results with other procedures. 4. Applying the Test in Real Data

As it has already been commented, in some data sets, we have ties, produced by the precision used in data collection. This is the case of the wine data set (from the glus R-package), composed of 178 observations. For example, consider the cases (i) Alcohol vs. Flavonoids (see Figure 5, left) and (ii) Flavanoids vs. Intensity (see Figure 5, right). For each case (i) and (ii) both variables are continuous but recorded with a precision of two decimal places. We use known procedures in the area of continuous variables. For all the computations is used the R-project software environment. The “hoeffd” function in the “Hmisc” package is used to compute the p-value in the case of Hoeffding’s test. The “cor.test” function in the “stat” package is used to compute the p-value for Pearson, Spearman and Kendall tests, see also [8]. Finally, we use the “indepTest” function, from the “copula” package to compute the “Copula“ test.

In case (i) of Figure 5 (left) all the procedures report p-value less than 0.02. UsingJLN_Dn(jln_d0=31.843 ) we obtain p-value = 0.0160 and p-value = 0.0004, applying Equations (2) and (3), respectively. That is,JLN_Dn-based procedures detect dependence without the possible contraindications that the other procedures have, since we see ties in the dataset.

From the appearance of the scatter plot (Figure 5, right), it is understandable that the tests based on the Spearman and Kendall coefficients show difficulties in recording dependence, see Table 9. We also see that the other procedures capture the signs of dependence as well as the one proposed in this paper (jln_d0=29.904). In both situations (cases (i) and (ii)) the only procedure, without contraindication, with significant p-value to reject_H0isJLN_Dn.

We inspect also the dependence between the variables Duration: duration of the eruption and Interval: time until following eruption, both measures in minutes, corresponding to 222 eruptions of the Old Faithful Geyser during August 1978 and August 1979. The data is coming from [9] and it is a traditional data set used in regression analysis with the aim of predicting the time of the next eruption using the duration of the most recent eruption (see [10]).

Figure 6 clearly shows the high number of ties, which compromises procedures designed for continuous variables. We have run theJLN_Dntest (jln_d0=63.797 ), using various values of B, B = 1000, 2000, 5000, 10,000. In all cases the p-value is less than 0.00001 and using both versions to estimate the cumulative distribution, Equations (2) and (3). Then the hypothesis of independence between Duration and Interval is rejected.

The data set, cdrate is composed by 69 observations given in the 23 August 1989, issue of Newsday, it consists of the three-month certificate of deposit rates Return on CD for 69 Long Island banks and thrifts. The variables are Return on CD and Type = 0 (bank), 1 (thrift), source: [9]. Table 10 shows the data arranged based on the values of the attribute Return on CD and divided into the two cases of the variable Type. That table shows sparseness, an issue reported in the literature, that compromises the performance of tests Pearson’s Chi-squared based (Table 11), see [1].

In Table 11 we see the results for testing_H0.We see that according toJLN_Dn’s test we must reject_H0, which seems to be confirmed by Figure 7. Figure 7 comparatively shows the performance of variable Return on CD, for the two values of variable Type.

We conclude this section with a case of the wine data set, Class vs. Alcohol. Figure 8 shows the relationship in which we wish to verify whether independence can be rejected. Class registers 3 possible values and Alcohol has been registered with low precision, which leads to observing ties. The observed value ofJLN_Dnisjln_d0=99.438. The p-value given by the Equations (2) and (3) indicate the rejection of_H0. By the Equation (2) we obtain a p-value = 0.0004 and by the Equation (3) we obtain a p-value < 0.00001.

We note that in the cases of Figure 5 we have verified that the testJLN_Dn through Equation (3) offers lower p-value than the version given by Equation (2). In the cases of Figure 6 and Figure 8, it may simply be an effect of computational precision. For the other cases, it is necessary to take into account that the Bootstrap version, by tending to create more ties, shows a tendency to underestimate the cumulative distribution, in other words,_{F^JLN_DnB}(q)≤_{FJLN_Dn}(q)where_{FJLN_Dn}(·) is the true cumulative distribution. Due to the increasing tendency shown by the cases addressed (see Figure 5), it is expected that the observed value of the statisticJLN_Dn,jln_d0,in each case is positioned in the upper tail of the distribution, which leads to the p-value be given by2(1−_{F^JLN_DnB}(jln_d0)),see Definition 4. As a consequence2(1−_{F^JLN_DnB}(jln_d0))>2(1−_{FJLN_Dn}(jln_d0)). With the proposal made through Equation (3), we seek to correct the underestimation, since it does not favor the proliferation of ties. Which would explain the relationship between the p-value.

5. Concluding Remarks

In this article, we investigate the performance of theJLN_Dnstatistic to identify dependence on bivariate random vectors from a paired sample of sizen.The procedure requires identifying the LNDSS that can be found on the graph between the marginal ranks of the paired observations, see Definitions 1 and 2. The goal is to compare the length of such subsequence (Definition 3) with the length of all possible subsequences, under the assumption of independence. This means, imposing an uniform distribution on the expanded_Snspace. For the formulation of the procedure, it is required to estimate the distribution of the statisticJLN_Dn, under the assumption of independence and, in this paper it is given by Equation (3) (see also Definition 4). The estimation proposed in this paper shows an improved performance compared with the one given in [5], see Section 2.1. The concept, longest non-decreasing subsequence, allows us to build a tool without restrictions over the type of variable, continuous or discrete in which it can be applied. From the simulation study we confirm that the detection power of the procedure occurs in situations with an increasing pattern from left to right and from bottom to top, which is the direction in which theJLN_Dn statistic is sought (see Figure 1). The observations can be associated with continuous or discrete variables, not affecting the power of the test. The concomitant presence of increasing patterns and decreasing patterns does not necessarily nullify the detection capacity of the procedure if the size of the samples is big enough. We also verify the robustness of the procedure when inspecting cases that suffer contamination that could conceal the dependence. See Table 4, Table 5, Table 6, Table 7 and Table 8, we use different real data sets that expose the versatility of the procedure to reject independence in situations such as (a) in the presence of ties, (b) in the presence of sparseness, (c) in mixed situations.

Figure 1. The LNDSS ofQD, from Table 1.

View Image - Figure 2.(left) X vs.Y.(right)ranks(X)vs.ranks(Y) . The values of X and Y are simulated from two independent exponential distributions,λ=10 for X andλ=20forY,n=100.

Figure 2.(left) X vs.Y.(right)ranks(X)vs.ranks(Y) . The values of X and Y are simulated from two independent exponential distributions,λ=10 for X andλ=20forY,n=100.

Figure 3.(left)D1(20,1),n=80, (middle)D2(20,1),n=80, (right)D3(20,1,6),n=80.

Figure 4.(left)D4(5,0.5),n=80, (right)D5(5,0.5),n=80.

Figure 5.(left): Alcohol vs. Flavanoids. (right): Flavanoids vs. Intensity. Variables coming from wine data set from gclus R-package.

Figure 6. Duration vs. Interval of geyser data set [9].

Figure 7. Boxplots of Return on CD by variable Type. See cdrate data set [9].

Figure 8. Class vs. Alcohol, see wine data set from gclus R-package.

_xi	_yi	Rank (_xi)	Rank (_yi)
5.3	10.2	1	5
5.3	9.3	1	1
6.1	9.3	3	1
6.1	10.1	3	3
7.1	10.1	5	3
7.3	11.0	6	6

	n	m=10	m=20	m=50	m=100
	20	0.013	0.021	0.022	0.032
	40	0.021	0.038	0.037	0.041
α=0.01	60	0.025	0.033	0.043	0.050
	80	0.019	0.040	0.053	0.050
	100	0.028	0.034	0.044	0.059
	n	m=10	m=20	m=50	m=100
	20	0.084	0.089	0.112	0.100
	40	0.091	0.105	0.134	0.143
α=0.05	60	0.104	0.114	0.148	0.149
	80	0.095	0.124	0.139	0.159
	100	0.113	0.111	0.125	0.143

	n	m=2	m=3	m=4	m=5	m=10	m=20	m=50	m=100
	20	0.004	0.006	0.008	0.014	0.011	0.007	0.007	0.004
	40	0.004	0.009	0.005	0.008	0.007	0.008	0.010	0.011
α=0.01	60	0.005	0.004	0.009	0.007	0.006	0.004	0.010	0.014
	80	0.006	0.005	0.010	0.011	0.012	0.009	0.006	0.008
	100	0.005	0.011	0.011	0.009	0.007	0.009	0.012	0.008
	n	m=2	m=3	m=4	m=5	m=10	m=20	m=50	m=100
	20	0.021	0.032	0.041	0.047	0.038	0.048	0.042	0.043
	40	0.019	0.038	0.030	0.043	0.030	0.046	0.045	0.037
α=0.05	60	0.029	0.041	0.042	0.044	0.040	0.032	0.056	0.044
	80	0.039	0.031	0.045	0.046	0.051	0.046	0.046	0.052
	100	0.031	0.041	0.048	0.049	0.052	0.054	0.053	0.053

		p=1.0			p=0.8
	n	D1	D2	D3	M1	M2	M3
	20	1.000	0.349	0.028	0.994	0.179	0.018
	40	1.000	0.798	0.050	1.000	0.568	0.034
α=0.01	60	1.000	0.983	0.136	1.000	0.858	0.078
	80	1.000	0.999	0.252	1.000	0.963	0.109
	100	1.000	1.000	0.352	1.000	0.990	0.181
		p=1.0			p=0.8
	n	D1	D2	D3	M1	M2	M3
	20	1.000	0.537	0.101	1.000	0.366	0.064
	40	1.000	0.906	0.177	1.000	0.757	0.125
α=0.05	60	1.000	0.993	0.306	1.000	0.934	0.192
	80	1.000	1.000	0.468	1.000	0.985	0.250
	100	1.000	1.000	0.601	1.000	0.997	0.368

		p=1.0			p=0.8
	n	D1	D2	D3	M1	M2	M3
	20	1.000	0.431	0.044	0.993	0.229	0.024
	40	1.000	0.902	0.172	1.000	0.719	0.079
α=0.01	60	1.000	0.989	0.374	1.000	0.924	0.180
	80	1.000	1.000	0.615	1.000	0.986	0.342
	100	1.000	0.999	0.762	1.000	0.998	0.515
		p=1.0			p=0.8
	n	D1	D2	D3	M1	M2	M3
	20	1.000	0.610	0.126	0.998	0.404	0.097
	40	1.000	0.949	0.368	1.000	0.837	0.196
α=0.05	60	1.000	0.997	0.608	1.000	0.963	0.370
	80	1.000	1.000	0.823	1.000	0.993	0.571
	100	1.000	1.000	0.918	1.000	0.999	0.711

		p=1.0			p=0.8
	n	D1	D2	D3	M1	M2	M3
	20	1.000	0.409	0.038	0.997	0.179	0.024
	40	1.000	0.865	0.137	1.000	0.623	0.063
α=0.01	60	1.000	0.984	0.292	1.000	0.884	0.141
	80	1.000	0.999	0.473	1.000	0.969	0.247
	100	1.000	1.000	0.655	1.000	0.991	0.394
		p=1.0			p=0.8
	n	D1	D2	D3	M1	M2	M3
	20	1.000	0.597	0.110	1.000	0.345	0.090
	40	1.000	0.933	0.300	1.000	0.771	0.194
α=0.05	60	1.000	0.996	0.520	1.000	0.949	0.326
	80	1.000	1.000	0.715	1.000	0.990	0.468
	100	1.000	1.000	0.848	1.000	0.998	0.620

			p=1.0				p=0.8
	n	m=2	m=3	m=5	m=10	m=2	m=3	m=5	m=10
	20	1.000	1.000	1.000	1.000	0.713	0.936	0.985	0.996
	40	1.000	1.000	1.000	1.000	0.993	1.000	1.000	1.000
α=0.01	60	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	80	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	100	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
			p=1.0				p=0.8
	n	m=2	m=3	m=5	m=10	m=2	m=3	m=5	m=10
	20	1.000	1.000	1.000	1.000	0.882	0.977	0.999	0.999
	40	1.000	1.000	1.000	1.000	0.999	1.000	1.000	1.000
α=0.05	60	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	80	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	100	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000

		p=1.0			p=0.8
	n	m=3	m=5	m=10	m=3	m=5	m=10
	20	0.070	0.209	0.446	0.031	0.105	0.209
	40	0.211	0.634	0.926	0.118	0.374	0.708
α=0.01	60	0.446	0.910	0.996	0.224	0.655	0.945
	80	0.611	0.977	1.000	0.364	0.852	0.994
	100	0.728	0.999	1.000	0.528	0.949	0.999
		p=1.0			p=0.8
	n	m=3	m=5	m=10	m=3	m=5	m=10
	20	0.161	0.385	0.638	0.112	0.239	0.388
	40	0.358	0.791	0.971	0.231	0.578	0.828
α=0.05	60	0.612	0.970	0.999	0.384	0.810	0.973
	80	0.736	0.995	1.000	0.529	0.951	0.998
	100	0.850	0.999	1.000	0.671	0.981	1.000

	Copula	Hoeffding	Spearman	Pearson	Kendall	_JLNDnEquation (2)	_JLNDn Equation (3)
p-value	0.0005	0.0000	0.5695	0.0214	0.5713	0.0380	0.0044
coefficient			−0.0429	−0.1724	0.0287

Return on CD	Type = 0	Type = 1	Return on CD	Type = 0	Type = 1	Return on CD	Type = 0	Type = 1
7.51	0	1	8.15	0	1	8.49	0	3
7.56	1	0	8.17	1	0	8.50	1	9
7.57	1	0	8.20	0	1	8.51	1	0
7.71	1	0	8.25	0	2	8.52	0	1
7.75	0	1	8.30	1	2	8.55	1	0
7.82	2	0	8.33	2	1	8.57	1	0
7.90	1	1	8.34	0	1	8.65	2	0
8.00	7	3	8.35	0	2	8.70	0	1
8.05	2	0	8.36	0	1	8.71	1	0
8.06	1	0	8.40	1	6	8.75	0	1
8.11	1	0	8.45	0	1	8.78	0	1

Test	^χ2	_JLNDn (Equation (2))	_JLNDn (Equation (3))
p-value	0.0558	0.0320	0.0068
Statistic’s value	45.6450 (df = 32)	50.275	50.275

Author Contributions

Conceptualization, J.E.G. and V.A.G.-L.; methodology, J.E.G. and V.A.G.-L.; software, J.E.G. and V.A.G.-L.; validation, J.E.G. and V.A.G.-L.; formal analysis, J.E.G. and V.A.G.-L.; investigation, J.E.G. and V.A.G.-L.; data curation, J.E.G. and V.A.G.-L.; writing-review and editing, J.E.G. and V.A.G.-L. Both authors have read and agreed to the published version of the manuscript.

Funding

No funds were received for the execution of this work.

Acknowledgments

The authors wish to thank the two referees for their many helpful comments and suggestions on an earlier draft of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

1. Agresti, A. Categorical Data Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2002.

2. Kendall, M.G. The treatment of ties in ranking problems. Biometrika 1945, 33, 239-251.

3. Cramir, H. Mathematical Methods of Statistics; Princeton U. Press: Princeton, NJ, USA, 1946; Volume 500.

4. Romik, D. The Surprising Mathematics of Longest Increasing Subsequences; Cambridge University Press: New York, NY, USA, 2015; Volume 4.

5. Garca, J.E.; González-López, V.A. Independence test for sparse data. AIP Conf. Proc. 2016, 1738, 140002.

6. Garca, J.E.; González-López, V.A. Independence tests for continuous random variables based on the longest increasing subsequence. J. Multivar. Anal. 2014, 127, 126-146.

7. Zelterman, D. Goodness-of-fit tests for large sparse multinomial distributions. J. Am. Stat. Assoc. 1987, 82, 624-629.

8. Hollander, M.; Wolfe, D. Nonparametric Statistical Methods; John Wiley & Sons: New York, NY, USA, 1973; pp. 185-194.

9. Simonoff, J.S. Smoothing Methods in Statistics; Springer: New York, NY, USA, 1996.

10. Weisberg, S. Applied Linear Regression, 4th ed.; John Wiley & Sons: Minneapolis, MN, USA, 2005; Volume 528.

AuthorAffiliation

Jesús E. Garca and Verónica A. González-López^*

Department of Statistics, University of Campinas, Sérgio Buarque de Holanda, 651, Campinas 13083-859, São Paulo, Brazil

^*Author to whom correspondence should be addressed.

Word count: 4600

Show less

© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

In this paper, we show how the longest non-decreasing subsequence, identified in the graph of the paired marginal ranks of the observations, allows the construction of a statistic for the development of an independence test in bivariate vectors. The test works in the case of discrete and continuous data. Since the present procedure does not require the continuity of the variables, it expands the proposal introduced in Independence tests for continuous random variables based on the longest increasing subsequence (2014). We show the efficiency of the procedure in detecting dependence in real cases and through simulations.

Details

Title

Random Permutations, Non-Decreasing Subsequences and Statistical Independence

Author

Garca, Jesús E

; González-López, Verónica A

First page

1415

Publication year

2020

Publication date

2020

Publisher

MDPI AG

e-ISSN

20738994

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/sym12091415

ProQuest document ID

2438572499

Random Permutations, Non-Decreasing Subsequences and Statistical Independence

Jump to:

Full text

Abstract

Details

Suggested sources