1. Introduction
In this article, we use an expanded structure of the symmetric groupSn,over the set of permutations from{1,…,n}to{1,…,n},to develop a dependence detection procedure in bivariate random vectors. The procedure is based on identifying the longest non-decreasing subsequence (LNDSS) detected in the graph of the paired marginal ranks of the observations. It records the size of the subsequence and verifies the chances that it has to occur in the expanded space ofSn,under the assumption of independence between the variables. The procedure does not require assumptions about the type of the two random variables being tested, such as being both discrete, both continuous or a mixed structures (discrete-continuous).
When we face the challenge of deciding whether the independence between random variables can be discarded, it is necessary to establish the nature of the variables, whether they are continuous or discrete. For continuous random variables, we have several procedures, for example, Hoeffding’s test and those based on dependence’s coefficients (Spearman’s coefficient, Pearson’s coefficient, Kendall’s coefficient, etc.). Instead, for the discrete case, the options are few, the most popular is Pearson’s Chi-squared test. Also, the tests based on Kendall and Spearman coefficients going through corrections that consider ties can be used to test for independence between two discrete and ordinal variables (see [1,2]). In general, recommended for small sample sizes. Moreover, some derivations of the Chi-squared statistic have been projected to test independence between two nominal variables, as is the case of the Cramér’s V statistic, see [3].
The goal of this article is to show an independence test, developed from the notion of the LNDSS among the ranks of the observations, see [4]. The main notion was introduced previously in [5] with a different implementation from the one proposed in this paper. The alterations proposed in this paper aim to improve the procedure’s performance. This methodology works without limitation on the type of the two random variables being tested, which can be continuous/discrete.
The existence of ties in a dataset cast doubts about the use of matured tests for continuous variables, see, for instance, [6] for a discussion on this issue. The use of procedures preconized for continuous random variables, in cases with repetitions in the observations due to the precision used to record the data, may have unforeseen consequences on the performance of the procedure. If the ties are eliminated, the use of asymptotic distributions can be compromised, if the ties are considered (by means of some correction), the control of type 1 and 2 errors can be put at risk (increasing the false positives/negatives of the procedure). Another frequent situation is when one of the variables is continuous, and the other is discrete. For some test of independence, problems may arise from this situation forcing the practitioner to apply some arbitrary data categorization. Under this picture, one of the most popular procedures is the Pearson’s Chi-squared statistic. The traditional tests are based on some of the following statistics Pearson’s Chi-squared, likelihood ratio [1], and Zelterman’s [7] for the case in which the number of categories is too large for the available sample size. Moreover, Zelterman’s [7] do not work well when one of the variables is continuous. [1] shows several examples of independent data, where Pearson’s Chi-squared, likelihood ratio, and Zelterman statistics fail. In [1] is shown that to be reliable, those tests require that each cell in the frequency table should have a minimal (non zero) frequency, which can depend on the total size of the data set. It is shown in [7], that in some situations, with a large number of factors, Pearson’s Chi-squared statistic will behave as a normal random variable with summaries as variance and mean that are unassociated to the Chi-squared distribution, even with large sample size. Those situations are similar to the case of continuous random variables registered with limited precision, which in fact, is similar to a discrete random variable with a large number of categories producing sparseness (or sparse tables).
This article is organized as follows. Section 2 introduces the formulation of the test showing the new strategy, in comparison with the implemented in [5]. Section 3 simulates different situations showing the performance of the procedure. The purpose is to show situations in which the statistic proposed in this paper is efficient in detecting dependence. We consider in the simulations settings concentrating points in the diagonals, the variables being continuous or discrete. We also consider perturbations of such situations, which will show the maintenance or loss of power of the test developed here. Section 4 applies the new procedure to real data, and Section 5 presents the final considerations.
2. The Procedure We start this section with the construction of the test’s statistics. For that, we introduce the LNDSS notion.
Definition 1.
Given the setQ={q1,…,qn}of cardinality n such thatqi∈R,∀i∈{1,…,n},
i.
the subsequence{qi1 ,…,qik }of Q is a non-decreasing subsequence of Q if1≤i1<…<ik≤nandqi1 ≤qi2 ≤…≤qik ;
ii.
the length of a subsequence verifyingi.isk;
iii.
lndn(Q)=maxk{1≤k≤n:{qi1 ,…,qik }∈Sn},whereSnis the set of subsequences of Q verifyingi.
lndn(Q)(item iii., Definition 1) is the length of the LNDSS ofQ.Here, consider two illustrations of Definition 1. Suppose thatQ={1.3,0.2,2,2.1,1.2}.Then the LNDSS are{1.3,2,2.1}and{0.2,2,2.1},thenlnd5(Q)=3.Consider now a collection Q with replicationsQ={1.5,2.4,1.1,2.4,3,3.1},so the LNDSS is{1.5,2.4,2.4,3,3.1}andlnd6(Q)=5.
Using the next Definition we adapt this notion to the context of random samples.
Definition 2.
Consider(X,Y)a random vector with joint cumulative distribution functionH,let(X1,Y1),(X2,Y2),…,(Xn,Yn)be independent realizations of(X,Y),we denote byLNDnthe random variable built from iii. of Definition 1 as
LNDn=lndnQD,
whereD=(Xi,Yi)i=1nandQD={qrank(Xi)=rank(Yi),i=1,…,n}.
Remark 1.
i.
Note that without the presence of ties, the setQDis a particular case of all the permutations of the values in the set{1,…,n}.
ii.
With ties, there is more than one way of defining ranks. We apply the minimum rank notion. For example, the sample6.1,2.1,5.3,4.7,5.5,6.2,5.3,4.7has ranks7,1,4,2,6,8,4,2.
If we considerSn=πpermutationssuchthatπ:{1,…,n}→{1,…,n},the subsetQDgiven by Definition 2 and without ties is a specific case of the finite setSn.Also,Snis an algebraic group if it is considered operating with the law of composition among the possible permutations. Given two permutationsπ1,π2the composition between them results when applyingπ2∗π1from right to left, it means first applyingπ1and to its result applyingπ2,that composition also is a permutation. The law of composition is associative, with an identity element and with the existence of an inverse element for each member ofSn.By Definition a symmetric group defined over any set is the group whose elements are all the bijections from the set to itself, thenSnis the symmetric group of the set{1,…,n}since, it is composed by all the bijections from{1,…,n}to{1,…,n}.Since{1,…,n}is finite, the bijections are permutations.
Through the next example, we show the construction of the LNDSS in a setQDrelated to fictional observations.
Example 1.
Table 1 shows an artificial data withn=6and already ordered in terms of the magnitude ofxivalues. We show the graphical construction ofLNDn,
This data defines aQD={5,1,1,3,3,6}.The maximal non-decreasing subsequence is{1,1,3,3,6}given by the trajectory(0,0)−(1,1)−(3,1)−(3,3)−(5,3)−(6,6) from the plot between the ranks of the observations, shown in Figure 1. The value ofLND6for this example is 5. We note that the indicated trajectory refers to the correspondence of1→1,3→1,3→3,5→3,6→6,which is no longer a permutation in the traditional sense since, it allows repetition both in the domain and in the image.
Remark 2.
Note that the construction of the statisticLNDnis symmetric in the sense that if we exchange the roles of X andY,we obtain the same result. Formally, this characteristic is a consequence of the following property. Consider a sample{(Xi,Yi)}i=1nand the increasing set of indexes{I1,…,Ik}⊆{1,…,n}such that the trajectory(XI1 ,YI1 )−(XI2 ,YI2 )−…−(XIk ,YIk )constitutes a non-decreasing subsequence (as illustrated by Example 1), this occurs if and only ifXIi ≤XIi+1 andYIi ≤YIi+1 ,1≤i≤k−1,then the trajectory(YI1 ,XI1 )−(YI2 ,XI2 )−…−(YIk ,XIk )constitutes a non-decreasing subsequence also.
The example shows that the procedure operates in an extended space of the symmetric groupSn. Below we show a motivation to identify the dependence by trajectories such as those used by Definition 2 and exemplified in Figure 1. The dependence on a bivariate vector can be represented by the ranks of the observations; let’s see a simple motivation.
We see on the left of Figure 2 an apparent relationship between the random variables, this illusion of relationship disappears in the graph on the right, since when computing the ranks of the observations, the marginal stochastic structure is neutralized, showing the dependence between X andY. And, in this case, X and Y are independent, since they have been generated in this way. On the other hand, if the variables X and Y were dependent, Figure 2 on the right should expose a pattern, and traces of it would be captured by theLNDnnotion.
The formulation of the conjectures of independence between the random variables is then given by
H0:XandYareindependentH1:XandYaredependent.
Here follows the test’s statistic build from Definition 2.
Definition 3.
LetD=(Xi,Yi)i=1nbe replications of(X,Y). DefineJLNDn=1n∑(u,v)∈DLNDn(u,v),whereLNDn(u,v)=lndn−1(QD(u,v) )as given by Definition 2, andD(u,v)=D∖(u,v),with(u,v)∈D.
That is, we consider the notion given by the Definition 2 for each setD(u,v),which include the entire sample except one, allowing to buildQD(u,v) .Then, we defineLNDn(u,v)and, the test statistic is the average between all the casesLNDn(u,v).Next we introduce the most frequent formulation of estimation of the two-sided p-value in a context such as that given by theJLNDnstatistic.
Definition 4.
The estimator of the two sided p-value for the statistical test of independence between X and Y (see (1)) is defined by,
min2F^JLNDn(jlnd0)IF^JLNDn(jlnd0)≤12+2(1−F^JLNDn(jlnd0))IF^JLNDn(jlnd0)>12,1,
wherejlnd0is the value ofJLNDncalculated in the sample, see Definition 3.F^JLNDnis the empirical cumulative distribution function ofJLNDn,under independence, andIAis the indicator function of the setA.
In the following subsection, we analyze the performance of two proposals to estimateFJLNDn, one introduced in [5] and the other proposed by this paper.
2.1.FJLNDnEstimates
F^JLNDn can be estimated by using bootstrap, for instance see [5]. Denote this kind of estimation asF^JLNDnB.The procedure to buidF^JLNDnBunderH0hypothesis is replicated here. Let be B a positive and integer value, we compute B size n resamples with replacement ofX1,X2,…,XnandY1,Y2,…,Ynseparately, since we assume thatH0is true. That is, we generateX1b,X2b,…,Xnbforb=1,2,…,B,resampling fromX1,X2,…,Xn,and, we generateY1b,Y2b,…,Ynbforb=1,2,…,B,resampling fromY1,Y2,…,Yn.Then, for each b defineDb=(Xib,Yib)i=1nand from that sample compute the notionJLNDn,from Definition 3, sayJLNDnb.Then, if|A|denotes the cardinal ofA,set
F^JLNDnB(q)=|{b:JLNDnb≤q}|B.
In Table 2, we show the performance of theJLNDn ’ s test based on the computation of the p-value (Definition 4) according to the Bootstrap technique, given by Equation (2). We generated n independent pairs of discrete Uniform distributions from 1 tom,and we computed in 1000 simulations, the proportion of them showing a p-value (Definition 4)≤α,indicating the rejection ofH0.Such a proportion is expected to be close toα,in order to control type 1 error. As we can see, when increasing the number of categoriesm,theαlevel is no longer respected, since the registered proportion always exceedsα.In order to improve the control of type 1 error, in this paper is proposed an alternative way to estimateFJLNDn. The Bootstrap method described above and used in [5] can be modified in order to avoid the removal of any of the observations, following the strategy of swapping them. We considerX1,X2,…,XnandY1,Y2,…,Ynseparately, givenB∈Z,for eachb∈1,…,Bconsider a permutationπb:{1,…,n}→{1,…,n}and defineXπb(1),…,Xπb(n).Similarly, consider a permutationσb:{1,…,n}→{1,…,n}and defineYσb(1),…,Yσb(n).Then, for each b defineDπb,σb=(Xπb(i),Yσb(i))i=1nand from that sample compute the notionJLNDn,from Definition 3, sayJLNDnπb σb.Then, set
F^JLNDnB,π,σ(q)=|{b:JLNDnπb σb≤q}|B.
Bootstrap generates the estimate by Equation (2), it considers samples with replacement, which tends to increase the number of ties. For example, if the original sample has no ties, the Bootstrap procedure tends to create ties, leading to longer non-decreasing subsequences. The permutation-based procedure that allows the formulation of Equation (3) lacks such a tendency, and this principle seems to be a more suitable strategy.
In Table 3, we show the performance of theJLNDn ’ s test based on the computation of the p-value (Definition 4) according to Equation (3). We implement the same settings used in Table 2, also we include simulations form=2,3,4,5. The impact of Equation (3) allows better control of the type 1 error, we see that in most cases the proportion does not exceedαand when it does it remains close toα.
Returning to the construction of the hypothesis test (Equation (1)), we note that the hypothesisH0 is used in the construction of both types of estimates of the cumulative distribution, Equations (2) and (3). For both cases, the observed values{(Xi,Xi)}i=1nare treated separately, as being independent{Xi}i=1nby one side and{Yi}i=1nfor other side. Then, the distribution of the length of the LNDSS, underH0,is estimated by both procedures, which allows computing the evidence againstH0given by the observed value in the originally paired{(Xi,Yi)}i=1nsample and applying Definition 4. Moreover, the type 1 error control refers to the ability of a procedure to rejectH0 under its validity. In other words, it represents an unwanted situation, which we must control. In the study presented by Table 2 and Table 3, based on the two ways of estimating the cumulative distribution ofJLNDn(underH0 ) by Equations (2) and (3) respectively, we see that fixed a levelα, Equation (3) offers better performance than Equation (2) since it maintains type 1 error at pre-established levels. For this reason, the test based on the statisticJLNDn with the implementation given by Equation (3) is more advisable in practice.
The following section describes the behavior of the test in different simulated situations, in order to identify its strengths and weaknesses. 3. Simulations
To investigate the performance of theJLNDn -based procedure, we will aim to determine the rejection ability of the procedure in scenarios with dependence. Our research focuses on the procedure that uses the Equation (3) to compute the p-value, given the justification of Section 2.1. We begin our study considering discrete distributions that we describe below and some mixtures or disturbances of them.
We take discrete uniform distributions on different regions, considerm,band a fixed values such thatm,b,a∈Z>0,and set
i.
D1(m,a):Uniform onA=(x,y)∈{1,…,m}2:|x−y|≤a;
ii.
D2(m,a):Uniform onA=(x,y)∈{1,…,m}2:|x−y|≤aor|x+y−m−1|≤a;
iii.
D3(m,a,b):Uniform onA={(x,y)∈{1,…,m}2:|x−y|≤aor|x−y+b|≤aor|x−y−b|≤aor |x−y−2b|≤a or |x−y+2b|≤a}.
The performance of the distributions i.- iii. is illustrated in Figure 3, usingm=20,a=1andb=6.Denote byU(m)the Uniform distribution onA=(x,y)∈{1,…,m}2,givenp∈[0,1]consider now the next three mixture of distributions
iv.
M1(m,a):pD1(m,a)+(1−p)U(m);
v.
M2(m,a):pD2(m,a)+(1−p)U(m);
vi.
M3(m,a,b):pD3(m,a,b)+(1−p)U(m).
where the notationpD1(m,a)+(1−p)U(m)represents that the bivariate vector is given in proportion p by the distributionD1(m,a)and in proportion(1−p)by the distributionU(m).Note that ifp=1we recover the distributionsDi,i=1,2,3. From Table 4, Table 5 and Table 6, settings have projected that increase the number of categories of the discrete Uniform distributionU(m),and also increase the parameters a andb.
Table 4, Table 5 and Table 6 show the rejection rates, obtained through 1000 simulations of samples of sizen.We inspect the distributionsDi,i=1,2,3and the mixturesMi,i=1,2,3withp=0.8.What is sought is to obtain high proportions evidencing the control of type 2 error.
As expected, for distributionD1the procedureJLNDnshows maximum performance, for all sample sizes and variants of m anda.For distributionD2,the performance of the procedureJLNDnimproves and reaches maximum performance as the sample size increases, for all variants of m anda.For distributionD3,we noticed a deterioration in the performance of the test when compared to the other two casesD1andD2,despite this, the procedure responds adequately to the sample size, increasing its ability to detect dependence with increasing sample size.
Miis a distribution that results from disturbingDi,so it makes sense to compare the effect of the disturbance, which in the illustrated cases is 20% fromU(m).For the distributionM1theJLNDn-based procedure shows optimal performance, as occurs in the caseD1.In casesM2andM3,there is a deterioration in the performance of the procedureJLNDnwhen compared toD2andD3,respectively. Despite this, within the framework given byM2,we see that the good properties of the procedure are preserved when the sample size is increased.
In the following simulations, we investigate the dependence between discrete and continuous variables. The types explored are denoted byD4andD5, Figure 4 illustrates the cases. Considerm∈Z>0,a∈R>0and set
vii.
D4(m,a): Uniform distribution onA=(x,y)∈{1,…,m}×[0,m+1]:|x−y|≤a;
viii.
D5(m,a): Uniform onA=(x,y)∈{1,…,m}×[0,m+1]:|x−y|≤aor|x+y−m−1|≤a.
Denote byW(m)the Uniform distribution onA=(x,y)∈{1,…,m}×[0,m+1]givenp∈[0,1]consider now the next two mixture of distributions
ix.
M4(m,a):pD4(m,a)+(1−p)W(m);
x.
M5(m,a):pD5(m,a)+(1−p)W(m).
Note that when usingp=1in ix (or x) we recorverD4(orD5).
Table 7 and Table 8 show the performance of 1000 simulations of sizen,fromM4(m,0.5) in Table 7,M5(m,0.5) in Table 8. To the left of each Table (withp=1 ), are simulated cases similar to the illustrated in Figure 4,D4andD5. Table 7 shows that in the case of distributionD4,the procedure is very efficient and, we see that when the distribution is disturbed (by including 20% fromW, to the right of Table 7) the procedure maintains its efficiency in detecting dependence. In relation to the distributionD5, we see from Table 8 that two effects occur, the one produced by the sample size n and the one produced by the value ofm.By increasing n and m the procedure gains power quickly. The same effect is observed in theM5distribution (D5disturbance), with a certain deterioration in the power of the test.
TheJLNDn statistic is built in the graph of the paired ranks of the observations, and it is given by the size of the LNDSS found in this graph (see Figure 1). The proposal induces a region where this statistic can found evidence of dependence, in the diagonal of the graph. The simulation study points that the detection power of the procedure occurs in situations with an increasing pattern in the direction in which theJLNDnstatistic is built. Even more, the concomitant presence of increasing patterns and decreasing patterns does not necessarily nullify the detection capacity of the procedure, since the statisticJLNDnis formulated considering the expandedSn space provided with the uniform distribution. See Table 4, Table 5, Table 6, Table 7 and Table 8 in which we observe that by increasing the sample size, the detection capacity ofJLNDnis preserved. Also, looking at the right side of the tables already cited, we verify the robustness of the procedure, when inspecting cases with a concentration of points in the diagonals and suffering contamination, if the sample size grows.
In the next section, we apply the test to real data and compare our results with other procedures. 4. Applying the Test in Real Data
As it has already been commented, in some data sets, we have ties, produced by the precision used in data collection. This is the case of the wine data set (from the glus R-package), composed of 178 observations. For example, consider the cases (i) Alcohol vs. Flavonoids (see Figure 5, left) and (ii) Flavanoids vs. Intensity (see Figure 5, right). For each case (i) and (ii) both variables are continuous but recorded with a precision of two decimal places. We use known procedures in the area of continuous variables. For all the computations is used the R-project software environment. The “hoeffd” function in the “Hmisc” package is used to compute the p-value in the case of Hoeffding’s test. The “cor.test” function in the “stat” package is used to compute the p-value for Pearson, Spearman and Kendall tests, see also [8]. Finally, we use the “indepTest” function, from the “copula” package to compute the “Copula“ test.
In case (i) of Figure 5 (left) all the procedures report p-value less than 0.02. UsingJLNDn(jlnd0=31.843 ) we obtain p-value = 0.0160 and p-value = 0.0004, applying Equations (2) and (3), respectively. That is,JLNDn-based procedures detect dependence without the possible contraindications that the other procedures have, since we see ties in the dataset.
From the appearance of the scatter plot (Figure 5, right), it is understandable that the tests based on the Spearman and Kendall coefficients show difficulties in recording dependence, see Table 9. We also see that the other procedures capture the signs of dependence as well as the one proposed in this paper (jlnd0=29.904). In both situations (cases (i) and (ii)) the only procedure, without contraindication, with significant p-value to rejectH0isJLNDn.
We inspect also the dependence between the variables Duration: duration of the eruption and Interval: time until following eruption, both measures in minutes, corresponding to 222 eruptions of the Old Faithful Geyser during August 1978 and August 1979. The data is coming from [9] and it is a traditional data set used in regression analysis with the aim of predicting the time of the next eruption using the duration of the most recent eruption (see [10]).
Figure 6 clearly shows the high number of ties, which compromises procedures designed for continuous variables. We have run theJLNDntest (jlnd0=63.797 ), using various values of B, B = 1000, 2000, 5000, 10,000. In all cases the p-value is less than 0.00001 and using both versions to estimate the cumulative distribution, Equations (2) and (3). Then the hypothesis of independence between Duration and Interval is rejected.
The data set, cdrate is composed by 69 observations given in the 23 August 1989, issue of Newsday, it consists of the three-month certificate of deposit rates Return on CD for 69 Long Island banks and thrifts. The variables are Return on CD and Type = 0 (bank), 1 (thrift), source: [9]. Table 10 shows the data arranged based on the values of the attribute Return on CD and divided into the two cases of the variable Type. That table shows sparseness, an issue reported in the literature, that compromises the performance of tests Pearson’s Chi-squared based (Table 11), see [1].
In Table 11 we see the results for testingH0.We see that according toJLNDn’s test we must rejectH0, which seems to be confirmed by Figure 7. Figure 7 comparatively shows the performance of variable Return on CD, for the two values of variable Type.
We conclude this section with a case of the wine data set, Class vs. Alcohol. Figure 8 shows the relationship in which we wish to verify whether independence can be rejected. Class registers 3 possible values and Alcohol has been registered with low precision, which leads to observing ties. The observed value ofJLNDnisjlnd0=99.438. The p-value given by the Equations (2) and (3) indicate the rejection ofH0. By the Equation (2) we obtain a p-value = 0.0004 and by the Equation (3) we obtain a p-value < 0.00001.
We note that in the cases of Figure 5 we have verified that the testJLNDn through Equation (3) offers lower p-value than the version given by Equation (2). In the cases of Figure 6 and Figure 8, it may simply be an effect of computational precision. For the other cases, it is necessary to take into account that the Bootstrap version, by tending to create more ties, shows a tendency to underestimate the cumulative distribution, in other words,F^JLNDnB(q)≤FJLNDn(q)whereFJLNDn(·) is the true cumulative distribution. Due to the increasing tendency shown by the cases addressed (see Figure 5), it is expected that the observed value of the statisticJLNDn,jlnd0,in each case is positioned in the upper tail of the distribution, which leads to the p-value be given by2(1−F^JLNDnB(jlnd0)),see Definition 4. As a consequence2(1−F^JLNDnB(jlnd0))>2(1−FJLNDn(jlnd0)). With the proposal made through Equation (3), we seek to correct the underestimation, since it does not favor the proliferation of ties. Which would explain the relationship between the p-value.
5. Concluding Remarks
In this article, we investigate the performance of theJLNDnstatistic to identify dependence on bivariate random vectors from a paired sample of sizen.The procedure requires identifying the LNDSS that can be found on the graph between the marginal ranks of the paired observations, see Definitions 1 and 2. The goal is to compare the length of such subsequence (Definition 3) with the length of all possible subsequences, under the assumption of independence. This means, imposing an uniform distribution on the expandedSnspace. For the formulation of the procedure, it is required to estimate the distribution of the statisticJLNDn, under the assumption of independence and, in this paper it is given by Equation (3) (see also Definition 4). The estimation proposed in this paper shows an improved performance compared with the one given in [5], see Section 2.1. The concept, longest non-decreasing subsequence, allows us to build a tool without restrictions over the type of variable, continuous or discrete in which it can be applied. From the simulation study we confirm that the detection power of the procedure occurs in situations with an increasing pattern from left to right and from bottom to top, which is the direction in which theJLNDn statistic is sought (see Figure 1). The observations can be associated with continuous or discrete variables, not affecting the power of the test. The concomitant presence of increasing patterns and decreasing patterns does not necessarily nullify the detection capacity of the procedure if the size of the samples is big enough. We also verify the robustness of the procedure when inspecting cases that suffer contamination that could conceal the dependence. See Table 4, Table 5, Table 6, Table 7 and Table 8, we use different real data sets that expose the versatility of the procedure to reject independence in situations such as (a) in the presence of ties, (b) in the presence of sparseness, (c) in mixed situations.
Figure 2.(left) X vs.Y.(right)ranks(X)vs.ranks(Y) . The values of X and Y are simulated from two independent exponential distributions,λ=10 for X andλ=20forY,n=100.
Figure 5.(left): Alcohol vs. Flavanoids. (right): Flavanoids vs. Intensity. Variables coming from wine data set from gclus R-package.
xi | yi | Rank (xi) | Rank (yi) |
---|---|---|---|
5.3 | 10.2 | 1 | 5 |
5.3 | 9.3 | 1 | 1 |
6.1 | 9.3 | 3 | 1 |
6.1 | 10.1 | 3 | 3 |
7.1 | 10.1 | 5 | 3 |
7.3 | 11.0 | 6 | 6 |
n | m=10 | m=20 | m=50 | m=100 | |
20 | 0.013 | 0.021 | 0.022 | 0.032 | |
40 | 0.021 | 0.038 | 0.037 | 0.041 | |
α=0.01 | 60 | 0.025 | 0.033 | 0.043 | 0.050 |
80 | 0.019 | 0.040 | 0.053 | 0.050 | |
100 | 0.028 | 0.034 | 0.044 | 0.059 | |
n | m=10 | m=20 | m=50 | m=100 | |
20 | 0.084 | 0.089 | 0.112 | 0.100 | |
40 | 0.091 | 0.105 | 0.134 | 0.143 | |
α=0.05 | 60 | 0.104 | 0.114 | 0.148 | 0.149 |
80 | 0.095 | 0.124 | 0.139 | 0.159 | |
100 | 0.113 | 0.111 | 0.125 | 0.143 |
n | m=2 | m=3 | m=4 | m=5 | m=10 | m=20 | m=50 | m=100 | |
20 | 0.004 | 0.006 | 0.008 | 0.014 | 0.011 | 0.007 | 0.007 | 0.004 | |
40 | 0.004 | 0.009 | 0.005 | 0.008 | 0.007 | 0.008 | 0.010 | 0.011 | |
α=0.01 | 60 | 0.005 | 0.004 | 0.009 | 0.007 | 0.006 | 0.004 | 0.010 | 0.014 |
80 | 0.006 | 0.005 | 0.010 | 0.011 | 0.012 | 0.009 | 0.006 | 0.008 | |
100 | 0.005 | 0.011 | 0.011 | 0.009 | 0.007 | 0.009 | 0.012 | 0.008 | |
n | m=2 | m=3 | m=4 | m=5 | m=10 | m=20 | m=50 | m=100 | |
20 | 0.021 | 0.032 | 0.041 | 0.047 | 0.038 | 0.048 | 0.042 | 0.043 | |
40 | 0.019 | 0.038 | 0.030 | 0.043 | 0.030 | 0.046 | 0.045 | 0.037 | |
α=0.05 | 60 | 0.029 | 0.041 | 0.042 | 0.044 | 0.040 | 0.032 | 0.056 | 0.044 |
80 | 0.039 | 0.031 | 0.045 | 0.046 | 0.051 | 0.046 | 0.046 | 0.052 | |
100 | 0.031 | 0.041 | 0.048 | 0.049 | 0.052 | 0.054 | 0.053 | 0.053 |
p=1.0 | p=0.8 | ||||||
n | D1 | D2 | D3 | M1 | M2 | M3 | |
20 | 1.000 | 0.349 | 0.028 | 0.994 | 0.179 | 0.018 | |
40 | 1.000 | 0.798 | 0.050 | 1.000 | 0.568 | 0.034 | |
α=0.01 | 60 | 1.000 | 0.983 | 0.136 | 1.000 | 0.858 | 0.078 |
80 | 1.000 | 0.999 | 0.252 | 1.000 | 0.963 | 0.109 | |
100 | 1.000 | 1.000 | 0.352 | 1.000 | 0.990 | 0.181 | |
p=1.0 | p=0.8 | ||||||
n | D1 | D2 | D3 | M1 | M2 | M3 | |
20 | 1.000 | 0.537 | 0.101 | 1.000 | 0.366 | 0.064 | |
40 | 1.000 | 0.906 | 0.177 | 1.000 | 0.757 | 0.125 | |
α=0.05 | 60 | 1.000 | 0.993 | 0.306 | 1.000 | 0.934 | 0.192 |
80 | 1.000 | 1.000 | 0.468 | 1.000 | 0.985 | 0.250 | |
100 | 1.000 | 1.000 | 0.601 | 1.000 | 0.997 | 0.368 |
p=1.0 | p=0.8 | ||||||
n | D1 | D2 | D3 | M1 | M2 | M3 | |
20 | 1.000 | 0.431 | 0.044 | 0.993 | 0.229 | 0.024 | |
40 | 1.000 | 0.902 | 0.172 | 1.000 | 0.719 | 0.079 | |
α=0.01 | 60 | 1.000 | 0.989 | 0.374 | 1.000 | 0.924 | 0.180 |
80 | 1.000 | 1.000 | 0.615 | 1.000 | 0.986 | 0.342 | |
100 | 1.000 | 0.999 | 0.762 | 1.000 | 0.998 | 0.515 | |
p=1.0 | p=0.8 | ||||||
n | D1 | D2 | D3 | M1 | M2 | M3 | |
20 | 1.000 | 0.610 | 0.126 | 0.998 | 0.404 | 0.097 | |
40 | 1.000 | 0.949 | 0.368 | 1.000 | 0.837 | 0.196 | |
α=0.05 | 60 | 1.000 | 0.997 | 0.608 | 1.000 | 0.963 | 0.370 |
80 | 1.000 | 1.000 | 0.823 | 1.000 | 0.993 | 0.571 | |
100 | 1.000 | 1.000 | 0.918 | 1.000 | 0.999 | 0.711 |
p=1.0 | p=0.8 | ||||||
n | D1 | D2 | D3 | M1 | M2 | M3 | |
20 | 1.000 | 0.409 | 0.038 | 0.997 | 0.179 | 0.024 | |
40 | 1.000 | 0.865 | 0.137 | 1.000 | 0.623 | 0.063 | |
α=0.01 | 60 | 1.000 | 0.984 | 0.292 | 1.000 | 0.884 | 0.141 |
80 | 1.000 | 0.999 | 0.473 | 1.000 | 0.969 | 0.247 | |
100 | 1.000 | 1.000 | 0.655 | 1.000 | 0.991 | 0.394 | |
p=1.0 | p=0.8 | ||||||
n | D1 | D2 | D3 | M1 | M2 | M3 | |
20 | 1.000 | 0.597 | 0.110 | 1.000 | 0.345 | 0.090 | |
40 | 1.000 | 0.933 | 0.300 | 1.000 | 0.771 | 0.194 | |
α=0.05 | 60 | 1.000 | 0.996 | 0.520 | 1.000 | 0.949 | 0.326 |
80 | 1.000 | 1.000 | 0.715 | 1.000 | 0.990 | 0.468 | |
100 | 1.000 | 1.000 | 0.848 | 1.000 | 0.998 | 0.620 |
p=1.0 | p=0.8 | ||||||||
n | m=2 | m=3 | m=5 | m=10 | m=2 | m=3 | m=5 | m=10 | |
20 | 1.000 | 1.000 | 1.000 | 1.000 | 0.713 | 0.936 | 0.985 | 0.996 | |
40 | 1.000 | 1.000 | 1.000 | 1.000 | 0.993 | 1.000 | 1.000 | 1.000 | |
α=0.01 | 60 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
80 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
100 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
p=1.0 | p=0.8 | ||||||||
n | m=2 | m=3 | m=5 | m=10 | m=2 | m=3 | m=5 | m=10 | |
20 | 1.000 | 1.000 | 1.000 | 1.000 | 0.882 | 0.977 | 0.999 | 0.999 | |
40 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 | 1.000 | 1.000 | 1.000 | |
α=0.05 | 60 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
80 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
100 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
p=1.0 | p=0.8 | ||||||
n | m=3 | m=5 | m=10 | m=3 | m=5 | m=10 | |
20 | 0.070 | 0.209 | 0.446 | 0.031 | 0.105 | 0.209 | |
40 | 0.211 | 0.634 | 0.926 | 0.118 | 0.374 | 0.708 | |
α=0.01 | 60 | 0.446 | 0.910 | 0.996 | 0.224 | 0.655 | 0.945 |
80 | 0.611 | 0.977 | 1.000 | 0.364 | 0.852 | 0.994 | |
100 | 0.728 | 0.999 | 1.000 | 0.528 | 0.949 | 0.999 | |
p=1.0 | p=0.8 | ||||||
n | m=3 | m=5 | m=10 | m=3 | m=5 | m=10 | |
20 | 0.161 | 0.385 | 0.638 | 0.112 | 0.239 | 0.388 | |
40 | 0.358 | 0.791 | 0.971 | 0.231 | 0.578 | 0.828 | |
α=0.05 | 60 | 0.612 | 0.970 | 0.999 | 0.384 | 0.810 | 0.973 |
80 | 0.736 | 0.995 | 1.000 | 0.529 | 0.951 | 0.998 | |
100 | 0.850 | 0.999 | 1.000 | 0.671 | 0.981 | 1.000 |
Copula | Hoeffding | Spearman | Pearson | Kendall | JLNDnEquation (2) | JLNDn Equation (3) | |
---|---|---|---|---|---|---|---|
p-value | 0.0005 | 0.0000 | 0.5695 | 0.0214 | 0.5713 | 0.0380 | 0.0044 |
coefficient | −0.0429 | −0.1724 | 0.0287 |
Return on CD | Type = 0 | Type = 1 | Return on CD | Type = 0 | Type = 1 | Return on CD | Type = 0 | Type = 1 |
---|---|---|---|---|---|---|---|---|
7.51 | 0 | 1 | 8.15 | 0 | 1 | 8.49 | 0 | 3 |
7.56 | 1 | 0 | 8.17 | 1 | 0 | 8.50 | 1 | 9 |
7.57 | 1 | 0 | 8.20 | 0 | 1 | 8.51 | 1 | 0 |
7.71 | 1 | 0 | 8.25 | 0 | 2 | 8.52 | 0 | 1 |
7.75 | 0 | 1 | 8.30 | 1 | 2 | 8.55 | 1 | 0 |
7.82 | 2 | 0 | 8.33 | 2 | 1 | 8.57 | 1 | 0 |
7.90 | 1 | 1 | 8.34 | 0 | 1 | 8.65 | 2 | 0 |
8.00 | 7 | 3 | 8.35 | 0 | 2 | 8.70 | 0 | 1 |
8.05 | 2 | 0 | 8.36 | 0 | 1 | 8.71 | 1 | 0 |
8.06 | 1 | 0 | 8.40 | 1 | 6 | 8.75 | 0 | 1 |
8.11 | 1 | 0 | 8.45 | 0 | 1 | 8.78 | 0 | 1 |
Test | χ2 | JLNDn (Equation (2)) | JLNDn (Equation (3)) |
---|---|---|---|
p-value | 0.0558 | 0.0320 | 0.0068 |
Statistic’s value | 45.6450 (df = 32) | 50.275 | 50.275 |
Author Contributions
Conceptualization, J.E.G. and V.A.G.-L.; methodology, J.E.G. and V.A.G.-L.; software, J.E.G. and V.A.G.-L.; validation, J.E.G. and V.A.G.-L.; formal analysis, J.E.G. and V.A.G.-L.; investigation, J.E.G. and V.A.G.-L.; data curation, J.E.G. and V.A.G.-L.; writing-review and editing, J.E.G. and V.A.G.-L. Both authors have read and agreed to the published version of the manuscript.
Funding
No funds were received for the execution of this work.
Acknowledgments
The authors wish to thank the two referees for their many helpful comments and suggestions on an earlier draft of this paper.
Conflicts of Interest
The authors declare no conflict of interest.
1. Agresti, A. Categorical Data Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2002.
2. Kendall, M.G. The treatment of ties in ranking problems. Biometrika 1945, 33, 239-251.
3. Cramir, H. Mathematical Methods of Statistics; Princeton U. Press: Princeton, NJ, USA, 1946; Volume 500.
4. Romik, D. The Surprising Mathematics of Longest Increasing Subsequences; Cambridge University Press: New York, NY, USA, 2015; Volume 4.
5. Garca, J.E.; González-López, V.A. Independence test for sparse data. AIP Conf. Proc. 2016, 1738, 140002.
6. Garca, J.E.; González-López, V.A. Independence tests for continuous random variables based on the longest increasing subsequence. J. Multivar. Anal. 2014, 127, 126-146.
7. Zelterman, D. Goodness-of-fit tests for large sparse multinomial distributions. J. Am. Stat. Assoc. 1987, 82, 624-629.
8. Hollander, M.; Wolfe, D. Nonparametric Statistical Methods; John Wiley & Sons: New York, NY, USA, 1973; pp. 185-194.
9. Simonoff, J.S. Smoothing Methods in Statistics; Springer: New York, NY, USA, 1996.
10. Weisberg, S. Applied Linear Regression, 4th ed.; John Wiley & Sons: Minneapolis, MN, USA, 2005; Volume 528.
Jesús E. Garca and Verónica A. González-López*
Department of Statistics, University of Campinas, Sérgio Buarque de Holanda, 651, Campinas 13083-859, São Paulo, Brazil
*Author to whom correspondence should be addressed.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
In this paper, we show how the longest non-decreasing subsequence, identified in the graph of the paired marginal ranks of the observations, allows the construction of a statistic for the development of an independence test in bivariate vectors. The test works in the case of discrete and continuous data. Since the present procedure does not require the continuity of the variables, it expands the proposal introduced in Independence tests for continuous random variables based on the longest increasing subsequence (2014). We show the efficiency of the procedure in detecting dependence in real cases and through simulations.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer