About the Authors:
Stanley Luck
Roles Conceptualization, Formal analysis, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing
* E-mail: [email protected]
Affiliation: Vector Analytics LLC, Wilmington, DE, United States of America
ORCID logo https://orcid.org/0000-0001-7081-9407
Abstract
The ordinary linear regression method is limited to bivariate data because it is based on the Cartesian representation y = f(x). Using the chain rule, we transform the method to the parametric representation (x(t), y(t)) and obtain a linear regression framework in which the weighted average is used as a parameter for a multivariate linear relation for a set of linearly related variable vectors (LRVVs). We confirm the proposed approach by a Monte Carlo simulation, where the minimum coefficient of variation for error (CVE) provides the optimal weights when forming a weighted average of LRVVs. Then, we describe a parametric linear regression (PLR) algorithm in which the Moore-Penrose pseudoinverse is used to estimate measurement error regression (MER) parameters individually for the given variable vectors. We demonstrate that MER parameters from the PLR and nonlinear ODRPACK methods are quite similar for a wide range of reliability ratios, but ODRPACK is formulated only for bivariate data. We identify scale invariant quantities for the PLR and weighted orthogonal regression (WOR) methods and their correspondences with the partitioned residual effects between the variable vectors. Thus, the specification of an error model for the data is essential for MER and we discuss the use of Monte Carlo methods for estimating the distributions and confidence intervals for MER slope and correlation coefficient. We distinguish between elementary covariance for the y = f(x) representation and covariance vector for the (x(t), y(t)) representation. We also discuss the multivariate generalization of the Pearson correlation as the contraction between Cartesian polyad alignment tensors for the LRVVs and weighted average. Finally, we demonstrate the use of multidimensional PLR in estimating the MER parameters for replicate RNA-Seq data and quadratic regression for estimating the parameters of the conical dispersion of read count data about the MER line.
Figures
Fig 4
Fig 5
Fig 6
Fig 1
Fig 2
Fig 3
Fig 4
Fig 5
Fig 6
Fig 1
Fig 2
Fig 3
Citation: Luck S (2022) A parametric framework for multidimensional linear measurement error regression. PLoS ONE 17(1): e0262148. https://doi.org/10.1371/journal.pone.0262148
Editor: Sriparna Saha, Indian Institute of Technology Patna, INDIA
Received: July 14, 2021; Accepted: December 19, 2021; Published: January 21, 2022
Copyright: © 2022 Stanley Luck. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The RNA-Seq GLDS-251 dataset is available from the GeneLab Omics database at https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-251/. The textbook, "Statistical Inference" by G. Casella and R. L. Berger, also serves as a data source.
Funding: The author, Stanley Luck, is a member of Vector Analytics LLC, which is a science consulting company. The funder provided support in the form of salaries for authors [SL], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.
Competing interests: There are no competing interests connected with our consulting work at Vector Analytics LLC. This affiliation also does not alter our adherence to PLOS ONE policies on sharing data and materials. This work is not associated with any patents or commercial products.
1 Introduction
In this work, we consider the problem of fitting a multidimensional line for data that are subject to stochastic error. The motivation for this work comes from a collaborative R & D effort involving the application of the genome-wide association (GWAS) [1] and eQTL methods [2] to identify beneficial agronomic variation in maize. This led to our applied algebraic investigation of the merits of various effect size measures and their associated statistical methodologies as described in our two recent publications in this journal. In the first, we discussed the importance of factoring a 2 × 2 contingency table for obtaining marginal scale invariant effect size measures for proportional variation [3]. We also used projective geometric concepts for discussing connections between the phi coefficient, odds ratio, relative risk and proportion with respect to the representation of effect sizes. In the second, we discussed the formulation of a nonparametric measure for nonoverlap proportion and the importance of specifying a complete set of parameters for point-biserial variation in the formulation of an effect size measure [4]. We also discussed the importance of providing an error model for data and the use of Monte Carlo (MC) methods for estimating distributions and confidence intervals for effect size, as required for best practices [5]. However, in RNA-Seq studies, error estimation is complicated by the low number of replications usually employed. Our RNA-Seq experimental design is consistent with the recommended minimum of three replicates [6], but this is less than the minimum of 13 replicates required for reliably estimating variance [7]. Furthermore, RNA-Seq read counts can range over more than three orders of magnitude with heteroscedastic error. Thus, weighted least squares (WLS) methods [8, 9] are needed for properly partitioning residual effects in data analyses. The importance of accounting for measurement errors in omics studies has been discussed in several publications [7, 10–13]. These problems provide the motivation for our investigation of linear measurement error regression (MER) methods for multivariate data. The concepts discussed in this paper significantly extend our previous gene expression data analysis work that is briefly described in [14].
We use the term ‘measurement error regression’ in accordance with the recommendation in [15] instead of the other commonly used terms, such as ‘error-in-variables’ or ‘major axis regression’ [16]. There are many studies on the MER problem [17–19], but [9, 20–22] were particularly useful in this work. There are two forms of WLS optimization in measurement error regression for (x, y) data. The elementary form involves the application of the normal equations [9] algorithm to obtain WLS-based linear regression (WR) estimates for the necessary parameters. However, WR is subject to the limitation where a design matrix is constructed from error free independent data, and the residual effects are assigned exclusively to the dependent data. Thus, WR requires the specification of ‘dependent’ and ‘independent’ quantities. Then, when both x and y are subject to error the WR estimate is biased, and there is extensive literature describing various approaches for solving this problem [18–20]. The second form of WLS involves ‘weighted orthogonal regression’ (WOR) or ‘orthogonal distance regression’ where the weighted sum of squares of the combined residual effects for x and y are minimized [23]. The ‘dependent’ versus ‘independent’ distinction for y and x does not hold in WOR. Then, WOR can be regarded as a generalization because under the condition where only the y data are subject to error, the WOR and WR estimates are equivalent. However, there is no analytical solution for WOR [23], and the implementation requires a numerical iterative search method. In this work, we use the implementation of WOR in open source statistics software [24, 25] based on the nonlinear ODRPACK algorithm [26]. However, ODRPACK is limited to the analysis of bivariate data because it is based on the Cartesian representation y = f(x). Our objective is to describe a multivariate generalization of the WR algorithm that provides unbiased estimates for MER parameters for a set of linearly related variable vectors (LRVVs). The main novel contributions of this work are as follows: 1) We associate a set of LRVVs with a convex set of weighted averages and use Monte Carlo simulation to demonstrate that the minimum coefficient of variation for error (CVE) provides the optimal weights for forming the weighted average (Eqs 13, 16 and 17). 2) We formulate a parametric linear regression (PLR) algorithm where the weighted average of the LRVVs serves as the independent parameter, and the MER parameters are estimated using the Moore-Penrose pseudoinverse (Eqs 22 and 23; Proposition 1). 3) Convex mappings from the data points in the MER graph to the predicted values along the MER line are provided (Eqs 6 and 7). 4) We propose that in the parametric representation for the linear relation between variable vectors, the corresponding covariance is a parametric vector quantity (Eq 25; Proposition 2). 5) We identify scale invariant quantities for the partitioned residual effects between the LRVVs for both PLR (Eq 27) and WOR (Eq 37). 6) We propose a multivariate generalization of the Pearson correlation coefficient based on the m-fold contraction of Cartesian polyad alignment tensors for the LRVVs and weighted average (Eqs 32, 33 and 35). 7) A quadratic regression method is used to estimate the conical dispersion parameters for determining the read counting error in replicated RNA-Seq data (Eqs 40 and 43).
2 Methods
Section 2.1 introduces our notations. In sections 2.2–2.4, we examine the WLS methods for partitioning residual effects in linear regression for (x, y) data. In section 2.5, we discuss the minimum CVE criterion for obtaining the optimal weighted average of a set of LRVVs. In section 2.6, we describe the PLR framework where parametric equations are used to obtain a novel multidimensional generalization of the normal WLS algorithm for linear regression. In section 3.2, we describe the application of the PLR algorithm for the analysis of the conical dispersion of replicated RNA-Seq data about the MER line.
It has already been established that the specification of an error model for the given data is essential in MER analysis. An error model serves as a realistic assessment of the performance of a data acquisition system, and error parameters are estimated from replicated measurements [27]. The parameters are usually summarized as an ‘expected variance’ for error with respect to each observation. When referring to experimental error, we use the term ‘expected variance’ because error parameters are empirical. We provide a detailed discussion of the role of in partitioning residual effects for the WLS optimization of MER. Furthermore, proper statistical practice dictates that effect size measures such as the regression slope, the correlation coefficient and Cohen’s d must be qualified by a distribution [5, 28], and an assessment of substantive significance [29] must account for a distributed system response. Consequently, we also discuss the use of Monte Carlo methods and for estimating the distributions and confidence intervals of MER parameters.
2.1 Notation
Our notation and terminology are partly taken from [30], but comprehensive discussions of vector spaces and linear algebra for applied statistics are found in many textbooks, such as [8, 9, 31]. Scalars are denoted by italics y, and vectors are written in bold lowercase letters y ≡ (yi) ≡ (y1, y2, y3, …, ydim(y)). Matrices are written in bold uppercase letters Y ≡ [yij], and the transpose is denoted as YT. Then, n joint observations for m experimental quantities produce a dataset , which corresponds to the Cartesian product of the data vectors , , where . We use a convenient terminology for the subsets of , where each axis of a regression graph is assigned to a variable vector yj, the linear combinations ∑j kj yj are elements of the variable space, and each point in the graph corresponds to an observation vector y(i) [30, 32]. In this work, we describe a novel parametric linear regression algorithm for estimating MER parameters for a set of LRVVs. We also use the shortened term ‘variable vectors’ synonymously when referring to LRVVs, except in a few clearly identified cases. The inner product with the one-vector 1 = (1, 1, 1, …) produces the mean value . The subtraction of the mean vector produces a centered vector . The hat symbol ‘^’ denotes a unit-length vector, i.e., . Then, the Pearson correlation coefficient r(x, y) is defined as the cosine of the angle between unit-length, centered vectors [30, 32](1)
The variance for yj is denoted as Var(yj). The expected variance for normally distributed errors is denoted as , and the vector denotes the expected variance of the error in yj. For bivariate data (yj, yk), the covariance is denoted as Cov(yj, yk); the notation for a variance-covariance matrix is not needed in this work. In our Cov(yj, yk) notation, yj and yk are independent and dependent quantities, respectively, and the roles are reversed for Cov(yk, yj). The connection between correlation and covariance as measures of linear dependence in data is discussed in [21, 32]. A general discussion of convex sets, linear fractional transformations and perspective functions is found in [33]. The set containing all convex combinations of a set of variable vectors is denoted as , where , for ∑j wj = 1 and 0 ≤ wj∀j. A perspective function has the form P(y, t) = y/t. In recent publications, we provide an elementary discussion about projective geometry for fractional variation [3], and identify the homogeneous coordinates of the Pearson correlation coefficient as points on the line passing through the point [4]. Equivalent homogeneous coordinates are indicated by a two-sided arrow, i.e., y ↔ x ⇒ y = x t.
2.2 The WLS linear regression algorithm
Consider a set of n linearly related observations {(xi, yi)|1 ≤ i ≤ n}, which are represented as the variable vectors . The observations are subject to error, and the differences between the true (x*, y*) and observed values are denoted by (ex, ey) = (x − x*, y − y*). The fact that (x*, y*) are unknown implies that (ex, ey) are also indeterminate. The specification of an error model for the data is essential for estimating the distributions and corresponding confidence intervals of the statistical parameters, including the slope; see section 2.6. The default approach is to reduce to a stochastic form or some approximation thereof, where the errors (ex, ey) are independent and associated with normal distributions with zero means (0, 0) and expected variances , where , and . In the standard MER framework [20, 21], the linear relation between x and y is expressed using Cartesian equations:(2) (3)where are the model parameters. We use the subscript ‘xy’ to indicate the correspondence with the derivative , because it will be necessary to distinguish between different parameterizations in Eq 21. In linear regression analysis, the objective is to obtain sample estimates (axy, bxy) for (αxy, βxy). After introducing the design matrix X = [1, x] and the diagonal variance matrix , the WLS linear regression estimate is obtained by solving the normal equations to obtain [9](4)where uxy,WR = (axy,WR, bxy,WR). Vy is required for the WLS scaling of both X and y, and (XT X)−1 XT is the Moore-Penrose pseudoinverse of X; in practice, the pseudoinverse is estimated by singular value decomposition. The ordinary linear regression (OLR) estimate is obtained when Vy ∝ I, where I is the identity matrix. However, the bxy,WR estimate is biased because of the correlation of x with ex in Eq 3 [20]. The reliability ratio [17] for x is defined aswhere Var(x) = Var(x*) + Var(ex), and . As illustrated in Fig 1A and 1B, bxy,WR is attenuated as κx decreases from 1. This bias in bxy,WR arises from the algebraic requirement that X must serve as an error-free quantity in the normal equations, and the residuals are assigned exclusively to y. The solution uxy,WR holds under the condition where Var(ex) is small enough such that x ≈ x* with κx ≈ 1. A key objective of this work is to describe a novel parametric generalization in which the WR algorithm itself serves as the basis for measurement error regression for a set of LRVVs (section 2.6). We also compare the performance of PLR with that of existing MER methods and provide realistic MER examples using publicly available data. In particular, we discuss the WLS orthogonal regression and attenuation-corrected slope estimation methods.
[Figure omitted. See PDF.]
Fig 1. Comparison of various measurement error regression methods.
A, B, C: Monte Carlo simulations of MER for 25 evenly-spaced values . The statistical parameters are estimated from 5000 Monte Carlo datasets (x, y) = (x* + ex, y* + ey), with normally distributed errors (ex, ey). The parametric linear regression and orthogonal distance regression methods both produce consistent estimates for the slope (bMER) for a wide range of reliability ratios κx. The weighted average (bxy,W) of the attenuation-corrected direct and inverse WR estimates also produces comparable results. In contrast, the standard (bWR) estimate is attenuated. A: Homoscedastic error: (sy* = 0.3, sx* = α), and α = {0.01, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35}. B: Heteroscedastic error: , and α = {0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6}. C: Using the results from (B), the Pearson correlation coefficient rxy is attenuated compared to the parametric correlation coefficient for the weighted average τw = w x + (1 − w)y. D, E, F: MER results for the data (x, y) from Table 11.3.1 in [21]. However, error estimates were not provided for the data. For demonstration purposes, we used a rudimentary error model: , , κx = 0.93, κy = 0.91, r = 0.49, and rPLR = 0.74. D: Each MER line is labeled with its slope. The ordinary least squares (bOLR) and orthogonal regression (bOR) estimates are consistent with [21]. The WLS orthogonal regression (bORw) estimate is also shown. The predicted values for (xi, yi) are plotted for PLR and WOR. E: The bxy,PLR estimate is obtained from separate regressions bτx,WR and bτy,WR. The inset shows the distribution of bxy,PLR estimated from 100 histograms with 5000 MC samples per histogram (see Fig 4); the error bars indicate the degree of convergence for the MC simulation. F: The weighted average prediction residuals for bxy = 1.38 and the WOR residuals are quite similar. The PLR residuals from the τ WR results in (E) are comparable but do not account for heteroscedastic error. A-E: The error bars correspond to ±2σ.
https://doi.org/10.1371/journal.pone.0262148.g001
2.3 WLS orthogonal regression
The WOR algorithm minimizes the weighted sum of squares for the combined residual error for :(5)with [26] and does not require the identification of dependent and independent variables. However, this minimization problem does not possess an analytical solution [23], and a numerical method is needed for the implementation of WOR. For our demonstration, the python scipy.odr package [25] provides a convenient WOR implementation based on the ODRPACK optimization library [26]. We are interested in WOR for a linear function y = a + bx, but ODRPACK is more general and uses the Levenberg-Marquardt nonlinear least squares algorithm to support parameter estimation for a user-specified bivariate function y = f(x). As demonstrated in Fig 1, the ODRPACK algorithm performs well for homoscedastic (Fig 1A) and heteroscedastic (Fig 1B) error models over a wide range of reliability ratios. When either κx or κy approaches 1, the WOR estimate is equivalent to the WR estimate. Thus, the WOR algorithm is regarded as a generalization of WR. Furthermore, the ODRPACK algorithm provides predicted and residual values for (xi, yi), as demonstrated in Fig 1F. For comparison, we describe a novel algebraic method to estimate predicted and residual values in MER. First, we associate the observed data (xi, yi) with a convex mapping to an interval on an MER line with (intercept, slope)≡(a, b). The interval is bounded by the following two points:andwhich correspond to the special cases and , respectively. Then, the weighted average(6)is contained in for 0 ≤ w ≤ 1. Using the signal-to-noise ratios (see Eq 17), the optimal weight is obtained:(7)which produces the weighted average prediction (WP) value vi,WP. In Fig 1F, we demonstrate that the ODRPACK and WP estimates for residuals are quite similar. The generalization of Eqs 6 and 7 for MER in is straightforward.
However, the WR and WOR algorithms are both subject to the algebraic limitation that the Cartesian representation y = f(x) only allows for the representation of lines (or curves) in two dimensions. Alternatively, the specification of a straight line in requires the use of the parametric representation(8)with and [34]. Consequently, our objective is to describe a parametric framework where the WR algorithm serves as the basis for measurement error regression for LRVVs.
2.4 Covariance in ordinary linear regression
The OLR algorithm corresponds to the special case where Vy is replaced with the identity matrix I in Eq 4. Then, the expression for the slope takes a simple form and provides the intuition for our novel parametric algorithm in section 2.6. From Eq 3, the sample covariance [22] is expressed aswhere Cov(ex, ey) = 0, and Cov(x, ex) = Var(ex). Then, we obtain the direct estimator for the slope(9)where(10)and as κx → 1. Even though is referred to as the ‘regression coefficient corrected for attenuation’ [17], we note that is also biased. Reversing the roles of x and y yields the inverse estimator for βxy:(11)where κy = 1 − Var(ey)/Var(y). The relations in Eqs 9 and 11 indicate that the OLR estimates serve as bounding values because for any least squares estimate bMER, it can be shown [20] that(12)
In section 2.6, we provide a statistical explanation (see Eq 26) and a graphical illustration (Fig 2) of the bounded range of |bMER|. To compensate for the bias in and , various ad-hoc combinations have been proposed, including the arithmetic mean [16, 35]:
[Figure omitted. See PDF.]
Fig 2. Bounded ranges of the regression slopes for the errors in both x and y.
Monte Carlo simulations of parametric linear regression are conducted for 25 evenly spaced values , and the error model is , for α = 0.01, 0.2, 0.4. The estimates for the slope bPLR (A) and correlation rPLR = rτx rτy (B) are averages over 5000 datasets (x, y) = (x* + ex, y* + ey) with normally distributed errors (ex, ey) and a weighted average τ = wx x + (1 − wx)y. The optimal (bPLR, rPLR) estimates corresponding to the minimum CVE in τ are also plotted (red). The 2σ error bars are estimates for the positive and negative deviations from the median value.
https://doi.org/10.1371/journal.pone.0262148.g002
Unfortunately, the limiting values of bxy,A are not consistent with the OLR estimate for either κx or κy = 1. Instead, we introduce the weighted averagewhere the weights are determined from signal to noise ratios (see Eq 17). In Fig 1A and 1B, we observe that the bxy,W estimates compare well with the expected slope but with much larger confidence intervals compared to those of WOR.
2.5 Noise reduction for a weighted average of linearly related variable vectors
Consider the weighted average of the LRVVs :(13)with 0 ≤ w ≤ 1. Then, τ ∈ conv(x, y) and corresponds to points on the line segment connecting x and y. The average of x and y is associated with the partial cancellation of error and an improved signal-to-noise ratio. Let Var(ex) and Var(ey) denote the expected variances of the errors in x and y, respectively. Then, the propagation of independent error for the weighted sum of two variables [34] gives(14)
By introducing the mean value and the change in coordinates ω = w/(1 − w), the coefficient of variation for error in τ is(15)
Then, setting the derivative yieldsand we obtain the criterion(16)for the minimum CVE. In Fig 3, we use Monte Carlo simulations to demonstrate that the minimum is consistent with Eq 16. By partitioning the weighted average τ = ∑j wj yj into a sequence of pairwise averages for the data (yj), we obtain the general min(CVE) expression(17)
[Figure omitted. See PDF.]
Fig 3. Noise reduction for a weighted average of variable vectors.
Monte Carlo simulations of weighted averaging are conducted for the data (x, y) consisting of 25 evenly spaced values for y* = 2x*. The statistical parameters are estimated from 5000 datasets (x, y) = (x* + ex, y* + ey) with normally distributed errors (ex, ey), and the error model is: sy* = 0.3y*, sx* = α x* for α = 0.01, 0.3, 0.5. The minimum coefficient of variation for error for the weighted average τ = wx x + (1 − wx)y is consistent with Eq 17, as shown by the optimal wmin curve. Deming regression is associated with the weights and the corresponding CVE curve (grey) intersects the wmin curve.
https://doi.org/10.1371/journal.pone.0262148.g003
Thus, the noise reduction for the average is optimal when the ratios are balanced for the LRVVs. To the best of our knowledge, this expression for optimizing the CVE of a weighted average has not been previously reported. We note that the correspond to the signal-to-noise ratios [36]. However, for data that are not linearly related, averaging results in the uncontrolled mixing of noise and signal and a loss of information. This implies that the treatment of measurement error in more general problems, including multiple and polynomial regression, requires nonlinear optimization methods. That explains why this work is limited to the treatment of MER for LRVVs. In the next section, we discuss the use of τ as a parameter for measurement error regression.
2.6 The chain rule in linear regression
By applying the chain rule [34] to Eq 2, we obtain the factorization(18) (19) (20) (21)
The fractional form of Eq 19 is associated with scale invariance and corresponds to the homogeneous coordinates βxyt, with . The mapping βxy ↦ βxy is equivalent to invoking the parametric representation (x(t), y(t)) for the linear relation between the variables (Eq 8). Therefore, the chain rule permits a generalization in which the slope in linear regression is parameterized by τ*, where βxy corresponds to the point (βxy, 1) on the projective line and a perspective function of the slope vector βxy [4, 33]. Consequently, Eq 21 serves as the basis for our generalization of the linear regression algorithm, where the weighted average τ serves to parameterize the relation between variable vectors. Analogously to Eq 9, the PLR slope vector is obtained as(22)where the components of bxy,PLR are estimated individually using the WR algorithm (Eq 4) with τ in the design matrix. Thus, x and y both serve as dependent variables. Without loss of generality, the factor can be ignored because of the homogeneous coordinates equivalence. Therefore, we can regard τ as corresponding to a set of fixed values. Then, the parametric linear regression estimator for βxy is(23)
In the special case where wx = 1, τ = x and bxy,PLR → bxy,WR, x serves as both a dependent variable and an independent variable. Conversely, for wy = 1, and y serves as an independent variable. Therefore, the fact that there is a bounded range of values for the MER slope (Eq 12) comes from the dependence of bxy,PLR on τ, which ranges over conv(x, y). The monotonic decrease of bxy,PLR with wx is shown in Fig 2A, with the lower and upper bounds given by (bxy,WR, 1) and (1, byx,WR), respectively; this is consistent with the OLR constraints in Eq 12. As we discuss in section 3.2, the extension of the PLR algorithm for estimating the MER parameters for a set of LRVVs is straightforward. Therefore, we make the parametric linear regression proposition.
Proposition 1. For a set of linearly related variable vectors , the weighted average τ = ∑j wjyj corresponding to the minimum coefficient of variation for error (Eq 17) provides the optimal measurement error regression estimate for the slope vector .
In Fig 1A and 1B, we show that the performance of the bxy,PLR estimate is comparable to that of the WOR algorithm for a wide range of reliability ratios. The PLR algorithm provides predicted values for y(i) (Fig 1E), but predicted values can also be estimated using the WP method (Eq 7). In the special case where the LRVVs are not subject to error and the points fall exactly on a straight line, we obtain , then any τ can serve as the parametric variable vector in PLR.
Now, we consider the pedagogically important special case where Vx and Vy are both proportional to I. We obtain(24)(25)(26)where bxy,POLR = Cov(τ, y)/Cov(τ, x) is a perspective function of bxy,POLR. In Eq 25, we introduce the parametric covariance vector notation Cov(y, x ∣ τ) to indicate the homogeneous coordinates equivalence with bxy,POLR. Thus, we note that covariance serves as a measure of linear dependence between variable vectors [21, 32] and is therefore subject to the chain rule. Consequently, we make the parametric covariance proposition.
Proposition 2 In the analysis of linear dependence for a set of LRVVs , the chain rule implies that the covariance is associated with a conditional vector form , which is parameterized by the weighted average .
Therefore, covariance is a vector quantity in the parametric representation for the linear relation between variable vectors. Then, there is a range for Cov(y, x ∣ τ) in correspondence with bxy,POLR and the bounded range for bMER in Eq 12. For (wx, wy) given by Eq 17 and τ = wx x + wx y, we obtain the expression(27)observe that the right side is invariant to scaling the data: (x, y) → (kxx, kyy), and note that there is a similar factorization for Cov(τ, x). Conversely, we conclude that the scaling invariance property for the quantitiesimplies that wx/wy ∝ ky/kx, which is a necessary condition but not sufficient for specifying the optimal weights for τ; in section 3.1, we discuss Deming weighting which also satisfies the scaling invariance property. Then, we obtain the homogeneous coordinates equivalenceand observe the corresponding scaling relation
Then, bxy,POLR varies proportionally with the scaling (kxx, kyy); the bxy,POLR estimate is consistent with respect to the scaling of the data. These covariance factorizations provide another demonstration of the importance of WLS optimization for partitioning residual effects in measurement error regression. The generalization of these expressions for a (yj) dataset is straightforward. However, we note that Cov(τ, yj) assumes that and does not account for the heteroscedastic error in yj. Thus, in practice, the PLR estimate (Eq 22) is preferred. Finally, we note that the two special cases(28)(29)for τ = x and(30)(31)for τ = y, correspond to the bounding values in Eq 12. Therefore, Cov(x, y) and Cov(y, x) are numerically equal but serve as components for different parametric covariance vectors, and correspond to different estimates for MER slope, bxy,OLR and byx,OLR, respectively.
Now, we consider the multivariate generalization of the Pearson correlation coefficient r (Eq 1). Using the well-known correspondence between covariance and Pearson correlation [32],we transform Eq 25 and define a parametric correlation vector(32)
Then, we observe that the Pearson correlation coefficient rτ(x, y) can be written in Cartesian tensor form as the double contraction of dyads(33)(34)
The dyad is a bilinear measure of the joint variation in x and y. In the special cases where and , rτ reduces to the standard Pearson correlation coefficient r (Eq 1). Geometrically, rτ serves as a measure of the degree to which xc and yc are mutually aligned in the τc direction. Then, the generalization of Eq 33 as the m-fold contraction of polyads for the variable vectors serves as the basis for our definition of the parametric correlation coefficient:(35)with m ≥ 2. The -polyads are referred to as 2nd, 3rd, … order Pearson correlation tensors, respectively. Therefore, we associate a set of m LRVVs with a set of two-way, three-way, …, m-way weighted averages, PLRs, covariance vectors and PLR correlation coefficients. However, a detailed discussion of Cartesian tensors [37, 38] is beyond the scope of this paper. In Fig 2B, the concave down variation of rPLR = rτx rτy results from the product of the monotonically increasing rτx function and decreasing rτy function, with the endpoints equal to the Pearson correlation coefficient r. Thus, rPLR varies with the expected error , but Pearson’s r serves as the lower bound. Then, r is attenuated compared to rPLR when the data (x, y) are subject to measurement errors [13], as shown in Fig 1C. When the data (x, y) are highly correlated and Pearson’s r approaches 1, the rPLR curve becomes a horizontal line at rPLR = 1 (or nearly so). We conclude that the parametric covariance (Eq 25), correlation (Eq 32) and PLR slope (Proposition 1) correspond to different perspective functions and coordinate systems for representing the linear relation between variable vectors, i.e., points dispersed about the MER line. Then, a condition for the interpretation of covariance and correlation is that the deviation of the points from the MER line must be consistent with the expected error . In section 3.1, we discuss the ambiguity that arises when the MER line is perturbed by outliers.
Analytical methods for the propagation of errors and the estimation of distributions for ratios [39, 40], proportions [41, 42], and correlation coefficients [43] are complicated by fractional transformations, bounded ranges, and discreteness. The propagation of errors for convoluted quantities such as bxy,PLR and rPLR is prohibitively complicated. Alternatively, Monte Carlo methods are practical approaches for estimating the distributions and confidence intervals of fractional statistical parameters, and they allow for the detailed simulation of stochastic effects in the data acquisition process as described previously [4]. The error bars shown in various figures in this paper are estimated using MC methods. Fig 4 shows an example of a two-dimensional MC histogram for the joint (rPLR, bxy,PLR) distribution. We note that the distributions for MER parameters are often skewed, so confidence intervals are estimated separately for + /− deviations from the median.
[Figure omitted. See PDF.]
Fig 4. Distribution of the statistical parameters in parametric linear regression.
A two-dimensional histogram (B) is calculated from 500000 Monte Carlo datasets for the data in Fig 1D. The distributions of rPLR (A) and bPLR (C) are averages obtained from 100 MC histograms, with 5000 MC datasets (x, y) per histogram. The 2σ error bars are estimates of the positive and negative deviations from the median and indicate the degree of convergence of the MC estimate. The (rPLR, bPLR) = (0.74, 1.44) estimates are indicated in red.
https://doi.org/10.1371/journal.pone.0262148.g004
3 Data analysis and results
Using publicly available data, we provide practical examples that illustrate the ambiguity that arises when the variable vectors are not linearly related and the application of multidimensional parametric linear regression for analyzing the dispersion of RNA-Seq read count data.
3.1 Example 1: Measurement error regression with outliers
The data for this example come from Table 11.3.1 in [21] and originate from consulting work (R. Berger, personal communication). However, error estimates are not available for these data. For our demonstration, we provided a rudimentary error model with , as shown by the 2σ xy-error bars in Fig 1D. The ordinary linear regression estimate for the slope is attenuated as expected (Eq 9). However, the orthogonal regression (OR) estimate reported in [21] is also biased because of the equal weight assumption for the residual effects (ex, ey). Using the defined notations for sums of squares,and making the substitution (x, y) ↦ (wxx, wyy), we obtain a WLS OR estimate:(36)where the wx/wy factor is needed for the transformation back to the original coordinates for (x, y), and the factor for the weighted residuals in (x, y) is(37)
The weights are determined from the signal to noise ratios with the constraints wx + wy = 1 and wx, wy ≥ 0 (Eq 17). Then, we note that B is invariant to the scaling of the data, (x, y)→(kx x, ky y). Therefore, the bxy,ORw estimate is consistent with respect to the scaling of the data. The bxy,OR estimate is recovered when wx = wy [20], and because of this constraint, bxy,OR is not a consistent estimate since it does not vary proportionally with (kxx, kyy) scaling. There is better agreement between bxy,WOR and bxy,ORw (Fig 1D), but the latter does not account for the heteroscedastic errors . The bxy,PLR estimate is obtained from the separate regressions (bτx,WR, bτy,WR), as shown in Fig 1E. The differences between the bxy,WOR and bxy,PLR estimates are small compared to the experimental uncertainty. However, we observe that there are many positive and negative outliers in the data (Fig 1D), and only a subset of the points are close to any one of the MER lines. This implies that the data (x, y) are not linearly related, and the weighted average τ ∈ conv(x, y) is then subject to confounding effects. Then, WLS optimization produces misleading results, and the MER line does not possess a functional or operational interpretation. Criteria can be applied to select an LRVV subset for measurement error regression, but this works best when the outlier subset is small. Otherwise, linear regression is not recommended for the analysis of data that are not linearly related. In [4], we discuss the use of regression tree association graphs in the analysis of weakly correlated data. Finally, we note that the Deming orthogonal regression estimate [22] is obtained by substituting the alternative weighting into Eq 36. Our simulations confirm that Deming weighting has the scale invariance property but it is not optimal for MER (Fig 3), and the corresponding parameter estimates are biased for both PLR and ORw (not shown).
3.2 Example 2: Conical dispersion for replicated RNA-Seq data
In this section, our objective is to demonstrate that multidimensional linear MER has practical applications. We expect that there are many applications for PLR in the ‘big data’ world but the application for RNA-Seq data analysis is convenient because of our previous collaborations in gene expression research. In particular, we discuss the use of the PLR method in the analysis of read counting errors in a replicated RNA-Seq dataset. The estimation of the experimental errors and confidence intervals for effect size measures are important problems in RNA-Seq analysis because experiments are usually performed with very few replicates [6]. The replicated data correspond to the Cartesian product of LRVVs with , ; yij is the read count for the ith RNA tag of the yj replicate, n is the number of RNA tags, m is the number of replicates, and m ≪ n because RNA-Seq assays are highly multiplexed and n is large. The observation vectors for the RNA tags are denoted g(i). Our error analysis algorithm is iterative and requires an initial guess for the expected error in the data ; the read counting errors are assumed to be independent or approximately so: . In our implementation, the initial error estimate is of the form . Let min(g(i)) correspond to a set of RNA tags with the lowest read counts, excluding those with ; i.e., min(g(i)) is a set of points close to the origin of the regression graph. Then, q0 is initially estimated from the variance for min(g(i)); typically 1 ≤ q0 ≤ 50. An initial value for q2 is obtained by visual inspection of the data (Fig 5); typically 0.002 ≤ q2 ≤ 0.05 (Fig 6 legend). Then, the optimal for the minimum CVE is estimated, and the PLR method provides parameter estimates for the MER line:(38)with , , . Then, the data are transformed so that ℓ passes through the origin with a unit slope (Fig 5B & 5D); i.e., bPLR = 1, and amin ≊ 0. This adjustment for ℓ is important because afterwards, the g(i) are conically dispersed about the 1 line. Then, the g(i) are partitioned into orthogonal components for deviation δ(i) and mean :(39)
[Figure omitted. See PDF.]
Fig 5. Conical quadratic error parameters for replicated RNA-Seq data.
Replicate Arabidopsis thaliana RNA-Seq data from [46] for the (μg, 4 reps) sample in (A, B) and (1g, 3 reps) sample in (C,D) are scaled so that the parametric linear regression line corresponds to the unit slope line and passes through the origin. Then, the conical dispersion of points about the 1 line in the MER graph serves as a measure of the replication error, as shown in the projections (B,D). An iterative WLS regression of the mean squared deviation versus the mean value produces parameter estimates for the quadratic error model (Eq 41), as shown in (A, C). The corresponding 2σ threshold curves are shown in (B, D). The linear component of the error () is shown as ‘convergent’ dashed curves. The larger overdispersion q2 in (C, D) compared to that in (A, B) indicates that there is a significant variation in data quality between the two experiments. The histogram (D inset) shows that the read counts range over four decades.
https://doi.org/10.1371/journal.pone.0262148.g005
[Figure omitted. See PDF.]
Fig 6. Overdispersion of the residual effects in replicated RNA-Seq data.
Empirical survival functions for the root mean square scaled deviation () from the conical dispersion analysis (see Fig 5) of the RNA-Seq data are plotted for six A. thaliana samples under fractional gravity [46]. The and curves from the Monte Carlo simulation for the standard normal distribution are also shown. The slower decay in the observed indicates that RNA-Seq data are subject to heterogeneous dispersive effects. The inset shows the overdispersion for a 1000 point subset around the median for the μg sample. Thus, overdispersive effects are observed across the entire range of read counts. In the legend, each sample is labeled with its coefficient for the quadratic error model. The variation in indicates that there are significant data quality variations between samples.
https://doi.org/10.1371/journal.pone.0262148.g006
Statistically, only the magnitude of δ(i) is important because the orientation is random. Next, we introduce the mean squared deviation (MSD) , and we obtain the two parameter representation for the dispersion , as shown in Fig 5A and 5C. We adopt a heuristic quadratic model for the MSD [14, 44](40)
Then, a WLS quadratic regression of the MSD produces sample estimates for the qi (Fig 5A and 5C), which then serve as the error model for the input data:(41)as shown by the 2σ threshold curves in Fig 5B and 5D. The q0 and q1 terms correspond to the background noise and Poisson-like components of the dispersion, respectively [45]. The overdispersive component q2 suggests that there are uncontrolled experimental effects that cannot be averaged in a random manner. We note that for a well-designed data acquisition process, the q2 component should be minimized. The initial guess for is updated, and the algorithm iterates several times to reach convergence for both ℓ and {qi}.
The data for this demonstration come from a study of the response of Arabidopsis thaliana to fractional gravity [47] and can be retrieved from the publicly accessible NASA GeneLab Omics database [46]. The dataset contains read count data for six samples with varying gravities (g) {μ, 0.09, 0.18, 0.36, 0.57, 1}. For each sample, the replicates (yj) with m = 3∨4, correspond to a set of LRVVs with approximately identical noise properties: , with the possible exception of a small subset of irreproducible outliers which must be flagged. The read counts are distributed over more than three decades (Fig 5D inset), and the point estimates are skewed due to the heavy tail effects at large values [6]. Consequently, an upper limit for the mean value of the observation vectors is imposed, and 4% of the data are removed from the MER analysis; however, all of the data are included in the graphs (Figs 5 and 6). In Fig 5, we observe that the dispersion varies between the μg (A, B) and 1g (C, D) samples. As discussed above, the data are transformed such that the PLR slope is 1 and the curvature is negligible near the origin in log-log plots (B, D). We note that log-log scales are convenient for visualizing dispersion, but the data are not log-transformed in actual calculations; a log or ratio transformation results in nonlinear confounding effects for weighted averages. For duplicate read counts g(i) = (x, y), we define the scaled difference(42)with the expected variance for the error in x − y given by the sum of the individual variances: [21]. Then, we apply the condition to obtain the quadratic equation(43)
The |d| = 2 threshold curves in Fig 5B and 5D are calculated from the roots of this expression. The q2 component dominates at high read counts, as indicated by the convergence of the dashed curves for the linear components (q0, q1). Then, q2 serves as an indicator of the data quality and reproducibility, and we observe significant variations between the samples (Fig 6 legend). We define the root mean square (rms) scaled deviation aswhere . A comparison of the empirical survival distributions for shows that there are heavy tails at large deviations for the six samples (Fig 6). We use Monte Carlo methods to simulate the read count data for normal distributions and confirm that the simulated distributions for are identical to chi-squared distributions (Fig 6). The much slower decay in the experimental is an indication of uncontrolled processes in the RNA-Seq assay that result in overdispersion in the g(i). The inset graph in Fig 6 shows the overdispersion for a small {g(i)} subset around the median for the μg sample. Thus, the overdispersive effects in the are observed for intervals over the entire range of read counts. Statistical methods that assume normal distributions and do not account for this overdispersion will overestimate the effect size. In biological studies, the variable vectors for RNA-Seq are not linearly related. Then, our parametric LRVV framework serves an elementary role in searching for multiway dependency in the variation of gene expression. The detailed discussion of weighted least squares optimization problems, including normalization and the estimation of effect size, in gene expression data analysis is a topic for a separate paper.
4 Software
Numerical computations were performed using the python (v3.8.4) language with the NumPy (v1.19.0) [48] and SciPy (v1.5.0) [49] packages. Figures were prepared using the Matplotlib (v3.4.1) package [50]. The manuscript was prepared using the MiKTeX (v2.9) implementation of TeX/LaTeX [51].
5 Discussion
In this work, we discuss the fact that statistical measures of linear dependence, including covariance, correlation coefficient and regression slope, are subject to the chain rule. We develop a novel multidimensional linear regression algorithm where the relation between variable vectors is parameterized by a weighted average, and the weights are determined from an error model for the input data. Then, the implementation of PLR involves the use of weighted least squares normal equations to estimate the parameters of the best fit line for a set of LRVVs. Standard statistical concepts, including covariance and the Moore-Penrose pseudoinverse, provide the necessary intuition for the formulation of our PLR method. This contrasts with the ODRPACK algorithm which requires iterative numeric optimization because the WOR least-squares equation cannot be solved analytically. The PLR and ODRPACK methods yield very similar results but the latter is formulated only for bivariate data. We find that in the parametric representation of a linear relation, covariance and correlation are associated with vector quantities. We also identify scale invariant quantities for partitioned residual effects between the LRVVs for both the PLR and weighted orthogonal regression methods. Therefore, when is undefined, MER is not a well-posed problem [52]; i.e., there is not a unique solution. This supports previous suggestions that the specification of is a requirement for solving the MER problem [27]. is also essential for determining the distributions and corresponding confidence intervals of PLR parameter estimates and effect size. Constructing a realistic error model for a data acquisition process can be daunting, especially in omics studies, because it is necessary to account for all relevant sources of experimental error and system variability. However, an approximate trial and error approach can be informative, and Monte Carlo methods are convenient because they allow for the detailed simulation of stochastic effects in the data acquisition process. Finally, the averaging of variable vectors that are not linearly related results in the loss of information because of the destructive interference associated with the mixing of signal and noise. Consequently, accounting for measurement error in polynomial and multiple regression requires nonlinear optimization methods, which is a topic for a separate paper. We conclude that the parametric representation for the relation between variable vectors provides a more general framework for linear regression compared to the standard Cartesian representation.
Acknowledgments
I thank many former colleagues in the Genetic Discovery group at DuPont for many helpful discussions about genome-wide association and eQTL methods. I also especially thank my biologist collaborator and spouse, Ada Ching.
Citation: Luck S (2022) A parametric framework for multidimensional linear measurement error regression. PLoS ONE 17(1): e0262148. https://doi.org/10.1371/journal.pone.0262148
1. Beló A, Zheng P, Luck S, Shen B, Meyer DJ, Li B, et al. Whole genome scan detects an allelic variant of fad2 associated with increased oleic acid levels in maize. Molecular Genetics and Genomics. 2008;279(1):1–10. pmid:17934760
2. Holloway B, Luck S, Beatty M, Rafalski JA, Li B. Genome-wide expression quantitative trait loci (eQTL) analysis in maize. BMC genomics. 2011;12(1):336. pmid:21718468
3. Luck S. Factoring a 2 x 2 contingency table. PLOS ONE. 2019;14(10):e0224460. pmid:31652283
4. Luck S. Nonoverlap proportion and the representation of point-biserial variation. PLOS ONE. 2020;15(12):e0244517. pmid:33370394
5. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews. 2007;82(4):591–605. pmid:17944619
6. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016;17(1):1–19.
7. Moseley HNB. Error analysis and propagation in metabolomics data analysis. Computational and Structural Biotechnology Journal. 2013;4(5):e201301006. pmid:23667718
8. Draper NR, Smith H. Applied Regression Analysis. 3rd ed. Wiley Series in Probability and Statistics. Wiley; 1998.
9. Press WH, Teukolsky SA, Vetterling WT, Flannery BP, Cambridge University Press. Numerical Recipes: The Art of Scientific Computing. 3rd ed. New York: Cambridge University Press; 2007.
10. Carroll RJ, Wang Y. Nonparametric variance estimation in the analysis of microarray data: a measurement error approach. Biometrika. 2008;95(2):437–449. pmid:19381352
11. Wang B, Zhang SG, Wang XF, Tan M, Xi Y. Testing for Differentially-Expressed MicroRNAs with Errors-in-Variables Nonparametric Regression. PLoS ONE. 2012;7(5):e37537. pmid:22655055
12. Liao J, Li X, Wong TY, Wang JJ, Khor CC, Tai ES, et al. Impact of Measurement Error on Testing Genetic Association with Quantitative Traits. PLoS ONE. 2014;9(1):e87044. pmid:24475218
13. Saccenti E, Hendriks MHWB, Smilde AK. Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Scientific Reports. 2020;10(1):438. pmid:31949233
14. Luck SD. Normalization and error estimation for biomedical expression patterns. In: Bittner ML, Chen Y, Dorsel AN, Dougherty ER, editors. Proceedings of SPIE—The International Society for Optical Engineering. vol. 4266; 2001. p. 153–157. Available from: http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=901072.
15. Stefanski LA. Measurement Error Models. Journal of the American Statistical Association. 2000;95(452):1353.
16. Cantrell CA. Technical Note: Review of methods for linear least-squares fitting of data and application to atmospheric chemistry problems. Atmospheric Chemistry and Physics. 2008;8(17):5477–5487.
17. Fuller WA. Measurement Error Models. Fuller WA, editor. Wiley Series in Probability and Statistics. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2006.
18. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. 2nd ed. CRC Press; 2006.
19. Buonaccorsi JP. Measurement error: models, methods, and applications. Boca Raton: Chapman and Hall/CRC; 2010.
20. Cheng CL, Ness JWV. Statistical Regression with Measurement Error. New York: Wiley; 1999.
21. Casella G, Berger R. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury; 2002.
22. Gillard J, Iles T. Methods of Fitting Straight Lines where Both Variables are Subject to Measurement Error. Current Clinical Pharmacology. 2009;4(3):164–171. pmid:19500068
23. Markovsky I, Van Huffel S. Overview of total least-squares methods. Signal Processing. 2007;87(10):2283–2302.
24. Spiess AN. Orthogonal Nonlinear Least-Squares Regression; 2015. Available from: https://cran.r-project.org/package=onls.
25. Boggs PT, Byrd RH, Rogers JE, Schnabel RB. Orthogonal Distance Regression. In: SciPy v.1.5.2 Reference Guide. SciPy; 2020. Available from: https://docs.scipy.org/doc/scipy/reference/odr.html.
26. Boggs PT, Byrd RH, Rogers JE, Schnabel RB. User’s Reference Guide for ODRPACK Version 2.01 Software for Weighted Orthogonal Distance Regression. June; 1992.
27. Cheng CL, Riu J. On Estimating Linear Relationships When Both Variables Are Subject to Heteroscedastic Measurement Errors. Technometrics. 2006;48(4):511–519.
28. Cumming G. Understanding The New Statistics. New York, NY: Routledge; 2012.
29. Kelley K, Preacher KJ. On effect size. Psychological Methods. 2012;17(2):137–152. pmid:22545595
30. Puntanen S, Styan GPH, Isotalo J. Matrix Tricks for Linear Statistical Models. Berlin, Heidelberg: Springer Berlin Heidelberg; 2011.
31. Noble B, Daniel JW. Applied Linear Algebra. 2nd ed. Englewood Cliffs: Prentice-Hall; 1977.
32. Rodgers JL, Nicewander WA. Thirteen Ways to Look at the Correlation Coefficient. The American Statistician. 1988;42(1):59–66.
33. Boyd SP, Vandenberghe L. Convex optimization. New York, NY: Cambridge University Press; 2004.
34. Boas ML. Mathematical methods in the physical sciences. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 1983.
35. Schneeweiss H, Shalabh H. On the estimation of the linear relation when the error variances are known. Computational Statistics & Data Analysis. 2007;52(2):1143–1148.
36. Voigtman E. Comparison of Signal-to-Noise Ratios. Analytical Chemistry. 1997;69(2):226–234.
37. Drew TB. Handbook of Vector and Polyadic Analysis. New York, NY: Reinhold; 1961.
38. Lebedev LP, Cloud MJ, Eremeyev VA. Tensor Analysis with Applications in Mechanics. WORLD SCIENTIFIC; 2010.
39. Marsaglia G. Ratios of Normal Variables. Journal of Statistical Software. 2006;16(4):1–10.
40. von Luxburg U, Franz VH. A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap. Statistica Sinica. 2009;19:1095–1117.
41. Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine. 1998;17(8):873–890. pmid:9595617
42. Agresti A. Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact. Statistical Methods in Medical Research. 2003;12(1):3–21. pmid:12617505
43. Bishara AJ, Hittner JB. Reducing Bias and Error in the Correlation Coefficient Due to Nonnormality. Educational and Psychological Measurement. 2015;75(5):785–804. pmid:29795841
44. Weng L, Dai H, Zhan Y, He Y, Stepaniants SB, Bassett DE. Rosetta error model for gene expression analysis. Bioinformatics. 2006;22(9):1111–1121. pmid:16522673
45. Alkemade CTJ, Snelleman W, Boutilier GD, Pollard BD, Winefordner JD, Chester TL, et al. A review and tutorial discussion of noise and signal-to-noise ratios in analytical spectrometry—I. Fundamental principles of signal-to-noise ratios. Spectrochimica Acta Part B: Atomic Spectroscopy. 1978;33(8):383–399.
46. Herranz R, Vandenbrink JP, Medina F, Kiss JZ. GLDS-251: RNAseq analysis of the response of Arabidopsis thaliana to fractional gravity under blue-light stimulation during spaceflight; 2019. Available from: https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-251/.
47. Herranz R, Vandenbrink JP, Villacampa A, Manzano A, Poehlman WL, Feltus FA, et al. RNAseq Analysis of the Response of Arabidopsis thaliana to Fractional Gravity Under Blue-Light Stimulation During Spaceflight. Frontiers in Plant Science. 2019;10(November):1–11. pmid:31850027
48. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–362. pmid:32939066
49. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods. 2020;17(3):261–272. pmid:32015543
50. Hunter JD. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. 2007;9(3):90–95.
51. Schenk C. MiKTeX; 2021. Available from: http://miktex.org/.
52. Logan JD. Applied Mathematics. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 1997.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022 Stanley Luck. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The ordinary linear regression method is limited to bivariate data because it is based on the Cartesian representation y = f(x). Using the chain rule, we transform the method to the parametric representation (x(t), y(t)) and obtain a linear regression framework in which the weighted average is used as a parameter for a multivariate linear relation for a set of linearly related variable vectors (LRVVs). We confirm the proposed approach by a Monte Carlo simulation, where the minimum coefficient of variation for error (CVE) provides the optimal weights when forming a weighted average of LRVVs. Then, we describe a parametric linear regression (PLR) algorithm in which the Moore-Penrose pseudoinverse is used to estimate measurement error regression (MER) parameters individually for the given variable vectors. We demonstrate that MER parameters from the PLR and nonlinear ODRPACK methods are quite similar for a wide range of reliability ratios, but ODRPACK is formulated only for bivariate data. We identify scale invariant quantities for the PLR and weighted orthogonal regression (WOR) methods and their correspondences with the partitioned residual effects between the variable vectors. Thus, the specification of an error model for the data is essential for MER and we discuss the use of Monte Carlo methods for estimating the distributions and confidence intervals for MER slope and correlation coefficient. We distinguish between elementary covariance for the y = f(x) representation and covariance vector for the (x(t), y(t)) representation. We also discuss the multivariate generalization of the Pearson correlation as the contraction between Cartesian polyad alignment tensors for the LRVVs and weighted average. Finally, we demonstrate the use of multidimensional PLR in estimating the MER parameters for replicate RNA-Seq data and quadratic regression for estimating the parameters of the conical dispersion of read count data about the MER line.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer