Full Text

Turn on search term navigation

Headnote

Background: Environmental health researchers often aim to identify sources or behaviors that give rise to potentially harmful environmental exposures.

Objective: We adapted principal component pursuit (PCP)-a robust and well-established technique for dimensionality reduction in computer vision and signal processing-to identify patterns in environmental mixtures. PCP decomposes the exposure mixture into a low-rank matrix containing consistent patterns of exposure across pollutants and a sparse matrix isolating unique or extreme exposure events.

Methods: We adapted PCP to accommodate nonnegative data, missing data, and values below a given limit of detection (LOD). We simulated data to represent environmental mixtures of two sizes with increasing proportions <LOD and three noise structures. We applied PCP-LOD to evaluate its performance in comparison with principal component analysis (PCA). We next applied principal component pursuit with limit of detection (PCP-LOD) to an exposure mixture of 21 persistent organic pollutants (POPs) measured in 1,000 U.S. adults from the 2001-2002 National Health and Nutrition Examination Survey (NHANES). We applied singular value decomposition to the estimated low-rank matrix to characterize the patterns.

Results: PCP-LOD recovered the true number of patterns through cross-validation for all simulations; based on an a priori specified criterion, PCA recovered the true number of patterns in 32% of simulations. PCP-LOD achieved lower relative predictive error than PCA for all simulated data sets with up to 50% of the data <LOD. When 75% of values were <LOD, PCP-LOD outperformed PCA only when noise was low. In the POP mixture, PCP-LOD identified a rank-three underlying structure and separated 6% of values as extreme events. One pattern represented comprehensive exposure to all POPs. The other patterns grouped chemicals based on known structure and toxicity.

Discussion: PCP-LOD serves as a useful tool to express multidimensional exposures as consistent patterns that, if found to be related to adverse health, are amenable to targeted public health messaging. https://doi.org/10.1289/EHP10479

Introduction

To assess exposure to multiple chemicals simultaneously, researchers must consider the high dimensionality of environmental exposures and the complex correlation structure among chemicals. Environmental researchers often aim to represent patterns of exposure using dimension reduction techniques.1 Grouping chemicals may allow for interpretation of underlying sources or behaviors that give rise to these highly correlated mul-tipollutant exposures. Identification of exposure routes is a consequential research question in environmental health because common sources may prove more modifiable or preventable than single chemical exposures and, thus, more amenable to regulatory action or targeted interventions.

When analyzing high-dimensional data, a major challenge is how to recover low-dimensional patterns from noisy, incomplete, or erroneous measurements.2 In environmental health, observations below the analytic limit of detection (LOD) provide an example of incomplete data. Depending on the laboratory, these observations may be marked as <LOD and not reported, or they may be reported as measured with less certainty than those >LOD.3,4 Identificationof exposure patterns in data sets with large proportions of observations <LOD proves challenging.5

Traditional methods to handle observations <LOD include single and multiple imputation, the most common implementation being imputation with LOD= 2. This method was proposed in 1990 as providing more accurate estimation of the mean and standard deviation than imputation with LOD=2 and improved computational efficiency over a maximum likelihood method.7 However, predictive accuracy is not often the goal in environmental epidemiology, and computational speed is no longer a barrier to new methods. Furthermore, substitution of values <LOD with a fixed value pffiffi

(e.g., LOD= 2), especially when some information is available, will impact the distribution of the data, potentially severely impacting exposure pattern identification inthe study population.8

Here, weintroduce a novel techniquetoidentify patternsinenvi-ronmental mixtures, adapting a robust and well-established method for data dimensionality reduction and pattern recognition in computer vision applications, principal component pursuit (PCP). PCP decomposes the exposure data matrix into a low-rank matrix (to identify underlying patterns of exposure across the pollutants) and a sparse matrix (to identify unusual, unique, or extreme exposure events).9 PCP has several advantages over traditional methods for pattern recognition in environmental mixtures. In a recent PCP extension, square root PCP ( PCP), Zhang et al. derived a new formulation with a universal choice of regularization parameter.10 Thus, the user is not required to choose or tune hyperparameters (i.e., parameters that control the strength of the penalty term or the amount of shrinkage). We combined this with a separate extension introducing a nonconvex penalty on the low-rank matrix that performs well with data that may not have a strong underlying struc-ture.11,12 Estimation of the sparse matrix is especially advantageous. Traditional methods are sensitive to unusual or extreme exposure events13; the patterns identified by PCP are not influenced by outlying values. Instead, exposures that are not explained by patterns in the low-rank matrix are separated in the sparse matrix and available to the researcher for further analysis.

To our knowledge, this is the first time that PCP has been considered in pattern identification in environmental health or

Address correspondence to Marianthi-Anna Kioumourtzoglou, 722 W. 168th St., New York, NY, 10032 USA. Email: [email protected]

Supplemental Material is available online (https://doi.org/10.1289/EHP10479).

The authors declare no potential conflicts of interest.

Received 11 October 2021; Revised 24 August 2022; Accepted 15 September 2022; Published 23 November 2022.

Note to readers with disabilities: EHP strives to ensure that all journal content is accessible to all readers. However, some figures and Supplemental Material published in EHP articles may not conform to 508 standards due to the complexity of the information being presented. If you need assistance accessing journal content, please contact [email protected]. Our staff will work with you to assess and meet your accessibility needs within 3 working days.

epidemiology. Additionally, we have included three novel extensions designed uniquely for chemical mixtures: a) a distinct penalty for observations <LOD (PCP-LOD) that has improved distributional assumptions over single imputation and adapts to study-specific confidence in measurement, b) a nonnegativity constraint on the low-rank matrix to improve interpretability of results, and c) procedures to accommodate missing values. We also implemented a cross-validation approach so that the choice of estimated components is replicable and free from researchers' implicit assumptions. In this work, we conducted a simulation study based on real multipollutant exposures, simulating an increasing proportion of observations measured <LOD and varying levels and structure of added noise. We use these to compare PCP-LOD performance to that of PCA with values <LOD imputed as LOD/V2. Finally, we applied PCP-LOD to an environmental health data set of persistent organic pollutants (POPs) measured in the 2001-2002 cycle of the National Health and Nutrition Examination Study (NHANES) to identify consistent patterns of POP exposure while isolating unique or extreme events.

Methods

PCP

We present PCP as a robust method for dimensionality reduction and pattern identification. Given an exposure data matrix X, with individual entries Xn , p, where n is the number of observations andp is the number of pollutants, PCP seeks to express X as a sum of two matrices: a low-rank matrix L, with individual entries Ln,p, where r = rank(L) <SCmin (n,p), and a sparse matrix S, with individual entries Sn, p, where most entries are zero. Because L is of rank r ": p, its rank can correspond to underlying patterns in exposure, such as specific sources or certain behaviors. L is still defined in terms of the original variables, i.e., patterns are not directly estimated. S captures unusual or uniquely high or low events that cannot be explained by the identified patterns from L; it does not capture events relevant to individual chemicals. PCP may be paired with various matrix factorization techniques [e.g., principal component analysis (PCA), factor analysis, or nonnegative matrix factorization (NMF)] to extract chemical loadings and individual scores. In traditional PCP, the rank of L and the number or location of nonzero entries in S do not need to be a priori defined.

We incorporated two existing PCP extensions that suit features of environmental mixtures data. First, Zhang et al. recently proposed y/PCP with a noise-independent universal choice of regula-rization parameters.10 Previous formulations of PCP required knowledge of the true noise level to determine the appropriate parameters.12,14 This approach is problematic in environmental mixtures where we cannot know or accurately estimate the underlying noise level, and it would leave the researcher with the subjective task of tuning parameters on a per-data set basis. Zhang et al. provide a more practical approach to pattern recognition in environmental mixtures.

As first proposed, PCP minimizes a weighted combination of the nuclear norm of L, \\L\\", and the entry-wise I1 norm of S, \\S\\f1. ,1,1 A norm is inherently a distance measure; for example, Euclidean distance is a norm. These norms transform the size of the matrix (or the distance between matrices, as in \\L + S - X\\F below) on which they are applied to a positive real number. The nuclear norm of L is the sum of its singular values and is often used to search for low rank matrices. The entry-wise I1 norm of S is the sum of the absolute value of each entry in the matrix; this encourages S to be sparse. The Frobenius norm in Equation 1 is the sum of the square of each entry in the matrix. Notably, this formulation has the desirable quality of convexity, meaning that every local optimum is a global optimum, and there is a single best solution. This, together with the particular structure of the '1 and nuclear norms, guarantees that the resulting optimization problem can be solved efficiently.

In practice, however, the nuclear norm assumes a stronger low-rank structure (i.e., slowly decaying singular values) than what is the case in many real-life environmental mixtures (e.g., POPs or air pollution). To address unsatisfactory performance with the nuclear norm, we replaced it with a rank-r projection (Supplemental Materials, "Section S1: Indicator for the rank-r matrices"). Although the nuclear norm is convex, the rank-r projection is not, meaning that the algorithm could find a local optimum that is not the global one. However, closely related nonconvex formulations are accompanied by theoretical guarantees of equivalent performance with the convex implementation. 1,1,20-2 These provide strong motivation for our framework but do not prove its success; our method contains a number of additional proposals that are helpful for processing health data but fall outside the bounds of the strongest existing theoretical results.23,24 Combining a nonconvex rank projection with y/PCP, we solve the following optimization problem:

where X denotes the original data matrix. The two parameters, X and \x, are not tuned by the researcher; instead, they are each set using single universal values, X= 1/\/n from Candès et al. and \x= \pj2 from Zhang et al., which have been shown theoretically to yield near-optimal estimation performance. ,1 The first term on the right is the rank-r projection, lrank(L)<r, an indicator function that constrains L to be of rank < r. The value of r must be specified; to address this, we implemented a cross-validation approach to choose r in the "Implementation and Evaluation" section. The final term is the error between the predicted and the observed values, which favors a solution that is close to the original data.

Environmental health-relevant extensions. To better adapt nonconvex y/PCP (ncy/PCP) for use with environmental data, we extended this method in three ways. First, we modified the algorithm to allow for missing values. This method proves beneficial to environmental data sets that often include participants with missing exposure measurements. It also enables the cross-validation procedure outlined in the "Methods" section titled "Implementation and Evaluation." Missing values are handled by only applying the LOD penalty to entries that are observed or <LOD, consistent with the literature on robust matrix completion. 5,2 Next, we constrained the low-rank matrix to be nonnegative. Nonnegativity in L allows for individual pattern scores and chemical loadings on patterns on the same support as the original chemical distributions. The nonnegativity constraint is enforced by using a splitting approach in which we introduce an auxiliary matrix variable, which is nonnegative, and add an additional equality constraint (Supplemental Materials, "Section S2: Optimization via Alternating Directions Method of Multipliers").24 We tailored the third extension to observations <LOD. We introduced a diverging penalty, v|/LOD, in the ncy/PCP solution to accommodate values <LOD when they are not available to the users (Equation 2), as is most commonly the case. This penalty treats all estimated values from zero to the LOD as equally good approximations (Equation 3, line 2), removing the error term from the objective function:

Here, d is a matrix of LOD values, and d ij represents the observation-specific LOD. This is an attribute of the data specified by the researcher; it can be common across all chemicals, chemical-specific, or chemical- and individual-specific, depending on the measurements. If all observations are >LOD, this equation simplifies to Equation 1. For observation Xij <LOD and estimated values >LOD (Equation 3, line 3) or <0 (Equation 3, line 4), we included more stringent penalties than in Equation 1, which acted to push estimates to the known range. Estimated values <0 were penalized the most because measured concentrations cannot be negative. For observation Xij <LOD, estimated values >LOD were penalized less than negative values but more than estimates for observations >LOD because we had prior information for these observations that the estimated values should be <LOD. For observation Xij <LOD and estimated value LOD >Xij >0 (Equation 3, line 2), we placed no penalty on estimated values Lij + Sij lying between 0 and the LOD. This amounted to all values between zero and the LOD being equally optimal.

The proposed LOD penalty introduced no additional difficulty for optimization. To enforce additional structure, such as nonne-gativity, on the low-rank term, we employed an Alternating Directions Method of Multipliers (ADMM) splitting technique (see Supplemental Materials, Section S2, for a detailed algorithmic description).24

Simulations

We simulated 100 exposure matrices for all combinations of two mixture sizes, three noise structures, and three detection proportions (1,800 total). We generated data sets of 500 observations each, {xi}i = 1, where xi = (Xi ,1,... ,Xi , p) presents an exposure profile withp mixture components. We specified r = 4 underlying patterns and investigated two mixture sizes (p = 16 and p = 48) to represent medium and large environmental mixtures. We first simulated chemical loadings (r×p) to represent realistic environmental patterns where some chemicals were distinct to a single pattern and some chemicals appeared in multiple patterns. Each pattern included p 8 chemicals that loaded distinctly and p 4 chemicals that overlapped with a second pattern. Distinct chemicals were given a loading of 1 on the single pattern on which they loaded and a loading of 0 for the remaining patterns. One-third of the chemicals appeared in only one pattern; two-thirds of the chemicals appeared in two patterns. This design corresponds to multiple environmental sources giving rise to the chemicals in the mixture. Overlapping chemicals were drawn from a Dirichlet distribution so that their loadings would sum to 1 over all patterns. Of the four loadings across the four patterns for each chemical, two were drawn from Dir("1 = 1,"2 = 1) and two were set to zero. This introduced variability into the overlapping chemical loadings (Figure 1A).

We next generated individual scores (n × r). We drew scores independently from lognormal (\x = 1,a = 1) because chemical concentration distributions are generally concentrated at low values with long right tails. We created the simulated data from matrix products of individual scores by chemical loadings with added noise, replacing negative values with zero. We generated noise in one of three ways, a) low Gaussian noise [V(0,1)], b) high Gaussian noise [V(0,5)], or c) low Gaussian noise with high sparse events, which most closely aligns with the assumptions of PCP-LOD. We included low- and high-noise scenarios to describe performance under ideal (i.e., low-noise) and extreme (i.e., high-noise) circumstances. We included low noise with sparse events to evaluate the method's ability to identify rare events. Figure 1B shows an example simulated correlation matrix. Finally, to provide representative examples of method performance on data with low to high proportions of values <LOD, we designated a quantile (25th, 50th, or 75th) and set all values below the set threshold as <LOD.

Study Population

For pattern recognition in an environmental mixture with varying detection limits across chemicals, we chose a mixture of dioxins, furans, and polychlorinated biphenyls (PCBs) measured in U.S. adults from the 2001-2002 NHANES cycle. POP concentrations from this NHANES cycle are well characterized and have been used in prior environmental mixture analyses. 7,2 NHANES inclusion criteria have been reported previously. For the chosen cycle, 11,039 participants were interviewed. One-third of participants age 12 y and older were eligible for environmental chemical analysis. We removed individuals below 18 y of age or without any POP measurements, resulting in a final study sample of 1,000. Eighteen PCBs, seven dioxins, and nine furans were measured. Exposure assessment of POPs measured in blood serum in NHANES has been described previously.30,31 All POP values were lipid-adjusted by the U.S. Centers for Disease Control and Prevention (U.S. CDC).32 Of the POPs measured, 21 detected in at least 50% of all samples were included in our main analyses. Written informed consent was obtained from all participants, and NHANES data collection was approved by the National Center for Health Statistics (NCHS) Ethics Review Board.

Implementation and Evaluation

We determined the appropriate rank for PCP-LOD and the number of components to retain from PCA in the same manner for all experiments and the application. For PCA, we a priori defined our component retention criterion as the first k components that explained =80% of the variance in the data, as seen previously in environmental mixtures applications.27 Although it is possible to perform cross-validation on PCA,33,34 it is not a common practice in applied environmental health research. For PCP-LOD we used the default parameters for X and li and cross-validated to select the rank of the L matrix. We set an initial grid of rank values from 1 to 10 for all scenarios. We performed this cross-validation approach on a single representative data set for each combination of simulated mixture sizes (p = 16 and 48), proportions <LOD (25%, 50%, and 75%), and noise structures (low, high, and sparse) and for the POP mixture.

To cross-validate PCP-LOD on a single data set, X, we repeated the following steps 100 times for each rank r € [1,10]: a) we randomly corrupted 20% of the mixture X as missing (i.e., set the value to NA) to serve as a held-out test set, denoted Xq, yielding the corrupted matrix X; b) we ran PCP-LOD on X to obtain L and S; and c) we recorded the relative recovery error of Lq+Sq in comparison with the observed data Xq in the held-out set, calculated via the Frobenius norm, \Xn-Ln-Sn\\F/\\Xn\\F. Finally, for each rank, we aggregated the average relative recovery error across 100 runs and chose the optimal rank, ?, as that with the lowest mean relative recovery error on the held-out set. We subsequently ran PCP-LOD on the full data set X with the selected rank?.

We ran PCP-LOD and PCA on all simulated data sets. We compared PCP-LOD and PCA to assess their relative performance when faced with large proportions of nondetectable observations. For PCA, we replaced observations <LOD with LOD/V2. For PCP-LOD we estimated the rank of L, the sparsity of S, and their relative change to assess stability of the solution across increasing proportions of data <LOD. Because the sparse matrix may contain nonzero values so close to zero as to be considered zero, we set a threshold above which to regard values as legitimate extreme exposures. We evaluated sparse events two standard deviations of the model residuals (JS+e), per chemical, from zero, i.e., 2× <JVar(Xp obs -L ), where obs indicates "observed" values above the LOD in the simulated data.

For both PCP-LOD and PCA, we calculated relative predictive error as the ratio of the error to the truth in terms of their Frobenius norm: |Truth - Predicted\\F/\\Truth||F. For PCP-LOD we interpreted L as the predicted values, and for PCA we constructed predicted values as the product of the score matrix (i.e., the coordinates of the rotated data on the principal components) by the rotation matrix (i.e., right eigenvectors), truncated at the chosen rank. We defined the "truth" as simulated values before noise or sparse events were added. Finally, we assessed the stability of the identified patterns using the relative prediction error of the singular value decomposition (SVD).

Application

Prior to the application to the NHANES POP mixture, we examined distributional plots and descriptive statistics for all variables. We scaled all POP concentrations by their standard deviations to make variances comparable across chemicals. The solution, thus, cannot be influenced by high-variance pollutants. For PCA, we replaced observations <LOD with LOD/V2. We used PCP-LOD to separate unique events from underlying patterns. Following PCP-LOD, we extracted individual scores and pattern loadings from L using SVD. We compared scores, loadings, and overall relative error with those obtained from PCA. We present unique events and interpret observed patterns. To better characterize sparse events, we employed hierarchical clustering on the sparse matrix to identify individuals with similar profiles of extreme events. We used Ward's minimum variance linkage method to grow the dendrogram (i.e., the tree-based representation of observations),35 and we cut it based on subject-specific knowledge to choose the appropriate number of clusters.

As a sensitivity analysis, we repeated this application using a higher LOD cut point for POP inclusion, retaining only chemicals that were detected in at least 75% of samples. All analyses were conducted using R version 4.0.4 (https://www.r-project.org/). Code to implement PCP-LOD, along with simulations, NHANES data, and analyses conducted in this work are available at github. com/lizzyagibson/PCP-LOD (Supplemental Material, "R code").

Results Simulations

We ran PCP-LOD and PCA on all simulated data sets. PCP-LOD had lower relative prediction error across the majority of mixture size (p= 16 and 48), proportion <LOD (25%, 50%, and 75%), and noise structure (low, high, and sparse) combinations. PCP-LOD outperformed PCA on all simulations with low noise, simulations with high noise with up to 50% <LOD, and simulations with low noise and added sparse events with up to 50% <LOD (Figure 2 and Supplemental Table S1). Figures 2 and 3 present simulations where p = 16; corresponding figures where p = 48 are included in Supplemental Figures S1 and S2.

PCP-LOD was more affected by the proportion of data <LOD, which can be seen in the larger step size between box plots in Figure 2. The decline in PCP-LOD predictive accuracy as the proportion of values <LOD increased appears because of poorer performance on values <LOD in high-noise scenarios (Figure 3 and Supplemental Table S2). Relative prediction error for values >LOD was approximately constant for PCP-LOD and PCA. Supplemental Tables S3, S4, and S5 contain the median and interquartile range (IQR) of relative error for predicted values overall and stratified by LOD.

Next, we assessed the stability of the identified patterns using the SVD of the simulated data before noise or sparse events were added and compared this with the SVD of the bL matrix and of PCA results (Figure 4 and Supplemental Table S6). Figure 4 depicts the relative prediction error comparing the left eigenvectors (comparable to scaled individual scores) of the PCP-LOD and PCA solutions with those of the simulated "truth." PCP-LODs median relative prediction error is generally lower than PCAs for the larger mixture size and higher than PCAs for the smaller mixture size. However, these patterns appear quite stable over increasing proportions of data <LOD for both methods. PCP-LOD solutions achieved lower relative prediction error on chemical loadings (i.e., right eigenvectors) across all simulations (Supplemental Figure S3 and Table S6).

Across PCP-LOD solutions, between 2% and 10% of S entries were nonsparse. We found decreasing sparsity as the proportion <LOD increased, with 3% (IQR: 2%, 4%), 6% (IQR: 4%, 7%), and 7% (IQR: 3%, 8%) unique events, on average, found in simulations with 25%, 50%, and 75% <LOD, respectively (Supplemental Table S7). For simulations that included sparse events in the noise structure, PCP-LOD correctly included 69% (IQR: 67%, 71%), 70% (IQR: 68%, 72%), and 65% (IQR: 62%, 67%) of sparse values in the S matrix, on average, for simulations with 25%, 50%, and 75% <LOD, respectively (Supplemental Table S8).

Application

Thirty-four POPs were measured in the NHANES 2001-2002 cycle. Detection frequency is presented in Figure 5 and Supplemental Table

S9. Fourteen PCBs, four furans, and three dioxins were detected in >50% of samples. POP concentrations were all positively correlated (Figure 6A).

We applied PCP-LOD to identify underlying patterns of POP exposure and extreme exposure events that were not explained by these patterns without making a priori assumptions concerning the number of patterns or sparse events. PCP-LOD returned a low-rank matrix of rank three, which corresponds with three patterns of POP exposure in the bL matrix. Figure 6B depicts the bL correlation matrix alongside the correlation matrix of the raw data (Figure 6A). By removing sparse events and residual noise, PCP-LOD increased the correlations between POPs. To characterize underlying patterns, we extracted principal components from the low-rank matrix using SVD.

The three components distinguished by PCP-LOD included one component of overall POP exposure, a component that separated dioxins and furans from PCBs, and a third component that separated higher molecular weight PCBs from lower molecular weight PCBs (Figure 7 and Supplemental Table S10). The first component explained 79.4% of the variance in the

low-rank matrix, the second explained 14.6%, and the third explained 6.0%.

PCA conducted on the POP mixture chose three components that explained =80% of the variance and returned loadings and scores much the same as those from bL (Supplemental Figure S4 and Table S11). Using the three chosen components, the relative prediction error on values >LOD was 0.30 for PCA, similar to the relative error of 0.32 for PCP-LOD when comparing only bL with the original data. However, when including bS in the solution (bL + bS), the relative error for PCP-LOD on values >LOD was 0.07. This is more comparable to the PCA solution when including all 21 components, 0.06, which does not accomplish any dimension reduction. Because values <LOD are unknown in this application, only the relative prediction error on values >LOD could be calculated.

PCP-LOD partitioned the variation that was unexplained by the low-rank structure into a sparse matrix of large outlying values and the remaining residuals. ThebS matrix contained mostly zero values, with 5.7%ofentries being nonsparse. Sparse observations were generally weakly correlated, with the absolute value of r < 0:15 for 70% of Spearman correlations between sparse chemical exposure events (Supplemental Figure S5). Table 1 summarizes the number of individuals with uniquely high >2× < /Var(Xp obs -Lo pbs) or low exposure events. Figure 8 describes participant-specific sparse events. Most participants had no extreme exposures (44%) or only extremely low exposures (18%). A total of 22% had one high unique event on a single chemical, and 16% had between two and six high exposures across 21 chemicals left unexplained by the identified patterns (Table 1 and Supplemental Table S12).

We identified three clusters grouping individuals with similar profiles of extreme events. The first cluster included the 439 individuals without any unique events. The second grouped 448 individuals with few sparse events per person [median = 1IQR: (1,2)] relative to cluster 3. The third cluster of 113 individuals had more sparse events per person [median = 4; IQR: (2, 5)] and more extreme values, on average (Supplemental Figures S6 and S7 and Table S13).

In a sensitivity analysis including POPs detected in >75% of samples, we included 11 POPs. PCP-LOD returned a low-rank matrix of rank two, which corresponds to two patterns of POP exposure in the L matrix. The two patterns distinguished by PCP-LOD were similar to the first two components from the main analysis (Supplemental Figure S8 and Table S14). The first component included all POPs loading in the same direction, and the second component separated dioxins and furans from PCBs. In the low-rank matrix, the first component explained^79.6% of the variance, and the second explained 20.4%. The S matrix contained mostly zero values, with 11.4% of entries being nonsparse.

Discussion

We propose PCP-LOD as a new approach to identify patterns- and extreme events left unexplained by patterns-underlying environmental chemical mixtures in the presence of values <LOD. Our simulation studies highlighted three main advantages of PCP-LOD over PCA at identifying patterns in environmental mixtures: a) reduced error in estimated patterns of exposure, b) identification of extreme or unique events, and c) improved estimation of values <LOD.

Patterns identified by PCP-LOD are more robust to noise and incomplete data than more traditional pattern identification methods because patterns in L are not influenced by events in S. PCP-LOD estimated the underlying low-rank structure of L with lower relative error than PCA under all realistic simulation scenarios. PCA outperformed PCP-LOD for two error structures when 75% of the data set was simulated as <LOD. In that case, PCP-LOD used 25% to reconstruct 75% of the data, and poorer performance was expected. However, it is unlikely that an environmental health researcher will face a chemical mixture with 75% of all values <LOD. In our application to POPs detected in over 50% of measurements among NHANES participants, 76% of all observations were >LOD. In the entire POP mixture of 34 chemicals, with five chemicals never detected, 52% of all observations were >LOD. We observed the highest relative prediction error across all simulations for values <LOD in simulated data sets. This held for PCA, as well, and applies to all methods to address censored or missing data.

AlthoughwecomparedPCP-LOD withPCAinthis work, PCA is not the only existing tool used to characterize chemical mixtures. Other traditional methods, such as factor analysis and NMF, have also been employed to address research questions around patterns of environmental exposures.36-38 Additional techniques, such as frequent itemset mining (FIM) or perturbed factor analysis (PFA), have been borrowed from other fields, adapted, or developed for environmental health data. FIM, a data mining technique commonly employed for retail analysis, has been used to isolate chemical combinations (i.e., patterns) based on their prevalence.39,40 PFA captures similarities and differences in exposures and has been used to express shared exposure profiles within groups and to evaluate differences in exposure across groups.41 These methods lack some of PCP-LOD's distinct features (i.e., LOD-specific penalty, nonnegativity constraint, and procedure to accommodate missing values), but they have certain characteristics (that PCP-LOD may lack), making them well suited for environmental mixtures research. It is useful to have a collection of tools to answer research questions concerning multipollutant exposures to allow for improved public health communication.

PCP-LOD may be paired with various dimension reduction techniques. In our simulations and application to NHANES data, we paired PCP-LOD with SVD to make results comparable with those of PCA. The SVD solution does include negative values, but because of the nonnegativity constraint on the L matrix, PCP-LOD can be paired with any nonnegative dimension reduction technique (e.g., NMF) to provide results interpretable on an additive scale with a parts-based representation.42

The three components underlying the NHANES mixture distinguished by PCP-LOD represent one pattern of exposure to all POPs and two patterns grouped by known structural and toxico-logical properties. More than 90% of human exposure to PCBs, dioxins, and furans is through the food supply, mainly meat, dairy, and seafood.43-45 Thus, the first component of comprehensive exposure may be interpreted as a dietary source of these POPs. The second component separated dioxins and furans, which are generally more toxic, from PCBs.46 Accordingly, a potential interpretation of the second component is as a measure of toxicity. Notably, in a sensitivity analysis restricting to 11 POPs detected in at least 75% of samples, the first two identified patterns remain largely unchanged, demonstrating PCP-LOD's ability to extract the underlying structure in the presence of values <LOD. The third component separated lower molecular weight PCBs from higher molecular weight PCBs, where larger numbers indicate more chlorine atoms and larger molecules. Higher chlorinated congeners tend to bioaccumulate more than lower chlorinated congeners.47,48 The third pattern identified in the main analysis was not found in the sensitivity analysis, because four of the lower molecular weight PCBs and four of the higher molecular weight PCBs from the main analysis were not included. Depending on the research question, any or all of these components could be included in subsequent analyses with health outcomes.

In the original POP mixture, individuals with high values on any chemicals were likely to have high values on other chemicals, or equivalently, individuals with low values on any chemicals were likely to have low values on other chemicals. PCP-LOD captured this in a component representing overall mixture exposure. After removing the underlying patterns in the mixture described in bL, high (or low) exposure events on individual chemicals did not indicate high (or low) exposure to other chemicals; i.e., sparse events in bS were not highly correlated. About half of the unique low exposure events were <LOD in the original mixture; these values <LOD were not explained by overall low exposure or by identified patterns.

The ability to identify and separate extreme events is a unique feature offered by PCP-LOD and cannot be found in other methods. These unique or extreme events not captured in bL may themselves be risk factors (e.g., wildfires-unique events not explained by commonly recognized air pollution sources-for asthma emergency admissions),49 or they may modify an association with one of the bL components (e.g., a Saharan dust episode might modify the association with traffic-related pollution).50

Next steps could entail including bS exposures along with identified patterns from bL in a health model with some form of penalization (e.g., lasso or elastic net). Because exposure variables in the sparse matrix are not highly correlated, they do not pose the same problems as the original mixture.

Although PCP-LOD addresses several drawbacks of existing methods, it does not overcome all limitations of pattern identification in environmental mixtures. First, in multipollutant exposures the "true" originating mechanism is almost never known; thus PCP-LOD cannot provide the "correct" answer. PCP-LOD also cannot guarantee interpretable results. However, the nonnegativity constraint and the S matrix were added, in part, to enhance inter-pretability. PCP-LOD, like other methods employed in our field, should be used in conjunction with subject area expertise. The interpretability of results relies on this expert knowledge. This limitation applies, however, to all methods that are used to address research questions concerning patterns of environmental exposures. Second, including scores obtained from any dimension reduction technique paired with PCP-LOD in a health model ignores the uncertainty inherent in the solution selection, resulting in underestimated confidence intervals and, potentially, spurious results.51 Third, some data sets will likely be high-dimensional, with a large number of correlated chemical measurements for each participant. In this situation, PCP-LOD still performs well, provided the rank of the target matrix L0 is small enough in comparison with n (e.g., r <cp= log2n, where c is a constant, as Zhou et al. presented at an information theory symposium).14 Moreover, PCP methods have not yet been extended to accommodate repeated measures such as clustered observations or longitudinal data, which are common in environmental health applications. Additionally, our application findings should be interpreted in light of their limitations. First, as is the case when using chemical bio-markers, our study is susceptible to exposure measurement error. In a noisy setting, any method will exhibit an inaccuracy in the estimated left singular vectors, which is commensurate with the noise level. Nevertheless, even in this setting, the results produced by PCP-LOD are stable with respect to noise.14 Second, our results may not be generalizable beyond the study population. Although NHANES includes a nationally representative sample of the general noninstitutionalized U.S. population,52 we did not account for the complex sampling design and weights of the study.53 Thus, the PCP-LOD-identified patterns may represent sources or behaviors distinct to the participants.

PCP-LOD also has numerous strengths when compared with existing methods to identify exposure patterns in environmental mixtures, which require strong assumptions and have key limitations. As a consequence, their use has resulted in heterogeneous and inconsistent findings across studies.1 Moreover, results from methods that are not generalizable or interpretable hinder their use in the design and developmentof regulations, policies, and targeted interventions. Original PCP has few assumptions, namely that L is not sparse and that S is not low rank.9 This is an appealing feature of a tool when the underlying truth is not known. PCP-LOD directly addresses several additional limitations of existing methods: a) its solution is not necessarily orthogonal, allowing correlations between patterns; b) its solution is nonnegative, so patterns can exist in an interpretable space; c) its parameters do not require tuning by the researcher, meaning that the choice of number of patterns in L is not subjective; and d) PCP-LOD is robust to extreme values because of the novel S matrix.

To our knowledge, this work represents the first instance of decomposing the structure among chemicals in an additive manner. By separating the unique events from underlying patterns, PCP-LOD provides the opportunity to include extreme events in analyses, where they previously may have been suppressed or discarded.

The theory-backed parameter selection and cross-validation enhances reproducibility of PCP-LOD, ensuring that two different research groups with the same data set will identify the same optimal numberofpatterns. PCP-LOD maybeemployed when environmental epidemiologists have research questions concerning sources or behaviors leadingtochemical exposureorpatterns underlyingexpo-sure to multiple pollutants, especially when data are noisy, incom-plete,ormay contain extreme exposure events.

Acknowledgments

This work was partially supported by the National Institutes of Environmental Health (NIEHS) individual fellowship grant F31 ES030263, as well as PRIME R01 ES028805 and P30 ES009089.

References

References

1. Gibson EA, Goldsmith J, Kioumourtzoglou M-A. 2019. Complex mixtures, complex analyses: an emphasis on interpretable results. Curr Environ Health Rep 6(2):53- 61, PMID: 31069725, https://doi.org/10.1007/s40572-019-00229-5.

2. Gull SF, Daniell GJ. 1978. Image reconstruction from incomplete and noisy data. Nature 272(5655):686-690, https://doi.org/10.1038/272686a0.

3. Helsel DR. 2005. More than obvious: better methods for interpreting nondetect data. Environ Sci Technol 39(20):419A-423A, PMID: 16295833, https://doi.org/10. 1021/es053368a.

4. Helsel DR. 2005. Nondetects and Data Analysis. Statistics for Censored Environmental Data. Hoboken, NJ: Wiley-Interscience.

5. U.S. EPA (U.S. Environmental Protection Agency). 2000. Guidance for Data Quality Assessment. Practical Methods for Data Analysis: EPA QA/G-9, QA00 version. Washington, DC: U.S. Environmental Protection Agency, Office of

6. Barr DB, Landsittel D, Nishioka M, Thomas K, Curwin B, Raymer J, et al. 2006. A survey of laboratory and statistical issues related to farmworker exposure studies. Environ Health Perspect 114(6):961-968, PMID: 16760001, https://doi.org/ 10.1289/ehp.8528.

7. Hornung RW, Reed LD. 1990. Estimation ofaverage concentration in the presence of nondetectable values. Appl Occup Environ Hyg 5(1):46-51, https://doi.org/10. 1080/1047322X.1990.10389587.

8. Helsel DR. 1990. Less than obvious-statistical treatment of data below the detection limit. Environ Sci Technol 24(12):1766-1774, https://doi.org/10.1021/

9. Candès EJ, Li X, Ma Y, Wright J. 2011. Robust principal component analysis? J ACM 58(3):1-37, https://doi.org/10.1145/1970392.1970395.

10. Zhang J, Yan J, Wright J. 2021. Square root principal component pursuit: tuning-free noisy robust matrix recovery. Adv Neural Inf Process Syst

11. Netrapalli P, Niranjan UN, Sanghavi S, Anandkumar A, Jain P. 2014. Non-convex robust PCA. Adv Neural Inf Process Syst 27. https://proceedings.neurips. cc/paper/2014/file/443cb001c138b2561a0d90720d6ce111-Paper.pdf [accessed 7

12. Chen Y, Fan J, Ma C, Yan Y. 2020. Bridging convex and nonconvex optimization in robust PCA: noise, outliers, and missing data. Ann Stat 49(5):2948-2971, https://doi.org/10.1214/21-AOS2066.

13. Wold S, Esbensen K, Geladi P. 1987. Principal component analysis. Chemometr Intell Lab Syst 2(1-3):37-52, https://doi.org/10.1016/0169-7439(87)80084-9.

14. Zhou Z, Li X, Wright J, Candès E, Ma Y. 2010. Stable principal component pursuit. In: 2010 IEEE International Symposium on Information Theory Proceedings, 1518-1522, https://doi.org/10.1109/ISIT.2010.5513535.

15. Chandrasekaran V, Sanghavi S, Parrilo PA, Willsky AS. 2011. Rank-sparsity incoherence for matrix decomposition. SIAM J Optim 21(2):572-596, https://doi.org/ 10.1137/090761793.

16. Blackledge JM. 2006. Chapter 8, Vector and Matrix Norms. In: Digital Signal Processing: Mathematical and Computational Methods, Software Development and Applications. 2nd ed. Chichester, UK: Horwood.

17. Lin Z, Liu R, Su Z. 2011. Linearized alternating direction method with adaptive penalty for low-rank representation. In: Proceedings of 25th Annual Conference on Neural Information Processing Systems 2011. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira FCN, Weinberger KQ, eds. 12-14 December 2011. Granada, Spain: Adv Neural Inf Process Syst 24, 612-620. https://proceedings.neurips.cc/ paper/2011/hash/18997733ec258a9fcaf239cc55d53363-Abstract.html [accessed 7 September 2021].

18. Boyd S, Boyd SP, Vandenberghe L. 2004. Convex Optimization. Cambridge, UK: Cambridge University Press.

19. Wright J, Ma Y. 2021. High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications. Cambridge, UK: Cambridge University Press.

20. Ge R, Jin C, Zheng Y. 2017. No spurious local minima in nonconvex low rank problems: a unified geometric analysis. In: ICML'17-Proceedings of the 34th International Conference on Machine Learning, vol. 70. 6-11 August 2017. Sydney, Australia: JMLR, 1233-1242.

21. Yi X, Park D, Chen Y, Caramanis C. 2016. Fast algorithms for robust PCA via gradient descent. arXiv, https://doi.org/10.48550/arXiv.1605.07784.

22. Cherapanamjeri Y, Gupta K, Jain P. 2017. Nearly optimal robust matrix completion. In: ICML'17: Proceedings of the 34th International Conference on Machine Learning, 797-805.

23. Gao W, Goldfarb D, Curtis FE. 2020. ADMM for multiaffine constrained optimization. Optim Methods Softw 35(2):257-303, https://doi.org/10.1080/10556788. 2019.1683553.

24. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. 2010. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends(r) in Machine Learning 3(1):1-122, https://doi.org/10.1561/2200000016.

25. Chen Y, Jalali A, Sanghavi S, Caramanis C. 2011. Low-rank matrix recovery with errors and erasures. IEEE Trans Inf Theory 59(7):4324-4337, https://doi.org/ 10.1109/TIT.2013.2249572.

26. Li X. 2013. Compressed sensing and matrix completion with constant proportion of corruptions. Constructive Approximation 37(1):73-99.

27. Gibson EA, Nunez Y, Abuawad A, Zota AR, Renzetti S, Devick KL, et al. 2019. An overview of methods to address distinct research questions on environmental mixtures: an application to persistent organic pollutants and leukocyte telo-mere length. Environ Health 18(1):76, PMID: 31462251, https://doi.org/10.1186/ s12940-019-0515-1.

28. McGee G, Wilson A, Webster TF, Coull BA. 2021. Bayesian multiple index models for environmental mixtures. Biometrics 1-13. Preprint posted online September 25, 2021, PMID: 34562016, https://doi.org/10.1111/biom.13569.

29. Zipf G, Chiappa M, Porter KS, Ostchega Y, Lewis BG, Dostal J. 2013. Health and Nutrition Examination Survey plan and operations, 1999-2010. Vital Health Stat 156:1-37, PMID: 25078429.

30. U.S. CDC (U.S. Centers for Disease Control and Prevention). 2002. Laboratory Procedure Manual: PCBs and Persistent Pesticides in Serum. 2001-2002. Atlanta, GA: U.S. CDC. https://www.cdc.gov/nchs/data/nhanes/nhanes_01_02/ l28poc_b_met_pcb_pesticides.pdf [accessed 7 September 2021].

31. U.S. CDC. 2002. Laboratory Procedure Manual: PCDDs, PCDFs, and cPCBs in Serum. 2001-2002. Atlanta, GA: U.S. CDC. https://www.cdc.gov/nchs/data/nhanes/ nhanes_01_02/l28poc_b_met_dioxin_pcb.pdf [accessed 7 September 2021].

32. Akins jr, Waldrep K, Bernert jT Jr, 1989. The estimation of total serum lipids by a completely enzymatic 'summation' method. Clin Chim Acta 184(3):219 226, PMID: 2611996, https://doi.org/10.1016/0009-8981 (89)90054-5.

33. Krzanowski W. 1987. Cross-validation in principal component analysis. Biometrics 43(3):575-584, https://doi.org/10.2307/2531996.

34. Diana G, Tommasi C. 2002. Cross-validation methods in principal component analysis: a comparison. Stat Methods Appt 11(1):71-82, https://doi.org/10.1007/ BF02511446.

35. Ward JH Jr. 1963. Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236-244, https://doi.org/10.1080/01621459.1963.10500845.

36. Zhuang LH, Chen A, Braun JM, Lanphear BP, Hu JM, Yolton K, et al. 2021. Effects of gestational exposures to chemical mixtures on birth weight using Bayesian factor analysis in the Health Outcome and Measures of Environment (HOME) Study. Environ Epidemiol 5(3):e159, PMID: 34131620, https://doi.org/10. 1097/EE9.0000000000000159.

37. Traoré T, Forhan A, Sirot V, Kadawathagedara M, Heude B, Hulin M, et al. 2018. To which mixtures are French pregnant women mainly exposed? A combination

of the second French total diet study with the EDEN and ELFE cohort studies. Food Chem Toxicol 111:310-328, PMID: 29138022, https://doi.org/10.1016/j.fct. 2017.11.016.

38. Paatero P, Hopke PK, Hoppenstock J, Eberly SI. 2003. Advanced factor analysis of spatial distributions of PM2.5 in the eastern United States. Environ Sci Technol 37(11):2460-2476, PMID: 12831032, https://doi.org/10.1021/es0261978.

39. Kapraun DF, Wambaugh JF, Ring CL, Tornero-Velez R, Setzer RW. 2017. A method for identifying prevalent chemical combinations in the US population. Environ Health Perspect 125(8):087017, PMID: 28858827, https://doi.org/10. 1289/EHP1265.

40. Stanfield Z, Addington CK, Dionisio KL, Lyons D, Tornero-Velez R, Phillips KA, et al. 2021. Mining of consumer product ingredient and purchasing data to identify potential chemical coexposures. Environ Health Perspect 129(6):67006, PMID: 34160298, https://doi.org/10.1289/EHP8610.

41. Roy A, Lavine I, Herring AH, Dunson DB. 2021. Perturbed factor analysis: accounting for group differences in exposure profiles. Ann Appl Stat 15(3):1386-1404, https://doi.org/10.1214/20-AOAS1435.

42. Lee DD, Seung HS. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788-791, PMID: 10548103, https://doi.org/10. 1038/44565.

43. ATSDR (Agency for Toxic Substances and Disease Registry). 2000. Toxicological Profile for Polychlorinated Biphenyls (PCBs).

44. World Health Organization. 2000. WHO Fact Sheets: Dioxins and Their Effects on Human Health. https://www.who.int/news-room/fact-sheets/detail/dioxins-and-their-effects-on-human-health [accessed 7 September 2021].

45. Loganathan BG, Masunaga S. 2020. PCBs, dioxins, and furans: Human exposure and health effects. In: Handbook of Toxicology of Chemical Warfare Agents. 3rd ed. London, UK: Elsevier, 267-278.

46. IARC Working Group on the Evaluation of Carcinogenic Risks to Humans. Chemical Agents and Related Occupations. Lyon (FR): International Agency for Research on Cancer. 2012. (IARC Monographs on the Evaluation of Carcinogenic Risks to Humans, No. 100F.) https://www.ncbi.nlm.nih.gov/books/ NBK 304416/ [accessed 7 September 2021].

47. Steele G, Stehr-Green P, Welty E. 1986. Estimates of the biologic half-life of polychlorinated biphenyls in human serum. N Engl J Med 314(14):926-927, PMID: 3081811, https://doi.org/10.1056/NEJM198604033141418.

48. Hopf NB, Ruder AM, Succop P. 2009. Background levels of polychlorinated biphenyls in the US population. Sci Total Environ 407(24):6109-6119, PMID: 19773016, https://doi.org/10.1016/j.scitotenv.2009.08.035.

49. Delfino RJ, Brummel S, Wu J, Stern H, Ostro B, Lipsett M, et al. 2009. The relationship of respiratory and cardiovascular hospital admissions to the Southern California wildfires of 2003. Occup Environ Med 66(3):189-197, PMID: 19017694, https://doi.org/10.1136/oem.2008.041376.

50. Karanasiou A, Moreno N, Moreno T, Viana M, De Leeuw F, Querol X. 2012. Health effects from sahara dust episodes in Europe: literature review and research gaps. Environ Int 47:107-114, PMID: 22796892, https://doi.org/10.1016/ j.envint.2012.06.012.

51. Kioumourtzoglou M-A, Coull BA, Dominici F, Koutrakis P, Schwartz J, Suh H. 2014. The impact of source contribution uncertainty on the effects of source-specific PM2.5 on hospital admissions:a case study in Boston. J Expo Sci Environ Epidemiol 24(4):365-371, PMID: 24496220, https://doi.org/10.1038/jes.2014.7.

52. Johnson CL, Paulose-Ram R, Ogden CL, Carroll MD, Kruszan-Moran D, Dohrmann SM, et al. 2013. National Health and Nutrition Examination Survey. Analytic Guidelines 1999-2010. Vital Health Stat 2 161:1-24, PMID: 25090154.

53. Curtin LR, Mohadjer LK, Dohrmann SM, Montaquila JM, Kruszan-Moran D, Mirel LB, et al. 2012. The National Health and Nutrition Examination Survey: sample design, 1999-2006. Vital Health Stat 2 155:1-39, PMID: 22788053.

Word count: 7730

Show less

© 2022. This work is published under Reproduced from Environmental Health Perspectives (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Background: Environmental health researchers often aim to identify sources or behaviors that give rise to potentially harmful environmental exposures. Objective: We adapted principal component pursuit (PCP)-a robust and well-established technique for dimensionality reduction in computer vision and signal processing-to identify patterns in environmental mixtures. PCP decomposes the exposure mixture into a low-rank matrix containing consistent patterns of exposure across pollutants and a sparse matrix isolating unique or extreme exposure events. Methods: We adapted PCP to accommodate nonnegative data, missing data, and values below a given limit of detection (LOD). We simulated data to represent environmental mixtures of two sizes with increasing proportions <LOD and three noise structures. We applied PCP-LOD to evaluate its performance in comparison with principal component analysis (PCA). We next applied principal component pursuit with limit of detection (PCP-LOD) to an exposure mixture of 21 persistent organic pollutants (POPs) measured in 1,000 U.S. adults from the 2001-2002 National Health and Nutrition Examination Survey (NHANES). We applied singular value decomposition to the estimated low-rank matrix to characterize the patterns. Results: PCP-LOD recovered the true number of patterns through cross-validation for all simulations; based on an a priori specified criterion, PCA recovered the true number of patterns in 32% of simulations. PCP-LOD achieved lower relative predictive error than PCA for all simulated data sets with up to 50% of the data <LOD. When 75% of values were <LOD, PCP-LOD outperformed PCA only when noise was low. In the POP mixture, PCP-LOD identified a rank-three underlying structure and separated 6% of values as extreme events. One pattern represented comprehensive exposure to all POPs. The other patterns grouped chemicals based on known structure and toxicity. Discussion: PCP-LOD serves as a useful tool to express multidimensional exposures as consistent patterns that, if found to be related to adverse health, are amenable to targeted public health messaging.

Details

Title

Principal Component Pursuit for Pattern Identification in Environmental Mixtures

Author

Gibson, Elizabeth A¹; Zhang, Junhui²; Yan, Jingkai³; Chillrud, Lawrence¹; Benavides, Jaime¹; Nunez, Yanelli; Herbstman, Julie B; Goldsmith, Jeff; Wright, John; Kioumourtzoglou, Marianthi-Anna

¹ Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, New York, New York, USA
² Department of Applied Physics and Applied Mathematics, Columbia University, New York, New York, USA
³ Department of Electrical Engineering, Columbia University Data Science Institute, New York, New York, USA

Pages

1-10

Publication year

2022

Publication date

Nov 2022

Publisher

National Institute of Environmental Health Sciences

e-ISSN

15529924

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1289/EHP10479

ProQuest document ID

3171901623

Principal Component Pursuit for Pattern Identification in Environmental Mixtures

Jump to:

Full Text

Abstract

Details

Suggested sources