Introduction
The relationship between geography and genetics has had enduring importance in evolutionary biology (see Felsenstein, 1982). One fundamental consideration is that individuals who live near one another tend to be more genetically similar than those who live far apart (Wright, 1943; Wright, 1946; Malécot, 1948; Kimura, 1953; Kimura and Weiss, 1964). This phenomenon is often referred to as ‘isolation-by-distance’ (IBD) and has been shown to be a pervasive feature in spatial population genetic data across many species (Slatkin, 1985; Dobzhansky and Wright, 1943; Meirmans, 2012). Statistical methods that use both measures of genetic variation and geographic coordinates to understand patterns of IBD have been widely applied (Bradburd and Ralph, 2019; Battey et al., 2020). One major challenge in these approaches is that the relationship between geography and genetics can be complex. Particularly, geographic features can influence migration in localized regions leading to spatially heterogeneous patterns of IBD (Bradburd and Ralph, 2019).
Multiple approaches have been introduced to model spatially non-homogeneous IBD in population genetic data (McRae, 2006; Duforet‐Frebourg and Blum, 2014; Hanks and Hooten, 2013; Petkova et al., 2016; Bradburd et al., 2018; Al-Asadi et al., 2019; Safner et al., 2011; Ringbauer et al., 2018). Particularly relevant to our proposed approach is the work of Petkova et al., 2016 and Hanks and Hooten, 2013. Both approaches model genetic distance using the ‘resistance distance’ on a weighted graph. This distance metric is inspired by concepts of effective resistance in circuit theory models, or alternatively understood as the commute time of a random walk on a weighted graph or as a Gaussian graphical model (specifically a conditional auto-regressive process) (Chandra et al., 1996; Hanks and Hooten, 2013; Rue and Held, 2005). Additionally, the resistance distance approach is a computationally convenient and accurate approximation to spatial coalescent models (McRae, 2006), although it has limitations in asymmetric migration settings (Lundgren and Ralph, 2019).
Hanks and Hooten, 2013 introduced a Bayesian model that uses measured ecological covariates, such as elevation, to predict genetic distances across sub-populations. Specifically, they use a graph-based model for genotypes observed at different spatial locations. Expected genetic distances across sub-populations in their model are given by resistance distances computed from the edge weights. They parameterize the edge weights of the graph to be a function of known biogeographic covariates, linking local geographic features to genetic variation across the landscape.
Concurrently, the Estimating Effective Migration Surfaces (EEMS) method was developed to help interpret and visualize non-homogeneous gene-flow on a geographic map (Petkova, 2013; Petkova et al., 2016). EEMS uses resistance distances to approximate the between-sub-population component of pairwise coalescent times in a ‘stepping-stone’ model of migration and genetic drift (Kimura, 1953; Kimura and Weiss, 1964). EEMS models the within-sub-population component of pairwise coalescent times, with a node-specific parameter. Instead of using known biogeographic covariates to connect geographic features to genetic variation as in Hanks and Hooten, 2013, EEMS infers a set of edge weights (and diversity parameters) that explain the genetic distance data. The inference is based on a hierarchical Bayesian model and a Voronoi-tessellation-based prior to encourage piece-wise constant spatial smoothness in the fitted edge weights.
EEMS uses Markov Chain Monte Carlo (MCMC) and outputs a visualization of the posterior mean for effective migration and a measure of genetic diversity for every spatial position of the focal habitat. Regions with relatively low effective migration can be interpreted to have reduced gene-flow over time, whereas regions with relatively high migration can be interpreted as having elevated gene-flow. EEMS has been applied to multiple systems to describe spatial genetic structure, but despite EEMS’s advances in formulating a tractable solution to investigate spatial heterogeneity in IBD, the MCMC algorithm it uses can be slow to converge, in some cases leading to days of computation time for large datasets (Peter et al., 2020).
The inference problems faced by EEMS and Hanks and Hooten are related to a growing area referred to as ‘graph learning’ (Dong et al., 2019; Mateos et al., 2019). In graph learning, a noisy signal is measured as a scalar value at a set of nodes from the graph, and the aim is then to infer non-negative edge weights that reflect how spatially ‘smooth’ the signal is with respect to the graph topology (Kalofolias, 2016). In population genetic settings, this scalar could be an allele frequency measured at locations in a discrete spatial habitat with effective migration rates between sub-populations. Like the approach taken by Hanks and Hooten, 2013, one widely used representation of smooth graph signals is to associate the smoothness property with a Gaussian graphical model where the precision matrix has the form of a graph Laplacian (Dong et al., 2016; Egilmez et al., 2016). The probabilistic model defined on the graph signal then naturally gives rise to a likelihood for the observed samples, and thus much of the literature in this area focuses on developing specialized algorithms to efficiently solve optimization problems that allow reconstruction of the underlying latent graph. For more information about graph learning and signal processing in general see the survey papers of Dong et al., 2019 and Mateos et al., 2019.
To position the present work in comparison to the ‘graph learning’ literature, our contributions are twofold. First, in population genetics, it is impossible to collect individual genotypes across all the geographic locations and, as a result, we often work with many, often the majority, of nodes having missing data. As far as we are aware, none of the work in graph signal processing considers this scenario and thus their algorithms are not directly applicable to our setting. In addition, if the number of the observed nodes is much smaller than the number of nodes of a graph, one can project the large matrices associated with the graph to the space of observed nodes, therefore allowing for fast and efficient computation. Second, highly missing nodes in the observed signals can result in significant degradation of the quality of the reconstructed graph unless it is regularized properly. Motivated by the Voronoi-tessellation-based prior adopted in EEMS (Petkova et al., 2016), we propose regularization that encourages spatial smoothness in the edge weights.
Building on advances in graph learning, we introduce a method, Fast Estimation of Effective Migration Surfaces (FEEMS), that uses optimization to obtain penalized-likelihood-based estimates of effective migration parameters. In contrast to EEMS which uses a node-specific parameterization of effective migration, we optimize over edge-specific parameters allowing for more flexible migration processes to be fit, such as spatial anisotropy, in which the migration process is not invariant to rotation of the coordinate system (e.g. migration is more extensive along a particular axis). Although we developed this model as a Gaussian Markov Random Field, the resulting likelihood has key similarities to the EEMS model, in that it is a Wishart-distribution that is a function of a genetic distance matrix. Expected genetic distances in both models can be interpreted as ‘resistance distances’ (McRae, 2006).
To fit the model, rather than using MCMC, we develop a fast quasi-Newton optimization algorithm (Nocedal and Wright, 2006) and a cross-validation approach for choosing the penalty parameter used in the penalized likelihood. We demonstrate the method using coalescent simulations and an application to a dataset of gray wolves from North America. The output is comparable to the results of EEMS but is provided in orders of magnitude less time. With this improvement in speed, FEEMS opens up the ability to perform fast exploratory data analysis of spatial population structure.
Results
Overview of FEEMS
Figure 1 shows a visual schematic of the FEEMS method. The input data are genotypes and spatial locations (e.g. latitudes and longitudes) for a set of individuals sampled across a geographic region. We construct a dense spatial grid embedded in geographic space where nodes represent sub-populations, and we assign individuals to nodes based on spatial proximity (see Appendix 1—figure 1 for a visualization of the grid construction and node assignment procedure). The density of the grid is user defined and must be explored to appropriately balance model mis-specification and computational burden. As the density of the lattice increases, the model is similar to discrete approximations used for continuous spatial processes, but the increased density comes at the cost of computational complexity.
Figure 1.
Schematic of the FEEMS model: The full panel shows a schematic of going from the raw data (spatial coordinates and genotypes) through optimization of the edge weights, representing effective migration, to convergence of FEEMS to a local optima.
(A) Map of sample coordinates (black points) from a dataset of gray wolves from North America (Schweizer et al., 2016). The input to FEEMS are latitude and longitude coordinates as well as genotype data for each sample. (B) The spatial graph edge weights after random initialization uniformly over the graph to begin the optimization algorithm. (C) The edge weights after 10 iterations of running FEEMS, when the algorithm has not converged yet. (D) The final output of FEEMS after the algorithm has fully converged. The output is annotated with important features of the visualization.
Details on the FEEMS model are described in the Materials and methods section, however at a high level, we assume exchangeability of individuals within each sub-population and estimate allele frequencies, , for each sub-population, indexed by , and single nucleotide polymorphism (SNP), indexed by , under a simple Binomial sampling model. We also use the recorded sample sizes at each node to model the precision of the estimated allele frequency. With the estimated allele frequencies in hand, we model the data at each SNP using an approximate Gaussian model whose covariance is, up to constant factors, shared across all SNPs—in other words, after rescaling by SNP-specific variation factors, we assume that the set of observed frequencies at each SNP is an independent realization of the same spatial process. The latent frequency variables, , are modeled as a Gaussian Markov Random Field (GMRF) with a sparse precision matrix determined by the graph Laplacian and a set of residual variances that vary across SNPs. The pseudo-inverse of the graph Laplacian in a GMRF is inherently connected to the notion of resistance distance in an electrical circuit (Hanks and Hooten, 2013) that is often used in population genetics to model the genetic differentiation between sub-populations (McRae, 2006). The graph’s weighted edges, denoted by between nodes and , represent gene-flow between the sub-populations (Friedman et al., 2008; Hanks and Hooten, 2013; Petkova et al., 2016). The Gaussian approximation has the advantage that we can analytically marginalize out the latent frequency variables. The resulting likelihood of the observed frequencies shares a number of similarities to that of EEMS (see Materials and methods).
To prevent over-fitting we use penalized maximum likelihood to estimate the edge weights of the graph. Our overall goal is thus to solve the following optimization problem:where [Character omitted. See PDF] is a vector that stores all the unique elements of the weighted adjacency matrix, [Character omitted. See PDF] and [Character omitted. See PDF] are element-wise non-negative lower and upper bounds for [Character omitted. See PDF], [Formula/expression omitted. See PDF] is the negative log-likelihood function that comes from the GMRF model described above, and [Formula/expression omitted. See PDF] is a penalty that controls how constant or smooth the output migration surface will be and is controlled by the hyperparameter . Writing [Character omitted. See PDF] to denote the set of nodes in the graph and
[Formula omitted. See PDF]
This function serves to penalize large differences between the weights
The scale parameter
We use sparse linear algebra routines to efficiently compute the objective function and gradient of our parameters, allowing for the use of widely applied quasi-Newton optimization algorithms (Nocedal and Wright, 2006) implemented in standard numerical computing libraries like
Evaluating FEEMS on ‘out of model’ coalescent simulations
While our statistical model is not directly based on a population genetic process, it is useful to see how it performs on simulated data under the coalescent stepping stone model (Figure 2, also see Appendix 1—figure 2 for additional scenarios). In these simulations we know, by construction, the model we fit (FEEMS) is different from the true model we simulate data under (the coalescent), allowing us to assess the robustness of the fit to a controlled form of model mis-specification.
Figure 2.
FEEMS fit to coalescent simulations: We run FEEMS on coalescent simulations, varying the migration history (columns) and sampling design (rows).
In each simulation, we used leave-one-out cross-validation (at the level of sampled nodes) to select the smoothness parameter λ. The first column (A–C) shows the ground-truth and fit of FEEMS to coalescent simulations with a homogeneous migration history that is, a single migration parameter for all edge weights. Note that the ground-truth simulation figures (A,D,G) display coalescent migration rates, not fitted effective migration rates output by FEEMS. The second column (D–F) shows the ground truth and fit of FEEMS to simulations with a heterogeneous migration history that is, reduced gene-flow, with 10-fold lower migration, in the center of the habitat. The third column (G–I) shows the ground truth and fit of FEEMS to an anisotropic migration history with edge weights facing east-west having five fold higher migration than north-south. The second row (B,E,H) shows a sampling design with no missing observations on the graph. The final row (C,F,I) shows a sampling design with 80% of nodes missing at random.
The first migration scenario (Figure 2A–C) is a spatially homogeneous model where all the migration rates are set to be a constant value on the graph, this is equivalent to simulating data under an homogeneous isolation-by-distance model. In the second migration scenario (Figure 2D–E), we simulate a non-homogeneous process by representing a geographic barrier to migration, lowering the migration rates by a factor of 10 in the center of the habitat relative to the left and right regions of the graph. Finally, in the third migration scenario (Figure 2G–I), we simulate a pattern which corresponds to anisotropic migration with edges that point east/west being assigned to a fivefold higher migration rate than edges pointing north/south. For each migration scenario, we simulate two sampling designs. In the first ‘dense-sampling’ design (Figure 2B,E,I) we sample individuals for every node of the graph. Next, in the ‘sparse-sampling’ design (Figure 2C,F,J) we sample individuals for only a randomly selected 20% of the nodes.
For each coalescent simulation, we used leave-one-out cross-validation (at the level of sampled nodes) to select the smoothness parameter λ. In the homogeneous migration simulations, the best value for the smoothness parameter, as determined by the grid value with the lowest leave-one-out cross-validation error, is
With regard to the visualizations of effective migration, FEEMS performs best when all the nodes are sampled on the graph, that is, when there is no missing data (Figure 2B,E,H). Interestingly, in the simulated scenarios with many missing nodes, FEEMS can still partly recover the migration history, including the presence of anisotropic migration (Figure 2I). A sampling scheme with a central gap leads to a slightly narrower barrier in the heterogeneous migration scenario (Appendix 1—figure 2I) and for the anisotropic scenario, a degree of over-smoothness in the northern and southern regions of the center of the graph (Appendix 1—figure 2N). For the missing at random sampling design, FEEMS is able to recover the relative edge weights surprisingly well for all scenarios, with the inference being the most challenging when there is anisotropic migration. The potential for FEEMS to recover anisotropic migration is novel relative to EEMS, which was parameterized for fitting non-stationary isotropic migration histories and produces banding patterns perpendicular to the axis of migration when applied to data from anisotropic coalescent simulations (Petkova et al., 2016, Supplementary Figure 2; see also Appendix 1 ‘
Application of FEEMS to genotype data from North American gray wolves
To assess the performance of FEEMS on real data, we used a previously published dataset of 111 gray wolves sampled across North America typed at 17,729 SNPs (Schweizer et al., 2016; Appendix 1—figure 5). This dataset has a number of advantageous features that make it a useful test case for evaluating FEEMS: (1) The broad sampling range across North America includes a number of relevant geographic features that, a priori, could conceivably lead to restricted gene-flow averaged throughout the population history. These geographic features include mountain ranges, lakes, and islands. (2) The scale of the data is consistent with many studies for non-model systems whose spatial population structure is of interest. For instance, the relatively sparse sampling leads to a challenging statistical problem where there is the potential for many unobserved nodes (sub-populations), depending the density of the grid chosen.
Before applying FEEMS, we confirmed a signature of spatial structure in the data through regressing genetic distances on geographic distances and top genetic PCs against geographic coordinates (Appendix 1—figures 6, 7, 8, 9). We also ran multiple replicates of ADMIXTURE for
We first show FEEMS results for four different values of the smoothness parameter, λ from large
Figure 3.
The fit of FEEMS to the North American gray wolf dataset for different choices of the smoothing regularization parameter λ: (A)
As expected, when λ decreases from large to small (A–D), the fitted graph becomes less smooth and eventually over-fits to the data, revealing a patchy surface in (D), whereas earlier in the regularization path FEEMS fits a homogeneous surface with all edge weights having nearly the same fitted value, as in (A). (E) shows the mean square error between predicted and held-out allele frequencies output by running leave-one-out cross-validation to select the smoothness parameter λ. The cross-validation error is minimized over a pre-selected grid at an intermediate value of
The cross-validation approach finds the optimal value of λ to be 2.06. This solution visually appears to have a moderate level of regularization and aligns with several known landscape features (Figure 4). Spatial features in the FEEMS visualization qualitatively matches the structure plot output from ADMIXTURE using
Figure 4.
FEEMS applied to a population genetic dataset of North American gray wolves: We show the fit of FEEMS applied to a previously published dataset of North American gray wolves.
Leave-one-out cross-validation (at the level of sampled nodes) was used to select the smoothness parameter
Comparison to EEMS
We also ran EEMS on the same gray wolf dataset. We used default parameters provided by EEMS but set the number of burn-in iterations to
We find that FEEMS is multiple orders of magnitude faster than EEMS, even when running multiple runs of FEEMS for different regularization settings on both the sparse and dense graphs (Table 1). We note that constructing the graph and fitting the model with very low regularization parameters are the most computationally demanding steps in running FEEMS.
Table 1.
Runtimes for FEEMS and EEMS on the North American gray wolf dataset.
We show a table of runtimes for FEEMS and EEMS for two different grid densities, a sparse grid with 307 nodes and a dense grid with 1207 nodes. The second row shows the FEEMS run-times for applying leave-one-out cross-validation to select λ. The third row shows the run-times when applying FEEMS at the best λ value selected using cross-validation. FEEMS is orders of magnitude faster than EEMS, even when using cross-validation to select λ. Runtimes are based on computation using Intel Xeon E5-2680v4 2.4 GHz CPUs with 5 Gb RAM reserved using the University of Chicago Midway2 cluster.
Method | Sparse grid (run-time) | Dense grid (run-time) |
---|---|---|
EEMS | 27.43 hr | N/A |
FEEMS (Cross-validation) | 10 min 32 s | 1.03 hr |
FEEMS (Best λ) | 1.23 s | 4.08 s |
We find that many of the same geographic features that have reduced or enhanced gene-flow are concordant between the two methods. The EEMS visualization, qualitatively, best matches solutions of FEEMS with lower λ values (Figure 4, Appendix 1—figure 11); however, based on the ADMIXTURE results, visual inspection in relation to known geographical features and inspection of the observed vs fitted dissimilarity values (Appendix 1—figures 14, 22), we find these solutions to be less satisfying compared to the FEEMS solution found with λ chosen by leave-one-out cross-validation. We note that in many of the EEMS runs the MCMC appears to not have converged (based on visual inspection of trace plots) even after a large number of iterations.
Discussion
FEEMS is a fast approach that provides an interpretable view of spatial population structure in real datasets and simulations. We want to emphasize that beyond being a fast optimization approach for inferring population structure, our parameterization of the likelihood opens up a number of exciting new directions for improving spatial population genetic inference. Notably, one major difference between EEMS and FEEMS is that in FEEMS each edge weight is assigned its own parameter to be estimated, whereas in EEMS each node is assigned a parameter and each edge is constrained to be the average effective migration between the nodes it connects (see Materials and methods and Appendix 1 ‘
One general challenge, which is not unique to this method, is selecting the tuning parameters controlling the strength of regularization (λ in our case). A natural approach is to use cross-validation, which estimates the out-of-sample fit of FEEMS for a particular choice of λ. We used leave-one-out cross-validation, leaving one sampled population out at a time, and find such an approach works well based on the coalescent simulations and application to the North American wolf data. That said, we sometimes found high variability in the selected solution when we used cross-validation with fewer folds (e.g. five-fold versus leave-one-out, results not shown). We expect this happens when the number of sampled populations is small relative to the complexity of the gene flow landscape, and we recommend using leave-one-out cross-validation in general. We also find it useful to fit FEEMS to a sequential grid of regularization parameters and to look at what features are consistent or vary across multiple fits. Informally, one can gain an indication of the strongest features in the data by looking at the order they appear in the regularization path that is, what features overcome the strong penalization of smoothness in the data and that are highly supported by the likelihood. For example, early in the regularization path, we see regions of reduced gene-flow occurring in the west coast of Canada that presumably correspond to Coastal mountain ranges and islands in British Columbia (Figure 3B) and this reduced gene-flow appears throughout more flexible fits with lower λ.
An important caveat is that the objective function we optimize is non-convex so any visualization output by FEEMS should be considered a local optimum and, keeping in mind that with different initialization one could get different results. That said, for the datasets investigated, we found the output visualizations were not sensitive to initialization, and thus our default setting is constant initialization fitted under an homogeneous isolation by distance model (See Materials and methods).
When comparing to EEMS, we found FEEMS to be much faster (Table 1). While this is encouraging, care must be taken because the goals and outputs of FEEMS and EEMS have a number of differences. FEEMS fits a sequential grid of solutions for different regularization parameters, whereas EEMS infers a posterior distribution and outputs the posterior mean as a point estimate. FEEMS is not a Bayesian method and unlike EEMS, which explores the entire landscape of the posterior distribution, FEEMS returns a particular point estimate: a local minimum point of the optimization landscape. Setting the prior hyper-parameters in EEMS act somewhat like a choice of the tuning parameter λ, except that EEMS uses hierarchical priors that in principle allow for exploration of multiple scales of spatial structure in a single run, but requires potentially long computation times for adequate MCMC convergence.
Like EEMS, FEEMS is based on an assumed underlying spatial graph of populations exchanging gene flow with neighboring populations. While the inferred migration rates explain the data under an assumed model, it is important for users and readers of FEEMS results to keep in mind the range and density of the chosen grid when interpreting results. We note that using a denser grid has the two potential advantages of providing improved approximation for continuously distributed species, as well as a more flexible model space to fit the data.
Depending on the scale of the analysis and the life history of the species, the process of assuming and assigning a single geographic location for each individual is a potential limitation of the modeling framework used here. For instance, the North American wolves studied here are understood to be generally territorial with individual ranges that are on the scale of 103 km2 (Burch et al., 2005), which is small relative to the greater than 106 km2 scale of our analysis. Thus, modeling individual wolves with single locations may not generally be problematic. However, at the boundary of the Boreal forest and Tundra, there are wolves with larger annual ranges and seasonal migrations that track caribou herds roughly north-south over distances of 1000 km (Musiani et al., 2007), and the wolves in the study were sampled in the winter (Musiani et al., 2007; Schweizer et al., 2016). If the samples were instead obtained in the summer, the position of the inferred low migration feature near the boundary of the Boreal Forest (marked 'C' in Figure 4) would presumably shift northward. The general cautionary lesson is that one must be careful when interpreting these maps to consider the life history of dispersal for the organism under study during the interpretation of results. Extending the methodology to incorporate knowledge of uncertainty in position or known dispersal may be an interesting direction for future work.
One natural extension to FEEMS, pertinent to a number of biological systems, is incorporating long-range migration (Pickrell and Pritchard, 2012; Bradburd et al., 2016). In this work, we have used a triangular lattice embedded in geographic space and enforced smoothness in nearby edge weights through penalizing their squared differences (see Materials and methods). We could imagine changing the structure of the graph by adding edges to allow for long-range connections; however, our current regularization scheme would not be appropriate for this setting. Instead, we could imagine adding an additional penalty to the objective, which would only allow a few long range connections to be tolerated. This could be considered to be a combination of two existing approaches for graph-based inference, graphical lasso (GLASSO) and graph Laplacian smoothing, combining the smoothness assumption for nearby connections and the sparsity assumption for long-range connections (Friedman et al., 2008; Wang et al., 2016). Another potential methodological avenue to incorporate long-range migration is to use a ‘greedy’ approach. We could imagine adding long-range edges one a time, guided by re-fitting the spatial model and taking a data-driven approach to select particular long-range edges to include. The proposed greedy approach could be considered to be a spatial graph analog of TreeMix (Pickrell and Pritchard, 2012).
Another interesting extension would be to incorporate asymmetric migration into the framework of resistance distance and Gaussian Markov Random Field based models. FEEMS, like EEMS, used a likelihood that is based on resistance distances, which are limited in their ability to model asymmetric migration (Lundgren and Ralph, 2019). Recently, Hanks, 2015 developed a promising new framework for deriving the stationary distribution of a continuous time stochastic process with asymmetric migration on a spatial graph. Interestingly, the expected distance of this process has similarities to the resistance distance-based models, in that it depends on the pseudo-inverse of a function of the graph Laplacian. Hanks, 2015 used MCMC to estimate the effect of known covariates on the edge weights of the spatial graph. Future work could adapt this framework into the penalized optimization approach we have considered here, where adjacent edge weights are encouraged to be smooth.
Finally, when interpreted as mechanistic rather than statistical models, both EEMS and FEEMS implicitly assume time-stationarity, so the estimated migration parameters should be considered to be ‘effective’ in the sense of being averaged over time in a reality where migration rates are dynamic and changing (Pickrell and Reich, 2014). The MAPS method is one recent advance that utilizes long stretches of shared haplotypes between pairs of individuals to perform Bayesian inference of time varying migration rates and population sizes (Al-Asadi et al., 2019). With the growing ability to extract high quality DNA from ancient samples, another exciting future direction would be to apply FEEMS to ancient DNA datasets over different time transects in the same focal geographic region to elucidate changing migration histories (Mathieson et al., 2018). There are a number of technical challenges in ancient DNA data that make this a difficult problem, particularly high levels of missing and low-coverage data. Our modeling approach could be potentially more robust, in that it takes allele frequencies as input, which may be estimable from dozens of ancient samples at the same spatial location, in spite of high degrees of missingness (Korneliussen et al., 2014).
In closing, we look back to a review titled ‘How Can We Infer Geography and History from Gene Frequencies?’ published in 1982 (Felsenstein, 1982). In this review, Felsenstein laid out fundamental open problems in statistical inference in population genetic data, a few of which we restate as they are particularly motivating for our work:
How can we make valid statistical estimates of parameters of stepping stone models?
The methods developed here aim to help address these longstanding problems in statistical population genetics and to provide a foundation for future work to elucidate the role of geography and dispersal in ecological and evolutionary processes.
Materials and methods
Model description
See Appendix 1 ‘
(1)
[Formula omitted. See PDF]
Using vector notation, we represent the joint model of estimated allele frequencies as:
[Formula omitted. See PDF]
where [Formula/expression omitted. See PDF] is aTo summarize, we estimate allele frequencies from a subset of nodes on the graph and define latent allele frequencies for all the nodes of the graph. The assignment matrix [Character omitted. See PDF] maps these latent allele frequencies to our observations. Our summary statistics (the data) are thus [Formula/expression omitted. See PDF] where [Formula/expression omitted. See PDF] is a
[Formula omitted. See PDF]
where [Character omitted. See PDF] is the graph Laplacian, † represents the pseudo-inverse operator, andUsing the law of total variance formula, we can derive from (Equations 2, 3) an analytic form for the marginal likelihood. Before proceeding, however, we further approximate the model by assuming
Under (Equation 2, 3), the law of total variance formula leads to specific formulas for the mean and variance structure as given in (Equation 4). With those results, we arrive at the following approximate marginal likelihood:
[Formula omitted. See PDF]
where [Formula/expression omitted. See PDF] is aTo remove the SNP means we transform the estimated frequencies by a contrast matrix, [Formula/expression omitted. See PDF], that is orthogonal to the one-vector:
[Formula omitted. See PDF]
Let [Formula/expression omitted. See PDF] be the
[Formula omitted. See PDF]
where [Formula/expression omitted. See PDF] denotes a Wishart distribution with[Formula omitted. See PDF]
and so[Formula omitted. See PDF]
using the fact that the contrast matrix [Character omitted. See PDF] is orthogonal to the one-vector. Thus, we can use the same spatial covariance model implied by the allele frequencies once we project the distances on to the space of contrasts:[Formula omitted. See PDF]
Overall, the negative log-likelihood function implied by our spatial model is the following (ignoring constant terms):
(7)
We note our model (Equation 6) assumes that the
One key difference between EEMS (Petkova et al., 2016) and FEEMS is how the edge weights are parameterized. In EEMS, each node is given an effective migration parameter
Penalty description
As mentioned previously, we would like to encourage that nearby edge weights on the graph have similar values to each other. This can be performed by penalizing differences between all edges connected to the same node, that is, spatially adjacent edges:
[Formula omitted. See PDF]
where, as before,(8)
One might wonder whether it is possible to use the
Optimization
Putting Equation (7) and Equation (8) together, we infer the migration edge weights [Formula/expression omitted. See PDF] by minimizing the following penalized negative log-likelihood function:
(9)
One advantage of the formulation of Equation 9 is the use of the vector form parameterization [Formula/expression omitted. See PDF] of the symmetric weighted adjacency matrix [Formula/expression omitted. See PDF]. In our triangular graph [Formula/expression omitted. See PDF], the number of non-zero lower-triangular entries is [Formula/expression omitted. See PDF], so working directly on the space of vector parameterization saves computational cost. In addition, this avoids the symmetry constraint imposed on the adjacency matrix [Character omitted. See PDF], hence making optimization easier (Kalofolias, 2016).
We solve the optimization problem using a constrained quasi-Newton optimization algorithm, specifically L-BFGS implemented in
Estimating the residual variance and edge weights under the null model
For estimating the residual variance parameter
One strategy we found effective is to fit the model of homogeneous isolation by distance and then fix the estimated residual variance
Leave-one-out cross-validation to select tuning parameters
FEEMS estimates one set of graph edge weights for each setting of the tuning parameters λ and α which control the smoothness of the fitted edge weights. Figure 3 shows that the estimated migration surfaces vary substantially depending on the particular choices of the tuning parameters, and indeed, due to the large fraction of unobserved nodes, it can highly over-fit the observed data unless regularized accordingly. To address the issue of selecting the tuning parameters, we propose using leave-one-out cross-validation to assess each fitted model’s generalization ability at held out locations.
To simplify the notation, we write the model Equation 4 for the estimated allele frequencies in SNP
[Formula omitted. See PDF]
where[Formula omitted. See PDF]
For each fold, we hold out one node from the set of observed nodes in the graph and use the rest of the nodes to fit FEEMS across a sequential grid of regularization parameters. Note that our objective function is non-convex, so the algorithm converges to different local minima for different regularization parameters, even with the same initial value
[Formula omitted. See PDF]
the conditional mean of the observed frequency[Formula omitted. See PDF]
Using this formula, we can predict allele frequencies at held out locations using the fitted graph [Formula/expression omitted. See PDF] for each setting of tuning parameters λ and α. Note that in Equation (10), the parameters
Comparison between FEEMS and EEMS models
At a high level, we can summarize the differences between FEEMS and EEMS as follows: (1) the likelihood functions of FEEMS and EEMS are slightly different as a function of the graph Laplacian [Character omitted. See PDF]; (2) the migration rates are parameterized in terms of edge weights or in terms of node weights; and (3) EEMS is based on Bayesian inference and thus chooses a prior and studies the posterior distribution, while FEEMS is an optimization-based approach and thus chooses a penalty function and minimizes the penalized log-likelihood (in particular, the EEMS prior and the FEEMS penalty are both aiming for locally constant type migration surfaces). The last two points were already discussed in the above sections, so here we focus on the difference of the likelihoods between the two methods.
FEEMS develops the spatial model for the genetic differentiation through Gaussian Markov Random Field, but the resulting likelihood has similarities to EEMS (Petkova et al., 2016) which considers the pairwise coalescent times. Using our notation, we can write the EEMS model as
[Formula omitted. See PDF]
whereIn Equation (11), the effective degree of freedom ν is introduced to account for the dependency across SNPs in close proximity. Because EEMS uses a hierarchical Bayesian model to infer the effective migration rates, ν is being estimated alongside other model parameters. On the other hand, FEEMS uses an optimization-based approach and the degrees of freedom has no influence on the point estimate of the migration rates. Besides the effective degree of freedom and the SNP-specific re-scaling by
Data description and quality control
We analyzed a population genetic dataset of North American gray wolves previously published in Schweizer et al., 2016. For this, we downloaded plink (
Population structure analyses
We fit the Pritchard, Donnelly, and Stephens model (PSD) and ran principal components analysis on the genotype matrix of North American gray wolves (Price et al., 2006; Pritchard et al., 2000). For the PSD model, we used the ADMIXTURE software (
Grid construction
To create a dense triangular lattice around the sample locations, we first define an outer boundary polygon. As a default, we construct the lattice by creating a convex hull around the sample points and manually trimming the polygon to adhere to the geography of the study organism and balancing the sample point range with the extent of local geography using the following website https://www.keene.edu/campus/maps/tool/. We often do not exclude internal ‘holes’ in the habitat (e.g. water features for terrestrial animals), and let the model instead fit effective migration rates for those features to the extent they lead to elevated differentiation. We also emphasize the importance of defining the lattice for FEEMS as well as EEMS and suggest this should be carefully curated with prior biological knowledge about the system.
To ensure edges cover an equal area over the entire region, we downloaded and intersected a uniform grid defined on the spherical shape of earth (Sahr et al., 2003). These defined grids are pre-computed at a number of different resolutions, allowing a user to test FEEMS at different grid densities which is an important feature to explore.
Code availability
The code to reproduce the results of this paper and more can be found at https://github.com/jhmarcus/feems-analysis (Marcus and Ha, 2021a, copy archived at swh:1:rev:f2d7330f25f8a11124db09000918ae38ae00d4a7, Marcus and Ha, 2021b). A
2 Department of Statistics, University of California, Berkeley Berkeley United States
3 Department of Statistics, University of Chicago Chicago United States
4 Department of Ecology and Evolution, University of Chicago Chicago United States
Pennsylvania State University United States
Pennsylvania State University United States
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021, Marcus et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Spatial population genetic data often exhibits ‘isolation-by-distance,’ where genetic similarity tends to decrease as individuals become more geographically distant. The rate at which genetic similarity decays with distance is often spatially heterogeneous due to variable population processes like genetic drift, gene flow, and natural selection. Petkova et al., 2016 developed a statistical method called Estimating Effective Migration Surfaces (EEMS) for visualizing spatially heterogeneous isolation-by-distance on a geographic map. While EEMS is a powerful tool for depicting spatial population structure, it can suffer from slow runtimes. Here, we develop a related method called Fast Estimation of Effective Migration Surfaces (FEEMS). FEEMS uses a Gaussian Markov Random Field model in a penalized likelihood framework that allows for efficient optimization and output of effective migration surfaces. Further, the efficient optimization facilitates the inference of migration parameters per edge in the graph, rather than per node (as in EEMS). With simulations, we show conditions under which FEEMS can accurately recover effective migration surfaces with complex gene-flow histories, including those with anisotropy. We apply FEEMS to population genetic data from North American gray wolves and show it performs favorably in comparison to EEMS, with solutions obtained orders of magnitude faster. Overall, FEEMS expands the ability of users to quickly visualize and interpret spatial structure in their data.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer