Introduction
Measurement of in vivo chromosome conformation is a major unsolved problem in structural biology despite its known biological importance 1. The present state-of-art is to either obtain indirect information about conformations using 3C-derived methods which measure DNA-DNA contacts (typically in a cell-averaged population) 2, or else directly measure the cellular locations of individual chromosomal loci in single cells by microscopy 3. The major limitation of direct localization is one of throughput: only ~ 3–5 labeled loci can be uniquely identified ‘by color’ in a standard microscope image, whereas a whole-chromosome reconstruction would involve labeling and identifying hundreds or thousands of loci.
Several research efforts aim to remove the color limitation either by experimental improvements or computational inferences. The experimental approaches aim to allow an increased number of labels that can be distinguished in an image
4–
6. Alternatively, attempts have been made to infer the identity of labels that cannot be uniquely identified in an image, by comparing the image to the known label positions along the DNA contour. The first attempt to do this was ‘by eye’
7, but subsequently two computational algorithms have been developed to automate this inference:
This paper presents improvements to
Figure 1.
Legal versus illegal (overlapping) conformations.
Schematic showing one legal and one illegal conformation passing through spots
A,
B and
C.
The final step is to use the mapping probabilities to construct the range of likely conformations compatible with the microscope image. Uncertainty in the conformation results from inaccuracy or uncertainty in the mapping probabilities due to three factors: inaccuracy in the DNA model (the relation between genomic and spatial distance), error in estimating the partition functions, and the inherent uncertainty in the data even with a perfect reconstruction algorithm. The DNA model can be calibrated by a control experiment, and we argue that the remaining model error can reduce our method’s confidence in its results but it generally does not cause our method to reconstruct mistaken conformations. The main focus of this paper is on improving the partition function estimate, using two different strategies. First, we give an efficient method for optimizing the spot penalties when there are hundreds of spots in the image. Next, we provide formulas for the partition functions which allow them to be estimated to arbitrarily high accuracy (given enough computation time), without using spot penalties or any optimization. As we show using simulations, these two methods used individually or in tandem permit confident, chromosome-scale conformational reconstructions using existing experimental technologies.
Methods
First we provide a method for efficiently optimizing the spot penalties regardless of the number of labeled loci. This rule guarantees that a) the rate of missing spots is as expected, and b) the mapping probabilities are properly normalized. Let q s denote the penalty attached to spot s; then the update rule for that spot penalty is:
where N is the number of loci, P( s) = ∑ L p( L → s) is the total probability of mapping any locus to spot s, and p fn( c) is the estimated rate of missing spots having color c. The justification for this rule is given in Appendix 1 ( Supplementary File 1).
We can also update a penalty
Typically, we first optimize the
Next, we give two exact formulas for the partition functions
Z
L→
s and the full partition function
Z that determine our locus-to-spot mapping probabilities. We focus on the full partition function
Z since the formulas for
Z
L→
s are identical. The largest term in each formula, which we denote
Figure 2.
Series expansions.
A. Schematic showing terms in a series expansion, in a case where series 1 and series 2 have the same terms. The full series gives the exact partition function for the 4-locus experiment shown where only 2 spots appeared in the image (due to a high rate of missing spots). Cartoons show only the constrained loci for each term (so for example each term includes the illegal conformation visiting spots
s →
t →
s →
t).
B. An illegal conformation for which loci
A,
C and
E overlap at spots
s, and loci
F and
H overlap at spot
t. Series expansion 1 includes this conformation in terms
There are two ways we might remove conformations containing overlapping loci, leading us to two different series expansions for the true partition function
Z. Suppose that we are calculating the term
Each of the two series expansions is a weighted sum over all possible illegally-constrained terms having two properties: 1) each locus and each spot appear at most once in the indices, and 2) two or more loci map to each constrained spot. To be formal, we use Ω to represent the set of all possible illegal constraints: each element of Ω consists of a set of two or more non-adjacent loci and a single spot where they are forced to overlap. Each expansion thus takes the form
where
Here we specify each series expansion by giving a formula for the weights
w
ϕ in terms of
n
ϕ and the various
Series expansion 1 For series expansion 1, we do not allow the unconstrained loci to map to spots that were used in constraints. Then the weights w ϕ in the series formula given by Equation 3 are:
To select terms for a series approximation, we first choose a set of illegal constraints
ψ to disallow, then include all series terms
Series expansion 2 For series expansion 2, the unconstrained loci are allowed to map to spots that were used in constraints. Then the weights w ϕ in Equation 3 are:
To select terms for a series approximation, we first choose a set of
N
ψ single-locus-to-spot mappings Ψ, then include all terms
Z
ϕ whose illegal constraints use only mappings in Ψ. For example, the constraint (
AC)
s would be included if Ψ ⊆ {
A →
s,
C →
s}. In order to select the largest terms, we recommend building Ψ from the
N
ψ largest mapping probabilities calculated from
Results
We tested the improved
For each scenario, we randomly generated 100 conformations using a wormlike chain model (packing density n l = 0.3 kb/nm, persistence length l p = 300 kb, as suggested by the measurements of Trask, Pinkel and van den Engh, 1989 12); applied a random labeling at a mean density of 1 locus per megabase; and simulated experimental error: 100/200-nm Gaussian localization error in xy/z, a 10% rate of missing labels, and a 10% rate of nonspecifically-bound labels. A typical simulated experiment from the Toy scenario is shown in Figure 3A.
Figure 3.
Example reconstruction.
A. Randomly generated and labeled chromosome contour with simulated experimental error: localization error (lines offsetting spots from the labeled genomic loci) and missing labels (open circles). This example lacks nonspecifically-bound labels (floating spots).
B. Spot mapping probabilities calculated using both the largest series term
Next, we specified a DNA model relating the genomic distance between two loci
L to their expected RMS spatial distance
R, which is used by
For each simulated conformation, we fed the label positions and colors together with the simulated 3D images and our DNA model into the
Our reconstruction quality metric is the amount of
unrecovered information from the mapping probabilities, defined as
I = – 〈log
p(
L
i →
s
i)〉
i where the average 〈.〉 is taken over the set of true locus-to-spot mappings (
L
i ,
s
i). The ideal case of
I → 0 implies a perfect reconstruction with no mistakes and zero uncertainty, but in practice
I is always positive. In a real experiment where the true mappings are not known, we use a proxy for unrecovered information that we term entropy, defined as
S = − 〈
p(
L
i →
s
j) log
p(
L
i →
s
j)〉
ij whose average is taken over all locus-to-spot mappings, not just the correct mappings. The goal is to have
S ≈
I so that a real experiment will have an accurate estimate of the reconstruction performance. The left-hand panel of
Figure 3C shows how
I and
S depend on the accuracy of the calculation for the simple example shown, using either of the two series expansions and varying the number of terms from 1 (simply
Validation of Equation 1– Equation 5. We first validated each of the two series expansions by comparing them against exact partition function calculations for the simulated Toy experiments. In all cases, both series expansions, when taken to their maximum number of terms, exactly reproduced the partition function calculations obtained by direct enumeration over all possible non-overlapping conformations. This test validates Equation 4 and Equation 5. We also verified that both series expansions could be used in conjunction with spot penalty optimization ( Equation 1 and Equation 2), both by numerically validating the cost function gradient calculation and by testing for convergence on these small problems.
Improved optimization allows large-scale reconstructions. Next, we tested whether the iterative spot-penalty optimization rules given by
Equation 1 and
Equation 2 could work on large-scale problems such as those of Experiment 2, where the old gradient descent optimizer in
Figure 4.
Comparison of old and new optimization methods.
Each panel compares the number of iterations required to achieve convergence using the old (purple) versus new (yellow) optimization methods. Only trials that successfully converged are counted, so the histograms are not normalized relative to each other. The first number in parentheses of each legend entry shows the number of converged trials, and the second number shows the total number of trials. Note that the second numbers in the right-hand panel equal the first numbers in the left-hand panel, since we required convergence in
Use of more colors dramatically improves reconstructions. Our most striking result is that simulations of the ambitious Experiment 2 produce far better results than even the Toy scenario, despite the fact that these simulations have more loci per color than either the Toy scenario or Experiment 1. This can be seen in the amount of unrecovered information
I shown in the simulation-averaged plots of
Figure 5A. High-quality reconstructions using ~ 20 colors were also observed by the ChromoTrace reconstruction method
9 even for large numbers of labeled loci. Our explanation is that the reconstruction quality has more to do with the average spatial density of loci per color than the total number of loci per color, because each ‘propagator’ evolving one potential locus-to-spot mapping to the next sees only the spots within some reasonable radius, as determined by the genomic distance to the next locus. These arguments really pertain to the information recovery of the baseline calculation of
At the end of this section we revisit Experiment 2, in order to assess the reconstruction quality when analyzing more realistic DNA contours having tighter confinement and thus more closely-packed spots.
Figure 5.
Comparison of the convergence rates of series expansion 1 and series expansion 2.
A. Median unrecovered information I as a function of the number of terms used in each series expansion, without using spot penalty optimization (solid lines) versus with optimization (dotted lines), and over the three simulation scenarios (panels left-to-right). Each curve was derived from the 100 individual curves corresponding to the 100 simulations in each scenario using a simple point-by-point median average. B. Percentile distribution of the difference between the unrecovered information using series 2 minus the unrecovered information using series 1; the fact that this difference quickly drops below zero in nearly all individual simulations shows that series 2 recovers more information in the first few terms than does series 1.
Series expansion 2 outperforms series expansion 1. Next, we compared the convergence properties of our two expansions on the three scenarios of simulated experiments. Figure 5A gives a sense of how the amount of unrecovered information varies with the number of terms taken in each series, without (solid lines) and with (dotted lines) the use of spot penalties. Each of the 3 panels summarizes all 100 simulated experiments of that scenario, and each experiment in that scenario shows a unique relationship between information recovery and number of series terms computed. Representative curves of individual experiments in each scenario are shown in Supplementary Figure S1 ( Supplementary File 1). In order to summarize these very dissimilar curves, Figure 5A shows a median average of all 100 individual experimental curves taken at each data point. Note that this averaging process does not necessarily preserve the shape of the curves from typical individual simulations.
In order to directly compare the two series expansions, we plotted their difference in unrecovered information I 2 − I 1 versus the number of series terms in Figure 5B. In this case, we plotted the full distribution showing the median (50th percentile) as well as the 10th, 25th, 75th and 90th percentile curves. These plots show directly that series 2 almost always outperforms series 1 when only a few terms can be evaluated. The reason is that the terms in series 2 are larger in magnitude owing to their looser constraints, and thus remove the extraneous part of the partition function more quickly than the terms of series 1 (see Supplementary Figure S1 and Supplementary Figure S4, Supplementary File 1). Based on these results, we recommend using series expansion 2 in all situations where the partition function cannot be evaluated exactly.
Spot penalty optimization is the most efficient way to recover information. Spot penalty optimization is an iterative process where each iterative step requires the evaluation of some number of series terms. An optimization requiring
t iterations thus multiplies computation time by a factor of
t relative to the simple evaluation of the series. Alternatively, one could spend the extra computation time on taking the series to a higher order without spot penalty optimization.
Figure 6A plots the unrecovered information when a) taking series 2 to a certain order without optimization, versus b) using spot penalty optimization on only the first term yielding
Figure 6.
Optimization in conjunction with series expansions.
A. Comparison of unrecovered information using series expansions without iteration, denoted
I, to the unrecovered information obtained by optimizing spot penalties using only the first series term, denoted
Series expansions can improve optimization information recovery. Although spot penalty optimization is the most efficient way to recover information, that process alone can only extract a certain fraction of the recoverable information: once the cost function is zero, optimization can proceed no further despite the problem not having been solved exactly. At this point, the only way forward is to go higher in the order of series terms used; we can still apply spot penalties to this sum of terms and iteratively optimize them as before using
Equation 1 and
Equation 2.
Figure 6B plots the difference in unrecovered information when applying spot penalty optimization between a) a variable number of terms in series expansion 2, and b) only
20-color labeling leads to near-perfect reconstructions. As shown in Figure 5A, the unrecovered information for the whole-chromosome Experiment 2 averages around 0.2 bits per locus, implying near perfect mapping probabilities. However, because these results were based on randomly-generated unconfined conformations, they do not establish whether such good information recovery is possible with real chromosomes which are likely to be more compact. To test Experiment 2 on realistic chromosome conformations, we generated four plausible conformations of human chromosome 4 by running the GEM software package 13 on the smoothed human Hi-C data set provided by Yaffe and Tanay, 2011 14 and using a 3D spline interpolation to increase the resolution from 1 Mb to 50 kb. These conformations were then virtually labeled at 300 randomly-selected loci and simulated experimental error was added in as before. One set of experiments assumed diffraction-limited 100/200 nm localization error in xy/z, and a second set of experiments assumed superresolution 30/50 nm localization error in xy/z; in both sets the missing- and extra-spot rates were 10%. For this experiment we determined the DNA model parameters n l and l p by fitting pairwise locus distributions, as one would do in an experiment, and for L > l p we set ρ = 1/3 as that has been reported in the literature for locus separations under 7 Mb 4. Mapping probabilities were reconstructed by taking series expansion 2 to the lowest order that included at least 1000 terms, then applying and optimizing spot penalties. Compared with the random-walk conformations used to test the Experiment 2 scenario, the diffraction-limited reconstructions did somewhat worse (~ 0.4 versus ~ 0.2 bits of unrecovered information per locus) owing to fact that physical confinement of chromosomes increases the density of competing spots in the image. The superresolution reconstruction quality was unchanged at ~ 0.2 bits of unrecovered information.
Despite the drop in performance when localizing spots at the diffraction limit, 0.4 bits of unrecovered information per locus is still an extremely strong reconstruction, implying that the correct locus-to-spot mappings are assigned
p-values averaging around 2
–0.4 ≈ 76%. Starting from such accurate and confident mapping probabilities, one can infer a reasonable conformation simply by assigning each locus to the unassigned spot to which it maps with the highest probability (or calling a missing spot if
Figure 7.
Simulated reconstructions of 4 plausible conformations of human chromosome 4.
The left-hand reconstruction of each conformation was obtained using a simulated image from diffraction-limited microscopy (shown in inset; localization error is shown as lines connecting spots to DNA), and the right-hand reconstruction used a simulated superresolution image. Grey shaded lines indicate the underlying DNA contours; blue lines trace the ideal reconstructed contours given the measured spot positions; red lines show our reconstructed contours where they deviate from the ideal contours. Captions above each reconstruction indicate the amount of unrecovered information I per locus after/before the reconstruction process; captions below indicate the number of alignment errors between the spot ID sequences read along the true versus inferred conformations. For both superresolution reconstructions 2 and 3 we calculated I excluding a single locus whose true spot mapping was given 0 probability; including that locus sends I → ∞.
Figure 7 shows that the benefit of superresolution is twofold: 1) the locus-to-spot mapping quality improves relative to diffraction-limited resolution (i.e. fewer red lines), and 2) the small-scale structure of an ideal mapping (blue line) more faithfully traces the underlying contour (grey line). This shows the importance of measuring spot locations to sub-pixel resolution, even in experiments where normal-resolution microscopes using standard fluorophores are used to localize spots separated by two pixels or more. In our GEM conformations 23 spots were closer than 200 nm to another spot of the same color, which would indicate problems localizing these spots, but this is inconsistent with the data shown in Wang et al., 2016 4 which indicates that virtually all spots in our experimental scenarios should be well-separated in at least in some cell lines.
Discussion
We have developed and evaluated two improvements to the
Our problem of finding likely (i.e. low-free-energy) DNA conformations passing through a set of imaged spots is similar to the well-known traveling salesman problem (TSP), in which a salesman must find the shortest route connecting a set of cities. Somewhat more closely related is a generalization of the TSP called the time-dependent traveling salesman problem (TDTSP) 10, where the intercity distances change every step on the tour; this is analogous to our situation where the free energy needed to thread DNA between two spots depends not only on their separation but also on the length of DNA used to connect them. In our case, the presence of missing and extra spots generalizes our problem still further: in the TDTSP analogy the salesman would be allowed to skip stops and cities for a penalty. Our main finding is that the partition function of this generalized TDTSP (which encompasses traditional TSP and TDTSP problems) can be expressed as a sum of terms computable using a (modified) forward-backward algorithm, a result which should also apply to other route-finding applications where research has historically focused on route optimization rather than route inference.
Both our mapping
p-values and our entropy proxy for information recovery show a systematic bias, which comes from the use of a different DNA model for reconstruction than was used to create the simulated DNA contours. The fact that our reconstructions were nonetheless quite strong shows that the reconstruction method itself is quite robust to model error. This is very fortunate given the uncertainty in the true
in vivo biological model describing the cells in a real experiment. For our results to be accurate, we had to calibrate our model so as to reproduce the peak in the distance distribution of pairs of distinguishable loci. An experimenter would perform this calibration by imaging distinguishable pairs of loci in a parallel experiment. Due to
From a genomic standpoint, our most exciting result is that the combination of our computational improvements together with 20-color labeling technology gives almost perfect reconstructions at the whole-chromosome scale. Out of ~ 4 bits per locus of uncertainty inherent in the reconstruction problem, our method recovers ~ 3.6–3.8 bits. Such confident mapping probabilities allow for the direct construction of individual conformations that are ≥ 90% accurate. High-quality piecewise reconstructions are likewise possible with two overlapping copies of the same chromosome (data not shown), although sometimes the fragments cannot be assembled. We want to emphasize that our reconstructions require only a few parameters that would be known experimentally with proper controls: the 3 DNA model parameters which in a real scenario would be calibrated using a control experiment, and the correct average rates of missing and extra spots averaged over all experiments, used by
Data availability
The simulated conformations and labelings used to generate the plots in this paper, together with the output of the
Software availability
Results in this paper were generated using version 1.1 of
All source files used in preparing this paper are available from the GitHub page for this paper: https://github.com/heltilda/align3d/tree/master/seriesExpansions.
License: GPL 3.0
Archived code at time of publication:
align3d: https://doi.org/10.5281/zenodo.2580342 15
License: GPL 3.0
wormulator: https://doi.org/10.5281/zenodo.1411503 16
License: GPL 3.0
Cicada scripting language: https://doi.org/10.5281/zenodo.1411505 17
License: MIT License
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2019 Ross BC and Costello JC. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
We previously published a method that infers chromosome conformation from images of fluorescently-tagged genomic loci, for the case when there are many loci labeled with each distinguishable color. Here we build on our previous work and improve the reconstruction algorithm to address previous limitations. We show that these improvements 1) increase the reconstruction accuracy and 2) allow the method to be used on large-scale problems involving several hundred labeled loci. Simulations indicate that full-chromosome reconstructions at 1/2 Mb resolution are possible using existing labeling and imaging technologies. The updated reconstruction code and the script files used for this paper are available at: https://github.com/heltilda/align3d.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer