Introduction
Functional genomic data are often summarized as interval sets and deposited in public repositories (e.g., UCSC, ENCODE, Roadmap, GEO, SRA etc.). Identifying relationships among sequences and searching through widely available sequence data are routine tasks in genomic research. A fundamental operation in genomic/epigenomic analysis is comparing two interval sets, and many algorithms and tools have been developed for this purpose ( Alekseyenko & Lee, 2007; Cormen et al., 2001; Feng et al., 2019; Giardine et al., 2005; Jalili et al., 2019; Kent et al., 2002; Li, 2011; Neph et al., 2012; Quinlan & Hall, 2010; Richardson, 2006). These methods are based on computing the total number of intersections (overlaps) between the two interval sets. To compare a query interval set with multiple interval sets in a genomic sequence database, searching tools LOLA ( Sheffield & Bock, 2016) and GIGGLE ( Layer et al., 2018) calculate two values — Fisher’s exact p-value and the odds-ratio based on the total number of intersections — and use them as the similarity score to rank the search results. These similarity metrics have proven useful for determining relationships among interval sets, but also have some flaws. First, calculating the Fisher’s exact test results requires building a contingency table, but determining its values is not straightforward. The p-value and odds-ratio for two intervals sets (with number of intervals N 1 and N 2) are calculated from four numbers, namely, the number of intersections between the two sets, n, the number of intervals in set 1 that do not overlap an interval in set 2, N 1 - n, the number of intervals in set 2 that do not overlap an interval in set 1, N 2 - n, and the number of intervals that are not present in either set, m. Determining the value of the fourth number m is not straightforward; in LOLA, it depends on the definition of a “universe set” that is not objectively defined, whereas GIGGLE estimates m from the two interval sets. Second, the total number of overlaps n does not necessarily reflect similarity since intervals can have very different lengths (often in the range of 1 to 10 5 base pairs) and two very different intervals may intersect by only a few base pairs. This can result in inconsistency of the metrics: a comparison between two identical interval sets may have a larger p-value or smaller odds-ratio than a comparison between two different interval sets (see example cases and analysis in the next section). More strikingly, since one interval may contain or cover other intervals in an interval set, depending on how the overlaps are computed, n can be larger than N 1 and/or N 2, i.e., N 1- n and/or N 2- n can be negative, which leads to both the p-value and odds-ratio being undefined—another potential source of inconsistency. Third, the Fisher’s exact-based metrics require two values ( p-value and odds-ratio) but neither is a direct measurement of the similarity: p-values are sensitive to the total number of regions and can range as low as 10 -200 for large genomic interval sets, and odds-ratios are sensitive to small numbers; and neither metric directly informs on how similar the two sets are. Last, the p-value calculation is computationally expensive for genomic interval sets, particularly when the number of intervals is large (up to 10 9). To overcome these weaknesses of the Fisher’s exact-based metrics, we developed Seqpare, a self-consistent metric for quantifying the similarity among genomic interval sets.
Methods
Seqpare metric
The Seqpare metric uses a single index to quantify the degree of similarity S of two interval sets with number of intervals N 1 and N 2. Similar to the Jaccard index, the S eqpare metric is directly defined as the ratio of the total effective overlap O of the two interval sets over the union N 1+N 2 - O:
For two intervals v 1 in set 1 and v 2 in set 2, the similarity s is defined as:
where o is the length of the intersection and l 1 and l 2 are the lengths of v 1 and v 2 respectively. Definition 2 is the Jaccard index for individual intervals: o represents the effective overlap of the two intervals and s takes values in the range of [0, 1]: s = 0 indicates that there is no overlap between the two intervals, and s = 1 means that the union equals the overlap so v 1 and v 2 are identical. Then the total effective overlap O for the two interval sets can be calculated by adding up the similarities of all mutual best matching (MBM) pairs:
A MBM pair is defined as a pair of intervals v 1 and v 2 that fulfill the following conditions: among all intervals in set 2 that intersect v 1, v 2 matches v 1 the best, i.e., the similarity s between v 1 and v 2 is the highest among those intersections; and among all intervals in set 1 that intersect v 2, v 1 matches v 2 the best. Clearly, if two intervals only intersect each other, then they form an MBM pair. In Figure 1a, the two long intervals (the 1st in set 1 and the 1st in set 2) only intersect each other (intersection pair ip 1,1), so they form an MBM pair; similarly, the two short intervals ip 2,2 form another MBM pair. For intervals that involve multiple intersections, we define a relatively simple and strict rule to find the MBM pairs: find and choose the first MBM pair that has the highest s among all involved intersection pairs, then find and choose the next MBM pair that has the highest s from the rest of the intersection pairs (excluding all pairs that involve the intervals that are already chosen), and so on until there are no more intersection pairs. In Figure 1b, there are 3 intersection pairs: ip 1,2 with s=1/5, ip 1,3 with s=1/9, and ip 2,3 with s=1/5. So the first MBM pair is either ip 1,2 or ip 2 ,3 depending on which one is found first. If ip 1,2 is chosen as the first MBM pair, then ip 1,3 will not be considered since interval 1 in set 1 is already chosen, and then we have only ip 2,3 left, which is the second MBM pair. We get the same result if ip 2,3 is chosen as the first MBM pair. In Figure 1d, there are 6 intersection pairs and two MBM pairs: ip 1,1 and ip 3,2 both with s=1, where interval 2 in set 1 ( i 2,1) does not have a match. Note that interval i 2,1 matches best with interval i 1,2, but i 1,2 does not match best with i 2,1, so they are not the mutual best matching pair.
Figure 1.
Example cases for illustrating the Seqpare similarity metric.
The length ratio of the short interval to the longer intervals are 1: 5 in a), 1: 3: 5 in b), and 2: 3 :4 in d). The total number of overlaps n is 2 for a), 3 for b) (interval 1 in Set 1 intersects two intervals in Set 2), 4 for c) and 6 for d). The p-value in case b is smaller than that in case a. N 1 - n and N 2 - n are both negative in cases c and d.
Since the number of total matching pairs ≤ Min( N 1, N 2) — the minimum of N 1 and N 2 — and s is in range of [0, 1], we obtain O ≤ Min( N 1, N 2), and S takes a value in the range of [0, 1]. If S is zero, then there is no matching pair, and vice versa; if S = 1, then N 1 = N 2 = O (the two sets are equivalent), and vice versa. And, because each s is the mutual best match, O is symmetric (the amount of overlap between set 1 and set 2 is the same as that between set 2 to set 1) and so is S.
In Figure 1a, O a=2 and S a=1, which is correct because the two sets are identical. In Figure 1b, O b=2/5 and S b=1/14, which is expected since the two sets are very different. The Fisher’s exact approach is inconsistent here: the p-value in 1 b is smaller than that in 1 a although the two sets in 1 b are very different while those in 1 a are equivalent. Assuming that the number of intervals N in the ‘universe set’ is 100, then Fisher’s exact test contingency table is [(2, 0), (0, 98)] in 1 a and [(3, 0), (0, 97)] in 1 b, which gives p a = 2.02×10 -4 and p b = 6.18×10 -6 respectively. The odds-ratio is ∞ in both cases. In Figure 1c and Figure 1d, N 1- n and N 2 - n are all negative, so it is not conceptually appropriate to use Fisher’s exact test to calculate the p-value and odds-ratio.
Implementation
The implementation of the Seqpare metric is simple. The searching for MBM pairs is deterministic and it can be implemented by directly following the description in the above section. The Seqpare code is built on top of the AIList v0.0.1 ( Feng et al., 2019) software written in C.
Operation
The Seqpare software ( Feng & Feng, 2020) was tested on Linux machines and the minimum required memory is 8GB. The interval set file should be in the format of bed or bed.gz.
Results
A test with real genomic interval sets
To test Seqpare and compare it with the Fisher’s exact-based metrics, we took 100 interval sets from a UCSC database and used one interval set, affyGnf1h, as a query to search over the database. Because the database contains the query set, affyGnf1h should have the highest similarity score. Table 1 ( Feng & Feng, 2020) shows part of the result. Interval set affyGnf1h indeed ranks first with maximum similarity 1 when using Seqpare, but it ranks 94 th out of 100 when using the p-value and ranks last when using the odds-ratio. This happens because N 1- n and N 2- n are both negative (n=16686, N 1= N 2=12158). Given this inconsistency, GIGGLE sets negative N 1-n and N 2 -n to zero to calculate the p-value, and to one to calculate the odds-ratio. The Seqpare indices for other interval sets are all small (<0.03) because the average effective overlap of an intersection pair in those sets is about 0.1 or less, i.e., they are very different from the query set affyGnf1h; however, all of the p-values are so small (e -200), which suggests that the p-value is not a meaningful similarity index for these genomic interval sets. This search takes 6m30s for Seqpare and 15m32s for GIGGLE. All computations were carried out on a computer with a 2.8GHz CPU, 16GB memory, and an external SSD hard disk. The complete results can be found at the same site as the software.
Table 1.
Comparison of Seqpare and GIGGLE similarity metrics: partial list of the search results from a collection of 100 interval sets, which contains the query set affyGnf1h.
Seqpare metric | p-value and odds-ratio | Interval dataset
| ||||
---|---|---|---|---|---|---|
rank | similarity | rank_p | p-value | rank_or | odds-ratio | |
1 | 1.0 | 94 | 1.543 e -197 | 100 | 9.046 e -16 | affyGnf1h |
2 | 0.029 | 19 | 8.328 e -201 | 1 | 25.25 | ccdsGene |
3 | 0.026 | 35 | 2.743 e -200 | 3 | 4.425 | allenBrainAli |
4 | 0.022 | 39 | 3.372 e -200 | 32 | 7.972 e -11 | affyU133Plus2 |
5 | 0.021 | 4 | 5.403 e -202 | 2 | 17.24 | affyU133 |
Conclusion
We have shown that the Fisher’s exact test approach may be not the most appropriate test statistic for comparing similarity among interval sets. While the approach has been shown to be successful for many questions, we have demonstrated how it can break down for a variety of reasons, such as very similar interval sets, within-set containment, widely varying interval lengths among sets, or small effective overlaps. In contrast, Seqpare is a self-consistent metric for quantifying the similarity of two interval sets that addresses these concerns. Seqpare is the first rigorously defined metric for comparing two sequences based on their interval sets. In addition to the metric itself, our Seqpare software tool provides functions for both searching and mapping large-scale interval datasets. We anticipate that this approach will contribute to novel results for interval set searching.
Data availability
Source data
Test data of interval sets are from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database
A small subset of the test data and instructions are provided for verifying the result: https://github.com/deepstanding/seqpare/ucsc_30
Underlying data
Zenodo: deepstanding/seqpare: First release of Seqpare. http://doi.org/10.5281/zenodo.3840051 ( Feng & Feng, 2020)
This project contains the following underlying data:
-
AffyGnf1h_ucsc100_seqpare (Seqpare similarity result)
-
AffyGnf1h_ucsc100_giggle (GIGGLE p-value and odds-ratio result)
Data is available alongside the source code under the terms of the MIT license.
Software availability
Source code available from: https://github.com/deepstanding/seqpare
Archived source code at time of publication: http://doi.org/10.5281/zenodo.3840051 ( Feng & Feng, 2020)
License: MIT
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2021 Feng SC et al. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Searching genomic interval sets produced by sequencing methods has been widely and routinely performed; however, existing metrics for quantifying similarities among interval sets are inconsistent. Here we introduce Seqpare, a self-consistent and effective metric of similarity and tool for comparing sequences based on their interval sets. With this metric, the similarity of two interval sets is quantified by a single index, the ratio of their effective overlap over the union: an index of zero indicates unrelated interval sets, and an index of one means that the interval sets are identical. Analysis and tests confirm the effectiveness and self-consistency of the Seqpare metric.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer