Content area
Full Text
About the Authors:
Luis Santana-Quintero
Affiliation: Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America
Hayley Dingerdissen
Affiliations Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America, Department of Biochemistry and Molecular Biology, George Washington University Medical Center, Washington, DC, United States of America
Jean Thierry-Mieg
Affiliation: National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
Raja Mazumder
* E-mail: [email protected] (RM); [email protected] (VS)
Affiliation: Department of Biochemistry and Molecular Biology, George Washington University Medical Center, Washington, DC, United States of America
Vahan Simonyan
* E-mail: [email protected] (RM); [email protected] (VS)
Affiliation: Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America
Introduction
Sequence alignment is the critical first step of sequence analysis [1], [2], after which the alignment results are used as a source of data for numerous downstream analyses (e.g., the genetic content of short reads, pathway analysis, and etc). Before proceeding to the description of the optimized, ultra-fast alignment algorithm implemented in the High-performance Integrated Virtual Environment (HIVE), the following section describes the task of alignment and conventional methods currently used to solve it.
Given
* There exists a set of “Reference Genomes” numbered 1…r…N with sizes of Gr and cumulative size of R = ΣGr bases.
* There exists a set of “Short Reads” from 1…s…S, each one having a length of L.
Task
* Define an alignment as A(s,r) = (s1r1),…(sirj),… (sAsrAr) where (siri) signifies the correspondence between i-th letter of the short sequence and j-th letter of the reference sequence. As and Ar correspond to the length of the alignment with respect to the corresponding sequence or reference.
* Define a set of “Scoring Parameters” P defining the benefit and cost factors for matches, mismatches, insertions and deletions between bases (siri) of the short read and reference genomes
* Define an additive “Score of Alignment” as the sum of scoring factors SA(s,r) = Σ(Pl) where l is chosen based on the match, mismatch, insertion or deletion of the sequence positions of s and r.
Solve
* Find an optimal alignment A(s,r) between short sequence s and reference genome r such that SA(s,r)...