Full Text

About the Authors:

Luis Santana-Quintero

Affiliation: Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America

Hayley Dingerdissen

Affiliations Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America, Department of Biochemistry and Molecular Biology, George Washington University Medical Center, Washington, DC, United States of America

Jean Thierry-Mieg

Affiliation: National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America

Raja Mazumder

* E-mail: [email protected] (RM); [email protected] (VS)

Affiliation: Department of Biochemistry and Molecular Biology, George Washington University Medical Center, Washington, DC, United States of America

Vahan Simonyan

* E-mail: [email protected] (RM); [email protected] (VS)

Affiliation: Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America

Introduction

Sequence alignment is the critical first step of sequence analysis [1], [2], after which the alignment results are used as a source of data for numerous downstream analyses (e.g., the genetic content of short reads, pathway analysis, and etc). Before proceeding to the description of the optimized, ultra-fast alignment algorithm implemented in the High-performance Integrated Virtual Environment (HIVE), the following section describes the task of alignment and conventional methods currently used to solve it.

Given

* There exists a set of “Reference Genomes” numbered 1…r…N with sizes of Gr and cumulative size of R = ΣGr bases.

* There exists a set of “Short Reads” from 1…s…S, each one having a length of L.

Task

* Define an alignment as A(s,r) = (s1r1),…(sirj),… (sAsrAr) where (siri) signifies the correspondence between i-th letter of the short sequence and j-th letter of the reference sequence. As and Ar correspond to the length of the alignment with respect to the corresponding sequence or reference.

* Define a set of “Scoring Parameters” P defining the benefit and cost factors for matches, mismatches, insertions and deletions between bases (siri) of the short read and reference genomes

* Define an additive “Score of Alignment” as the sum of scoring factors SA(s,r) = Σ(Pl) where l is chosen based on the match, mismatch, insertion or deletion of the sequence positions of s and r.

Solve

* Find an optimal alignment A(s,r) between short sequence s and reference genome r such that SA(s,r)...

Show less

HIVE-Hexagon: High-Performance, Parallelized Sequence Alignment for Next-Generation Sequencing Data Analysis

Full Text

Suggested sources

HIVE-Hexagon: High-Performance, Parallelized Sequence Alignment for Next-Generation Sequencing Data Analysis

Content area

Full Text

Suggested sources