Content area
Full Text
About the Authors:
Hyunsung John Kim
* E-mail: [email protected] (HJK); [email protected] (NP)
Affiliation: Biomolecular Engineering Department, Baskin School of Engineering, University of California, Santa Cruz, Santa Cruz, California, United States of America
Nader Pourmand
* E-mail: [email protected] (HJK); [email protected] (NP)
Affiliation: Biomolecular Engineering Department, Baskin School of Engineering, University of California, Santa Cruz, Santa Cruz, California, United States of America
Introduction
Hematopoietic stem cell transplantation (HSCT) has successfully treated a wide variety of diseases including autoimmune disorders, rare genetic diseases and blood cancers [1]. Success of allogenic HSCT treatments is highly correlated with matching of Human Leukocyte Antigen (HLA) alleles between donor and recipient [2]. The process of matching a donor and recipient requires HLA haplotyping, a process that typically requires specialized antibody or targeted DNA based tests [3]. The utility of high throughput sequencing methods has been restricted by the need for specialized primer sets to enrich targeted DNA or RNA sequences [4–10]. The use of untargeted RNA-seq data for HLA haplotyping has not seen much development despite dramatic reductions in sequencing costs. This is unfortunate, as RNA-seq assays have proven to be useful tools in personalized medicine for the subclassification of cancers including breast, prostate, leukemias and lymphomas [11–15]. In cases where a patient would benefit from sequencing of RNA and whose disease would require HSCT, it would be cost effective to predict HLA haplotypes directly from RNA-seq data rather than performing an additional specialized test to determine HLA haplotype.
Predicting HLA haplotypes, however, has been historically a difficult task [16]. This is due to the fact that the HLA genes reside in the most polymorphic region of the human genome, the Major Histocompatibility Complex (MHC) [17]. As a result of balancing selection, the number of known haplotypes for many HLA genes is in the thousands [18,19]. Despite the diversity of the region, a high degree of sequence similarity exists between known haplotypes. The unique assignment of a short read to an allele is nearly impossible due to sequence similarity between alleles. The hierarchical nature of the haplotypes (see Methods for details) and the sampling bias, with over representation of most common haplotypes in public databases (e.g. IMGT), further add to the complexity. For these reasons, traditional techniques for assigning mapping qualities or...