Content area
Full Text
About the Authors:
Yan Zhang
Contributed equally to this work with: Yan Zhang, Shantao Li
Affiliations Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America, Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America, Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, Ohio, United States of America
ORCID http://orcid.org/0000-0002-3357-5121
Shantao Li
Contributed equally to this work with: Yan Zhang, Shantao Li
Affiliation: Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
ORCID http://orcid.org/0000-0002-5440-2780
Alexej Abyzov
* E-mail: [email protected] (AA); [email protected] (MBG)
Affiliation: Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
Mark B. Gerstein
* E-mail: [email protected] (AA); [email protected] (MBG)
Affiliations Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America, Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America, Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
ORCID http://orcid.org/0000-0002-9746-3719Abstract
Retroduplications come from reverse transcription of mRNAs and their insertion back into the genome. Here, we performed comprehensive discovery and analysis of retroduplications in a large cohort of 2,535 individuals from 26 human populations, as part of 1000 Genomes Phase 3. We developed an integrated approach to discover novel retroduplications combining high-coverage exome and low-coverage whole-genome sequencing data, utilizing information from both exon-exon junctions and discordant paired-end reads. We found 503 parent genes having novel retroduplications absent from the reference genome. Based solely on retroduplication variation, we built phylogenetic trees of human populations; these represent superpopulation structure well and indicate that variable retroduplications are effective population markers. We further identified 43 retroduplication parent genes differentiating superpopulations. This group contains several interesting insertion events, including a SLMO2 retroduplication and insertion into CAV3, which has a potential disease association. We also found retroduplications to be associated with a variety of genomic features: (1) Insertion sites were correlated with regular nucleosome positioning. (2) They, predictably, tend to avoid conserved functional regions, such as exons, but, somewhat surprisingly, also avoid introns. (3) Retroduplications tend to be co-inserted with young L1 elements, indicating recent retrotranspositional activity, and (4) they have a weak tendency to...