1. Introduction
Pinaceae is the largest gymnosperm family, consisting of 10 genera and more than 230 species. Most of Pinaceae species are classified as forest and timber species, and they are mainly distributed in the northern hemisphere [1]. Pinus densiflora Siebold & Zucc., also called Korean red pine, is a species widely distributed in East Asia, including the Korean Peninsula, Japan and China [2]. This species occupies more than 23% of forest land in South Korea, and is the most important and popular coniferous species in Korea [3].
The chloroplast genome is a valuable resource in molecular phylogenetic studies [4,5]. It usually exhibits uniparental inheritance and contains conserved sequences as a result of its slower evolutionary rate of change compared to nuclear genomes. Specifically, the genome consists of multiple copies of a circular DNA molecule (110–210 kb) in a chloroplast [6]. The chloroplast DNA of seed plants have quadripartite structures that include a large single copy (LSC) region, a small single copy (SSC) region, and a pair of inverted repeats (IRs) [7]. Most land plants have 110–130 genes in their chloroplast genome [8].
The IR structures of a chloroplast genome typically range in length from 15 kb to 30 kb [6]. Because of the expansion or rearrangement of IRs, there are variations in gene number and order among species. Large IRs play an important role in stabilizing chloroplast genomes and, consequently, large IR loss can lead to some variation in the genome structure and gene contents [9]. Reductions in IR sequences have been identified in species of Pinaceae, Taxaceae, Cephalotaxaceae and Legumes [10,11,12].
The development of next generation sequencing (NGS) technologies made it simple and inexpensive to obtain complete chloroplast sequences [13]. However, the technologies come with limitations, including the fact that they generate short reads less than several hundred bases in length. Short reads cause mis-mappings and mis-alignments, making heterozygous and repetitive regions of the genome inaccessible [14]. In addition, it is difficult to identify structural variation or haplotypic structure using short reads alone [15]. Oxford Nanopore Technology (ONT), a new third-generation sequencing platform, promises read lengths that are orders of magnitude longer than previous technologies, and a small, future-oriented, USB-powered sequencer [16]. Using the long reads produced by ONT is efficient as it is unnecessary to align short reads in order to complete a sequence. To overcome the high error rate of ONT reads, high quality MiSeq short reads can be used for error correction [17].
In this study, we characterized the complete single-molecule chloroplast genome of Pinus densiflora using Oxford Nanopore MinION (ONM) with long reads, and then Illumina MiSeq short reads were used for error correction. We described the gene contents of the chloroplast genome and compared them with related species. Also, phylogenetic analysis based on the chloroplast genomes of 12 conifers was performed.
2. Materials and Methods
2.1. Sampling, DNA Extraction and Sequencing
Fresh leaves of P. densiflora (MK285358) were collected from a designated cultural heritage site (333–360, Jungyeong-gil, Miro-myeon, Samcheok-si, Gangwon-do, Republic of Korea, N 37˚ 22′ 2″ E 129˚ 3′ 32″), and total genomic DNA was extracted from ~100 mg of frozen leaves using ExgeneTM Plant SV Kit (GeneAll Biotechnology Co., LTD: Seoul, Republic of Korea) following the manufacturer’s instructions. Illumina MiSeq and Oxford Nanopore libraries were prepared—in accordance with the manufacturer’s instructions—using a TruSeq Nano DNA Kit with a 670 bp average insert size and a rapid sequencing kit (SQK-RAD004, Oxford Nanopore Technologies: Oxford, UK) with a 55 kbp average insert size, respectively, to construct two types of DNA libraries. Then, the two genomic libraries were sequenced by Illumina MiSeq and the Oxford Nanopore MinION platform at PHYZEN (
2.2. Assemblies of Chloroplast Genome Sequences and Annotation
ONM raw data was basecalled with the default option of the program Albacore (
2.3. Alignments and Construction of a Phylogenetic Tree
A BLASTN search was performed on the NCBI nucleotide database using the newly sequenced P. densiflora chloroplast genome sequence. From this, the entire chloroplast genome sequence of the species with the highest similarity was identified and downloaded. Assembled chloroplast DNA (cpDNA) sequences were compared to the chloroplast genome of the most similar species, P. sylvestris, by BLASTZ analysis (Figure S2). Sequence alignments of the sequences containing protein coding genes (CDSs) were conducted using mVISTA [21]. For phylogenetic analysis, 12 Pinaceae (Pinus, Picea, Larix and Abies) and a Taxus species were selected, as well as an outgroup: T. baccata. From the 13 species, 59 commonly conserved protein coding genes were extracted: accD, atpA, atpB, atpE, atpF, atpH, atpI, chlB, chlL, chlN, infA, matK, petA, petB, petD, petG, psaA, psaB, psaC, psaI, psaJ, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, rbcL, rpl14, rpl16, rpl2, rpl20, rpl22, rpl23, rpl32, rpl33, rpl36, rpoA, rpoB, rpoC1, rpoC2, rps11, rps12, rps14, rps15, rps18, rps19, rps3, rps4, rps7, rps8. Multiple sequence alignment was conducted using the MAFFT program (version 7,
3. Results
3.1. The Structure of Chloroplast Genomes of P. densiflora
The complete chloroplast genome of Pinus densiflora was sequenced using an ONM sequencer and error corrected by Illumina MiSeq short reads. After mapping the ONM reads of the completed chloroplast genome sequence, the sequencing error rate for each base position was calculated to be about 9.29% (Figure S3). The total amount of raw data produced was about 14.6 Gb for MiSeq and 2.1 Gb for ONM (Table 1). The average coverage of the chloroplast genome was 303.27 × for MiSeq and 28.46 × for ONM. The completed chloroplast genome sequence of P. densiflora was submitted to GenBank (accession number MK285358).
The chloroplast of P. densiflora had circular DNA molecules 119,875 bp in size, with the typical quadripartite structure composed of an LSC, SSC and two IRs (IRa and IRb). The total length of the genome (119,875 bp) consisted of the 65,654 bp LSC and the 53,231 bp SSC, as well as the two IRs, which contributed 495 bp each. The newly assembled P. densiflora chloroplast genome (119,875 bp) was a little larger than other reported Pinus species, except loblolly pine (P. taeda) (Table 2).
The P. densiflora chloroplast genome encoded 108 unique genes, including 72 protein-coding, 32 tRNA and 4 rRNA genes (Table 3). One of these genes (trnI-CAU) was repeated in the IR regions, two genes (psaM and trnS-GCU) had two copies in LSC, and another two genes (trnH-GUG and trnT-GGU) had one copy in each of LSC and SSC. The duplicated genes were repeated identically and the orientations of four pairs were inverted (except trnT-GGU). As a result of duplication, a total of 113 genes—encoding 73 proteins, 36 tRNAs and 4 rRNAs—were detected in the chloroplast genome. Among the 108 unique genes, 12 genes included one intron and two genes contained two introns. The introns ranged in size from 479 bp (for trnL-UAA) to 2,501 bp (for trnK-UUU). The rps12 gene was found to have 3 exons and was assumed to require trans-splicing as exon1 was found in the LSC region, while exon2 and exon3 were in the SSC region. The five ndh genes had pseudogenized as follows: ndhB, ndhD, ndhE, ndhH and ndhI. The gene map of the newly generated P. densiflora chloroplast genome is shown in Figure 1.
3.2. Comparative Analyses of the Chloroplast Genome with Other Pinus Species for the Identification of DNA Variation
The gene content and order of the P. densiflora chloroplast genome were nearly identical to the previously published chloroplast genomes of four Pinus species: P. sylvestris (KR476379), P. thunbergii (D17510), P. tabuliformis (KT740995), and P. taeda (KC427273), excepting the ycf genes and psaM gene duplication. Single nucleotide polymorphisms (SNPs) and insertion/deletion (InDel) variations among these species were typically located in the inter-genic regions (Figure 2). However, sequences showed low similarity to each other in the ycf1 and ycf2 regions. Also, a large insertion was found for P. densiflora between trnE-UUC and clpP in the LSC.
A detailed comparison of the border structure was performed in five Pinus chloroplast genomes (Figure 3). P. densiflora was most similar to P. sylvestris, which coincided with the pairwise alignment result. P. taeda had an IR structure that was significantly different from the other four species. For P. densiflora, P. sylvestris, P. thunbergii, and P. tabuliformis, gene positioning at the IRs was stable, whereas SSC and LSC showed variable regions. IR regions had trnI-CAU and partial psbA in the IRa/LSC border, and both gene size and location were well-conserved. Genes adjacent to the border in the LSC and SSC regions were found to have 1–11 bp variations. The trnK-UUU gene, located in the LSC region, ranged from 1,500 bp to 1,511 bp away from the SSC/IRa border. The first exon of rpl2 was located 1,463–1,468 bp upstream of the LSC/IRb border. trnF-GAA and trnH-GUG, in SSC, were located 1,530–1,537 bp and 144–169 bp away from the borders, respectively.
A phylogenetic tree based on the sequence alignment of the chloroplast genome of 12 Pinaceae and Taxaceae species showed that P. densiflora was most closely related to P. sylvestris (Figure 4, Table 4).
4. Discussion and Conclusions
In this study, the complete chloroplast genome of P. densiflora was sequenced by ONT, annotated (Figure 1, Table 3) and compared with previously published conifer plastome sequences. We found gene content, order and intron structure were highly conserved among them.
We assembled a nearly complete chloroplast genome using only Oxford Nanopore sequencer data and error correction by MiSeq. The hybrid strategy of ONT combined with MiSeq was attempted for the first time in bacteria [23,24]. It is conventional to use the hybrid method for relatively short genomes, such as chloroplast DNA. This was well demonstrated in a previous study, wherein a 4.6 Mbp chromosome was sequenced essentially perfectly (>99.99% accuracy) [23]. The hybrid pipelines show higher accuracy than assembly using only MiSeq data [17,25]. Further, in the conventional method, additional polymerase chain reaction (PCR) processes are typically performed to confirm short IRs. However, as sequencing has become more accurate with the combination of ONM and Illumina MiSeq, this additional experiment is no longer necessary [26].
The ndh genes encoding the NAD(P)H-dehydrogenase-like (NDH) are located in the nucleus, mitochondria, and chloroplast genomes. As is the case for Pinaceae species, the ndh genes were absent in the P. densiflora chloroplast genome (Table 3). Previous studies have suggested that the ndh genes have been transferred to the nuclear genome or left as pseudogenes in the chloroplast genome [27,28]. Thus, we suggest that the ndh genes missing from the chloroplast genome of P. densiflora have either been moved to the nuclear genome or remain in the chloroplast genome as one of five pseudogenes (ndhB, ndhD, ndhE, ndhH, and ndhI) (Figure 1).
There are extremely shortened IR regions in conifers, including Pinus [11,29]. In P. densiflora, we found large IRs had been replaced by two pairs of IRs highly reduced in size (495 bp and 738 bp), a finding which is consistent with previous studies [30]. We focused on shorter IRs for the comparative analysis. Rearrangements are frequently observed in species with loss of IRs, leading to interspecies variation [9]. Short IRs had some variation in size and sequence among the Pinus species (Table 1, Figure 2). It is presumed that loss of IRs occurred during the speciation of conifers from gymnosperm groups (cycads, conifers, Ginkgo, and Gnetales), as extant seed plants such as Cycas taitungensis (23 kbp), Gnetum parvifolium and Ginkgo biloba (17 kbp) have a large pair of IRs which are not found in Taxaceae, Pinaceae, and Cupressaceae [8,31]. With regard to the evolution of Cupressophytes, Li et al. argued that, after an inverted repeat was modified into a tandem repeat, the tandem repeat was divided by a rearrangement into two different parts to become short inverted repeats [32].
A lot of phylogenetic studies in land plants have used chloroplast genome sequences to analyze relatedness and classify the species [33,34,35]. Some CDSs (e.g., matK, rbcL and rpoB) and intergenic regions have been used as barcode markers in phylogenetic studies [36]. However, it is inappropriate to use a few markers when closely related species are classified because each marker only has a low amount of variability [37,38]. For example, the matK and rpoB genes of P. densiflora are identical to those of P. sylvestris. Moreover, we found there were only 151 and 72 parsimony-informative sites in the alignment of the 13 species investigated when using the matK and rbcL markers, respectively. Also, using CDSs was more exact than the whole genome sequence due to mis-alignments, such as structural variation and huge indels (Supplementary data1). In such a case, whole, commonly conserved CDSs that contain more information could be used in phylogenetic analyses [31]. The phylogenetic analysis of 59 CDSs was conducted in 12 Pinaceae species, using Taxus baccata (Taxaceae) as an outgroup. The number of parsimony-informative sites was 2,436 using 59 CDSs (Supplementary data2). Compared within the Pinus genus, P. densiflora was found to be most closely related to P. sylvestris. The composition and location of IR adjacent genes also supported the results (Figure 3). P. sylvestris and P. densiflora are known to be crossable [39]. Previous research has shown these species are very similar in their signal patterns in comparative analyses of their FISH karyotypes, which means they are closely related to each other [40]. P. taeda presented a different size IR structure, which is consistent with conventional plant taxonomic knowledge. The phylogenetic tree constructed in this study (depicted in Figure 4) revealed that the genetic distance of P. taeda is far from other Pinus species with two leaves [1,41]. The two groups (sect. Pinus and sect. Cembra) consisted of eight pine species, which supports the results of previous studies [1,42,43].
In this work, complete chloroplast sequencing using Oxford Nanopore Technology was newly attempted, assisting in the development of a method for obtaining more accurate sequences. The variations found in this study, which distinguish P. densiflora from other Pinus species, are useful for designing markers for specie identification. The complete chloroplast genome sequence of P. densiflora is also useful for understanding phylogenetic relationships among Pinus species.
Supplementary Materials
Supplementary materials can be found at
Author Contributions
D.S., S.-W.L. and T.J.Y. conceived the conception and design of this study; H.-I.K. wrote the first draft of the manuscript; H.-I.K., H.O.L. performed the analysis and prepared figures and tables; I.H.L., I.S.K., S.-W.L., T.J.Y., and D.S. carefully checked and revised the manuscript.
Funding
This research received no external funding.
Acknowledgments
This work was supported by the National Institute of Forest Science, Republic of Korea (FG0400-2017-01).
Conflicts of Interest
The authors declare no conflict of interest.
Figures and Tables
Figure 1. Gene map of the P. densiflora chloroplast genome. Genes drawn inside the circle are transcribed clockwise, while those drawn outside are transcribed counterclockwise. Different functional gene groups are color-coded. A GC-content graph is depicted within the inner circle. The circle inside the GC content graph marks the 50% threshold.
Figure 2. Pairwise alignments of chloroplast genome sequences of four Pinus species, each with that of P. densiflora.
Figure 3. Distance between adjacent genes and junctions of the small single copy (SSC), large single copy (LSC) and two inverted repeats (IRs) among plastid genomes from five Pinus species.
Figure 4. Phylogenetic tree based on protein coding genes (CDSs) of P. densiflora and 12 reference species. Bootstrap values (%) are shown above branches.
Information on next generation sequencing (NGS) data of Pinus densiflora sequenced in this study.
Sequencing Platform | Input Reads | Trimmed Reads | Raw Bases | Trimmed Bases |
---|---|---|---|---|
MiSeq | 49,013,296 | 40,786,223 (83.21%) | 14,563,262,097 | 10,101,761,675 (69.36%) |
ONM | 305,965 | 306,493 (100.21%) | 2,116,530,768 | 2,089,930,503 (98.74%) |
Summary of Pinus chloroplast genome features.
Genome Size (bp) | LSC Length (bp) | SSC Length (bp) | IR Length (bp) | Number of Genes | |
---|---|---|---|---|---|
P. densiflora |
119,875 | 65,654 | 53,231 | 495 | 113 |
P. sylvestris |
119,758 | 65,559 | 53,209 | 495 | 112 |
P. thunbergii |
119,707 | 65,696 | 53,021 | 495 | 113 |
P. tabuliformis |
119,646 | 65,618 | 53,038 | 495 | 114 |
P. taeda |
121,530 | 66,272 | 54,288 | 485 | 110 |
List of genes annotated in the chloroplast genome of Pinus densiflora sequenced in this study.
Function | Genes |
---|---|
RNAs, ribosomal | rrn4.5, rrn5, rrn16, rrn23, |
RNAs, transfer | trnA-UGC *, trnC-GCA, trnD-GUC, trnE-UUC, trnF-GAA, trnG-GCC, trnG-UCC *, trnH-GUG, trnI-CAU, trnI-GAU *, trnK-UUU *,T, trnL-CAA, trnL-UAA *, trnL-UAG, trnM-CAU, trnfM-CAU, trnN-GUU, trnP-UGG, trnP-GGG, trnQ-UUG, trnR-ACG, trnR-CCG, trnR-UCU, trnS-GCU, trnS-GGA, trnS-UGA, trnT-GGU, trnT-UGU, trnV-GAC, trnV-UAC *, trnW-CCA, trnY-GUA |
Transcription and splicing | rpoA, rpoB, rpoC1 *, rpoC2, matK |
Translation, ribosomal proteins | |
Small subunit | rps2, rps3, rps4, rps7, rps8, rps11, rps12 **,T, rps14, rps15, rps18, rps19 |
Large subunit | rpl2 *, rpl14, rpl16 *, rpl20, rpl22, rpl23, rpl32, rpl33, rpl36 |
Photosynthesis | |
ATP synthase | atpA, atpB, atpE, atpF *, atpH, atpI |
Photosystem Ⅰ | psaA, psaB, psaC, psaI, psaJ, psaM, ycf3 **, ycf4 |
Photosystem Ⅱ | psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ |
Calvin cycle | rbcL |
Cytochrome complex | petA, petB *, petD *, petG, petL, petN |
Chlorophyll biosynthesis | chlB, chlL, chlN |
Others | clpP, accD, cemA, ccsA, infA, ycf1, ycf2, ycf12 |
Genes containing one intron; * genes containing one intron; ** genes containing two introns; Ttrans-splicing of the related gene. Genes in boldface type have two gene copies.
Table 4Chloroplast genome comparison with 12 conifer species.
Species | Accession No. | No. of Protein Coding Genes | No. of Common Protein Coding Genes with P. densiflora |
---|---|---|---|
Pinus sylvestris Linn. | KR476379 | 73 | 73 |
Pinus tabuliformis Carr. | KT740995 | 74 | 73 |
Pinus thunbergii Parl. | D17510 | 69 | 69 |
Pinus taeda L. | KC427273 | 71 | 71 |
Pinus strobus Linn. | KP099650 | - | - |
Pinus koraiensis Sieb. et Zucc. | AY228468 | 74 | 72 |
Pinus sibirica (Loud.) Mayr | KT723438 | 77 | 73 |
Picea abies (L.) Karst. | HF937082 | 74 | 72 |
Larix decidua Mill | AB501189 | 72 | 71 |
Abies koreana Wils | KP742350 | 74 | 73 |
Abies sibirica Ledeb. | KR476376 | 74 | 73 |
Taxus baccata L. | KR476375 | 81 | 70 |
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2019 by the authors.
Abstract
Pinus densiflora (Korean red pine) is widely distributed in East Asia and considered one of the most important species in Korea. In this study, the complete chloroplast genome of P. densiflora was sequenced by combining the advantages of Oxford Nanopore MinION and Illumina MiSeq. The sequenced genome was then compared with that of a previously published conifer plastome. The chloroplast genome was found to be circular and comprised of a quadripartite structure, including 113 genes encoding 73 proteins, 36 tRNAs and 4 rRNAs. It had short inverted repeat regions and lacked ndh gene family genes, which is consistent with other Pinaceae species. The gene content of P. densiflora was found to be most similar to that of P. sylvestris. The newly attempted sequencing method could be considered an alternative method for obtaining accurate genetic information, and the chloroplast genome sequence of P. densiflora revealed in this study can be used in the phylogenetic analysis of Pinus species.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 National Institute of Forest Science, Department of Forest Bio-Resources, Suwon 16631, Korea
2 PHYZEN Genome Institute, 605, Baekgoong Plaza1, Seongnam-si, Gyeonggi-do 13558, Korea
3 Department of Plant Science, Plant Genomics and Breeding Institue, and Research Institute of Agriculture and Life Sciences, College of Agriculture and Life Sciences, Seoul National University, Seoul 08826, Korea