Insights into population structure of East

Full text

Turn on search term navigation

Revised Amendments from Version 1

A number of misspellings and incorrect claims were revised following the reviewers’ comments. Some misleading and confusing sentences were corrected in the new version. Figure 1d has been slightly modified to remove red dots indicating SNPs. The figure legend has been modified accordingly.

Introduction

The chloroplast (cp) genome has been widely used to study the phylogeography, molecular systematics and the population genetics for plants^1,2. The chloroplast DNA (cpDNA) usually displays uniparental inheritance and represents a relatively high degree of conservation in genome structure and gene content². There are over 800 complete cp sequences available for a wide variety of plants from National Center for Biotechnology Information (NCBI) repository ranging in size from 107 to 218 Kb³. The cp genomes usually contain 110–130 protein encoding genes (PEGs), about 30 transfer RNA (tRNA) genes and four ribosomal RNA (rRNA) genes, primarily participating in the process of photosynthesis^3,4. The cpDNA typically forms a circular quadripartite structure with two inverted repeats (IRs), IRA and IRB, separated by one large single-copy section (LSC) and one small single-copy section (SSC)⁵.

The first cpDNA was sequenced from tobacco (Nicotiana tabacum) using the bacterial artificial chromosome (BAC) sequencing method in 1986⁶. The two IRs were cloned separately in order to distinguish between them. A plethora of cpDNA had since been sequenced with similar methods^7–9. Besides BAC sequencing, an alternative strategy used to sequence cpDNA is whole-cp-genome amplification by rolling-circle amplification (RCA) technology^10–12. However, both approaches require complicated library preparation.

The development of next-generation sequencing (NGS) technologies such as Illumina and Roche 454 facilitate faster and cheaper methods to sequence cp genomes^13–15. The output of the NGS technologies is short reads of size up to a few hundred base pairs. It is difficult to assemble cp genome with short reads only, especially because of the two large IRs of tens of kilobase pairs. In order to solve this problem, a reference cp genome, normally from a related species, is usually used to anchor the contigs assembled from the short reads^4,16. The long reads generated from the third-generation sequencing (TGS) technologies, such as the single-molecule real-time (SMRT) PacBio sequencing and Oxford Nanopore sequencing, can also be used to anchor the contigs and solve the repetitive regions. It is even possible to assemble cp genomes directly from long reads¹⁷. However, as the sequencing error rate of the long reads from the TGS is typically higher than 10%, it is important to introduce an error correction step to guarantee an accurate genome assembly¹⁸. The high-quality NGS short reads can be integrated for error correction to improve accuracy^19,20.

The aforementioned methods to construct cp genomes from NGS or TGS data assume pure cpDNA were sequenced. More precisely, the cpDNA were isolated from the nuclear DNAs and other organelle DNAs before sequencing^4,13–16. However, whole-genome sequencing (WGS) data generated from NGS or TGS technologies always contains cp sequences at various levels determined by the tissue type and library preparation. Normally we are able to gain enough coverage of cp genome for assembly even from low coverage WGS data. There have since been several studies describing assembly of cp genomes from WGS data^21–27. Extraction of cp sequences from the WGS data plays a key role in these methods. The most straightforward idea is to use a reference cp genome. The cp sequences could be extracted by examining the mapping results of the WGS data to the reference cp genome^21,22. An alternative strategy relies upon the fact that there are many more copies of the cpDNA than the nuclear DNA and that from other organelles. The entire WGS data is assembled to construct contigs. Contigs that represent significantly higher coverages are treated as cp contigs^23–25. NOVOPlasty adopted a seed-and-extend paradigm, where the seed could be a cp read sequence, a conserved gene or a cp genome from a related species²⁶. The start and the end of a given seed sequence are iteratively extended with reads that are overlapped with the seed until the circular genome is formed. Izan et al. proposed a K-mer frequency-based selection of cpDNA sequences from WGS data, which was integrated into a reference free cp genome assembler for non-model species²⁷.

Sweetpotato (Ipomoea batatas) ranks among the ten most important food crops worldwide²⁸. The total annual production is more than 100 million metric tonnes grown on about 8.6 million hectares around the world in year 2016²⁹. Understanding the sweetpotato genomes is of significant importance to achieve the full potential of the sweetpotato³⁰. Sweetpotato is a hexaploid (2n=6x=90) with genome size estimated to be between 2,200 to 3,000 Mb²⁸. Due to the complex genome structure, the availability of sweetpotato genomic resources is lacking. Under these circumstances, the cp genome provides researchers with an easy and efficient way to study sweetpotato^4,16,31,32. A number of cp genomes from the genus Ipomoea have been sequenced^4,16,33,34. Most of them are diploid wild relatives of the sweetpotato. The genome size is around 161 Kb, and the structure represents a standard quadripartite circular with a LSC of 87 Kb, a SSC of 12 Kb and two IRs of 31 Kb⁴. The cp genomes were mainly used to perform phylogenetic analyses^4,16,34.

In the present study, we constructed a complete cp genome assembly for the hexaploid sweetpotato cultivar Tanzania³⁵ using long reads produced by the Oxford Nanopore sequencing technology. Despite the <1× genome coverage, we obtained approximately 270× data coverage for the cp genome. Illumina sequencing data was integrated to improve the accuracy of the genome assembly. Using the Tanzania cp genome assembly as a reference, we constructed 19 cp genomes for a further 17 sweetpotato cultivars (including a duplicate for one cultivar) and an I. triloba line from paired-end whole genome Illumina sequence data. The assembled sweetpotato cp genomes were combined to perform phylogenetic analysis to investigate the population structure of 18 East African sweetpotato cultivars. Putting together the assembled cp genomes and nine publicly available cp genomes of the sweetpotato and its wild relatives, we performed a phylogenetic analysis to investigate the phylogenetic relationship for species in Convolvulaceae Ipomoea section Batatas.

Results

Extraction of cp genome sequence from whole genome sequencing data

We generated high-coverage (60×) 150 bp paired-end Illumina WGS data, and low-coverage (<1×) Oxford Nanopore WGS data on a single cultivar, referred to as Tanzania³⁵ (Methods). The cultivar Tanzania was used as one of the parents to develop an F1 outcrossing mapping population (B×T) in the Genomic Tools for Sweetpotato (GT4SP) Improvement Project³⁰. Approximately 162,000 Nanopore reads and 1.46 billion Illumina reads were generated (Supplementary Table 1). A total of 6,710 Nanopore reads were identified for cp genome by mapping to 30 publicly available cp genomes of the species from the Convolvulaceae Ipomoea family^4,16,33,36 (Methods, Supplementary Table 2). The total size is ~43.9 Mb, which represents ~270× data coverage for the cp genome. The longest read is ~30 Kb, and the average size is ~6.5 Kb (Supplementary Figure 1). We identified approximately 45 million Illumina reads for cp genome by mapping to the publicly available cp genomes summing to ~6.2 Gb, which were used for error correction for Nanopore reads and the genome assembly. The other parent for the B×T F1 outcrossing mapping population, Beauregard, was subject to whole genome sequencing at 60× coverage (Methods). A total of approximately 1.3 billion 150 bp Illumina reads were generated summing to ~164 Gb, of which approximately 52 million reads were identified as cp sequences with a total size of ~7.2 Gb (Supplementary Table 1). We performed Illumina WGS at 30× coverage for a further 16 sweetpotato cultivars—Wagabolige and New Kawogo³⁵, Ejumula and SPK004³⁷, NASPOT 1 and NASPOT 5³⁸, NASPOT 7 and NASPOT 10 O³⁹, NK259L and NASPOT 11⁴⁰, Huarmeyano, Dimbuka-Bukulula and NASPOT 5/58⁴¹, Resisto⁴², Magabali⁴³ and Mugande⁴⁴. These cultivars were used as the parental genotypes in the Mwanga Diversity Panel (MDP) which is an 8×8 diallele diversity mating panel constructed by the GT4SP project for genomic selection of the sweetpotato. While the great majority of these sweetpotato cultivars were from East African countries including Uganda and Kenya, Resisto was from USA and Huarmeyano was from Peru (Supplementary Table 3). We have duplicate samples for the cultivar NASPOT 10 O—one was from the screen-house while the other one was from the field. These two NASPOT 10 O samples were analysed separately in this research (Methods). On average, a total of approximately 75 million 251 bp reads were generated for each sample. The number and the total size of the cp reads extracted from the whole genome sequence data, on average, are ~4.4 million and ~1 Gb respectively for each sample (Supplementary Table 1).

We performed Illumina whole genome sequencing at 50× coverage for the I. triloba line, NCNSP-0323³⁰ (Methods). The raw whole genome sequence data consists of approximately 196 million 150bp reads summing to ~29 Gb. We extracted approximately 13 million reads for the cp genome from the raw sequence data summing to ~2 Gb (Supplementary Table 1).

Cp genome assembly for the sweetpotato cultivar Tanzania

We combined the Nanopore long reads with Illumina short reads to construct a cp genome assembly for the sweetpotato cultivar Tanzania (Methods). After trimming off the low-quality bases, approximately 2.2 Gb Illumina sequence data remained which was used for error correction for the Nanopore reads with Nanocorr (Supplementary Table 1). A total of 70 low quality Nanopore reads were removed after error correction and the total size reduced to approximately 43.2 Mb (Figure 1a), which was used to construct a draft genome assembly using Canu. The resulting genome assembly of approximately 218 Kb consists of three contigs of size 46 Kb, 39 Kb and 132 Kb, respectively. Compared to the published sweetpotato cp genome, the assembly is split at the boundaries of the two IRs (Figure 1b). Utilizing the overlap information between the contigs, the AMOS minimus combined the three contigs and generated a single contig of ~183 Kb (Figure 1c) (Methods). The contig contains a ~20 Kb redundancy at the ends which was removed after circularization (Figure 1d). The circularized contig is ~161 Kb, and is highly collinear with the reference cp genome assembly (Figure 1d). Application of Pilon further identified and corrected 42 single-nucleotide polymorphisms (SNPs) and small indels. To follow the paradigm of the published cp genomes, we restructured the genome assembly so that it starts from the LSC (Methods). The final genome assembly consists of a single circular contig of 161,274 bp (Figure 1e).

View Image - Figure 1. Assembly of the Tanzania chloroplast (cp) genome. (a) Dot plot of the Nanopore read length versus the alignment identity to reference assembly. The read alignment identity is defined as I = M/L, where M is the total number of base pairs of the exact match and L is the size of the alignment span on the reference genome. The reference genome is the 30 cp genomes downloaded from the NCBI (Supplementary Table 2). The alignment was performed with BWA MEM45. The alignment identities were calculated from the Cigar string. The purple and yellow represents before and after error correction with Illumina reads using Nanocorr20, respectively. (b) Dot plot of the reference cp genome versus the contigs produced by Canu17. (c) Dot plot of the reference cp genome versus the contigs produced by AMOS minimus46 after merging Canu contigs. (d) Dot plot of the reference cp genome versus the contigs produced by AMOS minimus after circularization. (e) Dot plot of the reference cp genome versus the final cp genome assembly which was polished with Illumina reads using Pilon19 and fixed the start at the LSC. For (b–e), the cp genome assembly of the I. trifida was used as the reference (accession number REM 753, Genbank accession number KF242496)16. The green bars on the x-axis indicate positions of the two IRs.

Figure 1. Assembly of the Tanzania chloroplast (cp) genome. (a) Dot plot of the Nanopore read length versus the alignment identity to reference assembly. The read alignment identity is defined as I = M/L, where M is the total number of base pairs of the exact match and L is the size of the alignment span on the reference genome. The reference genome is the 30 cp genomes downloaded from the NCBI (Supplementary Table 2). The alignment was performed with BWA MEM45. The alignment identities were calculated from the Cigar string. The purple and yellow represents before and after error correction with Illumina reads using Nanocorr20, respectively. (b) Dot plot of the reference cp genome versus the contigs produced by Canu17. (c) Dot plot of the reference cp genome versus the contigs produced by AMOS minimus46 after merging Canu contigs. (d) Dot plot of the reference cp genome versus the contigs produced by AMOS minimus after circularization. (e) Dot plot of the reference cp genome versus the final cp genome assembly which was polished with Illumina reads using Pilon19 and fixed the start at the LSC. For (b–e), the cp genome assembly of the I. trifida was used as the reference (accession number REM 753, Genbank accession number KF242496)16. The green bars on the x-axis indicate positions of the two IRs.

Cp genome assembly for the other 17 sweetpotato cultivars and the I. triloba line NCNSP-0323

The cp sequence data was subjected to quality control before assembled with SPAdes (Methods). After trimming off the low-quality regions, the total sizes of the sequence data of the 19 samples range from approximately 267 Mb to 2.67 Gb (Supplementary Table 1). The contigs generated from SPAdes for the 19 samples vary in numbers and sizes: the minimum number of contigs is 76 for the cultivar NASPOT 7, while the maximum number is 197 for the cultivar Beauregard; and the total sizes of the genome assemblies range from ~169 Kb (cultivars Ejumula and NASPOT 7) to ~229 Kb (cultivar NK259L) (Supplementary Table 4). The SPAdes contigs were then mapped to the Tanzania cp genome assembly for anchoring (Methods). The resulting genome assemblies for the 19 samples are very similar. The largest and the smallest genome assembly is 161,509bp and 161,198bp, derived from the cultivar NASPOT 5 and Beauregard, respectively (Supplementary Table 4).

Molecular structure and gene content of the sweetpotato cp genome

The gene annotation of the cp genome assembly of the sweetpotato cultivar Tanzania was generated with the web tool DOGMA and further refined with MUSCLE (Methods). The circular plot of the gene annotation is depicted in Figure 2. The sweetpotato cp genome represents a common circular structure with two IRs (IRA and IRB) separating one LSC and one SSC2. The size of the IRA, IRB, LSC and SSC is 30,874, 30,835, 87,489 and 12,076 bp, respectively. The overall GC content of the sweetpotato cp genome is 37.54%. The GC contents in different regions are highly variable. The two IRs represent significantly higher GC content than the single-copy regions: for the LSC and SSC, the GC content is 36.14% and 32.20%, respectively, whereas for the two IRs, the GC content is 40.57%. This is mainly caused by the high GC content ribosomal RNA genes in IR regions, including rrn16, rrn23, rrn4.5 and rrn5 (Figure 2). We identified 152 genes in the cp genome of which there are 96 protein encoding genes (PEGs), eight rRNA genes and 48 tRNA genes. Table 1 shows a full list of the functional genes. As we can see, the genes can be divided into 16 functional systems. The number of single-copy and double-copy genes is 71 and 11, respectively, and there is one triple-copy gene (rps12). The results are highly similar to what has been reported for the cultivar Xushu 18 cp genome⁴; the only difference is that the psbZ gene is not found in the cultivar Xushu 18 cpDNA while the ihbA gene is not found in the cultivar Tanzania cpDNA. It should be noted that the double-copy gene ycf1 was not reported for the cultivar Xushu 18 cp genome4, but this was actually a miss-annotation.

View Image - Figure 2. The chloroplast genome of the sweetpotato cultivar Tanzania. The preliminary annotations were produced by DOGMA48. MUSCLE49 was used to refine the annotations. The plot was generated with OGDRAW50.

Figure 2. The chloroplast genome of the sweetpotato cultivar Tanzania. The preliminary annotations were produced by DOGMA48. MUSCLE49 was used to refine the annotations. The plot was generated with OGDRAW50.

Table 1. List of annotated genes.

The functional systems were adopted from the OGDRAW⁵⁰. Bracketed superscripts represent number of copies.

Functional system	Number	Gene list
Photosystem I	7	psaA, psaB, psaC, psaI, psaJ, ycf3, ycf4
Photosystem II	15	psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ
Cytochrome b/f complex	6	petA, petB, petD, petG, petL, petN
ATP synthase	6	atpA, atpB, atpE, atpF, atpH, atpI
NADH dehydrogenase	13	ndhA, ndhB^[2], ndhC, ndhD, ndhE, ndhF, ndhG, ndhH^[2], ndhI, ndhJ, ndhK
RubisCO large subunit	1	rbcL
C-type cytochrome synthesis	1	ccsA
RNA polymerase	4	rpoA, rpoB, rpoC1, rpoC2
Ribosomal proteins (LSU)	9	rpl2, rpl14, rpl16, rpl20, rpl22, rpl23, rpl32, rpl33, rpl36
Ribosomal proteins (SSU)	16	rps2, rps3, rps4, rps7^[2], rps8, rps11, rps12^[3], rps14, rps15^[2] , rps16, rps18, rps19
Maturase K	1	matK
Acetyl-CoA carboxylase carboxyltransferase	1	accD
Clp protease proteolytic subunit	1	clpP
Chloroplast envelope membrane protein	1	cemA
ORFs	6	orf188^[2], orf42^[2], orf56^[2]
Hypothetical chloroplast RF	8	ycf1^[2], ycf15^[2], ycf2^[2], ycf68^[2]

Phylogenetic analysis of the sweetpotato cp genome

We performed a phylogenetic analysis for the Convolvulaceae Ipomoea section Batatas on the basis of the 19 cp genomes of the sweetpotato (I. batatas) and the cp genome of the I. triloba line NCNSP-0323 assembled in this research, coupled with nine publicly available cp genomes, of which, four of them are for sweetpotato and two of them are for I. trifida and the other three are for I. cordatotriloba, I. splendor-sylvae and I. setosa, respectively^4,16 (Supplementary Table 2). The resulted phylogenetic tree is depicted in Figure 3. The 18 sweetpotato cultivars used as the parental genotypes for mapping populations in the GT4SP project represent two distinct clades, consisting of 12 and six cultivars, respectively. Here, the length of any branch in a clade is no greater than 2×10^-4 substitutions per bp. The detailed phylogenetic relationship of the 18 sweetpotato cultivars is shown in Figure 4. As we can see, the distance between the two clades is approximately 5×10^-4 substitutions per bp. In the larger clades, the cultivar Tanzania represents a relatively larger distance (2×10^-4 substitutions ber bp) compared to the other cultivars. The population structure discovered here is similar to the one revealed by using simple sequence repeat primers by David et al. with the exception of the classification of the sweetpotato cultivars NK259L, Resisto and Mugande⁴⁷ (Supplementary Table 3). For the publicly available sweetpotato cp genomes, PI 561258 and Xushu 18 are closely related to the larger clade, while PI 518474 and PI 508520 have a closer relationship with the smaller clade (Figure 3). The diploid wild relative of the hexaploid sweetpotato, I. trifida (REM 753), displays a significantly closer relationship to the I. batatas compared to the other species in the Convolvulaceae Ipomoea section Batatas. The other I. trifida accession PI 618966, however, represents a much larger diversity to the I. batatas and shows a close relationship to the I. triloba line NCNSP-0323 assembled in this research. Interestingly, the accession PI 618966 was originally identified as I. triloba and was recently reidentified as I. trifida by the GRIN National Genetic Resources Program. Among the other three species in the Convolvulaceae Ipomoea section Batatas, the I. cordatotriloba (REM 317) is closely related to the I. triloba (NCNSP-0323) and I. trifida (PI 618966) and therefore displays much closer relationship to the I. batatas compared to the I. splendor-sylvae (REM 763) and I. setosa (REM 68).

View Image - Figure 3. A phylogenetic tree of the Convolvulaceae Ipomoea section Batatas on the basis of chloroplast genomes. The numbers on the branches are bootstrap support values. The branches shorter than 2×10-4 substitutions per bp were collapsed resulting two clades consisting of 12 and 6 sweetpotato cultivars represented by a big and small solid circle respectively in the plot. The plot was generated with iTOL52.

Figure 3. A phylogenetic tree of the Convolvulaceae Ipomoea section Batatas on the basis of chloroplast genomes. The numbers on the branches are bootstrap support values. The branches shorter than 2×10-4 substitutions per bp were collapsed resulting two clades consisting of 12 and 6 sweetpotato cultivars represented by a big and small solid circle respectively in the plot. The plot was generated with iTOL52.

View Image - Figure 4. A phylogenetic tree of the East African sweetpotato cultivars used in the GT4SP project on the basis of chloroplast genomes. This is a fine-scale representation of the two clades in Figure 3. The numbers on the branches are branch lengths given in terms of substitutions per bp.

Figure 4. A phylogenetic tree of the East African sweetpotato cultivars used in the GT4SP project on the basis of chloroplast genomes. This is a fine-scale representation of the two clades in Figure 3. The numbers on the branches are branch lengths given in terms of substitutions per bp.

Discussion

The sweetpotato cp genome contains two ~31 Kb IRs which is very difficult for short-read de novo assemblers. There have been a few studies exploring the possibility to perform de novo assembly of organelle genomes with long reads especially with SMRT PacBio sequencing reads^21,26,51. In this study, we constructed a complete sweetpotato cp genome assembly using the long reads generated from Oxford Nanopore sequencing. Nanopore reads proved to be extremely powerful in assembling the cp genome, especially in solving long repetitive regions. The sweetpotato cp genome contains two ~31 Kb IRs, which is very difficult for short-read de novo assemblers. With the overlapping information from long reads; however, the problem can be easily resolved. Canu¹⁷ provides a useful tool set for assembling Nanopore reads, which was used in this research. It is worth noting that although the average depth of coverage of the whole sweetpotato genome is less than 1×, we obtained enough coverage of the cp genome for assembly.

Although long reads are powerful in solving complex genome structures, the error-prone nature of the raw reads necessitates an extra error-correction step. Illumina reads have been widely used to assist long read error-correction^19,20. The Illumina read-based correction could be performed either on the raw long reads before assembling²⁰ or on the draft genome assembly constructed from the raw reads¹⁹. In the current study, we did both. Before assembling with Canu, the Nanopore reads were corrected with Illumina reads using Nanocorr²⁰. After assembling, the draft genome assembly was polished with Illumina reads using Pilon¹⁹ (Methods). With several pipelines examined, we found that to perform error correction both before and after assembling is the best practice to construct the sweetpotato cp genome.

Assembling the cp genome from the short Illumina reads is challenging owing to the two large IRs. Since the structure of the cp genome is generally stable, reference genomes from the closely related species are usually used to perform reference-based assembling^4,16. In this study, we used the genome assembly constructed from the Nanopore reads as reference to assemble cp genomes for a further 19 cp genomes including 17 sweetpotato cultivars (including a duplicate for one sample) and the I. triloba line NCNSP-0323. SPAdes⁵³ was used as the de novo short-read assembler. The contigs generated by SPAdes were fragmented as expected. Among the 19 genome assemblies, the minimum number of contigs was 76. As the two IRs are highly homologous, there was generally only one copy of repetitive regions being assembled. In order to solve this problem, for reference-based scaffolding, we reused some single-copy contigs from the two IR regions to construct complete cp genome assemblies.

The molecular structure and gene content of the cpDNA are relatively conserved in land plants². Many cpDNAs form a circular quadripartite structure with two IRs separated by one large and one small single-copy section^2,5. All 20 cp genome assemblies constructed in this research represent this common structure. The size of the two IRs of the sweetpotato cpDNA is approximately 31 Kb each, and is much larger than the other plants such as potato¹⁰, rice⁵⁴, wheat⁵⁵, and maize⁵⁶, of which the IRs are usually smaller than 26 Kb. This is highly likely due to gene losses in these species. By comparing the gene annotation of the sweetpotato cpDNA in this study (Figure 2) to the potato cpDNA¹⁰, we can see that, in the potato cpDNA, the boundary region of the IRA and SSC harbors a deletion of approximately 6Kb involved in the genes, ycf1, rsp15 and ndhH. Meanwhile, these three genes are presented in the symmetric boundary region of the IRB and SSC, which explains why the size of the IRs of the potato cpDNA is approximately 6 Kb smaller than the sweetpotato cpDNA.

The cpDNA usually has uniparental inheritance and undergoes low rates of substitution and recombination, which makes it well suited for phylogenetic analysis. The cp genome has been widely used to perform phylogenetic or comparative analysis in previous studies^2,10,16. In this research, we used the complete cp genome assemblies to study the phylogenetic relationship of the 18 sweetpotato potato cultivars used as the parental genotypes for mapping populations in the GT4SP project, as well as the species from the Convolvulaceae Ipomoea section Batatas. The sweetpotato genotypes from the GT4SP project were classified into two distinct clusters, which guarantees the diversities of mapping populations derived from them. The phylogenetic analysis clearly revealed that the I. trifida is the most closely related diploid wild relatives to the hexaploid sweetpotato, I. batatas, which is consistent with conclusions from the previous studies^32,57.

Almost all whole genome sequencing data contains cp sequences, from which we are usually able to obtain cp genome sequences of enough data coverage for de novo assembly. As we can see, all the cp genome assemblies described in this research were constructed using whole genome sequencing data. Given that the cp genome is an important resource for studying plant genomes and whole genome data has gradually become indispensable in modern genome projects, it will be a good practice to construct the cp genome assembly to gain a first insight into the plant genome we are trying to understand before moving to the complex nuclear genome.

Methods

Genome sequencing of the MDP parental genotypes

The 16 sweetpotato cultivars used as the parental genotypes for the MDP diversity panel were subjected to whole genome sequencing. These sweetpotato cultivars were collected from Uganda, Kenya, USA and Peru, and included Wagabolige, New Kawogo, Ejumula, SPK004, NASPOT 1, NASPOT 5, NASPOT 7, NASPOT 10 O, NK259L, NASPOT 11, Huarmeyano, Dimbuka-Bukulula, NASPOT 5/58, Resisto, Magabali and Mugande (Supplementary Table 3). Leaf tissue was ground to a fine powder using the FastPrep-24^TM 5G tissue homogenizer (MP Biomedicals, Santa Ana, California) and DNA extracted from the leaf tissues following published protocols with modifications^58,59. Briefly, tissue was suspended in pre-warmed (65°C) CTAB buffer (200mM Tris-CL, 50mM EDTA, 2M NaCl, 2% CTAB and 3% β-mercapto-ethanol), mixed and heated at 65°C for 45 min prior to extraction with chloroform:isoamyl alcohol (24:1) and precipitated with sodium acetate and ethanol. Paired-end genomic libraries were prepared using the Illumina’s Genomic DNA Sample Preparation kit and sequenced on the Illumina HiSeq 2500 system with paired-end mode and read length of 251 bp (Illumina, San Diego, CA).

10x Genomics’ Chromium sequencing of the sweetpotato cultivar Tanzania and Beauregard

The genomic DNA of Tanzania and Beauregard were extracted using the method cetyltrimethyl ammonium bromide and purified with 1× Agencourt AMPure XP beads (Beckman Coulter), according to manufacturer’s instructions. Before the library preparation, 1.5 µg purified gDNA was size selected using the BluePippin instrument (Sage Science) with the 0.75% Agarose Dye free, Marker U1 High-pass 30–40 kb vs3 protocol followed by a purification step with 0.4× AMPure XP beads. The library preparations for these two samples were done following the Chromium^TM Genome Reagent Kits user guide (CG00022, Rev C). In summary, 10 ng of sample DNA was used to generate Gel Bead-In-Emulsions (GEM) in the Chromium^TM Controller (10× Genomics) followed by isothermal incubation, post GEM incubation cleanup and quality control (QC). Libraries were constructed with end-repair and A-tailing, adaptor ligation, post ligation cleanup using SPRIselect Reagent (Beckman Coulter, USA), sample index PCR, post PCR cleanup, and QC. We modified the protocol by increasing the number of PCR cycles to nice and adding 105 µl SPRIselect reagent for the Post Sample Index PCR Cleanup, which resulted in the recovery of shorter fragments than it was expected. The libraries were sequenced using the HiSeq X Ten platform (Illumina, San Diego, CA).

Oxford Nanopore sequencing of the sweetpotato cultivar Tanzania

Before the MinION library preparation, 5.7 µg Tanzania pure DNA was size selected (start selection size: 8Kb) with the same protocol used in 10x Genomics’ Chromium sequencing. The size selected gDNA was purified with 1× AMPure XP beads. The resulting 950 ng of Tanzania gDNA was used in MinION sequencing library preparation with the SQK-LSK108 1D ligation Sequencing kit (May 2017 version). We modified the protocol as follows: 30 min incubation each end-repair step and adapter ligation; 10 min incubation at RT in the end-repair purification step; 0.7× AMPure XP beads used after adapters ligation and ELB buffer (Oxford Nanopore Technologies) warmed up at 50°C previously to use and incubation of the eluted solution at 50°C. A library of 348 ng was loaded into a FLO-MIN106 (R.9.4 version) flowcell used in a MK1B MinION. We run the 1D protocol in the MinKnow software (version 1.5.18) and we basecalled the raw data using Albacore (version 1.1.0).

Cp genome sequence extraction

WGS data were aligned to 30 publicly available cp genome assemblies of the species from the Ipomoea family^4,16,33,36 (Supplementary Table 2) to extract cp genome reads, using BWA MEM⁴⁵ (version 0.7.15). We used the option ‘-x ont2d’ for Nanopore reads, and default options for Illumina reads. For each Nanopore read, the alignment records with at least 500 bp sequence aligned were selected to calculate the total length of the alignment. A Nanopore read was considered as a cp sequence if at least 1 Kb and 80% of the read aligned. A similar strategy was employed for Illumina reads extraction. Both of the two reads of a read pair were required to be aligned. The minimum size of the alignment block was set to 100 bp.

Cp genome assembly from Nanopore data

We used Nanocorr²⁰ (version 0.01) to perform error correction for Nanopore reads using the Illumina reads. In order to guarantee the quality of Illumina reads, Trimmomatic⁶⁰ (version 0.36) was used to remove the low quality regions. We imposed the quality score of each base pair to be no less than 20 and the length of the reads no less than 100. The corrected Nanopore reads were then used to construct a draft genome assembly with Canu¹⁷ (version 1.5). As the resulting draft genome assembly contained more than one contig, AMOS minimus⁴⁶ (version 3.1.0) was used to remove the redundancy and concatenate contigs using the overlap information. The AMOS minimus was also used to circularize the contig. We aligned the Illumina reads to the circularized contig and corrected the SNPs and small indels with Pilon¹⁹ (version 1.22). In order to follow the paradigms of the published cp genomes, we aligned the genome assembly to the published cp genomes with MUMMER⁶¹ (version 3.23) to find homology regions, and let the genome assembly start from the LSC.

Cp genome assembly from Illumina Hiseq data

The low quality regions of the extracted cp sequences were removed with Trimmomatic⁶⁰ (version 0.36). The minimum quality score of each base pair was set to 20 and the minimum length of the reads was set to 100. SPAdes⁵³ (version 3.10.1) was used to construct contigs from Illumina reads. We excluded the repeat resolve module from SPAdes and used the contigs before repeat resolution as it consistently missed one of the two IRs. The resulting genome assembly contains tens to hundreds of contigs. The size of the contigs ranged from several hundred base pairs to tens of kilobase pairs. Since we know the structure of cp genome is generally stable, the syntenic relationship was used for scaffolding. We mapped the SPAdes contigs to the genome assembly resulting from the Nanopore reads using BWA MEM⁴⁵. The alignments were used to order the contigs. The overlap information between the neighbouring contigs was used to concatenate them.

Cp genome annotation

We used the web tool Dual Organellar GenoMe Annotator (DOGMA)⁴⁸ to generate the preliminarily gene annotations. For each particular gene, we used MUSCLE⁴⁹ (version 3.8.31) to align the genuine protein sequences of the gene gained from the NCBI GenBank to the genome assembly to decide the exact boundary positions. The web tool Organellar Genome DRAW (OGDRAW)⁵⁰ was used to generate the circular annotation plot of the genome assembly. The hypothetical cp open-reading frame ycf1 was not identified by DOGMA initially. It was added to the annotation on the basis of the MUSCLE alignment results.

Phylogenetic analysis

Phylogenetic analysis was performed on the 18 sweetpotato cultivars used as the parental genotypes for constructions of mapping populations in GT4SP project as well as the Convolvulaceae Ipomoea section Batatas including the cp genome assemblies constructed in this research and nine publicly available cp genome assemblies. MAFFT⁶² (version 7.310) was employed to perform the multiple sequence alignment (MSA) for cp genomes. The phylogenetic structure was constructed with PhyML⁶³ (version 3.1). Branch certainty was evaluated with 1000 replications of bootstrap resampling. The phylogenetic tree depicted in this research was constructed with the web tool iTOL (version 4)⁵².

Data availability

Underlying data

Nanopore and Illumina reads and the cp genome assemblies are deposited at NCBI BioProject repository, accession number PRJNA438020: http://identifiers.org/bioproject/PRJNA438020.

Extended data

Supplementary Figure 1. Size distribution of the Nanopore sequencing data of the total DNA. https://doi.org/10.26188/12652034.v2⁶⁴

Supplementary Table 1. Statistics of the chloroplast (cp) sequencing data. https://doi.org/10.26188/12652067.v1⁶⁵

Supplementary Table 2. List of the 30 publicly available Ipomoea chloroplast (cp) genomes in the NCBI repository. https://doi.org/10.26188/12652079.v1⁶⁶

Supplementary Table 3. Description of the parental genotypes of the Mwanga Diversity Panel (MDP). https://doi.org/10.26188/12652085.v1⁶⁷

Supplementary Table 4. Statistics of the chloroplast (cp) genome assemblies of the 18 sweetpotato cultivars and the I. triloba line NCNSP-0323. https://doi.org/10.26188/12652094.v1⁶⁸

Acknowledgements

We would like to thank Rick Miller for his valuable suggestions on the phylogenetic analysis.

Faculty Opinions recommended

References

1. Martin W, Rujan T, Richly E, et al.: Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc Natl Acad Sci U S A. 2002; 99(19): 12246–12251.

2. Shaw J, Lickey EB, Schilling EE, et al.: Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. Am J Bot. 2007; 94(3): 275–288.

3. Daniell H, Lin CS, Yu M, et al.: Chloroplast genomes: diversity, evolution, and applications in genetic engineering. Genome Biol. 2016; 17(1): 134.

4. Yan L, Lai X, Li X, et al.: Analyses of the complete genome and gene expression of chloroplast of sweet potato [Ipomoea batata]. PLoS One. 2015; 10(4): e0124083.

5. Sandelius AS, Aronsson H: The chloroplast: interactions with the environment. (Springer Science & Business Media). 2008; 13.

6. Shinozaki K, Ohme M, Tanaka M, et al.: The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. EMBO J. 1986; 5(9): 2043–2049.

7. Bausher MG, Singh ND, Lee SB, et al.: The complete chloroplast genome sequence of Citrus sinensis (L.) Osbeck var 'Ridge Pineapple': organization and phylogenetic relationships to other angiosperms. BMC Plant Biol. 2006; 6: 21.

8. Samson N, Bausher MG, Lee SB, et al.: The complete nucleotide sequence of the coffee (Coffea arabica L.) chloroplast genome: organization and implications for biotechnology and phylogenetic relationships amongst angiosperms. Plant Biotechnol J. 2007; 5(2): 339–353.

9. Saski C, Lee SB, Fjellheim S, et al.: Complete chloroplast genome sequences of Hordeum vulgare, Sorghum bicolor and Agrostis stolonifera, and comparative analyses with other grass genomes. Theor Appl Genet. 2007; 115(4): 591.

10. Daniell H, Lee SB, Grevich J, et al.: Complete chloroplast genome sequences of Solanum bulbocastanum, Solanum lycopersicum and comparative analyses with other Solanaceae genomes. Theor Appl Genet. 2006; 112(8): 1503–18.

11. Daniell H, Wurdack KJ, Kanagaraj A, et al.: The complete nucleotide sequence of the cassava (Manihot esculenta) chloroplast genome and the evolution of atpF in Malpighiales: RNA editing and multiple losses of a group II intron. Theor Appl Genet. 2008; 116(5): 723–37.

12. Saski C, Lee SB, Daniell H, et al.: Complete chloroplast genome sequence of Gycine max and comparative analyses with other legume genomes. Plant Mol Biol. 2005; 59(2): 309–322.

13. Moore MJ, Dhingra A, Soltis PS, et al.: Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biol. 2006; 6: 17.

14. Wu J, Liu B, Cheng F, et al.: Sequencing of chloroplast genome using whole cellular DNA and solexa sequencing technology. Front Plant Sci. 2012; 3: 243.

15. Cronn R, Liston A, Parks M, et al.: Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology. Nucleic Acids Res. 2008; 36(19): e122.

16. Eserman LA, Tiley GP, Jarret RL, et al.: Phylogenetics and diversification of morning glories (tribe Ipomoeeae, Convolvulaceae) based on whole plastome sequences. Am J Bot. 2014; 101(1): 92–103.

17. Koren S, Walenz BP, Berlin K, et al.: Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017; 27(5): 722–736.

18. Chin CS, Alexander DH, Marks P, et al.: Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013; 10(6): 563–569.

19. Walker BJ, Abeel T, Shea T, et al.: Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014; 9(11): e112963.

20. Goodwin S, Gurtowski J, Ethe-Sayers S, et al.: Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015; 25(11): 1750–1756.

21. Soorni A, Haak D, Zaitlin D, et al.: Organelle_PBA, a pipeline for assembling chloroplast and mitochondrial genomes from PacBio DNA sequencing data. BMC Genomics. 2017; 18(1): 49.

22. Wang W, Messing J: High-throughput sequencing of three Lemnoideae (duckweeds) chloroplast genomes from total DNA. PLoS One. 2011; 6(9): e24670.

23. Kim K, Lee SC, Lee J, et al.: Complete chloroplast and ribosomal sequences for 30 accessions elucidate evolution of Oryza AA genome species. Sci Rep. 2015; 5: 15655.

24. Antipov D, Hartwick N, Shen M, et al.: plasmidspades: assembling plasmids from whole genome sequencing data. bioRxiv. 2016; 048942.

25. Zhang T, Zhang X, Hu S, et al.: An efficient procedure for plant organellar genome assembly, based on whole genome data from the 454 GS FLX sequencing platform. Plant Methods. 2011; 7: 38.

26. Dierckxsens N, Mardulyn P, Smits G: NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 2017; 45(4): e18.

27. Izan S, Esselink D, Visser RGF, et al.: De Novo Assembly of Complete Chloroplast Genomes from Non-model Species Based on a K-mer Frequency-Based Selection of Chloroplast Reads from Total DNA Sequences. Front Plant Sci. 2017; 8: 1271.

28. Loebenstein G, Thottappilly G: The sweetpotato. (Springer Science & Business Media). 2009.

29. Faostat F: Agriculture organization of the united nations statistics division (2014). [Review date: April 2015]. 2016.

30. Mwanga ROM, Andrade MI, Carey EE, et al.: Sweetpotato (Ipomoea batatas L.). (Springer International Publishing, Cham). 2017; 181–218.

31. Roullier C, Rossel G, Tay D, et al.: Combining chloroplast and nuclear microsatellites to investigate origin and dispersal of New World sweet potato landraces. Mol Ecol. 2011; 20(19): 3963–3977.

32. Huang J, Sun M: Genetic diversity and relationships of sweetpotato and its wild relatives in ipomoea series batatas (convolvulaceae) as revealed by inter-simple sequence repeat (ISSR) and restriction analysis of chloroplast DNA. Theor Appl Genet. 2000; 100(7): 1050–1060.

33. Hoshino A, Jayakumar V, Nitasaka E, et al.: Genome sequence and analysis of the Japanese morning glory Ipomoea nil. Nat Commun. 2016; 7: 13295.

34. Muñoz-Rodríguez P, Carruthers T, Wood JR, et al.: Reconciling conflicting phylogenies in the origin of sweet potato and dispersal to Polynesia. Curr Biol. 2018; 28(8): 1246–1256.e12.

35. Mwanga ROM, Odongo B, p’Obwoya CO, et al.: Release of five sweetpotato cultivars in uganda. HortScience. 2001; 36(2): 385–386.

36. McNeal JR, Kuehl JV, Boore JL, et al.: Complete plastid genome sequences suggest strong selection for retention of photosynthetic genes in the parasitic plant genus Cuscuta. BMC Plant Biol. 2007; 7: 57.

37. Mwanga ROM, Odongo B, Niringiye C, et al.: Release of two orange-fleshed sweetpotato cultivars, ‘spk004’ (‘kakamega’) and ‘ejumula’, in uganda. HortScience. 2007; 42(7): 1728–1730.

38. Mwanga ROM, Odongo B, Turyamureeba G, et al.: Release of six sweetpotato cultivars (‘naspot 1’ to ‘naspot 6’) in uganda. HortScience. 2003; 38(3): 475–476.

39. Mwanga ROM, Odongo B, Niringiye C, et al.: ‘naspot 7’, ‘naspot 8’, ‘naspot 9 o’, ‘naspot 10 o’, and ‘dimbuka-bukulula’ sweetpotato. HortScience. 2009; 44(3): 828–832.

40. Mwanga ROM, Niringiye C, Alajo A, et al.: ‘naspot 11’, a sweetpotato cultivar bred by a participatory plant-breeding approach in uganda. HortScience. 2011; 46(2): 317–321.

41. Mwanga ROM, Kyalo G, Ssemakula GN, et al.: ‘naspot 12 o’ and ‘naspot 13 o’ sweetpotato. HortScience. 2016; 51(3): 291–295.

42. Jones A, Dukes PD, Schalk JM, et al.: ‘resisto’ sweet potato. HortScience. 1983; 18(2): 251–252.

43. Gruneberg WJ, Ma D, Mwanga ROM, et al.: Advances in sweetpotato breeding from 1992 to 2012. (CABI International). 2015.

44. Abidin PE, van Eeuwijk FA, Stam P, et al.: Adaptation and stability analysis of sweet potato varieties for low-input systems in uganda. Plant Breed. 2005; 124(5): 491–497.

45. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25(14): 1754–1760.

46. Sommer DD, Delcher AL, Salzberg SL, et al.: Minimus: a fast, lightweight genome assembler. BMC Bioinformatics. 2007; 8: 64.

47. David MC, Diaz FC, Mwanga ROM, et al.: Gene pool subdivision of east african sweetpotato parental material. Crop Science. 2018; 58(6): 2302–2314.

48. Wyman SK, Jansen RK, Boore JL: Automatic annotation of organellar genomes with DOGMA. Bioinformatics. 2004; 20(17): 3252–3255.

49. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5): 1792–1797.

50. Lohse M, Drechsel O, Bock R: OrganellarGenomeDRAW (OGDRAW): a tool for the easy generation of high-quality custom graphical maps of plastid and mitochondrial genomes. Curr Genet. 2007; 52(5–6): 267–274.

51. Li Q, Li Y, Song J, et al.: High-accuracy de novo assembly and SNP detection of chloroplast genomes using a SMRT circular consensus sequencing strategy. New Phytol. 2014; 204(4): 1041–1049.

52. Letunic I, Bork P: Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 2016; 44(W1): W242–W245.

53. Bankevich A, Nurk S, Antipov D, et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012; 19(5): 455–477.

54. Tang J, Xia H, Cao M, et al.: A comparison of rice chloroplast genomes. Plant Physiol. 2004; 135(1): 412–420.

55. Gogniashvili M, Naskidashvili P, Bedoshvili D, et al.: Complete chloroplast DNA sequences of zanduri wheat (Triticum spp.). Genet Resour Crop Evol. 2015; 62(8): 1269–1277.

56. Strittmauer G, Kössel H: Cotranscription and processing of 23S, 4.5S and 5S rRNA in chloroplasts from Zea mays. Nucleic Acids Res. 1984; 12(20): 7633–7647.

57. Srisuwan S, Sihachakr D, Siljak-Yakovlev S: The origin and evolution of sweet potato (Ipomoea batatas Lam.) and its wild relatives through the cytogenetic approaches. Plant Sci. 2006; 171(3): 424–433.

58. Dellaporta SL, Wood J, Hicks JB: A plant DNA minipreparation: version ii. Plant Mol Biol Report. 1983; 1(4): 19–21.

59. Mace ES, Buhariwalla KK, Buhariwalla HK, et al.: A high-throughput DNA extraction protocol for tropical molecular breeding programs. Plant Mol Biol Report. 2003; 21(4): 459–460.

60. Bolger AM, Lohse M, Usadel B: Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014; 30(15): 2114–2120.

61. Kurtz S, Phillippy A, Delcher AL, et al.: Versatile and open software for comparing large genomes. Genome Biol. 2004; 5(2): R12.

62. Katoh K, Standley DM: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013; 30(4): 772–780.

63. Guindon S, Dufayard JF, Lefort V, et al.: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010; 59(3): 307–321.

64. Zhou C: Nanopore read length distribution. University of Melbourne. Figure. 2020.

65. Zhou C: Statistics of the chloroplast genome sequencing data. University of Melbourne. Dataset. 2020.

66. Zhou C: List of the 30 publicly available Ipomoea chloroplast genomes in the NCBI repository. University of Melbourne. Dataset. 2020.

67. Zhou C: Description of the parental genotypes of the Mwanga Diversity Panel (MDP). University of Melbourne. Dataset. 2020.

68. Zhou C: Statistics of the chloroplast genome assemblies of the 18 sweetpotato cultivars and the I. triloba line NCNSP-0323. University of Melbourne. Dataset. 2020.

AuthorAffiliation

Chenxi Zhou¹, Tania Duarte ¹, Rocio Silvestre², Genoveva Rossel², Robert O. M. Mwanga ³, Awais Khan^2,4, Andrew W. George ⁵, Zhangjun Fei⁶, G. Craig Yencho⁷, David Ellis², Lachlan J. M. Coin ¹

¹ Institute for Molecular Bioscience, University of Queensland, St Lucia, Brisbane, QLD, 4072, Australia

² International Potato Center, P.O. Box 1558, Lima 12, Peru

³ International Potato Center, P.O. Box 22274, Kampala, Uganda

⁴ Plant Pathology and Plant-Microbe Biology Section, School of Integrative Plant Science, Cornell University, Geneva, NY, 14456, USA

⁵ Data61, CSIRO, Ecosciences Precinct, Brisbane, QLD, 4102, Australia

⁶ Boyce Thompson Institute, Cornell University, Ithaca, NY, 14853, USA

⁷ Department of Horticulture, North Carolina State University, Raleigh, North Carolina, 27695, USA

Chenxi Zhou

Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Software, Visualization, Writing – Original Draft Preparation

Tania Duarte

Roles: Data Curation, Writing – Review & Editing

Rocio Silvestre

Roles: Data Curation, Resources, Writing – Review & Editing

Genoveva Rossel

Roles: Data Curation, Resources, Writing – Review & Editing

Robert O. M. Mwanga

Roles: Funding Acquisition, Project Administration, Writing – Review & Editing

Awais Khan

Roles: Funding Acquisition, Project Administration, Writing – Review & Editing

Andrew W. George

Roles: Supervision, Writing – Review & Editing

Zhangjun Fei

Roles: Data Curation, Funding Acquisition, Project Administration, Writing – Review & Editing

G. Craig Yencho

Roles: Funding Acquisition, Project Administration, Writing – Review & Editing

David Ellis

Roles: Project Administration, Resources, Writing – Review & Editing

Lachlan J. M. Coin

Roles: Conceptualization, Funding Acquisition, Project Administration, Supervision, Writing – Review & Editing

Word count: 7305

Show less

© 2020. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Background: The chloroplast (cp) genome is an important resource for studying plant diversity and phylogeny. Assembly of the cp genomes from next-generation sequencing data is complicated by the presence of two large inverted repeats contained in the cp DNA.

Methods: We constructed a complete circular cp genome assembly for the hexaploid sweetpotato using extremely low coverage (<1×) Oxford Nanopore whole-genome sequencing (WGS) data coupled with Illumina sequencing data for polishing.

Results: The sweetpotato cp genome of 161,274 bp contains 152 genes, of which there are 96 protein coding genes, 8 rRNA genes and 48 tRNA genes. Using the cp genome assembly as a reference, we constructed complete cp genome assemblies for a further 17 sweetpotato cultivars from East Africa and an I. triloba line using Illumina WGS data. Analysis of the sweetpotato cp genomes demonstrated the presence of two distinct subpopulations in East Africa. Phylogenetic analysis of the cp genomes of the species from the Convolvulaceae Ipomoea section Batatas revealed that the most closely related diploid wild species of the hexaploid sweetpotato is I. trifida.

Conclusions: Nanopore long reads are helpful in construction of cp genome assemblies, especially in solving the two long inverted repeats. We are generally able to extract cp sequences from WGS data of sufficiently high coverage for assembly of cp genomes. The cp genomes can be used to investigate the population structure and the phylogenetic relationship for the sweetpotato.

Details

Title

Insights into population structure of East African sweetpotato cultivars from hybrid assembly of chloroplast genomes

Author

Zhou, Chenxi; Duarte, Tania

; Silvestre, Rocio; Rossel, Genoveva; Mwanga, Robert O M

; Khan, Awais; George, Andrew W

; Zhangjun Fei; Yencho, G Craig; Ellis, David; Coin, Lachlan J M

Section

Research Article

Publication year

2020

Publication date

Jul 21, 2020

Publisher

Taylor & Francis Ltd.

ISSN

25724754

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.12688/gatesopenres.12856.2

ProQuest document ID

2661523447

Insights into population structure of East African sweetpotato cultivars from hybrid assembly of chloroplast genomes: [version 2; peer review: 2 approved, 3 approved with reservations]

Jump to:

Full text

Abstract

Details

Suggested sources