Content area
Statistical parameters When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main text, or Methods section). n/a Confirmed ] The exact sample size (r?) for each experimental group/condition, given as a discrete number and unit of measurement ] An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly □ The statistical test(s) used AND whether they are one- or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section. ] A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons □ A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) □ □ □ □ □ For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable. For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Clearly defined error bars State explicitly what error bars represent (e.g. SD, SE, Cl) Our web collection on statistics for biologists may be useful. Software and code Policy information about availability of computer code Data collection n/a Data analysis Maq (Version: 0.7.1), R (version 3.4.1) For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature/ software must be made available to editors/reviewers upon request. If you are not sure, read the appropriate sections before making your selection. ¾ Life sciences Q Behavioural & social sciences Q Ecological, evolutionary & environmental sciences Life sciences study design Reporting for specific materials, systems and methods Acknowledgements This work was supported by the UK Biotechnology and Biological Sciences Research Council (BB/L002124/1, BB/ L011751/1), including work carried out within the ERACAPS Research Program (BB/L027844/1).
To the Editor - A draft genome sequence of Brassica juncea, a member of the Brassicaceae and therefore a species benefiting from the functional genomics advances in the 'model' species Arabidopsis thaliana, was reported recently by Yang et al.1. B. juncea is a recently formed allotetraploid, whose diploid progenitors were mesohexaploids: Brassica rapa (which contributed the A genome) and Brassica nigra (which contributed the B genome). In addition to underpinning future trait-oriented work in this important crop species, which includes both vegetable and oil types, the sequences were analyzed for characteristics of genome evolution under crop selection. For both purposes, the genome sequences must represent with high fidelity (though not perfectly in 'draft' form) both the gene complement and gene order of the species. As a model for addressing the challenges of achieving an adequate representation of the latter for allopolyploid crops, the construction methodology employed short shotgun sequence reads, single-molecule long reads, BioNano sequencing and highresolution genetic mapping.
A particular problem in genetic mapping in polyploids is the confounding effects of single nucleotide polymorphisms (SNPs) resulting from interhomeolog polymorphisms (IHPs), which are much more abundant than the allelic SNPs needed for genetic (linkage) mapping.
In species such as B. juncea and B. napus (which also contains the A genome contributed by a B. rapa progenitor, but in this species in combination with the C genome contributed by a Brassica oleracea progenitor), further complications arise from the mesohexaploid nature of the genomes of the diploid progenitors, resulting in interparalog polymorphisms (IPPs). However, as long as sufficient sequencing redundancy has been obtained to overcome stochastic sampling effects and differentiate allelic SNPs (which will segregate across a linkage mapping population) from IHPs and IPPs (which should be invariant), the confounding effects can be overcome. Even using transcriptome sequence data, robust methodologies have been developed in B. napus to score allelic SNPs for high-resolution linkage map construction and to underpin association genetics2-4.
We aimed to test the fidelity with which the genome sequence reported by Yang et al.1 represents the gene order of B. juncea by comparing that with our own estimates using an AB Brassica genomics platform constructed as was described for our AC Brassica genomics platform5 based on the sequences of the progenitor species B. rapa (A genome) and B. nigra (B genome) (Supplementary Note). For the test, we used the coding DNA sequence (CDS) gene models from (1) the AB Brassica genomics platform and (2) the B. juncea genome sequence of Yang et al.1 (denoted J genome) as the reference sequences for mapping Illumina mRNA-seq reads from 106 lines of the B. juncea VHDH mapping population6,7 with variant calling essentially as described previously for B. napus2-4 (Supplementary Note). The SNP scoring strings were filtered to retain only simple SNPs (i.e., polymorphisms between resolved bases) and displayed in genome sequence order as genome-ordered graphical genotypes (GOGGs). If the order of the genes in which the polymorphisms are scored is correct in the genome sequence, the result should resemble a genetic linkage map, i.e., having few instances of nearby alternating parental alleles in individual recombinant lines.
The GOGGs generated comprised 33,059 scored SNP markers for the AB Brassica genomics platform and 29,834 scored SNP markers for the B. juncea genome sequence reported by Yang et al.1 (Supplementary Fig. 1). For example, comparison of chromosome J1 of Yang et al.1 to A1 from the AB Brassica genomics platform is shown in Fig. 1. The results of this simple quality control assessment show that the authentic arrangement of genes in B. juncea matches very well to that of their orthologs in the AB reference, and hence that in the progenitor species, but they also show that the B. juncea genome sequence reported by Yang et al.1 is extensively mis-assembled. We note also that the internationally agreed nomenclature for B genome chromosomes8, which we followed for the AB resource, was not followed for the B. juncea genome sequence.
The assembly and validation methodology described by Yang et al.1 sounds plausible and may well be taken as a model to follow for other polyploid crops, so why was it ineffective? Detailed inspection of the GOGGs suggests two problems: chimeric assemblies (in which collinearity with the genome of A. thaliana breaks down) and mistaking IHPs or IPPs for allelic SNPs when undertaking the linkage mapping with the 5,333 "bin markers" or in the pre-existing linkage map (in which collinearity with the genome of A. thaliana is maintained). The bin markers appear to have been scored on the basis of only ~0.7fold redundant genome re-sequencing, which wouldn't be sufficient (in SNP scoring) to differentiate the differing types of polymorphisms (IHPs, IPPs and allelic SNPs) in polyploid genomes. It is less clear why use of the single-molecule long reads and BioNano sequencing failed to detect the chimerism.
Although the draft of the B. juncea genome sequence reported by Yang et al.1 does not appear to faithfully represent the organization of that genome, undermining analyses requiring positional information (such as illustrated in Figs. 1, 2a, 3 and 4a in the report of Yang et al.1), it could easily be improved by exploiting the linkage mapping information depicted by the GOGGs. Indeed, the B genome component of our AB Brassica genomics platform was based on the B. nigra genome sequence reported by Yang et al.1 alongside that of B. juncea and was developed by splitting it (into 175 segments) and re-organizing based on the transcriptome SNPs scored across the B. juncea VHDH mapping population (Supplementary Table 1). The assessment of genome assemblies based on GOGGs therefore not only represents an important quality control measure, but also provides a solution where problems are found. Linkage mapping populations have been a fundamental resource for the genetic analyses of traits in crop so will usually be available already in crop species for which genome sequencing is being undertaken. To help assure the quality of genome sequences, we would like to propose an expectation that validation by means of GOGGs should be incorporated into the assembly workflow for polyploid crop genomes.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The B. juncea mRNA-seq data used for production of the graphical genotypes have been deposited in the SRA data library under project ID PRJNA471033. □
Zhesi He and Ian Bancroft·
Department of Biology, University of York,
Heslington, York, UK.
*e-mail: [email protected]
Published online: 24 September 2018 https://doi.org/10.1038/s41588-018-0239-0
Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.
Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main text, or Methods section).
n/a
Confirmed
] The exact sample size (r?) for each experimental group/condition, given as a discrete number and unit of measurement ] An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
□
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
] A description of all covariates tested
A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
□
A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)
□
□
□
□
□
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Clearly defined error bars
State explicitly what error bars represent (e.g. SD, SE, Cl)
Our web collection on statistics for biologists may be useful.
Software and code
Policy information about availability of computer code Data collection n/a
Data analysis Maq (Version: 0.7.1), R (version 3.4.1)
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature/ software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
The B. juncea mRNAseq data used for production of the graphical genotypes have been deposited in the SRA data library under accession number PRJNA471033.
nature research | reporting summary Aprii2oi8
Field-specific reporting
Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection. ¾ Life sciences Q Behavioural & social sciences Q Ecological, evolutionary & environmental sciences
Life sciences study design
Reporting for specific materials, systems and methods
References
1. Yang, J. et al. Nat. Genet. 48, 1225-1232 (2016).
2. Trick, M., Long, Y., Meng, J. & Bancroft, I. Plant Biotechnol. J. 7, 334-346 (2009).
3. Bancroft, I. et al. Nat. Biotechnol. 29, 762-766 (2011).
4. Harper, A. L. et al. Nat. Biotechnol. 30, 798-802 (2012).
5. He, Z. et al. Data Brief4, 357-362 (2015).
6. Paritosh, K. et al. BMC Genomics 15, 396 (2014).
7. He, Z. et al. Plant Biotechnol. J. 15, 594-604 (2017).
8. King, G. https://doi.org/10.4226/47/5afb8519d194c (2010).
Copyright Nature Publishing Group Nov 2018