Data Descriptor: Construction of a map-based reference genome sequence for barley, Hordeum vulgare L.
OPEN
Received: 26 August 2016
Accepted: 9 February 2017
Published: 27 April 2017
Sebastian Beier et al.#
Barley (Hordeum vulgare L.) is a cereal grass mainly used as animal fodder and raw material for the malting industry. The map-based reference genome sequence of barley cv. Morex was constructed by the International Barley Genome Sequencing Consortium (IBSC) using hierarchical shotgun sequencing. Here, we report the experimental and computational procedures to (i) sequence and assemble more than 80,000 bacterial articial chromosome (BAC) clones along the minimum tiling path of a genome-wide physical map, (ii) nd and validate overlaps between adjacent BACs, (iii) construct 4,265 non-redundant sequence scaffolds representing clusters of overlapping BACs, and (iv) order and orient these BAC clusters along the seven barley chromosomes using positional information provided by dense genetic maps, an optical map and chromosome conformation capture sequencing (Hi-C). Integrative access to these sequence and mapping resources is provided by the barley genome explorer (BARLEX).
Design Type(s) genome assembly
Measurement Type(s) whole genome sequencing assay
Technology Type(s) DNA sequencing
Factor Type(s) library preparation
Sample Characteristic(s) Hordeum vulgare
Correspondence and requests for materials should be addressed to M.M. (email: mailto:[email protected]
Web End [email protected] ). #A full list of authors and their afliations appears at the end of the paper.
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 1
Background & Summary
Barley (Hordeum vulgare L.) is a cereal grass of great agronomical importance. The goal of the International Barley Genome Sequencing Consortium (IBSC) is the construction of a map-based reference sequence assembly of barley cultivar Morex by means of hierarchical shotgun sequencing1.
Towards this aim, the barley genomics community has developed an array of genome-wide physical and genetic mapping resources. These include libraries of bacterial articial chromosomes (BACs)2,
a genome-wide physical map3, a draft whole genome shotgun (WGS) assembly4 and an ultra-dense genetic map5. The last stage on the road towards the reference genome is the shotgun sequencing of BAC clones along a minimum tiling path of the genome dened by the physical map. The advances in high-throughput sequencing technology enabled this task to be completed in a much shorter timeframe than was required for the completion of, for instance, the human6 and maize7 genomes. In addition to the generation of BAC raw sequence data, we constructed (i) physical genome maps by single-molecule optical mapping in nanochannels8 and by chromosome conformation capture sequencing (Hi-C)9,10, and(ii) a high-resolution genetic map of a large bi-parental mapping population through genotyping-by-sequencing11. We undertook the sequence assembly of individual BACs, the construction of larger sequence scaffolds by merging sequences from adjacent clones and the integration of these super-scaffolds with the various genome-wide mapping resources constructed in the present effort as well as those published previously3,5. The nal outcome of this approach was the construction of pseudomolecules, i.e., contiguous sequence scaffolds representing the seven chromosomes of barley.
We have submitted the relevant raw data to public sequence data archives, made analysis results available under permanent digital object identiers (DOIs) and entered the positional information used for pseudomolecule construction into a bespoke information management system, the BARLEX genome explorer12. Here, we give (i) a comprehensive overview of datasets used for assembling the barley genome and methods employed in their generation, (ii) a detailed description of wet-lab procedures for BAC sequencing and the bioinformatics workow of the sequence assembly and data integration procedures together with an outline of (iii) their browsable presentation in an online database. These resources document the construction of the map-based reference sequence of the barley genome and will enable researchers to inspect the evidence used to assemble, order and orient sequence scaffolds and may guide the further improvement of the genome sequence with complementary data sets.
Methods
The main steps for the construction of the map-based reference sequence of the barley genome were(i) shotgun and mate-pair sequencing of BAC clones, (ii) sequence assembly of individual BAC clones and (iii) the construction of a pseudomolecule sequences by merging the sequences of adjacent BACs into super-scaffolds and ordering these using various sources of positional information such as physical maps, optical map and chromosome conformation capture. A schematic overview of our experimental procedures is given in Fig. 1.
BAC sequencing
Identication and analysis of gene-containing BACs. Isolation of gene-containing BACs, construction of a minimal tiling path (MTP), sequencing of MTP clones and the annotation of genes were essentially as described previously13.
Shotgun and mate-pair sequencing of MTP-BACs. Sequencing of MTP-BACs was conducted in four laboratories (Leibniz Institute on AgingFritz Lipmann Institute (FLI) Jena, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Beijing Genomics Institute (BGI) and Earlham Institute (EI) Norwich). Depending on the instrumentation and established protocols, customized approaches were taken to sequence the barley MTP BACs.
Barley chromosomes 1H, 3H and 4H (IPK and FLI)
Shotgun sequencing of MTP BACs
During the initial phase, BACs mostly from chromosome 3H (4870 clones) and a small number of clones from other chromosomes (34 from 1H; 31 from 2H; 50 from 4H; 101 from 5H; 33 from 6H; 64 from 7H; 107 from 0H) were shotgun sequenced using the Roche/454 GS FLX device (Data Citation 1, Data Citation 2, Data Citation 3, Data Citation 4, Data Citation 5, Data Citation 6, Data Citation 7, Data Citation 8, Data Citation 9). BAC DNA was prepared using a modied alkaline lysis protocol14. Construction of barcoded 454 sequencing libraries and sequencing using the Roche platform were performed as described15,16. The remaining BAC clones from chromosomes 1H, 3H and 4H were shotgun sequenced employing Illumina instruments. BAC DNA isolation, library construction, sequencing-by-synthesis (paired-end, 2 100 cycles) using the Illumina HiSeq2000 device was performed as described17 (Data Citation 10, Data Citation 11, Data Citation 12, Data Citation 13). Pools of up to 667 BACs were individually barcoded and sequenced on one HiSeq2000 lane.
In addition, the Illumina GAIIx, HiSeq2500 and MiSeq machines were utilized to sequence pools of up to 384 clones per lane as described previously17.
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 2
Mate-pair sequencing of MTP BACsFor scaffolding of chromosomes 1H, 3H and 4H standard Illumina Nextera mate-pair libraries (span size: 8 kb) of BAC pools up to 384 BACs were constructed and sequenced using the Illumina HiSeq2000 (paired end, 2 100 cycles) and MiSeq (paired end, 2 250 cycles) as described17
(Data Citation 14, Data Citation 15).
Barley chromosomes 5H, 6H and 7H (BGI)
Shotgun sequencing of MTP BACsBacterial starter cultures were inoculated in 0.4 ml 2 YT liquid medium18 supplemented with chloramphenicol (17.5 g ml[C0]1) in 2 ml polypropylene 96-deep well-plates sealed with gas-permeable foil and incubated at 37 C for 14 h in a shaking incubator (210 r.p.m.). For DNA isolation duplicates of cultures (1 ml 2 YT liquid medium containing 17.5 g ml[C0]1 chloramphenicol) were inoculated with 50 l starter culture and incubated (37 C, 14 h, 210 r.p.m.). BAC DNA was isolated using the alkaline lysis method essentially as described previously17. The DNA was dissolved (overnight, 4 C) in 64 l TE (pH 8.0) containing RNase A (30 g ml[C0]1) and stored at 20 C. BAC plasmid DNA (0.52.0 g in 60 l)
was randomly fragmented by focused-ultrasonicator (Covaris LE220 instrument: 21% duty factor, 500 PIP, 500 cycles per burst, 70 s treatment time) in 96-well plates (Axygen, PCR-96M2-HS-C) to an average size of 250750 bp. The DNA fragments were puried using magnetic beads (GeneOn Purication kit, GO-PCRC-5000) according to the manufacturers instructions. DNA was precipitated by adding 10 l magnetic bead suspension and 75 l Binding Buffer. The samples were mixed and incubated at room temperature for 5 min. Beads containing the DNA were reclaimed by using a magnet (96S Super Magnet Plate, ALPAQUA, A001322), and the clear supernatant was discarded. The beads were washed twice with 200 l of 70% ethanol and dried completely. For the elution of DNA the beads were suspended in 42 l Elution Buffer (EB, 10 mM Tris-Cl, pH 8.5) and incubated (5 min). The plate was placed on the magnet, and the supernatant (40 l) was transferred into new 96-well plates.
End-repair and A-Tailing were performed as described19. The reaction clean-ups were performed with GeneOn magnetic beads as described above. Barcode adapters (1 l, 20 M) for the rst index were ligated to the sticky ends of DNA fragments by using T4 DNA ligase19, incubated at 16 C for at least 12 h. Each individual sample was provided with a different barcode of a set of 384 different indices (adapter and barcode sequences are available upon request). Equal volumes of the 384 individually barcoded adapter-ligated products were pooled. The pooled DNA was precipitated by adding 20 l GeneOn magnetic beads and 650 l Binding Buffer (GeneOn Purication Kit, GO-PCRC-5000)
BAC DNA preparation
BAC scaffolds
FPC / BES data
Bionano map data
Paired-end library construction
Mate-pair library construction
BLAST analysis
Quantification, pooling and size fractionation
Quantification, pooling and size fractionation
POPSEQ map
Sequencing-by-synthesis (Illumina)
Sequencing-by-synthesis (Illumina)
BAC overlap clusters
Non-redundant sequence
Removal of low quality sequences & contamination
Removal of low quality sequences & contamination
Conformation capture map
(HiC map)
Conformation capture data
(HiC / TCC)
Individual BAC assembly
Mapping & individual BAC scaffolding
Individual BAC scaffolds
AGP generation & Pseudomolecule sequence
Figure 1. Assembly workow. (a) Assembly of individual BAC clones from paired-end and mate-pair read data. (b) Data integration procedures for pseudomolecule construction.
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 3
to 500 l pooled DNA. The suspension was mixed and incubated at room temperature for 5 min. The beads containing the DNA were reclaimed using a magnet, and the clear supernatant was discarded. The beads were washed twice with 500 l of 70% ethanol and dried completely. The DNA was eluted in 52 l EB. The sample was size-separated by using standard agarose gel electrophoresis (2% agarose gel,
HyAgarose, 16250). DNA was revealed using ethidium bromide and excitation by visible blue light emitted from a Dark Reader blue light transilluminator (Clare Chemical Research) to select the target fragments (580620 bp). The target region was extracted in 27 l EB using the QIAquick Gel Extraction kit (QIAGEN). The second index was introduced using the adapter-ligated products as template DNA (98 C for 30 s, 10 cycles of: 98 C for 10 s, 65 C for 30 s and 72 C for 30 s, nal extension 72 C for 5 min) (Enzymatics, CM0075) and PCR products (target region: 580620 bp) were recovered by agarose gel electrophoresis (2% agarose gel, HyAgarose, 16250) as described above. Index primers were used for barcoding each 384 pooled BAC samples (index primer sequences are available upon request). The average size of the PCR products was determined by using an Agilent 2100 Bioanalyzer (Agilent DNA 1,000 Reagents). Typical average size of the libraries was between 574 to 674 bp. PCR products were quantied using real-time PCR and pooled for sequencing in equal proportion20. Paired-end sequencing (2 100 cycles; rst index: 11 cycles, second index: 8 cycles) was performed on the Illumina HiSeq2000 platform (Data Citation 16, Data Citation 17, Data Citation 18).
Mate-pair sequencing of MTP BACs
For the construction of mate-pair libraries (10 and 20 kb span size), 96 BACs corresponding to 6 g DNA were pooled into one tube. The DNA was fragmented to 10 or 20 kb by using the HydroShear DNA
Shearing system from GeneMachines (10 kb: large assembly, speed code 12, cycles 12, volume 250 l; 20 kb: large assembly, speed code 13, cycles 20, volume 150 l). Following DNA fragmentation, the fragments were puried by using 0.6 volumes magnetic beads (Axygen, MAG-PCR-CL-250). The samples were mixed and incubated at room temperature for 10 min. Beads containing the DNA were reclaimed by using a magnet plate (96S Super Magnet Plate, ALPAQUA, A001322), and the clear supernatant was discarded. The beads were washed twice with 500 l of 70% ethanol and dried completely. For the elution of DNA the beads were resuspended in 80 l EB. End-repair and biotin-labeling were performed as described21. End-repaired DNA was puried using 0.6 volumes magnetic beads (Axygen, MAG-PCR
CL-250) as described for the purication of hydro-sheared DNA. The DNA was eluted in 79 l EB. 20 kb libraries (2026 kb range) were size-selected using agarose gel (0.6%) electrophoresis. The ligation of the libraries, was performed by adding 1 l Barcode Adaptor (20 M, sequences are available upon request), 10 l T4 DNA ligase (Enzymatics, L603-HC) in a total volume of 100 l (20 C, 15 min). 15 individually barcoded adaptor-ligated DNAs (10 kb) were pooled in equimolar manner and size-fractionated (911 kb) using agarose gel (0.6%) electrophoresis. DNA circularization and removal of non-circularized DNA was as described21. The DNA was isolated from the gel using the QIAquick Gel Extraction kit as described by the manufacturer (QIAGEN). Circular DNA was fragmented using the Covaris S2 device (10% duty cycle, 10 intensity, 1,000 bursts per second, 22 min (11 min) treatment time for 10 kb (20 kb) libraries in TC13 Covaris tubes), and biotinylated fragments derived from true mate-pair ligation events were puried using streptavidin-coupled Dynabeads (M-280, Invitrogen)19. Ends of the DNA fragments were repaired and provided with Illumina paired-end adapters as described for the construction of shotgun libraries. The bead-bound DNA was PCR-amplied using Phusion polymerase (NEB) (98 C for 30 s, 18 cycles of: 98 C for 10 s, 65 C for 30 s, 72 C for 30 s and a nal extension: 72 C for 5 min) using manufacturers protocols (NEB). Size-selection was essentially performed as described for shotgun library construction. For the 10 kb (20 kb) mate-pair libraries, DNA in the size range between 270420 bp (400600 bp) was isolated and puried using the QIAquick Gel Extraction kit according to manufacturers instructions (QIAGEN). The average size of the paired-end BAC libraries was determined electrophoretically using an Agilent 2100 Bioanalyzer (Agilent DNA 1,000 Reagents). Libraries were quantied using Real-Time PCR20. The mate-pair libraries were paired-end sequenced using the Illumina HiSeq2500 device (10 kb library: 150 cycles, 20 kb mate-pair library 50 cycles). Raw data are available as Data Citation 19, Data Citation 20, Data Citation 21).
Barley chromosomes 2H and 0H (EI)
Shotgun sequencing of MTP BACs
QRep 384 Pin Replicators (Molecular Devices, New Molton, UK) were used to inoculate clones from stock plates into 384 square deep well culture plates containing 140 l 2 YT media supplemented with12.5 g ml[C0]1 chloramphenicol18. The culture plates were sealed with a gas permeable seal and incubated for 22 h at 37 C in a shaking incubator (200 r.p.m.). Cells were harvested by centrifugation (20 min, 3,220 g, 4 C), the supernatant was discarded. BAC DNA was prepared using a modied alkaline lysis protocol (Beckman Coultier, High Wycombe, UK). Cell pellets were resuspended in 8 l of Resuspension
Buffer (RE1) using a Microplate Shaker TiMix 5 control (Edmund-Buehler, Hechingen, Germany) (10 min, 1,400 r.p.m.). Cells were lysed by adding 8 l of the lysis solution (L2). After shaking (5 min, 500 r.p.m.) 8 l of cold Neutralisation Buffer (N3) were added. The plate was shaken (10 min, 500 r.p.m.)
followed by a centrifugation (20 min, 3,220 g, 4 C). The clear supernatant (14.33 l) was transferred
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 4
to a 384 well PCR plate, which contained 1 l of CosMc beads per well. The plate was mixed briey (500 r.p.m.), 10 l of isopropanol was added and the suspension was mixed briey again (500 r.p.m.). The plate was incubated at room temperature for 15 min to allow precipitation of the DNA onto the beads.
The plate containing the DNA precipitate was moved onto a 96 pin 384 well plate compatible magnet (Alpaqua, Beverley, MA, USA) and left for 5 min for the beads to pellet. The supernatant was discarded and the beads were washed three times with 20 l 70% ethanol while placed in the magnet and air dried (room temperature, 5 min). The DNA was eluted from the beads in 20 l of 10 mM Tris HCl (pH 8.0)
and transferred to a fresh 384 well PCR plate. To remove contaminating host E. coli gDNA samples were treated with Epicentre Plasmid Safe ATP dependent DNase (Cambio, Cambridge, UK), which digests the fragmented E. coli and nicked BAC DNA but leaves supercoiled BAC DNA intact. To 20 l of DNA 2.5 l of 10x Reaction buffer, 1 l 25 mM ATP, 0.1 l ATP dependent DNase (10 u l[C0]1) and 1.4 l water was added, and the samples were incubated at 37 C (8 h) followed by 70 C (20 min) to inactivate the DNase.
Sequencing libraries (single index) from the initial sixteen 384 well plates of BACs (2H chromosome) were constructed in 384 well PCR plates (Fortitude, Wotton, UK) using the Epicentre Nextera Kit (Epicentre, Madison, WI, USA) and Robust 2G Taq polymerase (Kapa Biosciences, London, UK). The 384 adapter oligos with 9 bp barcodes each with a hamming distance of 4 (adapter sequences are available upon request) were designed using standard guidelines22. Briey, 1 l of BAC DNA, 1 l Nextera HMW 5 Reaction Buffer, 1 l of Nextera Enzyme (diluted 50-fold in 50% glycerol, 0.5 TE pH 8.0) and 2 l of water were combined and incubated (5 min, 55 C) as described23. For the denaturation of the Tn5 polymerase, 15 l PB Buffer (Qiagen, Manchester, UK) and for the reaction clean-up, 20 l AMPure XP (Beckman, High Wycombe, UK) beads were added using a Caliper Sciclone Robot (Perkin Elmer,
Coventry, UK). Following an incubation (5 min, room temperature), the precipitated tagmented DNA was puried using a 96 well ring Magnet (Alpaqua, Beverly, MA, USA). The beads were washed twice with 20 l 70% ethanol while placed in the magnet before being air dried for 5 min. The tagmented DNA was eluted in 5 l 10 mM Tris HCl, pH 8.0 and transferred to a fresh 384 well PCR plate. To 5 l puried, tagmented DNA 2 l of 5 2G B Reaction buffer, 0.2 l of 10 mM dNTPs, 0.1 l of Robust 2G Taq polymerase, 0.2 l of 50 Nextera Primer Cocktail and 2.5 l 0.2 M barcoded P2 adapter primer were added in a total reaction volume of 10 l and amplied according to the following thermal cycling prole: 72 C for 3 min, 95 C for 1 min, followed by 21 cycles of 95 C for 10 s, 65 C for 20 s and 72 C for 3 min.
Post amplication the DNA concentration was determined using the Quant-It Picogreen dsDNA assay (Thermo Fisher, Cambridge, UK). Library DNA concentrations typically ranged from 4 to 40 ng l[C0]1 (average of 16 ng l[C0]1). For each sample from a 384 well plate a 5 l aliquot was pooled and split into two 2 ml Lo bind Eppendorf tubes (950 l each). To each aliquot 950 l of AMPure XP (Beckman, High
Wycombe, UK) beads was added. Samples were mixed, incubated (5 min, room temperature) and placed on a magnet particle concentrator (MPC) until the beads were collected. The supernatant was discarded. The beads were washed twice with 20 l 70% ethanol while placed in the MPC and air dried (5 min). The pooled library was eluted from the beads in 17 l of 10 mM Tris HCl pH 8.0. The two 17 l aliquots of the library were combined and the DNA concentration was determined using the Qbit device with the
Quant-It DNA HS Assay (Invitrogen). Typical DNA concentrations were above 100 ng l[C0]1. The DNA size selection was performed using the Blue Pippin (Sage Science, Beverly, MA, USA). About 3 g of the library in 30 l of 10 mM Tris HCl pH 8.0 and 10 l of the R2 ladder were separated (tight selection protocol, 650 bp) using a 1.5% agarose cassette according to the manufacturers instructions (Sage Science, Beverly, MA, USA), thereby yielding an average insert size of about 485 bp. Size selected samples were collected in 40 l of TRIS- TAPS buffer, pH 8.0 (Sage Science, Beverly, MA, USA). The average size of the library was determined using a High Sensitivity Chip and an Agilent 2100
Electrophoresis Bioanalyzer (Agilent). The DNA concentration was measured using the Qbit device and the Quant-It DNA HS Assay (Invitrogen). Size selected libraries were quantied using the Kappa Biosciences Illumina library qPCR quantication kit (Kapa Biosciences) on a Step One qPCR machine (ThermoFisher) according to the manufacturers instructions and compared against a known concentration of a PhiX control library. Several libraries were pooled for sequencing in an equimolar manner, and the nal pool was re-quantied for sequencing relative to a standard library of a known concentration using the Kapa Biosciences Illumina library qPCR quantication kit. Sequencing-by-synthesis for 6,144 BACs from chromosome 2H was performed using an Illumina HiSeq2000 device (2 100 cycles paired-end, single indexing read, 384 BACs/lane) according to manufacturers instructions, thereby yielding at least 32 Gb/lane and an average sequence coverage of at least 500-fold per BAC. The remaining BAC clones from 2H (384 BACs/lane) and 0H (2304 BACs/lane) were sequenced with a HiSeq2500 machine (2 150 cycles paired-end, dual indexing, rapid mode, yield: at least 30 Gb/lane) using a slightly adapted protocol with an additional normalization step prior to sample pooling. Briey, a custom panel of 48 P5 and 48 P7 adapter oligos with 9 bp barcodes (with 4 hamming distance) was designed to individually label up to 2,304 (48 48) libraries by dual indexing. A mixture of 2 l of BAC DNA, 0.5 l Nextera 10 Reaction Buffer, 0.1 l Nextera Enzyme and 2.4 l water was incubated (5 min, 55 C). Tn5 denaturation, reaction clean-up, washing, elution and transfer to a fresh 384 well plate were as described for the single-indexing libraries. 5 l puried, tagmented DNA, 2 l of 5 Kapa Robust 2G B Reaction buffer, 0.2 l of 10 mM dNTPs, 0.05 l of Kapa Robust 2G Taq polymerase, 1 l 2 M P5 primer, 1 l 2 M P7 primer were combined (reaction volume of 10 l) and amplied according to following thermal cycling prole: 72 C for 3 min, 95 C for 1 min, followed by
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 5
16 cycles of 95 C for 10 s, 65 C for 20 s and 72 C for 3 min. The size prole and quantity was determined as described for single-indexing libraries. Amplied libraries were normalised using MagQuant bead technology (GC Biotech, Netherlands) on a Caliper Zephyr Robot (Perkin Elmer), essentially as described by the manufacturer. Normalised libraries were eluted in 10 l of 10 mM Tris HCl pH 8.0 and transferred to a fresh 384 well PCR plate.5 l of 384 normalized samples were pooled (total volume 1,920 l). Purication using AMPure XP beads, washing, elution, size-selection (Blue Pippin) and quality checks prior to sequencing were essentially as described for single indexing libraries. Sequencing-by-synthesis of pooled libraries (2,304 BACs) was performed using an Illumina HiSeq2500 device (rapid run mode, 2 150 cycles paired-end, dual indexing reads) according to manufacturers instructions. At least 40 Gbp/lane, and an average sequence coverage of >100-fold per BAC were obtained (Data Citation 22, Data Citation 23, Data Citation 24, Data Citation 25).
Mate-pair sequencing of MTP BACs
BAC clones were inoculated as described for the preparation of shotgun libraries. The bacterial cultures were grown for 6 h at 37 C in a shaking incubator at 200 r.p.m., and 384 clones were pooled. The pool was used to inoculate 250 ml 2 YT media supplemented with chloramphenicol (12.5 g ml[C0]1). The cultures were incubated (18 h, 37 C, 200 r.p.m.). Cells were harvested by centrifugation (3,220 g, 20 min, 4 C), and the supernatant was discarded. Alkali lysis and DNA isolation steps were performed using the Large Construct kit (Qiagen, UK) essentially following the manufacturers instructions. The DNA was resuspended in 4.75 ml Buffer Ex, 100 l 100 mM ATP (Fisher Scientic, UK) were added and contaminating E. coli DNA was removed using 150 l ATP dependent Exonuclease (Qiagen). During the incubation (1 h, 37 C) a Qiagen Tip-100 column (Qiagen) was equilibrated in Buffer QBT (Qiagen). 5 ml of Buffer QS were added to the DNA, and the sample was applied to the equilibrated column. The column was washed twice with 10 ml of Buffer QC (Qiagen). The DNA was eluted with 7.5 ml of pre-warmed (65 C) Buffer QF (Qiagen). The DNA was precipitated by adding 0.7 volume of room temperature isopropanol and centrifugation (20 min, 3,220 g, 4 C). The pellet was washed twice with 70% ethanol, air dried and dissolved in 200 l TE buffer according to manufacturers guidelines. The
DNA concentration was measured using a Qubit Fluorometer (Thermo Fisher, Cambridge, UK) and adjusted with water to 13 ng l[C0]1. For tagmentation 200 l diluted DNA were equilibrated (6 min, 55 C)
and subsequently provided with 52 l 5 Tagment Buffer Mate-Pair and 8 l Mate-Pair Tagmentation Enzyme (Illumina, San Diego, USA). After the incubation (30 min, 55 C), 65 l Neutralize Tagment
Buffer (Illumina, San Diego, USA) were added, and the reaction was incubated (5 min, room temperature). One volume CleanPCR beads (GC Biotech, Alphen aan den Rijn, The Netherlands) was added, and the DNA was puried using magnetic separation. The DNA was eluted in 170 l of nuclease-free water, quantied using a Qubit uorometer (DNA HS assay, Invitrogen) and analysed using the
Agilent Bioanalyser (DNA 1,200 chip, Agilent, Stockport, UK). Strand displacement was performed by combining 105.3 l of tagmented DNA, 13 l 10x Strand Displacement Buffer (Illumina), 5.2 l dNTPs (Illumina), 6.5 l Strand Displacement Polymerase (Illumina) and incubation (30 min, room temperature). CleanPCR beads (0.75 volume) were added and the DNA was puried using a magnet.
The DNA was eluted in 30 l nuclease-free water. The concentration was measured (Qubit, DNA HS assay, Invitrogen), and a 1:6 diluted sample was analysed using the Agilent Bioanalyser (DNA 1,200 chip,
Agilent, Stockport, UK). Size selection was performed using a Pippin Blue (Sage Science, Beverly, MA, USA). 30 l DNA were provided with 10 l loading buffer and separated on a 0.75% agarose cassette (size selection centered at 7 kb and collection between 68 kb) according to the manufacturers instructions (Sage Science, Beverly, MA, USA). Size selected samples were collected in 40 l of
TRIS- TAPS buffer (pH 8.0) (Sage Science, Beverly, MA, USA), and analysed using the Agilent Bioanalyser (high sensitivity chip, Agilent, Stockport, UK) to determine the nal library size. The DNA concentration was measured using the Qubit device and the Quant-It DNA HS Assay (Invitrogen). Circularisation was performed by combining 40 l size selected DNA, 12.5 l 10 circularisation buffer (Illumina), 3 l Circularisation Enzyme (Illumina) and 75 l nuclease-free water. The reaction was incubated at 30 C overnight. Linear DNA was digested by adding 3.75 l Exonuclease (Illumina) and incubation (30 min, 37 C). The enzyme was inactivated by heat (30 min, 70 C) and the addition of 5 l stop ligation (Illumina). Circularised DNA (130 l) was sheared in a Covaris MicroTube AFA Fiber (Pre-slit, Snap-cap, 6 16 mm; 2 cycles of 37 s, 10% duty cycle, 200 cycles per burst, 4 intensity, 4 C)
using the Covaris S2 device (Covaris, Massachusetts, USA). M280 Dynabeads (Thermo Fisher) were prepared as described (Illumina). 130 l washed M280 beads were added to the fragmented DNA, mixed and placed on a lab rotator (20 min, room temperature). Library molecules were afnity puried and washed as described (Illumina). The beads were resuspended in a mixture of 85 l nuclease free water, 10 l 10x End Repair Reaction Buffer (Ilumina) and 5 l end repair enzyme mix (Illumina) and incubated (30 min, 30 C). End repaired library molecules bound to M280 beads were washed as described (Illumina). A-Tailing and adapter ligation were performed according to manufacturers instructions (Illumina). For PCR amplication, the beads were resuspended in a reaction mixture (20 l nuclease-free water, 25 l 2x Kappa HiFi (Kappa Biosystems, London, UK), 5 l Illumina Primer Cocktail) and amplied (98 C for 3 min, 12 cycles of 98 C for 10 s, 60 C for 30 s, 72 C for 30 s followed by 72 C for 5 min and storage of the sample at 4 C). Beads were removed by magnetic separation and 45 l of the products were transferred to a 2 ml DNA Lobind Eppendorf tube. The DNA was precipitated by addition
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 6
of 31.5 l CleanPCR beads (GC Biotech, Alphen aan den Rijn, The Netherlands). The beads were washed twice with 100 l 70% ethanol, and the nal library was eluted in 20 l resuspension buffer (GC biotech).
The DNA concentration was determined (Qubit, DNA HS assay, Invitrogen), followed by analysis using the Agilent Bioanalyser (High sensitivity chip, Agilent, Stockport, UK). Up to 12 mate-pair libraries were pooled in an equimolar manner and measured using the Kappa qPCR Illumina quantication kit. Sequencing-by-synthesis of pooled mate-pair libraries was performed using an Illumina HiSeq2500 device (rapid run mode, 2 150 cycles paired-end, single indexing reads) according to manufacturers instructions (Data Citation 26, Data Citation 27).
Sequence assembly of individual BACs
Assembly of gene-containing BACs (UCR/JGI). A total of 15,661 gene-bearing BACs were paired-end sequenced (2 100 cycles) using the Illumina HiSeq2000 platform (Illumina, Inc., San Diego, CA, USA) applying a combinatorial pooling design24, as described in Munoz-Amatriain et al.13. Reads were quality trimmed, deconvoluted, and then assembled BAC-by-BAC using Velvet version 1.2.09 (ref. 25) with the parameter k set to 45. Sequences of an additional 50 randomly chosen BACs included in Munoz-Amatriain et al.13 were derived using the Sanger method by Jane Grimwood (US Department of Energy Joint Genome Institute) and Jeremy Schmutz (HudsonAlpha Institute for Biotechnology), including shatter and transposon sequencing. The assignment of BACs to chromosome arms/pericentromeric regions was performed using CLARK26, an accurate k-mer-based classication method that is much faster than BLASTN or MegaBLAST. CLARK makes assignments by using a prebuilt database of k-mers that are specic to each chromosome arm/peri-centromeric region.
Assembly of MTP BACs from barley chromosomes 1H, 3H, 4H, 6H and 7H (FLI and IPK). A total of 10,148 BACs mainly originating from barley chromosome 3H were sequenced on the Roche 454 system. Reads were deconvoluted and assigned to individual BACs16. Reads were quality trimmed according to the manufacturers recommendations. Reads were screened for E. coli and vector sequences with MegaBLAST27. Assemblies were then constructed from the clean reads using the MIRA software28 as described in Steuernagel, et al.16 and Taudien, et al.29.
A total of 41,004 BACs were sequenced on Illumina machines (mainly HiSeq2000) in pools of up to 672 individually barcoded BAC clones. Paired-end reads were quality trimmed with the CLC toolkit and screened for E. coli and vector sequences with MegaBLAST. Assemblies were obtained by running CLC Assembly Cell Version 4.0.6 beta with default parameters. Contigs derived with low read coverage as well as contigs smaller than 500 bp were removed using the criteria described in Beier, et al.17.
The resultant contigs were then compared to NCBIs nucleotide database using MegaBLAST to check for possible contamination. Contigs with non-plant hits were either completely removed or trimmed.
Scaffolding of MTP BACs from barley chromosomes 1H, 3H, 4H, 6H and 7H (FLI and IPK). Scaffolding was performed as described in Beier et al.17 Briey, mate-pair reads were mapped against the concatenated assemblies of up to 384 BACs using BWA mem version 0.7.4 (ref. 30) with default parameters. Only read pairs mapping uniquely (minimal mapping quality of Q40) to different contigs of the same BAC assembly were retained. These reads were used to scaffold individual BACs using SSPACE version 3.0 Standard31.
If multiple mate-pair libraries were present (MiSeq mate-pair reads as well as HiSeq2000 mate-pair reads) an iterative scaffolding procedure17 was used.
Assembly of MTP BACs from barley chromosome 5H (BGI). Obtained raw sequence reads from 5H MTP BACs were ltered to generate high-quality reads by the following criteria: (1) reads containing more than 2% of Ns or with poly-A structures were removed; (2) reads with 40% low quality bases for short insert size libraries (60% for large insert size libraries) were excluded; (3) reads containing adapters were removed; (4) PCR duplicates were detected and excluded; (5) removal of reads contaminated byE. coli, vector sequences or phage sequences. High-quality reads were then used for assembly.
BACs were assembled using SOAPdenovo version 2.01 (ref. 32) multiple times using different k and m
values (main parameter in SOAPdenovo assembly). In total each BAC was assembled 45 times (k from 33 to 66, only odd numbers and m from 1 to 3). The N50 was examined for each assembly and the assembly with the largest N50 was retained as the nal assembly result for each BAC.
Scaffolding of MTP BACs from barley chromosomes 5H (BGI). Assemblies from paired-end sequences were used as reference for mapping 2, 5 and 10 kb mate-pair reads obtained from barley genomic WGS data with SOAPaligner/soap2 version 2.21 with parameters p 6 v 3 R. Mate-pair read pairs mapped in this fashion were used in conjunction with the corresponding paired-end read pairs to re-assemble each BAC using SOAPdenovo version 2.01 as described above.
Assembly of MTP BACs from barley chromosomes 2H and 0H (EI). Minimal tiling path BACs from (i) barley chromosomes 2H or from (ii) ngerprinted contigs not assigned to chromosomes (termed 0H) were sequenced. After demultiplexing, sample quality control (QC) information was generated using FastQC33. Contamination screening was carried out using Kontaminant34. Reads were screened using a k-mer size of 21 against a range of potential contaminants (Phi X, E. coli, Enterobacter cloacae
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 7
genomic DNA and BAC vector) and contaminated reads or reads with quality values o30 were removed.
ABySS assembler (v1.5.1)35 was used to assemble the ltered paired-end reads of each BAC individually (k-71, l-91 b-0). Paired-end contigs were compared to NCBIs NR database using BLAST to check for hits to non-plant organisms using e-value 1e-4 as threshold. The obtained hits were compared to NCBI taxonomy using fastacmd to obtain common names used to check for any non-plant hits.
Scaffolding of MTP BACs from barley chromosomes 2H and 0H (EI). Illumina Nextera mate-pair libraries were created from pools of 384 BACs. After quality checking the reads using PAP34, the reads
were merged using FLASH (version 1.2.9)36. Nextclip (v0.8)37 was run on the ashed reads to trim the junction adapters. A k-mer-based approach was used to assign mate-pair reads to individual BACs with KAT (v1.0.4) (https://github.com/TGAC/KAT
Web End =https://github.com/TGAC/KAT). Scaffolding and gap closing were performed on each BAC individually using an in-house shell script (available from GitHub: https://github.com/DhSaTGAC/BAC-assembly-pipeline.git
Web End =https://github.com/DhSaTGAC/ https://github.com/DhSaTGAC/BAC-assembly-pipeline.git
Web End =BAC-assembly-pipeline.git ). SOAPdenovo scaffolder version 2.01 (ref. 38) was applied to scaffold the ABySS paired-end contigs using the k-mer classied mate-pair reads with parameters k = 41, -G 30, -F, -w and -L 100. The resulting scaffolds were then edited to replace long stretches (>20) of C/G with N characters as SOAP is known to substitute Ns within paired-end contigs to C/G. The scaffolds were then passed through GapCloser (v1.12-r6), a SOAP2 module, to ll in long stretches of Ns produced during the scaffolding steps. Contigs and scaffolds shorter than 500 bp were removed to produce the nal assembly per BAC.
Splash contamination checks of MTP BACs from barley chromosomes 2H and 0H (EI). The raw reads within each plate were aligned to one side of the vector sequence adjacent to the restriction enzyme cut site using exonerate39. Substrings of size 20 bp were extracted from aligning reads containing the BAC sequence adjacent to the vector sequence. Flanking sequences from each BAC were clustered based on a Hamming distanceo3 and consensus sequences generated to account for sequencing errors. These were compared with neighboring wells to check for potential contamination caused by splash during lab processing steps. Where contamination between neighboring wells was indicated, the assembled contigs from each BAC in question were aligned in a pairwise fashion using exonerate and the total percentage of similar sequence ( 99% identity) was computed. In cases where neighboring BACs shared more than 10% similar sequence, both BACs were resequenced.
Pseudomolecule construction
Initial contamination removal. Sequence assemblies of 66,586 MTP clones, 5,468 non-MTP BACs and 15,044 gene-bearing clones13 (total number of unique BACs: 87,098) were combined into a single FASTA le (Data Citation 28,Data Citation 29,Data Citation 30). If a clone had two or more independent sequence assemblies, we selected the one with the largest N50 value for further analyses. BAC assemblies were aligned to a custom library of potential contaminants (Data Citation 31) including phages, bacterial and vector sequences using megablast27. Regions aligning to contaminants (criteria: (alignment length 500 bp AND identity 80%) OR (identity 90%)) were removed from the assembly using UNIX scripts and BEDTools40. Sequences shorter than 500 bp or consisting of less than 500 proper nucleotides (ACGT characters) after contamination removal were discarded. This step removed 55.5 Mb (0.5%) of the assembled BAC sequence.
Sequence alignment of BACs sequences and overlap detection. After contamination removal, a set of 87,075 BAC assemblies (Table 1, Data Citation 32) was aligned against itself using megablast27 with a word size of 44, retaining only alignments with identity 99% and alignment length 500 bp. Two sets of overlaps (stringent and permissive) between BACs were dened from the BLAST results of all BACs against each other. Pairs of BACs were considered as potentially overlapping under stringent criteria if there was at least one high-scoring pair (HSP) with alignment length 5 kb and identity 99.8%. Under permissive criteria, we required at least one HSP with alignment length 2 kb and identity 99.5%. For all pairs of potentially overlapping BACs (under either set of criteria), the size of their overlapping regions was determined using UNIX scripts and BEDTools40 as the extent of non-redundant regions in the BAC sequences (i.e., contigs or scaffolds) contained in HSPs 500 bp and identity 99.5% between BAC sequences having at least one HSP with alignment length 5 kb and identity 99.8% (stringent criteria) or alignment length 2 kb and identity 99.5% (permissive criteria). HSPs less than 200 bp apart were combined into one with BEDTools (command merge). BAC overlap information was imported into the R statistical environment41 for use in genetic anchoring and merging sequence assemblies of adjacent BAC clones (see section Construction of the BAC overlap graph).
Alignment of BACs to the BioNano map of barley cv. Morex. An optical map of the genome of barley cv. Morex was generated using the Irys platform of BioNano Genomics using Nt.BspQI as the nicking enzyme. Further details of the optical map procedure are described in Mascher et al.42 An in silico
BspQI digest was performed with the Knickers software (http://www.bionanogenomics.com
Web End =http://www.bionanogenomics.com) using default parameters. Restriction maps of BAC sequences were aligned to the BioNano map of barley cv. Morex42 (Data Citation 33) with IrysView software43 (http://www.bionanogenomics.com
Web End =http://www.bionanogenomics.com) using the
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 8
MTP chromosome no of. BACs in MTP no. of sequenced BACs no. of anchored BACs* average no. of sequences average N50 (kb) 1H 6,993 6,983 (99.9%) 6,410 (91.8%) 7.6 81.22H 9,061 8,969 (99.0%) 8,195 (91.4%) 9.9 104.53H 8,841 8,807 (99.6%) 8,303 (94.3%) 7.7 87.54H 8,314 8,306 (99.9%) 7,783 (93.7%) 6.7 91.25H 8,426 8,358 (99.2%) 7,573 (90.6%) 9.7 72.26H 8,305 7,886 (95.0%) 6,476 (82.1%) 7.4 70.77H 8,576 7,970 (92.9%) 6,842 (85.8%) 8.5 65.50H 8,256 8,031 (97.3%) 6,714 (83.6%) 7.6 83.6 Non-MTP 21,765 20,397 (93.7%) 14.5 33.7 Total 66,772 87,075 78,693 (90.4%) 9.8 70.3
Table 1. BAC assembly and anchoring statistics. *Number and percentage of BAC clones that have been assigned genetic positions in the POPSEQ map. BAC clones in physical contigs that had not been assigned to chromosomes.
command line tool RefAligner (version 3827) with the following parameters -M 2 -T 1e-4 -extend 1 -biaswt 0 to report all alignments with a condence score 4.
Construction of the updated POPSEQ map of the Morex x Barke mapping population. An ultra-dense linkage map had been constructed previously5 by shallow whole-genome shotgun sequencing of 90 recombinant inbred lines (RILs) derived from a cross between the barley cultivars Morex and Barke. We wished to increase the resolution of this map by reducing the average fraction of missing data per SNP marker. Towards this aim, we sequenced the existing Illumina paired-end libraries of 87 RILs to higher coverage (23x) and combined them (Data Citation 34) with the existing read data set5 (ENA accession: ERP002184). Map construction followed the procedures described in Chapman et al.44.
Reads were aligned to the whole-genome shotgun assembly of barley cv. Morex4 (NCBI accession: CAJW01) with BWA mem version 0.7.5a (ref. 45). Sorting, conversion to BAM format and removal of duplicate reads was done with PicardTools version 1.100 (http://broadinstitute.github.io/picard/
Web End =http://broadinstitute.github.io/picard/). Variant detection and genotype calling were performed with SAMTools version 0.1.19 (commands samtools mpileup BD and bcftools view cvg). The resultant VCF le was ltered using an AWK script (Supplementary Text S3 of Mascher et al. 2013 (ref. 46)). Homozygous genotype calls were set to missing if their read depth was 0 or their genotype quality below 3. Heterozygous genotype calls were set to missing if their read depth was below 3 or their genotype quality below 5. Variants with (i) a quality scores below 40, (ii) more than 10% heterozygous genotype calls, (iii) more than 90% missing data after genotype call ltering, or (iv) a minor allele frequency below 5% were discarded. SNP information was aggregated at the contig level to derive consensus genotypes as described in the section Framework map construction in the Methods section of Chapman et al.44 For map construction with MSTMap47, the population type RIL8 was used. Additional contigs were inserted into the framework map as described in Chapman et al.44 (section Anchoring scaffolds onto the framework map) using previously published read data5. Variant calling and map construction were done for the Oregon Wolfe Barley (OWB) doubled haploid population using the same procedures with the following two changes: (i) heterozygous genotype calls were excluded and (ii) the population type DH was used for map construction with MSTMap47.
Map positions in the OWB map were interpolated into the Morex x Barke map using loess regression in R41. A consensus position was derived as follows: if map positions disagreed by more than 5 cM in both maps, a contig was considered unanchored; otherwise, the Morex x Barke position was preferred if available. The nal map assigned genetic positions to 791,176 WGS contigs (Table 2, Data Citation 35), compared to 723,499 anchored contigs in the original POPSEQ map5.
Genetic anchoring of single BAC clones. The genetic positions of Morex WGS contigs in the updated POPSEQ map were lifted to BAC sequences via sequence alignment. The set of all contigs of the whole-genome shotgun assembly of barley cv. Morex4 (NCBI accession: CAJW01) was aligned to all BAC assemblies with megablast27 using a word size of 44 and retaining only alignments with identity 99.8%
and alignment length 1,000 bp. For each BAC clone, the genetic positions of WGS contigs aligning to its constituent sequences were tabulated and a genetic position of a clone was derived using a majority rule with functions of the R package data.table (https://cran.r-project.org/web/packages/data.table/index.html
Web End =https://cran.r-project.org/web/packages/data.table/index. https://cran.r-project.org/web/packages/data.table/index.html
Web End =html ). Ninety per cent of contigs assigned to a BAC had to originate to the major chromosome and the standard deviation of genetic positions had to be 3 cM. BACs without alignments to anchored WGS contigs were considered as unanchored; those not meeting the consistency criteria were agged as inconsistently anchored. In the second step, unanchored clones were positioned by utilizing positional information from neighboring BACs. We considered as neighbors of a given clone B all those BACs that overlapped for at least 10% of their assembled lengths with clone B. The genetic position of an
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 9
Chromosome No. of anchored WGS contigs Length of anchored WGS contigs (Mb) 1H 74,184 123.72H 130,436 202.63H 119,131 187.64H 96,642 170.65H 117,314 177.86H 121,384 168.47H 132,085 190.2Total 791,176 1220.9
Table 2. Summary statistics of the updated POPSEQ map of the Morex WGS assembly.
unanchored BAC B with an assembled length 300 kb were borrowed from its neighbors if all of them were anchored to same chromosome and the standard deviation of genetic coordinates was at most 3 cM. If these criteria were fullled, the genetic position of B was set to the arithmetic mean of the genetic coordinates of its neighbors. Genetic positions were determined for 78,693 (90.4%) BACs (Table 1, Data Citation 36).
Construction of the BAC overlap graph. We converted the overlap information between BACs in a graph structure using the R package igraph48. Nodes represented BACs. An edge was drawn between two nodes (BACs) if the criteria regarding sequence overlap and consistency of positional information were fullled as detailed below. The edge weights were set to the cumulative length of intervals in which two adjacent BACs overlapped. We named the connected components of this graph clusters. These clusters are analogous to physical contigs in that they represent overlaps between BACs. In contrast to physical contigs, overlaps between BACs in the cluster graph are not derived from restriction maps, but from sequence alignments.
The initial overlap graph was rened in subsequent steps by adding edges that were supported by(i) additional information about links between BACs derived from BAC end sequences, (ii) the genome-wide physical map of barley3 or (iii) the BioNano map. After each renement step, we checked for the existence of branches in the overlap graph. Such branches should not occur in a linear genome and may have arisen from spurious sequence alignments or incorrect positional information. We also determined genetic locations of clusters by aggregating the positional information of their constituent BACs using a majority rule, requiring all anchored BACs to come from the same chromosome and the standard deviation of their genetic coordinates to be 5 cM. Clusters not meeting these criteria were considered inconsistently anchored. Edges giving rise to branches or to inconsistent genetic positions were detected and removed. To detect branches, we calculated a minimum spanning tree (MST) of each cluster using Prims algorithm49 as implemented in the igraph48 function minimum.spanning.tree(). A geodesic of the MST of maximal length was determined with the igraph function get.diameter() and set as the linear(i.e., branchless) backbone of the cluster. In the MST, each BAC B was either part of the diameter or attached to a single BAC of the backbone, i.e., there existed a path from B to one and only one BAC of the backbone. The length of this path to a member of the backbone was dened as its rank. Groups of BACs attached the same backbone BAC were considered as a BAC bin of the cluster. Branches were dened as groups of nodes with rank>1. A cluster was said to be branched if it contained branches, i.e., had a non-linear structure. Note that due to redundancies in the BACs selected for sequencing, we expect BACs with rank equal to 1. After each insertion or removal of edges or nodes, connected components, MST backbones and genetic positions of clusters were re-calculated, and branches and inconsistencies with genetic data removed if necessary. The summary statistics of the overlap graph after each step are given in Table 3. The nal clustering results summarized in Table 4 are available as Data Citation 36).
Step 1: Initial overlap graph from links within FP contigs
In the initial overlap graph, an edge between two BACs was drawn if both BACs were (i) on the same ngerprinted (FP) contig, (ii) the overlapping regions between them accounted for 5% of the length of either BAC and (iiiA) there were genetically anchored to the same chromosome within 3 cM of each other or (iiiB) one or both clone were unanchored. To determine overlap lengths, we used the permissive set of overlaps. BACs that were inconsistently anchored or whose assembled length was>300 kb were excluded from the graph. The initial graph had both branched and inconsistently anchored clusters. To remove inconsistencies in genetic positions, all edges involving unanchored clones were deleted in clusters showing inconsistent genetic positions. To remove branches in the initial graph, we rst removed nodes representing non-MTP clones that were part of branches. This step was iterated twice. In the next steps, BACs in branches and originating from the set of gene-bearing BACs13 were excluded. These BACs were sequenced using combinatorial pooling strategy and errors during demultiplexing may have given rise to chimeric assemblies. After these steps, nine clusters with branches remained in the graph. BACs in
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 10
Step Datasets* Clusters BACs in clusters Singleton BACs Excluded BACs Cluster N50 Average cluster size 1 BAC, FPC 9,637 71,828 13,211 2,036 21 12.92 BAC 4,890 79,871 4,002 3,202 60 38.33 BAC, OM 4,843 79,884 3,989 3,202 61 38.84 FPC, BES, OM 4,653 79,884 3,989 3,202 65 41.25 FPC, BES 4,562 79,908 3,965 3,202 66 41.76 BAC, OM 4,486 79,918 3,955 3,202 66 42.47 FPC, BAC 4,485 79,919 3,954 3,202 66 42.48 FPC, OM 4,390 79,919 3,954 3,202 66 43.09 exBAC 4,382 80,010 3,938 3,127 66 43.110 BAC, OM 4,323 80,010 3,938 3,127 67 43.811 FPC, OM 4,259 80,010 3,938 3,127 69 45.212 BES, FPC 4,251 80,010 3,938 3,127 69 45.2
Table 3. Cluster summary statistics after each step of the BAC overlap graph construction. *Datasets used in each step (BAC, BAC sequence overlap; FPC, physical map; OM, optical map; BES, BAC end sequences; exBAC, previously excluded BAC assemblis. Consistency with the POPSEQ genetic map was checked in each step. An N50 value N indicates that half of all clusters contain at least N BACs. Arithmetic mean of the number of BACs per cluster.
1H 2H 3H 4H 5H 6H 7H Un
Number of clusters 389 605 324 415 549 768 943 242 Number of singletons 65 214 74 78 173 167 162 1190 Assembly length (Mb) 562.8 785.5 704 655.5 687.8 600.2 663.8 130.6 Length in clusters (Mb) 555.9 760.3 695.8 648.4 668.2 581.1 646 28.9 Length in singletons (Mb) 6.9 25.1 8.3 7.1 19.5 19.1 17.7 101.7 N50 (Mb) 2.5 2.1 3.6 2.5 2.0 1.1 1 0.1
Table 4. Final cluster statistics.
these branches were removed from the graph. After these steps, the graph was unbranched and showed no inconsistencies with the genetic map. The graph consisted of 9,637 clusters and 13,211 singletons (Table 3).
Step 2: Adding links between FP contigs
Next, we added edges between BACs on different FP contigs. An edge between two BACs was drawn if (i) the overlapping regions between them accounted for 10% of the length of either BAC and (iii) they were genetically anchored to the same chromosome within 3 cM of each other. Stringent overlap criteria were used in this step. This graph had branches, which were removed in subsequent steps. First, clones shorter than 50 kb or having an N50o10 kb were excluded. Then, nodes representing non-MTP clones that were part of branches were deleted. This step was repeated once. Then, edges where both clones were part of branches and in different FPCs were removed, followed by another removal of non-MTP clones. In the next step, clones in branches that were longer than 250 kb were removed. These large assemblies may combine sequences of two unrelated BACs as a result of chimeric inserts or cross-contamination between neighboring well positions. Next, gene-bearing clones13 in branches were deleted. Finally, all remaining clones in branches were discarded. The resultant graph had no branches and all its clusters were consistently anchored to the genetic map. This step reduced the number of clusters from 9,637 to 4,980 and led to the exclusion of 1,166 putatively chimeric BAC assemblies giving rise to non-linear structures (Table 3).
Step 3: Adding links with permissive overlap criteria, but support by the BioNano map
In the next steps, we tried to nd additional links between BACs that would support the joining of neighboring clusters. This was motivated by our desire to have fewer, but large clusters (i.e., increase the contiguity of the overlap graph) to facilitate the construction of the Hi-C map (see below). Towards this aim, we added edges to the graph using less stringent overlap criteria, but requiring support from other datasets. If the inclusion of an edge gave rise to a branch or map inconsistencies, this edge was removed
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 11
again. We note that in some cases edges do not represent true sequence overlaps between BACs, but only evidence for close proximity of two BACs.
In the rst step, we added edges between two BACs if (i) they were located at the ends of clusters,(ii) the overlapping regions between them accounted for 10% of the length of either BAC, (iii) they were genetically anchored to the same chromosome within 3 cM of each other and (iv) and the link was supported by the BioNano map. The BACs at the ends of clusters were determined from the MST traversals of clusters. Support by the BioNano map means the presence of a single contig of the BioNano map (an optical genome map (OM) in BioNanos nomenclature) that links to two clusters. To nd such genome maps, we aggregated the alignment information between BAC sequences and OMs at the level of clusters. In the alignment table between BioNano maps and BAC sequences, we only retained the best alignment of each BAC sequence contig. A cluster was considered aligned to a OM if the sum of the condence scores (as reported by BioNanos refaligner software) of its BAC sequences was at least 25. A OM was joining two clusters if (i) the distance in the OM between restriction map alignments pertaining to the two clusters was (i) 300 kb and (ii) the order and orientation of alignments to the OM were consistent with the order of BACs in the MSTs of the clusters, requiring a rank correlation above0.5. Adding all edges meeting these criteria to the overlap graph did not result in branches or inconsistent map positions within clusters. The graph consisted of 4,843 clusters (Table 3).
Step 4: Adding links supported by FP contigs, BAC end sequences and the BioNano map
We added edges representing pairs of BAC end sequences linking BACs at ends of clusters on the conditions that (i) these links were supported by the BioNano map and (ii) the joined BACs originated from the same FPC contig. BAC end sequences of cv. Morex (EMBL ENA accessions: HF140858-HF362636, HE975059-HE977519, HF000001-HF140857, HE867107-HE939654, HE939655-HE956691 and HF362637-HF479769) were aligned to all BAC assemblies with megablast27 using a word size of 28 and considering only hits with identity 99.5% and alignment length 500 bp. We identied pairs of
BAC end sequences that aligned to BACs B1 and B2 from two different clusters C1 and C2. BACs B1 and B2 were required to be the end of their clusters and to belong to same FPC contig and were less than 200 kb apart from each other in the physical map (using the conversion factor 1 FPC consensus band = 1.24 kb3) map. Moreover, we required the clusters C1 and C2 to be connected by a BioNano contig under the criteria described in the section Adding links with permissive overlap criteria, but support by the BioNano map. If all these criteria were fullled, we added an edge between B1 and B2. This step did not introduce branches or inconsistently anchored clusters to the graph. The number of clusters decreased to 4,653 (Table 3).
Step 5: Adding links supported by FP contigs and BAC end sequences
In this step, we used BAC end sequences and FP information to nd additional links as described in the previous step, but we did not require support by the BioNano map. This step introduced branches to the graph that were removed by pruning newly introduced edges between BACs in branches. The updated graph was composed of 4,562 clusters (Table 3).
Step 6: Using FP information and inconsistently anchored BACs to bridge gaps
In previous steps, we had excluded inconsistently anchored BAC assemblies from the overlap analysis. We speculated that many of these assemblies may contain BAC sequences from two unlinked genomic loci as a consequence of chimeric inserts or cross-contamination between neighboring wells during handling of BAC plates for MTP rearraying or sequencing. So if both BACs were fully assembled, one could use their sequences to link BAC clusters under the condition that further evidence corroborates the connection. We identied inconsistently anchored BACs (termed link BACs) that showed stringent sequence overlaps ( 10% of the assembled length of either BAC) to two BACs B1 and B2 at the ends of different clusters. We required BACs B1 and B2 to originate from the same FP contig and to be anchored within 1 cM of each other in the POPSEQ genetic map. If these criteria were met, we added an edge between B1 and B2 in the overlap graph. We did not add the link BAC itself to avoid introducing contaminant sequences from other parts of the genome. This step did not introduce branches or inconsistencies with genetic data. The number of clusters decreased to 4,486 (Table 3).
Step 7: Using singletons BACs to bridge gaps in FP contigs
In this step, we tried to nd single BACs that can close gaps within FP contigs. We identied pairs BACs B1 and B2 that were located on the same FP contigs, but different clusters, and searched for a third B3 that had stringent sequence overlap ( 10% of the assembled length of either BAC) to both B1 and B2.
We required that B3 was a singleton (i.e., a cluster of size 1) and was within 3 cM of both B1 and B2 and the POPSEQ genetic map. If these criteria, were fullled we added edges B3o->B1 and
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 12
B3o->B2. No branches or inconsistencies with the POPSEQ map were introduced in this step. This step resulted in the merging of two adjacent clusters and the incorporation of one singleton (Table 3).
Step 8: Using FP information and BioNano data
We searched for links between two BAC clusters that were part of the same FP contig and that were supported by alignments to a single BioNano contig. We searched the BioNano map for links between clusters as described in the section Adding links with permissive overlap criteria, but support by the BioNano map. We required the alignments of connected clusters to be no farther apart than 300 kb and that the corresponding BACs came from the same FP contig and were located within 300 kb in the FP map. Moreover, the order and orientation in the FP contig and the BioNano map were required to be consistent with each other. If these criteria were fullled, we added an edge between the BACs at the abutting end of the two connected clusters. This step introduced inconsistencies to the POPSEQ map that were removed by deleting all newly inserted edges in the affected clusters. This step reduced the number of clusters from 4,485 to 4,390 (Table 3).
Step 9: Adding BACs previously considered as inconsistently anchored
We searched for BACs who (i) were agged as inconsistently anchored because of the standard deviation of the genetic coordinates of the Morex WGS aligned to them was larger than 3 cM, (ii) had stringent overlaps to non-singleton BACs. We required that all Morex WGS contigs aligning to these BACs originated from the same chromosome. We added these BACs and edges leading to them to the overlap graph. This step introduced branches to the overlap graph, which were removed by deleting the newly added BACs in branched clusters. This step resulted in the incorporation of 75 additional BACs into the overlap graph (Table 3).
Step 10: Using BAC overlap information and BioNano data
In this step, we used BAC sequence overlap information and BioNano map data to add edges to the overlap graph. We found potential connections between clusters as detailed in the section Adding links with permissive overlap criteria, but support by the BioNano map. If the two BACs B1 and B2 at the adjoining ends of the two linked clusters were within 3 cM of each other and the overlapping regions was ( 10% of the assembled length of either BAC), we added an edge between B1 and B2. This step did not introduce branches or inconsistencies with the genetic map. The updated graph consisted of 4,323 clusters (Table 3).
Step 11: Using FP information to bridge gaps
In this step, we aimed to use the BioNano map to close gaps between two BACs B1 and B2 that are near to each other in the physical map and were expected to overlap with a common BAC B3 between them (layout: B1 ->B3 ->B2) based on ngerprinting results, but their sequence assemblies failed to do so, resulting in a short gap between B1 and B2. Towards this purpose, we identied pairs of BACs B1 and B2 that (i) were on the same chromosome less than 3 cM part and (ii) located at the ends of two different overlap clusters and (iii) came from the same FP contigs, (iv) were separated by less than 300 kb in the FPC map with a single BAC B3 between them in the FPC map. Such cases may occur if both B1 and B2 were expected to overlap with B3 according to FPC information, but either the overlapping regions could not be detected in the alignment of the sequence assemblies because of low assembly quality or because of BAC mix-ups during ngerprinting, re-arraying of MTP clones or sequencing library preparation, so that B1 and B2 were separated by a gap in the overlap graph. We added an edge between B1 and B2 if the following conditions were fullled: (i) the two clusters of B1 and B2 could be aligned to the same contig of the BioNano map, (ii) the aligned regions were less than 300 kb apart in the BioNano map and (iii) the orientation of the BioNano contigs and the overlap clusters were consistent. This step did not introduce branches or inconsistencies with genetic data. This step decreased the number of clusters from 4,323 to 4,259 (Table 3).
Step 12: Adding links supported by BAC end sequences and the BioNano map
We identied BACs link supported by BAC end sequences and the BioNano map as described in Step 4, but did not require the connected BACs to come from the same FP contig. Added links meeting the criteria to the overlap graph did not create branches or inconsistencies. The nal graph consisted of 80,010 BACs in 4,251 clusters and 3,938 singleton BACs (Table 3).
Construction of non-redundant sequences of BAC overlap clusters
A non-redundant sequence was constructed for each BAC cluster by detecting and removing sequence overlaps between neighboring BACs using an iterative procedure. In the initial step, the complete sequence of the largest sequence scaffold among the assemblies of all BACs in a cluster was added to the
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 13
set of visited BAC sequence scaffolds, all other sequence scaffolds were part of the set of unvisited BAC sequence scaffolds. The set of unvisited sequence scaffolds was then aligned to the visited sequence scaffolds with megablast27 with a word size of 44, accepting only high-scoring pairs with an alignment length 500 bp and an alignment identity 99.5 bp. Alignments between two sequence scaffolds from
BACS B1 and B2 were only allowed if B1 and B2 were separated in the minimum spanning tree of the cluster by no more than 10 BACs. Regions contained in alignments to visited scaffolds satisfying these criteria were subtracted from the unvisited sequence scaffolds using BEDTools40. Sequence scaffolds that were composed of less than 500 proper nucleotides (ACGT characters) after subtraction were discarded. The largest sequence scaffold among the unvisited scaffolds was moved from the set of unvisited to the set of visited scaffolds. These steps of alignment, redundancy removal and selection of the largest unvisited scaffold were repeated until no unvisited scaffolds remained. Finally, stretches of N characters at the ends of non-redundant fragments of sequence scaffolds were trimmed with an AWK script. After these procedures had been carried out for all BAC clusters, the resultant non-redundant sequences were written into a single FASTA le (Data Citation 37).
Construction of a high-resolution GBS map of the Morex x Barke populationAt this stage, we constructed a high-resolution linkage map from GBS data using the non-redundant sequence as a reference for read alignment. This map was used to derive orientations of BAC overlap clusters in the Hi-C map (see Orienting clusters by Hi-C and GBS) and to validate the order of clusters in the Hi-C map (see Technical Validation). GBS libraries of 2,398 recombinant inbred lines of the Morex x Barke lines were constructed using published protocols46,50 and subjected to Illumina or IonTorrent sequencing (Data Citation 38). Adapters were trimmed from GBS reads with cutadapt51
version 1.8.1. Reads shorter than 30 bp after trimming were discarded. Trimmed reads were mapped to the non-redundant sequence of BAC clusters with BWA45 mem version 0.7.12. The resultant alignment les were converted to BAM format with SAMtools52 (version 0.1.19), sorted with Novosort (Novocraft Technologies Sdn Bhd, Malaysia, http://www.novocraft.com/
Web End =http://www.novocraft.com/ ) and merged into a single BAM les with Picard (version 1.128, http://broadinstitute.github.io/picard/
Web End =http://broadinstitute.github.io/picard/ ). Multi-sample SNP calling was performed with FreeBayes53 using the parameters -i -X -u -n 2 -$ 5 -e 2 -m 20 -q 20 --min-coverage 500 -G 200 -F 1 -w --genotype-qualities --report-genotype-likelihood-max. The resulting VCF le was ltered with an AWK scripts (Text S3 of Mascher et al.46). Only bi-allelic SNP with a quality score 40 were considered.
Homozygous genotype calls were set to missing if their read depth was below 2 or their quality score below 20. Heterozygous genotype calls were ignored. Variants with more than 50% missing data or a minor allele frequency below 30% were discarded. The ltered SNP-by-individual matrix was imported into the R statistical environment41 for further processing. After removing samples with less than 6,000 successful genotype calls, the nal marker-by-individual matrix was constructed by discarding SNPs with more than 10% missing data. Genetic map construction was done with MSTMap47 with a P-value cut off of 1 1060 using the population type RIL8. The nal map included genotypic data from 1,613 individuals at 2,637 variant positions (Table 5, Data Citation 39).
Hi-C map construction
Hi-C map construction comprised the steps (i) data alignment to the non-redundant sequence,(ii) ordering and (iii) orienting BAC clusters using Hi-C link information.
Alignment of Hi-C data to restriction fragments. A BED le representing all intact HindIII restriction fragments 100 bp within in the non-redundant sequence was constructed using a custom
AWK script. Whole genome shotgun reads4 of barley cv. Morex corresponding to ~14x whole genome coverage were aligned to non-redundant sequence with BWA mem 0.7.12 (ref. 45), converted to BAM format with SAMtools52. Duplicate removal and sorting were done with Novosort. The coverage of the non-redundant sequence with WGS reads was calculated with SAMtools52 using the command depth Q 20 q 10 and written into a BED le. This le was used to calculate the average coverage of each HindIII
Chromosome No. of SNPs No. of bins Map length (cM) 1H 346 195 133.32H 383 231 153.23H 385 231 154.94H 237 135 115.55H 474 265 173.36H 362 188 122.77H 450 253 143.9total 2,637 1,498 996.8
Table 5. Summary statistics of the GBS map.
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 14
fragment using the BEDTools40 command map. Fragments with an average coverage below 7 or above 21 were discarded.
Paired-end reads9 (Data Citation 40) obtained using the Hi-C and TCC protocols9,54 as described in ref. 42 were trimmed using cutadapt51 version 1.8.1 using as the adapter sequence the extended NheI restriction site (AAGCTAGCTT) created by ligating two blunted HindIII fragments9. Trimmed read pairs were mapped as single ends to the non-redundant sequence using BWA mem version0.7.12 (ref. 45) with parameters -M P S and then converted to BAM format with SAMtools52.
After duplicate removal with Novosort (Novocraft Technologies Sdn Bhd, Malaysia, http://www.novocraft.com/
Web End =http://www. http://www.novocraft.com/
Web End =novocraft.com/ ), BAM les were sorted by read name to group the two mates of a pair together. Hi-C mapping information was then converted from BAM to BED format and assigned to HindIII restriction fragments with BEDTools40 using the command pairtobed -bedpe type both requiring both mates of a pair to have mapping quality 10. A custom AWK script was used to calculate the size of sequence fragments that read pairs originated from based on the distance of mapped ends to the next HindIII restriction site. After discarding fragments with size 500 bp, read pairs linking two different clusters (Hi-C links) were tabulated using standard UNIX tools (AWK, sort, uniq) and the link counts for each cluster pair were imported into R41.
Ordering scaffolds by Hi-C. Clusters whose non-redundant sequence was less than 30 kb or which had less than 20 restriction fragments were not used for making the Hi-C map. Scaffold ordering with Hi-C data was done using a custom R implementation of the algorithm outlined in Burton et al.10. First,
the Hi-C link information was entered into graph structure using the R package igraph (http://igraph.org/r/
Web End =http://igraph. http://igraph.org/r/
Web End =org/r/ ). The graph was composed of nodes representing the clusters and of edges representing Hi-C links between them. The edge weights were set to log10(number of Hi-C links). Only links between clusters anchored genetically to the same chromosome within 15 cM of each other were considered. For each of the seven largest connected components (corresponding to the seven chromosomes of barley), a minimum spanning tree was calculated with Prims algorithm49 as implemented in igraph. This resulted in a backbone map into which further nodes (clusters) were inserted so as to minimize the additional weight incurred by each node insertion. Subsequently, the 2-opt heuristics and single node relocation as used in the MSTMap algorithm for genetic mapping47 were applied to incorporate local perturbations that reduce the weight sum of the initial solution. The resultant paths of each connected component (chromosome) were oriented from short to long arm by comparison to the POPSEQ genetic map.
Orienting clusters by Hi-C and GBS. To orient clusters relative to the telomeres of the long and short chromosome arm, clusters were divided into bins of 300 kb size that were ordered by Hi-C as described above. If a cluster comprises several bins, the scaffold orientation can be inferred from the order of its constituent bins in the global Hi-C map of all 300 kb bins, which is oriented on a chromosome scale (from short to long arm) by comparison to the genetic map as described above. Local inversions may arise in the Hi-C map of the bins because of the reduced accuracy of Hi-C mapping when smaller intervals are used to aggregate Hi-C link information. To correct inverted orientations in the bin map, we checked how the relative order of a cluster C and its two adjacent clusters was correlated with that of their constituent bins. If the correlation coefcient was negative, the orientation of cluster C was reversed. If no HIC orientation could be determined, but orienting clusters was possible using GBS marker information, this information was used instead. The orders and orientation of sequence clusters are given in Data Citation 41.
Construction of pseudomolecule sequences
We constructed a FASTA le containing a single entry for each barley chromosome (a pseudomolecule) and an additional entry combining all sequence not anchored to chromosomes. Prior to the construction of pseudomolecules, we (i) identied genes incomplete or missing in the non-redundant sequence, but represented by (a) BAC sequence that had been excluded from the construction of the non-redundant sequence, or by (b) Morex WGS contigs4; and (ii) performed a nal scan for contaminant sequences.
Identication of additional gene-bearing sequences. The sets of (i) barley high-condence (HC) genes annotated on the WGS assembly of cv. Morex4 and (ii) barley full-length cDNA (-cDNA) sequences55 were aligned with GMAP56 version 2014-12-21 to (a) the set of all BAC assemblies,(b) Morex WGS contigs4 and (c) the non-redundant sequence.
First, we identied genes (as represented by the HC genes or -cDNAs) whose best alignment to the set
of assembled sequences of all BACs in clusters (as opposed to BACs excluded from the overlap analysis) represented at least 5% more of their coding sequence than their best alignment to the non-redundant sequence. Such cases arise if during the iterative construction of the non-redundant sequence, a sequence contig (or scaffold) C1 that breaks within a gene G is chosen before a contig C2 that contains a larger part of G than C1, but the total length of C1 is larger than that of C2. To amend such situations, we added contigs of type C2 to the non-redundant sequence and removed contigs of the non-redundant sequence that had previously represented the sequence now covered by C2. Towards this purpose, we aligned the sequence of each C2-type contig C to the non-redundant sequence of its BAC cluster of origin with megablast27 using a word size of 44 and considering only high-scoring pairs with an alignment
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 15
length 500 bp and an alignment identity 99.5%. Regions of the old non-redundant sequence covered by C (as determined by commands of BEDTools40 suite) were removed and contig C was added instead. This procedure was performed for each C2-type contig.
Next, we queried the GMAP alignments for genes that had no alignments to the non-redundant sequence, but were represented either in (a) the Morex WGS contigs or in (b) sequences of BACs excluded from the overlap analysis. We considered sequence of type (a) and (b) as additional gene-bearing sequences. We aligned these additional gene-bearing sequences to the non-redundant sequence with megablast27 using a word size of 44 and considering only high-scoring pairs with an alignment length 500 bp and an alignment identity 99.5%. Regions covered by the non-redundant sequence under these alignment criteria were subtracted from the additional gene-bearing sequences and sequence fragments with a length 500 bp were added to the non-redundant sequence.
Final contamination removal. We identied regions in the non-redundant sequence that were not covered by whole-genome shotgun reads of cv. Morex. Alignment of WGS reads and read depth calculation were done as described in the section Alignment of Hi-C data to restriction fragments. Regions of the non-redundant sequence not covered by Morex WGS reads and with a length 500 bp were extracted using UNIX command line tools and BEDTools40 (command getfasta). The extracted sequences were aligned to the NCBI NT database with megablast27 using a word size of 44 and requiring the high-scoring pairs to have a length of at least 100 bp and an alignment identity 80%. We retained only hits whose description in the NCBI NT database did not match the following regular expression (R syntax) representing a list of common and taxonomic names of plant species:
Hordeum|Triti|Populus|Aegilops|Avena|Alnus|A\\.squarrosa|Morus|Nelumbo|Brassica|Cucumis|Citrus| Camelina|Fragaria|Lotus|Tarenaya|Spartina|Eucommia|Sorghum|Corylus|Theobroma|Phaseolus|Barley| Trifolium|Elymus|Brachypodium|Beta vulgaris|Ricinus|Licania|Phoenix|H\\.vulgare|Pyrus|Malus|Prunus| Saccharum|Hypericum|Wheat|Oryza|hloroplast|Secale|Vitis|Quercus
Regions overlapping the BLAST hits passing these lters were cut from the non-redundant sequence with BEDTools40 (command subtract). Sequences shorter than 500 bp after the removal of contaminant sequences were discarded. This step removed 5 Mb (0.1%) of the assembled sequence.
Construction of pseudomolecule sequences for chromosome 1H7H and chrUn. We constructed pseudomolecules of the seven barley chromosomes by placing the sequence fragments of single BAC assemblies that constitute the non-redundant sequence according to the Hi-C map positions of the BAC overlap clusters these fragments belong to. Sequences not anchored by Hi-C were placed on chrUn (chromosome unassigned). The order of clusters was taken from the Hi-C map. BACs within the same cluster were ordered according to the minimum spanning tree of the BAC overlap graph of the cluster and oriented relative to the telomeres using the Hi-C orientation of the cluster if available. The relative order of sequence fragments originating from the same BAC bin (see section Construction of the BAC overlap graph) could not be determined so that the placement of sequences within a BAC bin (average size: 70 kb) is arbitrary. ChrUn is composed of (i) sequence fragments originating from BAC overlap clusters not placed in the Hi-C map, or (ii) gene-bearing fragments of BAC sequences and Morex WGS contigs selected in addition to the non-redundant sequence (see section Identication of additional gene-bearing sequences). A gap of 100 N characters was inserted between adjacent sequence fragments. Pseudomolecules of all chromosomes and chrUn were combined into a single FASTA le (Data Citation42). To accommodate limitations of the Sequence/Alignment Map format (see Usage Notes) split pseudomolecules with a size below 512 Mb were constructed by breaking pseudomolecules arbitrarily at breaks between sequence contigs (Data Citation 43, Data Citation 44). A BED le indicating the placement of BAC sequence fragments, Morex WGS contigs and intercalating gaps in the (split) pseudomolecules is available for download (Data Citation 45, Data Citation 46).
A tabular summary of the positional information incorporated into pseudomolecules is given in Data Citation 41.
Masking of residual redundancy
Residual redundancy arising from undetected overlaps between adjacent BACs was detected and masked by aligning the pseudomolecules sequence to itself with megablast27. Genomic intervals contained in BLAST hits with a length 5 kb and an identity 99.8% were considered as potentially redundant (PR)
regions. PR regions were classied to decide which sequence of a redundant pair to mask: (i) PR regions assigned to chromosomal pseudomolecules (as opposed to chrUn), but having BLAST hits only to other chromosomes were considered as originating from chimeric BAC assemblies incorporating unrelated sequences from different chromosomes and masked with Ns; (ii) an analogous procedures was used to nd intrachromosomal chimeras based on Hi-C map information; (iii) PR regions on chrUn that had alignments to regions on chromosomal pseudomolecules were masked, (iv) for other PR regions one sequence of a redundant pair was chosen arbitrarily. Positions of masked regions on the (split) pseudomolecules were written into a BED le (Data Citation 47, Data Citation 48). Masking was done with BEDTools40 (command mask) overwriting nucleotides in redundant intervals with N characters. Masked versions of the (split) pseudomolecules are provided as Data Citation 49, Data Citation 50).
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 16
Figure 2. Collinearity between the Hi-C map and two genetic maps. The positions of genetic markers (x-axis) are plotted against their genetic positions (y-axis) in a GBS map (top row) and a POPSEQ map (bottom row) of the Morex x Barke recombinant inbred lines.
POPSEQ genetic map based on pseudomolecule sequence
After the construction of the map-based reference sequence, we constructed an updated high-resolution genetic map of the Morex x Barke population to validate the order of genetic map in the reference
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 17
Figure 3. Collinearity between the Hi-C map and a cytogenetic map of chromosome 3H. Dots mark the positions of probes in the cytogenetic map (x-axis) and the Hi-C-derived pseudomolecule (y-axis). A linear regression line (red) was tted with the R function lm(). Note that cytogenetic data is not available for distal regions because probes were designed only for non-recombining peri-centromeric regions61.
sequence. Raw reads (see section Construction of the updated POPSEQ map of the Morex x Barke mapping population) were aligned to the barley pseudomolecules with BWA mem (version 0.7.12)45.
Checking mated mapped paired reads, sorting, conversion to BAM format and marking of duplicate read pairs were done with PicardTools version 2.300 (http://broadinstitute.github.io/picard/
Web End =http://broadinstitute.github.io/picard/). Variant detection and genotype calling were performed using GATK Toolkit version 3.3.0 (command HaplotypeCaller)57. A total of ve RILs with >3% heterozygous variants were removed. A variant position was removed if more than 10% of all samples were called heterozygous, there were more than 80% missing data, or the minor allele frequency (in the non-missing data) was smaller than 5%. SNP information was aggregated at the contig level to derive consensus genotype blocks with false discovery rate calculated based on the quality of each variant call in the block. High-condence genotype blocks were obtained based on a Bonferroni correction threshold. Given the fact that the length of crossover tracts is signicantly larger than that of non-crossover tracts and non-crossover tracts would enlarge the genetic distance articially, we only retained high-condence genotype blocks with more than 1 Mb tract length, which are likely to be derived from crossovers. Representative non-redundant genomic variants of high-condence genotype blocks were extracted and used for the construction of a high-resolution map through MSTMap47. We further anchored all remaining markers to the genetic map by the C program canchor5. The nal POPSEQ map consisted of 9,012,742 SNP variants dened on the pseudomolecule sequence Data citation 51).
Representation of full-length cDNAs
The representation of gene models in the whole-genome genome assembly of barley cv. Morex4 and in the pseudomolecules was compared by aligning a set of 22,651 publicly available full-length cDNAs55 to
the assemblies using the GMAP splice aligner software56. The GMAP alignment output was then ltered. If a full-length cDNA had multiple hits, only the hit with the highest % identity was considered. Hits were further ltered by identity ( 98%) and coverage ( 95%). This resulted in a set of hits representing genes recovered intact on a single genomic contig/chromosome.
Code availability
R and shell source code for the construction of the BAC overlap graph and the Hi-C map is provided as Data Citation 52. Code can be re-used under the terms of the MIT license.
Data Records
BAC sequence raw data was submitted to the European Nucleotide Archive (ENA) (Data Citation 1, Data Citation 2, Data Citation 3, Data Citation 4, Data Citation 5, Data Citation 6, Data Citation 7, Data Citation 8, Data Citation 9, Data Citation 10, Data Citation 11, Data Citation 12, Data Citation 13, Data Citation 14, Data Citation 15, Data Citation 16, Data Citation 17, Data Citation 18, Data Citation 19, Data Citation 20, Data Citation 21, Data Citation 22, Data Citation 23, Data Citation 24, Data Citation 25, Data Citation 26, Data Citation 27). BAC assemblies were submitted to ENA or NCBI (Data Citation 28, Data Citation 29). Raw data for POPSEQ (Data Citation 35), GBS (Data Citation 38) and Hi-C mapping (Data Citation 40) were submitted to ENA. Processed datasets are accessible as
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 18
QRLPDAAGSSAEEHSGQDKLLIVVTPTRARASQAYYLSRMGQTLRLVRPPVLWVVVEAGKPTPEAALELRRTAVMHRYVGCCDA
LNASASPAVDFRPHQLNAGLEVVENHRLDGVVYFADEEGVYSLPLFDRLRQIRRFGTWPVPTISDGGHGVVLEGPVCKQNQVVGWHTSGDANKLQRFHVAMSGFAFNSTMLWDPRLRSHKAWNSIRHPEMVEQGFQGTTFVEQLVEDESQMEGIPADCSQIMNWHVPFGSESPVYPKGWRSAANLDVIIPLK
Figure 4. Accessing sequence and positional information with the barley genome explorer (BARLEX). The barley pseudomolecule data was imported into BARLEX, where it is directly linked to the IPK Barley BLAST server. Users can paste a nucleotide or amino acid sequence (1) into the BARLEX input query form and select reference database such as pseudomolecules sequence, the set of all BAC assemblies or annotated genes (2). The sequence is then transferred to the IPK barley BLAST Server (3). The web page with the BLAST results (4) contains references to BARLEX information pages for different structural units (BAC sequence contigs, BAC, BAC cluster, chromosomal Hi-C map). For example, the pages of BAC sequence contigs visualize the repeat content based on genome-wide k-mer histograms (5) and are linked to a graph-based visualization (6) of the entire BAC assembly. Summary statistics and positional information of BAC clusters are presented in tables that can be searched, sorted and subsetted using user-dened criteria (7). Users can convert pseudomolecule coordinates (AGP positions) to intervals in the underlying BAC sequence assemblies (8).
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 19
Digital Object Identiers (DOIs) in the Plant Genomics and Phenomics Research Data Repository58 (Data Citation 30, Data Citation 31, Data Citation 32, Data Citation 33, Data Citation 34, Data Citation 36, Data Citation 37, Data Citation 39, Data Citation 41, Data Citation 42, Data Citation 43, Data Citation 44, Data Citation 45, Data Citation 46, Data Citation 47, Data Citation 48, Data Citation 49, Data Citation 50, Data citation 51, Data Citation 52). DOIs were registered with e!DAL59.
Technical Validation
Collinearity between genetic maps and pseudomolecules
To validate the order of scaffolds in the Hi-C map, we compared the order of genetic marker loci in the Hi-C-derived pseudomolecules to their positions in linkage maps. First, we used genotyping-by-sequencing (GBS)11,50 to type single-nucleotide polymorphisms (SNPs) segregating in a bi-parental population comprising 2,398 recombinant inbred lines (RILs). A total of 2,637 SNPs were detected by aligning GBS reads and calling variants and genotypes using a previously published pipeline46. Second, we reanalysed WGS re-sequencing data of a subset of the same population (POPSEQ data) comprising 90 RILs. Construction of a framework linkage map and insertion of additional markers were performed essentially as described by Chapman et al.44 A dot plot comparison of physical and genetic SNP positions revealed that marker orders were highly collinear between the pseudomolecules and both the GBS and POPSEQ map of the Morex x Barke population (Fig. 2).
Collinearity between a cytogenetic map and the pseudomolecule of chromosome 3HWe could not validate the order of BAC overlap clusters in the large peri-centromeric regions because of severely repressed recombination3,60. Therefore, we compared the order of probes mapped by uorescence in-situ hybridization to chromosomal locations on chromosome 3H and their corresponding sequences in the pseudomolecule of 3H. Since probes were derived from BAC sequences associated with physical contigs, their position from the reference sequence could be determined from the BAC overlap graph. The comparison showed that the cytogenetic and Hi-C maps were highly collinear in pericentromeric regions of chromosome 3H (Fig. 3).
Representation of full-length cDNAs
To assess the completeness of our assembly, we checked for the presence of high-condence transcript sequences. The representation of gene models in the whole-genome shotgun assembly of barley cv. Morex4 and in the map-based reference assembly was compared by aligning a set of 22,651 publicly available full-length cDNAs55 of barley cv. Haruna Nijo. After aligning and ltering, 18,062 (79.74%) intact full-length cDNAs were found in the pseudomolecules, whereas only 10,496 (46.33%) were recovered in the whole-genome assembly. This increase in the number of correctly represented full-length cDNAs vindicates the effort invested in the map-based assembly. Nevertheless, a signicant proportion of genes remain fragmented even in the pseudomolecule assembly (20.26%), and presumably these largely represent difcult to assemble genes that contain e.g., microsatellites, long homopolymer stretches and other difcult features, and/or form part of complex gene families that are difcult to resolve. It is likely that only longer read technologies such as Pacic Biosciences (http://www.pacb.com
Web End =http://www.pacb.com) or Oxford Nanopore (https://www.nanoporetech.com
Web End =https://www.nanoporetech.com) will be able to resolve these more difcult cases. Further results on gene space completeness based on an automated gene annotation of the pseudomolecules, and on the representation of repetitive elements are described elsewhere42.
Usage Notes
Positional information for BAC sequences, physical contigs and WGS contigs can be accessed via the barley genome explorer BARLEX (Fig. 4). BLAST searches against the barley pseudomolecules can also be carried out in BARLEX. We note that processing BAM les with short read alignments to the full pseudomolecules with commonly used tools such as SAMtools52 or BEDTools40 may not work as expected because of restrictions on the chromosome size (512 Mb) for indexing le in Sequence Alignment/Map (SAM) format52. To circumvent this issue, we have split the pseudomolecules into two part and provide (i) a FASTA le with split pseudomolecules (Data Citation 44) along the with the intact sequences and (ii) a BEDle to convert between full and split pseudomolecule coordinate (Data Citation43) Alternatively, the CRAM format (https://samtools.github.io/hts-specs/CRAMv3.pdf
Web End =https://samtools.github.io/hts-specs/CRAMv3.pdf) may be used instead of the BAM format. We note that the orientation of sequence contigs within individual BACs in the pseudomolecules is arbitrary, thus the order and orientation of sequences in the pseudomolecules is accurate only up to resolution of ~100 kb.
References
1. Schulte, D. et al. The international barley sequencing consortium--at the threshold of efcient access to the barley genome. Plant physiology 149, 142147 (2009).
2. Schulte, D. et al. BAC library resources for map-based cloning and physical map construction in barley (Hordeum vulgare L). BMC genomics 12, 247 (2011).
3. Ariyadasa, R. et al. A sequence-ready physical map of barley anchored genetically by two million single-nucleotide polymorphisms. Plant physiology 164, 412423 (2014).
4. International Barley Genome Sequencing Consortium. A physical, genetic and functional sequence assembly of the barley genome. Nature 491, 711716 (2012).
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 20
5. Mascher, M. et al. Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ). The Plant Journal 76, 718727 (2013).
6. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860921 (2001).7. Schnable, P. S. et al. The B73 maize genome: complexity, diversity, and dynamics. Science 326, 11121115 (2009).8. Lam, E. T. et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nature biotechnology 30, 771776 (2012).
9. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289293 (2009).
10. Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature biotechnology 31, 11191125 (2013).
11. Poland, J. A., Brown, P. J., Sorrells, M. E. & Jannink, J.-L. Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS ONE 7, e32253 (2012).
12. Colmsee, C. et al. BARLEXthe Barley Draft Genome Explorer. Mol Plant 8, 964966 (2015).13. Munoz-Amatriain, M. et al. Sequencing of 15 622 gene-bearing BACs claries the gene-dense regions of the barley genome. Plant Journal 84, 216227 (2015).
14. Pasquariello, M. et al. The barley Frost resistance-H2 locus. Functional & integrative genomics 14, 85100 (2014).15. Meyer, M., Stenzel, U. & Hofreiter, M. Parallel tagged sequencing on the 454 platform. Nature protocols 3, 267278 (2008).16. Steuernagel, B. et al. De novo 454 sequencing of barcoded BAC pools for comprehensive gene survey and genome analysis in the complex genome of barley. BMC genomics 10, 547 (2009).
17. Beier, S. et al. Multiplex sequencing of bacterial articial chromosomes for assembling complex plant genomes. Plant bio-technology journal 14, 15111522 (2016).
18. Sambrook, J. & Russell, D. W. Molecular cloning: a laboratory manual. 3rd edition (Coldspring-Harbour Laboratory Press, 2001).19. Aird, D. et al. Analyzing and minimizing PCR amplication bias in Illumina sequencing libraries. Genome biology 12, R18 (2011).
20. Quail, M. A. et al. A large genome centers improvements to the Illumina sequencing system. Nature methods 5, 10051010 (2008).
21. Asan et al. Paired-end sequencing of long-range DNA fragments for de novo assembly of large, complex Mammalian genomes by direct intra-molecule ligation. PLoS ONE 7, e46211 (2012).
22. Meyer, M. & Kircher, M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010, pdb prot5448 (2010).
23. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome biology 11, R119 (2010).
24. Lonardi, S. et al. Combinatorial pooling enables selective sequencing of the barley gene space. PLoS computational biology 9, e1003010 (2013).
25. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18, 821829 (2008).
26. Ounit, R., Wanamaker, S., Close, T. J. & Lonardi, S. CLARK: fast and accurate classication of metagenomic and genomic sequences using discriminative k-mers. BMC genomics 16, 236 (2015).
27. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences. Journal of computational biology: a journal of computational molecular cell biology 7, 203214 (2000).
28. Chevreux, B., Wetter, T. & Suhai, S. in German conference on bioinformatics (1999); 4556.29. Taudien, S. et al. Sequencing of BAC pools by different next generation sequencing platforms and strategies. BMC research notes 4, 411 (2011).
30. Li, H. & Durbin, R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics 25, 17541760 (2009).
31. Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE.
Bioinformatics 27, 578579 (2011).
32. Brenchley, R. et al. Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature 491, 705710 (2012).33. Andrews, S. FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Web End =http://www.bioinformatics. http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Web End =babraham.ac.uk/projects/fastqc (2010).
34. Leggett, R. M., Ramirez-Gonzalez, R. H., Clavijo, B. J., Waite, D. & Davey, R. P. Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics. Frontiers in genetics 4, 288 (2013).
35. Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome research 19, 11171123 (2009).36. Magoc, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 29572963 (2011).
37. Leggett, R. M., Clavijo, B. J., Clissold, L., Clark, M. D. & Caccamo, M. NextClip: an analysis and read preparation tool for Nextera Long Mate Pair libraries. Bioinformatics 30, 566568 (2014).
38. Luo, R. et al. SOAPdenovo2: an empirically improved memory-efcient short-read de novo assembler. GigaScience 1, 18 (2012).39. Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC bioinformatics 6, 31 (2005).
40. Quinlan, A. R. & Hall, I. M. BEDTools: a exible suite of utilities for comparing genomic features. Bioinformatics 26, 841842 (2010).
41. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2015).42. Mascher, M. et al. A chromosome conformation capture ordered sequence of the barley genome. Nature doi:10.1038/nature22043 (2017).
43. Cao, H. et al. Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology. GigaScience 3, 1 (2014).
44. Chapman, J. A. et al. A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome. Genome biology 16, 26 (2015).
45. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/pdf/1303.3997v2.pdf
Web End =https://arxiv.org/pdf/ https://arxiv.org/pdf/1303.3997v2.pdf
Web End =1303.3997v2.pdf (2013).
46. Mascher, M., Wu, S., Amand, P. S., Stein, N. & Poland, J. Application of genotyping-by-sequencing on semiconductor sequencing platforms: a comparison of genetic and reference-based marker ordering in barley. PLoS ONE 8, e76925 (2013).
47. Wu, Y., Bhat, P. R., Close, T. J. & Lonardi, S. Efcient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS genetics 4, e1000212 (2008).
48. Csardi, G. & Nepusz, T. The igraph software package for complex network research, InterJournal, Complex Systems 1695 (2006).49. Prim, R. C. Shortest connection networks and some generalizations. Bell system technical journal 36, 13891401 (1957).50. Wendler, N. et al. Unlocking the secondary gene-pool of barley with next-generation sequencing. Plant biotechnology journal 12, 11221131 (2014).
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 21
51. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17, 1012 (2011).52. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079 (2009).53. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/pdf/1207.3907v2.pdf
Web End =https://arxiv.org/pdf/ https://arxiv.org/pdf/1207.3907v2.pdf
Web End =1207.3907v2.pdf (2012).
54. Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F. & Chen, L. Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nature biotechnology 30, 9098 (2012).
55. Matsumoto, T. et al. Comprehensive sequence analysis of 24,783 barley full-length cDNAs derived from 12 clone libraries. Plant physiology 156, 2028 (2011).
56. Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 18591875 (2005).
57. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43, 491498 (2011).
58. Arend, D. et al. PGP repository: a plant phenomics and genomics data publication infrastructure. Database 2016, baw033 (2016).59. Arend, D. et al. e!DAL--a framework to store, share and publish research data. BMC bioinformatics 15, 214 (2014).60. Knzel, G., Korzun, L. & Meister, A. Cytologically integrated physical restriction fragment length polymorphism maps for the barley genome based on translocation breakpoints. Genetics 154, 397412 (2000).
61. Aliyeva-Schnorr, L. et al. Cytogenetic mapping with centromeric bacterial articial chromosomes contigs shows that this recombination-poor region comprises more than half of barley chromosome 3H. The Plant Journal 84, 385394 (2015).
Data Citations
1. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9062
Web End =European Nucleotide Archive PRJEB9062 (2016).2. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9097
Web End =European Nucleotide Archive PRJEB9097 (2016).3. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9098
Web End =European Nucleotide Archive PRJEB9098 (2016).4. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9099
Web End =European Nucleotide Archive PRJEB9099 (2016).5. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9100
Web End =European Nucleotide Archive PRJEB9100 (2016).6. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9101
Web End =European Nucleotide Archive PRJEB9101 (2016).7. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9102
Web End =European Nucleotide Archive PRJEB9102 (2016).8. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9103
Web End =European Nucleotide Archive PRJEB9103 (2016).9. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9104
Web End =European Nucleotide Archive PRJEB9104 (2016).10. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB8576
Web End =European Nucleotide Archive PRJEB8576 (2016).11. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB8577
Web End =European Nucleotide Archive PRJEB8577 (2016).12. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB8578
Web End =European Nucleotide Archive PRJEB8578 (2016).13. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9619
Web End =European Nucleotide Archive PRJEB9619 (2016).14. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB8579
Web End =European Nucleotide Archive PRJEB8579 (2016).15. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB8580
Web End =European Nucleotide Archive PRJEB8580 (2016).16. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9429
Web End =European Nucleotide Archive PRJEB9429 (2016).17. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9430
Web End =European Nucleotide Archive PRJEB9430 (2016).18. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9431
Web End =European Nucleotide Archive PRJEB9431 (2016).19. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB10963
Web End =European Nucleotide Archive PRJEB10963 (2016).20. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB11489
Web End =European Nucleotide Archive PRJEB11489 (2016).21. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB12096
Web End =European Nucleotide Archive PRJEB12096 (2016).22. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB11758
Web End =European Nucleotide Archive PRJEB11758 (2016).23. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9428
Web End =European Nucleotide Archive PRJEB9428 (2016).24. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB11991
Web End =European Nucleotide Archive PRJEB11991 (2016).25. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB9427
Web End =European Nucleotide Archive PRJEB9427 (2016).26. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB11798
Web End =European Nucleotide Archive PRJEB11798 (2016).27. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB11992
Web End =European Nucleotide Archive PRJEB11992 (2016).28. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB13020
Web End =European Nucleotide Archive PRJEB13020 (2016).29. Muoz-Amatrian, M. et al. http://www.ncbi.nlm.nih.gov/bioproject/PRJNA198204
Web End =NCBI BioProject PRJNA198204 (2015).30. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/21
Web End =http://dx.doi.org/10.5447/IPK/2016/21 (2016).31. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/28
Web End =http://dx.doi.org/10.5447/IPK/2016/28 (2016).32. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/12
Web End =http://dx.doi.org/10.5447/IPK/2016/12 (2016).33. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/31
Web End =http://dx.doi.org/10.5447/IPK/2016/31 (2016).34. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB13028
Web End =European Nucleotide Archive PRJEB13028 (2016).35. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/33
Web End =http://dx.doi.org/10.5447/IPK/2016/33 (2016).36. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/22
Web End =http://dx.doi.org/10.5447/IPK/2016/22 (2016).37. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/30
Web End =http://dx.doi.org/10.5447/IPK/2016/30 (2016).38. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB14130
Web End =European Nucleotide Archive PRJEB14130 (2016).39. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/29
Web End =http://dx.doi.org/10.5447/IPK/2016/29 (2016).40. International Barley Genome Sequencing Consortium. http://www.ebi.ac.uk/ena/data/view/PRJEB14169
Web End =European Nucleotide Archive PRJEB14169 (2016).41. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/20
Web End =http://dx.doi.org/10.5447/IPK/2016/20 (2016).42. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/34
Web End =http://dx.doi.org/10.5447/IPK/2016/34 (2016).43. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/27
Web End =http://dx.doi.org/10.5447/IPK/2016/27 (2016).44. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/36
Web End =http://dx.doi.org/10.5447/IPK/2016/36 (2016).45. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/23
Web End =http://dx.doi.org/10.5447/IPK/2016/23 (2016).46. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/24
Web End =http://dx.doi.org/10.5447/IPK/2016/24 (2016).47. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/25
Web End =http://dx.doi.org/10.5447/IPK/2016/25 (2016).48. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/26
Web End =http://dx.doi.org/10.5447/IPK/2016/26 (2016).49. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/35
Web End =http://dx.doi.org/10.5447/IPK/2016/35 (2016).50. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/37
Web End =http://dx.doi.org/10.5447/IPK/2016/37 (2016).51. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/17
Web End =http://dx.doi.org/10.5447/IPK/2016/17 (2016).52. International Barley Genome Sequencing Consortium. IPK Gatersleben http://dx.doi.org/10.5447/IPK/2016/19
Web End =http://dx.doi.org/10.5447/IPK/2016/19 (2016).
Acknowledgements
This work was carried out under the auspices of the International Barley Genome Sequencing Consortium and supported from the following funding sources: German Ministry of Education and Research (BMBF) grant 0314000 BARLEX and 0315954 TRITEX to M.P., U.S. and N.S and 031A536
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 22
de.NBI to U.S. Leibniz Association grant (Pakt f. Forschung und Innovation) sequencing barley chromosome 3H to N.S. and U.S.; Scottish Government/UK Biotechnology and Biological Sciences Research Council (BBSRC) grant BB/100663X/1 to R.W, P.E.H., J.R.; BBSRC grants BB/I008357/1 to M. D.C., M.C. and BB/I008071/1 to P.K.; of Finland grant 266430 and a BioNano grant to A.H.S.; Carlsberg Foundation grant nr. 2012_01_0461 to the Carlsberg Research Laboratory; Grain Research and Development Corporation (GRDC) grant DAW00233 to C.L. and P.L.; Department of Agricultural and Food, Government of Western Australia grant 681 to C.L.; National Natural Science Foundation of China (NSFC) grant 31129005 to C.L. and G.Zhang; NSFC grant 31330055 to G.Zhang.; Czech Ministry of Education, Youth and Sports grant LO1204 to J.D.; National Science Foundation grant DBI 0321756 Coupling EST and Bacterial Articial Chromosome Resources to Access the Barley Genome to T.J.C. and S.L.; United States Department of Agriculture (USDA), Agriculture and Food Research Initiative Plant Genome, Genetics and Breeding Program of USDA-CSREES-NIFA grant 2009-65300-05645 Advancing the Barley Genome and 2011-68002-30029 TriticeaeCAP to T.J.C., S.L. and G.J.M.; United States National Science Foundation (NSF)-ABI grant DBI-1062301 to T.J.C. and S.L.; University of California grant CA-R-BPS-5306-H to T.J.C and S.L.;National Science Foundation grant DBI 0321756 Algorithms for Genome Assembly of Ultra-deep Sequencing Data to S.L. Next-generation sequencing and library construction was delivered via the BBSRC National Capability in Genomics (BB/J010375/1) at Earlham Institute (formerly The Genome Analysis Centre) by members of the Platforms and Pipelines group and BBSRC Institute Strategic Programme funding for Bioinformatics (BB/J004669/1) to M.D.C., S.A. and M.C. We gratefully acknowledge: (1) the excellent technical assistance by Susanne Knig, Manuela Knauft, Uli Beier, Anne Kusserow, Katrin Trnka, Ines Walde, Sandra Driesslein, Cynthia Voss; (2) Doreen Stengel, Anne Fiebig, Thomas Mnch, Danuta Schler and Daniel Arend and Matthias Lange for sequence raw data management and data submission to EMBL/ENA and registration of DOIs; (3) Dr Hlne Berges, Arnaud Bellec and Sonia Vautrin (CNRGV) for management and distribution of barley BAC libraries; (4) Andreas Graner and David Marshall for scientic discussions.
Author Contributions
BAC sequencing and assembly (1H, 3H, 4H): S.B., A.Himmelbach, S.T., M.F., M.G., M.M., U.S. (co-leader), M.P. (co-leader), N.S. (leader); BAC sequencing and assembly (2H, unassigned): D.S., D.H., S.A. (co-leader), M.D.C. (co-leader), M.C. (co-leader), R.W. (leader); BAC sequencing and assembly (5H, 7H): X.Z., R.A.B., Q.Z., C.T., J.K.M., B.C., G.Zhou, F.D., Y.H., S.Y., S.Cao, S.Wang, X.L., M.I.B., P.L.,G.Zhang (co-leader), C.Li (leader); BAC sequencing and assembly (6H): S.B., S.Wang, C.Lin, H.L., U.S., M.H. (co-leader), I.B. (leader); BAC sequencing (gene-bearing): M.M.-A., R.O., S.Wanamaker, S.L. (co-leader), T.J.C. (leader); Optical mapping: A.Hastie, H.., J.T., H.S., J.V., S.Chan, M.M., N.S., J.D., A.H.S. (leader); Chromosome conformation capture: A.Himmelbach, S.G., M.M. (co-leader), N.S. (leader); Pseudomolecule construction: M.M. (leader), S.B., C.C., D.B., T.S., P.K., N.S., U.S. (co-leader); Validation:L.L., M.B., L.A.-S., A.Houben, J.A.P., N.S., G.J.M., M.M. (leader). All authors read and commented on the manuscript.
Additional information
Competing nancial interests: The authors declare no competing nancial interests.
How to cite this article: Beier, S. et al. Construction of a map-based reference genome sequence for barley, Hordeum vulgare L. Sci. Data 4:170044 doi: 10.1038/sdata.2017.44 (2017).
Publishers note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional afliations.
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the articles Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0
Web End =http://creativecommons.org/licenses/by/4.0
Metadata associated with this Data Descriptor is available at http://www.nature.com/sdata/
Web End =http://www.nature.com/sdata/ and is released under the CC0 waiver to maximize reuse.
The Author(s) 2017
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 23
Sebastian Beier1,*, Axel Himmelbach1,*, Christian Colmsee1, Xiao-Qi Zhang2,Roberto A. Barrero3, Qisen Zhang4, Lin Li5, Micha Bayer6, Daniel Bolser7, Stefan Taudien8, Marco Groth8, Marius Felder8, Alex Hastie9, Hana imkov10, Helena Stakov10,
Jan Vrna10, Saki Chan9, Mara Muoz-Amatrian11, Rachid Ounit12, Steve Wanamaker11, Thomas Schmutzer1, Lala Aliyeva-Schnorr1, Stefano Grasso13, Jaakko Tanskanen14, Dharanya Sampath15, Darren Heavens15, Sujie Cao16, Brett Chapman3, Fei Dai17,Yong Han17, Hua Li16, Xuan Li16, Chongyun Lin16, John K. McCooke3, Cong Tan3, Songbo Wang16, Shuya Yin17, Gaofeng Zhou2, Jesse A. Poland18, Matthew I. Bellgard3, Andreas Houben1, Jaroslav Doleel10, Sarah Ayling15, Stefano Lonardi12, Peter Langridge19,
Gary J. Muehlbauer5,20, Paul Kersey7, Matthew D. Clark15,21, Mario Caccamo15,22, AlanH. Schulman14, Matthias Platzer8, Timothy J. Close11, Mats Hansson23, Guoping Zhang17, Ilka Braumann24, Chengdao Li2,25,26, Robbie Waugh6,27, Uwe Scholz1, Nils Stein1,28& Martin Mascher1,29
1Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Seeland, Germany. 2School of Veterinary and Life Sciences, Murdoch University, Murdoch, Western Australia 6150, Australia. 3Centre for Comparative Genomics, Murdoch University, Murdoch, Western Australia 6150, Australia. 4Australian Export Grains Innovation Centre, South Perth, Western Australia 6151, Australia. 5Department of Agronomy and Plant Genetics, University of Minnesota, St Paul, Minnesota 55108, USA. 6The James Hutton Institute, Dundee DD2 5DA, UK. 7European Molecular Biology LaboratoryThe European Bioinformatics Institute, Hinxton CB10 1SD, UK. 8Leibniz Institute on AgingFritz Lipmann
Institute (FLI), 07745 Jena, Germany. 9BioNano Genomics Inc., San Diego, California 92121, USA. 10Institute of Experimental Botany, Centre of the Region Han for Biotechnological and Agricultural Research, 78371 Olomouc, Czech Republic. 11Department of Botany & Plant Sciences, University of California, Riverside, Riverside, California 92521, USA.
12Department of Computer Science and Engineering, University of California, Riverside, Riverside, California 92521, USA.
13Department of Agricultural and Environmental Sciences, University of Udine, 33100 Udine, Italy. 14Green Technology, Natural Resources Institute (Luke), Viikki Plant Science Centre, and Institute of Biotechnology, University of Helsinki,00014 Helsinki, Finland. 15Earlham Institute, Norwich NR4 7UH, UK. 16BGI-Shenzhen, Shenzhen 518083, China.
17College of Agriculture and Biotechnology, Zhejiang University, Hangzhou 310058, China. 18Kansas State University, Wheat Genetics Resource Center, Department of Plant Pathology and Department of Agronomy, Manhattan, Kansas 66506, USA. 19School of Agriculture, University of Adelaide, Urrbrae, South Australia 5064, Australia. 20Department of Plant and Microbial Biology, University of Minnesota, St Paul, Minnesota 55108, USA. 21School of Environmental Sciences, University of East Anglia, Norwich NR4 7UH, UK. 22National Institute of Agricultural Botany, Cambridge CB3 0LE, UK.
23Department of Biology, Lund University, 22362 Lund, Sweden. 24Carlsberg Research Laboratory, 1799 Copenhagen, Denmark. 25Department of Agriculture and Food, Government of Western Australia, South Perth, Western Australia 6150, Australia. 26Hubei Collaborative Innovation Centre for Grain Industry, Yangtze University, Jingzhou, Hubei 434025, China.
27School of Life Sciences, University of Dundee, Dundee DD2 5DA, UK. 28School of Plant Biology, University of Western Australia, Crawley 6009, Australia. 29German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, 04103 Leipzig, Germany. *These authors contributed equally to this work.
SCIENTIFIC DATA | 4:170044 | DOI: 10.1038/sdata.2017.44 24
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright Nature Publishing Group Apr 2017
Abstract
Barley (Hordeum vulgare L.) is a cereal grass mainly used as animal fodder and raw material for the malting industry. The map-based reference genome sequence of barley cv. 'Morex' was constructed by the International Barley Genome Sequencing Consortium (IBSC) using hierarchical shotgun sequencing. Here, we report the experimental and computational procedures to (i) sequence and assemble more than 80,000 bacterial artificial chromosome (BAC) clones along the minimum tiling path of a genome-wide physical map, (ii) find and validate overlaps between adjacent BACs, (iii) construct 4,265 non-redundant sequence scaffolds representing clusters of overlapping BACs, and (iv) order and orient these BAC clusters along the seven barley chromosomes using positional information provided by dense genetic maps, an optical map and chromosome conformation capture sequencing (Hi-C). Integrative access to these sequence and mapping resources is provided by the barley genome explorer (BARLEX).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer