Long-read-based draft genome sequence of Indian

Full text

Translate

Turn on search term navigation

Abbreviations

BUSCO
benchmarking universal single-copy ortholog

CC
coiled-coil

HMM
hidden Markov model

LRR
leucine-rich repeat

LTR
long-terminal repeat

MITE
miniature inverted-repeat transposable element

NBS
nucleotide-binding site

NBS-LRR
nucleotide binding site–leucine rich repeat

PacBio
Pacific Biosciences of California, Inc.

PCR
polymerase chain reaction

RGA
resistance gene analog

RLK
receptor-like kinase

RLP
receptor-like protein

RNA-seq
RNA sequencing

rRNA
ribosomal RNA

SMRT
single-molecule real-time

SSP
seed storage protein

TIR
Toll-interleukin receptor 1

TM
transmembrane

tRNA
transfer RNA

INTRODUCTION

A continuous escalation in the global human population necessitates a second ‘Green Revolution’, wherein crop plants subsidiary to cereals are stipulated to play an indispensable role in meeting global food and feed security. Legumes constitute a diverse range of functional food crops with exceptional nutritional properties, which contribute substantially toward global protein security. Black gram [Vigna mungo (L.) Hepper var. mungo] (NCBI txid 3915), also known as urd bean, is a member of subfamily Papilionoideae of Fabaceae and harbors a diploid chromosome number of 2n = 2x = 22 with an estimated genome size of 574 Mb (Arumuganathan & Earle, 1991). The grain crop is believed to have originated in India from V. mungo (L.) Hepper var. silvestris (Lukoki et al., 1980). It is a commercially important legume principally grown in semiarid and subtropical regions including India, Myanmar, Pakistan, Bangladesh, Sri Lanka, and Thailand (Somta et al., 2020). Currently, India is dominating the world production of black gram followed by Myanmar and Pakistan (Raizada & Souframanien, 2019). Black gram seeds are a prized source of dietary proteins consisting of ∼25% stored proteins, 60% carbohydrates, 1.5% fat along with high contents of iron and folate. The crop is vital for Asian countries where it provides plant-based proteins to the vegetarian Asian population. Additionally, the nitrogen-fixing ability, short maturity period, and drought resilience of black gram allows its usage in an intercropping system, which results in replenishment of soil fertility and encourages low prevalence of diseases and pests (Yadav et al., 2006). In spite of its commercial and ecological significance, black gram has not been studied extensively compared with other legume crops, owing to the inadequate genetic and genomic resources.

The low yield associated with black gram is the major limiting factor challenging its wide adoption and cultivation. Despite being the largest producer and consumer of black gram, crop production in India suffers from extremely low yield resulting in an increased reliance on imports from other producer countries to meet consumption demands (Rawal & Navarro, 2019; Somta et al., 2019, 2020). The grain yield of black gram is significantly constrained by a variety of biotic and abiotic stresses with viral and fungal diseases along with waterlogging and salinity contributing maximal yield losses (Bhomkar et al., 2008; Kundu et al., 2019; Rawal & Navarro, 2019; Vishalakshi et al., 2017). Moreover, the narrow genetic base of the current cultivars and limited genomic resources have impeded genetic improvement of black gram (Raizada & Souframanien, 2019). Conventional breeding methods have enabled the reduction of yield penalties to an extent; however, true genetic gains could only be realized through the integration of molecular-assisted breeding approaches in improvement programs. Over the past decade, an impetus for the generation of large-scale genomic resources, including genome sequence assemblies, resequencing, high-resolution linkage maps, and deployment of modern breeding techniques for the development of improved and climate-resilient legume varieties, has been initiated (Garg et al., 2021; Mousavi-Derazmahalleh et al., 2019; Pandey et al., 2016; Varshney et al., 2018). In black gram, the majority of the previous molecular studies have focused on the development of molecular markers, exploring the genetic variability, constructing linkage maps, and generating conditional transcriptomes (Kundu et al., 2019; Raizada & Souframanien, 2019; Somta et al., 2019, 2020; Souframanien & Gopalakrishna, 2004; Souframanien et al., 2010). Recently, the first reference draft genome assembly of black gram was derived using a Thailand cultivar, Chai Nat 80 (Pootakham et al., 2021). Nevertheless, sequencing a distinctive black gram genome of an accession belonging to its center of origin (India) is imperative for understanding the origin and evolution of the crop. Besides, the availability of more than one high-quality draft genomes provide access to novel genetic variations for deployment in crop improvement programs.

Here, we used long-read, single-molecule real-time sequencing (SMRT) from the Pacific Biosciences of California, Inc. (PacBio) platform and short reads from Illumina platform to generate a new chromosome level draft genome sequence of an Indian black gram cultivar, IPU 94-1 (‘Uttara’) based on a reference-guided approach. The assembled scaffolds were assigned to 11 pseudochromosomes using the available genome sequence of Thailand cultivar Chai Nat 80 (Pootakham et al., 2021). The functional transcripts derived from Illumina RNA sequencing (RNA-seq) combined with ab initio predictions were used to decipher the detailed annotated genomic features of black gram including repetitive genomic regions, protein-coding genes, regulatory transcription factors (TFs), microsatellites, and disease resistance (R) genes. We explored the multigene family encoding seed storage proteins (SSPs) in the black gram genome and related species. Further, the assembled genome and gene sequences of black gram were used for syntenic analysis with genome sequences of three Vigna species. These findings will increase our understanding related to genetics and genomics of Indian black gram and provide valuable resources for the facilitation of effective crop improvement programs.

MATERIALS AND METHODS Plant materials and DNA extraction

An individual plant of the Indian black gram Kharif (rainy season) cultivar IPU 94-1 (‘Uttara’) was used for genome sequencing. Young expanding leaves were harvested from a greenhouse-grown black gram plant and immediately immersed in liquid nitrogen followed by storage at −80 °C until further use.

High-molecular-weight genomic DNA was extracted from harvested leaves using a modified nuclei isolation protocol (Dorweiler et al., 2000; Steinmuller & Apel, 1986). Briefly, 0.5 g of leaf tissue was homogenized to fine powder using liquid nitrogen followed by addition of 20 ml of chilled nuclei isolation buffer (20 mM Tris-HCl pH 7.8, 250 mM sucrose, 5 mM MgCl2, 5 mM KCl, 40% Glycerol, 0.25% Triton X-100, and 0.1% β- mercaptoethanol). The thawed homogenate was sequentially filtered through four layers of Mira cloth (Calbiochem) and a 53-μm-thick nylon mesh followed by centrifugation at 6,000 × g for 20 min at 4 °C. The obtained pellet was gently washed four times with 20 ml nuclei isolation buffer. Subsequently, the nuclei pellet was resuspended in prewarmed Carlson lysis buffer (100 mM Tris pH 9.5, 2% CTAB, 1.4 M NaCl, 1% PEG, 20 mM EDTA) along with 25 μl β-mercaptoethanol and incubated at 74 °C for 2 h. Then, the contents were cooled to room temperature and an equal volume of chloroform–isoamyl alcohol (24:1) mixture was added. The components were gently mixed followed by centrifugation at 5000 × g for 10 min at 4 °C. The aqueous phase was extracted in a clean tube and another round of chloroform–isoamyl extraction was performed. The extracted DNA was precipitated using one-tenth part of 3 M sodium acetate and chilled isopropanol. The pellet was washed with 70% ethanol, air dried, and dissolved in molecular biology grade water (Sigma). The DNA was then treated with RNaseA at 37 °C for 1 h and extracted using a mixture of chloroform and isoamyl alcohol (24:1). The aqueous phase was collected in a clean microcentrifuge tube and precipitated to yield purified DNA following the steps described above. The quality and integrity of extracted DNA were assessed through gel electrophoresis using 0.8% agarose gel and a Nanodrop spectrophotometer (Thermo Scientific). The extracted DNA was quantified through Qubit fluorometer (Invitrogen) to confirm the high quality of DNA.

Core Ideas

A ∼454.4-Mb draft genome of an Indian black gram was assembled based on SMRT and Illumina sequencing.
Around 444.4 Mb (∼97.8%) of the assembly was anchored onto 11 pseudomolecules based on the reference-guided method.
Large collinear blocks representing high synteny among Vigna species were detected.
Gene discovery predicted 28,881 gene models, of which 95% were functionally annotated.
Disease resistance genes and SSPs were investigated for deployment in crop improvement.

Library construction and sequencing

For short-reads Illumina whole-genome sequencing, two paired-end libraries with insert sizes of 250 bp and 400 bp were prepared and sequenced on Illumina HiSeq 2000 platform following the standard Illumina protocols. For PacBio SMRT sequencing, the SMRTbell template prep kit v1.0 was used to construct a PacBio DNA library with ∼10 kb insert. The long-read sequencing was performed using a total of 38 SMRT cells on a PacBio RS II platform with P6-C4 chemistry.

Estimation of genome size and karyology

The genome size of black gram was estimated through flow cytometry. A fresh young leaf harvested from a healthy plant was used for sample preparation using CyStain PI Absolute P kit (Sysmex) following manufacturer instructions. The nuclear DNA amount was evaluated on CyFlow Cube 8 flow cytometer (Sysmex) with a minimum of 5,000 nuclei counts per sample. Tomato (Solanum lycopersicum L.) ‘Stupicke´ polnı´ rane´’ was used as an inner standard reference with an estimated genome size of 1.96 picogram per 2C DNA content (Dolezel et al., 2007). Replicates were maintained for both the internal control and black gram samples. A k-mer based genome size estimation was performed with clean long-read sequences from PacBio using Kmerfreq (Wang et al., 2020) and Genomic Character Estimator (Liu et al., 2013). The k-mer frequency distribution of long reads was explored at k-mer value of 17. The output including depth frequency and total k-mer number identified from the reads was evaluated for estimation of genome size using Genomic Character Estimator.

The chromosome number of black gram was studied using conventional mitotic metaphase spreads prepared from root tip cells as described by Shamurailatpam et al. (2015). Briefly, healthy growing root tips from germinated seedlings were excised and treated with 0.025% colchicine (Himedia) for 3 h at room temperature followed by fixation in freshly prepared ethanol–acetic acid solution (3:1, v/v) and subsequent storage at 4 °C until further use. The chromosomal spreads were prepared by washing the root tips thoroughly with distilled water followed by depurination with 1N HCl at 60 °C for 8 min and staining with Feulgen stain for 45 min. The stained root tips were washed with distilled water and squash preparations were made in 1% acetocarmine solution.

Hybrid genome assembly

The draft genome of black gram was assembled in three main steps. Firstly, a de novo draft genome was assembled with the PacBio clean reads using FALCON assembler v0.3 based on a length_cutoff of ≥ 500 bases, length_cutoff_pr of ≥ 2500 bases, and minimum coverage of 5 (Chin et al., 2016). FALCON executes a stratified genome assembly process wherein, firstly, seed reads are selected, and shorter reads are aligned to generate error corrected or preassembled reads. Successively, the preassembled reads are assembled into genomic contigs. After assembly via FALCON, the initial draft assembly was self-corrected with PacBio reads. All the SMRT reads were remapped to the de novo assembly using Quiver implemented in the GenomicConsensus package and supported with default settings (Chin et al., 2013). In parallel, the Illumina-derived short paired-end reads were quality filtered using NGSQC toolkit v2.3.3 using parameters comprising cut off percentage of read length for high quality (l) = 70 and threshold PHRED quality score (s) = 20 (Patel et al., 2012). The clean reads were mapped on the FALCON-generated assembly using Burrows–Wheeler Aligner using minimum seed length (k) of 19, penalty for a mismatch (B) = 4, and other parameters at default values (Li & Durbin, 2009). The alignment file (BAM) was subsequently used for error correction with Pilon (Walker et al., 2014). Two consecutive runs of error-correction were applied with Pilon. Lastly, the previously published draft genome of Thailand black gram cultivar Chai Nat 80 (Pootakham et al., 2021) was applied as reference to coalesce the de novo assembled scaffolds generated for the Indian cultivar in the present study into pseudochromosomes using RaGOO (Alonge et al., 2019). The reference-guided scaffolding with RaGOO was executed with g 100 (gap size for padding in pseudomolecules) and C (unplaced contigs individually instead of making a chr0) options.

Assessment of structural integrity and completeness of genome assemblies

Both the derived genome assemblies (de novo and reference-guided) were tested for their structural integrity and completeness using five approaches. Firstly, the quality of the assembled genomes was investigated by aligning clean paired-end reads (∼88×) from short-insert size Illumina libraries back onto the black gram draft assembly using ‘bwa mem’ function of Burrows–Wheeler Aligner v0.7.17 (Li & Durbin, 2009). Secondly, a set of 1,055 DNA sequences was retrieved from the National Center for Biotechnology Information (NCBI) database (Access date: May 2020) using the search term ‘Vigna mungo’. These DNA sequences were mapped on the draft black gram genome using GMAP v2019-09-12 (Wu & Watanabe, 2005). For the third approach, we used the Illumina RNA-seq reads generated for functional annotation in the present study to assemble a transcriptome and the derived transcripts were mapped on the draft genome using GMAP v2019-09-12 (Wu & Watanabe, 2005). Through our fourth approach, we compared the k-mer frequencies in the PacBio long sequencing reads against the assemblies using the ‘comp’ and ‘spectra-cn’ module of the k-mer Analysis Toolkit v2.4.2 (Mapleson et al., 2017). Finally, the completeness of the genome assemblies was investigated through evaluation of gene space integrity using Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.0 (Simao et al., 2015) adopting orthologous gene models of three lineages: Embryophyta, Eudicots, and Fabales.

Comparative genomics among Vigna species

The draft assemblies of the sequenced Vigna species including mungbean [V. radiata (L.) R. Wilczek] (Kang et al., 2014), adzuki bean [V. angularis (Willd.) Ohwi & H. Ohashi] (Kang et al., 2015), and cowpea [V. unguiculata (L.) Walp.] (Lonardi et al., 2019) were obtained from web repositories of their respective sequencing groups or RefSeq database (O'Leary et al., 2016). All-in-all pairwise whole-genome alignments of Vigna species were generated against the Indian black gram draft genome (reference-guided assembly) using GSAlign v1.0.22 (Lin et al, 2020). For visualization, Circos diagrams were derived from pairwise alignment files using Circos v0.69-8 (Krzywinski et al., 2009).

RNA isolation, transcriptome sequencing, and assembly

Four black gram vegetative and reproductive tissues, including roots, shoots, total seedlings, and flowers, were sampled from the greenhouse plants for transcriptome sequencing and subsequent genome annotation. The harvested tissues were immediately frozen in liquid nitrogen and stored at −80 °C until further use. Total RNA extraction was performed using Trizol (Molecular Research Center, Inc.) following the manufacturer's instructions. The RNA integrity was verified on a Bioanalyzer 2100 (Agilent Technologies), while its quality assessment was performed on a NanoDrop-1000 spectrophotometer (Thermo Scientific). Four cDNA libraries were constructed and sequenced on Illumina HiSeq 2000 platform in paired-end mode according to the instructions provided by the manufacturer (Illumina). The raw sequencing data was filtered for low quality (Q value ≤ 20) and adaptor contaminated reads using NGSQC toolkit Patel & Jain, 2012). The quality-controlled reads were employed for de novo transcriptome assembly using Trinity v2.0.6 (Grabherr et al., 2011; Haas et al., 2013) with a ‘minimum contig length’ of 200 bp and ‘minimum count for k-mers to be assembled’ by Inchworm algorithm set at 10. The obtained primary assembly was clustered using CD-HIT-EST v4.6.1 (Fu et al., 2012) for the removal of redundant contigs. Quality validation of the transcriptome was evaluated by quantification of read representation, for which, the high-quality RNA-seq reads were mapped back onto the assembled transcriptome using bowtie2 with informed condition of –no-unal (Suppress SAM records for reads that failed to align) along with default parameters (Langmead & Salzberg, 2012).

Annotation of Repeatome of black gram genome

The de novo prediction program RepeatModeler v1.0.11 (Smit & Hubley, 2008) was used to derive a repeat library, which was annotated using the module RepeatClassifier embedded in RepeatModeler. The homology-based tool, RepeatMasker v4.0.7 (Smit et al., 2015) was used to annotate repeat sequences using the RepBase libraries (v20170127) combined with a de novo repeat library extracted from RepeatModeler. TransposonPSI v08222010, a homology search tool based on PSI-Blast, was used to complement the mining of repetitive elements (Haas, 2007). The PSI-Blast uses a set of protein sequences encoded by transposable elements, enabling the search for degenerate and novel transposons that cannot be detected by other repeat finding software programs. The full-length, long-terminal retrotransposons were studied using the structural de novo search tool, LTRharvest (Ellinghaus et al., 2008) combined with LTR digest (Steinbiss et al., 2009) for automated annotation of discovered long-terminal repeats (LTRs). Analysis with LTR harvest was performed at default parameters. The LTR digest run was performed employing pHMM signatures of 20 transposable-element-related protein domains (Supplemental Table S1) from pfam database (El-Gebali et al., 2019) along with the set of 84,464 eukaryotic RNAs from GtRNAdb (Chan & Lowe, 2009) for identification of primer binding site. The two-layered local combinatorial variable tool, HelitronScanner v1.1, was deployed at default parameters to determine the Helitron components of black gram genome (Xiong et al., 2014). We identified the head (5′) and tail (3′) termini of Helitrons, which were then concatenated to assemble putative Helitrons. Helitron search was performed in both the forward and reverse complement orientations. The de novo tool, MITE_Hunter was used for the detection of miniature inverted-repeat transposable elements (MITEs) in the black gram genome (Han & Wessler, 2010). Tandem repeat finder v4.09 was applied to predict tandem repeats of the black gram genome (Benson, 1999).

Gene model prediction and annotation of protein-coding genes

The structural gene prediction was performed using the standard MAKER gene annotation pipeline, which combines ab initio gene finders and evidence from RNA-seq or protein sequences (Campbell et al., 2014). The MAKER run was conducted using high-quality transcripts generated by assembling Illumina RNA-seq reads as transcript evidence and the set of annotated proteins from V. radiata (Kang et al., 2014) as protein evidence together with ab initio gene discovery through Augustus (Stanke et al., 2006) and GeneMark (Besemer & Borodovsky, 2005). Among these predicted genes, gene loci that coincided with masked repeat regions were filtered out. Also, transcripts and proteins with lengths shorter than 150 bases and 50 amino acid residues, respectively, were removed while retaining only gene sets with annotation edit distance scores <1. The functional annotation of the predicted black gram gene models was performed using various public nucleotide and protein databases for the assignment of functional descriptors and gene ontology terms. Homology searches were conducted against NCBI-RefSeq database (O'Leary et al., 2016) using BLASTP (Altschul et al., 1990) with a threshold E-value of 1 × 10⁻³ implemented in OmicsBox v2.0.36 (https://www.biobam.com/omicsbox/), Kyoto Encyclopedia of Genes and Genomes (KEGG) Expression Database through KEGG Automatic Annotation Server (Ogata et al., 1999) and Eukaryotic Orthologous Groups database (Tatusov et al., 2003). The motifs- and domains-based functional annotation and identification of conserved domains and families in protein-coding genes was implemented through InterProScan (Jones et al., 2014) against multiple public databases (CDD, Panther, Pfam, PIR, PRINTS, PROSITE, and SMART).

Mining of simple sequence repeats and annotation of noncoding RNAs

The draft genome was screened for putative perfect (mono- to hexanucleotides) and complex microsatellite or simple sequence repeats employing the MIcroSAtellite (MISA) identification tool. The custom parameters applied for microsatellite search included unit_size (1–6) with min_repeats (1–10, 2–9, 3–6, 4–5, 5–5, and 6–5) allowing an interruption of 100 bp between two microsatellites (Beier et al., 2017). Additionally, we experimentally validated 34 randomly selected microsatellites markers for which primers were designed using Primer3 with the following parameters (primer length, 20−24 bases; product size, 150–450 bases; optimum annealing temperature, 52–60 °C; and primer GC content, 45–60%) (Untergasser et al., 2012). The designed primer pairs were checked for the presence of any secondary structure including hairpin using NetPrimer (PREMIER Biosoft). The polymerase chain reaction (PCR) assays were conducted in a total volume of 15 μl consisting of ∼50 ng of template genomic DNA, 1 × PCR buffer, 2 mM MgCl_2, 0.2 mM of each dNTP, 0.5 mM of each loci-specific forward and reverse primers, and 1.25 U of Taq polymerase (iNtRON Biotechnology). The Veriti thermal cycle (Applied Biosystems) was used for amplification reactions with following cycling conditions: initial denaturation at 96 °C for 5 min followed by 30 repetitive cycles of 96 °C for 45 s, primer annealing at appropriate melting temperature (between 50–60 °C) for 30 s, extension at 72 °C for 30 s, and a final DNA extension at 72 °C for 7 min. The PCR amplicons were tested for product size and quality on 2% agarose gel.

We also surveyed noncoding RNA (ribosomal RNA [rRNA] and transfer RNA [tRNA]) genes in black gram draft genome. The rRNA gene predictions were done using the RNAmmer v1.2 with provided parameters of kingdom (-S euk) and molecule type (-m tsu,ssu,lsu) (Lagesen et al., 2007). The tRNA genes were searched using tRNASCAn-SE v2.0 (Chan & Lowe, 2019) and Infernal v1.1.2 specifying option -E, which delimits search to eukaryotic tRNA (Nawrocki & Eddy, 2013).

Identification of regulatory genes and disease resistance genes

The standalone version of iTAK program was deployed for the discovery of TFs, transcriptional regulators, and protein kinases and their classification into various gene families (Zheng et al., 2016). The set of predicted protein-coding genes from reference-based assembly was applied to the iTAK program at default parameters.

The final set of proteins from reference-guided black gram assembly was inspected for domain and motif structures associated with disease resistance (R) genes using the automated Disease Resistance Analysis and Gene Orthology (DRAGO 2) pipeline against the curated plant resistance gene database (PRGDB) (Osuna-cruz et al., 2018). Based on the presence of specific characteristic domains, the candidate resistance genes were classified into various classes including coiled-coil (CC)–nucleotide-binding site (NBS)–leucine-rich repeat (LRR) (CNL), Toll-interleukin receptor 1 (TIR)-NB-LRR (TNL), serine/threonine–LRR (receptor-like protein [RLP]), receptor-like kinase (RLK), and others. A set of 229 reference NBS-LRR sequences was retrieved from the NCBI Viridiplantae (NCBI:txid 33090) protein database using the search term ‘NBS-LRR’ and employed for BLASTP searches with an E value of 1 × 10⁻⁵. Moreover, a hmmsearch was applied for the identification of NBS-LRR genes in black gram genome using hidden Markov model (HMM) profile for Pfam NBS domain (NB-ARC, PF00931) through HMMER software (Finn et al., 2011). Candidate genes encoding NBS identified by all three search methods, including DRAGO 2, HMMER, and BLASTP searches, were retained for further verification. Functional domains were annotated against the PFAM (http://pfam.xfam.org) and CDD (http://www.ncbi.nlm.nih.gov/cdd/) databases and their genomic distribution was explored through the associated GFF file.

Genome-wide identification of seed storage proteins gene family

The reference-guided V. mungo L. genome and peptide sequences were comprehensively explored for the identification of putative seed storage protein (SSP)-encoding genes. The nucleotide and peptide sequences of V. radiata, V. angularis, and V. unguiculata were retrieved from web repositories of their respective sequencing groups while Arabidopsis thaliana (L.) Heynh. sequences were downloaded from Phytozome v12.1 (https://phytozome.jgi.doe.gov/pz/portal.html). The NCBI database with Viridiplantae (NCBI:txid33090) taxonomic restriction was extensively explored using various keywords related to SSPs including 7S globulin, 11S globulin, vicilin, glycinin, conglycinin, 2S albumin, glutelin, and prolamin. The obtained peptide sequences were used to build HMM profiles using ‘hmmbuild’ function of HMMER v3.3.2 (Finn et al., 2011). The HMM profiles were further used as a query to search for putative SSP genes in genomes of the studied species through hmmsearch with an E-value cutoff of 1 × 10⁻⁵. Also, BLASTP and BLASTN homology-based searches were performed with these sequences. All the candidate SSP genes identified through both the methods were inspected through SMART domain research tool (http://smart.emblheidelberg.de/), Pfam protein family database (http://pfam.xfam.org/; El-Gebali et al., 2019), and CDD (http://www.ncbi.nlm.nih.gov/cdd/) of NCBI to ensure the existence of characteristic domains. Only unique sequences were retained for further analysis. The physiochemical parameters of putative SSP proteins including length (number of amino acids), molecular weight (kDa), and theoretical isoelectric point were investigated through ExPaSy ProtParam tool (http://web.expasy.org/protparam/). The subcellular localization of each seed storage protein was predicted through the Plant-mPLOC v2.0 web server (http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/).

The SSP genes were assigned a chromosomal location based on the location coordinates from the GFF genome files of the respective plant species. The distribution of SSP genes on chromosomes was visualized through the TBtools toolkit (https://github.com/CJ-Chen/TBtools/; Chen et al., 2020). Analysis of gene duplication events, including segmental and tandem duplications in the SSP gene family, was assessed using Multiple Collinearity Scan Toolkit (MCScanX) at the default parameters (Wang et al., 2012). The multiple protein sequence alignments were generated using the MUSCLE method in MEGA 7.0 (Kumar et al., 2016) at default parameters. Subsequently, the alignment results were employed to construct an unrooted phylogenetic tree based on the neighbor-joining method with 1,000 bootstrap iterations. The structural organization of the SSP genes comprising the exonic and intronic boundaries was studied and displayed through Gene Structure Display Server 2.0 using the GFF files of respective species (http://gsds.cbi.pku.edu.cn/; Hu et al., 2015). The conserved motifs of SSP protein sequences were surveyed using the online tool Multiple Expectation Maximization for Motif Elucidation (MEME) v5.3.3 (http://meme-suite.org/tools/meme; Bailey et al., 2009) with a maximum number of motif discovery being set at 10, while other parameters were used at default values. The synteny–collinearity analysis was performed using MCScanX and the results were visualized through TBtools.

RESULTS AND DISCUSSION Genome sequencing

The black gram Indian cultivar IPU 94-1 (‘Uttara’) selected for genome sequencing (Supplemental Figure S1) displays high resistance to mungbean yellow mosaic virus, Cercospora leaf blight, bacterial leaf spot, and Anthracnose fungus (Singh et al., 2010). It is cultivated over a wide geographical region including the northwestern and northeastern plain zones of India (Kumar & Singh, 2009). We applied two different technologies, Illumina paired-end sequencing and PacBio SMRT sequencing, for derivation of the sequencing data for draft genome assembly. A total of 32.47 Gb of Illumina sequencing data from 321,484,777 raw reads was generated (Supplemental Table S2) providing a genome coverage of 112-fold based on the estimated genome size of 574 Mb (Arumuganathan & Earle, 1991). Post filtering the raw reads for quality score and adaptor contamination, a total of 25,3148,662 high-quality reads were obtained representing ∼88× of the black gram genome. The processing of 38 SMRT cells on PacBio platform yielded 32.89 Gb data in 5,711,096 raw reads with an average read length of 5,759 bp and a haploid genome coverage of 57.3× (Supplemental Table S3).

Genome size estimation and karyology

The genome size estimation of black gram through flow cytometry based on the linear relationship of 2C peaks indicated an approximate genome size of 1.16 picograms DNA per 2C content corresponding to ∼567 Mbp 1C⁻¹ (Supplemental Figure S2A,B). The genome size estimated through flow cytometry in the present study is consistent with that reported by Pal (2006) (1.16 picograms DNA per 2C content) but is slightly deviated from that reported by Arumuganathan and Earle (1991) (574 Mb). Curiously, inspection of k-mer distribution analysis at k = 17 using error-corrected long reads from PacBio identified a total k-mer count of 16,274,711,917 and an average k-mer peak at 30 equating to an estimated genome size of ∼498 Mb (Supplemental Figure S2C). Previously, Pootakham et al. (2021) has provided an estimated genome size of ∼531.3 Mb based on k-mer analysis of Illumina short reads. The genome size estimate through k-mer analysis is 12.1% lower than the flow cytometry measurements in this study. The quantum of used PacBio data, high repeat regions and sequencing errors in the data might have likely contributed to lower accuracy of k-mer frequency-based genome size estimation. In the light of these evidence, our flow cytometry measurements for genome size provided higher credibility and we propose an approximate genome size of 560–574 Mbp for black gram. Our analysis of the mitotic spread preparation for the somatic chromosome count of black gram IPU 94-1 (‘Uttara’) revealed 2n = 2x = 22 (Supplemental Figure S3).

Genome assembly and quality assessment

We adopted a hybrid pipeline for whole-genome assembly employing both the long-read PacBio and short-read Illumina clean reads. Firstly, FALCON allowed the generation of 30.78 Gb of error-corrected clean reads with a mean subread length of 5,213 bp and N50 of 8,214 bp from the SMRT sequencing data (Supplemental Table S3, Supplemental Figure S4). Secondly, incorporation of 253,148,662 clean Illumina derived reads for error correction was undertaken. Consequently, a de novo assembled genome (hereafter referred to as ‘VM_denovo’) of ∼454.1 Mb consisting of 3,032 contigs with an N50 of 350 kb and N90 of 93.5 kb was achieved (Table 1). Approximately 78.33% of the draft genome was represented in scaffolds >10 Kb. However, our de novo genome assembly was incontiguous, comprising of thousands of small, fragmented scaffolds and contigs, which challenges its optimal annotation and use for genomic studies. Reference-assisted chromosome assembly approach benefits in such conditions where high-density linkage maps and chromosome conformation sequencing data is unavailable for pseudomolecule construction. In the present study, our de novo assembly of black gram was represented in 3,032 small contigs. Therefore, to accomplish a highly ordered and improved full-length assembly and gain additional information, we applied recently published reference genome of black gram Chai Nat 80 (Pootakham et al., 2021) to order our de novo assembled scaffolds into pseudomolecules. Use of reference genome facilitated anchorage of ∼444.4 Mb of the contigs into 11 pseudomolecules corresponding to the haploid chromosome count of black gram (x = 11) while 10.02 Mb of the assembled genome was captured in 508 unplaced scaffolds and singletons (Table 1, Figure 1). A final reference-guided assembly (hereafter referred to as ‘VM_draft’) of ∼454.43 Mb represented in 519 scaffolds was derived with an N50 value of 42.88 Mb and N90 of 29.91 Mb (Table 1). The length of the 11 pseudomolecules (nomenclature assigned as ‘VM_pChr1–VM_pChr11’) ranged from 24.4 Mb (pseudomolecule VM_pChr 11) to 62.7 Mb (pseudomolecule VM_pChr 1) (Supplemental Table S4).

TABLE 1 Characteristics of black gram genome assembly

Length of genome assembly (bp)	45,442,6400
Scaffolds^a
Total length of scaffold sequence (Mb)	454.43
Total number of scaffolds	519
N50 of scaffolds (Mb)	42.88
N90 of scaffolds (Mb)	29.91
Maximum scaffold length (Mb)	62.77
Number of scaffolds >10 Kb	278
Contigs^b
Total length of contig sequence (Mb)	454.1
Total number of contigs	3,032
N50 of contigs (Kb)	350
N90 of contigs (Kb)	93.5
Pseudomolecules^a
Total number of pseudomolecules	11
Total length of pseudomolecules (Mb)	444.4
Unanchored scaffolds^a
Total number of unanchored scaffolds	508
Total length of unanchored scaffolds (Mb)	10.02
Guanine-Cytosine (GC) content (%)^a	33.5
Gene prediction^a
Number of protein coding genes	28,881
Average gene length (bp)	1,395.02
Gene annotation^a
Annotated genes	27,432

Denotes features of reference-guided (VM_draft) black gram assembly.

Denotes features de novo (VM_denovo) black gram assembly.

View Image - FIGURE 1. Circos plot depicting the genomic landscape of draft black gram genome (VM_draft). Eleven pseudo-chromosomes of black gram designated as VM_pChr1–VM_pChr11 are indicated in the grey outer panel. Tracks from outside to inside (a) microsatellites density, (b) Guanine–Cytosine (GC) content, (c) repeat density, (d) gene density, and (e) paralogous regions between black gram chromosomes indicated by colored lines

FIGURE 1. Circos plot depicting the genomic landscape of draft black gram genome (VM_draft). Eleven pseudo-chromosomes of black gram designated as VM_pChr1–VM_pChr11 are indicated in the grey outer panel. Tracks from outside to inside (a) microsatellites density, (b) Guanine–Cytosine (GC) content, (c) repeat density, (d) gene density, and (e) paralogous regions between black gram chromosomes indicated by colored lines

Implementation of a reference genome may introduce biasness in the assembly and create assembly errors that is majorly propelled by the evolutionary distance between the involved target and reference species. In our study, the employed reference genome belongs to another cultivar of the same species therefore, we hypothesize that the evolutionary distance between the target and reference genome would be minimal and high sequence similarity is shared by these accessions, which would considerably mitigate assembly biasness and errors. Nevertheless, we evaluated the impact of reference-assisted assembly method by comparing the key metrics of both the de novo (VM_denovo) and reference-guided (VM_draft) assemblies of black gram through several approaches. Firstly, clean genomic reads from Illumina libraries were aligned onto the assembled genome and our results revealed an improvement in alignment metrics of reference-guided assembly over de novo assembly. A high percentage (∼94.55%) of short Illumina reads were mapped on the reference-guided assembly than de novo assembly (∼93%). Alignment of the set of black gram DNA sequences retrieved from NCBI provided a mapping percentage of 95.85% for VM_draft assembly, while a comparable mapping rate was recorded for VM_denovo (∼95.63%). Further, mapping of black gram transcripts from RNA-seq data resulted in a similar mapping rate of ∼98.1% for both the assemblies (Table 2). Another quality assessment was performed through comparison of 27-mers content in circular consensus sequencing reads from PacBio and black gram assemblies (VM_denovo and VM_draft) using KAT comp tool. Evidently, both the assemblies showed consistent results depicting vast levels of unique k-mers (1×) represented only once in the assembly with limited duplicated (2× or more) and missing content (0×; Supplemental Figure S5). We further extended our analysis to validate the completeness of the genome assembly using BUSCO and compared BUSCO quality score of both de novo and reference-guided assemblies using three lineages: Embryophyta, Eudicotyledons, and Fabales (Table 2). The complete and fragmented BUSCO score for de novo assembly accounted for 93.9 and 2.79% of the orthologous genes, while the reference-guided assembly displayed an improvement over these statistics with the complete and fragmented BUSCO score of 94.18 and 2.66%, respectively, for lineage Embryophyta (Table 2). Similarly, for the dataset Eudicots and Fabales, higher number of complete orthologous genes were detected in VM_draft than VM_denovo assembly. The proportion of duplicated and fragmented orthologous genes was comparatively higher in VM_denovo reflecting higher redundancy and fragmentation in de novo assembly as compared with reference-guided assembly (Table 2). Cumulatively, our results indicate that the coalescence of de novo and reference-based assembly to guide pseudomolecules construction improved the quality of black gram assembly as high number of contigs could be ordered into longer nonredundant scaffolds which promoted genome contiguity and, consequently, improved gene annotation. Nevertheless, we present repeat and gene annotation for both the assemblies of black gram to promote their easy adoption by black gram genomic and genetic communities.

TABLE 2 Assessment of structural integrity and completeness of black gram genome assemblies

[Table omitted. See PDF]

Note. BUSCO, benchmarking universal single-copy ortholog

Comparative genomics among Vigna species

Syntenic comparisons among various Vigna species indicated the high level of conservation between black gram and other Vigna species including V. radiata (Figure 2a), V. angularis (Figure 2b,) and V. unguiculata (Figure 2c). Although few chromosomal rearrangements were evident, all of the pseudochromosomes of black gram aligned with the chromosomes of Vigna species in a one-to-one syntenic relationship. A large number of syntenic blocks in addition to chromosome rearrangements within the genus Vigna were revealed.

View Image - FIGURE 2. Circos plots showing the macro-syntenic relationship of Vigna mungo (chromosomes denoted as VM_pChr1–VM_pChr11) with (a) Vigna radiata (chromosomes indicated as VR_pChr1–VR_pChr11), (b) Vigna angularis (chromosomes indicated as VA_pChr1–VA_pChr11), and (c) Vigna unguiculata (chromosomes indicated as VU_pChr1–VU_pChr11) as determined by pair-wise whole-genome alignments

FIGURE 2. Circos plots showing the macro-syntenic relationship of Vigna mungo (chromosomes denoted as VM_pChr1–VM_pChr11) with (a) Vigna radiata (chromosomes indicated as VR_pChr1–VR_pChr11), (b) Vigna angularis (chromosomes indicated as VA_pChr1–VA_pChr11), and (c) Vigna unguiculata (chromosomes indicated as VU_pChr1–VU_pChr11) as determined by pair-wise whole-genome alignments

We also compared the genome assembly generated in the present study with the recently published black gram genome of a Thailand cultivar Chai Nat 80 (Pootakham et al., 2021). In the present study, long-read sequencing technology, SMRT sequencing from PacBio, was deployed in conjunction with short-read Illumina sequencing to generate a hybrid assembly of Indian black gram, while the study published earlier used 10× genomics together with HiC technology for derivation of the Thailand black gram genome. The two black gram draft assemblies exhibited comparable genomic features (Supplemental Table S5). The assembly size of the Indian black gram (454.43 Mb) is slightly smaller than the Thailand variety (499 Mb; Supplemental Table S5). This can be due to the differential quantum of data generated in the two studies. The quality assessment of assembly completeness reflected excellent BUSCO scores with a high representation of conserved BUSCO orthologs in both the black gram assemblies (Supplemental Table S5). Notably, the percentage of ambiguous nucleotides (N) recorded in the Indian black gram genome was lesser than that reported for Thailand black gram genome (Supplemental Table S5). Thus, the two reference genomes would reflect upon the available genetic diversity and serve as important resources for fundamental and applied genetics and genomics research in black gram.

Transcriptome sequencing and de novo assembly

The RNA sequencing of black gram was performed for assistance in genome annotation. A total raw sequencing data of 11.01 Gb was generated from four major plant tissues including root, shoot, total seedling, and flower (Supplemental Table S6). Quality filtration of raw data resulted in the obtainment of 10.85 Gb of high-quality reads with an average read length of 101 bp (Supplemental Table S6). The clean reads were used to generate a de novo assembled transcriptome of ∼42.5 Mb containing a total of 54,603 transcripts. The average transcript length and N50 of assembled transcripts were 779 and 1,163 bp, respectively (Supplemental Table S7). We determined the quality of the transcriptome assembly prior to its use for genome annotation. The alignment of high-quality RNA-seq reads onto the assembled transcriptome revealed an overall alignment rate of 93.02%, indicating superior nature of the transcriptome assembly and promoting its use for genome annotation.

Annotation of repeatome of black gram genome

The repetitive structural constituents of the black gram genome were investigated through a combination of de novo and homology-based tools. Altogether, a total of 208.6 Mb of the black gram genome (∼45.91% of VM_draft) was composed of repetitive DNA (Supplemental Table S8), which is in concordance with the earlier reported draft assembly of black gram Chai Nat 80 (41.1.%; Pootakham et al., 2021) and other studied legumes including chickpea (Cicer arietinum L.) (49.4%; Varshney et al., 2013), pigeonpea (Cajanus cajan L. Huth) (51.6%; Varshney et al., 2012), mungbean [V. radiata (L.) R. Wilczek var. radiata] (51.6%, Kang et al., 2014), and adzuki bean [V. angularis (Willd.) Ohwi & H. Ohashi var. angularis] (43.1%; Kang et al., 2015). Long-terminal repeats constituted the major class of repeat elements occupying a total 15.3% of the black gram draft assembly, while DNA transposons represented 2.92% of draft genome sequence. Long interspersed nuclear elements were more prominent than the short interspersed nuclear elements in black gram constituting a proportion of ∼0.17% of the total identified repeats (Supplemental Table S8). The black gram de novo assembly (VM_denovo) exhibited a slightly higher proportion of repetitive components (∼216.6 Mbp) as compared with VM_draft (Supplemental Table S8). Remarkably, a large number of sequences with simple repeats were recorded in VM_denovo assembly. The computational difficulties associated with assembling simple repeats in genome assemblies is well known. Misassemblies and fragmentation introduced by simple repeats owing to their high polymorphic nature may be attributable to sequence redundancy resulting in greater proportion of simple repeats in VM_denovo.

TransposonPSI detected a total of 37,929 repeat elements distributed in 10 transposons superfamilies in VM_draft assembly. Our TransposonPSI-based analysis identified a large proportion of the LTR superfamily, Ty1-Copia in the black gram genome (∼62% of total identified elements) followed by the superfamily Ty3-gypsy (∼23.4%; Supplemental Figure S6A). The most abundant DNA transposons families were Cacta (∼7.19%) and MuDR (∼6%). Given the abundance of Class II transposable elements (retrotransposons) in the black gram genome, we characterized 6,065 candidate full-length, long-terminal retrotransposons using LTRharvest and LTRdigest with characteristic conserved domains of LTRs.

Characterization of Helitrons, MITEs, and tandem repeats

Helitron is an important class of DNA transposons that transposes using rolling circle replication mechanisms and are of particular evolutionary significance owing to their ability to capture gene sequences (Hu et al., 2019; Xiong et al., 2014). However, Helitrons are often left undetected, as these elements do not harbor typical structural features of transposons. In the present study, we deployed HelitronScanner to determine the Helitron components of the black gram genome. We distinguished a total of 331 putative Helitrons in the direct orientation in VM_draft. The length of predicted Helitron transposons varied from 254 to 19,974 bp with an average length of 10,709 bp (Supplemental Table S9). A survey of the distribution of Helitrons in the black gram genome revealed that the highest number of Helitrons (49) were localized on VM_pChr1, while the least number of Helitrons (17) were detected on VM_pChr9 (Supplemental Figure S6B). In VM_denovo, we recorded a total of 306 putative Helitrons, which is slightly less than that in assembly VM_draft (Supplemental Table S9).

Miniature inverted-repeat transposable elements are nonautonomous DNA transposons reported in several eukaryotes. MITE_Hunter revealed a total of 173 and 231 MITE sequences in VM_draft and VM_denovo, respectively (Supplemental Table S9). Tandem repeat finder discerned 251,615 tandem repeats in the assembled genome of black gram (VM_draft). Out of the total tandem repeats identified, 213,381 (84.8%) were localized on the 11 pseudomolecules of black gram, while 38,234 were found on unassembled scaffolds (Supplemental Figure S6C).

Gene model prediction and annotation of protein-coding genes

The integrated gene prediction pipeline using MAKER led to the derivation of a final unique set of 28,881 protein-coding genes with an average length of 1,395.02 bp in VM_draft (Table 1, Figure 1). Functional annotation through homology search against RefSeq assigned possible functions to a total of 26,974 genes (∼93.4% of predicted genes; Supplemental Figure S7A–C), while InterProScan assigned domains to 25,787 black gram genes (Figure 3a). The most abundant InterPro domains were protein kinase, serine-threonine/tyrosine kinase (catalytic domain), pentatricopeptide repeat, and leucine-rich repeat (Supplemental Figure S7D). Kyoto Encyclopedia of Genes and Genomes analysis provided annotation to 8,928 genes. Moreover, we functionally classified black gram genes using Eukaryotic Orthologous Groups, which annotated 16,894 genes within various functional classes. Our results displayed that a significant number of genes were associated with general prediction only, signal transduction mechanisms, posttranslational modifications, protein turnover, chaperones, and transcription (Figure 3b). Cumulatively, we were able to annotate ∼95% of the predicted gene models with at least one functional term from these databases (Figure 3a; Table 3). The structural annotation of VM_denovo recorded a higher number of predicted genes (31,393), of which, 29,837 were annotated against the public databases (Table 3).

View Image - FIGURE 3. Annotation of predicted black gram genes. (a) Venn diagram illustrating the unique and shared annotated genes among different public databases. (b) Function classes identified by eukaryotic orthologous groups (KOG) analysis in VM_draft assembly

FIGURE 3. Annotation of predicted black gram genes. (a) Venn diagram illustrating the unique and shared annotated genes among different public databases. (b) Function classes identified by eukaryotic orthologous groups (KOG) analysis in VM_draft assembly

TABLE 3 Annotation of predicted gene models in black gram

Database	No. of annotated genes
	VM_draft	VM_denovo	‘Chai Nat 80’^a
Total predicted genes	28,881	31,393	27,989
RefSeq	26,974	29,771	26,802
InterPro	25,787	27,843	25,455
Kyoto encyclopedia of genes and genomes	8,928	9,245	9,055
Eukaryotic orthologous groups	16,894	17,781	16,788
Total annotated	27,432	29,837	26,846

Pootakham et al., 2021. For comparative purposes, gene annotation for Chai Nat 80 was reanalyzed in the present study.

Mining of simple sequence repeats and annotation of noncoding RNAs

The screening of draft genome assembly for microsatellites or simple sequence repeats identified a total of 1,77,017 putative microsatellite loci distributed in 378 contigs of VM_draft assembly. Mononucleotide repeats were discarded from further analysis. Out of the total microsatellites, 74,107 perfect (di- to hexanucleotides) microsatellite loci were discerned with dinucleotide repeats being the most dominant class of microsatellites representing a proportion of 55.6% out of the total identified perfect microsatellite loci (Supplemental Figure S8). Additionally, ∼33,062 complex simple sequence repeats were also discovered. In addition, ∼85% of the PCR tested 34 microsatellites markers exhibited robust amplification and provided an amplicon of expected size suggesting good quality of the assembled genome (Supplemental Table S10). Evaluation of VM_denovo assembly identified a total of 177,023 putative microsatellite loci distributed in 2,661 contigs with 33,067 presented in compound form (Supplemental Table S11). The large set of novel microsatellite markers identified in the present study could be widely applied for diversity analysis, quantitative trait loci identification, and marker-assisted breeding in black gram.

We also surveyed noncoding RNA (rRNA and tRNA) genes in the black gram assemblies. We deciphered 380 putative rRNA (representing 59 18S-type, 61 28S-type. and 259 8S-type rRNA) and 619 tRNA genes in our reference guided assembly. Out of the total predicted tRNAs, 596 tRNAs decoded 20 standard amino acids (Supplemental Table S12). The de novo assembly (VM_denovo) consisted of 378 rRNA loci (including 60 18S-type, 58 28S-type, and 260 8S-type rRNA) and 619 putative tRNA genes.

Identification of regulatory genes and disease resistance genes in VM_draft

In total, 1,711 genes encoding TFs classified into 49 different gene families and 439 genes encoding transcriptional regulators belonging to 23 families were identified in the black gram genome assembly, VM_draft. Among the TF gene families, the most represented families were MYB and bHLH, while ARID and AUX/IAA dominated the transcriptional regulators gene families (Supplemental Figure S9A,B). Furthermore, the black gram genome encoded 1,175 protein kinases distributed in 20 gene families. The most prominent gene families among protein kinases were RLK and calcium- and calmodulin-regulated kinase (CAMK) (Supplemental Figure S9C).

The disease resistant (R) gene profile of black gram genome is largely unexplored, disrupting the genetic dissection of disease resistance in the crop. Using the DRAGO pipeline, a total of 1,551 candidate R gene analogs were discerned in the black gram genome (VM_draft). The putative R genes were classified into 21 major classes based on internal domain and motif organization (Table 4). The class kinases (50.6%) dominated the R gene analogs (RGAs) followed by membrane-associated RLKs (14.95%) and RLPs (11.6%). A survey of the distribution of R genes conserved domains revealed that 14.6% of the predicted RGAs contained a single domain (Supplemental Figure S10A,B). Further, two to five domain associations were found (Supplemental Figure S10C–F). A substantial proportion (∼56%) of the RGA analogs showed an internal architecture with two domains. Among the double-domain RGAs, 622 proteins consisted of the dual kinase–transmembrane (TM) domains, while 179 proteins contained LRR-TM domains. The composite kinase-LRR-TM domain was most represented among the association of the triple domains and observed in 231 putative R gene proteins. The largest number of proteins for four domain associations was recorded for CC-LRR-NBS-TM. We observed two combinations of five domains (NBS-LRR-CC-TIR-TM and kinase-NBS-LRR-TIR-TM) in black gram genome.

TABLE 4 Family classification of R gene analogs of black gram (VM_draft) identified through DRAGO 2 pipeline

Family ID	No. of proteins
C	1
CK	94
CL	13
CLK	5
CN	44
CNL	28
CT	3
CTNL	3
KIN	785
L	35
N	48
NK	1
NL	37
RLK	232
RLP	180
T	7
TL	1
TN	3
TNL	27
TRAN	3
Undefined	1
Grand total	1,551

The NBS-LRR proteins are the most characterized disease resistance genes in plants (Dangl & Jones, 2001). In the black gram genome, 119 genes encoding NBS-LRR proteins were identified, of which, most encoded for CC-NBS-LRR (Supplemental Table S13). We detected six genes belonging to Resistance to Powdery Mildew 8 (RPW8)-NBS-LRR (RNL) and 19 genes assigned to TIR-NBS-LRR (TNL). Phylogenetic analysis of NBS-LRR genes exhibited clustering on the basis of domain composition (Figure 4A). Among the genes with NB-ARC domains, 89 harbored the LRR domain (Figure 4b), which is comparable to the number (73) reported in V. radiata (Kang et al., 2014). Further, the NBS-encoding genes showed a dispersed distribution in the black gram genome, with 69% clustered on four chromosomes (VM_pChr1, VM_pChr2, VM_pChr8, and VM_pChr10; Supplemental Figure S11). Concluding from the above, the RGAs identified in the present study represent ready resources for characterization of novel disease resistance genes, which could be applied in black gram improvement programs.

View Image - FIGURE 4. Phylogenetic tree of 119 nucleotide binding site–leucine rich repeat (NBS-LRR) genes of black gram (VM_draft) constructed based on neighbor-joining method with 1,000 bootstrap support. (a) Various classes of genes encoding NBS-LRR are represented by different colors. (b) The NBS-LRR genes are color-coded based on presence or absence of LRR domain

FIGURE 4. Phylogenetic tree of 119 nucleotide binding site–leucine rich repeat (NBS-LRR) genes of black gram (VM_draft) constructed based on neighbor-joining method with 1,000 bootstrap support. (a) Various classes of genes encoding NBS-LRR are represented by different colors. (b) The NBS-LRR genes are color-coded based on presence or absence of LRR domain

Genome-wide identification of SSP genes in Vigna species

In the present study, BLASTP and HMM searches were used to screen the black gram genome (VM_draft assembly) for putative members of the SSP gene family leading to identification of a total of 23 nonredundant SSP-encoding gene candidates, hereafter referred to as Vmu_SSPs1–Vmu_SSPs23 (Table 5). The gene validation at the NCBI database revealed that the black gram genome was enriched with globulins type of SSPs (20) further classified into 11S (10) and 7S (10) types. We found two members of glutenin type and one member of albumin type, while genes encoding prolamin type of SSPs were not detected in the black gram genome (Table 5). Investigation of the physiochemical and molecular parameters showed that protein length of the predicted SSP sequences varied between 123 (Vmu_SSPs10) to 609 (Vmu_SSPs7) amino acids, while the corresponding computed molecular weights ranged from 13.05 (Vmu_SSPs21) to 69.06 (Vmu_SSPs7) kDa (Table 5). The isoelectric point of SSPs displayed a range between 4.45 and 9.07. The prediction of the subcellular localization of black gram SSPs revealed multiple cellular locations including vacuole, cell membrane, and cell wall. However, the majority of SSPs were localized in the vacuole and cell membrane. The important details related to the predicted SSP genes of black gram are presented in Table 5. Using a similar approach, we analyzed SSP-encoding genes from closely related plant species V. radiata, V. angularis, V. unguiculata, and Arabidopsis. Our findings revealed a total of 17 (V. radiata; Vra_SSPs1–Vra_SSPs17), 15 (V. angularis; Van_SSPs1–Van_SSPs15), 35(V. unguiculata; Vun_SSPs1–Vun_SSPs35), and 15 (Arabidopsis; Ath_SSPs1–Ath_SSPs15) SSP-encoding genes (Supplemental Table S14).

TABLE 5 Details of seed storage protein (SSP) gene family of black gram (identified from VM_draft assembly)

Transcript ID	Gene name	Chromosome	Location start	Location end	mRNA length	Protein length	Protein molecular weight	Isoelectric point	No. of exons	No. of introns	NCBI description	Subcellular localization
			bp			AA	kDA
VM_pChr1-gene-545.52-mRNA-1	Vmu_SSPs1	VM_pChr1	54,569,800	54,571,464	1,664	358	38693.49	4.88	3	2	Legumin B	Vacuole
VM_pChr1-gene-545.54-mRNA-1	Vmu_SSPs2	VM_pChr1	54,575,374	54,576,740	1,366	354	38,280.13	5.24	3	2	Glutelin type A 2	Vacuole
VM_pChr2-gene-379.4-mRNA-1	Vmu_SSPs3	VM_pChr2	37,931,332	37,932,614	1,282	405	43,552.58	4.95	2	1	Basic 7S globulin	Cell membrane
VM_pChr3-gene-210.40-mRNA-1	Vmu_SSPs4	VM_pChr3	20,995,502	20,997,702	2,200	609	69,069.61	5.64	4	3	Glycinin	Vacuole
VM_pChr3-gene-273.30-mRNA-1	Vmu_SSPs5	VM_pChr3	27,353,320	27,355,481	2,161	509	58,142.04	5.68	6	5	Vicilin-like seed storage protein	Vacuole
VM_pChr3-gene-279.34-mRNA-1	Vmu_SSPs6	VM_pChr3	27,912,809	27,914,252	1,443	449	48,115.46	9.07	2	1	Basic 7S globulin	Cell membrane
VM_pChr3-gene-353.27-mRNA-1	Vmu_SSPs7	VM_pChr3	35,290,934	35,291,374	440	123	13,056.84	4.45	3	2	Basic 7S globulin	Cell wall
VM_pChr3-gene-417.34-mRNA-1	Vmu_SSPs8	VM_pChr3	41,768,083	41,769,947	1,864	416	47,183.2	5.82	7	6	Beta-conglycinin	Vacuole
VM_pChr3-gene-417.35-mRNA-1	Vmu_SSPs9	VM_pChr3	41,776,319	41,778,442	2,123	532	60,999.45	5.73	6	5	Beta-conglycinin	Vacuole
VM_pChr3-gene-417.37-mRNA-1	Vmu_SSPs10	VM_pChr3	41,785,920	41,787,786	1,866	445	50,742.37	7.15	6	5	Beta-conglycinin	Vacuole
VM_pChr3-gene-418.36-mRNA-1	Vmu_SSPs11	VM_pChr3	41,800,609	41,802,476	1,867	445	50,728.29	6.54	6	5	Beta-conglycinin	Vacuole
VM_pChr6-gene-377.33-mRNA-1	Vmu_SSPs12	VM_pChr6	37,732,502	37,733,839	1,337	445	48,093.23	9	1	0	Basic 7S globulin	Cell membrane
VM_pChr6-gene-377.6-mRNA-1	Vmu_SSPs13	VM_pChr6	37,739,523	37,740,770	1,247	415	44,632.94	6.41	1	0	Basic 7S globulin 2-like	Cell membrane
VM_pChr7-gene-54.37-mRNA-1	Vmu_SSPs14	VM_pChr7	5,444,743	5,446,044	1,301	433	46,297.21	8.83	1	0	Basic 7S globulin 2-like	Cell membrane
VM_pChr7-gene-54.40-mRNA-1	Vmu_SSPs15	VM_pChr7	5,453,171	5,454,472	1,301	433	46,088.99	8.46	1	0	Basic 7S globulin	Cell membrane
VM_pChr7-gene-114.50-mRNA-1	Vmu_SSPs16	VM_pChr7	11,419,992	11,421,266	1,274	424	46,187.9	7.59	1	0	Basic 7S globulin 2-like	Cell membrane
VM_pChr8-gene-57.22-mRNA-1	Vmu_SSPs17	VM_pChr8	5,697,009	5,698,822	1,803	348	37,521.83	5.35	3	2	12S seed storage protein	Vacuole
VM_pChr8-gene-57.22-mRNA-1	Vmu_SSPs18	VM_pChr8	5,704,081	5,705,886	1,805	346	37,322.64	5.45	4	3	12S seed storage protein	Vacuole
VM_pChr8-gene-63.43-mRNA-1	Vmu_SSPs19	VM_pChr8	6,356,409	6,359,298	2,889	492	55,404.52	6.03	5	4	Vicilin-like seed storage protein	Vacuole
VM_pChr8-gene-338.57-mRNA-1	Vmu_SSPs20	VM_pChr8	33,839,965	33,840,761	796	127	13,638.7	4.96	2	1	Albumin-1	Cell membrane
VM_pChr10-gene-152.17-mRNA-1	Vmu_SSPs21	VM_pChr10	15,215,671	15,216,921	1,250	271	31,123.05	5.72	5	4	Beta-conglycinin	Vacuole
VM_pChr10-gene-211.79-mRNA-1	Vmu_SSPs22	VM_pChr10	21,161,127	21,163,026	1,899	452	51,605.94	6.01	6	5	Beta-conglycinin	Vacuole
VM_pChr11-gene-41.21-mRNA-1	Vmu_SSPs23	VM_pChr11	4,161,322	4,163,771	2,426	356	38,526.45	5.46	3	2	Glutelin type-B	Vacuole

We studied the genomic distribution of 23 SSP-encoding genes onto 11 black gram chromosomes. The physical map positions displayed that the 23 SSP genes were unevenly distributed on eight chromosomes, with chromosome VM_pChr3 harboring the maximum number of SSP members (8) (Supplemental Figure S12A). No SSP-encoding gene was localized on chromosomes VM_pChr4, VM_pChr5, and VM_pChr9. In V. radiata, a total of 13 out of the 17 SSP genes detected were localized on its nine chromosomes, while the rest of the genes (4) were distributed on four unmapped scaffolds (Supplemental Figure S12B), whereas in V. angularis, nine out of 15 SSP genes were dispersed on four chromosomes, and six SSP genes were mapped on unplaced scaffolds (Supplemental Figure S12C). All the 35 SSP genes identified in V. unguiculata were distributed across its 11 chromosomes (Supplemental Figure S12D), while the 15 SSP genes in Arabidopsis were located on its five chromosomes (Supplemental Figure S12E). Furthermore, to probe the underlying mechanism responsible for the expansion of the SSP gene family in black gram and other Vigna species, we extended our analysis to characterize tandem and segmental duplication events of SSP genes using MCScanX. In black gram, we identified six tandemly duplicated gene pairs that were located on chromosomes VM_pChr1, VM_pChr3, VM_pChr7, and VM_pChr8 (Supplemental Figure S12A). On the other hand, three segmentally duplicated SSP gene pairs (Vmu_SSPs5–Vmu_SSPs10, Vmu_SSPs12–Vmu_SSPs14, and Vmu_SSPs17–Vmu_SSPs23) were detected in our study. Based on the observed gene duplication pattern, we hypothesized that tandem duplication events are more likely the driving force for SSP gene family expansion in black gram. Among the other studied Vigna species, V. unguiculata exhibited the highest number of gene duplications with six tandem and four segmental duplication events (Supplemental Figure S12D). No tandem duplication events were detected in V. radiata and V. angularis, however, four and one segmental duplicated gene pairs were recorded in the respective genomes (Supplemental Figure S12B, C). In Arabidopsis, three tandem and three segmental gene pairs were discovered (Supplemental Figure S12E).

To ascertain the phylogenetic relationship between SSP genes recognized in the black gram genome, a neighbor-joining phylogenetic tree was constructed using the 23 full-length SSP sequences. The neighbor-joining method segregated the black gram SSP gene family members into two major groups (I and II) supported by high bootstrap values (Supplemental Figure S13A). The Group I, consisting of 14 SSP family members, was further subdivided into two subgroups, Ia (7) and Ib (7). The subgroup Ia majorly included 11S globulin type of SSPs, while the subgroup Ib assembled glutenins along with 11S globulins. The Group II exhibited clustering of 7S globulins together with the albumin member (Supplemental Figure S13A). Analysis of the gene organization of putative SSP genes demonstrated high variability in their genomic structure (Supplemental Figure S13B). Five black gram SSP genes lack introns while, 18 genes possess up to two to seven exons. For the exon number, five SSP genes accumulated three and five genes contained six exons. Among the SSPs, Vmu_SSPs11 contained the highest number of seven exons and six introns (Supplemental Figure S13B). Further, we discerned the domain architecture and explored the conserved motifs in the black gram SSP gene family using the PFAM database and MEME suite, respectively. The genomic structure of SSP genes displayed the presence of three types of domains: Cupin, TAXi (xylanase inhibitor), and AlbuminI. Fourteen SSP genes (Group I) harbored single or double copies of the Cupin domain, while seven members contained TAXi domain at N and C terminal. Vmu_SSPs5 and Vmu_SSPs9 possessed a single copy of Cupin domain, while Vmu_SSPs7 encompassed a TAXi domain only at the C terminal (Supplemental Figure S14A). Also, Vmu_SSPs20 included an AlbuminI domain. Moreover, 10 conserved motifs were explored in the protein sequences of SSP genes (Supplemental Figure S14B). The number of motifs per gene varied from zero to six in black gram and the majority of these motifs resided within the boundaries of observed domains (Cupin and TAXi; Supplemental Figure S14B). Among the 10 motifs, six motifs (1, 4, 5, 6, 7, and 9) were embodied in the Cupin domain while motif 8 was contained within the TAXi domain. Interestingly, the albumin SSP member (Vmu_SSPs20) did not show presence of any conserved motif (Supplemental Figure S13B). Fundamentally, the structural divergence of SSP members corroborated with their clustering based on the phylogenetic tree with SSPs grouped in a cluster showing homogeneous domain and motif composition (Supplemental Figure S13A).

Additionally, we explored the evolutionary relationships between 105 SSP genes discerned from five species including black gram, V. radiata, V. angularis, V. unguiculata, and Arabidopsis using the neighbor-joining method. The SSP members were clustered into two major groups (I and II) with further internal subclustering (Figure 5a). Group I was further classified into three subgroups (Ia, Ib, and Ic), while Group II exhibited division into two subgroups (IIa and IIb). Notably, the phylogenetic tree showed an assemblage similar to the SSPs from black gram (Supplemental Figure S13A) based on the domain and motif composition except the identification of an additional domain, Tryp_alpha_amyl in SSP members of Arabidopsis and V. radiata (Group Ib). The members of Group Ib were devoid of any conserved motifs. The SSP members of Group Ia and Ic harbored Cupin domain and conserved motifs varying between three and nine. Group IIa showed clustering of SSP members harboring Albumin_1_a domain; however, the majority of these genes lacked conserved motifs. Interestingly, none of the SSP genes assayed from Arabidopsis showed presence of Albumin_1_a domain (Figure 5b). All the SSP genes with TAXi domain were grouped in the subgroup IIb. The conserved Motif 4 was universally present in all members of Group IIb. In principle, the SSP genes identified in the present study demonstrated high diversity while maintaining conserved domains and motifs suggesting evolutionary conservation of cellular function of these proteins.

View Image - FIGURE 5. Phylogenetic analysis, conserved motifs, and domains of seed storage protein (SSP) genes. (a) Neighbor-joining-based phylogenetic tree constructed using full-length protein sequences of SSP genes of Vigna mungo (Vmu_SSPs1–Vmu_SSPs23), V. radiata (Vra_SSPs1–Vra_SSPs17), V. angularis (Van_SSPs1–Van_SSPs15), V. unguiculata (Vun_SSPs1–Vun_SSPs35), and Arabidopsis thaliana (Ath_SSPs1–Ath_SSPs15) with 1,000 bootstraps. Clades are color coded. (b) Motif distribution and (c) domain composition is presented

FIGURE 5. Phylogenetic analysis, conserved motifs, and domains of seed storage protein (SSP) genes. (a) Neighbor-joining-based phylogenetic tree constructed using full-length protein sequences of SSP genes of Vigna mungo (Vmu_SSPs1–Vmu_SSPs23), V. radiata (Vra_SSPs1–Vra_SSPs17), V. angularis (Van_SSPs1–Van_SSPs15), V. unguiculata (Vun_SSPs1–Vun_SSPs35), and Arabidopsis thaliana (Ath_SSPs1–Ath_SSPs15) with 1,000 bootstraps. Clades are color coded. (b) Motif distribution and (c) domain composition is presented

Furthermore, syntenic analysis was performed to decipher the collinearity between the homologous SSP genes identified from black gram and the four species including V. radiata, V. angularis, V. unguiculata, and Arabidopsis (Figure 6). The exploration of orthologs revealed that a total of 19 black gram SSP genes shows collinearity with 13 V. radiata, eight V. angularis, 19 V. unguiculata, and seven Arabidopsis SSP genes. Black gram shared the highest level of SSP genes orthology with V. unguiculata, sharing 23 orthologous gene pairs followed by V. radiata with 16 homologous gene pairs. It is noteworthy that several black gram SSP genes displayed collinear relationship with more than one gene from other species indicating that these genes might have played important roles in SSP gene expansion. In total, we observed 60 pairs of homologous SSP gene pairs between black gram and four other Vigna species (Figure 6).

View Image - FIGURE 6. Syntenic analysis of seed storage protein (SSP) genes between Vigna mungo and studied plant species. (a) V. radiata, (b) V. angularis, (c) V. unguiculata, and (d) Arabidopsis thaliana. Grey lines in the background represent collinear blocks between species, while red lines highlight the collinear SSP genes between species. Chromosomal IDs of V. mungo (VM_pChr1–VM_pChr11), V. radiata (VR_pChr1–VR_pChr11), V. angularis (VA_pChr1–VA_pChr11), V. unguiculata (VU_pChr1–VU_pChr11), and A. thaliana (AT_pChr1–AT_pChr5) are given on each chromosome

FIGURE 6. Syntenic analysis of seed storage protein (SSP) genes between Vigna mungo and studied plant species. (a) V. radiata, (b) V. angularis, (c) V. unguiculata, and (d) Arabidopsis thaliana. Grey lines in the background represent collinear blocks between species, while red lines highlight the collinear SSP genes between species. Chromosomal IDs of V. mungo (VM_pChr1–VM_pChr11), V. radiata (VR_pChr1–VR_pChr11), V. angularis (VA_pChr1–VA_pChr11), V. unguiculata (VU_pChr1–VU_pChr11), and A. thaliana (AT_pChr1–AT_pChr5) are given on each chromosome

CONCLUSIONS

A high-quality draft assembly provides the essential foundation resources for expediting future crop improvement programs. In our case, we present a hybrid draft genome sequence and annotation of an Indian black gram cultivar using long-read SMRT sequencing together with short-read Illumina sequencing, extracting the benefits of reference-assisted approach. Considering the quality parameters, including scaffold N50, BUSCO score, and the number of annotated genes, the present reference-guided assembly denotes features of a high-quality genome. A total of 28,881 gene models were predicted, out of which, 27,432 predicted genes could be functionally annotated. A preliminary analysis of R genes provided the putative candidates that could be explored in the area of plant pathogenesis for understanding and fortifying disease development in black gram. The reported SSP genes could be explored to further improve the content of seed proteins in black gram. In totality, this high-quality black gram genome from its center of origin would be instrumental for propagating its genetics- and genomics-assisted improvement programs.

DATA AVAILABILITY

The raw DNA sequencing data from Vigna mungo project has been deposited at NCBI SRA database under the BioProject accession number PRJNA668459 and BioSample accession numbers, SRR12806087 - SRR12806091, SRR12806096 - SRR12806128 (SMRT sequencing) and SRR12806129 - SRR12806130 (Illumina sequencing). Raw RNA sequencing data can be located under the accession numbers SRR12806092 - SRR12806095. This whole genome shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JAMTDO000000000. The version described in this paper is JAMTDO010000000. The data generated in this study is also available at DRYAD Digital Repository: https://doi.org/10.5061/dryad.dr7sqvb1r

ACKNOWLEDGMENTS

We acknowledge the grant provided by Department of Biotechnology, Government of India, to the University of Delhi, India, and North-Eastern Hill University, India, under the Sanction orders “BT/190/NE/TBP/2011” and “BT/PR24637/NER/95/787/2017” through NER Biotechnology Management Cell (NER-BPMC) to support this work. We also acknowledge the grant provided by University of Delhi under the scheme, “Institute of Eminence, sanction order IOE/FRP/LS/2020/27 through Faculty Research Programme Grant – loE”. P.K.O. acknowledges the fellowship by the Council of Scientific and Industrial Research, Ministry of Science and Technology, Government of India.

AUTHOR CONTRIBUTIONS

Shailendra Goel: Conceptualization; Data curation; Formal analysis; Funding acquisition; Investigation; Methodology; Resources; Supervision; Validation; Writing-original draft; Writing-review & editing. Heena Ambreen: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Software; Validation; Visualization; Writing-original draft; Writing-review & editing. Praveen Kumar Oraon: Data curation; Formal analysis; Methodology; Software; Validation; Visualization; Writing-original draft; Writing-review & editing. Manu Agarwal: Funding acquisition; Methodology; Resources; Writing-review & editing. Arun Jagannath: Methodology; Resources; Supervision; Writing-review & editing. Surekha Katiyar-Agarwal: Funding acquisition; Investigation; Supervision; Writing-review & editing. Amar Kumar: Resources; Supervision; Writing-review & editing. Rama Rao Satyawada: Conceptualization; Funding acquisition; Investigation; Resources; Validation; Writing-review & editing. Daniel Regie Wahlang: Formal analysis; Investigation; Methodology; Validation; Writing-review & editing. Roli Budhwar: Data curation; Formal analysis; Software. Rohit Nandan Shukla: Data curation; Formal analysis; Software.

CONFLICT OF INTEREST

The authors declare that they have no conflict of interest.

Word count: 10211

Show less

© 2022. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Black gram [Vigna mungo (L.) Hepper var. mungo] is a warm-season legume highly prized for its protein content along with significant folate and iron proportions. To expedite the genetic enhancement of black gram, a high-quality draft genome from the center of origin of the crop is indispensable. Here, we established a draft genome sequence of an Indian black gram cultivar, ‘Uttara’ (IPU 94-1), known for its high resistance to mungbean yellow mosaic virus. Pacific Biosciences of California, Inc. (PacBio) single-molecule real-time (SMRT) and Illumina sequencing assembled a draft reference-guided assembly with a cumulative size of ∼454.4 Mb, of which, 444.4 Mb was anchored on 11 pseudomolecules corresponding to 11 chromosomes. Uttara assembly denotes features of a high-quality draft genome illustrated through high N50 value (42.88 Mb), gene completeness (benchmarking universal single-copy ortholog [BUSCO] score 94.17%) and low levels of ambiguous nucleotides (N) percent (0.0005%). Gene discovery using transcript evidence predicted 28,881 protein-coding genes, from which, ∼95% were functionally annotated. A global survey of genes associated with disease resistance revealed 119 nucleotide binding site–leucine rich repeat (NBS-LRR) proteins, while 23 genes encoding seed storage proteins (SSPs) were discovered in black gram. A large set of microsatellite loci were discovered for marker development in the crop. Our draft genome of an Indian black gram provides the foundational genomic resources for the improvement of important agronomic traits and ultimately will help in accelerating black gram breeding programs.

Details

Title

Long-read-based draft genome sequence of Indian black gram IPU-94-1 ‘Uttara’: Insights into disease resistance and seed storage protein genes

Author

Heena Ambreen¹; Oraon, Praveen Kumar¹; Wahlang, Daniel Regie²

; Rama Rao Satyawada²; Katiyar-Agarwal, Surekha³; Agarwal, Manu¹; Jagannath, Arun¹; Kumar, Amar¹; Budhwar, Roli⁴; Rohit Nandan Shukla⁴; Goel, Shailendra¹

¹ Dep. of Botany, Univ. of Delhi, Delhi, India
² Dep. of Biotechnology & Bioinformatics, North-Eastern Hill Univ., Shillong, India
³ Dep. of Plant Molecular Biology, Univ. of Delhi, New Delhi, India
⁴ Bionivid Technology Pvt. Limited, Bengaluru, India

Section

ORIGINAL RESEARCH

Publication year

2022

Publication date

Sep 2022

Publisher

John Wiley & Sons, Inc.

ISSN

19403372

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/tpg2.20234

ProQuest document ID

2714819007

Long-read-based draft genome sequence of Indian black gram IPU-94-1 ‘Uttara’: Insights into disease resistance and seed storage protein genes

Jump to:

Full text

Abstract

Details

Suggested sources