WheatGene:A genomics database for common wheat and its related species

2021-12-10 12:24:10DiegoGarciaZhengyuWangJiantaoGuanLingjieYinShuaifengGengAiliLiLongMao

The Crop Journal 2021年6期

Diego F.Garcia,Zhengyu Wang,Jiantao Guan,Lingjie Yin,Shuaifeng Geng,Aili Li,Long Mao*

Institute of Crop Sciences,Chinese Academy of Agricultural Sciences,Beijing 100081,China

Keywords:

A B S T R A C T Common wheat(Triticum aestivum)is a hexaploid plant(AABBDD)derived from genetically related tetraploid wheat T.turgidum(AABB)and a diploid goatgrass Aegilops tauschii(DD).Recent advances in sequencing technology and genome assembly strategies allow the acquisition of multiple wheat genomes,calling for a centralized database to store,manage and query the genomics information in a manner to reflect their evolutionary relationship and to perform effective comparative genome analysis.Here,we built WheatGene,a database that contains five wheat genomes of 318,102 genes and 945,900 transcripts and their expression information in 998 RNA-seq samples that can be searched and compared in an interactive manner.WheatGene was developed with Drupal,a popular content management system and the toolkit Tripal managed the biological information.The database was accessible through a web browser with species,search,gene expression,tools,and literature entries.Tools available were BLAST,synteny viewer,map viewer,JBrowse,data downloads,gene expression heatmap and bar chart,and homologs viewer.Moreover,the map viewer connected genomics data with genetic maps and QTL that can be searched for markers for molecular breeding.WheatGene was developed with open-source modules and libraries.WheatGene is available at http://wheatgene.agrinome.org.

1.Introduction

Wheat is one of the major crops in the world;it is an important source of protein and calories for more than 30% of the human population [1,2]. Since the arise of modern agriculture in the Fertile Crescent, ～10,000 years ago, human beings domesticated wild wheat species (i.e., wild emmer wheatTritiumturgidumssp.dicoccoides) into cultivated ones, e.g., emmer wheatT.turgidumssp.dicoccum[3]. Indeed, domestication of the allotetraploid wild emmer wheat (BBAA, 2n =4x= 28) led to the subsequent evolution of the hexaploid wheat including bread wheat (Triticumaestivum;AABBDD, 2n =6x= 42) [4]. In other words, bread wheat is a consequence of two hybridization events between three diploid progenitor species:Triticumurartu(A genome),Aegilopsspeltoides(B genome equivalent), andAegilopstauschii(D genome) [5,6].

The rapid development of biological techniques such as Next Generation Sequencing(NGS)allows for genome annotation and protein discovery which has produced a great amount of biological data and has set the need to develop new technological solutions to store,manage,mine,and access these large-scale datasets in an integrative way[7,8].A good solution to meet this demand is the implementation of various wheat databases[9,10].In this regard,several plant databases have been created to host wheat genomics data;but they often do not include the tetraploid and diploid donor species for a convenient inter-specific comparison,e.g.,the PGSB PlantsDB and the Phytozome[11,12].Current wheat databases are mainly focused on common wheat,e.g.,the Wheat Expression Browser(http://www.wheat-expression.com/)[13]that has four gene expression datasets exclusively for Chinese Spring(CS),a hexaploid wheat.Moreover,the lack of tools to visualize and analyze the wheat genomics data is a common weakness of these databases.

To address the need for a database focused on common wheat as well as its tetraploid and diploid progenitors,we developed WheatGene(http://wheatgene.agrinome.org).This database hosts the latest genome assemblies of common wheat variety CS,wild emmer wheat(WEW),emmer wheat(DW),the A-genome donorT.urartu,and the D genome donorAegilops tauschii.In addition to the most recent genome assemblies,genomic and genetic data available in WheatGene covers genes,transcripts,RNA-seq expression,nucleotide sequences,amino acid sequences,gene functional annotation,and gene ontology.WheatGene features a responsive web interface compatible with most browsers and adaptive to all screen sizes.The web interface implements a set of tools to visualize and compare the genomic data,including BLAST[14],the genome browser JBrowse[15],synteny viewer,gene functional annotation and gene ontology charts,and map viewer.Besides a gene expression heatmap functionality,WheatGene offers two newly developed tools:a bar chart to visualize gene expression patterns of a set of genes,and a chart to visualize pairs of homologous/homoeologous genes graphically.WheatGene was designed for easier comparison of the polyploid genomes of common wheat and its relatives and for better utilization of this hard-earned genomic information.

2.Materials and methods

2.1.Architecture of WheatGene

WheatGene consists of a web project based on a three-tier software architecture[8](Fig.1).We developed this platform usingTripalv3.2,a toolkit for the construction of online community databases[16,17]that has been widely used to develop a number of similar projects[10,18–20].Such a toolkit is a suite of PHP modules integrated with a popular Content Management System(CMS)called Drupal(https://www.drupal.org/,version 7.61).At the database level,information is arranged in a scheme designed to manage biological knowledge called Chado[21].All data is stored in PostgreSQL(https://www.postgresql.org/)a Database Management System(DBMS,version 9.2.24).

2.2.Genome assemblies and sequences

WheatGene hosts five high-quality genomes of common wheat and its related species retrieved from public databases.The genomes of common wheat(IWGSC RefSeq v1.1)[1],emmer wheat(Svevo.v1)[22],wild emmer wheat(WEWSeq v.1.0)[4],andAe.tauschii(Aet v4.0)[23]were downloaded from Ensemble Plants(https://plants.ensembl.org/),whereas the genome ofT.urartu(the A-genome donor)[24]was retrieved from MBKBase(http://www.mbkbase.org/Tu/)[25].In total,WheatGene covers 251,543 genes and 749,717 mRNAs.The complete information is summarized in Table 1.

2.3.Data analysis

During the construction of WheatGene,we performed different data analysis tasks such as gene expression analysis,synteny blocks detection,gene functional annotation,and genetic markers and QTL localization.For detailed information on the pipelines,see supplementary material S1.On the one hand,gene expression values of common wheat already normalized to TPM(Transcripts per million)were downloaded from the Wheat Expression Browser(http://www.wheat-expression.com/)for the RefSeq1.1 genome assembly.In total,we selected 37 RNA-seq studies covering a wide range of tissues(leaves,grain,roots,spike,stem,anther,endosperm,etc.),development stages(reproductive,seedling,vegetative,among others),and stress conditions in 938 RNA-seq samples as listed in Table S1.On the other hand,we collected RNA-seq data of the remaining four species from NCBI Sequence Read Archive(SRA)selecting only those studies that have been published in scientific journals.Thus,raw data of six RNA-seq experiments comprising 60 samples were analyzed by means of an in-house pipeline.

Fig.1.The three-tier architecture design of WheatGene.(A)The presentation tier,powered by HTML and JavaScript,is the web interface of the system which comprises a set of pages and features to visualize the data.(B)The application tier is responsible for what is known as‘‘the business logic”,it implements the algorithms that control how data is shown,created,or altered in the database as well as the integrity of the database transactions.The backbone of this tier is Drupal,including a broad set of modules to manage biological information called Tripal.(C)The data tier accounts for the data storage and is managed by the DBMS PostgreSQL including Chado,a relational database scheme designed to store biological data.Communication between presentation and application tiers is conducted through the HTTP protocol.Database transactions between the application tier and data tier are performed by the PostgreSQL driver.

Table 1Summary of genomic data in WheatGene.

To identify synteny blocks and homologous genes among the five species,we used the first protein sequence to represent each gene.We blasted such representative sets of proteins against all proteins from other species and against their own proteomes.In total,we did 31 comparisons to cover all the possible pairwise combinations of homologous chromosomes among the five genomes considering their hexaploid,tetraploid and diploid nature as well as the combination of homoeologous subgenomes in the hexaploid and tetraploid species.We detected 11,809 synteny blocks,351,622 orthologous genes,and 130,154 paralogous gene pairs(Table S2).

Genes of the five species were functionally annotated by blasting against the UniProtKB/Swiss-Prot database[26]and a local installation of InterProScan[27].A total of 224,498 and 300,176 genes were annotated by homologs in UniProtKB/Swiss-Prot and InterProScan respectively.Furthermore,the results of the InterProScan process were parsed against a set of GO terms by the GO Consortium[28]to generate a gene ontology annotation report.

WheatGene hosts 381 genetic markers(SNP)and 448 QTL of the hexaploid wheat retrieved from seven GWAS and genetic map studies as described in Table S3.After aligning the DNA sequences of multiple genetic markers against the reference genome,we chose only the first match of each marker and identified their physical position in the chromosome.The markers and their associated QTL were arranged in nine physical maps considering the maternal and parental lines,type of population,and author of the study.

3.Results

3.1.WheatGene website

On the WheatGene home page,users can find a short description of the species,genome statistics,and links to the different features of the platform for data visualization.Besides,this platform incorporates a data download section to provide direct access to the different wheat datasets used in this project such as wholegenome reference sequences,GFF files,protein sequences,and gene expression values(Fig.S1).In addition,it counts with a literatures section that contains a list of remarkable scientific publications on wheat(Fig.S2).WheatGene website is designed to be compatible with different screen sizes including computer monitors,smartphones,and tablets.

3.2.Search functions

WheatGene includes three sequence search functionalities.First,the sequence search feature allows users to search for both genes and transcripts among the five genomes hosted in the database.It includes filters by species,type of sequence,chromosome,position,or sequence name and the results are listed on a data table that can be downloaded in CSV or FASTA format(Fig.S3).Second,four programs of the NCBI BLAST+toolkit 2.9.0 package(blastn,blastx,tblastn,blastp)are available through the web interface that integrates various blast input parameters.This module allows users to align one or more nucleotide or peptide sequences in FASTA format against five predefined nucleotide and protein databases(Table S4).The users can visualize the results of the alignment graphically(Fig.S4)or download them in HTML,tabdelimited,XML,or GFF3 format.Third,users can perform searches by gene or transcript name,chromosome,or coordinates in the genome browser JBrowse.This tool includes information on chromosomes,genes,mRNA,rRNA,and alignments as well as full genome reference sequences(Fig.S5).

3.3.Genome synteny and homology

Gene conservation among subgenomes and genomes can be visualized in WheatGene by means of the implemented synteny viewer and a newly developed homologous viewer.Given a particular chromosome or gene,the synteny viewer shows both a Circos chart(Fig.S6A)and a data table for each genome compared.The chart represents all synteny blocks connected with the given chromosome or gene in a circular layout.In addition,users can check the list of homoeologous and homologous gene pairs in a block of interest(Fig.S6B).Since Tripal toolkit does not have a feature for homologous gene pairs visualization,we developed a new Tripal module and was named Tripal Homologous.Available on the gene description page,the homologous genes viewer shows graphically the relationship of orthologous and paralogous(homoeologous)genes of the gene of interest(Fig.2A).

3.4.Gene expression pattern visualization

Gene expression profiles are available in three sections of WheatGene.First,on the description page of a gene,the item‘‘Expression”shows a bar chart with the expression values of a particular gene on different developmental stages and tissues.Such a feature includes a form where the user can group and filter the RNA-seq samples and generate the chart as shown in Fig.2B.Second,the option‘‘Gene expression heatmap”on the main menu allows users to generate a heatmap to compare the expression values of the desired set of genes(Fig.S7).In the heatmap,genes appear in the vertical axis whilst the horizontal axis corresponds to the RNA-seq samples associated with such genes.Third,we developed a new feature to generate a gene expression bar chart for a set of selected genes(Fig.2C).The users can also arrange the bars in sub-sets according to the different properties of the RNA-seq samples,e.g.,age,tissue,variety,or stress disease.This chart is generated by means of Highcharts(https://www.highcharts.com/),a modern SVG-based,multi-platform charting library written in JavaScript.In all three cases,a file containing the expression data can be downloaded by click on a link at the bottom of the chart.

Fig.2.Charts to visualize homology and gene expression patterns on WheatGene.(A)Homologous genes of the hexaploid wheat gene TraesCS6B02G215300;each branch from the left-hand accounts for a homology relationship and leads to a target homologous/homoeologous gene.Target homologous genes are grouped by different colors according to their species.(B)Gene expression profiles of the gene TraesCS6B02G215300;each bar represents an RNA-seq sample and these are grouped by tissue and colored by age.(C)Gene expression bar chart of the gene TraesCS6B02G215300 and its two homoeologs,TraesCS6A02G189300 and TraesCS6D02G176900,along with the gene TraesCS3B02G475800.The bars,grouped by high-level tissue,represent the average gene expression values,the black lines account for the standard error,and‘‘Development”is the name of the selected RNA-seq study.

3.5.WheatGene as a resource for visualizing genetic markers

The map viewer implemented on WheatGene allows visualizing the physical or genetic position of genetic markers and QTL of wheat and its related species on their respective chromosomes.In this first version of WheatGene,only physical maps with markers of the hexaploid wheat were included.Given a chosen species,a map,and a chromosome,the tool shows a physical map with the genetic markers and QTL associated with the given chromosome(Fig.S8A).In addition,this tool can show all the chromosomes associated with the current map as well as its description and publication(Fig.S8B).

3.6.Protein functions in WheatGene

The results of the functional annotation processes can be visualized graphically on the gene pages through a functional annotation viewer and tables that provide links to the external databases as illustrated in Fig.S9.Additionally,WheatGene includes a GO annotation report,which available on the organism pages(Fig.S10).The default API used to generate the pie chart of the GO report was replaced given that the original service is not available in China,instead,we used Highcharts.

3.7.An application example of WheatGene

Here we demonstrate the applications of WheatGene through the identification of rice homologous geneGW2,a gene involved in regulating rice grain weight by increasing the cell number of spikelet hulls[29]and plays a role in wheat grain development[30,31].First,the complete coding sequence(CDS)of theGW2(NCBI accession KJ697754)was aligned to the nucleotide database ofT.aestivumusing the BLAST tool.As a result,we received 10 hits,with the three best hits as the query itself(KJ697754 corresponding to the D genome copy ofGW2 TraesCS6D02G176900)and two of its homoeologsTraesCS6A02G189300andTraesCS6B02G215300(Fig.S4).This is consistent with the previous report thatTaGW2has three homoeologs in common wheat[30].Next,we browsed on the gene page ofTraesCS6B02G215300where the homologous genes chart confirmed the presence of two homoeologs in the A and D subgenomes.The figure also showed one homolog inT.dicoccoides(TRIDC6BG033820),and two homologs inT.turgidum(TRITD6Av1G091060&TRITD6Bv1G096950),the tetraploid relatives(Fig.2A).

To identify the expression patterns ofTraesCS6B02G215300,we browsed the expression functionality on the same gene page.As expected,TraesCS6B02G215300is highly expressed in grain especially during the development stages of dough and ripening(Fig.2B).The functional annotation tool confirms that this gene is homologous toE3 ubiquitin-protein ligase GW2in rice,involved in the regulation of grain size(Fig.S9).Next,we used the gene expression bar chart to compare the expression patterns ofTraesCS6B02G215300with its two homoeologs(Fig.2C).Once again,the bar chart indicates that the three genes are highly expressed in grain,withTraesCS6B02G215300presenting the highest TPM values among the three.As a comparison,we included the geneTraesCS3B02G475800,a homolog of the Arabidopsis gene TaARF4(NCBI accession AK335756.1)strongly associated with plant height and root depth[32].Consistent with its functions,TraesCS3B02G475800is highly expressed in spikes,leaves and roots,and relatively lowly expressed in grain.

4.Discussion

WheatGene was designed by taking different features of existing wheat genomic databases into consideration.We observed and analyzed the advantages and disadvantages of current wheat genome databases to make the most optimal set of tools and data types available in the first release of WheatGene.Table S5 summarizes the data types,tools,and species available in WheatGene,in comparison of current wheat genome databases elsewhere[33–35].WheatGene integrates a wider range of tools and data types than those present in the rest of the databases that often just provide links to third-party systems.WheatGene brings into all of the tools as parts of the same database with a focus on wheat genomes for easier retrieval for wheat research.

For wheat,the current most comprehensive database is the IWGSC data repository which carries only common wheat genomic data without any other genomes for additional comparative analysis purposes.WheatGene integrates the genome assemblies of the hexaploid wheat as well as its tetraploid and diploid ancestors for easier comparison,particularly by tools integrated in the database such as the Synteny Viewer,homologous genes viewer,or the genome browser JBrowse.More importantly,WheatGene was built using open-source software and should be easier for data updates and functionality expansion.For instance,Tripal is a community project that is continuously receiving improvements.Despite this,some latest features such as Tripal Pub and Tripal Analysis KEGG that are not available in the newest version of Tripal(v3.2)await to be added in WheatGene soon.

WheatGene was developed to support the wheat research community through the integration of biological data of common wheat and its tetraploid and diploid donors.The diverse data included in this database will be welcomed by wheat researchers world-wide and will be continuously improved over time.During the developing process,all features were modified to improve usability.Future releases of WheatGene will include genetic markers information of the remaining four species,phenotypes,and syntenic analysis with related species such as rice,barley,and maize.Moreover,new wheat genome assemblies can be easily integrated.The administration pages provided by Drupal make it easy to maintain the content of the site as well as upload new genomes and can be updated by following the administrator manuals created during the configuration of WheatGene.

CRediT authorship contribution statement

Long Maoconceptualized the database and supervised the entire development process.Diego Fernando Garciacurated data and developed the database.Jiantao Guan,Zenyu Wang,Shuaifeng Geng,and Aili Liprovided help in data curation.Lingjie Yinprovided the computational resources and help in setting up the distribution of the database.Diego Fernando Garciawrote the original draft and together with Long Mao edited and revised the manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is financially supported by National Key Research and Development Program of China(2016YFD0101004)and Agricultural Science and Technology Innovation Program of Chinese Academy of Agricultural Sciences.Diego Garcia was supported by the Chinese Government Scholarship.We thank the Bioinformatics Computing Platform of Institute of Crop Sciences,Chinese Academy of Agricultural Sciences for providing computational support.

Appendix A.Supplementary data

Supplementary data for this article can be found online at https://doi.org/10.1016/j.cj.2021.04.011.

Homepage (https://www.keaipublishing.com/en/journals/the-crop-journal/)

The Crop Journal in ScienceDirect (https://www.sciencedirect.com/journal/the-crop-journal/)

The Crop Journal2021年6期

The Crop Journal的其它文章: A gain-of-function mutation of OsMAPK6 leads to long grain in rice; Genetic dissection of rice appearance quality and cooked rice elongation by genome-wide association study; The importance of aboveground and belowground interspecific interactions in determining crop growth and advantages ofpeanut/maize intercropping; Mining favorable alleles for five agronomic traits from the elite rapeseed cultivar Zhongshuang 11 by QTL mapping and integration; Genetic gains with genomic versus phenotypic selection for drought and waterlogging tolerance in tropical maize(Zea mays L.); BSA-seq-based identification of a major additive plant height QTL with an effect equivalent to that of Semi-dwarf 1 in a large rice F2 population