ldentification of novel soybean oil content-related genes using QTL-based collinearity analysis from the collective soybean genome

2018-08-06 12:08XUMingyueLlUZhangxiongQlNHongtaoQlHuidongWANGZhongyuMAOXinruiXlNDaweiHUZhenbangWUXiaoxiaJlANGHongweiQlZhaomingCHENQingshan
Journal of Integrative Agriculture 2018年8期

XU Ming-yue, LlU Zhang-xiong, QlN Hong-tao, Ql Hui-dong, WANG Zhong-yu, MAO Xin-rui, XlN Da-wei, HU Zhen-bang, WU Xiao-xia, JlANG Hong-wei, Ql Zhao-ming, CHEN Qing-shan

1 Key Laboratory of Soybean Biology, Ministry of Education/College of Agriculture, Northeast Agricultural University, Harbin 150030, P.R.China

2 National Key Facility for Crop Gene Resources and Genetic Improvement (NFCRI)/Key Laboratory of Germplasm Utilization,Ministry of Agriculture/Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China

Abstract Soybean is a global principal source of edible plant oil. As more soybean oil-related quantitative trait loci (QTLs) have been located in the collective genome, it is urgent to establish a classification system for these distributed QTLs. A collinear platform may be useful to characterize and identify relationships among QTLs as well as aid in novel gene discovery. In this study,the collinearity MCScanX algorithm and collective soybean genomic information were used to construct collinearity blocks,to which soybean oil-related QTLs were mapped. The results demonstrated that 666 collinearity blocks were detected in the soybean genome across 20 chromosomes, and 521 collinearity relationships existed in 231 of the 242 effective soybean oil-related QTLs. This included 214 inclusion relationships and 307 intersecting relationships. Among them, the collinearity among QTLs that are related to soybean oil content was shown on a maximum of seven chromosomes and minimum of one chromosome, with the majority of QTLs having collinearity on two chromosomes. Using overlapping hotspot regions in the soybean oil QTLs with collinearity, we mined for novel oil content-related genes. Overall, we identified 23 putatively functional genes associated with oil content in soybean and annotated them using a number of annotation databases. Our findings provide a valuable framework for elucidating evolutionary relationships between soybean oil-related QTLs and lay a foundation for functional marker-assisted breeding relating to soybean oil content.

Keywords: soybean oil QTLs, collinearity analysis, candidate genes

1. lntroduction

Soybean (Glycine max(L.) Merr) is a principal economic crop that supplies a significant source of high-quality vegetable oil and is responsible for which approximate 60%of the global seed oil production (http://www.soystats.com)(Wilson 2008). For nearly two decades, molecular markers have been widely used to excavate quantitative trait loci(QTLs) (Patersonet al.1988) and numerous QTLs have been mapped in various soybean genetic backgrounds and environments across 20 linkage groups. Soybean seeds oil content is a complex quantitative trait governed by a few genes (Burton 1987). Currently, at least 312 soybean oil QTLs have been excavated and are listed in the Soybean Breeder’s Toolbox (http://soybase.org/). However, the QTLs underlying the soybean oil phenotype are distributed in different linkage groups and are dispersed irregularly. Thus,it is difficult for breeders to utilize these QTLs and analyze their variation accordingly.

With the rapid development of sequencing technology,researchers are gradually understanding the principle of collinearity. Initially, collinearity was used to describe the relationship between loci on the same chromosome(McCouch 2001). It has now been shown that many loci display the same ancestral type conservation and relative order among diverse species (Tanget al.2008).Using this knowledge, collinearity can be used to discern genomes and solve issues regarding crosses among genomes. Ahn and Tanksley (1993) and Kurataet al.(1994) identified collinearity among a major chromosomes portion between rice and maize as well as between rice and wheat. Recently, collinearity was also used to adjust the confidence intervals of QTLs. Gawronski and Schnurbusch(2012) used a collinearity analysis comparing wheat, rice,and short stalked grass and found that the QTL confidence interval distance for the diploid wheat growth period could be reduced to 0.9 cM. Gene collinear relationships or the correspondence between genes along a chromosome and its expression patterns has been found to be conserved in many groups (Tanget al.2008). In a given species, a certain trait QTLs location that has been confirmed in a group may be the same location as in other groups (Wanget al.2006). QTLs collinearity refers to identifying QTLs on the basis of genome order, and multiple collinearity analysis using many genomes has already revealed its importance(Mason and Perreault 1991). To date, there are very few studies regarding the collinearity of soybean oil QTLs.Therefore, integrated soybean oil QTLs and collinearity analysis may offer more candidate QTLs information and ultimately be used in breeding techniques for oil production optimization. The construction of a collinear soybean whole-genome is necessary for collinearity analysis of soybean oil QTLs. MCScanX is a very effective software for discovering collinearity and synteny (Wanget al.2012).Xuet al.(2014) detected syntenic blocks and was able to classify duplication types betweenCosmopterex carpioandDanio rerio.Another study by Geetikaet al.(2016)demonstrated within theTIFYgene family in pigeonpea by constructing syntenic blocks. In this study, the MCScanX toolkit and soybean genome information were used to construct collinearity blocks for soybean. Subsequently,the identify soybean oil QTLs were mapped into these blocks to establish an information network for soybean oil QTLs. Furthermore, putative soybean oil gene annotation was implemented for hotspot intervals that possessed high collinearity. The purpose of this study was to construct a whole genome collinearity relationship among soybean oil QTLs and to screen for the major intervals. This research provides a new method for researching genetic collinearity in soybean oil QTLs and may provide novel QTLs for breeders concerned with soybean oil content.

2. Materials and methods

2.1. lntegration of mapping information for soybean oil QTLs

The relevant details associated with 242 soybean oil QTLs were collected from SoyBase (http://soybase.org/). The collected information including QTL name, confidence interval, linkage group (LG), chromosome, markers, and physical position (Appendix A). QTL confidence intervals that accounted for more than 70% of a chromosome were not included. Soybean genome information (v1.0) was downloaded from the Phytozome website (http://phytozome.jgi.doe.gov/).

2.2. Establishment of the collinearity blocks using the soybean genome

The synteny and collinearity within the soybean collectivegenome were detected by MCScanX (http://chibba.pgml.uga.edu/mcscan2/) according to previous research (Wanget al.2012). The soybean v1.0 data were imported into the MCScanX toolkit to detect the soybean genome-wide collinearity blocks. BLASTp was employed to compare the collective-genome of soybean, where the default E-value was set as 1e–5, the maximum GAP number was set as 25,and the formation of a collinearity block required at least five pairs of genes to match.

2.3. Construction of the collinearity network of soybean oil QTLs

In order to identify the collinearity of soybean oil QTLs on the basis of soybean collective-genome, the Perl program toolkit (https://www.perl.org/) (Hall and Schwartz 1998)was used. The Perl appendix contained an executable file named “origin” and a Perl script file was initially written for our purposes (Appendix B). Several files were read by the executable file, including QTLs related information downloaded from the SoyBase website such as physical location of genes, the soybean genome collinearity across 20 chromosomes generated by MCScanX, and the information for soybean oil QTLs.

To obtain soybean oil QTL relationships, there were four main steps employed. In the first step, the necessary information for the genes were determined. For example,extracted information of genes from the general feature format 3 (GFF3) file were transferred to a GFF file format to determine gene loci. In the second step, similar protein pairs were determined by BLASTp, where the minimum expected value was set to 1e–10. In the third step, collinearity blocks of coding genes were determined and soybean oil sequences were aligned against themselves using BLASTp. In the final step, all new blocks information was put into a tab-delimited format that included the soybean oil QTL collinearity.

2.4. Visualization of soybean oil QTL network

Visualization of soybean oil QTL collinearity relationship was conducted using the Circos software (Krzywinskiet al.2009). This software offers an information aesthetic for comparative genomics (http://www.circos.ca/software/download/circos/).

2.5. Gene mining from hotspot intervals

Candidate genes involved in soybean oil production that were located in hotspot repeat intervals were collected from the Phytozome website. The putative genes were annotated by gene annotation databases, including Gene Ontology(GO) (http://www.geneontology.org/), Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/), Swissprot (http://www.gpmaw.com/html/swiss-prot.html), and the BLAST nt (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) and nr (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome).

Fig. 1 The collinearity blocks of soybean genome.

3. Results and analysis

3.1. QTL information collection and analysis

A total of 242 original soybean oil QTLs have been reported in the last 25 years and were obtained from SoyBase database. These included 12 QTLs starting with the word“cqseed oil”, three QTLs starting with the word “seed oil plus protein”, two QTLs starting with the word “seed oil to protein ratio”, and 225 QTLs starting with the word “seed oil”.These QTLs derived from various population types, such as recombination inbred line (RIL), near isogenic line (NIL),F2, F5, F6, F13, F2:3, and F2:5. The analysis methods used for these QTLs included interval mapping (IM), composite interval mapping (CIM), multiple interval mapping (MIM),and ANOVA.

3.2. Construction of soybean genome blocks

After importing the soybean genome information (v1.0) into the MCScanX toolkit, 666 collinear blocks were obtained with 45 933 soybean genes across 20 chromosomes (Appendix C). Among these, the average of collinear blocks was 33,ranging from 20 collinear blocks detected on chromosome 4 to 52 collinear blocks detected on chromosome 13 (Fig. 1).

3.3. Construction collinearity linkage map of the soybean oil QTLs

After inputting 242 soybean oil QTLs into the Perl script, a single text file named final.txt was outputted (Appendix D).In total, we found 521 collinearity relationships among the 231 QTLs (Appendix E). As shown in Fig. 2-A, 64 QTLs with a single collinearity relationship were the prevalent. On the contrary, single QTL showed a collinearity relationship with 11, 12, 16, 21, and 33 others. Among them, the seed oil 23-1 QTL that showed 33 collinearity relationships was distributed on seven chromosomes and was the most common QTL. The chromosome number of soybean oil QTL with collinearity were measured (Fig. 2-B). Analysis on the QTL related to collinearity, demonstrated that 121, 64, 22, 4, 1, and 1 QTLs were distributed on 1, 2,3, 4, 5, and 7 chromosomes, respectively, indicating that the majority of QTLs were distributed on one or two chromosomes. Additionally, we found a single hotspot interval repeat that occurred 229 times, while only one hotspot interval repeated 5, 9, and 13 times (Fig. 2-C). All of the overlapping intervals were characterized according to size, with an average interval repeat of 3 Mb. Among the 387 overlapping intervals, most ranged from 0 to 3 Mb while two overlapping intervals ranged from 18 to 21 Mb and 24 to 27 Mb, respectively, suggesting that overlapping interval repeats in soybean oil QTLs with collinearity are marginal, as very few of them exceed 18 Mb (Fig. 2-D). The QTL located on the Gm11 chromosome had no collinearity relationship with any QTLs located on any other chromosomes while the QTLs located on Gm01, Gm04, and Gm12 only showed a collinearity relationship with QTLs located on the same chromosome. The QTLs with collinearity located on the remaining chromosomes not only had collinearity relationships within the same chromosome but also with other chromosomes. The QTLs located on Gm06 and Gm13 were widely distributed on seven chromosomes. Seventy collinearity relationships between QTLs distributed on Gm20 were the most prevalent (Table 1). In order to better describe the relationship between QTL and collinearity, the QTLs were divided into two categories based on the overlapping interval repeats. Red shaded portions which represented the collinearity relationship of including or included between soybean oil QTLs is 214 and blue shaded portions which represented the collinearity relationship of intersecting is 307 (Fig. 3). Furthermore, among the 20 linkage groups,the majority of the collinearity relationships were found in Gm20 to Gm03 and Gm20 to Gm20.

3.4. Gene mining from hotspot intervals of soybean oil QTL

Fig. 2 Description of quantitative trait loci (QTLs) for soybean oil with collinearity. Frequency distributions of the number of collinear relationships corresponding to a singe QTL (A), chromosomes corresponding to the collinear distribution of a single QTL (B), hotpot intervals repeats (C), and distribution of hotpot interval (D) for soybean oil QTLs with collinearity.

Table 1 The relationship between chromosomes corresponding to soybean quantitative trait loci (QTLs) with collinearity

Fig. 3 Visualization of the soybean oil quantitative trait loci (QTLs) with collinearity. Blue shaded portions indicate intersecting relationship, and red shaded portions indicate the relation of including and included.

Table 2 Identification of genes related to oil content

To identify QTL functional genes, a total of 1 316 candidate genes were screened from the hotspot intervals. There were 19, 51, 61, 24, 108, 141, 61, 20, 41, 110, 76, 49, 25, 124,and 65 candidate genes distributed in each interval (Table 2).Among them, the shortest interval ranged from 3.2 to 4.1 Mb on Gm15 with an average distance of one functional gene per 0.04 Mb. The maximum interval spanned a total length of 3.3 to 26.0 Mb on Gm20 with an average distance of one functional gene per 0.18 Mb. After gene annotation, 23 oil related functional genes were collected, with one functional gene screened from Gm01, Gm02, Gm03, Gm04 Gm14,Gm15, Gm16, Gm18, Gm19, and Gm20, two functional genes screened from Gm17, Gm19, and three functional genes screened from Gm06, Gm13, and Gm20.

Twenty-three functional genes were involved in eight pathways, namely sphingolipid metabolism, glycerolipid metabolism, fatty acid biosynthesis, fatty acid metabolism,fatty acid degradation, ether lipid metabolism, alphalinolenic acid metabolism, and linoleic acid metabolism.The putative oil synthesis-related geneGlyma.02G127000was annotated as a longevity assurance gene (LAG1)and has been shown to be necessary for ceramide synthetize activity. Ceramide is a precursor molecule in the sphingolipid pathway and plays a key role in sphingolipid metabolism (Guillaset al.2003). The homologous geneGlyma.02G127000was located upstream of growth and differentiation factor 1 (uog1) and has been shown to selectively induce sphingolipid synthesis in mammalian cells (Venkataramanet al.2002). The putative oil synthesisrelated geneGlyma.13G246000had a diacylglycerol kinase catalytic domain and was annotated as sphingosine kinase 1 (SPHK1). Maceykaet al.(2005) demonstrated thatSphK1andSphK2could catalyze phosphorylation of sphingosine to produce the effective sphingolipid metabolite sphingosine 1-phosphate. Another putative oil synthesis-related geneGlyma.18G118600was annotated as encoding for alphagalactosidase, an enzyme that has been shown to act on alpha-galactoside-containing complex polysaccharides,glycoproteins, and glycosphingolipids (Svastits-dücsőet al.2009; Katroliaet al.2012). Leeet al.(2009) reported that alkaline alpha-galactosidase in rice is involved in the degradation of galactosyl diacylglycerol (DGDG) during rice leaf senescence. TheGlyma.20G085600putative oil-related gene was annotated as phosphatidic acid phosphohydrolase 2 (PAH2) and has already been studied extensively in the model plantArabidopsis. Nakamuraet al.(2009) indicatedPAH2mediated membrane lipid remodeling is the adaptation mechanism of phosphate starvation. The geneGlyma.15G047900was annotated as short-chain dehydrogenase/reductase (SDR) and has been previously analyzed by Perssonet al.(2009). They showed that SDR is the largest enzyme superfamily, involved in the metabolism of various compounds, including fatty acid biosynthesis and fatty acid metabolism. The geneGlyma.19G059400was annotated as a long chain acyl-CoA synthetase 4 and has been shown to activate free fatty acids into acyl-CoA thioesters inArabidopsis(Colemanet al.2002). The genesGlyma.13G127900andGlyma.01G116600were annotated as alcohol dehydrogenase class-3 (ADH3), which has been shown to support the oxidation of omega-hydroxy fatty acids(Boledaet al.1993). The putative genesGlyma.19G034800andGlyma.20G049900were annotated as phospholipase D beta 1 (PLD beta 1) and have been previously shown to be one of the ubiquitous eukaryotic enzymes involved in various cellular processes (Liscovitchet al.2000; Wang 2000). The putative oil synthesis-related geneGlyma.03G041900was annotated as a SAM dependent carboxyl methyltransferase,which acts as on a methyl donor and a carboxylic acidcontaining substrate to form a small molecule methyl ester (Zubietaet al.2003). Two gene candidates,Glyma.15G074300andGlyma.14G173500were annotated as PLAT/LH2 domain-containing lipoxygenase (LOX) family oil and linoleate 9S-lipoxygenase-4, respectively. These enzymes oxidize linoleate and alpha-linolenic by inserting molecular oxygen at the C9 position with (S)-configuration(Boeglinet al.2008). Of the remaining ten candidate genes, one gene participates in the sphingolipid metabolism pathway and nine participate in steroid biosynthesis, but they were not annotated. Detailed annotation information can be found in Appendix F.

4. Discussion

There are many advantages of utilizing collinearity analyses when searching for putative QTLs. It has been shown that collinearity analysis can be used to explore evolutionary events, such as transposition, loss, rearrangement, and replication (Ball and Cherry 2001; Lyonset al.2008).Using collinearity analysis, Chenet al.(2013) was able to identify local duplications, insertions, and inversions of region in theOryza brachyanthagenome by constructing a collinearity map of the whole-genome comparison of rice andO. brachyantha. This analysis further revealed mechanisms involved in the variation ofO. brachyanthagenome size, gene family expansion, gene migration,and transformation. Another use for collinearity analysis provides theoretical and practical value for studying the loci that control common traits among species. In wheat and rice, the position of the transcription factor VIVIPAROUS-1(encoded by theVp1gene) could be associated with the maize dormancy gene on chromosome 3 when identifiedviagenome marker location (Baileyet al.1999). Additionally,Sarmaet al.(1998) suggested the vernalization geneVrn-A1located on chromosome 5A flanked by deletion breakpoints 0.68 and 0.78 kb in wheat was shown to be orthologous to a chromosome 3 region in that contained the floweringtime QTLHd-6. In this study, the MCScanX algorithm was used to detect synteny and collinearity. This flexible algorithm can be used not only for analyzing chromosome structure but also for illuminating the history of the gene family. Wanget al.(2013) used MCScanX to detect 26 346 duplicated genes in the collective grapevine genome, further suggesting the great contribution of duplication events. Jiaoet al.(2014) detected 1 021 syntenic blocks using MCScanX and identified the monocotyledon duplication phylogenetic signals. In this study, the collective soybean genome collinearity blocks were constructed using the McscanX algorithm and the soybean genome. Overall, 666 collinear blocks were obtained including 45 933 genes across 20 chromosomes of soybean. It is known that soybean has undergone three whole-genome multiplication events during its evolutionary process (Jaillon 2007; Schlueteret al.2007)and remnants of the duplication events have been largely noted in soybean genome (Schmutzet al.2010). The establishment of the soybean genome collinearity blocks provides an understanding of the genome formation and evolution mechanisms, functional and biological significance of the soybean genome, and a foundation for further QTLs mapping and research.

In recent years, many researchers have become accustomed to analyzing QTL data by preliminary loci and meta-analysis. One study by Maroneet al.(2013)obtained 24 meta-QTLs for powdery mildew resistance using meta-analysis in wheat, minimizing the confidence interval distance using QTL interval collinearity and allowing further screening into genes that have a role in resistance to powdery mildew. Another study by Pottorffet al.(2012)located a QTL for cowpea leaf where a QTL on chromosome 7 had a collinearity relationship with a partial sequence on soybean chromosome 3 and chromosome 19 then subsequently the leaf-related genes were screened using soybean genome sequences. However, there are few studies on collinearity in soybean oil production-related QTLs. In this study, 242 soybean oil QTLs were mapped in collinear blocks from the soybean genome and 521 collinearity relationships from 231 effective soybean oil QTLs were obtained. Based on the established network of these soybean oil QTLs, 15 hotspot intervals repeats were found and 23 functional genes were screened by further mining.

There are many homologous genes in the soybean genome and it is necessary to reduce the interference from redundant homologous genes by mining candidate genes through target trait loci. This will provide useful information for the research of subsequent molecular markers. Hotspot regions obtained in this study provided many candidate oil synthesis-related genes. After screening, 23 genes showed an indirect association to oil production and involved eight pathways, including sphingolipid metabolism(ko00600), glycerolipid metabolism (ko00561), fatty acid biosynthesis (ko00061), fatty acid metabolism (ko01212),fatty acid degradation (ko00071), ether lipid metabolism(ko00565), alpha-Linolenic acid metabolism (ko00592),and linoleic acid metabolism (ko00591). Thirteen functional genes have been annotated in distinct species, but the remaining 10 genes involved in steroid biosynthesis and sphingolipid metabolism were not able to be annotated and call for further study.

Our research has contributed 23 functional genes associated with oil content in soybean. These novel oilassociated genes can be used for breeding or molecular biology research and offer information to elucidate the genetic mechanisms underlying the soybean oil metabolic pathway. Collinearity relationship analysis in soybean oil-related QTLs has shown to be a useful tool in studying the genetic evolution of QTLs and analyzing genes related to soybean quality. This study offers a useful method for the search for novel genes associated with desired traits in soybean as well as demonstrates its robust ability to identify QTL.

5. Conclusion

In this study, a total of 666 collinearity blocks were detected in the soybean genome across 20 chromosomes, and then 521 collinearity relationships existed in 231 of the 242 effective soybean oil-related QTLs. To identify QTL functional genes, a total of 1 316 candidate genes were screened from the hotspot intervals and 23 candidate functional genes were involved in eight pathways.

Acknowledgements

This study was financially supported by the National Key R&D Program of China (2016YFD0100500, 2016YFD0100300,2016YFD0100201-21), the National Natural Science Foundation of China (31701449, 31471516, 31401465,31400074, 31501332), the Natural Science Foundation of Heilongjiang Province, China (QC2017013), the Young Innovative Talent Training Plan of Undergraduate Colleges and Universities in Heilongjiang Province, China(UNPYSCT-2016144), the Special Financial Aid to Postdoctor Research Fellow in Heilongjiang, China (To Qi Zhaoming), the Heilongjiang Funds for Distinguished Young Scientists, China (JC2016004), the Outstanding Academic Leaders Projects of Harbin, China (2015RQXXJ018),the China Post Doctoral Project (2015M581419), the Dongnongxuezhe Project, China (to Chen Qingshan), and the Young Talent Project of Northeast Agricultural University,China (to Qi Zhaoming, 518062).

Appendicesassociated with this paper can be available on http://www.ChinaAgriSci.com/V2/En/appendix.htm