Ming QI, Guiping HE, Zhaoyou XU
Abstract The number and variability of transcripts determine the diversity and complexity of the proteins in a species. To reveal the variations in transcript abundance and the related factors in Chinese fir (Cunninghamia lanceolata), we performed second-generation sequencing (RNA-seq) and third-generation sequencing (PacBio) analyses. The forms and rates of alternative splicing in Chinese fir were studied on the basis of the obtained transcripts. Furthermore, the number of full-length transcripts (isoforms) produced by alternative splicing and the variation patterns in ‘Long 15 were analyzed. The transcript diversity in Chinese fir was largely caused by six alternative splicing forms, of which intron retention was the most common (47.98% of all alternative splicing events), followed by alternative 3′ splice site (24.29%). The third-generation sequencing analysis detected 61 613 isoforms in ‘Long 15, with each gene producing 1-10 isoforms. Only 0.06% of the genes produced more than 10 isoforms. Transcript abundance was similar among Chinese fir varieties, but 615 more transcripts were detected in ‘Long 15 than in clone 1339, implying that ‘Long 15 synthesized more protein types in vivo than 1339. This difference may explain why ‘Long 15 grows better and is more adaptable to environmental conditions than 1339. An examination of Chinese fir clone Kai6 detected more transcripts after fertilization than following a nutrient stress treatment. Moreover, transcript polymorphism was greater after fertilization than in response to nutrient stress. This finding may be useful for improving fertilizer applications to enhance Chinese fir growth and development. Sequences and alternative splicing forms are critical for research on the Chinese fir transcriptome and the potential biological consequences of alternative splicing.
Key words Cunninghamia lanceolata; Alternative splicing; Transcriptome sequencing; Transcript; Full-length transcript
Received: March 7, 2022 Accepted: May 8, 2022
Supported by the Fundamental Research Funds of the Central Public Welfare Research Institutes of The Chinese Academy of Forestry (CAFYbb2017ZA001-1-2); the "14th Five-year" Forest Variety Breeding Project of Zhejiang Province (2021C010010808).
Ming QI (1962-), male, P. R. China, associate researcher, devoted to research about molecular genetics of forest trees.
*Corresponding author. E-mail: youqingyi1962@163.com.
Transcript abundance may be used as the basis for determining the protein diversity and complexity ineukaryotes[1-5]. Alternative splicing is directly responsible for the formation of diverse transcripts[6-12], making it an important mechanism regulating gene expression in multicellular eukaryotes. Additionally, its important for increasing protein functional diversity, promoting species evolution, and protecting developing plants from biotic and abiotic stresses.
Precursor mRNA can be spliced via constituent splicing and alternative splicing. More specifically, constituent splicing involves the sequential removal of introns from the precursor mRNA and the subsequent linking of exons according to the corresponding sequence. In this splicing mode, only one set of splicing sites is used to produce a mature mRNA sequence, which generally leads to the production of only one protein. In contrast, alternative splicing is a process in which splicing sites in the precursor mRNA are identified by a splicing complex. Moreover, different exon combinations result in the production of multiple mature transcripts and proteins, with important implications for the regulation of gene expression and the diversification of the proteome[1-5]. Alternative splicing is an important mechanism regulating eukaryotic gene expression. Different proteins may be synthesized from an individual gene because of the formation of multiple transcripts. Therefore, studying transcript variations in eukaryotes requires an understanding of the mechanisms underlying alternative splicing.
In recent years, the rapid development of second-generation high-throughput sequencing technology (e.g., Illumina systems) and the related bioinformatics algorithms and software have made it possible to interpret plant genomes and transcriptomes. Genome and transcriptome sequences provide the most basic genetic information for a species[13-16]. In Cunninghamia lanceolata, the transcriptomes of various tissues and organs (e.g., needles, immature xylem, and fibrous roots) in different periods have been sequenced using second-generation Illumina sequencing technology[17-18,23]. The resulting data provide important insights into the growth and development of C. lanceolata and the wood formation process. However, the short reads and long assembly process associated with Illumina transcriptome sequencing are not ideal for genome and transcriptome sequencing studies conducted to annotate new genes, analyze full-length transcripts, and screen for alternative splicing events.
Because of the nature of short-read sequencing, many full-length transcripts cannot be produced, and some important alternative splicing events may be lost. However, the current third-generation sequencing technology (e.g., PacBio systems) has effectively solved this problem[27]. Generating information regarding RNA sequences and alternative splicing forms is crucial for further characterizing plant transcriptomes and the potential biological consequences of alternative splicing.
In this study, PacBio and Illumina sequencing technologies were used to explore the transcriptional abundance and variation in the same ramet of a Chinese fir clone (‘Long15), different Chinese fir varieties, and fertilization as well as nutrient stress treatments of a Chinese fir clone (Kai6). To reveal the mechanism regulating alternative splicing during C. lanceolata growth and development, we identified many transcripts, which will serve as a useful reference for studying transcriptional regulation.
Materials and Methods
Collection of original data and the software used in this study
Data from the following three C. lanceolata sequencing projects were collected: ① transcriptome sequencing (RNA-seq) analysis of the molecular mechanism underlying the heterosis of C. lanceolata growth traits[17], ② RNA-seqanalysis of the gene expression changes induced by Chinese fir fertilization[18], and ③ full-length transcriptome sequencing of ‘Long15. All three projects were commissioned by Hangzhou Lianchuan Biological Co., Ltd. The C. lanceolata sequencing data were deposited in the NCBI Sequence Read Archive (SRX4887816, SRX4887819, and SRX4887827).
The software used for the third-generation sequencing analysis of ‘Long 15 as well as for the second-generation and third-generation transcriptome sequencing analyses are listed in Table 1. The public databases used for analyses are listed in Table 2.
Experimental design and bioinformatics methods
To study Chinese fir transcript variations in different varieties, after different fertilization treatments, and using different sequencing methods, the following three research schemes were adopted:
Scheme I: New branches and leaves from Chinese fir ‘Long 15 (perennial species) growing in clonal seed orchard were used as research materials. The PacBio sequencing platform (i.e., latest third-generation sequencing technology) was used for full-length transcript sequencing and for comparing gene and full-length transcript sequences after alternative splicing.
Scheme II: An RNA-seq method was used to compare transcript variations among different varieties included clone "Long15" and clone" 1339"[17].
Scheme III: Chinese fir clone Kai6 and transcriptome data for gene expression induced by fertilization were used to investigate the transcript variations in Chinese fir growing on barren land[18].
The basic sequencing data were processed using Excel software.
Results and Analysis
Alternative splicing forms detected in Chinese fir ‘Long 15
The PacBio sequencing platform was used to perform a third-generation sequencing analysis of the new branches and leaves of ‘Long 15. The analysis revealed the following six alternative splicing forms (Fig. 1): exon-skipping (ES), intron retention (IR), alternative 5′ splice site (A5SS), alternative 3′splicesite (A3SS), alternative last exon (ALE), and alternative first exon (AFE). This result is consistent with the findings of several studies involving other species[1,3,7,8].
Frequency of different alternative splicing events in ‘Long 15
The frequency of different alternative splicing forms in ‘Long 15 is presented in Fig. 2.
Among the six alternative splicing forms in ‘Long 15, IR was the most common (Fig. 2), accounting for 47.98% of all splicing events, followed by A3SS (24.29%). The least common alternative splicing forms were ALE (1.42%) and AFE (3.04%).
Number of full-length transcripts produced by individual genes in ‘Long 15
The six alternative splicing forms in ‘Long 15 produced numerous full-length transcripts. After eliminating low-quality redundant transcripts, 61 613 non-redundant full-length transcripts remained. Each gene produced 1-10 full-length transcripts.
A total of 38 genes produced more than 10 full-length transcripts in ‘Long 15. The frequency and number of full-length transcripts produced by ‘Long 15 genes are presented in Fig. 3 and Table 3. The results were consistent with those of an earlier study on Brassica napus[1] (Fig. 4).
Quantitative analysis of transcript variations via second-generation and third-generation sequencing analyses of ‘Long 15
Because the whole genome sequence of C. lanceolata has not been published, the second-generation and third-generation sequencing results for C. lanceolata were analyzed without a reference genome sequence. Additionally, the amount of generated data differed between the second-generation and third-generation sequencing analyses. Moreover, the sequencing data obtained from the second-generation and third-generation sequencing analyses of the same ‘Long 15 clone were not comparable.
New leaves collected from ‘Long 15 trees were used as the research material for an RNA-seq analysis. The results for related transcripts are listed in Table 3.
A total of 61 613 full-length transcripts were obtained from third-generation sequencing analyses (Table 3). One and two full-length transcripts were obtained from 77.01% and 15.89% of the genes, respectively. In contrast, only 0.06% of the genes produced more than 10 full-length transcripts. The second-generation sequencing results indicated that ‘Long 15 has 53 979 genes, from which 104 475 transcripts were generated (Table 3), which suggested that each gene produced about two transcripts. However, 39.75% of the genes produced one transcript and 47.33% of the genes produced more than 10 transcripts. There were substantial differences between the results of the second-generation and third-generation sequencing analyses (Table 3). This diversity was likely due to the use of different research methods. Many studies generated similar results[1,3,5,24,26,28]. For example, Lei[28] conducted a transcriptome sequencing analysis of mixed whole-tissue samples of Salmo trutta fario, which resulted in the detection of 81 505 genes as well as153 365 transcripts following an analysis using Trinity software, with an average of two transcripts per gene. Our findings were in accordance with the results of this earlier study.
Differences in the number and frequency of transcripts obtained using second-generation and third-generation sequencing methods
The results of this study on ‘Long 15 (Table 3) revealed clear differences in the number and diversity of transcripts obtained by the second-generation and third-generation sequencing analyses. We conducted the following in-depth analysis on the basis of the variation in the number of transcripts produced by the growth-related genes of ‘Long 15.
An earlier transcriptome sequencing analysis of ‘Long 15 detected 14 growth-related genes[17]. The results of a comparison of the transcripts obtained from second-generation sequencing and third-generation sequencing analyses of these genes are listed in Table 4.
Of the 14 examined growth-related genes (Table 4), the number of full-length transcripts was higher for the third-generation sequencing analysis than for the second-generation sequencing analysis for four genes (Comp61820_c0, Comp71030_c0, Comp52987_c0, and Comp62983_c0). For both sequencing analyses, two full-length transcripts were obtained for Comp50961_c0. Regarding the other nine growth-related genes, the number of full-length transcripts was higher for the second-generation sequencing analysis than for the third-generation sequencing analysis. These results are consistent with the data presented in Fig. 3.
Comparison of the variation in the number of transcripts in different Cunninghamia lanceolata varieties on the basis of an RNA-seq analysis
Two clones of grafted Chinese fir trees (‘Long 15 and 1339) grown under the same environmental conditions in a seed orchard were used as research materials. New leaves produced in the current year were collected for an RNA-seq analysis[17]. The results of a comparison of the ‘Long 15 and 1339 transcripts are listed in Table 5.
A total of 53 979 genes were detected in ‘Long 15, which is 440 more than the 53 539 genes detected in 1339 (Table 5). A comparison of transcript abundance revealed 615 more transcripts in ‘Long 15 (104 475) than in 1339 (103 860). Additionally, the ‘Long 15 transcripts were more polymorphic than the 1339 transcripts, indicative of more kinds of proteins being synthesized in ‘Long 15 than in 1339. This increased protein diversity is conducive to the growth and development of ‘Long 15, which is highly adaptable to environmental conditions. These results are in accordance with those of an earlier study[22] involving the conventional breeding of Chinese fir.
RNA-seqanalysis of the transcript variations induced by nutrient stress and fertilization treatments of Chinese fir clone Kai6
Chinese fir (clone Kai6) trees grown in the same environmental conditions with the same management practices were used for the nutrient stress and fertilization tests[18]. The results of a comparison of the transcriptomes resulting from the two treatments are listed in Table 6.
A total of 69 322 genes were detected following the nutrient stress treatment (Table 6), which was 3 346 more than the 65 976 genes detected after fertilization. However, 1 436 more transcripts were detected after fertilization (146 943) than in response to the exposure to nutrient stress (145 507). Moreover, transcript polymorphism was greater for the fertilization treatment than for the nutrient stress treatment, implying the protein diversity after fertilization was greater than that induced by nutrient stress. This finding may be relevant for enhancing C. lanceolata growth and development in regions in which fertilizer application rates are moderate or low.
Discussion
Alternative splicing of precursor mRNA
In this study on C. lanceolata, six alternative splicing forms were detected. This is consistent with the results of earlier related research. For example, a previous investigation on the alternative splicing in forest strawberry (Fragaria vesca Linn.), which was conducted on the basis of Illumina ii and SMRT III sequencing technologies, revealed six types of alternative splicing and elucidated the changes in alternative splicing forms during the flower and fruit development stages. The results significantly increased the accuracy and integrity of forest strawberry genome annotations. Similarly, six alternative splicing forms were also identified in another study that applied PacBio sequencing technology to investigate the complex regulatory mechanisms of alternative splicing in cotton (Gossypium hirsutum Linn.). Alternative splicing is a universal phenomenon in organisms.
Relationship between second-generation and third-generation sequencing
Notably, current transcriptome analyses of Chinese fir are conducted without a reference genome. Consequently, third-generation transcriptome sequencing analyses of Chinese fir cannot generate more specific information about all full-length transcripts. Furthermore, because of the differences in sequencing methods, second-generation and third-generation sequencing data are not comparable. Accordingly, even with the foundation provided by second-generation sequencing data, third-generation sequencing cannot detect new genes, nor can it annotate previously uncharacterized genes.
Jiang[24] sequenced the Camellia nitidissima transcriptome using second-generation and third-generation sequencing methods, with 67 018 genes detected by the second-generation sequencing analysis. Additionally, 122 201 transcripts were generated, with an average of about two transcripts per gene, which is similar to the results of the current study. In contrast, 354 234 full-length transcripts were obtained by the third-generation sequencing analysis. Moreover, the number and diversity of transcripts varied considerably between the second-generation and third-generation sequencing analyses. There have been several similar studies. For example, using Brassica napus L. as the research material, Yuan[1] combined second-generation and third-generation sequencing methods to identify 1 360 new loci and correct 1 821 incorrect annotations. Feng et al.[26] explored the complexity of transcripyome of Eupatorium adenophorum and produced important data for subsequent examinations of the structural characteristics and sequence compositions of E. adenophorum genes.
On the basis of the above-mentioned research results, we speculated that when second-generation sequencing and third-generation full-length transcriptome sequencing methods are used simultaneously, the second-generation transcriptome sequencing data maybe used to correct the full-length transcripts generated by the third-generation transcriptome sequencing analysis, thereby maximizing the utility of the third-generation sequencing data. Furthermore, the third-generation transcriptome sequencing data can be used to optimize the use of the second-generation reference genome information and improve the accuracy of second-generation quantitative data. Thus, the two sequencing approaches appear to complement each other.
Conclusions
In this study, we identified six types of alternative splicing events in C. lanceolata, with the resulting different exon combinations leading to the production of a variety of transcripts that increased the diversity of the proteins in C. lanceolata and the complexity of this tree species. The observed variation in the number of transcripts was related to the genetic characteristics, growth stage, and fertilization of Chinese fir, among other factors.
Because transcriptome sequencing analyses of C. lanceolata are currently conducted in the absence of an established reference genome, the relationship between the second-generation and third-generation sequencing results was not investigated in detail in this study. In conclusion, the C. lanceolata genome has not been completely sequenced, which greatly restricts the possible molecular biology research that can be conducted on C. lanceolata. The lack of an available reference genome has especially limited the study of the C. lanceolata transcriptome because researchers are unable to identify new transcripts and thoroughly examine transcript structural variations, non-coding region functions, and single nucleotide polymorphisms. Thus, the incomplete whole genome sequence is a key bottleneck for in-depth research on the molecular biology of C. lanceolata. To promote such research, the structure and function of the whole C. lanceolata genome will need to be comprehensively characterized.
References
[1] YUAN JL. Alternative splicing analysis of Brassica napus L. based on PacBio RSII and NGS[D]. Wuhan: Master dissertation of Huazhong Agricultural University, 2019.
[2] ZHANG T. Transcriptome analysis of ear of wheat and alternative splicing map analysis of wheat based on RNA-Seq[D].Yang Ling: Doctoral thesis of Northwest A&F University, 2019.
[3] LI YP. Global identication of alternative splicing genes and genome re-annotation of the woodland strawberry Fragaria vesca[D]. Wuhan: Doctor dissertation of Huazhong Agricultural University, 2020.
[4] WANG SQ. Alternative splicing and novel TU ananlysis of maize CMS-C anther based on RNA-seq[D]. Yaan City, Sichuan agricultural University Master Dissertation, 2014.
[5] WANG MJ. Multiple sets of data reveal the genetic and epigenetic basis of cotton fiber development[D]. Wuhan: Doctor dissertation of Huazhong Agricultural University, 2017.
[6] YANG LY, ZHANG YR, YANG YS, et al. Analysis of alternative splicing events of insect-resistant response genes based on RNA-seq data inZea mays[J]. Journal of Henan agricultural University, 2020, 54(2): 181-188.
[7] HUANG L, LI QY, WEI SG, et al. Identification and difference analysis of the alternative splicing event in the Hermaphroditic flowers and male flowers of Asparagus officinalis[J]. Acta Horticulturae Sinica, 2019, 46(8): 1503-1518.
[8] HUANG L, LAI J, WEI SG, et al. Differences of antioxidant enzyme activities and alternative splicing of related genes between hermaphroditic and male flower buds in asparagus[J]. Plant physiology Journal, 2019, 55(8): 1231-1238.
[9] WEI H, LOU Q, XU K, et al. Pattern of alternative splicing different associated with difference in rooting depth in rice[J]. Plant and Soil, 2020.
[10] LEE HJ, EOM SH, LEE JH, et al. Genome-wide analysis of alternative splicing events during response to drought stress in tomato (Solanum lycopersicum L)[J]. The Journal of Horticulturalence and Biotechnology, 2020, 95(3): 286-293.
[11] LIU QF, LI H. Transcript diversity of different cellular compartments in Drosophila[J]. Chinese Journal of Bioprocess Engineering, 2019, 17(2): 166-170.
[12] LIU DM, ZHANG YJ, PEI YX. Characterization of alternative splicing of Lemads1 in tomato MADS-box[J]. Chinese Journal of Biochemistry and Molecular Biology, 2016, 32(6): 641-648.
[13] ZHAO CH, ZHOU HJ, TONG ZK, et al. Cloning and expression of a floral related MADS-box gene BlMADS1 from Betula luminitera[J]. Journal of Zhejiang A&FUniversity, 2015, 32(2): 221-228.
[14] DING J. Study on differences in synthesis, accumulation and transcription expression of sea buckthorn flesh and seed oil[D]. Harbin: Doctoral dissertation of northeast forestry university, 2016.
[15] FENG YZ. Transcriptome sequencing and FAD3 gene identification and function study[D]. Beijing: Doctoral dissertation of China academy of forestry sciences, 2016.
[16] JIANG GX. Analysis of the transcription group of yulong seeds and cloning of important genes for oil synthesis[D]. Changsha: Doctoral dissertation of central south university of forestry and technology, 2014.
[17] QI M, HE GP, ZHOU JG, et al. Transcriptome analysis of heterosis of growth traits in Chinese fir[J]. Forest Research, 2019, 32(3): 113-120.
[18] QI M, HE GP, WANG HR, et al. A transcriptome analysis of the gene expression responses induced by fertilization treatments in Chinese fir[J].Journal of Central south university of Forestry & Technology, 2020,40(4): 101-110.
[19] FENG YL, XIONG Y, ZHANG J, et al. The role of alternative splicing in plant development and abiotic stress response[J]. Journal of Nuclear Agricultural Sciences, 2020, 34(1): 62-70.
[20] AO ZL, GUO YH, JIANG C, et al. The bioinformatics analysis about the alternative spliced variant resistgene of OsGATA23 in rice[J]. Journal of Hubei University, 2019, 41(6): 584-591.
[21] ZENG XQ, YUAN HJ, YANG CB, et al. Genome-wide variable shear analysis of highland barley under drought stress[J]. Tibet Journal of Agricultural sciences, 2019(1): 1-5.
[22] QI M. Isozyme analysis on excellent variety of Cunninghamia Lanceolata "Long 15"[J]. Forest Research, 2007, 20(1): 143-146.
[23] ZHANG YX, HAN XJ, SANG J, et al. Transcriptome analysis of immature xylem in the Chinese fir at different developmental phases[J]. Peer J, 2016(4): e2097.
[24] JIANG LN. Study on flower color formation metabolism mechanism and key genes function of Camellia nitidissima[D]. Beijing: Dissertation for Ph. Degree of Chinese Academy of Forestry, 2020.
[25] ZHU FY, CHEN MX, YE NH, et al. Proteogenomic analysis reveals alternative splicing and translation as part of the abscisic acid response in Arabidopsis seedlings[J]. Plant J, 2017, 91(3): 518-533.
[26] FENG KW, NIE XJ, SONG WN, et al. Study on the complexity of the transcriptome of purple rhizome based on second and third generation sequencing technology[R]. Xinjiang: Abstract of paper presented at the National Biodiversity Conference, 2019.
[27] SHUAI HW, JIANG R, CHEN F, et al. Effect on mRNA alternative splicing of At-RS31 gene under shade stress conditions[J]. Molecular Plant Breeding, 2018, 16(17): 5631-5637.
[28] LEI SY. The artificial breeding for Salm trutta fario, genetic variation in breeding populations of Erythrocuter ilishaeformis and their transcriptome analysis[D].Shanghai: Master degree of Shanghai Ocean University, 2017.