WANG Mei,WANG Ke-Xin,TAN Jian-Jun,WANG Jing-Jing
(Department of Biomedical Engineering,Faculty of Environment and Life,Beijing University of Technology,Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation,Beijing 100124,China)
Abstract Objective Lung cancer is one of the most common cancers in the world.Lung adenocarcinoma(LUAD)has the highest annual mortality rate among lung cancer patients.It has been reported that changes in gene spectrum were associated with the process of tumorigenesis and its development.The purpose of this study is to identify the gene signatures associated with LUAD and to further analyze their prognostic significance.Methods Weighted gene co-expression network analysis(WGCNA),differential gene analysis,cox regression analysis,and protein-protein interaction(PPI)network analysis were used to screen the hub genes highly related to LUAD based on The Cancer Genome Atlas(TCGA)database.The RNA-seq data sets from TCGA and GTEx(Genotype Tissue Expression)database were combined and divided into a training set and a validation set,which were used to construct the diagnostic model by support vector machine recursive feature elimination feature(SVM-RFE)algorithm.GSE32863 and GSE31210 were used to verify the diagnostic accuracy of the model and the prognostic value of our obtained gene signatures,respectively.Results The results demonstrated that the model of 5 gene signatures(anln,cenpa,plk1,tpx2,cdca3)obtained by the SVM-RFE algorithm had an outstanding performance in the classification of LUAD patients.Functional enrichment analysis showed that these 5 gene signatures were highly related to the biological process of tumor initiation and progression.What's more,LUAD patients with high expression of these 5 genes also exerted a poor outcome in survival status.Conclusion Therefore,we could conclude that our study obtained useful models with 5 gene signatures for the diagnosis and prognosis of LUAD,which were essential for the development of novel targets applied in precision therapy.
Key words lung adenocarcinoma,gene signature,WGCNA,SVM-RFE
Lung cancer is the leading cause of death globally,and non-small cell lung cancer(NSCLC)accounts for 85% of all lung cancer[1].Lung adenocarcinoma(LUAD)is the most common type in NSCLC[2].The 5-year survival rate after the diagnosis of lung cancer is less than 20%[3].Although there are recent advances in surgical methods,immunotherapy,and neoadjuvant therapy,the mortality of NSCLC remains high[4].
With the development of high-throughput sequence technology,bioinformatics has become increasingly popular in genomic analysis to investigate the pathological mechanism of tumor and discover tumor-specific biomarkers[5].Bioinformatics has made it possible to identify gene expression changes during tumorigenesis,contributing to determining the prognosis and treatment of lung cancer[6].What's more,after the HGP(Human Genome Project)finished,lots of publicly available databases such as The Cancer Gene Atlas(TCGA,https://tcga-data.nci.nih.gov/),Gene Expression Omnibus(GEO,http://www.ncbi.nlm.nih.gov/geo/)and Genotype Tissue Expression(GTEx,https://commonfund.nih.gov/GTEx/)showed cancer genome sequencing data.A careful and thorough analysis of these data can identify gene signatures and signal pathways about tumor,which will help to explore the mechanism of tumor formation and development.
At present,there have been numerous studies on gene signatures,which are helpful for the selection of lung cancer treatment methods and the prediction of survival rate after lung cancer surgery.For example,Damaet al.[7]suggested that 10 gene signatures may be important prognostic indicators of patients with stage I LUAD.Liuet al.[8]proved that increased mRNA expression of TTK and NEK2 improved the risk of smoking-related LUAD death.Moreover,Xieet al.[9]identified prognostic signatures containing 6 genes significantly correlated with the overall survival(OS)of NSCLC patients,providing support for the construction of treatment regimens for patients.However,most of these studies considered genes as individual bioinformatics analysis factors and finally combined the screened genes to form a predictive model,which did not make full use of the relationship between genes.
Weighted gene co-expression network analysis(WGCNA)is a systematic biological method for selecting co-expression modules of related genes and the critical module associated with clinical traits[10],providing a new direction to predict the gene signatures.Furthermore,the support vector machine recursive feature elimination feature(SVM-RFE)is a powerful algorithm to establish a gene model based on a large number of sample data.It improves the accuracy of the classifier model and has a wide range of applications in the field of bioinformatics[11-12].Therefore,an idea of combining WGCNA and SVMRFE to improve the recognition ability of highly related genes turned out,which was used to establish a cancer diagnosis model and screen candidate gene signatures.
In our study,in order to screen gene signatures related to the diagnosis and prognosis of LUAD,we used WGCNA,SVM-RFE,survival analysis,and other bioinformatics analysis methods to identify and verify the gene signatures highly related to LUAD based on multiple bioinformatics databases.Finally,our study obtained useful models with 5 gene signatures that applied to different sets for the diagnosis and prognosis and provided some valuable insights for the development of novel targets involved in the precision therapy of LUAD.
The transcriptome profiling dataset of LUAD and corresponding clinical dataset were obtained from TCGA database,including 344 tumor samples and 38 normal samples,and RNA-seq count data had about 19 430 genes.The mRNA expression of each sample was merged into a matrix with a merge script in the Perl language.Then the matrix of mRNA expression was annotated with the Ensembl database.With the help of the“edgeR”and“DESeq2”package,genes with low read counts were usually not of interest for further differential gene screening.According to the standard that the average expression of a gene in each sample should be≥1,the mRNA expression data were filtered,a total of 18 127 gene expression data continued to be analyzed.This study conformed to the publication guidelines of TCGA database.
Meanwhile,to improve the accuracy of the diagnostic model constructed by the SVM-RFE algorithm,data of 288 normal LUAD samples were downloaded from the GTEx database and gene count data on 56 754 genes.As the data used herein were freely sourced,approval from the Ethics Committee was not required.However,since only the data from normal samples were collected by GTEx database,it is usually used for bioinformatics analysis combined with TCGA database.Because the types of TCGA data and GTEx data were gene counts and reads per kilobase per million mapped reads(RPKM),respectively.According to the calculation Equation(1),the gene counts of TCGA are converted into RPKM,and then the data sets of TCGA and GTEx are standardized and merged by the Z-socre method to facilitate the further construction of the model.The combination of TCGA and GTEx dataset were divided into a training set(60%)and an interval validation set(40%),including 325 normal samples and 307 tumor samples.
FPKMstandardized total exon reads based on two aspects:mapped reads and exon length.
For further verification,we downloaded LUAD gene expression data from two access sets,GSE32863 and GSE31210,from GEO database.The former one provided 58 cancer samples and 58 normal samples for verifying the accuracy of the diagnostic model and the latter one provided 226 cancer samples with clinical information which helped to explore the relationship between the gene signatures and clinical prognosis.
In order to identify gene modules highly related to tumor patients,a weighted gene co-expression network was constructed by using the“WGCNA”package[13]in R software.WGCNA generated coexpression modules of highly correlated genes based on the interaction between genes and screened the critical module that was highly related to the clinical feature,providing a new insight for gene target prediction in precision medicine[14-16].According to the variance of gene expression,the first 8 000 genes were selected to construct a co-expression network.After constructing a sample tree based on gene expression,the goodSamplesGenes function was used to delete samples with large outliers.According to the co-expression relationship of genes,Pearson correlation coefficients were calculated between each gene,and their absolute values were used to establish the gene adjacency matrix,the formula is Equation(2).In order to make the distribution of genes conform to the scale-free network base on gene connectivity,the best soft-threshold power valueβwas chosen to construct the proximity matrix and transformed into a topological overlap matrix(TOM)in Equation(3).According to the TOM of genes,Equation(4)was used to calculate the distance between genes for hierarchical clustering.The network modules were generated using the dynamic shear method and distinguished by colors,with the best soft-threshold power value and a module size cutoff criterion≥10.Then,the obtained modules were used to screen the key module most relevant to the clinical traits by Pearson correlation test.
xiandxjare the nodes in the scale-free network;β:soft threshold;aij:gene adjacency matrix;k:the sum of the adjacency coefficient of all nodes connected individually between genes;lij:the sum of the product of the adjacency coefficients between geneiandj;Dij:the hierarchical clustering distance between geneiandj.
Due to differential analysis of the macrogenome,adj.Pwas selected to reduce the error rate of differential gene screening.Two DEGs sets were generated by“edgeR”and“DESeq2”packages after normalization and data filter with thresholds|log2foldchange|≥2 and adj.P<0.01 by comparing LUAD group with normal group,respectively.The two DEGs sets of LUAD group were visualized as volcano plots by using“gplots”package.Then,the overlapped genes obtained by crossing the two DEGs sets and the gene modules with high clinical relevance screened by WGCNA were used to identify potential gene signatures,which were shown as the Venn diagram by“VennDiagram”package.
The purpose was to independently assess the impact of overlapping genes on the survival time of LUAD patients.The“survival”package was used to analyze the prognostic value of each gene through univariate cox regression[17].According to the cutoff criterion ofP<0.05 in TCGA set,genes that were significantly related to the survival time of the patients were considered to have prognostic value.Then,the independent prognostic genes were selected to further analysis.
According to the results of the univariate cox regression analysis,we used the Search tool for the Retrieval of Interacting Genes(STRING)database to build PPI networks and then selected the genes that played key roles in the gene network for the further network model analysis.The threshold for minimum interaction score of these genes was 0.7 and built a PPI network model visualized by Cytoscape(version 3.6.1;https://www.cytoscape.org/).According to reports,Maximal Clique Centrality(MCC)algorithm was the most effective way to find hub genes in PPI networks[18].The MCC score of each node in the PPI network was calculated by using CytoHubba plugin.In our study,according to the results of MCC algorithm analysis,the top 15 genes were considered as the hub genes that played important roles in the process of tumor formation.
To investigate the biological function of the selected genes,the“clusterProfiler”package in R software[19]was used to analyze the Gene Ontology(GO)annotation in the overlapping genes and hub genes.GO annotation that includes the three aspects:molecular function(MF),cellular component(CC),and biological process(BP),can describe the molecular functions that gene products may perform,the cellular environment,and the biological processes involved[20].
The“e1071”[21]and“caret”packages[22]in R software were used to select optimized gene signatures based on recursive feature elimination(RFE)algorithm.Furthermore,the expression of genes was considered as the feature,the clinical traits of the sample were considered as the categorical variable,and the support vector machine(SVM)used the linear kernel to predict patients with LUAD to obtain the optimal gene signatures by RFE algorithm.The hub genes that passed the MCC algorithm analysis entered the SVM-RFE algorithm analysis.In the SVM-RFE algorithm,genes were ranked according to the measure of their importance,those genes with lower rankings were removed.And the precision of the gene model was examined by 10-fold cross-validation in the training set.Then,the optimal gene model was examined in the internal validation set and external validation set(GSE32863).Besides,the performance of the classifier in these sets was evaluated by receiver operating characteristic(ROC)curve analysis.The genes selected by the SVM-RFE method were chosen as the gene signatures.
To confirm the prognostic value of gene signatures,the multivariate cox regression model was constructed with the gene signatures as variables based on GSE31210 set.Then,based on the expression of gene signatures and the regression coefficient estimated by the multivariate cox regression model,the risk score(RS)prognostic model was constructed as follows[23]:
βmRNAwas defined as the independent prognostic coefficient andExpmRNArepresented the expression of corresponding mRNA.
According to the median RS as the cut-off point,all patients in GSE31210 set were distributed to lowrisk and high-risk groups.In order to evaluate the survival time difference between low-risk and highrisk groups and verify the prognostic value of RS model,the“survival”(version 3.27)package in R software was used to perform Kaplan-Meier(K-M)survival curve analysis[24].
Fig.1 The workflow of this study
This study was conducted as indicated in Figure 1.The mRNA expression matrix of TCGA-LUAD was obtained(18 127 genes)after data preprocessing.According to the requirements of WGCNA algorithm,we selected the top 8 000 genes in the variance order among genes.Then,in order to ensure the reliability of co-expression network,the outliers of the samples were removed by the method of sample clustering(Figure 2a).The scale-free distribution was shown in Figure 2b,and the appropriate soft-threshold power valueβ=8 was chosen base on the fitting degree and connectivity of the network.Finally,11 modules were identified base on dynamic tree clipping and average hierarchical clustering(Figure 2c).In addition,we plotted the heatmap of module-trait relationships to evaluate the association between modules and two clinical traits.The result of the heatmap revealed that the turquoise module(r=0.61,P=2E-35)was found to be the highest associated with tumor tissues(Figure 2d).
Fig.2 Construction of weighted gene co-expression network for LUAD
Based on the cut-off criteria of|log2foldchange|≥2 and adj.P<0.01,a total of 1 137 DEGs(Figure 3a)and 1 011 DEGs(Figure 3b)were found to be dysregulated in tumor samples by the“edgeR”and“DESeq2”packages,respectively.As shown in Figure 2c,1 799 co-expression genes were found in the turquoise module of TCGA data set.According to the intersection of these three sets,a total of 295 overlapping genes were extracted for gene signature screening(Figure 3c).
Fig.3 Screening of overlapping genes
Univariate cox regression analyses were performed on 306 LUAD patients to evaluate the prognostic relation between the selected 295 genes and OS.A total of 39 genes were obtained with a cutoff criterion ofP<0.05,which was considered to be significantly associated with OS in LUAD patients.Then,PPI network among the selected genes was constructed by using STRING database(Figure 4a).The cut-off value of the interaction score was 0.7.To further analyze the relationship of the 39 genes,we imported PPI network into Cytoscape.HR(hazard ratio),the co-expression relationship among the selected genes was calculated using the gene expression levels by MCC algorithms in the Cytoscape(Figure 4b)and the result was shown in Table 1.As shown in Figure 3b,PPI network of the top 15 highest-scored genes have 19 nodes and 102 edges,including their expanded sub-network.According to the MCC scores,the top 15 highestscored genes were considered as the hub genes for further bioinformatics analysis.
Table 1 The top 15 highest-scored genes in PPI network
Fig.4 Screening of the hub genes
To further analyze the potential biological function of the 295 overlapping genes and hub genes,the functional enrichment analysis was performed by the“clusterProfiler”and“enrichplot”packages in R software.As shown in Figure 5a,several enriched gene sets were obtained by screening from GO enrichment analysis.Biological processes(BP)of the 295 genes were mainly involved in nuclear division and organelle fission.Due to the result of cellular component(CC),these genes were mainly enriched in the chromosomal region and spindle.Moreover,through the molecular function(MF)analysis,ATPase activity,tubulin binding,and microtubule binding were suggested to be related to these 295 genes.In addition,the result of gene enrichment analysis of the hub genes is shown in Figure 5b.These hub genes were mainly enriched in regulation of mitotic cell cycle phase transition,regulation of cell cycle phase transition,regulation of mitotic nuclear division,mitotic nuclear division,regulation of nuclear division,etc.
The recursive feature elimination(RFE)algorithm was used to identify the most significant gene signatures.In order to solve the class skew caused by the imbalance between normal and tumor samples in TCGA data set,data of 288 normal lung tissue samples were downloaded from GTEx database for SVM-RFE algorithm analysis.Then the RFE algorithm was applied to filter the 15 hub genes in order to identify the optimal combination of gene signatures in the training set.Finally,5 genes(anln,cdca3,cenpa,plk1,tpx2)were selected as the gene signatures.Figure 6a shows the optimization process of RFE algorithm.When the number of gene signatures is 5,the accuracy of SVM-RFE model is the highest.Consequently,we constructed a diagnostic gene model based on these 5 genes.
The constructed SVM classifier with 5 gene signatures was applied to the training set(n=391),internal validation set(n=241),and external validation set GSE32863(n=116).As shown in Figure 6b-d,the classifier could successfully differentiate normal samples and tumor samples in the 3 sets(AUC,area under curve;PPV,positive predictive value;NPV,negative predictive value).The training set generated AUC of 0.984,Accuracy of 0.936,and Recall of 0.897,the internal validation set generated AUC of 0.977,Accuracy of 0.900,and Recall of 0.835,and the external validation set GSE32863 generated AUC of 0.903,Accuracy of 0.828,and Recall of 0.800(Table 2).These results illustrate that the classification model based on the 5 genes could accurately predict the LUAD patients(Figure 6).
Fig.5 GO enrichment analysis
Fig.6 Construction and validation of the diagnostic model for LUAD
Table 2 Effectiveness evaluation of the classifier of 5 gene signatures on the three sets
According to the result of SVM-RFE algorithm analysis,5 genes(anln,cdca3,cenpa,plk1,tpx2)were obtained to explore their prognostic value among the patients of the GSE31210 set.The risk score(RS)model was constructed based on the coefficients(Table 3)of the 5 gene signatures andtheir expression levels in the GSE31210.The RS formula was shown in Equation(6).The concordance index of this model was 0.69,andP=1.717 4E-02(Figure 7a).
Table 3 Cox regression model of the 5 gene signatures in the GSE31210 set
Fig.7 Prognostic value of the 5 gene signatures in the GSE31210 set
The expression ofcdca3marked asExpcdca3and it is same to others.
The risk score of each sample was calculated for each LUAD patient,and all 226 patients were divided into two groups containing low-risk and high-risk groups.As shown in Figure 7b,compared with the high-risk group,patients with low-risk scores are demonstrated to have a greater chance of having the same survival time in the GSE31210 set.The AUC was 0.749 and 0.742 for the 3-year and 5-year OS by ROC analysis,respectively(Figure 7c),which indicated the good performance of RS model in survival prediction.Notably,the risk curves,living status of LUAD patients,and the 5 gene expression value associated with the risk score in the GSE31210 set were shown in Figure 7d-f.The mortality of highrisk group was significantly higher than that of lowrisk group.Besides,we put the 5 genes into independent survival curve analysis,and the result showed that the expression ofanln,cdca3,cenpa,plk1,andtpx2had a significant impact on the survival time of patients and displayed good prognostic significance(Figure 7g-k).
NSCLC is the most common type of lung cancer.LUAD is the most common subtype of NSCLC with an extremely high mortality rate[25].In recent years,the increasing studies of high-tech sequencing which illustrated the importance of gene signatures on determining cancer formation and outcomes provided a novel insight to integrate this bioinformatics into the therapeutic schedule[26-27].It is necessary for the diagnosis and treatment of LUAD to reveal the molecular mechanism of LUAD and screen out the gene signatures based on the genome information.In this study,295 overlapping genes were selected by comprehensively differential gene analysis and WGCNA analysis based on TCGA-LUAD database.As shown in functional annotation analysis,these genes were mainly enriched in the nuclear division,organelle fission,and chromosomal region,essential for tumorigenesis.According to the survival status of LUAD patients,39 prognostic genes were screened by univariate cox regression,and the top 15 hub genes were selected out from the prognostic genes through the MCC algorithm with the help of CytoHubba plugin in Cytoscape.The classification model was constructed based on the 15 hub genes using the SVMRFE algorithm for choosing the optimal gene signatures in the training set.Due to the results of the model accuracy,a classifier model with 5 gene signatures(anln,cdca3,cenpa,plk1,tpx2)was obtained and performed well in classifying LUAD samples in the training set,internal validation set,and external validation set GSE32863 through calculating AUC,Accuracy,Recall,PPV and NPV.Moreover,the RS model was constructed to validate the prognostic value of the 5 gene signatures.By applying these signatures to construct the RS model in the GSE31210 set,there were significant differences between the high-risk and low-risk groups,and the high-risk group has a lower survival rate than the low-risk group.Also,we found that these 5 genes had a high prognostic value in LUAD through the survival analysis of each gene signature.
The anillin actin binding protein(anln)gene is located on chromosome 7p14.2 and encodes a protein composed of 1 124 amino acids that contain four domains,including a RhoA-binding domain,a Cterminal pleckstrin homology domain,and myosinand actin-binding domain[28].Anlnplays an important role in the process of cell cycle in the assembly of actin and myosin contractile rings in separates daughter cells[29].Moreover,anillin is a substrate for the anaphase-promoting complex/cyclosome(APC/C),a ubiquitin ligase that controls the mitotic progression[30].In addition,the inheritance and mitotic proliferation of defective genome can cause pathological conditions including a variety of cancers[31].Aniline,a cell cycle regulator,has been proved to play a key role in tumor invasion[32-33].In our study,compared with normal samples,the expression ofanlnis up-regulated in tumor tissues,which is significantly correlated with LUAD.Mechanism ofanlnfunction showed that active cell division in tumor tissue results in higher levels ofanln,which is consistent with our findings.
Cenpa(centromere protein-A),a centromerespecific 17-ku protein,is a unique histone H3,likes the protein found in the active centromeres,relates to the major epigenetic tension of centromeric identity[34].Cenpaplays an important role in cell cycle regulation and cell survival[35].According to previous reports,whencenpaandcenpbwere knocked out at the same time,the inhibitory effect on cell proliferation was more significant[36].A recent study showed that high expression ofcenpawas closely associated with LUAD tumorigenesis proved by realtime polymerase chain reaction(RT-PCR)and Western blotting analysis[37].Therefore,inhibitors targetingcenpamay be a promising anticancer strategy.
Plk1(polo-like kinase 1),a member of the mitotic serine/threonine kinase family,is closely related to spindle formation and chromosome segregation during mitosis[38].Previous studies demonstrated thatplk1mRNA expression was elevated in proliferating cells including tumors of different origins and various cancer cell lines[39].Wanget al.[40]revealed that the overexpression of plk1 protein was an independent prognostic biomarker in NSCLC patients.In the present study,we further demonstrate that LUAD patients who express a high level of plk1 protein have low overall survival,and the high expression of plk1 protein is a prognostic factor validated by the RS model and survival analysis.
The targeting protein for Xenopus kinesin-like protein 2(tpx2),which is required for targeting Aurora-A kinase to the spindle apparatus,had been reported as gene signatures for human lung cancer prognosisin vitrolung carcinogenesis system[41-42].Liet al.[43]demonstrated thattpx2was a potential candidate targeted for amplification and overexpression in NSCLC.In our study,tpx2is mainly plays a role in the regulation of mitotic cell cycle phase transition and mitotic nuclear division.The survival time of samples with a hightpx2gene expression level is significantly lower than that with a lowtpx2gene expression level,which is consistent with previous studies.
Cell division cycle-associated protein-3(cdca3),is required for mitosis entry as a part of the SKP1-Cullin RING-F-box(SCF)ubiquitin ligase complex to degrade the endogenous cell cycle inhibitor Wee1[44].Some studies showed that unregulatedcdca3was associated with the carcinogenic process and malignant patterns of certain tumors[45-46],but the relationship between the gene and formation of LUAD has not been found.Cdca3may be a new potential target for NSCLC to inhibit tumor growth and promote tumor aging,which may play an important role in tumor cell proliferation.
In conclusion,we obtained 5 gene signatures of LUAD by bioinformatics and machine learning analysis methods.The diagnostic and prognostic models constructed by the 5 gene signatures could had an outstanding performance in predicting and prognostic among different sets.PPI network analysis and GO analysis confirmed that these genes were positively related to the tumorigenesis and development of LUAD.The combined application of multiple sets provided more robust supports for our research.Therefore,our study may provide new insight into the diagnosis and treatment of LUAD.