Qianqian Song, Jing Su, Lance D. Miller, Wei Zhang,*
1 Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC 27157, USA
2 Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC 27157, USA
3 Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
KEYWORDS Single-cell RNA sequencing;Consensus clustering;Latent space;Markov Chain Monte Carlo;Maximum likelihood approach
Abstract In gene expression profiling studies, including single-cell RNA sequencing (scRNA-seq)analyses,the identification and characterization of co-expressed genes provides critical information on cell identity and function. Gene co-expression clustering in scRNA-seq data presents certain challenges.We show that commonly used methods for single-cell data are not capable of identifying co-expressed genes accurately,and produce results that substantially limit biological expectations of co-expressed genes. Herein, we present single-cell Latent-variable Model (scLM), a gene coclustering algorithm tailored to single-cell data that performs well at detecting gene clusters with significant biologic context.Importantly,scLM can simultaneously cluster multiple single-cell datasets, i.e., consensus clustering, enabling users to leverage single-cell data from multiple sources for novel comparative analysis.scLM takes raw count data as input and preserves biological variation without being influenced by batch effects from multiple datasets.Results from both simulation data and experimental data demonstrate that scLM outperforms the existing methods with considerably improved accuracy. To illustrate the biological insights of scLM, we apply it to our in-house and public experimental scRNA-seq datasets.scLM identifies novel functional gene modules and refines cell states,which facilitates mechanism discovery and understanding of complex biosystems such as cancers. A user-friendly R package with all the key features of the scLM method is available at https://github.com/QSong-github/scLM.
Co-expressed genes work in concert in biological pathways and processes [1-3]. Such genes are involved in crucial biological activities including immune cell activation [4,5], cellular epithelial-mesenchymal transition(EMT)[6],and transcription factor-mediated gene regulatory networks and signaling pathways[7,8].Co-expression of genes based on similarities among their expression profiles, has been a primary way to unravel gene-gene relationships and facilitate functional annotation[9-12].Therefore,identification of co-expressed genes provides functional insights into underlying cellular and molecular mechanisms in normal and disease processes.
The recently developed single-cell RNA sequencing(scRNA-seq) technology provides high resolution of gene expression at the single-cell level[13],yet presents certain challenges for gene expression analysis [14]. In contrast to bulk RNA-seq, single-cell data have been shown to exhibit a characteristic negative binomial (NB) distribution pattern[15-18], wherein genes suffer from stochastic dropouts and over-dispersion problems. Dropouts, or genes that exhibit excessive zero values[19-21],represent a special type of missing value, which can be caused by low RNA input or stochastic expression fluctuation at the single-cell level. Over-dispersion relates to the substantially large cell-to-cell variability in gene expression profiles which likely arises from technical noise stemming from low input RNA and PCR amplification bias[22].
Rapid advances in scRNA-seq technologies have made it feasible to perform population-scale studies in which the transcriptome is measured for thousands of single cells from multiple samples or conditions [23-27]. This in turn has amplified the need for versatile gene co-expression approaches that not only address the unique challenges of scRNA-seq data, but also the challenges of dataset integration including batch effects, technical variations (e.g., mRNA quality and preamplification efficiency), and extrinsic biological variabilities.
Classical methods designed for analysis of bulk transcriptome data such as weighted gene co-expression network analysis(WGCNA)[28]and Clust[29]are not designed to account for the unique characteristics of scRNA-seq data. Some network-based approaches for single-cell data, including Single-Cell rEgulatory Network Inference and Clustering(SCENIC) [30], Cell Specific Network (CSN) [31], and Left Truncated Mixture Gaussian (LTMG) [32], can detect gene co-expression modules as part of the network reconstruction.However,these methods do not account for the technical noise and extrinsic variance among multiple samples. Therefore,there is a clear need to develop a tailored and effective method for scRNA-seq data to extract‘‘consensus”co-expressed genes[11], that is, to extract the genes that are consistently coexpressed in each of the multiple datasets.
Herein, we have developed a novel method, single-cell Latent-variable Model (scLM), to simultaneously extract coexpressed genes that exhibit consensus behaviors from multiple single-cell datasets. The scLM method accounts for both cell-level covariates and sample-level batch effects. We assessed the performance of scLM in both simulated data and experimental data. scLM achieves the best performance over other commonly used methods. We then applied scLM to our in-house scRNA-seq data generated from four nonsmall cell lung cancer (NSCLC) tumor tissues and their corresponding adjacent normal tissues.The scLM method identified tumor-specific co-expressed gene modules with significant prognostic values. Furthermore, these co-expression modules contributed to the subtle characterization of lung tumor cell states. In addition, we applied scLM to analyze a set of malignant cells from NSCLC, head and neck squamous cell carcinoma (HNSCC), and melanoma. We discovered a common co-expressed gene program across different cancer types, providing insights into fundamental mechanisms of carcinogenesis.
As multiple recent studies [15-18] showed that the expression of most genes in single-cell data is sufficiently captured by NB distribution, NB model is chosen as an appropriate model to formulate single-cell data. It is supported by the physical modeling of bursting gene expression [18,33] and is also commonly used in scRNA-seq analysis[15-18].Therefore,without loss of generality,we assumed that the measured gene expression xijkfor cell j in dataset k follows the NB distribution NB(p,γ), which has the probability function as:
which was used to distinguish the intrinsic biological signals zifrom the extrinsic variabilities(ajkand βjk)including the technical variances at the cell level(j)and batch effects at the sample level(k).That is,the uikis composed of the intrinsic biological signals of gene i captured by latent variables ziregardless of the confounding variabilities at the cell level and sample level,while variances due to technical biases and batch effects are captured by offsets ajkand scale factors βjk.Since ziis the same for specific gene i,and uikis estimated from observed counts xijk,we further turned the formula into
To alleviate the impact of extreme values, we utilized logarithm form in the linear model that has been frequently used[34-38] in scRNA-seq data, i.e., the GLM,
To estimate the parameters in our model,we used the maximum likelihood approach. As is assumed above that xijkfollows the NB distribution, the conditional log-likelihood function of xijkcan be written as:
For the latent variable zi, f(zi) represents the density function of the standard multivariate normal distribution N(0, Ik). Therefore, the joint log-likelihood of (xijk, zi) can be written as
To control model complexity and overfitting, we applied the least absolute shrinkage and selection operator (LASSO,L1-norm penalty), to the following penalized joint loglikelihood estimation:
Then the above parameters are estimated by maximizing the penalized joint log-likelihood function,that is,maximizing the following penalized joint log-likelihood,
where the summation is due to the conditional independence assumption of xijkgiven zi.
To estimate the parameters ajkand βjk, we solved the following optimization problem conditional on zi,
Figure 1 Schematic of scLM for identifying consensus co-expressed gene clusters across multiple datasets
here we used the coordinate descent algorithm provided in[39],therefore optimized the above log-likelihood function. Herein the update of parameters ajkand βjkdepend on zi.As the latent variables ziwere not observable in our model, we used the Markov Chain Monte Carlo (MCMC) simulation to iteratively update zi, for maximizing the penalized joint loglikelihood. That is, we replaced the value in the parameter updates by its expectation with respect to zi, through repeatedly sampling the latent variables zifrom the following joint posterior distribution, i.e.,
With the estimated latent variables zi,that is,with the genes projected into the latent space, we clustered genes that were projected in the latent space to identify co-expressed genes.Here we used K-means clustering to divide genes into k clusters based on the latent variables zi.The parameter k can be either determined according to the Bayesian information criterion(BIC), or chosen by user’s preference.
Based on the single-cell data characteristics, we used the NB distribution to simulate two synthetic cohorts(synthetic cohort 1 and synthetic cohort 2). Each synthetic cohort contained 9 sets of simulated gene expression data with an increasing number of datasets(D1-D9).That is,D1 contained one individual dataset(n=1),D2 contained two individual datasets(n=2),..., and D9 contained nine individual datasets (n = 9). Each individual dataset contained 180 genes belonging to three clusters, with 60 co-expressed genes in each of the three clusters.For each gene cluster c ∈{1,2,3}in batch n,their gene expression was sampled from the NB distribution NB(ucn,hcn),where ucnand hcnreferred to the mean and deviation, respectively.Different gene clusters had different values of ucnand hcn.Full expression values and cluster membership for these datasets were provided in the scLM example data.
Additionally,we utilized the Splatter package[40]to generate another two synthetic cohorts(synthetic cohort 3 and synthetic cohort 4) of simulated data with dropout effects, which more accurately recapitulated actual scRNA-seq data distributions. Specifically, we adjusted the batch parameters‘‘batch.facLoc” and ‘‘batch.facScale” as 1 and generated 16 different batches of data. Each batch consisted of 240 cells,and 240 genes constituting four groups of co-expressed genes as the group truth, which was achieved by adjusting the ‘‘de.prob” parameter. We also added the dropout effects in these simulation data by setting ‘‘experiment” for global dropout and the ‘‘dropout.mid” parameter. These 16 batches of data made up the synthetic cohorts 3 and 4. Full expression values were provided in the scLM example data.
In-house dataset
Fresh tumor and adjacent normal tissues from four NSCLC patients were collected by the Tumor Tissue and Pathology Shared Resource (TTPSR) into Miltenyi Tissue Storage Medium (Catalog No. 130-100-008, Miltenyi Biotec, San Diego, CA). Tissues were then processed to single-cell suspensions using the Miltenyi Human Tumor Dissociation Kit(Catalog No. 130-095-929, Miltenyi Biotec) and the gentleMACS Octo Dissociator with Heaters (Catalog No.130-096-427, Miltenyi Biotec). Red blood cells were removed by negative selection using Miltenyi CD235a (Glycophorin A) microbeads (Catalog No. 130-050-501, Miltenyi Biotec)and LS Columns (Catalog No. 130-042-401, Miltenyi Biotec).Recovered cell numbers were determined by trypan blue exclusion using a LUNA II automated cell counter (Catalog No.L40001, Logos Biosystems, Annandale, VA). In preparation for scRNA-seq, cells were thawed and washed according to the demonstrated protocol developed for human peripheral blood mononuclear cells (PBMCs) by 10X Genomics (San Francisco, CA).
All scRNA-seq procedures were performed by the Cancer Genomics Shared Resource (CGSR) of the Wake Forest Baptist Medical Center Comprehensive Cancer Center(WFBMC-CCC). Viable cells in suspensions were loaded into wells of a 10X Genomics Chromium Single Cell A Chip Kit(Catalog No.PN-120236,10X Genomics).Single-cell gel beads in emulsion (GEMs) were created on a Chromium Single Cell Controller and scRNA-seq libraries were prepared using the Chromium Single Cell 3′Library and Gel Bead Kit v2(Catalog No. PN-120237, 10X Genomics). Sequencing libraries were loaded at 1.3 PM on an Illumina NextSeq500 with a High Output 150 cycle Kit (Catalog No. FC-404-2002, Illumina, San Diego,CA)for paired-end sequencing.A total of 11,813 single cells were captured, with the number of cells recovered per channel ranging from 369 to 2502. Low-quality cells were discarded if the cell with expressed genes was smaller than 200.Only malignant cells from four tumor samples and epithelial cells from three adjacent normal samples were used in this study.The scRNA-seq data were deposited in the Gene Expression Omnibus (GEO) of National Center for Biotechnology Information (NCBI) database (GEO: GSE117570) at https://onlinelibrary.wiley.com/doi/full/10.1002/cam4.2113 [41].
Melanoma dataset
We downloaded the expression matrix data of melanoma from the GEO of NCBI database (GEO: GSE72056) at https://www.ncbi.nlm.nih.gov/pubmed/27124452 [42]. This dataset included expression profiles of 23,689 genes in 4645 cells from 19 melanoma tumors.These cells included both malignant cancer cells and non-malignant cells. For the input matrix of scLM, a sample is excluded if it contains < 200 cells, and a gene is excluded from the input matrix if it is expressed in< 300 cells.
HNSCC dataset
We downloaded the expression matrix data of the HNSCC dataset from the GEO of NCBI database (GEO:GSE103322) at https://www.sciencedirect.com/science/article/pii/S0092867417312709[6].This dataset consisted of 5902 cells from 18 patient samples after initial quality controls,including 2215 malignant and 3363 non-malignant cells. For our analyses, we used the samples with more than 200 malignant cells and genes expressed in over 300 cells as the input matrix.
Breast cancer dataset
We downloaded the expression matrix data of breast cancer(BR) scRNA-seq dataset from the GEO of NCBI database(GEO: GSE118390) at https://www.nature.com/articles/s41467-018-06052-0 [23]. For our analysis, we used malignant cells and genes expressed in over 300 cells as input.
Figure 2 Performance evaluation on simulation data
Each clustering result produced by different methods was assessed using clustering evaluation indices, including the Adjusted Rand Index (ARI) [43], the Calinski-Harabasz(CH) index [44], the Davies-Bouldin (DB) index [45], and the Dunn index [46]. CH index evaluated the cluster validity based on the average between- and within-cluster sum of squares. DB index was obtained by averaging all the cluster similarities. Dunn index used the minimum pairwise distance between objects in different clusters as the inter-cluster separation and the maximum diameter among all clusters as the intra-cluster compactness. Larger CH index, smaller DB index, and larger Dunn index represented better clustering results.
With the co-expressed gene modules,we utilized mean value of the modules in each single cell as input for graph-based clustering. Uniform manifold approximation and projection(UMAP)was used to visualize cell clusters.Graph-based clustering was performed using the Seurat package (v3.1), and UMAP analysis was performed using the ‘‘umap” package(v.0.2.3.1)[47]in R(v.3.4.3).The number of epochs(n_epochs)was set at 20.The n_neighbors value was set at 15,and min_dist was set as 0.1.
Kaplan-Meier (KM) analysis was performed using the ‘‘survival” R package (http://cran.r-project.org/web/packages/survival/index.html). Log-rank test was used to test the differences of survival curves. When evaluating the performance of scLM, P value was calculated by t-test.
Figure 3 Evaluation of scLM using experimental data
Hallmark collection
We downloaded the Hallmark gene set collection for functional analyses from Molecular Signatures Database(MSigDB) [48], which was a widely used and comprehensive database. Each hallmark in this collection consisted of a ‘‘refined”gene set that conveyed a specific biological state or process and displayed coherent expression. The hallmarks effectively summarized most of the relevant information of the original founder sets and, by reducing both variation and redundancy,provided more refined and concise inputs for gene set enrichment analysis.
Pathway database
Reactome (http://www.reactome.org) was a manually curated open-data resource of human pathways and reactions, which was an archive of biological processes and a tool for discovering potential functions. Gene sets derived from the Reactome[49]and Kyoto Encyclopedia of Genes and Genomes(KEGG)[50] pathway database were downloaded from the MSigDB Collections.
Enrichment test
Functional enrichment based on the Reactome and Gene Ontology(GO)databases was assessed by hypergeometric test,which was used to identify a priori-defined gene set that showed statistically significant differences between two given clusters. Enrichment test was performed by the clusterProfiler package [51]. Test P values were further adjusted by Benjamini-Hochberg correction, and adjusted P values less than 0.05 were considered statistically significant.
Figure 4 scLM identifies co-expressed genes with significantly enriched biological functions
To evaluate the performance of scLM, we benchmarked it against other methods, including LTMG [32], CSN [31],Seurat-wgcna, MNN-wgcna, and SCENIC [30]. Seuratwgcna and MNN-wgcna referred to the co-expression analysis using WGCNA [28], following the batch correction by Seurat[52] or MNN [53]. As SCENIC relied on the RcisTarget database that required real gene input,we omitted comparing with SCENIC on simulation data but still included it in the comparison on real single-cell data.
We first generated two synthetic data cohorts (synthetic cohorts 1 and 2)from NB distribution.Each cohort contained 9 sets(D1-D9)of simulated data with an increasing number of samples.That is,D1 contained one individual dataset(n=1),D2 contained two individuals of datasets (n = 2), and so on.Each set contained three co-expressed gene clusters as ground truth. Additionally, we utilized the Splatter package [40] to generate another two batches of simulated data (synthetic cohorts 3 and 4)with dropout effects,which could more accurately recapitulate actual scRNA-seq data distributions.Details of the simulation datasets were provided in the Method.
Figure 5 scLM identifies co-expressed gene modules that characterize subtle cell subpopulations
With the simulated data cohorts, we applied scLM and other methods (LTMG, CSN, Seurat-wgcna, and MNN-wgcna) to identify the co-expression clusters. To assess and quantify clustering accuracy, we used the ARI [43] as the performance metric to rank these methods (Figure 2). The corresponding bar plots represented the ARI of the identified clusters by each method compared to the ground truth.Notably, scLM accurately identified each gene cluster in four cohorts, and demonstrated much higher ARIs (mean ± SE:0.979±0.063 for synthetic cohort 1;0.971±0.031 for synthetic cohort 2; 0.899 ± 0.043 for synthetic cohort 3; 0.886 ± 0.025 for synthetic cohort 4). The other methods showed relatively lower ARIs. For example, CSN showed lower ARIs in synthetic cohort 1 (mean ± SE: 0.627 ± 0.028) and synthetic cohort 3 (mean ± SE: 0.520 ± 0.070). LTMG presented little higher ARIs and lower variances in four synthetic cohorts. These results demonstrated the outperformance of scLM in identifying accurate co-expressed genes from multiple datasets.
To further demonstrate the performance of scLM,we compared scLM with other methods (LTMG, CSN, Seurat-wgnca,MNN-wgcna,and SCENIC)on experimental scRNA-seq datasets.For comparisons,we used two in-house datasets from lung tumor and adjacent normal tissues as well as three public datasets from BR,HNSCC,and melanoma.The data pre-processing procedures were described in the Method.
To assess and quantify clustering accuracy on real datasets,we used performance metrics including the CH index [44],Dunn index [46], and DB index [45], to rank these methods.Importantly, scLM produced sets of clusters that showed significantly higher CH values than other methods (Figure 3A),especially higher than LTMG (P = 1.75E-07) and MNN-wgcna (P = 0.02), demonstrating that scLM achieved better cluster validity than other methods based on average between-and within-cluster sum of squares.In addition,compared to other methods, scLM also achieved significantly higher Dunn index scores representing better inter-cluster separation and intra-cluster compactness (Figure 3B), and lower DB index scores reflecting higher cluster quality (Figure 3C).Though SCENIC and Seurat-wgcna showed higher Dunn index score in one dataset(HNSCC),they failed to show superior performance on other datasets. Thus scLM proved to achieve the best partitioning of co-expressed gene clusters that are most distinct from each other.
As co-expressed genes were likely to be enriched with biological functions,we compared the extent to which different methods affected the functional discovery, based on their identified co-expressed genes. First, the aforementioned methods were evaluated for their capability to detect enriched GO terms in the five experimental datasets. Different methods identified gene clusters enriched with different GO enrichment results.The average number of significantly enriched GO terms (adjusted P value < 0.05) ranged from 15 to 184 (Figure 4A).scLM extracted co-expressed genes with more enriched functional terms than other methods in three of the five datasets(i.e., the BR, HNSCC, and lung Normal datasets). SCENIC identified relatively high number of enriched GO terms in the BR dataset, whereas low number of enriched GO terms in other datasets; CSN identified relatively high number of enriched GO terms in the lung Tumor dataset but low number of enriched GO terms in other datasets; MNN-wgcna identified relatively high number of enriched GO terms in the melanoma dataset but low number of enriched GO terms in other datasets. LTMG and Seurat-wgcna showed lower number of enriched GO terms in five datasets. Similar results were observed when we strengthened the enriched significance by the adjusted P value<0.01(Figure S1).The number of significant terms became fewer for all the methods, yet scLM identified the most on all datasets except for the melanoma dataset.Some methods, like LTMG, failed to identify gene clusters with enriched terms at the threshold of adjusted P value< 0.01.
In addition to GO terms, we also examined the enriched pathways in the Reactome database, based on the coexpressed genes identified by different methods (Figure 4B).Different methods showed different pathway enrichment results. Importantly, scLM identified co-expressed genes with more enriched pathways than other methods in three of the five datasets (i.e., the BR, HNSCC, and lung Normal datasets). Taken together, these results demonstrated that scLM outperforms other methods in functional discovery of co-expressed genes.
In real-world scenarios,samples from different patients or different data sources often demonstrated highly different cell numbers, largely due to strong batch effects and technical issues.The scLM method was designed to address such highly unbalanced data that outperformed other competitors on such datasets. To validate the effectiveness of scLM, we intentionally selected patient samples that varied with respect to cell number, which could create challenges for this method. As a case study, we used scLM to analyze our in-house scRNA-seq profiling from 4 NSCLC patients (P1-P4) [41] to identify the co-expressed genes in tumor and normal epithelial cells,respectively.In tumor cells(Figure 5A,heatmap of latent variables), we discovered 12 co-expressed gene modules in the latent space (T-m1-T-m12). These modules showed clear differences but were consistently concordant across patients(Figure 5A, heatmaps of P1-P4), even though the single cells from different patients presented strong heterogeneity and batch effects (Figure 5B, left panel).
Using the 12 co-expression modules, the single cells were separated into two major clusters. In each cluster, cells from different patients mixed well without interference from batch effects (Figure 5B, right panel), which further supported that the co-expression modules were consistent across patients and not affected by batch effects. Interestingly, we found that cluster 1 had higher expression of epithelial functional markers(EMT-related genes) than cluster 2 (Figure 5C). These results indicated that co-expression modules were capable of characterizing specific cell phenotypes.
Similarly, in normal single cells, we observed 13 coexpressed gene modules (N-m1-N-m13) that showed concordant expression across individual patients (Figure S2). Then we compared the co-expressed gene modules identified from tumor and normal cells. Four modules (T-m1, T-m3, T-m4,and T-m10) were not correlated with any normal modules,suggesting that they were tumor-specific (File S1; Figure S3).
To explore the underlying mechanism of carcinogenesis, we next extended the application of scLM to HNSCC and melanoma.In addition to the 12 co-expressed gene modules identified in NSCLC, we identified 11 modules in HNSCC and 14 modules in melanoma. To determine the similarities of these co-expression modules, we performed a pair-wise comparison using weighted Jaccard similarity, followed by hierarchical clustering. As shown in the diagram (Figure S4A), we found that most branches were dominated by a mixture of cancer types.Importantly, we identified a branch with high similarity among T-m9, HNSCC-m7, and Melanoma-m12 modules.
These three similar modules substantially overlapped with 91 genes, which were defined as a common program across three cancer types. To gain insights into the biological functions of the common program,we performed enrichment analysis in the Hallmark database(Figure S4B).The MYC targets v1 and hypoxia were the top enriched terms, involving the genes FOS, GAPDH, HLA-A, and NFKB1A, which suggested the common oncogenesis mechanism regardless of cancer types(File S1).
From the applications of scLM, we see three meaningful use of scLM to scientific research. 1) scLM identifies coexpressed genes that reveal novel biological processes. An example is the lung tumor-specific module that highlights cell-cell communication in tumor microenvironment (File S1;Figure S3). 2) scLM contributes to the subtle characterization of cell states. In lung cancer, scLM identifies 12 co-expression modules that are consistent across patients. These coexpression modules separate cells into two major clusters, of which one cluster presents different EMT activity suggesting more precise characterization of cell states (Figure 5). 3) With the co-expressed genes identified by scLM, both specific and common gene modules can further be explored for their translational and biological relevance. For example, in melanoma,scLM identifies two co-expressed gene modules that are associated with immune checkpoint inhibitor (ICI) resistance,which provides potential value for predicting ICI therapy response. We also find a common co-expression module from three different cancer types, and reveal the MYC targets and hypoxia as the common intrinsic mechanisms of tumor malignancy (File S1; Figure S4).
Given the merits of scLM,several potential limitations warrant further study. First, zero-inflated genes are excluded during pre-processing.The main reason is that,genes with inflated zeros are not informative and have negligible meaningful contribution to co-expression. The other reason is, with the fast advance of scRNA-seq technology, zero-inflation issue will be very minimal in near future.Second,in future work,we will examine the necessity of providing zero-inflation models,which specifically deal with data of poor sequencing depth and strong dropout effects. Third, the computational cost of scLM can be further reduced.We have already utilized C programming and parallel computing to dramatically boost the efficiency of scLM. However, considering that scRNA-seq data are growing into million-cell level, we will explore the use of GPU computing and cloud-based approaches to catch up with the scale of future scRNA-seq data.
Co-expressed genes with coordinate expression indicate functional linkages between genes. Genes with coordinate biological functions are frequently co-transcribed, resulting in coexpression profiles. Thus, co-expressed genes can be used to intuitively associate genes with biological processes, to reveal disease-related genes,and to discern transcriptional regulatory mechanisms.Accumulative evidence supports the reliability of co-expression analysis for annotating and inferring gene functions [1-3,9-12]. Recent advances in scRNA-seq technologies enable the systematic interrogation of gene co-expression modules in specific cell types,and the elucidation of the underlying biological mechanisms [13,54,55]. The improved data resolution and quality allow accurate identification of diseaserelated modules and regulatory genes for specific cell types and specific tissues. Thus, we expect co-expression analysis to be more widely applied due to the technology advances.
In this study,we introduce a novel method,scLM,to simultaneously identify co-expressed genes across multiple singlecell datasets.The scLM algorithm uses the conditional NB distribution with latent variables to disentangle co-expression patterns across multiple datasets. To our knowledge, scLM is the first available tool that is capable of leveraging multiple scRNA-seq datasets to accurately detect co-expressed genes.We provide an overview of scLM and illustrate how scLM can be used to further characterize cell states and identify tumor-specific modules in lung cancer. We demonstrate that the tumor-specific modules are enriched in pathways,including cell-cell communication and SMAD2/3/4 transcriptional activity, with identified upstream transcriptional factors including TEAD1 and FOXA1. We further show the clinical prognostic significance of these discoveries in clinical samples. Moreover,we explore the common co-expressed genes, i.e., the common module, across three cancer types and offer intrinsic mechanisms of tumor malignancy. The common module is highly enriched in the MYC targets v1 and hypoxia, suggesting the presence of common intrinsic oncogenesis mechanisms.Additionally, the common module is shown to be related with clinical response to ICI in melanoma patients, suggesting that the common module provides predictive value of ICI therapy response.
Compared with other methods, scLM has several key advantages:1)scLM accounts for data heterogeneity and variances among multiple datasets,such as unbalanced sequencing depths and technical biases in library preparation. 2) scLM leverages information across datasets for detecting stable and conserved co-expression modules with high accuracy and reproducibility. 3) scLM is an integrated pipeline that uses raw count matrix without prior batch-correction as input,thus can be easily applied to scRNA-seq data.Overall,scLM opens possibilities for further investigation and mechanistic interpretation of co-expressed genes. With the growing scRNA-seq data,scLM is poised to become a valuable tool for elucidating co-expression studies in single-cell transcriptomics.
For our in-house data,fresh remnant tumor and adjacent normal tissues were collected at the time of elective curative resection by the Tumor Tissue and Pathology Shared Resource(TTPSR)of the Wake Forest Baptist Medical Center Comprehensive Cancer Center (WFBMC-CCC), USA. Collections by the TTPSR adhere to Institutional Review Board approved procedures (Advanced Tumor Bank protocol CCCWFU 01403, TTPSR collections IRB BG04-104 which also allows for the use of de-identified protected health information along with the tissue samples). Acquisition of de-identified samples from the TTPSR for single-cell isolation and research use was in accordance with approved IRB protocol 00048977.All patients provided written consent to participate in the study, which was approved by the institutional reviewing board.
A user-friendly R package with all the key features of the scLM method is available at https://github.com/QSonggithub/scLM.
Qianqian Song: Data curation, Methodology, Software,Writing - original draft. Jing Su: Methodology, Software,Validation, Writing - review & editing. Lance D. Miller:Investigation, Writing - review & editing. Wei Zhang:Conceptualization, Supervision. All authors read and approved the final manuscript.
The authors have declared no competing interests.
The authors thank Elizabeth Forbes and Karen Klein for professional editing service. This work was supported in part by the Cancer Genomics, Tumor Tissue Repository, and Bioinformatics Shared Resources under the NCI Cancer Center Support Grant to the Comprehensive Cancer Center of Wake Forest University Health Sciences, USA (Grant No.P30CA012197).WZ is supported by the Hanes and Willis Professorship in Cancer, USA. Additional support for QS and WZ are provided by a Fellowship to WZ from the National Foundation for Cancer Research, USA. JS is supported by Indiana University Precision Health Initiative, USA.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2020.09.002.
0000-0002-4455-5302 (Qianqian Song)
0000-0003-4917-6173 (Jing Su)
0000-0003-3799-2528 (Lance D. Miller)
0000-0002-2235-1851 (Wei Zhang)
Genomics,Proteomics & Bioinformatics2021年2期