kLDM: Inferring Multiple Metagenomic Association Networks Based on the Variation of Environmental Factors

2021-06-07 07:44:40YuqingYangXinWangKaikunXieCongminZhuNingChenTingChen
Genomics,Proteomics & Bioinformatics 2021年5期

Yuqing Yang, Xin Wang, Kaikun Xie, Congmin Zhu,6, Ning Chen,*,Ting Chen,3,6,*

1 Department of Computer Science and Technology and Institute of Artificial Intelligence,Tsinghua University,Beijing 100084,China

2 Sogou Inc., Beijing 100084, China

3 Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350122, China

4 Peking Union Medical College, Chinese Academy of Medical Science, Beijing 100005, China

5 Department of Ultrasound, Peking Union Medical College Hospital, Beijing 100005, China

6 Beijing National Research Center for Information Science and Technology, Beijing 100084, China

KEYWORDS Metagenomics;Association inference;Environmental condition;Bayesian model;Clustering

Abstract Identification of significant biological relationships or patterns is central to many metagenomic studies.Methods that estimate association networks have been proposed for this purpose; however, they assume that associations are static, neglecting the fact that relationships in a microbial ecosystem may vary with changes in environmental factors (EFs), which can result in inaccurate estimations. Therefore, in this study, we propose a computational model, called the k-Lognormal-Dirichlet-Multinomial (kLDM) model, which estimates multiple association networks that correspond to specific environmental conditions, and simultaneously infers microbe–microbe and EF–microbe associations for each network. The effectiveness of the kLDM model was demonstrated on synthetic data, a colorectal cancer (CRC) dataset, the Tara Oceans dataset, and the American Gut Project dataset. The results revealed that the widely-used Spearman’s rank correlation coefficient method performed much worse than the other methods,indicating the importance of separating samples by environmental conditions.Cancer fecal samples were then compared with cancer-free samples, and the estimation achieved by kLDM exhibited fewer associations among microbes but stronger associations between specific bacteria, especially five CRC-associated operational taxonomic units, indicating gut microbe translocation in cancer patients.Some EF-dependent associations were then found within a marine eukaryotic community.Finally,the gut microbial heterogeneity of inflammatory bowel disease patients was detected.These results demonstrate that kLDM can elucidate the complex associations within microbial ecosystems. The kLDM program, R, and Python scripts, together with all experimental datasets, are accessible at https://github.com/tinglab/kLDM.git.

Introduction

Microbes interact closely with the human body and the environments in which humans live [1–3]. Metagenomic highthroughput sequencing technology plays a vital role in the study of microbes that inhabit various human body sites and different natural ecological environments. Computational tools have been developed to analyze the microbial constituents of microbiota and the interactions within a microbial community, specifically the interactions among microbes and those between microbes and environmental factors (EFs).EFs are known to be associated with variations of the abundance of microbes, and include the states of human diseases,genotypes of hosts,values of some physiological and biochemical indicators, quantization of people’s lifestyle factors, and concentrations of nutrients [4–8]. Recently, there has been a rapid increase in large-scale metagenomic studies aiming to discover biological interactions [9,10], specifically how microbes interact with other microbes and EFs.

Biological interactions can be classified as positive or negative relationships among microbes and between EFs and microbes. Commensalism among microbes and microbial dependency on EFs are positive relationships, while parasitism, competition among microbes, and the inhibition of microbes by EFs are negative relationships[11].Such relationships can be inferred indirectly by predicting the variation patterns of microbial counts and EF values using association inference methods.Furthermore,interactions among microbes and between EFs and microbes depend on the conditions of EFs(named EF conditions),which are defined as some specific ranges of EF values in this study.Under similar EF conditions,interactions in the microbial community are generally stable,but they can change dynamically with the alteration of the EF conditions (Figure 1A). For example, marine microbial interactions may vary by season and depth [12], and interactions among human gut microbes can change according to the onset of diseases and disease states [4,13]. In these two examples, the environmental conditions of the marine microbial communities are specific seasons and ocean depths, and those of the human gut microbes are particular diseases and disease states. In summary, biological interactions are dynamic, a phenomenon that is considered in the association inference approach utilized in this work.

However, in some studies, environmental conditions are sometimes neither obvious nor known in advance. For example, the American Gut Project [14] collected information on dozens of human lifestyle factors related to individuals’ diets and living habits, as well as metagenomic sequencing data for thousands of individuals. However, it is unclear which subsets of these individuals belong to the same environmental conditions, and therefore which individuals would, by definition,have similar values of some or all of these lifestyle factors,and would also share identical microbial interactions. Besides,even if hosts’disease states have been obtained,grouping samples into cases and controls while ignoring other EFs may lead to inaccurate analysis results. The underlying patterns of microbial compositions and associations may be very complex and consist of many subgroups, and can be aligned with neither the cases nor the controls. These problems call for novel computational methods with which to discover potential environmental conditions and infer association networks.

In many previous studies, due to the limited numbers of sequencing samples, association inference methods have estimated static association networks using all the samples [15–20].These methods can be classified into two categories:methods that compute pairwise associations independently and methods that estimate multiple associations simultaneously.The widely-used Pearson’s correlation coefficient (PCC) and Spearman’s rank correlation coefficient(SCC)methods belong to the former category, as does local similarity association(LSA) [19]. In contrast, CCREPE [16], SparCC [15], SPIECEASI[20], CCLasso [18],and mLDM[17]belong to the latter category. The latter methods consider compositional bias [21]caused by the normalization process of microbial read counts,by which microbial read counts are divided by the total sum of read counts to determine the relative abundance. The relative abundance of thejthoperational taxonomic unit (OTU) is defined aswherexjis the read count andIt is clear that the commonlyproposed normalization process [22] introduces dependency into microbial relative abundance, and, as such, it reduces the efficiency of association inference (Figure 1D). CCREPE,SparCC, and CCLasso estimate OTU–OTU (microbe–microbe) correlations by computing the covariance among microbes, while SPIEC-EASI and mLDM infer conditionally dependent OTU–OTU associations by estimating the precision matrix among microbes. With the exception of mLDM, none of these methods consider associations between EFs and microbes. Taking both the compositional bias and the large variance of read counts into consideration, the method proposed in our previous work, mLDM [17], estimates both OTU–OTU and EF–OTU associations more accurately by removing indirect associations among microbes induced by common EFs (Figure 1E).

However, all the methods noted previously assume only a single biological network, neglecting that biological interactions can be different with the variation of EFs. In this case,if samples from two different environmental conditions are combined to infer associations, the results will reflect the change of environmental conditions rather than real associations under these two environmental conditions; this may lead to false associations and conclusions. For example,association networks of gut microbes in patients with liver cirrhosis and healthy individuals are distinctly different [23], and to predict gut microbial interactions of disease patients, samples from the healthy population should be excluded.

To estimate OTU–OTU associations and EF–OTU associations conditional on environmental conditions, a new hierarchical Bayesian model,the k-Lognormal-Dirichlet-Multi nomial (kLDM) model, is proposed in this work (Figure 1).kLDM can automatically determine the number of EF conditions and simultaneously infer complex associations. kLDM considers the compositional bias and large variance of read counts under every EF condition, and estimates both conditionally dependent OTU–OTU associations and direct EF–OTU associations. Most significantly, associations that vary according to environmental conditions can be elaborated by kLDM. It should be noted that the direct or conditionally dependent associations are mathematical concepts that approximate biological interactions,rather than indicate direct biological interactions or causal relationships. In addition,considering that the sequencing of marker genes, such as16S rRNA / 18S rRNAgenes, is widely used to investigate microbial compositions of samples,kLDM is designed for data produced by such marker gene sequencing technology.To the best of our knowledge,kLDM is the first method that infers multiple association networks based on the variation of EF values.The efficiency and robustness of kLDM are first validated on synthetic datasets via comparing with state-of-the-art approaches. It is then applied to a colorectal cancer (CRC)dataset to demonstrate its ability to find well-defined EF conditions. Its applications on theTaraOceans dataset and the American Gut Project dataset are also explored to discover potential EF conditions and novel biological relationships.

Method

Materials used for kLDM evaluation

Gut microbial samples from a CRC study

The OTU table and metadata of metagenomic gut16S rRNAsequencing data of healthy individuals and patients with CRC were obtained from Baxter et al. [13]. A dataset was constructed, which consisted of 5 known CRC-associated OTUs(Peptostreptococcus,Parvimonas,Fusobacterium,Porphyromonas, andPrevotella), 112 common OTUs observed in more than half of all 490 samples, and 4 EFs [including the fecal immunochemical test (FIT) result, age, gender, and the diagnostic state of each donor]. The ‘‘FIT” result referred to the positive or negative result of FIT.The recorded diagnostic states included ‘‘Normal”, ‘‘High-risk Normal”, ‘‘Adenoma”,‘‘Advanced Adenoma”,and‘‘Cancer”,all of which were determined by colonoscopy examination and the review of biopsies.The diagnostic states were modeled as five numerical values from 1 to 5, with a higher value representing a more serious disease state.

Tara Oceans eukaryotic dataset

In the study of theTaraOceans project [24], read counts of marine eukaryotic OTUs were obtained by sequencing the V9 region of18S rRNAgenes. This project also included 91 genus-level matched eukaryotic symbiotic interactions. After filtering out the OTUs that existed in < 40% of samples and removing samples with missing EFs or abnormal counts,a subset of 221 samples with 67 OTUs related to genus-level symbiotic interactions and 17 EFs was obtained. The 17 EFs included the depth of water,chlorophyll maximum,maximum Bru¨nt-Va¨isa¨la¨ frequency, maximum dissolved oxygen, minimum dissolved oxygen,salinity concentration,oxygen concentration, phosphate concentration, silica concentration,chlorophyll concentration, temperature, sunshine duration,moon phase, maximum Lyapunov exponent, residence time,latitude, and longitude.

16S rRNA sequencing samples from the American Gut Project

The OTU table and metadata from the American Gut Project[14] were downloaded from the FTP site (ftp://ftp.microbio.me/AumericanGut), and 22 metadata regarding individuals’diets and living habits were selected. Among these metadata,3 factors were associated with living habits (alcohol, exercise,and smoking frequency), and the remaining 19 factors were related to diets (frequencies of the consumption of fermented plants, frozen desserts, fruit, high-fat red meat, home-cooked meals, meat, eggs, milk cheese, milk substitutes, olive oil, probiotics,red meat,salted snacks,seafood,vegetables,vitamin D supplements, vitamin B supplements, whole grains, and whole eggs). The values of the metadata can be categorized as‘‘Never”, ‘‘Rarely (less than once/week)”, ‘‘Occasionally (1–2 times/week)”, ‘‘Regularly (3–5 times/week)”, or ‘‘Daily”. For convenience, these categories were recoded into integers from 1 to 5 according to their frequencies. Samples with large (first 1%)or small(last 2%)numbers of reads,as well as those with evenness< 2,were removed.OTUs that existed in <20%of samples and with an average size of < 50% were filtered out.Finally, 11,946 samples with 216 OTUs and 22 EFs were obtained. The Python script to process the OTU ‘‘.biom” file can be found on Github (https://github.com/tinglab/kLDM.git).For every sample,information on individuals’disease statuses was also recorded, and included cardiovascular disease,small intestinal bacterial overgrowth, mental illness, lactose intolerance, diabetes, inflammatory bowel disease (IBD), irritable bowel syndrome,Clostridium difficileinfection, cancer,and obesity.

Assumption of the kLDM hierarchical Bayesian model

The kLDM model assumes that interactions among microbes are regulated by EFs and tend to be constant when environmental conditions do not change,but may vary due to changes of environmental conditions (Figure 1A). In other words,kLDM accounts for variation in the values of EFs. Under the same environmental condition, EF values may fluctuate within a small range, core microbes stay the same, and their associations remain stable.However,for different environmental conditions, EF values, microbial species, and their associations can be quite different. In addition, the distribution of environmental conditions may be continuous and complex[25]. Thus, kLDM uses mixtures of multiple clusters with known distributions to approximate and capture patterns of environmental conditions, with each cluster representing one environmental condition.

Figure 2 Comparison of the performance of kLDM with other methods on synthetic data

The hierarchical structure of the kLDM model

Sparse association inference of kLDM in theory

The generative model can be solved theoretically via the expectation–maximization (EM) algorithm and maximum a posteriori (MAP) estimation for the latent variableIt is assumed that the microbe–microbe and EF–microbe associations are sparse and can be inferred by kLDM with sparsity constraints. Detailed equations of inference and optimization can be found in File S1.

However, two potential problems confront this theoretical sparse association inference, thereby limiting the practicality of the model. First, results are very sensitive to the initialization of parameters because the EM algorithm can converge to a local minimum.Second,estimating the number of EF conditions is very time-consuming. Therefore, more effective approaches were explored and an efficient split-merge algorithm was ultimately adopted, as subsequently detailed.

Implementation of a split-merge algorithm for kLDM

kLDM adopts a split-merge algorithm to estimate the number of EF conditions and sparse OTU–OTU and EF–OTU associations under each EF condition [28]. First, samples are partitioned into fine-grained clusters using the values of EFs,such that samples within a cluster can be regarded as belonging to the same environmental condition. Second, because this partition is not perfect, multiple clusters are merged into one if they share similar environmental conditions and association networks.The final output is a set of sample groups,each with distinct predicted association networks.

More specifically,the split process starts from a single cluster with all samples, and then iteratively selects a cluster and partitions it into two clusters until the number of samples in each cluster is smaller than a pre-defined thresholdNmin. This process results in the construction of a binary tree for EFs,each node of which corresponds to one cluster of the samples,and each leaf node of which corresponds to a set of samples with similar values of EFs. It is assumed that the EF vectormifollows the multivariate Gaussian distribution, and when a cluster is split in two, two Gaussian mixtures are used to model the distribution of the EFs of the cluster as follows:

where π1+π2=1 and πj(j=1,2)is the weight of thejthcomponent. After estimating these parameters via the EM algorithm, each sample of the original cluster is re-assigned to one of the two new clusters with the larger posterior probability.

Association networks are then estimated for the leaves via mLDM [17], which is similar to kLDM when the number of EF conditions is set to one (K=1). The mLDM algorithm was re-implemented with C++ and OpenMP to improve its stability and efficiency, and a comparison of the running time and memory usage by kLDM and mLDM on a single association network inference is presented in Table S1.Inferred associations were used for the merge process.

The merge process aims to recover clusters that are partitioned into multiple leaf nodes as a result of the greedy approach of the split process. The merge process adopts a bottom-to-top strategy, starting from internal nodes at the lowest level and traversing up to the root,to identify leaf nodes for merging. For each internal node, the algorithm traverses down to its left and right branches to search for leaf nodes.The two leaf nodes with the smallest inter-cluster distance, as measured by the Euclidean distance between the mean values of EFs, are merged. kLDM estimates the associations and the extended Bayesian information criterion (EBIC) score for the merged cluster; if its EBIC score is less than the sum of the EBIC scores of the two leaves, then the merged cluster is kept and substitutes the one that has closer mean values of EFs, while the other is discarded. Otherwise, the merged cluster is discarded.This split-merge process reduces the time complexity by limiting the operation at each step to partition one large cluster into two, or to merge two nearby clusters into one. The algorithm can run in parallel, making kLDM very efficient in practice.

Synthetic data generation process

Synthetic datasets were generated by specifying the numbers of microbes,EFs,and clusters,as well as the range of the number of samples per cluster. Samples were then separately constructed for every cluster via a generative process of the kLDM model. For theithcluster (i=1,2,...,K), the EF–OTU association matrixBiwas produced by sampling uniformly from the interval [-0.5,0.5], with only 15% of the elements set to non-zero. The OTU–OTU association matrix Θiwas generated using the R package ‘‘huge” [29], which outputs a precision matrix for which the adjacency matrix can be random, a cluster, scale-free, a hub, or a band graph. Every graph corresponds to a specific association structure among microbes.Values of the mean vector of the EF of the cluster can be obtained by sampling uniformly from the interval[i,(i+0.5)×i]. The Dirichlet-multinomial samples were then produced using the R package ‘‘HMP”. For parameters of every setting, ten repetitions were conducted to generate synthetic data, and the mean and standard deviation of the evaluation metrics were then calculated for comparison. The public R-language codes of CCLasso and SPIEC-EASI were used. The implementation of SparCC was provided by SPIEC-EASI. SCC and SCC(all) were implemented directly in the built-in functions of the R language. The ‘‘mb”(Meinshausen-Bu¨hlmann) estimation method was set for SPIEC-EASI, and the default parameters of CCLasso,SPIEC-EASI, and SparCC were utilized. In addition, thePvalue was set to 0.05 for SCC and PCC to save the significant associations.

Evaluation metrics on synthetic data

The receiver operating characteristic (ROC) curves and the area under the curve (AUC) values were utilized to evaluate the performance of association inference. Every cluster estimated by kLDM is represented by two ROC curves, namely the ROC curve of the OTU–OTU associations and that of the EF–OTU associations. When the AUC values were computed, the signs of estimated associations were neglected.Regarding the thresholds for the computation of the AUCs,the absolute values of the calculated associations were selected for kLDM, SparCC, CCLasso, and SPIEC-EASI, andPvalues were used for SCC and PCC. When plotting the ROC curves for SCC and CCLasso, the calculated coefficients were compared with the true correlation matrix, as deduced by the inverse of the precision matrix Θi. One estimated non-zero association is regarded as a true positive if its value is also non-zero in the ground truth. If the values of the inferred and real results of one association are both zero, the association is a true negative.

Results

Comparing kLDM with other methods on synthetic datasets

The performance of kLDM was first assessed using synthetic datasets, and then how the similarity of underlying networks and missing information,i.e., EFs, affects the performance of kLDM was evaluated.

Detailed experiments were conducted to demonstrate the effectiveness of kLDM by comparing it with SCC, CCLasso,and SPIEC-EASI. These three models were included because they were tested in our previous study [17] and exhibited advantages over other methods. SCC estimates both EF–OTU and OTU–OTU associations, CCLasso solves the covariance matrix among microbes, and SPIEC-EASI performs well in inferring the precision matrix among OTUs.For a fair comparison, because none of the methods consider more than one network, data were partitioned into clusters according to the results reported by kLDM, and these programs were applied on each cluster to infer associations. It should be noted that the results of SCC using all samples,denoted as SCC(all), were plotted as the baselines. The ROC curve and AUC value of each cluster were used to compare the performances.

First, the relationship between the number of samples per cluster (N) and the efficiency of kLDM was examined by setting the following parameters:K=2 clusters,P=50 microbes,Q=5 EFs, andNsamples with two ranges separately denoted asandThese two settings ofNhad identical associations among microbes and between EFs and microbes.As shown inFigure 2, the ROC curves and AUC values of kLDM were consistently better than those of the other methods, and the ability of kLDM to recover OTU–OTU and EF–OTU associations increased with the increase ofN. However,it was observed that the ROC curves of SCC(all)in most situations were lower, demonstrating the importance of separating samples by environmental conditions. It should be noted that SPIEC-EASI did not efficiently estimate the associations due to its strict model selection. Samples from different EF conditions may introduce many noises and disturb the results of SPIEC-EASI. Additionally, the sensitivities and specificities of the top associations estimated by the five methods were compared using the synthetic dataset withK=2,P=50,Q=5, andN∊[100,200]. As shown in Table S2, it was evident that kLDM inferred the OTU–OTU and EF–OTU associations with higher sensitivity and superior specificity as compared to the other methods.

The scalability of kLDM was then investigated withP=100 andP=200,and the results are plotted inFigure 3A and B. In both situations, the AUC values of kLDM were all higher than those of the other approaches, which verifies its fine scalability due to the re-implementation of mLDM with the C++ language and the utilization of parallel computing.Next, kLDM was tested by increasing the number of clusters to 3 and 4 (Figure 3C and D), and kLDM again achieved the best results.

Figure 3 Evaluation of the scalability of kLDM after increasing the numbers of microbes and EF conditions

Because the split process of kLDM is affected by the similarity between the EFs of clusters, only the distances between the mean values of EFs of two clusters were changed, and the other parameters were kept the same to examine the performance of kLDM. As is revealed in Table S3, as expected,when the distance between two EF mean vectors was sufficiently large,such as 1.5 or 2.0,which are respectively denoted as ‘‘1.5 baseline” and ‘‘2.0 baseline”, kLDM accurately inferred the associations.However,the effectiveness of kLDM was found to decrease when the distance became smaller,such as 1.0 (‘‘1.0 baseline”), especially on the inference of OTU–OTU associations. This can be attributed to the tendency of kLDM to group samples together to infer common associations when two clusters have similar values of EFs but different associations.

To test the effect of the similarity of EF–OTU or OTU–OTU associations among clusters on the performance of kLDM, two new datasets were generated for each distance(1.0, 1.5, and 2.0 baseline) by separately setting the value ofBior Θiof two clusters(i=1,2)to be equal(‘‘same EF–OTU”or‘‘same OTU–OTU”in Table S3).Compared with the results of the corresponding ‘‘baseline”, the AUC values of ‘‘same EF–OTU” and ‘‘same OTU–OTU” were not significantly changed. From this, it was concluded that the similarity of EFs influences kLDM more than does the similarity of associations among environmental conditions.

In the simulated experiments,all EFs were used to estimate association networks, which may not be feasible in reality because some EFs can be missing. Therefore, new datasets with only partial EFs were simulated to test the effectiveness of kLDM.As shown in Figure S1,the AUC values of kLDM on two clusters are shown with different proportions of EFs retained (20%, 40%, 60%, and 80%) to infer association networks. With the subsets of EFs, the AUC values of kLDM were found to decline.This indicates the importance of including as many EFs as possible. Furthermore, with only 60% of the EFs, kLDM achieved results comparable to those of CCLasso and SCC on the whole dataset, which again demonstrates the superiority of kLDM (Table S4).

kLDM captures variation in associations of gut microbiota of CRC patients

The relationship between human gut microbiota and CRC has been explored in previous studies [30–32]. In this work, the metagenomic16S rRNAsequencing dataset from Baxter et al. [13] consisting of 117 OTUs and 3 EFs was chosen,and the efficiency of kLDM in capturing the variation of associations in the microbial community was evaluated. kLDM reported two clusters: Cluster 1 was annotated as ‘‘Cancer”,and included 90% of ‘‘Cancer” samples with significantly higher positive FIT values (P=7.18×10-22) (Table S5);Cluster 2 was denoted as ‘‘Healthy”, and contained 83.7% of healthy samples (‘‘Normal” and ‘‘High-risk Normal”)(Table S6). ‘‘Adenoma” and ‘‘Advanced Adenoma” samples existed in both Cluster 1 and Cluster 2. kLDM did not split samples simply according to the diagnostic state; instead, it took all EF values,microbial abundances,and microbial associations into account.

Different patterns of microbial abundances and associations were found between these two clusters.Prevotellawas abundant in samples from both clusters, while the other four CRC-associated OTUs were significantly over-represented(P< 0.001) in the samples of Cluster 1 (Figure S2). By comparing the OTU–OTU association networks between Cluster 1 and Cluster 2, as presented in Figure S3, it was found that most gut microbes of‘‘Healthy”samples were connected quite closely and the distribution of associations among microbes was balanced, while few associations were observed between the aforementioned 5 known CRC-associated OTUs and other microbes(Figure S3B).In contrast,associations among the gut microbes of ‘‘Cancer” patients were relatively sparse (Figure S3A), and strong correlations were found between specific bacteria (Table 1).Peptostreptococcus,Parvimonas,Fusobacterium,andPorphyromonaswere found to have strong correlations with each other,but did not connect with other microbes,whilePrevotellawas found to be more associated with other common OTUs, such asPhascolarctobacteriumandClostridium_XlVa. Based on the differences in microbial abundances and the distributions of associations within the two clusters,the gut microbiota translocation in cancer samples can be found,andPrevotellamay play a specific role in tumorigenesis.

Previous studies have shown thatPeptostreptococcusandFusobacteriumare associated with inflammation [33,34]. The results of the present study also confirmed thatPeptostreptococcuswas positively correlated withFusobacteriumin‘‘Cancer” patients, and that it was over-represented in CRC fecal samples. Table 1 reveals thatPorphyromonasandPeptostreptococcuswere found to have positive correlations with diagnostic states in Cluster 1, suggesting the significance of these two bacteria in CRC diagnosis. On the other hand,Prevotellawas found to be positively associated with the diagnostic states in Cluster 2, indicating a high predictive value, and could be a useful indicator for CRC diagnosis. Research has shown thatFusobacteriummay act as a passenger microbe to perpetuate tumorigenesis, as a higher load ofFusobacteriumis related to disease severity [35]. From the results of the present study,patients with cancer that have a positive FIT result are more susceptible to carryFusobacterium.

Table 1 Top associations of CRC-associated microbes for two clusters of the CRC dataset found by kLDM

Table 2 Top 1% associations of two clusters on the Tara Oceans dataset with support from literature

Elaborating relationships between marine eukaryote associations and EFs

Next,kLDM was applied on theTaraOceans eukaryotic dataset[24]to explore the associations in natural environments and compare associations predicted by kLDM with known genuslevel interactions. kLDM found two clusters corresponding to two EF conditions, namely Cluster 1, consisting of 168 samples with 67 OTUs, and Cluster 2, consisting of 53 samples with 26 OTUs.41 OTUs in Cluster 2 were filtered out because the number of samples was less than the number of OTUs,and kLDM removed small OTUs to infer associations efficiently.The EFs of the two clusters are listed in Table S7. Cluster 1 had a significantly higher salinity and temperature, but lower oxygen, phosphate, and silica concentrations than Cluster 2.The top 1%associations found in these two clusters,including 23 OTU–OTU and 12 EF–OTU associations in Cluster 1,and 4 OTU–OTU and 5 EF–OTU associations in Cluster 2, were analyzed,and associations with support from the existing literature are listed inTable 2.

The associations were then matched with known genuslevel interactions among the top-Nassociations within each of the two clusters discovered by kLDM.Because known interactions were at the genus-level, an association between two OTUs was regarded to be matched if the OTUs’ genera were identical to two corresponding genera of the known interaction. The results were compared with those when the environmental conditions were not considered and all samples were regarded as one cluster, denoted as ‘‘Static”, by limiting kLDM to predict one set of EF–OTU and OTU–OTU association networks. As shown in Table S8, it was clear that the combined predicted associations from the two clusters were similar to the ‘‘Static” results, with each cluster consisting of both common and specific OTU–OTU associations.

Different types of known associations detected by kLDM are listed in Table S9. Four known associations, namely‘‘Phaeocystis–Amphibelone”, ‘‘Vampyrophrya–Copepoda”,‘‘Amoebophrya–Acanthometra”, and ‘‘Blastodiniaceae–

Copepoda”, may be dominant in the ocean because they were inferred from both the whole dataset and the larger cluster(Cluster 1), and the association betweenAmoebophryaandProtoperidiniaceaemay be strong in a specific EF condition related to Cluster 2.More specifically,the association between OTU-38 and OTU-25, respectively belonging to the genera

AmoebophryaandProtoperidiniaceae, was found to only exist in Cluster 2, in which the mean values of oxygen concentration, phosphate concentration, and silica concentration were higher than those in Cluster 1, and the abundance levels of two OTUs were significantly higher than those in Cluster 1(P<0.05).The EF condition in Cluster 2 could be more suitable for the growth of the genusProtoperidiniaceae,andAmoebophryawould then benefit from parasitism with the former.Based on the results from theTaraOceans dataset, the effectiveness of kLDM in elaborating the relationships between OTU–OTU associations and EF values is confirmed.

Characterizing changes in the association networks of human gut microbes with different lifestyle factors

kLDM has the advantage of analyzing complex datasets with large numbers of samples to infer multiple EF conditions and the corresponding association networks, and this capability was verified on the American Gut Project dataset [14]. kLDM was applied to cluster samples according to lifestyle-related factors and association networks,and two large clusters,Cluster 1 with 6831 samples and Cluster 2 with 5003 samples, and one small cluster, Cluster 3 with 112 samples, were obtained.

Compared with Cluster 1 and Cluster 2,Cluster 3 exhibited different lifestyle patterns and microbiota distributions(Tables S10–S12). Cluster 1 and Cluster 2 together contained almost all disease and healthy people,while Cluster 3 was mainly composed of IBD patients(94.64%),who made up about a quarter of all IBD patients(26.77%). Individuals in Cluster 3 had significantly higher frequencies of alcohol,high-fat red meat,and red meat consumption than those in Cluster 1 and Cluster 2,but they also ate more vegetables and took vitamin D and B supplements and probiotics more frequently. However, their exercise frequency, milk substitute consumption frequency,and milk cheese consumption frequency were distinctly lower(Table S10).For genus-level microbial abundances,the genera

Prevotella,Ruminococcus, andSutterellawere more abundant in Cluster 3, butBifidobacteriumandBacteroideswere less abundant (Table S13). If only IBD samples in each cluster were considered, it can be observed that the IBD samples in Cluster 1 and Cluster 2 were found to be substantially different from those in Cluster 3 based on the aforementioned diet frequencies and genus abundances (Table S14). By comparing the IBD samples in Cluster 1 and Cluster 2, similar lifestyle quantification values and genus abundance levels can be observed, excluding the frequencies of vegetables, fruit,home-cooked meal consumption, and smoking, and the abundance of the genusLachnospira.

The top 1% EF–OTU associations predicted by kLDM were subsequently analyzed (Table S12), and various associations among the three clusters that matched with the findings of published literature are presented. A positive correlation between probiotic frequency andBifidobacteriawas found in both Cluster 1 and Cluster 2, which was previously reported by Rajkumar et al. [36]. In people with high animal-protein diets, positive correlations were found between red meat frequency andBacteroidesand between poultry frequency andRuminococcusin Cluster 1, and between seafood frequency andClostridialesin Cluster 3,which were all consistent with other research[37].Associations between high-fat red meat frequency andBacteroidesin Cluster 1 andClostridialesin Cluster 2 matched with a rat study that found that the intake of a high-fat diet resulted in disproportionate increases in propionate- and acetate-producing species such as

ClostridialesandBacteroides[38].

Discussion

Considering the dynamic nature of microbial interactions, a new hierarchical Bayesian model, kLDM, was proposed to infer associations among microbes and associations between EFs and microbes, under different environmental conditions.Two algorithms, namely a theoretical EM algorithm and a more practical and efficient split-merge algorithm, were then developed to simultaneously estimate both the number of EF conditions and the associations for each EF condition. The effectiveness of kLDM was verified on simulated datasets, as well as the real CRC,TaraOceans,and American Gut Project datasets. Although kLDM was implemented with OpenMP, it can also be accelerated in MPI.

For the synthetic experiment, when the scalability of kLDM was tested (Figure 3A and B), the AUC values of CCLasso were the second-best in most cases, but CCLasso predicted more associations than kLDM, and its ROC curves increased slowly at the beginning and then faster with the increase of the false positive rate.However,the initial high true positive rate is arguably more crucial for scientists wanting to select candidate interactions for validation; in this instance,kLDM presented an advantage. In addition, the influence of the similarity level of EF values between different EF conditions on the performance of kLDM was further explored in detail. The relationship between AUC values and the absolute distance (see the equation in Table S3) between the EF values of two clusters(Figure S4)was plotted based on the‘‘baseline”dataset used in Table S1.When the distance was small(<2),it was observed that the AUC values of the OTU–OTU and EF–OTU associations increased rapidly with the distance, and when the distance was greater than 2, the power of kLDM tended to be stable. To maintain all AUC values > 0.7, the distance of the EF values of two clusters should be greater than 1.0. The mean values of the inverse Simpson indexneffof the synthetic datasets were also compared, and the results are exhibited in Table S15. The values ofneffof the two synthetic datasets withP=100 andP=200 were ≥13 and had high compositionality [39]. By considering the results in Table S15 and the AUC values of kLDM in Figure 3,it is clear that kLDM can handle high compositional effects.

For the CRC dataset,the diagnostic state of the donors was included as one EF by kLDM.The diagnostic state was classified according to colonoscopy examination and the review of biopsies, instead of by the microbial compositions and associations; therefore, patients with the same diagnostic state may not have the same underlying microbial associations. In addition, cancer may have different subtypes, and each subtype may have a different association network. Patients who will potentially develop cancer may also have different association networks from those who will not. kLDM was also tested on the CRC dataset using three EFs and excluding the diagnostic state. It was found that the two clusters stayed the same, and the compositions of the diagnostic states in the two clusters did not change. The reason for this may be that the underlying microbial networks of these two clusters were distinct.In addition,kLDM was also compared with a probabilistic clustering model called MicrobeDMM [40], which clusters sequencing samples according only to the microbial composition. From Table S16,it was clear that MicrobeDMM did not distinguish the healthy samples from the CRC samples well, and the gut samples of these two groups were mixed evenly into two clusters.In comparison,in the results of kLDM,most of the‘‘Normal” samples were clustered into one group, while the‘‘Cancer” samples were clustered into the other, which is consistent with the prior knowledge.This demonstrates that using only microbial compositions may not be sufficient, and this result, together with the results of kLDM on the synthetic dataset with partially observable EFs (Figure S1), proves the importance of collecting all EFs.

On theTaraOceans dataset, the top associations found by kLDM were associated with previous studies (Table 2). For Cluster 1, the four listed OTU–OTU associations belonged to two kinds of known microbial interactions. More specifically, the symbiosis betweenPhaeocystisandAmphibelone anomala, and the parasitism betweenAmoebophrya ceratii

andCochlodinium_01 fulvescens, were captured by kLDM[41,42]. Associations related to EF values were also found.Blastodinium manginiwas found to tend to live in seawater with high salinity[43],and the concentration of PO4was found to affect the growth ofPhaeocystis[44]. In addition,Amphibelone anomalahas been found to be associated with PO4, because it has a close phylogenetic relationship withPfiesteria piscicida[45], which is regulated by phosphate [46].For Cluster 2, the parasiteBlastodinium_06was found to be linked to temperature and oxygen, and Skovgaard et al. [43]reported that someBlastodiniumspp.living in warm temperatures can perform photosynthesis. The concentration of oxygen was associated with somePhaeocystis, and it has been reported that the bloom ofPhaeocystis globosecauses oxygen depletion [47]. Because most interactions among marine microorganisms are unknown, the explanation in the present study is limited.

For the American Gut Project dataset, kLDM was used to analyze the relationships between lifestyle factors and microbiota,as well as associations within the microbial community.The resultant three clusters exhibited differential frequencies of lifestyle factors, compositions of microbes, and OTU–OTU and EF–OTU associations.Diet preference is among the most influential EFs of the gut microbiome, and it can even determine microbial compositions in the mammalian evolution process [48]. These findings were again reflected in the results of the present study.IBD patients in three clusters were analyzed,and the results indicated the gut microbial heterogeneity of IBD patients; subgroups could therefore potentially be classified by individuals’ lifestyles and microbial communities. The concept of enterotypes in gut microbial communities has been extensively discussed [49,50], and kLDM may be a useful tool for the discovery of special subgroups of IBD patients.

Whether IBD patients in Cluster 3 differed from other IBD patients in Cluster 1 and Cluster 2 in pathogenesis, clinical characteristics, treatment, and prognosis was also of interest.Notably,although patients in Cluster 3 consumed more probiotics, they still exhibited low abundances ofBifidobacterium.Probiotic diets can induce the anti-inflammatory factor IL-10, which improves the gut microenvironment and reduces IBD symptoms [51,52]. Individuals in Cluster 1 and Cluster 2 had low frequencies of probiotic consumption, which is in agreement with the results of previous studies, but IBD patients from Cluster 3 had high-frequency probiotic consumption (Table S8). Philpott and Girardin [53] reported that IBD patients carryingNOD2mutations exhibited decreased transcription of IL-10; thus, it is suspected that patients in Cluster 3 are more likely to carryNOD2mutations.

By further comparing the IBD diagnoses among the three clusters (Table S17), all IBD patients in Cluster 3 were found to be those with colonic Crohn’s disease(CD),but such information of IBD samples in the other two clusters is missing.Patients with CD may lack vitamins B and D [54]; therefore,the use of vitamin B and D supplements in Cluster 3 indicates the treatment for such patients. Recently, an ‘‘anti-inflamma tory” diet, in which high-fat meats are avoided, has been reported to reduce symptoms in patients with IBD [55]. However, the frequency of high-fat red meat consumption was observed for colonic CD patients in Cluster 3,and adjustment in this category may help them in the future.Whether the lifestyles of patients with colonic CD are affected by a doctor’s advice was not included in the dataset, which limits further interpretation.

Taking into account healthy cohorts in Cluster 1 and Cluster 2, it was observed that the genus abundances of the IBD samples in Cluster 1 and Cluster 2 shared more similarity with those of the healthy samples when compared to the IBD samples in Cluster 3 (Table S18). Recently, many microbiotaassisted models have been proposed for the detection of gut lesions, but there have been mixed results [50,56,57]. Adding microbial abundance improves sensitivity, but at the cost of specificity. The results of the present study indicate that such decreased specificity may result from the close microbial compositions between some patients and healthy people. Therefore, when microbiota-based prediction models are designed,the heterogeneity within patient samples should be considered.

While it was demonstrated that kLDM is an excellent tool for biologists to interpret microbial associations under multiple environmental conditions, there are some implementation challenges, as well as several possible improvements. For example, the mathematical score (e.g., EBIC score) may not be sufficiently sensitive to separate two environmental conditions. Studies on differential gene co-expression networks may be useful for kLDM. For example, characteristics of nodes in networks, such as the degrees, clustering coefficients,and other mathematical measures that summarize the changes in associations, may be included to develop a more suitable approach to the merging of sub-clusters [58]. In addition, the assumption of Gaussian distributions for the EFs may not be very suitable in the case of categorical data types,and other probabilistic distributions can be considered to model categorical metadata. The scalability of kLDM must also be further expanded so that a large number of rare OTUs with low occurrence can be included in the model. Prior knowledge of microbes and their interactions from known studies may also be very helpful for kLDM to reduce the complexity of association inference and further improve its efficiency [59]. The sample sizes of the realTaraOceans and CRC datasets are relatively small, which is a common phenomenon in current metagenomic studies. However, it is believed that more large-scale datasets like the American Gut Project and Human Microbiome Project datasets will be collected in the future.For these large-scale datasets,kLDM could be the perfect tool for the analysis of complex associations.

Conclusion

In this work,the kLDM model was proposed to infer EF conditions existing in microbial communities and to predict direct EF–OTU and conditionally dependent OTU–OTU associations under every EF condition while simultaneously considering compositional bias. Compared with traditional methods that estimate static association networks, kLDM has the advantage of illuminating the influences of EFs on associations in microbial communities. The EF condition estimated by kLDM is the result of the comprehensive consideration of EFs, OTU abundances, and associations in the community,which can provide biologists with more insight into the heterogeneity of microbial communities and can identify microbes and interactions regulated by EFs, such as nutrients, lifestyle,and health status. The superiority of kLDM was validated on both synthetic data and real datasets related to the human gut and marine ecosystems.With the deepening of research on the relationship between microorganisms and human diseases,it is expected that kLDM will enable new discoveries of the variations of microbes and OTU–OTU and EF–OTU associations with human health and diets.

Code availability

The code of kLDM and all experimental datasets are all accessible at https://github.com/tinglab/kLDM.git.

CRediT author statement

Yuqing Yang:Conceptualization,Investigation, Methodology,Software, Writing - original draft.Xin Wang:Conceptualization,Investigation,Methodology,Formal analysis,Data curation, Writing - original draft.Kaikun Xie:Investigation,Writing - original draft.Congmin Zhu:Investigation, Writing- original draft.Ning Chen:Conceptualization, Supervision,Writing - review & editing.Ting Chen:Conceptualization,Supervision,Writing-review&editing.All authors have read and approved the final manuscript.

Competing interests

All authors declare no conflicts of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant Nos. 61872218, 61673241, and 61721003), and the Tsinghua-Fuzhou Institute Research Program, Beijing National Research Center for Information Science and Technology (BNRist), China.

Supplementary material

Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2020.06.015.

ORCID

ORCID 0000-0003-0409-2590 (Yuqing Yang)

ORCID 0000-0003-1422-223X (Xin Wang)

ORCID 0000-0001-5775-3095 (Kaikun Xie)

ORCID 0000-0002-0472-9128 (Congmin Zhu)

ORCID 0000-0002-6742-0048 (Ning Chen)

ORCID 0000-0002-3228-9166 (Ting Chen)