Essential proteins identification method based on four-order distances and subcellular localization information

2024-01-25 07:14PengliLu卢鹏丽YuZhong钟雨andPeishiYang杨培实
Chinese Physics B 2024年1期

Pengli Lu(卢鹏丽), Yu Zhong(钟雨), and Peishi Yang(杨培实)

1School of Computer and Communication,Lanzhou University of Technology,Lanzhou 730050,China

2School of Tianmen Vocational College,Tianmen 431700,China

Keywords: protein–protein interaction(PPI)network,essential proteins,four-order distances,subcellular localization information

1.Introduction

Essential proteins are an integral part of cell life and play an important role in maintaining the fundamental functions of organisms, whose absence would lead to the death or sterility of the organism.[1]Therefore,accurate identification of essential proteins has an important place in biomedical research.Traditional essential proteins identification relies on biological experimental methods,such as single gene knockout,[2]RNA interference[3]and conditional gene knockout,[4]which have a high degree of accuracy but often require a great deal of time and a high amount of money.

With the development of high throughput technologies,a huge amount of PPI network data has been mined, which makes it possible to use computational methods to identify essential proteins.Many methods based on the centrality of PPI network topology properties have also been proposed to identify essential proteins, such as degree centrality (DC),[5]betweenness centrality(BC),[6]eigenvector centrality(EC),[7]closeness centrality (CC),[8]subgraph centrality (SC),[9]local average center(LAC),[10]network centrality(NC),[11]and so on.However, there is a large amount of noisy data in the PPI network, which makes these centrality methods for identifying essential proteins less accurate.To overcome this problem, many researchers have tried to use biological information to enhance accuracy in identifying essential proteins.Currently, the main biological data used by researchers includes gene ontology(GO)terms,[12]subcellular localization information,[13]gene expression sequence,[14,15]and protein complexes information.[16]For example, Hsinget al.[12]developed a central protein classifier using GO terms to identify the central protein.The LDS method for identifying essential proteins was proposed by Liet al.,[17]which combines local fractal dimension and subcellular localization information.Liet al.and Tanget al.proposed the Pec[14]method and the Wdc[18]method using the topological properties of PPI network and gene expression data, respectively.Liet al.[16]proposed the UC method by considering the number of proteins that are present in the complexes.Luet al.[19]proposed the CENC method for essential proteins identification studies by using protein complex information and the node clustering coefficients.Although these methods have yielded good results, it is still challenging to better combine the topological properties of PPI network and biological information to improve the performance of identifying essential proteins.As an optimization algorithm, the random walks method is also widely used to identify essential proteins.For example, Leiet al.[20]proposed the RWEP algorithm, which uses the random walks method to integrate the topological properties of PPI networks and four biological information to identify essential proteins,and obtained an excellent experimental result.In addition,the random walks method has been applied to link prediction,[21]recommender systems,[22]ranking,[23]community detection,[24]and transmission dynamics.[25]

Although these methods have good accuracy and have contributed to the development of the field of bioinformatics,however,they usually only consider the interaction properties between nodes and neighboring nodes in a PPI network,while ignoring the interactions between nodes at higher-order distances.To break these limitations,in this paper a new method called DSEP is proposed that combines the four-order distances and subcellular localization information of a PPI network to identify essential proteins.In DSEP,we first construct a topological information matrix based on the four-order distances of PPI network, we then construct a subcellular localization information matrix based on the four-order distances,and we finally use the random walks method to combine the topological information matrix of PPI network and the subcellular localization information matrix to identify essential proteins.To evaluate the properties of the DSEP algorithm, we used three PPI network data sets (i.e., BioGRID, YHQ, and YMBD).The experiment revealed that DSEP has an outstanding performance in identifying essential proteins compared to 11 comparison methods.

2.Methods

The PPI network can be expressed asG=(V(G),E(G)),whereV(G)={v1,v2,...,vn}denotes the set of vertices in the PPI network andE(G)={e1,e2,...,em}denotes the set of edges.The distance between verticesviandvjis denoted asdi,j.

The DSEP algorithm for scoring proteins in PPI networks has the following three main steps,and the algorithm flowchart is shown in Fig.1.

Fig.1.Flowchart of DSEP.

Step 1 Calculate the four-order distances-based topological information matrix in a PPI network.

Step 2 Calculate the information matrix of subcellular localization based on four-order distances in a PPI network.

Step 3 Based on these two similarity matrices, the proteins are scored using the random walks method.

In Ref.[26], the authors report that the number of paths with distancekbetween nodesviandvjin graphGis(Adenotes the adjacency matrix of graphG).Therefore,we proposed a method to calculate the finite-order distance matrix of the graphG(see step 8 of Algorithm 1).In solving the distance matrix of the network,the idea of Dijkstra’s algorithm is often used to solve the distance matrix,which has a time complexity of O(n3), while our proposed DSEP algorithm only needs to use the four-order distances.With our proposed method, the time complexity of solving the finite-order distances can be reduced to O(n2),which greatly reduces the time complexity of the DSEP algorithm.

2.1.Topological information matrix based on four-order distances

Most real networks are known to satisfy a power-law distribution.Based on this, Liet al.,[17]using the local fractal dimension, proposed the following formula to calculate the scores of nodes in the PPI network:

whereBvi(r) denotes the sum of those nodes with distance from nodevibetween 0 tor.Dvidenotes the score of nodevi,andCdenotes a constant.

Equation(1)can be expressed asDvi=m·lnBvi(r)/lnr.In this equation,each order of distance of nodevihas the same importance for nodevi, which is inconsistent with the actual situation.Therefore, we define the following formula to calculate the topological similarity of nodes in the PPI network:

whereci(i=1,2,3,4) denotes the coefficient of thei-th order distance of nodevi,andri(i=1,2,3,4)denotes the total number of nodes of thei-th order distance of nodevi.In this paper, we have mainly considered the four-order distances ofvi.

Based on the equation defined above,we construct a fourorder distances-based topological information matrix,defined as follows:

whererdij(i) (di,j= 1,2,3,4) denotes the total number of nodes in the PPI network with distancejto nodevi.

2.2.Information matrix for subcellular localization based on four-order distances

Subcellular localization information plays an indispensable role in the study of gene function, which represents the specific location of proteins or gene expression products in the cell.We analyzed the involvement of proteins in 11 intracellular locations and defined a matrix of information on the subcellular localization of proteins as follows:

Combining Eq.(4),we define the information matrix for subcellular localization based on four-order distances

wheresrdij(i)(di,j=1,2,3,4)denotes the number of proteins in the PPI network which are distant from nodeviasjand are jointly involved in subcellular processes.

2.3.DSEP algorithm

We propose the DSEP algorithm based on the random walks method.We chose this approach because of its ability to integrate well the topological properties of the network.The traditional random walks method only considers the topological properties of the network.In 2021, Ahmedet al.[27]proposed the EPD-RW method by combining four biological properties of proteins and eight topology-based centrality methods using the random walks method.The random walks are iteratively formulated as follows:

wherePdenotes the proteins’ topological similarity matrix,Bdenotes the biological property similarity matrix of the proteins,andeis an 1×n-dimensional(ndenotes the total amount of proteins)vector with value all 1/n,which denotes the probability vector of returning to the start node in a random walk.

Equation (6) combines the biological and topological properties of proteins well but the probability of each restart is 1/n, which does not reflect the actual situation effectively.Therefore,we propose the following equation for the random walks method:

whereL'is the matrix obtained by normalizing the matrixL(defined by Eq.(8)), which represents the transfer probability matrix in the random walks method, andT'is the matrix evolved from the matrixT(the exact process can be seen in Eqs.(9)and(10)),which represents the probability of returning to the starting position in the random walks method.

In order not to destroy the distance information contained in the matrix,we normalizeLaccording to a new normalization method as follows:

where MS-L denotes the maximum row sum of matrixL.

We obtained the scoring matrixCof the proteins based on the PPI network topology using the row sum of matrixT.The definition is as follows:

After that,we normalizedCusing the sigmoid function,so that the restart probability of each node is 0–0.5.

Algorithm 1 DSEP Input:1: The data of PPI network G=(V,E);2: The adjacency matrix A of the PPI network;3: Subcellular localization information;4: Parameters in Table 2;Stopping error ε =0.000001;Output:5: The sorting vector q of the PPI network;6: Calculate the matrices A2,A3,A4 through the matrix A;7: Initialize the four-order distances matrix D of the graph G to be a matrix with all values of n×n being zero;8: for each vi,vj ∈V(G)do if Ai,j =1 Di,j =1 elif A2i,j ≥1 Di,j =2 elif A3i,j ≥1 Di,j =3 elif A4i,j ≥1 Di,j =4 9: end for 10: Calculating the four-order distances-based topological information matrix T by Eq.(3);11: Calculating the information matrix of subcellular localization based on four-order distances L by Eqs.(4)and(5);12: Normalize the matrix L to obtain the matrix L' by Eq.(8);13: Reconstruct matrix T to get matrix T' by Eqs.(9)and(10);14: Initialize the vector q(t)=(1,1,...,1);15: while‖q(t+1)−q(t)‖≥ε do Compute q(t+1) by Eq.(7);16: end while 17: q=q(t+1);18: return q

3.Results and discussion

3.1.Experimental data

To evaluate the performance of DSEP,we have performed a computational analysis of the PPI network of Saccharomyces cerevisiae, which is the most integral and reliably available single-cell network.We used data from three PPI networks,which are BioGRID,[28]YHQ,[29]and YMBD.The BioGRID data came from the BioGRID database,[28]the YHQ data was constructed by Yuet al.,[29]and the YMBD data comes from the Mark Gerstein Lab website.After removing duplicate edges and self-interaction from the data,the network data are shown in Table 1.The data for standard essential proteins are integrated from the following data sets: MIPS,[30]SGD,[31]DEG,[32]and SGDP.[33]Subcellular localization information data is downloaded from COMPARTMENTS database,[34]which contains 11 types of subcellular localization information.

3.2.Parameter settings

In this paper,all of our parameters are locally optimal solutions obtained in the various modules of Algorithm 1.For example, in the BioGRID data, we chose the parameters for the best performance case of identifying essential proteins,which was to obtain the specific values (proportions) of parametersc1,c2,c3,c4,here we set to 0–10 to traverse,respectively,and bring them into Eq.(3),find the row sums,and rank them in descending order.The parameters that we chose are shown in Table 2.

Table 1.The information contained in the three PPI networks:BioGRID,YHQ,YMBD.

Table 2.Parameters in Algorithm 1.

3.3.Comparative analysis with other methods

To validate the performance of our proposed DSEP algorithm, we comprehensively compared DSEP with 11 other algorithms that excelled in identifying essential proteins.The experiments showed that DSEP outperformed these compared methods.In the following comparison experiments,the compared methods are DC,[5]BC,[6]EC,[7]CC,[8]SC,[9]LAC,[10]NC,[11]UC,[16]Pec,[14]Wdc,[18]and CENC.[19]

3.3.1.Histograms

A histogram can visualize the performance difference of various methods in correctly identifying the number of essential proteins.First, we ranked the DSEP and 11 centrality methods in sorted descending order with their scores.After that,the top 100–600 selected proteins were used as the candidate essential protein set.Finally,the number of true essential proteins in the candidate essential protein set was derived by comparing the standard essential protein set.The results of the experiments are shown in Figs.2–4.

Figure 2 shows the identification results of DSEP with 11 comparison methods on the BioGRID data set.DSEP identified 24, 39, 50, 46, 39, and 50 more essential proteins when compared to the best identification results of these 11 methods in the top 100–600,respectively.

Figure 3 shows the identification results of DSEP with 11 comparison methods on the YHQ data set.DSEP identified 30,59,77,63,76,and 87 more essential proteins when compared to the best identification results of these 11 methods in the top 100–600,respectively.

Figure 4 shows the identification results of DSEP with 11 comparison methods on the YMBD data set.DSEP identified 13,21,35,47,53,and 51 more essential proteins when compared to the best identification results of these 11 methods in the top 100–600,respectively.

This analysis shows that DSEP was significantly better at accurately identifying the number of essential proteins in the top 100–600 of the BioGRID,YHQ,and YMBD data sets than the 11 comparison methods.

Fig.3.Comparison of the number of essential proteins accurately identified by DSEP with 11 other methods on the YHQ data set.

Fig.4.Comparison of the number of essential proteins accurately identified by DSEP with 11 other methods on the YMBD data set.

3.3.2.Six evaluation indicators

To further assess the performance of our algorithm, we have compared DSEP with 11 other methods in a comprehensive manner under six evaluation metrics.These six evaluation metrics are sensitivity (SN), specificity (SP), positive predictive value(PPV),negative predictive value(NPV),F-measure(F),and accuracy(ACC),[16]which are defined as follows:

where TP indicates the amount of protein positively predicted as essential, TN indicates the amount of protein positively predicted as non-essential, FP indicates the amount of nonessential protein falsely predicted as essential, and FN indicates the amount of essential protein falsely predicted as nonessential.

We ranked the scores of the individual proteins in descending order and selected the top 20%of the proteins as the candidate essential protein set and the other 80% of the proteins as the candidate non-essential protein set.From comparing the standard essential protein sets,we were able to obtain the amount of correctly identified essential proteins in the candidate essential protein sets.Among these six statistical metrics,the higher the value of the metric indicates the better the execution of the algorithm.The comparison results of DSEP algorithm with 11 methods in these six indicators are shown in Table 3,and the experiments show that our algorithm is better than these comparison methods.

3.3.3.TheP–Rcurves

TheX-axis of the precision–recall curve(P–Rcurve)represents the recall and theY-axis represents the precision.In theP–Rcurve,the area enclosed by the curve and the axes is represented by AUPR.When the algorithm’s AUPR value is higher, the algorithm has a better performance.Figures 5–7 show theP–Rcurves of our DSEP algorithm and other 11 comparison methods on three different data sets (i.e., BioGRID, YHQ, and YMBD), and it is shown from the AUPR values that our algorithm performs better.The formulas for the Precision and Recall are defined as follows:

3.3.4.ROC curves

The ROC curve is also a good way to evaluate an algorithm’s good or bad performance.The area enclosed by the ROC curve and the coordinate axes is called AUC,and similar to AUPR, the higher the AUC value, the better the algorithm performs in the ROC curve,as shown in Figs.8–10,our algorithm outperforms the other 11 methods in the ROC curve.

Table 3.Comparison of the DSEP algorithm with the other 11 methods in the six statistical indicators on three different data sets.

Fig.5.The P–R curve of the DSEP algorithm and other 11 methods in the BioGRID data set.

Fig.6.The P–R curve of the DSEP algorithm and other 11 methods in the YHQ data set.

Fig.7.The P–R curve of the DSEP algorithm and other 11 methods in the YMBD data set.

Fig.8.ROC curves of the DSEP algorithm and other 11 methods in the BioGRID data set.

Fig.9.ROC curves of the DSEP algorithm and other 11 methods in the YHQ data set.

Fig.10.ROC curves of the DSEP algorithm and other 11 methods in the YMBD data set.

4.Conclusion

In this paper,we propose an essential proteins identification algorithm based on four-order distances and a method to calculate finite-order distance combined with some knowledge of graph theory, which can effectively reduce the time complexity of solving the distance between nodes.To verify the performance of DSEP,we did comparative experiments on BioGRID,YHQ,and YMBD data sets using various evaluation methods against the existing 11 methods.The results of the experiments show that DSEP has an excellent performance.

Acknowledgments

Project supported by the Gansu Province Industrial Support Plan (Grant No.2023CYZC-25), the Natural Science Foundation of Gansu Province (Grant No.23JRRA770), and the National Natural Science Foundation of China (Grant No.62162040).