DTLM-DBP:Deep Transfer Learning Models for DNA Binding Proteins Identification

2021-12-14 06:05SaraSaberUswahKhairuddinRubiyahYusofandAhmedMadani
Computers Materials&Continua 2021年9期

Sara Saber,Uswah Khairuddin,Rubiyah Yusof and Ahmed Madani

1Department of Computer Engineering,Faculty of Engineering,Arab Academy of Science and Technology,Egypt

2Centre for Artificial Intelligence&Robotics,Malaysia-Japan International Institute of Technology,Universiti Teknologi Malaysia

Abstract:The identification of DNA binding proteins (DNABPs) is considered a major challenge in genome annotation because they are linked to several important applied and research applications of cellular functions e.g.,in the study of the biological,biophysical,and biochemical effects of antibiotics,drugs,and steroids on DNA.This paper presents an efficient approach for DNABPs identification based on deep transfer learning,named“DTLM-DBP.” Two transfer learning methods are used in the identification process.The first is based on the pre-trained deep learning model as a feature’s extractor and classifier.Two different pre-trained Convolutional Neural Networks(CNN),AlexNet 8 and VGG 16,are tested and compared.The second method uses the deep learning model as a feature’s extractor only and two different classifiers for the identification process.Two classifiers,Support Vector Machine (SVM) and Random Forest (RF),are tested and compared.The proposed approach is tested using different DNA proteins datasets.The performance of the identification process is evaluated in terms of identification accuracy,sensitivity,specificity and MCC,with four available DNA proteins datasets:PDB1075,PDB186,PDNA-543,and PDNA-316.The results show that the RF classifier,with VGG-Net pre-trained deep transfer learning features,gives the highest performance.DTLM-DBP was compared with other published methods and it provides a considerable improvement in the performance of DNABPs identification.

Keywords:DNABPs;deep transfer learning;AlexNet 8;VGG 16;SVM;RF

1 Introduction

Deoxyribonucleic Acid (DNA) represents the cell blueprint that contains the main information that codes all organisms.DNA can perform its functions with the help of thousands of proteins,which are called DNA binding proteins (DNABPs).DNABPs have several jobs,such controlling protein production,regulating cell growth and storing DNA in the nucleus.DNABPs play an important role in the structural composition of DNA.In addition,they regulate and control different cellular processes such as the transcription,replication,recombination,repair and modification of DNA.

DNABPs identification is considered a major challenge of genome annotation because they have several linked cellular functions.The identification process may include:identifying the DNABPs (positive sample) from the non-DNABPs (negative sample) [1],identifying the singlestranded DNABPs from the double-stranded DNABPs [2],or identifying the DNABPs from the Ribonucleic acid-binding proteins (RNABPs) [3-5].In this paper,the identification process is formulated as a binary classification problem to identify DNABPs and non-DNABPs.DNABPs are the proteins that have DNA binding domains and they generally interact with the major groove of B-DNA.Non-DNABPs,on the other hand,are the structural proteins within the chromosomes.

Several experimental technical methods can be used for identifying DNABPs,but they are time-consuming and expensive [6].Therefore,there is a significant need to find a suitable and efficient computational method for replacing these experimental methods.Recently,several computational and statistical methods have been proposed for DNABPs identification,but most of these methods cannot provide the invaluable knowledge base for DNABPs identification.With the advancements in machine and deep learning techniques over recent years,several methods based on machine and deep learning have been presented.

Zhu et al.,proposed a method for DNABPs identification based on the position-specific scoring matrices (PSSM) and co-occurrence matrix.The results achieved an accuracy of 97.06%for Yeast dataset,98.95% for Human dataset,and 89.69% for H.Pylori dataset [7].The PSSM with SVM (PSFM-DT) tested by Xu et al.[8]achieved an accuracy of 79.96% for PDB1075 dataset,and 79.96% for PDB186 dataset.In addition,the PSSM with RF tested by Waris et al.[9]achieved an accuracy of 92.3% for their tested dataset.Chowdhury et al.,proposed a method(iDNAProt-ES) for DNABPs identification by extracting the structural and evolutionary features that feed the SVM predictor.The results achieved an accuracy of 90.18% for the jack-knife dataset [10].Xu used the random forest for DNABPs identification.The results achieved an accuracy of 85.57 for the jack-knife dataset [11].

Zhang et al.,proposed a method for DNABPs identification by combining the positionspecific frequency matrix and the distance-bigram transformation (PSFM-DBT).The results achieved an accuracy of 81.02% for PDB1075 dataset,and 80.65% for PDB186 dataset [12].Zhang et al.,made features with a fusion of evolutionary,structural,and physicochemical features for DNABPs identification,and used the binary firefly optimization for removing the redundant features.The results achieved an accuracy of 91% for the DNA dataset,and 0.80.9% for PDB186 dataset [13].Ma et al.,proposed a method for DNABPs identification based on selecting the hybrid features using the random forest.The results achieved an accuracy of 89.56% for Mainsett dataset [14].

Moreover,Shen et al.,used the multi-scale local average blocks approach for DNABPs identification.The results achieved an accuracy of 91.80% for PDNA-543 dataset,92.06% for PDNA-41 dataset,90.23% for PDNA-316 dataset,and 77.6% for PDNA-52 dataset [15].Krishna et al.,proposed a DNABPs identification (DNA-Prot) method by incorporating the evolutionary features into the pseudo-amino acid composition.The results achieved an accuracy of 81.83% for DNA-Prot dataset,and 61.42% for DNA binder dataset [16].This method was modified by adding the grey model and named iDNA-Prot [17].Fu et al.,applied the same method on the jack-knife test and independent test,which achieved an accuracy of 89.77% and 88.71%,respectively [18].Wei et al.[19]used RF in the training,called the Method by Local-DPP model.Moreover,Liu et al.[20]used SVM and called it iDNAPro-PseAAC.This method was improved through dimension reduction by Liu et al.[21]and renamed iDNA-Pro-dis.The concept of Pse-AAC was applied in other models called DNABinder [22],PseDNA-Pro [23],and DPP-PseAAC [24].Biological information was added by Zaman et al.[25]and named (HMMBinder).

Szilagyi et al.presented a method for DNABPs identification (DNABIND) based on the amino acid proportions in the sequence of the protein.The results achieved an accuracy of 67.70% for PDB186 dataset [26].Gao et al.presented a threading-based method for DNABPs identification (DNA-Threader).The results achieved an accuracy of 59.7% for PDB186 dataset [27].Szilagyi et al.presented a DNABPs identification method (DNABIND) based on hybrid feature selection using RF and Gaussian naive Bayes (DBPPred).The results achieved an accuracy of 76.90% for PDB186 dataset [28].

Zhang et al.,proposed a DNABPs identification method using bootstrap multiple CNN.The results achieved an accuracy of 90.77% for PDNA-543 dataset and 91.04% for PDNA-316 dataset [29].They used the long short-term memory and CNN.The results achieved an accuracy of 81.83% for DNA-Prot dataset,and 89.19% for Chip-seq dataset [30].Liu et al.,proposed a method for DNABPs identification by combining the auto-cross covariance with ensemble learning(iDNA-KACC).The results achieved an accuracy of 75.16% for the tested dataset [31].Qu et al.,proposed a method for DNABPs identification using mixed feature representation methods.The results achieved an accuracy of 77.43% for PDB1075 dataset,and 81.58% for PDB186 dataset [32].Hu et al.[33]combined the sequence features with multiple SVMs and named the method TargetDNA.Si et al.[34]presented a meta-based DNABPs identification and named it MetaDBSite.

The main contribution of this paper is the testing and adaptation of pre-trained deep transfer learning models for DNABPs sequence identification.The paper presents a novel approach for DNABPs identification using deep transfer learning.In this approach,two transfer learning methods were tested and compared;in the first method,the pre-trained deep CNN (AlexNet 8 or VGG 16) learning model was used as a feature’s extractor and classifier.In the second method,the deep learning model was used as a feature’s extractor only,while the classifier was either the SVM or RF.The proposed approach was tested using different DNA proteins datasets.The performance of the identification process was evaluated in terms of identification accuracy,sensitivity,specificity,and MCC with four available DNA proteins datasets:PDB1075,PDB186,PDNA-543,and PDNA-316 datasets.The results show that the RF classifier with VGG-Net pre-trained deep transfer learning features produced the highest performance.DTLM-DBP was compared with the other published methods and found to represent a considerable improvement in the performance of DNABPs identification.The remainder of the paper is organized as follows:the second section will present the proposed methodology,the third section gives the results,and the conclusion will be given in the last section.

2 Materials and Methods

The general block diagram of the DNABPs identification process in this paper is shown in Fig.1.Two transfer learning methods were carried out.In the first method,the protein sequences were adapted to CNN models using 1D convolutions layers,then one of the pre-trained deep CNN learning models was used as a feature’s extractor and classifier.In the second method,the deep learning model was used as a feature’s extractor only,while the classifier was either the SVM or RF.More details about each block will be presented in the following subsections.

Figure 1:DNABPs identification process

2.1 Datasets

There are several publicly available protein sequences datasets,most of which were collected from the protein data bank (PDB).The researchers collected the sequences data from the PDB website by searching for words such as ‘DNA binding,’‘DNA protein’and other related terms,Then,certain processing procedures were undertaken to avoid the inclusion of redundant data,and finally,the obtained datasets were used in the research.To guarantee the reliability of the proposed approach and for performance evaluation comparison purposes,pre-collected publicly available datasets were used that had been used by several researchers in the literature.

The experimental work was implemented on four different DNABPs datasets:PDB1075,PDB186,PDNA-543,and PDNA-316 datasets.PDB1075 dataset was collected by Liu et al.[21],and included 1,075 protein samples;525 samples were positive DNABPs and 550 samples were negative non-DNABPs.PDB186 dataset was collected by Lou et al.[28],and included 186 protein samples;93 samples were positive DNABPs and 93 samples were negative non-DNABPs.PDNA-543 dataset was collected by Hu et al.[33],and included 144,544 protein samples;9,549 samples were positive DNABPs and 134,995 samples were negative non-DNABPs.PDNA-316 dataset was collected by Si et al.[34],and included 72,718 protein samples;5,609 samples were positive DNABPs and 67,109 samples were negative non-DNABPs.

2.2 Deep Transfer Learning Models

In this paper,two pre-trained deep transfer learning models,AlexNet and VGG-Net,were adapted for the identification of DNABPs sequences.The model architecture of each training model will be presented.These two models had been selected from the large number of pre-trained deep learning transfer models because,according to the literature,they are the most successful models in terms of the identification process,while their architectures are simple and contain different numbers of convolution layers.

2.2.1 AlexNet-8 Pre-Trained Deep Transfer Learning Model

AlexNet-8 is a CNN that is 8 layers deep,and was introduced by Krizhevsky et al.[35].The number of parameters in AlexNet-8 is 60 million and the number of neurons is 650,000.It consists of 8 layers (5 convolutional and 3 fully connected),as shown in the model architecture in Fig.2 [36].The first and second convolutional layers are followed by normalization and a max-pooling layer,the third and fourth convolutional layers are connected directly,and the last convolutional layer is followed by a max-pooling layer.The output of the convolutional layer passes through a series of two fully connected layers,in which the second fully connected layer is fed into the SoftMax classifier.

Figure 2:AlexNet-8 model architecture

2.2.2 VGG-16 Pre-Trained Deep Transfer Learning Model

VGG-16 is a CNN model which is 16 layers deep,and was introduced by Simonyan and Zisserman in 2014 [37,38].According to the literature,VGG-16 offers a considerable improvement over AlexNet in several applications because it is rich with several feature representations that can be used for a wide range of applications.The VGG-16 model architecture is shown in Fig.3 [39].It consists of a 16-layer network comprised of convolutional layers.

Figure 3:VGG-16 model architecture

2.3 Classifiers

The DNA proteins identification process is mainly a binary classification problem between two classes.The first class is the DNABPs that have DNA binding domains and interact with the DNA.The second class is the non-DNABPs,such as the structural proteins within the chromosomes.Several classifiers are suitable for binary classification;the most commonly used classifiers for DNA proteins identification are SVM [8,22,33]and RF [14,16,17,28].In this paper,the two classifiers were used and compared.

2.3.1 SVM Classifier

SVM is a set of related supervised-learning models introduced by Cortes et al.[40].It minimizes the identification error and maximizes the geometric margin.SVMs are the most suitable binary linear identification methods [40-43].SVM works for two-class problems by separating the data by a separating hyperplane,as shown in Fig.4.

Figure 4:SVM separating hyperplanes

In Fig.4,consider that the training sequences are represented by {xi,yi},i=1,...,l,yi=±1,xi∈Rd,x points lie on the hyperplane and satisfy the condition x.w+b=0,w a is normal to the hyperplane.This can be formulated as [44]:

The primal Lagrange is given as [44]:

whereαi,i=1,...,lare the positive Lagrange multipliers,‖w‖ is the Euclidean norm ofw.

For minimizingLPwith respect tow,b,using the conditions:

Using Eqs.(3)-(5),the dual Lagrangian will be:

The mapping of training vectors xi into the higher dimensional space uses a function called kernel function K(xi,xj)≡Φ(xi)Φ(xj).There are several SVMs kernel functions,such as:

Linear kernel:

Polynomial kernel:

RBF kernel:

Sigmoid kernel:

whereγ,r and d are kernel parameters.

Figure 5:RF algorithm

The DNA protein sequences identification was carried out using the SVM Matlab Toolbox with different kernel functions:linear,polynomial,RBF,and sigmoid kernel.DNABPs identification using SVM can be carried out in two steps.The first step is building the identification simulation model,while the second step is the feature matching for the model performance evaluation.In the modelling step,the features related to the DNA protein sequences are stored.When a tested sequence arrives,its features are matched with the stored features in the model and the identification decision is taken based on the matching process.

2.3.2 RF Classifier

RF is a tree collection introduced by Ho [45];each tree is grown through a subset of all the possible attributes of the input features vectors [46].It constructs the decision ensemble in random trees based on the input features,and the final identification decision is obtained by combining the results from the trees via voting,as shown in Fig.5.

3 Results and Discussions

3.1 Performance Evaluation Metrics

The performance of the DNA protein sequences identification system is normally evaluated using wide performance metrics,such as identification accuracy,sensitivity,specificity,and Matthew’s correlation coefficient.These metrics can be calculated using four parameters obtained from the testing of the identification system with a certain dataset.The system tests the DNA protein sequences if it is a DNABP (positive sample) or non-DNABP (negative sample).For each DNABP testing,if the test result is positive,this means that the system identifies it as correct (True),and accumulating the positive true results for all the tested protein sequences in the dataset gives theTpnumber.If the test result is negative,this means that the system identifies it as incorrect (False) and the accumulation gives theFnnumber.For each non-DNABP testing,if the test result is positive,this means that the system identifies it as incorrect (False),and the accumulation gives theFpnumber.If the test result is negative,this means that the system identifies it as correct (True) and the accumulation gives theTnnumber.Using these four numbers,it is possible to calculate:

1.Accuracy

2.Sensitivity

3.Specificity

4.Matthew’s correlation coefficient

The accuracy,the sensitivity and the specificity are percentages,while theMCCranges from-1 to+1;the perfect classifier should give 100% for the three first parameters and+1MCC.

3.2 Deep Transfer Learning Models

This section presents the results of the first DNABPs identification method,which is based on the pre-trained deep transfer learning models as the features extractor and classifier.Two pretrained deep transfer learning models,AlexNet and VGG-Net,were tested and compared in terms of identification accuracy,sensitivity,specificity,and MCC for the four examined DNA proteins datasets,as shown in Tab.1.

Table 1:Performance comparison between deep transfer learning models

The results in Tab.1 show that the VGG-Net 16 pre-trained deep transfer learning model gives higher performance than AlexNet.This may be because the 16-layer VGGnet is deeper than the 8-layer AlexNet,and the VGGnet is rich with several feature representations.

3.3 Classifiers Tuning

This section presents the results of the second DNABPs identification method,which is based on the pre-trained deep transfer learning models as the features extractor.The classifier is one of the two different classifiers (SVM or RF) used for the identification process.The identification accuracy,sensitivity,specificity,and MCC for the four examined DNA proteins datasets are as shown in Tab.2.

The results in Tab.2 show that the RF classifier with VGG-Net pre-trained deep transfer learning features gives the highest performance compared to the other approaches.

3.4 Performance Comparison with Existing Methods

The performance of the proposed DNABPs identification method (DTLM-DBP) was compared with the other published methods for the four available DNA proteins datasets:PDB1075,PDB186,PDNA-543,and PDNA-316 datasets.For PDB1075 dataset,DTLM-DBP was compared with DNAbinder [22],DNA-Prot [16],iDNA-Prot [17],iDNA-Prot-dis [21],PSSM-DT [8],PseDNA-Pro [23],iDNAPro-PseAAC [20],PSFM-DBT [12],Mixed Feature [32],Local-DPP [19],iDNAProt-ES [10],HMMBinder [25],iDNA-KACC [31],and DPP-PseAAC [24],as shown in Tab.3.

The results in Tab.3.show that the proposed method gives a better performance than the other published methods.

For PDB186 dataset,DTLM-DBP was compared with DNABIND [26],DNAbinder [22],DNA-Threader [27],DNA-Prot [16],DBPPred [28],iDNA-Prot [17],iDNA-Prot-dis [21],PSSM-DT [8],iDNAPro-PseAAC [20],Mixed Feature [32],PseDNA-Pro [23],iDNAProt-ES [10],PSFM-DBT [12],Local-DPP [19],HMMBinder [25],DPP-PseAAC [24],and iDNA-KACCEL [31],as shown in Tab.4.The results show the superiority of the proposed method over the other published methods.

Table 2:Performance comparison between SVM and RF classifiers

Table 3:Comparison of DTLM-DBP with previous methods for PDB 1075 dataset

For PDNA-543 dataset,DTLM-DBP was compared with TargetDNA [33],EC-RUS [15],and Bootstrap [30],as shown in Tab.5.

Table 4:Comparison of DTLM-DBP with previous methods for PDB 186 dataset

Table 5:Comparison of DTLM-DBP with previous methods for PDNA-543 dataset

For PDNA-316 dataset,DTLM-DBP was compared with MetaDBSite [34],TargetDNA [33],EC-RUS [15],and Bootstrap [30],as shown in Tab.6.

The results confirmed the efficacy and viability of the proposed method for different datasets.

Table 6:Comparison of DTLM-DBP with previous methods for PDNA-316 dataset

4 Conclusions

The paper presented an efficient new approach for DNABPs identification based on deep transfer learning “DTLM-DBP.” The protein sequences were adapted to CNN models using 1D convolutions layers,then the VGG-NET 16 pre-trained deep transfer learning models were used as a feature’s extractor.Finally,the RF classifier was used for sequence features matching.DTLM-DBP was tested using different DNA proteins datasets and compared with the other published DNABPs identification methods,and it has provided a considerable improvement in the performance of DNABPs identification.

Acknowledgement:We would like to acknowledge Universiti Teknologi Malaysia for their support via the 2020-2021 Industry-International Incentive Grant which funded this publication.

Funding Statement:This paper was funded under the 2020-2021 Industry-International Incentive Grant by Universiti Teknologi Malaysia (Grant Number:Q.K130000.3043.02M12) which was granted to U.Khairuddin,F.Behrooz and R.Yusof.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.