Salman Khan,Mukhtaj Khan,2,Nadeem Iqbal,Mohd Amiruddin Abd Rahmanand Muhammad Khalis Abdul Karim
1Department of Computer Science,Abdul Wali Khan University Mardan,Khyber Pakhtunkhwa,23200,Pakistan
2Department of Information Technology,University of Haripur,Khyber Pakhtunkhwa,22620,Pakistan
3Faculty of Science,Universiti Putra Malaysia,UPM Serdang,43400,Malaysia
Abstract: Piwi-interacting Ribonucleic acids (piRNAs) molecule is a wellknown subclass of small non-coding RNA molecules that are mainly responsible for maintaining genome integrity,regulating gene expression,and germline stem cell maintenance by suppressing transposon elements.The piRNAs molecule can be used for the diagnosis of multiple tumor types and drug development.Due to the vital roles of the piRNA in computational biology,the identification of piRNAs has become an important area of research in computational biology.This paper proposes a two-layer predictor to improve the prediction of piRNAs and their function using deep learning methods.The proposed model applies various feature extraction methods to consider both structure information and physicochemical properties of the biological sequences during the feature extraction process.The outcome of the proposed model is extensively evaluated using the k-fold cross-validation method.The evaluation result shows that the proposed predictor performed better than the existing models with accuracy improvement of 7.59%and 2.81%at layer I and layer II respectively.It is anticipated that the proposed model could be a beneficial tool for cancer diagnosis and precision medicine.
Keywords: Deep neural network;DNC;TNC;CKSNAP;PseDPC;cancer discovery;Piwi-interacting RNAs;Deep-piRNA
PiWi-interacting Ribonucleic acids (piRNAs) are a unique class of small non-coding RNA(sncRNA) molecules that express in a variety of human somatic cells and also in animal cells [1].The piRNA molecule contains 19-33 nucleotides which is slightly different from small interfering RNA and micro-RNA molecules in length [2].So over 1.5 million unique piRNA molecules have been discovered in fruit flies which are clustered in thousands of genomic loci [3].In addition,the piRNA molecules have been discovered in rats,mice,fish,mammals and somatic tissues[4].It has been reported that the piRNA molecules play an important role in many gene functions such as translation of specific proteins,regulate gene expression,fight against viral infection,maintain genome integrity and transposon silencing etc.[5,6].In addition,the piRNA molecules move within the genome and induce mutations,insertions,and deletions which may cause genome instability[7].Moreover,many studies (e.g.,[8-11]) have shown that the occurrence of piRNAs are related with multiple tumor types and actively involved for the cancer cells development and progression.Hence,there is a great interest in identification and classification of piRNA molecules and their function types for cancer cells diagnosis and therapy,drug development and gene stability.Owing to the importance of the piRNA molecules in the genome,the prediction and identification of piRNA molecules has become an important research area in computational biology[12,13].
In the literature numerous computational models have been proposed for the identification of piRNAs and non-piRNA sequences using machine learning algorithms.For example,Zhang et al.[14]developed a k-mer based piRNA-predictor and Wang et al.[15]developed a piRNA annotation program that predicts piRNAs sequence based on SVM as a learning algorithm.Luo et al.[16]and Li et al.[17]demonstrated an ensemble approach for prediction of piRNA and non-piRNA molecules using physicochemical properties.Recently,Wang et al.[18]proposed a convolutional neural network based computational method for identification of piRNA molecules.The authors employed k-mer(k=1 to 5)for sequence.It is worth mentioning that these models have ignored whether these piRNAs molecules are functional and non-functional piRNA molecules or non-functional piRNA.To identify piRNAs and their functions,Liu et al.[19]proposed a 2L-piRNA ensemble predictor and employed PseKNC method to extract a feature vector along with the physicochemical properties.Similarly,Chen et al.[20]developed an SVM-based predictor called piRNAPred for prediction of piRNA and their function type.The piRNAPred utilized sequence information along with structure information i.e.,physicochemical properties to represent RNAs sequence into a feature vector.Both the models have yielded promising results in term of accuracy;however,these models are developed based on traditional machine learning algorithms which are unable to accurately predict the piRNAs sequences and their function types due to a high similarity between piRNAs and non-piRNAs.In addition,these models require a huge amount of human expertise and capability to extract the predominant features[21,22].
Recently,Khan et al.proposed two different computational models known as 2L-piRNADNN[23]and piRNA (2L)-PseKNC [24]for the identification of piRNA and their function using multilayer deep neural network algorithms.Both the models have produced promising results in term of performance.However,the 2L-piRNADNN model use a simple dinucleotide auto covariance(DAC)method and the piRNA(2L)-PseKNC model employs different modes of pseudo K-tuple nucleotide composition(PseKNC)method using different values for K(i.e.,K=1,2,3)feature extraction which ignored global sequence order information.
In this paper,we propose a powerful and robust multi-layers deep neural network(DNN)model[25]based on Chou’s 5-steps rule[26].The proposed deep-piRNA model is constructed to improve the prediction accuracy of piRNAs and their functions.The framework of the proposed model is depicted in Fig.1,where firstly,the deep-piRNA employs four different feature extraction methods such as normalized moreau-broto autocorrelation(NMBroto),Z curve parameters for frequencies of phase independent di-nucleotides (Z_curve_12bit),dinucleotide composition (DNC) and single nucleotide composition(SNC)to formulate RNAs sequence into features vectors.Secondly,a composite feature vector is constructed by combining all the features vectors.Finally,the DNN algorithm is applied as a prediction engine to build the proposed computational model.The deep-piRNA performance is extensively evaluated using rigorous K-fold cross validation tests.The experimental results show that the proposed model outperformed the existing prediction models in terms of accuracy and other performance measurements parameters.It is anticipated that the proposed technique could be a useful tool for cancer diagnosis and drug development.
The remainder of the paper is structured as follows:Section 2 explains the material and methods,which includes the benchmark dataset and feature extraction,and classification algorithms.The performance evaluation metrics are presented in Section 3.Section 4 discusses the experimental findings and discussions.Finally,Section 5 includes the paper conclusion and future work.
According to Chou comprehensive review[27,28]a valid and reliable benchmark dataset is always required for the design of a powerful and robust computational model.Hence,in this paper we used the same benchmark datasets that were used in [17,19].The selected datasets can be expressed in mathematical form using Eqs.(1)and(2):
whereDdenoted the combination of piRNAs and non-piRNAs sequence,Dpdenoted a subset of piRNAs sequence andDNrepresents a subset of non-piRNA sequences.In addition,Sinstdenoted a subset of functional piRNA samples andDnon_instrepresent a subset of non-functional piRNA samples.To construct a valid benchmark dataset,firstly we downloaded the piRNA samples from piRBASE[29]and non-piRNA samples from[23,30].Secondly,CH-HIT software was applied using a threshold value of 80%to eliminate high resemblance sequences[31].Thirdly,we employed a random sampling technique to select the same number of positive samples as that of the negative samples to balance the benchmark dataset [32,33].Finally,we obtained a benchmark dataset that contained a total of 2836 sequences,in which 1418 are piRNA sequences and 1418 are non-piRNA samples.Moreover,the piRNA sequences are further divided intoDinstandDnon_instsubsets and each of the subset contains 709 sequences.Additionally,we examine generalization capability and effectiveness of our proposed model using independent dataset.The independent datasetD2was defined as:
whereDPINDrepresents the piRNA samples which was extracted from piRbase database,represent the non-piRNA samples extracted from NONCODE database [34].Further,CH-HIT software was applied using threshold value of 80% to eliminate high resemblance sequences [31].The resultant non-redundant dataset consists of 500 piRNA sequences and same number of non-piRNA sequences.Furthermore,we equally split the piRNA samples into functional and non-functional piRNA samples which are represented asandrespectively in Eq.(4).There are 250 functional piRNA samples and same number of non-functional piRNA samples.
In machine learning,extracting all the relevant details from the RNA sequence which includes the sequence ordering information and main structural characteristics is a very crucial and important step.The efficiency of any proposed algorithm largely depends on how efficiently relevant information is derived from the raw data which are the RNA samples in our case.Selection of a suitable feature extraction technique leads to an optimized model which can yield precise predictions with the selection of most favorable features[35,36].These RNA sequences must be translated into the form of a vector or discrete model because the classifier works and understands the discrete form,not the sequences/samples directly [37].Consider now the second rule of Chou’s 5-step guidelines,in this paper we utilize four different feature extraction techniques i.e.,NMBroto,Z curve-12bit,SNC and DNC.The explanation of these feature extraction methods is in the following section.
2.2.1 Z Curve-12-Bit Method
The Z Curve is a 3-Dimensional(i.e.,X,Y,Z)curve that represents an RNA sequence in a unique way.The Z curve-12-bit descriptor considers the frequency of di-nucleotides,denoted byf (a,b).Where,a,b∈{A,C,G,U}this descriptor can be calculated using Eq.(5).
wheren∈{A,C,G,U}.
2.2.2 Normalized Moreau Broto(NMB)Method
Normalized Moreau Broto autocorrelation method is a type of topological feature encoder,also known as a molecular connectivity index,that expresses the degree of correlation between two nucleotides,in terms of their structural or physicochemical properties.The normalized Moreau Broto autocorrelation method is widely used in studies[38,39]and defined in Eq.(6).
where,j indicates descriptor,irepresents the position in a RNA sequencea,n,andlagare representing the length of RNA sequence and the sequential distance between residues respectively.
2.2.3 Pseudo K-Tuple Nucleotide Composition(PseKNC)
PseKNC is one of the feature formulation methods which is widely applied in computational biology for the formulation of RNA / DNA sequences [40,41].In this paper,we have applied the PseKNC to formulate the RNA sequence in the form of a discrete feature vector.Using two values of K(i.e.,PseSNC(=1)and PseDNC(=2)).In PseSNC,RNA sequence is expressed with the single nucleotides and gives 4-D while in PseDNC RNA sequence is expressed with the help of two consecutive nucleotides pairs and gives 16-D[42].It can be numerically expressed as:
where,Tsymbol represents the transpose operator.
The proposed Deep-piRNA model utilizes four different feature extraction techniques i.e.,PseSNC,PseDNC,Normalized Moreau Broto autocorrelation,and Z curve to represent the RNA sequences.Tab.1 shows the number of features obtained for each method.Using Eq.(5),we have merged all four feature vectors(i.e.,Eqs.(5)and(6)&Eqs.(8)and(9))to create a composite feature vector.
Table 1:Dimension of feature vector with different value of K
DNN is a subfield of machine learning algorithms in artificial intelligence that is inspired by the human brain working mechanism and its activities[43].A DNN model topology consists of an input layer,output layer and multiple hidden layers as shown in Fig.2.The hidden layers are important elements of the DNN model and are actively engaged in the process of learning [44].Using more hidden layers in the model training process may increase the model efficiency,however,it may arise problems such as computational cost,model complexity and overfitting[45].The main characteristic of the DNN model is automatically extracting appropriate features from the specified unlabeled or unstructured dataset without requiring human engineering and experience using standard learning methods[46].Several researchers have proved that the DNN model worked better than the traditional learning methods used for various complex classification problems[47].In addition,the DNN model has been successfully applied in many areas including bio-engineering [48],image recognition [49],speech recognition[50]and natural language processing[51].
Figure 2:DNN configuration topology,the circle represents neurons at each layer
Inspired from the successful implementation of the deep learning models in various domains for complex classification problems,this paper considered the DNN model for the prediction of simulation sites using benchmark dataset.The proposed DNN model is configured with 3-hidden layers along with input layer and output layer as shown in Fig.3.Each layer in the DNN topology is configured with multiple neurons that process the input features vector and produce output using Eq.(11).The weight matrix on every neuron is initialized using Xavier function [52]which has the ability to remain the variance same through each layer.Moreover,a back propagation technique is applied to update the weight matrix in such a way that errors between the output class and target class are minimized.Nonlinear activation function i.e.,Sigmoid is applied at input layer and at hidden layers.The activation function helps the model to learn non-linearity and complex patterns in a dataset.Moreover,it determines either a neuron can be fired or ignored depending upon the output produced by the particular neuron [53].Additionally,a softmax activation function is applied at the output layer that generated a value in the range of[0,1]that represent the probability of data-point belong to a particular class.
Figure 3:DNN model Confusion matrix using composite features.The left side represents the first layer confusion matrix,and the right side represents the second layer confusion matrix
whereyarepresent output at a layera,Brepresent bias value,wabrepresent weight used at a layeraby a neuronb,xabrepresent input feature andfrepresent a non-linear activation sigmoid function and it can be calculated using Eq.(12).
The efficiency of a machine learning system can be measured using various assessment criteria.Here,we adopt commonly used metrics to assess our proposed Deep-piRNA model performance that include accuracy (ACC),Specificity (Sp),Sensitivity (Sn),AUC score represent the area under the ROC curve and Matthew’s Correlation Coefficient (Mcc) [54-56].In order to make these measures more straightforward and easier to understand,we used the method suggested in the number of publications[57,58]which is based on the Chou symbol and definition[59].These metrics using the Chou symbol are defined as:
where,T+symbolizes true positives,F+symbolizes false positives,T-symbolizes true negatives andF-symbolizes false negatives respectively.
Accuracy (Acc) shows the overall Accuracy of a model.Sensitivity and Specificity are inversely proportional.So,when we increase sensitivity,there is a decrease in specificity and vice versa.In order words lower the threshold we get more positive values,so the sensitivity increases and the specificity decreases.So,when we raise the threshold,we get more negative values,so we get more specificity and less sensitivity.MCC gives accurate concordant ratings for a prediction that accurately identifies both positive and negative.The area under the ROC curve is a graphical plot between false positive rate(FPR=1-specificity)and true positive rate(TPR=sensitivity).Here,we find the AUC to be the key criterion for measuring success independent of any threshold.
The proposed Deep-piRNA efficiency is evaluated and discussed in depth in this section.We have used the Windows 10 operating system along with Java 8 and Python 3.7 to evaluate the proposed algorithm.First,through an empirical process,we discuss the optimization of proposed model hyperparameters.Second,the proposed model performance was evaluated using different feature extraction techniques.Thirdly,we compare the performance with other classifiers.Fourthly,we analyze the proposed model using an independent dataset.Finally,we compare the performance with existing predictors.
The DNN configuration topology includes several parameters,and such parameters are referred to as hyper-parameters.The hyper-parameters include hidden layers,weight initialization,learning rate,process optimization methods,activation functions and configuration of various numbers of neurons in each layer.The model hyper-parameters must be tuned before the learning process begins[25]as the efficiency of the DNN model is mainly based on the optimum configuration of the hyperparameters.The commonly used hyper-parameters are shown in Tab.2.
Table 2:Detailed configuration of proposed DNN Model
The grid search technique was used to tune thehyper-parametersconfiguration to obtain the optimal value for the DNN model.The grid search technique automatically evaluates the performance of the proposed model for every combination of parameters.First,we performed a few experiments to find optimum configuration values for the learning rate and activation function.Secondly,we obtain optimized values for the number of training iterations through several experiments.The optimum values of the various hyper-parameters are shown in Tab.2.
The DNN model’s predictive performance was evaluated using different vector feature extraction techniques i.e.,PseSNC,PseDNC,Normalized Moreau Broto Autocorrelation and Z curve and composite features.In the literature,there are several methods listed that can be applied to test a classification model’s efficiency and effectiveness.These include jackknife,independent dataset and cross-validation test(sub-sampling test)[56].In this paper we used a cross-validation test,i.e.,10-fold cross-validation test and independent dataset test to check the DNN model’s performance.It should be noted that we designed the DNN model with hyperparameters configuration values provided in Tab.2 during the performance evaluation.
The prediction performance achieved by the DNN model using different feature vectors at the first and second layer is shown in Tab.3.The table reveals that the best performance obtained by the DNN model utilizes composite features vector relative to the individual type of extraction of features on both layers.For instance,The DNN model obtained a significantly improved average accuracy of 96.13%,sensitivity 94.03%,specificity 98.00%and MCC 0.923 on composite features vector in the first layer.Similarly,from Tab.3 the best performance obtained by the DNN model in the second layer achieved an accuracy of 85.54%,sensitivity 83.46%,specificity 87.46%and MCC 0.712.Additionally,a confusion matrix is presented in Fig.3 to further explore the behavior of the proposed DNN in prediction using the composite features vector.
Table 3:DNN performance at both layers using different sequence formulation methods
Here,we use composite feature vectors to compare the performance of the DNN model with other traditional machine learning classifiers.The classifiers we considered for the performance analysis included:Random Forest (RF) [60],K-Nearest Neighbor (KNN) [61],Support-Vector-Machine(SVM) [62].Tab.4 demonstrates the performance comparison between various classifiers at both layers.
Table 4:Comparison with machine learning algorithms at both layers
It is shown from Tab.4 that the DNN model attained the most distinguished accuracies in the first and second layers i.e.,96.13%and 85.54%respectively compared with other classifiers.Moreover,DNN using the composite feature set attained an effective MCC i.e.,0.923% and 0.712 respectively in the first and second layer.On the other hand,after examining traditional classifiers performance;SVM on composite feature set did perform satisfactorily and obtained an accuracy of 93.27%,with a specificity of 93.23%,sensitivity value of 93.30%,and MCC of 0.86 as compared to KNN and RF.Therefore,in this paper,the proposed model adopted the DNN as the final classifier.
To ensure the stability and reliability of the proposed model we perform an independent dataset test.The output findings of the S2 independent dataset are shown in Tab.5.From the Tab.5,it is illustrated that evaluating the composite feature set on different classifiers,the proposed DNN classifier performed remarkably and achieved accuracy 93.53% in the first layer.Moreover,the proposed DNN classifiers recorded the highest sensitivity of 95.89%among all other classifiers and achieved the highest specificity of 90.91%.Furthermore,the SVM on composite feature set performed well and reported second highest accuracy and MCC i.e.,90.59 and 0.811 respectively in the first layer.
Table 5:Performance of proposed DNN model on independent dataset
Table 5:Continued
However,composite features set in conjunction with DNN performed extraordinarily more than any of the individual feature sets and reported 81.90%accuracy and 0.637 of MCC in the second layer.At last,the composite space features have performed better in sensitivity and specificity i.e.,80.37%and 83.33% respectively that indicates that the success rates reached by the suggested predictor are quite high.
Here,we compare our proposed model with the existing benchmark methods i.e.,[16,17,19,20,23],in the first and second layer respectively.The mentioned latest methods build prediction models based on machine learning algorithms.The performance of our proposed model and the existing benchmark models are evaluated on benchmark datasets by using 10-fold cross-validation.For facilitating comparison,Tab.6 shows the corresponding results obtained by the existing state of the art methods.
Table 6:Comparison of the proposed model results with the existing models at both layers
It can be observed from Tab.6 that our proposed Deep-piRNA model performs overwhelmingly better than the existing model.Our proposed new predictor Deep-piRNA achieved the highest accuracy of 96.13%and 85.54%and Matthew’s correlation coefficient(MCC)0.923 and 0.712 in both layers respectively.These two most important metrics reflect the overall performance,robustness,and stability of the proposed predictor.The proposed method also yields much better performances in specificity (Sp) and sensitivity (Sn) comparable with the existing methods i.e.,98.00% and 94.03%in the first layer and 87.93% and 83.46% in the second layer respectively.The average accuracy improvements in both layers i.e.,7.59% and 2.81% respectively illustrate the significance of the proposed model and self-evident compared with existing predictors.
Furthermore,we have adopted the graphical analysis to show the usefulness of our proposed Deep-piRNA model,as it is mostly useful and shown in recent studies of complicated biological systems [63,64].The value of the receiver operating characteristic (ROC) Curve field reflects the model’s efficiency so that the higher the value the better the output[65,66].Fig.4 shows the graph of the area under the ROC curve(AUC).As we can see from Fig.4,the proposed classifier is remarkably larger i.e.,0.983 and 0.878 in both the first and second layers respectively.This indicates that there are 98% and 88% expectations that the model will be able to distinguish between positive and negative classes in both layers.The intuitive graphical approach demonstrates the merit and virtue of our proposed Deep-piRNA model.
Figure 4:AUC at first layer and second layer using composite features
The present study showed high performance for the detection of piRNA and their function.The proposed two-layer predicator is a robust and accurate predictor and can be used for diagnosis of numerous tumor types of cancer and drug development.We used different methods for sequence encoding and then fused to construct a more efficient representation of the given input space.The performance of the two-layer predictor was investigated on different classifiers.The results shows that optimized DNN classifier algorithm outperformed existing state of the art models with accuracy improvement of 7.59% and 2.81% at layer I and layer II respectively.In future work,we have planned to design a web server that can access and utilize the proposed model.Finally,it is anticipated that Deep-piRNA is a useful tool and that will help the research community in precision medicine,drug development,and cancer cell diagnosis.In addition,it is evident that a huge amount of genome data is generated due to advancement in next-generation sequencing technology which poses computational challenges for sequential computing approaches.In future work,we plan to apply parallel programming techniques to parallelize computations on few processing nodes.
Acknowledgement:The authors would like to thank the Ministry of Higher Education Malaysia(MOHE)and Universiti Putra Malaysia for supporting the project.
Funding Statement:This research was supported by the Ministry of Higher Education (MOHE) of Malaysia through Fundamental Research Grant Scheme(FRGS/1/2020/ICT02/UPM/02/3).
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.
Source Code and Benchmark Dataset:The proposed model source code and benchmark dataset are available at https//www.github.com/salman-khan-mrd/piRNA-2L-pseKNC.
Computers Materials&Continua2022年8期