XU Xiaolong ,CHEN Wen ,and WANG Xinheng
1.Jiangsu Key Laboratory of Big Data Security &Intelligent Processing,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;2.Institute of Big Data Research at Yancheng,Nanjing University of Posts and Telecommunications,Yancheng 224000,China;3.School of Computing and Engineering,University of West London,London W5 5RF,UK
Abstract: Software defect prediction (SDP) is used to perform the statistical analysis of historical defect data to find out the distribution rule of historical defects,so as to effectively predict defects in the new software.However,there are redundant and irrelevant features in the software defect datasets affecting the performance of defect predictors.In order to identify and remove the redundant and irrelevant features in software defect datasets,we propose ReliefF-based clustering (RFC),a clusterbased feature selection algorithm.Then,the correlation between features is calculated based on the symmetric uncertainty.According to the correlation degree,RFC partitions features into k clusters based on the k-medoids algorithm,and finally selects the representative features from each cluster to form the final feature subset.In the experiments,we compare the proposed RFC with classical feature selection algorithms on nine National Aeronautics and Space Administration (NASA) software defect prediction datasets in terms of area under curve (AUC) and Fvalue.The experimental results show that RFC can effectively improve the performance of SDP.
Keywords:software defect prediction (SDP),feature selection,cluster.
Software with defects can bring unexpected results or behaviors at run time,which can cause huge damage or even disasters in serious cases [1].
Software defect prediction (SDP) [2]is an effective method to identify defects in system modules in advance.First,the software code or the development process is analyzed,the metrics related to software defects are designed,and then the defect dataset is created by mining
the software historical repositories.Finally,based on the defect dataset,the SDP model is constructed.
In order to better predict software defects,many metrics that have strong correlation with software defects are proposed to measure modules [3−6].Attributes of software quality,such as defect density and failure rate,are external measures of the software product and its development process.Generally,a software quality prediction model is built with software metrics and defect data collected from the previously developed systems or similar software projects.The selection of the specific set of metrics becomes an integral component of the model-building process.However,the redundant and irrelevant features in the software defect dataset increase the time complexity and affect the performance of defect predictors[7].
Therefore,the defects of the existing algorithms can be summarized as follows:
(i) The feature selection algorithms based on filter can remove irrelevant features,while cannot remove redundant features.
(ii) The feature selection algorithms based on the embedded method combining the advantages of filter and encapsulation have the problem of high complexity.
In order to solve the above problems,we propose ReliefF-based clustering (RFC),a cluster-based feature selection algorithm.First,the ReliefF algorithm [8]is used to calculate the relevance between each feature and target class,features are sorted to remove irrelevant features,then the features are clustered according to the correlation between the remaining features,and finally the representative features of each cluster are selected.We implement experiments based on the software defect prediction datasets released by the National Aeronautics and Space Administration (NASA) to test and verify the proposed algorithm.RFC considers the correlation between features and the relevance between features and the target class,which can effectively remove redundant features and irrelevant features to solve the problem of dimension disaster and improve the performance of SDP.
This paper is organized as follows.Section 2 introduces the related work.Section 3 describes RFC in detail.Section 4 presents the experimental results and the performance analysis.Section 5 concludes this paper and presents our possible future work.
The methods of SDP can be divided into two categories.On the one hand,some researchers focus on utilizing software metrics [9−11],such as the Hastead scientific metrics [3],the McCabe loop complexity [4],quality model for object oriented design (QMOOD) [5],the Chidamber and Kenerer (CK) metrics [6].On the other hand,some researchers focus on the quality of SDP datasets.There are many problems to be solved in SDP,such as the data imbalance [12−15],and the dimension disaster [16].The feature selection is an effective method to solve the problem of dimension disaster.
Appropriate feature selection results can improve the learning accuracy,reduce the learning time,and simplify the learning results.Kira et al.[17]proposed a feature selection method,Relief,which randomly selectsminstances from the training set.For each selected instancei,Relief computes the nearest neighbor from the same class ofiand the nearest neighbor from the opposite class.The quality of each feature is estimated with respect to whether the feature differentiates two instances from the same class and from different classes.A feature has an undesired property if it differentiates two near instances that belong to the same class.The original Relief algorithm is limited to binary classification problems.ReliefF [8]is extended from Relief to solve the multiclass problem.Its main idea is to take the Euclidean distance as the correlation index and then weights features according to how well they differentiate instances of different classes.Meanwhile,the embedded methods select feature in the training process of the learning model,and the feature selection result outputs automatically while the training process is finished.For example,a hybrid feature selection method proposed by Shivkumar et al.[18]first uses the filter or the classifier to score each feature,then removes features with lower scores,uses the classifier to evaluate the prediction performance of the remaining features,and finally chooses the best prediction performance as the final feature set.When the characteristics are reduced to a certain extent,predictive performance begins to decline due to the lack of important information.These algorithms combine the advantages of the filter method and the wrapper method,while it has a high time complexity.
Guo et al.[19]proposed a software quality prediction method based on the random forest,where prediction accuracy of the proposed methodology is higher as compared to logistic regression and discriminant analysis.The random forest is more robust to noise and outliers than other methods.However,irrelevant features have a great impact on the predictive effect of random forest,the effect of using the first five most relevant features to train the random forest is comparable to that of using all the features to train the random forest.Menzies et al.[20]used the information gain (IG) to rank features.The prediction effects of the first three features are comparable to the use of all features.Rodriguez et al.[21]made use of feature selection methods in different datasets,and tested different data mining algorithms for classification to defect faulty modules.The results showed that in general,smaller datasets with fewer attributes maintained or improved the prediction capability with fewer attributes than the original datasets.Catal et al.[22]verified the influence of feature selection on model performance,and showed that feature selection is beneficial to improve the performance of the software defect prediction model.Gao et al.[23]compared seven different feature ranking methods and four different feature subset selection approaches based on software metrics and defect data collected from multiple releases of a large real-world software system.The results showed that the automatic hybrid search algorithm performed the best among the feature subset selection methods.Moreover,performances of the defect prediction models either improved or remained unchanged if over 85% of the software metrics are eliminated.Wang et al.[24]presented a comprehensive empirical study by examining 17 different ensembles of feature ranking methods (rankers) including six commonly used feature ranking methods,the signal-to-noise filter method,and 11 threshold-based feature ranking methods.This study utilized 16 real-world software measurement datasets of different sizes and built 13 600 classification models.Experimental results indicated that ensembles of very few rankers were very effective and even better than ensembles of many or all rankers.Bennin et al.[25]proposed a synthetic oversampling approach called MAHAKIL based on the chromosomal theory of inheritance.Experiments showed that MAHAKIL improved the prediction performance for software defect prediction.Miholca et al.[26]developed a supervised classification method called HyGRAR,which combined gradual relational association rule mining and artificial neural networks to discriminate between defective and non-defective software entities.Experiments demonstrated the excellent performance of the HyGRAR classifier.Cao et al.[27]proposed an SDP model based on the twin support vector machines (TSVM) and a multi-objective cuckoo search (MOCS),which achieved a better performance than other SDP models.
Different from the above works,in order to effectively identify the redundant and relevant features in the software defect dataset,in this paper,we propose a clusterbased feature selection algorithm,RFC.RFC removes irrelevant features,then partitions features intokclusters based on symmetric uncertainty (SU),and finally selects representative features from each cluster.The specific workflow of RFC is shown in Fig.1.
Fig.1 Workflow of RFC
(i) Calculate the relevance between features and the target class based on ReliefF,and then remove irrelevant features according to the threshold.
The key idea of ReliefF is to estimate the quality of attributes according to how well their values distinguish between the instances that are near to each other.Given a randomly selected instanceR,ReliefF searches forknearest neighbors ofRfrom the same class,called near-Hits,andknearest neighbors from each of the different classes,called nearMisses.If the difference betweenRand nearHits is less than that betweenRand nearMisses on a feature,which indicates that the feature is beneficial to classification,the corresponding weight of the feature will be increased.Conversely,the weight of the feature is reduced.This process is repeated until the end condition is met,and then the weight of each feature is returned,which is the relevance between the feature and the target class.The features are sorted by weight,and then the thresholdδis set to determine whether each feature is valid or not.
(ii) Cluster feature subsets.
i) Calculate the degree of correlation between features.
We use the nonlinearSU[28]to measure correlation between features.SUis derived from mutual information by normalizing it to the entropies of feature values.It can be computed with
whereH(X) is the entropy of a discrete random variableX.Supposep(x) is the prior probabilities for all values ofX,p(x|y) is the posterior probabilities ofXgiven the values ofY,H(X) can be computed with
The information gainIG(X|Y) can measure the amount by which the entropy ofXdecreases given the values ofY.It can reflect the additional information aboutXprovided byY.IG(X|Y) can be computed with
whereH(X|Y) is the conditional entropy,quantifying the remaining entropy,i.e.,uncertainty of a random variableXgiven that the value of another random variableYis known.H(X|Y) is defined with
SUcan compensate for information gain’s bias toward features with more values and restrict its values in [0,1].The value 1 ofSU(X,Y) indicates that knowledge ofXcan completely predict knowledge ofYand vice versa,while the value 0 ofSU(X,Y) reveals thatXandYare independent.Although the entropy-based measure can handle nominal or discrete variables,they can deal with continuous features as well,if the values are able to be discretized properly in advance.
ii) Calculate the average correlation degree.
Before feature clustering,several features should be selected as the initial representative characteristics.To further accelerate the convergence of our proposed algorithm,we consider features with the most information content as initial features.The information content of a feature is measured by the average correlation degree of the feature.The average value of the symmetry uncertainty offiand other features in the feature set can be computed with
wheremrepresents the number of remaining features,SU(fi,fj) represents the correlation between featurefiand featurefj.
iii) Cluster the features based onk-medoids.
Step 1Sort the features according toAvgRelin the descending order,and select topkfeatures as initial medoids.
Step 2Assign each feature to the cluster associated with the most similar medoid.
Step 3Update cluster’s medoids.For each feature cluster,the features with the highest degree of correlation with other features are selected as new cluster medoids.
Step 4Step 2 and Step 3 are iterated until that medoids no longer change.
(iii) Select representative features.
After features partitioned intokclusters,relevant features need to be selected from each cluster.Since the scale of different clusters cannot be uniform,the scale of a cluster is taken into account when relevant features are selected from each cluster.The larger a cluster is,the more relevant features are.
For feature clusterCi,select the cluster’s medoid first,and then sort the remaining features according to their relevance with the target class in the descending order.After that,select topfeatures from each cluster,where |C| is the scale of the cluster,mis the number of the feature subset,andris the scale of the final feature subset that we want to select.
The metrics of SDP datasets are composed of lines of codes (LOC) counts,Halstead complexity as well as Mc-Cabe complexity,as shown in Table 1.
Table 1 LOC counts and Halstead complexity
In particular,LOC counts measure the number of code lines,comments lines,and so on.Halstead complexity estimates the program complexity by counting operators and operands in a module.McCabe complexity measures the complexity of a module’s inner structure.These metrics,i.e.,features,can be widely used in SDP.However,the effectiveness of current methods is influenced by irrelevant and redundant features select features.
The pseudo code of the algorithm is as follows:
In the pseudo code,Tis the dataset;kis the number of feature clusters for software defect prediction;Fis the original software defect feature set;δis the threshold;ris the number of selected software defect features;Sis the output software defect feature subset;widenotes the relevance between featurefiand the target class;Cirepresents theith feature cluster;|C| is the size of the cluster;mrepresents the number of feature subset;andrrepresents the number of the final feature subset.
For example,to the kc3 dataset,the index of the final selected feature subset is (31 35 16 24 27 7).These features are highly correlated with the target class,lowly correlated with each other.
In order to verify the effectiveness of our proposed method,nine software defect datasets are selected from the NASA Metrics Data Program (MDP) repository [29],which mainly design the measurement from LOC counts,Halstead complexity and McCabe complexity,stored in attribute-relation file format (ARFF),were processed with Weka.Table 2 describes these data in terms of the number of instances,the number of features,the number of defective modules,and the number of defect modules.
Table 2 Descriptions of the datasets
The hardware for experiments is as follows:8 G RAM,Intel core i5 processor at 1.8 GHz,and 500 GB hard disk.
In order to evaluate the prediction results,it is necessary to select reasonable performance evaluation metrics.SDP can be simplified to a two-class problem.In Table 3,TP and TN denote the number of positive and negative samples that are classified correctly,while FP and FN denote the number of misclassified positive and negative samples respectively.which indicates the ratio of positive samples that are classified correctly to actual positive samples.
Table 3 Confusion matrix for a two-class problem
F-value is defined as
F-value is a combination of Recall and Precision which are effective metrics for information retrieval community.F-value is high,if both Recall and Precision are high.It can be adjusted through changing the value ofβ.βcorresponds to relative importance of Precision vs.Recall,and is usually set to 1.
The true positive rate (TPR) is the same as Recall.
The false positive rate (FPR) is defined as
The receiver operating characteristic (ROC) curve describes the relationship between TPR and FPR.
On an ROC graph,the TPR is plotted on theYaxis and FPR is plotted on theXaxis.In the ROC space,each classifier with a given class distribution and cost matrix is represented by a point on the ROC curve.The area under curve (AUC) is a single-value measurement that originated from the field of signal detection.The value of the AUC ranges from 0 to 1.On the ROC curve,points closing to (0,1) are preferable,which means low false positive error rate and high recall value.A perfect classifier provides an AUC that equals 1.
AUC is used for evaluating the predictive capability of classifiers.Jiang et al.[30]selected different proportions of training sets on the NASA dataset for experiments to compare the performance differences of different evaluation metrics.The experimental results indicate that AUC is a more stable measurement than F-value.
The main parameters of RFC include the number of clustersk,and the thresholdδof ReliefF.We selectrelevant features from the original dataset to construct the final selected feature subset,inspired by Gao et al.[23],whereMrepresents the number of original features.By adjustingδvia experiments,the final feature number is close tokis set to beheuristically,andmrepresents the number of remaining features after removing irrelevant features with ReliefF.
In defect predictor construction phase,we employ the Naïve Bayes and the J48 classification algorithm.We implement 10×10 fold cross validation.The measurement of performance is averaged.
First,we implement experiments to compare our proposed method with the method using all the original features,denoted as NONE.Then,we compare our proposed method with four representative feature selection methods.Both IG and Chi-Square (CS) consider the relevance between feature and target class in terms of IG and card square value respectively.ReliefF measures the relevance between the features and the target class based on instances.Compared with ReliefF,we can find out the influence of feature clustering.Compared with IG and CS,we can find out the influence of different correlation measurement methods on prediction performance.Correlationbased feature selection (CFS) [31]exploits the best first search based on the evaluation of a subset that contains features highly correlated with the target class,yet uncorrelated with each other.This method can deal with both irrelevant features and redundant features.
The performance of the feature selection algorithm can be usually evaluated from two aspects:in the case of the same classification effect,the smaller the feature subset,the better the performance;in the case of the same effect on the feature reduction,the better the classification effect,the better the performance.
(i) For J48 classifier
Table 4 shows that the RFC algorithm has obvious advantages compared with NONE,IG,CS,ReliefF and CFS based on AUC,which greatly improves the performance of the classifier.RFC achieves the best performance on datasets pc3,kc4 and cm1.On dataset mc2,RFC performs only worse than NONE,and on dataset mw1,RFC performs only worse than CFS.AUC of RFC on average is 0.673,and AUC of CFS is 0.676.Table 5 shows that the F-value of RFC is better than other methods on most datasets.
Table 4 AUC of the J48 classifier after using different feature selection methods
Table 5 F-value of the J48 classifier after using different feature selection methods
(ii) For Naïve Bayes classifier
Table 6 shows that the RFC has advantages over the NONE,IG,CS,ReliefF,and CFS.Based on AUC,RFC achieves the best performance on datasets pc1,kc4,mc2,cm1,mc2 and mw1.On the dataset kc1,the performance of RFC is very similar to ReliefF.AUC of RFC on average is the highest,followed by NONE and CFS.As shown in Table 7,RFC can achieve the best performance to F-value on most datasets,and F-value of RFC is the highest on average,followed by CFS.
Table 6 AUC of Naïve Bayes after using different feature selection methods
Table 7 F-value of Naïve Bayes after using different feature selection methods
Figs.2−5 show AUC and F-value of different feature selection algorithms based on the J48 and Naïve Bayes classifiers.For the J48 classifier,Fig.2 and Fig.4 indicate that the advantage of RFC against other algorithms is significant.Compared with the Naïve Bayes defect predictor,RFC also has obvious advantages.
Fig.2 AUC of J48 after using different feature selection methods
Fig.3 F-value of J48 after using different feature selection methods
Fig.4 AUC of Naïve Bayes after using different feature selection methods
Fig.5 F-value of Naïve Bayes after using different feature selection methods
We present the proportion of selected features by all the feature selection methods for each dataset in Table 8.The number of features retained by RFC is set to be close to or same as the number of features retained by the IG,CS,and ReliefF.
From Table 8,we can see that RFC on average can obtain 18.43% of selected features.IG,RFC,and CSF rank first with 13.8% of selected features.RFC performs similar to those methods only using feature ranking on the proportion of selected features.Table 9 lists the indexes of the features selected by different feature selection algorithms.RFC can select fewer features than CFS,and more than CS,IG and ReliefF.
Table 8 Proportion of features selected with different feature selection methods
Table 9 Indexes of features on all data sets selected with different feature selection methods
From the experimental results,we can analyze and draw the following conclusions about the performances of RFC and other methods:
(i) With each above feature selection method,SDP can achieve a better performance with fewer features.All these five feature selection methods can help to select more information content and more distinctive features.
(ii) RFC performs similar to ReliefF on the proportion of selected features.From Table 4 to Table 7,we can see that RFC has a better performance than ReliefF.RFC can effectively improve the performance of SDP by feature clustering.
(iii) As shown in Table 8,RFC selects more features than IG,CS,and ReliefF.The reason is that the number of features selected in each cluster is related to the scale of the cluster,and at least one feature is selected from each cluster.
To sum up,based on clustering,RFC can effectively improve the performance of SDP.
Suppose thatNrepresents the number of instances of the dataset,Mrepresents the number of features of the original dataset andkrepresents the number of feature clusters.The calculation time of RFC is mainly composed of five parts:
(i) Calculate relevance between feature and target class.
ReliefF updates the weight of the sample.Its time complexity isO(M×N).Repeat multiple times,and the number of executions is close to the number of instances.Therefore,the time complexity of ReliefF isOM×N2.
(ii) Calculate the correlation between features.
After ReliefF removes the irrelevant features,the number of remaining features ism.The time complexity of using SU to calculate two features on dataset ofNinstances isO(N).Therefore,the time complexity of calculating the correlation between features isON×m2.
(iii) Calculate and rank the average correlation degree of features.
The time complexity of computing the average correlation degree of features isOm2,and then quick sorting.Therefore,the time complexity of this part isOm2.
(iv) Cluster features.
Divide the original feature set intokclusters,and then update the medoids.The time complexity of this part isOm2.
(v) Select representative features.
Select the medoid of the cluster,and sort these features according to their relevance with the target class in the descending order.The time complexity of this part isOmlog2m.
The number of instancesNis much larger thanMandm,so that the time complexity of RFC isON2.Therefore,we can find that the time complexity of RFC is not higher than the typical feature selection algorithms based on filter,and lower than the feature selection algorithms based on the embedded method.
In this paper,we propose a new feature selection algorithm RFC,which first removes irrelevant features,then clusters the remaining features,and finally selects representative features from each cluster.The experimental results demonstrate that RFC achieves a better performance compared with other classical feature selection algorithms on most datasets,which proves the effectiveness of the RFC algorithm in SDP.However,RFC can be further optimized.Our future work will focus on the redundant features of high-dimensional datasets.
Journal of Systems Engineering and Electronics2021年2期