Breast cancer prediction based hybrid comparison of multiple classification algorithms

2021-11-20 09:10LILiWANGYongLUNingLINKuoyi
控制理论与应用 2021年10期

LI LiWANG YongLU NingLIN Kuo-yi

(1.College of Electronic and Information Engineering,Tongji University,Shanghai 201804,China;2.Shanghai Research Institute for Artificial Intelligence Science&Technology,Tongji University,Shanghai 201203,China)

Abstract:Breast cancer is an important cause of the death of female cancer patients due to its easy recurrence and high mortality.Early diagnosis of breast cancer increases the probability of curing cancer.Therefore,it is particularly important to improve the accuracy of early diagnosis.The traditional early diagnosis mainly relies on human experience to judge breast cancer by analyzing clinical or examination data,and sufficient accuracy cannot be guaranteed.Many researchers have proposed various machine learning methods to improve prediction accuracy and efficiency.However,the current algorithms have high computational complexity,and it is difficult to directly determine the appropriate algorithm from a variety of algorithms.This paper experiments with ten popular classification algorithms,compares the differences between the algorithms,and applies quantum support vector machines to speed up the computation process.Numerical experiments show that support vector machines and artificial neural network achieve the best prediction results,verifying the effectiveness of hybrid comparison of multiple classification algorithms.

Key words:breast cancer;support vector machines;classification;quantum computing

1 Introduction

Cancer,also called malignant tumor,is a disease caused by abnormally growing cells.It has the characteristics of high spread and easy recurrence.Breast cancer is one of the primary malignancy of female,and it is also an important cause of the death of female cancer patients.Early diagnosis can improve the survival chance of cancer patients [1].However,there is still a lack of ideal methods with strong specificity,especially for the early diagnosis of deep tumors.Traditional diagnostic methods rely on human experience and cannot guarantee the efficiency and accuracy of the diagnosis,which delays the patient’s treatment opportunity.

In the past few decades,machine learning has received the interest of many researchers due to its high scalability and great representation of high-dimensional data.As a branch of artificial intelligence,the basic idea of machine learning is that the established model learns from the given data to improve its performance.Machine learning can be divided into supervised learning and unsupervised learning.Supervised learning requires that each sample should contain special markers in addition to the characteristic values.It predicts the markers through the characteristic values,then compares the actual markers to calculate the error,and uses a recursive algorithm to correct the model based on the error.The most common tasks of supervised learning are classification and regression.Unsupervised learning does not require markers.It explores the degree of similarity between examples according to specific indicators and methods,or studies the value relationship between features.Breast cancer prediction can be regarded as a classification problem in supervised learning.A lot of literatures have proposed many machine learning methods to assist breast cancer diagnosis.

Zhou proposed the training mode of artificial neural network (ANN) algorithm,using a decision tree (DT)algorithm C4.5 extracted feature from the rule model for various diseases,such as diabetes,breast cancer.It was found that it has high accuracy,but also has strong generalization ability [2].Huang used support vector machine(SVM)combined with ultrasound texture analysis to classify breast cancer ultrasound images [3].In 2007,Wu proposed a cancer prediction model based on the SVM algorithm,which effectively solved the problem of small sample learning and can well limit the over fitting [4].Moayedi put forward a diagnosis method of breast cancer in three stages,the optimized SVM classifier for classification,and image data sets may reach 96.6% of accuracy rate [4–5].In 2017,Wang used particle swarm algorithm to select features of high-dimensional mass spectrometry data,analyzed and compared the results of extreme learning machine(ELM),K-nearest neighbors (KNN),artificial neural network,SVM and random forest(RF),verified the feasibility of ELM in cancer diagnosis[6].In 2019,Miao proposed a machine learning training method based on the spark model and RF,which had high fault tolerance and fast training speed,and reached an accuracy rate of 99.01%[7].

The models selected in the above literature are relatively single,and with the increase of data,the computational consumption increases rapidly.It is difficult to meet the needs of early and rapid diagnosis of breast cancer.In order to overcome the above shortcomings and improve the utility of breast cancer prediction models,this paper applies ten popular machine learning models to data analysis,predict and the model compared,reaching 98.25% prediction accuracy rate.And quantum SVM is used to speed up the computation process of the model.The main contributions of this paper include:1) data analysis and comprehensive consideration of various machine learning models,ANN and SVM achieve the best results on the data set;2) quantum SVM accelerates computation of model and adapts to larger data sets.

This paper is organized as follows.Section 2 presents a brief research work related to machine learning,breast cancer,model evaluation,and quantum machine learning.In Sec.3,the efficiency of the proposed approach is verified by numerical experiments.Finally,a conclusion and future perspectives are given in Sec.4.

2 Research work

2.1 Abbreviations

The abbreviations used in the paper are described in Table 1.

2.2 Machine learning models

In the recent decades of machine learning development,many machine learning models have been proposed,including KNN,SVM,logistic regression(LR),DT,naive bayesian(NB),RF,linear discriminant analysis(LDA),etc.KNN,SVM and neural network models are more commonly used and perform well when used in classification problems.Therefore,the principles of the three models are briefly introduced.

2.2.1 K-nearest neighbors

The KNN classification algorithm is one of the simplest methods in data classification technology[8],and it is relatively mature in theory.Its core idea is that if most of the K-nearest samples of a sample in the feature space belong to a certain category,the sample also belongs to the category and has the characteristics of the category,as shown in Fig.1.This method only determines the category of the sample to be classified based on the category of the nearest one or several samples in determining the classification decision.Since the KNN method mainly relies on the surrounding limited nearby samples and instead of relying on the discriminant category domain to determine the category,the KNN method is more suitable than other methods for the sample set to be divided with more cross or overlap of the class domain.

Fig.1 The K-nearest samples in KNN algorithm

The KNN algorithm is easy to understand and implement,and it does not require parameter estimation or training.Therefore,it is particularly suitable for multiclassification problems.However,it has high computational cost and low efficiency due to the need to calculate the distance of each sample to be classified to all known samples [9].To overcome this shortcoming,a variety of improved KNN algorithms have been derived.The improved algorithm is divided into the two directions of efficiency and result,for instance,wang proposed an improved clustering KNN algorithm[10],the classification effect of the algorithm was improved through classifying and weighting the test samples and then performed the KNN algorithm.The circular KNN algorithm based on clustering was proposed by Kuang[11],used filters to divide the training points around the point to be tested into outer training points and inner training points.In classification,the inner training points were used to replace the original training points to reduce redundancy and improve the efficiency of high-dimensional KNN algorithm classification.Johansson proposed an improved KNN classifier based on genetic algorithm with optimized K[12].The ensemble version based on the KNN model is significantly better than the original KNN classifier in terms of accuracy and AUC.The improved KNN algorithm proposed by Poonam Chaudhar considered the distance between adjacent K-nearest neighbor line segments when processing unbalanced data sets,calculated the average Euclidean distance,and used it to generate synthetic samples,effectively improved the accuracy of the decision boundary[13].Shang and others proposed an improved KNN algorithm based on fuzzy set theory,which improved the accuracy and recall rate of text classification to a certain extent[14].Chen and others proposed a fast density peak method based on the KNN algorithm,KNN density was used instead of density,which greatly improves the efficiency of density calculation[15].Shaban and others used the enhanced K-nearest neighbor classifier to detect COVID-19 patients.It avoids the pitfalls of traditional KNN by adding reliable heuristics when selecting neighbors of the tested item[16].

2.2.2 Support vector machine

The SVM model is a linear classifier with the largest interval defined in the feature space.Its basic idea is to solve the separation hyperplane that can correctly divide the training data set and has the largest geometric interval,as shown in Fig.2.For low-dimensional linear inseparable problems,the SVM model uses the kernel tracks.The kernel track is a method that calculates in low dimensions but shows a high-dimensional classification effect,avoiding complicated calculations that are directly mapped to high-dimensional spaces.

Fig.2 The largest hyperplane in SVM algorithm

SVM shows many unique advantages in solving small samples,nonlinear and high-dimensional pattern recognition problems and has derived many improved algorithms.In 2017,Ma proposed an incremental learning based on SVM with a hyper-cone structure,avoid retraining on all samples,effectively improved the classification efficiency,precision and generalization [17].Zheng proposed an algorithm for multi-value classification by combining multiple binary classifiers [18].Wieslaw proposed the use of voting classifiers to solve multi-classification problems based on combining multiple binary classifiers,which effectively improved the performance of the model.When the classification categories increase,the model performance improves more than the original algorithm [19].An improved twin SVM was proposed by converting the large classification problem into solving two small-scale classification problems,the number of constraints for each quadratic programming problem became one-half of the original,and the training time was reduced to one-fourth of the SVM,which effectively improved computing performance and generalization ability of SVM [20].The performance accuracy of SVM models of different kernels and the importance of inputs in different SVM models were compared,and the radial basis function was considered as the best kernel [21].Through incorporating the transformed particle swarm optimized ANN weight vector into the Bayesian optimized SVM kernel in the time series,Kouziokas introduced a novel SVM kernel[22].

2.2.3 Neural network

ANN imitates the structure and function of the biological brain based on modern neural research results to replace the human brain,to make effective decisions and complete specific functions.Since the concept of ANN was proposed in 1943,it has attracted many researchers to invest in the research of ANN.However,two key bottlenecks have temporarily hindered the development of ANN:

1) Computers have limited computing power to process large ANNs.

2) Single-layer perception cannot handle XOR problems[23].

Fortunately,fvie years later,some researchers have given new algorithms to solve the problem of insufficient computer power and improve the feasibility of multilayer neural networks.After AlphaGo appeared in 2016,a wave of deep learning research began.

The most basic fully connected neural network(FCNN) usually includes three parts:input layer,hidden layer,and output layer,as shown in Fig.3.Each neuron is connected to all neurons in the previous layer with weight,but there is no connection between neurons in the same layer.When the number of hidden layers is greater than one,it is called deep neural network(DNN).

Fig.3 The most basic fully connected neural network

However,the structural characteristics of FCNN can easily lead to the expansion of the number of parameters,and the proposal of convolutional neural network(CNN)can effectively alleviate the problem of parameter expansion.Compared with FCNN,the difference is that the connection method used by CNN is not full neuron connection,but each neuron is only connected to the part of the previous layer of neurons,which greatly reduces the number of parameters and improves operating efficiency.

Many breakthrough results have been made in theoretical and application research,and many new and improved neural network models have been derived,mainly including:improve circular back propagation(ICBP),discounted least squares(DLS)-ICBP,chained DLS-ICBP and plane-gaussian[24].

2.3 Breast cancer prediction

Early diagnosis and prediction of breast cancer can improve the survival rate of patients.The prediction data comes from the patient’s clinical or examination,including the most basic patient’s gender,age and medical history.Some common methods help doctors to judge the disease,such as breast ultrasound,mammography and so on.Doctors generally stage breast cancer based on the TNM system.T describes the size and location of the tumor,N describes the size and location of the lymph nodes where the cancer has spread and M describes the spread of cancer to other parts of the body.The TNM system has high clinical value for predicting tumor recurrence and metastasis,and it is also a relatively mature risk assessment index.

For breast cancer prediction,machine learning algorithm steps usually include data cleaning,integration and other preprocessing methods,data dimensionality reduction,model selection,model training and data prediction.The specific process is shown in Fig.4.Data preprocessing ensures the purity of the data processed by the model,which is beneficial to improve the prediction performance.Choosing an appropriate model is sometimes more effective than various data processing methods,reflecting the ability to characterize data.

Fig.4 The process of machine learning algorithm for breast cancer prediction

2.4 Model evaluation

For binary classification models,the most basic metrics for evaluating the prediction result are precision and recall.Usually,the class of interest is the positive class,and the other is the negative class.The prediction situations of the classifier on the data set include:true positive(TP),false negative(TN),false positive(FP),true negative (TN).TP means that the actual value is positive and the predicted value is positive;FN means that the actual value is positive and the predicted value is negative.Common evaluation metrics for the twoclass model include:

And the harmonic average of precision and recall is:

In addition to the above metrics,the receiver operating characteristic curve(ROC) can also be used,i.e.,under certain stimulus conditions,the false report probabilityP(y/N)obtained by the subject under different judgment standards is used as the abscissa,and take the hit probabilityP(y/SN)as the ordinate to draw the line of each point.After the curve is drawn,the model needs to be qualitatively analyzed.If the model is to be analyzed quantitatively,a new concept-AUC needs to be introduced:it refers to the area under the ROC curve.The calculation of the AUC value only requires integration along the horizontal ROC axis.In a real scene,the ROC curve is generally above the liney=x,the value of AUC is generally between 0.5 and 1.The larger the value of AUC,the better the performance of the model.

2.5 Quantum machine learning

Quantum machine learning (QML) is a subdiscipline of quantum computation and quantum information [25].Its purpose is to use quantum mechanical effects to perform calculations to accelerate classical machine learning algorithms.The concept of quantum computation first originated in the late 1970s.Some scholars pointed out that computers based on quantum mechanics can solve specific problems that traditional computers cannot match.In 1994,Shor proposed the Shor algorithm (quantum factorization algorithm)to achieve exponential acceleration of the classical algorithm [26].Compared with classical computing can only process one state data on one bit,quantum computing can process two states in parallel due to its unique superposition and entanglement characteristics,achieving exponential acceleration effects and the ability to characterize high-dimensional data.After the quantum algorithm has attracted the attention of a large number of researchers,people try to use the parallelism of quantum computing and the ability to store data at an exponential level to solve the problem of inefficiency in the current era of big data.

Today,many quantum algorithms have emerged,such as quantum K-means algorithm,quantum clustering algorithm,Grover algorithm [27],Harrow-Hassidim-Lloyd (HHL) algorithm [28],QSVM [29],quantum principal components analysis (QPCA) algorithm[30].There are also more and more company platforms that open online quantum platforms for free,such as the domestic original quantum platform,Huawei’s HIQ quantum platform,IBM’s Qiskit quantum platform and Google quantum platform,to encourage people in the quantum field to conduct more in-depth research.Although a large number of quantum algorithms have been produced,the challenges facing hardware and software are still very large[31–32].

3 Experiment

This paper has experimented with the prediction effects of various machine learning algorithms on breast cancer.The experimental data comes from the wisconsin breast cancer diagnosis data set provided by the kaggle website (www.kaggle.com/uciml/breast-cancerwisconsin-data),the prediction performance of the model is obtained through k-fold cross-validation,and the parameters are obtained through grid search.All experiments were performed on a computer with an Intel Core i5-6300HQ CPU running at 2.30 GHz,with 8 GB of RAM and Python 3.6.The algorithm library was from Scikit-learn.

3.1 Raw data analysis

Through the visualization of the raw data,the potential model of the data can be vividly presented,which is convenient for subsequent understanding of the data processing process.

In Fig.5,the number of benign samples is larger than the number of malignant samples,but it has not yet reached the category of unbalanced data sets.In the subsequent experiments,normalization is needed to make it easier to obtain the optimal solution.

Fig.5 Comparison of sample size of benign and malignant data

In Fig.6,the box plot has fvie lines from top to bottom,and their meanings are:upper edge,upper quartile line,median,lower quartile line and lower edge.The rest of the point sets represent outliers that are outside the range of the box plot.It is obvious from the figure that the data distribution of benign data is lower than that of malignant data,and each feature has more or less outliers.The first feature in the figure shows that there are fewer outliers when the distribution interval is obviously different.Therefore,it can be concluded that this feature is of great significance for the prediction of benign and malignant breast cancer.

Fig.6 Data box plot of the top 10 features

The lighter the color of the square in the heat map,the higher the correlation between the features corresponding to the horizontal axis and the vertical axis.Fig.7 clearly shows the relationship between different features.For example,the color distribution of the blocks in columns 1,3 and 4 is basically the same,which indicates that the correlation between the features in columns 1,3 and 4 is very high.When the number of features is large,this method can be used to effectively reduce the dimension,reduce the difficulty of calculation,and improve the efficiency.

Fig.7 Heat map between all features

Random forest can be used to easily sort out the feature importance,as shown in Fig.8.Combining the feature importance ranking and the feature correlation obtained from the heat map,the feature selection and dimensionality reduction can be performed more effectively.For the same data set,the random number should be kept unchanged,and the feature importance obtained at this time will not change,otherwise the order of neighboring features will change mutually.

Fig.8 Feature importance ranking

3.2 Results

Completing the data processing and fitting the data through a variety of classification algorithms.The experimental results are as follows.

Comparing the performance in Table 2,only the three metrics of ANN are qualified.Although the accuracy of the train set of the SVM model is slightly lower and fails to meet the expected indicators,the other two metrics are leading.The reason is that the ANN model will gradually become over-fitting as the number of training increases when the parameters are constant,therefore,it can eventually take the lead in the accuracy of the training set.While statistical models,such as SVM,do not require the number of training times,the final training accuracy mainly depends on the model parameters and the characteristics of the data set itself.Compared with SVM,the ensemble models such as AdaBoost,RF and Voting,do not achieve better results on the test set.

Table 2 Comparison of the optimal indicators of each model

The QSVM experiment was done with the help of the Qiskit platform provided by IBM.Compared with the traditional SVM algorithm,the QSVM algorithm improves computational efficiency and reduces the execution time of the algorithm.But the mathematical principles it relies on are consistent with traditional SVM,it has little effect on the accuracy of the final prediction results.The quantum state preparation method used in the algorithm is analog,which encodes the dimensional characteristics of the eigenvector into the probability amplitude of the quantum state.

The important links to achieve algorithm acceleration in QSVM are:

1) Inner product operations between samples when solving constrained optimization problems:the complexity of the classical algorithm is O(N),while the Swap test quantum algorithm can achieve O(logN).

2) After introducing the slack variable,the original SVM’s inequality constraints are converted to equality constraints,and the problem is converted to solving linear equations:using HHL algorithm to solve linear equations can achieve exponential acceleration.

The Qiskit library comes with the QSVM algorithm,but the feature data needs to be reduced to 2 dimensions using PCA and normalized due to the limited open permissions of the quantum platform.In the experiment,20 benign samples and 20 malignant samples were selected as the train set samples,and 10 samples were used as the test set samples.The final accuracy rate of the test set was 90%,and the value of the AUC was 0.9.

4 Conclusion

Complex problems such as the prediction and diagnosis of breast cancer require doctors’ professional knowledge and medical experience,and machine learning models can assist doctors in making more accurate decisions.In order to improve the practicality of the model,this paper considers seven basic models and three ensemble models.Among which SVM achieves the best results,reaching the highest prediction accuracy rate of 98.25%.And QSVM are used to accelerate the calculation process of the model.Due to the small number of samples selected,it failed to truly show the superiority of the quantum algorithm in solving problems that require high computational complexity.But,by analyzing the principles,it can be seen that quantum machine learning has natural advantages in solving specific problems,and can achieve exponential increase in computational efficiency.

However,this paper only conducted experiments on one data set,and did not consider the comprehensive performance of machine learning algorithms on each data set.At the same time,the number of selected data sets is too small to reflect the superiority of QSVM.In future work,more data sets with larger data volume can be selected for experiments.